Using Custom Accelerators in Wireless SystemsAlex Papakonstantinou, Deming Chen
Illinois Center forWireless Systems
Wireless SoC Design Trends and Challenges
• Shrinking transistor technologies have transformed die into a host of extraordinary size and complexity systems
– All the analog and digital components that were implemented in 3-4 different ICs in past technologies, can now fit in a single chip
• Designer Productivity does not rise at the same rate as transistor capacity
– Design reuse and use of Commercial Off-The-Self (COTS) Intellectual Property (IP) help meet Time-To-Market (TTM) constraints but have other downsides
• Design space exploration is becoming a daunting task and conflicts with the shrinking TTM requirements
• System customization suffers in terms of functionality/ performance/power/area from “one system fits all” tactic
• Design focus is shifting from single thread speed optimization to execution parallelization through multi-processor systems
Typical Design Practice & Design Paradigm Shift
• COTS IP modules are integrated to meet the required system functionality
– Usually a generic microprocessor/micro-controller is used for the control part and a separate DSP processor for the signal processing part
– Fixed-functionality IP modules are integrated for the various data processing
• IP-use speeds up the design phase but:– imposes coarse granularity on optimization
decisions regarding functionality, performance and power dissipation
– does not eliminate design time entirely, as interfacing between different IP modules can take up considerable engineering resources
• Design Paradigm needs a shift to higher abstraction level– Design systems efficiently with higher flexibility and on-demand
customization
• Instruction-less custom processor / accelerator:
– Microcode memory stores microcode words which control Functional-Units (FU) and data transfers each cycle
– Program Counter (PC) holds next microcode memory address
– Microcode words do not require any decoding– FUs customized according to application domain– Application-custom forwarding paths between FUs
can eliminate unnecessary Register File (RF) reads/writes
EPOS (Explicitly Parallel Operations System)
• Instruction-Level Parallelism (ILP) extraction:
– The front-end of the IMPACT compiler is used to optimize the HLL description using:
• Traditional compiler techniques• Superblock and Hyperblock
creation
• The EPOS accelerators generated can substitute the generic COTS IP by:
– Offering high customization according to the system requirements– Providing better performance and power efficiency than a generic
DSP-core/microprocessor
EPOS – based Wireless SoC Solution
• Each module is mapped directly onto a customized EPOS accelerator
• The interfaces between the EPOS accelerators, as well as, between other IP and EPOS modules are defined in the HLL program and automatically synthesized along with the EPOS datapaths
• Exploration of alternative system implementations becomes efficient and extremely fast
• Each EPOS processor can be re-programmed within the system to execute optimized/modified versions of its original functionality
EPOS Performance Results• EPOS Configuration
used:– 4xALU– 1xMUL– 1xST-Port– 1xLD-Port
• FU Latencies:– ALU: 1– MUL: 3– LD: 4– ST: 1
ApplicationNISC(cycles)
EPOS(cycles)
startup 1002 793
dijkstra 36074 15096
bubble 9691 2916
Wireless SystemAnalogCicuits
Amplifier
Filter
ADC
USB EPOS
802.11g EPOS
Bluetooth EPOS
SRAM ROM
MCU
FFTEPOS
Interrupt Controller
Timers/Counters
DMA Controller
CryptoEPOS
DCTEPOS
Wireless SystemAnalogCicuits
Amplifier
Filter
ADC
USB
802.11g
Bluetooth
SRAM ROM
CPUDSPCore
Interrupt Controller
Timers/Counters
DMA Controller
Encryption/Decryption
PC
MicroCode Memory
FU1 FU2 FU3
+
DataMemory
1
ConstantOffsetRegister
File
Superblock/HyperblockFormation (IMPACT)
Scheduling
RegisterAllocation
Forwarding Network Minimization
EPOS Flow
PC
RegisterFile
FU1Data
Memory
1
MCBank2
MCBank3
MCBank4
PRF
FU2 FU3 FU4
SRF1
Offset Constant
SRF2 SRF3 SRF4
+
MCBank1
EPOS accelerator
BB1
BB2 BB3
BB4
9010
10 90
1
1
BB1
BB2 BB3
BB4
9010
10 90
BB4d
Superblock formation
1SB1
SB2
BB1
BB2 BB3
BB4
5545
45 55
1
1
BB1
BB2 BB3
BB4
100
100
Hyperblock formation
1HB1
1 99
Performance Speed-up
0
0.5
1
1.5
2
2.5
3
3.5
startup dijkstra bubble-sort
NISC
EPOS