Reconfigurable Cell Arrayfor DSP Applications
Chenxin Zhang
Department of Electrical and Information TechnologyLund University, Sweden
ETI180 DSP-design Dec. 06th, 2011
Department of Electrical and Information Technology, Lund University
Outline
• Reconfigurable computing
• Coarse-grained reconfigurable cell array
– Processing cell
– Memory cell
– Network router cell
– System reconfiguration
� ����������– Reconfigurable FIR
– Reconfigurable FFT processor
– Multi-standard OFDM coarse time synchronization
Department of Electrical and Information Technology, Lund University
Reconfigurable computing
• Updates on the data path in addition to the control flow.
• Combined flexibility with high performance at a feasible
hardware cost.
• Software-centric programming approach.
• Coarse-grained granularity – trade-off between efficiency,
flexibility, and programmability.
• Dynamic reconfigurability.
Department of Electrical and Information Technology, Lund University
High performance real-time DSP computing
Department of Electrical and Information Technology, Lund University
Media.Processor
Apps.Processor
GPS
Multiple standards
CellularApps.Processor
BT
WLAN
DVB-H
Wimax
LTE-A
WCDMA
Media.Processor
…
5G?
Apps.Processor
Department of Electrical and Information Technology, Lund University
Software-defined hardware
• Hardware sharing
– Accelerators: poor hardware reusability
– Reconfigurable architecture
+ Multi-task
+ Multi-standard
+ Multi-algorithm
− Control overhead, e.g. area, power.
A B C D
Processing chain
Department of Electrical and Information Technology, Lund University
Performance vs. Flexibility
• Specialized hardware (ASIC)
+ High performance, small size, low power
- Less flexible, manufacturing defects
- High NRE cost
• Standard processor (GPP, DSP…)
+ Flexible, Short design time
- Lack of computation capacity
• Fine-grained reconfigurable architecture (FPGA)
+ High calculation capacity, flexible
- Routing overhead, high power consumption
- Hardware oriented design approach
Department of Electrical and Information Technology, Lund University
Application specific DSP:Tensilica ConnX Baseband Engine
Department of Electrical and Information Technology, Lund University
Tabula Spacetime
• Ultra-rapid reconfiguration:
multi-GHz rates
• 2.5x logic density
• 3.7x DSP performance
Department of Electrical and Information Technology, Lund University
Coarse-grainedreconfigurable architecture
• High calculation capacity & flexible
• Software oriented: relevantly fast
development
• tolerance to manufacturing defects
• Sacrificed area & energy efficiency
compared to ASICs
• Sacrificed mapping flexibility compared
to FPGAs
CGRA
Department of Electrical and Information Technology, Lund University
Related work
• ALU clusters: MathStar FPOA, RICA…
– Instruction level, data level parallelism
– SIMD or VLIW
• Processor array: RAW, WPPA, REMARC…
– Instruction level, data level, and task level parallelism
– MIMD
• Hybrid structure: ADRES, PACT XPP…
– Instruction level, data level, and task level parallelism
– SIMD or VLIW and MIMD
– Combined complexity?
Courtesy: MathStar “FPOA architecture guide”.
Courtesy: D. Kissler et
al.“A Highly ParameterizableParallel Processor Array Architecture”.
Courtesy: PACT: “XPP-III Processor Overview”.
Department of Electrical and Information Technology, Lund University
R
R R
System infrastructure
• An array of resource cells.
• Heterogeneous cell array:
– Processing cell
– Memory cell
– Accelerator
(e.g. no configuration)
• Hierarchical cell array.
R
R
Addr. gen
Coeff. gen
Department of Electrical and Information Technology, Lund University
Resource cell
• Dedicated local interconnections:
– High data throughput
• Hierarchical global routing network:
– Flexible global data transmission
– External data access
– Global cell (re)configuration
• Data driven synchronization
• Single-Cycle-Per-Hop latency
• AMBA 4 AXI4-stream protocol
• GALS network data transmission
L0
L2
L1
L3
G0
R
RC
Department of Electrical and Information Technology, Lund University
Processing cell
• Processing core
– ALU, DSP, SIMD, VLIW,
CORDIC...
– Implicit load-store operations in
all instructions.
– Run-time control and conditional
reconfiguration.
– In-cell NoC supervision and
reconfiguration.
• Processing shell
– Network adapter
P3 = f(P1,P2)P1
P2
Department of Electrical and Information Technology, Lund University
Example 1:Generic signal processing cell
• 4 pipeline stages.
• Hybrid Load-Store & Memory-Memory
architecture.
• Compact program size (memory
references).
• With external memory cells:
– Complex addressing modes, e.g.
memory indirect, auto-increment.
– Flexible usage: program/data
memory, processor stack, (cache).
• Single-cycle delayed branch.
• Zero-delay conditional inner loop
control.
P3 = f(P1,P2)P1
P2
Department of Electrical and Information Technology, Lund University
Example 2:Dataflow processing cell (I) Branch
IF/ID EXE/WB
Operation
controller
L0 L1 ... Lx G
Local IO ports Global IO port
PC
Register
ID/EXE
...
Input arrangement MUX
Arith/Logic selection
Output arrangement MUX
Output MUX
Department of Electrical and Information Technology, Lund University
Example 2:Dataflow processing cell (II)
• SIMD/VLIW-like operation:
– 2/4-way 16/8-bit independent data processing
– Multi-level data processing (implicit prolog & epilog processing)
• Dual-operand instruction set:
– Dual-OpCode & Dual-Operand: e.g. ADDSUB R[d1], R[d2], R[s1], R[s2]
– Vector operation option: e.g. complex number arithmetic
• Dynamic data path reconfiguration
• Conditional instruction executions
Input arrangement MUX
Arith/Logic selection
Output arrangement MUX
Output MUX
Department of Electrical and Information Technology, Lund University
Dataflow processing cell:Dynamic data path reconfiguration
Input arrangement MUX
Arith/Logic selection
Output arrangement MUX
Output MUX
Department of Electrical and Information Technology, Lund University
Dataflow processing cell:Run-time data arrangement (II)
• Complex number multiplication vs. Real number multiplication
– MUL R3, R1, R2 ; R3 = R1 * R2 where {ab} is stored in R1
and {cd} is stored in R2.
Department of Electrical and Information Technology, Lund University
Memory cell (I)
� ������������� ��������������������� � �������� ������� �������
� ����������� ������ �������� ������������
� ��������������������� � ��������������
� ����������������������� ���������
� ��������������������������� ����������������������������
Department of Electrical and Information Technology, Lund University
Memory cell (II)Memory descriptor
� ������������ ��������!��������������������� �����
� "���� �������������� ����������������#��$�������#��������$�
Department of Electrical and Information Technology, Lund University
Memory cell (III)������������������������������ ���� ������ ���� ������ ���� ������ ����
Sign Sign
I Q
2(I) 2(Q)
1(I) 1(Q)
3(I) 3(Q)
4(I) 4(Q)
Inphase Quadrature
011162731 19 3
12 bits -> 4 bits
3(Q) 1(Q)4(Q) 2(Q)3(I) 1(I)4(I) 2(I)
After 4 iterations
PC0 -> MC0
(a)
(b)
(c)
(d)Address ‘X’
Sign Sign
723
Shift by 0 &
mask
Shift by 20
& mask
Shift by 16
& mask
Shift by 4
& maskLogic
“or”
Department of Electrical and Information Technology, Lund University
Memory cell (IV)Reconfiguration
• Individual memory DSC loading & tracing
• Memory DSC execution program:
• Memory DSC execution mode: restart, resume
• Memory data dump (debug)
Department of Electrical and Information Technology, Lund University
Network router cell (I)
• Cell structure:
– Decision unit
– Routing structure :
• Parallel network
• MUX-DEMUX switch
– Output packet queue (FIFOs)
Department of Electrical and Information Technology, Lund University
Network router cell (II)Decision unit
• Static routing table
• Managing data transactions:
– Check in
– Packet arbitration (MUX-DEMUX switch)
• Fixed
• Round-robin
• Data broadcast
– Configure routing path
Action list with candidate transactions
O(0) O(1) O(2) O(3) O(4)
In(0) o
In(1) o o o
In(2) x
In(3) o
In(def) x
Action list with candidate transaction
O(0) O(1) O(2) O(3) O(4)
In(0) x
In(1) o x x
In(2) x
In(3) x
In(def) x
(Parallel network)
(MUX-DEMUX switch)
Department of Electrical and Information Technology, Lund University
Static & Dynamic configuration (I)
icache dcache
Master
MPMC
R R
R R
R
Mem
ory
StreamCtrl
Conf.Ctrl
Department of Electrical and Information Technology, Lund University
Static & Dynamic configuration (II)
R R
R R
R
M1
M2 M3
M4
Department of Electrical and Information Technology, Lund University
• FIR filter
– Processing cell: MAC
– Memory cell: Input data FIFO, coefficient ROM
• Time-multiplexed structure for area driven application.
• Unfolding (parallelize) to improve processing throughput.
• High-precision computations.
Case study:Reconfigurable FIR
R R
R R
R
Department of Electrical and Information Technology, Lund University
Case study:Reconfigurable FFT processor
• Radix-22 structure
• Folding
Department of Electrical and Information Technology, Lund University
Radix-22 FFT building block
• Basic radix-22 FFT building block
• A 2,048-point radix-22 pipeline FFT
Department of Electrical and Information Technology, Lund University
Radix-22 pipeline FFT
• Simple mapping– Simple to scale up.
– Local communication only.
– High storage capacity demand in each
single memory cell.
Department of Electrical and Information Technology, Lund University
Radix-22 pipeline FFT
• Simple mapping– Simple to scale up.
– Local communication only.
– High storage capacity demand in each
single memory cell.
• Simple mapping with concatenated memory cells
– Low storage capacity demand in each
single memory cell.
– Global data communications.
Department of Electrical and Information Technology, Lund University
Time-multiplied FFT (I)
Department of Electrical and Information Technology, Lund University
Time-multiplied FFT (II)
• FFT benchmark comparison
– Rapid system reconfiguration: 40nS @300MHz
– High performance: 2.5x vs. DSPs, 6.5x vs. GPPs
Architecturefmax
[MHz]FFT size[point]
Execution time [cc]
Code size[byte]
Reconfigurationcode size [byte]
CGRA 5342561024
2,2429,943
1,032 30
Texas TMS-320VC5502
3002561024
5,38925,921
462462
(code reload)
ARM926EJ-S 2762561024
13,19466,196
- -
Department of Electrical and Information Technology, Lund University
Case study:Multi-standard OFDM synchronization
• Multiple wireless radio standards
• Concurrent data stream processing
• Coarse Time Synchronization
• Carrier Frequency Offset (CFO) estimation
$θ
${ }arg γ θ
[ ]γ θ
Department of Electrical and Information Technology, Lund University
Implementation results (I)
• 65 nm low-power regular VT CMOS:
– Area: 0.48 mm2
– Clock frequency: 534 MHz
• Adaptive word length scheduling.
• Adoption of different algorithms, e.g. Novel sign-bit OFDM acquisition.
Department of Electrical and Information Technology, Lund University
Summary
• Reconfigurable cell array enables hardware sharing at
different levels, i.e., task-, function-, and algorithm-level.
• Coarse-grained reconfigurable cell array comprises
distributed processing and memory cells, and a
hierarchical NoC structure.
• In-cell dynamic reconfiguration enables fast context
switching.