Data-Parallel Digital Signal Processors:Algorithm mapping, Architecture scaling,
and Workload adaptation
Sridhar Rajagopal
Digital Signal Processors (DSPs)
Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications
A 5 billion $ (and growing) market today
We always want something faster!
New high performance applications drive need for faster DSPs
• Physical-layer signal processing in high speed wireless communications to support multimedia
• Application-layer signal processing for video and imaging
Example : wireless systems
Data ratesAlgorithmsEstimationDetection
Decoding
Theoretical min ALUs @ 1
GHz
32-user system
1 Mbps/userMIMO
Chip equalizerMatched filter
LDPC
> 200
128 Kbps/userMulti-user
Max. likelihoodInterference cancellation
Viterbi
> 20
16 Kbps /user
Single-user Correlator
Matched filter
Viterbi
> 2
4G3G2G
Time1996 2003 ?
Data-Parallel DSPs: state-of-the-art
Clusters of ALUs provide billions of computations per second
Exploit data parallelism in signal processing applications
Imagine stream processor – Stanford (1998 - 2004)
Internal memory
+++***
+++***
+++***
+++***
…
Clusterof ALUs
Proposal:Research questions for DP-DSPs
• Will DP-DSPs work well for wireless systems?
• How do I design DP-DSPs to meet real-time at lowest power?
• Can I improve power efficiency further by adapting DSPs to the application?
Contributions: Algorithm mapping
• Efficient mapping of (wireless) algorithms
– parallelization, structure, memory access patterns
– tradeoffs between ALU utilization, inter-cluster
communication, memory stalls, packing
• A reduced inter-cluster network proposed
– exploits inter-cluster communication patterns
– allows greater scalability of the architecture by reducing
wires
Contributions: Architecture scaling
• Design methodology and tool to explore architectures for low power
• Provides candidate architectures for low power
• Provides insights into ALU utilization and performance
• Compile-time exploration is orders-of-magnitude faster than run-time exploration
Contributions: Workload adaptation
• Adapt the number of clusters and ALUs to
changes in workload during run-time
• Multiplexer network designed
– adapts clusters to DP at run-time
– turns off unused clusters using power gating
• Significant power savings at run-time (up to 60%)
Thesis contributions
Data-Parallel DSPs
+++***
+++***
+++***
Algorithmmapping:
Design of algorithms for
efficient mapping and performance
Architecturescaling:
Having designed the algorithms, find a low
power processor
Workloadadaptation:
Having designed the processor, improve power
at run-time
Outline
• DP-DSPs : Parallelism and architecture
• Power-aware design exploration
• Power-aware resource utilization at run-time
• Conclusions
Parallelism levels in DP-DSPs
Instruction Level Parallelism (ILP) - DSP
Subword Parallelism (SubP) - DSP
Data Parallelism (DP) – vector processor
Not independentDP can decrease by increasing ILP and SubP
– loop unrolling
Code snippet for ILP, SubP, DP
int i,a[N],b[N],sum[N];
short int c[N],d[N],diff[N];
for (i = 0; i< 64; ++i)
{
sum[i] = a[i] + b[i];
diff[i] = c[i] - d[i];
}
ILP
DP
SubP
Data-Parallel DSPs
• ILP, SubP within cluster, DP across clusters• Communication within clusters using inter-cluster comm.
network• Microcontroller issues same instruction to all clusters
Internal memory
+++***
+++***
+++***
+++***
…ILPSubP
DP
mic
roco
ntr
oll
er
ILP is resource-bound
• ILP dependent on resources such as ALUs, read/write ports, inter-cluster communication, registers
• Any one resource bottleneck can affect ILP
Adders Multipliers Inter-cluster communication
Tim
e
Schedule for matrix-matrix multiplication as ALUs increase
Signal processing algorithms have DP in plenty
Observations: 1. More DP available after exploiting ILP and SubP
to the point of diminishing returns
2. Used to set number of clusters
3. As clusters are added and exploit this ‘extra’ DP, ILP and SubP are not affected significantly
This ‘extra’ DP is defined as Cluster DP (CDP)
Observing CDP in Viterbi decoding
1 10 1001
10
100
1000
Number of clustersFre
qu
en
cy n
eed
ed
to a
ttain
real-
tim
e (
in M
Hz)
K = 9K = 7 K = 5DSP
Max CDP
Designing low power DP-DSPs
‘1’ cluster
100 GHz
+
++
*
*
*
‘a’
+
‘m’
*
+
++
*
*
*
‘a’
+
‘m’
*
+
++
*
*
*
‘a’
+
‘m’*
‘c’ clusters
‘f’ MHz
+
++
*
*
*
‘1’
+
‘1’
*
+
++
*
*
*
‘10’+
‘10’
*
+
++
*
*
*
‘10’
+
‘10’
*
+
++
*
*
*
‘10’
+
‘10’
*
‘100’ clusters
10 MHz
Find the right (a,m,c,f) to minimize power
a – #adders/cluster, m – #multipliers/cluster, c – #clusters
Detailed simulation using the Imagine processor simulator
• Cycle accurate, parameterized simulator
– Insights into operations every cycle
• High-level C++-based programming
• GUI interface shows dependencies and schedule
• Power and VLSI scaling model available
• Open source allows modifications in architecture,
tools
Need for design exploration tool
• Random choice may be way off
– 100x power variation possible
• Exhaustive simulation not possible
– large parameter space (hours for each simulation)
– DSP compilers need hand optimizations for performance
– evolving algorithms -- architecture exploration needed
Design exploration framework
Base Data-ParallelDSP
Designworkload
(worst-case)
Applicationworkload
Explore (a,m,c,f)combination thatminimizes power
Dynamic adaptationto turn down (a,m,c,f)
to save power
Hardwareimplementation
+++***
+++***
+++***
Designphase
Utilizationphase
DSPs are compute-bound with predictable performance
Computations
Hiddenmemory stalls
Exposedmemory stalls
Totalexecution
time(cycles)
Microcontrollerstalls
tcompute
tstall
Minimization for power
C(a,m,c) – capacitance from simulator model f(a,m,c) – real-time clock frequency
– obtained by running application on (a,m,c) architecture
2
, , , , , ,
, , , , , ,
( , , )
3
( , , )
min min ( , , )
min min ( , , )
a m c f a m c f
a m c f a m c f
a m c
a m c
P C a m c V f
P C a m c f
V f
Sensitivity to technology and modeling
• Sensitivity to technology ‘p’
• Sensitivity to adder-multiplier power ratio ‘’– 0.01 0.1 for 32-bit adders and 32x32
multipliers
• Sensitivity to memory stalls ‘’– difficult to predict at compile time (5-20 %)– assume q = 25% of execution time as worst case
– fstall = q* (1-) * fmin 0 1
, , , , , , ( , , )2 min min ( , , ) where p 3
a m c f a m c f
p
a m cP C a m c f
Design exploration: big picture
1. (a,m,c) = (, , )
2. Find (a,m,c) where ILP, SubP, DP are fully exploited
3. Find c that minimizes P for (max(a), max(m))
4. Find (a,m) that minimizes P using c
5. Explore sensitivity to , , p
, , , , , , ( , , )min min ( , , )a m c f a m c f
p
a m cP C a m c f
Running algorithms at (amax,mmax,
cCDP)
Algorithm Kernel CDP MHz
Estimation
Correlation 32 1
Matrix mul 32 43
Iteration 32 1
Transpose 512 < 1
Matrix mul L 32 22
Matrix mul C 32 22
Detection Matched filter 32 71
Interference cancellation 32 83
Decoding
Packing 256 <1
Re-packing 64 <1
Initialization 64 17
Add-Compare-Select (ACS)
64 254
Decoding output 64 23
Min. real-time frequency (a,m,c) =(5,3,512)
538 MHz
Real-time frequency with clusters for (a,m) = (5,3)
100
101
102
10310
2
103
104
Clusters
Fre
qu
ency
(M
Hz)
= 0 = 0.5 = 1
538 MHz
541 MHz
( ) ( )c cdp
cdpf f
c
Choosing clusters c = 64, 541 MHz
100
101
102
103
10-3
10-2
10-1
100
Clusters
Nor
mal
ized
Pow
er
Power f2
Power f2.5
Power f3
ALU utilization (+,*)
1
3
5 1
3
400
800
1200
(51,42)
(55,62)
(65,46)
#Adders
(67,62)
(78,45)
Rea
l-T
ime
Fre
qu
ency
(in
MH
z)
Initial (5,3,64)(541 MHz)
Final (3,1,64)(567 MHz)
c = 64, = 0.01, = 1, p = 3
Choosing ALUs (a,m) for c = 64
p = 2 p = 2.5
p = 3
= 0, = 0.01 (2,1,64)
(2,1,64)
(3,1,64)
= 0.5, = 0.01
(2,1,64)
(3,1,64)
(3,1,64)
= 1, = 0.01 (2,1,64)
(3,1,64)
(3,1,64)
= 1, = 0.1 (2,1,64)
(3,1,64)
(3,1,64)
Insights from analysis
• Sensitivity importance: p, ,
• Design gives candidates for low power solutions Design I : (a,m,c): (, , ) (5,3,512) (5,3,64)
(2,1,64)Design II : (a,m,c): (, , ) (5,3,512) (5,3,64)
(3,1,64)
• Power minimization related to ALU efficiency– same as maximizing a scaled version of ALU utilization
Advantages of design exploration tool
• Simulator (S)– cycle-accurate (execution time at run-time)– explore 100 machine configurations in 100 hours
(conservative)– modification of parameters and code for different runs
• Tool (T)– cycle-approximate (execution time at compile time)– explore millions of configurations in 100 hours– automated process all the way – generate plots for defense the day before
• Rapid evaluation of candidate algorithms for future systems
Verification of design tool
Human (3,3,32) @ 1.2V, 0.13 , 1 GHz = 18.2 W
Exploration tool choice : (2,1,64) at 887 MHz
Estimated base power @ 1.2V, 0.13 = 13.2 W
200
400
600
800
1000
(Execu
tion
tim
e)
Real-
tim
e c
lock f
req
uen
cy (
MH
z)
ComputationsStalls
T S T S T S Design I Design II Human
T- ToolS - Simulator
Cluster utilization
• 64 cluster inefficient in terms of cluster utilization (54% for 33:64)
• But, still lower power than 32 clusters due to the difference in f– can see difference reduces as p 2
20 40 60
20
40
60
80
100
Cluster index number
Clu
ste
r u
tiliza
tion
(%
)
32 clusters
64 clusters
Improving power efficiency
• Clusters significant source of power consumption (50-75%)
• When CDP < c, unutilized clusters waste power
• Dynamically turn off clusters using power gating to improve power efficiency
Data access difficult after adaptation
Clusters off – then how to get data from other banks?
4 2 clusters• Data not in the correct memory banks• Overhead in bringing data : external memory, inter-
cluster network
+++***
+++***
+++***
+++***
4 2 clusters
Multiplexer network design
Multiplexernetwork adapts clusters to DP
No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off
Turned off using power gating to
eliminate static anddynamic power dissipation
Run-time variations in workload
20 40 60
20
40
60
80
100
Cluster index number
Clu
ster
uti
lizat
ion
(%
)
K = 9
K = 7
K = 5
Benefits of multiplexer network
Power efficiency at design time:
Human choice : (3,3,32) Base power @ 1.2V, 0.13 , 1 GHz = 18.2 W
Exploration tool choice : (2,1,64)Base power @ 1.2V, 0.13 , 887 MHz = 13.2 W
Power efficiency at run-time:With mux network ( K = 9) = 9.9 W
( K = 7) = 7.4 W (K = 5) = 6.8 W
Design exploration for 2G-3G-4G systems
A “power”ful tool for algorithm-architecture exploration
101
102
103
101
102
103
104
105
Data ratesReal-
tim
e c
lock f
req
uen
cy (
MH
z)
4G*3G2G
(2,1,64) and (3,1,64)
(1,1,32) and (2,1,32)
Broader impact
• Power-aware design exploration with improved run-time power efficiency
• Techniques can be applied to all high performance, power efficient DSP designs– Handsets, cameras, video
Future extensions
• Fabrication needed to verify concepts
• Higher performance– Multi-threading (ILP, SubP, DP, MT)– Pipelining (ILP, SubP, DP, MT, PP)
• LDPC decoding– Sparse matrix requires permutations over large data– Indexed SRF in stream processors [Jayasena, HPCA
2004]
Conclusions
• Providing high performance with 100-1000’s of ALUs and providing low power designs – a challenge for DSP designers
• Algorithm design for efficient mapping on DP-DSPs
• Design exploration tool for low power DP-DSPs – Provides candidate DSPs for low power – Allows algorithm-architecture evaluation for new systems
• Power efficiency provided during both design and use of DP-DSPs
Acknowledgements
• Dr. Joseph R. Cavallaro, Dr. Scott Rixner
• Imagine stream processor group at Stanford– Abhishek, Ujval, Brucek, Dr. Dally
• Marjan, Predrag, Alex– 4G MIMO + LDPC
• Thesis committee
• Nokia, Texas Instruments, TATP, NSF