Download - Sridhar Rajagopal

Data-Parallel Digital Signal Processors:Algorithm mapping, Architecture scaling,

and Workload adaptation

Sridhar Rajagopal

Digital Signal Processors (DSPs)

Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications

A 5 billion $ (and growing) market today

We always want something faster!

New high performance applications drive need for faster DSPs

• Physical-layer signal processing in high speed wireless communications to support multimedia

• Application-layer signal processing for video and imaging

Example : wireless systems

Data ratesAlgorithmsEstimationDetection

Decoding

Theoretical min ALUs @ 1

GHz

32-user system

1 Mbps/userMIMO

Chip equalizerMatched filter

LDPC

> 200

128 Kbps/userMulti-user

Max. likelihoodInterference cancellation

Viterbi

> 20

16 Kbps /user

Single-user Correlator

Matched filter

Viterbi

> 2

4G3G2G

Time1996 2003 ?

Data-Parallel DSPs: state-of-the-art

Clusters of ALUs provide billions of computations per second

Exploit data parallelism in signal processing applications

Imagine stream processor – Stanford (1998 - 2004)

Internal memory

+++***

+++***

+++***

+++***

…

Clusterof ALUs

Proposal:Research questions for DP-DSPs

• Will DP-DSPs work well for wireless systems?

• How do I design DP-DSPs to meet real-time at lowest power?

• Can I improve power efficiency further by adapting DSPs to the application?

Contributions: Algorithm mapping

• Efficient mapping of (wireless) algorithms

– parallelization, structure, memory access patterns

– tradeoffs between ALU utilization, inter-cluster

communication, memory stalls, packing

• A reduced inter-cluster network proposed

– exploits inter-cluster communication patterns

– allows greater scalability of the architecture by reducing

wires

Contributions: Architecture scaling

• Design methodology and tool to explore architectures for low power

• Provides candidate architectures for low power

• Provides insights into ALU utilization and performance

• Compile-time exploration is orders-of-magnitude faster than run-time exploration

Contributions: Workload adaptation

• Adapt the number of clusters and ALUs to

changes in workload during run-time

• Multiplexer network designed

– adapts clusters to DP at run-time

– turns off unused clusters using power gating

• Significant power savings at run-time (up to 60%)

Thesis contributions

Data-Parallel DSPs

+++***

+++***

+++***

Algorithmmapping:

Design of algorithms for

efficient mapping and performance

Architecturescaling:

Having designed the algorithms, find a low

power processor

Workloadadaptation:

Having designed the processor, improve power

at run-time

Outline

• DP-DSPs : Parallelism and architecture

• Power-aware design exploration

• Power-aware resource utilization at run-time

• Conclusions

Parallelism levels in DP-DSPs

Instruction Level Parallelism (ILP) - DSP

Subword Parallelism (SubP) - DSP

Data Parallelism (DP) – vector processor

Not independentDP can decrease by increasing ILP and SubP

– loop unrolling

Code snippet for ILP, SubP, DP

int i,a[N],b[N],sum[N];

short int c[N],d[N],diff[N];

for (i = 0; i< 64; ++i)

{

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

ILP

DP

SubP

Data-Parallel DSPs

• ILP, SubP within cluster, DP across clusters• Communication within clusters using inter-cluster comm.

network• Microcontroller issues same instruction to all clusters

Internal memory

+++***

+++***

+++***

+++***

…ILPSubP

DP

mic

roco

ntr

oll

er

ILP is resource-bound

• ILP dependent on resources such as ALUs, read/write ports, inter-cluster communication, registers

• Any one resource bottleneck can affect ILP

Adders Multipliers Inter-cluster communication

Tim

e

Schedule for matrix-matrix multiplication as ALUs increase

Signal processing algorithms have DP in plenty

Observations: 1. More DP available after exploiting ILP and SubP

to the point of diminishing returns

2. Used to set number of clusters

3. As clusters are added and exploit this ‘extra’ DP, ILP and SubP are not affected significantly

This ‘extra’ DP is defined as Cluster DP (CDP)

Observing CDP in Viterbi decoding

1 10 1001

10

100

1000

Number of clustersFre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5DSP

Max CDP

Designing low power DP-DSPs

‘1’ cluster

100 GHz

+

++

*

*

*

‘a’

+

‘m’

*

+

++

*

*

*

‘a’

+

‘m’

*

+

++

*

*

*

‘a’

+

‘m’*

‘c’ clusters

‘f’ MHz

+

++

*

*

*

‘1’

+

‘1’

*

+

++

*

*

*

‘10’+

‘10’

*

+

++

*

*

*

‘10’

+

‘10’

*

+

++

*

*

*

‘10’

+

‘10’

*

‘100’ clusters

10 MHz

Find the right (a,m,c,f) to minimize power

a – #adders/cluster, m – #multipliers/cluster, c – #clusters

Detailed simulation using the Imagine processor simulator

• Cycle accurate, parameterized simulator

– Insights into operations every cycle

• High-level C++-based programming

• GUI interface shows dependencies and schedule

• Power and VLSI scaling model available

• Open source allows modifications in architecture,

tools

Need for design exploration tool

• Random choice may be way off

– 100x power variation possible

• Exhaustive simulation not possible

– large parameter space (hours for each simulation)

– DSP compilers need hand optimizations for performance

– evolving algorithms -- architecture exploration needed

Design exploration framework

Base Data-ParallelDSP

Designworkload

(worst-case)

Applicationworkload

Explore (a,m,c,f)combination thatminimizes power

Dynamic adaptationto turn down (a,m,c,f)

to save power

Hardwareimplementation

+++***

+++***

+++***

Designphase

Utilizationphase

DSPs are compute-bound with predictable performance

Computations

Hiddenmemory stalls

Exposedmemory stalls

Totalexecution

time(cycles)

Microcontrollerstalls

tcompute

tstall

Minimization for power

C(a,m,c) – capacitance from simulator model f(a,m,c) – real-time clock frequency

– obtained by running application on (a,m,c) architecture

2

, , , , , ,

, , , , , ,

( , , )

3

( , , )

min min ( , , )

min min ( , , )

a m c f a m c f

a m c f a m c f

a m c

a m c

P C a m c V f

P C a m c f

V f

Sensitivity to technology and modeling

• Sensitivity to technology ‘p’

• Sensitivity to adder-multiplier power ratio ‘’– 0.01 0.1 for 32-bit adders and 32x32

multipliers

• Sensitivity to memory stalls ‘’– difficult to predict at compile time (5-20 %)– assume q = 25% of execution time as worst case

– fstall = q* (1-) * fmin 0 1

, , , , , , ( , , )2 min min ( , , ) where p 3

a m c f a m c f

p

a m cP C a m c f

Design exploration: big picture

1. (a,m,c) = (, , )

2. Find (a,m,c) where ILP, SubP, DP are fully exploited

3. Find c that minimizes P for (max(a), max(m))

4. Find (a,m) that minimizes P using c

5. Explore sensitivity to , , p

, , , , , , ( , , )min min ( , , )a m c f a m c f

p

a m cP C a m c f

Running algorithms at (amax,mmax,

cCDP)

Algorithm Kernel CDP MHz

Estimation

Correlation 32 1

Matrix mul 32 43

Iteration 32 1

Transpose 512 < 1

Matrix mul L 32 22

Matrix mul C 32 22

Detection Matched filter 32 71

Interference cancellation 32 83

Decoding

Packing 256 <1

Re-packing 64 <1

Initialization 64 17

Add-Compare-Select (ACS)

64 254

Decoding output 64 23

Min. real-time frequency (a,m,c) =(5,3,512)

538 MHz

Real-time frequency with clusters for (a,m) = (5,3)

100

101

102

10310

2

103

104

Clusters

Fre

qu

ency

(M

Hz)

= 0 = 0.5 = 1

538 MHz

541 MHz

( ) ( )c cdp

cdpf f

c

Choosing clusters c = 64, 541 MHz

100

101

102

103

10-3

10-2

10-1

100

Clusters

Nor

mal

ized

Pow

er

Power f2

Power f2.5

Power f3

ALU utilization (+,*)

1

3

5 1

3

400

800

1200

(51,42)

(55,62)

(65,46)

#Adders

(67,62)

(78,45)

Rea

l-T

ime

Fre

qu

ency

(in

MH

z)

Initial (5,3,64)(541 MHz)

Final (3,1,64)(567 MHz)

c = 64, = 0.01, = 1, p = 3

Choosing ALUs (a,m) for c = 64

p = 2 p = 2.5

p = 3

= 0, = 0.01 (2,1,64)

(2,1,64)

(3,1,64)

= 0.5, = 0.01

(2,1,64)

(3,1,64)

(3,1,64)

= 1, = 0.01 (2,1,64)

(3,1,64)

(3,1,64)

= 1, = 0.1 (2,1,64)

(3,1,64)

(3,1,64)

Insights from analysis

• Sensitivity importance: p, ,

• Design gives candidates for low power solutions Design I : (a,m,c): (, , ) (5,3,512) (5,3,64)

(2,1,64)Design II : (a,m,c): (, , ) (5,3,512) (5,3,64)

(3,1,64)

• Power minimization related to ALU efficiency– same as maximizing a scaled version of ALU utilization

Advantages of design exploration tool

• Simulator (S)– cycle-accurate (execution time at run-time)– explore 100 machine configurations in 100 hours

(conservative)– modification of parameters and code for different runs

• Tool (T)– cycle-approximate (execution time at compile time)– explore millions of configurations in 100 hours– automated process all the way – generate plots for defense the day before

• Rapid evaluation of candidate algorithms for future systems

Verification of design tool

Human (3,3,32) @ 1.2V, 0.13 , 1 GHz = 18.2 W

Exploration tool choice : (2,1,64) at 887 MHz

Estimated base power @ 1.2V, 0.13 = 13.2 W

200

400

600

800

1000

(Execu

tion

tim

e)

Real-

tim

e c

lock f

req

uen

cy (

MH

z)

ComputationsStalls

T S T S T S Design I Design II Human

T- ToolS - Simulator

Cluster utilization

• 64 cluster inefficient in terms of cluster utilization (54% for 33:64)

• But, still lower power than 32 clusters due to the difference in f– can see difference reduces as p 2

20 40 60

20

40

60

80

100

Cluster index number

Clu

ste

r u

tiliza

tion

(%

)

32 clusters

64 clusters

Improving power efficiency

• Clusters significant source of power consumption (50-75%)

• When CDP < c, unutilized clusters waste power

• Dynamically turn off clusters using power gating to improve power efficiency

Data access difficult after adaptation

Clusters off – then how to get data from other banks?

4 2 clusters• Data not in the correct memory banks• Overhead in bringing data : external memory, inter-

cluster network

+++***

+++***

+++***

+++***

4 2 clusters

Multiplexer network design

Multiplexernetwork adapts clusters to DP

No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off

Turned off using power gating to

eliminate static anddynamic power dissipation

Run-time variations in workload

20 40 60

20

40

60

80

100

Cluster index number

Clu

ster

uti

lizat

ion

(%

)

K = 9

K = 7

K = 5

Benefits of multiplexer network

Power efficiency at design time:

Human choice : (3,3,32) Base power @ 1.2V, 0.13 , 1 GHz = 18.2 W

Exploration tool choice : (2,1,64)Base power @ 1.2V, 0.13 , 887 MHz = 13.2 W

Power efficiency at run-time:With mux network ( K = 9) = 9.9 W

( K = 7) = 7.4 W (K = 5) = 6.8 W

Design exploration for 2G-3G-4G systems

A “power”ful tool for algorithm-architecture exploration

101

102

103

101

102

103

104

105

Data ratesReal-

tim

e c

lock f

req

uen

cy (

MH

z)

4G*3G2G

(2,1,64) and (3,1,64)

(1,1,32) and (2,1,32)

Broader impact

• Power-aware design exploration with improved run-time power efficiency

• Techniques can be applied to all high performance, power efficient DSP designs– Handsets, cameras, video

Future extensions

• Fabrication needed to verify concepts

• Higher performance– Multi-threading (ILP, SubP, DP, MT)– Pipelining (ILP, SubP, DP, MT, PP)

• LDPC decoding– Sparse matrix requires permutations over large data– Indexed SRF in stream processors [Jayasena, HPCA

2004]

Conclusions

• Providing high performance with 100-1000’s of ALUs and providing low power designs – a challenge for DSP designers

• Algorithm design for efficient mapping on DP-DSPs

• Design exploration tool for low power DP-DSPs – Provides candidate DSPs for low power – Allows algorithm-architecture evaluation for new systems

• Power efficiency provided during both design and use of DP-DSPs

Acknowledgements

• Dr. Joseph R. Cavallaro, Dr. Scott Rixner

• Imagine stream processor group at Stanford– Abhishek, Ujval, Brucek, Dr. Dally

• Marjan, Predrag, Alex– 4G MIMO + LDPC

• Thesis committee

• Nokia, Texas Instruments, TATP, NSF