Chapter 4 Pipelining and Parallel Processing - SOC & DSP...

transcript

VLSI DSP 2008 Y.T. Hwang 5-1

Chapter 4 Pipelining and Parallel Processing

Introduction (1)

PipeliningReduction in critical path

Increase the clock speed

Reduce power consumption at same speed

Parallel processingParallelism

Increase effective sampling speed

Reduction of power consumption

Introduction (2)

A 3-tap FIR filtery(n)=ax(n)+bx(n-1)+cx(n-2)

Critical path: 1 multiply and 2 add

AMsample

AMsam ple

Introduction (3)

Pipelining or parallel processing to sampling frequency

Critical path: 2 add

Pipelining

Parallel processing

Pipelining of FIR digital filters (1)

Feed forward cut set Two iterations are computed concurrently

Critical path reduced from TM+2TA to TM+TA

Latency increased from 1 to 2

Drawbacks of pipeliningIncrease in the number of latches and in system latency

ObservationsThe clock period is limited by the longest path between Two latches

An input and a latch

A latch and an output

An input and an output

Critical path can be reduced by suitably placing the pipelining latches

Pipelining latches can be placed across any feed-forward cutset of the graph

Cut setA set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint

Feed-forward cut setThe data move in the forward direction on all the edges of the cut set

We can arbitrarily place latches on a feed-forward cut set w/o affecting the functionality of the algorithm

Example 3.2.1

Incorrect pipelining correct pipelining

Original critical path: A3 → A5 → A4 → A6

After pipelining: A3 → A5 or A4 → A6

Critical path is reduced by one half

Direct v.s. transpose form

Direct form with long critical path

Transpose form with data broadcast structureCritical path is reduced to TM + TA

Fine-Grain pipelining

Pipelining the function unitAssume TM = 10 units, TA = 2 units

After pipelining, the critical path is 6 units

Parallel processing of FIR filter (1)

Block processing of size Ly(n)=ax(n)+bx(n-1)+cx(n-2) y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)

y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)

y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)

Block delay (L-slow): placing a latch at any line of MIMO structures produces an effective delay of L clocks at the sample rate

Block size 33 times hardware

Critical path remains unchanged TM+2TA

Tclk ≥ TM+2TA

3 samples are produced in 1 clock cycle

effective iteration period is

Note: Tclk ≠Tsample

11AMclksam pleiter TTT

MIMO system

Complete parallel processingSystem with block size 4

A serial-to-parallel converter

A parallel-to-serial converter

Pipelining v.s. parallel processing

Limitation of pipelining processingInput/output bottleneck, i.e. communication bounded system

Pipelining period cannot be smaller than the communication or I/O bound

pipelining & parallel processing

Combined fine grain pipelining and parallel processing for 3-tap FIR filter

L = 3, M = 2

clksampleiter

Pipelining & parallel processing for low power

Advantages of pipelining and parallel processingHigh speed

Low power

CMOS circuit model1st order analysis

Propagation delay

Power consumption fVCP

Pipelining for low power (1)

Sequential version

M-level pipelined versionWorking at the same frequency, i.e. f = 1/Tseq remains unchanged

Capacitance in each pipeline stage is reduced to Ccharge/M

Only V0 ( < 1) is needed to charge Ccharge/M in Tseq

seqtotalseq TffVCP /1 ,20

seqtotalpip PfVCP 220

Calculation of

pipseq

echseq

Example3-tap FIR filter

Tm = 10, Ta = 2, Cm = 5Ca

Pipelined multiplier, Tm1 = 6, Tm2 = 4, Cm1 = 3Ca , Cm2 = 2Ca

V0 = 5V, Vt = 0.6V

Supply voltage calculationCcharge = Cm + Ca = 6Ca

Pipelined: Ccharge = Cm1 =Cm2 + Ca = 3Ca

502 - 31.36 + 0.72 = 0 = 0.6033

Vpip = V0 = 3.0165V

Power consumption ratio = 2 = 36.4%

Parallel processing for low power (1)

L-parallel versionWorking at the one Lth frequency, i.e. f = 1/(LTseq)

Total Capacitance is increased to LCcharge

Since each Ccharge is charged in LTseq, Only V0 ( < 1) is needed to charge

Calculation of

seqech

echpar

echseq

220arg

Example of 2-parallel version4-tap FIR filter

Tm = 8, Ta = 1, Cm = 8Ca

Tseq = 9

V0 = 3.3V, Vt = 0.45V

2-parallel FIR filter designNote each delay is 2-slow

x(2k-1)

x(2k-2)

2-parallel: Ccharge = Cm + 2Ca = 10Ca

Vpar = V0 = 2.17437V

Power consumption ratio = 2 = 43.41%

)(0282.0or 6589.0

08225.13425.6701.98

)(9)(5

seqsamplepar

Area efficient 2-parallel version

Multiplier: 8 → 6, adder: 6 → 7 Delay: 3 → 4

Architecture verification

)22()12()2()12(

)32()22()12()2(

delay]block 1after [)2(

)12()12(

))12()22()(())12()2()((

)22()2(

)3()2()1()()(

kxhkxhkxhkxh

kxhkxhy

kxkxhhkxkxhhy

kxhkxhy

nxhnxhnxhnxhny

2-parallel: Ccharge = Cm + 4Ca = 12Ca

Vpar = V0 = 2.4585V

)(025.0or 745.0

06075.0155.2567.32

seqsamplepar

Power consumption ratio

%6.4335

155 ,35

saparsaseq

sseqpar

parparpar

totalparaampar

seqseq

totalseqaamseq

Pratio

fVCPfVCP

fVCPCCCC

Combining pipelining and parallel processing

PipeliningReduces the capacitance to be charged/discharged in 1 clock period

Parallel processingIncreases the clock period for charging/discharging the original capacitance

3-parallel 2-stage pipelining

pipelining + parallel processing

Propagation delay of the parallel pipelined filter

Solution of

0charge2

0charge

20 )()( tt VVVVML

Chapter 4 Pipelining and Parallel Processing - SOC & DSP...

Documents