Download - Ch4_pipelining and Parallel Processing

8/12/2019 Ch4_pipelining and Parallel Processing

1/15

VLSI DSP 2008 Y.T. Hwang 5-1

Chapter 4Pipelining and Parallel

Processing


Introduction (1)

Pipelining

Reduction in critical path

Increase the clock speed

Reduce power consumption at same speedParallel processing

Parallelism

Increase effective sampling speed

Reduction of power consumption


2/15


Introduction (2)

A 3-tap FIR filter

y(n)=ax(n)+bx(n-1)+cx(n-2)

Critical path: 1 multiply and 2 add

AM

sample

AMsample

TTf

TTT

2

1

2


Introduction (3)

Pipelining or parallel processing to sampling

frequency

Critical path: 2 add

Pipelining

Parallel processing


3/15


Pipelining of FIR digital fil ters (1)

Feed forward cut set Two iterations arecomputed

concurrentlyCritical path

reduced from

TM+2TA to TM+TA

Latency increased

from 1 to 2



Drawbacks of pipelining

Increase in the number of latches and in system latency

Observations

The clock period is limited by the longest path between Two latches

An input and a latch

A latch and an output

An input and an output

Critical path can be reduced by suitably placing the

pipelining latches

Pipelining latches can be placed across any feed-forward cutset of the graph


4/15



Cut set

A set of edges of a graph such that if these edges are

removed from the graph, the graph becomes disjointFeed-forward cut set

The data move in the forward direction on all the edges

of the cut set

We can arbitrarily place latches on a feed-forward cut

set w/o affecting the functionality of the algorithm



Example 3.2.1

Incorrect pipelining correct pipelining

Original critical path: A3

A5 A4 A6

After pipelining: A3 A5

or A4 A6

Critical path is reduced by

one half


5/15


Direct v.s. transpose form

Direct form with long critical path

Transpose form with data broadcast structure

Critical path is reduced to TM + TA


Fine-Grain pipelining

Pipelining the function unit

Assume TM = 10 units, TA = 2 units

After pipelining, the critical path is 6 units


6/15


Parallel processing of FIR filter (1)

Block processing of size L

y(n)=ax(n)+bx(n-1)+cx(n-2)

y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)

y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)

Block delay (L-slow): placing a latch at any line of MIMO

structures produces an effective delay of L clocks at the

sample rate



Block size 3

3 times hardware

Critical path remains

unchanged TM+2TATclk TM+2TA

3 samples are

produced in 1 clock

cycle

effective iteration

period is

Note: Tclk Tsample

)2(311 AMclksampleiter TTT

LTT


7/15



MIMO system

Complete parallel processing

System with block size 4

A serial-to-parallel

converter

A parallel-to-serial converter


Pipelining v.s. parallel processing

Limitation of pipelining processing

Input/output bottleneck, i.e. communication bounded

system

Pipelining period cannot be smaller than thecommunication or I/O bound


8/15


pipelining & parallel processing

Combined fine grain

pipelining and

parallel processingfor 3-tap FIR filter

L = 3, M = 2

6

14)2(

6

1

1

AM

clksampleiter

TT

TLM

TT


Pipelining & parallel processing for low power

Advantages of pipelining and parallel processing

High speed

Low power

CMOS circuit model1st order analysis

Propagation delay

Power consumption fVCP

VVk

VCT

total

t

echpd

20

20

0arg

)(


9/15


Pipelining for low power (1)

Sequential version

M-level pipelined versionWorking at the same frequency, i.e.f = 1/Tseq remains

unchanged

Capacitance in each pipeline stage is reduced to

Ccharge/M

OnlyV0 (< 1) is needed to charge Ccharge/M inTseq

seqtotalseq TffVCP /1,2

0

seqtotalpip PfVCP

22

0

2



Calculation of

20

20

20

0arg

20

0arg

)()(

let

)(

)(

tt

pipseq

t

ech

pip

t

ech

seq

VVVVM

TT

VVk

VM

C

T

VVk

VCT


10/15



Example

3-tap FIR filter

Tm = 10, Ta = 2, Cm = 5Ca

Pipelined multiplier, Tm1 = 6, Tm2 = 4, Cm1 = 3Ca , Cm2 = 2Ca

V0 = 5V, Vt= 0.6V

Supply voltage calculation

Ccharge = Cm + Ca = 6Ca

Pipelined: Ccharge = Cm1 =Cm2 + Ca = 3Ca

50 2 - 31.36+ 0.72 = 0= 0.6033Vpip = V0 = 3.0165V

Power consumption ratio = 2 = 36.4%


Parallel processing for low power (1)

L-parallel version

Working at the one Lth frequency, i.e.f = 1/(LTseq)

Total Capacitance is increased toLCcharge

Since each Ccharge is charged inLTseq, OnlyV0 (< 1) isneeded to charge


11/15



Calculation of

seqech

echpar

tt

t

echseq

t

echseq

PfVC

L

fVLCP

VVVVL

VVk

VC

LTVVk

VC

T

220arg

2

20arg

20

20

20

0arg

20

0arg

))((

)()(

)(,)(



Example of 2-parallel version

4-tap FIR filter

Tm = 8, Ta = 1, Cm = 8Ca

Tseq = 9V0 = 3.3V, Vt= 0.45V


12/15



2-parallel FIR filter design

Note each delay is 2-slow

x(2k-1)

x(2k-2)





2-parallel: Ccharge = Cm + 2Ca = 10Ca

Vpar= V0 = 2.17437V

Power consumption ratio = 2 = 43.41%

)(0282.0or6589.0

08225.13425.6701.98

)(9)(5

22let

)(

10

)(9

2

20

20

20

0

20

0

tt

seqsamplepar

t

apar

t

aseq

VVVV

TTT

VVk

VCT

VVk

VCT


13/15



Area efficient 2-parallel version

Multiplier: 86, adder: 67 Delay: 34



Architecture verification

)22()12()2()12(

)12(

)32()22()12()2(

delay]block1after[)2(

)12()12())12()22()(())12()2()((

)22()2(

)3()2()1()()(

3210

3210

31

3210

20

3210

kxhkxhkxhkxh

yyyky

kxhkxhkxhkxh

yyky

kxhkxhykxkxhhkxkxhhy

kxhkxhy

nxhnxhnxhnxhny

CAB

CA

C

B

A


14/15





2-parallel: Ccharge = Cm + 4Ca = 12Ca

Vpar= V0 = 2.4585V

)(025.0or745.0

06075.0155.2567.32

)(

12

)(

92

22let

)(

12

)(

9

2

20

02

0

0

20

0

20

0

t

a

t

a

seqsamplepar

t

apa r

t

aseq

VVk

VC

VVk

VC

TTT

VVk

VCT

VVk

VCT



Power consumption ratio

%6.4335

555.0

2

155,35

2

1

2

1

,5576

,3534

2

20

220

2)()(

20

)()(

seq

par

saparsaseq

sseqpar

parparpartotalparaam

partotal

seqseq

totalseqaamseq

total

P

Pratio

fVCPfVCP

fff

fVCPCCCC

fVCPCCCC


15/15


Combining pipelining and parallel processing

Pipelining

Reduces the capacitance to be charged/discharged in 1

clock periodParallel processing

Increases the clock period for charging/discharging the

original capacitance

3-parallel

2-stage pipelining

VLSI DSP 2008 YT Hwang 5 30

pipelining + parallel processing

Propagation delay of the parallel pipelined filter

Solution of

20

0charge

20

0charge

)()(

)/(

tt

pdVVk

VLC

VVk

VMCLT

20

20 )()( tt VVVVML