8/12/2019 Ch4_pipelining and Parallel Processing
1/15
VLSI DSP 2008 Y.T. Hwang 5-1
Chapter 4Pipelining and Parallel
Processing
VLSI DSP 2008 Y.T. Hwang 5-2
Introduction (1)
Pipelining
Reduction in critical path
Increase the clock speed
Reduce power consumption at same speedParallel processing
Parallelism
Increase effective sampling speed
Reduction of power consumption
8/12/2019 Ch4_pipelining and Parallel Processing
2/15
VLSI DSP 2008 Y.T. Hwang 5-3
Introduction (2)
A 3-tap FIR filter
y(n)=ax(n)+bx(n-1)+cx(n-2)
Critical path: 1 multiply and 2 add
AM
sample
AMsample
TTf
TTT
2
1
2
VLSI DSP 2008 Y.T. Hwang 5-4
Introduction (3)
Pipelining or parallel processing to sampling
frequency
Critical path: 2 add
Pipelining
Parallel processing
8/12/2019 Ch4_pipelining and Parallel Processing
3/15
VLSI DSP 2008 Y.T. Hwang 5-5
Pipelining of FIR digital fil ters (1)
Feed forward cut set Two iterations arecomputed
concurrentlyCritical path
reduced from
TM+2TA to TM+TA
Latency increased
from 1 to 2
VLSI DSP 2008 Y.T. Hwang 5-6
Pipelining of FIR digital fil ters (2)
Drawbacks of pipelining
Increase in the number of latches and in system latency
Observations
The clock period is limited by the longest path between Two latches
An input and a latch
A latch and an output
An input and an output
Critical path can be reduced by suitably placing the
pipelining latches
Pipelining latches can be placed across any feed-forward cutset of the graph
8/12/2019 Ch4_pipelining and Parallel Processing
4/15
VLSI DSP 2008 Y.T. Hwang 5-7
Pipelining of FIR digital fil ters (3)
Cut set
A set of edges of a graph such that if these edges are
removed from the graph, the graph becomes disjointFeed-forward cut set
The data move in the forward direction on all the edges
of the cut set
We can arbitrarily place latches on a feed-forward cut
set w/o affecting the functionality of the algorithm
VLSI DSP 2008 Y.T. Hwang 5-8
Pipelining of FIR digital fil ters (4)
Example 3.2.1
Incorrect pipelining correct pipelining
Original critical path: A3
A5 A4 A6
After pipelining: A3 A5
or A4 A6
Critical path is reduced by
one half
8/12/2019 Ch4_pipelining and Parallel Processing
5/15
VLSI DSP 2008 Y.T. Hwang 5-9
Direct v.s. transpose form
Direct form with long critical path
Transpose form with data broadcast structure
Critical path is reduced to TM + TA
VLSI DSP 2008 Y.T. Hwang 5-10
Fine-Grain pipelining
Pipelining the function unit
Assume TM = 10 units, TA = 2 units
After pipelining, the critical path is 6 units
8/12/2019 Ch4_pipelining and Parallel Processing
6/15
VLSI DSP 2008 Y.T. Hwang 5-11
Parallel processing of FIR filter (1)
Block processing of size L
y(n)=ax(n)+bx(n-1)+cx(n-2)
y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)
Block delay (L-slow): placing a latch at any line of MIMO
structures produces an effective delay of L clocks at the
sample rate
VLSI DSP 2008 Y.T. Hwang 5-12
Parallel processing of FIR filter (2)
Block size 3
3 times hardware
Critical path remains
unchanged TM+2TATclk TM+2TA
3 samples are
produced in 1 clock
cycle
effective iteration
period is
Note: Tclk Tsample
)2(311 AMclksampleiter TTT
LTT
8/12/2019 Ch4_pipelining and Parallel Processing
7/15
VLSI DSP 2008 Y.T. Hwang 5-13
Parallel processing of FIR filter (3)
MIMO system
Complete parallel processing
System with block size 4
A serial-to-parallel
converter
A parallel-to-serial converter
VLSI DSP 2008 Y.T. Hwang 5-14
Pipelining v.s. parallel processing
Limitation of pipelining processing
Input/output bottleneck, i.e. communication bounded
system
Pipelining period cannot be smaller than thecommunication or I/O bound
8/12/2019 Ch4_pipelining and Parallel Processing
8/15
VLSI DSP 2008 Y.T. Hwang 5-15
pipelining & parallel processing
Combined fine grain
pipelining and
parallel processingfor 3-tap FIR filter
L = 3, M = 2
6
14)2(
6
1
1
AM
clksampleiter
TT
TLM
TT
VLSI DSP 2008 Y.T. Hwang 5-16
Pipelining & parallel processing for low power
Advantages of pipelining and parallel processing
High speed
Low power
CMOS circuit model1st order analysis
Propagation delay
Power consumption fVCP
VVk
VCT
total
t
echpd
20
20
0arg
)(
8/12/2019 Ch4_pipelining and Parallel Processing
9/15
VLSI DSP 2008 Y.T. Hwang 5-17
Pipelining for low power (1)
Sequential version
M-level pipelined versionWorking at the same frequency, i.e.f = 1/Tseq remains
unchanged
Capacitance in each pipeline stage is reduced to
Ccharge/M
OnlyV0 (< 1) is needed to charge Ccharge/M inTseq
seqtotalseq TffVCP /1,2
0
seqtotalpip PfVCP
22
0
2
VLSI DSP 2008 Y.T. Hwang 5-18
Pipelining for low power (2)
Calculation of
20
20
20
0arg
20
0arg
)()(
let
)(
)(
tt
pipseq
t
ech
pip
t
ech
seq
VVVVM
TT
VVk
VM
C
T
VVk
VCT
8/12/2019 Ch4_pipelining and Parallel Processing
10/15
VLSI DSP 2008 Y.T. Hwang 5-19
Pipelining for low power (3)
Example
3-tap FIR filter
Tm = 10, Ta = 2, Cm = 5Ca
Pipelined multiplier, Tm1 = 6, Tm2 = 4, Cm1 = 3Ca , Cm2 = 2Ca
V0 = 5V, Vt= 0.6V
Supply voltage calculation
Ccharge = Cm + Ca = 6Ca
Pipelined: Ccharge = Cm1 =Cm2 + Ca = 3Ca
50 2 - 31.36+ 0.72 = 0= 0.6033Vpip = V0 = 3.0165V
Power consumption ratio = 2 = 36.4%
VLSI DSP 2008 Y.T. Hwang 5-20
Parallel processing for low power (1)
L-parallel version
Working at the one Lth frequency, i.e.f = 1/(LTseq)
Total Capacitance is increased toLCcharge
Since each Ccharge is charged inLTseq, OnlyV0 (< 1) isneeded to charge
8/12/2019 Ch4_pipelining and Parallel Processing
11/15
VLSI DSP 2008 Y.T. Hwang 5-21
Parallel processing for low power (2)
Calculation of
seqech
echpar
tt
t
echseq
t
echseq
PfVC
L
fVLCP
VVVVL
VVk
VC
LTVVk
VC
T
220arg
2
20arg
20
20
20
0arg
20
0arg
))((
)()(
)(,)(
VLSI DSP 2008 Y.T. Hwang 5-22
Parallel processing for low power (3)
Example of 2-parallel version
4-tap FIR filter
Tm = 8, Ta = 1, Cm = 8Ca
Tseq = 9V0 = 3.3V, Vt= 0.45V
8/12/2019 Ch4_pipelining and Parallel Processing
12/15
VLSI DSP 2008 Y.T. Hwang 5-23
Parallel processing for low power (4)
2-parallel FIR filter design
Note each delay is 2-slow
x(2k-1)
x(2k-2)
VLSI DSP 2008 Y.T. Hwang 5-24
Parallel processing for low power (5)
Supply voltage calculation
Ccharge = Cm + Ca = 9Ca
2-parallel: Ccharge = Cm + 2Ca = 10Ca
Vpar= V0 = 2.17437V
Power consumption ratio = 2 = 43.41%
)(0282.0or6589.0
08225.13425.6701.98
)(9)(5
22let
)(
10
)(9
2
20
20
20
0
20
0
tt
seqsamplepar
t
apar
t
aseq
VVVV
TTT
VVk
VCT
VVk
VCT
8/12/2019 Ch4_pipelining and Parallel Processing
13/15
VLSI DSP 2008 Y.T. Hwang 5-25
Parallel processing for low power (6)
Area efficient 2-parallel version
Multiplier: 86, adder: 67 Delay: 34
VLSI DSP 2008 Y.T. Hwang 5-26
Parallel processing for low power (7)
Architecture verification
)22()12()2()12(
)12(
)32()22()12()2(
delay]block1after[)2(
)12()12())12()22()(())12()2()((
)22()2(
)3()2()1()()(
3210
3210
31
3210
20
3210
kxhkxhkxhkxh
yyyky
kxhkxhkxhkxh
yyky
kxhkxhykxkxhhkxkxhhy
kxhkxhy
nxhnxhnxhnxhny
CAB
CA
C
B
A
8/12/2019 Ch4_pipelining and Parallel Processing
14/15
VLSI DSP 2008 Y.T. Hwang 5-27
Parallel processing for low power (8)
Supply voltage calculation
Ccharge = Cm + Ca = 9Ca
2-parallel: Ccharge = Cm + 4Ca = 12Ca
Vpar= V0 = 2.4585V
)(025.0or745.0
06075.0155.2567.32
)(
12
)(
92
22let
)(
12
)(
9
2
20
02
0
0
20
0
20
0
t
a
t
a
seqsamplepar
t
apa r
t
aseq
VVk
VC
VVk
VC
TTT
VVk
VCT
VVk
VCT
VLSI DSP 2008 Y.T. Hwang 5-28
Parallel processing for low power (9)
Power consumption ratio
%6.4335
555.0
2
155,35
2
1
2
1
,5576
,3534
2
20
220
2)()(
20
)()(
seq
par
saparsaseq
sseqpar
parparpartotalparaam
partotal
seqseq
totalseqaamseq
total
P
Pratio
fVCPfVCP
fff
fVCPCCCC
fVCPCCCC
8/12/2019 Ch4_pipelining and Parallel Processing
15/15
VLSI DSP 2008 Y.T. Hwang 5-29
Combining pipelining and parallel processing
Pipelining
Reduces the capacitance to be charged/discharged in 1
clock periodParallel processing
Increases the clock period for charging/discharging the
original capacitance
3-parallel
2-stage pipelining
VLSI DSP 2008 YT Hwang 5 30
pipelining + parallel processing
Propagation delay of the parallel pipelined filter
Solution of
20
0charge
20
0charge
)()(
)/(
tt
pdVVk
VLC
VVk
VMCLT
20
20 )()( tt VVVVML