Post on 18-Mar-2018
transcript
VLSI DSP 2008 Y.T. Hwang 5-1
Chapter 4 Pipelining and Parallel Processing
VLSI DSP 2008 Y.T. Hwang 5-2
Introduction (1)
PipeliningReduction in critical path
Increase the clock speed
Reduce power consumption at same speed
Parallel processingParallelism
Increase effective sampling speed
Reduction of power consumption
VLSI DSP 2008 Y.T. Hwang 5-3
Introduction (2)
A 3-tap FIR filtery(n)=ax(n)+bx(n-1)+cx(n-2)
Critical path: 1 multiply and 2 add
AMsample
AMsam ple
TTf
TTT
2
1
2
VLSI DSP 2008 Y.T. Hwang 5-4
Introduction (3)
Pipelining or parallel processing to sampling frequency
Critical path: 2 add
Pipelining
Parallel processing
VLSI DSP 2008 Y.T. Hwang 5-5
Pipelining of FIR digital filters (1)
Feed forward cut set Two iterations are computed concurrently
Critical path reduced from TM+2TA to TM+TA
Latency increased from 1 to 2
VLSI DSP 2008 Y.T. Hwang 5-6
Pipelining of FIR digital filters (2)
Drawbacks of pipeliningIncrease in the number of latches and in system latency
ObservationsThe clock period is limited by the longest path between Two latches
An input and a latch
A latch and an output
An input and an output
Critical path can be reduced by suitably placing the pipelining latches
Pipelining latches can be placed across any feed-forward cutset of the graph
VLSI DSP 2008 Y.T. Hwang 5-7
Pipelining of FIR digital filters (3)
Cut setA set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint
Feed-forward cut setThe data move in the forward direction on all the edges of the cut set
We can arbitrarily place latches on a feed-forward cut set w/o affecting the functionality of the algorithm
VLSI DSP 2008 Y.T. Hwang 5-8
Pipelining of FIR digital filters (4)
Example 3.2.1
Incorrect pipelining correct pipelining
Original critical path: A3 → A5 → A4 → A6
After pipelining: A3 → A5 or A4 → A6
Critical path is reduced by one half
VLSI DSP 2008 Y.T. Hwang 5-9
Direct v.s. transpose form
Direct form with long critical path
Transpose form with data broadcast structureCritical path is reduced to TM + TA
VLSI DSP 2008 Y.T. Hwang 5-10
Fine-Grain pipelining
Pipelining the function unitAssume TM = 10 units, TA = 2 units
After pipelining, the critical path is 6 units
VLSI DSP 2008 Y.T. Hwang 5-11
Parallel processing of FIR filter (1)
Block processing of size Ly(n)=ax(n)+bx(n-1)+cx(n-2) y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)
y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)
Block delay (L-slow): placing a latch at any line of MIMO structures produces an effective delay of L clocks at the sample rate
VLSI DSP 2008 Y.T. Hwang 5-12
Parallel processing of FIR filter (2)
Block size 33 times hardware
Critical path remains unchanged TM+2TA
Tclk ≥ TM+2TA
3 samples are produced in 1 clock cycle
effective iteration period is
Note: Tclk ≠Tsample
)2(3
11AMclksam pleiter TTT
LTT
VLSI DSP 2008 Y.T. Hwang 5-13
Parallel processing of FIR filter (3)
MIMO system
Complete parallel processingSystem with block size 4
A serial-to-parallel converter
A parallel-to-serial converter
VLSI DSP 2008 Y.T. Hwang 5-14
Pipelining v.s. parallel processing
Limitation of pipelining processingInput/output bottleneck, i.e. communication bounded system
Pipelining period cannot be smaller than the communication or I/O bound
VLSI DSP 2008 Y.T. Hwang 5-15
pipelining & parallel processing
Combined fine grain pipelining and parallel processing for 3-tap FIR filter
L = 3, M = 2
6
14)2(
6
1
1
AM
clksampleiter
TT
TLM
TT
VLSI DSP 2008 Y.T. Hwang 5-16
Pipelining & parallel processing for low power
Advantages of pipelining and parallel processingHigh speed
Low power
CMOS circuit model1st order analysis
Propagation delay
Power consumption fVCP
VVk
VCT
total
t
echpd
20
20
0arg
)(
VLSI DSP 2008 Y.T. Hwang 5-17
Pipelining for low power (1)
Sequential version
M-level pipelined versionWorking at the same frequency, i.e. f = 1/Tseq remains unchanged
Capacitance in each pipeline stage is reduced to Ccharge/M
Only V0 ( < 1) is needed to charge Ccharge/M in Tseq
seqtotalseq TffVCP /1 ,20
seqtotalpip PfVCP 220
2
VLSI DSP 2008 Y.T. Hwang 5-18
Pipelining for low power (2)
Calculation of
20
20
20
0arg
20
0arg
)()(
let
)(
)(
tt
pipseq
t
ech
pip
t
echseq
VVVVM
TT
VVk
VM
C
T
VVk
VCT
VLSI DSP 2008 Y.T. Hwang 5-19
Pipelining for low power (3)
Example3-tap FIR filter
Tm = 10, Ta = 2, Cm = 5Ca
Pipelined multiplier, Tm1 = 6, Tm2 = 4, Cm1 = 3Ca , Cm2 = 2Ca
V0 = 5V, Vt = 0.6V
Supply voltage calculationCcharge = Cm + Ca = 6Ca
Pipelined: Ccharge = Cm1 =Cm2 + Ca = 3Ca
502 - 31.36 + 0.72 = 0 = 0.6033
Vpip = V0 = 3.0165V
Power consumption ratio = 2 = 36.4%
VLSI DSP 2008 Y.T. Hwang 5-20
Parallel processing for low power (1)
L-parallel versionWorking at the one Lth frequency, i.e. f = 1/(LTseq)
Total Capacitance is increased to LCcharge
Since each Ccharge is charged in LTseq, Only V0 ( < 1) is needed to charge
VLSI DSP 2008 Y.T. Hwang 5-21
Parallel processing for low power (2)
Calculation of
seqech
echpar
tt
t
echseq
t
echseq
PfVC
L
fVLCP
VVVVL
VVk
VCLT
VVk
VCT
220arg
2
20arg
20
20
20
0arg2
0
0arg
))((
)()(
)( ,
)(
VLSI DSP 2008 Y.T. Hwang 5-22
Parallel processing for low power (3)
Example of 2-parallel version4-tap FIR filter
Tm = 8, Ta = 1, Cm = 8Ca
Tseq = 9
V0 = 3.3V, Vt = 0.45V
VLSI DSP 2008 Y.T. Hwang 5-23
Parallel processing for low power (4)
2-parallel FIR filter designNote each delay is 2-slow
x(2k-1)
x(2k-2)
VLSI DSP 2008 Y.T. Hwang 5-24
Parallel processing for low power (5)
Supply voltage calculationCcharge = Cm + Ca = 9Ca
2-parallel: Ccharge = Cm + 2Ca = 10Ca
Vpar = V0 = 2.17437V
Power consumption ratio = 2 = 43.41%
)(0282.0or 6589.0
08225.13425.6701.98
)(9)(5
22let
)(
10)(
9
2
20
20
20
0
20
0
tt
seqsamplepar
t
apar
t
aseq
VVVV
TTT
VVk
VCT
VVk
VCT
VLSI DSP 2008 Y.T. Hwang 5-25
Parallel processing for low power (6)
Area efficient 2-parallel version
Multiplier: 8 → 6, adder: 6 → 7 Delay: 3 → 4
VLSI DSP 2008 Y.T. Hwang 5-26
Parallel processing for low power (7)
Architecture verification
)22()12()2()12(
)12(
)32()22()12()2(
delay]block 1after [)2(
)12()12(
))12()22()(())12()2()((
)22()2(
)3()2()1()()(
3210
3210
31
3210
20
3210
kxhkxhkxhkxh
yyyky
kxhkxhkxhkxh
yyky
kxhkxhy
kxkxhhkxkxhhy
kxhkxhy
nxhnxhnxhnxhny
CAB
CA
C
B
A
VLSI DSP 2008 Y.T. Hwang 5-27
Parallel processing for low power (8)
Supply voltage calculationCcharge = Cm + Ca = 9Ca
2-parallel: Ccharge = Cm + 4Ca = 12Ca
Vpar = V0 = 2.4585V
)(025.0or 745.0
06075.0155.2567.32
)(
12
)(
92
22let
)(
12)(
9
2
20
02
0
0
20
0
20
0
t
a
t
a
seqsamplepar
t
apar
t
aseq
VVk
VC
VVk
VC
TTT
VVk
VCT
VVk
VCT
VLSI DSP 2008 Y.T. Hwang 5-28
Parallel processing for low power (9)
Power consumption ratio
%6.4335
555.0
2
155 ,35
2
1
2
1
,5576
,3534
2
20
220
2)()(
20
)()(
seq
par
saparsaseq
sseqpar
parparpar
totalparaampar
total
seqseq
totalseqaamseq
total
P
Pratio
fVCPfVCP
fff
fVCPCCCC
fVCPCCCC
VLSI DSP 2008 Y.T. Hwang 5-29
Combining pipelining and parallel processing
PipeliningReduces the capacitance to be charged/discharged in 1 clock period
Parallel processingIncreases the clock period for charging/discharging the original capacitance
3-parallel 2-stage pipelining
VLSI DSP 2008 Y.T. Hwang 5-30
pipelining + parallel processing
Propagation delay of the parallel pipelined filter
Solution of
20
0charge2
0
0charge
)()(
)/(
ttpd
VVk
VLC
VVk
VMCLT
20
20 )()( tt VVVVML