Date post: | 19-Jun-2015 |
Category: |
Technology |
Upload: | rahul-jain |
View: | 527 times |
Download: | 2 times |
An Efficient Pipelined VLSI
Architecture for Lifting-Based 2D-
Discrete Wavelet Transform
Rahul Jain
Preeti Ranjan Panda
IIT-Delhi
28 May 2007 ISCAS 2007 2
Agenda
� Existing Work
� Proposed Architecture
� Comparative Results
� Conclusion
28 May 2007 ISCAS 2007 3
Discrete Wavelet Transform (DWT)
� At the core of JPEG2000 standard
� (9, 7) Daubechies coefficients defined in JPEG2000
� 1-D DWT using Daubechies (9, 7) filter� two lifting steps
� one scaling step
� Each lifting Step� a prediction step
� an update step
28 May 2007 ISCAS 2007 4
Hardware Implementation of DWT
� 2-D DWT implemented by row-wise and
column wise 1-D DWT
� Dominated by memory size and bandwidth
� No of pipeline registers α Memory Size
� Objective
� Smaller critical path
� Lesser pipeline registers
28 May 2007 ISCAS 2007 5
1-D DWT Equation
1. P1: Y(2i+1) = a * ( X(2i) + X(2i+2) ) + X(2i+1)
2. U1: Y(2i) = b * ( Y(2i-1) + Y(2i+1) ) + X(2i)
3. P2: Z(2i+1) = c * ( Y(2i) + Y(2i+2) ) + Y(2i+1)
4. U2: Z(2i) = d * ( Z(2i-1) + Z(2i+1) ) + Y(2i)
5. S: Z(2i) = k * Z(2i)
6. S: Z(2i+1) = (1/k) * Z(2i+1)
P: Prediction Step
U: Update Step
S: Scaling Step
a, b, c, d, k: constants defined in JPEG2000 standard
28 May 2007 ISCAS 2007 6
Data Flow Graph (DFG)
� DFG derived from the equations
� a, b, c and d nodes show the corresponding constant coefficient multipliers
� X7 and X8 are the off-chip reads required to compute Z4 and Z5
� X6, Y5, Y4 and Z3 are read from the on-chip buffer
28 May 2007 ISCAS 2007 7
Existing Architectures
� Non-Pipelined Direct Implementation� Requires 6 registers with Critical Path : 4Tm+8Ta
� Fully Pipelined Direct Implementation� Requires 32 registers with Critical Path : Tm
� High Performance Architecture� Lifting step equations modified
� Throughput of 1 input/output per cycle
� Requires 20 registers with Critical Path : Tm
� Flipping Architecture
28 May 2007 ISCAS 2007 8
Flipping Architecture
� Multiplications moved from critical path using
inverse multipliers
� Critical path reduced to Tm + 5Ta
� No hardware Overhead
� 5-Stage pipelined implementation
� 11 registers required
� Critical Path : Tm
28 May 2007 ISCAS 2007 9
Proposed DFG Optimizations
� X6 in the present cycle essentially becomes X8 in the next cycle
� “a*X6” computed now can be stored and reused to obtain the “a*X8”
� no need to re-compute “a*X8”
� Similar argument for computations involving Y5, Y4 and Z3
28 May 2007 ISCAS 2007 10
Optimized DFG
1. e1 = X6 * a
2. e2 = X6 + Y5*b
3. e3 = Y5 + Y4*c
4. e4 = Y4 + Z3 * d
28 May 2007 ISCAS 2007 11
4 Stage Pipelining� Critical Path is Ta + Tm
� Initiation Interval =1, Resource Requirement� 4 Multipliers
� 8 Adders
� 10 Registers
� 6 Pipelining Registers
� 4 for e1-e4
� Initiation Interval =2 Resource Requirement� 2 Multipliers
� 4 Adders
� 8 Registers
28 May 2007 ISCAS 2007 12
Reducing the Scaling Step Multiplier
Requirement
� 1D-DWT� Low Pass Coeff multiplied by k
� High Pass Coeff multiplied by 1/k
� Effectively in 2D-DWT� 25% Coeff multiplied by k*k
� 25% Coeff multiplied by 1/ (k*k)
� 50% Coeff multiplied by 1
28 May 2007 ISCAS 2007 13
Combining the 2 Scaling Steps
� Combine the scaling steps of Row-wise and column-wise 1D-DWT� Reduces 75% scaling step multiplications
� Saves 3 multiplier requirement at throughput of 2 I/O per cycle
� Proposed Architecture
28 May 2007 ISCAS 2007 14
Multiplier and Adder Synthesis
� Existing work presented critical paths with assumptions that Tm > 2*Ta
� In DWT, we have constant multiplications
� DWT constant multipliers synthesized
� Tm = 1.6*Ta
Tm: Multiplier Latency, Ta: Adder Latency
28 May 2007 ISCAS 2007 15
Comparison of 1D-DWT
� Critical Path column considers the multiplier synthesis results
� Proposed Architecture uses 1 register less compared to Flipping Architecture
28 May 2007 ISCAS 2007 16
Comparison of 2D-DWT
� Combining the scaling step multiplication� 3 lesser multipliers required
� reduces a pipeline register which reduces temporary buffer requirement
28 May 2007 ISCAS 2007 17
Flipping vs Proposed @ 4ns Clock
� 2 architectures synthesized under same clock constraints� 20% lesser area saving
� 25% power saving
� 3 lesser register requirement� Simplifies clock network => clock power saving
28 May 2007 ISCAS 2007 18
Conclusion
� 1D-DWT DFG optimizations proposed
� In (9,7) DWT, Tm comparable to Ta
� Lesser register requirement
� Area Saving
� Lesser memory requirement
� Simpler clock network
� Scaling steps combined
� Lesser multiplications
� Area Saving
� Power Saving