4/2/2012
1
Introduction
EE216B: VLSI Signal Processing
Prof. Dejan Marković [email protected]
EE216B Elevator Pitch
Area/energy-efficient mapping
of advanced DSP algorithms
to hardware
1.2
4/2/2012
2
Background?
Familiarity with
Digital ICs
VLSI design
Signal processing
1.3
What is This Course About?
Circuit Optimization
Signal Proc. Architectures
Algorithm Modeling
Simulink/XSG Model
- bit-true cycle-accurate
- hw-equivalent blocks
- target: FPGA or ASIC
Min Energy & Area
- interleaving, folding
- iterative sqrt/div
- loop retiming
Opt Energy-Delay
- parallelism, time-mux
- circuit topology
- Vdd, Vth, gate size
Complex DSP
topology A
topology B
Delay
En
erg
y
c
z
m ba
x2
xN
time indexk
y1y
2y
N
k-1
zN
z2z
1
k-a/N
a+b+m=N
time index
x1
N*fClk
1.4
4/2/2012
3
Course Objectives
The implementation of signal processing systems in CMOS technology
To understand the issues involved in the design of signal processing systems
1.5
DSP Chip Design Challenges
Power-limited performance
More flexibility (multi-mode, multi-standard)
Algorithm and hardware design are separate
Increasing computational complexity
1.6
4/2/2012
4
Course Outcomes
Systematic methodology for:
algorithm specification,
architecture mapping, and
hardware optimizations
Outcome 1: hardware-friendly algorithm development
Outcome 2: optimized hardware implementation
1.7
Course Highlights
A design methodology starting from a high-level description to an implementation optimized for performance, power and area
Unified description of algorithm and hardware parameters
– Methodology for automated wordlength reduction
– Automated exploration of many architectural solutions
– Design flow for FPGA and custom hardware including chip verification
Examples to show wide throughput range (kS/s to GS/s)
– Outcomes: energy/area optimal design, technology portability
Online resources: examples, references, tutorials etc.
1.8
4/2/2012
5
icslwebs.ee.ucla.edu/dejan/ee219awiki
1.9
Create a wiki account
using your UCLA username
1.10
4/2/2012
6
Course Material
Lecture notes
CAD tutorials
Class project
Selected papers from IEEExplore (http://ieeexplore.ieee.org)
1.11
Books
Textbook: DSP Architecture Design Essentials – A free draft available online
Supplemental books (not required) – K. Parhi, “VLSI Digital Signal Processing Systems: Design and
Implementation,” Wiley (1999)
– Oppenheim, Schafer, “Discrete-Time Signal Processing,” Prentice Hall
– Rabaey, Nikolic, Chandrakasan, “Digital Integrated Circuits: A Design Perspective,” Prentice Hall
– And a few other books (see course wiki)
1.12
4/2/2012
7
Material Based on a Book
To be published 2012
– Hard copy
– eBook formats
– Supplemental online material
1.13
Course/Book Development
Over 15 years of effort and revisions…
– Course material from UC Berkeley (Communication Signal Processing, EE225C), ~1995-2003 ● Profs. Robert W. Brodersen, Jan M. Rabaey, Borivoje Nikolić
– The concepts were applied and expanded by researchers from the Berkeley Wireless Research Center (BWRC), 2000-2006 ● W. Rhett Davis, Chen Chang, Changchun Shi, Hayden So, Brian Richards,
Dejan Marković
– UCLA course (VLSI Signal Processing, EE216B), 2006-2008 ● Prof. Dejan Marković
– The concepts expanded by researchers from UCLA, 2006-2010 ● Sarah Gibson, Vaibhav Karkare, Rashmi Nanda, Cheng C. Wang,
Chia-Hsiang Yang
All of this is integrated into the course/book
– Lots of practical ideas and working examples
1.14
4/2/2012
8
Chip Examples: Energy-Efficient DSP Kernels
DSP architecture optimization methodology
Rx DFE0.4 mm2
1.34 mm
1.20
mm
3.16
mm
2.17 mm
Reg. File Bank
128-2048 ptFFT
Hard-outputSphere
Decoder Soft
-ou
tpu
t B
ank
Pre
-pro
c.
M1
M2
M3
STA+DTA
Power Est.
TestCircuitry
STA+DTA
MW
FFT
Memory Logic
MW+FFT
Level Shifters
18
20
um
1520 um
RxDFE 8x8 SD Cogno
16x16 SD
[ESSCIRC’09]
[VLSI’10] [VLSI’11]
4x4 SVD
[VLSI’06]
12GOPS/mW 3.6GS/s
5GOPS/mW 200MS/s
10GOPS/mW 160MS/s
2GOPS/mW 100MS/s
17GOPS/mW 256MS/s
[ASSCC’11]
16x16 8x8
CR
SVD RxDFE
1
10
1000 10 0.1
Area Efficiency (GOPS/mm2)
ISSCC VLSI Our work 100
1 100
0.1
0.01
Ene
rgy
Effi
cie
ncy
(G
OP
S/m
W)
PE
1
PE
2
PE
3
PE
4
PE
5
PE
6
PE
7
PE
8
PE
9
PE
10
PE
11
PE
12
PE
13
PE
14
PE
15
PE
16
2.98 mm
2.9
8 m
m
register bank / scheduler
1.15
Organization
The material is organized into four parts
Technology Metrics
DSP Operations & Their Architecture
Architecture Modeling & Optimized Implementation
Design Examples: GHz to kHz
1
2
3
4
Performance, area, energy tradeoffs and their implication on architecture design
Number representation, fixed-point, basic operations (direct, iterative) & their architecture
Data-flow graph model, high-level scheduling and retiming, quantization, design flow
Radio baseband DSP, parallel data processing (MIMO, neural spikes), architecture flexibility
1.16
4/2/2012
9
Part 1: Technology Metrics
time-mux
reference
pipeline,intl,
time-mux
reference
pipelineparallel
parallelfoldintl,fold
0 DelayArea
EnergyVDD scaling
∂E/∂A∂D/∂A A=A0
SA=
SB
SA
f(A0, B)
f(A, B0)
Delay
Ene
rgy
D0
(A0, B0)E0→1
PMOSnetwork
NMOSnetwork
...
A1
AN
CL
Vout
VDD
E1→0
MicroprocessorsGeneral
Purpose DSPs
~3 orders of magnitude!
Dedicated
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Chip Number
0.01
0.1
1
10
100
1000
Ene
rgy
Effi
cie
ncy
(M
OP
S/m
W)
Ch 1: Energy and Delay Models
Ch 2: Circuit Optimization
Ch 3: Architecture Techniques
Ch 4: Architecture Flexibility
Energy and delay models of logic gates as a function of gate size and voltage…
are used to formulate sensitivity optimization, result: energy-delay plots
Extension to architecture tradeoff analysis…
1.17
Part 2: DSP Operations and Their Architecture
Ch 5: Arithmetic for DSP Ch 6: CORDIC, Divider, Square Root
Ch 7: Digital Filters
Ch 8: Time-Frequency Analysis
Number representation, quantization modes, fixed-point arithmetic
Overflow mode Quantization mode
0 0 1 1 0 1 00 0 1
WInt WFrSign
π =
−45o
0
26.57o
−14.04o
7.13o
−3.58o
It: 0
It: 1
It: 2
It: 3
It: 4It: 5
+ +
z−1 z−1
×× ×
x(n)
y(n−1)
z−1
z−1
Pipelineregs
tcritical = tmult + tadd
h0 h1 h2
Fourier basis functions Wavelet basis functions
Time
Fre
qu
en
cy
Time
Fre
qu
en
cy
Iterative DSP algorithms for standard ops, convergence analysis, the choice of initial condition
Direct and recursive digital filters, direct and transposed, pipelined…
FFT and wavelets (multi-rate filters)
1.18
4/2/2012
10
Part 3: Architecture Model & Opt. Implementation
Ch 9: Data-Flow Graph Model Ch 10: Wordlength Optimization
Ch 11: Architectural Optimization
Ch 12: Simulink-Hardware Flow
DFG model is used for architecture transformations based on high-level scheduling and retiming, an automated GUI tool is built…
w(e1) = 0w(e2) = 0w(e3) = 1
1 0 0
0 1 0
1 1 1
0 0 1
Matrix A for graph G
Data-flow graph G
x1(n) x2(n)
y(n)
v1 v2
v3
v4
e1 e2
e3
Z-1
D
+
(16,12)
(12,9)
(16,11)(16,11)
(14,9)
(24,16)(24,16)
(24,16)(16,11)
(8,4)
(13,8)(11, 6)
(10,6)(11,7)
(10,7)
(13,11)
(8,7) (8,7)
Legend: red = WL optimal 409 slices black = fixed WL 877 slices
Example: 1/sqrt()
x1(n) x2(n)
y1(n)
v1
v2
v4
y2(n)
x3(n)
v5
M1
A1
M2
v3
M1
A1
v6
M1
Titer Extract Model
Automated wordlength selection
1.19
Part 4: Design Examples: GHz to kHz
Ch 13: Multi-GHz Radio DSP
Ch 14: Dedicated MHz-rate Decoders
Ch 15: Flexible MHz-rate Decoders
Ch 16: kHz-rate Neural Processors
Sample-rateConversion
−fs1 fs1
−fs2 fs2
ADCfs1 > 1 GHz
High speed digital mixing
I/Q down conversion
Decimate b/w arbitrary
fs1 to fs2
High speedfiltering
LO090
Theoretical
blind trackingtraining
Samples per sub-carrier
Eige
n v
alu
es
0 500 1000 1500 20000
2
4
6
8
10
12
values
s12
s22
s32
s42
PE
1
PE
2
PE
3
PE
4
PE
5
PE
6
PE
7
PE
8
PE
9
PE
10
PE
11
PE
12
PE
13
PE
14
PE
15
PE
16
2.98 mm
2.9
8 m
m
register bank / scheduler
High-speed (GHz+) digital filtering
Adaptive channel gain tracking, parallel data processing (SVD)
Increased number of antennas, added flexibility for multi-mode operation
1.20
4/2/2012
11
Additional Design Examples
Integrated circuits for future radio and healthcare devices
– 4 orders of magnitude in speed: kHz (neural) to GHz (radio)
– 3 orders of magnitude in power: µW/mm2 to mW/mm2
Action Potentials
00
#1
#2
#3
Recorded Signal
Spike Sorting
#1
#2
#3
Sorted Spikes
#1 #2 #3
AnalogFront End Detection Clustering
Spike sorting process
3.16
mm
2.17 mm
Reg. File Bank
128-2048 ptFFT
Hard-outputSphere
Decoder Soft
-ou
tpu
t B
ank
Pre
-pro
c.
200MHz Cognitive Radio Spectrum Sensing
. . .
. . .
. . .
...
...
...
...
...
...
trace-back
radius shrinking
Multi-core 8x8 MIMO Sphere Decoder
16-ch Neural-spike Clustering
4 mW/mm2
65 μW/mm2
75 μW
7.4 mW
13.8 mW
LTE compliant
Online Clust.
1.21
Class Topics
Circuit and DSP basics
– Circuit and architecture techniques
– Scheduling and retiming
Arithmetic for DSP
Tools: Matlab/Simulink, Synphony HLS
Building blocks
– Filters, time-frequency analysis, DSP kernels
Systems
– Communications baseband
– Biomedical sensors
– Multimedia
1.22
4/2/2012
12
Design Trajectory: From DSP Theory…
Digital Signal Processing
Harry Nyquist Alan Oppenheim Jean Baptiste Fourier
Sample & Quantize
Audio Video Radar
Add Multiply Memory
1.23
…to Optimized Hardware Realization
Design, Optimization, Verification in Matlab/Simulink
ASIC
FPGA
Micro Arch.
E
Circuit
E
Macro Arch.
E & A D
E
A
Demod-Mod
Delay = 1Tsys
check_us_block
angle_u
compare_u
compare_v
res_chk_10
res_chk_u
res_chk_v
diff_V
inout
trng/trck
in out
tr.seq.tx
EN
tr.per
errors
EN
enTck
EN
enNp
in out
delay-7
in out
delay-6.2
in out
delay-6.1
in out
delay-4
inout
delay-2.1
c4
A Z
YA Z
X1
AZ
X
c4
[1,-1]
nbits
ib/p
mod-x
V-Modulation:
ch-1: 16-PSK
ch-2: 8-PSK
ch-3: QPSK
ch-4: BPSK
AZ
V
1/z
x'
Vx
Tx: V*x'
[-1,1] sequence
[-1,1] sequence
xind
xin
outs
eCnt
np2
xout
A Z
W [4x4]
ky [4x1]nPow nPow
Sigma [4x1]
nb [4x1]
ob/p
ib/p
nbits
en4
eCnt1
eCnt2
eCnt3
eCnt4
y [4x1]
r [4x4]
y [4x4]
ky [4x1]
Sig
In1 FFC
y c
eg
Reg
R
AZ
x'
Vx
Rx: V*x'
y
Uy '
Rx: U'*y
Resource
Estimator
A Z
RY
xhat
y
Sigma
y [4x4]
u [4x4]
VOrth
PE V
y
r [4x4]
U [4x4]
Sigma
W [4x4]
PE U-Sigma
A Z
N
ib/pnbits
Sigma[-1,1]
mod
c4
A Z
KY
in
nbitsob/p
x y
Channel
H = U*S*V'
AWGN
AWGN
Channel
0
# Ch-4 Bit Errs
0
# Ch-3 Bit Errs
0
# Ch-2 Bit Errs
0
# Ch-1 Bit Errs
Sy stem
Generator
y
y
xhat
xhat'
x 12,9 10,8
14,9
8,5
Automated environment for hardware design and verification
optimization hardware design I/O verification
1.24
4/2/2012
13
Class Organization
4 homework assignments
1 term-long design project
Midterm
Final
1.25
EE216B Weekly Schedule
Mon
Tue
Wed
Thu
Fri
9 10 11 12 1 2 3 4 5 6 7
OH 56-147E Eng-4
OH 56-147E Eng-4
Instructor Info: Dejan Marković / [email protected] 56-147E Eng-IV / Tel: 310-825-8656
Lecture 8500 BH
Lecture
8500 BH
1.26
4/2/2012
14
Grading Policy and Timeline
Homeworks: 20%
Midterm: 25%
Project: 30%
Final: 25%
1 2 3 4 5 6 7 8 9 10 Week
Class project
Phase-1 Presentation
h1 h2 h4 homeworks
Phase-2
h3
Midterm Mon, May 7
1.27
Homeworks and Project
Bi-weekly homeworks (4 assignments)
– Implement individual DSP blocks
Final project: a DSP system
– Work in teams of two (if > 2, we need to talk)
– Phase 1: proposal
– Phase 2: mid-term report
– Presentation + 4-page report
1.28
4/2/2012
15
EE216B Design Flow
Timed dataflow
DSP algorithm
SysGen Synplify
B-box HDL
FPGA backend
ASIC backend
Architectural
Transformations
Speed Power Area
Hardware
co-simulation
1.29
Software Environment: Big Picture
Algorithm
description
(Matlab/Simulink)
FPGA hardware
emulation
(XUP, BEE2)
Chip synthesis
Retiming, P&R
(Cadence)
Circuit design
introductory
(Cadence)
Circuit design
advanced
(Cadence)
Architecture
transformations
(Simulink/C++)
RTL description
216B
216A 216B 215B 215E
115A 115B 115C
216A 215B 215A 215E
216B DSP + Com.
216B DSP + Com.
Windows/Linux
Windows
Windows/Linux
Linux
Linux Linux
1.30
4/2/2012
16
XUP Virtex-II Pro Based FPGA Board
You can borrow this board if you’d like (first-come first-serve)
14k slices (~0.5M gates) 136 mults 2448Kb BRAM
Resources
1.31
The Basic Problem
Algorithm designers Chip designers
Gate delay, leakage power number of bits, latency
?
Shannon limit, Raleigh fading, cyclostationary process
? ^$*#^$E(W^$^&$
^$*#^$E(W^$^&$
Very constrained implementation choices
Design reentry (Matlab/C, HDL)
1.32
4/2/2012
17
Proposed Approach
Unified Simulink environment – Enter design only once! – Algorithm verification / emulation – Abstract view of architecture – FPGA based ASIC debug
Hardware-equivalent blocks – Basic operators
● Add, multiply, shift, mux…
– Implementation constraints ● Word-size, latency
1.33
Hardware Libraries
Xilinx System Generator Synphony HLS
1.34
4/2/2012
18
XSG Model Example: Iterative 1/sqrt()
User defined parameters:
- data type
- wordlength (#bits, binary pt)
- quantization
- overflow
- latency
- sample period
wordlength
latency
xs (k + 1) =
xs (k) / 2· (3 – Z· xs2
(k))
User defined parameters – Data type – Wordlength – Quantization – Overflow – Latency – Sample period
xs
Z
1.35
Block Characterization
Latency
Cycle Time
0
mult
add
Energy
VDD scaling
VDDref
TClk @ VDDopt
Library blocks / macros synthesized @ VDD
ref Pipeline logic scaling
FO4 inv simulation
Speed Power Area
TClk @ VDD
ref
gate sizing
1.36
4/2/2012
19
ASIC Synthesis
10,000 FPGA slices
1mm2
(90nm CMOS)
))(3(2
)()1(
2kxN
kxkx s
ss
500MOPS
0.18mW, 0.07mm2
1.37
Are
a
Valid architectures
Constraints
Direct-mapping (reference)
0.2 0.4
0.6 0.8
1
0.2
0.4
0.6
0.8
1 0.2
0.4
0.6
0.8
1
Energy-Area-Performance Mapping
Each point is an architecture automatically generated in Simulink using scheduling and retiming
[Rashmi Nanda]
1.38
4/2/2012
20
New Trend: Parallel Data Processing
Power limited technology scaling
– Increased impact of process variations
– More leakage power, multiple threshold devices
Single dimensional Multidimensional data
Multi-core Processors MIMO Communications Neuroscience
www.sci.utah.edu IBM / Sony / Toshiba Belkin
1.39
Energy-Delay Tradeoff
VDD scaling
0
Communications
Ene
rgy
Delay
Neural
Processors
Processors – Maximize performance
– Highest VDD required
Communications – Minimize energy & area
– Typically, sensitivity ~ 1
Neuroscience – Power density: 0.8mW/mm2
– Aggressive VDD scaling
1.40
4/2/2012
21
Parallel Data in Neuroscience
[M.A.L. Nicolelis, Actions from thoughts, Nature 409 (2001), pp. 403–407.]
Slide 1.41
Animal Models
Observation of brain injured vs. naïve rat pups
10
mm
Hippocampus Headstage
P19 rat pup
Main probe locations
1.42
4/2/2012
22
64 site nanoprobe
Tungsten electrode
nanoprobes
Micromachined probes
Microelectrodes
Courtesy: S. Masmanidis
1.43
Monitoring of Freely-behaving Animals
Exploration of enriched environment: Brain injured vs. naïve pups
social
naïve rich environment
level 2
level 3
level 1
4 ft (1.22 m)
4 f
t
injured
1.44
4/2/2012
23
Summary: Focus of This Course
3 components of the design problem
Algorithm specification – Matlab (or C)
– Floating point, implementation independent, system simulation
Architecture mapping
– Simulink for data flow
– Stateflow for control
Hardware optimizations
– Real-time emulation
– FPGA/ASIC implementation
1.45