Prof. Dejan Marković - UCLAicslwebs.ee.ucla.edu/dejan/ee219awiki/images/2/26/Le… · ·...

4/2/2012

1

Introduction

EE216B: VLSI Signal Processing

Prof. Dejan Marković [email protected]

EE216B Elevator Pitch

Area/energy-efficient mapping

of advanced DSP algorithms

to hardware

1.2

4/2/2012

2

Background?

Familiarity with

Digital ICs

VLSI design

Signal processing

1.3

What is This Course About?

Circuit Optimization

Signal Proc. Architectures

Algorithm Modeling

Simulink/XSG Model

- bit-true cycle-accurate

- hw-equivalent blocks

- target: FPGA or ASIC

Min Energy & Area

- interleaving, folding

- iterative sqrt/div

- loop retiming

Opt Energy-Delay

- parallelism, time-mux

- circuit topology

- Vdd, Vth, gate size

Complex DSP

topology A

topology B

Delay

En

erg

y

c

z

m ba

x2

xN

time indexk

y1y

2y

N

k-1

zN

z2z

1

k-a/N

a+b+m=N

time index

x1

N*fClk

1.4

4/2/2012

3

Course Objectives

The implementation of signal processing systems in CMOS technology

To understand the issues involved in the design of signal processing systems

1.5

DSP Chip Design Challenges

Power-limited performance

More flexibility (multi-mode, multi-standard)

Algorithm and hardware design are separate

Increasing computational complexity

1.6

4/2/2012

4

Course Outcomes

Systematic methodology for:

algorithm specification,

architecture mapping, and

hardware optimizations

Outcome 1: hardware-friendly algorithm development

Outcome 2: optimized hardware implementation

1.7

Course Highlights

A design methodology starting from a high-level description to an implementation optimized for performance, power and area

Unified description of algorithm and hardware parameters

– Methodology for automated wordlength reduction

– Automated exploration of many architectural solutions

– Design flow for FPGA and custom hardware including chip verification

Examples to show wide throughput range (kS/s to GS/s)

– Outcomes: energy/area optimal design, technology portability

Online resources: examples, references, tutorials etc.

1.8

4/2/2012

5

icslwebs.ee.ucla.edu/dejan/ee219awiki

1.9

Create a wiki account

using your UCLA username

1.10

4/2/2012

6

Course Material

Lecture notes

CAD tutorials

Class project

Selected papers from IEEExplore (http://ieeexplore.ieee.org)

1.11

Books

Textbook: DSP Architecture Design Essentials – A free draft available online

Supplemental books (not required) – K. Parhi, “VLSI Digital Signal Processing Systems: Design and

Implementation,” Wiley (1999)

– Oppenheim, Schafer, “Discrete-Time Signal Processing,” Prentice Hall

– Rabaey, Nikolic, Chandrakasan, “Digital Integrated Circuits: A Design Perspective,” Prentice Hall

– And a few other books (see course wiki)

1.12

4/2/2012

7

Material Based on a Book

To be published 2012

– Hard copy

– eBook formats

– Supplemental online material

1.13

Course/Book Development

Over 15 years of effort and revisions…

– Course material from UC Berkeley (Communication Signal Processing, EE225C), ~1995-2003 ● Profs. Robert W. Brodersen, Jan M. Rabaey, Borivoje Nikolić

– The concepts were applied and expanded by researchers from the Berkeley Wireless Research Center (BWRC), 2000-2006 ● W. Rhett Davis, Chen Chang, Changchun Shi, Hayden So, Brian Richards,

Dejan Marković

– UCLA course (VLSI Signal Processing, EE216B), 2006-2008 ● Prof. Dejan Marković

– The concepts expanded by researchers from UCLA, 2006-2010 ● Sarah Gibson, Vaibhav Karkare, Rashmi Nanda, Cheng C. Wang,

Chia-Hsiang Yang

All of this is integrated into the course/book

– Lots of practical ideas and working examples

1.14

4/2/2012

8

Chip Examples: Energy-Efficient DSP Kernels

DSP architecture optimization methodology

Rx DFE0.4 mm2

1.34 mm

1.20

mm

3.16

mm

2.17 mm

Reg. File Bank

128-2048 ptFFT

Hard-outputSphere

Decoder Soft

-ou

tpu

t B

ank

Pre

-pro

c.

M1

M2

M3

STA+DTA

Power Est.

TestCircuitry

STA+DTA

MW

FFT

Memory Logic

MW+FFT

Level Shifters

18

20

um

1520 um

RxDFE 8x8 SD Cogno

16x16 SD

[ESSCIRC’09]

[VLSI’10] [VLSI’11]

4x4 SVD

[VLSI’06]

12GOPS/mW 3.6GS/s

5GOPS/mW 200MS/s

10GOPS/mW 160MS/s

2GOPS/mW 100MS/s

17GOPS/mW 256MS/s

[ASSCC’11]

16x16 8x8

CR

SVD RxDFE

1

10

1000 10 0.1

Area Efficiency (GOPS/mm2)

ISSCC VLSI Our work 100

1 100

0.1

0.01

Ene

rgy

Effi

cie

ncy

(G

OP

S/m

W)

PE

1

PE

2

PE

3

PE

4

PE

5

PE

6

PE

7

PE

8

PE

9

PE

10

PE

11

PE

12

PE

13

PE

14

PE

15

PE

16

2.98 mm

2.9

8 m

m

register bank / scheduler

1.15

Organization

The material is organized into four parts

Technology Metrics

DSP Operations & Their Architecture

Architecture Modeling & Optimized Implementation

Design Examples: GHz to kHz

1

2

3

4

Performance, area, energy tradeoffs and their implication on architecture design

Number representation, fixed-point, basic operations (direct, iterative) & their architecture

Data-flow graph model, high-level scheduling and retiming, quantization, design flow

Radio baseband DSP, parallel data processing (MIMO, neural spikes), architecture flexibility

1.16

4/2/2012

9

Part 1: Technology Metrics

time-mux

reference

pipeline,intl,

time-mux

reference

pipelineparallel

parallelfoldintl,fold

0 DelayArea

EnergyVDD scaling

∂E/∂A∂D/∂A A=A0

SA=

SB

SA

f(A0, B)

f(A, B0)

Delay

Ene

rgy

D0

(A0, B0)E0→1

PMOSnetwork

NMOSnetwork

...

A1

AN

CL

Vout

VDD

E1→0

MicroprocessorsGeneral

Purpose DSPs

~3 orders of magnitude!

Dedicated

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Chip Number

0.01

0.1

1

10

100

1000

Ene

rgy

Effi

cie

ncy

(M

OP

S/m

W)

Ch 1: Energy and Delay Models

Ch 2: Circuit Optimization

Ch 3: Architecture Techniques

Ch 4: Architecture Flexibility

Energy and delay models of logic gates as a function of gate size and voltage…

are used to formulate sensitivity optimization, result: energy-delay plots

Extension to architecture tradeoff analysis…

1.17

Part 2: DSP Operations and Their Architecture

Ch 5: Arithmetic for DSP Ch 6: CORDIC, Divider, Square Root

Ch 7: Digital Filters

Ch 8: Time-Frequency Analysis

Number representation, quantization modes, fixed-point arithmetic

Overflow mode Quantization mode

0 0 1 1 0 1 00 0 1

WInt WFrSign

π =

−45o

0

26.57o

−14.04o

7.13o

−3.58o

It: 0

It: 1

It: 2

It: 3

It: 4It: 5

+ +

z−1 z−1

×× ×

x(n)

y(n−1)

z−1

z−1

Pipelineregs

tcritical = tmult + tadd

h0 h1 h2

Fourier basis functions Wavelet basis functions

Time

Fre

qu

en

cy

Time

Fre

qu

en

cy

Iterative DSP algorithms for standard ops, convergence analysis, the choice of initial condition

Direct and recursive digital filters, direct and transposed, pipelined…

FFT and wavelets (multi-rate filters)

1.18

4/2/2012

10

Part 3: Architecture Model & Opt. Implementation

Ch 9: Data-Flow Graph Model Ch 10: Wordlength Optimization

Ch 11: Architectural Optimization

Ch 12: Simulink-Hardware Flow

DFG model is used for architecture transformations based on high-level scheduling and retiming, an automated GUI tool is built…

w(e1) = 0w(e2) = 0w(e3) = 1

1 0 0

0 1 0

1 1 1

0 0 1

Matrix A for graph G

Data-flow graph G

x1(n) x2(n)

y(n)

v1 v2

v3

v4

e1 e2

e3

Z-1

D

+

(16,12)

(12,9)

(16,11)(16,11)

(14,9)

(24,16)(24,16)

(24,16)(16,11)

(8,4)

(13,8)(11, 6)

(10,6)(11,7)

(10,7)

(13,11)

(8,7) (8,7)

Legend: red = WL optimal 409 slices black = fixed WL 877 slices

Example: 1/sqrt()

x1(n) x2(n)

y1(n)

v1

v2

v4

y2(n)

x3(n)

v5

M1

A1

M2

v3

M1

A1

v6

M1

Titer Extract Model

Automated wordlength selection

1.19

Part 4: Design Examples: GHz to kHz

Ch 13: Multi-GHz Radio DSP

Ch 14: Dedicated MHz-rate Decoders

Ch 15: Flexible MHz-rate Decoders

Ch 16: kHz-rate Neural Processors

Sample-rateConversion

−fs1 fs1

−fs2 fs2

ADCfs1 > 1 GHz

High speed digital mixing

I/Q down conversion

Decimate b/w arbitrary

fs1 to fs2

High speedfiltering

LO090

Theoretical

blind trackingtraining

Samples per sub-carrier

Eige

n v

alu

es

0 500 1000 1500 20000

2

4

6

8

10

12

values

s12

s22

s32

s42

PE

1

PE

2

PE

3

PE

4

PE

5

PE

6

PE

7

PE

8

PE

9

PE

10

PE

11

PE

12

PE

13

PE

14

PE

15

PE

16

2.98 mm

2.9

8 m

m

register bank / scheduler

High-speed (GHz+) digital filtering

Adaptive channel gain tracking, parallel data processing (SVD)

Increased number of antennas, added flexibility for multi-mode operation

1.20

4/2/2012

11

Additional Design Examples

Integrated circuits for future radio and healthcare devices

– 4 orders of magnitude in speed: kHz (neural) to GHz (radio)

– 3 orders of magnitude in power: µW/mm2 to mW/mm2

Action Potentials

00

#1

#2

#3

Recorded Signal

Spike Sorting

#1

#2

#3

Sorted Spikes

#1 #2 #3

AnalogFront End Detection Clustering

Spike sorting process

3.16

mm

2.17 mm

Reg. File Bank

128-2048 ptFFT

Hard-outputSphere

Decoder Soft

-ou

tpu

t B

ank

Pre

-pro

c.

200MHz Cognitive Radio Spectrum Sensing

. . .

. . .

. . .

...

...

...

...

...

...

trace-back

radius shrinking

Multi-core 8x8 MIMO Sphere Decoder

16-ch Neural-spike Clustering

4 mW/mm2

65 μW/mm2

75 μW

7.4 mW

13.8 mW

LTE compliant

Online Clust.

1.21

Class Topics

Circuit and DSP basics

– Circuit and architecture techniques

– Scheduling and retiming

Arithmetic for DSP

Tools: Matlab/Simulink, Synphony HLS

Building blocks

– Filters, time-frequency analysis, DSP kernels

Systems

– Communications baseband

– Biomedical sensors

– Multimedia

1.22

4/2/2012

12

Design Trajectory: From DSP Theory…

Digital Signal Processing

Harry Nyquist Alan Oppenheim Jean Baptiste Fourier

Sample & Quantize

Audio Video Radar

Add Multiply Memory

1.23

…to Optimized Hardware Realization

Design, Optimization, Verification in Matlab/Simulink

ASIC

FPGA

Micro Arch.

E

Circuit

E

Macro Arch.

E & A D

E

A

Demod-Mod

Delay = 1Tsys

check_us_block

angle_u

compare_u

compare_v

res_chk_10

res_chk_u

res_chk_v

diff_V

inout

trng/trck

in out

tr.seq.tx

EN

tr.per

errors

EN

enTck

EN

enNp

in out

delay-7

in out

delay-6.2

in out

delay-6.1

in out

delay-4

inout

delay-2.1

c4

A Z

YA Z

X1

AZ

X

c4

[1,-1]

nbits

ib/p

mod-x

V-Modulation:

ch-1: 16-PSK

ch-2: 8-PSK

ch-3: QPSK

ch-4: BPSK

AZ

V

1/z

x'

Vx

Tx: V*x'

[-1,1] sequence

[-1,1] sequence

xind

xin

outs

eCnt

np2

xout

A Z

W [4x4]

ky [4x1]nPow nPow

Sigma [4x1]

nb [4x1]

ob/p

ib/p

nbits

en4

eCnt1

eCnt2

eCnt3

eCnt4

y [4x1]

r [4x4]

y [4x4]

ky [4x1]

Sig

In1 FFC

y c

eg

Reg

R

AZ

x'

Vx

Rx: V*x'

y

Uy '

Rx: U'*y

Resource

Estimator

A Z

RY

xhat

y

Sigma

y [4x4]

u [4x4]

VOrth

PE V

y

r [4x4]

U [4x4]

Sigma

W [4x4]

PE U-Sigma

A Z

N

ib/pnbits

Sigma[-1,1]

mod

c4

A Z

KY

in

nbitsob/p

x y

Channel

H = U*S*V'

AWGN

AWGN

Channel

0

# Ch-4 Bit Errs

0

# Ch-3 Bit Errs

0

# Ch-2 Bit Errs

0

# Ch-1 Bit Errs

Sy stem

Generator

y

y

xhat

xhat'

x 12,9 10,8

14,9

8,5

Automated environment for hardware design and verification

optimization hardware design I/O verification

1.24

4/2/2012

13

Class Organization

4 homework assignments

1 term-long design project

Midterm

Final

1.25

EE216B Weekly Schedule

Mon

Tue

Wed

Thu

Fri

9 10 11 12 1 2 3 4 5 6 7

OH 56-147E Eng-4

OH 56-147E Eng-4

Instructor Info: Dejan Marković / [email protected] 56-147E Eng-IV / Tel: 310-825-8656

Lecture 8500 BH

Lecture

8500 BH

1.26

4/2/2012

14

Grading Policy and Timeline

Homeworks: 20%

Midterm: 25%

Project: 30%

Final: 25%

1 2 3 4 5 6 7 8 9 10 Week

Class project

Phase-1 Presentation

h1 h2 h4 homeworks

Phase-2

h3

Midterm Mon, May 7

1.27

Homeworks and Project

Bi-weekly homeworks (4 assignments)

– Implement individual DSP blocks

Final project: a DSP system

– Work in teams of two (if > 2, we need to talk)

– Phase 1: proposal

– Phase 2: mid-term report

– Presentation + 4-page report

1.28

4/2/2012

15

EE216B Design Flow

Timed dataflow

DSP algorithm

SysGen Synplify

B-box HDL

FPGA backend

ASIC backend

Architectural

Transformations

Speed Power Area

Hardware

co-simulation

1.29

Software Environment: Big Picture

Algorithm

description

(Matlab/Simulink)

FPGA hardware

emulation

(XUP, BEE2)

Chip synthesis

Retiming, P&R

(Cadence)

Circuit design

introductory

(Cadence)

Circuit design

advanced

(Cadence)

Architecture

transformations

(Simulink/C++)

RTL description

216B

216A 216B 215B 215E

115A 115B 115C

216A 215B 215A 215E

216B DSP + Com.

216B DSP + Com.

Windows/Linux

Windows

Windows/Linux

Linux

Linux Linux

1.30

4/2/2012

16

XUP Virtex-II Pro Based FPGA Board

You can borrow this board if you’d like (first-come first-serve)

14k slices (~0.5M gates) 136 mults 2448Kb BRAM

Resources

1.31

The Basic Problem

Algorithm designers Chip designers

Gate delay, leakage power number of bits, latency

?

Shannon limit, Raleigh fading, cyclostationary process

? ^$*#^$E(W^$^&$

^$*#^$E(W^$^&$

Very constrained implementation choices

Design reentry (Matlab/C, HDL)

1.32

4/2/2012

17

Proposed Approach

Unified Simulink environment – Enter design only once! – Algorithm verification / emulation – Abstract view of architecture – FPGA based ASIC debug

Hardware-equivalent blocks – Basic operators

● Add, multiply, shift, mux…

– Implementation constraints ● Word-size, latency

1.33

Hardware Libraries

Xilinx System Generator Synphony HLS

1.34

4/2/2012

18

XSG Model Example: Iterative 1/sqrt()

User defined parameters:

- data type

- wordlength (#bits, binary pt)

- quantization

- overflow

- latency

- sample period

wordlength

latency

xs (k + 1) =

xs (k) / 2· (3 – Z· xs2

(k))

User defined parameters – Data type – Wordlength – Quantization – Overflow – Latency – Sample period

xs

Z

1.35

Block Characterization

Latency

Cycle Time

0

mult

add

Energy

VDD scaling

VDDref

TClk @ VDDopt

Library blocks / macros synthesized @ VDD

ref Pipeline logic scaling

FO4 inv simulation

Speed Power Area

TClk @ VDD

ref

gate sizing

1.36

4/2/2012

19

ASIC Synthesis

10,000 FPGA slices

1mm2

(90nm CMOS)

))(3(2

)()1(

2kxN

kxkx s

ss

500MOPS

0.18mW, 0.07mm2

1.37

Are

a

Valid architectures

Constraints

Direct-mapping (reference)

0.2 0.4

0.6 0.8

1

0.2

0.4

0.6

0.8

1 0.2

0.4

0.6

0.8

1

Energy-Area-Performance Mapping

Each point is an architecture automatically generated in Simulink using scheduling and retiming

[Rashmi Nanda]

1.38

4/2/2012

20

New Trend: Parallel Data Processing

Power limited technology scaling

– Increased impact of process variations

– More leakage power, multiple threshold devices

Single dimensional Multidimensional data

Multi-core Processors MIMO Communications Neuroscience

www.sci.utah.edu IBM / Sony / Toshiba Belkin

1.39

Energy-Delay Tradeoff

VDD scaling

0

Communications

Ene

rgy

Delay

Neural

Processors

Processors – Maximize performance

– Highest VDD required

Communications – Minimize energy & area

– Typically, sensitivity ~ 1

Neuroscience – Power density: 0.8mW/mm2

– Aggressive VDD scaling

1.40

4/2/2012

21

Parallel Data in Neuroscience

[M.A.L. Nicolelis, Actions from thoughts, Nature 409 (2001), pp. 403–407.]

Slide 1.41

Animal Models

Observation of brain injured vs. naïve rat pups

10

mm

Hippocampus Headstage

P19 rat pup

Main probe locations

1.42

4/2/2012

22

64 site nanoprobe

Tungsten electrode

nanoprobes

Micromachined probes

Microelectrodes

Courtesy: S. Masmanidis

1.43

Monitoring of Freely-behaving Animals

Exploration of enriched environment: Brain injured vs. naïve pups

social

naïve rich environment

level 2

level 3

level 1

4 ft (1.22 m)

4 f

t

injured

1.44

4/2/2012

23

Summary: Focus of This Course

3 components of the design problem

Algorithm specification – Matlab (or C)

– Floating point, implementation independent, system simulation

Architecture mapping

– Simulink for data flow

– Stateflow for control

Hardware optimizations

– Real-time emulation

– FPGA/ASIC implementation

1.45

Date post:	01-May-2018
Category:	Documents
Upload:	doannhu
View:	257 times
Download:	11 times

Prof. Dejan Marković - UCLAicslwebs.ee.ucla.edu/dejan/ee219awiki/images/2/26/Le… · ·...

Documents