Download - ENG6530 RCS1 ENG6530 Reconfigurable Computing Systems Digital Signal Processing using FPGAs.

ENG6530 RCS 1

ENG6530 Reconfigurable

Computing Systems

Digital Signal Processing Digital Signal Processing using FPGAsusing FPGAs

ENG6530 RCS 2

Topics Digital Signal Processing (DSP):

Definition, Advantages and Disadvantages Applications, ….

DSP vs. GPP vs. ASIC vs. FPGA Why use Reconfigurable Computing. Xilinx System Generator

ENG6530 RCS 3

ReferencesI. “http://www.xilinx.comII. “Reconfigurable Computing for DSP: A Survey”,

by R. Tessier and W. Burleson, 2001III. “Optimization Techniques for Efficient

Implementation of DSP in FPGAs”, by J. WangIV. “Reconfigurable Computing: The Theory and

Practice of FPGA Based Computing. Chapter 24: Distributed Arithmetic.

ENG6530 RCS 4

IntroductionIntroduction The term Digital Signal Processing, or DSPDSP, refers to the

branch of electronics concerned with the representation and manipulation of signals in digital form.

Such applicationsapplications as i. Telecommunication (switches, …) ii. Medical (Images, equipment, ..)iii. Military (radar, missiles, ..) iv. Consumers (Cell Phones, TVs, ..)

ENG6530 RCS 5

DSP FlowDSP Flow The data to be processed startsstarts out as a signal in the real

(analog) world. This analog signal is then sampledsampled by means of an analog

to digital converter. These samples are then processedprocessed in the digital domain. The digital samples are subsequently convertedconverted into an

analog equivalent by means of a digital to analog converter.

A/D DSP D/AAnalog input

signalDigital input

samplesModified output

samplesAnalog output

signal

Analog domain Digital domain Analog domain

ENG6530 RCS 6

Digital SystemDigital System

ADCDSPDSP

DAC1010..1010.. 1001..1001..

Sampling +Sampling +QuantificationQuantification

ArchitectureArchitecture

Signal Signal AnalysisAnalysis

SystemSystemAnalysisAnalysis

FilterFilterDesignDesign

Fix Point ArithmeticArchitecture TypesSelection Criteria

DSP FlowDSP Flow

ENG6530 RCS 7

Transition from Analog to DigitalTransition from Analog to Digital The transition from analog to more digital techniques has been driven by

the many advantages many advantages of DSP:

The main advantage of digital signals over analog signals is that the precise signal level of former is not vital (immune to imperfectionsimmune to imperfections)

Digital signals can be saved in memory saved in memory and then recalled. Digital signals can convey information with greater noise immunitygreater noise immunity. Digital signals can be processed by digital circuit components, which

are cheapare cheap and easily produced. Digital can be encrypted can be encrypted so that only the intended receiver can decode. The flexibility in precision flexibility in precision through changing word lengths and/or

number representation (e.g., fixed point vs. floating point) The ability to use a single processing single processing element to process multiple

incoming signals through multiplexing. Enables transmission of signals over a long distance long distance and higher rate. The ease with which digital approaches can adjust their processing

parameters, such as with adaptive filteringadaptive filtering.

ENG6530 RCS 8

Transition from Analog to DigitalTransition from Analog to Digital The main disadvantage main disadvantage of DSP:

i.i. Increased system complexityIncreased system complexity, DSP requires that signals be converted between converted between analog and digital forms using a sample and hold circuit, analog-to-digital converters (ADCs), and digital-to-analog converters (DACs) and analog filtering.

ii.ii. Power consumptionPower consumption, DSP tends to require more power since a dedicated processor is used.

iii.iii. Frequency range limitationFrequency range limitation, analog hardware will naturally be able to work with higher frequency signals than is possible with DSP hardware due to the limitations of performing analog to digital conversion.

For many applications, the advantages of DSP far outweigh these disadvantages.

ENG6530 RCS 9

DSP: Common OperationsDSP: Common OperationsSome of the most common operations most common operations performed on signals using digital or analog techniques include:

Elementary time-domain operations: amplification, attenuation, integration, differentiation, addition of signals, multiplication of signals, etc.,

Filtering (FIR, IIR) Transforms (FFT, IFFT) Convolution (Integral of product of two functions) Error Correction (Transmission) Compression and decompression (Audio, Video) Modulation and demodulation (BPSK, QAM, FSK, ASK, …) Multiplexing and de-multiplexing Signal generation

ENG6530 RCS 10

DSP ApplicationsDSP Applications AudioAudio Applications:

MPEG Audio Portable audio

Photography: Digital cameras CAM

WirelessWireless Applications WiFi WiMax Blue Tooth

NetworkingNetworking Switches Classifiers

MedicalMedical Equipment: Hearing Aids Heart Pacers

CableCable modems ADSL VDSL

CellularCellular Phones Base Stations GSM LTE

MilitaryMilitary Applications: Radar

Main DSP OperationsMain DSP Operations DSP is the arithmetic processing of

digital signals sampled at regular intervals

DSP can be reduced to three trivial operations: DelayDelay AddAdd MultiplyMultiply

Accumulate = Add + Delay MAC = Multiply + Accumulate The MAC is the engine behind DSP

More MACs = Higher Performance, Better Signal Quality

MACs vs. MIPS, not always equal

3 MACs

50* MACs

100 MACs

Filter

ENG6530 RCS 12

Alternative DSP ImplementationsAlternative DSP Implementations DSP tasks can be implemented in a number of different ways.

i. A general purpose processor (GPP): The processor can perform DSP by running an appropriate DSP algorithm.

ii. A digital signal processor (PDSP): This is a specialized form of microprocessor chip that has been designed to perform DSP tasks much faster and more efficiently than GPP.

iii. Dedicated ASIC hardware: Custom hardware implementation that executes the DSP task.

iv. Dedicated FPGA hardware: Similar to ASIC except that it offers:

Flexibility in terms of reconfiguration. Embedded microprocessor cores on the FPGA.

ENG6530 RCS 13

The Performance GapThe Performance Gap Algorithmic complexity increases as application demands increase. In order to process these new algorithms, higher performance signal

processing engines are required

Traditional DSP ApproachesTraditional DSP Approaches Digital Signal Processor IC

Software programmable, like a microprocessor Single MAC unit All processing done sequentially Fit the algorithm to the architecture

ASIC (gate array) Fit the architecture to the algorithm Significantly higher performance than DSP processor High cost and high risk to develop Usually only for high-volume applications

MAC

Data Controller

MemoryADC

Analog input Analog output

Digital output

‘Traditional’ DSP Processor

DAC

Pros

High performance

High density

One chip solution

Cons

High design risk

Long design cycle

Pros

High flexibility

Good adaptability

Low design risk

Cons

Performance

Hardware Complexity

The Promise of Programmable LogicThe Promise of Programmable Logic

ASIC DSP ProcessorFPGA

Best from both worldsplus:

Efficient IC architecture

System features

Short design cycle

Automatic migration to low cost HardWire

ENG6530 RCS 16

Why FPGAs?Why FPGAs? The most commonly most commonly used DSP functions are:

FIR (Finite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), DCT (Direct Cosine Transform), Encoder/Decoder and Error Correction/Detection functions.

All of these blocks All of these blocks perform intensive arithmetic operations (data path intensive operationsdata path intensive operations) such as: add, subtract, multiply, multiply-add or, multiply-accumulate.

Why Use FPGAs in DSP Applications?Why Use FPGAs in DSP Applications? 10x More DSP Throughput Than

DSP Processors Parallel vs. Serial Architecture

Cost-Effective for Multi-Channel Applications

Flexible Hardware Implementation

Single-Chip Solution System (Hardware/Software)

Integration Benefits

FPGASoftwareEmbeddedProcessor

FPGA

DSP System

SoftwareDSP

ENG6530 RCS 18

DSP-related embedded FPGA resourcesDSP-related embedded FPGA resources

Many FPGAs incorporate dedicated multiplier dedicated multiplier blocks (Virtex-5/6/7). Similarly, some FPGAs offer dedicated adder dedicated adder blocks. One operation that is very common in DSP-type application is called

the multiply-and-accumulate (MAC) unit(MAC) unit. To make life easier for implementing DSP on FPGAs some provide an

entire MAC as an embedded function entire MAC as an embedded function (Virtex-4)

x

+

x

+

A[n:0]

B[n:0] Y[(2n - 1):0]

Multiplier

Adder

Accumulator

MAC

DSP Functions are Parallel in NatureDSP Functions are Parallel in Nature 8-Bit, 16-Tap Finite Impulse Response (FIR) (FIR) Filter

Equation:

REG REG REG REG REG REG REG

REG REGREGREGREG REGREGREG

Data InputX[7:0]

0 15 1 14 2 13 3 12 4 11 5 10 6 9 7 8

Data OutputY[9:0]

C0 C1 C2 C3 C4 C5 C6 C7Multiply by

FilterCo-Efficients

FilterTaps

AccumulateValues

Y c x c x c x c x c x c x c x c x c xj k kjk

n

10 0 1 1 2 2 3 3 3 12 2 13 1 14 0 15

Symmetrical Coefficients

DSP and FPGADSP and FPGA

FPGAs Parallel Approach to DSP Enables Higher Computational Throughput

Consider a 256-tap FIR filter:

Conventional DSP Processor – Serial Implementation

FPGA – Fully parallel implementation

ENG6530 RCS 21

Multiply Accumulate Multiply Accumulate MultipleMultiple Engines Engines Parallel processing maximizes data

throughput Support any level of parallelismSupport any level of parallelism Optimal performance/cost

tradeoff 256 Tap FIR Filter256 Tap FIR Filter

256 multiply and accumulate (MAC) operations per data sample

One output every clock cycleOne output every clock cycle Flexible architecture

Distributed DSP resources (LUT, registers, multipliers, & memory)

Data Out

....C0 C1 C2 C255

Reg0 Reg1 Reg2 Reg255Data In

All 256 MAC operations in 1 clock cycle

FPGAs Outperform ‘Traditional’ DSP ProcessorsFPGAs Outperform ‘Traditional’ DSP Processors

22.00

0.241.00

2.60

4.00

16.00

0

5

10

15

20

25

133 MHzPentium™Processor750 KHz

Single50 MHz

DSP3 MHz

XC4003E-3FPGA

(68% util.)8 MHz

Four50 MHzDSPs12 MHz

XC4010E-3FPGA

(98% util.)56 MHz

XC4013E-2FPGA

(75% util.)66 MHz

Per

form

ance

Rel

ativ

e to

50

MH

z F

ixed

-Po

int

DS

P

Serial Distributed Arithmetic(SDA)

Parallel Distributed Arithmetic(PDA)

(est.)8-Bit, 16-Tap FIR Filter

Performance Comparisons(External Performance)

FPGA

FPGA

FPGA

MCM

Case Study: Viterbi DecoderCase Study: Viterbi Decoder

+-

+

-

Old_1

INC

Old_2

-+

+

-

++

++

OptionalPipeliningRegisters

MUX

MUX

New_1

Diff_2

Diff_1

New_2

MSB

MSB

Prestate Buffer Bit

24-bit 24-bit24-bit

1 0

REG

REG

REG

REG

REG

REG

REG

REG

I/O BusI/O Bus

DSP-Only DSP + FPGA8 DEVICES 4 DEVICES

Two 66 MHz DSPsSix 15 ns SRAMsSystem logic

One 66 MHz DSPXC4013E-3 FPGA (44%)Three 15 ns SRAMs

135 ns

360 ns0

1

2

3

Rel

ativ

e P

erfo

rman

ce 2.67 times better performance w ith FPGA-assisted DSP

Two 66 MHz DSPsSix 15 ns RAMs

66 MHz DSP+FPGAThree 15 ns RAMs

(FPGA-based DSP Co-Processor)

What to Look for in Your DSP ApplicationWhat to Look for in Your DSP Application

Identify Parallel Data Paths Find Operations that Require Multiple Clock Cycles Processor Bottlenecks

Flexibility

Parallel Data Paths

Scaleable Bandwidth

Design Modification

Device Expansion

DSP Pro

cess

or

ASICFPG

A= NO= YES

When to Use When to Use FPGAs for DSPFPGAs for DSP

0

5

10

15

20

25

30

35

40

45

50

1 4 8 12 16 20 24 28 32 36 40 44 48

Data

Rate

(w

ith

50 M

Hz s

yste

m c

lock)

Number of DSPs4 DSPs3 DSPs

2 DSPs1 DSP

Arithmetic Operations Per Sample

FPGARegion

DSPRegion

High sample ratesHigh sample rates Up to 500 MHz with Virtex 5/6/7

Low sample rates Integrate DSP + system logic in a

low-cost DSP using serial sequential algorithm

Short word lengthsShort word lengths DA algorithm gets faster with

shorter word length Lots of filter tapsLots of filter taps

FPGA processes all taps in parallel, faster than DSP

Fast correlatorsFast correlators Single-chip solution required HardWire gate array migration

path for high-volume designs

Co-processing with a FPGACo-processing with a FPGA

FPGA co-processors are an extremely cost-effective means of off-loading computationally intensive algorithms from a DSP processor.

FPGA Coprocessor for WiMAX WiMAX Baseband Processing Baseband Processing

FPGA Coprocessor for High-Definition H.264 Encoding H.264 Encoding

ENG6530 RCS 27

Digital FiltersDigital Filters Digital filters are one of the main elements of DSP and are

performed using only a MAC operation. A digital filter performs a filtering function on data by

attenuating or reducing bands of frequencies.

Remove High Frequency Noise from Speech Signal

Remove 50 HZ mains humsfrom ECG Signal

Emphasize a particular Frequencyin Music Signal

Remove low Frequency Noisefor some sensors

ENG6530 RCS 28

Low Pass Digital FilterLow Pass Digital Filter An example of the operation of a low pass filter is:

The weights W0 to WN-1must be appropriately chosen

ENG6530 RCS 29

Digital Filters: TypesDigital Filters: Types Finite Impulse Response (FIR):

Non-recursive linear filter (i.e. no feedback no feedback present).

Infinite Impulse Response (IIR) Recursive linear filter (i.e. with feedbackwith feedback)

Adaptive Digital Filter (ADF) A self learning filter self learning filter that adapts itself to a desired signal.

Non-Linear Filters: A Filter that can perform non-linear operationsnon-linear operations e.g. median filter min/max filters

ENG6530 RCS 30

FIR FiltersFIR Filters A Finite Impulse Response Finite Impulse Response (FIR) filter performs a weighted

average (convolution) on a window of N data samples:

31ENG6530 RCS

FIR FILTERSFIR FILTERS

FINITE-IMPULSE RESPONSE FILTER

1Z 1Z 1Z

N 1C2C NC1C

. . . .

Register

Multiplier

Adder

ENG6530 RCS 32

Frequency ResponseFrequency Response The frequency/phase response of a digital filter is found by

taking the Discrete Fourier Transform Discrete Fourier Transform (DFT) of the impulse

ENG6530 RCS 33

FPGA ImplementationsFPGA Implementations1.1. Hardware Description Language:Hardware Description Language:

VHDLVHDL VerilogVerilog

2.2. Electronic System Level Electronic System Level Handel-C, Handel-C, Vivado HLS (Lab #7)Vivado HLS (Lab #7) Impulse-CImpulse-C

3.3. Core Generator (IP Selection)Core Generator (IP Selection)

4.4. System Generator (Lab #6)System Generator (Lab #6) Matlab, Simulink, System GeneratorMatlab, Simulink, System Generator

ENG6530 RCS 34

FIR FILTER: VHDL ImplementationFIR FILTER: VHDL Implementation Simple VHDL design example of an 8-tap FIR filter.

ENG6530 RCS 35

Hardware Descriptive LanguagesHardware Descriptive Languages Full VHDL/Verilog (RTL code)

Advantages: Portability and efficient implementation Complete control of the design implementation and

tradeoffs Easier to debug and understand a code that you own

Disadvantages:Disadvantages: Can be time consuming Can be time consuming Don’t always have control over the Synthesis toolDon’t always have control over the Synthesis tool Need to be familiar with algorithm and how to write itNeed to be familiar with algorithm and how to write it

ENG6530 RCS 36

ENG6530 RCS 37

Abstraction: AdvantagesAbstraction: Advantages

ENG6530 RCS 38

BehavioralSimulation

CORE Generator CORE Generator

Synthesis

Implementation

Download

Functional Simulation

TimingSimulation

In-Circuit Verification

HDL

COREGen

Instantiate optimized IP within the HDL code

ENG6530 RCS 39

Xilinx CORE GeneratorXilinx CORE Generator

List of available IP from or

FullyParameterizable

IP CENTER http://www.xilinx.com/ipcenter

$P Reed Solomon$3GPP Turbo Code$P Viterbi Decoder$P Convolution Encoder $P Interleaver/De-interleaverP LFSRP 1D DCTP DA FIR P MACP MAC-based FIR filterFixed FFTs 16, 64, 256, 1024 pointsP FFT - 32 PointP Sine CosineP Direct Digital Synthesizer P Cascaded Integrator CombP Bit CorrelatorP Digital Down Converter

P Asynchronous FIFOP Block Memory modulesP Distributed MemoryP Distributed Mem EnhanceP Sync FIFO (SRL16)P Sync FIFO (Block RAM)P CAM (SRL16)

P Binary DecoderP Two's ComplementP Shift Register RAM/FFP Gate modulesP Multiplexer functionsP Registers, FF & latch basedP Adder/SubtractorP AccumulatorP ComparatorP Binary Counter

P Multiplier Generator - Parallel Multiplier - Dyn Constant Coefficient Mult - Serial Sequential Multiplier - Multiplier EnhancementsP DividerP CORDIC

Base FunctionsBase Functions

$P PCI 64/66$PS PCI 32/33$P PCI-X 64/66

8B/10B Encoder/Decoder$ POS-PHY L3$ POS-PHY L4$ Flexbus 4$ RapidIO PHY Layer$S HDLC 1 and 32 channel$S G.711 PCM Cores$S ADPCM 32 & 64 channel

Memory FunctionsMemory FunctionsDSP FunctionsDSP Functions

PCIPCI

Math FunctionsMath Functions

NetworkingNetworking

$ - License Fee, P - Parameterized, S - Project License Available, BOLD – Available in the Xilinx Blockset for the System Generator for DSP

Xilinx IP SolutionsXilinx IP Solutions

ENG6530 RCS 41

Core Generator: SummaryCore Generator: Summary CORE Generator

Advantages Can quickly access and generate existing functions No need to reinvent the wheel and re-design a block

if it meets specificationsif it meets specifications IP is optimized for the specified architecture

DisadvantagesDisadvantages IP doesn’t always do exactly what you are looking for Need to understand signals and parameters and

match them to your specification Dealing with black box and have little information on

how the function is implemented

Xilinx Xilinx System Generator for DSPSystem Generator for DSP

• Industry’s first tool Industry’s first tool system-level design environment (IDE) for FPGAs

• Simulink library Simulink library of arithmetic, logic operators and DSP functions (Xilinx blockset)

• Arithmetic abstraction• VHDL code generation VHDL code generation for most Spartan based FPGAs and

Virtex 4/5/6/7 FPGAs• Enables Hardware in the Loop Hardware in the Loop Co-simulation

MATLABMATLAB• MATLAB™, the most popular system design toolthe most popular system design tool, is a programming

language, interpreter, and modeling environment– Extensive libraries for math functionsExtensive libraries for math functions, signal processing, DSP,

communications, and much more– VisualizationVisualization: large array of functions to plot and visualize your data and

system/design – Open architecture: software model based on base system and domain-

specific plug-ins

ENG6530 RCS 44

System Level EvaluationSystem Level Evaluation Irrespective of the final implementation technology (GPP, DSP,

ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment.

The de facto industry standard for DSP algorithmic verification is MATLAB.MATLAB.

OriginalConcept

HandcraftedAssembly

Compile /Assemble

Auto C/C++Generation

HandcraftedC/C++

MachineCode

AlgorithmicVerification

ENG6530 RCS 45

System/Algorithmic level to RTLSystem/Algorithmic level to RTL Many DSP design teams commence by performing their

system level evaluation and algorithmic validation in MATLAB MATLAB using floating point using floating point representation.

AlternativelyAlternatively, they may first transition the FP representation into their fixed-point counterparts at the system level.

At this point, many design teams bounce directly into hand-coding fixed-point RTL equivalents of the design in VHDL

OriginalConcept

Handcraft Verilog/VHDL RTL(Fixed-point)

System/Algorithmic Verification(Floating-point)

To standard RTL-basedsimulation and synthesis

System/Algorithmic Verification(Fixed-point)

(a) (b)

SimulinkSimulink• Simulink™ - Visual data flow environment for modeling Visual data flow environment for modeling and simulation of

dynamical systems– Fully integrated with the MATLAB engine– Graphical block editorGraphical block editor– Event-driven simulator– Models parallelism– Extensive libraryExtensive library of parameterizable functions

• Simulink Blockset - math, sinks, sources • DSP Blockset - filters, transforms, etc.• Communications Blockset - modulation, DPCM, etc.

Traditional Simulink FPGA Flow

GAP

System Architect

FPGA Designer

Verify Equivalence

HDL

Synthesis

Implementation

Download

Timing Simulation



System Verification

Simulink

System GeneratorSystem Generator

HDLSystem Generator

MATLAB/Simulink

System Verification

•VHDL

•IP

•Testbench

•Constraints FileSynthesis

Implementation

Download

Timing Simulation



Creating a SystemCreating a SystemGenerator DesignGenerator Design

• Xilinx Block-set listed in Simulink Library Browser• Create Design by Dragging and Dropping Dragging and Dropping components from the Xilinx Block-set onto your new sheet to create design

Finding BlocksFinding Blocks

• Use the Find feature to search ALL Simulink libraries

• Xilinx blockset has nine major sections– Basic elements

• Counters, delays– Communication

• Error correction blocks– Control Logic

• MCode, Black Box– Data Types

• Convert, Slice– DSP

• FDATool, FFT, FIR– Index

• All Xilinx blocks – quick way to view all blocks– Math

• Multiply, accumulate, inverter– Memory

• Dual Port RAM, Single Port RAM– ToolsTools

• ModelSim, Resource EstimatorModelSim, Resource Estimator

Configure Your BlocksConfigure Your Blocks

• Double-clickDouble-click or go to Block Parametersto view a block’s configurable parameters

– Arithmetic Type: Unsigned or twos complement– Implement with Xilinx Smart-IP Core (if possible)/

Generate Core– Latency: Specify the delay through the block– Overflow and Quantization: Users can saturate or

wrap overflow. Truncate or Round Quantization– Override with Doubles: Simulation only– Precision: Full or the user can define the number

of bits and where the decimal point is for the block– Sample Period: Can be inherent with a “-1” or

must be an integer value• Note: While all parameters can be simulated,Note: While all parameters can be simulated,

not all are realizablenot all are realizable

Values Can Be EquationsValues Can Be Equations

• You can also enter equations in the block parameters, which can aid calculation and your own understanding of the model parameters

• The equations are calculated at the beginning of a simulation

• Useful MATLAB operators– + add– - subtract– * multiply– / divide– ^ power pi (3.1415926535897.…)– exp(x) exponential (ex)

Important Concept 1:Important Concept 1:The Numbers GameThe Numbers Game

• Simulink uses a “double” to represent numbers in a simulation. A double is a “64-bit twos complement floating point number”

– Because the binary point can move, a double can represent any number between +/- 9.223 x 10 18 with a resolution of 1.08 x 10-19 …a wide desirable range, but not efficient or realistic for FPGAs

• Xilinx Blockset uses n-bit fixed point number (twos complement optional)

Design Hint: Always try to maximize the dynamic range of design by using only the required number of bits

1

-22

0

21

1

20

1

2-1

0

2-2

1

2-3

1

2-4

1

2-5

1

2-6

0

2-7

1

2-8

0

2-9

0

2-10

1

2-11

0

2-12

1

2-13

Integer Fraction

Value = -2.261108…

Format = Fix_16_13

(Sign: Fix = Signed Value

UFix = Unsigned value) Format = Sign_Width_Decimal point from the LSB

Thus, a conversion is required when communicating with Xilinx blocks with Simulink blocks (Xilinx blockset MATLAB I/O Gateway In/Out)

What About All ThoseWhat About All ThoseOther Bits?Other Bits?

• The Gateway In and Out blocks support parameters to control the conversion from double precision to N - bit fixed point precision

. . . .

DOUBLE

-22

1 021

120

12-1

02-2

12-3

12-4

12-5

12-6

02-7

12-8

02-9

FIX_12_9

122

021

120

12-1

02-2

12-3

12-4

12-5

12-6

02-7

12-8

02-9

02-10

12-11

02-12

12-13

1 1 1 1 . . . .232425-26

QUANTIZATIONOVERFLOW

- Truncate- Round

- Wrap- Saturate- Flag Error

Creating a SystemCreating a SystemGenerator DesignGenerator Design

SysGen blocks realizable in Hardware

IO blocks used as interface between the Xilinx blockset and other Simulink blocks

Simulink sinks & library functions

Simulink sources

Using the ScopeUsing the Scope

• Click properties to change the number of axis displayed and the time range value (X-axis)

• Use Data history to control how many values are stored and displayed on the scope

• Click autoscale to quickly let the tools configure the display to the correct axis values

• Right click on the Y-axis to set its value

Design & Simulate in Design & Simulate in SimulinkSimulinkSimulate the design by pushing “play.” Go to “Simulation Parameters” under the “Simulation” menu to control the length of simulations

Resource EstimatorResource Estimator

• The block provides fast estimates of FPGA resources required to implement the subsystem

• Most of the blocks in the System Generator Blockset carries the resources information

– LUTs– FFs– BRAM– Embedded multipliers– 3-state buffers– I/Os

Resource EstimatorResource Estimator

• Three types of estimation– Estimate Area

• This option computes resources for the current level and all sub-levels

– Quick Sum• Uses the resources stored

in block directly and sum them up (no sub-levels functions are invoked)

– Post-Map Area• Opens up a file browser and

let user select map report file. The design should have been generated and gone through synthesis, translate, and mapping phases.

The Black BoxThe Black BoxUse the Black Box when:

• You need a function that cannot be created with the Xilinx Blockset• You already have a piece of VHDL you wish to use for a section of the design

Creates a place holder for the ‘Black box’ in generated VHDL

Use Black Box parameters to control the VHDL placeholder’s features

Generate the VHDL CodeGenerate the VHDL Code

Once complete, double click the System Generator token

Select the target device

Select to generate the testbench

Set the System clock period desired

Generate the VHDL

Hardware-in-the-Loop Reduces Hardware-in-the-Loop Reduces Design Time & CostDesign Time & Cost

• Configure any development board for hardware-in-the-loop using JTAG header in < 20 minutes

– Automatically create FPGA bit-stream from Simulink– Transparent use of FPGA implementation tools– Accelerate and verify the Simulink design using

FPGA hardware– Mirrors traditional DSP processor design flows

• Combine with black box to simulate HDL & EDIF

Create Bit-streamCreate Bit-stream

Step 2 Generate Bit-stream

Step 2 Generate Bit-stream

Step 1Select Target H/W Platform

Step 1Select Target H/W Platform

Co-Simulate in HardwareCo-Simulate in HardwareStep 3 contd.Post-generation script creates a new library containing a parameterized run-time co-simulation block.

Step 3 contd.Post-generation script creates a new library containing a parameterized run-time co-simulation block.

Step 4Copy the a co-simulation run-time block into the original model.

Step 4Copy the a co-simulation run-time block into the original model.

Step 5Simulate for verification

Step 5Simulate for verification

Hardware in the Loop Hardware in the Loop Performance Results Performance Results

Application

SoftwareSimulationTime(seconds)

HardwareSimulationTime(seconds)

Speed-up

5 x 5 Image Filter

Cordic Arc Tangent

Additive White Gaussian Noise Channel

170 4 43X

187 27 7X

600 80 7.5X

QAM Demodulator + Extension 1203 18 67X

A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and display them in Simulink. Using this hardware co-simulation method, designers can achieve up to 6 orders of magnitude performance enhancement over original software simulation.

Free Running Clock Mode

Single Step Clock Mode (bit and cycle accurate)

Image Filtering 676 6 112X

DSP System Generator: SummaryDSP System Generator: Summary

• System Generator for DSP– Advantages

• Ability to simulate the design at a system level• High level of abstraction - Very attractive for FPGA novices• Optimize Area, Speed, combination• Estimate resources easily• Hardware Co-Simulation (FPGA in the loop)• Test-bench and golden data written automatically

– Disadvantages• Cost of abstraction: doesn’t always give the best result from an area

usage point• Only as good as the IP support

FPGAs versus DSPFPGAs versus DSP FPGAs can out perform DSP processors on certain DSP tasks;

computation intensive, highly parallelizable tasks

DSP processors have the advantage for development infrastructure, time-to-market, developer familiarity

DSP processors are still easier to use Many engineers possess DSP processor development skills Ultimate speed is not always the first priority

Combination of FPGA and DSP processor is an excellent solution if performance requirements cannot be met by the processor alone

The “Best” architecture depends on the requirements of the applications

ENG6530 RCS 69

Problem with this flow?Problem with this flow? There is a significant conceptual and representational divide between

the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representation in VHDL.

Manual translation from one to another is time consuming and prone to error.

Any changes made to the original specs during the course of the project will be a painful and time consuming process to translate again to RTL.

OriginalConcept

Handcraft Verilog/VHDL RTL(Fixed-point)




(a) (b)

ENG6530 RCS 70

Direct RTL GenerationDirect RTL Generation

Some system/algorithmic level design environments offer direct VHDL code generation.

An example of this type of environment is offered by AccelChip Inc whose environment can accept floating-point MATLAB M-files, output their fixed point equivalent for verification and then use these new M-files to auto generate RTL.

System/Algorithmic Environment

OriginalConcept


System/Algorithmic Environment


Auto-generate Verilog/VHDL RTL(Fixed-point)


Auto-generate Verilog/VHDL RTL(Fixed-point)

Auto-interactive quantization (Fixed-point)


Third-party Environment

(a) (b)

(a) (b)

ENG6530 RCS 71

Transposed FIR with Multiplier BlockTransposed FIR with Multiplier Block

MAC MAC

MAC MAC

Can implement hundreds of MAC functions in an FPGA

Parallel implementation allows for faster throughput

– 200 Tap FIR Filter would need 1 clock cycle per sample

1-8 Multipliers Needs looping for more than 8

multiplications Needs multiple clock cycles

because of serial computation 200 Tap FIR Filter would need

25+ clock cycles per sample with an 8 MAC unit processor

MAC MAC MAC MAC MAC MAC MAC MAC




High Speed DSP Processor

High Level of Parallel Processing in FPGA

DSP Processors vs. FPGAsDSP Processors vs. FPGAs

ENG6530 RCS 73

Multiply Accumulate Multiply Accumulate SingleSingle Engine Engine

Sequential processing limits Sequential processing limits data throughput: Time-sharedTime-shared MAC unit Data width is fixed!!Data width is fixed!! High clock frequency creates difficult

system-challenge 256 Tap FIR Filter256 Tap FIR Filter

256 multiply and accumulate (MAC) operations per data sample

One output every 256 clock cycles256 clock cycles

RegData In

Loop Algorithm256 times

Data Out

MAC unit

ENG6530 RCS 74

Filters: ApplicationsFilters: Applications

ENG6530 RCS 75

Impulse ResponseImpulse Response The Impulse Response of an FIR filter is obtained from the

output of a filter when a single unit impulse is input:

Solution: Solution: Building a MACBuilding a MACwith System Generatorwith System Generator

MAC using Embedded MultiplierSlice Count: 22 Slices, 1 embedded multiplier

Performance: ~126 MHz(2v1000 -4)

MAC using Sliced Based MultiplierSlice Count: 70 Slices

Performance: ~130 Mhz(2v1000 -4)

b

+a

cii

i

c a b

ENG6530 RCS 77

FIR: Cont … VHDL ImplementationFIR: Cont … VHDL Implementation For convenience the selected coefficients are powers of 2. To operate, the filter must have eight register stageseight register stages, each

of which is eight bits wideeight bits wide. Therefore, for the register or memory portion of the design, 64 flip-64 flip-

flops flops are required. At each clock cycle, each coefficient is multipliedmultiplied by the

eight-bit value in the appropriate register. Due to the selection of ``powers of two” coefficients,coefficients, multiplication

is achieved by a simple shifting operationsimple shifting operation The coefficient values may be stored as constants.

The coefficients used in the example are given below: a0 = 2-3, a1=2-2, a2=2-1,a3=1,a4=1,a5=2-1,a6=2-2,a7=2-3

ENG6530 RCS 78

VHDL Description of FIR FilterVHDL Description of FIR Filterlibrary ieee;

use ieee.std_logic_1164.all;

entity FIR1 is

port (clk : in std_logic;

x : in integer range 0 to 255;

y : out integer range 0 to 511);

end entity FIR1;

ENG6530 RCS 79

VHDL Description of FIR FilterVHDL Description of FIR Filterarchitecture arch1 of FIR1 isbegin process (clk) type RegType is array (7 downto 0) of integer; variable Reg: RegType:= (others => 0); begin if (clk’event and clk=‘1’) then - - multiply/accumulate (MAC) operation y <= Reg(0)/8 + Reg(1)/4 + Reg(2)/2 + Reg(3) + Reg(4) + Reg(5)/2 + Reg(6)/4 + Reg(7)/8; - - update register values by shifting Reg(0) := Reg(1); Reg(1) := Reg(2); Reg(2) := Reg(3); Reg(3) := Reg(4); Reg(4) := Reg(5); Reg(5) := Reg(6); Reg(6) := Reg(7); Reg(7) := x; end if; end process;end architecture arch1;