ENG6530 RCS 1
ENG6530 Reconfigurable
Computing Systems
Digital Signal Processing Digital Signal Processing using FPGAsusing FPGAs
ENG6530 RCS 2
Topics Digital Signal Processing (DSP):
Definition, Advantages and Disadvantages Applications, ….
DSP vs. GPP vs. ASIC vs. FPGA Why use Reconfigurable Computing. Xilinx System Generator
ENG6530 RCS 3
ReferencesI. “http://www.xilinx.comII. “Reconfigurable Computing for DSP: A Survey”,
by R. Tessier and W. Burleson, 2001III. “Optimization Techniques for Efficient
Implementation of DSP in FPGAs”, by J. WangIV. “Reconfigurable Computing: The Theory and
Practice of FPGA Based Computing. Chapter 24: Distributed Arithmetic.
ENG6530 RCS 4
IntroductionIntroduction The term Digital Signal Processing, or DSPDSP, refers to the
branch of electronics concerned with the representation and manipulation of signals in digital form.
Such applicationsapplications as i. Telecommunication (switches, …) ii. Medical (Images, equipment, ..)iii. Military (radar, missiles, ..) iv. Consumers (Cell Phones, TVs, ..)
ENG6530 RCS 5
DSP FlowDSP Flow The data to be processed startsstarts out as a signal in the real
(analog) world. This analog signal is then sampledsampled by means of an analog
to digital converter. These samples are then processedprocessed in the digital domain. The digital samples are subsequently convertedconverted into an
analog equivalent by means of a digital to analog converter.
A/D DSP D/AAnalog input
signalDigital input
samplesModified output
samplesAnalog output
signal
Analog domain Digital domain Analog domain
ENG6530 RCS 6
Digital SystemDigital System
ADCDSPDSP
DAC1010..1010.. 1001..1001..
Sampling +Sampling +QuantificationQuantification
ArchitectureArchitecture
Signal Signal AnalysisAnalysis
SystemSystemAnalysisAnalysis
FilterFilterDesignDesign
Fix Point ArithmeticArchitecture TypesSelection Criteria
DSP FlowDSP Flow
ENG6530 RCS 7
Transition from Analog to DigitalTransition from Analog to Digital The transition from analog to more digital techniques has been driven by
the many advantages many advantages of DSP:
The main advantage of digital signals over analog signals is that the precise signal level of former is not vital (immune to imperfectionsimmune to imperfections)
Digital signals can be saved in memory saved in memory and then recalled. Digital signals can convey information with greater noise immunitygreater noise immunity. Digital signals can be processed by digital circuit components, which
are cheapare cheap and easily produced. Digital can be encrypted can be encrypted so that only the intended receiver can decode. The flexibility in precision flexibility in precision through changing word lengths and/or
number representation (e.g., fixed point vs. floating point) The ability to use a single processing single processing element to process multiple
incoming signals through multiplexing. Enables transmission of signals over a long distance long distance and higher rate. The ease with which digital approaches can adjust their processing
parameters, such as with adaptive filteringadaptive filtering.
ENG6530 RCS 8
Transition from Analog to DigitalTransition from Analog to Digital The main disadvantage main disadvantage of DSP:
i.i. Increased system complexityIncreased system complexity, DSP requires that signals be converted between converted between analog and digital forms using a sample and hold circuit, analog-to-digital converters (ADCs), and digital-to-analog converters (DACs) and analog filtering.
ii.ii. Power consumptionPower consumption, DSP tends to require more power since a dedicated processor is used.
iii.iii. Frequency range limitationFrequency range limitation, analog hardware will naturally be able to work with higher frequency signals than is possible with DSP hardware due to the limitations of performing analog to digital conversion.
For many applications, the advantages of DSP far outweigh these disadvantages.
ENG6530 RCS 9
DSP: Common OperationsDSP: Common OperationsSome of the most common operations most common operations performed on signals using digital or analog techniques include:
Elementary time-domain operations: amplification, attenuation, integration, differentiation, addition of signals, multiplication of signals, etc.,
Filtering (FIR, IIR) Transforms (FFT, IFFT) Convolution (Integral of product of two functions) Error Correction (Transmission) Compression and decompression (Audio, Video) Modulation and demodulation (BPSK, QAM, FSK, ASK, …) Multiplexing and de-multiplexing Signal generation
ENG6530 RCS 10
DSP ApplicationsDSP Applications AudioAudio Applications:
MPEG Audio Portable audio
Photography: Digital cameras CAM
WirelessWireless Applications WiFi WiMax Blue Tooth
NetworkingNetworking Switches Classifiers
MedicalMedical Equipment: Hearing Aids Heart Pacers
CableCable modems ADSL VDSL
CellularCellular Phones Base Stations GSM LTE
MilitaryMilitary Applications: Radar
Main DSP OperationsMain DSP Operations DSP is the arithmetic processing of
digital signals sampled at regular intervals
DSP can be reduced to three trivial operations: DelayDelay AddAdd MultiplyMultiply
Accumulate = Add + Delay MAC = Multiply + Accumulate The MAC is the engine behind DSP
More MACs = Higher Performance, Better Signal Quality
MACs vs. MIPS, not always equal
3 MACs
50* MACs
100 MACs
Filter
ENG6530 RCS 12
Alternative DSP ImplementationsAlternative DSP Implementations DSP tasks can be implemented in a number of different ways.
i. A general purpose processor (GPP): The processor can perform DSP by running an appropriate DSP algorithm.
ii. A digital signal processor (PDSP): This is a specialized form of microprocessor chip that has been designed to perform DSP tasks much faster and more efficiently than GPP.
iii. Dedicated ASIC hardware: Custom hardware implementation that executes the DSP task.
iv. Dedicated FPGA hardware: Similar to ASIC except that it offers:
Flexibility in terms of reconfiguration. Embedded microprocessor cores on the FPGA.
ENG6530 RCS 13
The Performance GapThe Performance Gap Algorithmic complexity increases as application demands increase. In order to process these new algorithms, higher performance signal
processing engines are required
Traditional DSP ApproachesTraditional DSP Approaches Digital Signal Processor IC
Software programmable, like a microprocessor Single MAC unit All processing done sequentially Fit the algorithm to the architecture
ASIC (gate array) Fit the architecture to the algorithm Significantly higher performance than DSP processor High cost and high risk to develop Usually only for high-volume applications
MAC
Data Controller
MemoryADC
Analog input Analog output
Digital output
‘Traditional’ DSP Processor
DAC
Pros
High performance
High density
One chip solution
Cons
High design risk
Long design cycle
Pros
High flexibility
Good adaptability
Low design risk
Cons
Performance
Hardware Complexity
The Promise of Programmable LogicThe Promise of Programmable Logic
ASIC DSP ProcessorFPGA
Best from both worldsplus:
Efficient IC architecture
System features
Short design cycle
Automatic migration to low cost HardWire
ENG6530 RCS 16
Why FPGAs?Why FPGAs? The most commonly most commonly used DSP functions are:
FIR (Finite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), DCT (Direct Cosine Transform), Encoder/Decoder and Error Correction/Detection functions.
All of these blocks All of these blocks perform intensive arithmetic operations (data path intensive operationsdata path intensive operations) such as: add, subtract, multiply, multiply-add or, multiply-accumulate.
Why Use FPGAs in DSP Applications?Why Use FPGAs in DSP Applications? 10x More DSP Throughput Than
DSP Processors Parallel vs. Serial Architecture
Cost-Effective for Multi-Channel Applications
Flexible Hardware Implementation
Single-Chip Solution System (Hardware/Software)
Integration Benefits
FPGASoftwareEmbeddedProcessor
FPGA
DSP System
SoftwareDSP
ENG6530 RCS 18
DSP-related embedded FPGA resourcesDSP-related embedded FPGA resources
Many FPGAs incorporate dedicated multiplier dedicated multiplier blocks (Virtex-5/6/7). Similarly, some FPGAs offer dedicated adder dedicated adder blocks. One operation that is very common in DSP-type application is called
the multiply-and-accumulate (MAC) unit(MAC) unit. To make life easier for implementing DSP on FPGAs some provide an
entire MAC as an embedded function entire MAC as an embedded function (Virtex-4)
x
+
x
+
A[n:0]
B[n:0] Y[(2n - 1):0]
Multiplier
Adder
Accumulator
MAC
DSP Functions are Parallel in NatureDSP Functions are Parallel in Nature 8-Bit, 16-Tap Finite Impulse Response (FIR) (FIR) Filter
Equation:
REG REG REG REG REG REG REG
REG REGREGREGREG REGREGREG
Data InputX[7:0]
0 15 1 14 2 13 3 12 4 11 5 10 6 9 7 8
Data OutputY[9:0]
C0 C1 C2 C3 C4 C5 C6 C7Multiply by
FilterCo-Efficients
FilterTaps
AccumulateValues
Y c x c x c x c x c x c x c x c x c xj k kjk
n
10 0 1 1 2 2 3 3 3 12 2 13 1 14 0 15
Symmetrical Coefficients
DSP and FPGADSP and FPGA
FPGAs Parallel Approach to DSP Enables Higher Computational Throughput
Consider a 256-tap FIR filter:
Conventional DSP Processor – Serial Implementation
FPGA – Fully parallel implementation
ENG6530 RCS 21
Multiply Accumulate Multiply Accumulate MultipleMultiple Engines Engines Parallel processing maximizes data
throughput Support any level of parallelismSupport any level of parallelism Optimal performance/cost
tradeoff 256 Tap FIR Filter256 Tap FIR Filter
256 multiply and accumulate (MAC) operations per data sample
One output every clock cycleOne output every clock cycle Flexible architecture
Distributed DSP resources (LUT, registers, multipliers, & memory)
Data Out
....C0 C1 C2 C255
Reg0 Reg1 Reg2 Reg255Data In
All 256 MAC operations in 1 clock cycle
FPGAs Outperform ‘Traditional’ DSP ProcessorsFPGAs Outperform ‘Traditional’ DSP Processors
22.00
0.241.00
2.60
4.00
16.00
0
5
10
15
20
25
133 MHzPentium™Processor750 KHz
Single50 MHz
DSP3 MHz
XC4003E-3FPGA
(68% util.)8 MHz
Four50 MHzDSPs12 MHz
XC4010E-3FPGA
(98% util.)56 MHz
XC4013E-2FPGA
(75% util.)66 MHz
Per
form
ance
Rel
ativ
e to
50
MH
z F
ixed
-Po
int
DS
P
Serial Distributed Arithmetic(SDA)
Parallel Distributed Arithmetic(PDA)
(est.)8-Bit, 16-Tap FIR Filter
Performance Comparisons(External Performance)
FPGA
FPGA
FPGA
MCM
Case Study: Viterbi DecoderCase Study: Viterbi Decoder
+-
+
-
Old_1
INC
Old_2
-+
+
-
++
++
OptionalPipeliningRegisters
MUX
MUX
New_1
Diff_2
Diff_1
New_2
MSB
MSB
Prestate Buffer Bit
24-bit 24-bit24-bit
1 0
REG
REG
REG
REG
REG
REG
REG
REG
I/O BusI/O Bus
DSP-Only DSP + FPGA8 DEVICES 4 DEVICES
Two 66 MHz DSPsSix 15 ns SRAMsSystem logic
One 66 MHz DSPXC4013E-3 FPGA (44%)Three 15 ns SRAMs
135 ns
360 ns0
1
2
3
Rel
ativ
e P
erfo
rman
ce 2.67 times better performance w ith FPGA-assisted DSP
Two 66 MHz DSPsSix 15 ns RAMs
66 MHz DSP+FPGAThree 15 ns RAMs
(FPGA-based DSP Co-Processor)
What to Look for in Your DSP ApplicationWhat to Look for in Your DSP Application
Identify Parallel Data Paths Find Operations that Require Multiple Clock Cycles Processor Bottlenecks
Flexibility
Parallel Data Paths
Scaleable Bandwidth
Design Modification
Device Expansion
DSP Pro
cess
or
ASICFPG
A= NO= YES
When to Use When to Use FPGAs for DSPFPGAs for DSP
0
5
10
15
20
25
30
35
40
45
50
1 4 8 12 16 20 24 28 32 36 40 44 48
Data
Rate
(w
ith
50 M
Hz s
yste
m c
lock)
Number of DSPs4 DSPs3 DSPs
2 DSPs1 DSP
Arithmetic Operations Per Sample
FPGARegion
DSPRegion
High sample ratesHigh sample rates Up to 500 MHz with Virtex 5/6/7
Low sample rates Integrate DSP + system logic in a
low-cost DSP using serial sequential algorithm
Short word lengthsShort word lengths DA algorithm gets faster with
shorter word length Lots of filter tapsLots of filter taps
FPGA processes all taps in parallel, faster than DSP
Fast correlatorsFast correlators Single-chip solution required HardWire gate array migration
path for high-volume designs
Co-processing with a FPGACo-processing with a FPGA
FPGA co-processors are an extremely cost-effective means of off-loading computationally intensive algorithms from a DSP processor.
FPGA Coprocessor for WiMAX WiMAX Baseband Processing Baseband Processing
FPGA Coprocessor for High-Definition H.264 Encoding H.264 Encoding
ENG6530 RCS 27
Digital FiltersDigital Filters Digital filters are one of the main elements of DSP and are
performed using only a MAC operation. A digital filter performs a filtering function on data by
attenuating or reducing bands of frequencies.
Remove High Frequency Noise from Speech Signal
Remove 50 HZ mains humsfrom ECG Signal
Emphasize a particular Frequencyin Music Signal
Remove low Frequency Noisefor some sensors
ENG6530 RCS 28
Low Pass Digital FilterLow Pass Digital Filter An example of the operation of a low pass filter is:
The weights W0 to WN-1must be appropriately chosen
ENG6530 RCS 29
Digital Filters: TypesDigital Filters: Types Finite Impulse Response (FIR):
Non-recursive linear filter (i.e. no feedback no feedback present).
Infinite Impulse Response (IIR) Recursive linear filter (i.e. with feedbackwith feedback)
Adaptive Digital Filter (ADF) A self learning filter self learning filter that adapts itself to a desired signal.
Non-Linear Filters: A Filter that can perform non-linear operationsnon-linear operations e.g. median filter min/max filters
ENG6530 RCS 30
FIR FiltersFIR Filters A Finite Impulse Response Finite Impulse Response (FIR) filter performs a weighted
average (convolution) on a window of N data samples:
31ENG6530 RCS
FIR FILTERSFIR FILTERS
FINITE-IMPULSE RESPONSE FILTER
1Z 1Z 1Z
N 1C2C NC1C
. . . .
Register
Multiplier
Adder
ENG6530 RCS 32
Frequency ResponseFrequency Response The frequency/phase response of a digital filter is found by
taking the Discrete Fourier Transform Discrete Fourier Transform (DFT) of the impulse
ENG6530 RCS 33
FPGA ImplementationsFPGA Implementations1.1. Hardware Description Language:Hardware Description Language:
VHDLVHDL VerilogVerilog
2.2. Electronic System Level Electronic System Level Handel-C, Handel-C, Vivado HLS (Lab #7)Vivado HLS (Lab #7) Impulse-CImpulse-C
3.3. Core Generator (IP Selection)Core Generator (IP Selection)
4.4. System Generator (Lab #6)System Generator (Lab #6) Matlab, Simulink, System GeneratorMatlab, Simulink, System Generator
ENG6530 RCS 34
FIR FILTER: VHDL ImplementationFIR FILTER: VHDL Implementation Simple VHDL design example of an 8-tap FIR filter.
ENG6530 RCS 35
Hardware Descriptive LanguagesHardware Descriptive Languages Full VHDL/Verilog (RTL code)
Advantages: Portability and efficient implementation Complete control of the design implementation and
tradeoffs Easier to debug and understand a code that you own
Disadvantages:Disadvantages: Can be time consuming Can be time consuming Don’t always have control over the Synthesis toolDon’t always have control over the Synthesis tool Need to be familiar with algorithm and how to write itNeed to be familiar with algorithm and how to write it
ENG6530 RCS 36
ENG6530 RCS 37
Abstraction: AdvantagesAbstraction: Advantages
ENG6530 RCS 38
BehavioralSimulation
CORE Generator CORE Generator
Synthesis
Implementation
Download
Functional Simulation
TimingSimulation
In-Circuit Verification
HDL
COREGen
Instantiate optimized IP within the HDL code
ENG6530 RCS 39
Xilinx CORE GeneratorXilinx CORE Generator
List of available IP from or
FullyParameterizable
IP CENTER http://www.xilinx.com/ipcenter
$P Reed Solomon$3GPP Turbo Code$P Viterbi Decoder$P Convolution Encoder $P Interleaver/De-interleaverP LFSRP 1D DCTP DA FIR P MACP MAC-based FIR filterFixed FFTs 16, 64, 256, 1024 pointsP FFT - 32 PointP Sine CosineP Direct Digital Synthesizer P Cascaded Integrator CombP Bit CorrelatorP Digital Down Converter
P Asynchronous FIFOP Block Memory modulesP Distributed MemoryP Distributed Mem EnhanceP Sync FIFO (SRL16)P Sync FIFO (Block RAM)P CAM (SRL16)
P Binary DecoderP Two's ComplementP Shift Register RAM/FFP Gate modulesP Multiplexer functionsP Registers, FF & latch basedP Adder/SubtractorP AccumulatorP ComparatorP Binary Counter
P Multiplier Generator - Parallel Multiplier - Dyn Constant Coefficient Mult - Serial Sequential Multiplier - Multiplier EnhancementsP DividerP CORDIC
Base FunctionsBase Functions
$P PCI 64/66$PS PCI 32/33$P PCI-X 64/66
8B/10B Encoder/Decoder$ POS-PHY L3$ POS-PHY L4$ Flexbus 4$ RapidIO PHY Layer$S HDLC 1 and 32 channel$S G.711 PCM Cores$S ADPCM 32 & 64 channel
Memory FunctionsMemory FunctionsDSP FunctionsDSP Functions
PCIPCI
Math FunctionsMath Functions
NetworkingNetworking
$ - License Fee, P - Parameterized, S - Project License Available, BOLD – Available in the Xilinx Blockset for the System Generator for DSP
Xilinx IP SolutionsXilinx IP Solutions
ENG6530 RCS 41
Core Generator: SummaryCore Generator: Summary CORE Generator
Advantages Can quickly access and generate existing functions No need to reinvent the wheel and re-design a block
if it meets specificationsif it meets specifications IP is optimized for the specified architecture
DisadvantagesDisadvantages IP doesn’t always do exactly what you are looking for Need to understand signals and parameters and
match them to your specification Dealing with black box and have little information on
how the function is implemented
Xilinx Xilinx System Generator for DSPSystem Generator for DSP
• Industry’s first tool Industry’s first tool system-level design environment (IDE) for FPGAs
• Simulink library Simulink library of arithmetic, logic operators and DSP functions (Xilinx blockset)
• Arithmetic abstraction• VHDL code generation VHDL code generation for most Spartan based FPGAs and
Virtex 4/5/6/7 FPGAs• Enables Hardware in the Loop Hardware in the Loop Co-simulation
MATLABMATLAB• MATLAB™, the most popular system design toolthe most popular system design tool, is a programming
language, interpreter, and modeling environment– Extensive libraries for math functionsExtensive libraries for math functions, signal processing, DSP,
communications, and much more– VisualizationVisualization: large array of functions to plot and visualize your data and
system/design – Open architecture: software model based on base system and domain-
specific plug-ins
ENG6530 RCS 44
System Level EvaluationSystem Level Evaluation Irrespective of the final implementation technology (GPP, DSP,
ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment.
The de facto industry standard for DSP algorithmic verification is MATLAB.MATLAB.
OriginalConcept
HandcraftedAssembly
Compile /Assemble
Auto C/C++Generation
HandcraftedC/C++
MachineCode
AlgorithmicVerification
ENG6530 RCS 45
System/Algorithmic level to RTLSystem/Algorithmic level to RTL Many DSP design teams commence by performing their
system level evaluation and algorithmic validation in MATLAB MATLAB using floating point using floating point representation.
AlternativelyAlternatively, they may first transition the FP representation into their fixed-point counterparts at the system level.
At this point, many design teams bounce directly into hand-coding fixed-point RTL equivalents of the design in VHDL
OriginalConcept
Handcraft Verilog/VHDL RTL(Fixed-point)
System/Algorithmic Verification(Floating-point)
To standard RTL-basedsimulation and synthesis
System/Algorithmic Verification(Fixed-point)
(a) (b)
SimulinkSimulink• Simulink™ - Visual data flow environment for modeling Visual data flow environment for modeling and simulation of
dynamical systems– Fully integrated with the MATLAB engine– Graphical block editorGraphical block editor– Event-driven simulator– Models parallelism– Extensive libraryExtensive library of parameterizable functions
• Simulink Blockset - math, sinks, sources • DSP Blockset - filters, transforms, etc.• Communications Blockset - modulation, DPCM, etc.
Traditional Simulink FPGA Flow
GAP
System Architect
FPGA Designer
Verify Equivalence
HDL
Synthesis
Implementation
Download
Timing Simulation
In-Circuit Verification
Functional Simulation
System Verification
Simulink
System GeneratorSystem Generator
HDLSystem Generator
MATLAB/Simulink
System Verification
•VHDL
•IP
•Testbench
•Constraints FileSynthesis
Implementation
Download
Timing Simulation
In-Circuit Verification
Functional Simulation
Creating a SystemCreating a SystemGenerator DesignGenerator Design
• Xilinx Block-set listed in Simulink Library Browser• Create Design by Dragging and Dropping Dragging and Dropping components from the Xilinx Block-set onto your new sheet to create design
Finding BlocksFinding Blocks
• Use the Find feature to search ALL Simulink libraries
• Xilinx blockset has nine major sections– Basic elements
• Counters, delays– Communication
• Error correction blocks– Control Logic
• MCode, Black Box– Data Types
• Convert, Slice– DSP
• FDATool, FFT, FIR– Index
• All Xilinx blocks – quick way to view all blocks– Math
• Multiply, accumulate, inverter– Memory
• Dual Port RAM, Single Port RAM– ToolsTools
• ModelSim, Resource EstimatorModelSim, Resource Estimator
Configure Your BlocksConfigure Your Blocks
• Double-clickDouble-click or go to Block Parametersto view a block’s configurable parameters
– Arithmetic Type: Unsigned or twos complement– Implement with Xilinx Smart-IP Core (if possible)/
Generate Core– Latency: Specify the delay through the block– Overflow and Quantization: Users can saturate or
wrap overflow. Truncate or Round Quantization– Override with Doubles: Simulation only– Precision: Full or the user can define the number
of bits and where the decimal point is for the block– Sample Period: Can be inherent with a “-1” or
must be an integer value• Note: While all parameters can be simulated,Note: While all parameters can be simulated,
not all are realizablenot all are realizable
Values Can Be EquationsValues Can Be Equations
• You can also enter equations in the block parameters, which can aid calculation and your own understanding of the model parameters
• The equations are calculated at the beginning of a simulation
• Useful MATLAB operators– + add– - subtract– * multiply– / divide– ^ power pi (3.1415926535897.…)– exp(x) exponential (ex)
Important Concept 1:Important Concept 1:The Numbers GameThe Numbers Game
• Simulink uses a “double” to represent numbers in a simulation. A double is a “64-bit twos complement floating point number”
– Because the binary point can move, a double can represent any number between +/- 9.223 x 10 18 with a resolution of 1.08 x 10-19 …a wide desirable range, but not efficient or realistic for FPGAs
• Xilinx Blockset uses n-bit fixed point number (twos complement optional)
Design Hint: Always try to maximize the dynamic range of design by using only the required number of bits
1
-22
0
21
1
20
1
2-1
0
2-2
1
2-3
1
2-4
1
2-5
1
2-6
0
2-7
1
2-8
0
2-9
0
2-10
1
2-11
0
2-12
1
2-13
Integer Fraction
Value = -2.261108…
Format = Fix_16_13
(Sign: Fix = Signed Value
UFix = Unsigned value) Format = Sign_Width_Decimal point from the LSB
Thus, a conversion is required when communicating with Xilinx blocks with Simulink blocks (Xilinx blockset MATLAB I/O Gateway In/Out)
What About All ThoseWhat About All ThoseOther Bits?Other Bits?
• The Gateway In and Out blocks support parameters to control the conversion from double precision to N - bit fixed point precision
. . . .
DOUBLE
-22
1 021
120
12-1
02-2
12-3
12-4
12-5
12-6
02-7
12-8
02-9
FIX_12_9
122
021
120
12-1
02-2
12-3
12-4
12-5
12-6
02-7
12-8
02-9
02-10
12-11
02-12
12-13
1 1 1 1 . . . .232425-26
QUANTIZATIONOVERFLOW
- Truncate- Round
- Wrap- Saturate- Flag Error
Creating a SystemCreating a SystemGenerator DesignGenerator Design
SysGen blocks realizable in Hardware
IO blocks used as interface between the Xilinx blockset and other Simulink blocks
Simulink sinks & library functions
Simulink sources
Using the ScopeUsing the Scope
• Click properties to change the number of axis displayed and the time range value (X-axis)
• Use Data history to control how many values are stored and displayed on the scope
• Click autoscale to quickly let the tools configure the display to the correct axis values
• Right click on the Y-axis to set its value
Design & Simulate in Design & Simulate in SimulinkSimulinkSimulate the design by pushing “play.” Go to “Simulation Parameters” under the “Simulation” menu to control the length of simulations
Resource EstimatorResource Estimator
• The block provides fast estimates of FPGA resources required to implement the subsystem
• Most of the blocks in the System Generator Blockset carries the resources information
– LUTs– FFs– BRAM– Embedded multipliers– 3-state buffers– I/Os
Resource EstimatorResource Estimator
• Three types of estimation– Estimate Area
• This option computes resources for the current level and all sub-levels
– Quick Sum• Uses the resources stored
in block directly and sum them up (no sub-levels functions are invoked)
– Post-Map Area• Opens up a file browser and
let user select map report file. The design should have been generated and gone through synthesis, translate, and mapping phases.
The Black BoxThe Black BoxUse the Black Box when:
• You need a function that cannot be created with the Xilinx Blockset• You already have a piece of VHDL you wish to use for a section of the design
Creates a place holder for the ‘Black box’ in generated VHDL
Use Black Box parameters to control the VHDL placeholder’s features
Generate the VHDL CodeGenerate the VHDL Code
Once complete, double click the System Generator token
Select the target device
Select to generate the testbench
Set the System clock period desired
Generate the VHDL
Hardware-in-the-Loop Reduces Hardware-in-the-Loop Reduces Design Time & CostDesign Time & Cost
• Configure any development board for hardware-in-the-loop using JTAG header in < 20 minutes
– Automatically create FPGA bit-stream from Simulink– Transparent use of FPGA implementation tools– Accelerate and verify the Simulink design using
FPGA hardware– Mirrors traditional DSP processor design flows
• Combine with black box to simulate HDL & EDIF
Create Bit-streamCreate Bit-stream
Step 2 Generate Bit-stream
Step 2 Generate Bit-stream
Step 1Select Target H/W Platform
Step 1Select Target H/W Platform
Co-Simulate in HardwareCo-Simulate in HardwareStep 3 contd.Post-generation script creates a new library containing a parameterized run-time co-simulation block.
Step 3 contd.Post-generation script creates a new library containing a parameterized run-time co-simulation block.
Step 4Copy the a co-simulation run-time block into the original model.
Step 4Copy the a co-simulation run-time block into the original model.
Step 5Simulate for verification
Step 5Simulate for verification
Hardware in the Loop Hardware in the Loop Performance Results Performance Results
Application
SoftwareSimulationTime(seconds)
HardwareSimulationTime(seconds)
Speed-up
5 x 5 Image Filter
Cordic Arc Tangent
Additive White Gaussian Noise Channel
170 4 43X
187 27 7X
600 80 7.5X
QAM Demodulator + Extension 1203 18 67X
A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and display them in Simulink. Using this hardware co-simulation method, designers can achieve up to 6 orders of magnitude performance enhancement over original software simulation.
Free Running Clock Mode
Single Step Clock Mode (bit and cycle accurate)
Image Filtering 676 6 112X
DSP System Generator: SummaryDSP System Generator: Summary
• System Generator for DSP– Advantages
• Ability to simulate the design at a system level• High level of abstraction - Very attractive for FPGA novices• Optimize Area, Speed, combination• Estimate resources easily• Hardware Co-Simulation (FPGA in the loop)• Test-bench and golden data written automatically
– Disadvantages• Cost of abstraction: doesn’t always give the best result from an area
usage point• Only as good as the IP support
FPGAs versus DSPFPGAs versus DSP FPGAs can out perform DSP processors on certain DSP tasks;
computation intensive, highly parallelizable tasks
DSP processors have the advantage for development infrastructure, time-to-market, developer familiarity
DSP processors are still easier to use Many engineers possess DSP processor development skills Ultimate speed is not always the first priority
Combination of FPGA and DSP processor is an excellent solution if performance requirements cannot be met by the processor alone
The “Best” architecture depends on the requirements of the applications
ENG6530 RCS 69
Problem with this flow?Problem with this flow? There is a significant conceptual and representational divide between
the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representation in VHDL.
Manual translation from one to another is time consuming and prone to error.
Any changes made to the original specs during the course of the project will be a painful and time consuming process to translate again to RTL.
OriginalConcept
Handcraft Verilog/VHDL RTL(Fixed-point)
System/Algorithmic Verification(Floating-point)
To standard RTL-basedsimulation and synthesis
System/Algorithmic Verification(Fixed-point)
(a) (b)
ENG6530 RCS 70
Direct RTL GenerationDirect RTL Generation
Some system/algorithmic level design environments offer direct VHDL code generation.
An example of this type of environment is offered by AccelChip Inc whose environment can accept floating-point MATLAB M-files, output their fixed point equivalent for verification and then use these new M-files to auto generate RTL.
System/Algorithmic Environment
OriginalConcept
To standard RTL-basedsimulation and synthesis
System/Algorithmic Environment
System/Algorithmic Verification(Fixed-point)
Auto-generate Verilog/VHDL RTL(Fixed-point)
System/Algorithmic Verification(Floating-point)
Auto-generate Verilog/VHDL RTL(Fixed-point)
Auto-interactive quantization (Fixed-point)
System/Algorithmic Verification(Floating-point)
Third-party Environment
(a) (b)
(a) (b)
ENG6530 RCS 71
Transposed FIR with Multiplier BlockTransposed FIR with Multiplier Block
MAC MAC
MAC MAC
Can implement hundreds of MAC functions in an FPGA
Parallel implementation allows for faster throughput
– 200 Tap FIR Filter would need 1 clock cycle per sample
1-8 Multipliers Needs looping for more than 8
multiplications Needs multiple clock cycles
because of serial computation 200 Tap FIR Filter would need
25+ clock cycles per sample with an 8 MAC unit processor
MAC MAC MAC MAC MAC MAC MAC MAC
MAC MAC MAC MAC MAC MAC MAC MAC
MAC MAC MAC MAC MAC MAC MAC MAC
MAC MAC MAC MAC MAC MAC MAC MAC
High Speed DSP Processor
High Level of Parallel Processing in FPGA
DSP Processors vs. FPGAsDSP Processors vs. FPGAs
ENG6530 RCS 73
Multiply Accumulate Multiply Accumulate SingleSingle Engine Engine
Sequential processing limits Sequential processing limits data throughput: Time-sharedTime-shared MAC unit Data width is fixed!!Data width is fixed!! High clock frequency creates difficult
system-challenge 256 Tap FIR Filter256 Tap FIR Filter
256 multiply and accumulate (MAC) operations per data sample
One output every 256 clock cycles256 clock cycles
RegData In
Loop Algorithm256 times
Data Out
MAC unit
ENG6530 RCS 74
Filters: ApplicationsFilters: Applications
ENG6530 RCS 75
Impulse ResponseImpulse Response The Impulse Response of an FIR filter is obtained from the
output of a filter when a single unit impulse is input:
Solution: Solution: Building a MACBuilding a MACwith System Generatorwith System Generator
MAC using Embedded MultiplierSlice Count: 22 Slices, 1 embedded multiplier
Performance: ~126 MHz(2v1000 -4)
MAC using Sliced Based MultiplierSlice Count: 70 Slices
Performance: ~130 Mhz(2v1000 -4)
b
+a
cii
i
c a b
ENG6530 RCS 77
FIR: Cont … VHDL ImplementationFIR: Cont … VHDL Implementation For convenience the selected coefficients are powers of 2. To operate, the filter must have eight register stageseight register stages, each
of which is eight bits wideeight bits wide. Therefore, for the register or memory portion of the design, 64 flip-64 flip-
flops flops are required. At each clock cycle, each coefficient is multipliedmultiplied by the
eight-bit value in the appropriate register. Due to the selection of ``powers of two” coefficients,coefficients, multiplication
is achieved by a simple shifting operationsimple shifting operation The coefficient values may be stored as constants.
The coefficients used in the example are given below: a0 = 2-3, a1=2-2, a2=2-1,a3=1,a4=1,a5=2-1,a6=2-2,a7=2-3
ENG6530 RCS 78
VHDL Description of FIR FilterVHDL Description of FIR Filterlibrary ieee;
use ieee.std_logic_1164.all;
entity FIR1 is
port (clk : in std_logic;
x : in integer range 0 to 255;
y : out integer range 0 to 511);
end entity FIR1;
ENG6530 RCS 79
VHDL Description of FIR FilterVHDL Description of FIR Filterarchitecture arch1 of FIR1 isbegin process (clk) type RegType is array (7 downto 0) of integer; variable Reg: RegType:= (others => 0); begin if (clk’event and clk=‘1’) then - - multiply/accumulate (MAC) operation y <= Reg(0)/8 + Reg(1)/4 + Reg(2)/2 + Reg(3) + Reg(4) + Reg(5)/2 + Reg(6)/4 + Reg(7)/8; - - update register values by shifting Reg(0) := Reg(1); Reg(1) := Reg(2); Reg(2) := Reg(3); Reg(3) := Reg(4); Reg(4) := Reg(5); Reg(5) := Reg(6); Reg(6) := Reg(7); Reg(7) := x; end if; end process;end architecture arch1;