+ All Categories
Home > Documents > Accelerating DSP Algorithms Using FPGAs

Accelerating DSP Algorithms Using FPGAs

Date post: 01-Feb-2016
Category:
Upload: jamuna
View: 37 times
Download: 2 times
Share this document with a friend
Description:
Accelerating DSP Algorithms Using FPGAs. Sean Gallagher DSP Specialist Xilinx Inc. Why DSP in FPGAs. Availability of fast analog-to-digital converters (ADCs) Enables digital methods for functions traditionally done in RF components Massive parallel processing - PowerPoint PPT Presentation
28
Gallagher 1 P188/MAPLD2004 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc
Transcript
Page 1: Accelerating DSP Algorithms Using FPGAs

Gallagher 1 P188/MAPLD2004

Accelerating DSP Algorithms Using FPGAs

Sean Gallagher

DSP Specialist

Xilinx Inc

Page 2: Accelerating DSP Algorithms Using FPGAs

Gallagher 2 P188/MAPLD2004

Why DSP in FPGAs

• Availability of fast analog-to-digital converters (ADCs) – Enables digital methods for functions

traditionally done in RF components

• Massive parallel processing – FPGAs may have several hundred embedded

multipliers on-chip– One FPGA can replace many DSP Processors

Page 3: Accelerating DSP Algorithms Using FPGAs

Gallagher 3 P188/MAPLD2004

Architectural Considerations

• FPGA architectures are vendor specific– Unlike ASICS, no two are alike

• FPGA vendors develop distinct competencies– In device architecture design

– In intellectual property (dsp functions, bus controllers, etc)

– In design tool flows

• Vendor independent HDL can be written but this usually achieves mediocre results in clock speed and design size instantiation

Page 4: Accelerating DSP Algorithms Using FPGAs

Gallagher 4 P188/MAPLD2004

FPGAs Are Massive Parallel Computing Machines

LPF

Multi ChannelFilter

80MHz Samples

ch1

ch2

ch3

ch4

LPF

LPF

LPF

LPF

20MHz Samples

• FPGAs are ideally suited for multi-channel DSP designs– Many low sample rate channels can be multiplexed (e.g. TDM) and

processed in the FPGA, at a high rate– Interpolation (using zeros) can also drive sample rates higher

Page 5: Accelerating DSP Algorithms Using FPGAs

Gallagher 5 P188/MAPLD2004

FPGAs Allow Space/Speed Trade-offs

Q = (A x B) + (C x D) + (E x F) + (G x H)

can be implemented in parallel

××

×× +

+

+

+

+

+

A

BC

DE

FG

H

Q

But is this the only way in the FPGA?

Page 6: Accelerating DSP Algorithms Using FPGAs

Gallagher 6 P188/MAPLD2004

××

×× +

+

+

+

+

+ ×+

+

D Q

××

+

+

+

+

D Q

Parallel Semi-Parallel Serial

Customize Architectures to Suit your Ideal Algorithms

FPGAs allow Area (cost) / Performance tradeoffs

Optimized for?Speed Area

Page 7: Accelerating DSP Algorithms Using FPGAs

Gallagher 7 P188/MAPLD2004

Exploitng The Xilinx Architecture For DSP Functions

• Memory Blocks that can be configured as ROMs, dual port RAMs, FIFOs

• Embedded 18x18 multipliers that can be ganged to form a 35x35 bit multiply

• SRL16 shift registers– A patented technique for turning the 4 input

lookup table (2 per slice) into an addressable shift register

Page 8: Accelerating DSP Algorithms Using FPGAs

Gallagher 8 P188/MAPLD2004

Using SRL16E to increase Compute Density

k3

‘0’ +

k2

+

k1

+

k0

+

99

9

918

20MHz

4 channels

9

20MHz

k3

‘0’ +

k2

+

9

9 channelsSRL16E takes the same area

as one LUT.

It can be used for up to 16 channels.

Page 9: Accelerating DSP Algorithms Using FPGAs

Gallagher 9 P188/MAPLD2004

Xilinx System Generator For DSP

– System Generator is a Block Set that resides in Simulink/Matlab environment.

– System Generator blocks are bit true and cycle true models of Xilinx’s DSP intellectual property (IP) cores.

– Hardware DSP design capture is significantly accelerated due to automatic code generation from Simulink

Page 10: Accelerating DSP Algorithms Using FPGAs

Gallagher 10 P188/MAPLD2004

Algorithm Instantiation Considerations

• There are cases where following a textbook approach does not necessarily translate into an efficient instantiation

• Manipulating the algorithm to exploit features of the architecture can lead to much more efficient instantiations

• Modification of a text book algorithm includes how the math is executed as well as over-clocking structures to allow the structures to be time division multiplexed

Page 11: Accelerating DSP Algorithms Using FPGAs

Gallagher 11 P188/MAPLD2004

Example 1: Digital Down Conversion

• In digital down conversion we need to filter before we decimate to prevent aliasing

• These filters can get rather large because the transition band is rather narrow in relation to the sample rate

• A text book solution is to step the sample rate down in steps

Page 12: Accelerating DSP Algorithms Using FPGAs

Gallagher 12 P188/MAPLD2004

Digital Down Conversion• The following 3 slides show three different filter designs for the down conversion of a .625 Mhz band of interest that is centered at 20 MHz and sampled at 61.44 MHz.

– The decimation rate is 25– The final sample rate will be 61.44/25= 2.4576MHz

• The next slide shows the filter design needed if decimating by 25 in one step– the total coefficient count is 184

• The two slides after the next show the two filters necessary to decimate in steps, decimating by 5 in each step– The total coefficient count is 11+43=54

Page 13: Accelerating DSP Algorithms Using FPGAs

Gallagher 13 P188/MAPLD2004

Page 14: Accelerating DSP Algorithms Using FPGAs

Gallagher 14 P188/MAPLD2004

Page 15: Accelerating DSP Algorithms Using FPGAs

Gallagher 15 P188/MAPLD2004

Page 16: Accelerating DSP Algorithms Using FPGAs

Gallagher 16 P188/MAPLD2004

Digital Down Conversion (DDC) Implementation

• The following design shows how the DDC function would be implemented using the FIR filter core from the Xilinx Library

• The coefficients are automatically loaded into the filter cores

• The design has been compiled and was found to use about 6000 logic slices

• The fir filter core is a legacy core and is built as an optimized lookup table of coefficients

Page 17: Accelerating DSP Algorithms Using FPGAs

Gallagher 17 P188/MAPLD2004

Digital Down Conversion Implementation

Page 18: Accelerating DSP Algorithms Using FPGAs

Gallagher 18 P188/MAPLD2004

DDC –Another Way• While we were able to exploit the math of DSP to

reduce our coefficient count, we did not necessarily exploit the Xilinx architecture.

• The next design shows a design that implements the 184 coefficient filter but is significantly smaller in instantiation size then the previous design

• This design exploits the memory, embedded multipliers, and SRL16s

Page 19: Accelerating DSP Algorithms Using FPGAs

Gallagher 19 P188/MAPLD2004

Page 20: Accelerating DSP Algorithms Using FPGAs

Gallagher 20 P188/MAPLD2004

Multiplexing I&Q multiplication so that just one filter is needed instead of two

Time Division Multiplexed Input

Page 21: Accelerating DSP Algorithms Using FPGAs

Gallagher 21 P188/MAPLD2004

Efficient Shift Registers via SRL16s

Delay line would require 16x50x7=5200 registers which would be 2800 logic slices.

Use of SRL16s reduces slice count to less then 700

Page 22: Accelerating DSP Algorithms Using FPGAs

Gallagher 22 P188/MAPLD2004

Clock Based Demuxing And Automatic Pipeline Balancing

Down sample block grabs last sample in a frame

Down sample block grabs next sample in a frame

Delay block “slide” frame

Balancing latencies is a common requirement in DSP designs. The Sync block uses SRL16s (very efficient) to automatically balance pipeline delays

Page 23: Accelerating DSP Algorithms Using FPGAs

Gallagher 23 P188/MAPLD2004

Notes on Previous Design

• One filter structure is used by clocking the filter at twice the rate of the incoming data

• The coefficients are stored in memory, 25 per rom. There are 200 coefficients but this approach allows storage of many more

• The delay between taps is built using SRL 16s. This would have taken 2800 slices alone without SRL16s but instead the entire design is less that 700 slices

Page 24: Accelerating DSP Algorithms Using FPGAs

Gallagher 24 P188/MAPLD2004

Channelizer Design • The following design is a 64 channel channelizer based on the

technique known as polyphase decimation filter with a DFT bank • The design basebands and decimates 64 channels simultaniously• The polyphase decimation is the same structure as the previous

design, hence very efficient device utilization.• This filter structure uses the on-chip ram blocks of the Xilinx device

to store the coefficients• This technique requires a tapped shift register that requires 6272

registers (3136 slices). However, Xilinx’s patented ability to turn the logic look-up table into a 16 bit register reduces this require by more than an order of magnitude. The whole design is less than 1700 slices.

• The DFT is implemented with a streaming fft core. The streaming mode allows the FFT to keep up with the data rate

• Individual channels out of the fft are demuxed using the implied clocking technique seen in the previous design

Page 25: Accelerating DSP Algorithms Using FPGAs

Gallagher 25 P188/MAPLD2004

512 Coefficients are stored in on chip block rams

64 pt FFT set to streaming mode

Page 26: Accelerating DSP Algorithms Using FPGAs

Gallagher 26 P188/MAPLD2004

Filter coefficients are stored in on-chip block rams. A new phase of the 64 phase-polyphase filter is rotated into the multipliers on every clock cycle. There are 64 phases x 8 taps =512 coefficients

Page 27: Accelerating DSP Algorithms Using FPGAs

Gallagher 27 P188/MAPLD2004

Page 28: Accelerating DSP Algorithms Using FPGAs

Gallagher 28 P188/MAPLD2004

Conclusion

• Efficient FPGA instantiation of DSP algorithms requires exploitation of the FPGA vendor’s architecture. Xilinx’s Virtex II architecture is especially amenable to systolic computation structures

• FPGA architectures may present non-obvious instantiation choices that are more efficient then a typical textbook approach

• Algorithms can and should be modified for parallelized data flow instantiation.


Recommended