Computing with FPGAs
Peter Škoda
Division of Electronics
Division of electronics
Laboratories and groups:
Laboratory for Information Systems
Laboratory for Stochastic Signals and Processes Research (LISSP)
Computational biology and bioinformatics group
Research:
Intelligent data and signal analysis techniques
Knowledge representations for information systems
Development of advanced measurement systems and signal processing techniques with applications in biomedicine, bioinformatics
DEL and CIR (Centre for Informatics and Computing) have recently proposed establishment of Scientific Computing and Information Processing Institute (SCIP)
Laboratory for Stochastic Signals and
Processes Research
Research High resolution measurement in the time and amplitude domain
Methods for processing and compressing huge data structures in computational linguistics and bioinformatics
Methods for analysis of time series applying theory of stochastic processes, chaotic and fractal signals and nonlinear dynamics
New programmable architectures and advanced features based on FPGA embedded systems design
Research and development projects related to PLD/FPGA at DEL and CIR: PLD Development and programming System, CPM Operating system, 1988
R&D of Optoelectronic based laser simulators, 1993.
Real Life Data Measurement and Characterization, Long term scientific project (Ministry of Science Education and Sport), (2007-).
Reconfigurable embedded systems based assistive applications for elderly people, Croatian-Hungarian Intergovernmental S&T Programme, (2009-2011).
Reliability of programmable logic devices in industrial embedded systems, R&D project with the KONČAR Electrical Engineering INSTITUTE, (2007-2009).
Quantum Random Number Generator, World Bank Croatia TAL2 project (2004-2006), (with DEP).
Motivation
Perpetual issue: demand for computing power keeps on
increasing
Multi-core CPUs, multi-processor systems, computer
clusters
Heterogeneous Computing
Use of different kind of processing units in a single computing
system – CPUs, DSPs, GPUs, custom accelerator units
Most common today: CPU+GPU, CPU+FPGA
FPGAs in computing – used to implement custom
accelerator units
FPGA – Field Programmable Gate Array
User-programmable digital
integrated circuit
Building elements:
Logic blocks
Input/output blocks
Programmable interconnect
Specialized memory,
arithmetic and
communication blocks
Logic Block
Implements general
combinational and
sequential logic
Look-Up Tables (LUT) –
combinational functions
Flip-Flops (FF) – sequential
functions
Input/Output Block
Provides connections to
outside components
Direction:
Output
Input
Bidirectional
Buffers:
Convert signal voltage
levels
Drive internal (In Buf) and
external (Out Buf) lines
Interconnect
Provides connections
between blocks
Two types of nets:
Signal net – regular
connections
Clock net – clock signal
distribution
Switch matrix
Specialized Blocks
Memory
Arithmetic
Multipliers
Multiply-accumulate
Communication
Fast serializer/deserializer
CPU vs. FPGA
CPU FPGA
Fixed hardware
Easier to program
High clock speed – GHz range
Sequential execution of instructions
Limited parallelism levels – data, task
Fixed set of arithmetic precisions
User defined hardware
More difficult to program
Low clock speed – 100s MHz range
Logic circuits that operate concurrently
Wide range of parallelism levels – bit, operation, data, task
Custom arithmetic precisions
FPGA in computer systems
Provides a platform for implementation of custom accelerators
Used in addition to CPU
FPGA executes only computation kernel – the computationally most
intensive part of the application
Coprocessor
Connects directly to CPU (Hyper Transport, FSB), has direct access
to main memory
Peripheral processing unit
Connects through peripheral bus (PCIe)
Programming FPGAs Describe hardware function
In text form Hardware description languages: VHDL, Verilog
C to HDL tools: Jacquard ROCCC, Mentor Graphics Catapult C, Impulse C
In graphical form NI LabVIEW
Xilinx System Generator for DSP + MathWorks Simulink
Synthesis Translates HDL description into configurations of FPGA building blocks
(logic, IO, memory, etc.)
Place and Route Distribute blocks and connection to physical resources on FPGA
Bitstream generation Generate configuration file which is written to the FPGA
Hardware Description vs. Programming
Languages
HDL
(VHDL, Verilog) Programming Language (C/C++, Java)
Concurrent execution
Explicit expression of
parallelism
Sequential execution
through finite state
machines (FSM)
Wide range of behavioural
abstraction levels (logic,
RTL, algorithm)
Sequential execution
No expression of
parallelism
Parallel execution through
thread mechanism
High level of behavioural
abstraction (algorithm)
Example: Artificial Neural Network
Artificial neural networks (ANN)
Computational models inspired by biological neural networks
of the brain
Processing in is mainly parallel and distributed,
Information is stored in connections
ANNs are widely used in many domains
Eg. signal processing, automation and control.
Artificial Neuron
Fundamental parts:
Inputs
Synaptic links with weights
Activation function Φ
Bias constant b – usually incorporated into the weight vector
Total synaptic input:
Output:
Commonly used activation functions:
bxwu
n
i
ii 1
xxf )(
xexf
1
1)(
xx
xx
ee
eexf
)(
)(uy
Multilayer Perceptron (MLP)
One of the most
commonly used ANN type
Feed-forward network
No connections between
non-adjacent layers
No connections between
neurons in the same layer
Input layer
Hidden layers
Output layer
MLP Parallelism
Layer parallelism
In multilayer networks the
layers can be pipelined
Node parallelism
Corresponds to individual
neurons – neurons are
processed in parallel
Weight parallelism
In computation of total
synaptic input – inputs are
multiplied with weights in
parallel
FPGA Implementation - Neuron
Implemented in two parts
Basic functional unit (BFU) Implements computation of total synaptic input
Computed sequentially using multiply-accumulate (MAC) unit
Synaptic weights stored in local ROM
Bias constant included as synaptic weight
Activation function look-up table (LUT) ROM addressed by total synaptic input
FPGA Implementation – MLP
Single layer One BFU per neuron
Single activation function LUT for a layer
Total synaptic inputs are loaded into shift registers and shifted to the activation function LUT
Computation on new inputs is carried out simultaneously with shifting of old results
Multilayer implementations Pipelined layers – cascading
Sequential layers – results routed back as new inputs
Performance
Evaluated on a single layer of a larger neural network
266 inputs
176 neurons
linear activation function
Target device: Xilinx Virtex-5 XC5VSX50T
Placed and routed at 85 MHz clock frequency
14,96 Gop/s (fixed-point multiply-accumulate operations)
Precision
(bits)
Input 16
Weights 14
Output 16
Resource Available Used Utilization
DSP48E 288 176 61%
Flip-flop 32640 2825 9%
LUT 32640 20197 62%
Performance
Extrapolation to entire network
Sequential layers implementation
Needs 542 clock cycles to evaluate (6.4 μs at 85 MHz)
Executes 62746 multiply-accumulate operations
9,84 Gop/s
Layer Number of
nodes
Activation
function
input 266 -
1st 176 linear
2nd 88 tan-sigmoid
3rd 2 log-sigmpoid
Conclusion
FPGAs provide great opportunities for computing acceleration...
Custom architectures tailored for specific applications
Wide range of parallelism levels – bit, operation, data, task
...but are underutilized
Development for FPGA requires significantly more effort than regular computer programming
Development tools and processes geared towards integrated circuit design
Limited support for computing applications
Future prospects
Hardware/software co-design
Automated hardware and software generation from high-level system model
The end