wl 2020 1.1
Custom Computing
• theory and practice of customising designs– one of the fastest growing technologies
– impact on ASIC, CPU, many-core, GPU, multi-scale dataflow
• wide range of architectures and applications– data-centre/supercomputers with user-customisable accelerators
– message routers, mobile robots, LCD TVs, car audio systems
– invent processors with your own instruction set!
• based mainly on customisable implementation technology – e.g. Field-Programmable gate Arrays (FPGAs)
– also called reconfigurable computing, FPGA-based computing
• we focus on concepts, abstractions, design methods
• requirement: willing to learn new ideas, languages, tools – not afraid of C/Java/functional programs, maths, hardware
wl 2020 1.2
Course coverage
• topics
– custom computing technology overview
– design parametrisation and optimisation
– system-on-chip architecture and design
• 18 lectures, 8 tutorials (flexible), 1 assessed exercise
• course material
– https://www.doc.ic.ac.uk/~wl/teachlocal/cuscomp
– EEE students: may need access via EEE machines
• preparation for projects and research
– many received project prizes or distinctions
– summer projects for non-MSc students
wl 2020 1.3
Why custom computing?
• FPGAs: customisable hardware resources– data centres for cloud computing
– mobile handsets, Internet of Things (IoT), edge computing
• acceleration of demanding workloads– big data, finance, genomics, weather/climate modelling, – integrated solution: often with interface to memory, sensors…
– target multiple platforms: need to promote design re-use
• design approach: generalisation + customisation– often start with design instance: f0
– generalise f0 to become a template f(x), such that f(x0) = f0where x is a parameter and x0 is a specific value for x
– customise f with values for x to support tradeoff in speed, size…
f0
f(x)
x=x0 f1 f2
f3
generalise customise
x=x1x=x2
x=x3
wl 2020 1.4
Benefits of customisation
• improvements in– accuracy: as needed, not necessarily 8, 32, 64, 128 bits
– throughput: rate of producing results
– latency: time between first input and first output
– reconfiguration time: speed of adapting to changes
– size: area, volume, weight
– energy and power consumption: mobile and remote applications
– development time: design and validation
– cost: minimise fabrication, post-delivery fixes, enhancements
• need to prioritise design objectives– e.g. smallest design at a given speed consuming given energy
• opportunities for customisation– application-oriented, e.g. run-time conditions
– implementation-oriented, e.g. technology used
wl 2020 1.5
Implementation technology
• application-specific integrated circuit (ASIC)– high performance, low part cost: cheap if producing large volume
– high risk, high development cost, slow time-to-market
– costly (Moore’s Second Law) to develop, build and test, inflexible
• Field-Programmable Gate Array (FPGA)– low risk, fast time-to-market, low development cost, high part cost
– post-delivery improvement: fix bugs, update functions
– customisable at run time: adapt to environment changes
– prototype for ASIC
– enable internet routing
• custom computing systems– stand-alone
– PCIe / Infiniband
– system-on-chip: instruction processor + FPGA
wl 2020 1.6
Technology comparison
FPGAs
Efficiency, Performance
Fle
xib
ility
ASICs
General-Purpose
Processors
Digital Signal
Processors
Special-Purpose
Processors
(adapted from K. Fan, HPCA’09)
wl 2020 1.7
Where are FPGAs? Consumer applications
Digital Camera & Editing
LCD Projectors
PDP & HDTV
STB, DVR & VTR
Automotive
Handheld
Automotive
Diagnostics
Home Computing
Home Networking
(source: Xilinx Inc.)
wl 2020 1.8
• Smart NIC (Network Interface Controller)
– compute accelerator: local / remote
– infrastructure accelerator: network / storage
– flexibility of Software Defined Network + speed of hardware
New: accelerators for data centre servers
Source: Microsoft
wl 2020 1.9
Accelerate clouds: Microsoft + Amazon
aws.amazon.com/ec2/instance-types/f1/
www.top500.org/news/microsoft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/
wl 2020 1.10
Why Intel bought Altera
Source: IntelIP: Intellectual Property
wl 2020 1.11
Source: Intel
Drones + IoT + …
Aerotenna:
Octagonal Pilot on Chip
ASSP: Application-Specific Standard Part
SAM: Serviceable Available Market
wl 2020 1.12
Particle Physics: Large Hadron Collider
(source: Xilinx Inc.)
Opto-RX,
12 way
3 x Delay FPGA
(ADC clk timing)
Virtex II, 2M gate FPGA performs signal processing
Optical ribbon cable input
Opto-to-electrical conversion Digitise & sync data Find hit clusters
• real-time analysis of particle collision
• combine data from various detectors
(source: G. Hall)
wl 2020 1.13
Customisation: pre-fab and post-fab
• fabrication: manufacturing the chip– Xilinx UltraScale FPGA: 16nm, Intel i7-i770T: 22nm
– costly: very small geometry, ultra-clean room
• application-specific integrated circuit (ASIC)– greatest customisation at pre-fabrication, but could be inflexible
– high performance, low part cost: cheap if producing large volume
– high risk, high development cost, slow time-to-market
– costly (in money and time) to develop and test: Moore’s Law
• field-programmable gate array (FPGA)– post-fabrication, post-delivery, even run-time customisation
– hardware speed, software flexibility
– most basic, fine-grained unit of programmability
– need larger function blocks for efficiency
wl 2020 1.14
Design metrics
• NRE (non-recurring engineering) cost– one-time cost of designing system
• total cost: total cost = NRE cost + unit cost * number of units
• size, performance, power
• flexibility– make changes to the hardware with low NRE cost
• time-to-prototype, time-to-market
• maintainability
• correctness, safety, robustness
Source: J. Wong
wl 2020 1.15
FPGA/ASIC crossover points
Production Volume
Co
st
FPGA Cost Advantage ASIC Cost AdvantageFPGA Cost Advantage ASIC Cost AdvantageFPGA Cost Advantage
Source: S.S.S.P. Rao
wl 2020 1.16
FPGA vs ASIC
FPGA
• faster time-to-market
– no layout, masks or other manufacturing steps are needed
• no upfront NRE costs
• simpler design cycle
– software tools for routing, placement, and timing
• more predictable project cycle
• field re-programmability
ASIC
• full custom capability
– for design since device is
manufactured to design specs
• lower unit costs
– for very high volume
• smaller form factor
– device is made to design specs
• higher raw internal clock speeds
Source: J. Wong
wl 2020 1.17
Design flows
HDL: Hardware Description Language DFT: Design For Test Source: J. Wong
wl 2020 1.18
Early FPGA architecture
Connection
Block
Logic Block
Switch Block
Routing Track
(Horizontal)
Routing Channel
(Vertical){
TILESource: S. Wilton
wl 2020 1.19
Basic logic gate: lookup table
Function of each lookup table can be configured by
shifting in bit-stream.
Reconfigurable logic
Inputs
Bit-S
trea
m
Source: S. Wilton
wl 2020 1.20
Basic logic gate: lookup table
Function of each lookup table can be configured by
shifting in bit-stream. By-passable register at output.
Reconfigurable logic
D Q
Inputs
Source: S. Wilton
wl 2020 1.21
Reconfigurable logic
•Connect logic
blocks using fixed
metal tracks and
programmable
switches
Source: S. Wilton
wl 2020 1.22
Reconfigurable logic
•Connect logic
blocks using fixed
metal tracks and
programmable
switches
Everything can be
built using fine-
grained logic;
why need anything
else?
Source: S. Wilton
wl 2020 1.23
But every user must pay for them, whether used or not…
FPGA vendors embed fixed blocks to improve speed
and density:
Implementing systems in an FPGA
Embedded Memories
(blocks of 2K-18K)
Source: S. Wilton
wl 2020 1.24
FPGA vendors embed fixed blocks to improve speed
and density:
Implementing systems in an FPGA
Embedded Memories
(blocks of 2K-18K)
Hard Blocks, eg multiplier
Source: S. Wilton
But every user must pay for them, whether used or not…
wl 2020 1.25
But every user must pay for them, whether used or not…
FPGA vendors embed fixed blocks to improve speed
and density:
Implementing systems in an FPGA
Embedded Memories
(blocks of 2K-18K)
Hard Blocks, eg multiplier
High-Speed I/Os
Source: S. Wilton
wl 2020 1.26
Example: Xilinx Virtex CLB tile
• CLB tile is composed of:
– switch matrix
– Configurable Logic Block and associated general routing resources
– IMUX and OMUX
• all CLB inputs have access to interconnect on all 4 sides
• fast local feedback within CLB and direct connects to east and west CLBs: support wide functions of up to 19 inputs within a single CLB
SINGLE
HEX
LONG
SINGLE
HEX
LONG
SIN
GL
E
HE
X
LO
NG
SIN
GL
E
HE
X
LO
NG
TRISTATE BUSSES
SWITCH
MATRIX
SLICE SLICE
Local
Feedback
CA
RR
Y
CA
RR
Y
CLB
CA
RR
Y
CA
RR
Y
DIRECTCONNECT
DIRECTCONNECT
Source: Xilinx Inc.
wl 2020 1.27
CLB
Slice 0
LUT Carry
LUT Carry D Q
CE
PRE
CLR
D Q
CE
PRE
CLR
Slice 1
LUT Carry
LUT Carry D Q
CE
PRE
CLR
D Q
CE
PRE
CLR
Simplified CLB structure
• two slices in each CLB
– two BUFTs associated with each CLB, accessible by all 8 CLB outputs
– carry Logic runs vertically upwards, to speed up carry propagation
Source: Xilinx Inc.
wl 2020 1.28
Combinatorial Logic
AB
CD
Z
A B C D Z
0 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 1 1 1
0 1 0 0 1
0 1 0 1 1
. . .1 1 0 0 0
1 1 0 1 0
1 1 1 0 0
1 1 1 1 1
Look-Up Tables
• combinatorial logic is stored in Look-Up Tables (LUTs) in a CLB
• capacity is limited by number of inputs, not complexity
• delay through CLB is constant
wl 2020 1.29
Stratix IVGX 230: mid-size device
Adaptive
Logic
Modules
(fine grain)
RAM
Blocks
(M9K &
M144K)
(source: V. Betz)
DSP
Blocks
(coarse grain)
High
Speed
Serial
Interfaces:
eg connect
multiple
FPGAs
wl 2020 1.30
Stratix IV Overview
Feature Stratix III (65 nm) Stratix IV (40 nm)
Logic Elements 340k 680k
RAM bits 16 Mb + 4 Mb 33 Mb + 8.5 Mb
18x18 multipliers 768 1360
General I/O 1104 1104
High-speed serial links
048 transmit + 48 receive
@ 11.3 Gb/s
Hard PCIe blocks 0 4
Clock generation 12 PLL(x10)
12 PLL(x10) +
32 serial recovered +
+ 24 serial transmit
Clock distribution16 Global + 88 Quadrant +
132 PCLK16 Global + 88 Quadrant
+ 132 PCLK
(from V. Betz)
wl 2020 1.31
Current and future: System-on-Chip
I/O Ring and Interface Circuitry
Embedded
Processor
On-Chip
Memory
Fixed
IP
Block
Fixed
IP
Block
Reconfigurable
Logic
I/O Ring and Interface Circuitry
Fixed Intellectual Property Block
- functionality fixedat design time
- little post-fab
flexibility
Processor eg ARM
- functionality
specified using software
Programmable Logic
- circuit can be specified / modified
after fabrication, possibly at run time
- maybe slower than fixed IP block
Source: S. Wilton
wl 2020 1.32
Summary
• custom computing: theory and practice of customisation – from data centres/cloud computing to mobile appliances
• customisable off-the-shelf implementation technology – e.g. FPGAs, coarse-grained/hybrid processors, custom instructions
• factors favouring field-programmability– rise in FPGA capability: many exciting applications
– rise in integrated circuit fabrication cost: zero for FPGA users!
– customisation: facilitate product evolution and prototyping
• custom computing tools + applications at Imperial College– financial analysis/trading, multimedia processing, medical imaging
– network firewall, data compression/encryption, mobile robots
– bio-informatics, machine learning, bio-inspired/self-aware systems see: http://cc.doc.ic.ac.uk