Custom Computing - Imperial College Londonwl/teachlocal/cuscomp/notes/cc...–e.g. FPGAs,...

wl 2020 1.1

Custom Computing

• theory and practice of customising designs– one of the fastest growing technologies

– impact on ASIC, CPU, many-core, GPU, multi-scale dataflow

• wide range of architectures and applications– data-centre/supercomputers with user-customisable accelerators

– message routers, mobile robots, LCD TVs, car audio systems

– invent processors with your own instruction set!

• based mainly on customisable implementation technology – e.g. Field-Programmable gate Arrays (FPGAs)

– also called reconfigurable computing, FPGA-based computing

• we focus on concepts, abstractions, design methods

• requirement: willing to learn new ideas, languages, tools – not afraid of C/Java/functional programs, maths, hardware

wl 2020 1.2

Course coverage

• topics

– custom computing technology overview

– design parametrisation and optimisation

– system-on-chip architecture and design

• 18 lectures, 8 tutorials (flexible), 1 assessed exercise

• course material

– https://www.doc.ic.ac.uk/~wl/teachlocal/cuscomp

– EEE students: may need access via EEE machines

• preparation for projects and research

– many received project prizes or distinctions

– summer projects for non-MSc students

wl 2020 1.3

Why custom computing?

• FPGAs: customisable hardware resources– data centres for cloud computing

– mobile handsets, Internet of Things (IoT), edge computing

• acceleration of demanding workloads– big data, finance, genomics, weather/climate modelling, – integrated solution: often with interface to memory, sensors…

– target multiple platforms: need to promote design re-use

• design approach: generalisation + customisation– often start with design instance: f0

– generalise f0 to become a template f(x), such that f(x0) = f0where x is a parameter and x0 is a specific value for x

– customise f with values for x to support tradeoff in speed, size…

f0

f(x)

x=x0 f1 f2

f3

generalise customise

x=x1x=x2

x=x3

wl 2020 1.4

Benefits of customisation

• improvements in– accuracy: as needed, not necessarily 8, 32, 64, 128 bits

– throughput: rate of producing results

– latency: time between first input and first output

– reconfiguration time: speed of adapting to changes

– size: area, volume, weight

– energy and power consumption: mobile and remote applications

– development time: design and validation

– cost: minimise fabrication, post-delivery fixes, enhancements

• need to prioritise design objectives– e.g. smallest design at a given speed consuming given energy

• opportunities for customisation– application-oriented, e.g. run-time conditions

– implementation-oriented, e.g. technology used

wl 2020 1.5

Implementation technology

• application-specific integrated circuit (ASIC)– high performance, low part cost: cheap if producing large volume

– high risk, high development cost, slow time-to-market

– costly (Moore’s Second Law) to develop, build and test, inflexible

• Field-Programmable Gate Array (FPGA)– low risk, fast time-to-market, low development cost, high part cost

– post-delivery improvement: fix bugs, update functions

– customisable at run time: adapt to environment changes

– prototype for ASIC

– enable internet routing

• custom computing systems– stand-alone

– PCIe / Infiniband

– system-on-chip: instruction processor + FPGA

wl 2020 1.6

Technology comparison

FPGAs

Efficiency, Performance

Fle

xib

ility

ASICs

General-Purpose

Processors

Digital Signal

Processors

Special-Purpose

Processors

(adapted from K. Fan, HPCA’09)

wl 2020 1.7

Where are FPGAs? Consumer applications

Digital Camera & Editing

LCD Projectors

PDP & HDTV

STB, DVR & VTR

Automotive

Handheld

Automotive

Diagnostics

Home Computing

Home Networking

(source: Xilinx Inc.)

wl 2020 1.8

• Smart NIC (Network Interface Controller)

– compute accelerator: local / remote

– infrastructure accelerator: network / storage

– flexibility of Software Defined Network + speed of hardware

New: accelerators for data centre servers

Source: Microsoft

wl 2020 1.9

Accelerate clouds: Microsoft + Amazon

aws.amazon.com/ec2/instance-types/f1/

www.top500.org/news/microsoft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/

wl 2020 1.10

Why Intel bought Altera

Source: IntelIP: Intellectual Property

wl 2020 1.11

Source: Intel

Drones + IoT + …

Aerotenna:

Octagonal Pilot on Chip

ASSP: Application-Specific Standard Part

SAM: Serviceable Available Market

wl 2020 1.12

Particle Physics: Large Hadron Collider

(source: Xilinx Inc.)

Opto-RX,

12 way

3 x Delay FPGA

(ADC clk timing)

Virtex II, 2M gate FPGA performs signal processing

Optical ribbon cable input

Opto-to-electrical conversion Digitise & sync data Find hit clusters

• real-time analysis of particle collision

• combine data from various detectors

(source: G. Hall)

wl 2020 1.13

Customisation: pre-fab and post-fab

• fabrication: manufacturing the chip– Xilinx UltraScale FPGA: 16nm, Intel i7-i770T: 22nm

– costly: very small geometry, ultra-clean room

• application-specific integrated circuit (ASIC)– greatest customisation at pre-fabrication, but could be inflexible

– high performance, low part cost: cheap if producing large volume

– high risk, high development cost, slow time-to-market

– costly (in money and time) to develop and test: Moore’s Law

• field-programmable gate array (FPGA)– post-fabrication, post-delivery, even run-time customisation

– hardware speed, software flexibility

– most basic, fine-grained unit of programmability

– need larger function blocks for efficiency

wl 2020 1.14

Design metrics

• NRE (non-recurring engineering) cost– one-time cost of designing system

• total cost: total cost = NRE cost + unit cost * number of units

• size, performance, power

• flexibility– make changes to the hardware with low NRE cost

• time-to-prototype, time-to-market

• maintainability

• correctness, safety, robustness

Source: J. Wong

wl 2020 1.15

FPGA/ASIC crossover points

Production Volume

Co

st

FPGA Cost Advantage ASIC Cost AdvantageFPGA Cost Advantage ASIC Cost AdvantageFPGA Cost Advantage

Source: S.S.S.P. Rao

wl 2020 1.16

FPGA vs ASIC

FPGA

• faster time-to-market

– no layout, masks or other manufacturing steps are needed

• no upfront NRE costs

• simpler design cycle

– software tools for routing, placement, and timing

• more predictable project cycle

• field re-programmability

ASIC

• full custom capability

– for design since device is

manufactured to design specs

• lower unit costs

– for very high volume

• smaller form factor

– device is made to design specs

• higher raw internal clock speeds

Source: J. Wong

wl 2020 1.17

Design flows

HDL: Hardware Description Language DFT: Design For Test Source: J. Wong

wl 2020 1.18

Early FPGA architecture

Connection

Block

Logic Block

Switch Block

Routing Track

(Horizontal)

Routing Channel

(Vertical){

TILESource: S. Wilton

wl 2020 1.19

Basic logic gate: lookup table

Function of each lookup table can be configured by

shifting in bit-stream.

Reconfigurable logic

Inputs

Bit-S

trea

m

Source: S. Wilton

wl 2020 1.20

Basic logic gate: lookup table

Function of each lookup table can be configured by

shifting in bit-stream. By-passable register at output.


D Q

Inputs

Source: S. Wilton

wl 2020 1.21


•Connect logic

blocks using fixed

metal tracks and

programmable

switches

Source: S. Wilton

wl 2020 1.22


•Connect logic

blocks using fixed

metal tracks and

programmable

switches

Everything can be

built using fine-

grained logic;

why need anything

else?

Source: S. Wilton

wl 2020 1.23

But every user must pay for them, whether used or not…

FPGA vendors embed fixed blocks to improve speed

and density:

Implementing systems in an FPGA

Embedded Memories

(blocks of 2K-18K)

Source: S. Wilton

wl 2020 1.24


and density:


Embedded Memories

(blocks of 2K-18K)

Hard Blocks, eg multiplier

Source: S. Wilton


wl 2020 1.25



and density:


Embedded Memories

(blocks of 2K-18K)

Hard Blocks, eg multiplier

High-Speed I/Os

Source: S. Wilton

wl 2020 1.26

Example: Xilinx Virtex CLB tile

• CLB tile is composed of:

– switch matrix

– Configurable Logic Block and associated general routing resources

– IMUX and OMUX

• all CLB inputs have access to interconnect on all 4 sides

• fast local feedback within CLB and direct connects to east and west CLBs: support wide functions of up to 19 inputs within a single CLB

SINGLE

HEX

LONG

SINGLE

HEX

LONG

SIN

GL

E

HE

X

LO

NG

SIN

GL

E

HE

X

LO

NG

TRISTATE BUSSES

SWITCH

MATRIX

SLICE SLICE

Local

Feedback

CA

RR

Y

CA

RR

Y

CLB

CA

RR

Y

CA

RR

Y

DIRECTCONNECT

DIRECTCONNECT

Source: Xilinx Inc.

wl 2020 1.27

CLB

Slice 0

LUT Carry

LUT Carry D Q

CE

PRE

CLR

D Q

CE

PRE

CLR

Slice 1

LUT Carry

LUT Carry D Q

CE

PRE

CLR

D Q

CE

PRE

CLR

Simplified CLB structure

• two slices in each CLB

– two BUFTs associated with each CLB, accessible by all 8 CLB outputs

– carry Logic runs vertically upwards, to speed up carry propagation

Source: Xilinx Inc.

wl 2020 1.28

Combinatorial Logic

AB

CD

Z

A B C D Z

0 0 0 0 0

0 0 0 1 0

0 0 1 0 0

0 0 1 1 1

0 1 0 0 1

0 1 0 1 1

. . .1 1 0 0 0

1 1 0 1 0

1 1 1 0 0

1 1 1 1 1

Look-Up Tables

• combinatorial logic is stored in Look-Up Tables (LUTs) in a CLB

• capacity is limited by number of inputs, not complexity

• delay through CLB is constant

wl 2020 1.29

Stratix IVGX 230: mid-size device

Adaptive

Logic

Modules

(fine grain)

RAM

Blocks

(M9K &

M144K)

(source: V. Betz)

DSP

Blocks

(coarse grain)

High

Speed

Serial

Interfaces:

eg connect

multiple

FPGAs

wl 2020 1.30

Stratix IV Overview

Feature Stratix III (65 nm) Stratix IV (40 nm)

Logic Elements 340k 680k

RAM bits 16 Mb + 4 Mb 33 Mb + 8.5 Mb

18x18 multipliers 768 1360

General I/O 1104 1104

High-speed serial links

048 transmit + 48 receive

@ 11.3 Gb/s

Hard PCIe blocks 0 4

Clock generation 12 PLL(x10)

12 PLL(x10) +

32 serial recovered +

+ 24 serial transmit

Clock distribution16 Global + 88 Quadrant +

132 PCLK16 Global + 88 Quadrant

+ 132 PCLK

(from V. Betz)

wl 2020 1.31

Current and future: System-on-Chip

I/O Ring and Interface Circuitry

Embedded

Processor

On-Chip

Memory

Fixed

IP

Block

Fixed

IP

Block

Reconfigurable

Logic

I/O Ring and Interface Circuitry

Fixed Intellectual Property Block

- functionality fixedat design time

- little post-fab

flexibility

Processor eg ARM

- functionality

specified using software

Programmable Logic

- circuit can be specified / modified

after fabrication, possibly at run time

- maybe slower than fixed IP block

Source: S. Wilton

wl 2020 1.32

Summary

• custom computing: theory and practice of customisation – from data centres/cloud computing to mobile appliances

• customisable off-the-shelf implementation technology – e.g. FPGAs, coarse-grained/hybrid processors, custom instructions

• factors favouring field-programmability– rise in FPGA capability: many exciting applications

– rise in integrated circuit fabrication cost: zero for FPGA users!

– customisation: facilitate product evolution and prototyping

• custom computing tools + applications at Imperial College– financial analysis/trading, multimedia processing, medical imaging

– network firewall, data compression/encryption, mobile robots

– bio-informatics, machine learning, bio-inspired/self-aware systems see: http://cc.doc.ic.ac.uk

Date post:	21-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Custom Computing - Imperial College Londonwl/teachlocal/cuscomp/notes/cc...–e.g. FPGAs,...

Documents