Download - The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-

Generation Embedded Systems

Frank Vahid Professor

Department of Computer Science and EngineeringUniversity of California, Riverside

Associate Director, Center for Embedded Computer Systems, UC Irvine

Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale

Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), David Sheldon (3rd yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy (1st yr PhD)

Frank Vahid, UC Rivers

ide

2/34

Outline FPGAs – The New Software

Why they’re great Why they’re not ubiquitous yet

Hiding FPGAs from programmers Warp processing

Binary decompilation Just-in-time FPGA compilation

Towards Standard Binaries for FPGAs


ide

3/34

FPGAs

FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits

N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric

Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc.

CAD tools automatically map desired circuit onto FPGA fabric

a ba1

a0

4x2 Memory

ab

1010

1110

d1 d0

F G

00011011

Implement circuit by downloading particular

bits

LUTF G

2x2 switch matrix

x

y

01

01

1 0a

b

FPGASM

SM

SM

SM

SM

SM

SM

LUT

SM

SM

SM

SM

SM

LUT

01

11

11 01

001111...

10 11

000101...


ide

4/34

FPGAs are "Programmable" like Microprocessors – Just Download Bits

Processor Processor

001010010……

001010010……

0010…

Bits loaded into program memory

Microprocessor Binaries

001010010……

01110100...

Bits loaded into LUTs and SMs

FPGA "Binaries"

Processor FPGA0111

…

More commonly known as "bitstream"

"Software"

"Hardware"


ide

5/34

FPGA – Why (Sometimes) Better than Microprocessor

x = (x >>16) | (x <<16);x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00);x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0);x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc);x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa);

C Code for Bit Reversal

sll $v1[3],$v0[2],0x10srl $v0[2],$v0[2],0x10or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x8and $v1[3],$v1[3],$t5[13]sll $v0[2],$v0[2],0x8and $v0[2],$v0[2],$t4[12]or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x4and $v1[3],$v1[3],$t3[11]sll $v0[2],$v0[2],0x4and $v0[2],$v0[2],$t2[10]...

Binary

Compilation

ProcessorProcessor

Requires between 32 and 128 cycles

Circuit for Bit Reversal

Bit Reversed X Value

Bit Reversed X ValueBit Reversed X Value

. . . . . . . . . . .

. . . . . . . . . . .

Original X Value

ProcessorFPGA

Requires only 1 cycle (speedup of 32x to 128x)


ide

6/34

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

FPGA: Why (Sometimes) Better than Microprocessor

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

C Code for FIR Filter

Processor Processor

1000’s of instructions Several thousand cycles

Circuit for FIR Filter

Processor FPGA

~ 7 cycles Speedup >

100xIn general, FPGA better due to circuit's concurrency, from bit-level to task level


ide

7/34

Extensive Studies over Past Decade

Large speedups on many important applications See ACM/SIGDA Int. Symp. on FPGAs

So why aren't FPGAs ubiquitous?

0

10

20

30

40

50

60

Sp

ee

du

p

79.2200500

0

5

10

15

20

25

30

Sp

ee

du

p


ide

8/34

Why FPGAs aren’t Ubiquitous

Cost – But improving yearly Power – But improving

yearly, and energy benefits too

Extra chip – But integration continues

Programming methodologySource: Xilinx

1 million system gate FPGA cost


ide

9/34

Why FPGAs aren’t Mainstream Cost Power Extra chip Programming methodology

Though tremendous progress in past decade

Implementation

Assembly code

Microprocessor binary FPGA binary

Logic equations / FSMs

Register transfersCompilation (1960s, 1970s)

Assembling, linking (1950s, 1960s)

Behavioral synthesis(1990s)

RT synthesis(1980s, 1990s)

Logic synthesis, physical design(1970s, 1980s)

Microprocessors FPGA circuits

Automated hardware/software partitioning

C/C++/Java C/C++/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C...

Application (C/C++/Java/SystemC/Handel-C/Streams-C/…)

Downloading Downloading


ide

10/34

So What’s the Holdup? FPGAs require special compilers

Limits adoption – desktop world dominates

100 software writers for every CAD user Millions of compiler seats worldwide, vs.

15,000 CAD seats Can't ignore "ecosystem" from

separation of applications, tools, and architectures

Just consider history of popular processors

BinaryApplic.

StandardCompiler

BinaryBinary

FPGA Binary

Microproc Binary

FPGAProc.

Includessynthesis, tech. map,place & route

Special Compiler

Architectures

Applications Tools

Standard binaries


ide

11/34







ide

12/34

Can we Hide FPGAs from Programmers and Standard Tools?

Example Radically different x86

architectures hidden from programmers and tools

All execute standard x86 binaries On-chip tools dynamically

translate binary to particular architecture

Idea: Hide FPGA from programmers and tools

Download standard binary Have on-chip tools dynamically

translate binary (portions) to FPGA

We call this Warp Processing

BinarySW

ProfilingStandard Compiler

BinaryBinary

Traditionalpartitioningdone here

RISC architecture

Translator

VLIWarchitecture

Translator

FPGAProc.

Translator


ide

13/34

µP

FPGAOn-chip CAD

Warp Processing Idea

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary


ide

14/34

µP

FPGAOn-chip CAD


ProfilerI Mem

D$


Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP


ide

15/34

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected


ide

16/34

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD


ide

17/34

µP

FPGADynamic Part. Module (DPM)


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0


ide

18/34

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .


ide

19/34

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

FPGA

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++


ide

20/34

µP



Profiler

µP

I Mem

D$


Software Binary88

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0

+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”


ide

21/34

Warp Processing Challenges Two key challenges

Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs?

Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources?

µPI$

D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis

Profiling & partitioning

Binary Updater

BinaryMicropr Binary

Std. HW Binary

JIT FPGA compilation


ide

22/34

Decompilation If we don't decompile

High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead

Need to recover high-level information

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater

BinaryMicropr. Binary

Std. HW Binary


0

1

2

3

4

5

6

7

8

SpeedupEnergy Size

g3faxadpcmcrcdesenginejpegsumminv42avg

Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs.

microprocessor alone


ide

23/34

Decompilation Solution – Recover high-level information from binary:

decompilation Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph.D. work of Greg Stitt (Ph.D. UCR 2006) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs


long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Control/Data Flow Graph CreationOriginal C Code

Corresponding Assembly


ret reg4

reg3 := 0reg4 := 0

Data Flow Analysis

long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}

Function Recovery

long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}

Control Structure Recovery

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Array Recovery

Almost Identical Representations

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary



ide

24/34

Decompilation Results vs. C

Compared with synthesis from C Synthesis after decompilation often quite

similar Almost identical performance, small area overhead

Example Cycles ClkFrq Time Area Cycles ClkFrq Time Area %TimeOverhead %AreaOverhead

bit_correlator 258 118 2.19 15 258 118 2.186 15 0% 0%fir 129 125 1.03 359 129 125 1.032 371 0% 3%udiv8 281 190 1.48 398 281 190 1.479 398 0% 0%prewitt 64516 123 525 2690 64516 123 524.5 4250 0% 58%mf9 258 57 4.5 1048 258 57 4.503 1048 0% 0%moravec 195072 66 2951 680 195072 70 2791 676 -6% -1%

Avg: -1% 10%

Synthesis from C Code Synthesis after Decompiling Binary

FPGA 2005


ide

25/34

Decompilation Results on Optimized H.264In-depth Study with Freescale

Used highly-optimized benchmark Results: Binary approach competitive

Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information

needed for partitioning and synthesis

Speedup from C Partititioning

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

Number of Functions in Hardware

Sp

ee

du

p


0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51


Sp

ee

du

p


Speedup from Binary Partitioning

Function Name Instr %TimeCumulative SpeedupMotionComp_00 33 6.8% 1.1InvTransform4x4 63 12.5% 1.1FindHorizontalBS 47 16.7% 1.2GetBits 51 20.8% 1.3FindVerticalBS 44 24.7% 1.3MotionCompChromaFullXFullY24 28.6% 1.4FilterHorizontalLuma 557 32.5% 1.5FilterVerticalLuma 481 35.8% 1.6FilterHorizontalChroma133 39.0% 1.6CombineCoefsZerosInvQuantScan69 42.0% 1.7memset 20 44.9% 1.8MotionCompensate 167 47.7% 1.9FilterVerticalChroma 121 50.3% 2.0MotionCompChromaFracXFracY48 53.0% 2.1ReadLeadingZerosAndOne56 55.6% 2.3DecodeCoeffTokenNormal93 57.5% 2.4DeblockingFilterLumaRow272 59.4% 2.5DecodeZeros 79 61.3% 2.6MotionComp_23 279 63.0% 2.7DecodeBlockCoefLevels56 64.6% 2.8MotionComp_21 281 66.2% 3.0FindBoundaryStrengthPMB44 67.7% 3.1


ide

26/34

Tangent: Simple Coding Guidelines Bring Speedups Closer to Ideal

Interesting discovery during H264 study – C style limited speedup Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6.55, C: 6.56

Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51


Sp

eed

up

Ideal Speedup (Zero-time Hw Execution)


Speedup from Binary Partitioning

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51


Sp

eed

up

Ideal Speedup (Zero-time Hw Execution)Speedup After Rewrite (C Partitioning)Speedup After Rewrite (Binary Partitioning)Speedup from C PartititioningSpeedup from Binary Partitioning


ide

27/34

Warp Processing Challenges Two key challenges

Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs?

Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources?

µPI$

D$

FPGA

Profiler

On-chip CAD

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater

BinaryMicropr Binary

Std. HW Binary



ide

28/34

Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA

e.g., Our router (ROCR) 10x faster and 20x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement

Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs

JIT FPGA Compilation

DAC’04

BinaryBinary

Decompilation

BinaryFPGA binary

Synthesis


Binary Updater


Std. HW Binary


60 MB

9.1 s

Xilinx ISE

3.6MB1.4s

Riverside JIT FPGA tools on a 75MHz ARM7

3.6MB0.2 s

Riverside JIT FPGA tools


ide

29/34

191 113 130

0

10

20

30

40

50

60

70

80

Spee

dup

Warp Proc.

Xilinx Virtex-E

Overall Warp Processing ResultsPerformance Speedup (Most Frequent Kernel Only)

Average kernel speedup of 41, vs. 21 for Virtex-E

SW Only Execution

Simpler FPGA fabric yields faster HW

circuitsCurrently

prototyping our simpler FPGA

fabric with Intel, scheduled for Q3

shuttle

Overall application speedup average is 7.4


ide

30/34







ide

31/34

FPGA Ubiquity via Obscurity Warp processing hides FPGA

from languages and tools ANY microprocessor platform

extendible with FPGA Maintains "ecosystem":

application, tool, and architecture developers

New platforms with FPGAs appearing

FPGAProc.

Translator

BinarySW


BinaryStandard Binary

Architectures

Applications Tools

Standard binaries

New processor platforms

with FPGA

evolving


ide

32/34

FPGA Standard Binaries? Microprocessor binary

represents one form of a "standard binary for FPGAs"

Missing is explicit concurrency Parallelism, pipelining, queues,

etc. As FPGAs appear in more

platforms, might a more general FPGA binary evolve?

FPGAProc.

Translator

BinarySW


BinaryStandard Binary

Architectures

Applications Tools

Standard binaries

BinarySystemC?

Standard FPGA Compiler

BinaryStandard FPGA binary?

Standard FPGA binaries

Ecosystem for FPGAs

presently sorely missing


ide

33/34

FPGA Standard Binaries?

Translator makes best use of existing FPGA resources

Can even add FPGA, like adding memory, to improve performance

Add more FPGA to your PDA to implement compute-intensive application?

BinaryBinary

FPGAProc.

Translator

FPGA

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

BinaryFPGA Binary

Translator

FPGA

Low-end PDA

100 sec

Translator

FPGA

High-end PDA

1 sec


ide

34/34

Summary FPGAs may be the new software Hiding FPGA via warp processing is feasible

Decompilation can recover high-level constructs to yield speedups competitive with source-level

JIT FPGA compilation can be made sufficiently lean Future: Standard binaries for FPGAs?

Extensive work to be done

Publications can be found at: http://www.cs.ucr.edu/~vahid/pubs