Reiner Hartenstein, TU Kaiserslautern,...

[email protected]

Reiner Hartenstein (keynote address): "From Organic Computing to Reconfigurable Computing"; The 8th Workshop on Parallel Systems and Algorithms (PASA 2006), co-located with the 19th Int'l Conf. on Architecture of Computing Systems (ARCS 2006) Frankfurt/Main, Germany, March 13-16, 2006 1

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

From Organic Computing to Reconfigurable Computing

Reiner Hartenstein

TU Kaiserslautern

PASA, Frankfurt, March 16, 2006

© 2005, [email protected] http://hartenstein.de

TU Kaiserslautern

2

Reconfigurable Computing (RC) and FPGA*

in the media

#####

Design Starts until 2010: from 80,000 to 110,000

[Dataquest]

June 2005

fastest growing segment of the semiconductor market:

~6 billion US-$ [Dataquest]

*) Field-Programmable Gate Array

Google: 10 million hits


TU Kaiserslautern

3

The Pervasiveness of RC

162,000

127,000

158,000 113,000

171,000 194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

search “FPGA and ….”


TU Kaiserslautern

4

>> Outline <<

• Reconfigurable Computing Paradox

• Von Neumann loosing its dominance

• Software vs. Configware

• The dual paradigm approach

• Coarse-grained Reconfigurable Devices

• Conclusions http://www.uni-kl.de


TU Kaiserslautern

5

The RC Paradox

Effective integration density much worse than the Gordon Moore curve: by a factor of more than 10,000

„very power-hungry“ [Rick Kornfeld*]

*) personal communication

application development: until recently still Logic Design on a very strange platform

The awful technology of FPGAs:

FPGAs run at lower clock frequencies, draw more power and are more expensive.


TU Kaiserslautern

6

fine-grained RC: low effective integration density

immense area inefficiency

reconfigurability overhead

routing congestion

wiring overhead

overhead:

> 10 000

1980 1990 2000 2010 100

103

106

109

FPGA logical

FPGA routed

density:

FPGA physical

transistors

/ microchip

[email protected]




TU Kaiserslautern

7

published speed-up factors #

1980 1990 2000 2010 100

103

106

109

8080

P4

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

100 000

Los Alamos traffic simulation 47

real-time face detection 6000

video-rate stereo vision

900 pattern recognition 730

SPIHT wavelet-based image compression 457

Smith-Waterman pattern matching

288

BLAST 52 protein identification

40

molecular dynamics simulation 88

Reed-Solomon Decoding 2400

Viterbi Decoding 400

FFT

100

1000 MAC

Grid-based DRC: no FPGA: DPLA on MoM by TU-KL

2000

2-D FIR filter (no FPGA: DPLA by TU-KL)

39,4

Lee Routing (DPLA by TU-KL)

160

Grid-based DRC („fair comparizon“)

15000

DSP and wireless Image processing, Pattern matching,

Multimedia

Bioinformatics

GRAPE 20

Astrophysics

crypto

rela

tive

perf

orm

ance


TU Kaiserslautern

8

HeHon‘s Law MOPS / milliWatt

1

10

100

1000

2 1 0.5 0.25 0.13 0.1 0.07

µ feature size RISC

FPGA


TU Kaiserslautern

9

However ....

Application migration [from supercomputer] resulting in performance increase up to 4 orders of magnitude

Reducing electricity bill by an order of magnitude

Hits the memory wall from a different direction

People think that high-performance must mean expensive


TU Kaiserslautern

10

why the RC paradigm shift is so important

Move the stool or the grand piano?

by Software

by Configware


TU Kaiserslautern

11

>> Outline <<








TU Kaiserslautern

12

Cray XD1

vN paradigm loosing its dominance

Xilinx inside !

[email protected]




TU Kaiserslautern

13

von Neumann is not the common model

progra

m

counter

DPU CPU

RAM memory

von Neumann bottleneck

von Neumann instruction-stream-

based machine

co-processors

accelerator CPU

instruction-stream-based

data-stream-

based

har

dw

are

software

mainframe age:

microprocessor age:

wagging the dog

the tail is

vN paradigm dominance ?


TU Kaiserslautern

14

Here is the common model

progra

m

counter

DPU CPU

RAM memory



based machine

co-processors

accelerator CPU


data-stream-

based

har

dw

are

software

mainframe age:

microprocessor age:

configware age:

mor

phw

are

accelerator reconfigurable

accelerator hardwired

CPU


TU Kaiserslautern

15

Here is the common model

progra

m

counter

DPU CPU

RAM memory



based machine

co-processors

accelerator CPU


data-stream-

based

har

dw

are

software

mainframe age:

microprocessor age:

configware age:

CPU accelerator reconfigurable

mor

phw

are

software/configware co-compiler


TU Kaiserslautern

16

Fundamentally different mind set

no program counter

non-von-Neumann

completely different OS principles

no instruction fetch at run time

it’s configware: definitely it is not software


TU Kaiserslautern

17

>> Outline <<








TU Kaiserslautern

18

Compilation: Software vs. Configware

source program

software compiler

software code

Software Engineering

configware code

mapper

configware compiler

scheduler

flowware code

source „program“

Configware Engineering

placement & routing

data

C, FORTRAN MATHLAB

[email protected]




TU Kaiserslautern

19

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed flowware algorithm: variable



1 programming source needed algorithm: variable

resources: fixed

software CPU


TU Kaiserslautern

20

Co-Compilation

software compiler

software code

Software / Configware Co-Compiler

configware code

mapper

configware compiler

scheduler

flowware code

data

C, FORTRAN, MATHLAB

automatic SW / CW partitioner


TU Kaiserslautern

21

Organic Computing ? Bio-inspired use of FPGAs

• evolvable „hardware“ community:

• crossover of chromosomes

• In love with genetic algorithms: darwinistic way to fitness thru generations of populations

• inefficient, but unexpected results possible

• simulated annealing (genetic morphing) - fitness by synthesis: highly efficient


TU Kaiserslautern

22

Software / Configware Co-Compilation

Resource Parameters

supporting different platforms Analyzer

/ Profiler

SW code

SW compiler

para d igm “vN" machine

CW Code

CW compiler

Kress/Kung machine paradigm

Partitioner

C language source

FW Code

Juergen Becker’s CoDe-X, 1996


TU Kaiserslautern

23

Co-Compiler for Hardwired Kress/Kung Machine [e. g. Brodersen]

software compiler

software code

Software / Flowware

Co-Compiler flowware compiler

scheduler

flowware code

data

source

automatic SW / CW partitioner


TU Kaiserslautern

24

>> Outline <<







[email protected]




TU Kaiserslautern

25

The dual paradigm approach

von Neumann paradigm Kress-Kung paradigm



ASM

CPU


TU Kaiserslautern

26

DPA

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port #

time

time

port # time

port #

Flowware defines: ... which data item at which time at which port

Data streams (flowware)

(pipe network)

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

M

algebraic synthesis algorithms:

H. T. Kung paradigm (systolic array)

Auto-Sequencing

Memory

RA

M

GA

G

ASM

implemented

by distributed

memory


TU Kaiserslautern

27

500MHz Flexible

Soft Logic Architecture

200KLogic Cells

500MHz Programmable DSP

Execution Units

0.6-11.1Gbps

Serial Transceivers

500MHz PowerPC™ Processors

(680DMIPS)

with

Auxiliary Processor Unit

1Gbps Differential

I/O

500MHz multi-port

Distributed 10 Mb SRAM

500MHz DCM Digital

Clock Management

DSP platform FPGA [courtesy Xilinx Corp.]


TU Kaiserslautern

28

Generalization of the systolic array ....

discard algebraic synthesis methods

[Rainer Kress]

use optimization algorithms instead

for example: simulated annealing

the achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

now reconfigurability makes sense

remedy?


TU Kaiserslautern

29

>> Outline <<








TU Kaiserslautern

30

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not used backbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

Example: mapping onto rDPA by DPSS: based on simulated annealing

reconfigurable function block, e. g. 32 bits wide

no CPU

[email protected]




TU Kaiserslautern

31

coarse-grained RC: high integration density

FPGA routed

> 10 000

1980 1990 2000 2010 100

103

106

109

transistors

/ microchip

The Reconfigurable Computing Paradox


TU Kaiserslautern

32

Claassen‘s Law

2 1 0.5 0.25 0.001

0.01

0.1

1

10

100

1000

0.13 0.1 0.07

µ feature size

MOPS / milliWatt

DSP

+ Hartenstein‘s Amendment


TU Kaiserslautern

33

commercial rDPA example:

PACT XPP - XPU128

XPP128 rDPA

• Evaluation Board available, and • XDS Development Tool with Simulator

buses not

shown

rDPU

CF

G

PAE

core

ALU CtrlALU

CF

GC

FG

PAE

core

CF

GC

FG

PAE

core

PAE

core

ALU CtrlALUALU CtrlALU

CF

GC

FG

CF

GC

FG

• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies

© PACT AG, http://pactcorp.com

(r)DPA


TU Kaiserslautern

34

>> Outline <<








TU Kaiserslautern

35

Conclusions

RC is reducing cost without loss of performance and flexibility.

FPGAs may be configured like for a micro-processor for C/C++ code.

An FPGA can perform a specific algorithm at very high speed.

Using a high-level language, the FPGA can be programmed for a wide variety of algorithms without any deep knowledge of the underlying architecture.

RC is reducing the electricity bill and the required building floor area

Speed-up factors of up to 4 orders of magnitude hve been reported

Compared to ASICs, prototyping time is on the order of hours rather than months, with a cost less than a tenth of that for an ASIC.

The personal supercomputer is near


TU Kaiserslautern

36

Conclusions (2)

We urgently need Reconfigurable Computing Education

An Update of CS curricula is overdue

[email protected]




TU Kaiserslautern

37

END


TU Kaiserslautern

38

thank you


TU Kaiserslautern

39

The first archetype machine model

main frame

CPU

compile or assemble

procedural personalization

Software Industry Software Industry’s Secret of Success

simple basic . Machine Paradigm

personalization: RAM-based

instruction-stream- based mind set

“von Neumann”


TU Kaiserslautern

40

An Archetype Common Model needed

Guidance for organizing efficient solutions

Make the project manageable

Allow to share lessions between applications and between application areas

Useful simple archetype not widely accepted

Archetype common model should provide ....

Progress stalled by the software/configware chasm

Configware Industry from the


TU Kaiserslautern

41

The 2nd archetype machine model

compile structural

personalization

Configware Industry Configware Industry’s Secret of Success

personalization: RAM-based

data-stream- based mind set

“Kress-Kung”

accelerator reconfigurable

simple basic . Machine Paradigm


TU Kaiserslautern

42

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

S

+

for demo: a tiny section of the pipe network inter-rDPU-communication: no memory cycles needed

configware solution: computing in space

[email protected]




TU Kaiserslautern

43

Compare it to software solution on CPU

on a very simple CPU

C = 1

memory

cycles nano

seconds

if C

then

read A

read instruction

instruction decoding

read operand*

operate & register transfers

if not C

then

read B

read instruction


add &

store

read instruction


operate & register transfers

store result

total

S = R + (if C then A else B endif);

S

+

A B R C

Clock 200

=1

S

+


TU Kaiserslautern

44

hypothetical branching example to illustrate software-to-configware migration

*) if no intermediate storage in register file

C = 1 simple conservative CPU example

memory cycles

nano seconds

if C

then read A

read instruction 1 100


read operand* 1 100

operate & reg. transfers

if not C

then read B



add & store



operate & reg. transfers

store result 1 100

total 5 500


S

+

A B R C

clock

200 MHz (5 nanosec)

=1

sect

ion

of

a m

ajo

r p

ipe

net

wo

rk o

n r

DP

U


TU Kaiserslautern

45

The wrong mind set ....


=1

+

A B R C

section of a very large pipe network:

not knowing this solution:

symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“


TU Kaiserslautern

46

The hardware / software chasm

If I use the term "software", a variety of images might appear in the engineering audience's mind.

Still we have "hardware" engineers and "software" engineers that go to different schools, attend different conferences, avoid each other's cocktail parties, and almost never play on the same volleyball teams at the company picnic. System designers begin to plan their creations around the skill sets and development processes of hardware engineers and software engineers. The two become oil and water.

The hardware / software chasm


TU Kaiserslautern

47

Blurred line between hardware and software

The line between "hardware" and "software" is rapidly blurring and even becoming irrelevant from a system design perspective. As this happens, the traditional roles and skillsets of hardware and software engineers are being challenged, and a new generation of designers is emerging as a result.

the obfuscation caused by the pervasiveness of softness.


TU Kaiserslautern

48

We need Reconfigurable Computing Education

We need a unification in dealing with problems, which are shared across many different application domains

There is an urgent need to cure severe qualification deficiencies of our graduates.

We need new curricula in CS and CE for providing an integrating dual paradigm mind set instead of vN-only

[email protected]




TU Kaiserslautern

49

Terminology clean-up

Software: for scheduling instruction streams

Flowware: for scheduling data streams

Configware: for configuring morphware

Programming sources:

von Neumann

primarily

non-von Neumann


TU Kaiserslautern

50

Why coarse grain

much more MOPS/milliWatt

reconfigurable Data Path Unit (e. g. rALU)

mind set close to classical computing background

instead of rLB (~1 bit wide) use rDPU (e. g. 32 bits wide)

instead of FPGA use rDPA

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU Reconfigurable Computing (RC)

much more area-efficient

much less reconfigurability

overhead


TU Kaiserslautern

51

„data stream“: an ambigouos definition

Reconfigurable Computing is not instruction-stream-based

it‘s data-stream-based

it‘s different from the operation of the (indeterministic) „dataflow machine“

other definition also from multimedia area

usable definition from systolic array area


TU Kaiserslautern

52

>> Outline <<

• Reconfigurable Devices


• Data-stream-based Computing

• The contemporary Common Model

• Reconfigurable Supercomputing



TU Kaiserslautern

53

Why the speed-up ...

... although FPGA is clock slower by x 3 or even more (most know-how from „high level synthesis“ discipline)

decisions without memory cycles nor clock cycles

most „data fetch“ without memory cycle


TU Kaiserslautern

54

data moved around by software

i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall

P&R: move locality of operation, not data !

stolen from Bob Colwell

[email protected]




TU Kaiserslautern

55

Replace Caches by ...

stolen from Bob Colwell

caches

… by 16 x 16 reconfigurable data path array (rDPA)

which fits on the same chip


TU Kaiserslautern

56

Similarly skilled

with hardware description languages, Hardware engineers had to adopt the methodologies and techniques of software engineers - Increased softness has an impact on even our products themselves

The required skills for your respective jobs are converging (against the grain in an age of increased specialization) and you'll soon be working with (and competing against) a new generation of embedded engineers that are similarly skilled in both disciplines.


TU Kaiserslautern

57

Using FPGAs

Reducing cost without loss of performance and flexibility.

It may be configured like a general flexible micro-processor executing conventional C/C++ code, and as a highly specific programmability of FPGAs distinguishes to ASICs.

An FPGA can perform a specific algorithm at very high speed. Compared to ASICs, prototyping time is on the order of hours rather than months, with a cost less than a tenth of that for an ASIC.

Using a high-level language, the FPGA can be programmed for a wide variety of algorithms without any deep knowledge of the underlying architecture.

Field-programmable FPGAs


TU Kaiserslautern

58

Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer


TU Kaiserslautern

59

Conclusions (1)

We need a unification in dealing with problems, which are shared across many different application domains.

RC suffers from fragmentation into different cultures of the many application domains.

CS is the only domain being qualified f. such an effort


TU Kaiserslautern

60

Conclusions (2)

IEEE Computer Society should advocate to improve application development methodologies

and, a common educational approach useful for the wide variety of application domains

inside IEEE Computer Society, a TC on RC should lobby for more

[email protected]




TU Kaiserslautern

61

Conclusions (3)

reverse the downtrend in CS enrolment

educate not only students …

increase membership

make CS more fascinating

Strategic issue for entire IEEE Computer Society


TU Kaiserslautern

62

Conclusions (4)

The personal supercomputer is near, not only for the desktop, but also for a new road map to large scale supercomputing of up to now unthinkable highest performance dimensions.

IEEE-CS should accept this fascinating challenge, by spearheading the paradigm shift.

IEEE-CS is needed as a translator to explain the impact to managers and to a wide public.


TU Kaiserslautern

63

RC education last week at Karlsruhe

Attendees declared ready to work for a task force

35 submissions from

Australia, Brasil, India, USA, and throughout Europe

But education is just one of several facets ……


TU Kaiserslautern

64

However ....

“What did you say again that your company does?” My father posed the question, “Gate arrays,” I replied, “They’re chips used to…”

“Oh yes, that’s right, Gatorade.” ….. “I used to give that to my marching band members so they wouldn’t get dehydrated on hot days. Don’t remember it coming in chip form …..”

Explain to your grandmother what it means if you’re one of the world’s leading experts on optical proximity correction (OPC) for nanometer-scale semiconductor lithography?

Could you perhaps relate it to some difficulty she has with needlepoint and her cataracts?

Even those with a scientific or technical background often won’t understand precisely what we do. A PhD in molecular biology won’t help to understand VHDL and Verilog synthesis for FPGAs.

Trying to relate DNA sequences to LUT truth tables might offer a starting point, but somebody has to be able to bridge the technology and terminology gap, even to initiate that analogy.

Try explaining FPGAs with the consumer electronics approach. “People tend to relate when you tell them what your part goes into. Today, finally, ‘chip’ seems universally understood. I never get people asking about potato chips anymore.”


TU Kaiserslautern

65

However ....

Abstract. Google’s yaw-dropping hit rates illustrate the pervasiveness of Reconfigurable Computing (RC), mainstream in embedded systems already for years, and now being adopted by supercomputing (Cray, sgi, etc.). From FPGA usage as accelerators, speed-up factors by up to two orders of magnitude are reported, as well as floor space requirements and electricity invoice amounts reduced by one order of magnitude. About 3 orders of magnitude and more is obtained by using coarse-grained reconfigurable datapath arrays (rDPAs) available from a number of start-ups.This is astonishing, since FPGAs and rDPAs have a substantially lower clock speed than microprocessors. Algorithmic cleverness is the secret of success, based on software to configware migration mechanisms, striving away from memory-cycle-hungry instruction-stream-based computing paradigms. The main benefit of RC platforms - having replaced the use of hardwired accelerators - is their flexibility by non-procedural programmability. This also contributes to those concepts of Organic Computing, which rely on processes of evolution, self-organization, adaptation and fault tolerance. The main hurdles on the way to heart-stopping new horizons of cheap highest performance are CS-related educational deficits causing the configware / software chasm and a methodology fragmentation between the different cultures of application domains. Current CS curricula do not sufficiently meet their transdisciplinary responsibility. The talk gives a survey on fundamental issues in RC and on new directions in CS-related curricula, focused on a dual paradigm organic computing approach.


TU Kaiserslautern

66

However ....

Application migration [from supercomputer] resulting in performance increase up to 4 orders of magnitude

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“

[Herb Riley, R. Associates]

Reducing electricity bill by an order of magnitude

Hits the memory wall from a different direction

[email protected]




TU Kaiserslautern

67

However ....


TU Kaiserslautern

68

Conclusions

IEEE Computer Society should advocate to introduce a dual paradigm approach – away from the monopoly of the vN mind set IEEE Computer Society should advocate a common model useful for the wide variety of application domains


TU Kaiserslautern

69

Conclusions

We need a unification in dealing with problems, which are shared across many different application domains.

RC suffers from fragmentation into different cultures of the many application domains.

Each domain uses its own trick box. We should teach the world to think outside the box

CS is the only domain qualified for this unification


TU Kaiserslautern

70

An Archetype Common Model needed

Configware Industry from the

IEEE Computer Society should advocate to introduce a dual paradigm transdisciplinary education by using Configware Engineering as the counterpart of Software Engineering by new curricula in CS and CE for providing an integrating dual paradigm mind set supporting a unification in dealing with problems, which are shared across many different application domains - to cure severe qualification deficiencies of our graduates.

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Reiner Hartenstein, TU Kaiserslautern,...

Documents