The von Neumann Syndrome

The von Neumann Syndrome

Reiner Hartenstein

TU Kaiserslautern

TU Delft, Sept 28, 2007

http://hartenstein.de

(v.2)

© 2007, [email protected] [R.H.] http://hartenstein.de2

TU Kaiserslauternvon Neumann Syndrome

this term has been

coined by “RAM” (C.V. Ramamoorthy, emeritus, UC Berkeley)


TU Kaiserslautern

The first Reconfigurable Computer

•prototyped 1884 by Herman Hollerith

•a century before FPGA introduction

•data-stream-based•data-stream-based

•60 years later the von Neumann (vN) model took over

•instruction-stream-based

•instruction-stream-based

Reiner Hartenstein

Herman Hollerith *29 Feb 1860 Buffalo


TU KaiserslauternOutline

• von Neumann overhead hits the memory wall• The manycore programming crisis• Reconfigurable Computing is the solution• We need a twin paradigm approach• Conclusions


TU KaiserslauternThe spirit of the Mainframe Age

•For decades, we’ve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps …

•Even in “hardware” courses (unloved child of CS scenes) we often teach von Neumann machine design – deepening this tunnel view

•… finally tending to code sizes of astronomic dimensions

•1951: Hardware Design going von Neumann (Microprogramming)


TU Kaiserslautern

von Neumann: array of massive overhead phenomena

overheadvon Neumann

machine

instruction fetch instruction stream

state address computation instruction stream

data address computation instruction stream

data meet PU instruction stream

i/o - to / from off-chip RAM instruction stream

… other overhead instruction stream

… piling up to code sizes of astronomic dimensions


TU Kaiserslautern

von Neumann: array of massive overhead phenomena

overheadvon Neumann

machine







piling up to code sizes of astronomic dimensions

[R.H. 1975] universal bus

considered harmful

[Dijkstra 1968] the “go to”

considered harmful

temptations by von Neumann style software

engineering

massive communication

congestion

Backus, 1978: Can programming be liberated from the von Neumann style?Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style


TU Kaiserslautern

von Neumann overhead: just one

example

overheadvon Neumann

machine







[1989]: 94% computation load

(image processing example)

94% computation load

only for moving this window


TU Kaiserslautern

the Memory Wall

DRAM7%/yr..

1

10

100

1000Performance

1980 1990 2000

DRAM

CPU

µProc60%/yr..

Dave Patterson’s Law -“Performance” Gap:

… needs off-chip RAM which fully hits

instruction stream code size of astronomic dimensions …..

growth 50% / yeargrowth 50% / year

CPU clock speed ≠ performance:processor’s silicon is mostly cache

better compare off-chip vs. fast on-chip memory

ends in 2005ends in 2005

2005

: ~

1000

2005

: ~

1000

Reiner Hartenstein

processors are not that good


TU Kaiserslautern

Benchmarked Computational Density

[BWRC, UC Berkeley, 2004]

1990 1995 2000 2005

200

100

0

50

150

75

25

125

175

SP

EC

fp20

00/M

Hz/

Bill

ion

Tra

nsis

tors

DEC alpha

SUNHP

IBM

alp

ha:

dow

n b

y 1

00

in

6 y

rsIB

M:

dow

n b

y 2

0 in 6

yrs

stolen from Bob Colwell

CPU caches ...

CPU clock speed ≠ performance:processor’s silicon is mostly cache

Reiner Hartenstein

intel curve removed, meanwhile allcurves removed from RAMP website





TU KaiserslauternThe Manycore future

• we are embarking on a new computing age -- the age of massive parallelism [Burton Smith]

• multiple von Neumann CPUs on the same µprocessor chip lead to exploding (vN) instruction stream overhead [R.H.]

• Even mobile devices will exploit multicore processors, also to extend battery life [B.S.]

• everyone will have multiple parallel computers [B.S.]


TU KaiserslauternSeveral overhead phenomena

The instruction-stream-based parallel von

Neumann approach:

the watering pot model [Hartenstein]

has several

von Neumann overhead

phenomena

has several


phenomena

per CPU!

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU


TU Kaiserslautern

Explosion of overhead by von Neumann parallelism

overheadvon Neumann

machine

monoprocessor

local overhead

instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream

data meet PU instruction streami / o to / from off-chip RAM instruction stream


parallel

global

inter PU communication instruction stream

message passing instruction stream

proportionate to the number of processors

disproportionate to the number of processors[R.H. 2006] MPI

consideredharmful

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU


TU KaiserslauternRewriting Applications

•more processors means rewriting applications

•we need to map an application onto different size manycore configurations

•most applications are not readily mappable onto a regular array.

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

•Mapping is much less problematic with Reconfigurable Computing


TU Kaiserslautern

Disruptive Development

•Computer industry is probably going to be disrupted by some very fundamental changes. [Ian Barron]

•I don‘t agree: we have a model.

•A parallel [vN] programming model for manycore machines will not emerge for five to 10 years [experts from Microsoft Corp].

•We must reinvent computing. [Burton J. Smith]

•Reconfigurable Computing: Technology is Ready, Users are Not•It‘s mainly an education problem The Education Wall

Reiner Hartenstein

....... does not support massive parallelism in large systems......





TU Kaiserslautern

The Reconfigurable Computing Paradox

•The spirit from the Mainframe Age is collapsing under the von Neumann syndrome

• There is something fundamentally wrong in using the von Neumann paradigm

•Up to 4 orders of magnitude speedup + tremendously slashing the electricity bill by migration to FPGA

•Bad FPGA technology: reconfigurability overhead, wiring overhead, routing congestion, slow clock speed

• The reason of this paradox ?


TU Kaiserslauternbeyond von Neumann

Parallelism

We need an approach like this:

The instruction-stream-based von Neumann

approach:

the watering pot model [Hartenstein]

has several


phenomena

has several


phenomena

per CPU!

per CPU!

it’s data-stream-based RC*

it’s data-stream-based RC*

*) “RC” = Reconfigurable Computing


TU Kaiserslautern

von Neumann overhead vs. Reconfigurable

Computing

overheadvon Neumann

machinehardwired

anti machinereconfigurable anti machine

instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*

data meet PU + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*

message passing overhead instruction stream none*

using

reconfigurable

data countersusing datacounters

usingprogramcounter

*) configured before run time

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array

(coa

rse-

grai

ned

rec.

)(c

oars

e-gr

aine

d re

c.)

no

inst

ruct

ion

fetc

h a

t ru

n

tim

e


TU Kaiserslautern

overheadvon Neumann

machinehardwired

anti machinereconfigurable anti machine

instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*

data meet P + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*

message passing overhead instruction stream none***) just by reconfigurable address generator

von Neumann overhead vs. Reconfigurable

Computingusing

reconfigurable

data countersusing datacounters

usingprogramcounter

*) configured before run time

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

[1989]: x 17 speedup by GAG**

(image processing example)

rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array

(coa

rse-

grai

ned

rec.

)(c

oars

e-gr

aine

d re

c.)

[1989]: x 15,000 total speedup

from this migration project


TU Kaiserslautern

Reconfigurable Computing means …

• Reconfigurable Computing means moving overhead from run time to compile time**

• For HPC run time is more precious than compiletime

• Reconfigurable Computing replaces “looping” at run time* …

http://www.tnt-factory.de/videos_hamster_im_laufrad.htm

… by configuration before run time

*) e. g. complex address computation**) or, loading time


TU Kaiserslautern

Data meeting the Processing Unit (PU)

by Software

byConfigware

routing the data by memory-cycle-hungry instruction streams thru shared memory

data-stream-based: placement* of the execution locality ...

We have 2 choices

pipe network generated by configware compilation

... explaining the RC advantage

*) before run time

(data)

(PU)


TU Kaiserslautern

pipe network, organized at compile time

rDPA = rDPU array, i. e. coarse-grained

rDPU = reconf. datapath unit (no program counter)

What pipe network ?

rDPArDPA

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

Generalization* of

the systolic array

array port receiving or sending a data stream

rDPArDPA

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU[R. Kress, 1995]

*) supporting non-linear pipes on free form hetero arrays

depending on connect fabrics


TU Kaiserslautern

datacounter

GAG RAM

ASM: Auto-Sequencing

MemoryrDPArDPA

ASMASM ASMASM ASMASM

ASMASM

ASMASM

ASMASM

Migration benefit by on-chip RAM

so that the drastic code size reduction by software to configware migration can beat the memory wall

Some RC chips have hundreds of on-chip RAM blocks, orders of magnitude faster than off-chip RAM

multiple on-chip RAM blocks are the enabling technology for ultra-fast anti machine solutions

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

ASMASM

ASMASM

ASMASM

ASMASM ASMASM ASMASM

rDPA = rDPU array, i. e. coarse-grainedrDPU = reconf. datapath unit (no program counter)

GAGs inside ASMs generate the data streams

GAG = generic address generator


TU Kaiserslautern

Coarse-grained Reconfigurable Array exampleimage processing: SNN filter ( mainly a pipe network)

note: kind of software perspective, but without instruction streams datastreams+ pipelining

note: kind of software perspective, but without instruction streams datastreams+ pipelining

compiled by Nageldinger‘s KressArray Xplorer (Juergen Becker‘s CoDe-X inside)

array size: 10 x 16 = 160 such rDPUs

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

rout thru only

not usedbackbus connect

ASMASM

ASMASM

ASMASM

ASMASM

ASMASM

ASMASM

ASMASM

ASMASM

ASMASM

rDPUrDPU. . . . . .

. .

32 bits wide32 bits wide

mesh-connected; exceptions: see

3 x 3 fast on-

chip RAM

coming close to programmer‘s mind set (much closer than FPGA)

coming close to programmer‘s mind set (much closer than FPGA)





TU Kaiserslautern Software / Configware Co-Compilation

Analyzer/ Profiler

SW code

SWcompiler

paradigm“vN" machine

CW Code

CWcompiler

anti machineparadigm

Partitioner

C language source

FW Code

Juergen Becker

1996

But we need a dual paradigm approach: to run legacy software together w. configware

Reconfigurable Computing: Technology is Ready. -- Users are Not ?

apropos compilation:

The CoDe-X co-compiler


TU Kaiserslautern

Curricula from the mainframe age

non-von-Neumann accelerators

(procedural) structurallydisabled

(this is not a lecture on brain regions)

no common modelno common model

the education wallthe education wall

not really taughtnot really taught the

main

pro

ble

mth

e m

ain

pro

ble

mthe common model is

ready, but users are

notthe common model is

ready, but users are

not


TU KaiserslauternWe need a twin paradigm

education

Brain Usage: both Hemispheres

each side needs its own common model

each side needs its own common model

procedural structural

(this is not a lecture on brain regions)


TU KaiserslauternRCeducation 2008

http://fpl.org/RCeducation/

The 3rd International Workshop on Reconfigurable Computing Education

April 10, 2008, Montpellier, France

teaching RC ?


TU KaiserslauternWe need new courses

“We urgently need a Mead-&-Conway-like text book “[R. H., Dagstuhl Seminar 03301,Germany, 2003]

We need undergraduate lab courses with HW / CW / SW partitioning

We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW co-design

20072007Here it is !





TU KaiserslauternConclusions

•But we need it for some small code sizes, old legacy software, etc. …

•Data streaming is the key model of parallel computation – not vN

•We need to increase the population of HPC-competent people [B.S.]

•The twin paradigm approach is inevitable, also in education [R. H.].

•Von-Neumann-type instruction streams considered harmful [RH]

•We need to increase the population of RC-competent people [R.H.]


TU KaiserslauternAn Open Question

please, reply to:

• Coarse-grained arrays: technology ready*, users not ready

• Much closer to programmer’s mind set: really much closer than FPGAs**

•Which effect is delaying the break-through?

*) offered by startups (PACT Corp. and others)

**) “FPGAs? Do we need to learn hardware design?”


TU Kaiserslautern

thank you


TU Kaiserslautern

END


TU Kaiserslautern

.


TU Kaiserslautern

Disruptive Development

The way the industry has grown up writing software - the languages we chose, the model of synchronization and orchestration, do not lead toward uncovering parallelism for allowing large-scale composition of big systems. [Iann Barron]

Reiner Hartenstein

....... does not support massive parallelism in large systems......


TU Kaiserslautern

Dual paradigm mind set: an old hat

Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer

Software mind set: instruction-stream-based: flow chart -> control instructions

(mapping from procedural to structural domain)

C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972

W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc.1967:1972:

FF

token bitevoke

FF FF

Date post:	09-Jan-2016
Category:	Documents
Upload:	afia
View:	75 times
Download:	1 times

The von Neumann Syndrome

Documents