The von Neumann Syndrome
Reiner Hartenstein
TU Kaiserslautern
TU Delft, Sept 28, 2007
http://hartenstein.de
(v.2)
© 2007, [email protected] [R.H.] http://hartenstein.de2
TU Kaiserslauternvon Neumann Syndrome
this term has been
coined by “RAM” (C.V. Ramamoorthy, emeritus, UC Berkeley)
© 2007, [email protected] [R.H.] http://hartenstein.de3
TU Kaiserslautern
The first Reconfigurable Computer
•prototyped 1884 by Herman Hollerith
•a century before FPGA introduction
•data-stream-based•data-stream-based
•60 years later the von Neumann (vN) model took over
•instruction-stream-based
•instruction-stream-based
© 2007, [email protected] [R.H.] http://hartenstein.de4
TU KaiserslauternOutline
• von Neumann overhead hits the memory wall• The manycore programming crisis• Reconfigurable Computing is the solution• We need a twin paradigm approach• Conclusions
© 2007, [email protected] [R.H.] http://hartenstein.de5
TU KaiserslauternThe spirit of the Mainframe Age
•For decades, we’ve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps …
•Even in “hardware” courses (unloved child of CS scenes) we often teach von Neumann machine design – deepening this tunnel view
•… finally tending to code sizes of astronomic dimensions
•1951: Hardware Design going von Neumann (Microprogramming)
© 2007, [email protected] [R.H.] http://hartenstein.de6
TU Kaiserslautern
von Neumann: array of massive overhead phenomena
overheadvon Neumann
machine
instruction fetch instruction stream
state address computation instruction stream
data address computation instruction stream
data meet PU instruction stream
i/o - to / from off-chip RAM instruction stream
… other overhead instruction stream
… piling up to code sizes of astronomic dimensions
© 2007, [email protected] [R.H.] http://hartenstein.de7
TU Kaiserslautern
von Neumann: array of massive overhead phenomena
overheadvon Neumann
machine
instruction fetch instruction stream
state address computation instruction stream
data address computation instruction stream
data meet PU instruction stream
i/o - to / from off-chip RAM instruction stream
… other overhead instruction stream
piling up to code sizes of astronomic dimensions
[R.H. 1975] universal bus
considered harmful
[Dijkstra 1968] the “go to”
considered harmful
temptations by von Neumann style software
engineering
massive communication
congestion
Backus, 1978: Can programming be liberated from the von Neumann style?Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style
© 2007, [email protected] [R.H.] http://hartenstein.de8
TU Kaiserslautern
von Neumann overhead: just one
example
overheadvon Neumann
machine
instruction fetch instruction stream
state address computation instruction stream
data address computation instruction stream
data meet PU instruction stream
i/o - to / from off-chip RAM instruction stream
… other overhead instruction stream
[1989]: 94% computation load
(image processing example)
94% computation load
only for moving this window
© 2007, [email protected] [R.H.] http://hartenstein.de9
TU Kaiserslautern
the Memory Wall
DRAM7%/yr..
1
10
100
1000Performance
1980 1990 2000
DRAM
CPU
µProc60%/yr..
Dave Patterson’s Law -“Performance” Gap:
… needs off-chip RAM which fully hits
instruction stream code size of astronomic dimensions …..
growth 50% / yeargrowth 50% / year
CPU clock speed ≠ performance:processor’s silicon is mostly cache
better compare off-chip vs. fast on-chip memory
ends in 2005ends in 2005
2005
: ~
1000
2005
: ~
1000
© 2007, [email protected] [R.H.] http://hartenstein.de10
TU Kaiserslautern
Benchmarked Computational Density
[BWRC, UC Berkeley, 2004]
1990 1995 2000 2005
200
100
0
50
150
75
25
125
175
SP
EC
fp20
00/M
Hz/
Bill
ion
Tra
nsis
tors
DEC alpha
SUNHP
IBM
alp
ha:
dow
n b
y 1
00
in
6 y
rsIB
M:
dow
n b
y 2
0 in 6
yrs
stolen from Bob Colwell
CPU caches ...
CPU clock speed ≠ performance:processor’s silicon is mostly cache
© 2007, [email protected] [R.H.] http://hartenstein.de11
TU KaiserslauternOutline
• von Neumann overhead hits the memory wall• The manycore programming crisis• Reconfigurable Computing is the solution• We need a twin paradigm approach• Conclusions
© 2007, [email protected] [R.H.] http://hartenstein.de12
TU KaiserslauternThe Manycore future
• we are embarking on a new computing age -- the age of massive parallelism [Burton Smith]
• multiple von Neumann CPUs on the same µprocessor chip lead to exploding (vN) instruction stream overhead [R.H.]
• Even mobile devices will exploit multicore processors, also to extend battery life [B.S.]
• everyone will have multiple parallel computers [B.S.]
© 2007, [email protected] [R.H.] http://hartenstein.de13
TU KaiserslauternSeveral overhead phenomena
The instruction-stream-based parallel von
Neumann approach:
the watering pot model [Hartenstein]
has several
von Neumann overhead
phenomena
has several
von Neumann overhead
phenomena
per CPU!
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
© 2007, [email protected] [R.H.] http://hartenstein.de14
TU Kaiserslautern
Explosion of overhead by von Neumann parallelism
overheadvon Neumann
machine
monoprocessor
local overhead
instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream
data meet PU instruction streami / o to / from off-chip RAM instruction stream
… other overhead instruction stream
parallel
global
inter PU communication instruction stream
message passing instruction stream
proportionate to the number of processors
disproportionate to the number of processors[R.H. 2006] MPI
consideredharmful
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
© 2007, [email protected] [R.H.] http://hartenstein.de15
TU KaiserslauternRewriting Applications
•more processors means rewriting applications
•we need to map an application onto different size manycore configurations
•most applications are not readily mappable onto a regular array.
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
•Mapping is much less problematic with Reconfigurable Computing
© 2007, [email protected] [R.H.] http://hartenstein.de16
TU Kaiserslautern
Disruptive Development
•Computer industry is probably going to be disrupted by some very fundamental changes. [Ian Barron]
•I don‘t agree: we have a model.
•A parallel [vN] programming model for manycore machines will not emerge for five to 10 years [experts from Microsoft Corp].
•We must reinvent computing. [Burton J. Smith]
•Reconfigurable Computing: Technology is Ready, Users are Not•It‘s mainly an education problem The Education Wall
© 2007, [email protected] [R.H.] http://hartenstein.de17
TU KaiserslauternOutline
• von Neumann overhead hits the memory wall• The manycore programming crisis• Reconfigurable Computing is the solution• We need a twin paradigm approach• Conclusions
© 2007, [email protected] [R.H.] http://hartenstein.de18
TU Kaiserslautern
The Reconfigurable Computing Paradox
•The spirit from the Mainframe Age is collapsing under the von Neumann syndrome
• There is something fundamentally wrong in using the von Neumann paradigm
•Up to 4 orders of magnitude speedup + tremendously slashing the electricity bill by migration to FPGA
•Bad FPGA technology: reconfigurability overhead, wiring overhead, routing congestion, slow clock speed
• The reason of this paradox ?
© 2007, [email protected] [R.H.] http://hartenstein.de19
TU Kaiserslauternbeyond von Neumann
Parallelism
We need an approach like this:
The instruction-stream-based von Neumann
approach:
the watering pot model [Hartenstein]
has several
von Neumann overhead
phenomena
has several
von Neumann overhead
phenomena
per CPU!
per CPU!
it’s data-stream-based RC*
it’s data-stream-based RC*
*) “RC” = Reconfigurable Computing
© 2007, [email protected] [R.H.] http://hartenstein.de20
TU Kaiserslautern
von Neumann overhead vs. Reconfigurable
Computing
overheadvon Neumann
machinehardwired
anti machinereconfigurable anti machine
instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*
data meet PU + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*
message passing overhead instruction stream none*
using
reconfigurable
data countersusing datacounters
usingprogramcounter
*) configured before run time
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array
(coa
rse-
grai
ned
rec.
)(c
oars
e-gr
aine
d re
c.)
no
inst
ruct
ion
fetc
h a
t ru
n
tim
e
© 2007, [email protected] [R.H.] http://hartenstein.de21
TU Kaiserslautern
overheadvon Neumann
machinehardwired
anti machinereconfigurable anti machine
instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*
data meet P + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*
message passing overhead instruction stream none***) just by reconfigurable address generator
von Neumann overhead vs. Reconfigurable
Computingusing
reconfigurable
data countersusing datacounters
usingprogramcounter
*) configured before run time
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
[1989]: x 17 speedup by GAG**
(image processing example)
rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array
(coa
rse-
grai
ned
rec.
)(c
oars
e-gr
aine
d re
c.)
[1989]: x 15,000 total speedup
from this migration project
© 2007, [email protected] [R.H.] http://hartenstein.de22
TU Kaiserslautern
Reconfigurable Computing means …
• Reconfigurable Computing means moving overhead from run time to compile time**
• For HPC run time is more precious than compiletime
• Reconfigurable Computing replaces “looping” at run time* …
http://www.tnt-factory.de/videos_hamster_im_laufrad.htm
… by configuration before run time
*) e. g. complex address computation**) or, loading time
© 2007, [email protected] [R.H.] http://hartenstein.de23
TU Kaiserslautern
Data meeting the Processing Unit (PU)
by Software
byConfigware
routing the data by memory-cycle-hungry instruction streams thru shared memory
data-stream-based: placement* of the execution locality ...
We have 2 choices
pipe network generated by configware compilation
... explaining the RC advantage
*) before run time
(data)
(PU)
© 2007, [email protected] [R.H.] http://hartenstein.de24
TU Kaiserslautern
pipe network, organized at compile time
rDPA = rDPU array, i. e. coarse-grained
rDPU = reconf. datapath unit (no program counter)
What pipe network ?
rDPArDPA
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
Generalization* of
the systolic array
array port receiving or sending a data stream
rDPArDPA
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU[R. Kress, 1995]
*) supporting non-linear pipes on free form hetero arrays
depending on connect fabrics
© 2007, [email protected] [R.H.] http://hartenstein.de25
TU Kaiserslautern
datacounter
GAG RAM
ASM: Auto-Sequencing
MemoryrDPArDPA
ASMASM ASMASM ASMASM
ASMASM
ASMASM
ASMASM
Migration benefit by on-chip RAM
so that the drastic code size reduction by software to configware migration can beat the memory wall
Some RC chips have hundreds of on-chip RAM blocks, orders of magnitude faster than off-chip RAM
multiple on-chip RAM blocks are the enabling technology for ultra-fast anti machine solutions
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
ASMASM
ASMASM
ASMASM
ASMASM ASMASM ASMASM
rDPA = rDPU array, i. e. coarse-grainedrDPU = reconf. datapath unit (no program counter)
GAGs inside ASMs generate the data streams
GAG = generic address generator
© 2007, [email protected] [R.H.] http://hartenstein.de26
TU Kaiserslautern
Coarse-grained Reconfigurable Array exampleimage processing: SNN filter ( mainly a pipe network)
note: kind of software perspective, but without instruction streams datastreams+ pipelining
note: kind of software perspective, but without instruction streams datastreams+ pipelining
compiled by Nageldinger‘s KressArray Xplorer (Juergen Becker‘s CoDe-X inside)
array size: 10 x 16 = 160 such rDPUs
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
rout thru only
not usedbackbus connect
ASMASM
ASMASM
ASMASM
ASMASM
ASMASM
ASMASM
ASMASM
ASMASM
ASMASM
rDPUrDPU. . . . . .
. .
32 bits wide32 bits wide
mesh-connected; exceptions: see
3 x 3 fast on-
chip RAM
coming close to programmer‘s mind set (much closer than FPGA)
coming close to programmer‘s mind set (much closer than FPGA)
© 2007, [email protected] [R.H.] http://hartenstein.de27
TU KaiserslauternOutline
• von Neumann overhead hits the memory wall• The manycore programming crisis• Reconfigurable Computing is the solution• We need a twin paradigm approach• Conclusions
© 2007, [email protected] [R.H.] http://hartenstein.de28
TU Kaiserslautern Software / Configware Co-Compilation
Analyzer/ Profiler
SW code
SWcompiler
paradigm“vN" machine
CW Code
CWcompiler
anti machineparadigm
Partitioner
C language source
FW Code
Juergen Becker
1996
But we need a dual paradigm approach: to run legacy software together w. configware
Reconfigurable Computing: Technology is Ready. -- Users are Not ?
apropos compilation:
The CoDe-X co-compiler
© 2007, [email protected] [R.H.] http://hartenstein.de29
TU Kaiserslautern
Curricula from the mainframe age
non-von-Neumann accelerators
(procedural) structurallydisabled
(this is not a lecture on brain regions)
no common modelno common model
the education wallthe education wall
not really taughtnot really taught the
main
pro
ble
mth
e m
ain
pro
ble
mthe common model is
ready, but users are
notthe common model is
ready, but users are
not
© 2007, [email protected] [R.H.] http://hartenstein.de30
TU KaiserslauternWe need a twin paradigm
education
Brain Usage: both Hemispheres
each side needs its own common model
each side needs its own common model
procedural structural
(this is not a lecture on brain regions)
© 2007, [email protected] [R.H.] http://hartenstein.de31
TU KaiserslauternRCeducation 2008
http://fpl.org/RCeducation/
The 3rd International Workshop on Reconfigurable Computing Education
April 10, 2008, Montpellier, France
teaching RC ?
© 2007, [email protected] [R.H.] http://hartenstein.de32
TU KaiserslauternWe need new courses
“We urgently need a Mead-&-Conway-like text book “[R. H., Dagstuhl Seminar 03301,Germany, 2003]
We need undergraduate lab courses with HW / CW / SW partitioning
We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW co-design
20072007Here it is !
© 2007, [email protected] [R.H.] http://hartenstein.de33
TU KaiserslauternOutline
• von Neumann overhead hits the memory wall• The manycore programming crisis• Reconfigurable Computing is the solution• We need a twin paradigm approach• Conclusions
© 2007, [email protected] [R.H.] http://hartenstein.de34
TU KaiserslauternConclusions
•But we need it for some small code sizes, old legacy software, etc. …
•Data streaming is the key model of parallel computation – not vN
•We need to increase the population of HPC-competent people [B.S.]
•The twin paradigm approach is inevitable, also in education [R. H.].
•Von-Neumann-type instruction streams considered harmful [RH]
•We need to increase the population of RC-competent people [R.H.]
© 2007, [email protected] [R.H.] http://hartenstein.de35
TU KaiserslauternAn Open Question
please, reply to:
• Coarse-grained arrays: technology ready*, users not ready
• Much closer to programmer’s mind set: really much closer than FPGAs**
•Which effect is delaying the break-through?
*) offered by startups (PACT Corp. and others)
**) “FPGAs? Do we need to learn hardware design?”
© 2007, [email protected] [R.H.] http://hartenstein.de39
TU Kaiserslautern
Disruptive Development
The way the industry has grown up writing software - the languages we chose, the model of synchronization and orchestration, do not lead toward uncovering parallelism for allowing large-scale composition of big systems. [Iann Barron]
© 2007, [email protected] [R.H.] http://hartenstein.de40
TU Kaiserslautern
Dual paradigm mind set: an old hat
Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer
Software mind set: instruction-stream-based: flow chart -> control instructions
(mapping from procedural to structural domain)
C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972
W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc.1967:1972:
FF
token bitevoke
FF FF