[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 1
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
We Need to Reinvent Computing to Avoid its
Future Unaffordable Electricity Consumption
Reiner Hartenstein
1 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern This talk is about HPRC
2
googling „HPRC“
the 1st from 167,000 hits
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
better googling with
3
www.chrec.org: the 1st from 48,400 hits
„NSF center for HPRC“
pronounced „ shreck "
HPRC
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern Hetero Systems
Not only in CHREC the term HPRC means Heterogeneous Systems including both:
• Instruction Stream Parallelism (on manycore etc.)
4
• Data Stream Parallelism (on accelerators; on FPGAs)
Qualified Programmer Population not existing
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
5 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
6
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 2
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Crude oil $ prices by barrel
7
March 2010, >82 US-$
January 2011, >92 US-$
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
never run out of energy?
natural gas: similar situation
typical oil field operation
coal
hydro nuclear
gas
oil
[Fatih Birol, Chief Economist IEA]. https://www.theoildrum.com/
2007:
80% crude oil coming from decline fields
> 30 %
~ 55 %
Pro
du
ctio
n (%
) 10
0
0 8
„6 more Saudi Arabias needed for demand predicted for 2030“
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern Beyond peak oil
9 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Power Consumption of Computers
Energy cost may overtake
IT equipment cost in the near future
„we may ultimately need revolutionary new solutions“ [Horst Simon, LBNL, Berkeley]
... has become an industry-wide issue: incremental improvements are on track,
[Albert
Zomaya]
Power consumption by internet: x30 til 2030 if trends continue G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 –11 Sep 2008
10
at Dallas
[Randy Katz: IEEE Spectrum, Febr. 2009]
server farm
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Electricity Bill: a Key Issue
„The possibility of computer equipment power consumption spiraling out of control could have serious consequences for the overall
affordability of computing.”
Patent for water-based data centers
Cost of a G’ data center determined only by monthly power bill
[L. A. Barrosso, Google]
11
Google going to sell electricity
• Already 2005, Google’s electricity bill higher than value of its equipment.
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
12
In my opinion, the largest supercomputers at any time,
including the first exaflops, should not be thought of as
computers. They are strategic scientific instruments that
happen to be built from computer technology. Their
usage patterns and scientific impact are closer to major
research facilities such as CERN, ITER, or Hubble. [Andrew Jones, vice president Numerical Algorithms Group ]
Supercomputers are
Scientific Instruments
no reason to solve the power problem ?
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 3
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
13
Exa-scale: (1018 computations/second): exp. by 2018: Power estimations (for a single supercomputer): 250 MW – 10 GW (twice NY City w. 16 million people)
[several sources]
Overall affordability of computing
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Why Computers are important
14
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Business Informaiotn Systems …
15
Lufthansa anno 1960
… without computers
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
16
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011, 17
Reconfigurable Computing offers an overwhelming reduction of electricity consumption
Potential of RC
Only Reconfigurable Computing can avoid, that running our infrastructures becomes unaffordable in the future.
as well as massive speed-up factors …
… both by up to several orders of magnitude.
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
>15000 PISA project
18
FFT 100
Reed-Solomon Decoding 2400
Viterbi Decoding 400
1000
MAC
DSP and wireless
molecular dynamics simulation
88
BLAST 52
protein identification 40
Smith-Waterman pattern matching
288
Bioinformatics
GRAPE
20 Astrophysics
SPIHT wavelet-based image compression 457
real-time face detection
6000
video-rate stereo vision
900
pattern recognition 730
Image processing, Pattern matching,
Multimedia
3000 CT imaging crypto
1000
28500
DES breaking
1
1000
1,000,000
Spe
edup
-Fac
tor
Speed-up
factors are
not new
by avoiding the
von Neumann
paradigm
8723 DNA seq.
100
10
10,000
100,000
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 4
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
19 Speed-up the mid’ 80ies
PISA project
one DPLA* replacing 256 early Xilinx FPGAs
*) fabricated by E.I.S. project
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Energy saving factors: ~10% of speedup
20
FFT 100
Reed-Solomon Decoding 2400
Viterbi Decoding 400
1000
MAC
DSP and wireless
molecular dynamics simulation
88
BLAST 52
protein identification 40
Smith-Waterman pattern matching
288
Bioinformatics
GRAPE
20 Astrophysics
crypto 1000
28500 DES breaking
100
103
106
Spe
edup
-Fac
tor
http://hartenstein.de
© 2010 [email protected]
Low Power Circuit Design:
PowerOpt™ (ChipVision Design Systems):
divides power consumption by up to 4
GPGPU and x86 multicore:
no energy saving data available
Power save
factors
obtained
SPIHT wavelet-based image compression 457
real-time face detection
6000
video-rate stereo vision
900 pattern
recognition 730
Image processing, Pattern matching, Multimedia
3000 CT imaging
8723 DNA seq.
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011, 21
[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]
Application . Speed-up factor
Savings Power Cost Size
DNA and Protein sequencing 8723 779 22 253
DES breaking 28514 3439 96 1116
much less equipment
needed
massively saving energy
RC*: Demonstrating the intensive Impact
SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster
Tarek El-Ghazawi
*) RC = Reconfigurable Computing © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Drastically less Equipment needed
For instance: a hangar full of racks replaced by a single rack without
air conditioning
22
or ½ rack
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
23
END
SGI® RASC™ Module (Version1) Xilinx Virtex II-6000 FPGA
16MB QDR SRAM Rack-mountable
Dual NUMAlink™ 4 ports Seamless direct attach to server's shared
memory fabric Datasheet (PDF 145K)
SGI® RASC™ RC100 Blade Dual Virtex 4 LX200 FPGAs
80MB QDR SRAM or 20GB DDR2 SDRAM Blade or rack-mountable form factor
Dual NUMAlink™ 4 ports Seamless direct attach to server's
shared memory fabric Datasheet (PDF 137K)
Hetero HPC SGI® RASC™
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
The Reconfigurability Paradox
• Routing congestion
24
• Lower clock speed
• Reconfigurability overhead
• Wiring overhead
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 5
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
25 25
because of
The von Neumann
Syndrome
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
All but ALU is overhead: x20 efficiency
26
(data cashe)
x20 inefficiency: just one of several
overhead layers
[R. Hameed et al.: Understanding Sources of Inefficiency in General-Purpose Chips; 37th ISCA, June 19-23, 2010, St. Malo, France]
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Massive Overhead Phenomena
proportionate to the number of processors
overproportionate to the number of processors
27
overhead von Neumann
machine
instruction fetch instruction stream
state address computation instruction stream
data address computation instruction stream
data meet PU + other overh. instruction stream
i / o to / from off-chip RAM instruction stream
Inter PU communication instruction stream
message passing overhead instruction stream
transactional memory overh. instruction stream
multithreading overhead etc. instruction stream
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Critique of the von Neumann Model
Brad Cox 1990: Planning the Software Industrial Revolution
Dijkstra 1968: The Goto considered harmful
R. Hartenstein, G. Koch 1975: The universal Bus considered harmful
Backus 1978: Can programming be liberated from the von Neumann style
Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style L. Savain 2006:
Why Software is bad
Critique of von Neumann is not new: Peter G. Neumann 1985-2003: 216x “Inside Risks“
18 years inside back cover of Comm_ACM
Peter G. Neumann
28
overhead piles up to code sizes of astronomic dimensions
“von Neumann Syndrome”:
C.V. Ramamoorthy; UC Berkeley
Nathan’s Law: Software is a gas. It expands to fill all its containers ...
Nathan Myhrvold
Wirth‘s
Law “software is slowing faster than hardware is accelerating“
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
>40 years Software Crisis
F. L. Bauer 1968, coined „Software Crisis“ - N. N. 1995: THE STANDISH GROUP REPORT
Robert N. Charette 2005: Why Software Fails; IEEE Spectrum, Sep 2005
Anthony Berglas 2008: Why it is Important that Software Projects Fail
Oct 1957
The Economist: Nov 19th 1955
In 1955, Parkinson could not have foreseen the impact of software.
The size of bureaucracy is independent of the amount of real work to be done.
29 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
30
von Neumann overhead vs.
Reconfigurable Computing
overhead von Neumann
machine datastream machine
instruction fetch instruction stream none*
state address computation instruction stream none*
data address computation instruction stream none*
data meet PU + other overh. instruction stream none*
i / o to / from off-chip RAM instruction stream none*
Inter PU communication instruction stream none*
message passing overhead instruction stream none*
transactional memory overh. instruction stream none*
multithreading overhead etc. instruction stream none*
30
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 6
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
31 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Fab Line Cost
32
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
FPGA to ASIC design start ratio
33
3% ASIC
97% FPGA
Dataquest March 25, 2009 Most ASIC design
in the world has stopped
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Revenue down and up
34
? ?
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
35
Magic mixture of FPGA fabric and hard logic on the same die
“pure” FPGA abandoned
The programmable logic industry abandoned “pure” FPGAs - a big field of programmable fabric surrounded by IO.
Instead of FPGA fabric in custom SoC designs, we get custom SoC in FPGAs; devices with narrower application focus
Super-flexible ASSP-like devices for optimized hard-core design mixed with FPGA flexibility.
… to capture huge segments of standard parts / ASSP market.
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
36
Tools, IP and support
Xilinx and Altera started FPGA synthesis development
A commercially-viable design flow for FPGA fabric requires years of development and customer experience
Synthesis and place-and-route are P-complete: monstrously complex software - no “magic bullet”.
To develop of a robust synthesis and place-and-route tool suite requires years of testing, and fine-tuning
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 7
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
37
1) focus on optimizing their tool only for their FPGAs
FPGA synthesis advantages
Advantages of FPGA companies:
2) their FPGA synthesis teams influence future FPGA architectures (what EDA synthesis teams could not)
3) earliest possible access and most detailed information about their company’s FPGA architectures
4) regularly access and benchmark to EDA companies’ tools: a known, measurable target to work against.
remark: not intuitive
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
38
Will EDA companies survive ?
Routing delays: the dominant timing factor, not the logic
Only FPGA firms’ synthesis and place-and-route tools survive?
Big EDA firms are wrestling with these complexities
If EDA abandons FPGA synthesis, smaller FPGA firms are in deep trouble.
FPGA firms interfaced synthesis to routing delay estimations
Feedback from a huge variety of design projects world-wide
larger-than-EDA staffing levels
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
39 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
40 year
relative performance
94 96 98 00 02 04 06 08 10 12 14 16 18 20 22 24 26 28 30
Performance Growth by Multicore? b
eg
in
o
f th
e
mu
ltic
ore
e
ra
& massive
programmer
productivity
problems
„Multicore shifts the burden of
Performance from Chip Designer
to Software Developers.“
[J. Larus: Spending Moore's Dividend; C_ACM, May 2009]
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011, 41
year
relative performance
94 96 98 00 02 04 06 08 10 12 14 16 18 20 22 24 26 28 30
be
gin
o
f th
e
mu
ltic
ore
e
ra
Multimedia in the Multicore Era
Multimedia Performance Needs
application performance needs up to:
Audio 800 MIPS Graphics 11 GOPS Video 160 GOPS Digital TV 900 GOPS
[Pierre Paulin, MPSoC’09]
needed performance
growing faster than
Moore‘s law
[courtesy E. Sanchez] MIPS
GSM GPRS EDGE UMTS
next standard
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
vN passes into history
• Foundational change will disrupt traditional habits throughout the discipline – [Michael Wrinn]
42
• Suddenly, All Computing Is Parallel: Seizing Opportunity Amid the Clamor - [Michael Wrinn]
• The proud era of von Neumann architecture passes into history - [Michael Wrinn]
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 8
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
the Parallel Programming Problem
The parallel programming problem has been addressed, in HPC, for at least 25 years. The result: only a small number of specialized developers write parallel code. Multicore becoming ubiquitous, there is some hope that the “if you build it, they will come” –[T. Mattson, M. Wrinn]
43
Also see the list „dead supercomputer society“
The growing core counts are racing ahead of programming paradigms and programmer productivity
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
The vast majority of HPC or supercomputing applications originally written for single processor with direct access to main memory.
44
The Programmability Crisis
But the first petascale supercomputers employ more than 100,000 processor cores each, and distributed memory.
They hope, that dozens of applications are inherently parallel enough to be laboriously decomposed, sliced and diced, for mapping onto HPC
Large applications only modestly scalable. >50% apps don’t scale beyond 8 cores, only 6% can exploit >128 PE cores, a tiny fraction 100,000 or more cores.
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Some
programming
languages
45 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Some languages for parallelism
46
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Are new languages necessary?
47
Humans are quickly overwhelmed by concurrency and find it much more difficult to reason about concurrent than sequential code. Even careful people miss possible interleavings among even simple collections of partially ordered operations.” [Sutter and Larus]
Language wars are religious wars
Concurrency in software is difficult because of the abstractions having been chosen.
Adding just new features to existing languages ?
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
absurdely incomprehensible
Threads make programs absurdely incomprehensible caused by the wildly nondeterministic nature [E. A. Lee]
48
Object-oriented limits the visibility of data
Concurrency models can operate at component architecture level rather than programming languages [E. A. Lee]
E. A. Lee: Are new languages necessary for multicore? 2007 E. A. Lee. The problem with threads. Computer, 2006.
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 9
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
We need a new Textbook
49
having an impact like Mead & Conway
"The book that changed everything“; Electronic Design News, Feb. 11, 2009
We‘ve to re-invent computing before writing it
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
50
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
We‘ve to re-write anyway
51
We‘ve to re-write software anyway (because of manycore)
We‘ve to re-write into configware (part of the software because of the power wall)
For both we have to learn locality awareness
We have to re-invent programmer education
We need a tool flow to support a twin-paradigm approach and locality awareness
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
(Outline)
4. We have to re-invent computing
52
• Re-write anyway • hetero with strong FPGA envolvement • need locality awareness -> model-based • We need une levee an masses
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
53
A Clean Terminology, please
program source result
Software instruction streams
Flowware data streams
Configware datapath structures configured
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Cray-XD1 Architecture features
The Cray-XD1 allows the Opteron µP to access the FPGA internal registers, internal and external memory.
54
provides several transfer modes between µP and the FPGA
The µP can read from / write to the FPGA local memory space (i.e. internal registers, internal BRAMS, and external memory).
The FPGA can read from / write to the µP local memory space.
The most bandwidth-efficient transfer mode:
write-only mode (producer initiates the transfer):
burst (for large amount of data) or non-burst.
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 10
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
HLL programming models
55
What is best for locality
awareness ??
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Too many HDLs
56
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Some more hardware description languaqges
57 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
58
term controlled by execution
triggered by paradigm
CPU program counter
(at ALU)
instruction fetch
instruction stream
DPU**
rDPU**
data counter(s) (at memory)
data arrival* data-stream-based
*) “transport-triggered” **) does not have a program counter
- no instruction fetch
single paradigm (from the
mainframe age) is obsolete
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
59
term controlled by execution
triggered by paradigm
CPU program counter
(at ALU)
instruction fetch
instruction stream
DPU**
rDPU**
data counter(s) (at memory)
data arrival* data-stream
+ New Machine Model for FPGAs
*) “transport-triggered” **) does not have a program counter
- no instruction fetch
twin paradigm
twin paradigm
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
60
Imperative Languages Twins
(1) language category Computer Languages Languages f. Anti Machine
both deterministic procedural sequencing: traceable, checkpointable
operation sequence driven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes, instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching
state register program counter data counter(s)
address computation
massive memory cycle overhead overhead avoided
Instruction fetch memory cycle overhead overhead avoided
parallel memory bank access interleaving only no restrictions
language features control flow + data manipulation
data streams only (no data manipulation)
Flowware Languages Software Languages
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 11
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
61
Imperative Languages Twins
(2)
Computer Languages Languages f. Anti Machine
procedural sequencing: traceable, checkpointable
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes, instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching
program counter data counter(s)
massive memory cycle overhead overhead avoided
memory cycle overhead overhead avoided
interleaving only no restrictions
control flow + data manipulation
data streams only (no data manipulation)
Flowware Languages Software Languages
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
A Heliocentric CS Model needed
Twin Paradigm Dual Dichotomy Approach.
64
PE
Program Engineering
The Generalization of Software Engineering —
*) do not confuse with „dataflow“!
Flowware
Engineering
FE
auto-sequencing Memory
asM time to space mapping
issue
CE
Configware
Engineering
structures
pipe network model
rDPU reconfigurable-Data-Path- Unit
reconfigurable-Data-Path- Array rDPA
SE
Software
Engineering
CPU
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
POIIP: Loop turns into Pipeline
65
[1979]
(reconfigurable) DataPath Unit:
rDPU loop body
rDPU
rDPU
rDPU
Pipeline:
rDPU loop body
loop:
complex loop body
nested loops
complex rDPU or pipe network inside rDPU
complex pipe network
CPU
Memory
Adder
Speaker
FMDemod
LPF1
Split
Gather
LPF2 LPF3
HPF1 HPF2 HPF3
Source:
MIT
StreamIT
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Paradigm Dichotomy: a very old hat
66 RTM as DEC product available: 1973
B0
B1
CONDITION
ENABLE
decision box:
0 1 B0
B1
CO
ND
ITIO
N
ENABLE
demultiplexer:
“That’s so simple! why did it take
30 years to find out ?”
HDL scene ~1970:
„decision box turns into demultiplexer“
C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972
W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. 1967: 1972:
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
67
A Clean Terminology, please
program source result
Software instruction streams
Flowware data streams
Configware datapath structures configured
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Coming unaffordable electricity bill
• Massively saving energy by FPGAs
• What’s the problem with FPGAs ?
• The parallelism crisis
• We have to re-invent computing
• Conclusions
68
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 12
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
key issues
69
The massive power consumption of computers.
Hetero systems needed (v N + non-v-N accelerators).
2 scenes: eRC (embedded) + HPRC (supercomputing)
Productivity programming methodology missing
Productivity programmer population missing.
Accelerators: hardwired + RC + (soon? ) ANN
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011, 70
Reconfigurable Computing offers an overwhelming reduction of electricity consumption
Potential of RC
Only Reconfigurable Computing can avoid, that running our infrastructures becomes unaffordable in the future.
as well as massive speed-up factors …
… both by up to several orders of magnitude.
We have to re-invent computing as soon as possible
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
We need „une' Levée en Masses“
71
We need „une' Levée en
Masses“
We need „une' Levée en
Masses“
71 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
END 72
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Taxonomy of Twin Paradigm
Programming Flows (HPRC)
73
E. El-Araby et al.: Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology And Empirical Study; Proc. SPL2007, Mar del Plata, Argentina, Febr. 2007
[courtesy Richard Newton]
„The nroff of EDA“ [R. N.]
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
74
Locality awareness is
essential for flowware
How data are moved Software: by addresses, read from instruction
Flowware: by wire (configured before run time)
relation to configware calls locality awareness
here locality is less relevant
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 13
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
75
Reinvent? (final remark)
avoid traditional tunnel views
to obtain new perspectives
rediscovery and revival of old ideas
rearrange and teach them properly
to reach promising new horizons
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Amid the Clamor ?
76
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
time to space mapping
time domain: space domain:
procedure domain structure domain
77
program loop n time steps, 1 CPU
pipeline 1 time step, n DPUs
Bubble Sort n x k time steps,
1 „conditional swap“ unit
Shuffle Sort k time steps, n „conditional swap“ units
time algorithm space algorithm
conditional swap
x
y
conditional swap
conditional swap
conditional swap
conditional swap
time algorithm
space/time algorithm s
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Architekture instead of synchro
78
„Shuffle Sort“
conditional swap
conditional swap
conditional swap
conditional swap
modification:
with shuffle-
function
conditional swap
conditional swap
conditional swap
conditional swap
conditional swap
conditional swap
swap
conditional swap
conditional
direct time to
space mapping
accessing conflicts
Better Architecture instead of complex synchronisation: half he number of Blocks + up und down of data (shuffle function) – no von Neumann-syndrome !
Example
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Transformations since the 70ies
time domain: space domain:
procedure domain structure domain
79
program loop
n x k time steps,
Pipeline
k time steps, n DPUs
time algorithm space/time algorithmus
Strip Mining
Transformation
loop transformations: rich methodology published [survey: Diss. Karin Schmidt,
1994, Shaker Verlag]
1 CPU
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Outline
• Never run out of energy ?
• Energy consumption: unaffordable soon?
• The many-core crisis
• Rescue by Reconfigurable Computing?
• We need to Reinvent Computing
• Conclusions
80
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 14
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Development with VHDL is expensive
81
FPGAs Achilles’ Heel is their long development time:Low level HDLs (VHDL/Verilog) are still dominant
Unlike software, FPGAs do not offer forward/backward compatibility
[Khaled Benkrid, Un. of Edinburg] FPGAs: low technology maturity; small user base, compared to software”
FPGA ASIC eASIC
Variety of Tool Flows
Survey urgently needed: Manpower required!
Complicated IP Core Scene
BDTI High-Level Synthesis
Tool Certification Program™
Grant Martin, Gary Smith. “High-Level Synthesis: Past, Present, and
Future,” IEEE Design and Test of Computers, July/August 2009.
TSMC creates a 'soft' IP cores collaboration program to improve soft IP readiness for EDA and IP suppliers including Arteris, Atrenta, Cadence,
Chips & Media, Imagination Technologies, Intrinsic-ID, MIPS Technologies, Sonics, Synopsys and Vivante.
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Productivity vs. Efficiency
82
[courtesy Richard Newton]
„The nroff of EDA“ [R. N.]
how to hide the ugliness from the user [Herman Schmit]
HDLs: zero ease of use !!!
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Understanding Complex
Hetero Systems
83
Layers of Abstraction and Automatic Parallelization hide critical sources of, and limits to efficient parallel execution
Efficient Distribution of Tasks being memory limited
Internode Communications reduces Computational Efficiency
We must change how programmers think
essential: awareness of locality,
Focusing on memory mapping issues and transfer modes to detect overhead and bottlenecks
Understanding streams through complex fabrics needed
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Processor inside FPGA vs.
FPGA inside Processor: EPP
84
-> totally changed concept
device more like heterogeneous SoCs: significant benefits for HPC applications:
FPGAs became software-centric:
-> EDUCATION !!!!-
not hardware centric:
Xilinx: Extensible Processing PlatformTM
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern Structured ASICs
Structured ASICs like eASIC are based mostly on FPGA-like architecture with special configuration mechanism to program at mask level: not re-programmable (more performance for less cost).
85
configware on ROM
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
RTL Programming to ASIC / ASSP
Platform-Language of Silicon: attractive for IP providers
86
Tools-Path to ASICs/ASSPs, across FPGAs
RTL is inherently parallel, mapped application are automatically optimally parallelized by CAD tools.
Application-Specific
Standard Products
now ESL (Electronic System Level) bridging HDL and ANSI C/C++ (at industrial level )
Battle between FPGAs vs MPSoC:
it is RTL vs Software programming.
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 15
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Graphical / Dataflow Languages
87
E. El-Araby et al.: Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology And Empirical Study; Proc. SPL2007 Symp., Mar del Plata, Argentina, Febr. 2007
We should have a look at DSPlogic
We should have a look at DSPlogic
Evaluation Metrics
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Architecture visible
The software programming model:
Which hardware architecture parts are visible
and under the programmer’s direct control ?
88
The RC programming model:
how the programmer can control data transfers between FPGA, onboard memory, the microprocessor and its memory
to the programmer
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
>60 NoC Projects
Jih-Sheng Shen, Pao-Ann Hsiung (editors): Dynamic Reconfigurable Network-on-Chip Designs: Innovations for Computational Processing and Communication; Information Science Reference, Hershey, USA, April 1, 2010
89
Industry faces 'platform collision'
Which platform technology will win ? ASIC, ASSP, FPGA, MCU or IP core?
NoC research: world-wide >60 projects
"It's not clear, all may coexist”
Battles will get further interesting if/when the parallel programming crisis is over
[Brad Howe,
VP IC, Altera]
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
90
Intel going FPGA ?
If Intel, for example, wanted to enter the FPGA, they’d have to
acquire an existing big FPGA companies - specifically to acquire the
core technology of synthesis and place-and-route.
If FPGAs - or programmable logic technology in general - are a long-
term important component of digital electronics, then synthesis and
place-and-route have put just two companies - Xilinx and Altera - in
the driver’s seat for the massive profit and growth that are possible in
this market. -
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
(Outline)
4. We have to re-invent computing
91
• we need a twin paradigm approach (hetero with strong FPGA involvement)
• need locality awareness (& model-based) • we need une levee an masses
Re-write anyway ->
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
asM
asM
asM
asM
asM
asM
asM: Auto-Sequencing
Memory
use data counters, no program counter
rDPA
x x
x
x
x
x
x x
x
- -
-
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
the Data stream
machine
x x x
x x x
x x x
|
| |
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
|
implemented
by distributed
on-chip memory
92
asM
asM
asM
asM
asM
asM
programmed . by Flowware
Locality Awareness is essential
reconfigurable
address generator
(GAG) inside asM
[email protected] 10 March 2014
Keynote, FPGA forum; 2-3 March 2010, Trondheim, Norway 16
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
Som
e f
unct
iona
l lan
guag
es
Som
e d
atas
tream
lang
uage
s
93 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
The Language and Tool Disaster
Software people don‘t speak VHDL
Hardware people don‘t MPI nor OpenMP
Bad quality application development tools
86% designers hate their tools [FCCM’98]
progress stalled by qualification problems in industry and academia
Comprehensibility barrier between procedural and structural mind set
N. Conner et al.: FPGAs for Dummies; Wiley, 2008 94
© 2010, [email protected] http://hartenstein.de
TU Kaiserslautern Productivity Semantic Gap
95 © 2010, [email protected] http://hartenstein.de
TU Kaiserslautern
2011,
Some Acceleration Mechanisms
parallelism by multi bank memory architecture auxiliary hardware for address calculation address calculation before run time
avoiding multiple accesses to the same data. avoiding memory cycles for address computation optimization by storage scheme transformations
optimization by memory architecture transformations
Accelerate tasks by streaming
Can achieve 100x in improved use of memory.
MISD structured computation: streaming computations across a long array before storing results in memory.