Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | madeleine-carter |
View: | 220 times |
Download: | 3 times |
How to cope with the Power
Wall
Reiner HartensteinTU Kaiserslautern
IEEE fellowFPL fellow
SDPS fellow
http://hartenstein.de
PATMOS 2015, the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation;
Salvador, Bahia, Brazil, Sept 1-5, 2015
downloadable fromhttp://xputer.de
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
2
>> Outline <<
http://www.uni-kl.de
•The Power Wall
•“Dataflow” Computing
•Reconfigurable Computing
•Time to Space Mapping
•The Xputer Paradigm
•Conclusions
downloadable from
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
Oldest conference series on power efficiencyPower efficiency is going to become an industry-wide issue
Some incremental improvements are on track, at all abstraction levels
however, there is still a lot to be done
http://hartenstein.de/PATMOS/
3
Project .leader: Reiner Hartenstein
partner . . leader: Antonio Núñez
partner . . leader: Francis Jutand
spin-off from thePATMOS project
The Workshop Series
© 2015, [email protected] http://xputer.de
TU KaiserslauternPower-Efficient Computing
Power-efficient Microchip Design
4
Power-efficient Computer Architectures
Power-efficient Languages and Compilers
Power-efficient Software Implementation
(Power-efficient Memory)
Power-efficient Machine Paradigm
tutorial by Jan
Rabaey
my presentation
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
55© New York Times Data Center at DallasData Center at Dallas
Columbia river
G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008
Power consumption by internet: Power consumption by internet: x30 til 2030 if trends continuex30 til 2030 if trends continue
It‘s more than the entire It‘s more than the entire world‘s total power world‘s total power consumption to-day consumption to-day !!!
ICT infrastructures
© Hewlett Packard
Three tectonic shifts:Three tectonic shifts:the energy-constrained worldthe energy-constrained worldfrom internet of people to internet of (every)thingfrom internet of people to internet of (every)thingthe end of scaling as we know it the end of scaling as we know it © Hewlett Packard
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
6
>> Outline <<
http://www.uni-kl.de
•The Power Wall
•“Dataflow” Computing
•Reconfigurable Computing
•Time to Space Mapping
•The Xputer Paradigm
•Conclusions
downloadable from
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
Terminology Problems
Stressing differences to „Control-Flow“ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT
Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“…
… although the „Dataflow“ scene is „I-Structure“-centered
7
Tagged Token Flow
or „ “
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
A Second OpinionD. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn:
A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER,
February 1982the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future."
( Still active workshops …, e. g.:5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA )
However, the power efficiency break-thru did not happen here
8
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
9
>> Outline <<
http://www.uni-kl.de
•The Power Wall
•“Dataflow” Computing
•Reconfigurable Computing
•Time to Space Mapping
•The Xputer Paradigm
•Conclusions
downloadable from
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
10
FFTFFT100
Reed-Solomon Decoding
Reed-Solomon Decoding 2400
Viterbi Decoding
Viterbi Decoding400
1000
MACMAC
DSP and wireless
Smith-Waterman pattern matchingSmith-Waterman pattern matching288
molecular dynamics simulation
molecular dynamics simulation
88
BLASTBLAST52
protein identification
protein identification
40
Bioinformatics
GRAPEGRAPE
2020
Astrop
hysics
Astrop
hysics
SPIHT wavelet-based image compression
SPIHT wavelet-based image compression 457
real-time face detection
real-time face detection
60006000
video-rate stereo vision
video-rate stereo vision
900pattern recognition
pattern recognition
730
Image processing,Pattern matching,Multimedia
3000CT imagingCT imaging
106
Speedup-Factor
1995 2000 20102005
~300
DNA & protein sequencing
8723 779
x2/y
r
1985 1990100
*) TU Kaiserslautern
103
100
+ Pre-FPGA solutionsXputer
15000
no FPGA: DPLA on MoM by TU-KL*
no FPGA: DPLA on MoM by TU-KL*
Design Rule Check accelerator PISA
(„fair comparizon“)
Design Rule Check accelerator PISA
(„fair comparizon“)
1984: 1 DPLA replaces
256 FPGAs
105
Speedup-Factor
Speed-ups by vN Software to FPGA Migrations
103
~10
% o
f
spee
d-up
Pow
er
savi
ng:
http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf
http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html
fabricated by E.I.S. Multi
University Project Chip
19841984>15 years earlier
3439
CryptoCrypto1000
28514DES breakingDES breaking
Saving Power
© 2015, [email protected] http://xputer.de11
The Reconfigurable Computing Paradoxalthough the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of:•wiring overhead• reconfigurability overhead•routing congestion
“von Neumann Syndrome”
C.V. RamamoorthyVon Neumann Syndrome
von Neumann: an extremely power-
inefficient paradigm
Reinvent Computing
why
?
© 2015, [email protected] http://xputer.de12
Obstacles to widespread FPGA adoption go well beyond the required skill set
- Workshop at FPL_2015
http://reconfigurablecomputing4themasses.net/
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
What about Acceleration by
Graphics Processors ?
•Power saving mostly not documented
•R. Vaduc et al.*: „ … adding a GPU is equivalent to adding one more multicore CPU socket …”
*) R. Vuduc, J. Choi, M. Guney, A. Shringarpure: On the Limits of GPU Acceleration; Proc. HotPar'10, 2nd USENIX workshop on Hot Topics in Parallelism, June 14 - 15, 2010, Berkeley, CA, USA, USENIX Assoc. Berkeley, CA, USA http://newport.eecs.uci.edu/~amowli/resources/papers/vuduc2010-hotpar.pdf
•Drastically smaller Speed-ups if at allvon Neumann
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
14
>> Outline <<
http://www.uni-kl.de
•The Power Wall
•“Dataflow” Computing
•Reconfigurable Computing
•Time to Space Mapping
•The Xputer Paradigm
•Conclusions
downloadable from
© 2006, [email protected] http://hartenstein.de
TU Kaiserslautern
Dual paradigm mind set: an old hat – but was ignored
15
time to space mapping: procedural to structural:
1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc.C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972
FF
token bitevoke
FF FF
„This is
so simple ……
1971
loop to pipe mapping
Tunnel Vision!
why did it take 25 years to find out?
PDP-16 RTMs:
dem
ultip
lexe
r
© 2015, [email protected] http://xputer.de16
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data stream
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
output data streams
„data streams“
time
port #
time
time
port #time
port #
define: ... which data item at which time at which port
The Systolic Arrays (1)no instruction streams needed 1980
DPA(pipe network)DataPath Array (array of DPUs)
M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips ... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung
execution transport-triggered
Kung‘s example (algebra)
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
17
just only .. special
purpose ? just only .. special
purpose ?
(Tunnel Vision)
Karl Steinbuch
It is not sufficient to invent something. You need to recognize, that you have invented something.
Systolic Arrays (2)
http
://ka
rl-st
einb
uch.
org
17
M. J. Foster and H. T. Kung: “The Design of Special-Purpose VLSI Chips ... “
© 2015, [email protected] http://xputer.de18
H. T. Kung*: „of course algebraic!“
My student Rainer Kress replaced it by simulated annealing: this supports also any irregular & wild shape pipe networks
What Synthesis Method?
The super-systolic array: a generalization of the systolic array:
The super-systolic array: a generalization of the systolic array:
http://kressarray.de/
supports only very special applications with strictly regular data
dependencies
only linear pipes
*) KressArray [ASP-DAC-1995]
now a general purpose general purpose methodology !
now a general purpose general purpose methodology !
(linear projection)*) a mathematician
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
H. T. Kung: “It’s not our job”
19*) or receives
another Tunnel Vision Symptom
without a sequencer: missed to invent a new machine paradigm
Who generates* the datastreams?
Who generates* the datastreams?
(the Xputer)
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
20
>> Outline <<
http://www.uni-kl.de
•The Power Wall
•“Dataflow” Computing
•Reconfigurable Computing
•Time to Space Mapping
•The Xputer Paradigm
•Conclusions
downloadable from
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
The Xputer machine Paradigm
obtained by adding auto-sequencing memory (ASM)
GAG: Generic Address GenerstorGAG: Generic Address Generstor
With data counters instead of a program counter
is the TU-KL‘s Symbiosis of Time to Space Mapping and
Reconfigurable Computing!(TU-KL)
ASM
ASM
ASM
datacounter
GAG RAM
ASM
Xputer literature
21
© 2006, [email protected] http://hartenstein.de
TU Kaiserslautern
Duality of procedural Languages
Flowware Languages
read next data itemgoto (data address)jump to (data address)data loopdata loop nestingdata loop escapedata stream branchingyes: internally parallel loops
22
Software Languages
read next instructiongoto (instruction address)jump to (instruction address)instruction loopinstruction loop nestinginstruction loop escapeinstruction stream branchingno: no internally parallel loops
But there is an AsymmetryAsymmetry But there is an AsymmetryAsymmetry
program counter:data counter(s):
more simple: no ALU tasks
MoPLstate register(s):
Xp
ute
r lit
era
ture
Xp
ute
r p
ag
es
easy
to
lear
n
cont
rol-p
roce
dura
l
data
-pro
cedu
ral
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
23
Compilation: Software vs. Configware u. Flowware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
time to space
mapping
data
C, etc.
procedural: time domain)
space domain
time domain
Xputer literature
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
24
Heterogeneous: Co-Compilation
softwarecompiler
software code
Software / Configware Co-Compiler
Software / Configware Co-Compiler
configware code
mapperconfigware
compiler
scheduler
flowware code
data
C, or other high level language
automatic SW / CW partitioner
Xputer pages important: why?
(Jürgen Becker‘s Ph.D. thesis)
CoDe-X
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
25
>> Outline <<
http://www.uni-kl.de
•The Power Wall
•“Dataflow” Computing
•Reconfigurable Computing
•Time to Space Mapping
•The Xputer Paradigm
•Conclusions
downloadable from
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
26
Illustrating the Paradigm Trap
The von Neumann Manycore Approach
the watering can model [Hartenstein]
( many crippled watering cans )
many many von von
NeumanNeumann n
bottle-bottle-necksnecks
many many von von
NeumanNeumann n
bottle-bottle-necksnecks
The von Neumann single core Approach
The Memory Wall
The Memory Wall ( crippled
watering can )
(1)
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
27
Illustrating the Paradigm Trap
The “Dataflow“ Computer
extremely complicate
d: no watering
can model !
(2)
(a power efficiency break-thru did not happen here)
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
28
Xputer: the only massively power-efficient Paradigm
The Xputer Paradigm
has no von Neumann bottle-neck
has no von Neumann bottle-neck
the watering can model [Hartenstein]
fullysupporting
Reconfigurable Computing
fullysupporting
Reconfigurable Computing
© 2015, [email protected] http://xputer.de
TU KaiserslauternWe need a Seismic Shift …
It’s an extremely massive challenge !
… to avoid future unaffordable ICT power consumption cost
29
For many more years we must work under a heterogeneous triple-paradigm mind set:
The software from more than half a century
sits squarely on top
Configware, Flowware, and still even Software
that’s why heterogeneous is important
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
33
[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]
much less equipment neededmuch less memory and bandwidth needed
massively saving energy
Reconfigurable Computing (RC): the intensive Impact
SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster
Tarek El-Ghazawi
Speed-ups by von Neumann to RC Migrations
ApplicationSpeed-up
factor
Savings divisorPower Cost Size
.DNA and Proteinsequencing 8723 779 22 253DES breaking 28514 3439 96 1116
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
Taxonomy
34
Flynn‘s taxonomy:von Neumann only
Diana‘s taxonomy:Reconfigurable Computing
Reiner‘s taxonomy:heterogeneous systems
Reiner‘s 2nd taxonomy:Xputers only
noI
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
LUT
LUT
LUT
LUT
LUT
LUT
TableLUT Table
First FPGA available 1984 from Xilinx
© 2015, [email protected] http://xputer.de
Transformations since the 70ies
time domain:procedure domain
space domain:structure domain
36
Pipelinek time steps, n DPUs
time algorithm space/time algorithmus
Strip Mining Transformation
Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]
n x k time steps, program loop
1 CPU
(time to time/space mapping)
© 2015, [email protected] http://xputer.de37
The Reconfigurable Computing Paradox
Enabling software developers Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing.
Reinvent Computing
although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of:
•wiring overhead• reconfigurability overhead•routing congestion•etc.
© 2015, [email protected] http://xputer.de38
datacounter
GAG RAM
ASM: Auto-Sequencing
Memory
ASM: Auto-Sequencing
Memory
use data counters, no program
counter
rDPArDPA
x x
x
x
x
x
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
ASM Data stream generator usage
(pipe network)
xxx
xxx
xxx
|
||
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
M
implemented by distributed
on-chip- memory
implemented by distributed
on-chip- memory
Xputermachine paradigm
Xputer pages
GAG: Generic Address
Generators
GAG: Generic Address
Generators(reconfigurable: to avoid
memory-cycle-hungry address
computation)
(reconfigurable: to avoid memory-cycle-
hungry address computation)
general purpose
reconfigurable
general purpose
reconfigurable
also more irregular
data routes supported
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
39
Paradigm Shift Consequences
Software EngineeringSoftware Engineering1 programming source needed
algorithm: variable
resources: fixedsoftware
CPUProgram Counter (PC)
configware resources: variable 2 programming sources needed
flowware algorithm: variable
Configware EngineeringConfigware Engineering
Data Counters (DCs) sequencing code (e. g. see MoPL language)
Configuration Code (CC)
von Neumann:
Xputer:
Xputer literature
Heterogenous EngineeringHeterogenous EngineeringXputer and vN:all 3 programming sources needed
© 2015, [email protected] http://xputer.de40
no memory wall
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
DPU operation is transport-triggered
no instruction streamsno message passing
nor thru common memory
massively avoiding memory cycles
Pipelining through DPU Arrays: the TU-KL Xputer principle
DPA = DPU arrayDPU = Data Path Unit
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
Von Neumann Syndrome
41
Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011
Shifting to the
dominance of von-
Neumann-only caused an
cumulated damage of
at least trillions of
Dollars, if not
quadrillions ….
© 2015, [email protected] http://xputer.de42
Computing Paradigms
termprogra
m counter
execution triggered
byparadigm
CPUyes instructio
n fetchinstruction-
stream-based (von Neumann)
programcounter
DPUCPUCPU
term
nodata
arrival**data-driven or data-stream-
baseddatacounters
rDPURPURPU
Xputer
Reconfigurable Computing
no
complicated I-
structure handling
“Dataflow” Computer
**) “transport-triggered”*) based on tagged token “I-Structure”
I-structure
DPUsDFCDFC
DFC*
© 2015, [email protected] http://xputer.de43
I-Structure Storage
Communication Network
PEI-Structure Storage
PE
MIT Tagged Token Dataflow Architecture*
*) source: Jurij Silc
„instruction is executed even if some of its operands are not yet available“
no Program Counter
no updateable global store
?
Tagged Token Flow ArchitectureI would call it
Wait-Match Unit &Waiting Token Store
Instruction Fetch Unit
Token Queue
Program Store &Constant Store
Form Token Unit
ALU & Form Tag
to/from the Communication Network
RU
SU
PE:
© 2015, [email protected] http://xputer.de
TU Kaiserslautern
Solution: I-Structure concept*
44
“can be read but not written”
“at least one read request has been deferred”
“read request deferrend but write
operation is allowed”
[source: Jurij Silc]
very complex data structures !
“each update consumes the structure and the value produces a new data structure”“awkward or even impossible to implement”
Problems with such “Dataflow” Architectures
*) „I-Structure Flow“ instead of „Dataflow“
?
?
= Tagged Token Structure„Tagged Token Flow“I would
call it
© 2015, [email protected] http://xputer.de
I-Structures (I = incremental) - part 1http://csd.ijs.si/courses/dataflow/Jurij Silc: Dataflow Architectures
28
27
28
26
45
© 2015, [email protected] http://xputer.de
MIT Tagged-Token
Dataflow Architecture
(PE)
31
29
http://csd.ijs.si/courses/dataflow/Jurij Silc: Dataflow Architectures
I-structure select
I-structure
assign
30 20
I-Structures - part 2
46