+ All Categories
Home > Documents > How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS...

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS...

Date post: 28-Dec-2015
Category:
Upload: madeleine-carter
View: 220 times
Download: 3 times
Share this document with a friend
Popular Tags:
48
How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow http://hartenstein.de PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015 downloadable from http://xputer.de
Transcript

How to cope with the Power

Wall

Reiner HartensteinTU Kaiserslautern

IEEE fellowFPL fellow

SDPS fellow

http://hartenstein.de

PATMOS 2015, the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation;

Salvador, Bahia, Brazil, Sept 1-5, 2015

downloadable fromhttp://xputer.de

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

2

>> Outline <<

http://www.uni-kl.de

•The Power Wall

•“Dataflow” Computing

•Reconfigurable Computing

•Time to Space Mapping

•The Xputer Paradigm

•Conclusions

downloadable from

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

Oldest conference series on power efficiencyPower efficiency is going to become an industry-wide issue

Some incremental improvements are on track, at all abstraction levels

however, there is still a lot to be done

http://hartenstein.de/PATMOS/

3

Project .leader: Reiner Hartenstein

partner . . leader: Antonio Núñez

partner . . leader: Francis Jutand

spin-off from thePATMOS project 

The Workshop Series

© 2015, [email protected] http://xputer.de

TU KaiserslauternPower-Efficient Computing

Power-efficient Microchip Design

4

Power-efficient Computer Architectures

Power-efficient Languages and Compilers

Power-efficient Software Implementation

(Power-efficient Memory)

Power-efficient Machine Paradigm

tutorial by Jan

Rabaey

my presentation

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

55© New York Times Data Center at DallasData Center at Dallas

Columbia river

G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008

Power consumption by internet: Power consumption by internet: x30 til 2030 if trends continuex30 til 2030 if trends continue

It‘s more than the entire It‘s more than the entire world‘s total power world‘s total power consumption to-day consumption to-day !!!

ICT infrastructures

© Hewlett Packard

Three tectonic shifts:Three tectonic shifts:the energy-constrained worldthe energy-constrained worldfrom internet of people to internet of (every)thingfrom internet of people to internet of (every)thingthe end of scaling as we know it the end of scaling as we know it © Hewlett Packard

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

6

>> Outline <<

http://www.uni-kl.de

•The Power Wall

•“Dataflow” Computing

•Reconfigurable Computing

•Time to Space Mapping

•The Xputer Paradigm

•Conclusions

downloadable from

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

Terminology Problems

Stressing differences to „Control-Flow“ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT

Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“…

… although the „Dataflow“ scene is „I-Structure“-centered

7

Tagged Token Flow

or „ “

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

A Second OpinionD. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn:

A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER,

February 1982the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future."

( Still active workshops …, e. g.:5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA )

However, the power efficiency break-thru did not happen here

8

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

9

>> Outline <<

http://www.uni-kl.de

•The Power Wall

•“Dataflow” Computing

•Reconfigurable Computing

•Time to Space Mapping

•The Xputer Paradigm

•Conclusions

downloadable from

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

10

FFTFFT100

Reed-Solomon Decoding

Reed-Solomon Decoding 2400

Viterbi Decoding

Viterbi Decoding400

1000

MACMAC

DSP and wireless

Smith-Waterman pattern matchingSmith-Waterman pattern matching288

molecular dynamics simulation

molecular dynamics simulation

88

BLASTBLAST52

protein identification

protein identification

40

Bioinformatics

GRAPEGRAPE

2020

Astrop

hysics

Astrop

hysics

SPIHT wavelet-based image compression

SPIHT wavelet-based image compression 457

real-time face detection

real-time face detection

60006000

video-rate stereo vision

video-rate stereo vision

900pattern recognition

pattern recognition

730

Image processing,Pattern matching,Multimedia

3000CT imagingCT imaging

106

Speedup-Factor

1995 2000 20102005

~300

DNA & protein sequencing

8723 779

x2/y

r

1985 1990100

*) TU Kaiserslautern

103

100

+ Pre-FPGA solutionsXputer

15000

no FPGA: DPLA on MoM by TU-KL*

no FPGA: DPLA on MoM by TU-KL*

Design Rule Check accelerator PISA

(„fair comparizon“)

Design Rule Check accelerator PISA

(„fair comparizon“)

1984: 1 DPLA replaces

256 FPGAs

105

Speedup-Factor

Speed-ups by vN Software to FPGA Migrations

103

~10

% o

f

spee

d-up

Pow

er

savi

ng:

http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf

http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html

fabricated by E.I.S. Multi

University Project Chip

19841984>15 years earlier

3439

CryptoCrypto1000

28514DES breakingDES breaking

Saving Power

© 2015, [email protected] http://xputer.de11

The Reconfigurable Computing Paradoxalthough the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of:•wiring overhead• reconfigurability overhead•routing congestion

“von Neumann Syndrome”

C.V. RamamoorthyVon Neumann Syndrome

von Neumann: an extremely power-

inefficient paradigm

Reinvent Computing

why

?

© 2015, [email protected] http://xputer.de12

Obstacles to widespread FPGA adoption go well beyond the required skill set

- Workshop at FPL_2015

http://reconfigurablecomputing4themasses.net/

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

What about Acceleration by

Graphics Processors ?

•Power saving mostly not documented

•R. Vaduc et al.*: „ … adding a GPU is equivalent to adding one more multicore CPU socket …”

*) R. Vuduc, J. Choi, M. Guney, A. Shringarpure: On the Limits of GPU Acceleration; Proc. HotPar'10, 2nd USENIX workshop on Hot Topics in Parallelism, June 14 - 15, 2010, Berkeley, CA, USA, USENIX Assoc. Berkeley, CA, USA http://newport.eecs.uci.edu/~amowli/resources/papers/vuduc2010-hotpar.pdf

•Drastically smaller Speed-ups if at allvon Neumann

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

14

>> Outline <<

http://www.uni-kl.de

•The Power Wall

•“Dataflow” Computing

•Reconfigurable Computing

•Time to Space Mapping

•The Xputer Paradigm

•Conclusions

downloadable from

© 2006, [email protected] http://hartenstein.de

TU Kaiserslautern

Dual paradigm mind set: an old hat – but was ignored

15

time to space mapping: procedural to structural:

1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc.C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972

FF

token bitevoke

FF FF

„This is

so simple ……

1971

loop to pipe mapping

Tunnel Vision!

why did it take 25 years to find out?

PDP-16 RTMs:

dem

ultip

lexe

r

Reiner
HDL scene ~1970: „decision box turns into demultiplexer““That’s so simple! why did it take 30 years to find out ?”reductionists’ tunnel view - RTM as DEC product available: 1973

© 2015, [email protected] http://xputer.de16

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data stream

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

output data streams

„data streams“

time

port #

time

time

port #time

port #

define: ... which data item at which time at which port

The Systolic Arrays (1)no instruction streams needed 1980

DPA(pipe network)DataPath Array (array of DPUs)

M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips ... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung

execution transport-triggered

Kung‘s example (algebra)

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

17

just only .. special

purpose ? just only .. special

purpose ?

(Tunnel Vision)

Karl Steinbuch

It is not sufficient to invent something. You need to recognize, that you have invented something.

Systolic Arrays (2)

http

://ka

rl-st

einb

uch.

org

17

M. J. Foster and H. T. Kung: “The Design of Special-Purpose VLSI Chips ... “

© 2015, [email protected] http://xputer.de18

H. T. Kung*: „of course algebraic!“

My student Rainer Kress replaced it by simulated annealing: this supports also any irregular & wild shape pipe networks

What Synthesis Method?

The super-systolic array: a generalization of the systolic array:

The super-systolic array: a generalization of the systolic array:

http://kressarray.de/

supports only very special applications with strictly regular data

dependencies

only linear pipes

*) KressArray [ASP-DAC-1995]

now a general purpose general purpose methodology !

now a general purpose general purpose methodology !

(linear projection)*) a mathematician

user
Mathematicians caught by their own paradigm trap
user
: using only uniform linear pipes

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

H. T. Kung: “It’s not our job”

19*) or receives

another Tunnel Vision Symptom

without a sequencer: missed to invent a new machine paradigm

Who generates* the datastreams?

Who generates* the datastreams?

(the Xputer)

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

20

>> Outline <<

http://www.uni-kl.de

•The Power Wall

•“Dataflow” Computing

•Reconfigurable Computing

•Time to Space Mapping

•The Xputer Paradigm

•Conclusions

downloadable from

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

The Xputer machine Paradigm

obtained by adding auto-sequencing memory (ASM)

GAG: Generic Address GenerstorGAG: Generic Address Generstor

With data counters instead of a program counter

is the TU-KL‘s Symbiosis of Time to Space Mapping and

Reconfigurable Computing!(TU-KL)

ASM

ASM

ASM

datacounter

GAG RAM

ASM

Xputer literature

21

© 2006, [email protected] http://hartenstein.de

TU Kaiserslautern

Duality of procedural Languages

Flowware Languages

read next data itemgoto (data address)jump to (data address)data loopdata loop nestingdata loop escapedata stream branchingyes: internally parallel loops

22

Software Languages

read next instructiongoto (instruction address)jump to (instruction address)instruction loopinstruction loop nestinginstruction loop escapeinstruction stream branchingno: no internally parallel loops

But there is an AsymmetryAsymmetry But there is an AsymmetryAsymmetry

program counter:data counter(s):

more simple: no ALU tasks

MoPLstate register(s):

Xp

ute

r lit

era

ture

Xp

ute

r p

ag

es

easy

to

lear

n

cont

rol-p

roce

dura

l

data

-pro

cedu

ral

Reiner Hartenstein
Befehls-prozedural
Reiner Hartenstein
Daten-prozedural

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

23

Compilation: Software vs. Configware u. Flowware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

time to space

mapping

data

C, etc.

procedural: time domain)

space domain

time domain

Xputer literature

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

24

Heterogeneous: Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

data

C, or other high level language

automatic SW / CW partitioner

Xputer pages important: why?

(Jürgen Becker‘s Ph.D. thesis)

CoDe-X

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

25

>> Outline <<

http://www.uni-kl.de

•The Power Wall

•“Dataflow” Computing

•Reconfigurable Computing

•Time to Space Mapping

•The Xputer Paradigm

•Conclusions

downloadable from

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

26

Illustrating the Paradigm Trap

The von Neumann Manycore Approach

the watering can model [Hartenstein]

( many crippled watering cans )

many many von von

NeumanNeumann n

bottle-bottle-necksnecks

many many von von

NeumanNeumann n

bottle-bottle-necksnecks

The von Neumann single core Approach

The Memory Wall

The Memory Wall ( crippled

watering can )

(1)

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

27

Illustrating the Paradigm Trap

The “Dataflow“ Computer

extremely complicate

d: no watering

can model !

(2)

(a power efficiency break-thru did not happen here)

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

28

Xputer: the only massively power-efficient Paradigm

The Xputer Paradigm

has no von Neumann bottle-neck

has no von Neumann bottle-neck

the watering can model [Hartenstein]

fullysupporting

Reconfigurable Computing

fullysupporting

Reconfigurable Computing

© 2015, [email protected] http://xputer.de

TU KaiserslauternWe need a Seismic Shift …

It’s an extremely massive challenge !

… to avoid future unaffordable ICT power consumption cost

29

For many more years we must work under a heterogeneous triple-paradigm mind set:

The software from more than half a century

sits squarely on top

Configware, Flowware, and still even Software

that’s why heterogeneous is important

© 2015, [email protected] http://xputer.de30

thank you !

© 2015, [email protected] http://xputer.de31

END

downloadable from

© 2015, [email protected] http://xputer.de32

Backup for discussion:

downloadable from

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

33

[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]

much less equipment neededmuch less memory and bandwidth needed

massively saving energy

Reconfigurable Computing (RC): the intensive Impact

SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster

Tarek El-Ghazawi

Speed-ups by von Neumann to RC Migrations

ApplicationSpeed-up

factor

Savings divisorPower Cost Size

.DNA and Proteinsequencing 8723 779 22 253DES breaking 28514 3439 96 1116

Reiner
– i. e. instead of a hangar full of racks: just one or half a rack without air conditioning
Reiner
: i.d. mostly fits in RAM on board of processor chip: i. e. orders of magnitude less bandwidth

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

Taxonomy

34

Flynn‘s taxonomy:von Neumann only

Diana‘s taxonomy:Reconfigurable Computing

Reiner‘s taxonomy:heterogeneous systems

Reiner‘s 2nd taxonomy:Xputers only

noI

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

LUT

LUT

LUT

LUT

LUT

LUT

TableLUT Table

First FPGA available 1984 from Xilinx

© 2015, [email protected] http://xputer.de

Transformations since the 70ies

time domain:procedure domain

space domain:structure domain

36

Pipelinek time steps, n DPUs

time algorithm space/time algorithmus

Strip Mining Transformation

Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]

n x k time steps, program loop

1 CPU

(time to time/space mapping)

Reiner
Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .

© 2015, [email protected] http://xputer.de37

The Reconfigurable Computing Paradox

Enabling software developers Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing.

Reinvent Computing

although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of:

•wiring overhead• reconfigurability overhead•routing congestion•etc.

© 2015, [email protected] http://xputer.de38

datacounter

GAG RAM

ASM: Auto-Sequencing

Memory

ASM: Auto-Sequencing

Memory

use data counters, no program

counter

rDPArDPA

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

ASM Data stream generator usage

(pipe network)

xxx

xxx

xxx

|

||

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

M

implemented by distributed

on-chip- memory

implemented by distributed

on-chip- memory

Xputermachine paradigm

Xputer pages

GAG: Generic Address

Generators

GAG: Generic Address

Generators(reconfigurable: to avoid

memory-cycle-hungry address

computation)

(reconfigurable: to avoid memory-cycle-

hungry address computation)

general purpose

reconfigurable

general purpose

reconfigurable

also more irregular

data routes supported

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

39

Paradigm Shift Consequences

Software EngineeringSoftware Engineering1 programming source needed

algorithm: variable

resources: fixedsoftware

CPUProgram Counter (PC)

configware resources: variable 2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Data Counters (DCs) sequencing code (e. g. see MoPL language)

Configuration Code (CC)

von Neumann:

Xputer:

Xputer literature

Heterogenous EngineeringHeterogenous EngineeringXputer and vN:all 3 programming sources needed

© 2015, [email protected] http://xputer.de40

no memory wall

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data streams

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

DPU operation is transport-triggered

no instruction streamsno message passing

nor thru common memory

massively avoiding memory cycles

Pipelining through DPU Arrays: the TU-KL Xputer principle

DPA = DPU arrayDPU = Data Path Unit

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

Von Neumann Syndrome

41

Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011

Shifting to the

dominance of von-

Neumann-only caused an

cumulated damage of

at least trillions of

Dollars, if not

quadrillions ….

© 2015, [email protected] http://xputer.de42

Computing Paradigms

termprogra

m counter

execution triggered

byparadigm

CPUyes instructio

n fetchinstruction-

stream-based (von Neumann)

programcounter

DPUCPUCPU

term

nodata

arrival**data-driven or data-stream-

baseddatacounters

rDPURPURPU

Xputer

Reconfigurable Computing

no

complicated I-

structure handling

“Dataflow” Computer

**) “transport-triggered”*) based on tagged token “I-Structure”

I-structure

DPUsDFCDFC

DFC*

© 2015, [email protected] http://xputer.de43

I-Structure Storage

Communication Network

PEI-Structure Storage

PE

MIT Tagged Token Dataflow Architecture*

*) source: Jurij Silc

„instruction is executed even if some of its operands are not yet available“

no Program Counter

no updateable global store

?

Tagged Token Flow ArchitectureI would call it

Wait-Match Unit &Waiting Token Store

Instruction Fetch Unit

Token Queue

Program Store &Constant Store

Form Token Unit

ALU & Form Tag

to/from the Communication Network

RU

SU

PE:

© 2015, [email protected] http://xputer.de

TU Kaiserslautern

Solution: I-Structure concept*

44

“can be read but not written”

“at least one read request has been deferred”

“read request deferrend but write

operation is allowed”

[source: Jurij Silc]

very complex data structures !

“each update consumes the structure and the value produces a new data structure”“awkward or even impossible to implement”

Problems with such “Dataflow” Architectures

*) „I-Structure Flow“ instead of „Dataflow“

?

?

= Tagged Token Structure„Tagged Token Flow“I would

call it

© 2015, [email protected] http://xputer.de

I-Structures (I = incremental) - part 1http://csd.ijs.si/courses/dataflow/Jurij Silc: Dataflow Architectures

28

27

28

26

45

© 2015, [email protected] http://xputer.de

MIT Tagged-Token

Dataflow Architecture

(PE)

31

29

http://csd.ijs.si/courses/dataflow/Jurij Silc: Dataflow Architectures

I-structure select

I-structure

assign

30 20

I-Structures - part 2

46

© 2015, [email protected] http://xputer.de

Power Efficiency of Programming

Languages(an example)

47

© 2015, [email protected] http://xputer.de48

How big is big ?

© Hewlett Packard


Recommended