The Shape of Things to Come: Future Trends in HPC ......DoE/DoD Workshop, Nov. 29 2007 1 The Shape...

1DoE/DoD Workshop, Nov. 29 2007

The ShapeShape

of Things to Come: Future Trends

in HPC Architectures

Peter M. Kogge

Associate Dean for Research

McCourtney Prof. of CS & Engr

University of Notre Dame

IBM Fellow (retired)


• Today: “Killer Micros” becoming “physics-limited” very hungry multi- core monsters

• Maturing Multi-threading & Tiling providing more nimble systems

• Is there an alternative evolutionary path we’ve ignored?

My View: Future HPC Evolutionary Paths Are Multiplying


My Concern: We’re Focused on the Wrong Aspect of the Wall

What about bandwidth?

TodayFuture Trend/Memory Wall

}

7% Performance DifferenceNOTE: ACCOUNTS ONLY FOR COMPUTATION (NOT MPI)!

Chart courtesy Richard Murphy,

SNL

Application:Trilinos


And Perhaps Missing Another Wall

Does Supplying Energy and Getting Rid of Heat

Dominate Area?


It Also Bothers Me That:

• Modern microprocessor state growing as Moore’s Law– Regardless of the number of computational units

• Memory is as dumb as it was 50 years ago

• We insist on giving persistent names to the tarballs representing the physical cores

• And go to great extremes to separate the persistent names of memory from its location

• Newer classes of apps “visit” data irregularly– Where “caching” copies is wasted energy


The Way We Were


The Historical Top 10

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1/1/72 1/1/76 1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

GFl

ops

Historical Rmax Rmax Rmax Leading Edge Rpeak Leading Edge

CAGR = 1.9


Clock Rates

10

100

1,000

10,000

100,000

1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

Clo

ck (M

Hz)

Historical ITRS Max Clock Rate (12 invertors)

0.01

0.10

1.00

10.00

1/1/93 1/1/95 12/31/96

12/31/98

12/30/00

12/30/02

12/29/04

12/29/06

Clo

ck (G

Hz)

Top 10 Top SystemMicroprocessorsTop 10


Processor Parallelism

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1/1/93 1/1/95 12/31/96 12/31/98 12/30/00 12/30/02 12/29/04 12/29/06

Proc

esso

r Pa

ralle

lism

Top 10 Top System


Concurrency: Flops per Cycle

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1/1/93 1/1/95 12/31/96 12/31/98 12/30/00 12/30/02 12/29/04 12/29/06

Tota

l Con

curr

ecnc

y

Top 10 Top System Top 1 Trend

CAGR 1.65


The Moore’s Law We Know & Love

• Goal: 4X Functionality every 3 years

• Underlying technology improvement:– Growth in transistor density– Growth in transistor switching speed– Growth in size of producible die

• Microprocessors: Functionality=IPS– ~1/2 from higher clock rate– ~1/2 from more complex microarchitectures

• Memory: Functionality = Storage capacity– ~2X from smaller transistors– Shrinkage in architecture of basic bit cell– Increase in die size

YesYesNot in commercial volumes

No: heatNo: complexity

YesYes, but ..

Not at commercially viable pricesAnd it is silent on inter-chip I/O

Knew


The Darwinian Multi-Core Evolution

NowUp to ~2002


Area Scaling Alone Reveals the Rationale for Multi-Core

1

10

100

1000

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

Sing

le C

ore

Proj

ecte

d D

ie S

ize

(mm

2)

Each line represents the scaling of a unique real microprocessor chip from its inception

ITRS ProjectedEconomic Die Size


How Many Can We Fit on a cm2?Assume we scale entire current single core chip & replicate to fill 280 sq mm die

1

10

100

1000

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

Num

ber

of u

P pe

r Sq

uare

Cen

timet

er

Answer Potentially 1000’s!!!!


And a Flood Tide of Recent Announcements

0

5

10

15

20

25

30

1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

# of

New

Mul

ti-co

re A

nnou

ncem

ents

They AreAll

Multi Core Now


And Not Just “Twosies”

1

10

100

1000

10000

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

# of

Cor

es/D

ie in

New

Ann

ounc

emen

ts


The Classical Limiting Factors for Microprocessor Chips:

Power & Contacts


Peak Logic Clock Rates

10

100

1,000

10,000

100,000

1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

Clo

ck (M

Hz)

Historical ITRS Max Clock Rate (12 invertors)

10

100

1000

10000

100000

10100100010000

Feature Size

Clo

ck (M

Hz)

Historical ITRS Max

3 GHz

Classic

al Moo

re’s

Law

Classic

al Moo

re’s

Law

2005 projection was for 5.2 GHz – and we didn’t make it in production.Further, we’re still stuck at 3+GHz in production.

3 GHz


Why the Clock Flattening? POWERPOWER

1

10

100

1000

1976 1986 1996 2006

Wat

ts p

er D

ie

0.1

1

10

100

1000

1976 1986 1996 2006W

atts

per

Squ

are

cm

Hot, Hot, Hot!

Light Bulb

Iron

Rocket Nozzle


Because Vdd No Longer Declining

0

1

2

3

4

5

6

1970 1980 1990 2000 2010 2020

Vdd


Multi-core Power and Clock

Chip Power = Cap/device*#_devices/core*cores/chip

* Clock * Voltage2

Max LimitWill Grow

only Slightly

Reaching anAsymptotic

Limit

AssumeConstant

forMulticore

Decreasing~linearly

withTechnology

IncreasingAs Square

withTechnology

Max Clock RateGrows

Rapidlywith

Technology

But ONLY KNOB to Balance Equation!!


Rewriting for Clock

Clock = Max_chip_power(T) * Reduction_in_core_area-------------------------------------------------------------

Cap_per_device * V2

This now governs Core Frequency.Not Faster Transistors!!!


Relative Change In Factors

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

2004 2008 2012 2016 2020

Rel

ativ

e to

200

4

Max Power (N) Area (N) Cap per Device (D)Vdd (D) Power Limited Clock


What Kind of Core Should We Replicate?

1.00 1.50 2.00 2.50 3.00 3.50 4.001

4

80

2

4

6

8

10

12

14

Rel

ativ

e IP

S

Relative Clock Issue Width

Complex designsGive most performance

1.00 1.50 2.00 2.50 3.00 3.50 4.001

4

80

2

4

6

8

10

12

14

16

18

Rel

ativ

e A

rea

Relative Clock Issue Width

But also largest area

1.001.502.00

2.503.00

3.504.00

12

46

810

0.0

0.5

1.0

1.5

2.0

2.5

3.0

IPS

per U

nit A

rea

Relative ClockIssue Width

But simpler givesbetter performance/area

Simpler is Better


10

100

1,000

10,000

100,000

1990 1995 2000 2005 2010 2015 2020

MH

z

Intel Bus Speed Intel CPU Clock ITRS: Max On-Chip ClockITRS: Max Off-Chip Clock Constant Dissipation Clock "0.3 of Power Limited Clock"

What About Memory Bus Clocks?

Historica

l Intel

CPU Clock

ITRS Projected CPU Clock

Clock for Constant Power Density

Historical Intel Memory Bus RateAssumed Projected Memory Rate


1

10

100

1,000

2004 2006 2008 2010 2012 2014 2016 2018 2020

Gro

wth

Fac

tor

over

200

4

ITRS Signal Pads per Hi Perf uP Ball Bond Contacts per sq. cmSignal Pads * Modified Off Chip Clock Transistor Density * Power Limited Clock

Does Logic Performance Match Off-chip Bandwidth Potential?

A Growing Mismatch!

ITRS Ball Bond Growth Rate

ITRS Hi Perf uP Signal Pad Growth Rate

Signal Pads * Modified Off Chip Clock

Transistor Density * Power Limited Clock


The Multi-Core Family Tree


Cache/Memory

Cache

Core Core

. . .

. . .Cache

Core Core

. . .

(a) Hierarchical Designs

CORE

CORE

CORE

MEM . . .

Cache/Memory

(b) Pipelined Designs

Cache/Memory

Core

Cache/Memory

Core. . .

Cache/Memory

Core

Cache/Memory

Core

. . .

Interconnect & Control

(c) Array Designs

This may be the Architecture You Think of for Multi-Core

• Intel Core Duo• IBM Power5• AMD Opteron• SUN Niagara• …

External Bandwidth = sum of escapes from cores

• IBM Cell• Most Router chips• Many Video chips

• Terasys• Execube• Yukon• Intel Teraflop


Cache/Memory

Cache

Core Core

. . .

. . .Cache

Core Core

. . .


CORE

CORE

CORE

MEM . . .

Cache/Memory


Cache/Memory

Core

Cache/Memory

Core. . .

Cache/Memory

Core

Cache/Memory

Core

. . .


(c) Array Designs

But There’s at Least One Approach with Lower Bandwidth Needs

• Intel Core Duo• IBM Power5• Sun Niagara• …

• Terasys• Execube• Yukon• Intel Teraflop

External bandwidth largely independent of # of cores

• Most Router chips• Many Video chips• Some aspects of IBM Cell


Cache/Memory

Cache

Core Core

. . .

. . .Cache

Core Core

. . .


CORE

CORE

CORE

MEM . . .

Cache/Memory


Cache/Memory

Core

Cache/Memory

Core. . .

Cache/Memory

Core

Cache/Memory

Core

. . .


(c) Array Designs

And then there’s Array Approaches that Provide Significant Internal Memory

• Intel Core Duo• IBM Power5• Sun Niagara• …

• IBM Cell• Most Router chips• Many Video chips

• Terasys• Execube• Yukon• Intel Teraflop• Some Aspects of Cell

Particularly Effective for Weak Scaling Apps


And Today’s Memory Architecture is Evolving to Feed the Beast

Mic

ropr

oces

sor

Nor

th B

ridge

M

emor

y C

ontr

olle

rMemoryInterface

Mic

ropr

oces

sor

Mic

ropr

oces

sor

Nor

th B

ridge

M

emor

y C

ontr

olle

rN

orth

Brid

ge

Mem

ory

Con

trol

ler

MemoryInterface

State of the Art Peak Aggregate Bandwidth: ~ 6.4 GB/s


… But Not to Reduce Latency

. . .. . .

AMB: AdvancedMemoryBuffer Chip

We’ve introduced 16 extra chip crossings!

… And at ~2X Power Increase


A Simple Case Study


A Modern HPC SystemComputational Board• 4 PE Nodes• Each PE Node:

– Dual core Opteron @ 2.6GHz– 4 DDR2 2GB DIMMs

• 4 Routers per BoardKey Ratios (all “Peak”)• 2 Flops per cycle per core• 1.5B per Flop• 1.25B/s of Memory BW per

Flop per core • 0.25B/s Link BW per flop per

PE• 0.06-0.25B/s of Bisection BW

per Flop


What Are We Doing with the Total System Silicon?

Silicon Area Distribution

Memory86%

Processors3%

Routers3%

Random8%

Power DistributionMemory

9%

Processors56%

Routers33%

Random2%


What Is the Board Space Utilization Like?

Board Space DistributionMemory

10%

Processors24%

Routers8%

Random8%

White Space50%


A Dual Core Processor Chip

http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg

0

20

40

60

80

100

120

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

Are

a (s

q. m

m)

Single Core Area Single HT Area Single Memory Controller

2 Cores, 56.2%

1 Memory Controller,

19.4%

3 HT Links, 5.4%

Other, 19.0%


Some Projections• Off chip memory controls

performance

• IPC/core more sensitive to latency than bandwidth

• “Flat” off chip physical latency => relative latency grows with clock

48%Drop 73%

Drop

82%Increase

3.08 XIncrease

1.048%Drop 73%

Drop

82%Increase

3.08 XIncrease

1.0

Single Core Performance Factors

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

Rel

ativ

e C

hang

e O

ver 2

004

Clock Growth Relative IPC Change Relative IPS

Clock

IPCIPS/Core

Ack. R. Murphy, SNL


Where Does This Lead Us?• Use density increase to replicate cores

• Keep clock flat to minimize power

• Still need additional I/O for both bandwidth & latency management (reduce queuing delays by multiple banks)

0

20

40

60

80

100

120

2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022

Num

ber o

f Cor

es

Just Cores Cores with HT & DDR Ctl

UnproductiveSilicon


So What May This Mean to the Top 500?

1.E-011.E+001.E+011.E+021.E+031.E+041.E+051.E+061.E+071.E+081.E+091.E+10

1/1/72 1/1/76 1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

GFl

ops

Historical Rmax RmaxRmax Leading Edge Rpeak Leading EdgeEvolutionary Heavy Node Projection


The Emergence of More Organized Architectures


Tiling & Local Memory Regularizes Layout, Lowers Latency, Reduces Off-Chip

Bandwidth Needs

• Work well with partitionable algorithms• Good fit for applications that support weak scaling• Inter-core communication DOES NOT USE CONTACTS• Compiling problem: placement of kernels AND data

structures to minimize inter-core bandwidth• Problems with global synchronization


Multi-Threading

• Provide explicit latency hiding

• Permits simpler cores with more efficient use of data flow

• Increase potential for memory references “in flight”

• Shares path to memory

• But still doesn’t help “single thread” performance in terms of chained memory references

• Nor reduction of off-chip bandwidth (and contacts)


A Brief History of Multi-threaded Processors

0

1

2

3

4

5

6

7

1960 1970 1980 1990 2000 2010

Rel

evan

t Fea

ture

s

6600 Space Shuttle IOP

HEP

J-Machine

Horizon MTA

HTMT

PIM Lite

Hyper ThreadingP5, U4

NiagaraEldorado


Sun’s Niagara• 8 4-way multi-threaded single

issue cores

• 3MB 12 bank shared L2• 4 DDR2 Memory Interfaces• Measured 5.76 IPC vs Peak of

8 on Java Business B/M• 63W @90nm (2W cores)

Cores, 37%

L2, 21%

FPU, 2%

Crossbar, 3%

DDR2 Interfaces,

11%

Other Functions,

3%

Remainder, 23%

1

10

100

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

Are

a (s

q. m

m)

Single Core Area Entire L2 Area Single DDR2 I/F Area Crossbar


Cray’s XMT

Supports 128 Threads/core

John Feo, David Harper, Simon Kahan, Petr Konecny, “Eldorado”, Computing Frontiers, 2005


Some Interesting Comparisons

Core L1 FPU Area pJ/~Niagara-I 24 No 11.92 1719Niagara-II 24 yes 23.85 2364

MIP64 64 yes 9.59MIPS64 40 No 436

So Multi-Threading is not Free


Problems Still Remain• Programming models not changed

• States still very heavy

• Compiling to specific cores

• Data partitioning

• Problems with coherency

• Doesn’t address barriers, sync points, …

• Doesn’t help emerging low reuse apps– AMR– Data mining– Graph traversals– Non-numeric solvers such as SAT


Are We Ready for a Mutation?


Ideas

• Ultra light weight “butterflies” take functions to the data flowers– Memory reference becomes “traveling

threadlet”

• But, like flowers, data can respond to the touch of the butterfly.– Add small amount of metadata to each word

• Finally, it’s the “flowers” whose location is important


Adding Metadata to the Memory

• “Special Values”– Uninitialized, error code, null

• Full/Empty bits– And multiple flavors of “empty”– Esp. “empty pending outstanding value”– Greatly simplifies Producer/Consumer

• Forwarding

• Locked

• Traps

• Especially interesting when aliased to thread state registers


Full/Empty Bits & MPI

Ack. A. Rodrigues, SNL


One Step Further: Allowing the Threads to Travel

• “Overprovision” memory with huge numbers of anonymous execution sites– Place at bottom of, or near, memory

• Reduce state of a thread to a memory reference

• Make creating a new thread “near” some memory a cheap operation

• Allow thread to “move” to new site when locality demands

• Don’t require target to maintain code

Latency reduced by huge factors


“Piglet” Processing At Base of Memory

Target Address Operands & Working Registers CodePCAdditional Data Payload

MANAGEMENT

PIGLETPROCESSING

Memory NODES

NETWORKINTERCONNECTCACHE

HEAVYWEIGHTISA

PROCESSING

PIGLETPROCESSING

“CLASSICAL”HOST CPU NODE

THREADLET FORMAT

ADDRESSMANAGEMENT

PIGLETPROCESSING

ADDRESSMANAGEMENT

PIGLETPROCESSING

Memory Bank

PIGLETPROCESSING


Types of Piglet Programs

• Classical memory operations

• Atomic Memory Operations

• Short Vector to Memory

• “Object-oriented” method evaluation at the object

• Small slices of programs


Example: AMO• AMO = Atomic Memory Operation

– Update some memory location– With guaranteed no interference– And return result

• Parcel Registers: A=Address, D=Data, R=Return Address• Sample Code:

MOVEL1: LOCK & LOADOPSTORE & RELEASE L1SWAPRAMOVE ASTOREQUIT

Atomic Update “At the Memory”

Return Result

Bottom Line: 2 network transactions rather than up to 6!


Vector Add (Z[I]=X[I]+Y[I]) via Threadlets

X

MEMORY

X

MEMORY

X

MEMORY

X

MEMORY

Type 1

Y

MEMORY

Y

MEMORY

Y

MEMORY

Y

MEMORY

Type 2

Type 3

Accumulate QX’s in payload

Spawn type 2s

Fetch Qmatching Y’s,

add to X’s,save in payload,store in Q Z’s

Z

MEMORY

Z

MEMORY

Z

MEMORY

Z

MEMORY

Stride thru Q elements

Transaction Reduction factor: •1.66X (Q=1)•10X (Q=6) • up to 50X (Q=30)


Conclusions


Conclusions• (Hierarchical) Multi-core has taken over

– But clock rate will be limited by power– And # of useable cores by contacts

• Simpler cores: more area/energy efficient– But we can’t use all them in hierarchical architectures

• Latency will stifle single-thread performance• Multi-threading provides better utilization

– But at an energy cost

• Pipelined/Array chips reduce need for off-chip bandwidth– But then run into power-limiting clock problem– And require 2D data/code partitioning of code

• Are there alternatives that don’t fix code to cores?

BEST HPC Architecture != Best commodity architecture


A Personal Goal

PIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIM

InterconnectionNetwork

PIM ClusterPIM Cluster

“Host”PIM Cluster

I/O

A “PIM Cluster”

A “PIM DIMM”PIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIM

InterconnectionNetwork

PIM ClusterPIM ClusterPIM ClusterPIM Cluster

“Host”“Host”PIM ClusterPIM Cluster

I/OI/O

A “PIM Cluster”

A “PIM DIMM”

• Huge increase in silicon per board• Level out power dissipation


The Future

Will We Design Like This? Or This?

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

The Shape of Things to Come: Future Trends in HPC ......DoE/DoD Workshop, Nov. 29 2007 1 The Shape...

Documents