1DoE/DoD Workshop, Nov. 29 2007
The ShapeShape
of Things to Come: Future Trends
in HPC Architectures
Peter M. Kogge
Associate Dean for Research
McCourtney Prof. of CS & Engr
University of Notre Dame
IBM Fellow (retired)
2DoE/DoD Workshop, Nov. 29 2007
• Today: “Killer Micros” becoming “physics-limited” very hungry multi- core monsters
• Maturing Multi-threading & Tiling providing more nimble systems
• Is there an alternative evolutionary path we’ve ignored?
My View: Future HPC Evolutionary Paths Are Multiplying
3DoE/DoD Workshop, Nov. 29 2007
My Concern: We’re Focused on the Wrong Aspect of the Wall
What about bandwidth?
TodayFuture Trend/Memory Wall
}
7% Performance DifferenceNOTE: ACCOUNTS ONLY FOR COMPUTATION (NOT MPI)!
Chart courtesy Richard Murphy,
SNL
Application:Trilinos
4DoE/DoD Workshop, Nov. 29 2007
And Perhaps Missing Another Wall
Does Supplying Energy and Getting Rid of Heat
Dominate Area?
5DoE/DoD Workshop, Nov. 29 2007
It Also Bothers Me That:
• Modern microprocessor state growing as Moore’s Law– Regardless of the number of computational units
• Memory is as dumb as it was 50 years ago
• We insist on giving persistent names to the tarballs representing the physical cores
• And go to great extremes to separate the persistent names of memory from its location
• Newer classes of apps “visit” data irregularly– Where “caching” copies is wasted energy
6DoE/DoD Workshop, Nov. 29 2007
The Way We Were
7DoE/DoD Workshop, Nov. 29 2007
The Historical Top 10
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
1/1/72 1/1/76 1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
GFl
ops
Historical Rmax Rmax Rmax Leading Edge Rpeak Leading Edge
CAGR = 1.9
8DoE/DoD Workshop, Nov. 29 2007
Clock Rates
10
100
1,000
10,000
100,000
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Clo
ck (M
Hz)
Historical ITRS Max Clock Rate (12 invertors)
0.01
0.10
1.00
10.00
1/1/93 1/1/95 12/31/96
12/31/98
12/30/00
12/30/02
12/29/04
12/29/06
Clo
ck (G
Hz)
Top 10 Top SystemMicroprocessorsTop 10
9DoE/DoD Workshop, Nov. 29 2007
Processor Parallelism
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1/1/93 1/1/95 12/31/96 12/31/98 12/30/00 12/30/02 12/29/04 12/29/06
Proc
esso
r Pa
ralle
lism
Top 10 Top System
10DoE/DoD Workshop, Nov. 29 2007
Concurrency: Flops per Cycle
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1/1/93 1/1/95 12/31/96 12/31/98 12/30/00 12/30/02 12/29/04 12/29/06
Tota
l Con
curr
ecnc
y
Top 10 Top System Top 1 Trend
CAGR 1.65
11DoE/DoD Workshop, Nov. 29 2007
The Moore’s Law We Know & Love
• Goal: 4X Functionality every 3 years
• Underlying technology improvement:– Growth in transistor density– Growth in transistor switching speed– Growth in size of producible die
• Microprocessors: Functionality=IPS– ~1/2 from higher clock rate– ~1/2 from more complex microarchitectures
• Memory: Functionality = Storage capacity– ~2X from smaller transistors– Shrinkage in architecture of basic bit cell– Increase in die size
YesYesNot in commercial volumes
No: heatNo: complexity
YesYes, but ..
Not at commercially viable pricesAnd it is silent on inter-chip I/O
Knew
12DoE/DoD Workshop, Nov. 29 2007
The Darwinian Multi-Core Evolution
NowUp to ~2002
13DoE/DoD Workshop, Nov. 29 2007
Area Scaling Alone Reveals the Rationale for Multi-Core
1
10
100
1000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Sing
le C
ore
Proj
ecte
d D
ie S
ize
(mm
2)
Each line represents the scaling of a unique real microprocessor chip from its inception
ITRS ProjectedEconomic Die Size
14DoE/DoD Workshop, Nov. 29 2007
How Many Can We Fit on a cm2?Assume we scale entire current single core chip & replicate to fill 280 sq mm die
1
10
100
1000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Num
ber
of u
P pe
r Sq
uare
Cen
timet
er
Answer Potentially 1000’s!!!!
15DoE/DoD Workshop, Nov. 29 2007
And a Flood Tide of Recent Announcements
0
5
10
15
20
25
30
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
# of
New
Mul
ti-co
re A
nnou
ncem
ents
They AreAll
Multi Core Now
16DoE/DoD Workshop, Nov. 29 2007
And Not Just “Twosies”
1
10
100
1000
10000
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
# of
Cor
es/D
ie in
New
Ann
ounc
emen
ts
17DoE/DoD Workshop, Nov. 29 2007
The Classical Limiting Factors for Microprocessor Chips:
Power & Contacts
18DoE/DoD Workshop, Nov. 29 2007
Peak Logic Clock Rates
10
100
1,000
10,000
100,000
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Clo
ck (M
Hz)
Historical ITRS Max Clock Rate (12 invertors)
10
100
1000
10000
100000
10100100010000
Feature Size
Clo
ck (M
Hz)
Historical ITRS Max
3 GHz
Classic
al Moo
re’s
Law
Classic
al Moo
re’s
Law
2005 projection was for 5.2 GHz – and we didn’t make it in production.Further, we’re still stuck at 3+GHz in production.
3 GHz
19DoE/DoD Workshop, Nov. 29 2007
Why the Clock Flattening? POWERPOWER
1
10
100
1000
1976 1986 1996 2006
Wat
ts p
er D
ie
0.1
1
10
100
1000
1976 1986 1996 2006W
atts
per
Squ
are
cm
Hot, Hot, Hot!
Light Bulb
Iron
Rocket Nozzle
20DoE/DoD Workshop, Nov. 29 2007
Because Vdd No Longer Declining
0
1
2
3
4
5
6
1970 1980 1990 2000 2010 2020
Vdd
21DoE/DoD Workshop, Nov. 29 2007
Multi-core Power and Clock
Chip Power = Cap/device*#_devices/core*cores/chip
* Clock * Voltage2
Max LimitWill Grow
only Slightly
Reaching anAsymptotic
Limit
AssumeConstant
forMulticore
Decreasing~linearly
withTechnology
IncreasingAs Square
withTechnology
Max Clock RateGrows
Rapidlywith
Technology
But ONLY KNOB to Balance Equation!!
22DoE/DoD Workshop, Nov. 29 2007
Rewriting for Clock
Clock = Max_chip_power(T) * Reduction_in_core_area-------------------------------------------------------------
Cap_per_device * V2
This now governs Core Frequency.Not Faster Transistors!!!
23DoE/DoD Workshop, Nov. 29 2007
Relative Change In Factors
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
2004 2008 2012 2016 2020
Rel
ativ
e to
200
4
Max Power (N) Area (N) Cap per Device (D)Vdd (D) Power Limited Clock
24DoE/DoD Workshop, Nov. 29 2007
What Kind of Core Should We Replicate?
1.00 1.50 2.00 2.50 3.00 3.50 4.001
4
80
2
4
6
8
10
12
14
Rel
ativ
e IP
S
Relative Clock Issue Width
Complex designsGive most performance
1.00 1.50 2.00 2.50 3.00 3.50 4.001
4
80
2
4
6
8
10
12
14
16
18
Rel
ativ
e A
rea
Relative Clock Issue Width
But also largest area
1.001.502.00
2.503.00
3.504.00
12
46
810
0.0
0.5
1.0
1.5
2.0
2.5
3.0
IPS
per U
nit A
rea
Relative ClockIssue Width
But simpler givesbetter performance/area
Simpler is Better
25DoE/DoD Workshop, Nov. 29 2007
10
100
1,000
10,000
100,000
1990 1995 2000 2005 2010 2015 2020
MH
z
Intel Bus Speed Intel CPU Clock ITRS: Max On-Chip ClockITRS: Max Off-Chip Clock Constant Dissipation Clock "0.3 of Power Limited Clock"
What About Memory Bus Clocks?
Historica
l Intel
CPU Clock
ITRS Projected CPU Clock
Clock for Constant Power Density
Historical Intel Memory Bus RateAssumed Projected Memory Rate
26DoE/DoD Workshop, Nov. 29 2007
1
10
100
1,000
2004 2006 2008 2010 2012 2014 2016 2018 2020
Gro
wth
Fac
tor
over
200
4
ITRS Signal Pads per Hi Perf uP Ball Bond Contacts per sq. cmSignal Pads * Modified Off Chip Clock Transistor Density * Power Limited Clock
Does Logic Performance Match Off-chip Bandwidth Potential?
A Growing Mismatch!
ITRS Ball Bond Growth Rate
ITRS Hi Perf uP Signal Pad Growth Rate
Signal Pads * Modified Off Chip Clock
Transistor Density * Power Limited Clock
27DoE/DoD Workshop, Nov. 29 2007
The Multi-Core Family Tree
28DoE/DoD Workshop, Nov. 29 2007
Cache/Memory
Cache
Core Core
. . .
. . .Cache
Core Core
. . .
(a) Hierarchical Designs
CORE
CORE
CORE
MEM . . .
Cache/Memory
(b) Pipelined Designs
Cache/Memory
Core
Cache/Memory
Core. . .
Cache/Memory
Core
Cache/Memory
Core
. . .
Interconnect & Control
(c) Array Designs
This may be the Architecture You Think of for Multi-Core
• Intel Core Duo• IBM Power5• AMD Opteron• SUN Niagara• …
External Bandwidth = sum of escapes from cores
• IBM Cell• Most Router chips• Many Video chips
• Terasys• Execube• Yukon• Intel Teraflop
29DoE/DoD Workshop, Nov. 29 2007
Cache/Memory
Cache
Core Core
. . .
. . .Cache
Core Core
. . .
(a) Hierarchical Designs
CORE
CORE
CORE
MEM . . .
Cache/Memory
(b) Pipelined Designs
Cache/Memory
Core
Cache/Memory
Core. . .
Cache/Memory
Core
Cache/Memory
Core
. . .
Interconnect & Control
(c) Array Designs
But There’s at Least One Approach with Lower Bandwidth Needs
• Intel Core Duo• IBM Power5• Sun Niagara• …
• Terasys• Execube• Yukon• Intel Teraflop
External bandwidth largely independent of # of cores
• Most Router chips• Many Video chips• Some aspects of IBM Cell
30DoE/DoD Workshop, Nov. 29 2007
Cache/Memory
Cache
Core Core
. . .
. . .Cache
Core Core
. . .
(a) Hierarchical Designs
CORE
CORE
CORE
MEM . . .
Cache/Memory
(b) Pipelined Designs
Cache/Memory
Core
Cache/Memory
Core. . .
Cache/Memory
Core
Cache/Memory
Core
. . .
Interconnect & Control
(c) Array Designs
And then there’s Array Approaches that Provide Significant Internal Memory
• Intel Core Duo• IBM Power5• Sun Niagara• …
• IBM Cell• Most Router chips• Many Video chips
• Terasys• Execube• Yukon• Intel Teraflop• Some Aspects of Cell
Particularly Effective for Weak Scaling Apps
31DoE/DoD Workshop, Nov. 29 2007
And Today’s Memory Architecture is Evolving to Feed the Beast
Mic
ropr
oces
sor
Nor
th B
ridge
M
emor
y C
ontr
olle
rMemoryInterface
Mic
ropr
oces
sor
Mic
ropr
oces
sor
Nor
th B
ridge
M
emor
y C
ontr
olle
rN
orth
Brid
ge
Mem
ory
Con
trol
ler
MemoryInterface
State of the Art Peak Aggregate Bandwidth: ~ 6.4 GB/s
32DoE/DoD Workshop, Nov. 29 2007
… But Not to Reduce Latency
. . .. . .
AMB: AdvancedMemoryBuffer Chip
We’ve introduced 16 extra chip crossings!
… And at ~2X Power Increase
33DoE/DoD Workshop, Nov. 29 2007
A Simple Case Study
34DoE/DoD Workshop, Nov. 29 2007
A Modern HPC SystemComputational Board• 4 PE Nodes• Each PE Node:
– Dual core Opteron @ 2.6GHz– 4 DDR2 2GB DIMMs
• 4 Routers per BoardKey Ratios (all “Peak”)• 2 Flops per cycle per core• 1.5B per Flop• 1.25B/s of Memory BW per
Flop per core • 0.25B/s Link BW per flop per
PE• 0.06-0.25B/s of Bisection BW
per Flop
35DoE/DoD Workshop, Nov. 29 2007
What Are We Doing with the Total System Silicon?
Silicon Area Distribution
Memory86%
Processors3%
Routers3%
Random8%
Power DistributionMemory
9%
Processors56%
Routers33%
Random2%
36DoE/DoD Workshop, Nov. 29 2007
What Is the Board Space Utilization Like?
Board Space DistributionMemory
10%
Processors24%
Routers8%
Random8%
White Space50%
37DoE/DoD Workshop, Nov. 29 2007
A Dual Core Processor Chip
http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg
0
20
40
60
80
100
120
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Are
a (s
q. m
m)
Single Core Area Single HT Area Single Memory Controller
2 Cores, 56.2%
1 Memory Controller,
19.4%
3 HT Links, 5.4%
Other, 19.0%
38DoE/DoD Workshop, Nov. 29 2007
Some Projections• Off chip memory controls
performance
• IPC/core more sensitive to latency than bandwidth
• “Flat” off chip physical latency => relative latency grows with clock
48%Drop 73%
Drop
82%Increase
3.08 XIncrease
1.048%Drop 73%
Drop
82%Increase
3.08 XIncrease
1.0
Single Core Performance Factors
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Rel
ativ
e C
hang
e O
ver 2
004
Clock Growth Relative IPC Change Relative IPS
Clock
IPCIPS/Core
Ack. R. Murphy, SNL
39DoE/DoD Workshop, Nov. 29 2007
Where Does This Lead Us?• Use density increase to replicate cores
• Keep clock flat to minimize power
• Still need additional I/O for both bandwidth & latency management (reduce queuing delays by multiple banks)
0
20
40
60
80
100
120
2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Num
ber o
f Cor
es
Just Cores Cores with HT & DDR Ctl
UnproductiveSilicon
40DoE/DoD Workshop, Nov. 29 2007
So What May This Mean to the Top 500?
1.E-011.E+001.E+011.E+021.E+031.E+041.E+051.E+061.E+071.E+081.E+091.E+10
1/1/72 1/1/76 1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
GFl
ops
Historical Rmax RmaxRmax Leading Edge Rpeak Leading EdgeEvolutionary Heavy Node Projection
41DoE/DoD Workshop, Nov. 29 2007
The Emergence of More Organized Architectures
42DoE/DoD Workshop, Nov. 29 2007
Tiling & Local Memory Regularizes Layout, Lowers Latency, Reduces Off-Chip
Bandwidth Needs
• Work well with partitionable algorithms• Good fit for applications that support weak scaling• Inter-core communication DOES NOT USE CONTACTS• Compiling problem: placement of kernels AND data
structures to minimize inter-core bandwidth• Problems with global synchronization
43DoE/DoD Workshop, Nov. 29 2007
Multi-Threading
• Provide explicit latency hiding
• Permits simpler cores with more efficient use of data flow
• Increase potential for memory references “in flight”
• Shares path to memory
• But still doesn’t help “single thread” performance in terms of chained memory references
• Nor reduction of off-chip bandwidth (and contacts)
44DoE/DoD Workshop, Nov. 29 2007
A Brief History of Multi-threaded Processors
0
1
2
3
4
5
6
7
1960 1970 1980 1990 2000 2010
Rel
evan
t Fea
ture
s
6600 Space Shuttle IOP
HEP
J-Machine
Horizon MTA
HTMT
PIM Lite
Hyper ThreadingP5, U4
NiagaraEldorado
45DoE/DoD Workshop, Nov. 29 2007
Sun’s Niagara• 8 4-way multi-threaded single
issue cores
• 3MB 12 bank shared L2• 4 DDR2 Memory Interfaces• Measured 5.76 IPC vs Peak of
8 on Java Business B/M• 63W @90nm (2W cores)
Cores, 37%
L2, 21%
FPU, 2%
Crossbar, 3%
DDR2 Interfaces,
11%
Other Functions,
3%
Remainder, 23%
1
10
100
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Are
a (s
q. m
m)
Single Core Area Entire L2 Area Single DDR2 I/F Area Crossbar
46DoE/DoD Workshop, Nov. 29 2007
Cray’s XMT
Supports 128 Threads/core
John Feo, David Harper, Simon Kahan, Petr Konecny, “Eldorado”, Computing Frontiers, 2005
47DoE/DoD Workshop, Nov. 29 2007
Some Interesting Comparisons
Core L1 FPU Area pJ/~Niagara-I 24 No 11.92 1719Niagara-II 24 yes 23.85 2364
MIP64 64 yes 9.59MIPS64 40 No 436
So Multi-Threading is not Free
48DoE/DoD Workshop, Nov. 29 2007
Problems Still Remain• Programming models not changed
• States still very heavy
• Compiling to specific cores
• Data partitioning
• Problems with coherency
• Doesn’t address barriers, sync points, …
• Doesn’t help emerging low reuse apps– AMR– Data mining– Graph traversals– Non-numeric solvers such as SAT
49DoE/DoD Workshop, Nov. 29 2007
Are We Ready for a Mutation?
50DoE/DoD Workshop, Nov. 29 2007
Ideas
• Ultra light weight “butterflies” take functions to the data flowers– Memory reference becomes “traveling
threadlet”
• But, like flowers, data can respond to the touch of the butterfly.– Add small amount of metadata to each word
• Finally, it’s the “flowers” whose location is important
51DoE/DoD Workshop, Nov. 29 2007
Adding Metadata to the Memory
• “Special Values”– Uninitialized, error code, null
• Full/Empty bits– And multiple flavors of “empty”– Esp. “empty pending outstanding value”– Greatly simplifies Producer/Consumer
• Forwarding
• Locked
• Traps
• Especially interesting when aliased to thread state registers
52DoE/DoD Workshop, Nov. 29 2007
Full/Empty Bits & MPI
Ack. A. Rodrigues, SNL
53DoE/DoD Workshop, Nov. 29 2007
One Step Further: Allowing the Threads to Travel
• “Overprovision” memory with huge numbers of anonymous execution sites– Place at bottom of, or near, memory
• Reduce state of a thread to a memory reference
• Make creating a new thread “near” some memory a cheap operation
• Allow thread to “move” to new site when locality demands
• Don’t require target to maintain code
Latency reduced by huge factors
54DoE/DoD Workshop, Nov. 29 2007
“Piglet” Processing At Base of Memory
Target Address Operands & Working Registers CodePCAdditional Data Payload
MANAGEMENT
PIGLETPROCESSING
Memory NODES
NETWORKINTERCONNECTCACHE
HEAVYWEIGHTISA
PROCESSING
PIGLETPROCESSING
“CLASSICAL”HOST CPU NODE
THREADLET FORMAT
ADDRESSMANAGEMENT
PIGLETPROCESSING
ADDRESSMANAGEMENT
PIGLETPROCESSING
Memory Bank
PIGLETPROCESSING
55DoE/DoD Workshop, Nov. 29 2007
Types of Piglet Programs
• Classical memory operations
• Atomic Memory Operations
• Short Vector to Memory
• “Object-oriented” method evaluation at the object
• Small slices of programs
56DoE/DoD Workshop, Nov. 29 2007
Example: AMO• AMO = Atomic Memory Operation
– Update some memory location– With guaranteed no interference– And return result
• Parcel Registers: A=Address, D=Data, R=Return Address• Sample Code:
MOVEL1: LOCK & LOADOPSTORE & RELEASE L1SWAPRAMOVE ASTOREQUIT
Atomic Update “At the Memory”
Return Result
Bottom Line: 2 network transactions rather than up to 6!
57DoE/DoD Workshop, Nov. 29 2007
Vector Add (Z[I]=X[I]+Y[I]) via Threadlets
X
MEMORY
X
MEMORY
X
MEMORY
X
MEMORY
Type 1
Y
MEMORY
Y
MEMORY
Y
MEMORY
Y
MEMORY
Type 2
Type 3
Accumulate QX’s in payload
Spawn type 2s
Fetch Qmatching Y’s,
add to X’s,save in payload,store in Q Z’s
Z
MEMORY
Z
MEMORY
Z
MEMORY
Z
MEMORY
Stride thru Q elements
Transaction Reduction factor: •1.66X (Q=1)•10X (Q=6) • up to 50X (Q=30)
58DoE/DoD Workshop, Nov. 29 2007
Conclusions
59DoE/DoD Workshop, Nov. 29 2007
Conclusions• (Hierarchical) Multi-core has taken over
– But clock rate will be limited by power– And # of useable cores by contacts
• Simpler cores: more area/energy efficient– But we can’t use all them in hierarchical architectures
• Latency will stifle single-thread performance• Multi-threading provides better utilization
– But at an energy cost
• Pipelined/Array chips reduce need for off-chip bandwidth– But then run into power-limiting clock problem– And require 2D data/code partitioning of code
• Are there alternatives that don’t fix code to cores?
BEST HPC Architecture != Best commodity architecture
60DoE/DoD Workshop, Nov. 29 2007
A Personal Goal
PIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIM
InterconnectionNetwork
PIM ClusterPIM Cluster
“Host”PIM Cluster
I/O
A “PIM Cluster”
A “PIM DIMM”PIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIMPIM PIM PIM PIM PIM PIM PIM PIM
InterconnectionNetwork
PIM ClusterPIM ClusterPIM ClusterPIM Cluster
“Host”“Host”PIM ClusterPIM Cluster
I/OI/O
A “PIM Cluster”
A “PIM DIMM”
• Huge increase in silicon per board• Level out power dissipation
61DoE/DoD Workshop, Nov. 29 2007
The Future
Will We Design Like This? Or This?