SC12
11/14/12 1
SC12
11/14/12 2
Titan: World’s #1 Open Science Supercomputer
18,688 Tesla K20X GPUs
27 Petaflops Peak: 90% of Performance from GPUs
17.59 Petaflops Sustained Performance on Linpack
SC12
11/14/12 3
Titan
18,688 Kepler GK110
27 PF peak (90% from GPUs)
17.6PF HP Linpack
2.12 GF/W
GK110 is 7GF/W
SC12
11/14/12 4
The Road to Exascale
2012
20PF
18,000GPUs
10MW
2GFLOPs/W
~107 Threads
You are Here 2020
1000PF (50x)
72,000HCNs (4x)
20MW (2x)
50GFLOPs/W (25x)
~109 Threads (100x)
SC12
11/14/12 5
Technical Challenges on
The Road to Exascale
2012
20PF
18,000GPUs
10MW
2GFLOPs/W
~107 Threads
2020
1000PF (50x)
72,000HCNs (4x)
20MW (2x)
50GFLOPs/W (25x)
~109 Threads (100x)
1. Energy Efficiency
SC12
11/14/12 6
Technical Challenges on
The Road to Exascale
2012
20PF
18,000GPUs
10MW
2GFLOPs/W
~107 Threads
2020
1000PF (50x)
72,000HCNs (4x)
20MW (2x)
50GFLOPs/W (25x)
~109 Threads (100x)
1. Energy Efficiency
2. Parallel Programmability
SC12
11/14/12 7
Technical Challenges on
The Road to Exascale
2012
20PF
18,000GPUs
10MW
2GFLOPs/W
~107 Threads
2020
1000PF (50x)
72,000HCNs (4x)
20MW (2x)
50GFLOPs/W (25x)
~109 Threads (100x)
1. Energy Efficiency
2. Parallel Programmability
3. Resilience
SC12
11/14/12 8
Energy Efficiency
SC12
11/14/12 9
Moore’s Law to the Rescue?
Moore, Electronics 38(8) April 19, 1965
SC12
11/14/12 10
Unfortunately Not!
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
SC12
11/14/12 11
Chips are now power, not area limited
P=150W P=5W
Perf (Ops/s) = P(W) * Eff(Ops/J)
Process is improving Eff by 15-25% per node 2-3x in 8 years
SC12
11/14/12 12
We need 25x energy efficiency
2-3x will come from process
10x must come from
Architecture and Circuits
SC12
11/14/12 13
Energy Efficiency
2GFLOPs/W
50GFLOPs/W
(25x)
Integrate
Host
12GFLOPs/W
(2x)
4 Process Nodes
6GFLOPS/W
(3x)
SC12
11/14/12 14
The High Cost of Data Movement Fetching operands costs more than computing
on them 20mm
64-bit DP 20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficient off-chip link
28nm
256-bit buses
16 nJ DRAM Rd/Wr
256-bit access 8 kB SRAM
50 pJ
SC12
11/14/12 15
Low-Energy Signaling
SC12
11/14/12 16
Energy cost with efficient signaling
20mm
64-bit DP 20pJ 3 pJ 30 pJ
100 pJ
100 pJ Efficient off-chip link
28nm
256-bit buses
1 nJ DRAM Rd/Wr
256-bit access 8 kB SRAM
50 pJ
SC12
11/14/12 17
Energy Efficiency
2GFLOPs/W
50GFLOPs/W
(25x)
Integrate
Host
12GFLOPs/W
(2x)
4 Process Nodes
6GFLOPS/W
(3x)
Advanced
Signaling
24GFLOPs/W
(2x)
SC12
11/14/12 18
4/11/11 Milad Mohammadi 18
An Out-of-Order Core Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
SC12
11/14/12 19
SM Lane Architecture
ORF ORFORF
LS/BRFP/IntFP/Int
To LD/ST
L0Addr
L1Addr
Net
LM
Bank
0
To LD/ST
LM
Bank
3
RFL0Addr
L1Addr
Net
RF
Net
Data
Path
L0
I$
Thre
ad P
Cs
Act
ive
PC
s
Inst
Control
Path
Sch
edul
er
64 threads
4 active threads
2 DFMAs (4 FLOPS/clock)
ORF bank: 16 entries (128 Bytes)
L0 I$: 64 instructions (1KByte)
LM Bank: 8KB (32KB total)
SC12
11/14/12 20
Energy Efficiency
2GFLOPs/W
50GFLOPs/W
(25x)
Integrate
Host
12GFLOPs/W
(2x)
4 Process Nodes
6GFLOPS/W
(3x)
Advanced
Signaling
24GFLOPs/W
(2x)
Efficient
Microarchitecture
48GFLOPs/W
(2x)
SC12
11/14/12 21
Energy Efficiency
2GFLOPs/W
50GFLOPs/W
(25x)
Integrate
Host
12GFLOPs/W
(2x)
4 Process Nodes
6GFLOPS/W
(3x)
Advanced
Signaling
24GFLOPs/W
(2x)
Efficient
Microarchitecture
48GFLOPs/W
(2x)
Optimized Voltages
Efficient Memory
Locality Enhancement
(??x)
SC12
11/14/12 22
Parallel Programmability
SC12
11/14/12 23
Parallel programming is not inherently any
more difficult than serial programming
However, we can make it a lot more difficult
SC12
11/14/12 24
A simple parallel program
forall molecule in set { // launch a thread array
forall neighbor in molecule.neighbors { // nested
forall force in forces { // doubly nested
molecule.force =
reduce_sum(force(molecule, neighbor))
}
}
}
SC12
11/14/12 25
Why is this easy?
forall molecule in set { // launch a thread array
forall neighbor in molecule.neighbors { // nested
forall force in forces { // doubly nested
molecule.force =
reduce_sum(force(molecule, neighbor))
}
}
}
No machine details
All parallelism is expressed
Synchronization is semantic (in reduction)
SC12
11/14/12 26
We could make it hard
pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization
// manipulate struct
unlock(struct.lock) ;
code = send(pid, tag, &msg) ; // partition across nodes
SC12
11/14/12 27
Programmers, tools, and architecture
Need to play their positions
Programmer
Architecture Tools
SC12
11/14/12 28
Programmers, tools, and architecture
Need to play their positions
Programmer
Architecture Tools
Algorithm
All of the parallelism
Abstract locality
Fast mechanisms
Exposed costs
Combinatorial optimization
Mapping
Selection of mechanisms
SC12
11/14/12 29
Programmers, tools, and architecture
Need to play their positions
Programmer
Architecture Tools
forall molecule in set { // launch a thread array
forall neighbor in molecule.neighbors { //
forall force in forces { // doubly nested
molecule.force =
reduce_sum(force(molecule, neighbor))
}
}
}
Map foralls in time and space
Map molecules across memories
Stage data up/down hierarchy
Select mechanisms
Exposed storage hierarchy
Fast comm/sync/thread mechanisms
SC12
11/14/12 30
Fundamental and Incidental
Obstacles to Programmability Fundamental
Expressing 109 way parallelism
Expressing locality to deal with >100:1 global:local energy
Balancing load across 109 cores
Incidental
Dealing with multiple address spaces
Partitioning data across nodes
Aggregating data to amortize message overhead
SC12
11/14/12 31
Parallel Programmability
107 Threads
109 Threads
Autotuning
Mapper
Abstract
Parallelism and
Locality
Fast
Communication
Synchronization
Thread Mgt
Exposed
Storage
Hierarchy
SC12
11/14/12 32
NVIDIA Exascale Architecture
SC12
11/14/12 33
System Sketch
SC12
11/14/12 34
Echelon Chip Floorplan
L2
Banks
XBAR
NOC
SM
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOCS
M
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LOC
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O 17mm
10nm process
290mm2
SC12
11/14/12 35
The fundamental problems are hard
enough. We must eliminate the
incidental ones.
SC12
11/14/12 36
Parallel Roads to Exascale
2GFLOPs/W 50GFLOPs/W
(25x) Integrate
Host
12GFLOPs/W
(2x)
4 Process Nodes
6GFLOPS/W
(3x)
Advanced
Signaling
24GFLOPs/W
(2x)
Efficient
Microarchitecture
48GFLOPs/W
(2x)
Optimized Voltages
Efficient Memory
Locality Enhancement
(??x)
109 Threads Autotuning
Mapper
Abstract
Parallelism and
Locality
Fast
Communication
Synchronization
Thread Mgt
Exposed
Storage
Hierarchy