Celsius Lecture2/14/13 1
Celsius Lecture2/14/13 2
Exascale Computing Will Enable Transformational Science
Celsius Lecture2/14/13 3
Climate
Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies.
Celsius Lecture2/14/13 4
Combustion
First-principles simulation of combustion for new high- efficiency, low-emision engines.
Celsius Lecture2/14/13 5
Biology
Coupled simulation of entire cells at molecular, genetic, chemical and biological levels.
Celsius Lecture2/14/13 6
Astrophysics
Predictive calculations for thermonuclear and core- collapse supernovae, allowing confirmation of theoretical models.
Celsius Lecture2/14/13 7
Exascale Computing Will Enable Transformational Science
High-Performance Computers are Scientific Instruments
Celsius Lecture2/14/13 8
Titan: World’s #1 Open Science Supercomputer
18,688 NVIDIA Tesla K20X GPUs
27 Petaflops
Peak: 90% of Performance from GPUs
17.59 Petaflops
Sustained Performance on Linpack
Celsius Lecture2/14/13 9
Titan & Kepler
18,688 NVIDIA Kepler GK11027 PF peak (90% from GPUs)17.6PF HP Linpack2.12 GF/W
GK110 is 7GF/W
Celsius Lecture2/14/13 10
The Road to Exascale
201220PF
18,000GPUs10MW
2GFLOPs/W~107 Threads
You are Here2020
1000PF (50x)72,000HCNs (4x)
20MW (2x)50GFLOPs/W (25x)
~1010 Threads (1000x)
Celsius Lecture2/14/13 11
Technical Challenges on The Road to Exascale
201220PF
18,000GPUs10MW
2GFLOPs/W~107 Threads
20201000PF (50x)
72,000HCNs (4x)20MW (2x)
50GFLOPs/W (25x)~1010 Threads (1000x)
1. Energy Efficiency
Celsius Lecture2/14/13 12
Technical Challenges on The Road to Exascale
201220PF
18,000GPUs10MW
2GFLOPs/W~107 Threads
20201000PF (50x)
72,000HCNs (4x)20MW (2x)
50GFLOPs/W (25x)~1010 Threads (1000x)
1. Energy Efficiency2. Parallel Programmability
Celsius Lecture2/14/13 13
Technical Challenges on The Road to Exascale
201220PF
18,000GPUs10MW
2GFLOPs/W~107 Threads
20201000PF (50x)
72,000HCNs (4x)20MW (2x)
50GFLOPs/W (25x)~1010 Threads (1000x)
1. Energy Efficiency2. Parallel Programmability3. Resilience
Celsius Lecture2/14/13 14
50x performance in 8 years, Moore’s Law will take care of that, right?
Celsius Lecture2/14/13 15
50x performance in 8 years, Moore’s Law will take care of that, right?
Wrong!
Celsius Lecture2/14/13 16
Moore’s Law gives us transistors Which we used to turn into scalar performance
Moore, Electronics 38(8) April 19, 1965
Celsius Lecture2/14/13 17
ISAT LCC: 17
But ILP was ‘mined out’ in 2000
1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)Linear (ps/Inst)
52%/year
74%/year
19%/year30:1
1,000:1
30,000:1
Dally et al. “The Last Classical Computer”, ISAT Study, 2001
Celsius Lecture2/14/13 18
And L3 energy scaling ended in 2005
Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003
Celsius Lecture2/14/13 19
Result: The End of Historic Scaling
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
Celsius Lecture2/14/13 20
Historic scaling is at an end!
To continue performance scaling of all sizes of computer systems requires addressing two challenges:
Power and Parallelism
Much of the economy depends on this
Celsius Lecture2/14/13 21
The Power Challenge
Celsius Lecture2/14/13 22
In the past we had constant-field scaling L’ = L/2 V’ = V/2
E’ = CV2 = E/8 f’ = 2f
D’ = 1/L2 = 4D P’ = P
Halve L and get 8x the capability for the same power
Celsius Lecture2/14/13 23
Now voltage is held nearly constant L’ = L/2 V’ = V
E’ = CV2 = E/2 f’ = 2f*
D’ = 1/L2 = 4D P’ = 4P
Halve L and get 2x the capability for the same power in ¼ the area
*f is no longer scaling as 1/L, but it doesn’t matter, we couldn’t power it if it did
Celsius Lecture2/14/13 24
Performance = Efficiency
Efficiency = Locality
Celsius Lecture2/14/13 25
Locality
Celsius Lecture2/14/13 26
The High Cost of Data Movement Fetching operands costs more than computing on them
20mm
64-bit DP20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip link
28nm
256-bitbuses
16 nJ DRAMRd/Wr
256-bit access8 kB SRAM
50 pJ
Celsius Lecture2/14/13 27
Scaling makes locality even more important
Celsius Lecture2/14/13 28
Its not about the FLOPS
Its about data movement
Algorithms should be designed to perform more work per unit data movement.
Programming systems should further optimize this data movement.
Architectures should facilitate this by providing an exposed hierarchy and efficient communication.
Celsius Lecture2/14/13 29
Move Bits More Efficiently
Celsius Lecture2/14/13 30
Move Fewer Bits
forall cells in set { compute_x_flux(cell) ;
}forall cells in set {
compute_y_flux(cell) ;}forall cells in set {
compute_z_flux(cell) ;}forall cells in set {
compute_p(cell) ;}
Celsius Lecture2/14/13 31
Move Fewer Bits
forall cells in set { compute_x_flux(cell) ;compute_y_flux(cell) ;compute_z_flux(cell) ;compute_p(cell) ;
}
Celsius Lecture2/14/13 32
Move Fewer Bits
forall blocks in set {// hierarchicallylocalize(block)forall cells in block {
compute_x_flux(cell) ;compute_y_flux(cell) ;compute_z_flux(cell) ;compute_p(cell) ;
}}
Celsius Lecture2/14/13 33
System SketchSystem Sketch
Celsius Lecture2/14/13 34
Echelon Chip Floorplan
L2 Banks
XBAR
NOC
SMLa
ne
Lane
Lane
Lane
Lane
Lane
Lane
Lane
SMSM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
LOC
NOC
SMSMSMSM
NOC
SMSMSMSM
NOCSMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
LOC
NOC
SM SM SM SM
NOC
SM SM SM SMNOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
LOC
NOCSMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
LOC
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SMNOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
LOC
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSMNOC
SMSMSMSM
NOC
SMSMSMSM
LOC
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
LOC
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
NOC
SMSMSMSM
LOC
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SM
NOC
SM SM SM SMDRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O 17mm
10nm process290mm2
Celsius Lecture2/14/13 35
Overhead
Celsius Lecture2/14/13 36
4/11/11 Milad Mohammadi 36
An Out-of-Order CoreSpends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
Celsius Lecture2/14/13 37
SM Lane Architecture
ORF ORFORF
LS/BRFP/IntFP/Int
To LD/ST
L0AddrL1Addr
Net
LM Bank
0
To LD/ST
LM Bank
3
RFL0AddrL1Addr
Net
RF
Net
DataPath
L0I$
Thre
ad P
Cs
Act
ive
PCs
Inst
ControlPath
Sch
edul
er
64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)
Celsius Lecture2/14/13 38
Solving the Power Challenge – 1, 2, 3
Celsius Lecture2/14/13 39
Solving the ExaScale Power Problem
Celsius Lecture2/14/13 40
Parallelism
Celsius Lecture2/14/13 41
Parallel programming is not inherently any more difficult than serial programming
However, we can make it a lot more difficult
Celsius Lecture2/14/13 42
A simple parallel program
forall molecule in set { // launch a thread arrayforall neighbor in molecule.neighbors { // nested
forall force in forces { // doubly nestedmolecule.force = reduce_sum(force(molecule, neighbor))
}}
}
Celsius Lecture2/14/13 43
Why is this easy?
forall molecule in set { // launch a thread arrayforall neighbor in molecule.neighbors { // nested
forall force in forces { // doubly nestedmolecule.force = reduce_sum(force(molecule, neighbor))
}}
}
No machine detailsAll parallelism is expressedSynchronization is semantic (in reduction)
Celsius Lecture2/14/13 44
We could make it hard
pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization// manipulate structunlock(struct.lock) ;
code = send(pid, tag, &msg) ; // partition across nodes
Celsius Lecture2/14/13 45
Programmers, tools, and architecture Need to play their positions
Programmer
Architectur eTools
Celsius Lecture2/14/13 46
Programmers, tools, and architecture Need to play their positions
Programmer
Architectur eTools
AlgorithmAll of the parallelismAbstract locality
Fast mechanismsExposed costs
Combinatorial optimizationMappingSelection of mechanisms
Celsius Lecture2/14/13 47
Programmers, tools, and architecture Need to play their positions
Programmer
Architectur eTools
forall molecule in set { // launch a thread arrayforall neighbor in molecule.neighbors { //
forall force in forces { // doubly nestedmolecule.force =
reduce_sum(force(molecule, neighbor))}
}}
Map foralls in time and spaceMap molecules across memoriesStage data up/down hierarchySelect mechanisms
Exposed storage hierarchyFast comm/sync/thread mechanisms
Celsius Lecture2/14/13 48
Abstract description of Locality – not mapping
compute_forces::inner(molecules, forces) {tunable N ;set part_molecules[N] ;part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) {compute_forces(part_molecules[i]) ;
}
Celsius Lecture2/14/13 49
Abstract description of Locality – not mapping
compute_forces::inner(molecules, forces) {tunable N ;set part_molecules[N] ;part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) {compute_forces(part_molecules) ;
}
Autotuner picks number and size of partitions - recursively
No need to worry about “ghost molecules”with global address space, it just works
Celsius Lecture2/14/13 50
Autotuning Search Spaces
T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'BoyleCombined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.In IEEE PACT, pages 237-248, 2000.
ExeExecution Time of Matrix Multiplication for Unrolling and Tiling
Architecture enables simple and effective autotuning
Celsius Lecture2/14/13 51
Performance of Auto-tuner
Conv2D SGEMM FFT3D SUmb
Cell Auto 96.4 129 57 10.5
Hand 85 119 54
Cluster Auto 26.7 91.3 5.5 1.65
Hand 24 90 5.5
Cluster of PS3s
Auto 19.5 32.4 0.55 0.49
Hand 19 30 0.23
Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.
For FFT3D, performances is with fusion of leaf tasks.
SUmb is too complicated to be hand-tuned.
Celsius Lecture2/14/13 52
Fundamental and Incidental Obstacles to Programmability
FundamentalExpressing 109 way parallelismExpressing locality to deal with >100:1 global:local energyBalancing load across 109 cores
IncidentalDealing with multiple address spacesPartitioning data across nodesAggregating data to amortize message overhead
Celsius Lecture2/14/13 53
The fundamental problems are hard enough. We must eliminate the incidental ones.
Celsius Lecture2/14/13 54
Execution ModelExecution Model
A B
Active Message
Abstract Memory
Hierarchy
Global Address Space
ThreadObject
B
Load
/Sto
re
A
B Bulk Xfer
Celsius Lecture2/14/13 55
Thread array creation, messages, block transfers, collective operations – at the “speed of light”
Celsius Lecture2/14/13 56
Kepler
Hardware thread-array creation
Fast syncthreads() ;
Shared memory
Celsius Lecture2/14/13 57
Scalar ISAs don’t matter
Parallel ISAs – the mechanisms for threads, communication, and synchronization make a huge difference.
Celsius Lecture2/14/13 58
A Prescription
Celsius Lecture2/14/13 59
Research
Need a research vehicle (experimental system)Co-design architecture, programming system, applications
Productive parallel programmingExpress all the parallelism and localityCompiler and run-time map to the target machineLeverage an existing eco-system
Mechanisms – for: threads, comm, syncEliminate ‘incidental’ programming issuesEnable fine-grain execution
PowerLocality – exposed memory hierarchy and software to use itOverhead – move scheduling to compiler
Others are investing, if we don’t invest we will be left behind.
Celsius Lecture2/14/13 60
Education
We need parallel programmersBut we are training serial programmersand serial thinkers
Parallelism throughout the CS curriculumProgrammingAlgorithms
Parallel algorithmsAnalysis focused on communications, not counting ops
Systems
Models need to include locality
Celsius Lecture2/14/13 61
A Bright Future from Supercomputers to Cellphones
Eliminate overhead and exploit locality to get 100x power efficiency
Easy parallelism with a coordinated team
ProgrammerToolsArchitectureHD Video
Decoder
HD VideoEncoder
Audio ISP
GPU
MEM I/O
HDMI
SecurityEngine
Display
Core 1
Core 3
Core 2
Core 4
Core 0
Celsius Lecture2/14/13 62
Celsius Lecture2/14/13 63
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
Celsius Lecture2/14/13 64
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
Celsius Lecture2/14/13 65
Granularity
Celsius Lecture2/14/13 66
#Threads increasing faster than problem size.
Celsius Lecture2/14/13 67
Number of Threads increasing faster than problem size
Celsius Lecture2/14/13 68
Number of Threads increasing faster than problem size
WeakScalingWeak
Scaling
Celsius Lecture2/14/13 69
Number of Threads increasing faster than problem size
WeakScalingWeak
ScalingStrongScalingStrongScaling
Celsius Lecture2/14/13 70
Smaller sub-problem per thread
Celsius Lecture2/14/13 71
Smaller sub-problem per thread
Celsius Lecture2/14/13 72
Smaller sub-problem per thread
More frequent comm, sync, and thread operations
Celsius Lecture2/14/13 73
Smaller sub-problem per thread
More frequent comm, sync, and thread operations
Celsius Lecture2/14/13 74
This fine-grain parallelism is multi- level and irregular
Celsius Lecture2/14/13 75
To support this requires fast mechanisms for
Thread arrays – create, terminate, suspend, resumeHardware allocation of resources to a thread array
threads, registers, shared memoryWith locality
CommunicationData movement up and down the hierarchyFast active messages (message-driven computing)
SynchronizationCollective operations (e.g., barrier, reduce)Pairwise (producer-consumer)
Celsius Lecture2/14/13 76
Execution ModelExecution Model
A B
Active Message
Abstract Memory
Hierarchy
Global Address Space
ThreadObject
B
Load
/Sto
re
A
B Bulk Xfer
J-Machine Speedup with Strong Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
J-Machine Speedup with Strong Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
2 characters per node