BillDally_NVIDIA_SC12

transcript

11/14/12 1

11/14/12 2

Titan: World’s #1 Open Science Supercomputer

18,688 Tesla K20X GPUs

27 Petaflops Peak: 90% of Performance from GPUs

17.59 Petaflops Sustained Performance on Linpack

11/14/12 3

18,688 Kepler GK110

27 PF peak (90% from GPUs)

17.6PF HP Linpack

2.12 GF/W

GK110 is 7GF/W

11/14/12 4

The Road to Exascale

18,000GPUs

2GFLOPs/W

~107 Threads

You are Here 2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

11/14/12 5

Technical Challenges on

18,000GPUs

2GFLOPs/W

~107 Threads

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

11/14/12 6

18,000GPUs

2GFLOPs/W

~107 Threads

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

2. Parallel Programmability

11/14/12 7

18,000GPUs

2GFLOPs/W

~107 Threads

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

2. Parallel Programmability

3. Resilience

11/14/12 8

Energy Efficiency

11/14/12 9

Moore’s Law to the Rescue?

Moore, Electronics 38(8) April 19, 1965

11/14/12 10

Unfortunately Not!

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

11/14/12 11

Chips are now power, not area limited

P=150W P=5W

Perf (Ops/s) = P(W) * Eff(Ops/J)

Process is improving Eff by 15-25% per node 2-3x in 8 years

11/14/12 12

We need 25x energy efficiency

2-3x will come from process

10x must come from

Architecture and Circuits

11/14/12 13

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

Integrate

12GFLOPs/W

4 Process Nodes

6GFLOPS/W

11/14/12 14

The High Cost of Data Movement Fetching operands costs more than computing

on them 20mm

64-bit DP 20pJ 26 pJ 256 pJ

500 pJ Efficient off-chip link

256-bit buses

16 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM

11/14/12 15

Low-Energy Signaling

11/14/12 16

Energy cost with efficient signaling

64-bit DP 20pJ 3 pJ 30 pJ

100 pJ

100 pJ Efficient off-chip link

256-bit buses

1 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM

11/14/12 17

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

Integrate

12GFLOPs/W

4 Process Nodes

6GFLOPS/W

Advanced

Signaling

24GFLOPs/W

11/14/12 18

4/11/11 Milad Mohammadi 18

An Out-of-Order Core Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)

11/14/12 19

SM Lane Architecture

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0Addr

L1Addr

To LD/ST

RFL0Addr

L1Addr

Control

64 threads

4 active threads

2 DFMAs (4 FLOPS/clock)

ORF bank: 16 entries (128 Bytes)

L0 I$: 64 instructions (1KByte)

LM Bank: 8KB (32KB total)

11/14/12 20

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

Integrate

12GFLOPs/W

4 Process Nodes

6GFLOPS/W

Advanced

Signaling

24GFLOPs/W

Efficient

Microarchitecture

48GFLOPs/W

11/14/12 21

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

Integrate

12GFLOPs/W

4 Process Nodes

6GFLOPS/W

Advanced

Signaling

24GFLOPs/W

Efficient

Microarchitecture

48GFLOPs/W

Optimized Voltages

Efficient Memory

Locality Enhancement

11/14/12 22

Parallel Programmability

11/14/12 23

Parallel programming is not inherently any

more difficult than serial programming

However, we can make it a lot more difficult

11/14/12 24

A simple parallel program

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

11/14/12 25

Why is this easy?

forall neighbor in molecule.neighbors { // nested

molecule.force =

No machine details

All parallelism is expressed

Synchronization is semantic (in reduction)

11/14/12 26

We could make it hard

pid = fork() ; // explicitly managing threads

lock(struct.lock) ; // complicated, error-prone synchronization

// manipulate struct

unlock(struct.lock) ;

code = send(pid, tag, &msg) ; // partition across nodes

11/14/12 27

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

11/14/12 28

Programmer

Architecture Tools

Algorithm

All of the parallelism

Abstract locality

Fast mechanisms

Exposed costs

Combinatorial optimization

Mapping

Selection of mechanisms

11/14/12 29

Programmer

Architecture Tools

forall neighbor in molecule.neighbors { //

molecule.force =

Map foralls in time and space

Map molecules across memories

Stage data up/down hierarchy

Select mechanisms

Exposed storage hierarchy

Fast comm/sync/thread mechanisms

11/14/12 30

Fundamental and Incidental

Obstacles to Programmability Fundamental

Expressing 109 way parallelism

Expressing locality to deal with >100:1 global:local energy

Balancing load across 109 cores

Incidental

Dealing with multiple address spaces

Partitioning data across nodes

Aggregating data to amortize message overhead

11/14/12 31

Parallel Programmability

107 Threads

109 Threads

Autotuning

Mapper

Abstract

Parallelism and

Locality

Communication

Synchronization

Thread Mgt

Exposed

Storage

Hierarchy

11/14/12 32

NVIDIA Exascale Architecture

11/14/12 33

System Sketch

11/14/12 34

Echelon Chip Floorplan

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

I/O 17mm

10nm process

290mm2

11/14/12 35

The fundamental problems are hard

enough. We must eliminate the

incidental ones.

11/14/12 36

Parallel Roads to Exascale

2GFLOPs/W 50GFLOPs/W

(25x) Integrate

12GFLOPs/W

4 Process Nodes

6GFLOPS/W

Advanced

Signaling

24GFLOPs/W

Efficient

Microarchitecture

48GFLOPs/W

Optimized Voltages

Efficient Memory

Locality Enhancement

109 Threads Autotuning

Mapper

Abstract

Parallelism and

Locality

Communication

Synchronization

Thread Mgt

Exposed

Storage

Hierarchy

BillDally_NVIDIA_SC12

Documents