BillDally_NVIDIA_SC12

Post on 12-Apr-2015

5 views 1 download

description

NVIDIA

transcript

SC12

11/14/12 1

SC12

11/14/12 2

Titan: World’s #1 Open Science Supercomputer

18,688 Tesla K20X GPUs

27 Petaflops Peak: 90% of Performance from GPUs

17.59 Petaflops Sustained Performance on Linpack

SC12

11/14/12 3

Titan

18,688 Kepler GK110

27 PF peak (90% from GPUs)

17.6PF HP Linpack

2.12 GF/W

GK110 is 7GF/W

SC12

11/14/12 4

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

You are Here 2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

SC12

11/14/12 5

Technical Challenges on

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

SC12

11/14/12 6

Technical Challenges on

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

2. Parallel Programmability

SC12

11/14/12 7

Technical Challenges on

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

2. Parallel Programmability

3. Resilience

SC12

11/14/12 8

Energy Efficiency

SC12

11/14/12 9

Moore’s Law to the Rescue?

Moore, Electronics 38(8) April 19, 1965

SC12

11/14/12 10

Unfortunately Not!

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

SC12

11/14/12 11

Chips are now power, not area limited

P=150W P=5W

Perf (Ops/s) = P(W) * Eff(Ops/J)

Process is improving Eff by 15-25% per node 2-3x in 8 years

SC12

11/14/12 12

We need 25x energy efficiency

2-3x will come from process

10x must come from

Architecture and Circuits

SC12

11/14/12 13

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

SC12

11/14/12 14

The High Cost of Data Movement Fetching operands costs more than computing

on them 20mm

64-bit DP 20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficient off-chip link

28nm

256-bit buses

16 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM

50 pJ

SC12

11/14/12 15

Low-Energy Signaling

SC12

11/14/12 16

Energy cost with efficient signaling

20mm

64-bit DP 20pJ 3 pJ 30 pJ

100 pJ

100 pJ Efficient off-chip link

28nm

256-bit buses

1 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM

50 pJ

SC12

11/14/12 17

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

SC12

11/14/12 18

4/11/11 Milad Mohammadi 18

An Out-of-Order Core Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)

SC12

11/14/12 19

SM Lane Architecture

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0Addr

L1Addr

Net

LM

Bank

0

To LD/ST

LM

Bank

3

RFL0Addr

L1Addr

Net

RF

Net

Data

Path

L0

I$

Thre

ad P

Cs

Act

ive

PC

s

Inst

Control

Path

Sch

edul

er

64 threads

4 active threads

2 DFMAs (4 FLOPS/clock)

ORF bank: 16 entries (128 Bytes)

L0 I$: 64 instructions (1KByte)

LM Bank: 8KB (32KB total)

SC12

11/14/12 20

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Efficient

Microarchitecture

48GFLOPs/W

(2x)

SC12

11/14/12 21

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Efficient

Microarchitecture

48GFLOPs/W

(2x)

Optimized Voltages

Efficient Memory

Locality Enhancement

(??x)

SC12

11/14/12 22

Parallel Programmability

SC12

11/14/12 23

Parallel programming is not inherently any

more difficult than serial programming

However, we can make it a lot more difficult

SC12

11/14/12 24

A simple parallel program

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

}

}

}

SC12

11/14/12 25

Why is this easy?

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

}

}

}

No machine details

All parallelism is expressed

Synchronization is semantic (in reduction)

SC12

11/14/12 26

We could make it hard

pid = fork() ; // explicitly managing threads

lock(struct.lock) ; // complicated, error-prone synchronization

// manipulate struct

unlock(struct.lock) ;

code = send(pid, tag, &msg) ; // partition across nodes

SC12

11/14/12 27

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

SC12

11/14/12 28

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

Algorithm

All of the parallelism

Abstract locality

Fast mechanisms

Exposed costs

Combinatorial optimization

Mapping

Selection of mechanisms

SC12

11/14/12 29

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { //

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

}

}

}

Map foralls in time and space

Map molecules across memories

Stage data up/down hierarchy

Select mechanisms

Exposed storage hierarchy

Fast comm/sync/thread mechanisms

SC12

11/14/12 30

Fundamental and Incidental

Obstacles to Programmability Fundamental

Expressing 109 way parallelism

Expressing locality to deal with >100:1 global:local energy

Balancing load across 109 cores

Incidental

Dealing with multiple address spaces

Partitioning data across nodes

Aggregating data to amortize message overhead

SC12

11/14/12 31

Parallel Programmability

107 Threads

109 Threads

Autotuning

Mapper

Abstract

Parallelism and

Locality

Fast

Communication

Synchronization

Thread Mgt

Exposed

Storage

Hierarchy

SC12

11/14/12 32

NVIDIA Exascale Architecture

SC12

11/14/12 33

System Sketch

SC12

11/14/12 34

Echelon Chip Floorplan

L2

Banks

XBAR

NOC

SM

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOCS

M

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O 17mm

10nm process

290mm2

SC12

11/14/12 35

The fundamental problems are hard

enough. We must eliminate the

incidental ones.

SC12

11/14/12 36

Parallel Roads to Exascale

2GFLOPs/W 50GFLOPs/W

(25x) Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Efficient

Microarchitecture

48GFLOPs/W

(2x)

Optimized Voltages

Efficient Memory

Locality Enhancement

(??x)

109 Threads Autotuning

Mapper

Abstract

Parallelism and

Locality

Fast

Communication

Synchronization

Thread Mgt

Exposed

Storage

Hierarchy