+ All Categories
Home > Documents > BillDally_NVIDIA_SC12

BillDally_NVIDIA_SC12

Date post: 12-Apr-2015
Category:
Upload: bernasek
View: 5 times
Download: 1 times
Share this document with a friend
Description:
NVIDIA
36
SC12 11/14/12 1
Transcript
Page 1: BillDally_NVIDIA_SC12

SC12

11/14/12 1

Page 2: BillDally_NVIDIA_SC12

SC12

11/14/12 2

Titan: World’s #1 Open Science Supercomputer

18,688 Tesla K20X GPUs

27 Petaflops Peak: 90% of Performance from GPUs

17.59 Petaflops Sustained Performance on Linpack

Page 3: BillDally_NVIDIA_SC12

SC12

11/14/12 3

Titan

18,688 Kepler GK110

27 PF peak (90% from GPUs)

17.6PF HP Linpack

2.12 GF/W

GK110 is 7GF/W

Page 4: BillDally_NVIDIA_SC12

SC12

11/14/12 4

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

You are Here 2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

Page 5: BillDally_NVIDIA_SC12

SC12

11/14/12 5

Technical Challenges on

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

Page 6: BillDally_NVIDIA_SC12

SC12

11/14/12 6

Technical Challenges on

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

2. Parallel Programmability

Page 7: BillDally_NVIDIA_SC12

SC12

11/14/12 7

Technical Challenges on

The Road to Exascale

2012

20PF

18,000GPUs

10MW

2GFLOPs/W

~107 Threads

2020

1000PF (50x)

72,000HCNs (4x)

20MW (2x)

50GFLOPs/W (25x)

~109 Threads (100x)

1. Energy Efficiency

2. Parallel Programmability

3. Resilience

Page 8: BillDally_NVIDIA_SC12

SC12

11/14/12 8

Energy Efficiency

Page 9: BillDally_NVIDIA_SC12

SC12

11/14/12 9

Moore’s Law to the Rescue?

Moore, Electronics 38(8) April 19, 1965

Page 10: BillDally_NVIDIA_SC12

SC12

11/14/12 10

Unfortunately Not!

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Page 11: BillDally_NVIDIA_SC12

SC12

11/14/12 11

Chips are now power, not area limited

P=150W P=5W

Perf (Ops/s) = P(W) * Eff(Ops/J)

Process is improving Eff by 15-25% per node 2-3x in 8 years

Page 12: BillDally_NVIDIA_SC12

SC12

11/14/12 12

We need 25x energy efficiency

2-3x will come from process

10x must come from

Architecture and Circuits

Page 13: BillDally_NVIDIA_SC12

SC12

11/14/12 13

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Page 14: BillDally_NVIDIA_SC12

SC12

11/14/12 14

The High Cost of Data Movement Fetching operands costs more than computing

on them 20mm

64-bit DP 20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficient off-chip link

28nm

256-bit buses

16 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM

50 pJ

Page 15: BillDally_NVIDIA_SC12

SC12

11/14/12 15

Low-Energy Signaling

Page 16: BillDally_NVIDIA_SC12

SC12

11/14/12 16

Energy cost with efficient signaling

20mm

64-bit DP 20pJ 3 pJ 30 pJ

100 pJ

100 pJ Efficient off-chip link

28nm

256-bit buses

1 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM

50 pJ

Page 17: BillDally_NVIDIA_SC12

SC12

11/14/12 17

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Page 18: BillDally_NVIDIA_SC12

SC12

11/14/12 18

4/11/11 Milad Mohammadi 18

An Out-of-Order Core Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)

Page 19: BillDally_NVIDIA_SC12

SC12

11/14/12 19

SM Lane Architecture

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0Addr

L1Addr

Net

LM

Bank

0

To LD/ST

LM

Bank

3

RFL0Addr

L1Addr

Net

RF

Net

Data

Path

L0

I$

Thre

ad P

Cs

Act

ive

PC

s

Inst

Control

Path

Sch

edul

er

64 threads

4 active threads

2 DFMAs (4 FLOPS/clock)

ORF bank: 16 entries (128 Bytes)

L0 I$: 64 instructions (1KByte)

LM Bank: 8KB (32KB total)

Page 20: BillDally_NVIDIA_SC12

SC12

11/14/12 20

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Efficient

Microarchitecture

48GFLOPs/W

(2x)

Page 21: BillDally_NVIDIA_SC12

SC12

11/14/12 21

Energy Efficiency

2GFLOPs/W

50GFLOPs/W

(25x)

Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Efficient

Microarchitecture

48GFLOPs/W

(2x)

Optimized Voltages

Efficient Memory

Locality Enhancement

(??x)

Page 22: BillDally_NVIDIA_SC12

SC12

11/14/12 22

Parallel Programmability

Page 23: BillDally_NVIDIA_SC12

SC12

11/14/12 23

Parallel programming is not inherently any

more difficult than serial programming

However, we can make it a lot more difficult

Page 24: BillDally_NVIDIA_SC12

SC12

11/14/12 24

A simple parallel program

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

}

}

}

Page 25: BillDally_NVIDIA_SC12

SC12

11/14/12 25

Why is this easy?

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { // nested

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

}

}

}

No machine details

All parallelism is expressed

Synchronization is semantic (in reduction)

Page 26: BillDally_NVIDIA_SC12

SC12

11/14/12 26

We could make it hard

pid = fork() ; // explicitly managing threads

lock(struct.lock) ; // complicated, error-prone synchronization

// manipulate struct

unlock(struct.lock) ;

code = send(pid, tag, &msg) ; // partition across nodes

Page 27: BillDally_NVIDIA_SC12

SC12

11/14/12 27

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

Page 28: BillDally_NVIDIA_SC12

SC12

11/14/12 28

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

Algorithm

All of the parallelism

Abstract locality

Fast mechanisms

Exposed costs

Combinatorial optimization

Mapping

Selection of mechanisms

Page 29: BillDally_NVIDIA_SC12

SC12

11/14/12 29

Programmers, tools, and architecture

Need to play their positions

Programmer

Architecture Tools

forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { //

forall force in forces { // doubly nested

molecule.force =

reduce_sum(force(molecule, neighbor))

}

}

}

Map foralls in time and space

Map molecules across memories

Stage data up/down hierarchy

Select mechanisms

Exposed storage hierarchy

Fast comm/sync/thread mechanisms

Page 30: BillDally_NVIDIA_SC12

SC12

11/14/12 30

Fundamental and Incidental

Obstacles to Programmability Fundamental

Expressing 109 way parallelism

Expressing locality to deal with >100:1 global:local energy

Balancing load across 109 cores

Incidental

Dealing with multiple address spaces

Partitioning data across nodes

Aggregating data to amortize message overhead

Page 31: BillDally_NVIDIA_SC12

SC12

11/14/12 31

Parallel Programmability

107 Threads

109 Threads

Autotuning

Mapper

Abstract

Parallelism and

Locality

Fast

Communication

Synchronization

Thread Mgt

Exposed

Storage

Hierarchy

Page 32: BillDally_NVIDIA_SC12

SC12

11/14/12 32

NVIDIA Exascale Architecture

Page 33: BillDally_NVIDIA_SC12

SC12

11/14/12 33

System Sketch

Page 34: BillDally_NVIDIA_SC12

SC12

11/14/12 34

Echelon Chip Floorplan

L2

Banks

XBAR

NOC

SM

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOCS

M

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LOC

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O 17mm

10nm process

290mm2

Page 35: BillDally_NVIDIA_SC12

SC12

11/14/12 35

The fundamental problems are hard

enough. We must eliminate the

incidental ones.

Page 36: BillDally_NVIDIA_SC12

SC12

11/14/12 36

Parallel Roads to Exascale

2GFLOPs/W 50GFLOPs/W

(25x) Integrate

Host

12GFLOPs/W

(2x)

4 Process Nodes

6GFLOPS/W

(3x)

Advanced

Signaling

24GFLOPs/W

(2x)

Efficient

Microarchitecture

48GFLOPs/W

(2x)

Optimized Voltages

Efficient Memory

Locality Enhancement

(??x)

109 Threads Autotuning

Mapper

Abstract

Parallelism and

Locality

Fast

Communication

Synchronization

Thread Mgt

Exposed

Storage

Hierarchy