+ All Categories
Home > Documents > Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0...

Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0...

Date post: 08-Feb-2018
Category:
Upload: tranphuc
View: 219 times
Download: 3 times
Share this document with a friend
24
Jeremy Appleyard July 2016 PASCAL AND CUDA 8.0
Transcript
Page 1: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

Jeremy Appleyard

July 2016

PASCAL AND CUDA 8.0

Page 2: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

2

TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory Space

Unified Memory

CPU

Tesla P100

Page 3: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

GIANT LEAPS

IN EVERYTHING

NVLINK

PAGE MIGRATION ENGINE

PASCAL ARCHITECTURE

CoWoS HBM2 Stacked Mem

K40 Tera

flops

(FP32/FP16)

5

10

15

20

P100

(FP32)

P100

(FP16)

M40

K40

Bi-

dir

ecti

onal BW

(G

B/Sec)

40

80

120

160 P100

M40

K40 Bandw

idth

(G

B/s)

200

400

600

P100

M40 K40

Addre

ssable

Mem

ory

(G

B)

10

100

1000

P100

M40

21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth

3x Higher for Massive Data Workloads Virtually Unlimited Memory Space

10000 800

Page 4: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

4

HIGHEST ABSOLUTE PERFORMANCE DELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100

0x

5x

10x

15x

20x

25x

30x

35x

40x

45x

50x

Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC

2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100

Speed-u

p v

s D

ual Socket

Hasw

ell

2x Broadwell CPU

Page 5: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

5

PASCAL ARCHITECTURE

Page 6: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

6

TESLA P100 GPU: GP100

56 SMs

3584 CUDA Cores

5.3 TF Double Precision

10.6 TF Single Precision

21.2 TF Half Precision

16 GB HBM2

720 GB/s Bandwidth

Page 7: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

7

GPU PERFORMANCE COMPARISON

P100 M40 K40

Double Precision TFlop/s 5.3 0.2 1.4

Single Precision TFlop/s 10.6 7.0 4.3

Half Precision Tflop/s 21.2 NA NA

Memory Bandwidth (GB/s) 720 288 288

Memory Size 16GB 12GB, 24GB 12GB

Page 8: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

8

GP100 SM

GP100

CUDA Cores 64

Register File 256 KB

Shared

Memory 64 KB

Active Threads 2048

Active Blocks 32

Page 9: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

9

Core

s

FP64

Core

s

FP64

LD

/ST

SFU

Registers

Warps

Registers

Warps

Core

s

Core

s

FP64

FP64

LD

/ST

SFU

Registers

Warps

Core

s

Core

s

FP64

FP64

LD

/ST

SFU

Registers

Warps

Core

s

Core

s

FP64

FP64

LD

/ST

SFU

Registers

Warps

Registers

Warps

Shared Mem

Registers

Warps

Shared Mem

Registers

Warps

Maxwell SM

P100 SM

P100 SM

More resources per core

2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps

Higher Instruction Throughput

Page 10: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

10

IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast

Feature Half precision Single precision Double precision

Layout s5.10 s8.23 s11.52

Issue rate pair every clock 1 every clock 1 every 2 clocks

Subnormal support Yes Yes Yes

Atomic Addition Yes Yes Yes

Page 11: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

11

HALF-PRECISION FLOATING POINT (FP16)

• 16 bits

• 1 sign bit, 5 exponent bits, 10 fraction bits

• 240 Dynamic range

• Normalized values: 1024 values for each power of 2, from 2-14 to 215

• Subnormals at full speed: 1024 values from 2-24 to 2-15

• Special values

• +- Infinity, Not-a-number

s e x p f r a c .

USE CASES

Deep Learning Training

Radio Astronomy

Sensor Data

Image Processing

Page 12: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

12

NVLINK

Page 13: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

13

NVLINK

P100 supports 4 NVLinks

Up to 94% bandwidth efficiency

Supports read/writes/atomics to peer GPU

Supports read/write access to NVLink-enabled CPU

Links can be ganged for higher bandwidth

NVLink on Tesla P100

40 GB/s

40 GB/s

40 GB/s

40 GB/s

Page 14: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

14

NVLINK - GPU CLUSTER

Two fully connected quads, connected at corners

160GB/s per GPU bidirectional to Peers

Load/store access to Peer Memory

Full atomics to Peer GPUs

High speed copy engines for bulk data copy

PCIe to/from CPU

Page 15: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

15

NVLINK TO CPU

Fully connected quad

120 GB/s per GPU bidirectional for peer traffic

40 GB/s per GPU bidirectional to CPU

Direct Load/store access to CPU Memory

High Speed Copy Engines for bulk data movement

Page 16: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

16

UNIFIED MEMORY

Page 17: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

17

PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging

49-bit Virtual Addresses

Sufficient to cover 48-bit CPU address + all GPU memory

GPU page faulting capability

Can handle thousands of simultaneous page faults

Up to 2 MB page size

Better TLB coverage of GPU memory

28.7.2

016 г.

Page 18: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

18

KEPLER/MAXWELL UNIFIED MEMORY

Performance

Through

Data Locality

Migrate data to accessing processor

Guarantee global coherency

Still allows explicit hand tuning

Simpler

Programming &

Memory Model

Single allocation, single pointer,

accessible anywhere

Eliminate need for explicit copy

Greatly simplifies code porting

Allocate Up To GPU Memory Size

Kepler

GPU CPU

Unified Memory

CUDA 6+

Page 19: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

19

PASCAL UNIFIED MEMORY Large datasets, simple programming, High Performance

Allocate Beyond GPU Memory Size

Enable Large

Data Models

Oversubscribe GPU memory

Allocate up to system memory size

Tune

Unified Memory

Performance

Usage hints via cudaMemAdvise API

Explicit prefetching API

Simpler

Data Access

CPU/GPU Data coherence

Unified memory atomic operations

Unified Memory

Pascal

GPU CPU

CUDA 8

Page 20: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

20

GPU OVERSUBSCRIPTION HPGMG: high-performance multi-grid

7/28/2

016

Tesla K40 (12 GB)

Tesla P100 (16 GB)

*Tesla P100 performance is very early modelling results

Page 21: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

21

TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

More P100 Features: compute preemption, new instructions, larger L2 cache, more…

Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory Space

Unified Memory

CPU

Tesla P100

Page 22: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

22

CUDA 8.0

Page 23: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

23

CUDA 8

New Architecture, Stacked Memory, NVLINK

Pascal Support Simple Parallel Programming with large virtual memory

Unified Memory

nvGRAPH – library for accelerating graph analytics apps

FP16 computation to boost Deep Learning workloads

Libraries Critical Path Analysis to speed overall app tuning

OpenACC profiling to optimize directive performance

Single GPU debugging on Pascal

Developer Tools

Page 24: Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0 . 2 TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

24

GTC EUROPE 2016

• GPU Technology Conference in Europe

• Call for speakers closes on August 21st

• https://www.gputechconf.eu

Thanks for listening!

[email protected]

Amsterdam, 28–29 September


Recommended