Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0...

Jeremy Appleyard

July 2016

PASCAL AND CUDA 8.0

2

TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory Space

Unified Memory

CPU

Tesla P100

3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

GIANT LEAPS

IN EVERYTHING

NVLINK

PAGE MIGRATION ENGINE

PASCAL ARCHITECTURE

CoWoS HBM2 Stacked Mem

K40 Tera

flops

(FP32/FP16)

5

10

15

20

P100

(FP32)

P100

(FP16)

M40

K40

Bi-

dir

ecti

onal BW

(G

B/Sec)

40

80

120

160 P100

M40

K40 Bandw

idth

(G

B/s)

200

400

600

P100

M40 K40

Addre

ssable

Mem

ory

(G

B)

10

100

1000

P100

M40

21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth

3x Higher for Massive Data Workloads Virtually Unlimited Memory Space

10000 800

4

HIGHEST ABSOLUTE PERFORMANCE DELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100

0x

5x

10x

15x

20x

25x

30x

35x

40x

45x

50x

Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC

2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100

Speed-u

p v

s D

ual Socket

Hasw

ell

2x Broadwell CPU

5

PASCAL ARCHITECTURE

6

TESLA P100 GPU: GP100

56 SMs

3584 CUDA Cores

5.3 TF Double Precision

10.6 TF Single Precision

21.2 TF Half Precision

16 GB HBM2

720 GB/s Bandwidth

7

GPU PERFORMANCE COMPARISON

P100 M40 K40

Double Precision TFlop/s 5.3 0.2 1.4

Single Precision TFlop/s 10.6 7.0 4.3

Half Precision Tflop/s 21.2 NA NA

Memory Bandwidth (GB/s) 720 288 288

Memory Size 16GB 12GB, 24GB 12GB

8

GP100 SM

GP100

CUDA Cores 64

Register File 256 KB

Shared

Memory 64 KB

Active Threads 2048

Active Blocks 32

9

Core

s

FP64

Core

s

FP64

LD

/ST

SFU

Registers

Warps

Registers

Warps

Core

s

Core

s

FP64

FP64

LD

/ST

SFU

Registers

Warps

Core

s

Core

s

FP64

FP64

LD

/ST

SFU

Registers

Warps

Core

s

Core

s

FP64

FP64

LD

/ST

SFU

Registers

Warps

Registers

Warps

Shared Mem

Registers

Warps

Shared Mem

Registers

Warps

Maxwell SM

P100 SM

P100 SM

More resources per core

2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps

Higher Instruction Throughput

10

IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast

Feature Half precision Single precision Double precision

Layout s5.10 s8.23 s11.52

Issue rate pair every clock 1 every clock 1 every 2 clocks

Subnormal support Yes Yes Yes

Atomic Addition Yes Yes Yes

11

HALF-PRECISION FLOATING POINT (FP16)

• 16 bits

• 1 sign bit, 5 exponent bits, 10 fraction bits

• 240 Dynamic range

• Normalized values: 1024 values for each power of 2, from 2-14 to 215

• Subnormals at full speed: 1024 values from 2-24 to 2-15

• Special values

• +- Infinity, Not-a-number

s e x p f r a c .

USE CASES

Deep Learning Training

Radio Astronomy

Sensor Data

Image Processing

12

NVLINK

13

NVLINK

P100 supports 4 NVLinks

Up to 94% bandwidth efficiency

Supports read/writes/atomics to peer GPU

Supports read/write access to NVLink-enabled CPU

Links can be ganged for higher bandwidth

NVLink on Tesla P100

40 GB/s

40 GB/s

40 GB/s

40 GB/s

14

NVLINK - GPU CLUSTER

Two fully connected quads, connected at corners

160GB/s per GPU bidirectional to Peers

Load/store access to Peer Memory

Full atomics to Peer GPUs

High speed copy engines for bulk data copy

PCIe to/from CPU

15

NVLINK TO CPU

Fully connected quad

120 GB/s per GPU bidirectional for peer traffic

40 GB/s per GPU bidirectional to CPU

Direct Load/store access to CPU Memory

High Speed Copy Engines for bulk data movement

16

UNIFIED MEMORY

17

PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging

49-bit Virtual Addresses

Sufficient to cover 48-bit CPU address + all GPU memory

GPU page faulting capability

Can handle thousands of simultaneous page faults

Up to 2 MB page size

Better TLB coverage of GPU memory

28.7.2

016 г.

18

KEPLER/MAXWELL UNIFIED MEMORY

Performance

Through

Data Locality

Migrate data to accessing processor

Guarantee global coherency

Still allows explicit hand tuning

Simpler

Programming &

Memory Model

Single allocation, single pointer,

accessible anywhere

Eliminate need for explicit copy

Greatly simplifies code porting

Allocate Up To GPU Memory Size

Kepler

GPU CPU

Unified Memory

CUDA 6+

19

PASCAL UNIFIED MEMORY Large datasets, simple programming, High Performance

Allocate Beyond GPU Memory Size

Enable Large

Data Models

Oversubscribe GPU memory

Allocate up to system memory size

Tune

Unified Memory

Performance

Usage hints via cudaMemAdvise API

Explicit prefetching API

Simpler

Data Access

CPU/GPU Data coherence

Unified memory atomic operations

Unified Memory

Pascal

GPU CPU

CUDA 8

20

GPU OVERSUBSCRIPTION HPGMG: high-performance multi-grid

7/28/2

016

Tesla K40 (12 GB)

Tesla P100 (16 GB)

*Tesla P100 performance is very early modelling results

21

TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

More P100 Features: compute preemption, new instructions, larger L2 cache, more…

Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory Space

Unified Memory

CPU

Tesla P100

22

CUDA 8.0

23

CUDA 8

New Architecture, Stacked Memory, NVLINK

Pascal Support Simple Parallel Programming with large virtual memory

Unified Memory

nvGRAPH – library for accelerating graph analytics apps

FP16 computation to boost Deep Learning workloads

Libraries Critical Path Analysis to speed overall app tuning

OpenACC profiling to optimize directive performance

Single GPU debugging on Pascal

Developer Tools

24

GTC EUROPE 2016

• GPU Technology Conference in Europe

• Call for speakers closes on August 21st

• https://www.gputechconf.eu

Thanks for listening!

[email protected]

Amsterdam, 28–29 September

https://www.gputechconf.eu/



mailto:[email protected]

Date post:	08-Feb-2018
Category:	Documents
Upload:	tranphuc
View:	219 times
Download:	3 times

Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0...

Documents