Download - Parallel programming: Introduction to GPU · PDF fileParallel programming: Introduction to GPU architecture ... Computer architecture crash course ... parallel processing Computer

Parallel programming:Introduction to GPU architecture

Sylvain CollangeInria Rennes – Bretagne Atlantique

[email protected]

2

GPU internals

What makes a GPU tick?

NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering!

3

Outline

Computer architecture crash course

The simplest processor

Exploiting instruction-level parallelism

GPU, many-core: why, what for?

Technological trends and constraints

From graphics to general purpose

Forms of parallelism, how to exploit them

Why we need (so much) parallelism: latency and throughput

Sources of parallelism: ILP, TLP, DLP

Uses of parallelism: horizontal, vertical

Let's design a GPU!

Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD

Putting it all together

Architecture of current GPUs: cores, memory

4

The free lunch era... was yesterday

1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements

Exponential performance increase

Software compatibility preserved

Do not rewrite software, buy a new machine!Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed. 2006

5


How does a processor work?

Or was working in the 1980s to 1990s:modern processors are much more complicated!

An attempt to sum up 30 years of research in 15 minutes

6

Machine language: instruction set

Registers

State for computations

Keeps variables and temporaries

Instructions

Perform computations on registers,move data between register and memory, branch…

Instruction word

Binary representation of an instruction

Assembly language

Readable form of machine language

Examples

Which instruction set does your laptop/desktop run?

Your cell phone?

01100111

R1, R3ADD

R0, R1, R2, R3... R31

Example

7

The Von Neumann processor

Let's look at it step by step

Decoder

Fetch unit

Arithmetic andLogic Unit

Memory

Register file

PC

Statemachine

OperationOperands

Instructionword

Load/StoreUnit

BranchUnit

Result bus

+1

8

Step by step: Fetch

The processor maintains a Program Counter (PC)

Fetch: read the instruction word pointed by PC in memory

Fetch unit

Memory

PC

Instructionword

01100111

01100111

9

Decode

Split the instruction word to understand what it represents

Which operation? → ADD

Which operands? → R1, R3

Decoder

Fetch unit

Memory

PC

OperationOperands

Instructionword

01100111

R1, R3 ADD

10

Read operands

Get the value of registers R1, R3 from the register file

Decoder

Fetch unit

Memory

Register file

PC

Operands

Instructionword

R1, R3

42, 17

11

Execute operation

Compute the result: 42 + 17

Decoder

Fetch unit


Memory

Register file

PC

OperationOperands

Instructionword

42, 17 ADD

59

12

Write back

Write the result back to the register file

Decoder

Fetch unit


Memory

Register file

PC

OperationOperands

Instructionword

Result bus

59

R1

13

Increment PC

Decoder

Fetch unit


Memory

Register file

PC

OperationOperands

Instructionword

Result bus

+1

14

Load or store instruction

Can read and write memory from a computed address

Decoder

Fetch unit


Memory

Register file

PC

OperationOperands

Instructionword

Load/StoreUnit

Result bus

+1

15

Branch instruction

Instead of incrementing PC, set it to a computed value

Decoder

Fetch unit


Memory

Register file

PC

OperationOperands

Instructionword

Load/StoreUnit

BranchUnit

Result bus

+1

16

What about the state machine?

The state machine controls everybody

Sequences the successive steps

Send signals to units depending on current state

At every clock tick,switch to next state

Clock is a periodic signal(as fast as possible)

Fetch-decode

Readoperands

Readoperands

ExecuteWriteback

IncrementPC

17

Recap

We can build a real processor

As it was in the early 1980's

Youare

here

How did processors become faster?

18

Reason 1: faster clock

Progress in semiconductor technologyallows higher frequencies

Frequencyscaling

But this is not enough!

19

Outline











Let's design a GPU!




20

Going faster using ILP: pipeline

Idea: we do not have to wait until instruction n has finishedto start instruction n+1

Like a factory assembly line

Or the bandeijão

21

Pipelined processor

Independent instructions can follow each other

Exploits ILP to hide instruction latency

Program

1: add, r1, r32: mul r2, r33: load r3, [r1]

Fetch Decode Execute Writeback

1: add

22

Pipelined processor



Program



1: add2: mul

23

Pipelined processor



Program



1: add2: mul3: load

24

Superscalar execution

Multiple execution units in parallel

Independent instructions can execute at the same time

Fetch

Decode Execute Writeback

Exploits ILP to increase throughput



26

Locality

Time to access main memory: ~200 clock cycles

One memory access every few instructions

Are we doomed?

Fortunately: principle of locality

~90% of memory accesses on ~10% of data

Accessed locations are often the same

Temporal localityAccess the same location at different times

Spacial localityAccess locations close to each other

27

Caches

Large memories are slower than small memories

The computer theorists lied to you:in the real world, access in an array of size n costs O(log n), not O(1)!

Think about looking up a book in a small or huge library

Idea: put frequently-accessed data in small, fast memory

Can be applied recursively: hierarchy with multiple levels of cache

L1cache

Capacity: 64 KBAccess time: 2 ns

L2cache

1 MB10 ns

L3cache

8 MB30 ns

Memory 8 GB60 ns

28

Branch prediction

What if we have a branch?

We do not know the next PC to fetch from until the branch executes

Solution 1: wait until the branch is resolved

Problem: programs have 1 branch every 5 instructions on average

We would spend most of our time waiting

Solution 2: predict (guess) the most likely direction

If correct, we have bought some time

If wrong, just go back and start over

Modern CPUs can correctly predict over 95% of branches

World record holder: 1.691 mispredictions / 1000 instructions

General concept: speculation

P. Michaud and A. Seznec. "Pushing the branch predictability limits with the multi-poTAGE+ SC predictor." JWAC-4: Championship Branch Prediction (2014).

29

Example CPU: Intel Core i7 Haswell

Up to 192 instructionsin flight

May be 48 predicted branches ahead

Up to 8 instructions/cycle executed out of order

About 25 pipeline stages at ~4 GHz

Quizz: how far does light travel during the 0.25 ns of a clock cycle?

Too complex to explain in 1 slide, or even 1 lecture

David Kanter, Intel's Haswell CPU architecture, RealWorldTech, 2012http://www.realworldtech.com/haswell-cpu/

30

Recap

Many techniques to run sequential programsas fast as possible

Discovers and exploits parallelism between instructions

Speculates to remove dependencies

Works on existing binary programs,without rewriting or re-compiling

Upgrading hardware is cheaper than improving software

Extremely complex machine

31

Outline











Let's design a GPU!




32

Technology evolution

Memory wall

Memory speed does not increase as fast as computing speed

More and more difficult to hide memory latency

Power wall

Power consumption of transistors does not decrease as fast as density increases

Performance is now limited by power consumption

ILP wall

Law of diminishing returns on Instruction-Level Parallelism

Pollack rule: cost performance²≃

Cost

Serial performance

Performance

Time

Gap

Compute

Memory

Time

Transistordensity

Transistorpower

Total power

33

Usage changes

New applications demandparallel processing

Computer games : 3D graphics

Search engines, social networks…“big data” processing

New computing devices arepower-constrained

Laptops, cell phones, tablets…

Small, light, battery-powered

Datacenters

High power supplyand cooling costs

34

Latency vs. throughput

Latency: time to solution

CPUs

Minimize time, at the expense of power

Throughput: quantity of tasks processed per unit of time

GPUs

Assumes unlimited parallelism

Minimize energy per operation

35

Amdahl's law

Bounds speedup attainable on a parallel machine

S=1

1−PPN

Time to runsequential portions

Time to runparallel portions

N (available processors)

S (speedup)

G. Amdahl. Validity of the Single Processor Approach to Achieving Large-ScaleComputing Capabilities. AFIPS 1967.

S SpeedupP Ratio of parallel

portionsN Number of

processors

36

Why heterogeneous architectures?

Latency-optimized multi-core (CPU)

Low efficiency on parallel portions:spends too much resources

Throughput-optimized multi-core (GPU)

Low performance on sequential portions

S=1

1−PPN

Heterogeneous multi-core (CPU+GPU)

Use the right tool for the right job

Allows aggressive optimizationfor latency or for throughput

Time to runsequential portions

Time to runparallel portions

M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008.

37

Example: System on Chip for smartphone

Big coresfor applications

Small coresfor background activity

GPU

Special-purposeaccelerators

Lots of interfaces

38

Outline











Let's design a GPU!




39

Vertices

The (simplest) graphics rendering pipeline

Fragments

Clipping, RasterizationAttribute interpolation

Z-CompareBlending

Pixels

Vertex shader

Fragment shader

Primitives(triangles…)

Framebuffer

Programmablestage

Parametrizablestage

Textures

Z-Buffer

40

How much performance do we need

… to run 3DMark 11 at 50 frames/second?

Element Per frame Per second

Vertices 12.0M 600M

Primitives 12.6M 630M

Fragments 180M 9.0G

Instructions 14.4G 720G

Intel Core i7 2700K: 56 Ginsn/s peak

We need to go 13x faster

Make a special-purpose accelerator

Source: Damien Triolet, Hardware.fr

42

Beginnings of GPGPU

20092004 20072002

7.x 8.0 9.08.1 9.0ca 9.0b 10.0 10.1 11

2000 2001 2003 2005 2006 2008

Microsoft DirectX

NVIDIA

NV10 NV20 NV30 NV40 G70 G80-G90 GT200

ATI/AMD

R100 R200 R300 R400 R500 R600 R700

Programmableshaders

FP 16 FP 32

FP 24 FP 64

SIMT

CTM CAL

CUDA

GPGPU traction

Dynamiccontrol flow

2010

GF100

Evergreen

Unified shaders

43

Today: what do we need GPUs for?

1. 3D graphics rendering for games

Complex texture mapping, lighting computations…

2. Computer Aided Design workstations

Complex geometry

3. GPGPU

Complex synchronization, data movements

One chip to rule them all

Find the common denominator

44

Outline











Let's design a GPU!




45

Little's law: data=throughput×latency

Intel Core i7 920

210

1500

350 ns

190

50

1,25

180

3 10 50 Latency (ns)

Throughput (GB/s)

L1

L2DRAM

NVIDIA GeForce GTX 580

30

320

J. Little. A proof for the queuing formula L= λ W. JSTOR 1961.

46

Reques ts

Hiding memory latency with pipelining

Memory throughput: 190 GB/s

Memory latency: 350 ns

Data in flight = 66 500 Bytes

At 1 GHz:190 Bytes/cycle,350 cycles to wait

Tim

e

1 cycle

...

In flight: 65 KB

Throughput:190B/cycle

Latency :350 cyc les

47

Consequence: more parallelism

GPU vs. CPU

8× more parallelism to feed more units (throughput)

8× more parallelism to hide longer latency

64× more total parallelism

How to find this parallelism?

Space ×8

×8

Tim

e

Reques ts

...

48

Sources of parallelism

ILP: Instruction-Level Parallelism

Between independent instructionsin sequential program

TLP: Thread-Level Parallelism

Between independent execution contexts: threads

DLP: Data-Level Parallelism

Between elements of a vector:same operation on several elements

add r3 r1, r2←mul r0 r0, r1←sub r1 ← r3, r0

Thread 1 Thread 2

Parallel

add mul Parallel

vadd r a,b←+ + +a1 a2 a3

b1 b2 b3

r1 r2 r3

49

Example: X ← a×X

In-place scalar-vector product: X ← a×X

Or any combination of the above

Launch n threads:X[tid] a * X[tid]←

Threads (TLP)

For i = 0 to n-1 do:X[i] a * X[i]←

Sequential (ILP)

X a * X←Vector (DLP)

50

Uses of parallelism

“Horizontal” parallelismfor throughput

More units working in parallel

“Vertical” parallelismfor latency hiding

Pipelining: keep units busy when waiting for dependencies, memory

A B C D

throughput

late

ncy

A B C D

A B

A

C

B

A

cycle 1 cycle 2 cycle 3 cycle 4

51

How to extract parallelism?

Horizontal Vertical

ILP Superscalar Pipelined

TLP Multi-coreSMT

Interleaved / switch-on-event multithreading

DLP SIMD / SIMT Vector / temporal SIMT

We have seen the first row: ILP

We will now review techniques for the next rows: TLP, DLP

52

Outline











Let's design a GPU!




53

Sequential processor

Focuses on instruction-level parallelism

Exploits ILP: vertically (pipelining) and horizontally (superscalar)

for i = 0 to n-1X[i] a * X[i]←

move i 0←loop:

load t X[i]←mul t a×t←store X[i] t←add i i+1←branch i<n? loop Sequential CPU

add i ← 18

store X[17]

mul

Fetch

Decode

Execute

Memory

Source code

Machine code

Memory

54

The incremental approach: multi-core

Source: Intel

Intel Sandy Bridge

Several processorson a single chipsharing one memory space

Area: benefits from Moore's law

Power: extra cores consume little when not in use

e.g. Intel Turbo Boost

55

Homogeneous multi-core

Horizontal use of thread-level parallelism

Improves peak throughput

IFID

EX

LSU

F

D

X

Mem

add i ← 18

store X[17]

mul

IFID

EX

LSU

F

D

X

Mem

add i ← 50

store X[49]

mul

Mem

ory

T0 T1Threads:

56

Example: Tilera Tile-GX

Grid of (up to) 72 tiles

Each tile: 3-way VLIW processor,5 pipeline stages, 1.2 GHz

Tile (1,1)

…Tile (1,2)

Tile (9,1)

Tile (1,8)

Tile (9,8)

…

… …

57

Interleaved multi-threading

mul

mul

add i ← 50

Fetch

Decode

Execute

Memoryload-storeunit

load X[89]

Memory

Vertical use of thread-level parallelism

Hides latency thanks to explicit parallelismimproves achieved throughput

store X[72]load X[17]

store X[49]

add i ←73

T0 T1 T2 T3Threads:

58

Example: Oracle Sparc T5

16 cores / chip

Core: out-of-order superscalar, 8 threads

15 pipeline stages, 3.6 GHz

Core 1

Thread 1Thread 2

Thread 8

Core 2 Core 16

…

59

Clustered multi-core

For eachindividual unit,select between

Horizontal replication

Vertical time-multiplexing

Examples

Sun UltraSparc T2, T3

AMD Bulldozer

IBM Power 7

Area-efficient tradeoff

Blurs boundaries between cores

br

mul

add i ← 50

Fetch

Decode

EX

L/S Unitload X[89]

Memory

store X[72]load X[17]

store X[49]

muladd i ←73

store

T0 T1 T2 T3

→ Cluster 1 → Cluster 2

60

Implicit SIMD

In NVIDIA-speak

SIMT: Single Instruction, Multiple Threads

Convoy of synchronized threads: warp

Extracts DLP from multi-thread applications

(0-3) store

(0) mul

F

D

X

Mem(0)

Mem

ory

(1) mul (2) mul (3) mul

(1) (2) (3)

(0-3) load

Factorization of fetch/decode, load-store units

Fetch 1 instruction on behalf of several threads

Read 1 memory location and broadcast to several registers

T0

T1

T2

T3

61

Explicit SIMD

Single Instruction Multiple Data

Horizontal use of data level parallelism

Examples

Intel MIC (16-wide)

AMD GCN GPU (16-wide×4-deep)

Most general purpose CPUs (4-wide to 8-wide)

loop:vload T X[i]←vmul T a×T←vstore X[i] T←add i i+4←branch i<n? loop

Machine code

SIMD CPU

add i ← 20

vstore X[16..19

vmul

F

D

X

Mem

Mem

ory

62

Quizz: link the words

Parallelism

ILP

TLP

DLP

Use

Horizontal:more throughput

Vertical:hide latency

Architectures

Superscalar processor


Multi-threaded core


Implicit SIMD

Explicit SIMD

63


Parallelism

ILP

TLP

DLP

Use



Architectures



Multi-threaded core


Implicit SIMD

Explicit SIMD

64


Parallelism

ILP

TLP

DLP

Use



Architectures



Multi-threaded core


Implicit SIMD

Explicit SIMD

65

Outline











Let's design a GPU!




66

Hierarchical combination

Both CPUs and GPUs combine these techniques

Multiple cores

Multiple threads/core

SIMD units

67

Example CPU: Intel Core i7

Is a wide superscalar, but has also

Multicore

Multi-thread / core

SIMD units

Up to 117 operations/cycle from 8 threads

256-bitSIMDunits: AVX

Wide superscalar

Simultaneous Multi-Threading:2 threads

4 CPU cores

68

Example GPU: NVIDIA GeForce GTX 580

SIMT: warps of 32 threads

16 SMs / chip

2×16 cores / SM, 48 warps / SM

Up to 512 operations per cycle from 24576 threads in flight

Time

Core 1

Core 2

Core 16

Warp 3

Warp 1

Warp 47

SM1 SM16

……C

ore 17

Core 18

Core 32

Warp 4

Warp 2

Warp 48

…

69

Taxonomy of parallel architectures

Horizontal Vertical

ILP Superscalar / VLIW Pipelined

TLPMulti-core

SMTInterleaved / switch-on-

event multithreading

DLP SIMD / SIMT Vector / temporal SIMT

70

Classification: multi-core

Oracle Sparc T5

2

16 8

ILP

TLP

DLP

Horizontal Vertical

Cores Threads

Intel Haswell

8

8

4 2

SIMD(AVX)

CoresHyperthreading

4

10

12 8

IBM Power 8

2

8

16 2

Fujitsu SPARC64 X

General-purposemulti-cores:balance ILP, TLP and DLP

Sparc T:focus on TLP

71

Classification: GPU and many small-core

Intel MIC Nvidia Kepler AMD GCN

16

2

60 4

ILP

TLP

DLP 32

2

16×4 32

16

20×4

4

40

SIMD Cores SIMT Multi-threading

Cores×units

Kalray MPPA-256

5

17×16

Tilera Tile-GX

3

72

GPU: focus on DLP, TLPhorizontal and vertical

Many small-core:focus on horizontal TLP

Horizontal Vertical

72

Takeaway

All processors use hardware mechanisms to turn parallelism into performance

GPUs focus on Thread-level and Data-level parallelism

73

Outline











Let's design a GPU!




74

Computation cost vs. memory cost

Power measurements on NVIDIA GT200

Energy/op(nJ)

Total power(W)

Instruction control 1.8 18

Multiply-add on a 32-wide warp

3.6 36

Load 128B from DRAM 80 90

With the same amount of energy

Load 1 word from external memory (DRAM)

Compute 44 flops

Must optimize memory accesses first!

75

External memory: discrete GPU

Classical CPU-GPU model

Split memory spaces

Highest bandwidth from GPU memory

Transfers to main memory are slower

CPU GPU

Main memory Graphics memory

PCIExpress

16GB/s

26GB/s 290GB/s

Ex: Intel Core i7 4770, Nvidia GeForce GTX 780

8 GB 3 GB

76

External memory: embedded GPU

Most GPUs today

Same memory

May support memory coherence

GPU can read directly from CPU caches

More contention on external memory

CPU GPU

Main memory

26GB/s

8 GB

Cache

77

GPU: on-chip memory

Conventional wisdom

Cache area in CPU vs. GPUaccording to the NVIDIACUDA Programming Guide:

GPU Register files+ caches

NVIDIA GM204 GPU

8.3 MB

AMD Hawaii GPU

15.8 MB

Intel Core i7 CPU

9.3 MB

GPU/accelerator internal memory exceeds desktop CPUs

But... if we include registers:

78

Registers: CPU vs. GPU

Registers keep the contents of local variables

CPU GPU

Registers/thread 32 32

Registers/core 256 65536

Read / Write ports 10R/5W 2R/1W

Registers keep the contents of local variables

Typical values

GPU: many more registers, but made of simpler memory

79

Internal memory: GPU

Cache hierarchy

Keep frequently-accessed data

Reduce throughput demand on main memory

Managed by hardware (L1, L2) or software (shared memory)

Core

L1

L2 L2 L2

Crossbar

Core

L1

Core

L1

External memory

290 GB/s

~2 TB/s

6 MB

1 MB

80

Caches: CPU vs. GPU

On CPU, caches are designed to avoid memory latency

Throughput reduction is a side effect

On GPU, multi-threading deals with memory latency

Caches are used to improve throughput (and energy)

CPU GPU

LatencyCaches,

prefetchingMulti-threading

Throughput Caches

81

GPU: thousands of cores?

NVIDIA GPUs G80/G92(2006)

GT200 (2008)

GF100(2010)

GK104(2012)

GK110(2012)

GM204(2014)

Exec. units 128 240 512 1536 2688 2048

SM 16 30 16 8 14 16

Computational resources

Number of clients in interconnection network (cores)stays limited

AMD GPUs R600(2007)

R700 (2008)

Evergreen(2009)

NI(2010)

SI(2012)

VI(2013)

Exec. Units 320 800 1600 1536 2048 2560

SIMD-CU 4 10 20 24 32 40

82

Takeaway

Result of many tradeoffs

Between locality and parallelism

Between core complexity and interconnect complexity

GPU optimized for throughput

Exploits primarily DLP, TLP

Energy-efficient on parallel applications with regular behavior

CPU optimized for latency

Exploits primarily ILP

Can use TLP and DLP when available

83

Next time

Next Tuesday, 1:00pm, room 2014CUDA

Execution model

Programming model

API

Thursday 1:00pm, room 2011Lab work: what is my GPU and when should I use it?

There may be available seats even if you are not enrolled