How a GPU Works - Kayvon Fatahalian

7/25/2019 How a GPU Works - Kayvon Fatahalian

1/87

Kayvon Fatahalian

15-462 (Fall 2011)

How a GPU Works


2/87

Today

1.

Review: the graphics pipeline

2. History: a few old GPUs

3. How a modern GPU works (and why it is s

4. Closer look at a real GPU design


3/87

Part 1:

The graphics pipeline


4/87

Vertex processing

v0

v1

v2

v3

v4

v5

Vertices are transformed into screen space


5/87

Vertex processing

v0

v1

v2

v3

v4

v5

Vertices are transformed into screen space

EACH VERT

TRANSFORM

INDEPENDE


6/87

Primitive processing

v0

v1

v2

v3

v4

v5

v0

v3

v5

Then organized into primitives that are clippculled


7/87

Rasterization

Primitives are rasterized into pixel fragment


8/87

Rasterization

Primitives are rasterized into pixel fragment


9/87


10/87

Fragment processing

Fragments are shaded to compute a color at each


11/87

Pixel operations

Fragments are blended into the frame buffer

pixel locations (z-buffer determines visibility)


12/87

Pipeline entities

v0

v1

v2

v3

v4

v5v0

v1

v2

v3

v4

v5

Vertices Primitives Frag


13/87

Graphics pipeline

Primitive Generation

Vertex Generation

Vertex Processing

Fragment Generation

Memory Buffers

Vertex Data Buffers

Textures

TexturesPrimitive Processing

Vertex stream

Vertex stream

Primitive stream

Primitive stream

F t t

Vertices

Primitives


14/87

Part 2:

Graphics architecture


15/87

Independent

Whats so important about independent

computations?


16/87

Silicon Graphics RealityEngine (1

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

graphics supercomputer


17/87

Pre-1999 PC 3D graphics accelera

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

3dfx Voodoo

NVIDIA RIVA TNT

Clip/cull/rasterize

Tex Tex

CPU


18/87

GPU* circa 1999

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

CPU

GPU


19/87


20/87

Direct3D 10 programmability: 20

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

NVIDIA G F 8800

Core Pixel op

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Tex

Pixel op

Pixel op

Pixel op

Pixel op

Pixel op

Clip/Cull/Rast

Scheduler


21/87

Part 3:

How a shader core wor


22/87

GPUs are fast

Intel Core i7 Quad Core

~100 GFLOPS peak730 million transistors

AMD Radeon HD

~2.7 TFLOPS p2 2 billion transi

A diff fl t h d


23/87

A diffuse reflectance shader

Shader programming mo

Fragments are processed

but there is no explicit pa

programming.

Independent logical sequ

per fragment. ***


24/87



25/87


Shader programming mo

Fragments are processed

but there is no explicit pa

programming.

Independent logical sequ

per fragment. ***

Big Guy lookin diffuse


26/87

Big Guy, lookin diffuse

Compile shader


27/87

Compile shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


28/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


29/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


30/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


31/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


32/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


33/87

CPU-style cores


34/87

CPU style cores

Fetch/

Decode

Execution

Context

ALU

(Execute)

Data cache

(a big one)

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Slimming down


35/87

Slimming down

Fetch/

Decode

Execution

Context

ALU

(Execute)

Idea #1:

Remove components that

help a single instruction

stream run fast

Two cores (two fragments in parallel)


36/87

Two cores (two fragments in parallel)

Fetch/Decode

Execution

Context

ALU

(Execute)

Fetch/Decode

Execution

Context

ALU

(Execute)

2;733/!&*9";&'6J

!"#$%& 'G> ?:> .G> !G

#/% '5> ?G> FKGLGM

#";; '5> ?H> FKGLHM> '5

#";; '5> ?0> FKGL0M> '5

F%#$ '5> '5> % % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


37/87

Four cores (four fragments in parallel)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Sixteen cores (sixteen fragments in parallel)


38/87

Sixteen cores (sixteen fragments in parallel)

16 cores = 16 simultaneous instruct

Instruction stream sharing


39/87

g

But ... many fragments

should be able to shareinstruction stream!

2;733/!&*9";&'6J

!"#$%& 'G> ?:> .G> !G

#/% '5> ?G> FKGLGM

#";; '5> ?H> FKGLHM> '5

#";; '5> ?0> FKGL0M> '5

F%#$ '5> '5> % % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


40/87

Fetch/

Decode

p p g

Execution

Context

ALU

(Execute)

Add ALUs


41/87

Fetch/

Decode

Idea #2:

Amortize cost/complexmanaging an instructio

stream across many ALU

SIMD processing

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4


Modifying the shader


42/87

y g

Fetch/

Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



2;733/!&*9";&'6J!"#$%& 'G> ?:> .G> !G

#/% '5> ?G> FKGLGM

#";; '5> ?H> FKGLHM> '5

#";; '5> ?0> FKGL0M> '5

F%#$ '5> '5> % % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


43/87

y g

Fetch/

Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



New compiled shader:

Processes eight fragme

vector ops on vector re

2NOPQR;733/!&*9";&'6JVEC8_!"#$%& ?&FR'G> ?&FR?:> .G> ?&FR

VEC8_#/% ?&FR'5> ?&FR?G> FKGLGM

VEC8_#";; ?&FR'5> ?&FR?H> FKGLHM> ?&

VEC8_#";; ?&FR'5> ?&FR?0> FKGL0M> ?&

VEC8_F%#$ ?&FR'5> ?&FR'5> % % ?&FR'G> ?&FR'5

VEC8_#/% ?&FR4H> ?&FR'H> ?&FR'5

VEC8_#/% ?&FR40> ?&FR'0> ?&FR'5

VEC8_#4? 45> %


44/87

y g

Fetch/

Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



1 2 3 4

5 6 7 8

2NOPQR;733/!&*9";&'6JVEC8_!"#$%& ?&FR'G> ?&FR?:> .G> ?&FR

VEC8_#/% ?&FR'5> ?&FR?G> FKGLGM

VEC8_#";; ?&FR'5> ?&FR?H> FKGLHM> ?&

VEC8_#";; ?&FR'5> ?&FR?0> FKGL0M> ?&

VEC8_F%#$ ?&FR'5> ?&FR'5> % % ?&FR'G> ?&FR'5

VEC8_#/% ?&FR4H> ?&FR'H> ?&FR'5

VEC8_#/% ?&FR40> ?&FR'0> ?&FR'5

VEC8_#4? 45> %


45/87

g

16 cores = 128 ALUs, 16 simultaneous instruction stream

128 [ ] in parallelvertices/fragmentsprimitivesOpenCL work items


46/87

CUDA threads

fragments

vertices

primitives

But what about branches?


47/87

ALU 1 ALU 2 . . . ALU 8. . .Time (clocks)

2 . . .1 . . . 8

73


48/87


2 . . .1 . . . 8

73


49/87


2 . . .1 . . . 8

73


50/87


2 . . .1 . . . 8

73


51/87

! Coherent execution*** (admittedly fuzzy definition): when processing

entities is similar, and thus can share resources for efficient execution

- Instruction stream coherence: different fragments follow same sequence of logic

- Memory access coherence:

Different fragments access similar data (avoid memory transactions by reusing data in ca

Different fragments simultaneously access contiguous data (enables efficient, bulk granu

transactions)

!Divergence: lack of coherence

- Usually used in reference to instruction streams (divergent execution does not make full us

processing)

*** Do not confuse this use of term coherence with cache coherence protocols

GPUs share instruction streams across many fragm


52/87

In modern GPUs: 16 to 64 fragments share an instruction stream.


53/87

Stalls!Stalls occur when a core cannot run the next instruction because odependency on a previous operation.

Recall: diffuse reflectance shader


54/87

Texture acces

Latency of 10

Recall: CPU-style core


55/87

ALU

Fetch/Decode

Execution

Context

OOO exec logic

Branch predictor

Data cache

(a big one: several MB)

CPU-style memory hierarchy


56/87

CPU cores run efficiently when data is resident in cache

(caches reduce latency, provide high bandwidth)

ALU

Fetch/Decode

Execution

contexts

OOO exec logic

Branch predictor

L1 cache

(32 KB)

L2 cache

(256 KB)

L3 cache

(8 MB)

shared across cores

Processing Core (several cores per chip)


57/87

Stalls!Texture access latency = 100s to 1000s of cycles

Weve removed the fancy caches and logic that helps avoid stall

Stalls occur when a core cannot run the next instruction because o

dependency on a previous operation.


58/87

But we have LOTSof independent fragments.(Way more fragments to process than ALUs)

Idea #3:Interleave processing of many fragments on a single core to avoid

stalls caused by high latency operations.

Hiding shader stalls


59/87

Time (clocks) Frag 1 8

ALU 1 AL

ALU 5 AL

Ctx C

Ctx C

Shar



60/87

Time (clocks)

ALU 1 AL

ALU 5 AL

Frag 9 16 Frag 17 24 Frag 25 32Frag 1 8

1 2 3 4

1

3



61/87

Time (clocks)


1 2 3 4

Stall

Runnable



62/87

Time (clocks)


1 2 3 4

Stall

Runnable

Stall

Stall

Stall

Throughput!


63/87

Time (clocks)


1 2 3 4

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Runnable

Done!

Done!

Done!

Done!

Start

Start

Start

Increase run time of one group

to increase throughput of many groups

Storing contexts


64/87

Fetch/

Decode



Pool of context storage

128 KB

Eighteen small contexts (maximal latency hiding


65/87

Fetch/

Decode



Twelve medium contexts


66/87

Fetch/

Decode



Four large contexts (low latency hiding abilit


67/87

Fetch/

Decode



1 2

3 4

My chip!


68/87

16 cores

8 mul-add ALUs per core

(128 total)

16 simultaneousinstruction streams

64 concurrent (but interleaved)instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

My enthusiast chip!


69/87

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz

Summary: three key ideas for high-throughput ex


70/87

1. Use many slimmed down cores, run them in parallel

2. Pack cores full of ALUs (by sharing instruction stream overhead acrogroups of fragments)

Option 1: Explicit SIMD vector instructions

Option 2: Implicit sharing managed by hardware

3. Avoid latency stalls by interleaving execution of many groups of fra

When one group stalls, work on another group


71/87

Putting the three ideas into practice:

A closer look at a real GPU

NVIDIA GeForce GTX 480

NVIDIA GeForce GTX 480 (Fermi)! NVIDIA-speak:


72/87

! NVIDIA-speak:

480 stream processors (CUDA cores)

SIMT execution

! Generic speak:

15 cores

2 groups of 16 SIMD functional units per core

NVIDIA GeForce GTX 480 core


73/87

= SIMD function unit,

control shared acros

(1 MUL-ADD per clock

Shared scratchpad memory

(16+48 KB)

Execution contexts

(128 KB)

Fetch/

Decode

Groups of 32 fragments share a

stream

Up to 48 groups are simultane

Up to 1536 individual contexts

Source: Fermi Compute Architecture Whitepaper

CUDA Programming Guide 3.1, Appendix G

NVIDIA GeForce GTX 480 core


74/87

= SIMD function unit,

control shared acros



(16+48 KB)

Execution contexts

(128 KB)

Fetch/

Decode

Fetch/

Decode The core contains 32 functiona

Two groups are selected each c

(decode, fetch, and execute tw

streams in parallel)



NVIDIA GeForce GTX 480 SM


75/87

= CUDA core



(16+48 KB)

Execution contexts

(128 KB)

Fetch/

Decode

Fetch/

Decode The SMcontains 32 CUDA cores

Two warpsare selected each cl

(decode, fetch, and execute tw

parallel)

Up to 48 warps are interleaved

CUDA threads



NVIDIA GeForce GTX 480


76/87

There are 15 of these things on the GT

Thats 23,000 fragments!

(or 23,000 CUDA threads!)


77/87

Looking Forward

Current and future: GPU architectures! Bigger and faster (more cores, more FLOPS)


78/87

gg ( , )

2 TFLOPs today, and counting

! Addition of (select) CPU-like features

More traditional caches

! Tight integration with CPUs (CPU+GPU hybrids)

See AMD Fusion

! What fixed-function hardware should remain?

Recent trends

S t f lt ti i i t f


79/87

! Support for alternative programming interfaces

Accelerate non-graphics applications using GPU (CUDA, OpenCL)

! How does graphics pipeline abstraction change to enable more advance

real-time graphics?

Direct3D 11 adds three new pipeline stages

Global illumination algorithms Credit: Bratincevic


80/87

Credit: NVIDIA

Ray tracing:

for accurate reflections, shadows

Credit: Ingo Wald

Alternative shading structures (e.g., deferred shading)


81/87

For more efficient scaling to many lights (1000 lights, [Andersson 09])

Simulation


82/87

Cinematic scene complexity


83/87


84/87

Motion blur


85/87


86/87


87/87

Thanks!

Relevant CMU Courses for students interested in high performance graphics:

15-869: Graphics and Imaging Architectures (my special topics course)

15-668: Advanced Parallel Graphics (Treuille)15-418: Parallel Architecture and Programming (spring semester)

Date post:	24-Feb-2018
Category:	Documents
Upload:	gsbabil
View:	221 times
Download:	0 times

How a GPU Works - Kayvon Fatahalian

Documents