CS 380 - GPU and GPGPU Programming Lecture 6: GPU...

CS 380 - GPU and GPGPU ProgrammingLecture 6: GPU Architecture 5

Markus Hadwiger, KAUST

2

Reading Assignment #3 (until Feb. 16)

Read (required):• Programming Massively Parallel Processors book, Chapter 1 (Introduction)

• Programming Massively Parallel Processors book, Appendix B(GPU Compute Capabilities)

• OpenGL 4.0 Shading Language Cookbook, Chapter 2

Read (optional):• OpenGL 4.0 Shading Language Cookbook, Chapter 1

• GLSL book, Chapter 7 (OpenGL Shading Language API)

NVIDIA G80/GT200 Architecture

• Streaming Processor (SP)

• Streaming Multiprocessor (SM)

• Texture/Processing Cluster (TPC)3

Courtesy AnandTech

NVIDIA G80/GT200 Architecture

• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs

• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)

4

G80 / G92 GT200Courtesy AnandTech

Example: GeForce 8

ff

5

6

NVIDIA Fermi / GF100 Features

Names

• Compute: Fermi; product: Tesla-20 series

• Graphics: GF100 (product: Geforce GTX 480, 580, ...)

Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/

html/C/doc/ptx_isa_3.0.pdf

• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

L1 and L2 caches

More CUDA cores (up to 512)

Faster double precision float performance, faster atomics, float atomics

DirectX 11 and OpenGL 4 functionality

• New shader types, scatter writes to images, ...

7

NVIDIA Fermi / GF100 Stats

8

Streaming Multiprocessor

Streaming processors are nowCUDA cores

32 CUDA cores per Fermistreaming multiprocessor (SM)

16 SMs = 512 CUDA cores

CPU-like cache hierarchy• L1 cache / shared memory

• L2 cache

Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)

Dual Warp Schedulers

Markus Hadwiger, KAUST 9

10

Graphics Processor Clusters (GPC)

(instead of TPC on GT200)

4 Streaming Processors

32 CUDA cores / SM

4 SMs / GPC =128 cores / GPC

Decentralized rasterizationand geometry

• 4 raster engines

• 16 ”PolyMorph” engines

11

NVIDIA Fermi / GF100 Structure

Full size

• 4 GPCs

• 4 SMs each

• 6 64-bitmemorycontrollers(= 384 bit)

12

NVIDIA Fermi / GF100 Die

Full size

• 4 GPCs

• 4 SMs each

GF100 Graphics Pipeline

• ?Input Assembler

Vertex Shader

Pixel Shader

Hull Shader

Rasterizer

Output Merger

Tessellator

Domain Shader

Geometry Shader Stream Output

14

Compute Capab. 2.0

• 1024 threads / block

• More threads / SM

• 32K registers / SM

• New synchronization functions

15

L1 Cache vs. Shared Memory

Two different configs (on Fermi and Kepler; NOT on Maxwell!)• 64KB total

• 16KB shared, 48KB L1 cache

• 48KB shared, 16KB L1 cache

• Set per kernel

16

Global Memory Access

Cached on Fermi

L1 cache per SM

Global L2 cache

Compile time flag can choose:• Caching in both L1 and L2

• Caching only in L2

Cache line size (L1, L2):• 128 bytes

17


CUDA 6.5

18


CUDA 7.0

NVIDIA Kepler Architecture

Three different versions• Compute capability 3.0 (GK104)

– Geforce GTX 680, …– Quadro K5000– Tesla K10

• Compute capability 3.5 (GK110)– Geforce GTX 780 / Titan / Titan Black– Quadro K6000– Tesla K20, Tesla K40

• Compute capability 3.7 (GK210)– Tesla K80– Very new (~end of 2014)


GK104 SMX

• 192 CUDA cores

• 32 LD/ST units

• 16 SFUs

• 16 texture units


KAUST King Abdullah University of Science and Technology 21

GK110 SMX

• 192 CUDA cores

• 64 DP units

• 32 LD/ST units

• 16 SFUs

• 16 texture units

New read-onlydata cache (48KB)


23

NVIDIA Kepler / GK104 Structure

Full size

• 4 GPCs

• 2 SMXs each

= 8 SMXs,1536 CUDA cores

24

NVIDIA Kepler / GK110 Structure (1)

Full size

• 15 SMXs(Titan Black;Titan: 14)

• 2880 CUDAcores(Titan Black;Titan: 2688)

• 5 GPCs of3 SMXs each

25

NVIDIA Kepler / GK110 Structure (2)

Titan (not Black)

• 14 SMXs

• 2688 CUDAcores

• 5 GPCs with3 SMXs or2 SMXs each

Compute Capabilities 2.0 – 3.5


Maxwell (GM) Architecture

Multiprocessor: SMM

4 partitions inside the SMM• 32 CUDA cores each

• 128 CUDA cores in total

• Each has its own warp scheduler,dispatch units, register file

Shared memory and L1 cache nowseparate!• L1 cache shares with texture cache

• Shared memory is its own space



First gen.

GM107

(GTX 750Ti)

5 SMMs

(640 CUDA cores in total)



Second gen.

GM204

(GTX 980)

16 SMMs

(2048 CUDA cores in total)

4 GPCs of 4 SMMs


Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM107


Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM204


32

Compute Capab. 5.x (Part 1)

Maxwell• GM107: 5.0

• GM204: 5.2

33

Compute Capab. 5.x (Part 2)

Maxwell• GM107: 5.0

• GM204: 5.2

Thank you.

Date post:	05-Oct-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

CS 380 - GPU and GPGPU Programming Lecture 6: GPU...

Documents