+ All Categories
Home > Documents > CS 380 - GPU and GPGPU Programming Lecture 6: GPU...

CS 380 - GPU and GPGPU Programming Lecture 6: GPU...

Date post: 05-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
34
CS 380 - GPU and GPGPU Programming Lecture 6: GPU Architecture 5 Markus Hadwiger, KAUST
Transcript
Page 1: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

CS 380 - GPU and GPGPU ProgrammingLecture 6: GPU Architecture 5

Markus Hadwiger, KAUST

Page 2: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

2

Reading Assignment #3 (until Feb. 16)

Read (required):• Programming Massively Parallel Processors book, Chapter 1 (Introduction)

• Programming Massively Parallel Processors book, Appendix B(GPU Compute Capabilities)

• OpenGL 4.0 Shading Language Cookbook, Chapter 2

Read (optional):• OpenGL 4.0 Shading Language Cookbook, Chapter 1

• GLSL book, Chapter 7 (OpenGL Shading Language API)

Page 3: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

NVIDIA G80/GT200 Architecture

• Streaming Processor (SP)

• Streaming Multiprocessor (SM)

• Texture/Processing Cluster (TPC)3

Courtesy AnandTech

Page 4: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

NVIDIA G80/GT200 Architecture

• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs

• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)

4

G80 / G92 GT200Courtesy AnandTech

Page 5: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Example: GeForce 8

ff

5

Page 6: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

6

NVIDIA Fermi / GF100 Features

Names

• Compute: Fermi; product: Tesla-20 series

• Graphics: GF100 (product: Geforce GTX 480, 580, ...)

Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/

html/C/doc/ptx_isa_3.0.pdf

• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

L1 and L2 caches

More CUDA cores (up to 512)

Faster double precision float performance, faster atomics, float atomics

DirectX 11 and OpenGL 4 functionality

• New shader types, scatter writes to images, ...

Page 7: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

7

NVIDIA Fermi / GF100 Stats

Page 8: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

8

Streaming Multiprocessor

Streaming processors are nowCUDA cores

32 CUDA cores per Fermistreaming multiprocessor (SM)

16 SMs = 512 CUDA cores

CPU-like cache hierarchy• L1 cache / shared memory

• L2 cache

Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)

Page 9: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Dual Warp Schedulers

Markus Hadwiger, KAUST 9

Page 10: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

10

Graphics Processor Clusters (GPC)

(instead of TPC on GT200)

4 Streaming Processors

32 CUDA cores / SM

4 SMs / GPC =128 cores / GPC

Decentralized rasterizationand geometry

• 4 raster engines

• 16 ”PolyMorph” engines

Page 11: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

11

NVIDIA Fermi / GF100 Structure

Full size

• 4 GPCs

• 4 SMs each

• 6 64-bitmemorycontrollers(= 384 bit)

Page 12: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

12

NVIDIA Fermi / GF100 Die

Full size

• 4 GPCs

• 4 SMs each

Page 13: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

GF100 Graphics Pipeline

• ?Input Assembler

Vertex Shader

Pixel Shader

Hull Shader

Rasterizer

Output Merger

Tessellator

Domain Shader

Geometry Shader Stream Output

Page 14: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

14

Compute Capab. 2.0

• 1024 threads / block

• More threads / SM

• 32K registers / SM

• New synchronization functions

Page 15: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

15

L1 Cache vs. Shared Memory

Two different configs (on Fermi and Kepler; NOT on Maxwell!)• 64KB total

• 16KB shared, 48KB L1 cache

• 48KB shared, 16KB L1 cache

• Set per kernel

Page 16: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

16

Global Memory Access

Cached on Fermi

L1 cache per SM

Global L2 cache

Compile time flag can choose:• Caching in both L1 and L2

• Caching only in L2

Cache line size (L1, L2):• 128 bytes

Page 17: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

17

Global Memory Access

CUDA 6.5

Page 18: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

18

Global Memory Access

CUDA 7.0

Page 19: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

NVIDIA Kepler Architecture

Three different versions• Compute capability 3.0 (GK104)

– Geforce GTX 680, …– Quadro K5000– Tesla K10

• Compute capability 3.5 (GK110)– Geforce GTX 780 / Titan / Titan Black– Quadro K6000– Tesla K20, Tesla K40

• Compute capability 3.7 (GK210)– Tesla K80– Very new (~end of 2014)

Markus Hadwiger, KAUST 19

Page 20: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

GK104 SMX

• 192 CUDA cores

• 32 LD/ST units

• 16 SFUs

• 16 texture units

Markus Hadwiger, KAUST 20

Page 21: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

KAUST King Abdullah University of Science and Technology 21

Page 22: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

GK110 SMX

• 192 CUDA cores

• 64 DP units

• 32 LD/ST units

• 16 SFUs

• 16 texture units

New read-onlydata cache (48KB)

Markus Hadwiger, KAUST 22

Page 23: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

23

NVIDIA Kepler / GK104 Structure

Full size

• 4 GPCs

• 2 SMXs each

= 8 SMXs,1536 CUDA cores

Page 24: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

24

NVIDIA Kepler / GK110 Structure (1)

Full size

• 15 SMXs(Titan Black;Titan: 14)

• 2880 CUDAcores(Titan Black;Titan: 2688)

• 5 GPCs of3 SMXs each

Page 25: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

25

NVIDIA Kepler / GK110 Structure (2)

Titan (not Black)

• 14 SMXs

• 2688 CUDAcores

• 5 GPCs with3 SMXs or2 SMXs each

Page 26: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Compute Capabilities 2.0 – 3.5

Markus Hadwiger, KAUST 26

Page 27: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Maxwell (GM) Architecture

Multiprocessor: SMM

4 partitions inside the SMM• 32 CUDA cores each

• 128 CUDA cores in total

• Each has its own warp scheduler,dispatch units, register file

Shared memory and L1 cache nowseparate!• L1 cache shares with texture cache

• Shared memory is its own space

Markus Hadwiger, KAUST 27

Page 28: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Maxwell (GM) Architecture

First gen.

GM107

(GTX 750Ti)

5 SMMs

(640 CUDA cores in total)

Markus Hadwiger, KAUST 28

Page 29: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Maxwell (GM) Architecture

Second gen.

GM204

(GTX 980)

16 SMMs

(2048 CUDA cores in total)

4 GPCs of 4 SMMs

Markus Hadwiger, KAUST 29

Page 30: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM107

Markus Hadwiger, KAUST 30

Page 31: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM204

Markus Hadwiger, KAUST 31

Page 32: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

32

Compute Capab. 5.x (Part 1)

Maxwell• GM107: 5.0

• GM204: 5.2

Page 33: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

33

Compute Capab. 5.x (Part 2)

Maxwell• GM107: 5.0

• GM204: 5.2

Page 34: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32

Thank you.


Recommended