+ All Categories
Home > Documents > Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store...

Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store...

Date post: 27-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
28
Introduction to Computing With Graphics Processors Cris Cecka Computational Mathematics Stanford University Lecture 1: Introduction to Massively Parallel Computing
Transcript
Page 1: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Introduction to Computing With

Graphics Processors Cris Cecka

Computational MathematicsStanford University

Lecture 1: Introduction to Massively Parallel Computing

Page 2: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Tutorial Goals

Learn architecture and computational environment of GPU computing

Massively Parallel

Hierarchical threading and memory space

Principles and patterns of parallel programming

Processor architecture features and constraints

Scalability across future generations

Introduction to programming in CUDAProgramming API, tools and techniques

Functionality and maintainability

Page 3: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Moore’s Law (paraphrased)

“The number of transistors on an integrated circuit doubles every two years.”

– Gordon E. Moore

Page 4: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Moore’s Law (Visualized)

Data credit: Wikipedia

GF100

Page 5: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Serial Performance Scaling is Over

Cannot continue to scale processor frequenciesno 10 GHz chips

Cannot continue to increase power consumptioncan’t melt chip

Can continue to increase transistor densityas per Moore’s Law

Page 6: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A quiet revolution and potential build-upComputation: TFLOPs vs. 100 GFLOPs

– GPU in every PC – massive volume & potential impact

Why Massively Parallel Processing?

T12

Westmere

NV30 NV40

G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Page 7: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Why Massively Parallel Processing?

• A quiet revolution and potential build-up– Bandwidth: ~10x

– GPU in every PC – massive volume & potential impact

NV30

NV40 G70

G80

GT200

T12

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Westmere

Page 8: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

The “New” Moore’s Law

• Computers no longer get faster, just wider

• You must re-think your algorithms to be parallel!Not only parallel, but hierarchically parallel...

• Data-parallel computing is most scalable solutionOtherwise: refactor code for 2 cores

You will always have more data than cores – build the computation around the data

8 cores4 cores 16 cores…

Page 9: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Enter the GPU

• Highly Scalable• Massively parallel

Hundreds of cores

Thousands of threads

• Cheap

• Available

• Programmable

Page 10: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

GPU Evolution

• High throughput computation– GeForce GTX 280: 933 GFLOP/s

• High bandwidth memory– GeForce GTX 280: 140 GB/s

• High availability to all– 180+ million CUDA-capable GPUs in the wild

1995 2000 2005 2010

RIVA 128RIVA 1283M xtors3M xtors

GeForceGeForce®® 256 25623M xtors23M xtors

GeForce FXGeForce FX125M xtors125M xtors

GeForce 8800GeForce 8800681M xtors681M xtors

GeForce 3 GeForce 3 60M xtors60M xtors

““Fermi”Fermi”3B3B xtors xtors

Page 11: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Graphics Legacy

• Throughput is paramount– Must paint every pixel within frame time– Scalability

• Create, run, & retire lots of threads very rapidly– Measured 14.8 Gthread/s on increment() kernel

• Use multithreading to hide latency– 1 stalled thread is OK if 100 are ready to run

Page 12: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Why is this different from a CPU?

• CPU: minimize latency experienced by 1 threadbig on-chip caches

sophisticated control logic

• GPU: maximize throughput of all threads# threads in flight limited by resources => lots of resources (registers, bandwidth, etc.)

multithreading can hide latency => skip the big caches

• Different goals produce different designsGPU assumes work load is highly parallel

CPU must be good at everything, parallel or not

Page 13: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

NVIDIA GPU Architecture

Fermi GF100

DR

AM

D

RA

M

I/FI/F

HO

ST

H

OS

T

I/FI/F

Gig

a T

hre

adD

RA

M

DR

AM

I/FI/

FD

RA

M

DR

AM

I/FI/F

DR

AM

D

RA

M

I/FI/F

DR

AM

D

RA

M

I/FI/F

DR

AM

D

RA

M

I/FI/F

L2L2

Page 14: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

SM Multiprocessor

Register FileRegister File

SchedulerScheduler

DispatchDispatch

SchedulerScheduler

DispatchDispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect NetworkInterconnect Network

64K Configurable64K ConfigurableCache/Shared MemCache/Shared Mem

Uniform CacheUniform Cache

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

Instruction CacheInstruction Cache

• 32 CUDA Cores per SM (512 total)

• 8x peak FP64 performance

50% of peak FP32 performance

• Direct load/store to memory

Usual linear sequence of bytes

High bandwidth (Hundreds GB/sec)

• 64KB of fast, on-chip RAM

Software or hardware-managed

Shared amongst CUDA cores

Enables thread communication

Page 15: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Key Architectural Ideas

• GPU serves as a coprocessor to the CPUhas its own device memory on the card

• SIMT (Single Instruction Multiple Thread) executionthreads run in groups of 32 called warps

threads in a warp share instruction unit (IU)

HW automatically handles divergence

• Hardware multithreadingHW resource allocation & thread scheduling

HW relies on threads to hide latency

• Threads have all resources needed to runany warp not waiting for something can run

context switching is (basically) free

Register FileRegister File

SchedulerScheduler

DispatchDispatch

SchedulerScheduler

DispatchDispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect NetworkInterconnect Network

64K Configurable64K ConfigurableCache/Shared MemCache/Shared Mem

Uniform CacheUniform Cache

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

Instruction CacheInstruction Cache

Page 16: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Enter CUDA

• A compiler and toolkit for programming NVIDIA GPUs

• Minimal extensions to familiar C/C++ environmentlet programmers focus on parallel algorithms

• Scalable parallel programming modelExpress parallelism and control hierarchy of memory spaces

But also uses a high level abstraction from hardware

• Provide straightforward mapping onto hardwaregood fit to GPU architecture

maps well to multi-core CPUs too

Page 17: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Motivation

110-240X

45X 100X

35X

17X

13–457x 30x

Page 18: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

Compute Environment

• Threads executed by SPExecuting the same sequential kernel

On-chip registers, Off-ship local memory

• Thread blocks executed by SMThreads in the same block can cooperate

On-ship shared memory

Synchronization

• Grids of blocks executed by deviceOff-chip global memory

No synchronization

Thread t

t0 t1 … tBBlock b

t0 t1 … tB

t0 t1 … tB

t0 t1 … tB

t0 t1 … tB

Page 19: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

CUDA Model of Parallelism

• CUDA virtualizes the physical hardware– thread is a virtualized scalar processor (registers, PC, state)– block is a virtualized multiprocessor (threads, shared mem.)

• Scheduled onto physical hardware without pre-emption– threads/blocks launch & run to completion– blocks should be independent

• • •Block Mem

oryBlock Mem

ory

Global Memory

Page 20: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

NOT: Flat Multiprocessor

• Global synchronization isn’t cheap• Global memory access times are expensive

• cf. PRAM (Parallel Random Access Machine) model

Processors

Global Memory

Page 21: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

NOT: Distributed Processors

• Distributed computing is a different setting

• cf. BSP (Bulk Synchronous Parallel) model, MPI

Interconnection Network

Processor

Memory

Processor

Memory • • •

Page 22: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Global memory resides in device memory (DRAM)

Much slower access than shared memory

Tile data to take advantage of fast shared memory:

Generalize from adjacent_difference example

Divide and conquer

Page 23: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Global memory resides in device memory (DRAM)

Much slower access than shared memory

Tile data to take advantage of fast shared memory:

Generalize from adjacent_difference example

Divide and conquer

Page 24: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Partition data into subsets that fit into shared memory

Page 25: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Handle each data subset with one thread block

Page 26: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

Page 27: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Perform the computation on the subset from shared memory

Page 28: Introduction To Massively Parallel Computing2011/01/09  · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared

A Common Programming Strategy

Copy the result from shared memory back to global memory


Recommended