+ All Categories
Home > Documents > NVIDIA Kepler Architecture

NVIDIA Kepler Architecture

Date post: 07-Feb-2016
Category:
Upload: yaron
View: 28 times
Download: 0 times
Share this document with a friend
Description:
NVIDIA Kepler Architecture. Paul Bissonnette Rizwan Mohiuddin Ajith Herga. Compute Unified Device Architecture. Hybrid CPU/GPU Code Low latency code is run on CPU Result immediately available High latency, high throughput code is run on GPU Result on bus - PowerPoint PPT Presentation
Popular Tags:
27
NVIDIA Kepler Architecture Paul Bissonnette Rizwan Mohiuddin Ajith Herga
Transcript
Page 1: NVIDIA  Kepler  Architecture

NVIDIA Kepler Architecture

Paul BissonnetteRizwan Mohiuddin

Ajith Herga

Page 2: NVIDIA  Kepler  Architecture

Compute Unified Device Architecture

• Hybrid CPU/GPU Code• Low latency code is run

on CPU– Result immediately

available

• High latency, high throughput code is run on GPU– Result on bus– GPU has many more

cores than CPU

Page 3: NVIDIA  Kepler  Architecture

CPU/GPU Code

CUDA Program

GPU Routines

CPU Routines

NVCC GCC

GPU Object

CPU Object

CUDA Binary

Link

er

Page 4: NVIDIA  Kepler  Architecture

Execution Model (Overview)

CPU GPU CPU GPU

CPUGPU

RPC RPCResu

lt

ResultIntermediateResult

Page 5: NVIDIA  Kepler  Architecture

Execution Model (GPU)Th

read

Thre

ad

Thre

ad

Thread Block Thread Block Thread Block

Streaming Multiple Processor

Thread Grid

Graphics Card

Page 6: NVIDIA  Kepler  Architecture

Execution Model (GPU)

• Each procedure runs as a “kernel”• An instance of a kernel runs on a thread block– A thread block executes on a single streaming

multiple processor• All instances of a particular kernel form a

thread grid– A thread grid executes on a single graphics card

across several streaming multiple processors

Page 7: NVIDIA  Kepler  Architecture

Thread Cooperatively

• Multiple levels of sharing

• Thread blocks similar to MPI group

Page 8: NVIDIA  Kepler  Architecture

GPU Execution of Kernels

• In Kepler threads can spawn new thread blocks/grids

• Less time spent in CPU• More natural recursion• Completion dependent

on child grids

Page 9: NVIDIA  Kepler  Architecture

CUDA Languages

• CUDA C/C++ and CUDA Fortran• Scientific computing• Highly parallel applications• NVIDIA specific (unlike OpenCL)• Specialized for specific tasks– Highly optimized single precision floating point– Specialized data sharing instructions within thread

blocks

Page 10: NVIDIA  Kepler  Architecture

HYPER QWithout HYPER Q:

• Availability of only one work queue thus can receive work only from one queue.

• Difficult for a CPU core to keep a GPU busy.

Page 11: NVIDIA  Kepler  Architecture

• Using HYPER Q:– Allows connection from multiple CUDA streams,

Message Passing Interface (MPI) processes, or multiple threads of the same process.

– 32 concurrent work queues, can receive work from 32 process cores at the same time.

– 3X Performance increase on Fermi

Page 12: NVIDIA  Kepler  Architecture

• Removes the problem of false intra-stream dependencies.

Page 13: NVIDIA  Kepler  Architecture

Dynamic Parallelism• Without Dynamic

Parallelism– Data travels back and

forth between the CPU and GPU many times.

– This is because of the inability of the GPU to create more work on itself depending on the data.

Page 14: NVIDIA  Kepler  Architecture

• With Dynamic Parallelism:– GPU can generate

work on itself based intermediate results, without involvement of CPU.

– Permits Dynamic Run Time decisions.

– Leaves the CPU free to do other work, conserves power.

Page 15: NVIDIA  Kepler  Architecture

• Application Example: Adaptive Grid Simulation

Page 16: NVIDIA  Kepler  Architecture

• Application Example: Quick Sort Computation

Streams spawning Streams

CPU launches quicksort

Page 17: NVIDIA  Kepler  Architecture

CPU-GPU Stack Exchange

Runs on CPU

Looping based on intermediate results

Check if GPU has returned any more intermediate results

CPU spawns a stream to be computed on GPU

Page 18: NVIDIA  Kepler  Architecture

Memory Organization

Page 19: NVIDIA  Kepler  Architecture

Memory Organization

Page 20: NVIDIA  Kepler  Architecture

Core Stream

Page 21: NVIDIA  Kepler  Architecture

Stream Processor

Page 22: NVIDIA  Kepler  Architecture

Kepler Architecture

Page 23: NVIDIA  Kepler  Architecture

Scheduling

Page 24: NVIDIA  Kepler  Architecture

Warp Scheduler

Page 25: NVIDIA  Kepler  Architecture
Page 26: NVIDIA  Kepler  Architecture

Thread Block level/Grid Scheduling

Page 27: NVIDIA  Kepler  Architecture

References

• NVIDIA Whitepapers– http://www.geforce.com/Active/en_US/en_US/pdf/GeForce

-GTX-680-Whitepaper-FINAL.pdf– http://developer.download.nvidia.com/assets/cuda/files/CU

DADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf

• NVIDIA Keynote Presentation– http://www.youtube.com/watch?v=TxtZwW2Lf-w

• Georgia Tech Presentation– http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011

-04-14/02-cuda-overview.pdf

• http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/4

• http://gpuscience.com/code-examples/tesla-k20-gpu-quicksort-with-dynamic-parallelism


Recommended