Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON,...

transcript

Graphics Processing UnitsREFERENCES:

•COMPUTER ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012

•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FERMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF

•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT093009110932&P=1

•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML

•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK

CPU vs. GPU

http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html

• CPU: small fraction of chip used for arithmetic

CPU vs GPU

http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite

• GPU: large fraction of chip used for arithmetic

CPU vs GPU

Intel Haswell 170 GFlops on quad-core at 3.4GHz

AMD Radeon R9 290 4800 GFlops at 9.5GHz

Nvidia GTX 970 5000 Gflops at 1.05GHz

General Purpose GPU programming Massively parallel

Scientific computing, brain simulations, etc

In supercomputers 53 of top500.org supercomputers used

NVIDIA/AMD GPUs (Nov 2014 ranking)

Including 2nd and 6th places

OpenCL vs CUDA

Both for GPGPU

OpenCL Open standard

Supported on AMD, NVIDIA, Intel, Altera, …

CUDA Proprietary (Nvidia)

Losing ground to OpenCL?

Similar performance

Programming on Parallel Machines, Norm Matloff, Chapter 5

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Uses a thread hierarchy Thread

Thread

Executes an instance of a kernel (program)

ThreadID (within block), program counter, registers, private memory, input and output parameters

Private memory for register spills, function calls, array variables

Nvidia Fermi Whitepaper pg 6

Set of concurrently executing threads

Cooperate via barrier synchronization and shared memory (fast but small)

BlockID (within grid)

Array of thread blocks running same kernel

Read and write global memory (slow – hundreds of cycles)

Synchronize between dependent kernel calls

Hardware Mapping

GPU executes 1+ kernel (program) grids

Streaming Multiprocessor (sm) executes 1+ thread blocks

CUDA core executes thread

Fermi Architecture

Debuted in 2010

512 CUDA cores executes 1 FP or integer instruction per cycle

32 CUDA cores per SM

16 SMs per GPU

6 64-bit memory ports

PCI-Express interface to CPU

GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware)

fast context switch

3 billion transistors

epaper

CUDA core

pipelined integer and FP units

IEEE 754-2008 FP fused multiply-add

integer unit boolean, shift, move, compare, ...

Streaming Multiprocessor (SM) 32 CUDA cores

16 ld/st units calculate source/destination

addresses

Special Function Units sin, cosine, reciprocal, sqrt

32 threads from a block are bundled into warps which execute the same instr/cycle

this becomes the minimum size of SIMD data

warps are implicitly synchronized if threads branch in different directions, they step

through both using predicated instructions

two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs

Maxwell Architecture

16 streaming multiprocessors * 128 cores/SM = 2048 cores

Programming CUDA

C code

daxpy(n,2.0,x,y); // invoke

void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i];}

Programming CUDA

CUDA code

__host__int nblocks=(n+511)/512; // grid sizedaxpy<<<nblocks,512>>(n,2.0,x,y);// 512 threads/block

__global__void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i];}

n=8192, 512 threads/blockgrid block0 warp0 Y[0]=A*X[0]+Y[0]

Y[31]=A*X[31]+Y[31]

warp15 Y[480]=A*X[480]+Y[480]

Y[511]=A*X[511]+Y[511]

block15 warp0 Y[7680]=A*X[7680]+Y[7680]

Y[7711]=A*X[7711]+Y[7711]

warp15 Y[8160]=A*X[8160]+Y[8160]

Y[8191]=A*X[8191]+Y[8191]

Moving data between host and GPU

int main() {double *x, *y, a, *dx, *dy;x = (double *)malloc(sizeof(double)*n);y = (double *)malloc(sizeof(double)*n);// initialize x and y…cudaMalloc(dx, n*sizeof(double));cudaMalloc(dy, n*sizeof(double));cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); …daxpy<<<nblocks,512>>(n,2.0,x,y);cudaThreadSynchronize();cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost);cudaMemFree(dx); cudaMemFree(dy);free(x); free(y);

Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON,...

Documents