+ All Categories
Home > Documents > Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1...

Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1...

Date post: 09-May-2018
Category:
Upload: buituong
View: 224 times
Download: 1 times
Share this document with a friend
28
12/19/11 1 Intermediate GPGPU Programming in CUDA CSC 469/585 Winter 201112 Louisiana Tech U NVIDIA Hardware Architecture Host memory
Transcript
Page 1: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

1  

Intermediate  GPGPU  Programming  in  CUDA  

CSC  469/585  Winter  2011-­‐12  Louisiana  Tech  U  

NVIDIA  Hardware  Architecture  

Host  memory  

Page 2: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

2  

Recall  

•  5  steps  for  CUDA  Programming  –  IniMalize  device  – Allocate  device  memory  – Copy  data  to  device  memory  – Execute  kernel  – Copy  data  back  from  device  memory  

IniMalize  Device  Calls  

•  To  select  the  device  associated  to  the  host  thread  –  cudaSetDevice(device)  –  This  funcMon  must  be  called  before  any  __global__  funcMon,  otherwise  device  0  is  automaMcally  selected.  

•  To  get  number  of  devices  –  cudaGetDeviceCount(&devicecount)  

•  To  retrieve  device’s  property  –  cudaGetDeviceProperMes(&deviceProp,  device)  

Page 3: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

3  

Hello  World  Example  

•  Allocate  host  and  device  memory  

Hello  World  Example  

•  Host  code  

Page 4: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

4  

Hello  World  Example  

•  Kernel  code  

To  Try  CUDA  Programming  •  SSH  to  138.47.102.165  •  Set  environment  vals  in  .bashrc  in  your  home  directory    export  PATH=$PATH:/usr/local/cuda/bin    export  LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH    export  LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH  

•  Copy  the  SDK  from  /home/students/NVIDIA_GPU_CompuMng_SDK  

•  Compile  the  following  directories  –  NVIDIA_GPU_CompuMng_SDK/shared/  –  NVIDIA_GPU_CompuMng_SDK/C/common/  

•  The  sample  codes  are  in  NVIDIA_GPU_CompuMng_SDK/C/src/  

Page 5: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

5  

Demo  

•  Hello  World  – Print  out  block  and  thread  IDs  

•  Vector  Add  – C  =  A  +  B  

CUDA  Language  Concept  

•  CUDA  programming  model  •  CUDA  memory  model  

Page 6: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

6  

Some  Terminologies  

•  Device  =  GPU  =  set  of  stream  mulMprocessors    •  Stream  MulMprocessor  (SM)  =  set  of  processors  &  shared  memory  

•  Kernel  =  GPU  program  •  Grid  =  array  of  thread  blocks  that  execute  a  kernel  

•  Thread  block  =  group  of  SIMD  threads  that  execute  a  kernel  and  can  communicate  via  shared  memory  

CUDA  Programming  Model  

•  Parallel  code  (kernel)  is  launched  and  executed  on  a  device  by  many  threads  

•  Threads  are  grouped  into  thread  blocks  •  Parallel  code  is  wrifen  for  a  thread  

// Kernel definition !__global__ void vecAdd(float* A, float* B, float* C)!{! ! !int i = threadIdx.x; ! ! !C[i] = A[i] + B[i]; !} !

Page 7: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

7  

Thread  Hierarchy  

•  Threads  launched  for  a  parallel  secMon  are  parMMon  into  thread  blocks  

•  Thread  block  is  a  group  of  threads  that  can:  – Synchronize  their  execuMon  – Communicate  via  a  low  latency  shared  memory  

•  Grid  =  all  thread  blocks  for  a  given  launch  

Page 8: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

8  

IDs  and  Dimensions  •  Threads  –  3D  IDs  –  Unique  within  a  block  –  two  threads  from  two  different  blocks  cannot  cooperate  

•  Blocks  –  2D  and  3D  IDs  (depend  on  the  hardware)  –  Unique  within  a  grid  

•  Dimensions  are  set  at  launch  Mme  –  Can  be  unique  for  each  secMon  

•  Built-­‐in  variables:  –  threadIdx,  blockIdx  –  blockDim,  gridDim  

Host

Kernel 1

Kernel 2

Device

Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Grid 2

Block (1, 1)

Thread (0, 1)

Thread (1, 1)

Thread (2, 1)

Thread (3, 1)

Thread (4, 1)

Thread (0, 2)

Thread (1, 2)

Thread (2, 2)

Thread (3, 2)

Thread (4, 2)

Thread (0, 0)

Thread (1, 0)

Thread (2, 0)

Thread (3, 0)

Thread (4, 0)

Page 9: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

9  

Page 10: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

10  

CUDA  Memory  Model  

•  Each  thread  can:  –  R/W  per-­‐thread  registers  –  R/W  per-­‐thread  local  memory  –  R/W  per-­‐block  shared  memory  –  R/W  per-­‐grid  global  memory  –  Read  only  per-­‐grid  constant  memory  –  Read  only  per-­‐grid  texture  memory  

(Device) Grid

Constant Memory

Texture Memory

Global Memory

Block (0, 0)

Shared Memory

Local Memory

Thread (0, 0)

Registers

Local Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local Memory

Thread (0, 0)

Registers

Local Memory

Thread (1, 0)

Registers

Host •  The  host  can  R/W  global,  constant,  and  texture  memories  

Host  memory  

Page 11: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

11  

Device  DRAM  

•  Global  memory  – Main  means  of  communicaMng  R/W  data  between  host  and  device    

– Contents  visible  to  all  threads  •  Texture  and  Constant  Memories  – Constants  iniMalized  by  host    – Contents  visible  to  all  threads  

CUDA  Device  Memory  AllocaMon  

•  cudaMalloc(pointer,  memsize)  – Allocates  object  in  the  device  Global  Memory  – pointer  =  address  of  a  pointer  to  the  allocated  object  

– memsize  =  Size  of  allocated  object  

•  cudaFree(pointer)  – Frees  object  from  device  Global  Memory  

Page 12: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

12  

CUDA  Host-­‐Device  Data  Transfer  

•  cudaMemcpy()  – Memory  data  transfer  – Requires  four  parameters  •  Pointer  to  source    •  Pointer  to  desMnaMon  •  Number  of  bytes  copied  •  Type  of  transfer:  Host  to  Host,  Host  to  Device,  Device  to  Host,  Device  to  Device  

CUDA  FuncMon  DeclaraMon  

•  __global__  defines  a  kernel  funcMon  – Must  return  void  

Executed on the:

Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

Page 13: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

13  

CUDA  FuncMon  Calls  RestricMons  

•  __device__  funcMons  cannot  have  their  address  taken  

•  For  funcMons  executed  on  the  device:  – No  recursion  – No  staMc  variable  declaraMons  inside  the  funcMon  – No  variable  number  of  arguments  

Calling  a  Kernel  FuncMon  –  Thread  CreaMon  

•  A  kernel  funcMon  must  be  called  with  an  execuMon  configuraMon:  

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...);  – DimGrid  =  dimension  and  size  of  the  grid  – DimBlock  =  dimension  and  size  of  each  block  –  SharedMemBytes  specifies  the  number  of  bytes  in  shared  memory  (opMon)  

–  Streams  specifies  the  associated  stream  (opMon)  

Page 14: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

14  

NVIDIA  Hardware  Architecture  

Host  memory  

NVIDIA  Hardware  Architecture  

SM  

Page 15: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

15  

SpecificaMons  of  a  Device  

•  For  more  details  – deviceQuery  in  CUDA  SDK  – Appendix  F  in  Programming  Guide  4.0  

Specifica1ons   Compute  Capability  1.3  

Compute  Capability  2.0  

Warp  size   32   32  

Max  threads/block   512   1024  

Max  Blocks/grid   65535   65535  

Shared  mem   16  KB/SM   48  KB/SM  

Demo  

•  deviceQuery  – Show  hardware  specificaMons  in  details  

Page 16: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

16  

Memory  OpMmizaMons  

•  Reduce  the  Mme  of  memory  transfer  between  host  and  device  – Use  asynchronous  memory  transfer  (CUDA  streams)  

– Use  zero  copy  •  Reduce  the  number  of  transacMons  between  on-­‐chip  and  off-­‐chip  memory  – Memory  coalescing  

•  Avoid  bank  conflicts  in  shared  memory  

Reduce  Time  of  Host-­‐Device  Memory  Transfer  

•  Regular  memory  transfer  (synchronously)  

Page 17: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

17  

Reduce  Time  of  Host-­‐Device  Memory  Transfer  •  CUDA  streams  –  Allow  overlapping  between  kernel  and  memory  copy  

CUDA  Streams  Example  

Page 18: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

18  

CUDA  Streams  Example  

GPU  Timers  

•  CUDA  Events  – An  API  – Use  the  clock  shade  in  kernel  – Accurate  for  Mming  kernel  execuMons  

•  CUDA  Mmer  calls  – Libraries  implemented  in  CUDA  SDK  

Page 19: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

19  

CUDA  Events  Example  

Demo  

•  simpleStreams  

Page 20: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

20  

Reduce  Time  of  Host-­‐Device  Memory  Transfer  

•  Zero  copy  – Allow  device  pointers  to  access  page-­‐locked  host  memory  directly  

– Page-­‐locked  host  memory  is  allocated  by  cudaHostAlloc()  

Demo  

•  Zero  copy  

Page 21: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

21  

Reduce  number  of  On-­‐chip  and  Off-­‐chip  Memory  TransacMons  

•  Threads  in  a  warp  access  global  memory  •  Memory  coalescing  – Copy  a  bunch  of  words  at  the  same  Mme  

Memory  Coalescing  

•  Threads  in  a  warp  access  global  memory  in  a  straight  forward  way  

Page 22: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

22  

Memory  Coalescing  

•  Memory  addresses  are  aligned  in  the  same  segment  but  the  accesses  are  not  sequenMal    

Memory  Coalescing  

•  Memory  addresses  are  not  aligned  in  the  same  segment  

Page 23: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

23  

Shared  Memory  

•  16  banks  for  compute  capability  1.x,  32  banks  for  compute  capability  2.x  

•  Help  uMlizing  memory  coalescing  •  Bank  conflicts  may  occur  – Two  or  more  threads  in  access  the  same  bank  –  In  compute  capability  1.x,  no  broadcast  –  In  compute  capability  2.x,  broadcast  the  same  data  the  many  threads  that  request  

Bank  Conflicts  

0  0  Threads:   Banks:  

1  1  

2  2  

3  3  

0  0  Threads:   Banks:  

1  1  

2  2  

3  3  

No  bank  conflict   2-­‐way  bank  conflict  

Page 24: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

24  

Matrix  MulMplicaMon  Example  

Matrix  MulMplicaMon  Example  

•  Reduce  accesses  to  global  memory  –  (A.height/BLOCK_SIZE)  Mmes  reading  A  –  (B.width/BLOCK_SIZE)  Mmes  reading  B  

Page 25: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

25  

Demo  

•  Matrix  MulMplicaMon  – With  and  without  shared  memory  – Different  block  sizes  

Control  Flow  

•  if,  switch,  do,  for,  while  •  Branch  divergence  in  a  warp  – Threads  in  a  warp  issue  different  instrucMon  sets  

•  Different  execuMon  paths  will  be  serialized  •  Increase  number  of  instrucMons  in  that  warp  

Page 26: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

26  

Branch  Divergence  

Summary  

•  5  steps  for  CUDA  Programming  •  NVIDIA  Hardware  Architecture  – Memory  hierarchy:  global  memory,  shared  memory,  register  file  

– SpecificaMons  of  a  device:  block,  warp,  thread,  SM  

Page 27: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

27  

Summary  

•  Memory  opMmizaMon  – Reduce  overhead  due  to  host-­‐device  memory  transfer  with  CUDA  streams,  Zero  copy  

– Reduce  the  number  of  transacMons  between  on-­‐chip  and  off-­‐chip  memory  by  uMlizing  memory  coalescing  (shared  memory)  

– Try  to  avoid  bank  conflicts  in  shared  memory  

•  Control  flow  – Try  to  avoid  branch  divergence  in  a  warp  

References  

•  CUDA  C  Programming  Guide  •  CUDA  Best  PracMces  Guide  •  hfp://www.developer.nvidia.com/cuda-­‐toolkit  

Page 28: Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

12/19/11  

28  


Recommended