+ All Categories
Home > Documents > The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X...

The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X...

Date post: 03-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
53
Transcript
Page 1: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement
Page 2: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

The “New” Moore’s Law

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data-parallel computing is most scalable solution

Page 3: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Enter the GPU

Massive economies of scale

Massively parallel

Page 4: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Enter CUDA

Scalable parallel programming model

Minimal extensions to familiar C/C++ environment

Heterogeneous serial-parallel computing

Page 5: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Sound Bite

GPUs + CUDA

=The Democratization of Parallel Computing

Massively parallel computing has become a commodity technology

Page 6: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

MOTIVATION

0

250

500

750

1000

Sep-02 Jan-04 May-05 Oct-06 Feb-08

Peak GFLOP/s

NVIDIA GPU Intel CPU

Page 7: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

MOTIVATION

146X

Interactive Interactive

visualization of visualization of

volumetric white volumetric white

matter connectivitymatter connectivity

36X

Ionic placement for Ionic placement for

molecular dynamics molecular dynamics

simulation on GPUsimulation on GPU

19X

Transcoding HD video Transcoding HD video

stream to H.264stream to H.264

17X

Fluid mechanics in Fluid mechanics in

Matlab using .mex file Matlab using .mex file

CUDA functionCUDA function

100X

Astrophysics NAstrophysics N--body body

simulationsimulation

149X

Financial simulation Financial simulation

of LIBOR model with of LIBOR model with

swaptionsswaptions

47X

GLAME@lab: an MGLAME@lab: an M--

script API for GPU script API for GPU

linear algebralinear algebra

20X

Ultrasound medical Ultrasound medical

imaging for cancer imaging for cancer

diagnosticsdiagnostics

24X

Highly optimized Highly optimized

object oriented object oriented

molecular dynamicsmolecular dynamics

30X

Cmatch exact string Cmatch exact string

matching to find matching to find

similar proteins and similar proteins and

gene sequencesgene sequences

Page 8: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Motivation: NVIDIA

Supercomputing Performance960 cores. 4 TeraFLOPS

250x the performance of a desktop

Personal One researcher, one supercomputer

Plugs into standard power strip

AccessibleProgram in C for Windows, Linux

Available now under $10,000

Page 9: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Accelerating Time to Insight

4.6 Days

27 Minutes

2.7 Days

30 Minutes

8 Hours

13 Minutes16 Minutes

3 Hours

CPU Only Heterogeneous with Tesla GPU

Faster is not “just faster” - David Kirk, NVIDIA Chief ScientistFaster is not “just faster” - David Kirk, NVIDIA Chief Scientist

Page 10: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

CUDA: ‘C’ FOR PARALLELISM

void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke serialserialserialserial SAXPY kernel

saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);

__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)

{{{{

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;

ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;

saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);

Standard C Code

Parallel C Code

Page 11: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Hierarchy of concurrent threads

Parallel kernels composed of many threads

all threads execute the same sequential program

Threads are grouped into thread blocks

threads in the same block can cooperate

Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

Kernel foo()

. . .

Page 12: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Hierarchical organization

Thread

per-threadlocal memory

Block

per-blockshared

memory

Kernel 0

. . .per-device

globalmemory

. . .

Kernel 1

. . .Global barrier

Local barrier

Page 13: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Heterogeneous Programming

CUDA = serial program with parallel kernels, all in C

Serial C code executes in a CPU thread

Parallel kernel C code executes in thread blocksacross multiple processing elements

Serial Code

. . .

. . .

Parallel Kernel

foo<<< nBlk, nTid >>>(args);

Serial Code

Parallel Kernel

bar<<< nBlk, nTid >>>(args);

Page 14: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Thread = virtualized scalar processor

Independent thread of execution

has its own PC, variables (registers), processor state, etc.

no implication about how threads are scheduled

Page 15: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Block = virtualized multiprocessor

Provides programmer flexibility

freely choose processors to fit data

freely customize for each kernel launch

Thread block = a (data) parallel task

all blocks in kernel have the same entry point

but may execute any code they want

Thread blocks of kernel must be independent tasks

program valid for any interleaving of block executions

Page 16: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Scalable Execution Model

Kernel launched by host

. . .

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

. . .

Device Memory

Blocks Run on Multiprocessors

Page 17: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Synchronization & Cooperation

Threads within block may synchronize with barriers… Step 1 …

__syncthreads();

… Step 2 …

Blocks coordinate via atomic memory operationse.g., increment shared queue pointer with atomicInc()

Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);

vec_dot<<<nblocks, blksize>>>(c, c);

Page 18: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Using per-block shared memory

Variables shared across block__shared__ int *begin, *end;

Scratchpad memory__shared__ int scratch[blocksize];

scratch[threadIdx.x] = begin[threadIdx.x];// … compute on scratch values …begin[threadIdx.x] = scratch[threadIdx.x];

Communicating values between threadsscratch[threadIdx.x] = begin[threadIdx.x];

__syncthreads();int left = scratch[threadIdx.x - 1];

Block

Sh

ared

Page 19: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Summing Up

CUDA = C + a few simple extensions

makes it easy to start writing basic parallel programs

Three key abstractions:

1. hierarchy of parallel threads

2. corresponding levels of synchronization

3. corresponding memory spaces

Supports massive parallelism of manycore GPUs

Page 20: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

SOME FINAL THOUGHTS

We should teach parallel computing in CS 1 or CS 2

Remember: computers don’t get faster, just wider

Heapsort and mergesort

Both O(n lg n)

One parallel-friendly, one not

Students need to understand this early

Page 21: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Conclusion

GPUs are massively parallel manycore computers

Ubiquitous - most successful parallel processor in history

Useful - users achieve huge speedups on real problems

CUDA is a powerful parallel architecture and programming model

Heterogeneous - mixed serial-parallel programming

Scalable - hierarchical thread execution model

Accessible – e.g. minimal but expressive changes to C

They provide tremendous scope for innovative research

Page 22: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Questions?

Page 23: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Example: Vector Add w/ Host Code

Page 24: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Example: Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}

int main()

{

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

Device Code

Page 25: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Example: Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}

int main()

{

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

Host Code

Page 26: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Example: Host code for vecAdd

// allocate and initialize host (CPU) memory

float *h_A = …, *h_B = …;

// allocate device (GPU) memory

float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));

cudaMalloc( (void**) &d_B, N * sizeof(float));

cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device

cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );

cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Page 27: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Example: Reduction

Page 28: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Example: Parallel Reduction

Summing up a sequence with 1 thread:int sum = 0;

for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree

each thread holds 1 element

stepwise partial sums

N threads need log N steps

one possible approach:Butterfly pattern

Page 29: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Example: Parallel Reduction

Summing up a sequence with 1 thread:int sum = 0;

for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree

each thread holds 1 element

stepwise partial sums

N threads need log N steps

one possible approach:Butterfly pattern

Page 30: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i

int i = threadIdx.x;

__shared__ int sum[blocksize];

// One thread per element

sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2)

{

int t=sum[i]+sum[i^bit]; __syncthreads();

sum[i]=t; __syncthreads();

}

// OUTPUT: Every thread now holds sum in sum[i]

Page 31: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Reduction tree redux

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

Page 32: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Compare to interleaved addressing:

Input (shared memory)

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

0 1

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Page 33: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

OpenCL

Page 34: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

•© NVIDIA Corporation 2007

CUDA: An Architecture for Massively Parallel Computing

ATI’s Compute “Solution”

Page 35: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

•© NVIDIA Corporation 2007

OpenCL vs. C for CUDA

Shared back-end compiler & optimization technology

OpenCLOpenCL

C for CUDAC for CUDA

PTXPTX

GPUGPU

Entry point for developers who prefer high-level C

Entry point for developers who

want low-level API

Page 36: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

FFT Kernel Example

OPENCL

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,

__local float *sMemx, __local float *sMemy)

{

int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid;

float2 data[16];

in = in + blockIdx; out = out + blockIdx;

globalLoads(data, in, 64); // coalesced global reads

fftRadix16Pass(data); // in-place radix-16 pass

twiddleFactorMul(data, tid, 1024, 0);

localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));

fftRadix16Pass(data); // in-place radix-16 pass

twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));

fftRadix4Pass(data);

fftRadix4Pass(data + 4); // four radix-4 function calls

fftRadix4Pass(data + 8)

fftRadix4Pass(data + 12);

globalStores(data, out, 64); // coalesced global writes

}

C for CUDA (Written by Vasily Volkov, © UC

__global__ void FFT1024_device( float2 *dst, float2 *src )

{

int tid = threadIdx.x; int iblock = blockIdx.y * gridDim.x + blockIdx.x;

int index = iblock * 1024 + tid; src += index; dst += index;

int hi4 = tid>>4; int lo4 = tid&15;int hi2 = tid>>4; int mi2 = (tid>>2)&3;int

lo2 = tid&3;

float2 a[16];

__shared__ float smem[69*16];

load<16>( a, src, 64 );

FFT16( a );

twiddle<16>( a, tid, 1024 );

int il[] = {0,1,2,3, 16,17,18,19, 32,33,34,35, 48,49,50,51};

transpose<16>( a, &smem[lo4*65+hi4], 4, &smem[lo4*65+hi4*4], il );

FFT4x4( a );

twiddle4x4( a, lo4 );

transpose4x4( a, &smem[hi2*17 + mi2*4 + lo2], 69, &smem[mi2*69*4 +

hi2*69 + lo2*17 ], 1, 0xE );

FFT16( a );

store<16>( a, dst, 64 );

}

Calculate IndexLoad Data

FFT Kernel

Page 37: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2009

Different Host Code Styles

Calling a C function in nvcc

extern "C" void FFT1024( float2 *work, int batch )

{

FFT1024_device<<< grid2D(batch), 64 >>>( work, work );

}

OpenCL API-style programming

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(float)*2*num_entries, srcA);memobjs[1] = clCreateBuffer(context,CL_MEM_READ_WRITE,sizeof(float)*2*num_entries, NULL);// create the compute programprogram = clCreateProgramFromSource(context, 1,&fft1D_1024_kernel_src, NULL);// build the compute program executableclBuildProgramExecutable(program, false, NULL, NULL);// create the compute kernelkernel = clCreateKernel(program, “fft1D_1024”);// create N-D range object with work-item dimensionsglobal_work_size[0] = n;local_work_size[0] = 64;range = clCreateNDRangeContainer(context, 0, 1,global_work_size,local_work_size);// set the args valuesclSetKernelArg(kernel, 0, (void *)&memobjs[0],sizeof(cl_mem), NULL);clSetKernelArg(kernel, 1, (void *)&memobjs[1],sizeof(cl_mem), NULL);clSetKernelArg(kernel, 2, NULL,sizeof(float)*(local_work_size[0]+1)*16, NULL);clSetKernelArg(kernel, 3, NULL,sizeof(float)*(local_work_size[0]+1)*16, NULL);// execute kernelclExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

Source:

SIGGraph sneak preview

A Munshi, Apple Computer

NVIDIA’s PTX layer manages kernel

resources and execution

Page 38: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Sparse Linear Algebra Results

Page 39: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Sparse Matrix-Vector Multiplication (SpMV) on CUDA

Experimented with several data structures

CSR: Compressed Sparse Row

HYB: Hybrid of ELLPACK (ELL) and Coordinate (COO) formats

HYB gave best results

Speed of ELL with flexibility of COO

Benchmarked against matrices from

“Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms”, S. Williams et al, Supercomputing 2007

Page 40: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Page 41: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Double Precision

Page 42: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

T10 Double Precision Floating Point

Precision IEEE 754

Rounding modes for FADD and FMUL All 4 IEEE, round to nearest, zero, inf, -inf

Denormal handling Full speed

NaN support Yes

Overflow and Infinity support Yes

Flags No

FMA Yes

Square root Software with low-latency FMA-based convergence

Division Software with low-latency FMA-based convergence

Reciprocal estimate accuracy 24 bit

Reciprocal sqrt estimate accuracy 23 bit

log2(x) and 2^x estimates accuracy 23 bit

Page 43: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Double Precision Floating Point

Page 44: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Single Precision Floating Point

G80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

Page 45: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Products

Page 46: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Tesla S1070 1U System

1 single precision2 typical power

4 Teraflops1

800 watts2

Page 47: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Tesla C1060 Board

1 single precision2 typical power

957 Gigaflops1

160 Watts2

Page 48: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Building a 100TF datacenter

CPU 1U Server Tesla 1U System

10x lower cost

21x lower power

4 CPU cores

0.07 Teraflop

$ 2000

400 W

1429 CPU servers

$ 3.1 M

571 KW

4 GPUs: 960 cores

4 Teraflops

$ 8000

800 W

25 CPU servers

25 Tesla systems

$ 0.31 M

27 KW

Page 49: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

Tesla Personal Supercomputer

Supercomputing PerformanceMassively parallel CUDA Architecture

960 cores. 4 TeraFlops

250x the performance of a desktop

Personal One researcher, one supercomputer

Plugs into standard power strip

AccessibleProgram in C for Windows, Linux

Available now worldwide under $10,000

Page 50: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

C-for-CUDA SDK

NVIDIA C Compiler

NVIDIA Assemblyfor Computing

CPU Host Code

Integrated CPUand GPU C Source Code

Libraries:FFT, BLAS,CuDPP…Example Source Code

CUDADriver

DebuggerProfiler

Standard C Compiler

GPU CPU

Page 51: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

Quotes

Page 52: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems.

Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

Jack Dongarra

Professor, University of Tennessee

Author of Linpack

Page 53: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement

© NVIDIA Corporation 2008

We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.

Heterogeneous computing is what makes such a breakthrough possible.

Burton Smith

Technical Fellow, Microsoft

Formerly, Chief Scientist at Cray


Recommended