+ All Categories
Home > Technology > 2011.02.18 marco parenzan - modelli di programmazione per le gpu

2011.02.18 marco parenzan - modelli di programmazione per le gpu

Date post: 04-Aug-2015
Category:
Upload: marco-parenzan
View: 353 times
Download: 0 times
Share this document with a friend
Popular Tags:
40
Sunday 20 March 2022 - slide 1 SE – University of Trieste Linguaggi di programmazione e compilatori per le GPU Marco Parenzan GPU@UniTS
Transcript
Page 1: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 1MOSE – University of Trieste

Linguaggi di programmazione e compilatori per le GPU

Marco Parenzan

GPU@UniTS

Page 2: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 2MOSE – University of Trieste

WARNING!Linguaggi di programmazione e compilatori per le GPU

Page 3: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 3MOSE – University of Trieste

WARNING!There is a choose to do now:

Nvidia/CUDA… …or not

GPU computing is not Nvidia/CUDA computing Nvidia is the most advanded as a product Nvidia is the only one with a specific scientific product on sale

now Fermi was late (1 year, scheduled for 2009released in 2010)

So if CUDA is the main choose…tomorrow…? Intel is too late now, but Intel is Intel…

GPU can be personal, not only cluster Not only simulation Broadly available video cards are now powered by GPUs

We have GPUs thanks to games!

Page 4: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 4MOSE – University of Trieste

Why GPU Computing…Over the past few years, the GPU has evolved from a fixed-function special-purpose processor into a full-fledged parallel programmable processor with additional fixed-function special-purpose functionality

Fixed Function Pipeline : lack of generality

More fully featured instruction set

Unified Shader Model

Increased Program-mability

Program-mable engine surrounded by supporting fixed function units

Page 5: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 5MOSE – University of Trieste

ProductsProducts you can choose so you can try

There is a BIG range of products There is a LITTLE range of scientific products

Main products Nvidia Fermi ATI Radeon Intel Larrabee

Again, not yet released So late…!

Page 6: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 6MOSE – University of Trieste

Nvidia CardsCompute capability GPUs Cards

1.0 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M

1.1G86, G84, G98, G96, G96b, G94, G94b, G92, G92b

GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50

1.2 GT218, GT216, GT215

GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M

1.3 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800

2.0 GF100, GF110GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580

2.1 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 460, 500M

Professional SeriesComputing SeriesConsumer Series

Current Series

Page 7: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 7MOSE – University of Trieste

ATI CardsRetail/card series name Chip seriesR8500,R9000-R9250 R200R9500-R9800, x300-x600, x1050 R300

X700-850 R420X1300-1950 R520HD2000-HD3000 R600HD4000 R700HD5000 R800/EvergreenHD6000 R900/Northern IslandsHD7000 R1000/Southern Islands

Consumer Series

Current Series

Page 8: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 8MOSE – University of Trieste

REFERENCE ARCHITECTURE

Linguaggi di programmazione e compilatori per le GPU

Page 9: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 9MOSE – University of Trieste

An Asymmetric Multi- Processor System

GPU-enabled system is an asymetric multi-processor systemDifferent abilities The GPU is designed for a particular

class of applications GPUs have optimized instruction sets

for paralled data computing

Different Performances GPUs have optimized memory access

and wide bandwidth for parallel data access

Page 10: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 10MOSE – University of Trieste

An Asymmetric Multi- Processor System

Computational requirements are largeParallelism is substantialThroughput is more important than latency

CPU50GFlops

GPU1TFlop

CPU RAM4-6 GB

GPU RAM1 GB

10GB/s 100GB/s

1GB/s

Page 11: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 11MOSE – University of Trieste

CPU vs GPU

CPU

Low latency memoryRandom accesses20GB/s bandwidth0.1TFlop compute1GFlops/watt

Well known programming model

GPU

High bandwidth memorySequential accesses100GB/s bandwidth1TFlop compute10 Gflops/watt

Niche programming model

Page 12: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 12MOSE – University of Trieste

How to use a GPU

GPU can execute only parts of our algoritmParts that can handle indipendent blocks of data that can be managed in parallel… …only for some tasks, compatible with

GPU instruction set Loops, for example, are not GPU friendly Think more functional

Page 13: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 13MOSE – University of Trieste

PROGRAMMING MODELSLinguaggi di programmazione e compilatori per le GPU

Page 14: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 14MOSE – University of Trieste

Programming ModelsCUDA

Specific for Nvidia Hardware Multiplatform (Windows, Linux, Mac)

OpenCL Generic for multi-vendors hardware, also non GPU hardware Multiplatform (Windows, Linux, Mac)

DirectComputeGeneric for multi-vendors GPU hardware , also non GPU hardware

Single Platform (Windows)

Page 15: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 15MOSE – University of Trieste

Pros & ConsCUDA OpenCL DirectComput

e

Pros

Most advancedMature

Multiplatform

MultiplatformStandard as

OpenGL

Success of DirectX familyAcceptance/

Support by ATI

Cons Proprietary Not mature Proprietary

Page 16: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 16MOSE – University of Trieste

How to program: download toolkits

Nvidia http://developer.nvidia.com/object/cuda_3_2_downloads.html

CUDA + OpenCL support

Ati http://developer.amd.com/gpu/amdappsdk/pages/default.aspx

Accellerated Parallel Processing (aka Stream) API + OpenCL support

DirectCompute DirectX SDK (June 2010)

http://www.microsoft.com/downloads/en/details.aspx?displaylang=en&FamilyID=3021d52b-514e-41d3-ad02-438a3ba730ba

In any case, download updated drivers for GPU/graphic cards

Attention to mobile drivers...always in beta!

Page 17: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 17MOSE – University of Trieste

How to program

Every platform has it’s own proprietary languageAs GPU are just specialized CPUs, primary programming tool is a C-derived dialect language CUDA

C for CUDA (PathScale Open64 C compiler) OpenCL

C99-dialect HLSL for DirectCompute (High Level Shading

Language – Shading of pixels, from consumer background)

Page 18: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 18MOSE – University of Trieste

Kernel for SubBlock Matrix Multiplication in C for CUDA

//////////////////////////////////////////////////////////////////////////////////! Matrix multiplication on the device: C = A * B//! wA is A's width and wB is B's width////////////////////////////////////////////////////////////////////////////////__global__ voidmatrixMul( float* C, float* A, float* B, int wA, int wB){ // Block index int bx = blockIdx.x; int by = blockIdx.y;

// Thread index int tx = threadIdx.x; int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix // that is computed by the thread float Csub = 0;

// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix

// Write the block sub-matrix to device memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(ty, tx) = A[a + wA * ty + tx]; BS(ty, tx) = B[b + wB * ty + tx];

__syncthreads();

for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx);

__syncthreads(); }

Page 19: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 20MOSE – University of Trieste

Kernel for SubBlock Matrix Multiplication in C for OpenCL

__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1))) voidfloatMatrixMultLocals(__global       float * MResp,                      __global       float * M1,                      __global       float * M2,                      __global       int * q){   //Identification of this workgroup   int i = get_group_id(0);   int j = get_group_id(1);   //Identification of work-item   int idX = get_local_id(0);    int idY = get_local_id(1);    //matrixes dimensions   int p = get_global_size(0);   int r = get_global_size(1);   int qq = q[0];   //Number of submatrixes to be processed by each worker (Q dimension)   int numSubMat = qq/BLOCK_SIZE;   float4 resp = (float4)(0,0,0,0);   __local float A[BLOCK_SIZE][BLOCK_SIZE];   __local float B[BLOCK_SIZE][BLOCK_SIZE];    MResp[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*j+idY)] = // LOO CODE resp.x+resp.y+resp.z+resp.w;}

for (int k=0; k<numSubMat; k++) { //Copy submatrixes to local memory. Each worker copies one element //Notice that A[i,k] accesses elements starting from M[BLOCK_SIZE*i, BLOCK_SIZE*j] A[idX][idY] = M1[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*k+idY)]; B[idX][idY] = M2[BLOCK_SIZE*k + idX + qq*(BLOCK_SIZE*j+idY)]; barrier(CLK_LOCAL_MEM_FENCE);

for (int k2 = 0; k2 < BLOCK_SIZE; k2+=4) { float4 temp1=(float4)(A[idX][k2],A[idX][k2+1],A[idX][k2+2],A[idX][k2+3]); float4 temp2=(float4)(B[k2][idY],B[k2+1][idY],B[k2+2][idY],B[k2+3][idY]); resp += temp1 * temp2; } barrier(CLK_LOCAL_MEM_FENCE); }

Page 20: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 21MOSE – University of Trieste

Kernel for Matrix Multiplication in C for OpenCL

__kernel voidfloatMatrixMult( __global float * MResp, __global float * M1, __global float * M2, __global int * q){ // Vector element index int i = get_global_id(0); int j = get_global_id(1);

int p = get_global_size(0); int r = get_global_size(1);

MResp[i + p * j] = 0; int QQ = q[0]; for (int k = 0; k < QQ; k++) { MResp[i + p * j] += M1[i + p * k] * M2[k + QQ * j]; }}

Page 21: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 22MOSE – University of Trieste

Kernel for Matrix Multiplication in HLSL for DirectCompute

[numthreads(GROUP_SIZE_X, GROUP_SIZE_Y, 1)]void matrixMul( uint3 DTid : SV_DispatchThreadID ){

if(DTid.x < WidthB && DTid.y < HeightA){

float sum = 0;for(uint i=0; i<WidthA; i++){

uint addrA = DTid.y * WidthA + i;uint addrB = DTid.x + i*WidthB;

sum += MatrixA[addrA].val * MatrixB[addrB];}Output[DTid.y*WidthOut + DTid.x] = sum;

}}

Page 22: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 23MOSE – University of Trieste

Hosting GPUsAsymetric model needs an host language

Executed on CPU Coordinates tasks on GPU

C/C++ Embedding Technology GPU compiler strip GPU specific code from source CPU compiler compiler the remainder code and executes GPU-

compiled code

Non-C/C++ Language Bindings GPU code compiled via a sort of «compiler as a service>

Page 23: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 24MOSE – University of Trieste

Host Side API Code (example in DirectCompute)

Create device and contextMatrix A – structured buffer and shader resource viewMatrix B –float buffer and shader resource viewOutput Matrix – float buffer and unordered access viewCreate Constant BufferCompile and create shaderExecuteRead Back

Page 24: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 25MOSE – University of Trieste

Host Side API Code (example in DirectCompute)

Page 25: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 26MOSE – University of Trieste

Bindings (example for CUDA)Fortran - FORTRAN CUDA, PGI CUDA Fortran CompilerLua - KappaCUDAIDL - GPULibMathematica - CUDALinkMATLAB - Jacket.NET - CUDA.NETPerl - KappaCUDAPython - PyCUDA KappaCUDARuby - KappaCUDAJava - jCUDA, JCuda, JCublas, JCufft

Page 26: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 27MOSE – University of Trieste

Example of PyCUDA codeimport pycuda.autoinitimport pycuda.driver as drvimport numpy

from pycuda.compiler import SourceModulemod = SourceModule("""__global__ void multiply_them(float *dest, float *a, float *b){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1), grid=(1,1))

print dest-a*b

Page 27: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 28MOSE – University of Trieste

Custom LibrariesCUDA, OpenCL and DirectCompute are the building blocks

The smallest «bricks» available to build our software

Are there any «biggest» bricks?Libraries to achieve some specific results targeting a GPU platformExample

Microsoft Research Accelleratorhttp://research.microsoft.com/en-us/projects/Accelerator/

// ... using Microsoft.ParallelArrays;// ...

namespace AccelleratorDemo{  class Program  {    static void Main(string[] args)    {       // ....

      // Build a computation that calculates average of neigbors       var input = new FloatParallelArray(nums);      var sum =         ParallelArrays.Shift(input, 1) + input +         ParallelArrays.Shift(input, -1);      var output = sum / 3.0f;

      // Run the computation      var target = new DX9Target();      var res = target.ToArray1D(output);

      // Output the original data and calculated 'blurred' result      Action<float[]> WriteArray = (vals) =>        Console.WriteLine(vals.Aggregate("", (str, f) => str + Math.Round(f) + ", "));

       // ...         }  }}

Page 28: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 29MOSE – University of Trieste

Metaprogramming

What is metaprogramming? Metaprogramming is the writing of computer

programs that write or manipulate other programs (or themselves) as their data, or that do part of the work at compile time that would otherwise be done at runtime. (http://en.wikipedia.org/wiki/Metaprogramming)

Page 29: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 30MOSE – University of Trieste

GPU MetaprogrammingThe code executed on GPU is not in the host codeThe REFERENCE code is in the host code

Metaprogramming techniques allows to generate the real executable code from the source code, used as a template, as a declaration of the need

In many cases, this allows programmers to get more done in the same amount of time as they would take to write all the code manually, or it gives programs greater flexibility to efficiently handle new situations without recompilation.

An example in .NET GPU.NET (http://www.tidepowerd.com/)

All metaprogrammable languages are potentially tool for this approach

The GPU.NET code works in C#, Visual Basic, F#....

Page 30: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 31MOSE – University of Trieste

Sample in GPU.NET [Kernel(CustomFallbackMethod = "AddCpu")] private static void AddGpu(float[] a, float[] b, float[] c) { // Get the thread id and total number of threads int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X;

// Loop over the vectors 'a' and 'b', adding them pairwise and storing the sums in 'c' for (int ElementIndex = ThreadId; ElementIndex < a.Length; ElementIndex += TotalThreads) { c[ElementIndex] = a[ElementIndex] + b[ElementIndex]; }}

GPU code is runtime generated and loaded into GPU from [Kernel] decored native .NET methods

Page 31: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 32MOSE – University of Trieste

CONCLUSIONSLinguaggi di programmazione e compilatori per le GPU

Page 32: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 33MOSE – University of Trieste

Conclusions

GPUs give us bit opportunitiesMarket and Research have understood it Developer Community too

Now we have to assist to the grow of the market So of the tools So of the languages

2011 will be the APUs year CPU + GPU on the single die

Page 33: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 34MOSE – University of Trieste

Which programming model?

No one knows itAgain, CUDA is the most efficient and advanced

...but runs only on Nvidia cards No one knows if ATI will adopt it

OpenCL is not so advanced But it’s the same relation between OpenGL and the others ...and NVidia knows it.

DirectCompute? On Windows hosts, with DirectX experience, it’s another a big

opportunity

I think.... Specific programming models (CUDA) for ISVs which can write

drivers... OpenCL or DirectCompute for custom/lab/«home»

development

Page 34: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 35MOSE – University of Trieste

Attention!GPU is the cheapest computing power, but...

You can execute only streamable code (data processing)

What about more generic code?

Page 35: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 36MOSE – University of Trieste

TRENDSLinguaggi di programmazione e compilatori per le GPU

Page 36: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 37MOSE – University of Trieste

Parallel thinks about data Distinct blocks of data handled in the same way Distinct blocks of data handled at the same time

Parallel is all about distributing the same computation task in different computation resources at the same timeResult is divided into blocks

Result blocks must be merged in one general result

Software can be parallel

Parallel

Page 37: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 38MOSE – University of Trieste

Async is not Parallel Parallel thinks about data (and probably just 1 task) Async thinks about tasks

Not all tasks are sequential Some tasks are sequential Some tasks are parallel

Non sequential tasks have to be coordinated

Software have to be more asyncProgrammers don’t thinks async

Programmer are not trained async

Asyncronous

Page 38: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 39MOSE – University of Trieste

Imperative languages have no primitives to handle async or parallelTask coordination have to be adaptative

There is no only one way to do Coordination have to be “simplified”

Async or Parallel tasks are better coded in declarative language

Leave the compiler do the infrastructure and optimization work

Functional is a special declarative language The metaphor is the function as first-class citizen in language Function is threated as a value

Functional

Page 39: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 40MOSE – University of Trieste

Pros: On demand scalability Total Cost of Ownership (?) Different level of services (SAAS, PAAS, xAAS)

Cons: Broadband (speed) Privacy (?)

Cloud

Page 40: 2011.02.18   marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 41MOSE – University of Trieste

blog:

email:

web:

twitter:

slideshare:

facebook:

linked-in:

Links

Marco Parenzan

http://blog.codeisvalue.com/

[email protected]

http://www.codeisvalue.com/

@marco_parenzan

http://www.slideshare.com/marco.parenzan

http://www.facebook.com/parenzan.marco

http://it.linkedin.com/in/marcoparenzan


Recommended