Download - Yacf

YaCF: TheaccULL Compiler

Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: The accULL CompilerUndergraduate Thesis Project

Juan Jose Fumero AlfonsoUniversidad de La Laguna

22 de junio de 2012

1 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Outline

1 Introduction

2 YaCF

3 Experiments

4 Conclusions

5 Future Work

2 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Outline

1 Introduction

2 YaCF

3 Experiments

4 Conclusions

5 Future Work

3 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Moore’s Law

Every 18 months the number of transistors could be doubled.

4 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Nowadays Parallel Architectures

5 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Parallel Architectures

The solution

• More processors

• More cores per processor

6 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


The systems are hybrid using all options.

7 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


8 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

OpenMP: Shared MemoryProgramming

• API that support SMP programming.

• Multi-platform.

• A directive-based approach.

• A set of compiler directives, library routines and environmentvariables for parallel programming.

OpenMP example

1 #pragma omp p a r a l l e l2 {3 #pragma omp master

4 {5 nthreads = omp_get_num_threads ( ) ;6 }7 #pragma omp f o r p r i v a t e ( x ) reduction (+: sum ) schedule ( runtime )8 f o r ( i=0; i < NUM_STEPS ; ++i ) {9 x = ( i+0.5)∗step ;

10 sum = sum + 4.0/(1 .0+ x∗x ) ;11 }12 #pragma omp master

13 {14 pi = step ∗ sum ;15 }16 }

9 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

MPI: Message Passing Interface

• A language-independent communications protocol used toprogram parallel applications.

• MPI’s goals are high performance, scalability and portability.

MPI example

1 MPI_Comm_size ( MPI_COMM_WORLD , &MPI_NUMPROCESSORS ) ;2 MPI_Comm_rank ( MPI_COMM_WORLD , &MPI_NAME ) ;3 w = 1.0 / N ;4 f o r ( i = MPI_NAME ; i < N ; i += MPI_NUMPROCESSORS ) {5 local = ( i + 0 . 5 ) ∗ w ;6 pi_mpi = pi_mpi + 4.0 / ( 1 . 0 + local ∗ local ) ;7 }8 MPI_Allreduce(&pi_mpi , &gpi_mpi , 1 , MPI_DOUBLE , MPI_SUM , MPI_COMM_WORLD ) ;

10 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

High Performance Computing

• The most powerful computers at the moment.

• Systems with a massive number of processors.

• High speed of calculation.

• It contains thousands of processors and cores.

• Systems very expensive and consuming a huge amount of energy.

11 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

TOP 500: High PerformanceComputing

• The TOP500 project ranks and details the 500 (non-distributed)most powerful known computer systems in the world.

• The project publishes an updated list of the supercomputerstwice a year.

12 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Accelerators Era

13 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Languages for HeterogeneousProgramming

CUDADeveloped by NVIDIA.

• Pros: its performance, it is easier than OpenCL.

• Con: only works with NVIDIA hardware.

14 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


CUDA

1 __global__ v o i d mmkernel ( f l o a t∗ a , f l o a t∗ b , f l o a t∗ c , i n t n ,2 i n t m , i n t p )3 {4 i n t i = blockIdx . x∗32 + threadIdx . x ;5 i n t j = blockIdx . y ;6 f l o a t sum = 0.0 f ;7 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;8 a [ i+n∗j ] = sum ;9 }

15 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


OpenCL

A framework developed by the Khronos Group.

• Pros: can be used with any device, it is a standard.

• Cons: more complex than CUDA, immature.

16 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


OpenCL

1 __kernel v o i d matvecmul ( __global f l o a t ∗a ,2 const __global f l o a t ∗b , const __global f l o a t ∗c ,3 const uint N ) {4 f l o a t R ;5 i n t k ;6 i n t xid = get_global_id (0 ) ;7 i n t yid = get_global_id (1 ) ;8 i f ( xid < N ) {9 i f ( yid < N ) {

10 R = 0 . 0 ;11 f o r ( k = 0 ; k < N ; k++)12 R += b [ xid ∗ N + k ] ∗ c [ k∗N + yid ] ;13 a [ xid∗N+yid ] = R ;14 }15 }16 }

17 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


Pros

1 The programmer can use all machine’s devices.

2 GPU and CPU could work in parallel.

18 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


Problems

1 The programmer needs to know low-level details of thearchitecture.

19 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


Cons

1 The programmer needs to know low-level details of thearchitecture.

2 Source codes need to be rewritten:• One version for OpenMP/MPI.• A different version for GPU.

3 Good performance requires a great effort in parameter tuning.

4 These languages (CUDA/OpenCL) are complex and new fornon-experts.

20 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

GPGPU (General Purpose GPU)Computing

Can we use GPUs for parallelcomputing? Is this efficient?

21 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

The NBody Problem

• Simulation numericallyapproximates theevolution of a system ofbodies.

• Each body continuouslyinteracts with otherbodies.

• Fluid flow simulations.

22 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

NBody description

Acceleration

ai =Fi

mi

ai ≈ G ·∑

1≤j≤N

mj rij

(||rij ||2 + ε2)3/2

23 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

CUDA implementation

• The method is Particle to Particle.

• Its computational complexity is O(n2)

• Evaluate all pair-wise interactions. It is exact.

24 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

CUDA implementation: blocks andgrids

25 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

CUDA Kernel: Tile calculation

1 __device__ float3 gravitation ( float4 myPos , float3 accel ) {2 e x t e r n __shared__ float4 sharedPos [ ] ;3 uns igned long i = 0 ;45 f o r ( uns igned i n t counter = 0 ; counter < blockDim . x ; counter++ )6 {7 accel = bodyBodyInteraction ( accel , SX ( i++) , myPos ) ;8 }9 r e t u r n accel ;

10 }

26 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

CUDA Kernel: calculate forces

1 __global__ v o i d calculate_forces ( float4∗ globalX , float4∗ globalA )2 {3 // A sha r ed memory b u f f e r to s t o r e the body p o s i t i o n s .4 e x t e r n __shared__ float4 [ ] shPosition ;5 float4 myPosition ;6 i n t i , tile ;7 float3 acc = {0.0f , 0 . 0 f , 0 . 0 f};8 // G l oba l t h r ead ID ( r e p r e s e n t the un ique body i ndex i n the s imu l a t i o n )9 i n t gtid = blockIdx . x ∗ blockDim . x + threadIdx . x ;

10 // This i s the p o s i t i o n o f the body we a r e computing the a c c e l e r a t i o n f o r .11 float4 myPosition = globalX [ gtid ] ;12 f o r ( i = 0 , tile = 0 ; i < N ; i += blockDim . x , tile++)13 {14 i n t idx = tile ∗ blockDim . x + threadIdx . x ;15 shPosition [ threadIdx . x ] = globalX [ idx ] ;16 __syncthreads ( ) ;17 acc = tile_calculation ( myPosition , acc ) ;18 __syncthreads ( ) ;19 }20 // r e t u r n21 }

27 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Results• Tesla C1060 (1.3).• Sequential source code: Intel Corei7 930.• NBody SDK.• Cuda Runtime /Cuda Driver: 4.0.

• 400000 bodies• 200 interactions.

Device Cores Memory Performance (GFLOPS)Tesla C1060 240 4GB 933 (Single), 78 (double)Intel Corei7 4 4GB 44.8 (11.2 per core)

28 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Results

• Sequential code: ≈ 147202512.40 ms ≈ 41 hours (40.89 hours)

• Parallel CUDA code: 1392029.6 ms = (23.3 minutes)

• The speedup is 105.7 (105×).

29 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

At the Present Time

• Some applications accelerate with GPUs.

• The user need to learn new programming languages and tools.

• The CUDA model and its architecture have to be understood.

• Non-expert users have to write programs for a new model.

30 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

GPGPU Languages

OpenACC: introduced last November inSuperComputing’2011

A directive based language.

• Aimed to be standard.

• Supported by: Cray, NVIDIA, PGI and CAPS.

• One simple source code for all versions.

• Platform independent.

• Easier for beginners.

31 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

GPGPU Languages

OpenACC

A directive based language.

32 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

A New Dimension for HPC

33 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

accULL: our OpenACCImplementation

accULL = compiler + runtime library.

accULL = YaCF + Frangollo.

34 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

accULL: our OpenACCImplementation

accULL = compiler + runtime library.accULL = YaCF + Frangollo.

34 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Initial Objectives of this Project

• To integrate C99 in the YaCF project.

• To implement a new class hierarchy for new YaCF Frontends.

• To implement an OpenACC Frontend.

• To complete the OpenMP grammar with directives in OpenMP3.0.

• To test the new C99 interface.

35 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Source-to-source Compilers

• Rose Compiler Framework.

• Cetus Compiler.

• Mercurium.

36 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Outline

1 Introduction

2 YaCF

3 Experiments

4 Conclusions

5 Future Work

37 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

accULL: our OpenACCimplementation

38 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


39 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


40 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work


41 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Yet Another CompilerFramework

42 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF

• A source-to-source compiler that translates C code withOpenMP, llc and OpenACC annotations into code withFrangollo calls.

• Integrates code analysis tools.

• Completely written in Python.

• Based on widely known object oriented software patterns.

• Based on the pycparser Python module.

• Implementing code transformation is only a matter of writing afew lines of code.

43 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

44 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

45 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

46 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

47 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

48 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

49 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

50 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

51 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Preprocessor

52 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Preprocessor

53 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Preprocessor

54 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Preprocessor

55 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

56 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Architecture

57 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: Statistics

• 20683 lines of Python code.

• 2158 functions and methods.

• My contribution has been about 25 % of YaCF project.

58 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Outline

1 Introduction

2 YaCF

3 Experiments

4 Conclusions

5 Future Work

59 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Experiments

• Benchmark Scalapack: testingC99.

• Block Matrix Multiplication inaccULL.

• Three different problems fromthe Rodinia Benchmark:

• HotSpot.• SRAD.• Needleman–Wunsch.

60 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

ScaLAPACK

• The ScaLAPACK (Scalable LAPACK) is a library that includesa subset of LAPACK routines redesigned for distributed memoryMIMD parallel computers.

• ScaLAPACK is designed for heterogeneous computing.

• It is portable to any computer that support MPI.

• Scalable depends on PBLAS operations.

61 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

ScaLAPACK: results in YaCF

Directory Total C files Success Failures

PBLAS/SRC 123 123 0REDIST/SRC 21 21 0PBLAS/SRC/PTOOLS 102 101 1PBLAS/TESTING 2 1 1PBLAS/TIMING 2 1 1REDIST/TESTING 10 0 10SRC 9 9 0TOOLS 2 2 0

Total 271 258 13

95 % of the ScaLAPACK C files are correctly parsed in YaCF.

62 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

ScaLAPACK: results in YaCF

Directory Total C files Success Failures

PBLAS/SRC 123 123 0REDIST/SRC 21 21 0PBLAS/SRC/PTOOLS 102 101 1PBLAS/TESTING 2 1 1PBLAS/TIMING 2 1 1REDIST/TESTING 10 0 10SRC 9 9 0TOOLS 2 2 0

Total 271 258 1395 % of the ScaLAPACK C files are correctly parsed in YaCF.

62 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Platforms

• Garoe: A desktop computer with an Intel Core i7 930 processor(2.80 GHz), with 1MB of L2 cache, 8MB of L3 cache, shared bythe four cores. The system has 4 GB RAM and a Tesla C2050with 4 GB of memory attached.

63 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Platforms

• Drago: A second cluster node. It is a shared memory systemwith 4 Intel Xeon E7. Each processor has 10 cores. In this case,the accelerator platform is Intel OpenCL SDK 1.5 which runs onthe CPU.

64 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

MxM in accULL

• MxM is a basic kernel frequently used to showcase the peakperformance of GPU computing.

• We compare the performance of the accULL implementationwith that of:

• OpenMP.• CUDA.• OpenCL.

65 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

MxM in accULL

MxM OpenACC code

1 #pragma acc k e r n e l s name ( "mxm" ) copy ( a [ L∗N] ) copy i n ( b [ L∗M] , c [M∗N] )2 {3 #pragma acc loop p r i v a t e ( i , j ) c o l l a p s e (2 )4 f o r ( i = 0 ; i < L ; i++)5 f o r ( j = 0 ; j < N ; j++)6 a [ i ∗ L + j ] = 0 . 0 ;7 /∗ I t e r a t e ove r b l o c k s ∗/8 f o r ( ii = 0 ; ii < L ; ii += tile_size )9 f o r ( jj = 0 ; jj < N ; jj += tile_size )

10 f o r ( kk = 0 ; kk < M ; kk += tile_size ) {11 /∗ I t e r a t e i n s i d e a b l o ck ∗/12 #pragma acc loop collapse (2 ) p r i v a t e (i , j , k )13 f o r ( j=jj ; j < min (N , jj+tile_size ) ; j++)14 f o r ( i=ii ; i < min (L , ii+tile_size ) ; i++)15 f o r ( k=kk ; k < min (M , kk+tile_size ) ; k++)16 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;17 }18 }

66 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

MxM in accULL (Garoe)

67 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

MxM in accULL (Drago)

68 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

SRAD: an Image Filtering Code

69 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

SRAD (Garoe)

CUDA in Frangollo performs better than CUDA native.

70 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

SRAD (Drago)

71 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

NW: Needleman-Wunsch, aSequence Alignment Code

72 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

NW (Garoe)

Poor results (but better than OpenMP - 4 cores)

73 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

NW (Drago)

74 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

HotSpot: a Thermal SimulationTool for Estimating Processor

Temperature

75 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

HotSpot (Garoe)

As good as native versions.

76 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

HotSpot (Drago)

77 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Outline

1 Introduction

2 YaCF

3 Experiments

4 Conclusions

5 Future Work

78 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Conclusions: CompilerTechnologies

• Compiler technologies tend to use and optimize source-to-sourcecompilers to generate and transform source code.

• It is easier to parallelize a source code with AST transformations.

• AST transformations enable to programmers to easily generatecode for any platform.

79 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Conclusions: Programming Model

• The usage of directive-based programming languages allownon-expert programmers to abstract from architectural detailsand write programs easier.

• The OpenACC standard is a start point to heterogeneoussystems programming.

• Future versions of the OpenMP standard will include support foraccelerators.

• The results we are obtaining with accULL our early OpenACCimplementation are promising.

80 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

References I

Ruyman Reyes, Ivan Lopez, Juan J. Fumero, F de SandeaccULL: An OpenACC implementation with CUDA and OpenCLsupportInternational European Conference on Parallel and DistributedComputing 2012.

Ruyman Reyes, Ivan Lopez, Juan J. Fumero, F de SandeDirective-based Programming for GPUs: A Comparative StudyThe 14th IEEE International Conference on High PerformanceComputing and Communications.

Ruyman Reyes, Ivan Lopez, Juan J. Fumero, F de SandeaccULL: an user-directed Approach to HeterogeneousProgrammingThe 10th IEEE International Symposium on Parallel andDistributed Processing with Applications.

81 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Outline

1 Introduction

2 YaCF

3 Experiments

4 Conclusions

5 Future Work

82 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Future Work

• Add support to MPI with CUDA and OpenCL.

• Perform new experiments with OpenACC.

• To compare our accULL approach with PGI-OpenACC andCAPS-HMPP.

• Adding support for vectorization.

• Exploring FPGAs to combine with CUDA and OpenCL.

• To introduce LLVM Compiler Framework in the Frontend.

83 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Future Work







83 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Future Work







83 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Future Work







83 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Future Work







83 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Future Work







83 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

Thank you for your attention

Juan Jose Fumero [email protected]

84 / 85


Juan J. Fumero

Introduction

YaCF

Experiments

Conclusions

Future Work

YaCF: The accULL CompilerUndergraduate Thesis Project

Juan Jose Fumero AlfonsoUniversidad de La Laguna

22 de junio de 2012

85 / 85