Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

transcript

www.cineca.it

Optimization techniques

Carlo Cavazzoni, HPC department, CINECA

www.cineca.it

Modern node architecture

cacheI, D

Small & fast

www.cineca.it

Hierarchy register L1 L2 L3 RAML1: Instruction and dataSize: L1 … LnSpeed: L1 … Ln

CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 …

cache miss penaly in terms of clock cycle

www.cineca.it

CACHE Direct Mapped

32 Kbyte

128 Kcache

www.cineca.it

Cache set associative

32 Kbyte

32 Kbyte16 Kbyte

32 Kbyte

128 K16 Kbyte

Es. 2-ways

LastRecentlyUsed

Round Robin

Random

www.cineca.it

Loop optimization

www.cineca.it

Loop fusionLocality in time

do i=1, n

a(i) = b(i) + 1.0

enddodo i=2, n c(i) = sqrt(a(i-1))enddo

do i=2, n

a(i) = b(i) + 1.0

c(i) = sqrt(a(i-1))enddoa(1) = b(1) + 1.0

if n is big enough, a is loaded, offloaded and loaded again into cache

Reuse the a(i) loaded into cache

www.cineca.it

Loop interchangeLocality in space

do i=1, n

do j=1, n a(i,j) = b(i,j) + 1.0 enddoenddo

Load elements into cache lines and use only one before replacing them with new elements

Load elements into cache and use all of them before replacing them with new elements

do j=1, n

do i=1, n a(i,j) = b(i,j) + 1.0 enddoenddo

0x000x010x020x03

www.cineca.it

Cache thrashingreal, dimension (1024) :: a,b

COMMON /my_com/ a, b

do i=1, 1024 a(i) = b(i) + 1.0enddo

offset shift matrixes w.r.t. cache no more problems

Avoid power of 2 for matrix dimensions

integer offset = (linea_cache)/SIZE(REAL)

real, dimension (1024+offset) :: a,b

COMMON /my_com/ a, b

do i=1, 1024 a(i) = b(i) + 1.0enddo

Padding

size cache = 4*1024, direct mapped, a, b contiguous cache thrashing

array size = multiple of cache size possible source of cache thrashing

Set Associative help reducing thrashing problems

www.cineca.it

Loop unrolling

do j=1, n

do i=1, (n-1) a(i,j)= b(i,j)+b(i+1,j)+1.0 enddoenddo

Equivalent Loops.

Fewer jump.

Fewer dependencies.

Fill pipelines and vector units.

do j=1, n

do i=1, (n-1), 2 a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddoenddo

www.cineca.it

Optimize with numerical libraries

Less coding

Tested and (almost) bug free

Standard

Efficient implementation

Optimized

www.cineca.it

BLASBasic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org)• Level 1 BLAS: Vector-Vector operations

(scalar only).

• Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel).

• Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel).

• Level 1 and 2 BLACS: vector reduction, vector and matrics communications.

www.cineca.it

Lapack and ScalapackLinear Algebra Package and Scalable Lapack (www.netlib.org)

Matrix decomposition. Solution of Linear Systems. Eigenvalues and Eigenvetors Linear Least Square solutions

www.cineca.it

MKLESSLACMLCUBLASMAGMAPLASMA

www.cineca.it

MASS (IBM)• Accelerated version of SQRT, SIN, COS,

EXP, LOG, ecc… Scalar and vector

www.cineca.it

Equivalent to MASS (vector version only) For Intel processors

Accelerated version of:sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y

www.cineca.it

do i = 1, n r = r + sin( a(i) )end do

call vdsin( n, a, y )do i = 1, n r = r + y( i )end do

CALL vml_subroutine( n, a, y )

www.cineca.it

BLASMatrix multiplication

DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)

c = alpha op( a ) * op( b ) + beta c

Clm = n Aln Bnm + Clm Clm = n ATln Bnm + Clm

Clm = n Aln BTnm + Clm Clm = n AT

ln BTnm + Clm

real*8 a(lda,*), b(ldb,*), c(ldc,*)

www.cineca.it

Profileing with gprof

Compiler flag “-pg” or “-p” (depend on the compiler)

gcc -pg –c mio.c ./a.out

gmon.out gprof

www.cineca.it

gcc -pg -funroll-loops –O2 dotprod.c -static

[cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000

% cumulative self self total time seconds seconds calls us/call us/call name 68.57 0.05 0.05 2 23437.50 23437.50 set_vector 31.43 0.07 0.02 1 21484.38 21484.38 dot_product 0.00 0.07 0.00 1 0.00 68359.38 main

www.cineca.it

Profileing “by hand”

Find “hot spot” in your application

Use temporization functions

CALL SYSTEM_CLOCK(iclk1, count_rate=nclk)

CALL critical_subroutine( …… )

CALL SYSTEM_CLOCK(iclk2)

PRINT *,REAL(iclk2-iclk1)/nclk

t1 = cclock()

t2 = cclock()

PRINT *, (t2-t1)

CALL CPU_TIME( t3 )

CALL CPU_TIME( t4 )

PRINT *, (t4-t3)

www.cineca.it

Mesure performances

#include<stdio.h>#include<time.h>#include<ctype.h>#include<sys/types.h>#include<sys/time.h>

double cclock_(){

/* Restituisce il valore del CLOCK di sistema in secondi */

struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;

www.cineca.it

PROGRAM test_dgemm

IMPLICIT NONE

INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim x(i,j) = x(i,j) + y(i,k) * z(k,j) end do end do end do t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )

END PROGRAM

PROGRAM test_dgemm

IMPLICIT NONE

INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) )

y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock()

! x = matmul( y, z ) call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim)

t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )

END PROGRAM

www.cineca.it

BLAS compatible

Automatically Tuned Linear Algebra Software

http://sourceforge.net/http://math-atlas.sourceforge.net/devel/

www.cineca.it

http://www.fftw.org

Fast Fourier Trasform

FFT complex to complex

FFT complex to real

Parallel FFT

Moulti-thread FFT

www.cineca.it

Advanced techniques

www.cineca.it

do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo

Case Study: matrix transposition

Think Fortran: Consecutive elements in memory

www.cineca.it

Suppose 2-way set associative

For each value of x I need to load into cache a whole line.1) X “allocate” the 2nd way.2) Risk of thrashing3) When the cache is full, the proc. Start to overwrite cache lines

data mapped in cachey “allocate” the 1st way

What happens with the cache

www.cineca.it

As before for each value of x I need to load into cache a whole cache line.

data mapped in cachey “allocate” the 1st way

What happens with the cache, cont.

We can see that a lot of data are loaded into the cache but they are not used!

www.cineca.it

Block Algorithm

Load a block of data into cache

Swap data in cache

Write data back to memory

www.cineca.it 31

do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo

do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff, i+ioff) = buf(j,i) enddo enddo enddoenddo

bsiz = block sizenb = n / bsizmb = m / bsiz

You need to handle: MOD(n / bsiz) /= 0 ORMOD(m / bsiz) /= 0

Solution: Block algorithm

www.cineca.it

Whole block transpose do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff,j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff,i+ioff) = buf(j,i) enddo enddo enddoenddo

IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN ioff = nb * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF

IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz do ib = 1, nb ioff = (ib-1) * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, bsiz do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF

IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. & & MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz ioff = nb * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddoEND IF

www.cineca.it 33

Performance tuning and analysis: user codes

Matrix TraspositionMatrix size: 2048x2048

0 20 40 60 80 100 120

block size

Straightforward implementation

Block implementation

www.cineca.it

DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO

IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1)END IF

IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2)END IF

IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3)END IF

Parameter Dependent Code & Unrolling

per un dato set di parametri(di input), riesco ad eliminareogni loop, ottimizzando cachee pipe di esecuzione

www.cineca.it

Debugging (post mortem)

program hello_bug real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a )end program

subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n integer :: i do i = 1, n a( n ) = 0.0 end doend subroutine

gfortran –g hello_bug.f90

Remove core size limitulimit –c unlimited

./a.out Segmentation fault (core dumped)

gdb ./a.out core

www.cineca.it

Debugging (in vivo)gfortran -g hello_bug.f90

gdb ./a.out

GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)Copyright (C) 2009 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law. Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /plx/userinternal/acv0/a.out...done.

(gdb) runStarting program: /plx/userinternal/acv0/a.out warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000

Program received signal SIGSEGV, Segmentation fault.0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:1212 a( n ) = 0.0

www.cineca.it

#include<sys/types.h>#include<sys/time.h>

double cclock_()

/* Restituisce il valore del CLOCK di sistema in secondi */

struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;

} program mat_mul integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2 real*8 :: cclock external cclock a = 1.0d0 b = 1.0d0 t1 = cclock() call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock() write(*,*) SUM(c), t2-t1end program

1) gcc –c cclock.c

2) f95 matmul_prof.f90 cclock.o -L. -lblas

Link Fortran and C

www.cineca.it

Link Fortran and C

program rand real(kind=8) :: a external crand call crand( a ) print *,'this is random ', aend program

#include<stdlib.h>#include<time.h>

void crand_( double * x ){ (*x) = ( (double)random()/(double)RAND_MAX );}

Link a C subroutine with a Fortran program

rand.f90 crand.c

Fortran passes arguments by reference, C passes them by value

www.cineca.it

Link a Fortran subroutine with a C program

#include<stdio.h>

int main(){ int n; double a[10], d;

n = 10; d = 1.0; setv_( a, &d, &n ); printf("%lf\n", a[0]); }

subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d integer :: n integer :: i do i = 1, n a( i ) = d end doend subroutine

cvec.c vset.f90

Link Fortran and C

www.cineca.it

Make CommandIf a code is large and/or it shares subroutines withother codes, it is useful to split the source in many files that could be placed in different directories.

In F90 there are dependencies among program units,i.e. modules must be compiled before than any other program units.Therefore there is a well defined order for compiling source files

To avoid compiling by hands the sources in the proper order,the make command could be used

www.cineca.it

Make Command

The make command can be programmed to do the job for youusing a file containing instruction and dircetive.

By default the make command looks in the present directoryfor a file colled Makefile or makefile

www.cineca.it

A simple makefile# this is a comment within the makefile

myprog.x : modules.o main.of90 –o myprog.x modules.o main.o

modules.o : modules.f90f90 –c modules.f90

main.o : modules.o main.f90f90 –c main.f90

this tell to the make commandthat myprog.x depend frommodules.o and main.o

make execute the command onlywhen modules.o and main.ohave been built

to compile the code, from the console the programmer issuethe command:> make

www.cineca.it

A less simple makefile# this is a comment within the makefile

myprog.x : modules.o main.of90 –o myprog.x modules.o main.o

main.o : modules.o

.f90.of90 –c $<

this is an implicit dependency, it state that allfiles “.o” depend and should be generated fromthe corresponding “.f90” files

this is a make macro, and it is expandend with the proper “.f90” filename

In the above example, make try to built myprog.x but it realizes that main.o and modules.o should begenerated first. Then it starts looking for a rule to make the “.o”, and it finds that main.o depend onmodules.o, and thern make build an internal hierarchy for compilation in which modules.o come beforemain.o . Finally make finds the implicit rule and starts compiling the sources.

Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

Documents