Optimizing Lattice Boltzmann and Spin Glass codes · Optimizing Lattice Boltzmann and Spin Glass...

Optimizing Lattice Boltzmann andSpin Glass codes

S. F. Schifano

University of Ferrara and INFN-Ferrara

PRACE Summer SchoolEnabling Applications on Intel MIC based Parallel Architectures

July 8-11, 2013

Casalecchio di Reno, Bologna, Italy

S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 1 / 41

The Lattice Boltzmann Method

Lattice Boltzmann method (LBM) is a class of computational fluiddynamics (CFD) methods.

Simulation of synthetic dynamics described by the discrete Boltzmannequation, instead of the Navier-Stokes equations.

The key idea:

I a set of virtual particles called populations arranged at edges ofa discrete and regular grid

I interacting by propagation and collision reproduce – afterappropriate averaging – the dynamics of fluids.

Relevant features:

“Easy” to implement complex physics.

Good computational efficiency on MPAs.


The D2Q37 Lattice Boltzmann Model

Correct treatment of:

I Navier-Stokes equations of motion

I heat transport equations

I perfect gas state equation (P = ρT )

D2 model with 37 components of velocity

Suitable to study behaviour of compressible gas and fluids

optionally in presence of combustion1 effects.

1chemical reactions turning cold-mixture of reactants into hot-mixture ofburnt product.


LBM Computational Scheme

foreach time−stepforeach lattice−point

propagate ( ) ;

collide ( ) ;

endforendfor

Embarrassing parallelismAll sites can be processed in parallel applying in sequence propagate andcollide.

ChallengeEfficient implementation on computing systems to exploit a large fraction ofpeak performance.


D2Q37 propagation scheme


D2Q37 propagation scheme

Gather 37 populations from 37 different lattice-sites.S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 6 / 41

D2Q37 propagation

applies to each lattice-cell,

requires to access cells at distance 1,2, and 3,

gathers populations at the edges of the arrows at the center point,

performs memory accesses with sparse addressing patterns.


D2Q37: boundary-conditions

we simulate a 2D lattice with period-boundaries along x-direction

at the top and the bottom boundary conditions are enforced:

I to adjust some values at sites y = 0 . . . 2 and y = Ny − 3 . . .Ny − 1I e.g. set vertical velocity to zero

This step (bc) is computed before the collision step.


D2Q37 collision

collision is computed to each lattice-cell

computational intensive: for the D2Q37 model, andrequires > 7600 DP operations

completely local: arithmetic operations require only the populationsassociate to the site


Xeon-Phi 5110P co-processor

#cores 61frequency 1090 MHzmemory 8 GB GDDR5L1-cache / core 32 KBL2-cache / core 512 KBPeak Perf. SP/DP ≈ 2/1 TFlopsPeak Memory Bw 320 GBytes

yet another accelerator: PCIe 16x Gen2 card (8 GB/s)

61 Pentium-based CPUs

in-order execution, 1-4 thread per core

256 KB L2-cache blocks are shared among the cores, total 8 GB

vector FPU, 512-bit Advanced Vector Extensions (AVX)


MIC Programming Model

native:login on the Linux OS running on the card and run a program:icc -mmic pippo.c -o pippo

offload:using approriate pragmas to mark code that will be transparentlyexecuted onto the MIC board

Programming is well integrated with many languages:

openMP

TBB

Cilk

. . .


Offload programming (fine-grain)#define N 1717

void vinit (double ∗A , double ∗C ) {int i ;for (i=0; i<N ; i++){A [i ] = drand48 ( ) ; B [i ] = drand48 ( ) ; C [i ]= 0 . 0 ;

}}

int main ( ) {double A [N ] , B [N ] , C [N ] ;double s ;

srand48 ( ) ;

vinit (double ∗A , double ∗B , double ∗C ) ;

s = rand48 ( ) ;

#pragma offload target (mic:−1) in (A ,B :lenght (N ) ) in (s ) inout (C :lenght (N ) ){

#pragma omp parallel for private (i )for ( i=0; i < N ; i++ )

C [i ] = s∗ A [i ] + B [i ]}

. . .

}


Offload programming (coarse-grain)offload a code that spanws several threads

use either Pthread or openMP

for (t = 0; t < NTHREAD ; t++) {pthread_create(&threads [t ] , NULL , threadFunc , (void ∗ ) &tData [t ] ) ;

}

for (t = 0; t < NTHREAD ; t++) {pthread_join (threads [t ] , NULL ) ;

}

#pragma omp p a r a l l e l p r i v a t e ( t i d ){tid = omp_get_thread_num ( ) ;theadFunc ( (void ∗ ) &targv [tid ] ) ;

}

Our implementation uses this approach.


Relevant optimizations for performancesApplications running on Xeon-Phi can approach peak performance if codesexploits relevant hardware features:

core parallelism:all cores has to be kept active and working in parallel, e.g. runningdifferent functions or working on different data-sets (MIMD/multi-task orSPMD parallelism);

hyper-threading:cores have to execute at least 2, up-to 4, threads to keep hardwarepipelines busy and hide memory accesses latency;

vector programming:each core has to process data-set using vector (streaming) instructions(SIMD parallelism); in the case of Xeon-Phi up-to 8 double-precisionvalues can be processed by each vector instructions.


AVX512 Memory Copy Benchmark

#define _vload_pd _mm512_load_pd

#if def ined USE_STORENR#define _vstore_pd _mm512_storenr_pd

#elif def ined USE_STORENRNGO#define _vstore_pd _mm512_storenrngo_pd

#elif def ined USE_STORE#define _vstore_pd _mm512_store_pd

#endif

__m512d A [N ] , B [N ] ;

#pragma omp p a r a l l e l p r i v a t e ( t i d ){tid = omp_get_thread_num ( ) ;th_func ( (void ∗) &targv [tid ] ) ;

}

/ / code executed by each threadvoid th_func (void ∗ targv ) {

for (i=0; i<L ; i++)vstore (B+i , vload (A+i ) ) ;

}

Portions of vector used by each thread are kept at minimum distance of 4096bytes (1 page) to avoid two threads to access the same TLB entry.


AVX512 Memory Copy Benchmark


AVX512 Memory Copy Benchmark: Native VS Offload

possible TLB conflicts ? Try to set PHI_USE_2MB_BUFFERS env var.S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 17 / 41

Memory Copy: Scalar VS Vector


Memory Copy: Scalar VS Vector

#pragma omp p a r a l l e l for p r i v a t e ( i )for ( i=0; i < (NTHR∗S ) ; i++ ) {

__asm ("#BEGIN" ) ;_vcpy ( d+i , a+i ) ;__asm ("#END" ) ;

}

# Begin ASM#BEGIN# End ASM

..B2.50:movq (%r14), %rdxmovq (%r15), %rcxvmovapd (%rdx,%rbx), %zmm0nopvmovapd %zmm0, (%rcx,%rbx)

..B2.49:# Begin ASM#END# End ASM

#pragma omp p a r a l l e l for p r i v a t e ( i )for ( i=0; i < (NTHR∗S∗8) ; i++ ) {

__asm ("#BEGIN" ) ;d [i ] = a [i ] ;__asm ("#END" ) ;

}

# Begin ASM#BEGIN# End ASM

..B2.52:movq (%r14), %raxmovq (%r12), %rcxmovq (%rax,%r13,8), %rdxmovq %rdx, (%rcx,%r13,8)

..B2.51:# Begin ASM#END# End ASM


Implementation: memory layout AoS vs SoA

lattice stored as AoS: exploits cache-locality of populations associated to each site:relevant for computing of collision;

typedef struct {double p0 ; / / popu la t ion 1. . .double p36 ; / / popu la t ion 37

} pop_t ;

pop_t lattice2D [SIZEX∗SIZEY ] ;

lattice stored as SoA: exploits data-locality of corresponding populations of sites.

typedef struct {double p0 [SIZEX∗SIZEY ] ; / / a r ray o f popu la t ion 1. . .double p36 [SIZEX∗SIZEY ] ; / / a r ray o f popu la t ion 37

} pop_t ;

pop_t lattice2D ;

We have used the AoS scheme, and two copies of the lattice are kept in memory: each step readfrom prv and write onto nxt.


Implementation: code optimizations

core parallelism:

I cores runs several threads

I 60-240 threads are run depending on the size of the lattice

thread parallelism:

I lattice is split over the threads along X-dimension

I each thread process a portion of lattice

vector parallelism:

I threads process 8 lattice-sites in parallel

I exploiting AVX vector instructions


Implementation: core parallelismfor ( step = 0; step < MAXSTEP ; step++ ) {

if ( tid == 0 | | tid == NTHR−1 ) {comm ( ) ; / / exchange borderspropagate ( ) ; / / apply propagate to l e f t − and r i g h t−border

} else {propagate ( ) ; / / apply propagate to the inne r pa r t

}

pthread_barrier_wait ( . . . ) ;

if ( tid == 0 )bc ( ) ; / / apply bc ( ) to the three upper row−c e l l s

if ( tid == 1 )bc ( ) ; / / apply bc ( ) to the three lower row−c e l l s

pthread_barrier_wait ( . . . ) ;

collide ( ) ; / / compute c o l l i d e ( )

pthread_barrier_wait ( . . ) ;}


Implementation: vector programmingComponents of 8 lattice-cells are packed in a AVX vector of 8-doubles

struct {__m512d vp0 ;__m512d vp1 ;__m512d vp2 ;. . .__m512d vp36 ;

} vpop_t ;

vpop_t lattice [LX ] [ LY ] ;

Intrinsicsd = a× b + c =⇒ d = _m512_fmadd_pd(a,b,c)


Results: Propagate Performance

STORENRNGO: store AVX vector directly to memory avoiding loadinginto cache.


Results: Collide Performance


Comparisons

Lattice size: ≈ 4 M-cells

C2050 2-WS 2-SB KNCpropagate (GB/s) 84 17.5 60 94E 58% 29% 70% 29%collide (GF/s) 205 88 220 394E 41% 55% 63% 37%ξ (collide) NA 1.19 1.27 0.76

NVIDIA Tesla C2050, ≈ 500 GF DP, ≈ 144 GB/s peak (PARCFD’11)

2-WS: Intel dual 6-core (Westmere), ≈ 160 GF DP, ≈ 60 GB/s peak (ICCS’11)

2-SB: Intel dual 8-core (Sandybridge), ≈ 345 GF DP, ≈ 85.3 GB/s peak (CCP’12)

ξ =P

Nc × v × f


Simulation of the Rayleigh-Taylor (RT) InstabilityInstability at the interface of two fluids of different densities triggered bygravity.

A cold-dense fluid over a less dense and warmer fluid triggers an instabilitythat mixes the two fluid-regions (till equilibrium is reached).


AcknowledgmentsJoint work of a team of physicists and computer-scientists:

Luca Biferale, Mauro Sbragaglia, Patrizio RipesiUniversity of Tor Vergata and INFN Roma, Italy

Andrea Scagliarini, University of Barcelona, Spain

Filippo Mantovani, University of Regensburg, Germany

Gianluca Crimi, Marcello Pivanti, Sebastiano Fabio Schifano, Raffaele TripiccioneUniversity and INFN of Ferrara, Italy

Federico ToschiEindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy

This work was performed in the framework of the INFN COKA and SUMAprojects.

We would like to thank CINECA, INFN-CNAF and JSC institutes for access totheir systems.


Spin-GlassThe Spin-glass is a statistic model to study some behaviours of complexmacroscopic systems like disordered magnetic materials.

An apparently trivial generalization of ferromagnet model.


Spin-Glass ModelsIsing Model

E({S}) = −J∑〈ij〉 si · sj , J > 0, si , sj ∈ {−1,+1}

Edwards Anderson Model (Binary)

E({S}) =∑〈ij〉 Jij · si · sj , Jij , si , sj ∈ {−1,+1}

Edwards Anderson Model (Gaussian)

E({S}) =∑〈ij〉 Jij · si · sj , Jij ∈ R, si , sj ∈ {−1,+1}

Heisenberg Model

E({S}) =∑〈ij〉 Jij · ~si · ~sj Jij ∈ R, si , sj ∈ R3


The Edwards-Anderson (EA) ModelThe system variables are spins (±1), arranged in D-dimensional(usually D=3) lattice of size L .

Spins si interacts only with its nearest neighbours

Pair of spins (si , sj ) share a coupling term Jij

The energy of a configuration {S} is computed as:

E({S}) =∑〈ij〉

Jijsisj

Each configuration {S} has a probability given by the Boltzmann factor:

P({S}) ∝ e−E({S})

kT

Average of macroscopic observable (magnetization) are defined as:

〈M〉 =∑{S}

M({S})P({S}) where M({S}) =∑

i

si


Spin Glass Monte Carlo Algorithms

A lattice size L has 2L3different configurations (e.g. L = 80⇒ 2803)

pratically impossible to manage to generate all configurations

not all configurations have the same probability and are equallyimportant.

Monte Carlo algorithms, like the Metropolis and Heatbath, are adopted:

configurations are generated according to their probability

observables average are computed as unweighted sums ofMonte Carlo generated configurations:

〈M〉 ∼∑

i

M({SMCi })


Metropolis Algorithm for EA

Require: set of {S} and {J}1: loop // loop on Monte Carlo steps2: for all si ∈ {S} do3: s′i = (si == 1) ? − 1 : 1 // flip tentatively value of si

4: ∆E =∑〈ij〉(Jij · s′i · sj )− (Jij · si · sj ) // compute energy change

5: if ∆E ≤ 0 then6: si = s′i // accept new value of si7: else8: ρ = rnd() // compute a random number 0 ≤ ρ ≤ 1, ρ ∈ Q9: if ρ < e−β∆E then // β = 1/T , T = Temperature

10: si = si ‘ // accept new value of si11: end if12: end if13: end for14: end loop


Spin Glass Simulation is Computer Challenging

E({S}) = −∑〈ij〉 Jijsisj , si , sj ∈ {+1,−1}, Jij ∈ {+1,−1}

Frustation effects make:

the energy function landscape corrugated

the approach to the thermal equilibriuma slowly converging process.


Spin-glass is Computer Challenging

To bring a lattice L = 48 . . . 128 to the thermal equilibrium, typicalstate-of-the-art simulation-campaign steps are:

simulation of Hundreds (Thousands) systems, samples, with differentinitial values of spins and couplings,

for each sample the simulation is repeated 2-4 times with different initialspin-values (coupling values kept fixed), replicas.

Each simulation may requires 1012 . . . 1013 Monte Carlo update steps.

803 × 10 ns× 1011 MC-steps ≈ 16 years

Exploiting of parallelism is necessary.


Parallel Simulation of Spin Glass

Several levels of parallelism can be exploited in Monte CarloSpin Glass simulations.

The lattice can be divided in a checkerboard scheme: alghorithm is firstapplied to all white spins, and then to all blacks (order is irrelevant).

SIMD instructions can be used to update up to V ≤ L3/2 (white or black)spins in parallel (internal parallelism).

The lattice can be divided in several sub-lattices and allocated todifferent cores. Boundaries need to be updated after updating the bulk(internal parallelism).

Several lattices (samples or replicas) can be simulated in parallel usingmultispin-coding approach (external parallelism).


Multispin Encoding (1)

Multispin encoding (for the EA model) allows to simulate severalsystems in parallel.

Assuming to run simulation on a k -bit architecture (k = 32,64,128,256,512):

spins and couplings are represented by binary values {0,1}

a k -bit architectural word hosts k -spins of k different systems

Metropolis update procedure can be bit-wise coded (no conditionalstatements, only bit-wise operations)

Require: ρ pseudo-random numberRequire: ψ = int (−(1/4β) log ρ), encoded on two bitsRequire: η = ( not Xi ), encoded on two bits

c1 = (ψ[0] and η[0])c2 = (ψ[1] and η[1]) or ((ψ[1] or η[1]) and c1)

s′i = si xor (c2 or not Xi [2]) // update value of spin si


Multispin Encoding (2)We enhanced multispin encoding approach combining it withSIMD-instructions to exploit both internal- and external-parallelism.

the 512-bit SIMD-word is divided in V = 8 . . . 512 slots

each slot hosts one spin-values of a system

each slot hosts w spin-values of different lattices.

V = internal-parallelism degree, w = external-parallelism degree.


Random Number Generation

At each MC-step V (pseudo-)random numbers are needed.Same random value can be shared among the w lattice-replicas.

The Parisi-Rapuano generator is a popular choise for Spin Glass simulations:

WHEEL[K] = WHEEL[K-24] + WHEEL[K-55]

ρ = WHEEL[K] ⊕WHEEL[K-61]

WHEEL is an array of unsigned integer

SIMD instructions can be used to generate severalrandom numbers in parallel.


Spin Glass Simulation on MIC

Lattice is split in C (number of cores) sub-lattices of contigous planes,and each one (of L× L× L/C sites) is mapped on a different core.

each core first update all the white spins and then all the blacks

w/b spins are stored in half-plane data-structures (of L2/2 spins)

1: update the boundaries half-plane (indexes (0) and ((L3/C)− 1)).2: for all i ∈ [1..((L3/C)− 2)] do3: update half-planes (i)4: end for5: exchange half-plane (0) to the previous core and half-plane ((L3/C)− 1)

to the next core.

This approach requires only data exchange across the cores.


Results

3D Ising spin-glass model, SMSC (ps/spin)L Janus-SP 2NH CBE C1060 2SB GPU1 GPU2 Xeon-Phi

16 16 980 (8T) 830 (08T) 1170 330 (08T) – – –32 16 260 (8T) 260 (16T) 1240 220 (16T) 780 770 310 ( 60T)48 16 340 (8T) 250 (16T) 1100 160 (16T) 570 390 250 (180T)64 16 200 (8T) 150 (16T) 720 70 (16T) 430 230 52 (240T)80 16 340 (8T) 820 (08T) 880 120 (16T) 450 230 110 (180T)96 – 200 (8T) 410 (16T) 860 60 (16T) 420 200 65 (180T)128 – 200 (8T) 120 (16T) 640 60 (16T) 420 200 25 (240T)160 – – – – 70 (16T) 370 160 65 (180T)192 – – – – 59 (16T) 410 180 41 (180T)224 – – – – 70 (16T) 420 200 60 (240T)256 – – – – 180 (16T) 380 160 24 (240T)


Date post:	13-Feb-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	1 times

Optimizing Lattice Boltzmann and Spin Glass codes · Optimizing Lattice Boltzmann and Spin Glass...

Documents