Optimizing Lattice Boltzmann andSpin Glass codes
S. F. Schifano
University of Ferrara and INFN-Ferrara
PRACE Summer SchoolEnabling Applications on Intel MIC based Parallel Architectures
July 8-11, 2013
Casalecchio di Reno, Bologna, Italy
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 1 / 41
The Lattice Boltzmann Method
Lattice Boltzmann method (LBM) is a class of computational fluiddynamics (CFD) methods.
Simulation of synthetic dynamics described by the discrete Boltzmannequation, instead of the Navier-Stokes equations.
The key idea:
I a set of virtual particles called populations arranged at edges ofa discrete and regular grid
I interacting by propagation and collision reproduce – afterappropriate averaging – the dynamics of fluids.
Relevant features:
“Easy” to implement complex physics.
Good computational efficiency on MPAs.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 2 / 41
The D2Q37 Lattice Boltzmann Model
Correct treatment of:
I Navier-Stokes equations of motion
I heat transport equations
I perfect gas state equation (P = ρT )
D2 model with 37 components of velocity
Suitable to study behaviour of compressible gas and fluids
optionally in presence of combustion1 effects.
1chemical reactions turning cold-mixture of reactants into hot-mixture ofburnt product.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 3 / 41
LBM Computational Scheme
foreach time−stepforeach lattice−point
propagate ( ) ;
collide ( ) ;
endforendfor
Embarrassing parallelismAll sites can be processed in parallel applying in sequence propagate andcollide.
ChallengeEfficient implementation on computing systems to exploit a large fraction ofpeak performance.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 4 / 41
D2Q37 propagation scheme
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 5 / 41
D2Q37 propagation scheme
Gather 37 populations from 37 different lattice-sites.S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 6 / 41
D2Q37 propagation
applies to each lattice-cell,
requires to access cells at distance 1,2, and 3,
gathers populations at the edges of the arrows at the center point,
performs memory accesses with sparse addressing patterns.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 7 / 41
D2Q37: boundary-conditions
we simulate a 2D lattice with period-boundaries along x-direction
at the top and the bottom boundary conditions are enforced:
I to adjust some values at sites y = 0 . . . 2 and y = Ny − 3 . . .Ny − 1I e.g. set vertical velocity to zero
This step (bc) is computed before the collision step.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 8 / 41
D2Q37 collision
collision is computed to each lattice-cell
computational intensive: for the D2Q37 model, andrequires > 7600 DP operations
completely local: arithmetic operations require only the populationsassociate to the site
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 9 / 41
Xeon-Phi 5110P co-processor
#cores 61frequency 1090 MHzmemory 8 GB GDDR5L1-cache / core 32 KBL2-cache / core 512 KBPeak Perf. SP/DP ≈ 2/1 TFlopsPeak Memory Bw 320 GBytes
yet another accelerator: PCIe 16x Gen2 card (8 GB/s)
61 Pentium-based CPUs
in-order execution, 1-4 thread per core
256 KB L2-cache blocks are shared among the cores, total 8 GB
vector FPU, 512-bit Advanced Vector Extensions (AVX)
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 10 / 41
MIC Programming Model
native:login on the Linux OS running on the card and run a program:icc -mmic pippo.c -o pippo
offload:using approriate pragmas to mark code that will be transparentlyexecuted onto the MIC board
Programming is well integrated with many languages:
openMP
TBB
Cilk
. . .
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 11 / 41
Offload programming (fine-grain)#define N 1717
void vinit (double ∗A , double ∗C ) {int i ;for (i=0; i<N ; i++){A [i ] = drand48 ( ) ; B [i ] = drand48 ( ) ; C [i ]= 0 . 0 ;
}}
int main ( ) {double A [N ] , B [N ] , C [N ] ;double s ;
srand48 ( ) ;
vinit (double ∗A , double ∗B , double ∗C ) ;
s = rand48 ( ) ;
#pragma offload target (mic:−1) in (A ,B :lenght (N ) ) in (s ) inout (C :lenght (N ) ){
#pragma omp parallel for private (i )for ( i=0; i < N ; i++ )
C [i ] = s∗ A [i ] + B [i ]}
. . .
}
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 12 / 41
Offload programming (coarse-grain)offload a code that spanws several threads
use either Pthread or openMP
for (t = 0; t < NTHREAD ; t++) {pthread_create(&threads [t ] , NULL , threadFunc , (void ∗ ) &tData [t ] ) ;
}
for (t = 0; t < NTHREAD ; t++) {pthread_join (threads [t ] , NULL ) ;
}
#pragma omp p a r a l l e l p r i v a t e ( t i d ){tid = omp_get_thread_num ( ) ;theadFunc ( (void ∗ ) &targv [tid ] ) ;
}
Our implementation uses this approach.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 13 / 41
Relevant optimizations for performancesApplications running on Xeon-Phi can approach peak performance if codesexploits relevant hardware features:
core parallelism:all cores has to be kept active and working in parallel, e.g. runningdifferent functions or working on different data-sets (MIMD/multi-task orSPMD parallelism);
hyper-threading:cores have to execute at least 2, up-to 4, threads to keep hardwarepipelines busy and hide memory accesses latency;
vector programming:each core has to process data-set using vector (streaming) instructions(SIMD parallelism); in the case of Xeon-Phi up-to 8 double-precisionvalues can be processed by each vector instructions.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 14 / 41
AVX512 Memory Copy Benchmark
#define _vload_pd _mm512_load_pd
#if def ined USE_STORENR#define _vstore_pd _mm512_storenr_pd
#elif def ined USE_STORENRNGO#define _vstore_pd _mm512_storenrngo_pd
#elif def ined USE_STORE#define _vstore_pd _mm512_store_pd
#endif
__m512d A [N ] , B [N ] ;
#pragma omp p a r a l l e l p r i v a t e ( t i d ){tid = omp_get_thread_num ( ) ;th_func ( (void ∗) &targv [tid ] ) ;
}
/ / code executed by each threadvoid th_func (void ∗ targv ) {
for (i=0; i<L ; i++)vstore (B+i , vload (A+i ) ) ;
}
Portions of vector used by each thread are kept at minimum distance of 4096bytes (1 page) to avoid two threads to access the same TLB entry.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 15 / 41
AVX512 Memory Copy Benchmark
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 16 / 41
AVX512 Memory Copy Benchmark: Native VS Offload
possible TLB conflicts ? Try to set PHI_USE_2MB_BUFFERS env var.S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 17 / 41
Memory Copy: Scalar VS Vector
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 18 / 41
Memory Copy: Scalar VS Vector
#pragma omp p a r a l l e l for p r i v a t e ( i )for ( i=0; i < (NTHR∗S ) ; i++ ) {
__asm ("#BEGIN" ) ;_vcpy ( d+i , a+i ) ;__asm ("#END" ) ;
}
# Begin ASM#BEGIN# End ASM
..B2.50:movq (%r14), %rdxmovq (%r15), %rcxvmovapd (%rdx,%rbx), %zmm0nopvmovapd %zmm0, (%rcx,%rbx)
..B2.49:# Begin ASM#END# End ASM
#pragma omp p a r a l l e l for p r i v a t e ( i )for ( i=0; i < (NTHR∗S∗8) ; i++ ) {
__asm ("#BEGIN" ) ;d [i ] = a [i ] ;__asm ("#END" ) ;
}
# Begin ASM#BEGIN# End ASM
..B2.52:movq (%r14), %raxmovq (%r12), %rcxmovq (%rax,%r13,8), %rdxmovq %rdx, (%rcx,%r13,8)
..B2.51:# Begin ASM#END# End ASM
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 19 / 41
Implementation: memory layout AoS vs SoA
lattice stored as AoS: exploits cache-locality of populations associated to each site:relevant for computing of collision;
typedef struct {double p0 ; / / popu la t ion 1. . .double p36 ; / / popu la t ion 37
} pop_t ;
pop_t lattice2D [SIZEX∗SIZEY ] ;
lattice stored as SoA: exploits data-locality of corresponding populations of sites.
typedef struct {double p0 [SIZEX∗SIZEY ] ; / / a r ray o f popu la t ion 1. . .double p36 [SIZEX∗SIZEY ] ; / / a r ray o f popu la t ion 37
} pop_t ;
pop_t lattice2D ;
We have used the AoS scheme, and two copies of the lattice are kept in memory: each step readfrom prv and write onto nxt.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 20 / 41
Implementation: code optimizations
core parallelism:
I cores runs several threads
I 60-240 threads are run depending on the size of the lattice
thread parallelism:
I lattice is split over the threads along X-dimension
I each thread process a portion of lattice
vector parallelism:
I threads process 8 lattice-sites in parallel
I exploiting AVX vector instructions
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 21 / 41
Implementation: core parallelismfor ( step = 0; step < MAXSTEP ; step++ ) {
if ( tid == 0 | | tid == NTHR−1 ) {comm ( ) ; / / exchange borderspropagate ( ) ; / / apply propagate to l e f t − and r i g h t−border
} else {propagate ( ) ; / / apply propagate to the inne r pa r t
}
pthread_barrier_wait ( . . . ) ;
if ( tid == 0 )bc ( ) ; / / apply bc ( ) to the three upper row−c e l l s
if ( tid == 1 )bc ( ) ; / / apply bc ( ) to the three lower row−c e l l s
pthread_barrier_wait ( . . . ) ;
collide ( ) ; / / compute c o l l i d e ( )
pthread_barrier_wait ( . . ) ;}
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 22 / 41
Implementation: vector programmingComponents of 8 lattice-cells are packed in a AVX vector of 8-doubles
struct {__m512d vp0 ;__m512d vp1 ;__m512d vp2 ;. . .__m512d vp36 ;
} vpop_t ;
vpop_t lattice [LX ] [ LY ] ;
Intrinsicsd = a× b + c =⇒ d = _m512_fmadd_pd(a,b,c)
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 23 / 41
Results: Propagate Performance
STORENRNGO: store AVX vector directly to memory avoiding loadinginto cache.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 24 / 41
Results: Collide Performance
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 25 / 41
Comparisons
Lattice size: ≈ 4 M-cells
C2050 2-WS 2-SB KNCpropagate (GB/s) 84 17.5 60 94E 58% 29% 70% 29%collide (GF/s) 205 88 220 394E 41% 55% 63% 37%ξ (collide) NA 1.19 1.27 0.76
NVIDIA Tesla C2050, ≈ 500 GF DP, ≈ 144 GB/s peak (PARCFD’11)
2-WS: Intel dual 6-core (Westmere), ≈ 160 GF DP, ≈ 60 GB/s peak (ICCS’11)
2-SB: Intel dual 8-core (Sandybridge), ≈ 345 GF DP, ≈ 85.3 GB/s peak (CCP’12)
ξ =P
Nc × v × f
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 26 / 41
Simulation of the Rayleigh-Taylor (RT) InstabilityInstability at the interface of two fluids of different densities triggered bygravity.
A cold-dense fluid over a less dense and warmer fluid triggers an instabilitythat mixes the two fluid-regions (till equilibrium is reached).
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 27 / 41
AcknowledgmentsJoint work of a team of physicists and computer-scientists:
Luca Biferale, Mauro Sbragaglia, Patrizio RipesiUniversity of Tor Vergata and INFN Roma, Italy
Andrea Scagliarini, University of Barcelona, Spain
Filippo Mantovani, University of Regensburg, Germany
Gianluca Crimi, Marcello Pivanti, Sebastiano Fabio Schifano, Raffaele TripiccioneUniversity and INFN of Ferrara, Italy
Federico ToschiEindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy
This work was performed in the framework of the INFN COKA and SUMAprojects.
We would like to thank CINECA, INFN-CNAF and JSC institutes for access totheir systems.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 28 / 41
Spin-GlassThe Spin-glass is a statistic model to study some behaviours of complexmacroscopic systems like disordered magnetic materials.
An apparently trivial generalization of ferromagnet model.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 29 / 41
Spin-Glass ModelsIsing Model
E({S}) = −J∑〈ij〉 si · sj , J > 0, si , sj ∈ {−1,+1}
Edwards Anderson Model (Binary)
E({S}) =∑〈ij〉 Jij · si · sj , Jij , si , sj ∈ {−1,+1}
Edwards Anderson Model (Gaussian)
E({S}) =∑〈ij〉 Jij · si · sj , Jij ∈ R, si , sj ∈ {−1,+1}
Heisenberg Model
E({S}) =∑〈ij〉 Jij · ~si · ~sj Jij ∈ R, si , sj ∈ R3
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 30 / 41
The Edwards-Anderson (EA) ModelThe system variables are spins (±1), arranged in D-dimensional(usually D=3) lattice of size L .
Spins si interacts only with its nearest neighbours
Pair of spins (si , sj ) share a coupling term Jij
The energy of a configuration {S} is computed as:
E({S}) =∑〈ij〉
Jijsisj
Each configuration {S} has a probability given by the Boltzmann factor:
P({S}) ∝ e−E({S})
kT
Average of macroscopic observable (magnetization) are defined as:
〈M〉 =∑{S}
M({S})P({S}) where M({S}) =∑
i
si
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 31 / 41
Spin Glass Monte Carlo Algorithms
A lattice size L has 2L3different configurations (e.g. L = 80⇒ 2803)
pratically impossible to manage to generate all configurations
not all configurations have the same probability and are equallyimportant.
Monte Carlo algorithms, like the Metropolis and Heatbath, are adopted:
configurations are generated according to their probability
observables average are computed as unweighted sums ofMonte Carlo generated configurations:
〈M〉 ∼∑
i
M({SMCi })
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 32 / 41
Metropolis Algorithm for EA
Require: set of {S} and {J}1: loop // loop on Monte Carlo steps2: for all si ∈ {S} do3: s′i = (si == 1) ? − 1 : 1 // flip tentatively value of si
4: ∆E =∑〈ij〉(Jij · s′i · sj )− (Jij · si · sj ) // compute energy change
5: if ∆E ≤ 0 then6: si = s′i // accept new value of si7: else8: ρ = rnd() // compute a random number 0 ≤ ρ ≤ 1, ρ ∈ Q9: if ρ < e−β∆E then // β = 1/T , T = Temperature
10: si = si ‘ // accept new value of si11: end if12: end if13: end for14: end loop
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 33 / 41
Spin Glass Simulation is Computer Challenging
E({S}) = −∑〈ij〉 Jijsisj , si , sj ∈ {+1,−1}, Jij ∈ {+1,−1}
Frustation effects make:
the energy function landscape corrugated
the approach to the thermal equilibriuma slowly converging process.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 34 / 41
Spin-glass is Computer Challenging
To bring a lattice L = 48 . . . 128 to the thermal equilibrium, typicalstate-of-the-art simulation-campaign steps are:
simulation of Hundreds (Thousands) systems, samples, with differentinitial values of spins and couplings,
for each sample the simulation is repeated 2-4 times with different initialspin-values (coupling values kept fixed), replicas.
Each simulation may requires 1012 . . . 1013 Monte Carlo update steps.
803 × 10 ns× 1011 MC-steps ≈ 16 years
Exploiting of parallelism is necessary.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 35 / 41
Parallel Simulation of Spin Glass
Several levels of parallelism can be exploited in Monte CarloSpin Glass simulations.
The lattice can be divided in a checkerboard scheme: alghorithm is firstapplied to all white spins, and then to all blacks (order is irrelevant).
SIMD instructions can be used to update up to V ≤ L3/2 (white or black)spins in parallel (internal parallelism).
The lattice can be divided in several sub-lattices and allocated todifferent cores. Boundaries need to be updated after updating the bulk(internal parallelism).
Several lattices (samples or replicas) can be simulated in parallel usingmultispin-coding approach (external parallelism).
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 36 / 41
Multispin Encoding (1)
Multispin encoding (for the EA model) allows to simulate severalsystems in parallel.
Assuming to run simulation on a k -bit architecture (k = 32,64,128,256,512):
spins and couplings are represented by binary values {0,1}
a k -bit architectural word hosts k -spins of k different systems
Metropolis update procedure can be bit-wise coded (no conditionalstatements, only bit-wise operations)
Require: ρ pseudo-random numberRequire: ψ = int (−(1/4β) log ρ), encoded on two bitsRequire: η = ( not Xi ), encoded on two bits
c1 = (ψ[0] and η[0])c2 = (ψ[1] and η[1]) or ((ψ[1] or η[1]) and c1)
s′i = si xor (c2 or not Xi [2]) // update value of spin si
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 37 / 41
Multispin Encoding (2)We enhanced multispin encoding approach combining it withSIMD-instructions to exploit both internal- and external-parallelism.
the 512-bit SIMD-word is divided in V = 8 . . . 512 slots
each slot hosts one spin-values of a system
each slot hosts w spin-values of different lattices.
V = internal-parallelism degree, w = external-parallelism degree.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 38 / 41
Random Number Generation
At each MC-step V (pseudo-)random numbers are needed.Same random value can be shared among the w lattice-replicas.
The Parisi-Rapuano generator is a popular choise for Spin Glass simulations:
WHEEL[K] = WHEEL[K-24] + WHEEL[K-55]
ρ = WHEEL[K] ⊕WHEEL[K-61]
WHEEL is an array of unsigned integer
SIMD instructions can be used to generate severalrandom numbers in parallel.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 39 / 41
Spin Glass Simulation on MIC
Lattice is split in C (number of cores) sub-lattices of contigous planes,and each one (of L× L× L/C sites) is mapped on a different core.
each core first update all the white spins and then all the blacks
w/b spins are stored in half-plane data-structures (of L2/2 spins)
1: update the boundaries half-plane (indexes (0) and ((L3/C)− 1)).2: for all i ∈ [1..((L3/C)− 2)] do3: update half-planes (i)4: end for5: exchange half-plane (0) to the previous core and half-plane ((L3/C)− 1)
to the next core.
This approach requires only data exchange across the cores.
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 40 / 41
Results
3D Ising spin-glass model, SMSC (ps/spin)L Janus-SP 2NH CBE C1060 2SB GPU1 GPU2 Xeon-Phi
16 16 980 (8T) 830 (08T) 1170 330 (08T) – – –32 16 260 (8T) 260 (16T) 1240 220 (16T) 780 770 310 ( 60T)48 16 340 (8T) 250 (16T) 1100 160 (16T) 570 390 250 (180T)64 16 200 (8T) 150 (16T) 720 70 (16T) 430 230 52 (240T)80 16 340 (8T) 820 (08T) 880 120 (16T) 450 230 110 (180T)96 – 200 (8T) 410 (16T) 860 60 (16T) 420 200 65 (180T)128 – 200 (8T) 120 (16T) 640 60 (16T) 420 200 25 (240T)160 – – – – 70 (16T) 370 160 65 (180T)192 – – – – 59 (16T) 410 180 41 (180T)224 – – – – 70 (16T) 420 200 60 (240T)256 – – – – 180 (16T) 380 160 24 (240T)
S. F. Schifano (Univ. and INFN of Ferrara) Optimizing LB and SG codes July 8-11, 2013 41 / 41