www.cineca.it
Parallel Programming Trends in Extremely Scalable Architectures
Carlo Cavazzoni, HPC department, CINECA
www.cineca.it
CINECA
CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR).
CINECA is the largest Italian computing centre, one of the most important worldwide.The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.
www.cineca.it
Why parallel programming?
Solve larger problems
Run memory demanding codes
Solve problems with greater speed
www.cineca.it
Modern Parallel Architectures
Two basic architectural scheme:
Distributed Memory
Shared Memory
Now most computers have a mixed architecture
+ accelerators -> hybrid architectures
www.cineca.it
Distributed Memory
memory
CPU
memory
CPU
memory
CPU
memory
NETWORK
CPU
memory
CPU
memory
CPU
node
node
node
node
node
node
www.cineca.it
Shared Memory
CPU
memory
CPU CPU CPU CPU
www.cineca.it
Real Shared
CPU CPU CPU CPU CPU
System Bus
Memory banks
www.cineca.it
Virtual Shared
CPU CPU CPU CPU CPUCPU
HUB HUB HUB HUB HUB HUB
Network
mem
ory
mem
ory
mem
ory
mem
ory
mem
ory
mem
ory
node node node node node node
www.cineca.it
Mixed Architectures
CPU
memory
CPU
CPU
memory
CPU
CPU
memory
CPU
NETWORK
node
node
node
www.cineca.it
Most Common NetworksCube, hypercube, n-cube
Torus in 1,2,...,N Dim
switch
switched
Fat Tree
www.cineca.it
HPC Trends
www.cineca.it
Top500
….
Number of cores of no 1 system from Top500
0
100000
200000
300000
400000
500000
600000
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Jun-99
Jun-00
Jun-01
Jun-02
Jun-03
Jun-04
Jun-05
Jun-06
Jun-07
Jun-08
Jun-09
Jun-10
Jun-11
Num
ber
of c
ores Paradigm Change in HPC
What about applications?
Next HPC system installed in CINECA will have 200000 cores
www.cineca.it
Roadmap to Exascale(architectural trends)
www.cineca.it
Dennard Scaling law (MOSFET)
L’ = L / 2V’ = V / 2F’ = F * 2D’ = 1 / L2 = 4DP’ = P
do not hold anymore!
The power crisis!
L’ = L / 2V’ = ~VF’ = ~F * 2D’ = 1 / L2 = 4 * DP’ = 4 * P
The core frequencyand performance do notgrow following the Moore’s law any longer
CPU + Acceleratorto maintain the architectures evolution In the Moore’s law
Programming crisis!
www.cineca.it
Where Watts are burnt?
Today (at 40nm) moving 3 64bit operands to compute a 64bit floating-point FMA takes 4.7x the energy with respect to the FMA operation itself
ABC
D = A + B* C
Extrapolating down to 10nm integration, the energy required to move dateBecomes 100x !
www.cineca.it
MPP System
When? 2012
PFlop/s >2
Power >1MWatt
Cores >150000
Threads >500000
Arch Option for BG/Q
www.cineca.it
AcceleratorA set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)
CPU ACC.
CPU ACC.Physical integration
CPU & ACCArchitectural integration
Single thread perf. throughput
www.cineca.it
nVIDIA GPU
Fermi implementation will pack 512 processor cores
www.cineca.it
ATI FireStream, AMD GPU
2012New Graphics Core Next
“GCN”With new instruction set and
new SIMD design
www.cineca.it
Intel MIC (Knight Ferry)
www.cineca.it
What about parallel App?
In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to 1 / ( 1 − P )
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
www.cineca.it
Programming Models
• Message Passing (MPI)• Shared Memory (OpenMP)• Partitioned Global Address Space Programming (PGAS)
Languages UPC, Coarray Fortran, Titanium
• Next Generation Programming Languages and Models Chapel, X10, Fortress
• Languages and Paradigm for Hardware Accelerators CUDA, OpenCL
• Hybrid: MPI + OpenMP + CUDA/OpenCL
www.cineca.it
trends
Vector
Distributed memory
Shared Memory
Hybrid codes
MPP System, Message Passing: MPI
Multi core nodes: OpenMP
Accelerator (GPGPU, FPGA): Cuda, OpenCL
Scalar Application
www.cineca.it
Message Passingdomain decomposition
memory
CPU
node
memory
CPU
node
memory
CPU
node
memory
CPU
node
memory
CPU
node
memory
CPU
node
Internal High Performance Network
www.cineca.it
Ghost Cells - Data exchange
i,j i+1,ji-1,j
i,j+1
i,j-1
sub-domain boundaries
Ghost Cells
i,j i+1,ji-1,j
i,j+1
i,j-1
Processor 1
Processor 2
i,j i+1,ji-1,j
i,j+1
Ghost Cells exchangedbetween processors at every update
i,j i+1,ji-1,j
i,j+1
i,j-1
Processor 1
Processor 2
i,j i+1,ji-1,j
i,j+1
www.cineca.it
Message Passing: MPI
Main Characteristic• Library• Coarse grain• Inter node parallelization
(few real alternative)• Domain partition• Distributed Memory• Almost all HPC parallel
App
Open Issue• Latency• OS jitter• Scalability
www.cineca.it
Shared memory
mem
ory
CPU
node
CPU
CPU
CPU
Thread 0
Thread 1
Thread 2
Thread 3x
y
www.cineca.it
Shared Memory: OpenMP
Main Characteristic• Compiler directives• Medium grain• Intra node parallelization (pthreads)• Loop or iteration partition• Shared memory• Many HPC App
Open Issue• Thread creation overhead• Memory/core affinity• Interface with MPI
www.cineca.it
OpenMP !$omp parallel dodo i = 1 , nsl call 1DFFT along z ( f [ offset( threadid ) ] )end do!$omp end parallel docall fw_scatter ( . . . )!$omp paralleldo i = 1 , nzl !$omp parallel do do j = 1 , Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do!$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end doend do!$omp end parallel
www.cineca.it
Accelerator/GPGPU
Sum of 1D array
+
www.cineca.it
CUDA samplevoid CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; }}
__global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; }}
Each thread execute one loop iteration
www.cineca.it
CUDAOpenCL
Main Characteristic• Ad-hoc compiler• Fine grain• offload parallelization (GPU)• Single iteration parallelization• Ad-hoc memory• Few HPC App
Open Issue• Memory copy• Standard• Tools• Integration with other
languages
www.cineca.it
Hybrid (MPI+OpenMP+CUDA+…
Take the positive off all modelsExploit memory hierarchyMany HPC applications are adopting this modelMainly due to developer inertiaHard to rewrite million of source lines
…+python)
www.cineca.it
Hybrid parallel programming
MPI: Domain partition
OpenMP: External loop partition
CUDA: assign inner loopsIteration to GPU threads
http://www.qe-forge.org/ Quantum ESPRESSO
Python: Ensemble simulations
www.cineca.it
Storage I/O
• The I/O subsystem is not keeping the pace with CPU
• Checkpointing will not be possible
• Reduce I/O• On the fly analysis and
statistics• Disk only for archiving• Scratch on non volatile
memory (“close to RAM”)
www.cineca.it
PRACE
PRACE Research Infrastructure (www.prace-ri.eu)the top level of the European HPC ecosystem
The vision of PRACE is to enable and support European global leadership in public and private research and development.
CINECA (representing Italy) is an hosting member of PRACEcan host a Tier-0 system
European (PRACE)
Local
Tier 0
Tier 1
Tier 2
National (CINECA today)
FERMI @ CINECAPRACE Tier-0 System
Architecture: 10 BGQ FrameModel: IBM-BG/QProcessor Type: IBM PowerA2, 1.6 GHzComputing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D TorusDisk Space: 2PByte of scratch space Peak Performance: 2PFlop/s
ISCRA & PRACE call for projects now open!
www.cineca.it
Conclusion
• Exploit millions of ALU• Hybrid Hardware• Hybrid codes• Memory Hierarchy• Flops/Watt (more that Flops/Sec)• I/O subsystem• Non volatile memory• Fault Tolerance!
Parallel programming trends in extremely scalable architectures