Parallel P rogramming T rends in E xtremely S calable A rchitectures

www.cineca.it

Parallel Programming Trends in Extremely Scalable Architectures

Carlo Cavazzoni, HPC department, CINECA

www.cineca.it

CINECA

CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR).

CINECA is the largest Italian computing centre, one of the most important worldwide.The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

www.cineca.it

Why parallel programming?

Solve larger problems

Run memory demanding codes

Solve problems with greater speed

www.cineca.it

Modern Parallel Architectures

Two basic architectural scheme:

Distributed Memory

Shared Memory

Now most computers have a mixed architecture

+ accelerators -> hybrid architectures

www.cineca.it

Distributed Memory

memory

CPU

memory

CPU

memory

CPU

memory

NETWORK

CPU

memory

CPU

memory

CPU

node

node

node

node

node

node

www.cineca.it

Shared Memory

CPU

memory

CPU CPU CPU CPU

www.cineca.it

Real Shared

CPU CPU CPU CPU CPU

System Bus

Memory banks

www.cineca.it

Virtual Shared

CPU CPU CPU CPU CPUCPU

HUB HUB HUB HUB HUB HUB

Network

mem

ory

mem

ory

mem

ory

mem

ory

mem

ory

mem

ory

node node node node node node

www.cineca.it

Mixed Architectures

CPU

memory

CPU

CPU

memory

CPU

CPU

memory

CPU

NETWORK

node

node

node

www.cineca.it

Most Common NetworksCube, hypercube, n-cube

Torus in 1,2,...,N Dim

switch

switched

Fat Tree

www.cineca.it

HPC Trends

www.cineca.it

Top500

….

Number of cores of no 1 system from Top500

0

100000

200000

300000

400000

500000

600000

Jun-93

Jun-94

Jun-95

Jun-96

Jun-97

Jun-98

Jun-99

Jun-00

Jun-01

Jun-02

Jun-03

Jun-04

Jun-05

Jun-06

Jun-07

Jun-08

Jun-09

Jun-10

Jun-11

Num

ber

of c

ores Paradigm Change in HPC

What about applications?

Next HPC system installed in CINECA will have 200000 cores

www.cineca.it

Roadmap to Exascale(architectural trends)

www.cineca.it

Dennard Scaling law (MOSFET)

L’ = L / 2V’ = V / 2F’ = F * 2D’ = 1 / L2 = 4DP’ = P

do not hold anymore!

The power crisis!

L’ = L / 2V’ = ~VF’ = ~F * 2D’ = 1 / L2 = 4 * DP’ = 4 * P

The core frequencyand performance do notgrow following the Moore’s law any longer

CPU + Acceleratorto maintain the architectures evolution In the Moore’s law

Programming crisis!

www.cineca.it

Where Watts are burnt?

Today (at 40nm) moving 3 64bit operands to compute a 64bit floating-point FMA takes 4.7x the energy with respect to the FMA operation itself

ABC

D = A + B* C

Extrapolating down to 10nm integration, the energy required to move dateBecomes 100x !

www.cineca.it

MPP System

When? 2012

PFlop/s >2

Power >1MWatt

Cores >150000

Threads >500000

Arch Option for BG/Q

www.cineca.it

AcceleratorA set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)

CPU ACC.

CPU ACC.Physical integration

CPU & ACCArchitectural integration

Single thread perf. throughput

www.cineca.it

nVIDIA GPU

Fermi implementation will pack 512 processor cores

www.cineca.it

ATI FireStream, AMD GPU

2012New Graphics Core Next

“GCN”With new instruction set and

new SIMD design

www.cineca.it

Intel MIC (Knight Ferry)

www.cineca.it

What about parallel App?

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

maximum speedup tends to 1 / ( 1 − P )

P= parallel fraction

1000000 core

P = 0.999999

serial fraction= 0.000001

www.cineca.it

Programming Models

• Message Passing (MPI)• Shared Memory (OpenMP)• Partitioned Global Address Space Programming (PGAS)

Languages UPC, Coarray Fortran, Titanium

• Next Generation Programming Languages and Models Chapel, X10, Fortress

• Languages and Paradigm for Hardware Accelerators CUDA, OpenCL

• Hybrid: MPI + OpenMP + CUDA/OpenCL

www.cineca.it

trends

Vector

Distributed memory

Shared Memory

Hybrid codes

MPP System, Message Passing: MPI

Multi core nodes: OpenMP

Accelerator (GPGPU, FPGA): Cuda, OpenCL

Scalar Application

www.cineca.it

Message Passingdomain decomposition

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

Internal High Performance Network

www.cineca.it

Ghost Cells - Data exchange

i,j i+1,ji-1,j

i,j+1

i,j-1

sub-domain boundaries

Ghost Cells

i,j i+1,ji-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,ji-1,j

i,j+1

Ghost Cells exchangedbetween processors at every update

i,j i+1,ji-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,ji-1,j

i,j+1

www.cineca.it

Message Passing: MPI

Main Characteristic• Library• Coarse grain• Inter node parallelization

(few real alternative)• Domain partition• Distributed Memory• Almost all HPC parallel

App

Open Issue• Latency• OS jitter• Scalability

www.cineca.it

Shared memory

mem

ory

CPU

node

CPU

CPU

CPU

Thread 0

Thread 1

Thread 2

Thread 3x

y

www.cineca.it

Shared Memory: OpenMP

Main Characteristic• Compiler directives• Medium grain• Intra node parallelization (pthreads)• Loop or iteration partition• Shared memory• Many HPC App

Open Issue• Thread creation overhead• Memory/core affinity• Interface with MPI

www.cineca.it

OpenMP !$omp parallel dodo i = 1 , nsl call 1DFFT along z ( f [ offset( threadid ) ] )end do!$omp end parallel docall fw_scatter ( . . . )!$omp paralleldo i = 1 , nzl !$omp parallel do do j = 1 , Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do!$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end doend do!$omp end parallel

www.cineca.it

Accelerator/GPGPU

Sum of 1D array

+

www.cineca.it

CUDA samplevoid CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; }}

__global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; }}

Each thread execute one loop iteration

www.cineca.it

CUDAOpenCL

Main Characteristic• Ad-hoc compiler• Fine grain• offload parallelization (GPU)• Single iteration parallelization• Ad-hoc memory• Few HPC App

Open Issue• Memory copy• Standard• Tools• Integration with other

languages

www.cineca.it

Hybrid (MPI+OpenMP+CUDA+…

Take the positive off all modelsExploit memory hierarchyMany HPC applications are adopting this modelMainly due to developer inertiaHard to rewrite million of source lines

…+python)

www.cineca.it

Hybrid parallel programming

MPI: Domain partition

OpenMP: External loop partition

CUDA: assign inner loopsIteration to GPU threads

http://www.qe-forge.org/ Quantum ESPRESSO

Python: Ensemble simulations

www.cineca.it

Storage I/O

• The I/O subsystem is not keeping the pace with CPU

• Checkpointing will not be possible

• Reduce I/O• On the fly analysis and

statistics• Disk only for archiving• Scratch on non volatile

memory (“close to RAM”)

www.cineca.it

PRACE

PRACE Research Infrastructure (www.prace-ri.eu)the top level of the European HPC ecosystem

The vision of PRACE is to enable and support European global leadership in public and private research and development.

CINECA (representing Italy) is an hosting member of PRACEcan host a Tier-0 system

European (PRACE)

Local

Tier 0

Tier 1

Tier 2

National (CINECA today)

FERMI @ CINECAPRACE Tier-0 System

Architecture: 10 BGQ FrameModel: IBM-BG/QProcessor Type: IBM PowerA2, 1.6 GHzComputing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D TorusDisk Space: 2PByte of scratch space Peak Performance: 2PFlop/s

ISCRA & PRACE call for projects now open!

www.cineca.it

Conclusion

• Exploit millions of ALU• Hybrid Hardware• Hybrid codes• Memory Hierarchy• Flops/Watt (more that Flops/Sec)• I/O subsystem• Non volatile memory• Fault Tolerance!

Parallel programming trends in extremely scalable architectures

Date post:	02-Feb-2016
Category:	Documents
Upload:	butch
View:	31 times
Download:	0 times

Parallel P rogramming T rends in E xtremely S calable A rchitectures

Documents