+ All Categories
Home > Documents > Parallel P rogramming T rends in E xtremely S calable A rchitectures

Parallel P rogramming T rends in E xtremely S calable A rchitectures

Date post: 02-Feb-2016
Category:
Upload: butch
View: 31 times
Download: 0 times
Share this document with a friend
Description:
Parallel P rogramming T rends in E xtremely S calable A rchitectures. Carlo Cavazzoni, HPC department, CINECA. CINECA. - PowerPoint PPT Presentation
Popular Tags:
38
www.cineca.i t Parallel Programming Trends in Extremely Scalable Architectures Carlo Cavazzoni, HPC department, CINECA
Transcript
Page 1: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Parallel Programming Trends in Extremely Scalable Architectures

Carlo Cavazzoni, HPC department, CINECA

Page 2: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

CINECA

CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR).

CINECA is the largest Italian computing centre, one of the most important worldwide.The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

Page 3: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Why parallel programming?

Solve larger problems

Run memory demanding codes

Solve problems with greater speed

Page 4: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Modern Parallel Architectures

Two basic architectural scheme:

Distributed Memory

Shared Memory

Now most computers have a mixed architecture

+ accelerators -> hybrid architectures

Page 5: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Distributed Memory

memory

CPU

memory

CPU

memory

CPU

memory

NETWORK

CPU

memory

CPU

memory

CPU

node

node

node

node

node

node

Page 6: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Shared Memory

CPU

memory

CPU CPU CPU CPU

Page 7: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Real Shared

CPU CPU CPU CPU CPU

System Bus

Memory banks

Page 8: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Virtual Shared

CPU CPU CPU CPU CPUCPU

HUB HUB HUB HUB HUB HUB

Network

mem

ory

mem

ory

mem

ory

mem

ory

mem

ory

mem

ory

node node node node node node

Page 9: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Mixed Architectures

CPU

memory

CPU

CPU

memory

CPU

CPU

memory

CPU

NETWORK

node

node

node

Page 10: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Most Common NetworksCube, hypercube, n-cube

Torus in 1,2,...,N Dim

switch

switched

Fat Tree

Page 11: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

HPC Trends

Page 12: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Top500

….

Number of cores of no 1 system from Top500

0

100000

200000

300000

400000

500000

600000

Jun-93

Jun-94

Jun-95

Jun-96

Jun-97

Jun-98

Jun-99

Jun-00

Jun-01

Jun-02

Jun-03

Jun-04

Jun-05

Jun-06

Jun-07

Jun-08

Jun-09

Jun-10

Jun-11

Num

ber

of c

ores Paradigm Change in HPC

What about applications?

Next HPC system installed in CINECA will have 200000 cores

Page 13: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Roadmap to Exascale(architectural trends)

Page 14: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Dennard Scaling law (MOSFET)

L’ = L / 2V’ = V / 2F’ = F * 2D’ = 1 / L2 = 4DP’ = P

do not hold anymore!

The power crisis!

L’ = L / 2V’ = ~VF’ = ~F * 2D’ = 1 / L2 = 4 * DP’ = 4 * P

The core frequencyand performance do notgrow following the Moore’s law any longer

CPU + Acceleratorto maintain the architectures evolution In the Moore’s law

Programming crisis!

Page 15: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Where Watts are burnt?

Today (at 40nm) moving 3 64bit operands to compute a 64bit floating-point FMA takes 4.7x the energy with respect to the FMA operation itself

ABC

D = A + B* C

Extrapolating down to 10nm integration, the energy required to move dateBecomes 100x !

Page 16: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

MPP System

When? 2012

PFlop/s >2

Power >1MWatt

Cores >150000

Threads >500000

Arch Option for BG/Q

Page 17: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

AcceleratorA set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)

CPU ACC.

CPU ACC.Physical integration

CPU & ACCArchitectural integration

Single thread perf. throughput

Page 18: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

nVIDIA GPU

Fermi implementation will pack 512 processor cores

Page 19: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

ATI FireStream, AMD GPU

2012New Graphics Core Next

“GCN”With new instruction set and

new SIMD design

Page 20: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Intel MIC (Knight Ferry)

Page 21: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

What about parallel App?

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

maximum speedup tends to 1 / ( 1 − P )

P= parallel fraction

1000000 core

P = 0.999999

serial fraction= 0.000001

Page 22: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Programming Models

• Message Passing (MPI)• Shared Memory (OpenMP)• Partitioned Global Address Space Programming (PGAS)

Languages UPC, Coarray Fortran, Titanium

• Next Generation Programming Languages and Models Chapel, X10, Fortress

• Languages and Paradigm for Hardware Accelerators CUDA, OpenCL

• Hybrid: MPI + OpenMP + CUDA/OpenCL

Page 23: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

trends

Vector

Distributed memory

Shared Memory

Hybrid codes

MPP System, Message Passing: MPI

Multi core nodes: OpenMP

Accelerator (GPGPU, FPGA): Cuda, OpenCL

Scalar Application

Page 24: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Message Passingdomain decomposition

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

Internal High Performance Network

Page 25: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Ghost Cells - Data exchange

i,j i+1,ji-1,j

i,j+1

i,j-1

sub-domain boundaries

Ghost Cells

i,j i+1,ji-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,ji-1,j

i,j+1

Ghost Cells exchangedbetween processors at every update

i,j i+1,ji-1,j

i,j+1

i,j-1

Processor 1

Processor 2

i,j i+1,ji-1,j

i,j+1

Page 26: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Message Passing: MPI

Main Characteristic• Library• Coarse grain• Inter node parallelization

(few real alternative)• Domain partition• Distributed Memory• Almost all HPC parallel

App

Open Issue• Latency• OS jitter• Scalability

Page 27: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Shared memory

mem

ory

CPU

node

CPU

CPU

CPU

Thread 0

Thread 1

Thread 2

Thread 3x

y

Page 28: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Shared Memory: OpenMP

Main Characteristic• Compiler directives• Medium grain• Intra node parallelization (pthreads)• Loop or iteration partition• Shared memory• Many HPC App

Open Issue• Thread creation overhead• Memory/core affinity• Interface with MPI

Page 29: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

OpenMP !$omp parallel dodo i = 1 , nsl call 1DFFT along z ( f [ offset( threadid ) ] )end do!$omp end parallel docall fw_scatter ( . . . )!$omp paralleldo i = 1 , nzl !$omp parallel do do j = 1 , Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do!$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end doend do!$omp end parallel

Page 30: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Accelerator/GPGPU

Sum of 1D array

+

Page 31: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

CUDA samplevoid  CPUCode( int* input1, int* input2, int* output, int length) {                for ( int  i = 0; i < length; ++i ) {                      output[ i ] = input1[ i ] + input2[ i ];               }}

__global__void  GPUCode( int* input1, int*input2, int* output, int length) {               int idx = blockDim.x * blockIdx.x + threadIdx.x;                if ( idx < length ) {                      output[ idx ] = input1[ idx ] + input2[ idx ];               }}

Each thread execute one loop iteration

Page 32: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

CUDAOpenCL

Main Characteristic• Ad-hoc compiler• Fine grain• offload parallelization (GPU)• Single iteration parallelization• Ad-hoc memory• Few HPC App

Open Issue• Memory copy• Standard• Tools• Integration with other

languages

Page 33: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Hybrid (MPI+OpenMP+CUDA+…

Take the positive off all modelsExploit memory hierarchyMany HPC applications are adopting this modelMainly due to developer inertiaHard to rewrite million of source lines

…+python)

Page 34: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Hybrid parallel programming

MPI: Domain partition

OpenMP: External loop partition

CUDA: assign inner loopsIteration to GPU threads

http://www.qe-forge.org/ Quantum ESPRESSO

Python: Ensemble simulations

Page 35: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Storage I/O

• The I/O subsystem is not keeping the pace with CPU

• Checkpointing will not be possible

• Reduce I/O• On the fly analysis and

statistics• Disk only for archiving• Scratch on non volatile

memory (“close to RAM”)

Page 36: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

PRACE

PRACE Research Infrastructure (www.prace-ri.eu)the top level of the European HPC ecosystem

The vision of PRACE is to enable and support European global leadership in public and private research and development.

CINECA (representing Italy) is an hosting member of PRACEcan host a Tier-0 system

European (PRACE)

Local

Tier 0

Tier 1

Tier 2

National (CINECA today)

Page 37: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

FERMI @ CINECAPRACE Tier-0 System

Architecture: 10 BGQ FrameModel: IBM-BG/QProcessor Type: IBM PowerA2, 1.6 GHzComputing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D TorusDisk Space: 2PByte of scratch space Peak Performance: 2PFlop/s

ISCRA & PRACE call for projects now open!

Page 38: Parallel  P rogramming  T rends in  E xtremely  S calable  A rchitectures

www.cineca.it

Conclusion

• Exploit millions of ALU• Hybrid Hardware• Hybrid codes• Memory Hierarchy• Flops/Watt (more that Flops/Sec)• I/O subsystem• Non volatile memory• Fault Tolerance!

Parallel programming trends in extremely scalable architectures


Recommended