Serge G. Petiton June 23rd, 2008

What Programming Paradigms and algorithms for Petascale Scientific Computing, a Hierarchical Programming Methodology Tentative

Serge G. Petiton

June 23rd, 2008

Japan-French Informatic Laboratory (JFIL)

June, 23rd PAAP2

Outline

1. Introduction

2. Present Petaflops, on the Road to Future Exaflops

3. Experimentations, toward models and extrapolations

4. Conclusion

June, 23rd PAAP3

Outline

1. Introduction



4. Conclusion

June, 23rd PAAP4

Introduction

The Petaflop frontier was crossed (May 25-26 night) – top500

Sustained Petaflop would be soon available by a large number of computers

As scheduled since the 9Oth, we didn’t really have large technological gaps to access Petaflops computers

Language and tools are not so different since first SMPs

What about languages, tools, methods for sustained 10 Petaflops

Exaflop would probably ask for new technology advancements and new ecosystems

On the road toward Exaflops, we would soon face difficult challenges and we have to anticipate new problems around the 10 Petaflop frontier.

June, 23rd PAAP5

Outline

1. Introduction



4. Conclusion

June, 23rd PAAP6

Hyper Large Scale Hierarchical Distributed Parallel Architectures

Many-cores ask for new programming paradigm, as data parallel,

Message passing would be efficient for gang of cluster, Workflow and Grid-like programming may be a solution for the

higher level programming, Accelerators, vector computing, Energy consumption optimization, Optical networks, “Inter” and “intra” (chip, cluster, gang,….) communications Distributed/Shared Memory computer on a chip.

June, 23rd PAAP7

On the road from Petaflop toward Exaflop

Multi programming and execution paradigms, Technological and software challenge :

compilers, systems, middleware, schedulers, fault tolerance,…

New applications and Numerical Methods, Arithmetic and elementary function (multiple

and hybrids) Data distributed on networks and grids, Education challenges, we have to educate

scientists

June, 23rd PAAP8

and the road would be dificult….

Multi-level programming paradigms, Component Technologies, Mixed data migration and computing, with large instrument

control, We have to use end-users expertise, Indeterminist distributed computing, component dependence

graph, Middleware and Platform independent “Time to solution” minimization, new metrics We have to allow end-users to propose scheduler assistance

and to give some advice to anticipate data migration data

June, 23rd PAAP9

Outline

1. Introduction



4. Conclusion

June, 23rd PAAP10

Front end : Depends only of theapplications

Back end : depends of middleware.

Ex. XtermWeb (F), OmniRPC (Jp), andCondor (USA).

http://yml.prism.uvsq.fr/

YMLLanguage

June, 23rd PAAP11

Components/Tasks Graph Dependency

Begin node

End node

Graph node

Dependence

par compute tache1(..); signal(e1);// compute tache2(..); migrate matrix(..); signal(e2);// wait(e1 and e2); Par compute tache3(..); signal(e3); // compute tache4(..); signal(e4); // compute tache5(..); control robot(..); signal(e5); visualize mesh(…) ; end par// wait(e3 and e4 and e5); compute tache6(..); compute tache7(..);end par

Generic component node

Résultat A

Serge Petiton

Serge Petiton

Serge Petiton

June, 23rd PAAP12

LAKe Library (Nahid Emad, UVSQ)

June, 23rd PAAP13

YML/LAKe

June, 23rd PAAP14

Block Gauss-Jordan, 101 processor Cluster, Grid 5000; YML versus YML/OmniRPC (with Maximes Hugues (TOTAL and LIFL))

Taille de bloc = 1500Block

NumberTask

NumberOverhead

(%)2x2 8 22.413x3 27 14.784x4 64 28.375x5 125 40.826x6 216 65.607x7 343 97.018x8 612 138.24

We optimize the « Time to Solution »Several middleware may be choose

Number of Blocks

Time

June, 23rd PAAP15

GRID 5000, BGJ,10, 101 nœuds, YML versus YML/OmniRPC

Block sizes = 1500Block

NumberOverhead

(%)101 nodes

Overhead (%)10 nodes

2x2 22.41 21.673x3 14.78 11.574x4 28.37 12.125x5 40.82 22.606x6 65.60 50.007x7 97.01 63.988x8 138.24 133.69

June, 23rd PAAP16

BGJ, YML/OmniRPC versus YML

Block Size = 1500 Block Number

Overhead (%)101 nodesGrid5000

Overhead (%)Cluster of clusters

2x2 22.41 17.583x3 14.78 14.224x4 28.37 25.175x5 40.82 24.646x6 65.60 62.867x7 97.01 40.128x8 138.24 99.79

June, 23rd PAAP17

Asynchronous Restarted Iterative Methods on multi-node computers

With Guy Bergère,Zifan Li, and Ye Zhang (LIFL)

June, 23rd PAAP18

Convergence on GRID 5000

1.00E-14

1.00E-12

1.00E-10

1.00E-08

1.00E-06

1.00E-04

1.00E-02

1.00E+00

1.00E+02

1.00E+04

1.00E+06

1.00E+08

1.00E+10

0 5 10 15 20 25 30

Temps (secondes)

No

rme

Res

idu

elle

GMRES pur

nG=2

nG=5

nG=8

nG=10

Time (second)

Residual Norm

June, 23rd PAAP19

One or two distributed sites, same number of processors, communication overlay

1.00E-141.00E-131.00E-121.00E-111.00E-101.00E-091.00E-081.00E-071.00E-061.00E-051.00E-041.00E-031.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+05

0 5 10 15 20 25

temps d'execution

no

rme

d'it

erat

ion

sur un site

sur deux sites

One site

Two sites

June, 23rd PAAP20

Cell/GPU CEA/DEN : with Christophe Calvin et Jérome Dubois (CEA/DEN Saclay)

MINOS/APOLLO3 solver Netronic tranport problem Power Method to compute the dominante eigenvalue Slow convergence Large number of floating point operations

Experimentations on : Bi-Xeons quadcore 2.83GHz (45 GFlops) CellBlade (Cines Montpellier) (400 GFlops) GPU Quadro FX 4600 (240 GFlops)

June, 23rd PAAP21

Power method : Performances

3264

128256

5121024

15362048

25603072

35844096

46085120

56326144

66567168

76808192

0

5

10

15

20

25

GPU

CPU

Cell

Nombre de lignes de la matrice

GF

lops

21

Matrix size

June, 23rd PAAP22Power Method : Arithmetic Accuracy

22

0.00E+00

5.00E-06

1.00E-05

1.50E-05

2.00E-05

2.50E-05

3.00E-05

3.50E-05

Itérations

Eca

rt m

esur

é

Difference

June, 23rd PAAP23Arnoldi Projection: Performances

23

32 64 128 256 512 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 76800

5

10

15

20

25

CPU

Cell

GPU

Cell/LS

Nombre de lignes de la matrice

GF

lop

s

Matrix Size

June, 23rd PAAP24Arnoldi Projection : Arithmetic Accuracy

24

v1,v2v3,v4

v5,v6v7,v8

0,00000E+00

5,00000E-04

1,00000E-03

1,50000E-03

2,00000E-03

2,50000E-03

3,00000E-03

3,50000E-03

Erreur GPU

Erreur Cell

Erreur GPU

Erreur Cell

Déviation de la base orthogonale

Eca

rt

Orthogonalization degradation

June, 23rd PAAP25

Outline

1. Introduction



4. Conclusion

June, 23rd PAAP26

Conclusion

We plan to extrpolate from Grid5000 and our multi-core experimentations some behaviors of the future hiearachical large petascale computers, using YML for the higher level,

We need to propose new high-level languages to program large Petaflop computers, to be able to minimize “Time to Solution” and energy consumptions, with system and middleware independencies, MPI would probably very difficult to dethrone,

Other important codes would be still carefully “hand-optimized”

Several Programming paradigms, with respect to the different level, have to me mixed. The interface have to be well-specified; MPI would probably very difficult to dethrone,

End-users have to be able to give expertise to help middleware management such as scheduling, and to chose libraries

New Asynchronous Hybrid Methods have to be introduced

Date post:	05-Feb-2016
Category:	Documents
Upload:	tegan
View:	37 times
Download:	0 times

Serge G. Petiton June 23rd, 2008

Documents