6th International Workshop on Parallel Matrix Algorithms ...Peter Arbenz, Yousef Saad, Ahmed Sameh,...

Parallel Matrix Algorithms and Applications (PMAA’10)

PROGRAMME AND ABSTRACTS

6th International Workshop onParallel Matrix Algorithms and Applications (PMAA’10)

http://www.pmaa10.unibas.ch/

Department of Computer Science, University of Basel, SwitzerlandJune 29 - July 02, 2010

Address:University of BaselKollegienhausPetersplatz 1CH-4051 BaselSwitzerland

Computer Science Department, University of Basel, Switzerland I


PMAA10 Co-Chairs

Peter Arbenz (Switzerland), Erricos J. Kontoghiorghes (Cyprus), Yousef Saad (USA), AhmedSameh (USA), and Olaf Schenk (Switzerland)

PMAA10 Programme committee:

M.F. Adams (USA), P. D’Ambra (IT), M. Arioli (UK), A. Basermann (DE), C. Bekas (CH), M.W.Berry (USA), M. Bollhofer (DE), E. Boman (USA), M. Dayde (FR), T. Drummond (USA), N. Emad(FR), E. Gallopoulos(GR), M. van Gijzen (NL), L. Giraud (FR), L. Grigori (FR), T. Huckle (DE),A. Karaivanova (BG), R. Krause (CH), D. Kressner (CH), S. Margenov (BG), N. Missirlis (GR), M.Neytcheva (SE), E.G. Ng (USA), G. Oksa (SK), S. Petiton (FR), J.E. Roman (ES), D. di Serafino(IT), D.B. Szyld (USA), M. Tuma (CZ), M. Vajtersic (AT), P. Vasconcelos (PT), R.Vuduc (USA),Z. Zlatev (DK).

Local organizing committee:

Peter Arbenz, Costas Bekas, Olaf Schenk

Computer Science Department, University of Basel, Switzerland II


Dear Friends and Colleagues,

Welcome to the 6th International Workshop on Parallel Matrix Algorithms and Applications(PMAA’10). The workshop Co-chairs are happy to host this international conference in Basel.The first two workshops PMAA00 and PMAA02 took place in Neuchatel, the workshops PMAA04and PMAA06 in France, while PMAA08 again was organized in Neuchatel.

The workshop aims to be a forum for an exchange of ideas, insights and experiences in differentareas of parallel computing in which matrix algorithms are employed. The Workshop will bringtogether experts and practitioners from diverse disciplines with a common interest in matrix com-putation. The PMAA10 programme consists of 20 regular sessions, 4 plenary talks and around 80presentations. There are over 100 participants.

Peer reviewed papers presented at the PMAA10 will be considered for publication in a special issueof the Parallel Computing journal.

The Co-chairs have endeavored to provide a balanced and stimulating programme that will appealto the diverse interests of the participants. The local organizing committee hopes that the conferencevenue will provide the appropriate environment to enhance your contacts and to establish new ones.

The conference is a collective effort of many individuals and organizations. The Co-chairs, thescientific programme committee, the local organizing committee and volunteers have contributedsubstantially to the organization of the conference. We are acknowledging the support of the hostDepartment of Computer Science, University of Basel and the Swiss Speedup Society.

We hope that you enjoy the conference and your stay in Basel.

The conference Co-chairs:Peter Arbenz, Erricos John Kontoghiorghes, Yousef Saad, Ahmed Sameh, and Olaf Schenk.The local organizers: Peter Arbenz, Costas Bekas, and Olaf Schenk.

Computer Science Department, University of Basel, Switzerland III


SCHEDULEAll lectures take place at the Kollegienhaus, Petersplatz 1, CH-4051 Basel, Switzerland.

Tuesday, 29th June 2010

18:30 - 20:00 Welcome Reception (Wildt’sches Haus)

Wednesday, 30th June 2010

09:10 - 09:15 Opening09:15 - 10:15 Plenary Talk (Prof. George Biros)10:15 - 10:45 Coffee Break10:45 - 12:25 Parallel Sessions 1.1 A-B12:25 - 14:00 Lunch Break (not organized)14:00 - 15:40 Parallel Sessions 1.2 A-B15:40 - 16:10 Poster Session - Coffee Break16:10 - 17:10 Plenary Talk (Prof. Laura Grigori)

Thursday, 1st July 2010

08:40 - 10:20 Parallel Sessions 2.1 A-C10:20 - 10:50 Coffee Break10:50 - 12:30 Parallel Sessions 2.1 D-F12:30 - 14:00 Lunch Break (not organized)14:00 - 15:40 Parallel Sessions 2.2 A-C15:40 - 16:10 Coffee Break16:10 - 17:50 Parallel Sessions 2.2 D-F19:00 Conference Dinner

Friday, 2nd July 2010

09:15 - 10:15 Plenary Talk (Prof. Jim Demmel)10:15 - 10:45 Coffee Break10:45 - 12:25 Parallel Sessions 3.1 A-B12:25 - 14:00 Lunch Break (not organized)14:00 - 15:15 Parallel Sessions 3.2 A-B15:15 - 15:45 Poster Session - Coffee Break15:45 - 16:45 Plenary Talk (Prof. Ahmed Sameh)16:45 - 16:50 Closing Remarks (Prof. Erricos J. Kontoghiorghes )

SOCIAL EVENTS

• The coffee breaks will last 30 min each. Weather permitting the coffee breaks will take place on the terrace by thecafeteria of the Kollegienhaus, otherwise they will take place in the first and second floor of the Kollegienhaus in frontof the lecture halls.

• Welcome Reception, Tuesday 29th June, 6:30pm. The reception is open to all registrants. It will take place in theWildt’sche Haus which is close to the workshop venue. The welcome reception gives you the opportunity to meet theother workshop attendees. It will be the first official event of the conference. Opening speeches will be given by Dr.Hans-Peter Wessels, the Basler cantonal councilor and former director of the Swiss Center of Scientific Computing(CSCS), and by Prof. Ahmed Sameh (Purdue University). It will take place on Tuesday, June 29, 2010 from 6:30pm toapprox. 8:00pm. The welcome reception is included in the conference registration fee and the accompanying personsprogram fee. You must have your conference badge in order to attend the reception.

• Lunches are not organised. Various choices are available at the Kollegienhaus cafeteria and at the restaurants of variousshopping centres which are 5 minutes walk from the venue.

• Conference Dinner, Thursday 1st July, 7:00pm. The Conference Dinner will take place on July 1, 2010, 7:00 pmat the Restaurant Papiermuehle. (St. Alban-Tal 35, CH-4052 Basel, Tel. +41 61 272 48 48) The restaurant is 25-30 minutes walk from Kollegienhaus and the town centre (detailed information will be available at the conferenceregistration desk, a map is given at the end of this Book of Abstracts). The conference dinner is organized for allconference attendees.

You must have your conference badge in order to attend the conference dinner.

Computer Science Department, University of Basel, Switzerland IV


GENERAL INFORMATION

Lecture Rooms

The paper presentations will take place at the Kollegienhaus, University of Basel. There willbe signs indicating the location of the lecture rooms. Please ask for assistance and directionsat the registration desk.

The plenary talks will take place in the lecture room 102 (Kollegienhaus), and will last 60minutes including questions. Chairs are requested to keep the session on schedule. Papersshould be presented in the order in which they are listed in the programme for the convenienceof attendees who may wish to switch rooms mid-session to hear particular papers. In the caseof a no-show, please use the extra time for a break or a discussion so that the remaining papersstay on schedule.

Presentation instructions

The lecture rooms will be equipped with a PC, a computer projector and in most cases anoverhead projector. The session chairs should obtain copies of the talks in a USB stick beforethe session starts (use the lecture room as the meeting place), or obtain the talks by emailprior to the conference beginning. Presenters must deliver the files with the presentation inPDF (Acrobat) or PPT (Powerpoint) format on a USB memory stick to the session chair tenminutes before each session.

The PC in the lecture rooms should be used for presentations. The session chairs should havea laptop for backup.

Swiss plugs/power outlets are different from those in the rest of Europe, including Germany.We cannot provide adapters, so please do not forget to take your adapters if needed.

InternetThe wireless Internet connection is also freely available at Kollegienhaus.

Messages

You may leave messages for each other on the bulletin board by the registration desks.

SUPPORTERS

Department of Computer Science, University of Basel, Switzerland

Journal of Parallel Computing

The SPEEDUP Society: The SWISS forum High Performance Computing

Computer Science Department, University of Basel, Switzerland V


PUBLICATIONS OUTLETS

Journals of Parallel Computing

Papers containing strong parallel computing component will be considered for publication in a specialpeer-reviewed issue of the Parallel Computing journal (PARCO). The guest editors of the special issue arePeter Arbenz, Yousef Saad, Ahmed Sameh, and Olaf Schenk.

The deadline for paper submissions is the 30th of September 2010.

For further information please contact: Peter Arbenz (E-mail: [email protected]).

Papers will go through the usual review procedures and will be accepted or rejected based on the recom-mendations of the editors and referees. However, the review process will be streamlined in every waypossible to facilitate the timely publication of the papers. As always, papers will be considered for pub-lication under the assumption that they contain original unpublished work and that they are not beingsubmitted for publication elsewhere.

Computer Science Department, University of Basel, Switzerland VI


ContentsGeneral Information I

Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIISchedule and Social Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIIEquipment and Sponsors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VPublications Outlets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIConference Map with Important Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIIIConference Building Plan with Room Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX

Detailed Parallel Sessions Schedule X

Plenary Talks (Lecture Room: 102) 1Plenary talk 1 (George Biros, Georgia Institute of Technology, Atlanta, USA) Wednesday, 30.06.2010 at

9:15-10:15A parallel adaptive fast-multipole method on heterogeneous architectures . . . . . . . . . . . . . . . . . . . 1

Plenary talk 2 (Laura Grigori, INRIA, France) Wednesday, 30.06.2010 at 16:10-17:10Minimizing Communication in Linear Algebra, Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Plenary talk 3 (Jim Demmel, UC Berkeley, USA) Friday, 02.07.2010 at 9:15-10:15Minimizing Communication in Linear Algebra, Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Plenary talk 4 (Ahmed Sameh, Purdue University, USA) Friday, 02.07.2010 at 15:45-16:45A Scalable Parallel Sparse Linear System Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Parallel Session 1.1 (Wednesday 30.06.2010 at 10:45-12:25) 31.1A: SPARSE MATRIX COMPUTATIONS ON GPUS (Room: Lecture Room 102) . . . . . . . . . . . . . . . . . 31.1B: LARGE DENSE EIGENVALUE PROBLEMS (Room: Lecture Room 001) . . . . . . . . . . . . . . . . . . . 4

Parallel Session 1.2 (Wednesday 30.06.2010 at 2:00-3:40) 51.2A: SPARSE MATRIX COMPUTATIONS ON GPUS (Room: Lecture Room 102) . . . . . . . . . . . . . . . . . 51.2B: DENSE MATRIX COMPUTATIONS (Room: Lecture Room 001) . . . . . . . . . . . . . . . . . . . . . . . . 6

Parallel Session 2.1 (Thursday 01.07.2010 at 08:40-10:20) 82.1A: COMBINATORIAL SCIENTIFIC COMPUTING (Room: Lecture Room 120) . . . . . . . . . . . . . . . . . . 82.1B: PARALLEL MONTE CARLO (Room: Lecture Room 117) . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1C: MULTILEVEL METHODS (Room: Lecture Room 116) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Parallel Session 2.1 (Thursday 01.07.2010 at 10:50-12:30) 132.1D: COMBINATORIAL SCIENTIFIC COMPUTING (Room: Lecture Room 120) . . . . . . . . . . . . . . . . . . 132.1E: SPARSE MATRICES AND APPLICATIONS (Room: Lecture Room 116) . . . . . . . . . . . . . . . . . . . . 152.1F: RECENT ADVANCES IN EIGENVALUES AND LEAST-SQUARES COMPUTATION (Room: Lecture Room 117) 16

Parallel Session 2.2 (Thursday 01.07.2010 at 2:00-3:40) 172.2A: VOXEL-BASED COMPUTATIONS (Room: Lecture Room 120) . . . . . . . . . . . . . . . . . . . . . . . . 172.2B: HYBRID SOLVER FOR FLUID FLOW (Room: Lecture Room 117) . . . . . . . . . . . . . . . . . . . . . . 192.2C: MISCELLANEOUS A (Room: Lecture Room 116) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Parallel Session 2.2 (Thursday 01.07.2010 at 4:10-5:50) 212.2D: VOXEL-BASED COMPUTATIONS (Room: Lecture Room 120) . . . . . . . . . . . . . . . . . . . . . . . . 212.2E: MISCELLANEOUS B (Room: Lecture Room 116) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2F: HELMHOLTZ AND MAXWELL SOLVERS (Room: Lecture Room 117) . . . . . . . . . . . . . . . . . . . . 23

Parallel Session 3.1 (Friday 02.07.2010 at 10:45-12:25) 263.1A: AUTOTUNING (Room: Lecture Room 102) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1B: ACCELERATING THE SOLUTION OF LINEAR SYSTEMS AND EIGENVALUE PROBLEMS ON HETEROGE-

NEOUS COMPUTING ENVIRONMENTS (Room: Lecture Room 001) . . . . . . . . . . . . . . . . . . . . . . 27

Parallel Session 3.2 (Friday 02.07.2010 at 2:00-3:15) 283.2A: AUTOTUNING (Room: Lecture Room 102) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2B: ACCELERATING THE SOLUTION OF LINEAR SYSTEMS AND EIGENVALUE PROBLEMS ON HETEROGE-

NEOUS COMPUTING ENVIRONMENTS (Room: Lecture Room 001) . . . . . . . . . . . . . . . . . . . . . . 30

Author index 31

Computer Science Department, University of Basel, Switzerland VII


Important Locations

Conference Building: Kollegienhaus, Petersplatz 1

Thursday Dinner: Restaurant Papiermuehle, Sankt Alban-Tal 37, Tel: +41 (0)61 272 48 48

Hotel: Bildungszentrum 21, Missionsstrasse 21, Tel: +41 (0)61 260 21 21

Hotel: Rochat, Petersgraben 23, Tel: +41 (0)61 261 81 40

Welcome Reception:Wildt’sche Haus, Petersplatz 13

Computer Science Department, University of Basel, Switzerland VIII


Conference Building Plan

Conference BuildingKollegienhausGround Floor

001

Conference BuildingKollegienhaus2nd floor

102

120116 117

Computer Science Department, University of Basel, Switzerland IX


Detailed Schedule

Tuesday, June 29th, 20106:30PM Welcome Reception Wildt’sche Haus (Petersplatz 13)

Wednesday, June 30th, 20109:00AM Conference Opening: Peter Arbenz, Olaf Schenk9:15AM Plenary Talk — Lecture Room 102, Session Chair: Olaf Schenk

Prof. George Biros: A parallel adaptive fast-multipole method on heterogeneousarchitectures

10:15AM Coffee and refreshment break10:45AM Minisymposium 1.1.A — Lecture Room 102, Session Chair: Michael Garland

Sparse Matrix Computations on GPUs10:45AM Vasily Volkow, UC Berkeley, Use registers and multiple outputs per thread on GPU11:10AM Jee Choi, Georgia Tech, Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs11:35AM John Owens, UC Davis, Tridiagonal GPU Solver12:00PM Weichung Wang, National Taiwan University, On the GPU-accelerated Multifrontal10:45AM Minisymposium 1.1.B — Lecture Room 001, Session Chair: Michael Bader

Large Dense Eigenvalue Problems10:45AM M. Petschow, Paolo Bientinesi, RWTH Aachen, The MRRR algorithm for multi-core processors11:10AM K. Waldherr, TU Munchen, Large eigenvalue problems: computation of ground states11:35AM T. Auckenthaler, TU Munchen, A two-step tridiagonalization for the parallel symmetric eigenproblem12:00PM Bruno Lang, Bergische Universitat Wuppertal, Partial Eigensystems of Symmetric Tridiagonals12:25PM Lunch (on your own)

2:00PM Minisymposium 1.2.A — Lecture Room 102, Session Chair: Olaf SchenkSparse Matrix Computations on GPUs

2:00PM Michael Garland, NVIDIA, Sparse Matrix Computation on GPUs2:25PM Robert Strzodka, MPI Saarbruecken, Germany, GPU-Multigrid Solvers with Strong Smoothers2:50PM Tim Warburton, Rice University, GPU Accelerated Discontinuous Galerkin Methods3:15PM Dimitry Komatitsch, University of Pau, France, A spectral-element seismic wave propagation algorithm on

a cluster of 192 GPUs2:00PM Minisymposium 1.2.B — Lecture Room 001, Session Chair: Martin Becka

Dense Matrix Computations2:00PM Inge Gutheil, Julich Supercomputing Centre, Performance Evaluation of ScaLAPACK Eigensolver Rout-

nines on two HPC Systems2:25PM Zvonimir Bujanovic, University of Zagreb, A hybrid m-Hessenberg reduction algorithm2:50PM Jurgen Gotze, TU Dortmund, Parallel Jacobi Methods on Nanoscale Integrated Circuits3:15PM Martin Becka, Slovak Academy of Sciences, New ordering for the parallel one-sided block-Jacobi SVD

algorithm3:40PM Poster Session - Coffee and refreshment break4:10PM Plenary Talk — Lecture Room 102, Session Chair: Ahmed Sameh

Prof. Laura Grigori: Minimizing Communication in Linear Algebra, Part 1

Thursday, July 1st, 20108:40AM Minisymposium 2.1A — Lecture Room 120, Session Chair: Laura Grigori

Combinatorial Scientific Computing8:40AM Johannes Langguth, U Bergen, A parallel distributed memory algorithm for bipartite matchings9:05AM Madan Sathe, U Basel, Parallel Exact Matching in Massively-Parallel Large-Scale Nonlinear Optimization9:30AM Francois Pellegrini, U Bordeaux, Current challenges for parallel graph partitioning and static mapping9:55AM Rob H. Bisseling, U Utrecht, Sparse matrix partitioning, ordering, and visualing by Mondriaan 3.08:40AM Minisymposium 2.1B — Lecture Room 117, Session Chair: Aneta Karaivanova

Parallel Monte-Carlo8:40AM Yasunori Futamura, University of Tsukuba, Japan, Parallel stochastic estimation method for matrix eigen-

value distribution9:05AM Andrew Gilpin, Carnegie Mellon Univ., Speeding Up Modern Gradient-Based Algorithms for Large Se-

quential Games9:30AM Yongji Zhou, University of Leeds, Matrix Aided Task Scheduling — A Scheduling Scheme with Mathemat-

ical Representation of Task Dependency9:55AM Aneta Karaivanova, Bulgarian Academy of Sciences, GPU-based quasi-Monte Carlo algorithms for matrix

computations8:40AM Minisymposium 2.1C — Lecture Room 116, Session Chair: Yvan Notay

Multilevel Methods8:40AM Yavor Vutov, Bulgarian Academy of Sciences, Large Scale Finite Element Modeling on Unstructured Grids

Computer Science Department, University of Basel, Switzerland X


9:05AM Pascal Henon, INRIA, Improve ILU preconditioners by recursive solves.9:30AM Vit Vondrak, VSB-Technical University of Ostrava, MatSol — parallel implementation of scalable algo-

rithms for large multibody contact problems9:55AM Yvan Notay, Universite Libre de Bruxelles, A parallel algebraic multigrid method.

10:20AM Coffee and refreshment break10:50AM Minisymposium 2.1D — Lecture Room 120, Session Chair: Olaf Schenk

Combinatorial Scientific Computing10:50AM Bora UCAR, ENS Lyon, A parallel preprocessing for the optimal assignment problem based on diagonal

scaling11:15AM Cedric Chevalier, CEA, Combinatorial models for mesh partitioning11:40AM Meisam Sharify, INRIA, A parallel preprocessing for the optimal assignment problem based on diagonal

scaling12:05PM Giorgos Kollias, Purdue Univ., Parallel Algorithms for Graph Similarity10:50AM Minisymposium 2.1E — Lecture Room 116, Session Chair: Pascal Henon

Sparse Matrices and Applications10:50AM Gergana Bencheva, Bulgarian Academy of Sciences, Parallel Algorithms for Solution of a Chemotaxis

System in Haematology11:15AM Urban Borstnik, University of Zurich, Parallel Sparse Matrix Library and Preconditioner Construction for

Quantum Chemical Calculations of Large Systems11:40AM Nao Kuroiwa, Keio University, A parallel space-time finite difference solver for the steady-state shallow-

water equation12:05PM Pasqua D’Ambra, Institute for High-Performance Computing and Networking (ICAR), High-Performance

Preconditioners for the Solution of Pressure Systems in the LES of Turbulent Channel Flows10:50AM Minisymposium 2.1F — Lecture Room 117, Session Chair: Jose E. Roman and P. Vasconcelos

Recent advances in Eigenvalues and least-squares computation10:50AM Marc Baboulin, Universidade de Coimbra, Computational issues in least squares conditioning11:15AM Rui Ralha, Universidade do Minho, Mixed precision computation of eigenvalues11:40AM Andres Tomas, Universidad Politecnica de Valencia, Efficient Gram-Schmidt orthogonalization with CUDA

for iterative eigensolvers12:05PM Jose E. Roman, Universidad Politecnica de Valencia, Parallel linearization-based solvers for the quadratic

eigenproblem12:30PM Lunch (on your own)

2:00PM Minisymposium 2.2A — Lecture Room 120, Session Chair: Peter ArbenzVoxel Based Computations

2:00PM Dieter H. Pahr, Vienna University of Technology, Virtual Diagnosis — Voxel-based Simulations of HumanBones.

2:25PM Cyril Flaig, ETH Zurich, A Memory Efficient Multigrid Preconditioner for Micro-Finite Element Analysisbased on Voxel Images

2:50PM Nikola Kosturski, Bulgarian Academy of Sciences, Efficient Parallel Solution Algorithms for µFEM Elas-ticity Systems

3:15PM Michael Bader, TU Munchen, Memory-Efficient Sierpinski-order Traversals on Dynamically Adaptive, Re-cursively Structured Triangular Grids

2:00PM Minisymposium 2.2B — Lecture Room 117, Session Chair: Achim BasermannHybrid Solver for Fluid Flow

2:00PM Luc Giraud, INRIA, Parallel scalability and complexity analysis of sparse hybrid linear solvers2:25PM Jonas Thies, University of Groningen, NL, A robust parallel hybrid solver for fluid flow problems2:50PM Achim Basermann, German Aerospace Center, Distributed Schur Complement Solvers for Real and Com-

plex Block-Structured CFD Problems2:00PM Minisymposium 2.2C — Lecture Room 116, Session Chair: Jennifer Scott

Miscellaneous A2:00PM Tamas Kurics, Eotvos Lorand University, Equivalent operator preconditioning for elliptic problems with

nonhomogeneous mixed boundary conditions2:25PM T.B. Jonsthovel, TU Delft, On the Performance and Implementation of the Parallel Deflated Preconditioned

Conjugate Gradient method2:50PM Krzysztof Rojek, Czestochowa University of Technology, Model-Driven Adaptation of Double-Precision

Matrix Multiplication to the Cell Processor Architecture3:15PM Jennifer Scott, Rutherford Appleton Laboratory, Designing sparse direct solvers for multicore architectures3:40PM Coffee and refreshment break4:10PM Minisymposium 2.2D — Lecture Room 120, Session Chair: Maya Neytcheva

Voxel based Computations4:10PM Fumihiko Ino, Osaka University, Accelerating Iterative Stencil Computations on the GPU4:35PM Miriam Mehl, TU Munchen, Parallel Multigrid Methods on Octree-Like Grids in the Peano Framework

Computer Science Department, University of Basel, Switzerland XI

Parallel Matrix Algorithms and Applications (PMAA’10) Plenary Talks

5:00PM Vojtech Sokol, Institute of Geonics AS CR, Voxel based analysis of geomaterials with Schwarz-type parallelsolvers

5:25PM Yves Ineichen, ETH/IBM/PSI, A Fast Parallel Poisson Solver on Irregular Domains4:10PM Minisymposium 2.2E — Lecture Room 116, Session Chair: Pasqua D’Ambra

Miscellaneous B4:10PM Tetsuya Sakurai, University of Tsukuba, Japan, A hierarchical parallel eigenvalue solver: parallelism on

top of multicore linear solvers4:35PM Julien Callant, Universite Libre de Bruxelles, Towards a robust algorithm for computing the rightmost

eigenvalue.5:00PM Matous Sedlacek, TU Munchen, Smoothing and Regularization with Modified Sparse Approximate Inverses5:25PM Mikolaj Szydlarski, IFP France, Algebraic Optimized Schwarz Domain Decomposition Methods4:10PM Minisymposium 2.2F — Lecture Room 117, Session Chair: Rob H. Bisseling

Helmholtz and Maxwell Solvers4:10PM Dan Gordon, The Technion, Israel, A robust and efficient, highly scalable parallel solution of the Helmholtz

equation with large wave numbers4:35PM Rafael Lago, CERFACS, Solution of three-dimensional heterogeneous Helmholtz problems in geophysics5:00PM Christian Wieners, KIT, A Multigrid Method for Maxwell’s Equations with Parallel Nearly Direct Coarse

Grid Solver5:25PM Giorgos Kollias, Purdue University, Asynchronous Row Projections7:00PM Conference Dinner

Friday, July 2st, 20109:15AM Plenary Talk — Lecture Room 102, Session Chair: Costas Bekas

Prof. Jim Demmel: Minimizing Communication in Linear Algebra, Part 210:15AM Coffee and refreshment break10:45AM Minisymposium 3.1A — Lecture Room 102, Session Chair: Rich Vuduc

Autotuning10:45AM Emmanuel Agullo, INRIA, Autotuning dense linear algebra libraries on multicore architectures.11:10AM Cedric Augonnet, INRIA, Auto-tuned performance models to improve task scheduling on accelerator-based

platforms11:35AM Matthias Christen, Univ. of Basel, Code Generation and Autotunig for Stencil Codes Multi- and Manycore

Architectures12:00PM Diego Fabregat Traver, RWTH Aachen, Linear Algebra Algorithms for Automation Differentiation10:45AM Minisymposium 3.1B — Lecture Room 001, Session Chair: Costas Bekas

Accelerating the solution of linear systems and eigenvalue problems on Heteroge-neous Computing Environments

10:45AM Francisco D. Igual, Universitat Jaume I, Accelerating PLAPACK on Hybrid CPU-GPU Clusters11:10AM Lars Karlsson, Umea University, Fast Reduction to Hessenberg Form on Multicore Architectures11:35AM Daniel Kressner, ETH Zurich, Iterative Refinement of Spectral Decompositions in Heterogeneous Comput-

ing Environments12:00PM Costas Bekas, IBM Zurich Research Lab, Multicore Computers Can Protect Your Bones: Rendering the

Computerized Diagnosis of Osteoporosis a Routine Clinical Practice12:25PM Lunch (on your own)

2:00PM Minisymposium 3.2A — Lecture Room 102, Session Chair: Emmanuel AgulloAutotuning

2:00PM P. Sadayappan, Ohio State Univ., Parametric Tiling for Auto Tuning2:25PM Reiji Suda, University of Tokyo, Online Automatic Tuning of Parallel Sparse Matrix Computations2:50PM Emmanuel Agullo, Univ. of Tennessee, Autotuning dense linear algebra libraries on GPUs2:00PM Minisymposium 3.2B — Lecture Room 001, Session Chair: Daniel Kressner

Accelerating the solution of linear systems and eigenvalue problems on Heteroge-neous Computing Environments

2:00PM Julien Langou, University of Colorado, Choosing a Reduction Tree for Communication Avoiding QR2:25PM Eric Polizzi, University of Massachusetts, The FEAST eigenvalue parallel algorithm and solver2:50PM Jonathan Hogg, STFC, The challenge of the solve phase of a multicore solver3:15PM Poster Session - Coffee and refreshment break3:45PM Plenary Talk — Lecture Room 102, Session Chair: Peter Arbenz

Prof. Ahmed Sameh: A Scalable Parallel Sparse Linear System Solver

4:45PM Closing Remarks:Erricos J. Kontoghiorghes

Computer Science Department, University of Basel, Switzerland 1

Parallel Matrix Algorithms and Applications (PMAA’10) Plenary Talks

Wednesday, 30.06.2010 9:15-10:15 Lecture Room: 102 Plenary talk 1

A parallel adaptive fast-multipole method on heterogeneous architecturesSpeaker: George Biros, Georgia Institute of Technology, Atlanta, USA Chair: Olaf Schenk

The fast multipole method (FMM) is an efficient algorithm for what is known as “N-body problems”. I will present a newscalable algorithm and a new implementation of the kernel-independent fast multipole method, in which both distributedmemory parallelism (via MPI) and shared memory/SIMD parallelism (via GPU acceleration) are employed. I will concludemy talk by discussing the direct numerical simulation of blood flow in the Stokes regime using the FMM. I will describesimulations with 200 million red blood cells, an improvement of four orders of magnitude over previous results.

Wednesday, 30.06.2010 16:10-17:10 Lecture Room: 102 Plenary talk 2

Minimizing Communication in Linear Algebra, Part 1Speaker: Laura Grigori, INRIA, France Chair: A. Sameh

Algorithms have two kinds of costs: arithmetic and communication, by which we mean moving data either between levelsof a memory hierarchy (in the sequential case) or between processors over a network (in the parallel case). Communicationcosts can already exceed arithmetic costs by orders of magnitude, and the gap is growing exponentially over time, so ourgoal is to design linear algebra algorithms that minimize communication.

In this first of two related talks on this topic (the second is by James Demmel), we will discuss new direct factorizationalgorithms for dense and sparse matrices that provably minimize communication. In the dense case we will discuss LU, QRand rank-revealing QR (RRQR) factorizations. Both LU and RRQR require new, numerically stable pivoting schemes. Weshow large speedups on multicore and clusters of multicore machines compared to conventional algorithms, in the LAPACK,ScaLAPACK, MKL and ESSL libraries. In the sparse case, for matrices arising from discretizations on 2D and 3D regulargrids, we present communication-optimal sequential and parallel sparse Cholesky algorithms.

Friday, 02.07.2010 9:15-10:15 Lecture Room: 102 Plenary talk 3

Minimizing Communication in Linear Algebra, Part 2Speaker: Jim Demmel, UC Berkeley, USA Chair: Costas Bekas

Algorithms have two kinds of costs: arithmetic and communication, by which we mean moving data either between levelsof a memory hierarchy (in the sequential case) or between processors over a network (in the parallel case). Communicationcosts can already exceed arithmetic costs by orders of magnitude, and the gap is growing exponentially over time, so ourgoal is to design linear algebra algorithms that minimize communication.

In this second of two related talks on this topic (the first is by Laura Grigori), we will discuss the following ideas: First,we show how to extend known communication lower bounds for O(n3) dense matrix multiplication to all direct linear algebra,i.e. for solving linear systems, least squares problems, eigenproblems and the SVD, for dense or sparse matrices, and forsequential or parallel machines. Second, we discuss communication-optimal algorithms for eigenvalue problems and theSVD. Third, we show how to minimize communication in Krylov-subspace methods for solving sparse linear system andeigenproblems, and demonstrate new algorithms with significant speedups. Finally, we discuss extensions to Strassen-likealgorithms.

Friday, 02.07.2010 15:45-16:45 Lecture Room: 102 Plenary talk 4

A Scalable Parallel Sparse Linear System SolverSpeaker: Ahmed Sameh, Purdue University, USA Chair: Peter Arbenz

Achieving high parallel scalability of sparse linear system solvers implemented on large-scale computing platforms com-prised of tens of thousands of multicore processors is a task that offers many challenges. Towards achieving such a solver,we build on the success of the SPIKE family of parallel solvers for banded (dense within the band) linear systems. In thispaper, we present a generalization of the SPIKE family of schemes for handling general sparse linear systems. Using aparallel reordering scheme to help in extracting an effective “banded” (sparse within the “band”) preconditioner for an outerKrylov subspace method, we realize a hybrid sparse solver. Solving the sparse systems involving the preconditioner in eachouter iteration is handled using a specialized version of the direct solver Pardiso. We show that our resulting hybrid sparsesolver P-SPIKE (for Pardiso-Spike) is more scalable than current parallel direct solvers, and more robust than approximateLU-factorization, or algebraic multigrid-based preconditioned Krylov subspace methods.


Wednesday 30.06.2010 10:45-12:25 Parallel Matrix Algorithms and Applications (PMAA’10) Parallel Session 1.1

Wednesday 30.06.2010 10:45-12:25 Parallel Session 1.1

1.1A Lecture Room 102 SPARSE MATRIX COMPUTATIONS ON GPUS Chair: Michael Garland

#1: Use registers and multiple outputs per thread on GPUPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vasily Volkov ([email protected]), UC BerkeleyCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I discuss a few novel techniques for efficient management of storage and communication resources on GPUs that I founduseful in achieving superior performance in a range of numerical kernels such as FFT, matrix multiplication and stencils. Iemphasize: (i) offloading storage from the scarce shared memory to the larger register file; (ii) computing multiple outputsper thread. The resulting codes achieve substantially higher performance despite using more registers per thread and runningat lower occupancy.#2: Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jee Choi ([email protected]), Georgia Institute of TechnologyCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amik Singh, Richard W. Vuduc

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vectormultiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we de-scribe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations,enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressedsparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementa-tions. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/sin double- precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementa-tions (Bell and Garland, 2009) by up to 1.8x and 1.5x for single- and double-precision respectively. However, achieving thislevel of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop aperformance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), thismodel requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vectorprocessors like GPUs. We show that our model can identify the implementations that achieve within 15% of those foundthrough exhaustive search. GPUs are most often coupled together with general-purpose CPUs on most systems, and thereforeit makes sense to utilize both processing units to maximize performance. However, effectively dividing work between thetwo disparate processing elements is very difficult. For the third part of this study, we create a method for autotuning iterativesolvers on CPU-GPU systems that finds the most effective way of dividing the workload between the CPU and the GPU bypredicting the execution time of the CPU kernels in the context of the overall solver using a probabilistic model for cache hitand misses and by using a variation of the model described in the second part of the study to predict the execution time forthe counterpart GPU kernels.#3: Efficient GPU Tridiagonal SolversPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John D. Owens([email protected]), UC DavisCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yao Zhang, Jonathan Cohen, Andrew Davidson

Tridiagonal matrix solvers are used in a broad range of numerical applications, including spectral Poisson solvers, cubicspline approximations, numerical ocean models, alternating-direction-implicit (ADI) methods, and preconditioners for itera-tive linear solvers. The most common application scenario in computer graphics, in our experience, is solving a large numberof modestly-sized tridiagonal systems. Such systems are useful in real-time graphics for depth-of-field and shallow-watersimulations.

Traditional tridiagonal solvers are serial (and efficient). On the GPU, our previous work has demonstrated the use ofa parallel-friendly cyclic reduction formulation of a tridiagonal solver. The work I will discuss in this talk expands thetridiagonal toolbox to include two new algorithms on the GPU, parallel cyclic reduction and recursive doubling. We analyzeour implementation of these algorithms in detail in able to understand their advantages and disadvantages; our analysis relieson performance-analysis techniques that we have developed in the course of this project, which I will also describe. I thenwill demonstrate our implementation of novel hybrids of these algorithms that have superior performance than any of thethree algorithms alone. The key to the performance of these hybrids lies in the switch between work and step efficientalgorithms to best take advantage of the GPU’s architecture, work we hope will also apply to other similar problems innumerical and other domains.

I also expect to discuss some of the self-tuning work that we are in the process of performing that we hope can furtheroptimize our tridiagonal implementations. Future work includes solving sets of larger parallel tridiagonal systems that areeach larger than shared memory and solving one very large tridiagonal system.

This is joint work with Yao Zhang (UC Davis), who led the project; Jonathan Cohen (NVIDIA); and Andrew Davidson(UC Davis). The work was first presented by Yao at PPoPP 2010.#4: On the GPU-accelerated MultifrontalPresenter: . . . . . . . . Weichung Wang ([email protected]), Department of Mathematics, National Taiwan UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Multifrontal is an efficient direct method for solving large-scale sparse linear systems. The method transforms a large sparsematrix factorization into a sequence of factorizations involving smaller dense frontal matrices. Some of the dense operationscan be performed simultaneously and thus introduce the possibility of parallelism. We will discuss how such parallelism canbe implemented on graphic processing units (GPU) to take advantage of the hundreds of lightweight computing cores on theGPU.

1.1B Lecture Room 001 LARGE DENSE EIGENVALUE PROBLEMS Chair: Michael Bader

#5: The MRRR algorithm for multi-core processorsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Petschow ([email protected]), RWTH AachenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Bientinesi

Most direct methods for computing the eigenvalues and eigenvectors of a dense hermitian matrix reduce the input matrixto real symmetric tridiagonal form. The eigenvalues and eigenvectors of the tridiagonal matrix are then computed. Thealgorithm of Multiple Relatively Robust Representations (MRRR or MR3) computes k eigenvalues and eigenvectors ofa tridiagonal matrix of size n in O(nk) arithmetic operations. For most matrices it is the fastest tridiagonal eigensolveravailable.

While for large matrices parallel implementations especially suited for distributed-memory computer systems exist, smallto medium size problems realy on LAPACK’s implementation xSTEMR. However, xSTEMR does not take advantage oftoday’s multi-core and future many-core architectures, as it is optimized for single-core CPUs. We present a design of theMRRR algorithm (MR3–SMP) specifically tailored for multi-core processors. Our design uses a fine grain of parallelism toexploit the features of these architectures, thereby avoiding any form of redundant computation, expensive communicationand load balancing issues.

We show that MR3–SMP obtains close to optimal speedups for a wide range of matrices, including those of modest size.We tested our algorithm against all the tridiagonal eigensolvers contained in LAPACK and the Intel’s Math Kernel Library(MKL) on matrices arising in applications; not only is MR3–SMP the fastest, it is also the algorithm that obtains the bestspeedups.#6: Large eigenvalue problems: computation of ground statesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konrad Waldherr ([email protected]), TU MunchenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The computation of the ground state (i.e. the eigenvector concerning the smallest eigenvalue) is an important task whendealing with quantum systems. As the dimension of the underlying vector space grows exponentially in the number ofparticles, one has to consider appropriate subsets promising convenient approximation properties. The variational ansatz forthis numerical approach leads to the minimization of the Rayleigh quotient over the chosen subset. The Alternating LeastSquares technique is thereby applied to break down the eigenvector computation to problems of appropriate size.

For reasons of efficiency, we regard convenient representations for vectors, which allow practical calculations both of thematrix-vector product and of the inner product of two decomposed vectors, which could also be done in parallel.

Altogether, we present and analyze different techniques for the computation of ground states of huge systems. Theseschemes provide an opportunity for parallelization and can be executed in polynomial time complexity.#7: A two-step tridiagonalization for the parallel symmetric eigenproblemPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Auckenthaler ([email protected]), Technische Universitat MunchenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A widespread approach to compute all eigenvalues and eigenvectors of a symmetric matrix consists of three phases. First thesymmetric matrix is reduced to tridiagonal form, then the tridiagonal eigenproblem is solved and in a last step the eigenvectorsof the tridiagonal matrix are back transformed to the eigenvectors of . The current state of the art algorithms for the parallelsymmetric eigenproblem (e.g. ScaLAPACK) are reaching the limits of scalability, especially for the tridiagonalization of thematrix. One promising approach to improve the scalability and to break the memory wall is a two-step tridiagonalization.Thereby the symmetric matrix is first reduced to a symmetric banded matrix. In a second step the banded matrix is broughtto tridiagonal form.#8: Partial Eigensystems of Symmetric TridiagonalsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bruno Lang ([email protected]), Bergische Universitat WuppertalCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Willems

The MRRR algorithm is one of the methods of choice for computing partial eigensystems of symmetric tridiagonal matrices,mainly for two reasons. First, the complexity is roughly proportional to kn, where n and k denote the matrix dimension andthe number of requested eigenpairs, respectively, and, second, the algorithm allows for simple parallelization. However, thequality of the resulting eigensystem, in particular the orthogonality of the vectors, is somewhat inferior to other methodssuch as divide and conquer, and the method is quite difficult to grasp due to several techniques being necessary in order toachieve stability.

In this talk we will present the MRRR algorithm in a different way that allows to separate its “algorithmic core” fromthe handling of the recursion, that is, the details of re-factoring a shifted matrix. This also paves the way for improvingon the current implementation by using a different representation in the “twisted factorizations”, or by switching to block



factorizations, thus in fact leading to a whole framework of MRRR-type algorithms. Numerical experiments show that thesemodifications can improve the stability and effectiveness of the MRRR algorithm.

We will also comment on the possibility of using a divide and conquer-type algorithm for computing partial eigensys-tems.

Wednesday 30.06.2010 2:00-3:40 Parallel Session 1.2

1.2A Lecture Room 102 SPARSE MATRIX COMPUTATIONS ON GPUS Chair: Olaf Schenk

#9: Sparse Matrix Computations on GPUsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Garland ([email protected]), NVIDIA ResearchCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathan Bell

Modern microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Design-ing parallel algorithms for manycore chips like the GPU can present interesting challenges, particularly for computations onsparse data structures. One particularly common example is the collection of sparse matrix solvers and combinatorial graphalgorithms that form the core of many physical simulation techniques. Although seemingly irregular, these operations canbe implemented efficiently on GPU processors.

In this talk, I will focus on the problem of sparse matrix-vector multiplication (SpMV), discussing data structures andalgorithms that are well-suited to throughput-oriented architectures like the GPU. The techniques I will describe exploitseveral common sparsity classes, ranging from those that are well-structured and regular to highly irregular matrices withlarge imbalances in the distribution of nonzeros per matrix row.

Our techniques are efficient, in the sense that they successfully utilize large percentages of peak bandwidth. Furthermore,they also deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured gridand unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previouslyachieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.#10: GPU-Multigrid Solvers with Strong SmoothersPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Strzodka ([email protected]), Max Planck Institut InformatikCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Neither solvers with best numerical convergence nor solvers with best parallel efficiency are the best choice for the fastsolution of PDE problems in practice. The fastest solvers require a delicate balance between their numerical and hardwarecharacteristics and the talk will discuss this balance on all levels of current hardware architectures: SIMD vectorization,thread block parallelization, intra-node parallelism and inter-node parallelism in a cluster. Finally, we will look at ways ofabstracting the technical complexities for everyday use in computational science.#11: GPU Accelerated Discontinuous Galerkin MethodsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timothy Warburton ([email protected]), Rice UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

There has been recent interest in using general purpose graphics processing units to accelerate implementations of a widerange of numerical partial differential equation discretizations for a varied range of applications [2]. In particular it hasproved possible to obtain substantial speed up of discontinuous Galerkin based solvers for simulations of time-domain elec-tromagnetics by porting the computational kernels to the CUDA programming language [3] and computing on several graph-ics processing units [4]. Further accelerations can be achieved by applying metaprogramming techniques to automaticallyimplement and tune the computational kernels [5],[6]. In this talk I will discuss issues related to obtaining high GPGPUaccelerated computational performance for implementations of discontinuous Galerkin methods for Maxwell’s equationsof electromagnetics and also the compressible Navier-Stokes equations [8]. The computational architecture of the modernGPGPU favors problems that can be partitioned into small but algebraically intense operations on a subset of the overallproblem data with weak coupling to a small halo subset of the remaining data. On the other hand operations involvingthinly populated sparse matrices (i.e. matrices with no sparse block dense fill in) are dominantly memory bandwidth bound[9]. Global vector reduction operations, for instance inner-products, also typically yield relatively low performance [1]. Toaccommodate these limitations we have considered using an adaptive explicit time stepping method specifically designedfor somewhat stiff ordinary differential equations [7]. This choice helps reduce the total number of inner-products whencompared with Krylov subspace based iterative solvers, coincidentally avoids some possibility of convergence stalls due tothe use of single precision floating point operations on the GPU, and is easily adapted to handle time-dependent artificialviscosity. The GPGPU model of computing shifts the ratio of memory versus compute power firmly towards the latter, atrend that is increasing in time. I will describe how we substantially reduced the memory footprint when using curvilinearelements. We designed the low storage curvilinear discontinuous Galerkin method to avoid the need to compute and storecustom mass matrices for each individually curved element. Computational results and the performance characteristics ofthese methods will be discussed. This work was done in collaboration with Andreas Klockner, Nigel Nunn, and Nico Godel.

[1] N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. Technical report, NVIDIA TechnicalReport NVR-2008-004, NVIDIA Corporation, 2008.

[2] NVIDIA Corp. The CUDA Zone. http://www.nvidia.com/cuda.



[3] NVIDIA Corp. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 2007.

[4] N. Goedel, T. Warburton, and M. Clemens. GPU Accelerated Discontinuous Galerkin FEM for Electromagnetic RadioFrequency Problems. 2009.

[5] A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU Run-Time Code Generation forHigh-Performance Computing. Arxiv preprint arXiv:0911.3456, 2009.

[6] A. Klockner, T. Warburton, J. Bridge, and JS Hesthaven. Nodal discontinuous Galerkin methods on graphics proces-sors. Journal of Computational Physics, 228(21):7863–7882, 2009.

[7] VI Lebedev. How to solve stiff systems of differential equations by explicit methods. Numerical Methods andApplications, pages 45–80, 1994.

[8] H. Riedmann. Efficient numerical treatment of the compressible Navier-Stokes equations with nodal discontinuousGalerkin methods on graphics processors. Technical Report 2009-32, Scientific Computing Group, Brown University,Providence, RI, USA, October 2009.

[9] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. Scan primitives for GPU computing. In Proceedings of the 22ndACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, page 106. Eurographics Association, 2007.

#12: A spectral-element seismic wave propagation algorithm on a cluster of 192 GPUsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitri Komatitsch ([email protected]), University of Pau, FranceCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gordon Erlebacher, Dominik Goddeke and David Michea

We use the Spectral Element Method (SEM) to simulate numerically the propagation of seismic waves resulting from activeseismic acquisition experiments in the oil industry or from earthquakes in the Earth. The SEM is a high-order finite-elementmethod that solves the variational form of the elastic wave equation in the time domain on a non-structured mesh of elements,called spectral elements, in order to compute the displacement vector of any point of the medium under study. To representthe displacement field in an element, the SEM uses Lagrange polynomials of degree 4 to 10, typically, for the interpolationof functions. These Lagrange polynomials are defined in terms of control points that are chosen to be the Gauss-Lobatto-Legendre (GLL) points because they lead to an exactly diagonal mass matrix, i.e., no resolution of a large linear system isneeded.

Our goal is to port this application to a large cluster of GPUs, using the Message-Passing Interface (MPI) betweencompute nodes to exchange information between the GPUs. There are several challenges to address in mapping SEMcomputations to a GPU cluster. The elements that compose the mesh slices are in contact through a common face, edgeor point. To allow for overlap of communication between cluster nodes with calculations on the GPUs, we compute theouter elements first. Once these computations have been completed, we copy the associated data to MPI buffers and issue anon-blocking MPI call, which initiates the communication and returns immediately. While the messages are traveling acrossthe interconnect, we compute the inner elements. Achieving effective overlap requires that the ratio of the number of innerto outer elements be sufficiently large, which is the case for large enough mesh slices. Under these conditions, the MPI datatransfer will statistically likely complete before the completion of the computation of the inner elements. We note that toachieve effective overlap on a cluster of GPUs, this ratio must be larger than for classical CPU clusters, due to the speedupobtained by the GPUs. The elementwise contributions need to be assembled at each global point of the mesh. Each such pointreceives contributions from a varying number of elements, which calls for an atomic summation, i.e., an order-independentsequential accumulation. We decouple these dependencies, which do not parallelize in a straightforward manner, by usinga mesh coloring scheme to create sets of independent elements in the mesh (Komatitsch et al., 2009). This results in oneCUDA kernel call per color, and total parallelism inside each color.

The machine we use is a cluster of 48 Teslas S1070 at CCRT/CEA/GENCI in Paris, France; each Tesla S1070 has fourGT200 GPUs and two PCI Express-2 buses (i.e., two GPUs share a PCI Express-2 bus). The GT200 cards have 4 GB ofmemory. We use mesh slices of 446,080 spectral elements each. Each slice contains approximately 29.6 million unique gridpoints, i.e., 88.8 million degrees of freedom, corresponding to 3.6 GB (out of 4 GB) of memory used per GPU. The largestpossible problem size, using all 192 GPUs in the cluster, thus has 17.05 billion unknowns. We analyzed the average elapsedtime per time step of the SEM algorithm for simulations using between 4 and 192 GPUs (i.e., the whole machine), by stepsof four GPUs; each PCIe-2 bus is shared by two GPUs. As expected from our overlap of communications with computations,the weak scaling obtained is very good; communication is essentially free.

1.2B Lecture Room 001 DENSE MATRIX COMPUTATIONS Chair: Martin Becka

#13: Performance Evaluation of ScaLAPACK Eigensolver Routnines on two HPC SystemsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inge Gutheil ([email protected]), Julich Supercomputing CentreCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The computation of all or about 10% of the eigenvalues of large dense matrices plays an important role in several scientificapplications. Thus the performance and especially the scaling of dense eigensolvers is a critical part of such applications.



We investigate the performance and scaling of the eigensolvers PDSYEV and PDSYEVX from ScaLAPACK in themanually compiled version on the IBM BlueGene/P JUGENE and in the Intel MKL version on the Intel Nehalem clusterJUROPA at Forschungszentrum Juelich.

On JUGENE if 10% of the eingenvalues and eigenvectors of matrices of size N = 10000 are to be computed withPDSYEVX the speedups decrease from about 1.8 from 64 to 128 and from 128 to 256 processors to less than 1.1 for1024 to 2048 processors. Looking at the routine with the performance analysis tool Scalasca we could find out that in theabove mentioned case about 90% of the time is spent in the reduction routine PDSYNTRD with 1024 as well as with 2048processors. The difference then can be found in PDSYTTRD, which takes about 60% of the total time with 1024 processorsand about 70% with 2048 processors. Almost half the time here is spent in the broadcasts along processor rows and colums,whereas with 1024 processors only a quarter of the time is spent in the broadcasts. This can be either because the problemwas too small for 2048 processors or because broadcasts become rather expensive on BlueGene with an increasing numberof processors.

On JUROPA with the ScaLAPACK routines from MKL the situation is even stranger. For small processor numbersJUROPA is faster than JUGENE as can be expected from the peak performance of a single processor. For 1024 processorsthe performance of both computers is almost the same and for 2048 processors JUROPA is faster again. For the same taskthe speedup from 64 to 128 processors is about 1.8, from 128 to 256 processors it is 2.5, from 256 to 512 it is about 1.3, andfrom 512 to 1024 there is a slow-down of a factor of three to four. A first look at the program with Scalasca also suggeststhat most of the time is spent in broadcast operations.

We would like to discuss whether there are chances to improve the performance and scaling of the ScaLAPACK denseeigensolvers on computers with several thousands of processors.#14: A hybrid m-Hessenberg reduction algorithmPresenter: . . . . . . . . . . . . . . . . . . . . . . Zvonimir Bujanovic ([email protected]), University of Zagreb, Dept. of MathematicsCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The m-Hessenberg form (also known as banded Hessenberg form with lower bandwidth m) of a given matrix A ∈ Rn×n isa matrix H ∈ Rn×n orthogonally similar to A such that Hi, j = 0 for all i, j such that i > j +m. The need for reducing amatrix to an m-Hessenberg form occurs e.g. in control theory as the “reduction to the controller Hessenberg form” and alsoin some implementations of the block Arnoldi algorithm for computing the eigenvalues of a large sparse matrix. In this talkwe analyze and improve the implementation of this reduction. The improvement introduces blocking with a sophisticatedperformance boost for larger values of m. We also incorporate a hybrid CPU+GPU evaluation and analyze the performancebenefits.

The author is a SNF/SCOPES-supported student.#15: Parallel Jacobi Methods on Nanoscale Integrated CircuitsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jurgen Gotze ([email protected]), TU DortmundCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHI-Chia Sun

Modern Very Large Scale Integration (VLSI) manufacturing technology has kept shrinking down to Very Deep Sub-Micron(VDSM) with a very fast trend and Moore’s law is expected to hold for the next 10 years. As VLSI manufacturing keepsshrinking down into 32nm or even below in the near future, advanced many-core design will make it possible to realize a fullyparallel EVD/SVD computing array based on Jacobi’s method. Although Jacobis method is considered as slow, compared tobidiagonalization based QR methods, it is highly suited for parallel implementation [1] and yields small singular values withhigh relative accuracy [2].

The convergence of Jacobi’s method is very robust to modifications of the rotation computation. This fact is very im-portant from a VLSI implementation point of view. It allows simplifying the computational complexity of the rotationssignificantly while still keeping the convergence. This results in a simplified design, where each processor executes a simpli-fied two-sided rotation (two-dimensional array). As a consequence the number of processors which can be implemented onan integrated circuit is not only increased by the tremendous advance in VLSI technology, but also by possible simplificationsof the rotation computation.

However, simplified rotations will usually cause an increased number of iterations for convergence, i.e. the processorarray has to execute its operations (sweep) more often which means that the number of data transfers in the interconnectionsalso increases. Therefore, it actually becomes a trade-off problem between the performance/complexity of the hardware, theload/throughput of interconnects and the overall energy/power consumption due to the behavior of iterative algorithm, i.e.how to implement the algorithm in VLSI changes with technology and design objectives.

In [3-4], we discussed the impact of VDSM design issues on the implementation of two-sided Jacobi EVD parallelarray. Based on the CORDIC method, which is an iterative algorithm for the rotation computation, a simplified approximaterotation computation was used. The proposed µ-CORDIC reduced the required 32 iterations of the Full-CORDIC (assuming32 bit word length) to only one iteration and can be implemented very efficiently.

In this paper, we discuss the impact of exchanging inner iterations (iterations required for CORDIC rotation computa-tion) and outer iterations (sweeps of the Jacobi method) for Jacobi’s EVD/SVD methods. We suggest an efficient strategyfor parallel Jacobi EVD/SVD methods to fit the VDSM design criteria in terms of area, timing delay and power/energyconsumption. Simulation and VLSI implementation results are provided. Furthermore, the concept of exchanging innerand outer iterations can be extended to other iterative algorithms and their nano-scale implementation. This topic of futureresearch is also briefly discussed.


Thursday 01.07.2010 08:40-10:20 Parallel Matrix Algorithms and Applications (PMAA’10) Parallel Session 2.1

[1] P. P. M. De Rijk, “A One-Sided Jacobi Algorithm for Computing the Singular Value Decomposition on a VectorComputer,” SIAM Journal on Scientific Computing, vol. 10, no. 2, pp. 359-371, 1989.

[2] Z. Drmac and K. Veselic: New fast and accurate Jacobi SVD algorithm: I., SIAM J. Matrix Anal. Appl., 29 (2008),1322-1342.

[3] Chi-Chia Sun and Jurgen Gotze, “A VLSI Design Concept for Parallel Iterative Algorithms,” Advances in RadioScience, Vol. 7, pp. 95-100, 2009.

[4] Chi-Chia Sun and Jurgen Gotze, “VLSI Circuit Design Concepts for Parallel Iterative Algorithms in Nanoscale,” The9th International Symposium on Communications and Information Technologies (ISCIT), Incheon, Korea, Sep. 28-30,2009.

#16: New ordering for the parallel one-sided block-Jacobi SVD algorithmPresenter: . . . . . . . . . . . . . Martin Becka ([email protected]), Institute for Mathematics, Slovak Academy of SciencesCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriel Oksa, Marian Vajtersic

As is well known, the one-sided block-Jacobi SVD algorithm is based on the mutual orthogonalization of block columns.When computing the SVD in parallel on p processors, this orthogonalization is usually organized in some cyclic manner, i.e.,the pairs of column blocks, which are orthogonalized in a given parallel iteration step, are chosen according some prescribed,fixed list. However, this approach does not take into account the actual status of any two chosen block columns and canconverge rather slowly.

We propose a dynamic approach by choosing those pairs of block columns that are mutually mostly inclined at thebeginning of each iteration step. The inclination of two blocks is measured by their smallest principal angles. Instead ofcomputing these angles by moving the blocks around processors at the beginning of each parallel iteration step, we proposeto compute their principal angles by a small number of iterations using Lanczos processes implemented in parallel. Suchan approach substantially decreases the communication complexity, which would be otherwise huge since one would needto communicate all blocks through all processors. In our approach the blocks remain stationary inside processors. Afterfinishing the Lanczos iterations, one chooses p pairs of block columns, which are mostly inclined to each other. This is doneby applying the maximum-weight perfect matching algorithm to the complete edge-weighted graph, where the number ofvertices is equal to the number of blocks and the edge weights are sums of largest cosines describing the smallest principalangles between any two blocks. Hence, this dynamic ordering is the extension of our original dynamic approach for thetwo-sided block-Jacobi method.

We provide the detailed description of our new parallel ordering and report first numerical results. This work has beensupported by the Grant APVV-0532-07 from the Grant Agency for Science and Research, Slovak Republic.

Thursday 01.07.2010 08:40-10:20 Parallel Session 2.1

2.1A Lecture Room 120 COMBINATORIAL SCIENTIFIC COMPUTING Chair: Laura Grigori

#17: A parallel distributed memory algorithm for bipartite matchingsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Johannes Langguth ([email protected]), U BergenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fredrik Manne, Md. Mostofa Ali Patwary

It is a well established result that improved pivoting in linear solvers can be achieved by computing a bipartite matchingbetween matrix rows and and positions on the main diagonal. With the availability of increasingly faster linear solvers,the speed of these bipartite matching computations should keep up in order to avoid slowing down the main computation.Furthermore, the size of many instances arising in practical problems often exceeds the local memory of available machines,thus requiring the application of distributed memory methods.

Fast sequential algorithms for bipartite matching have been studied for a long time. They have been used successfully forcomputing pivoting strategies. However, these approaches are based on finding augmenting paths and thus exhibit inherentsequentiality. Therefore, success in parallelization has been very limited. A more promising approach is the push-relabelstrategy which allows multiple operations to be performed independently and has been used to design parallel bipartitematching algorithms on shared memory machines. Our goal is to adapt this approach to the distributed memory model,thereby allowing the computation of matchings on massive graphs. Assuming an edge based partitioning, our main toolis the detailed modelling of connectors, i.e. vertices shared by multiple processors in order to translate operations on theunderlying graph to the partitioned graph and vice versa. Furthermore, we change the edge based distance measure used inthe push-relabel strategy to a processor connection based measure, which more accurately reflects running time costs in adistributed memory setting.

We present a parallel algorithm for distributed memory machines that implements these ideas and show some experimen-tal results on randomly generated and real world instances regarding its speed and scalability. We also discuss extensionsto weighted matching problem and its application in matrix computations, as well as heuristic initialization considerations,which have been shown to have a significant effect on performance.#18: Parallel Exact Matching in Massively-Parallel Large-Scale Nonlinear OptimizationPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Madan Sathe ([email protected]), Dept. CS, U Basel



Co-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olaf Schenk, Helmar Burkhart

Current and upcoming hardware developments provide a highly parallel environment which could greatly support the designof parallel algorithms for very large-scale problems. The demands on the algorithm side are to be scalable up to thousandsof processors, to be robust, and to be fast while ensuring the quality of the solution.

Combinatorial scientific computing is an interdisplinary research field which combines the algorithmic computer sciencewith scientific computing. One of the main focuses in this field are graph problems such as graph partitioning, shortestpath or graph matching. The emphasis of this talk is placed on the use of exact bipartite matching algorithms in a parallelnonlinear optimization framework.

We propose to use an interior point optimizer IPOPT to deal with large-scale PDE-constrained nonlinear optimizationproblems as they appear in biomedical applications. Interior point methods require solving a linear equation system Ax=b ateach iteration, where A is the KKT matrix which is very sparse, symmetric, highly ill-conditioned, and indefinite. The linearsystem will be solved by a massively-parallel hybrid solver PSPIKE – a combination of a direct and iterative solver – usingsome preconditioning techniques.

Two crucial kernel routines for the convergence of the solver are in the preconditioner. On the one hand the scaling ofmatrix entries such that the entries are less than or equal to one, and on the other hand the permuting of large entries onto thediagonal. Both goals can be optimally achieved by solving the maximum weighted matching problem.

Although the problem can be optimally solved with cubic complexity in the sequential case, the runtime is still to high ifthe size of the matrix increases dramatically by a degree of millions or even billions. Such large systems typically arise bytackling 3D PDE-constrained nonlinear optimization problems by using a higher discretization number.

We will present a parallel exact matching algorithm based on the concepts of auctions which scales well up to 32 MPIprocesses for sparse matrices in our preconditioner. As evaluation, we compare the parallel auction algorithm with a state-of-the-art sequential matching implementation. The performance will be measured by the number of iterations of the nonlinearoptimizer and the hybrid solver, and certainly by the runtime of the whole optimization process.#19: Current challenges for parallel graph partitioning and static mappingPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francois Pellegrini ([email protected]), INRIACo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Graph partitioning is an ubiquitous technique which has applications in many fields of computer science and engineering. Itis mostly used to help solving domain-dependent optimization problems modeled in terms of weighted or unweighted graphs,where finding good solutions amounts to computing, eventually recursively in a divide-and-conquer framework, small vertexor edge cuts that balance evenly the weights of the graph parts. Because problem sizes keep increasing, large problemsgraphs cannot fit in the memory of sequential computers, and cost too much to partition, leading to the development ofparallel graph partitioning tools such as ParMeTiS or ParJostle. The Scotch project, carried out within the Bacchus team ofINRIA Bordeaux – Sud-Ouest, is yet another attempt to address this problem.

The advent of massively parallel, NUMA machines, represents a new challenge for software designers, in order forpartitioning tools to scale up to hundred thousands of processing elements. The purpose of this talk is to present the keyissues which we are considering, within the Scotch project, for the development of scalable parallel graph partitioning andstatic mapping algorithms.

While the sequential version of the Scotch library makes very few assumptions on the nature of graphs to be handled,the only limitation regarding the fact that vertex and edge weights should be strictly positive integers, the parallel versionrequired some early design decisions to be taken, which condition the ability of the software to handle some types of graphs.In particular, we assumed that distributed graphs are of reasonably small degree, that is, that graph adjacency matrices havesparse rows and columns.

Also, we wanted our software to run on any number of processes P, and produce any number of parts k, irrespectiveof the values of k and P. Tackling these issues in PT-Scotch required data structures to decouple the target and executionarchitectures: a process can handle several target domains, and conversely a given target domain can be managed concurrentlyby several processes.

In order for parallel partitioning algorithms to scale on hundred thousands of processing elements, graph data must neverbe duplicated, across all of the processing elements, more than a (small) factor of the vertex and edge set sizes |V | and |E|.Algorithms which handle graphs must never use data structures in k|V |, k2, kP, P|V | or P2, as they can lead to memoryshortage.

Static mapping aims at improving the locality of communications in the target machine, by taking into account thetopology of the latter when partitioning the problem graph. This additional constraint poses a problem to the computationof initial mappings. While initial k-way partitions can be easily computed in parallel by means of recursive bipartitioning,this is not possible for recursive bi-mapping, because every bipartition at some level must take into account the shapes ofneighboring partitions. This information is available in a sequential context, but not in parallel, when k ≈ |V |.

Also, the increasing number of processing elements hinders the convergence of partition refinement algorithms whichare commonly used in k-way multilevel schemes. The more processing elements there are, the more vertices can be movedindependently by each of them from overloaded domains to a presumed underloaded neighboring domain, overloading iteven more than its neighbors. Computing exactly diffusion matrices prior to data movement may not always be possible, asthese structures are in k2. Using iterative diffusion-based algorithms may also lead unbalance, due to rounding artifacts whendeciding which domain owns some vertex.



Future parallel static mapping software will therefore have to rely on a combination of these methods to preserve partitionquality.#20: Sparse matrix partitioning, ordering, and visualing by Mondriaan 3.0Presenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rob H. Bisseling ([email protected]), Utrecht UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bas O. Fagginger Auer, Albert-Jan N. Yzelman

This talk presents two combinatorial problems encountered in scientific computations on today’s high-performance architec-tures, such as parallel computers with many processors and several cores on each processor, and with sophisticated memoryhierarchies and multiple levels of cache.

For parallelism, the most important problem is to partition sparse matrices, graphs, or hypergraphs into nearly equal-sizedparts while trying to reduce inter-processor communication. For better cache performance, the problem is to reorder sparsematrices by suitable row and column permutations. Common approaches to such problems involve multilevel methods basedon coarsening and uncoarsening (hyper)graphs, matching of similar vertices, searching for good separator sets and goodsplittings, and two-dimensional matrix splitting methods such as incorporated in the software package Mondriaan.

We will discuss new algorithms and features included in version 3.0 of the Mondriaan package, to be released soon. Byusing this package, and its subprograms MondriaanPlot and MondriaanMovie, we can visualise the partitioning process of asparse matrix by various algorithms. We can also do this in Matlab. Mondriaan has now been made into a library that canbe called from other programs, such as Matlab, Mathematica, or as a standalone program. New reordering methods havebeen included such as Separated Block Diagonal (SBD), along with well-known methods such as Bordered Block Diagonal.Doubly separated and doubly bordered versions are also included.

2.1B Lecture Room 117 PARALLEL MONTE CARLO Chair: Aneta Karaivanova

#21: Parallel stochastic estimation method for matrix eigenvalue distributionPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasunori Futamura ([email protected]), University of Tsukuba, JapanCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroto Tadano, Tetsuya Sakurai, Jun-Ichi Iwata

Some kinds of eigensolvers for interior eigenproblems require specification of parameters based on rough estimation ofdesired eigenvalues. In this talk, we propose a stochastic estimation method for eigenvalue distribution that based on astochastic estimator of the trace of a symmetric matrix. When the contour integration of the trace of a shift-inverted matrixis calculated, one can obtain the eigenvalue count in the integration domain by the residue theorem. To estimate the trace,linear equations involving a shifted matrix should be solved by an iterative method in non-factorizable case. The eigenvaluecount needs few significant digits, thus the stopping criterion of the iterative method can be roughly set, and then only fewcounts of matrix-vector multiplication should be calculated. Since both contour integration for each domain and linear solverfor each independent sample vector of a quadrature point can be performed simultaneously, the method can be executed onGPU clusters at small communication cost. We discuss the accuracy of the estimation of the eigenvalue count by stochasticand quadratic error analysis, and evaluate the performance of the method by applying to matrices derived from real-spacedensity functional calculations.#22: Speeding Up Modern Gradient-Based Algorithms for Large Sequential GamesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Gilpin ([email protected]), Carnegie Mellon UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tuomas Sandholm

Game theory is the mathematical study of rational behavior in strategic environments. In many settings, most notablytwo-person zero-sum games, it provides particularly strong and appealing solution concepts. Furthermore, these solutionsare efficiently computable in the complexity-theory sense. For these reasons, game theory is increasingly serving as thefoundation on which many successful game-playing agents are based. This is perhaps most noticeable in the development ofagents for Texas Hold’em poker where virtually all of the best poker-playing programs are based on game theory and featureautomatically computed strategies. Due to the large size of the poker game tree, the game-theoretic analysis can only beperformed after a substantial reduction in the size of the game tree using abstraction. This usually results in a model of thegame that is missing some of the strategically relevant features. Consequently, strategies that are computed based on thesemodels will not be optimal, and generally the further from the real game the abstracted game is, the worse the quality ofthe resulting strategies. Thus, enhancing the ability of equilibrium-finding algorithms to solve ever larger abstracted gameshas a direct impact on the performance of game-theoretic agents in practice. This demand for solving large game treeshas recently spurred the development of algorithms for finding equilibria in large sequential two-person zero-sum games ofimperfect information. Novel first-order gradient-based methods have recently become important tools in the construction ofgame theory-based agents, and now can solve games with 1012 leaves in the game tree. The computation time per iterationis typically dominated by matrix-vector product operations involving the game’s payoff matrix A and the players’ strategyvectors. In this paper, we describe two techniques for scaling up this operation. The first technique involves randomlysampling only a few components of the payoff matrix A when performing the matrix-vector computation. The basic ideaof a gradient-based algorithm as applied to convex optimization problems is to estimate a good search direction based oninformation that is local to the current solution. Then, the algorithms take a step in a direction related to the gradient, andrepeat the process from the new solution. Hence, an approximation of the matrix-vector product will yield an approximationof the step direction. We randomly sample (in a structured way) non-zero entries of A to construct a sparser matrix A. Thenan approximate equilibrium for the original game is found by finding an approximate equilibrium of the sampled game. We



experimentally evaluate both static and dynamic sampling, in which the level of sampling is dynamically adjusted duringthe computation, and demonstrate performance improvements over the non-sampling version of the algorithm. The secondtechnique we describe is an algorithm for performing the matrix-vector product on a cache-coherent Non-Uniform MemoryAccess (ccNUMA) architecture. Fully taking advantage of this hardware requires the algorithm designer to take specialcare in the design of the algorithm’s memory management. We describe an algorithm that does exactly this, and we presentexperimental results on a ccNUMA machine using 64 cores. Our algorithm distributes different parts of the payoff matrixto the local memory banks on different cores. We tested the time needed for computing a matrix-vector product for anabstracted instance of Texas Hold’em poker. We compared our algorithm against a standard parallelization approach thatdoes not take into account the unique physical characteristics of the ccNUMA architecture. Our new approach is alwaysfaster, and at 64 cores it is more than twice as fast. The two techniques can be applied together or separately, and they eachlead to an algorithm that significantly outperforms the fastest prior gradient-based method.#23: Matrix Aided Task Scheduling — A Scheduling Scheme with Mathematical Representation of Task Dependency

Presenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongji Zhou ([email protected]), University of LeedsCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven Freear, T.X. Mei

Task scheduling plays an important role in parallel computing. In the development of new scheduling algorithms, a program(a set of tasks) is conventionally modeled by using graph-based methods, e.g. Directed Acyclic Graph, with nodes to denotetasks and edges to denote precedence relations between tasks. However, the graph-based representation of task dependencyis difficult to be stored and manipulated in computers. This paper presents a new scheme, MATS (Matrix-Aided-Task-Scheduling), which provide a mathematical representation of task dependency by using matrixes.

In the MATS scheme, the precedence relation of a set of n tasks is represented by an n-by-n matrix consisting of binaryelements ’0’ and ’1’. A ’1’ element in the i-th row and the j-th column (i, j ¡= n) of the matrix denotes that task i precedestask j, whereas a ’0’ element denotes that there is no precedence relation between the two tasks. There are a few keyfeatures: 1)The resulting matrix representation of the particular set of tasks is unique. 2)For static scheduling, the matrix ispredefined with determined task dependency before scheduling 3)For dynamic scheduling, the matrix is updated accordingto the change of task dependency during scheduling. 4)The size of the matrix can be increased dynamically to includeadditional tasks and the implicated task dependency. 5)Only legal schedules are produced by searching in the matrix for thetask without predecessors

The proposed matrix-based representation of task dependency has the ease of being stored and manipulated in computers,with the current advanced matrix computation techniques (e.g. MATLAB). The proposed MATS scheme does not stand aloneand requires collaboration with other scheduling algorithms. In this paper, the MATS scheme is developed and implementedto aid a particular GA (genetic algorithm) scheduler to solve dynamic scheduling problems, as a case study. The MATSscheme can be embedded in a wide range of scheduling algorithms and in solving different scheduling problems.#24: GPU-based quasi-Monte Carlo algorithms for matrix computationsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aneta Karaivanova ([email protected]), IPP-BASCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. Atanassov, S. Ivanovska, M. Durchova

Quasi-Monte Carlo methods (QMCMs) are powerful tools for accelerating the convergence of ubiquitous Monte Carlo Meth-ods (MCMs). For problems in linear algebra, QMCMs give not only better but also more stable convergence with the increaseof the length of the walks. In the same time MCMs and QMCMs have the same computational complexity. The disadvantageof quasi-Monte Carlo is the difficulty in obtaining practical error estimates. This disadvantage can be overcome by scram-bling of the low-discrepancy sequences. Scrambling also provides a natural way to parallelize the streams. In this paperwe study matrix-vector computations using scrambled quasirandom sequences which are the basic building block of variousMC and QMC algorithms for Linear Algrebra problems. We use GPU-based generation of Sobol and Halton sequences withOwen-type scramblings to obtain fully GPU-based algorithms for approximate calculation of matrix-vector products. Theuse of GPU computing allows for highly efficient implementation of the algorithms using all the available levels of paral-lelism - all the cores of the graphics cards and MPI for synchronizing the computations across several computing nodes.

2.1C Lecture Room 116 MULTILEVEL METHODS Chair: Yvan Notay

#25: Large Scale Finite Element Modeling on Unstructured GridsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yavor Vutov ([email protected]), Bulgarian Academy of SciencesCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The challenges in large scale finite element modeling on 3D unstructured grids will be discussed in this talk. Large scaleproblems imply the use of parallel hardware and software. We will give an insight on the steps in the computational process:mesh generation, mesh partitioning, mesh refinement, renumbering, discretization and the solution.

As an illustration we shall solve a scalar second order boundary value problem. For the mesh partitioning we use theParMETIS library. For the solution of the linear system of equations – PCG algorithm with BoomerAMG preconditioner.BoomerAMG is a parallel implementation of algebraic multigrid, part of the library Hypre.

Detailed timings of all steps in the solution on an IBM BlueGene/P computer will be presented. Special attention will bepaid to the mapping between the communication graph of the partitioning and the underling interconnect.#26: Improve ILU preconditioners by recursive solves



Presenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pascal Henon ([email protected]), INRIACo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A popular method to solve large sparse linear systems is to use an iterative method preconditioned by an ILU factorization.An ILU preconditioner needs to solve two triangular systems at each step of the iterative method. In this presentation wewill remind how a recursive solve based on a partition of the unknown can improve the preconditioner and we will proposesome heuristic to efficiently exploit this property. Indeed, if you split the unknowns in two sets B and C then the incompletefactorization of a matrix A can be written as :

A =

(AB ABCACB AC

)≈(

LBLCB LS

)×(

UB UBCUS

). Where AB ≈ LB.UB is an incomplete factorization of AB, LCB ≈ ACBU−1

B , UBC ≈ L−1ABC and LSUS is some incompletefactorization of an approximation to the Schur complement matrix S = AC− (ACBU−1

B )(L−1B ABC).

Given a right-hand side y = ( yByC ), then the solution x = ( xB

xC ) of the two triangular solves can be written as{xC =U−1

S L−1S (yC−LCBL−1

B yB),

xB =U−1B (L−1

B yB−UBCxC).(1)

As observed in [1] or [2], an interesting remark is that since LCB and UBC are only approximations to LCB ≈ ACBU−1B ,

UBC ≈ L−1ABC they can be advantageously replaced by their explicit products in (1):{xC =U−1

S L−1S (yC−ACBU−1

B L−1B yB),

xB =U−1B L−1

B (yB−ABCxC).(2)

Furthermore, using equations (2) instead of (1), we do not need to store LCB and UBC. Another aspect is that the compu-tational cost is different between (1) and (2); this depends on the fill-in and the splitting of the system.

Thus, we will present some algorithms that try to maximize the benefit of a recursive solve. We will see some results onlarge sparse systems to illustrate these algorithms.

[1] Y. SAAD AND B. SUCHOMEL, ARMS: An algebraic recursive multilevel solver for general sparse linear systems.Numer. Linear Algebra Appl., 9 (2002), pp. 359–378.

[2] P. HENON, AND Y. SAAD. A Parallel Multistage ILU Factorization based on a Hierarchical Graph Decomposition.SIAM J. Sci. Comput., 28 (2006), pp. 2266–2293.

#27: MatSol — parallel implementation of scalable algorithms for large multibody contact problemsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vit Vondrak ([email protected]), VSB-Technical University of OstravaCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomas Kozubek, Alexandros Markopoulos, Tomas Brzobohaty, Zdenek Dostal

In our contribution we first briefly review the TFETI based domain decomposition methodology adapted to the solution of2D and 3D multibody contact problems of elasticity [1] and present our in a sense optimal algorithms [2] for the solution ofresulting quadratic programming problems. The unique feature of these algorithms is their capability to solve the class ofquadratic programming problems with spectrum in a given positive interval in O(1) iterations. The theory yields the errorbounds that are independent on conditioning of constraints and the results are valid even for linearly dependent equalityconstraints. The main part of our talk concerns implementation details, such as stable inversion of singular stiffness matriceand parallel implementation. The algorithms were implemented in MatSol library [5] developed in Matlab environmentand tested on solution of 2D and 3D contact problems. As a parallel programming environment we use Matlab DistributedComputing Engine and Matlab Parallel Computing Toolbox. This parallel environment supports both parallel and distributedprogramming. As a hardware parallel platform we use cluster of 9 computational nodes with 36 CPU cores interconnectedwith high speed infiniband fabric. At the end, we give results of numerical experiments with parallel solution of contactproblems discretized by up to more than 10 million of nodal variables to demonstrate that the scalability can be observedin computational practice. The power of the results is demonstrated also by the solution of difficult real world problems asanalysis of the roller bearing of wind turbine. This research has been supported by the grants GACR 101/08/0574 and theMinistry of Education of the Czech Republic No. MSM6198910027.

[1] Z. Dostal, D. Horak, R. Kucera: Total FETI - an easier implementable variant of the FETI method for numericalsolution of elliptic PDE. Comm. Num. Meth. Eng. 022, 2006, pp. 1155-1162.

[2] Z. Dostal: Optimal Quadratic Programming Algorithms, with Applications to Variational Inequalities, 1st edition,Springer US, NY 2009, SOIA 23.

[3] Z. Dostal, T. Kozubek, V. Vondrak, T. Brzobohaty, A. Markopoulos: Scalable TFETI algorithm for the solution ofmultibody contact problems of elasticity, DOI: 10.1002/nme.2807, 2009.

[4] Z. Dostal, T. Kozubek, P. Horyl, T. Brzobohaty, A. Markopoulos: Scalable TFETI algorithm for two dimensionalmultibody contact problems with friction, submitted to Journal of Computational and Applied Mathematics, 2009.



[5] T. Kozubek, A. Markopoulos, T. Brzobohaty, R. Kucera, V. Vondrak, Z. Dostal: MatSol - MATLAB efficient solversfor problems in engineering, http://www.am.vsb.cz/matsol, 2009.

#28: A parallel algebraic multigrid methodPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yvan Notay ([email protected]), Universite Libre de BruxellesCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

We consider the iterative solution of large sparse n×n linear systems Au = b arising from the discretization of second orderelliptic PDEs. In this context, multigrid methods are among the most efficient solution techniques. In particular, algebraicmultigrid (AMG) methods are set up using only the information present in the system matrix; that is, they do neither requireinformation from the underlying discretization nor need that the latter is regular. They can therefore be used in a black boxfashion in much the same way as a direct solver. However, using the classical AMG paradigm, the parallel implementationof these techniques raises some non trivial issues.

In this talk, we show how an easy to parallelize algorithm is obtained by exchanging the classical AMG approach for anapproach based on the aggregation of the unknowns. This latter can be used for a matrix already partitioned, e.g., resultingfrom a parallelized discretization. Moreover, it can also serve as partitioning tool in the case of a matrix that has not beenpartitioned before calling the linear system solver.

Sequentially, the resulting method appears efficient and, e.g., is faster than the best direct solvers even on 2D problems ofmoderate size. On the other hand, good scalability is obtained with respect to both the number of unknowns and the numberof processors.


2.1D Lecture Room 120 COMBINATORIAL SCIENTIFIC COMPUTING Chair: Olaf Schenk

#29: On finding dense submatrices of a sparse matrixPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bora UCAR ([email protected]), ENS-LYONCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamer Kaya and Yves Robert

We consider a family of problems exemplified with the following one: Given an m×n matrix A and an integer k≤min{m,n},find a set of row indices R = {r1,r2, . . . ,rk} and a set of column indices C = {c1,c2, . . . ,ck} such that the number of nonzerosin the submatrix indexed by R and C , i.e., A(R ,C ) in Matlab notation, is maximized. This is equivalent to finding a k× ksubmatrix S of A with entries Si j = Ari,c j such that it contains the maximum number of nonzeros among all k×k submatricesof A. We show that this problem is NP-complete, and then propose and analyze heuristics approaches to the problem. Someof our results and heuristics extend to the following three variations of the problem:

• Given an m×n matrix A and an integer k ≤min(m,n), find a k× k submatrix S such that ∑ |si j| is maximized.

• Given an m×n matrix A, an n×m matrix B, and an integer k≤min(m,n), find two sets of indices R = {r1,r2, . . . ,rk}and C = {c1,c2, . . . ,ck} for ri ≤ m, c j ≤ n such that ∑r∈R ,c∈C |Arc|+∑r∈R ,c∈C |Bcr| is maximized.

• The input is the same as in the second item but the objective function is to maximize the number of nonzeros containedin the two submatrices.

The latter two problems arise in the PSPIKE family of hybrid solvers [1].

[1] M. Manguoglu, A. Sameh, and O. Schenk. PSPIKE: Parallel sparse linear system solver. In In Proc. Euro-Par 2009Parallel Processing, pages 797–808, 2009.

#30: Combinatorial models for mesh partitioningPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cedric Chevalier ([email protected]), CEACo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Numerical simulations of physical phenomena are essential for most of the applications in scientific computing. As theyare getting larger and larger in term of both computations and memory usage, it is necessary to use parallel and distributedmemory implementations, on very large scale systems (up to nearly million of cores). In this context, the distribution ofthe data is very important, as a better locality means less communications and also may improve significantly the runningtime, networks’ speed being significantly slow comparing to computing capabilities. Usual approaches to do this distributionrepresent data and their relationships with combinatorial entities such as Graphs or Hypergraphs, which can be partitionedto determine a good way to distribute the problem. Graphs or Hypergraphs partitioning problems are well studied and someefficient software exist in sequential or parallel. However, the model to construct the graphs or the hypergraphs from theoriginal problems is not so clear to choose, and can have a dramatic impact on the resulting distribution. We will describeseveral combinatorial models, using graphs and hypergraphs, in the context of mesh based applications and we presentand discuss some practical results we have obtained. This work was initiated at Scalable Algorithms Department, SandiaNational Laboratories, Albuquerque, NM. Sandia is a multiprogram laboratory operated by Sandia Corporation, a LockheedMartin Company, for the United States Department of Energy’s National Nuclear Security Administration under ContractDE-AC04-94AL85000. This work now also takes place at CEA/DAM in France.Computer Science Department, University of Basel, Switzerland 13


#31: A parallel preprocessing for the optimal assignment problem based on diagonal scalingPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . Meisam Sharify ([email protected]), INRIA, ECOLE POLYTECHNIQUECo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephane Gaubert, Laura Grigori

One of the most classical problems in computer science is the optimal assignment problem. An important application of thisproblem in the solution of very large linear systems of equations is to permute large elements in magnitude on the diagonalof a matrix. This problem can be formulated as finding a permutation σ of n elements maximizing the product

∏1≤i≤n

aiσ(i)

where ai j denotes the absolute value of the (i, j) entry in the matrix of the n× n linear system. This is equivalent to solvethe classical optimal assignment problem with weights logai j. The latter can be solved by efficient algorithms such asHungarian method, auction algorithm or by various network flow algorithms. However, the classical methods, which areinherently sequential, are not adapted to the situations in which the system is very large.

We propose two iterative algorithms for the optimal assignment problem, which can be performed efficiently in parallel.The idea of these algorithms is to think of the optimal assignment problem as the limit of a deformation of an entropymaximization problem. More precisely, consider the following entropy maximization problem which consists in finding ann×n bistochastic matrix X = (xi j) maximizing the relative entropy

Jp(X) :=− ∑1≤i, j≤n

xi j(log(xi j/api j)−1) , (3)

We show that when p goes to infinity, the solution of the entropy maximization problem converges to a point in the convexhull of the matrices presenting optimal permutations. We prove that for X(p) as the solution to Equation(3) for some valuesof p and for X(∞) as the solution when p→ ∞

|xi j(p)− xi j(∞)|= O(exp(−cp))

for c > 0. This shows an exponential convergence to the optimal solution when p increases. The connection between thesetwo problems leads us to develop two algorithms for solving the optimal assignment problem by using Sinkhorn iteration.

The first algorithm is a preprocessing for the optimal assignment problem which is based on an iterative process thateliminates the entries not belonging to an optimal assignment. This reduces the initial problem to a much smaller problem interms of memory requirements. The idea of this algorithm is to take p large enough, then apply Sinkhorn iteration to A(p)until convergence to a bistochastic matrix X , and finally delete the small entries of X . we shall show that it is possible toimplement this iteration in a numerically stable way by using the logarithmic coordinates. We show experimentally that thismethod can be efficiently used in numerical examples.

The second algorithm is based on a modification of the Sinkhorn scaling algorithm. We prove that this iteration, whichwe refer to as deformed-Sinkhorn iteration, converges to a matrix whose entries that belong to the optimal permutations arenonzero, while all the other entries are zero. The values of the nonzero entries are 1/m, where m represents the number ofelements in each row or column which belong to the optimal permutation. A theoretical estimation of the rate of convergenceis also presented.#32: Parallel Algorithms for Graph SimilarityPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giorgos Kollias ([email protected]), Purdue UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ananth Grama

With widespread availability of graph-structured data from sources ranging from social networks to biochemical processes,there is increasing impetus for efficient and scalable graph analyses techniques. An important problem in analyses of graphdatabases is the computation of node-wise similarity across graphs (or within the same graph). These node similarity scorescan be used to quantify aggregate similarity, or as seeds for identifying similar (conserved) sub-graphs.

The similarity of two nodes can be formally quantified recursively by the similarity of their neighbors [1,4]. This def-inition is closely related to the recursive definition of node ranking in popular approaches like PageRank [3] or HITS [2].Node similarity extends this notion to two graphs, and can be viewed as ranking calculations over their product graph. Theoutput of a node similarity computation is typically a similarity matrix X , with element xi j representing the relative similarityscore between nodes i and j of the respective graphs. If one is interested in further extracting the best matching pairs, amaximum weight bipartite matching algorithm can be used to post-process X . The computation of the similarity matrixX can become expensive (hours using state-of-the-art IsoRank software [4]) for large graphs (thousands of nodes, tens ofthousands of edges), since it potentially involves products with dense matrices. In this paper, we present a new algorithmfor uncoupling and decomposing the computation of the similarity matrix. Uncoupling enables us to process each graphindependently. Decomposing the similarity matrix enables us to independently compute and subsequently combine contribu-tions from different identifiable link patterns occuring in the two graphs. Our method works for both directed and undirectedgraphs and it treats, quite naturally, cases where pairings of specific node pairs should be favored.Our method is also highlyparallelizable and scalable, since it builds a distributed similarity matrix with minimal communication overhead. Numericalexperiments with various cluster configurations and graph instances (Protein Protein Interaction (PPI) networks) demonstratethat (i) our algorithm is orders of magnitude faster than the baseline IsoRank implementation; and (ii) parallel formulations



of our methods yield high efficiency and excellent scalability.

[1] Glen Jeh and Jennifer Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACMSIGKDD international conference on Knowledge discovery and data mining, pages 538-543, Edmonton, Alberta,Canada, 2002. ACM.

[2] J.M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. J.ACM, 46:604-632, 1999.

[3] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Techni-cal report, Stanford University, 1998.

[4] R. Singh, J. Xu, and B. Berger. Global alignment of multiple protein interaction networks with application to functionalorthology detection. Proceedings of the National Academy of Sciences, 105(35):12763, 2008.

2.1E Lecture Room 116 SPARSE MATRICES AND APPLICATIONS Chair: Pascal Henon

#33: Parallel Algorithms for Solution of a Chemotaxis System in HaematologyPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gergana Bencheva ([email protected]), Bulgarian Academy of SciencesCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The therapy of various pathological blood diseases consists mainly of two steps – chemotherapy and a whole body irradiationto eradicate the patient’s haematopoietic system is followed by transplantation of haematopoietic stem cells (HSCs), obtainedfrom the mobilized peripheral blood of a donor. After transplantation, HSCs have to find their way to the stem cell niche inthe bone marrow and afterwords to multiplicate rapidly to regenerate the blood system. Adequate computer models for theprocesses after transplantation would help medical doctors to shorten the period in which the patient is missing their effectiveimmune system.

The chemotactic movement of HSCs is modelled by a nonlinear system of chemotaxis equations coupled with an ordinarydifferential equation on the boundary of the domain in the presence of nonlinear boundary conditions. The unknowns of thesystem are the concentrations of HSCs, of chemoattractant SDF-1, produced by stroma cells in the bone marrow stem cellniches, and of the stem cells bound to the stroma cells. Various classical numerical methods applied directly to a generalchemotaxis system and in particular to HSCs migration model may lead to numerical instabilities and loss of the positivityproperty of the solution. In our investigations the PDE system is discretized on space using a finite-volume method, based ona second-order positivity preserving central-upwind scheme, proposed by A. Chertock and A. Kurganov in 2008. A strongstability preserving method (e.g. Runge-Kutta or a multistep one) for ODEs is applied for time integration of the resultingsemi-discrete problem.

This study is focused on the parallel solution of the resulting nonsymmetric algebraic systems. Several approaches areproposed and their properties are theoretically and experimentally analyzed. The partitionings of the data and computationsare based on the functional or domain decomposition techniques. The related communication structures are investigatedusing divide-and-conquer to uncover concurrency. The algorithms are implemented in C/C++ using MPI and the presentednumerical tests are performed on IBM Blue Gene/P.#34: Parallel Sparse Matrix Library and Preconditioner Construction for Quantum Chemical Calculations of LargeSystemsPresenter: . . . . . . . . . . . . . . . . . . . . Urban Borstnik ([email protected]), Physical Chemistry Institute, University of ZurichCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valery Weber, Jurg Hutter

Quantum chemical calculations can be practical for large systems only if the computational time increases at most linearlywith the number of atoms in the system.

Sparse matrices are an essential data structure for these calculations because the number of non-zero elements and there-fore their storage size is linearly proportional to the number of atoms. These matrices then enable linear algebra operationssuch as multiplication of two sparse matrices to be performed in linear time. In addition, all operations should run efficientlyon massively parallel computers. For this, a balance must be found between memory usage, load balancing, and minimizingthe communication cost.

We present a library for sparse matrix storage, manipulation, and the operations needed for quantum chemical calculationson distributed-memory parallel computers. The library also supports contemporary hybrid architectures featuring sharedmemory nodes. The library has been developed to support linear scaling in the CP2K quantum chemistry program.

We present an efficient way to construct sparse preconditioners using the spares matrix library. Such sparse precondi-tioners are important in electronic structure minimization algorithms. We also present several quantum chemical calculationsthat are made possible using the preconditioners, the sparse matrix library, and its applications.#35: A parallel space-time finite difference solver for the steady-state shallow-water equationPresenter: . . . . . . . . . . . . . . . . . . . . . Nao Kuroiwa ([email protected]), Department of Mathematics, Keio UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Arbenz, Dominik Obrist

We discuss the parallel implementation of a finite difference (FD) solver for solving the two-dimensional shallow-waterequation in non-conservative form. We are interested only in the time-periodic steady-state imposed by a time-periodic



force field. Therefore, we discretize the PDE in space and time to get a three-dimensional simulation domain. The FDdiscretization leads to very large sparse linear systems of equations. A particularity of this approach is that it admits theparallelization not only in space but also in time direction. We use the Trilinos framework for the parallelization of theiterative solver. We discuss the performance of our solver on a cluster of multi-core processors.#36: High-Performance Preconditioners for the Solution of Pressure Systems in the LES of Turbulent Channel Flows

Presenter: Pasqua D’Ambra ([email protected]), Institute for High-Performance Computing and Networking (ICAR)Co-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniela di Serafino, Salvatore Filippone

The application of projection methods in solving the Navier-Stokes (N-S) equations for incompressible flows leads to thesolution of elliptic equations, known as pressure equations. We consider an approximate projection method applied to fil-tered N-S equations for the Large Eddy Simulation (LES) of wall-bounded turbulent flows. In this context, a non-uniformdiscretization of the pressure equation, coupled with periodic and Neumann boundary conditions, produces sparse linearsystems which are singular but compatible, with symmetric nonzero pattern but nonsymmetric values. A large part of thecomputational cost of the simulations, when high Reynolds numbers are considered, is spent in the solution of these systems,thus requiring high-performance and scalable solvers. We focus on the application of the parallel preconditioners included inMLD2P4 (MultiLevel Domain Decomposition Parallel Preconditioners Package based on PSBLAS) to obtain a reliable andefficient solution of the above systems. MLD2P4 is a modular and portable software library for building and applying paral-lel algebraic multilevel preconditioners with the Krylov solvers provided by PSBLAS (Parallel Sparse Basic Linear AlgebraSubprograms). It uses Schwarz domain decomposition methods as smoothers, and smoothed aggregation as coarsening tech-nique. A generalization of the classical smoothed aggregation method, proposed in [M. Sala and R.S. Tuminaro, SISC, 31(1), 2008], has been recently included in the package, for efficiently dealing with nonsymmetric matrices. The generalizationis based on a Petrov-Galerkin approach for building coarse matrices, where a new smoothed restriction operator and easycomputable local damping parameters for local energy minimization in space transfer operations are obtained, by using theFrobenius norm and without relying on the symmetry of the matrices. We discuss numerical results obtained using variousMLD2P4 preconditioners, coupled with GMRES, in the LES of bi-periodical channel flows, for different Reynolds numbersand grid sizes. In this context, we observed that while the classical smoothed aggregation shows a significant influence of theaggregation threshold on the convergence behaviour of the preconditioned GMRES, the generalized smoothed aggregationexhibits threshold-independent convergence and leads to improvements in parallel performance and scalability.

2.1F Lecture Room 117 RECENT ADVANCES IN EIGENVALUES AND LEAST-SQUARES COM-PUTATION

Chair: Jose E. Roman

#37: Computational issues in least squares conditioningPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marc Baboulin ([email protected]), Universidade de Coimbra, PortugalCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

We are interested in error analysis for the full rank overdetermined linear least squares problem that can be formulated asb = Ax+ ε, A ∈ Rm×n, b ∈ Rm, rank(A) = n,m > n. To achieve this, we use condition numbers which measure the effecton the solution of small changes in the data. These perturbations can be measured either ”globally” using classical normsresulting in so-called normwise condition numbers or using metrics that take into account the structure of the matrix likesparsity or scaling resulting in component- wise condition numbers. For each type of perturbation, we provide computableformulas for the condition number and we explain how they can be computed using the standard parallel libraries. Wealso address the case where, contrary to the linear statistical model, random errors do not affect exclusively the observationvector b but also A, which is more realistic for some physical applications (model referred to as Errors-In-Variables model).The corresponding linear algebra problem is called the Total Least Squares (TLS) problem. We propose a formula and analgorithm for computing the condition number of the TLS problem using normwise error analysis as well as a practical wayto obtain it using some elements of the singular value decomposition of A and [A,b].#38: Mixed precision computation of eigenvaluesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Ralha (r [email protected]), Dep. Matematica, Universidade do MinhoCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

There are processors now coming to the market which carry out floating-point operations much faster in single precisionthan they do in double precision. This new technological paradigm is likely to have a significant impact in the design of fastalgorithms, namely in the area of numerical linear algebra. Even when one wishes to produce highly accurate results, theopportunity to exploit the fast single precision mode is not to be discarded. Iterative algorithms adapt well to this paradigmof mixed-precision arithmetic: single precision may be used to get close enough to the solutions, double precision will beused in the last iterations when convergence is usually faster. There is already a significant amount of work following thisline of research for the solution of linear systems. We take this approach in the context of a bisection-like algorithm forcomputing the eigenvalues of symmetric tridiagonal matrices. Suppose that two arithmetic units are available, one workingin single precision, one in double precision. It is natural to consider using these facilities in succession; first get an intervalone unit wide (or slightly bigger) in single precision and then give each narrow interval to the double precision unit. If thebinary word lenghts were 24 and 53 and if the single precision was infinitely faster then the cost would be (roughly) 53-24double precision steps. A significant improvement in efficiency.



This simple picture is flawed for a subtle, often overlooked, reason: some eigenvalues may not be defined to full accuracyby the data (i.e., the entries of the tridiagonal). Single precision rounding errors can perturb some eigenvalues to more thansingle precision variation at each bisection step. On top of that, it is not at all rare for a single precision interval, passed onto DP arithmetic, to be empty. There is an easy remedy that is justifiable but not fully satisfactory. For symmetric matricesall eigenvalues are perfectly conditioned with respect to the norm of the matrix. Thus, a stopping criterion of the form

interval length≤ O(ε) · ‖T‖ (4)

where ε denotes the rounding error unit (adjusted for single and double precision), would be a safe stopping criterion. Theblemish here is that ‖T‖/ |λ| may be huge and λ may sometimes be defined to high relative accuracy. In this case more stepsof bisection could have been performed in single precision. However, one should not carry out single precision bisectionsteps (at least, not too many) on intervals that are not guaranteed to contain a desired eigenvalue. The main purpose of ourwork is to use perturbation theory results to approach, as close as possible, the optimal point, in the sequence of iterations,for switching from single to double precision arithmetic.#39: Efficient Gram-Schmidt orthogonalization with CUDA for iterative eigensolversPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andres Tomas ([email protected]), Universidad Politecnica de ValenciaCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vicente Hernandez

The Gram-Schmidt orthogonalization procedure is widely used by iterative eigensolvers, either for building a Krylov sub-space or deflating against already converged eigenvectors. The classical Gram-Schmidt variant with refinement is usuallypreferred because it can be easily implemented with matrix vector products (BLAS level 2). The throughput of these opera-tions is mainly limited by memory access speed in current computer architectures. Therefore, this procedure can become themost time-consuming step in the eigensolver, even when computing a small fraction of all the solutions from a large eigen-problem. Graphical processors are very interesting for this task because they offer higher memory bandwidth than CPU’s.However, current available implementations of the matrix vector product (like the CUBLAS library) show poor performancewith the rectangular matrices used in the Gram-Schmidt procedure. In this work, we propose a GPGPU parallelizationscheme for the matrix vector product well suited for rectangular matrices with a larger number of rows than columns. Thisscheme is implemented in CUDA for the thick restart Lanczos method available in SLEPc (Scalable Library for EigenvalueProblem Computations). A performance comparison between different matrix vector product implementations in this contextwill be presented and discussed.#40: Parallel linearization-based solvers for the quadratic eigenproblemPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose E. Roman ([email protected]), Universidad Politecnica de ValenciaCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andres Tomas

For computing a partial solution of the quadratic eigenvalue problem, (λ2M+λC+K)x = 0, where M,C,K are large sparsematrices of order n, it is possible to choose between two approaches: (i) apply a linearization scheme that turns the probleminto a generalized eigenvalue problem (A−λB)x = 0 of order 2n, or (ii) apply a subspace projection technique that operatesdirectly on the n-dimensional space. In SLEPc, the Scalable Library for Eigenvalue Problem Computations, we plan toinclude functionality for both approaches. In this work, we focus on the linearization strategy, for which we can leverage theindustrial-strength eigensolvers already available in SLEPc. We consider different linearization schemes and their suitabilityfor different problem types. We also discuss some implementation details that can make the resulting solver more robust orefficient. Finally, we discuss some parallelization issues and analyze how the performance of the developed codes scale tolarge number of processors.


2.2A Lecture Room 120 VOXEL-BASED COMPUTATIONS Chair: Peter Arbenz

#41: Virtual Diagnosis - Voxel-based Simulations of Human BonesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dieter H. Pahr ([email protected]), Vienna University of TechnologyCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Philippe K. Zysset

Bone is a living tissue and changes continuously over a life time. Osteoporosis is a disease of unbalanced bone resorptionand formation, which affects every third woman and every seventh man. This bone disease leads to an increased risk offracture. The goal of virtual diagnosis is to predict such changes from medical images in order to start a proper treatmentat the right time. So called micro-finite element models are currently the gold standard in this field. In such models CTvoxels are directly converted into hexahedral finite elements. Hundreds of millions of elements lead to an accurate but on theother hand very expensive model. Usually, super computers and PCG-AMG solvers are needed to get results in a reasonablecomputing CPU time. Currently only a few finite element packages are available for this task. In this study the non-linearsolver ”Olympus” by Mark Adams (2002) and the linear matrix-free solver ”ParFE” by Peter Arbenz et.al. (2008) will beused. The presentation will focus on the challenges that appear when such codes are applied in the field of virtual diagnoses.Pros and cons of available linear and non-linear FEM frameworks will be discussed from the model generation, solving, andresult visualization point of view. Finally, some application examples of human bones be presented.#42: A Memory Efficient Multigrid Preconditioner for Micro-Finite Element Analysis based on Voxel Images



Presenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cyril Flaig ([email protected]), ETH ZurichCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Arbenz

According to the World Health Organization (WHO) the lifetime risk of a fracture caused by osteoporosis is close to 40% forwomen. To improve bone fracture prediction, a precise estimation of the bone’s elastic properties is required. Based on high-resolution micro-computed tomography (micro-CT) a prediction can be provided by micro-structural finite element (micro-FE) analysis, a technique that has recently made possible the in-vivo assessment of the trabecular bone micro-structure.

To represent the structure of a trabecular bone in a valid way very high resolution CT scans are required. This results inmodels with a huge number of voxels (3D pixels) entailing very demanding computations with enormous numbers of degreesof freedom. We developed a fully parallel simulation tool based on the conjugate gradient method with an aggregation-based algebraic multigrid preconditioner. In this field it is the most memory efficient code available although the geometricproperties of the CT image is exploited only on the finest level. The largest realistic bone model solved so far had a size ofabout 1.5 billion degrees of freedom. The simulation was done on an IBM BlueGene BG/L supercomputer.

For clinical usage the application must be much more memory efficient. In this talk we present a new approach based ongeometric multigrid. This new approach enables us to better exploit the typical geometry underlying CT scans. The geometricmultigrid makes it possible to use voxel based computations on all levels. An element-by-element matrix multiplicationavoids the assembly of any matrix. The usage of a parallel polynomial smoother enables full parallelization of the code.

On each grid level the information is hold in a few vectors only. The memory efficiency increases by more than a factorof ten compared to the previous AMG code. However, this factor is very sensitive to the sparsity of the trabecular bone.

The simple data structures of our approach are suited not only for general purpose computer architectures but will alsoallow to exploit novel architectures like the Cell BE processor or GPGPUs.#43: Efficient Parallel Solution Algorithms for µFEM Elasticity SystemsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikola Kosturski ([email protected]), Bulgarian Academy of SciencesCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yavor Vutov

We analyze the performance of a parallel preconditioned conjugate gradient (PCG) solver for linear systems as arising fromapplying micro finite element method (µFEM) to elasticity problems. The implementation is tested on the IBM Blue Gene/Pmassively parallel computer. We consider the problem of homogenization of trabecular bone micro-structures. The studiedtrabecular bone tissue has a strongly heterogeneous micro-structure, composed of solid and fluid phases. In this study, thefluid phase, located in the pores of the solid skeleton, is considered as an almost incompressible linear elastic material. Theconsidered numerical homogenization scheme is based on isotropic linear elasticity models at micro and macro levels. Thevoxel representation of the reference volume element (RVE) is obtained from a high resolution computer tomography (CT)image. As a preconditioner for the PCG, we use BoomerAMG – a parallel algebraic multigrid implementation from thepackage Hypre, developed in LLNL, Livermore. The IBM Blue Gene/P computer is equipped with a three-dimensional torusinterconnect. Our parallelization approach is to divide the computational domain in the three spatial directions in order tomatch the interconnect topology. This extends a previous algorithm in which the domain was divided in one direction only,that lead to limited parallelism.#44: Memory-Efficient Sierpinski-order Traversals on Dynamically Adaptive, Recursively Structured TriangularGridsPresenter: . . . . . . . . . . . . . . . . . . . . . Michael Bader ([email protected]), Technische Universitat MunchenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Csaba Vigh, Kaveh Rahnema

Grid generation packages that provide parallel adaptive mesh refinement and iterative traversals of unknowns on such adap-tive grids are fundamental building blocks for PDE solvers. We discuss a respective integrated approach for grid refinementand processing of unknowns that is based on recursively structured triangular grids and space-filling element orders. Inearlier work, the approach was demonstrated to be highly memory and cache-efficient. In this work, we quantify the cacheefficiency of the traversal algorithms using the I/O model. Further, we discuss how the nested recursive traversal algorithmscan be efficiently implemented. For that purpose, we compare the memory throughput of respective implementations withsimple stream benchmarks.

Our motivation for voxel based analysis comes from investigation of geocomposites, i.e. materials with soil or rockmatrix filled by a grouting media. This investigation is important for the assessment of both mechanical and hydraulicproperties, which depends on the grout and level or quality of fill in. The investigation uses homogenization and sensitivityanalysis which requires repeated solution of the voxel finite element models.

The voxel finite element models use regular grids and material distribution obtained by tomography scanning of samplesof geocomposites. The arising models are large scale with possibly high heterogenity of the material microstructure and highcoefficient jumps (especially in the case of hydraulic properties).

Therefore, the voxel finite element analysis needs efficient and parallelizable solvers for the arising linear systems. Weseek such solvers in the class of Schwarz type overlapping domain decomposition methods with additional coarse problem.Especially, we shall consider the Schwarz methods with layered decomposition and coarse problem created by selectiveaggregation, which means that aggregated elements are stiffer than the surrounding non aggregated ones. Assuming accu-rate solution of subproblems, this type of Schwarz methods with aggregation can be both efficient and robust with respectto the coefficient jumps. The layered decomposition enables the accurate and efficient solution of the local problems bydirect solvers. The details of the technique will be discussed and some numerical results will be shown. (*) SNF/SCOPES-supported student.



2.2B Lecture Room 117 HYBRID SOLVER FOR FLUID FLOW Chair: Achim Basermann

#45: Parallel scalability and complexity analysis of sparse hybrid linear solversPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L. Giraud ([email protected]), INRIACo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. Agullo, A. Guermouche, A. Haidar, J. Roman

In many large scale numerical simulation the inner most and most time consuming kernel is the solution of a sparse linearsystems. Sparse direct solvers have been for years the methods of because of their reliable numerical behaviour. However, it isnowadays admitted that such approaches are not scalable in terms of computational complexity or memory for large problemssuch as those arising from the 3D modelling. Iterative methods, on the other hand, generate sequences of approximationsto the solution. These methods have the advantage that the memory requirements are small. Also, they tend to be easierto be parallelized than direct methods. However, the main problem with this class of methods is the rate of convergence,which depends on the properties of the matrix. One way to improve the convergence rate is through preconditioning, whichis another difficult problem. Our approach to high-performance, scalable solution of large sparse linear systems in parallelscientific computing is to combine direct and iterative methods. Such a hybrid approach exploits the advantages of bothdirect and iterative methods. The iterative component allows us to use a small amount of memory and provides a natural wayfor parallelization. The direct part provides its favourable numerical properties.

Sparse hybrid solvers are a trade-off between direct methods and iterative methods. Part of the computation is firstperformed with a direct method in order to ensure numerical robustness; the algorithm then switches to an iterative methodto alleviate the computational complexity and memory usage. The convergence and number of iterations in the second stepdepend on the amount of computation performed in the direct part. In this talk, we study the impact of the proportion ofcomputation performed with a direct method on the number of iterations and the computation time. We propose a model tocompute the subsequent computational and memory costs, depending on the ratio of computation performed in direct. Weapply this model on several classes of academic problems and illustrate the possible trade-offs in terms of computational andmemory complexity. This is an on-going work and the present extended abstract set up the framework on top of which thetalk will be developed.#46: A robust parallel hybrid solver for fluid flow problemsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonas Thies ([email protected]), University of Groningen, NLCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fred Wubs

We present a parallel hybrid direct/iterative solver for the fully coupled solution of fluid flow problems. The algorithmis based on nested dissection combined with an iterative solver for the Schur-complement on the interfaces. A robustincomplete factorization based on orthogonal transformations and dropping by position is used as preconditioner for theSchur-complement. The method can be applied recursively and exposes parallelism on each level while preserving importantproperties of the original problem such as structure, symmetry and definiteness.

The advantage over existing solvers of this type is that the new preconditioning strategy leads to grid-independent con-vergence rates and is robust at high Reynolds-numbers.

Using a Trilinos implementation of the method, we demonstrate linear computational complexity and good parallelperformance for the Poisson problem and the Jacobian of the steady incompressible Navier-Stokes equations on a structuredgrid.#47: Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD ProblemsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Achim Basermann ([email protected]), German Aerospace CenterCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans-Peter Kersken

At the Institute for Propulsion Technology of the German Aerospace Center (DLR), the parallel simulation system TRACE(Turbo-machinery Research Aerodynamic Computational Environment) has been developed specifically for the calculationof internal turbo-machinery flows. The finite volume approach with block-structured grids requires the parallel, iterative so-lution of large, sparse real and complex systems of linear equations. For convergence acceleration of the iteration, DistributedSchur Complement (DSC)preconditioners for real and complex matrix problems have been investigated.

The DSC method requires adaquate partitioning of the matrix problem since the order of the approximate Schur comple-ment system to be solved depends on the number of couplings between the sub-domains. Graph partitioning with ParMETISfrom the University of Minnesota is suitable since a minimization of the number of edges cut in the adjacency graph ofthe matrix corresponds to a minimization of the number of the coupling variables between the subdomains. The latter de-termine the order of the approximate Schur complement system used for preconditioning. Since even the matrix pattern isnon-symmetric for block-structured TRACE problems it has to be symmetrized so that the corresponding matrix adjacencygraph becomes undirected und ParMETIS can be applied.

Matrix permutations like Reverse Cuthill-McKee (RCM) and Minimum Degree (MD) are employed per sub-domain inorder to reduce fill-in in incomplete LU factorizations which are part of the DSC preconditioner.

Numerical and performance results of these methods are discussed for typical TRACE problems on multi-core archi-tectures together with an analysis of the pros and cons of the complex problem formulation, e.g. regarding the ratio ofcalculation operations to memory accesses. The results show that matrix permutations are crucial for DSC preconditioneras well as iterative solver performance. The DSC preconditioned iterative solvers for the complex problem formulation dis-tinctly outperform the solvers for the real formulation. Reasons are that the complex formulation results in lower problem



order, more advantageous matrix structure, has higher data locality and a better ratio of computation to memory access.

2.2C Lecture Room 116 MISCELLANEOUS A Chair: Jennifer Scott

#48: Equivalent operator preconditioning for elliptic problems with nonhomogeneous mixed boundary conditionsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamas Kurics ([email protected]), Eotvos Lorand UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The numerical solution of linear elliptic partial differential equations often involves finite element discretization, where thediscretized system is usually solved by some conjugate gradient method. The crucial point in the solution of the obtaineddiscretized system is a reliable preconditioning, that is to ensure the boundedness of the condition number of the systems,no matter how the mesh parameter is chosen. The PCG method is applied to solving convection-diffusion equations withnonhomogeneous mixed boundary conditions. Using the approach of equivalent and compact-equivalent operators in Hilbertspace, it is shown that for a wide class of elliptic problems the superlinear convergence of the obtained preconditioned CGMis mesh independent under FEM discretization.

Here the main difficulty arises from the proper definition of the unbounded operator L corresponding to the elliptic PDE.The presence of inhomogeneous mixed boundary conditions involve the operator-pair approach, i.e. L has to be definedas a pair of operators, where one acts on the domain and the other one acts on the Neumann boundary. Using the theoryof compact-equivalent operators, it can be shown that preconditioning with a symmetric elliptic operator having the sameprincipal part as L provides mesh independent superlinear convergence. This property holds under the usual smoothness andcoercivity assumptions on the coefficients of the PDE.

These results can be extended to elliptic systems with mixed boundary conditions. Using the equivalent operator idea,one can define decoupled (that is, independent) Helmholtz operators as preconditioner. The resulting discrete systems in thePCG algorithms have block diagonal structure, allowing efficient parallelization for the solution of the auxiliary equations. Ithank the support of the SNF/SCOPES grant to make my participation at PMAA’10 possible.#49: On the Performance and Implementation of the Parallel Deflated Preconditioned Conjugate Gradient methodPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.B. Jonsthovel ([email protected]), TU DelftCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.B. van Gijzen, C. Vuik, A. Scarpas

Finite element computations are indispensable for the simulation of material behavior. Recent developments in visualizationand meshing software give rise to high-quality but very large meshes. As a result, large systems with millions of degrees offreedom need to be solved. In our application, the finite element stiffness matrix is symmetric, positive definite and thereforethe Preconditioned Conjugate Gradient (PCG) method is our method of choice. The PCG method is well suited for parallelapplications which are needed in practical applications.

Many finite element computations involve simulation of inhomogenous materials. These materials lead to large jumpsin the entries of the stiffness matrix. We have shown in (Jonsthovel et al., CMES 2009) that these jumps slow down theconvergence of the PCG method and that by decoupling of those regions with a deflation technique a more robust PCGmethod can be constructed: the Deflated Preconditioned Conjugate Gradient (DPCG) method.

This paper extends the results we report on in (Jonsthovel et al., CMES 2009). The DPCG method uses deflation vectorsthat contain the rigid body modes of sets of elements with similar properties. We will derive a cheap and general applicablemethod to compute those rigid body modes. We also provide a mathematical justification of our approach. We will showthat the DPCG method is parallel by nature when used with domain decomposition and can be easily and efficiently beimplemented in any existing parallel FE code. Finally, we will discuss numerical experiments on composite materials tovalidate our results and to show that Parallel DPCG scales well for an increasing number of domains (CPUs) and problemsizes.#50: Model-Driven Adaptation of Double-Precision Matrix Multiplication to the Cell Processor ArchitecturePresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Rojek ([email protected]), Czestochowa University of TechnologyCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lukasz Szustak, Roman Wyrzykowski

This paper presents an approach to adaptation of the double-precision matrix multiplication to architecture of systems basedon two types of Cell/B.E. processors.

The hierarchic algorithm used for the adaptation consist of three levels. The first level is a micro-kernel, which isresponsible for the block matrix multiplication, and is implemented on the register file of a SPE core. The second levelis a kernel of algorithm, and is responsible for the matrix multiplication of size 64 by 64. This level uses micro-kerneloperations, and is implemented on a single SPE core with its local storage. For each type of Cell/B.E. processors, twokernels were implemented. One of them is based on traditional operations of the type C=C+A*B, while the second is basedon multiplications C=A*B. The third level of the algorithm is executed on all 16 SPE cores of the IBM Blade Center. It isresponsible for the data management in the main memory, as well as communication between the main memory and localstorages of SPE cores.

Our approach is based on two performance models. The purpose of the first model is optimization of computations withina single SPE core. It is constructed as a function of a block matrix size. The model accounts for such factors as size of localstorage, number of registers, properties of double-precision operations, balance between pipelines, etc. This performancemodel allows for selecting ”the best” size of a micro-kernel, which is used for the adaptation.

The purpose of the second performance model is optimization of communication across all 16 SPE cores of the IBM



Blade Center, including the main memory. To efficiently distribute data blocks to SPEs, there are two key issues: datadistribution in the main memory, and data transfer between the main memory and local storages of SPEs. The data distributionis implemented using the NUMA library, which allows the programmer to allocate the main memory on the same processoron the current thread runs. The double buffering technique is used to implement the data transfer.

For the IBM QS21 system, which uses two Cell/B.E. processors of the first generation, the proposed adaptation andoptimizations allow for achieving 27.24 GFLOPS, which is 93.1% of the peak performance. This result was obtained formatrices of size 4096 by 4096. For the IBM QS22 system, based on PowerXCell8i processors, the performance of double-precision is extremely higher, so 184.3 GFLOPS were achieved, as 90% of the peak performance. This result was reportedfor the matrix multiplication of size 15 872 by 15 872. Krzysztof Rojek is a SNF/SCOPES-supported student#51: Designing sparse direct solvers for multicore architecturesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jennifer Scott ([email protected]), Rutherford Appleton LaboratoryCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Hogg

The rapid emergence of multicore machines has led to the need to design new algorithms that are efficient on these architec-tures. In this talk, we consider the design and development of a new direct solver HSL MA87 for the efficient solution oflarge sparse symmetric linear systems on multicore architectures. We were motivated by the successful division of the com-putation in the dense positive-definite case into tasks on blocks and use of a task manager to exploit all the parallelism thatis available between these tasks, whose dependencies may be represented by a directed acyclic graph (DAG). Our algorithmis built on the assembly tree and subdivides the work at each node into tasks on blocks, whose dependencies may again berepresented by a DAG. To limit memory requirements, updates of blocks are performed directly.

For portability and maintainability, HSL MA87 is written in Fortran 95 plus OpenMP; it is available as part of thesoftware library HSL. The first release of HSL MA87 was for positive-definite sparse systems. We have recently extendedits functionality to the harder but very important case of symmetric indefinite systems. We highlight the extra challengessuch systems pose, describe how we have accommodated numerical pivoting and, using problems arising from a range ofpractical applications, illustrate its performance.

[1] J.D. Hogg, J.K. Reid and J A. Scott (2009). Design of a multicore sparse Cholesky factorization using DAGs, TechnicalReport RAL-TR-2009-027, Rutherford Appleton Laboratory.

[2] J.D. Hogg and J A. Scott (2010). An indefinite sparse direct solver for multicore machines, Technical Report RAL-TR-2010-0xx, Rutherford Appleton Laboratory.


2.2D Lecture Room 120 VOXEL-BASED COMPUTATIONS Chair: Maya Neytcheva

#52: Accelerating Iterative Stencil Computations on the GPUPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fumihiko Ino ([email protected]), Osaka UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toshihiro Matsuda, Kenichi Hagihara

The stencil computation is a memory-bound operation that performs nearest neighbor computations on structured grids. Thisoperation frequently appears in solving partial differential equations (PDEs) used in a wide range of scientific applications.One emerging platform for solving this problem is the graphics processing unit (GPU), which rapidly increases the memorybandwidth with the programmability. In this talk, we will present a GPU-based method for accelerating stencil computations.We apply the method to total variation minimization, namely an image filtering algorithm. The proposed method differs fromprevious methods in that it decomposes the kernel into two parts. This decomposition leads to more synchronization in theGPU. However, it simplifies the complexity of memory access pattern, which is due to function composition, so that reducesbranches inherent in memory access. Furthermore, we select the appropriate shape and size of thread blocks to maximizethe effective bandwidth of video memory. Experimental results show that our method is 30% faster than a previous methodimplemented with a single kernel. We also find that the performance increases by 4%, depending on the shape of threadblocks. The frame rate reaches 46 frames per second for 10242-pixel time-series images, demonstrating real-time imagedenoising and visualization.#53: Parallel Multigrid Methods on Octree-Like Grids in the Peano FrameworkPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miriam Mehl ([email protected]), TU MunchenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Weinzierl, Marion Bendig

Although multigrid methods are obviously the best choice of a solver even (or in particular) on highly parallel computingarchitectures, their efficient implementation is far from being trivial and much more complex than cg solvers with singlelevel preconditioners for example. This complexity is mainly due to increased data dependencies caused by the inter-levelinteractions. Dynamical grid adaptivity which is crucial to further minimize the computational costs worsens the difficultiesby preventing the usage of simple data structures such as arrays and matrices as well as trivial balanced domain partitioningmethods.

We propose the implementation of multigrid methods in our PDE framework Peano that is based on octree-like compu-tational grids. The structuredness of such grids allows for a memory saving storage of the grid and, if used in combinationComputer Science Department, University of Basel, Switzerland 21


with cell-wise operatir evaluation, of the discretisation stencils even in adaptively refined grids. In contrast to other similarapproaches, Peano uses space-filling curves not only for balanced, dynamical, and parallel domain partitioning but also as atool to optimize data storage and data access. This yields a code with full dynamical grid adaptivity, quasi-optimal domainpartitioning, a very low memory footprint, and a nearly 100% cache-hitrate. As the domain partitioning handles the wholetree of grid cells over all refinement levels, the parallelization of additive multigrid methods is very straight forward in Peano.One additive v-cycle requires only one sweep over the whole cell tree. Data at partition boundaries are automatically storedin the same order in neighbouring processes such that communication reduces to copying or merging the respective datastreams. Multiplicativre multigrid methods of course require several grid traversals (up to the currently active grid level) asin any other code but are a quite natural extension that maintains all storage and cache-related advantages of Peano. As allunderlying concepts of Peano work in arbitrary dimensions, even time-space adaptive grids in combination with 4D multigridsolvers become feasible and are currently being implemented.

This presentation will show the general parallel multigrid concepts of Peano together with simulation and performanceresults for Poisson, Navier-Stokes, and heat equations on 2D, 3D, and 4D grids.#54: Voxel based analysis of geomaterials with Schwarz-type parallel solversPresenter: . . . . . . . . . . . . . . . . . . . . . Vojtech Sokol ([email protected]), Institute of Geonics AS CR, Czech RepublicCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radim Blaheta

Our motivation for voxel based analysis comes from investigation of geocomposites, i.e. materials with soil or rock matrixfilled by a grouting media. This investigation is important for the assessment of both mechanical and hydraulic properties,which depends on the grout and level or quality of fill in. The investigation uses homogenization and sensitivity analysiswhich requires repeated solution of the voxel finite element models.

The voxel finite element models use regular grids and material distribution obtained by tomography scanning of samplesof geocomposites. The arising models are large scale with possibly high heterogenity of the material microstructure and highcoefficient jumps (especially in the case of hydraulic properties).

Therefore, the voxel finite element analysis needs efficient and parallelizable solvers for the arising linear systems. Weseek such solvers in the class of Schwarz type overlapping domain decomposition methods with additional coarse problem.Especially, we shall consider the Schwarz methods with layered decomposition and coarse problem created by selectiveaggregation, which means that aggregated elements are stiffer than the surrounding non aggregated ones. Assuming accuratesolution of subproblems, this type of Schwarz methods with aggregation can be both efficient and robust with respect tothe coefficient jumps. The layered decomposition enables the accurate and efficient solution of the local problems by directsolvers. The details of the technique will be discussed and some numerical results will be shown. (*) SNF/SCOPES-supported student.#55: A Fast Parallel Poisson Solver on Irregular DomainsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yves Ineichen ([email protected]), ETH/IBM/PSICo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Adelmann, Peter Arbenz

We discuss the scalable parallel solution of the Poisson equation on irregularly shaped domains discretized by finite differ-ences. Depending on the treatment of the Dirichlet boundary the resulting system of equations is symmetric or ‘mildly’nonsymmetric positive definite. In all cases, the system is solved by the preconditioned conjugate gradient algorithm withsmoothed aggregation (SA) based algebraic multigrid (AMG) preconditioning. We investigate variants of the implementationof SA-AMG that lead to considerable improvements in the execution times. We demonstrate good scalability of the solveron distributed memory parallel processor with up to 2048 processors.

2.2E Lecture Room 116 MISCELLANEOUS B Chair: Pasqua D’Ambra

#56: A hierarchical parallel eigenvalue solver: parallelism on top of multicore linear solversPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetsuya Sakurai([email protected]), University of Tsukuba, JapanCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroto Tadano, Tsutomu Ikegami

Recently, we have developed a contour integral based eigenvalue solver for computing internal eigenvalues of nonlineareigenvalue problems. In this talk, we present a parallel implementation and performance evaluation of our eigenvalue solveron a cluster machine, in which each node of a cluster has shared memory multicore processors. One of the major advantagesof our eigenvalue solver is that it does not require the outer/inner loops, which allows a variety of parallel programmingmodels. Number of contour paths can be processed independently in a coarse-grained way, which forms a top-level of thehierarchy. The evaluation of the quadrature requires solving a set of linear equations, which forms a second level of thehierarchy. Furthermore, at the bottom level of hierarchy, each system of linear equations is processed by multicore linearsolvers. Some numerical experiments demonstrate the performance of the presented method.#57: Towards a robust algorithm for computing the rightmost eigenvalue.Presenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Callant ([email protected]), Universite Libre de BruxellesCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yvan Notay

We consider the computation of the rightmost eigenvalue of a real and non-Hermitian matrix A. A standard approach usesIRAM: a combination of Arnoldi method with Rayleigh-Ritz extractions at each reduction of subspace dimension. However,this method converges too slowly for large matrices. In this talk, we propose to accelerate it by applying a filter based on the



Cauchy integral formula. Considering a vector expanded in the basis of eigenvectors of A, ideally, this filter would cancel allcomponents corresponding to eigenvalues outside the region of interest. In practice, some errors are left because the integralsare approximated by Gauss-Legendre quadrature. It also means that the filter amounts in practice to a weighted sum of shiftand inverted matrices; that is, its application to a vector or a block of vectors is straightforward to parallelize. Note that thefilter requires some knowledge of the spectrum to define the region of interest. This knowledge is dynamically improved asthe convergence proceeds; that is, the filter accelerates IRAM whose convergence, in turn, allows to better define the filter.#58: Smoothing and Regularization with Modified Sparse Approximate InversesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matous Sedlacek ([email protected]), Technische Universitat MunchenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sparse Approximate Inverses M which satisfy minM ‖AM− I‖F have shown to be an attractive alternative to classicalsmoothers like Jacobi or Gauss-Seidel. The static and dynamic computation of a SAI and SPAI, respectively, comes alongwith advantages like inherent parallelism and robustness with equal smoothing properties.

We are interested in preconditioners that can incorporate probing conditions for improving the approximation relative tohigh or low frequency subspaces. For the discretizations of the constant coefficient Laplace operator we are able to presentanalytically derived optimal smoothers. On this basis we introduce individual as well as global probing conditions in thegeneralized Modified SPAI (MSPAI) approach.

In the second part we transfer our techniques to the domain of ill-posed problems for recovering original informationfrom blurred signals. Using the probing facility of MSPAI we impose the preconditioner to act as zero on the noise subspace.In combination with an iterative regularization method it thus becomes possible to reconstruct the original informationmore accurate, close to the reconstruction quality of analytically derived optimal smoothing preconditioners. Moreover, weachieve flat and smooth convergence curves and are able to deal with instabilities within signals using local corrections inthe subspace approximations.#59: Algebraic Optimized Schwarz Domain Decomposition MethodsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikolaj SZYDLARSKI ([email protected]), IFP, FranceCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frederic Nataf

Parallel computers are increasingly used in scientific computing. They enable to perform large scale computations. Newalgorithms which are well suited to such architectures have to be designed. Domain decomposition methods are a verynatural way to take profit of multiprocessor computers. Let us mention that such algorithms are very useful when used onmonoprocessor computers as well.

The idea is to decompose the computational domain into smaller subdomains. Each subdomain is assigned to one proces-sor. The equations are solved on each subdomain. In order to enforce the matching of the local solutions, interface conditionshave to be written on the boundary between subdomains. These conditions are imposed iteratively. The convergence rate isvery sensitive to these interface conditions. The Schwarz method is based on the use of Dirichlet boundary conditions. It isslow and requires overlapping decompositions. In order to improve the convergence and to be able to use non-overlappingdecompositions, it has been proposed to use more general boundary conditions. It is even possible to optimize them withrespect to the efficiency of the method. In the recent years, there has been many works on this question at the level of thecontinuous problem (PDE level).

We present here an algebraic way to construct interface conditions that accelerate the convergence of the domain decom-position method. The method is not based on an a priori knowledge of the PDE to be solved. In an adaptive way, we modifythe interface conditions at the algebraic level. Theoretical and numerical results are given. We also compare to the patchmethod which is another algebraic way to build coupling conditions. Results will be presented on problems with discontin-uous coefficents (up to 6 orders of magnitude) and anisotropic coefficients. The effect of the partition of the unknowns willbe studied.

2.2F Lecture Room 117 HELMHOLTZ AND MAXWELL SOLVERS Chair: Rob H. Bisseling

#60: A robust and efficient, highly scalable parallel solution of the Helmholtz equation with large wave numbersPresenter: . . . . . . . . . . Dan Gordon, ([email protected]), Dept. of Aerospace Engineering, The Technion, Haifa, Israel.Co-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rachel Gordon

Numerical solution of the Helmholtz equation is a challenging computational task, particularly when the wave number islarge. Recent years have seen great improvements in the finite difference approach to this problem through enhancements ofthe shifted Laplacian preconditioner, originally introduced in [1]. For a recent survey and some new results, see [4]. In somecases, this approach may be difficult to implement due to the fact that each iteration of the solver uses a multigrid solution ofthe preconditioner. More recently, a new approach, based on an algebraic multilevel preconditioner, was introduced in [3].This approach uses symmetric maximum weight matchings and an inverse-based pivoting strategy.

This work presents numerical experiments with the block-parallel CARP-CG algorithm, applied to the Helmholtz equa-tion with large wave numbers. CARP-CG was recently shown to be a very robust and efficient parallel solver of linearsystems with very large off-diagonal elements and discontinuous coefficients, obtained from strongly convection dominatedelliptic PDEs in heterogeneous media [7]. CARP-CG is simple to implement even on unstructured grids, or when subdomainboundaries are complicated. CARP-CG is a CG acceleration of CARP [5], which is a block-parallel version of the Kaczmarzalgorithm (which is SOR on the normal equations). On one processor, CARP-CG is identical to the CGMN algorithm [2,6].



Numerical experiments were carried out on the domain [0,1]× [0,1], with wave number k = 75,150,300. A Dirichletboundary condition with a discontinuity at (0.5,0) was prescribed on the side y = 0, and a first order absorbing boundarycondition was prescribed on the other sides. A second order finite difference discretization scheme was used, leading to acomplex, nonsymmetric and indefinite linear system. For each wave number, the domain was discretized so as to obtain 6, 8,and 10 grid points per wavelength. Three different convergence goals were prescribed for the relative residual: 10−4, 10−7,and 10−10. Experiments were carried out with 1, 4, 8, 16 and 32 processors. A fixed relaxation parameter of 1.7 was used inall cases.

The results demonstrate the robustness and runtime efficiency of CARP-CG on the tested problems. A most significantresult of these experiments is that as k increases, the scalability of CARP-CG improves: for k = 300, the number of iterationson 32 processors was only about 15% more than required on one processor, and there was very little variance in this resultwhen the convergence goal or the number of grid points per wavelength were changed. This places CARP-CG as a veryviable parallel solver for the Helmholtz equation with large wave numbers.

[1] A. Bayliss, C. I. Goldstein and E. Turkel. An iterative method for the Helmholtz equation. J. of Computational Physics49, 443-457, 1983.

[2] A. Bjorck and T. Elfving. Accelerated projection methods for computing pseudoinverse solutions of systems of linearequations. BIT 19, 145–163, 1979.

[3] M. Bollhofer, M. J. Grote and O. Schenk. Algebraic multilevel preconditioner for the Helmholtz equation in heteroge-neous media. SIAM J. on Scientific Computing 31, 3781-3805, 2009.

[4] Y. A. Erlangga. Advances in iterative methods and preconditioners for the Helmholtz equation. Archives of Computa-tional Methods in Engineering 15, 37-66, 2008.

[5] D. Gordon and R. Gordon. Component-averaged row projections: A robust, block-parallel scheme for sparse linearsystems. SIAM J. on Scientific Computing 27, 1092–1117, 2005.

[6] D. Gordon and R. Gordon. CGMN revisited: robust and efficient solution of stiff linear systems derived from ellipticpartial differential equations. ACM Trans. on Mathematical Software 35, 18:1-18:27, 2008.

[7] D. Gordon and R. Gordon. Solution methods for nonsymmetric linear systems with large off-diagonal elements anddiscontinuous coefficients. CMES-Computer Modeling in Engineering & Sciences, in press, 2010.

#61: Solution of three-dimensional heterogeneous Helmholtz problems in geophysicsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafael Lago ([email protected]), CERFACSCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henri Chalandra, Serge Gratton, Xavier Pinel, Xavier Vasseur

We discuss the solution of a geophysical problem related to wave propagation in seismics. At a given frequency, a source istriggered at a certain position on the Earth’s surface, propagating pressure waves. When these waves are propagated back tothe surface, it encountered discontinuities. Repeating this process for several source locations we intent to be able to detectthe location and thickness of these reflective layers.

This phenomena is modelled by the three-dimensional Helmholtz equation written in the frequency domain

−∆u− k2u = s

where u denotes the wave pressure, k the wavenumber, s a given source term. In order to avoid reflections near the boundaries,we use absorbing boundary conditions known as Perfect Matching Layer (PML) [1]. We discretize this system for the 7-pointstencil, and we must set h (the distance between each point in the grid) such that a stability condition is satisfied:

h =2π

nλk

where nλ is the number of points per wavelength (usually nλ = 12). We remark that k is related to the frequency, meaningthat the larger the frequency is, the larger is the linear system. This discretization results in a linear system Ax = b whereb is the given source and A is the coefficient matrix. However it is desirable to be able to solve this problem for multipleright-hand sides (multiple sources) and even multiple coefficient matrices (different frequencies), leading to a block system.

We implemented an algorithm for this strategy, to solve heterogeneous problems. We use an iterative method based ona flexible Krylov subspace method, known as FGMRES [2]. This system is preconditioned by a two-level scheme, usingKrylov methods both as a smoother and as an approximate solver on the coarse level. Since it is possible to implementdeflation for block FGMRES, this strategy is also found very suitable for our specific application [3].

We show numerical results of our implementation for problems of order of 9× 109 up to 30 Hz, performed on a BlueGene machine. We point out the sensitivity of this implementation regarding the performance of the coarse level, and we testboth the weak and the strong scalabilities of this implementation.

We also study a 27-point stencil which is able to greatly reduce the size of the problem, by reducing nlambda. Wealso show that this strategy might not work well when using iterative methods, and that the performance of the coarse gridlevel decreases drastically. To workaround this, we attempt to increase nlambda, comparing the performance to the 7-pointComputer Science Department, University of Basel, Switzerland 24


stencil. We also performed tests with only one level method and compared these results. We test the scalability of the 27-point scheme both for the one- and two- level preconditioners, and we show the results obtained on a Blue Gene machinewith up to 4096 cores.

We study alternative discretization schemes for this problem, aiming to obtain an higher accuracy, as for example O(h4),although these schemes in general reduce the sparsity of the coefficient matrix.

With this work we expect to perform a more accurate and faster solution of the given problem by tuning the precondi-tioning techniques and discretization schemes. Later we expect to apply this implementation to the inverse problem and alsoto include visco-elasticity in our physical model, to have a more general application.

[1] J.P. Berenger. A perfectly matched layer for absorption of eletromagnetic waves. J. Comp. Phys., 114:185-200, 1994.

[2] Y. Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Scientific and Statistical Computing,14:461-469, 1993.

[3] X. Pinel. A perturbed two-level preconditioner for the solution of three-dimensional heterogeneous Helmholtz prob-lems with application to geophysics. PhD thesis, CERFACS, 2010.

#62: A Multigrid Method for Maxwell’s Equations with Parallel Nearly Direct Coarse Grid SolverPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Wieners ([email protected]), KITCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Maurer

We consider parallel solution methods for the Maxwell cavity problem and the Maxwell eigenvalue problem. In both caseswe use a standard curl-conforming discretization with Nedelec finite elements, and for the linear solver we use a multigridpreconditioner with an hybrid smoother as it is introduced by Hiptmair. Here, alternating smoothing steps for the edge modesand for vertex based gradient fields are applied.

For the Maxwell cavity problem we have to solve a strongly indefinite system. This requires a sufficiently fine coarsemesh in the multigrid preconditioner. Here, the coarse problem is solved with a parallel direct solver. This solver followsthe idea of nested dissection and uses a recursive Schur complement reduction of the skeleton system. The dense Schurcomplement problems are partly approximated with low-rank matrices and solved in parallel with distributed rows.

For the Maxwell eigenvalue problem we use the LOBPC method introduced by Knyazev which is modified by includingan additional steps in oder to project out the large kernel of the Maxwell operator. This requires a quite accurate solution ofa Laplace problem. Again, the LOBPC method is preconditioned with the parallel multigrid method. Together, this methodallows to approximate several eigenvalues and eigenmodes simultaneously.

All algorithms are realized within the software M++ which allows for a flexible and transparent handling of degreesof freedom on the processor interfaces. The efficiency of both methods are demonstrated for various examples up to 512processors and 50 million degrees of freedom.#63: Asynchronous Row ProjectionsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giorgos Kollias ([email protected]), Purdue UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ananth Grama

Row projection methods have been shown to be robust for solving large sparse linear systems, particularly when the systemsare non-symmetric [2]. In parallel row-projection methods, each process repeatedly projects its current iterate on its localcoefficient vectors to produce an updated approximate solution. This approximate solution, or a part thereof, is communicatedto other processes for subsequent updates by projecting on to subspaces spanned by remote coefficient vectors. The precisemechanisms associated with local projections (traversal of the coefficient vectors for computing the necessary inner products)and the composition of remote projections into local updates distinguish various parallel row projection methods. To the bestof our knowledge, all parallel row projection schemes include some form of synchronization across processes betweeniterations. In scalable environments, or on platforms with large latencies, this synchronization step presents a significantperformance bottleneck.

In this paper we investigate the impact of removing global synchronization [1] on the performance of row projectionschemes. Specifically, we investigate component-averaged row projections (CARP) [3] over various cluster configurations(using machines from two heterogeneous clusters) under synchronous and asynchronous scenarios. We explore the space ofprojection parameters and sets of inner and outer iterations, and observe several interesting phenomena:

- Asynchronous CARP version converges in all cases tested. Note that formal proof of convergence exists only for thesynchronous case.

- Good speed-ups are observed for both the synchronous and the asynchronous versions. For an identical number of localiterations, the asynchronous parallel implementation outperforms the synchronous one (by 10 - 20%).However, the resultingresidual norm is higher for the asynchronous case, as expected.

As the number of processors is increased (or network latency increases), the performance benefit from asynchrony in-creases. Depending on the system being solved, the algorithmic overhead of the asynchronous method is expected to bedominated by its increased parallel efficiency. Beyond this inflection point, the performance of the asynchronous versiondominates that of its synchronous counterpart. Further investigation of the impact of asynchrony in row projection meth-ods is needed to address various parameters associated with row projection schemes and their implementation on differenthardware platforms.

[1] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation. Prentice Hall, Englewood Cliffs, NJ, 1989.Computer Science Department, University of Basel, Switzerland 25

Friday 02.07.2010 10:45-12:25 Parallel Matrix Algorithms and Applications (PMAA’10) Parallel Session 3.1

[2] R. Bramley and A. Sameh. Row projection methods for large nonsymmetric linear systems. SIAM J. Sci. Statist.Comput, 13(1), 1992.

[3] D. Gordon and R. Gordon. Component-averaged row projections: A robust, block-parallel scheme for sparse linearsystems. SIAM Journal on Scientific Computing, 27(3):1092-1117, 2006.

Friday 02.07.2010 10:45-12:25 Parallel Session 3.1

3.1A Lecture Room 102 AUTOTUNING Chair: Rich Vuduc

#64: Autotuning dense linear algebra libraries on multicore architectures.Presenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanuel Agullo ([email protected]), INRIACo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jack Dongarra, Rajib Nath, Stanimire Tomov

The hardware trends have dramatically changed in the last few years. The frequency of the processors has been stabilizedor even sometimes slightly decreased whereas the degree of parallelism has increased at an unprecedent exponential scale.This new hardware paradigm implies that applications must be able to exploit parallelism at that same exponential pace.Applications must also be able to exploit a reduced bandwidth (per core) and a smaller amount of memory (available percore). Numerical libraries, which are a critical component in the stack of high-performance applications, must in particulartake advantage of the potential of these new architectures. So long as library developers could depend on ever increasingclock speeds and instruction level parallelism, they could also settle for incremental improvements in the scalability oftheir algorithms. But to deliver on the promise of tomorrow’s petascale systems, library designers must find methods andalgorithms that can effectively exploit levels of parallelism that are orders of magnitude greater than most of today’s systemsoffer. In the dense linear algebra community, several projects have tackled this challenge on different hardware architectures.On multicore architectures, a new project called Parallel Linear Algebra Software for Multicore Architectures (PLASMA)has been developed. PLASMA is a redesign of LAPACK and ScaLAPACK for shared memory computers systems basedon multi-core processor architectures. To achieve high performance on this type of architecture, PLASMA relies on tilealgorithms. Basically, they aim at providing a fine granularity and high asynchronicity to fit multicore constraints. Thecommon characteristics of all these approaches are that they need intensive tuning to fully benefit from the potential of thehardware. Indeed, the increased degree of parallelism induced a more and more complex memory hierarch and CPUs haveseveral complex levels of cache with possibly a Non-Uniform Memory Access (NUMA). Tuning a library consists in findingthe parameters that maximize a certain metric (most of the time the performance) on a given environment. In general, theterm ”parameter” has to be considered in its broad meaning, possibly including a variant of an algorithm. The search space,corresponding to the possible set of values of the tunable parameters can be very large in practice. Depending on the context,on the purpose and on the complexity of the search space, different approaches may be employed. Vendors can afforddedicated machines for delivering highly tuned libraries and have thus limited constraints in terms of time spent in exploringthe search space. On the other side of the spectrum, some libraries aim at being portable and efficient on a wider range ofarchitectures and cannot afford a virtually unlimited time for tuning. To do so, empirical tuning is performed at installationtime. We follow this latter approach to automatically tune dense linear algebra operations on multicore architectures.

We illustrate our discussion with the QR factorization implemented in the PLASMA library. We propose new efficientmethods for tuning the search space. We show that the time for tuning is acceptable and that a performance close to theoptimum is almost consistently obtained.#65: Auto-tuned performance models to improve task scheduling on accelerator-based platformsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cedric Augonnet ([email protected]), INRIACo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samuel Thibault, Raymond Namyst

Multicore machines equipped with accelerators are becoming increasingly popular. To fully tap into the potential of thesehybrid platforms, pure offloading approaches in which the main core of the application runs on regular processors and offloadsspecific parts of the computation on accelerators are not sufficient. Instead, the real challenge is to design applications thatspread across the entire machine, that is where parallel tasks would be dynamically scheduled over all available processingunits.

In this presentation, we thus introduce the StarPU, a runtime system for accelerator-based multicore machines whichprovides a task scheduler along with a data management library. In order to cope with the heterogeneous nature of theprocessing units, StarPU relies on autotuned performance models that we will illustrate with dense linear algebra problems.

First, we improve our load balancing facilities by predicting task duration. We will show how such models can beobtained automatically. These models are constructed at runtime either by collecting the performance of the task during itsprevious executions, and/or by the means of linear-regression based models for the algorithms that are not as regular.

As data movements are critical for the overall performance, and that StarPU takes care of performing all the data transfers,we also provide the scheduler with a data transfer overhead prediction. This overhead is derived from the bus performance,which is benchmarked during the first execution of an application using StarPU.

After discussing about the impact of prediction inaccuracies, we will show how StarPU’s scheduler takes advantage ofthese predictions. In the case of LU decomposition, we transparently obtain significant speed improvements, and majorreduction of PCI bus activity. This results in a better scalability in the case of multiple accelerators.



#66: Code Generation and Autotunig for Stencil Codes Multi- and Manycore ArchitecturesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Christen ([email protected]), U Basel, CS Dept.Co-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olaf Schenk, Helmar Burkhart

Stencil calculations constitute an important class of kernels in many scientific computing applications ranging from simplePDE solvers to constituent kernels in multigrid methods as well as image processing applications. As the performanceof stencil kernels is typically bandwidth limited, stencils greatly benefit of the high bandwidth offered by current GPUs.However, naively coded kernels typically only achieve a fraction of the theoretical performance limit. We present a codegeneration and auto-tuning framework for stencils targeted at multi- and many-core processors, such as multicore CPUs andGPUs, that allows to generate optimized compute kernels from a mathematical description of the stencil operation and adescription of the parallelization strategy.

Preliminary results from a biomedical real-world application, namely the simulation of the temperature distribution withinthe human body as a result of hyperthermia cancer treatment, have shown that substantial speedups compared to naiveimplementations of the solver could be achieved using code optimization and auto-tuning. Studies have been carried out onmulticore CPUs, NVIDIA GPUs, and also the Cell Broadband Engine (Sony, Toshiba, IBM).#67: Linear Algebra Algorithms for Automation DifferentiationPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diego Fabregat Traver ([email protected]), RWTH AachenCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Bientines

We consider the problem of generating high-performance algorithms for linear algebra kernels that arise when computing thederivative of matrix operations. Depending on the number of derivation variables, and the functional dependencies betweenthem and the input operands, a variety of different kernels are needed. The challenge is twofold: on the one hand manydifferent kernels are required for each input operation; on the other hand, the computation of such kernels generally entailsdealing with 3-dimensional objects. Operations with these objects can normally be mapped onto BLAS in multiple ways. Anadditional level of complexity comes from the fact that in order to attain high performance over multiple architectures andsettings, not just one, but a family of loop-based algorithms have to be generated and optimized. In this talk we discuss howfrom a high-level description of a target operation it is possible to handle all the aforementioned requirements automatically,generating a family of high-performance algorithms.

3.1B Lecture Room 001 ACCELERATING THE SOLUTION OF LINEAR SYSTEMS AND EIGEN-VALUE PROBLEMS ON HETEROGENEOUS COMPUTING ENVIRONMENTS

Chair: Costas Bekas

#68: Accelerating PLAPACK on Hybrid CPU-GPU ClustersPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco D. Igual ([email protected]), Universitat Jaume ICo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Fogue, Enrique S. Quintana-Ortı, Robert A. van de Geijn

While existing libraries like ScaLAPACK and PLAPACK provide efficient codes for large-scale dense linear algebra oper-ations on clusters of computers, the incorporation of hardware accelerators in these platforms is a new challenge for librarydevelopers. Hardware accelerators are increasingly becoming a cheap alternative to boost applications that are intensive infloating-point arithmetic. It is thus natural that many platforms, from medium-sized clusters to supercomputers, will followthis trend in the near future.

Programming dense linear algebra algorithms for message-passing parallel architectures is a task for experts. Moreover,the use of hardware accelerators to build hybrid clusters may exacerbate the complexity of programming these platforms,by introducing a new separate memory space in the accelerator (or the device) different from that of the host (i.e., the mainmemory of the node). How to “rewrite” message-passing dense linear algebra libraries, like ScaLAPACK and PLAPACK, todeal with hybrid clusters is the question we address in this talk.

A naive approach to retarget codes from existing dense linear algebra libraries to hybrid clusters is to hide the presenceof the hardware accelerators inside the routines that perform the “local” computations. In this approach matrices are keptin the host memory most of the time. Then, when a computation is off-loaded to the device, only the data that are strictlynecessary are transferred there, exactly at that moment, and the results are retrieved back to the host memory immediatelyafter the computation is completed. Data transfers between the memories of host and device can thus be easily encapsulatedwithin wrappers to the BLAS and LAPACK routines that perform local computations.

While straight-forward, the approach can incur a significant number of data transfers between the memories of thehost and the device, degrading performance. To tackle this problem, our approach to port dense linear algebra codes tohybrid clusters places the data in the device memory most of the time. A data chunk is recovered to the host memoryonly when it is to be sent to a different node, or when it is involved in an operation that is to be computed by the host.Following PLAPACK object-based approach, all these data movements are encapsulated within PLAPACK copy and reducecommunication operations, thus leading also to an almost transparent port of the contents of this library to hybrid clusters.

By implementing and evaluating common linear algebra operations, we will demonstrate how programmability andhigh performance can be easily combined without intervention of the programmer or deep modifications of the underlyinglibraries. We will analyze two different strategies and illustrate their benefits and problems by evaluating the resulting codeson a cluster equipped with 16 computing nodes, each with two NVIDIA Quadro 5800 GPUs.#69: Fast Reduction to Hessenberg Form on Multicore ArchitecturesPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lars Karlsson ([email protected]), Umea University



Co-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Kagstrom

We consider the reduction of an n×n matrix A with real entries to upper Hessenberg form H =QT AQ, where Q is orthogonal.This reduction appears as a preprocessing step in several important numerical linear algebra algorithms. For example, thenonsymmetric QR algorithm reduces a Hessenberg matrix to Schur form and thereby reveals all the eigenvalues of A. Anotherexample is to solve a linear system (A−σI)x = b for many different shifts σ. The Hessenberg decomposition transformsthe problem into a Hessenberg system (H−σI)(QT x) = (QT b), which can be solved using O(n2) operations instead of theO(n3) operations that a general matrix A requires.

Blocked and parallel algorithms that stably reduce a matrix to Hessenberg form are typically based on Householder re-flections applied using the compact WY representation. In particular, the Hessenberg reduction routines in the (Sca)LAPACKlibraries are based on such an approach. The performance of these routines is limited, however, since 20% of the floatingpoint operations are part of large matrix–vector multiplications. These are inefficient and scale poorly on multicore ar-chitectures due to the gap between memory bandwidth and processor speed. Consequently, there is a need for alternativeHessenberg reduction algorithms.

By offloading the matrix–vector multiplications, and more, to a GPU and executing the less parallel operations on thehost CPU, Tomov and Dongarra recently demonstrated that substantial performance increases are possible on heterogeneoussystems. We are currently investigating the complementary approach of abandoning the standard one-stage algorithm infavor of a two-stage approach which first reduces A to block Hessenberg form and from there to Hessenberg form. An anal-ogous two-stage approach has previously been applied to the more general Hessenberg-triangular reduction by Adlerborn,Dackland, Kressner, and Kagstrom.

Our implementation of the first stage generalizes the standard Hessenberg algorithm and has the following features: (i)two levels of blocking, (ii) multithreaded with dynamic scheduling, (iii) panel reductions using recursive QR factorization,and (iv) mix of coarse- and fine-grained tasks. In particular, our scheduler partially solves the problem of overpartitioning,which is often associated with dynamic scheduling, by a careful priority-based scheduling of a mix of coarse-grained tasks,derived from the dominant operations, and fine-grained tasks, stemming from less critical operations. It is possible to furtheraccelerate the first stage on heterogeneous systems by extending the work of Tomov and Dongarra.

The second stage consists of a bulge-chasing procedure which is similar to the Rutishauser–Schwartz symmetric bandreduction algorithm. We recently discovered how this Householder-based algorithm can be reorganized to exploit a memoryhierarchy. Each iteration of our new algorithm consists of two steps: (i) determine a set of Householder reflections withoutupdating more than a band around the main diagonal, and (ii) execute the remaining operations. The first step is inefficientand not amenable for parallelization but cheap in terms of number of operations. The second step, which contains most ofthe operations, has plenty of data reuse and is suitable for parallel execution on multicore architectures.

Preliminary performance results show more than a 50% reduction in execution time compared to both LAPACK linkedto multithreaded BLAS and ScaLAPACK with one process per core on a dual socket Intel Xeon L5420 quad-core machine.#70: Iterative Refinement of Spectral Decompositions in Heterogeneous Computing EnvironmentsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Kressner ([email protected]), ETH ZurichCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Assume that the spectral decomposition of a symmetric matrix A is available at a considerably low machine precision but theunderlying application demands for higher accuracy. Examples for this setting of current interest include mixed single/doubleprecision computations on Cell processors and GPUs as well as mixed double/quad precision on standard CPUs. In bothcases, the operations performed in high precision are significantly more costly and should therefore be limited to the bareminimum. This talk will discuss refinement procedures for entire decompositions that allow to have a highly accurate spectraldecomposition, essentially at the cost of the low precision computation. Particular attention will be paid to situations wheresome of the eigenvalues of A are clustered. This is partly joint work with Christian Schroeder, TU Berlin.#71: Multicore Computers Can Protect Your Bones: Rendering the Computerized Diagnosis of Osteoporosis a Rou-tine Clinical PracticePresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Costas Bekas ([email protected]), IBM Zurich Research LabCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Curioni

Coupling recent imaging capabilities with microstructural finite element analysis offers a powerful tool to determine bonestiffness and strength. It shows high potential to improve individual fracture risk prediction, a tool much needed in the di-agnosis and treatment of Osteoporosis that is, according to the World Health Organization, second only to cardiovasculardisease as a leading health care problem. We present a high performance computational framework that aims to render thecomputerized diagnosis of Osteoporosis an everyday, routine clinical practice. In our setting the CAT scanner is fitted witha high performance multicore computing system that is supported by hardware accelerators such as GPUs. The goal is forimages from the scanner to be fed into the computational model and thus several load conditions to be simulated in a matterof minutes. Low time to solution is achieved by means of a novel low complexity mixed precision iterative refinement linearsolver that can take great advantage of hardware accelerators. The result is an accurate strength profile of the bone.

Friday 02.07.2010 2:00-3:15 Parallel Session 3.2

3.2A Lecture Room 102 AUTOTUNING Chair: Emmanuel Agullo



#72: Parametric Tiling for Auto TuningPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Sadayappan ([email protected]), Ohio State UniversityCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Baskaran, A. Hartono, L.-N. Pouchet, J. Ramanujam, and S. Tavarageri

Tiling is a key transformation for data locality optimization and coarse-grained parallelization. Since the performance oftiled code often varies significantly with tile size and robust general analytical approaches for optimal tile selection do notexist, auto tuning is an attractive approach for tile size optimization.

In this paper, we describe an automatic source-to-source tiling transformation system. It can be used to convert untiledinput C code (for affine imperfectly nested loops) into parametric tiled code whose tile sizes can be tuned by auto tuningsystems. Experimental results on several platforms with different compilers are presented that characterize the performanceof the generated code.#73: Online Automatic Tuning of Parallel Sparse Matrix ComputationsPresenter: . . . . . . . . . . . . . . . . . . Reiji Suda ([email protected]), Department of Computer Science, University of TokyoCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

We have researched on mathematical methods for automatic tuning. In our abstraction, automatic tuning is an optimizationproblem, where a cost-related objective function is to be minimized by optimal choices of tuning parameters under givenconditions. User’s a priori knowledge, such as performance models, can make empirical optimization efficient. We proposedto use Bayesian modeling for quantitative treatments of perturbation in the performance measurements and inaccuracy ofthe a priori knowledge. Defining the objective function as the total execution costs including trial executions for perfor-mance evaluations, we propose online automatic tuning, where a suboptimal setting of the tuning parameter is given byBayesian sequential experimental design. Online automatic tuning is advantages over offline automatic tuning in that it at-tains asymptotically optimal performance and that it optimizes the parameter to conditions of the actual usages. For example,if a user treats small data frequently then the parameter is tuned for small data; frequently used routines are well tuned, andinfrequently used routines are not tuned much.

There are two types of automatic tuning for parallel programs. One is local tuning, which tunes the program run oneach processor. The other is global tuning, that concerns a tuning parameter that affects all threads running in parallel. Bothtypes of tuning is indispensable for high performance parallel computing, and this paper focuses on the former, that is, localtuning. We propose two methodologies of online automatic tuning of local tuning parameters: parallel experiments andparallel trials. The method of parallel experiments distributes the candidate values of the tuning parameter over the threadsto collect performance information efficiently. The method of parallel trials assigns one of the processors to dedicate it toperformance evaluation. Both methods assume performance correlations among parallel tasks. In this work we extend ourBayesian framework to parallel experiments and parallel trials. We will show that preemption is crucial for efficiency inthose methods. Results of application of those methods to tuning of data structures of parallel sparse matrix computationswill be presented. Also we will explain an extension of our method to automatic tuning of sparse iterative solvers, whereconvergence is not necessarily guaranteed.#74: Autotuning dense linear algebra libraries on GPUsPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanuel Agullo ([email protected]), University of TennesseeCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stanimire Tomov, Rajib Nath, Jack Dongarra

Automatic performance tuning (optimization), or autotuning in short, is a technique that has been used intensively on CPUsto automatically generate near-optimal numerical libraries. For example, ATLAS and PhiPAC are used to generate highlyoptimized BLAS. With the success of autotuning techniques on generating highly optimized DLA kernels on CPUs, it isinteresting to see how the idea can be used to generate near-optimal DLA kernels on modern GPUs. Indeed, work in the areahas already shown that autotuning for GPUs is very practical solution to easily port existing algorithmic solutions on quicklyevolving GPU architectures and to substantially speed up even highly hand-tuned kernels.

There are two core components in a complete autotuning system: Code Generator and Heuristic Search Engine. The codegenerator produces code variants according to a set of pre-defined, parametrized templates/algorithms. The code generatoralso applies certain state of the art optimization techniques. The code generator is very application and algorithm specific.For example in case of a GEMM (C = AB + C) kernel autotuner, five parameters can critically impact performance, andtherefore are incorporated in a GEMM code generator. This choice though can be extended and enhanced with variousoptimization techniques: tuning number of threads per thread block, tuning the choice whether to put matrix A or B in sharedmemory, tuning the size and layout of the shared memory to avoid bank conflicts, prefetching into registers, applying otherstate of the art loop optimization techniques such as circular loop skewing to avoid partition camping in GPU global memory,etc. The code generator takes precision as a parameter as well, in order to apply the same mechanism for various precisions.The heuristic search engine runs the variants produced by the code generator and ends out the best one using a feedback loop,e.g., the performance results of previously evaluated variants are used as a guidance for the search on currently unevaluatedvariants.

One way to implement autotuning is to generate a small number of variants for some matrices with typical sizes duringinstallation time, and choose the best during run time, depending on the input matrix size and high level DLA algorithm.

There are several success stories of the autotuner. For a variation of DGEMM on square matrices, the autotuner was ableto find a better kernel which runs 15% faster than the CUBLAS 2.3 DGEMM. For another variation of SGEMM, CUBLAShas performance deteriorations for certain problem size (e.g., up to 50% of the total performance). Interestingly, the autotunerwas successful in finding a better kernel by applying circular loop skew optimization. Also in case of a Level 2 BLAS, in



particular SSYMV, with a new recursive block based algorithm, the autotuner was able to attain slightly above 100 GFlops/sperformance in GTX280 as compared to 2 GFlops/s in CUBLAS.

Moreover, in the area of DLA it is very important to have high performance GEMMs involving rectangular matrices.This is dictated by the fact that the algorithms are blocked, and the blocking size is in general small, resulting in GEMMsinvolving on rectangular matrices. In this case the autotuner found kernels to significantly outperforms (up to 27%) theDGEMM from CUBLAS 2.3.

These results support experiences and observations by others on how sensitive the performance of GPU is to the formu-lations of your kernel and that an enormous amount of well thought experimentation and benchmarking is needed in order tooptimize performance.

3.2B Lecture Room 001 ACCELERATING THE SOLUTION OF LINEAR SYSTEMS AND EIGEN-VALUE PROBLEMS ON HETEROGENEOUS COMPUTING ENVIRONMENTS

Chair: Daniel Kressner

#75: Choosing a Reduction Tree for Communication Avoiding QRPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Langou ([email protected]), University of Colorado DenverCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

We re-explain Communication Avoiding QR and the importance of the choice of the reduction tree in the factorization. Thereduction tree influences the number of communications, the type of communications and the parallelization of the algorithm.Flat tree is best for sequential algorithm and minimizes communication from main memory to computing unit. Binary treeminimizes communication for tall and skinny matrices. Binary tree enables great parallelism for tall and skinny matrices.Flat tree enables great parallelism for square matrices. Depending on the shape of the matrix and the underlying architectures,we are lead to consider a variety of different trees.#76: The FEAST eigenvalue parallel algorithm and solverPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric Polizzi ([email protected]), University of Massachusetts, AmherstCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The FEAST algorithm for solving the symmetric eigenvalue problem on a given search interval, takes its inspiration fromthe density-matrix representation and contour integration technique in quantum mechanics. In contrast to other traditionaleigenvalue algorithms, it is free from orthogonalization procedures, and its main computational tasks consist of solvingvery few inner independent complex linear systems with multiple right-hand sides and one reduced eigenvalue problemorders of magnitude smaller than the original one. The FEAST algorithm offers many important capabilities for achievinghigh performance, robustness, accuracy, and scalability on parallel architectures. A general purpose FEAST solver packagehas been developed and released to the public, which includes both reverse communication interfaces and ready to usepredefined interfaces for dense, banded and sparse systems. The current version of the FEAST package focuses on solvingthe symmetric eigenvalue problems(real symmetric or complex Hermitian systems) on a shared-memory architecture (e.gone multicore node). An efficient parallel implementation for FEAST can be addressed at three different levels: (i) manysearch interval can be run independently; (ii) each linear system can be solved independently; (iii) the linear systems canbe solved in parallel. In this work, we propose to combine the points (ii) and (iii) above and propose then a hybrid parallelimplementation of FEAST using a distributed outer-layer for shared memory direct or iterative system solvers. Numericalsimulations and scalability results for the electronic structure and time-dependent quantum problem will be presented anddiscussed.#77: The challenge of the solve phase of a multicore solverPresenter: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Hogg ([email protected]), STFCCo-authors: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jennifer Scott

A classical problem in scientific computing is the efficient and accurate solution of large sparse linear systems. A directsolver for solving such systems normally comprises four phases:

Ordering: Determine an a priori pivot order that attempts to minimize fill in the factors. Analysis: Perform a symbolicfactorization to set up the data structures and prepare for the numerical factorization. Factorize: Perform the numericalfactorization. Solve: Perform the forward and backward substitutions using the computed factors.

In many applications the ordering and analysis phases are performed once, but multiple matrices with the same patternare factored, and are use for several solves. Traditionally the factorize phase has been the focus of attention, representingten to one hundred times as much time as the solve phase and being computation bound. The solve phase has had relativelylittle attention, as it is perceived as simple and is memory bound. With the advent of multicore chips, characterised by alarge number of computation cores sharing memory resources, this balance has changed. While the factorize phase can besignificantly speeded up, the solve phase presents a greater challenge.

In this talk, we illustrate the problem through experiments and explore a number of different approaches that try toimprove the performance of the solve phase on multicore machines. In particular, we consider extending the DAG-basedapproach that has been used recently by the factorize phase on multicore machines [1] to the solve phase and using in-memorycompression techniques to limit memory bandwidth. Numerical experiments are performed on large-scale problems frompractical applications using the HSL sparse Cholesky multicore solver HSL MA87.

[1] J.D. Hogg, J.K. Reid and J A. Scott (2009). Design of a multicore sparse Cholesky factorization using DAGs, TechnicalReport RAL-TR-2009-027, Rutherford Appleton Laboratory.


Parallel Matrix Algorithms and Applications (PMAA’10) Authors Index

Author index

Adelmann, A., 22Agullo, E., 18, 26, 29Arbenz, P., 15, 17, 22Atanassov, E., 11Auckenthaler, T., 4Auer, B.O., 9Augonnet, C., 27

Baboulin, M., 16Bader, M., 18Basermann, A., 19Baskaran, M., 29Becka, M., 8Bekas, C., 29Bell, N., 5Bencheva, G., 15Benedig, M., 21Bientines, P., 27Bientinesi, P., 4Bisseling, R. H., 9Blaheta, R., 22Borstnik, U., 15Brzobohaty, T., 22Bujanovic, Z., 7Burkhart, H., 8, 27

Callant, J., 23Chalandra, H., 24Chevalier, C., 13Choi, J., 3Christen, M., 27Cohen, J., 3Curioni, A., 29

D’Ambra, P., 15Davidson, A., 3Dongarra, J., 26, 29, 31Dostal, Z., 22Douglas, C., 12Durchova, M., 11

Erlebacher, G., 6

Filippone, S., 15Flaig, C., 17Fogue, 27Freear, S., 11Futamura, Y., 10

Goddeke, D., 6Gotze, J., 7Garland, M., 5Gaubert, S., 13Geijn, R. A., 27Gijzen, M.B., 20Gilpin, A., 10Giraud, L., 18Gordon, D., 24Gordon, R., 24Grama, A., 14, 25Gratton, S., 24

Grigori, L., 13Guermouche, A., 18Gutheil, I., 6

Henon, P., 11Haase, G., 12Hagihara, K., 21Haidar, A., 18Hartono, A., 29Hernandez, V., 17Hogg, J., 21, 31Hutter, J., 15

Igual, F.D., 27Ineichen, Y., 22Ino, F., 21Ivanovska, S., 11Iwata, J., 10

Jonsthovel, T.B., 20

Kagstrom, B., 28Kamer, K., 13Karaivanova, A., 11Karlsson, L., 28Kersken, H.P., 19Kollias, G., 14, 25Komatitsch, D., 6Kosturski, N., 18Kozubek, T., 22Kressner, D., 28Kurics, T., 19Kuroiwa, N., 15

Lago, R., 24Lang, B., 4Langguth, J., 8Langou, J., 30Liebmann, M., 12

Manne, F., 8Markopoulos, A., 22Matsuda, T., 21Maurer, D., 25Mechea, D., 6Mehl, M., 21Mei, T.X., 11

Namyst, R., 27Nataf, F., 23Nath, R., 26, 29, 31Neic, A., 12Notay, Y., 12, 23

Obrist, D., 15Oksa, G., 8Owens, J. D., 3

Pahr, D. H., 17Patwary, M., 8Pellegrini, F., 9Petshow, M., 4


Parallel Matrix Algorithms and Applications (PMAA’10) Authors Index

Pinel, X., 24Polizzi, E., 30Pouchet, L.-N., 29

Quintana-Ortı, E.S., 27

Rahnema, K., 18Ralha, R., 16Ramanujam, J., 29Robert, Y., 13Rojek, K., 20Roman, J., 18Roman, J. E., 17

Sadayappan, P., 29Sakurai, T., 10Sandholm, T., 10Sathe, M., 8Scarpas, A., 20Schenk, O., 8, 27Scott, J., 21, 31Sedlacek, M., 23Serafino, D., 15Sharify, M., 13Singh, A., 3Sokol, V., 22Strzodka, R., 5Suda, R., 29Sun, C., 7Szustak, L., 20Szydlarski, M., 23

Tadano, H., 10Tavarageri, S., 29Thibault, S., 27Thies, J., 19Tomas, A., 17Tomov, S., 26, 29, 31Traver, D.F., 27

Ucar, B., 13

Vajtersic, M., 8Vasseru, X., 24Vigh, C., 18Volkov, V., 3Vondrak, V., 22Vuduc, R., 3Vuik, C., 20Vutov, Y., 11, 18

Waldherr, K., 4Wang, W., 3Warburton, T., 5Weber, V., 15Weinzierl, T., 21Wieners, C., 25Willems, P., 4Wubs, F., 19Wyrzykowski, R., 20

Yzelman, A. N., 9

Zhang, Y., 3Zhou, Y., 11Zysset, P., 17


Date post:	31-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

6th International Workshop on Parallel Matrix Algorithms ...Peter Arbenz, Yousef Saad, Ahmed Sameh,...

Documents