Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | joella-harper |
View: | 221 times |
Download: | 5 times |
High High Performance Performance Computing – Computing –
CISC 811CISC 811Dr Rob ThackerDr Rob Thacker
Dept of Physics (308A)Dept of Physics (308A)
thacker@physicsthacker@physics
Today’s LectureToday’s Lecture
Part 1: Motivations and benefits, Part 1: Motivations and benefits, serial librariesserial libraries
Part 2: Parallel libraries, ACTS Part 2: Parallel libraries, ACTS collectioncollection
Part 3: Netlib, HPLPart 3: Netlib, HPL
HPC Libraries
Part 1: HPC LibrariesPart 1: HPC Libraries
Motivations, benefitsMotivations, benefits Serial HPC librariesSerial HPC libraries
Is our current programming model viable?
"We need to move away from a coding style suited for serial machines, where every macrostep of an algorithm needs to be thought about and explicitly coded, to a higher-level style, where the compiler and library tools take care of the details. And the remarkable thing is, if we adopt this higher-level approach right now, even on today's machines, we will see immediate benefits in our productivity."
W. H. Press and S. A. Teukolsky, 1997Numerical Recipes: Does This Paradigm
Have a future?
Motivations, concernsMotivations, concerns In developing large applications, three In developing large applications, three
significant issues must be addressed:significant issues must be addressed: ProductivityProductivity
Time to the first solution Time to the first solution (prototype) and t(prototype) and time to ime to solution solution (production)(production)
ComplexityComplexity Increasingly sophisticated models, may need to link Increasingly sophisticated models, may need to link
to other solversto other solvers
PerformancePerformance Increasingly complex algorithms, architecturesIncreasingly complex algorithms, architectures
What strategies should be applied?What strategies should be applied? Some appear mutually exclusive: best performance Some appear mutually exclusive: best performance
would reduce productivity if you tailor every single part would reduce productivity if you tailor every single part of the codeof the code
Unavoidable tensionUnavoidable tension
Scientists frequentlyneed highest performance
Algorithms have longlifetimes (longer than hardware)
Low level programming High level programming
Library approachLibrary approach Why not use libraries? (provided they suit your Why not use libraries? (provided they suit your
problem)problem) Optimization – many library functions are often assembly Optimization – many library functions are often assembly
optimizedoptimized Well tested – libraries are used by far more people than Well tested – libraries are used by far more people than
your local research groupyour local research group Support – frequently commercial packages come with Support – frequently commercial packages come with
online forums or email supportonline forums or email support Main drawback – loss of understanding of code Main drawback – loss of understanding of code
inner workingsinner workings Is this really an issue? 99.9% of the software you use you Is this really an issue? 99.9% of the software you use you
didn’t writedidn’t write Also you are forced into using the library interface, but Also you are forced into using the library interface, but
usual this is not a significant concernusual this is not a significant concern Secondary drawback may be costSecondary drawback may be cost
Library ownershipLibrary ownership
Three main possibilitiesThree main possibilities Public DomainPublic Domain
Most common for numerical softwareMost common for numerical software CommercialCommercial
Becoming more common as Universities Becoming more common as Universities attempt to gain from Intellectual propertyattempt to gain from Intellectual property
Vendor SpecificVendor Specific Many of the big vendors release platform Many of the big vendors release platform
specific optimized versions of the larger specific optimized versions of the larger public domain packagespublic domain packages
Potential benefits of Potential benefits of librarieslibraries
Allows easier collaboration (provided Allows easier collaboration (provided library is freely available to everyone!) library is freely available to everyone!)
Software using GPL’d libraries can be Software using GPL’d libraries can be released publicly as source-code released publicly as source-code You can contribute back improvements to the You can contribute back improvements to the
user communityuser community Source based libraries can be adapted to Source based libraries can be adapted to
your needs your needs Bottomline is that your time to solution is Bottomline is that your time to solution is
reduced!reduced!
Bugs are a serious Bugs are a serious issue…issue…
On June 4, 1996, an Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off from Kourou, French Guiana. The rocket was on its first voyage, after a decade of development costing $7 billion. The problem was a software error in the inertial reference system. Specifically a 64 bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16 bit signed integer.
On August 23,1991, the first concrete base structure for the Sleipner A platform sprang a leak and sank under a controlled ballasting operation during preparation for deck mating in Gandsfjorden outside Stavanger, Norway. The post accident investigation traced the error to inaccurate finite element approximation of the linear elastic model of the tricell (using the popular finite element program NASTRAN). The shear stresses were underestimated by 47% leading to insufficient design. In particular, certain concrete walls were not thick enough.
Something to think Something to think about…about…
~ 20 years ago ~ 20 years ago 1x10 1x1066 Floating Point Ops/sec (Mflop/s) Floating Point Ops/sec (Mflop/s) Scalar basedScalar based
~ 10 years ago ~ 10 years ago 1x10 1x1099 Floating Point Ops/sec (Gflop/s) Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing, bandwidth awareVector & Shared memory computing, bandwidth aware Block partitioned, latency tolerantBlock partitioned, latency tolerant
~ Today ~ Today 1x10 1x101212 Floating Point Ops/sec (Tflop/s) Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing, Highly parallel, distributed processing, message passing,
network basednetwork based data decomposition, communication/computationdata decomposition, communication/computation
Coming soon Coming soon 1x10 1x101515 Floating Point Ops/sec (Pflop/s) Floating Point Ops/sec (Pflop/s) Many more levels of memory hierarchy, combination of Many more levels of memory hierarchy, combination of
grids&HPCgrids&HPC More adaptive, latency and bandwidth aware, fault tolerant, More adaptive, latency and bandwidth aware, fault tolerant,
extended precision, attention to SMP nodesextended precision, attention to SMP nodes Application codes will need to address these issuesApplication codes will need to address these issues
The Evolving The Evolving Performance GapPerformance GapPeak performance is
skyrocketing In 1990s, peak performance
increased 100x; in 2000s, it will increase 1000x
But Efficiency for many science
applications declined from 40-50% on the vector supercomputers of 1990s to as little as 5-10% on parallel supercomputers of today
Need research on Mathematical methods and
algorithms that achieve high performance on a single processor and scale to thousands of processors
More efficient programming models for massively parallel supercomputers
0.1
1
10
100
1,000
2000 2004
Ter
aflo
ps
1996
PerformanceGap
Peak Performance
Real Performance
We don’t want everyone working on the same problem though!
Notable Public Domain Notable Public Domain Numerical LibrariesNumerical Libraries
LAPACKLAPACK Linear equations, eigenproblemsLinear equations, eigenproblems
BLASBLAS Fast linear algebra kernelsFast linear algebra kernels
LINPACKLINPACK Linear equation solving (now incorporated in LAPACK)Linear equation solving (now incorporated in LAPACK)
ODEPACKODEPACK Ordinary d.e. solving (see also the DASSL toolkit)Ordinary d.e. solving (see also the DASSL toolkit)
QUADPACKQUADPACK Numerical QuadratureNumerical Quadrature
ITPACKITPACK Sparse problemsSparse problems
PIMPIM Linear systemsLinear systems
Check out mathtools.net for a vast list of librariesCheck out mathtools.net for a vast list of libraries
Basic Linear Algebra Basic Linear Algebra Subprograms (BLAS)Subprograms (BLAS)
FORTRAN FORTRAN library of simple subroutine which library of simple subroutine which can be used to build more sophisticated LA can be used to build more sophisticated LA programs (dates back to 1970’s)programs (dates back to 1970’s)
BLAS BLAS is divided into four types and three levelsis divided into four types and three levels Single, double, complex and double complexSingle, double, complex and double complex Level 1 (vector-vector operations)Level 1 (vector-vector operations) Level 2 (matrix-vector operations)Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations)Level 3 (matrix-matrix operations)
Functions are prefixed with the type of the Functions are prefixed with the type of the variables: variables: s,d,c, or z for s,d,c, or z for ssingle, ingle, ddouble, ouble, ccomplex, or double omplex, or double
complex (z).complex (z).
BLAS routinesBLAS routines
Some of the BLAS 1 subprograms are: Some of the BLAS 1 subprograms are: xCOPY - copy one vector to another xCOPY - copy one vector to another xSWAP - swap two vectors xSWAP - swap two vectors xSCAL - scale a vector by a constant xSCAL - scale a vector by a constant xAXPY - add a multiple of one vector to xAXPY - add a multiple of one vector to
another another xDOT - inner product xDOT - inner product xASUM - 1-norm of a vector xASUM - 1-norm of a vector xNRM2 - 2-norm of a vector xNRM2 - 2-norm of a vector IxAMAX - find maximal entry in a vector IxAMAX - find maximal entry in a vector
Levels 2 & 3Levels 2 & 3 Some of the BLAS 2 subprograms are: Some of the BLAS 2 subprograms are:
xGEMV - general matrix-vector multiplication xGEMV - general matrix-vector multiplication xGER - general rank-1 update xGER - general rank-1 update xSYR2 - symmetric rank-2 update xSYR2 - symmetric rank-2 update xTRSV - solve a triangular system of equations xTRSV - solve a triangular system of equations
Some of the BLAS 3 subprograms are: Some of the BLAS 3 subprograms are: xGEMM - general matrix-matrix multiplication xGEMM - general matrix-matrix multiplication xSYMM - symmetric matrix-matrix xSYMM - symmetric matrix-matrix
multiplication multiplication xSYRK - symmetric rank-k update xSYRK - symmetric rank-k update xSYR2K - symmetric rank-2k update xSYR2K - symmetric rank-2k update
Tuning AdvantagesTuning Advantages
C A B= *PHiPAC:
Linear algebra is always faster using an optimized library!
BLAS and CBLAS and C
CBLAS is a C version of the librariesCBLAS is a C version of the libraries Available from NetlibAvailable from Netlib
However, you can still call However, you can still call FORTRAN versions from CFORTRAN versions from C you will need to declare the involved you will need to declare the involved
BLAS routine as “extern”BLAS routine as “extern” extern void dgemv_(char *trans, int *m, int *n, extern void dgemv_(char *trans, int *m, int *n,
double *alpha, double *a, int *lda, double *x, int double *alpha, double *a, int *lda, double *x, int
*incx, double *beta, double *y, int *incy );*incx, double *beta, double *y, int *incy );
VSIPLVSIPL
www.vsipl.orgwww.vsipl.org Vector Signal and Image Processing Vector Signal and Image Processing
LibraryLibrary Origins in defence contracts to produce Origins in defence contracts to produce
an API for embedded programmingan API for embedded programming Developed in C, bindings for C++ under Developed in C, bindings for C++ under
developmentdevelopment Main functionalityMain functionality
Vector based frequency domain analysis Vector based frequency domain analysis routinesroutines
LAPACKLAPACK BLAS BLAS is used as the building block for the is used as the building block for the LLinear inear
AAlgegra lgegra PackPackage,age, LAPACK LAPACK
Website describing and distributing a portable Website describing and distributing a portable version of the library: version of the library: http://www.netlib.org/lapack/http://www.netlib.org/lapack/ Includes online manualIncludes online manual
http://www.netlib.org/lapack/lug/index.htmlhttp://www.netlib.org/lapack/lug/index.html Vendors frequently distribute their own assembly level Vendors frequently distribute their own assembly level
optimized versions of the library (e.g. Intel MKL, and optimized versions of the library (e.g. Intel MKL, and AMD ACML)AMD ACML)
This library consists of a set of higher level linear This library consists of a set of higher level linear algebra functions with interface described at:algebra functions with interface described at: http://www.netlib.org/lapack/individualroutines.htmlhttp://www.netlib.org/lapack/individualroutines.html
LAPACKLAPACK There are a very large number of linear algebra There are a very large number of linear algebra
subroutines available in subroutines available in LAPACKLAPACK All follow a XYYZZZ format, where X denotes the All follow a XYYZZZ format, where X denotes the
datatype, YY the type of matrix and ZZZ describes the datatype, YY the type of matrix and ZZZ describes the computation performed. For example:computation performed. For example:
dgetrf dgetrf is used to compute LU factorizations of a matrix is used to compute LU factorizations of a matrix (d=double, ge=general, trf=triangular factorization)(d=double, ge=general, trf=triangular factorization)
dgetrsdgetrs uses an LU factorization from dgetrf to solve a uses an LU factorization from dgetrf to solve a systemsystem
dgetridgetri uses the LU above to compute the inverse of a uses the LU above to compute the inverse of a matrixmatrix
dgesvdgesv essentially a combined call to dgetrf and dgetrs essentially a combined call to dgetrf and dgetrs dgeev dgeev computes the eigenvalues of a matrix.computes the eigenvalues of a matrix.
Gnu Scientific LibraryGnu Scientific Library http://www.gnu.org/software/gsl/http://www.gnu.org/software/gsl/ GSL is a numerical library for C and C++ GSL is a numerical library for C and C++
programmers programmers Free software, available under GNU GPL Free software, available under GNU GPL The library provides a wide range of The library provides a wide range of
mathematical routinesmathematical routines e.g. random number generatorse.g. random number generators special functions special functions least-squares fittingleast-squares fitting There are over 1000 functions in total. There are over 1000 functions in total.
The project was conceived in 1996 by Dr The project was conceived in 1996 by Dr M. Galassi and Dr J. Theiler of Los Alamos M. Galassi and Dr J. Theiler of Los Alamos National Laboratory. National Laboratory.
GSL FeaturesGSL Features The library uses an object-oriented designThe library uses an object-oriented design
Different algorithms can be plugged-in easily or changed Different algorithms can be plugged-in easily or changed at run-time without recompiling the programat run-time without recompiling the program
It is intended for ordinary scientific usersIt is intended for ordinary scientific users Users with a knowledge of C programming will be able to Users with a knowledge of C programming will be able to
use the library quicklyuse the library quickly Interface is designed to be simple to link into very Interface is designed to be simple to link into very
high-level languages, such as GNU Guile or Python high-level languages, such as GNU Guile or Python Library is thread-safe Library is thread-safe Many of the routines are C “re”implementations of Many of the routines are C “re”implementations of
FORTRAN routines (e.g. FFTPACK)FORTRAN routines (e.g. FFTPACK) Modern coding conventions and optimizations have been Modern coding conventions and optimizations have been
appliedapplied
Full list of functionsFull list of functionsComplex Numbers Roots of Polynomials Special Functions
Vectors and Matrices Permutations Sorting
BLAS Support Linear Algebra Eigensystems
Fast Fourier Transforms
Quadrature Random Numbers
Quasi-Random Sequences
Random Distributions Statistics
Histograms N-TuplesMonte Carlo
Integration
Simulated Annealing Differential Equations Interpolation
Numerical Differentiation
Chebyshev Approximation
Series Acceleration
Discrete Hankel Transforms
Root-Finding Minimization
Least-Squares Fitting Physical Constants IEEE Floating-Point
Compiling and LinkingCompiling and Linking
The library header files are installed The library header files are installed in their own `gsl' directoryin their own `gsl' directory Include statements need `gsl/' directory Include statements need `gsl/' directory
prefix:prefix:
#include <gsl/gsl_math.h> #include <gsl/gsl_math.h> Compile objects first: gcc -c myprog.cCompile objects first: gcc -c myprog.c Then link: gcc example.o -lgsl -Then link: gcc example.o -lgsl -
lgslcblas -lmlgslcblas -lm
FFTWFFTW http://http://www.fftw.orgwww.fftw.org ““Fastest Fourier Transform in the West”Fastest Fourier Transform in the West” Authored by Frigo and Johnson at MITAuthored by Frigo and Johnson at MIT C subroutine library for discrete Fourier transformsC subroutine library for discrete Fourier transforms
PortablePortable Multiple dimensionsMultiple dimensions Arbitrary input sizes, real and complex transformsArbitrary input sizes, real and complex transforms
Small prime factors are best thoughSmall prime factors are best though Discrete cosine and sine transformsDiscrete cosine and sine transforms Parallel versions available (both shared (pthreads) and Parallel versions available (both shared (pthreads) and
distributed memory (MPI))distributed memory (MPI)) C and FORTRAN APIC and FORTRAN API Supports SIMD extensions (e.g. SSE)Supports SIMD extensions (e.g. SSE)
Self-tuning Self-tuning Contains many different FFT algorithms and optimal one is Contains many different FFT algorithms and optimal one is
chosen at runtimechosen at runtime Has undergone a number of evolutions, and is now at Has undergone a number of evolutions, and is now at
version 3.0version 3.0 Won 1999 J. H. Wilkinson Prize for Numerical SoftwareWon 1999 J. H. Wilkinson Prize for Numerical Software
Using FFTWUsing FFTW Need to include header filesNeed to include header files
#include <fftw3.h>#include <fftw3.h> or or include “fftw3.f”include “fftw3.f”
Must also link to librariesMust also link to libraries -lfftw3 -lm -lfftw3 -lm but may also need to specify path – will be but may also need to specify path – will be
installation dependentinstallation dependent Having created arrays(“in” and “out”), must create a Having created arrays(“in” and “out”), must create a
“plan”“plan” plan=fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE)plan=fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD, FFTW_ESTIMATE)call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD, FFTW_ESTIMATE) Precise plan routine will depend upon the FFT operation you wish to Precise plan routine will depend upon the FFT operation you wish to
performperform Call to plan allows system to evaluate architecture and transform and then Call to plan allows system to evaluate architecture and transform and then
optimize the algorithm to be used in the FFToptimize the algorithm to be used in the FFT
Having created the plan the transform is executed Having created the plan the transform is executed by specifying fftw_execute(plan) by specifying fftw_execute(plan)
See the fftw website for precise detailsSee the fftw website for precise details
Parallel FFTW: Shared Parallel FFTW: Shared MemoryMemory
FFTW includes both a pthreads based SMP library and can FFTW includes both a pthreads based SMP library and can be compiled with OpenMP support on platforms where it is be compiled with OpenMP support on platforms where it is availableavailable On HPCVL it is compiled with OpenMP support On HPCVL it is compiled with OpenMP support
Threaded version requires additional memoryThreaded version requires additional memory Call Call fftw_init_threads()fftw_init_threads()before using the threaded versionbefore using the threaded version
SMP parallel plans require knowledge of how many SMP parallel plans require knowledge of how many threads are going to be usedthreads are going to be used CallCall fftw_plan_with_nthreads(nthreads)fftw_plan_with_nthreads(nthreads) Note that since plans are specific to the number of threads, if Note that since plans are specific to the number of threads, if
you change the number of threads you must create a new planyou change the number of threads you must create a new plan When work is completed you must call When work is completed you must call
fftw_cleanup_threads()fftw_cleanup_threads() deallocate memory for threads deallocate memory for threads At linking stage must also include parallel libraryAt linking stage must also include parallel library
-lfftw3_threads -lfftw3_threads Note only Note only fftw_executefftw_execute is using a parallel region is using a parallel region
Parallel FFTW: MPIParallel FFTW: MPI Only available for older 2.x libraries which Only available for older 2.x libraries which
have a different APIhave a different API MPI data decomposition is “slab” based. MPI data decomposition is “slab” based.
For 3d arrays this is potentially limiting – can For 3d arrays this is potentially limiting – can only use L processors if you have an Lonly use L processors if you have an L33 array array
However, communication costs are high so However, communication costs are high so this is not often a significant barrierthis is not often a significant barrier
Uses MPI_Alltoall primitive which can Uses MPI_Alltoall primitive which can occasionally lead to poor performance occasionally lead to poor performance (depends on MPI implementation)(depends on MPI implementation)
Must enable support when FFTW is Must enable support when FFTW is compiled and must also link to compiled and must also link to -lfftw_mpi -lfftw_mpi
Example codeExample code#include <fftw_mpi.h>
int main(int argc, char **argv){ const int NX = ..., NY = ...; fftwnd_mpi_plan plan; fftw_complex *data;
MPI_Init(&argc,&argv);
plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, NX, NY, FFTW_FORWARD, FFTW_ESTIMATE);
...allocate and initialize data...
fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);
...
fftwnd_mpi_destroy_plan(plan); MPI_Finalize();}
(Old) Performance results (Old) Performance results on T3D for MPI transform on T3D for MPI transform
(3d, complex)(3d, complex)
AMD Core Math Library AMD Core Math Library (ACML)(ACML)
Developed in collaboration with Developed in collaboration with Numerical Algorithms Group (NAG)Numerical Algorithms Group (NAG) Latest version = 3.1Latest version = 3.1
Distribution via registration (but they Distribution via registration (but they have never sent me spam!)have never sent me spam!)
32 bit (Athlon) and 64 bit (Opteron) 32 bit (Athlon) and 64 bit (Opteron) versionsversions
Cannot be linked with Intel 8.1 compiler Cannot be linked with Intel 8.1 compiler though – turf war! though – turf war! Forces you to use Intel MKLForces you to use Intel MKL
Exploits knowledge of cache architecture Exploits knowledge of cache architecture to improve execution speedto improve execution speed
ACML Components: ACML Components: Linear Algebra, FFTsLinear Algebra, FFTs
BBasic asic LLinear inear AAlgebra lgebra SSubroutinesubroutines ((BLASBLAS)) Level 1 (vector-vector operations)Level 1 (vector-vector operations) Level 2 (matrix-vector operations)Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations)Level 3 (matrix-matrix operations) Plus routines for sparse vectorsPlus routines for sparse vectors
LLinear inear AAlgebra lgebra PACKPACKageage (LAPACK) (LAPACK) 28 (threaded) routines28 (threaded) routines Use BLAS to perform complex operationsUse BLAS to perform complex operations
ScaScalable lable LAPACKLAPACK (ScaLAPACK, MPI parallel (ScaLAPACK, MPI parallel LAPACK) also includedLAPACK) also included Must provide your own MPI implementation (see part 2)Must provide your own MPI implementation (see part 2)
FFTsFFTs 1D,2D,single and double precision plus all combinations 1D,2D,single and double precision plus all combinations
of real-to-complex etcof real-to-complex etc C and FORTRAN APIsC and FORTRAN APIs
Intel Math Kernel Intel Math Kernel LibraryLibrary
Version 9.0 recently releasedVersion 9.0 recently released Free for non-commercial useFree for non-commercial use
Students come under this banner, but faculty do not!Students come under this banner, but faculty do not! Graduate students are becoming a grey area…Graduate students are becoming a grey area…
Online support forum Online support forum Library functions:Library functions:
Linear Algebra - BLAS and LAPACKLinear Algebra - BLAS and LAPACK Linear Algebra - PARDISO Sparse SolverLinear Algebra - PARDISO Sparse Solver Discrete Fourier TransformsDiscrete Fourier Transforms Vector Math LibraryVector Math Library Vector Statistical LibraryVector Statistical Library
random number generatorsrandom number generators
Cluster Math Kernel Cluster Math Kernel LibraryLibrary
Adds ScaLAPACK and parallel BLAS Adds ScaLAPACK and parallel BLAS routines to MKLroutines to MKL
Roughly 20% performanceimprovement over Netlibdistribution of ScaLAPACK.
PESSL, SCSL & CXMLPESSL, SCSL & CXML
SGI provide their SCSL library freeSGI provide their SCSL library free ““Scientific Computing Software Library”Scientific Computing Software Library” Provides same basic features as ACML (linear Provides same basic features as ACML (linear
algebra)algebra) Ported to Altix systems, but need to compare speed Ported to Altix systems, but need to compare speed
to Intel MKL before usingto Intel MKL before using PESSL is IBM’s parallel libraryPESSL is IBM’s parallel library
““Parallel Engineering and Scientific Subroutine Parallel Engineering and Scientific Subroutine Library”Library”
Again, same basic features as ACML, and also Again, same basic features as ACML, and also includes random number generatorincludes random number generator
CXML is Compaq’s library for the AlphaserverCXML is Compaq’s library for the Alphaserver
Random Number Random Number GeneratorsGenerators
Numerical recipes RAN2 and RAN3 Numerical recipes RAN2 and RAN3 are both reasonable RNGsare both reasonable RNGs Note RAN3 does fail some of the more Note RAN3 does fail some of the more
esoteric testsesoteric tests GSL library provides over 40 GSL library provides over 40
different generatorsdifferent generators Includes Knuth’s algorithmsIncludes Knuth’s algorithms Mersenne Twister as wellMersenne Twister as well
Mersenne TwisterMersenne Twister http://www.math.sci.hiroshima-u.ac.jp/~m-mat/Mhttp://www.math.sci.hiroshima-u.ac.jp/~m-mat/M
T/emt.htmlT/emt.html Developed by Matsumoto and Nishimura Developed by Matsumoto and Nishimura Period is 2^19937-1 (10Period is 2^19937-1 (1060006000))
623-dimensional equidistribution property is assured623-dimensional equidistribution property is assured Fast generationFast generation
C rand() has been substituted, and now there are no C rand() has been substituted, and now there are no much difference in speedmuch difference in speed
Efficient use of the memoryEfficient use of the memory The implemented C-code mt19937.c consumes only 624 The implemented C-code mt19937.c consumes only 624
words of working area words of working area Currently the generator of choice for most Currently the generator of choice for most
problems (except crypto)problems (except crypto)
Summary Part 1Summary Part 1
Libraries offer a number of benefitsLibraries offer a number of benefits OptimizationOptimization RobustnessRobustness PortabilityPortability Time to solution improvementsTime to solution improvements
Part 2: Parallel LibrariesPart 2: Parallel Libraries
BLACSBLACS ACTS collectionACTS collection ScaLAPACKScaLAPACK
BLACSBLACS
Basic Linear Algebra Basic Linear Algebra Communication SubprogramsCommunication Subprograms
Conceptual aid in design and coding Conceptual aid in design and coding (design tool)(design tool)
Associate widely known mnemonic Associate widely known mnemonic names with communicationnames with communication Improved readability and provides Improved readability and provides
standard interfacestandard interface ““Self documentation”Self documentation”
BLACS data BLACS data decompositiondecomposition
1 2 0
4 5 3
7 8 6
0 1 2
0
1
2
2d processor grid
Types of BLACS routines: point-to-point communication, broadcast, combine operations and support routines.
Communication Modes: All processes in rowAll processes in columnAll grid processes
Communication RoutinesCommunication Routines
Send/Receive Send/Receive Send (sub)matrix from one process to another:Send (sub)matrix from one process to another: _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA,
RDEST,CDEST)RDEST,CDEST) _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA,
RSRC, CSRC)RSRC, CSRC) _ denotes datatype: _ denotes datatype:
I (integer), S (single), D (double), C (complex), I (integer), S (single), D (double), C (complex), Z (double complex)Z (double complex)
xx denotes matrix typexx denotes matrix type GE = general, TR=trapezoidalGE = general, TR=trapezoidal
Point-to-Point examplePoint-to-Point example
CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, & MYROW, MYCOL )
IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 ) ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN CALL DGERV2D( ICTXT, 5, 1, Y, 5, 0, 0 ) END IF
ContextsContexts The concept of a communicator is imbedded within The concept of a communicator is imbedded within
BLACS as a “context”BLACS as a “context” Contexts are thus the mechanism by which you:Contexts are thus the mechanism by which you:
Create arbitrary groups of processes upon which to Create arbitrary groups of processes upon which to executeexecute
Create an indeterminate number of overlapping or Create an indeterminate number of overlapping or disjoint gridsdisjoint grids
Isolate each grid so that grids do not interfere with each Isolate each grid so that grids do not interfere with each other other
Initialization routines return a context (integer) Initialization routines return a context (integer) which is then passed to the communication routineswhich is then passed to the communication routines Equivalent to specifying COMM in MPI callsEquivalent to specifying COMM in MPI calls
ID less communicationID less communication
Messages with BLACS are taglessMessages with BLACS are tagless Generated internally within the libraryGenerated internally within the library
Why is this an issue?Why is this an issue? If tags are not unique it is possible to If tags are not unique it is possible to
create not deterministic behaviour (have create not deterministic behaviour (have race conditions on message arrival)race conditions on message arrival)
BLACS allows the user to specify BLACS allows the user to specify what range of IDs the BLACS can use what range of IDs the BLACS can use This ensures it can be used with other This ensures it can be used with other
packagespackages
ACTS CollectionACTS Collection ““Advanced CompuTational Software”Advanced CompuTational Software”
Set of software tools Set of software tools US Department of Energy program, run in US Department of Energy program, run in
conjunction with NSF and DARPA conjunction with NSF and DARPA Extended support for experimental software Extended support for experimental software Provide technical support ([email protected])Provide technical support ([email protected]) Maintain ACTS information center (http://acts.nersc.gov)Maintain ACTS information center (http://acts.nersc.gov) Coordinate efforts with US supercomputing centersCoordinate efforts with US supercomputing centers Enable large scale scientific applicationsEnable large scale scientific applications Educate and train Educate and train
Unclear how much support issue extends beyond Unclear how much support issue extends beyond US borders, although there are registered users US borders, although there are registered users across the globeacross the globe
Tuned and machineDependent modules
ApplicationData Layout
Control I/O
Algorithmic Implementations
New Architecture or S/W • Extensive tuning• May require new programming paradigms• Difficult to maintain!
New Architecture:• Extensive re-rewritingNew or extended Physics:• Extensive re-rewriting or increase overhead
New Architecture:• May or may not need re-rewritingNew Developments:• Difficult to compare
New Architecture:• Minimal to Extensive rewriting
ACTS MotivationACTS MotivationLarge Scientific Codes:
A Common Programming Practice
USER's APPLICATION CODE (Main Control)
Tuned and machineDependent modules
ApplicationData Layout I/O
Algorithmic Implementations
AVAILABLE
LIBRARIES & PACKAGES
AVAILABLE
LIBRARIES & PACKAGES
AVAILABLE
LIBRARIES
The ACTS (“ideal”) The ACTS (“ideal”) ApproachApproach
ACTS Tools and ACTS Tools and functions functions
CategoryCategory ToolTool FunctionalitiesFunctionalities
NumericalNumerical
AztecAztec Algorithms for the iterative solution of large sparse linear systems.Algorithms for the iterative solution of large sparse linear systems.
HypreHypre Algorithms for the iterative solution of large sparse linear systems, intuitive grid-centric Algorithms for the iterative solution of large sparse linear systems, intuitive grid-centric interfaces, and dynamic configuration of parameters.interfaces, and dynamic configuration of parameters.
PETScPETSc Tools for the solution of PDEs that require solving large-scale, sparse linear and nonlinear Tools for the solution of PDEs that require solving large-scale, sparse linear and nonlinear systems of equations.systems of equations.
OPT++OPT++ Object-oriented nonlinear optimization package.Object-oriented nonlinear optimization package.
SUNDIALSSUNDIALS Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential-algebraic equations.equations, and differential-algebraic equations.
ScaLAPACKScaLAPACK Library of high performance dense linear algebra routines for distributed-memory message-Library of high performance dense linear algebra routines for distributed-memory message-passing.passing.
SuperLUSuperLU General-purpose library for the direct solution of large, sparse, nonsymmetric systems of linear General-purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations.equations.
TAOTAO Large-scale optimization software, including nonlinear least squares, unconstrained Large-scale optimization software, including nonlinear least squares, unconstrained minimization, bound constrained optimization, and general nonlinear optimization.minimization, bound constrained optimization, and general nonlinear optimization.
Code Code DevelopmentDevelopment
Global ArraysGlobal Arrays Library for writing parallel programs that use large arrays distributed across processing nodes Library for writing parallel programs that use large arrays distributed across processing nodes and that offers a shared-memory view of distributed arrays.and that offers a shared-memory view of distributed arrays.
OvertureOverture Object-Oriented tools for solving computational fluid dynamics and combustion problems in Object-Oriented tools for solving computational fluid dynamics and combustion problems in complex geometries.complex geometries.
Code Code ExecutionExecution
CUMULVSCUMULVS Framework that enables programmers to incorporate fault-tolerance, interactive visualization Framework that enables programmers to incorporate fault-tolerance, interactive visualization and computational steering into existing parallel programsand computational steering into existing parallel programs
GlobusGlobus Services for the creation of computational Grids and tools with which applications can be Services for the creation of computational Grids and tools with which applications can be developed to access the Grid.developed to access the Grid.
PAWSPAWS Framework for coupling parallel applications within a component-like model.Framework for coupling parallel applications within a component-like model.
SILOONSILOON Tools and run-time support for building easy-to-use external interfaces to existing numerical Tools and run-time support for building easy-to-use external interfaces to existing numerical codes.codes.
TAUTAU Set of tools for analyzing the performance of C, C++, Fortran and Java programs.Set of tools for analyzing the performance of C, C++, Fortran and Java programs.
Library Library DevelopmentDevelopment
ATLAS and ATLAS and PHiPACPHiPAC
Tools for the automatic generation of optimized numerical software for modern computer Tools for the automatic generation of optimized numerical software for modern computer architectures and compilers.architectures and compilers.
PETEPETE Extensible implementation of the expression template technique (C++ technique for passing Extensible implementation of the expression template technique (C++ technique for passing expressions as function arguments).expressions as function arguments).
ATLASATLAS Automatically Tuned Linear Algebra SoftwareAutomatically Tuned Linear Algebra Software
Another University of Tennessee project!Another University of Tennessee project! Largely an unsupported project thoughLargely an unsupported project though http://math-atlas.sourceforge.net/http://math-atlas.sourceforge.net/
Provides a subset of both BLAS and LAPACK Provides a subset of both BLAS and LAPACK functionalityfunctionality Provided foundation for work on BLAS and LAPACK Provided foundation for work on BLAS and LAPACK
in AMDs ACMLin AMDs ACML Takes optimization step further by giving the Takes optimization step further by giving the
computer itself possibilities for optimization at computer itself possibilities for optimization at compile timecompile time ““AEOS”: Automated Empirical Optimization of AEOS”: Automated Empirical Optimization of
SoftwareSoftware Similar motivation as FFTWSimilar motivation as FFTW
ATLAS BenchmarksATLAS Benchmarks
ARCH ATLAS COMP % Peak PEAK (Gflop)
900Mhz Itanium2 3.6.0 icc 90% 3.6
1.6Ghz Opteron 3.6.0 gcc 88% 3.2
1062Mhz UltraSPARC III 3.7.8 gcc 3.3 82% 2.124
600Mhz Athlon 3.5.7 gcc 2.95.3 80% 1.2
2.8Ghz Pentium4E 3.7.3 gcc 3.3.2 77% 5.6
2.6Ghz Pentium4 3.6.0 gcc 77% 5.2
1Ghz PentiumIII 3.7.7 gcc 2.95.3 76% 1
1Ghz Efficieon 3.7.7 gcc 3.2 60% 2
DGEMM performance:
PETScPETSc Portable, Extensible Toolkit for Scientific Portable, Extensible Toolkit for Scientific
ComputationComputation http://www-unix.mcs.anl.gov/petsc/petsc-as/http://www-unix.mcs.anl.gov/petsc/petsc-as/ Argonne lab developmentArgonne lab development
Suite of data structures and routines for the Suite of data structures and routines for the scalable (parallel) solution of PDEsscalable (parallel) solution of PDEs Intended for use in large-scale application projects Intended for use in large-scale application projects Not a black box solution thoughNot a black box solution though
Easily interfaces with solvers written in C, Easily interfaces with solvers written in C, FORTRAN and C++FORTRAN and C++
All components are designed to be interoperableAll components are designed to be interoperable Works in distributed memory environment using Works in distributed memory environment using
MPIMPI
Levels of Abstraction Levels of Abstraction in Mathematical in Mathematical
SoftwareSoftware Application-specific interfaceApplication-specific interface
Programmer manipulates objects associated with the Programmer manipulates objects associated with the applicationapplication
High-level mathematics interfaceHigh-level mathematics interface Programmer manipulates mathematical objectsProgrammer manipulates mathematical objects
Weak forms, boundary conditions, meshesWeak forms, boundary conditions, meshes Algorithmic and discrete mathematics Algorithmic and discrete mathematics
interfaceinterface Programmer manipulates mathematical objectsProgrammer manipulates mathematical objects
Sparse matrices, nonlinear equationsSparse matrices, nonlinear equations Programmer manipulates algorithmic objectsProgrammer manipulates algorithmic objects
SolversSolvers Low-level computational kernelsLow-level computational kernels
BLAS-type operationsBLAS-type operations FFTFFT
PETScemphasis
FeaturesFeatures
• Parallel vectors• scatters • gathers
• Parallel matrices• several sparse storage formats • easy, efficient assembly.
• Scalable parallel preconditioners • Krylov subspace methods • Parallel Newton-based nonlinear
solvers • Parallel timestepping (ODE)
solvers
• Complete documentation • Automatic profiling of floating point
and memory usage • Consistent interface • Intensive error checking • Portable to UNIX and Windows • Over one hundred examples • PETSc is supported and will be
actively enhanced for the next several years.
Computation and Communication KernelsMPI, MPI-IO, BLAS, LAPACK
Profiling Interface
PETSc PDE Application Codes
Object-OrientedMatrices, Vectors, Indices
GridManagement
Linear SolversPreconditioners + Krylov Methods
Nonlinear Solvers
ODE Integrators Visualization
Interface
Structure of PETSc – Structure of PETSc – Layered ApproachLayered Approach
Functionality example: Functionality example: selected vector operationsselected vector operations
Function Name Operation VecAXPY(Scalar *a, Vec x, Vec y)
y = y + a*x
VecAYPX(Scalar *a, Vec x, Vec y) y = x + a*y VecWAXPY(Scalar *a, Vec x, Vec y, Vec w) w = a*x + y VecScale(Scalar *a, Vec x) x = a*x VecCopy(Vec x, Vec y) y = x VecPointwiseMult(Vec x, Vec y, Vec w) w_i = x_i *y_i VecMax(Vec x, int *idx, double *r) r = max x_i VecShift(Scalar *s, Vec x) x_i = s+x_i VecAbs(Vec x) x_i = |x_i | VecNorm(Vec x, NormType type , double *r) r = ||x||
A Complete PETSc A Complete PETSc ProgramProgram
#include #include petscvec.hpetscvec.hint main(int argc,char **argv)int main(int argc,char **argv){{ VecVec x; x; int n = 20,ierr; int n = 20,ierr; PetscTruthPetscTruth flg; flg; PetscScalarPetscScalar one = 1.0, dot; one = 1.0, dot;
PetscInitializePetscInitialize(&argc,&argv,0,0);(&argc,&argv,0,0); PetscOptionsGetIntPetscOptionsGetInt(PETSC_NULL,(PETSC_NULL,"-n""-n",&n,PETSC_NULL);,&n,PETSC_NULL); VecCreateVecCreate(PETSC_COMM_WORLD,&x);(PETSC_COMM_WORLD,&x); VecSetSizes(x,PETSC_DECIDE,n); VecSetSizes(x,PETSC_DECIDE,n); VecSetFromOptions(x); VecSetFromOptions(x); VecSet(&one,x); VecSet(&one,x); VecDot(x,x,&dot); VecDot(x,x,&dot); PetscPrintf(PETSC_COMM_WORLD, PetscPrintf(PETSC_COMM_WORLD,"Vector length %dn""Vector length %dn",(int)dot); ,(int)dot); VecDestroy(x); VecDestroy(x); PetscFinalize(); PetscFinalize(); returnreturn 0; 0;} }
TAOTAO
Toolkit for Advanced OptimizationToolkit for Advanced Optimization http://www-unix.mcs.anl.gov/tao/http://www-unix.mcs.anl.gov/tao/ Another Argonne projectAnother Argonne project
Aimed at the solution of large-scale Aimed at the solution of large-scale optimization problems on high-performance optimization problems on high-performance architectures architectures Suitable for both single-processor and massively-Suitable for both single-processor and massively-
parallel architecture parallel architecture Object oriented approachObject oriented approach
Interoperable with other toolkits (PETSc for Interoperable with other toolkits (PETSc for example)example)
FunctionalityFunctionality
Systems of nonlinear equations Systems of nonlinear equations Nonlinear least squares Nonlinear least squares Bound-constrained optimization Bound-constrained optimization Linear and quadratic programming Linear and quadratic programming Nonlinearly constrained optimization Nonlinearly constrained optimization Combinatorial optimization Combinatorial optimization Stochastic optimization Stochastic optimization Global optimizationGlobal optimization
Example programExample program
TAO tao; /* optimization solver */ mat H; /* Hessian matrix */ vec x, g; /* solution and gradient vectors */ double f; /* function to minimize */ int n; /* number of variables */ ApplicationCtx usercontext; /* user-defined context */ MatCreate(MPI_COMM_WORLD,n,n,&H); VecCreate(MPI_COMM_WORLD,n,&x); VecDuplicate(x,&g);
TaoCreate(MPI_COMM_WORLD,&tao); TaoSetFunction(tao,x,EvaluateFunction,usercontext); TaoSetGradient(tao,g,EvaluateGradient,usercontext); TaoSetHessian(tao,H,EvaluateHessian,usercontext); TaoSolve(tao); TaoDestroy(tao);
ScaLAPACKScaLAPACK Scalable LAPACKScalable LAPACK Development teamDevelopment team
University of TennesseeUniversity of Tennessee University of California at BerkeleyUniversity of California at Berkeley ORNL, Rice U.,UCLA, UIUC etc.ORNL, Rice U.,UCLA, UIUC etc.
Support in Commercial PackagesSupport in Commercial Packages NAG Parallel Library (including Intel MKL and NAG Parallel Library (including Intel MKL and
AMD ACML) AMD ACML) IBM PESSL IBM PESSL CRAY Scientific Library and SGI SCSLCRAY Scientific Library and SGI SCSL VNI IMSL VNI IMSL Fujitsu, HP/Convex, Hitachi, NECFujitsu, HP/Convex, Hitachi, NEC
Important detailsImportant details
Web page Web page http://http://www.netlib.org/scalwww.netlib.org/scalapackapack Includes Includes
ScaLAPACK User’s ScaLAPACK User’s GuideGuide
Language : Fortran Language : Fortran Dense Matrix Dense Matrix
Problem Solvers Problem Solvers Linear Equations Linear Equations Least Squares Least Squares EigenvalueEigenvalue
BLAS
LAPACK
MPI, PVM,...
BLACS
PBLAS
ScaLAPACK
Package dependencies
Components of the APIComponents of the API
DriversDrivers Solves a Complete ProblemSolves a Complete Problem
Computational ComponentsComputational Components Performs Tasks: LU factorization, etc.Performs Tasks: LU factorization, etc.
Auxiliary RoutinesAuxiliary Routines Scaling, Matrix Norm, etc.Scaling, Matrix Norm, etc.
Matrix Redistribution/Copy RoutineMatrix Redistribution/Copy Routine Matrix on PE grid1 -> Matrix on PE grid2Matrix on PE grid1 -> Matrix on PE grid2
API (cont..)API (cont..) LAPACK names with P prefixLAPACK names with P prefix
PXYYZZZComputation Performed
Matrix Type
Data Types
Data Type real double cmplx dble cmplx
X S D C Z
TAUTAU
Tuning and Analysis UtilitiesTuning and Analysis Utilities University of Oregon developmentUniversity of Oregon development http://www.cs.uoregon.edu/research/http://www.cs.uoregon.edu/research/
paracomp/tau/tautools/paracomp/tau/tautools/ Program and performance analysis Program and performance analysis
tool framework for high-performance tool framework for high-performance parallel and distributed computingparallel and distributed computing TAU provides a suite of tools analysis of TAU provides a suite of tools analysis of
C, C++, FORTRAN 77/90, Python, High C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java Performance FORTRAN, and Java programsprograms
UseageUseage Instrument the program by inserting Instrument the program by inserting
TAU macros into the program (this can TAU macros into the program (this can be done automatically). be done automatically).
Run the program. Files containing Run the program. Files containing information about the program information about the program performance are automatically performance are automatically generated.generated.
View the results with TAU's pprof, the View the results with TAU's pprof, the TAU visualizer racy (or paraprof), or a TAU visualizer racy (or paraprof), or a third-party visualizer (such as VAMPIR) third-party visualizer (such as VAMPIR)
Additional facilitiesAdditional facilities TAU collects much more information than what is TAU collects much more information than what is
available through prof or gprof, the standard Unix available through prof or gprof, the standard Unix utilities. Also available through TAU are: utilities. Also available through TAU are: Per-process, per-thread and per-host information (supports Per-process, per-thread and per-host information (supports
pthreads)pthreads) Inclusive and exclusive function timesInclusive and exclusive function times Profiling groups that allow you to organize data collection Profiling groups that allow you to organize data collection Access to hardware counters on some systems Access to hardware counters on some systems Per-class and per-instance informationPer-class and per-instance information Separate data for each template instantiationSeparate data for each template instantiation Start/stop timers for profiling arbitrary sections of codeStart/stop timers for profiling arbitrary sections of code Support for collection of statistics on user-defined events Support for collection of statistics on user-defined events
TAU is designed so that when you turn off profiling TAU is designed so that when you turn off profiling (by disabling TAU macros) there is no overhead (by disabling TAU macros) there is no overhead
CACTUSCACTUS http://www.cactuscode.org/http://www.cactuscode.org/ Developed as response to needs of large scale projects (initially Developed as response to needs of large scale projects (initially
developed for General Relativity calculations which have a developed for General Relativity calculations which have a large computation to communication ratio)large computation to communication ratio)
Numerical/computational infrastructure to solve PDE’sNumerical/computational infrastructure to solve PDE’s Freely available, Freely available, Open SourceOpen Source community framework community framework
Cactus Divided in “Flesh” (core) and “Thorns” (modules or Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections of subroutines)collections of subroutines)
Multilingual: User apps Fortran, C, CMultilingual: User apps Fortran, C, C++++; automated interface ; automated interface between thembetween them
Abstraction: Cactus Flesh provides API for virtually all CS type Abstraction: Cactus Flesh provides API for virtually all CS type operationsoperations Storage, parallelization, communication between processors, etcStorage, parallelization, communication between processors, etc Interpolation, ReductionInterpolation, Reduction IO (traditional, socket based, remote viz and steering…)IO (traditional, socket based, remote viz and steering…) Checkpointing, coordinatesCheckpointing, coordinates
““Grid Computing”: Cactus team and many collaborators Grid Computing”: Cactus team and many collaborators worldwide, especially NCSA, Argonne/Chicago, LBL worldwide, especially NCSA, Argonne/Chicago, LBL
Modularity of Modularity of Cactus...Cactus...
Application 1
Cactus Flesh
Application 2 ...
Sub-app
AMR (GrACE, etc)
MPI layer 3 I/O layer 2
Unstructured...
Globus Metacomputing Services
User selectsdesired functionality…Code created...
Abstractions...
Remote Steer 2MDS/Remote Spawn
Legacy App 2
Symbolic Manip App
Cactus & the GridCactus & the GridCactus Application Thorns
Distribution information hidden from programmerInitial data, Evolution, Analysis, etc
Grid Aware Application ThornsDrivers for parallelism, IO, communication, data mapping
PUGH: parallelism via MPI (MPICH-G2, grid enabled message passing library)
Grid Enabled Communication Library
MPICH-G2 implementation of MPI, can run MPI programs across heterogenous computing
resources
Standard MPI
SingleProc
The FleshThe Flesh Abstract APIAbstract API
evolve the same PDE with unigrid, AMR (MPI or shared memory, etc) evolve the same PDE with unigrid, AMR (MPI or shared memory, etc) without having to change any of the application code.without having to change any of the application code.
InterfacesInterfaces set of data structures that a thorn exports to the world (set of data structures that a thorn exports to the world (globalglobal), to its ), to its
friends (friends (protectedprotected) and to nobody () and to nobody (privateprivate) and how these are ) and how these are inheritedinherited..
ImplementationsImplementations Different thorns may implement e.g. the evolution of the same PDE Different thorns may implement e.g. the evolution of the same PDE
and we select the one we want at runtime.and we select the one we want at runtime. SchedulingScheduling
call in a certain order the routines of every thorn and how to handle call in a certain order the routines of every thorn and how to handle their interdependencies. their interdependencies.
ParametersParameters many types of parameters and all of their essential consistency many types of parameters and all of their essential consistency
checked before runningchecked before running
Summary Part 2Summary Part 2
ACTS is a collection of software for HPC ACTS is a collection of software for HPC that includes a number of useful toolsthat includes a number of useful tools Numerical librariesNumerical libraries Code development softwareCode development software Profiling softwareProfiling software
ScaLAPACK extends LAPACK to ScaLAPACK extends LAPACK to distributed memory architecturesdistributed memory architectures Built on top of PBLAS which uses BLACSBuilt on top of PBLAS which uses BLACS
Part 3: Odds and endsPart 3: Odds and ends
Netlib and other useful websitesNetlib and other useful websites HPL libraryHPL library VTKVTK
NetlibNetlib The Netlib repository contains The Netlib repository contains
freely available software, documents, freely available software, documents, databases of interest to the numerical & databases of interest to the numerical & scientific computing communitiesscientific computing communities
The repository is maintained by The repository is maintained by AT&T Bell LaboratoriesAT&T Bell Laboratories University of TennesseeUniversity of Tennessee Oak Ridge National LaboratoryOak Ridge National Laboratory
The collection is mirrored at several sites The collection is mirrored at several sites around the worldaround the world Kept synchronizedKept synchronized
Effective search engine to help locate Effective search engine to help locate software of potential usesoftware of potential use
High Performance High Performance LINPACKLINPACK
Portable and freely available implementation of the Portable and freely available implementation of the LINPACK Benchmark – LINPACK Benchmark – used for Top500 rankingused for Top500 ranking
Developed at UTK Innovative Computing Developed at UTK Innovative Computing LaboratoryLaboratory A. A. PetitetPetitet, , R. C. WhaleyR. C. Whaley, , J. J. DongarraDongarra, , A. ClearyA. Cleary
HPLHPL solves a (random) dense linear system in solves a (random) dense linear system in double precision (64 bits) arithmetic on double precision (64 bits) arithmetic on distributed-memory computersdistributed-memory computers Requires MPI 1.1 be installedRequires MPI 1.1 be installed Also requires an implementation of Also requires an implementation of eithereither the the BLAS orBLAS or
the Vector Signal Image Processing Library the Vector Signal Image Processing Library VSIPLVSIPL Provides a testing and timing programProvides a testing and timing program
Quantifies the Quantifies the accuracyaccuracy of the obtained solution as well of the obtained solution as well as the time it took to compute it as the time it took to compute it
Rice University HPC Rice University HPC softwaresoftware
Center for High Performance Center for High Performance Software Research (HiPerSoft)Software Research (HiPerSoft) established in October 1998 established in October 1998 http://www.hipersoft.rice.edu/http://www.hipersoft.rice.edu/
Rice has a strong history of Rice has a strong history of innovative HPC toolsinnovative HPC tools HPCToolkit is an open-source suite of HPCToolkit is an open-source suite of
multi-platform tools for profile-based multi-platform tools for profile-based performance analysis of applications performance analysis of applications
HPCtoolkitHPCtoolkit The toolkit components include: The toolkit components include:
hpcrun: a tool for profiling executions of unmodified application hpcrun: a tool for profiling executions of unmodified application binaries using statistical sampling of hardware performance binaries using statistical sampling of hardware performance counters. counters.
hpcprof & xprof: tools for interpeting sample-based execution hpcprof & xprof: tools for interpeting sample-based execution profiles and relating them back to program source lines. profiles and relating them back to program source lines.
bloop: a tool for analyzing application binaries to recover bloop: a tool for analyzing application binaries to recover program structure; namely, to identify where loops are present program structure; namely, to identify where loops are present and what program source lines they contain. and what program source lines they contain.
hpcview: a tool for correlating program structure information, hpcview: a tool for correlating program structure information, multiple sample-based performance profiles, and program multiple sample-based performance profiles, and program source code to produce a performance database. source code to produce a performance database.
hpcviewer: a java-based GUI for exploring databases consisting hpcviewer: a java-based GUI for exploring databases consisting of performance information correlated with program source. of performance information correlated with program source.
Supported platforms: Pentium+Linux, Opteron+Linux, Supported platforms: Pentium+Linux, Opteron+Linux, Athlon+Linux, Itanium+Linux, Alpha+Tru64 and MIPS+Irix. Athlon+Linux, Itanium+Linux, Alpha+Tru64 and MIPS+Irix.
HPCToolkit is open-source software released with a BSD-like HPCToolkit is open-source software released with a BSD-like license. license.
CALGOCALGO
Collected algorithms of the ACMCollected algorithms of the ACM http://www.acm.org/pubs/calgo/http://www.acm.org/pubs/calgo/
All software is refereed for originality, All software is refereed for originality, accuracy, robustness, completeness, accuracy, robustness, completeness, portability, and lasting value portability, and lasting value Use of ACM Algorithms is subject to the Use of ACM Algorithms is subject to the
ACM Software Copyright and License ACM Software Copyright and License AgreementAgreement
Available on CD Available on CD
MGnetMGnet
www.mgnet.orgwww.mgnet.org Site devoted to Multi-grid and adaptive Site devoted to Multi-grid and adaptive
mesh refinement algorithmsmesh refinement algorithms Run by Craig DouglasRun by Craig Douglas
Has links to a number of packages for Has links to a number of packages for multigridmultigrid Some are public domainSome are public domain Others are copyrighted Others are copyrighted
Very useful resource for MG methodsVery useful resource for MG methods
NCSANCSA
National Center for Supercomputing National Center for Supercomputing ApplicationsApplications www.ncsa.uiuc.eduwww.ncsa.uiuc.edu
Their application repository is a very Their application repository is a very useful guide to what software is useful guide to what software is available in a given fieldavailable in a given field
NHSENHSE
National HPC Software ExchangeNational HPC Software Exchange www.nhse.orgwww.nhse.org
Numerous reports, librariesNumerous reports, libraries Unfortunately has been suspended Unfortunately has been suspended
in light of a lack of funding (2004)in light of a lack of funding (2004) Access to meta-repository is still Access to meta-repository is still
available (and links there in)available (and links there in)
VTKVTK The Visualization ToolkitThe Visualization Toolkit
http://public.kitware.com/VTK/what-is-vtk.phphttp://public.kitware.com/VTK/what-is-vtk.php Portable open-source software system for Portable open-source software system for
3D computer graphics, image processing, 3D computer graphics, image processing, and visualizationand visualization Object-oriented approachObject-oriented approach
VTK is at a higher level of abstraction VTK is at a higher level of abstraction than rendering libraries like OpenGLthan rendering libraries like OpenGL
VTK applications can be written directly VTK applications can be written directly in C++, Tcl, Java, or Pythonin C++, Tcl, Java, or Python
Large user communityLarge user community Many source code contributions Many source code contributions
Summary Part 3Summary Part 3
When looking for a library first place When looking for a library first place to stop is netlib!to stop is netlib!