Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | darren-mitchell |
View: | 219 times |
Download: | 3 times |
1
2
Upon completion of this module, you will be able to:
Performance Features
Using the Library
MKL Addresses:Solvers (BLAS, LAPACKEigenvector/eigenvalue solvers (BLAS, LAPACK)Some quantum chemistry needs (dgemm)PDEs, signal processing, seismic, solid-state physics (FFTs)Geneal scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)
Software Construction
Geometric Transformation
Don’t use Intel® Math Kernel (Intel® MKL) on …
Don’t use Intel® MKL on “small” counts.Don’t call vector math functions on small n.
§ But you could use Intel® Performance Primitives
6
BLAS (Basic Linear Algebra SubroutinesLevel 1 BLAS – vector-vector operations
15 function types48 functions
Level 2 BLAS – matrix-vector operations26 function types66 functions
Level 3 BLAS – matrix-matrix operations9 function types30 functions
Extended BLAS – level 1 BLAS for sparse vectors8 function types24 functions
7
LAPACK (linear algebra packageSolvers and eigensolvers. Many hundreds of routines totalThere are more than 1000 total user callable and support routinesDiscrete Fourier Transformations (DFT)Mixed radix, multi-dimensional transformsMulti threadedVML (Vector Math Library)Set of vectorized transcendental functionsMost of libm functions, but fasterVSL (Vector Statistics Library)Set of vectorized ran
8
BLAS and LAPACK* are both FortranLegacy of high performance computation
VSL and VML have Fortran and C interfacesDFTs have Fortran 95 and C interfacescblas intercate. It is more convenient for a C/C++ programmer to call BLAS
9
Support 32-bit and 64-bit Intel Processors
Large set of examples and testsExtensive documentation
04/18/23 10
The goal of all optimization is maximum speed.Resource limited optimization – exhaust one or more resource of system:
CPU: Register use, FP unitsCache: Keep data in cache as long as possible; deal with cache interleaving.TLBs: Maximally use data on each pageMemory bandwidth: Minimally access memoryComputer: Use all the processors available using threadingSystem: Use all the nodes available (cluster software)
11
Most of Intel MKL could be threaded but:Limited resource is memory bandwidthThreading level 1 and level 2 BLAS are mostly ineffective (O(n) )
There are numerous opportunities for threading:Level 3 BLAS (O(n3) )LAPACK* (O(n3) )FFTs (O(n log(n) )VML, VSL? Depends on processor and function
All threading is via OpenMP*All Intel MKL is designed and compiled for thread safety
12
Scenario 1: ifort, BLAS, IA-32 processor:ifort myprog.f mkl_c.lib
Scenario 2: CVF, LAPACK, IA-32 processor:f77 myprog.f mkl_s.lib
Scenario 3: Statically link a C program with DLL linked at runtime:link myprog.obj mkl_c_dll.libNote: Optimal binary code will execute at run time based on processor.
13
14
15
Most important LAPACK optimizations:Threading – effectively uses multiple CPUsRecursive factorization
Reduces scalar time (Amdahl’s law: t=tscalar + tparallel/pExtends blocking further into the code
No runtime library support required
16
One dimensional, two-dimensional, three-dimensionalMultithreadedMixed radixUser – specified scaling, transform signTransforms on imbedded matricesMultiple one-dimensional transforms on single cellStridesC and F90 interfaces
17
Basically a three-step processCreate a descriptor
Status = DftiCreate Descriptor (MDH,…)Commit the descriptor (instantiates it)
Status = DftiCommit Descriptor (MDH)Perform the transform
Status = DftiComputeForard (MDH, X)Optionally free the descriptor
18
Vector Math Library: Vectorized transcendental functions – like libm but better (faster)Interface: Have both Fortran and C interfacesMultiple accuracies
High accuracy (<1ulp)Lower accuracy, faster (<4 ulps)
Special value handling √(-a), sin(0), and so onError handling – can not duplicate libm here
19
It is important for financial codes (Monte Carlo simulations)Exponentials, logarithms
Other scientific codes depend on transcendental functionsError functions can be big time sinks in come codes
20
Set of random number generators (RNGs)Numerous non-uniform distributionsVML used extensively for transformationsParallel computation support – some functionsUser can supply own BRNG or transformationsFive basic RNGs (BRNGs) – bits, integer, FP
◦ MCG31, R250, MRG32, MCG59, WH
21
Gaussian (two methods)ExponentialLaplaceWeibullCauchyRayleighLognormalGumbel
22
Basically a 3-step ProcessCreate a stream pointer. VSLStreamStatePtr stream;Create a stream.vslNewStream(&stream,VSL_BRNG_MC_G31, seed );Generate a set of RNGs.vsRngUniform( 0, &stream, size, out, start, end );Delete a stream (optional).vslDeleteStream(&stream);
2323
Compare the performance of C source code (RAND function) and VSL.Exercise control of the threading capabilities in MKL/VSL.
24
Intel® Math Kernel Library is a broad scientific/engineering math library.It is optimized for Intel® processors.It is threaded for effective use on SMP machines.