HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research...

transcript

HiCMA:Hierarchical Computations on Manycore

Architectures

Hatem LtaiefExtreme Computing Research Center

King Abdullah University of Science and TechnologyThuwal, Saudi Arabia

NVIDIA GTC - San JoseApril 5th, 2016

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Students/Collaborators/Support

Academic:I Extreme Computing Research Center @ KAUST

W. Boukaram, A. Charara, G. Chavez, D. Keyes, D. Sukkariand G. Turkiyyah

I Tokyo Institute of TechnologyR. Yokota

I Institut Polytechnique de Bordeaux - INRIA BordeauxM. Faverge

I Innovative Computing Laboratory - UTK

Industry:

Outline

Motivations

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Hardware/Software Trends

I Flops are free

I On/Off-chip network bandwidth limited

I Increasing gap between flops and bandwidth

I Data movement are the most energy-consuming operations

I Synchronization-reducing

I Communication-reducing

Going hierarchical all the way down the software stack

I Recursive formulation (increase data locality)

I Old concept!

I Tree structure (depth-first Vs breadth-first tree traversal)

I Reduce vertical/horizontal data motion

I Without compromising concurrency

I Trade-off between data reuse and parallelism

Outline

Motivations

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Standard SVD solver

One stage reduction:

Figure: Computational stages of the standard SVD algorithm: (a)bidiagonal reduction, (b) bidiagonal SVD solver and (c) backtransformation.

Two-stage SVD solver

Two-stage reduction:

Figure: Reduction of a general dense matrix to bidiagonal form using atwo-stage approach.

QDWH-SVD[1,2] solver

Three computational stages:

I Polar decomposition A = UpH: iterative procedure using thematrix inversion free formulation based on QR/Choleskyfactorization

I Symmetric eigensolver H = VΣV> to calculate the singularvalues and the right singular vectors

I Matrix-matrix multiplication U = UpV to calculate the leftsingular vectors

[1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient SpectralDivide and Conquer Algorithms for the Symmetric EigenvalueDecomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349.[2] D. Sukkari, H. Ltaief and D. Keyes, A High PerformanceQDWH-SVD Solver Using Hardware Accelerators, submitted toTrans. on Math. Soft., 2015.

Divide-and-Conquer

Figure: The recursive QDWH-SVD algorithm. The matrix Ai,j

corresponds to the submatrix indexed j at the ith level of recursion.

Performance results

Matrix size

MKL-DGESVDMKL-DGESDD

MAGMA-QDWH-SVD

(a) Ill-conditioned matrix.

Matrix size

MKL-DGESVDMAGMA-QDWH-SVD

(b) Well-conditioned matrix.

Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU)against Intel MKL (CPU).

Performance results

Matrix size

MAGMA-DGESVDMAGMA-DGESDD

MAGMA-QDWH-SVD

(a) Ill-conditioned matrix.

Matrix size

MAGMA-DGESVDMAGMA-QDWH-SVD

(b) Well-conditioned matrix.

Figure: Performance comparisons against existing MAGMA SVD solvers(GPU).

Outline

Motivations

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Recursive formulation

I Usually used for Level 2 BLAS algorithms (e.g., panelfactorization)

I Increase data locality

I Run at the cache level speed

I Again, not new and literature is quite rich

I And it does pay off for Level 3 BLAS too!

Triangular matrix-matrix multiplication (TRMM)

1-RecTRMM

3-RecTRMM

2-GEMM

Au All Ar

Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursiveTRMM, and splitting along the vertical direction. Operations areperformed according to their numbering.

Triangular matrix-matrix multiplication

GEMM GEMM

TRMM TRMM TRMM TRMM TRMM TRMM TRMM TRMM

Figure: A hypothetical tree representing the operations executed by therecursive algorithm. Operations are to be executed by traversing the treein depth-first order.

Performance results on NVIDIA GPUs

0!100!200!300!400!500!600!700!800!900!

1000!1100!1200!

1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14!

Matrix Dimension (x1024)!

Theo-Peak! DGEMM! cuBLAS_DTRMM (OOP)! KBLAS_DTRMM (IP)! cuBLAS_DTRMM (IP)!

Figure: Performance comparisons of KBLAS DTRMM against that of IPand OOP cuBLAS DTRMM (Integration to CUDA 8.0).

Performance results on higher DLA computations using GPUs

1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14!

Matrix Dimension (x1024)!

Theo-Peak! DGEMM! DPOTRI + KBLAS_TRMM! DPOTRI + cuBLAS_TRMM!

Figure: Performance speedup of matrix inversion in MAGMA library(DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.

Outline

Motivations

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Low Rank Approximation using H-Matrices

I Introduced by E. Tyrtyshnikov and revisited later by W.Hackbush[1,2]

I R = U X V T

[1] W. Hackbush, A Sparse Matrix Arithmetic based on H-Matrices(Part I), Computing, 62(2), pp 89-108, 1999.[2] W. Hackbusch and B. Khoromskij, A Sparse H-MatrixArithmetic (Part II): Application to multi-dimensional problems,Computing, 64(1), pp. 21-47, 2000.

Nice Properties of H-Matrices

I Memory footprint savingfrom O(n2) to O(k n log(n))

I Linear arithmetic complexityMVM: from O(n2) to O(k n log(n))MMM and A−1: from O(n3) to O(k2 n log2(n))

Examples of H-Matrices

4 9 9 9

9 99 9

Figure: Example of H-matrix approximation for BEM.

Examples of H-Matrices

20 20 30

3020 16

16 1630

20 20 30

30 3227

20 20 30

30 32 28

2832 28

20 20 30

30 32 29

20 20 29

29 3219

29 32 19

1932 19

20 20 30

30 32 30

3032 30

30 3220

20 20 30

30 32 20

2032 20

30 32 20

2032 20

20 3210

20 32 10

1032 10

Figure: Examples of H-matrix approximation for covariance matrix.

Tree Structure

Implementation Details

Dense MVM:

I Calculate the products V T x for the leaves in the treeBatch MVM operation

Upsweep:

I Sweep up the column basis tree tree calculating the productsof the inner nodes from the products of their childrenblock SpMV (BSR)

I It is also block SpMV (BSR) per level of the tree

Downsweep:

I Transpose operation of the upsweep phaseblock SpMV (BSR)

Pipelining:

I Overlapping computations possible within Dense MVM /Upsweep / Mult phases. Downsweep, however, requires a syncpoint!

Performance Results

Figure: Performance (GB/s) of H-MVM using k = 8 and n min = 32.

Advanced Hierarchical Matrix Operations

Context:

I Very small sizes!

I Batch operation executions at each level of the tree

I (usually) Fixed sizes

I Recursive formulation, stressing register usage

I State-of-the-art implementations not well optimized for thisscope or not supported

I NVIDIA K40 GPU (single GPU)

Advanced Hierarchical Matrix Operations

H-Matrix compression:

I Batch QR factorizations (square and tall-and-skinny)

I Batch SVD

H-Matrix computations:

I Level 3 BLAS: SYRK, TRSM

I Factorizations: POTRF

I Solves: POTRS, POSV

Performance Results (preliminary)

8 16 32 64 128 256

Matrix Size

Batch QR (Square matrix)

KBLAS_10000

KBLAS_1000

CUBLAS_10000

CUBLAS_1000

MAGMA_10000

MAGMA_1000

8 16 32 64 128 256 512

Performan

/s Log

Matrix Size

DSYRK_Batch

KBLAS_10240

KBLAS_1024

MAGMA_10240

MAGMA_1024

0.125 0.25 0.5 1 2 4 8 16 32 64

128 256 512

8 16 32 64 128 256 512

Performan

/s Log

Matrix Size

DTRSM_Batch

KBLAS_10240 KBLAS_1024 MAGMA_IP_10240 MAGMA_IP_1024 CUBLAS_10240 CUBLAS_1024

8 16 32 64 128 256

Performan

/s Log

Matrix Size

DPOTRF_Batch

KBLAS_1024

KBLAS_10240

MAGMA_1024

MAGMA_10240

8 16 32 64 128 256 512

Performan

/s Log

Matrix Size

DPOTRS_Batch

KBLAS_10240

KBLAS_1024

MAGMA_10240

MAGMA_1024

8 16 32 64 128 256 512

Performan

/s Log

Matrix Size

DPOSV_Batch

KBLAS_10240

KBLAS_1024

MAGMA_10240

MAGMA_1024

Outline

Motivations

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

HiCMA’s Scope

The hierarchical computations for manycore architectures libraryaims to:

I Develop high performance numerical solvers:Dense/ Data-Sparse (H)

I Increase data reuse thanks to a recursive/hierarchicalformulation

I Exploit high level of concurrency

I Perform asynchronous execution

I Target various architectures:Shared/Distributed-memoryAccelerators/Co-processorsARM

HiCMA Software Stack

HiCMA’s Backbone

HiCMA’s Horsepower

HiCMA’s MoC

HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research...

Documents