+ All Categories
Home > Documents > Solving Linear Algebra Algorithms Using Multiple GPU...

Solving Linear Algebra Algorithms Using Multiple GPU...

Date post: 16-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
45
Solving Challenging Numerical Linear Algebra Algorithms Using Multiple GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University of Tennessee, Knoxville
Transcript
Page 1: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Solving Challenging Numerical Linear Algebra Algorithms Using Multiple GPU Accelerators

Hatem Ltaief

KAUST Supercomputing Laboratory

Stanimire Tomov

University of Tennessee, Knoxville

Page 2: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Outline

MAGMA: LAPACK for GPUs

Methodology overview

— Use both GPUs and multicore CPUs

MAGMA: from single to multiGPU support

— One-sided factorizations and linear solvers

— Two-sided factorizations and eigensolvers

Dynamic scheduling approaches to DLA

MAGMA algorithms with dynamic scheduling

Conclusions

Page 3: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

MAGMA: LAPACK for GPUs

MAGMA

— Matrix algebra for GPU and multicore architecture

— To provide LAPACK/ScaLAPACK on hybrid architectures

— http://icl.cs.utk.edu/magma/

MAGMA BLAS

— A subset of BLAS for GPUs, highly optimized for NVIDIA GPGPUs

— Fast GEMM for Fermi (CUBLAS3.2) [IJHPCA’10]

MAGMA developers & collaborators

— UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia)

— Community effort, similarly to LAPACK/ScaLAPACK

Page 4: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

A New Generation of DLA Software

MAGMA

Hybrid Algorithms

(heterogeneity friendly)

Rely on

- hybrid scheduler

- hybrid kernels

Page 5: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

MAGMA Software Stack

Page 6: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

MAGMA 1.1

50+ hybrid LAPACK algorithms have been developed (total of 200+ routines) Every algorithm is in 4 precisions (s/c/d/z)

There are 3 mixed precision algorithms (zc & ds)

These are hybrid algorithms, expressed in terms of BLAS

MAGMA BLAS A subset of GPU BLAS, optimized for Tesla and Fermi GPUs

Page 7: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

MAGMA Methodology

A methodology to use all available resources:

MAGMA uses HYBRIDIZATION methodology based on

– Representing linear algebra algorithms as collections

of TASKS and DATA DEPENDENCIES among them

– Properly SCHEDULING tasks' execution over

multicore and GPU hardware components

Successfully applied to fundamental

linear algebra algorithms

– One and two-sided factorizations and solvers

– Iterative linear and eigen-solvers

Productivity

– 1) High-level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

Hybrid CPU+GPU algorithms (small tasks for multicores and large

tasks for GPUs)

Page 8: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Hybrid Algorithms

One-sided factorizations (LU, QR, Cholesky)

Hybridization

– Panels (Level 2 BLAS) are factored on CPU using LAPACK

– Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”

Page 9: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

A Hybrid Algorithm Example

Left-looking hybrid Cholesky factorization in MAGMA 1.0

The difference with LAPACK – the 3 additional lines in red

Line 10 (done on CPU) is overlapped with work on the GPU

Page 10: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

LU Factorization (Single GPU)

Page 11: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

From single to multiple GPUs support

Data distribution

— 1-D block-cyclic distribution

Algorithm

— GPU holding current panel is sending

it to CPU

— All updates are done in parallel on

the GPUs

— Look-ahead is done with GPU holding

the next panel

GPU

0 GPU

1 GPU

2

GPU

0 . . .

nb

Page 12: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

LU factorization (multiGPUs)

Page 13: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

LU factorization (multiGPUs)

Matrix out of

GPU memory

Page 14: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Out of GPU Memory Algorithms

Perform left-looking factorizations on sub-matrices

that fit in the GPU memory (using existing algorithms)

The rest of the matrix stays on the CPU

Left-looking versions minimize writing on the CPU

Factored

sub-matric

A1 on CPU

To be

factored

sub-matrix

A2 on GPU . . .

1) Copy A2 to the GPU

2) Update A2 using A1 (a panel of A1 at a time)

3) Factor the updated A2 using existing

hybrid code

4) Copy factored A2 to the CPU

Trivially extended to multiGPUs:

A2 is “larger” with 1-D block cyclic

distribution, again reusing existing algorithms

Page 15: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Hybrid Algorithms

Two-sided factorizations (to bidiagonal, tridiagonal, and upper

Hessenberg forms) for eigen- and singular-value problems

Hybridization

– Trailing matrix updates (Level 3 BLAS) are done on the GPU

(similar to the one-sided factorizations)

– Panels (Level 2 BLAS) are hybrid

– operations with memory footprint restricted to the panel are done on CPU

– The time consuming matrix-vector products involving the entire trailing

matrix are done on the GPU

Page 16: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Hybrid Two-Sided Factorizations

Page 17: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

MultiGPUs Two-Sided Factorizations

Performance of DSYMV on multi M2090s

Need HP multiGPU Level 2 BLAS

T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on

multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

Page 18: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Tridiagonalization

T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense

matrix-vector multiplication on multiple GPUs and its application to symmetric dense and

sparse eigenvalue problems, ICL Technical report, 03/2012.

50 % of the flops are in SYMV

Memory bound, i.e. does not

scale well on multicore CPUs

Use the GPU’s high memory

bandwidth and optimized SYMV

8 x speedup over 12 Intel cores

(X5660 @2.8 GHz)

Keeneland system, using one node 3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB)

2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

Performance of MAGMA tridiagonalization in DP

Page 19: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

From Static to Dynamic Scheduling…

Static may stall in situations where work is available

Hand tuned optimizations

Hardware heterogeneity

Kernel heterogeneity

Separation of concerns

Dynamic Runtime System

Page 20: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Punch Lines

Productivity!

Productivity!

Productivity!

Oh… Did I say Productivity?

Page 21: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Block Algorithms

Panel-Update Sequence

Transformations are blocked/accumulated within the Panel

(Level 2 BLAS)

Transformations applied at once on the trailing submatrix

(Level 3 BLAS)

Parallelism hidden inside the BLAS

Fork-join Model

Page 22: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Block QR Factorization

Page 23: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Fork-Join Paradigm

Page 24: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Leveraging Block Algorithms…

Column-major data layout Tile data layout

Page 25: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Lessons Learnt from PLASMA

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures:

http://icl.cs.utk.edu/plasma/

Tile Algorithms on homogeneous x86 cores

Parallelism is brought to the fore

May require the redesign of linear algebra algorithms

Tile data layout translation

Remove unnecessary synchronization points between Panel-Update sequences

DAG execution where nodes represent tasks and edges define dependencies

between them

Dynamic runtime system environment QUARK

Page 26: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Example: Tile QR Factorization

First panel factorization and

corresponding updates DAG for a 4x4 tiles matrix

Page 27: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Let’s go crazy!

DAG of 20x20 tile QR Factorization

Page 28: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Dynamic Scheduling

Conceptually similar to out-of-order processor scheduling

because it has:

— Dynamic runtime DAG scheduler

— Out-of-order execution flow of fine-grained tasks

— Task scheduling as soon as dependencies are satisfied

— Producer-Consumer

Data Flow Programming Model: five decades old concept

— Think "how things connect" rather than "how things happen”

— Assembly line

— Inherently parallel

Page 29: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Matrices Over Runtime Systems at Exascale (MORSE)

Mission statement: "Design dense and sparse linear

algebra methods that achieve the fastest possible time to

an accurate solution on large-scale Hybrid systems”

Runtime challenges due to the ever growing hardware

complexity

Algorithmic challenges to exploit the hardware

capabilities at most

Integrated into MAGMA software stack

Page 30: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

MAGMA-MORSE: x86 + MultiGPUs

Lessons Learned from PLASMA!

CUDA-based hybrid systems

New high performance numerical kernels

StarPU Runtime System (Augonnet et. Al, INRIA, Bordeaux)

Both: x86 and GPUs => Hybrid Computations

Similar to LAPACK in functionality

Page 31: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Achieving High Level of Productivity

From Sequential Nested-Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){

zgeqrt(A[k;k], ...);

for (n = k+1; n < NT; n++)

zunmqr(A[k;k], A[k;n], ...);

for (m = k+1; m < MT; m++){

ztsqrt(A[k;k],,A[m;k], ...);

for (n = k+1; n < NT; n++)

ztsmqr(A[m;k], A[k;n], A[m;n], ...);

}

}

Page 32: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Achieving High Level of Productivity

From Sequential Nested-Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){

starpu_Insert_Task(&cl_zgeqrt, k , k, ...);

for (n = k+1; n < NT; n++)

starpu_Insert_Task(&cl_zunmqr, k, n, ...);

for (m = k+1; m < MT; m++){

starpu_Insert_Task(&cl_ztsqrt, m, k, ...);

for (n = k+1; n < NT; n++)

starpu_Insert_Task(&cl_ztsmqr, m, n, k, ...);

}

}

Page 33: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Hybrid Architecture Targeted

⇒ PCI Interconnect 16X 64Gb/s, very thin pipe!

⇒ Fermi C2050 448 cuda cores 515 Gflop/s

Page 34: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Performance Charts

Cholesky

QR

LU

Symmetric Matrix Inversion

Page 35: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Cholesky Performance 8 Intel x86 cores + 3 Tesla GPUs

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, S. Thibault, R. Namyst, S. Thibault, S. Tomov, Software

for GPUs, GPU Computing GEMs, vol.2, 2011.

Page 36: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

QR Performance

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and

Distributed Processing Symposium, 2011.

8 AMD x86 cores + 4 Tesla GPUs

Page 37: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

QR Performance

+~200 Gflop/s but 12 cores = ~150 Gflop/s

8 AMD x86 cores + 4 Tesla GPUs

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and

Distributed Processing Symposium, 2011.

Page 38: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Performance Breakdown

Task distribution observed on StarPU:

— sgeqrt : 20% of tasks on GPUs

— stsmqr : 92.5% of tasks on GPUs

Taking advantage of heterogeneity !

— Only do what you are good for

— Don’t do what you are not good for

Kernel CPU GPU Speedup

sgeqrt 9 Gflops 60 Gflops ~ 6

stsqrt 12 Gflops 67 Gflops ~ 6

sormqr 8.5 Gflops 227 Gflops ~ 27

stsmqr 10 Gflops 285 Gflops ~ 27

Page 39: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

LU Performance

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief and S. Tomov, ACS/IEEE International

Conference on Computer Systems and Applications (best paper award), 2011

8 Intel x86 cores + 3 Tesla GPUs

Page 40: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Symmetric Matrix Inversion

A−1, Seriously???

YES!

Critical component of the variance-covariance matrix

computation in statistics

Three steps:

— Cholesky factorization (DPOTRF)

— Inverting the Cholesky factor (DTRTRI)

— Calculating the product of the inverted Cholesky factor with its

transpose (DLAUUM)

Built on previous work from E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg

Page 41: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Scheduling Algorithms as DAGs

41

48 cores

POTRF, TRTRI and LAUUM.

The matrix is 4000 x 4000

Tile size is 200 x 200

Page 42: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

A−1 Performance

H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC'11, India

8 Intel x86 cores + 2 Fermi GPUs

Page 43: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Summary and Future Directions

Two methodologies for solving challenging DLA

Static Scheduling (performance)

Dynamic Scheduling (productivity)

LAPACK compliant API

Source codes freely available in MAGMA

What’s next?

— Extended numerical functionality

— Distributed Memory Systems

Page 44: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Colloborators / Support

MAGMA [Matrix Algebra on GPU

and Multicore Architectures] team

http://icl.cs.utk.edu/magma/

PLASMA [Parallel Linear Algebra

for Scalable Multicore

Architectures] team

http://icl.cs.utk.edu/plasma

Collaborating partners University of Tennessee, Knoxville

University of California, Berkeley

University of Colorado, Denver

INRIA, France

KAUST, Saudi Arabia

Page 45: Solving Linear Algebra Algorithms Using Multiple GPU ...developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · Hybrid Algorithms Two-sided factorizations (to bidiagonal,

Questions?


Recommended