+ All Categories
Home > Documents > MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... •...

MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... •...

Date post: 03-Apr-2018
Category:
Upload: phammien
View: 221 times
Download: 3 times
Share this document with a friend
33
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates March 2012 1
Transcript
Page 1: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

MAGMAMatrix Algebra on GPU and Multicore Architectures

Mark GatesMarch 2012

1

Page 2: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

• Scale # cores instead of clock speed

• Hardware issue became software issue

• Multicore

• Hybrid

Berkeley ParLab

Moore’s Law is Alive and Well

2

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Hardware trends

2

Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.”Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç.

1e7

1e6

1e5

1e4

1000

100

10

1

0.1

Page 3: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

• Scale # cores instead of clock speed

• Hardware issue became software issue

• Multicore

• Hybrid

Berkeley ParLab

Moore’s Law is Alive and Well

2

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Berkeley ParLab

But Clock Frequency Scaling Has Been Replaced by Scaling Cores / Chip

3

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Hardware trends

2

Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.”Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç.

1e7

1e6

1e5

1e4

1000

100

10

1

0.1

Page 4: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

• Scale # cores instead of clock speed

• Hardware issue became software issue

• Multicore

• Hybrid

Berkeley ParLab

Moore’s Law is Alive and Well

2

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Berkeley ParLab

But Clock Frequency Scaling Has Been Replaced by Scaling Cores / Chip

3

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Berkeley ParLab

Performance Has Also Slowed, Along with Power (the Root Cause of All This)

4

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Power (W) Perf Cores

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Hardware trends

2

Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.”Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç.

1e7

1e6

1e5

1e4

1000

100

10

1

0.1

Page 5: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Future systems• Most likely hybrid design

• Multicore + GPU accelerators

• Today accelerators attached

• Future accelerators diverseand integrated• Intel’s MIC Knight’s Corner

• AMD’s Fusion

• Nvidia’s Project Denver

3

Page 6: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Challenges of GPUs• High levels of parallelism

• Hybrid architecture• Small, non-parallelizable tasks on CPU

• Large, parallel tasks on GPU

• Compute vs. communication gap growing• Tesla C2070 has 515 Gflops, 8 GB/s PCIe

4

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

PCIe

Figure from NVIDIA CUDA C Programming Guide

Page 7: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Software generations

5

criticalpath

LINPACK (70’s)vector operations

Level-1 BLAS

LAPACK (80’s)blocked, cache friendly

Level-3 BLAS

ScaLAPACK (90’s)distributed memory

PBLASmessage passing

PLASMAtiled, multicore

DAG + scheduler

MAGMAhybrid

hybrid kernels

Page 8: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Nvidia’s CUBLAS• Level 1: O(n), memory bound

• Level 2: O(n2), memory bound• cublasDgemv matrix-vector multiply

• Level 3: O(n3)• cublasDgemm matrix-matrix multiply

• Copy between CPU ↔ GPU• cublasSetMatrix / cublasGetMatrix

• Concurrent operations• cublasSetStream

6

Page 9: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

MAGMA 1.1• LAPACK column-wise layout

• LAPACK-like C and Fortran interfaces

• CPU and GPU interfaces

7

Page 10: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

MAGMA 1.1• 50+ hybrid LAPACK algorithms

• 4 precisions (single, double, single complex, double complex)

• 3 mixed precision algorithms

• MAGMA BLAS• Supplements CUBLAS

• Improves some routines

8

Page 11: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

• Solve system AX = B

Linear solvers

9

Mixed precision InterfaceInterfaceType Routine routine CPU GPUGeneral dgesv dsgesv ✓ ✓SPD dposv dsposv ✓ ✓Least squares dgeqrs dsgeqrsv ✓

Page 12: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

• Eigenvalue problem AX = XΛ

• Generalized eigenvalue problem AX = BXΛ

• Singular value decomposition A = UΣV

Eigenvalue decomposition

10

InterfaceInterfaceMatrix type Operation Routine CPU GPUGeneral SVD dgesvd ✓General Eigenvalues dgeev ✓Symmetric Eigenvalues dsyevd* / zheevd* ✓ ✓Symmetric Generalized dsygvd* / zhegvd* ✓

* Additional variants; complete list at http://icl.eecs.utk.edu/magma/

Page 13: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Computational routines• Solve one part of problem

11

InterfaceInterfaceMatrix type Operation Routine CPU GPUGeneral LU dgetrf ✓ ✓

Solve (given LU) dgetrs ✓Inverse dgetri ✓

SPD Cholesky dpotrf ✓ ✓Solve (given LLT) dpotrs ✓Inverse dpotri ✓

General QR dgeqrf ✓Generate Q dorgqr / zungqr ✓ ✓Multiply by Q dormqr / zunmqr ✓ ✓

Selected routines; complete list at http://icl.eecs.utk.edu/magma/

Page 14: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Naming: magma_zgesv_gpu• magma_ or magmablas_ prefix

• Precision (single, double, single complex, “z” double complex).Mixed precision (ds and zc)

• Matrix typegeneral symmetric hermetian positive definiteorthogonal unitary triangular

• Operationsv solvetrf triangular factorizationev eigenvalue problemgv generalized eigenvalue problem

• Interface _gpu suffix

12

Page 15: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

13

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

DAG

Page 16: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

13

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

DAG

Page 17: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

13

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

DAG

Page 18: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

13

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanel

DAG

Page 19: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

13

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

DAG

Page 20: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Panel Look ahead

Trailing matrixA = QA

criticalpath

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

13

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

DAG

Page 21: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Usage: CPU interface

14

#include <magma.h> ... // allocate matrices. lda is leading dimension, lda >= n float* A = (float*) malloc( lda * n * sizeof(float) ); float* B = (float*) malloc( lda * nrhs * sizeof(float) ); int* pivots = (int*) malloc( lda * sizeof(int) ); // set A and B ... // solve AX = B with magma magma_sgesv( n, nrhs, A, lda, pivots, B, n, &info ); // solve AX = B with lapack sgesv_( &n, &nrhs, A, &lda, pivots, B, &n, &info );

Page 22: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Usage: GPU interface

15

#include <magma.h> ... // allocate matrices on the GPU float *dA, *dB; err = cudaMalloc( (void**) &dA, lda * n * sizeof(float) ); err = cudaMalloc( (void**) &dB, lda * nrhs * sizeof(float) ); // allocate pivots int* pivots = (int*) malloc( lda * sizeof(int) ); // set A and B ... // solve AX = B with magma magma_sgesv_gpu( n, nrhs, dA, lda, pivots, dB, n, &info );

Page 23: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

16

Page 24: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

17

Keeneland system, using one node3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB)2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

Page 25: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Mixed-precision solvers

• Factor in single precision

• Iterative refinement yieldsdouble precision accuracy

18

960 3200 5120 7040 8960 11200 131200

50

100

150

200

250

300

350

400

450

500Single PrecDouble PrecIter Ref

Matrix size

GFl

op/s

MAGMA solve

Page 26: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Two-sided factorization• Hessenberg, tridiagonal factorizations for eigenvalue problems

19

Level 2BLAS on

GPU

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

Panel Trailing matrixA = QTAQyi = Avi

column ai

Page 27: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

20

MAGMA Hessenberg in double precision

Gflo

p/s

Matrix size

Keeneland system, using one node3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB)2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

Page 28: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Autotuning kernels

• Parameterize code,e.g., blocksize

• Test 100 – 500 kernels

• Improved zgemm from308 to 341 Gflops/s

• Improved up to 2x on specific rectangular shapes

21

� ���� ���� ���� ���� �����

���

���

���

���

���

���

��

���

��

�� ��������

�������

�����

�����

�����

����

MAGMA BLAS matrix multiply

Page 29: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Scheduling DAGs

22

48 coresMatrix is 4000 x 4000, tile is 200 x 200.

time →

time →

Page 30: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Dynamic scheduling• Parallelism using DAG-based runtime scheduler

• One-sided factorizations and solvers using StarPU

23

// Sequential Tile Choleskyfor k = 1 .. ntiles! dpotrf( Akk )! for i = k+1 .. ntiles! ! dtrsm( Akk, Aik ) ! for i = k+1 .. ntiles! ! dsyrk( Aik, Aii )! ! for j = i+1 .. ntiles! ! ! dgemm( Ajk, Aik, Aij )

// Hybrid Tile Choleskyfor k = 1 .. ntiles! Insert_Task( dpotrf, ... )! for i = k+1 .. ntiles! ! Insert_Task( dtrsm, ... ) ! for i = k+1 .. ntiles! ! Insert_Task( dsyrk, ... )! ! for j = i+1 .. ntiles! ! ! Insert_Task( dgemm, ... )

Page 31: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Documentation• Nvidia Guides

http://developer.nvidia.com/nvidia-gpu-computing-documentation

• MAGMAhttp://icl.eecs.utk.edu/magma

• PLASMAhttp://icl.eecs.utk.edu/plasma

• Routines indexhttp://web.eecs.utk.edu/~mgates3/docs/lapack.html

24

Page 32: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

Collaborators / Support

• University of Tennessee, Knoxville

• University of California, Berkeley

• University of Colorado, Denver

• INRIA, France

• KAUST, Saudi Arabia

25

Page 33: MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... • 4 precisions (single, double, single complex, double complex) • 3 mixed precision

26


Recommended