MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... •...

transcript

MAGMAMatrix Algebra on GPU and Multicore Architectures

Mark GatesMarch 2012

• Scale # cores instead of clock speed

• Hardware issue became software issue

• Multicore

• Hybrid

Berkeley ParLab

Moore’s Law is Alive and Well

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands)

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Hardware trends

Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.”Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç.

• Multicore

• Hybrid

Berkeley ParLab

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Berkeley ParLab

But Clock Frequency Scaling Has Been Replaced by Scaling Cores / Chip

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Cores

Hardware trends

• Multicore

• Hybrid

Berkeley ParLab

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Berkeley ParLab

But Clock Frequency Scaling Has Been Replaced by Scaling Cores / Chip

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Cores

Berkeley ParLab

Performance Has Also Slowed, Along with Power (the Root Cause of All This)

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (in Thousands) Frequency (MHz) Power (W) Perf Cores

Hardware trends

Future systems• Most likely hybrid design

• Multicore + GPU accelerators

• Today accelerators attached

• Future accelerators diverseand integrated• Intel’s MIC Knight’s Corner

• AMD’s Fusion

• Nvidia’s Project Denver

Challenges of GPUs• High levels of parallelism

• Hybrid architecture• Small, non-parallelizable tasks on CPU

• Large, parallel tasks on GPU

• Compute vs. communication gap growing• Tesla C2070 has 515 Gflops, 8 GB/s PCIe

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

ALU Control

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

ALU Control

Figure from NVIDIA CUDA C Programming Guide

Software generations

criticalpath

LINPACK (70’s)vector operations

Level-1 BLAS

LAPACK (80’s)blocked, cache friendly

Level-3 BLAS

ScaLAPACK (90’s)distributed memory

PBLASmessage passing

PLASMAtiled, multicore

DAG + scheduler

MAGMAhybrid

hybrid kernels

Nvidia’s CUBLAS• Level 1: O(n), memory bound

• Level 2: O(n2), memory bound• cublasDgemv matrix-vector multiply

• Level 3: O(n3)• cublasDgemm matrix-matrix multiply

• Copy between CPU ↔ GPU• cublasSetMatrix / cublasGetMatrix

• Concurrent operations• cublasSetStream

MAGMA 1.1• LAPACK column-wise layout

• LAPACK-like C and Fortran interfaces

• CPU and GPU interfaces

MAGMA 1.1• 50+ hybrid LAPACK algorithms

• 4 precisions (single, double, single complex, double complex)

• 3 mixed precision algorithms

• MAGMA BLAS• Supplements CUBLAS

• Improves some routines

• Solve system AX = B

Linear solvers

Mixed precision InterfaceInterfaceType Routine routine CPU GPUGeneral dgesv dsgesv ✓ ✓SPD dposv dsposv ✓ ✓Least squares dgeqrs dsgeqrsv ✓

• Eigenvalue problem AX = XΛ

• Generalized eigenvalue problem AX = BXΛ

• Singular value decomposition A = UΣV

Eigenvalue decomposition

InterfaceInterfaceMatrix type Operation Routine CPU GPUGeneral SVD dgesvd ✓General Eigenvalues dgeev ✓Symmetric Eigenvalues dsyevd* / zheevd* ✓ ✓Symmetric Generalized dsygvd* / zhegvd* ✓

* Additional variants; complete list at http://icl.eecs.utk.edu/magma/

Computational routines• Solve one part of problem

InterfaceInterfaceMatrix type Operation Routine CPU GPUGeneral LU dgetrf ✓ ✓

Solve (given LU) dgetrs ✓Inverse dgetri ✓

SPD Cholesky dpotrf ✓ ✓Solve (given LLT) dpotrs ✓Inverse dpotri ✓

General QR dgeqrf ✓Generate Q dorgqr / zungqr ✓ ✓Multiply by Q dormqr / zunmqr ✓ ✓

Selected routines; complete list at http://icl.eecs.utk.edu/magma/

Naming: magma_zgesv_gpu• magma_ or magmablas_ prefix

• Precision (single, double, single complex, “z” double complex).Mixed precision (ds and zc)

• Matrix typegeneral symmetric hermetian positive definiteorthogonal unitary triangular

• Operationsv solvetrf triangular factorizationev eigenvalue problemgv generalized eigenvalue problem

• Interface _gpu suffix

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

Level 3 BLAS on

Panel Look ahead

Level 2BLAS on

Level 3 BLAS on

LookaheadPanel

Trailing matrix

Panel Look ahead

Level 2BLAS on

Level 3 BLAS on

LookaheadPanel

Trailing matrix

Panel Look ahead

Level 2BLAS on

Level 3 BLAS on

LookaheadPanel

Trailing matrix

TrailingmatrixPanel

Panel Look ahead

Level 2BLAS on

Level 3 BLAS on

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

Panel Look ahead

criticalpath

Level 2BLAS on

Level 3 BLAS on

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

Usage: CPU interface

#include <magma.h> ... // allocate matrices. lda is leading dimension, lda >= n float* A = (float*) malloc( lda * n * sizeof(float) ); float* B = (float*) malloc( lda * nrhs * sizeof(float) ); int* pivots = (int*) malloc( lda * sizeof(int) ); // set A and B ... // solve AX = B with magma magma_sgesv( n, nrhs, A, lda, pivots, B, n, &info ); // solve AX = B with lapack sgesv_( &n, &nrhs, A, &lda, pivots, B, &n, &info );

Usage: GPU interface

#include <magma.h> ... // allocate matrices on the GPU float *dA, *dB; err = cudaMalloc( (void**) &dA, lda * n * sizeof(float) ); err = cudaMalloc( (void**) &dB, lda * nrhs * sizeof(float) ); // allocate pivots int* pivots = (int*) malloc( lda * sizeof(int) ); // set A and B ... // solve AX = B with magma magma_sgesv_gpu( n, nrhs, dA, lda, pivots, dB, n, &info );

Keeneland system, using one node3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB)2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

Mixed-precision solvers

• Factor in single precision

• Iterative refinement yieldsdouble precision accuracy

960 3200 5120 7040 8960 11200 131200

500Single PrecDouble PrecIter Ref

Matrix size

MAGMA solve

Two-sided factorization• Hessenberg, tridiagonal factorizations for eigenvalue problems

Level 2BLAS on

Level 3 BLAS on

Panel Trailing matrixA = QTAQyi = Avi

column ai

MAGMA Hessenberg in double precision

Matrix size

Keeneland system, using one node3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB)2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

Autotuning kernels

• Parameterize code,e.g., blocksize

• Test 100 – 500 kernels

• Improved zgemm from308 to 341 Gflops/s

• Improved up to 2x on specific rectangular shapes

� ��

��

MAGMA BLAS matrix multiply

Scheduling DAGs

48 coresMatrix is 4000 x 4000, tile is 200 x 200.

time →

Dynamic scheduling• Parallelism using DAG-based runtime scheduler

• One-sided factorizations and solvers using StarPU

// Sequential Tile Choleskyfor k = 1 .. ntiles! dpotrf( Akk )! for i = k+1 .. ntiles! ! dtrsm( Akk, Aik ) ! for i = k+1 .. ntiles! ! dsyrk( Aik, Aii )! ! for j = i+1 .. ntiles! ! ! dgemm( Ajk, Aik, Aij )

// Hybrid Tile Choleskyfor k = 1 .. ntiles! Insert_Task( dpotrf, ... )! for i = k+1 .. ntiles! ! Insert_Task( dtrsm, ... ) ! for i = k+1 .. ntiles! ! Insert_Task( dsyrk, ... )! ! for j = i+1 .. ntiles! ! ! Insert_Task( dgemm, ... )

Documentation• Nvidia Guides

http://developer.nvidia.com/nvidia-gpu-computing-documentation

• MAGMAhttp://icl.eecs.utk.edu/magma

• PLASMAhttp://icl.eecs.utk.edu/plasma

• Routines indexhttp://web.eecs.utk.edu/~mgates3/docs/lapack.html

Collaborators / Support

• University of Tennessee, Knoxville

• University of California, Berkeley

• University of Colorado, Denver

• INRIA, France

• KAUST, Saudi Arabia

MAGMA - National Institute for Computational Sciences · • AMD’s Fusion • Nvidia’s ... •...

Documents