+ All Categories
Home > Documents > Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6....

Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6....

Date post: 19-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
132
Multiprecision Algorithms Nick Higham School of Mathematics The University of Manchester http://www.maths.manchester.ac.uk/~higham/ @nhigham, nickhigham.wordpress.com Mathematical Modelling, Numerical Analysis and Scientific Computing, Kácov, Czech Republic May 27-June 1, 2018.
Transcript
Page 1: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Research Matters

February 25, 2009

Nick HighamDirector of Research

School of Mathematics

1 / 6

Multiprecision Algorithms

Nick HighamSchool of Mathematics

The University of Manchester

http://www.maths.manchester.ac.uk/~higham/@nhigham, nickhigham.wordpress.com

Mathematical Modelling, Numerical Analysis andScientific Computing, Kácov, Czech Republic

May 27-June 1, 2018.

Page 2: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Outline

Multiprecision arithmetic: floating point arithmeticsupporting multiple, possibly arbitrary, precisions.

Applications of & support for low precision.

Applications of & support for high precision.

How to exploit different precisions to achieve faster algswith higher accuracy.

Focus oniterative refinement for Ax = b,matrix logarithm.

Download these slides from http://bit.ly/kacov18

Nick Higham Multiprecision Algorithms 2 / 95

Page 3: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Lecture 1

Floating-point arithmetic.Hardware landscape.Low precision arithmetic.

Nick Higham Multiprecision Algorithms 3 / 95

Page 5: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Floating Point Number System

Floating point number system F ⊂ R:

y = ±m × βe−t , 0 ≤ m ≤ βt − 1.

Base β (β = 2 in practice),precision t ,exponent range emin ≤ e ≤ emax.

Assume normalized: m ≥ βt−1.

Floating point numbers are not equally spaced.

If β = 2, t = 3, emin = −1, and emax = 3:

0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0

Nick Higham Multiprecision Algorithms 5 / 95

Page 6: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Floating Point Number System

Floating point number system F ⊂ R:

y = ±m × βe−t , 0 ≤ m ≤ βt − 1.

Base β (β = 2 in practice),precision t ,exponent range emin ≤ e ≤ emax.

Assume normalized: m ≥ βt−1.

Floating point numbers are not equally spaced.

If β = 2, t = 3, emin = −1, and emax = 3:

0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0

Nick Higham Multiprecision Algorithms 5 / 95

Page 7: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Subnormal Numbers

0 6= y ∈ F is normalized if m ≥ βt−1. Unique representation.

Subnormal numbers have minimum exponent and notnormalized:

y = ±m × βemin−t , 0 < m < βt−1,

Fewer digits of precision than the normalized numbers.

Subnormal numbers fill the gap between βemin−1 and 0 andare equally spaced. Including subnormals in our toy system:

.

0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0

Nick Higham Multiprecision Algorithms 6 / 95

Page 8: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

IEEE Standard 754-1985 and 2008 Revision

Type Size Range u = 2−t

half 16 bits 10±5 2−11 ≈ 4.9× 10−4

single 32 bits 10±38 2−24 ≈ 6.0× 10−8

double 64 bits 10±308 2−53 ≈ 1.1× 10−16

quadruple 128 bits 10±4932 2−113 ≈ 9.6× 10−35

Arithmetic ops (+,−, ∗, /,√) performed as if firstcalculated to infinite precision, then rounded.Default: round to nearest, round to even in case of tie.Half precision is a storage format only.

Nick Higham Multiprecision Algorithms 7 / 95

Page 9: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Relative Error

If x ≈ x ∈ Rn the relative error is

‖x − x‖‖x‖

.

The absolute error ‖x − x‖ is scale dependent.

Common error not to normalize errors and residuals.

Nick Higham Multiprecision Algorithms 8 / 95

Page 10: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Rounding

For x ∈ R, fl(x) is an element of F nearest to x , and thetransformation x → fl(x) is called rounding (to nearest).

TheoremIf x ∈ R lies in the range of F then

fl(x) = x(1 + δ), |δ| ≤ u.

u := 12β

1−t is the unit roundoff, or machine precision.

The machine epsilon, εM = β1−t is the spacing between 1and the next larger floating point number (eps in MATLAB).

0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0

Nick Higham Multiprecision Algorithms 9 / 95

Page 11: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Model vs Correctly Rounded Resulty = x(1 + δ), with |δ| ≤ u does not imply y = fl(x).

β = 10,t = 2

x y |x − y |/x u = 12101−t

9.185 8.7 5.28e-2 5.00e-29.185 8.8 4.19e-2 5.00e-29.185 8.9 3.10e-2 5.00e-29.185 9.0 2.01e-2 5.00e-29.185 9.1 9.25e-3 5.00e-29.185 9.2 1.63e-3 5.00e-29.185 9.3 1.25e-2 5.00e-29.185 9.4 2.34e-2 5.00e-29.185 9.5 3.43e-2 5.00e-29.185 9.6 4.52e-2 5.00e-29.185 9.7 5.61e-2 5.00e-2

Nick Higham Multiprecision Algorithms 10 / 95

Page 12: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Model for Rounding Error Analysis

For x , y ∈ F

fl(x op y) = (x op y)(1 + δ), |δ| ≤ u, op = +,−, ∗, /.

Also for op =√

.

Sometimes more convenient to use

fl(x op y) =x op y1 + δ

, |δ| ≤ u, op = +,−, ∗, /.

Model is weaker than fl(x op y) being correctly rounded.

Nick Higham Multiprecision Algorithms 11 / 95

Page 13: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Model for Rounding Error Analysis

For x , y ∈ F

fl(x op y) = (x op y)(1 + δ), |δ| ≤ u, op = +,−, ∗, /.

Also for op =√

.

Sometimes more convenient to use

fl(x op y) =x op y1 + δ

, |δ| ≤ u, op = +,−, ∗, /.

Model is weaker than fl(x op y) being correctly rounded.

Nick Higham Multiprecision Algorithms 11 / 95

Page 14: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Precision versus Accuracy

fl(abc) = ab(1 + δ1) · c(1 + δ2) |δi | ≤ u,= abc(1 + δ1)(1 + δ2)

≈ abc(1 + δ1 + δ2).

Precision = u.Accuracy ≈ 2u.

Accuracy is not limited by precision

Nick Higham Multiprecision Algorithms 12 / 95

Page 15: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Precision versus Accuracy

fl(abc) = ab(1 + δ1) · c(1 + δ2) |δi | ≤ u,= abc(1 + δ1)(1 + δ2)

≈ abc(1 + δ1 + δ2).

Precision = u.Accuracy ≈ 2u.

Accuracy is not limited by precision

Nick Higham Multiprecision Algorithms 12 / 95

Page 16: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,
Page 17: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Fused Multiply-Add Instruction

A multiply-add instruction with just one rounding error:

fl(x + y ∗ z) = (x + y ∗ z)(1 + δ), |δ| ≤ u.

With an FMA:

Inner product xT y can be computed with half therounding errors.

In the IEEE 2008 standard.

Supported by much hardware, including NVIDIA Voltaarchitecture (P100, V100) at FP16.

Nick Higham Multiprecision Algorithms 14 / 95

Page 18: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Fused Multiply-Add Instruction (cont.)

The algorithm of Kahan

1 w = b ∗ c2 e = w − b ∗ c3 x = (a ∗ d − w) + e

computes x = det([a

cbd

]) with high relative accuracy

But

What does a*d + c*b mean?The product

(x + iy)∗(x + iy) = x2 + y2 + i(xy − yx)

may evaluate to non-real with an FMA.b2 − 4ac can evaluate negative even when b2 ≥ 4ac.

Nick Higham Multiprecision Algorithms 15 / 95

Page 19: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References for Floating-Point

Handbook of Floating-Point Arithmetic

Jean-Michel MullerNicolas BrunieFlorent de Dinechin Claude-Pierre JeannerodMioara JoldesVincent LefèvreGuillaume MelquiondNathalie Revol Serge Torres

Second Edition

Nick Higham Multiprecision Algorithms 16 / 95

Page 20: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

ARM NEON

Nick Higham Multiprecision Algorithms 17 / 95

Page 21: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

NVIDIA Tesla P100 (2016), V100 (2017)

“The Tesla P100 is the world’s first accelerator built fordeep learning, and has native hardware ISA support forFP16 arithmetic”V100 tensor cores do 4× 4 mat mult in one clock cycle.

TFLOPSdouble single half/ tensor

P100 4.7 9.3 18.7V100 7 14 112

Nick Higham Multiprecision Algorithms 18 / 95

Page 22: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

AMD Radeon Instinct MI25 GPU (2017)

“24.6 TFLOPS FP16 or 12.3 TFLOPS FP32 peak GPUcompute performance on a single board . . . Up to 82GFLOPS/watt FP16 or 41 GFLOPS/watt FP32 peak GPUcompute performance”

Nick Higham Multiprecision Algorithms 19 / 95

Page 23: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Low Precision in Machine LearningWidespread use of low precision, for training and inference:

single precision (fp32) 32 bitshalf precision (fp16) 16 bits

integer (INT8) 8 bitsternary {−1,0,1}binary {0,1}

plus other newly-proposed floating-point formats.

“We find that very low precision is sufficient not just forrunning trained networks but also for training them.”Courbariaux, Benji & David (2015)

No rigorous rounding error analysis exists (yet).

Papers usually experimental, using particular data sets.

Nick Higham Multiprecision Algorithms 20 / 95

Page 24: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Why Does Low Precision Work in ML?

We’re solving the wrong problem (Scheinberg, 2016),so don’t need an accurate solution.

Low precision provides regularization.

Low precision encourages flat minima to be found.

Nick Higham Multiprecision Algorithms 21 / 95

Page 25: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Deep Learning for Java

Nick Higham Multiprecision Algorithms 22 / 95

Page 26: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Climate Modelling

T. Palmer, More reliable forecasts with less precisecomputations: a fast-track route to cloud-resolvedweather and climate simulators?, Phil. Trans. R.Soc. A, 2014:

“Is there merit in representing variables atsufficiently high wavenumbers using halfor even quarter precision floating-pointnumbers?”

T. Palmer, Build imprecise supercomputers, Nature,2015.

Nick Higham Multiprecision Algorithms 23 / 95

Page 27: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Fp16 for Communication Reduction

ResNet-50 training on ImageNet.

Solved in 60 mins on 256 TESLA P100s at Facebook(2017).Solved in 15 mins on 1024 TESLA P100s at PreferredNetworks, Inc. (2017) using ChainerMN (TakuyaAkiba, SIAM PP18):

“While computation was generally done in singleprecision, in order to reduce the communicationoverhead during all-reduce operations, we usedhalf-precision floats . . . In our preliminaryexperiments, we observed that the effect from usinghalf-precision in communication on the final modelaccuracy was relatively small.”

Nick Higham Multiprecision Algorithms 24 / 95

Page 28: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Fp16 for Communication Reduction

ResNet-50 training on ImageNet.

Solved in 60 mins on 256 TESLA P100s at Facebook(2017).Solved in 15 mins on 1024 TESLA P100s at PreferredNetworks, Inc. (2017) using ChainerMN (TakuyaAkiba, SIAM PP18):“While computation was generally done in singleprecision, in order to reduce the communicationoverhead during all-reduce operations, we usedhalf-precision floats . . . In our preliminaryexperiments, we observed that the effect from usinghalf-precision in communication on the final modelaccuracy was relatively small.”

Nick Higham Multiprecision Algorithms 24 / 95

Page 29: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Preconditioning with Adaptive Precision

Anzt, Dongarra, Flegar, H & Quintana-Ortí (2018):

For sparse A and iterative Ax = b solver, executiontime and energy dominated by data movement.

Block Jacobi preconditioning: D = diag(Di), whereDi = Aii . Solve D−1Ax = D−1b.

All computations are at fp64.

Compute D−1 and store D−1i in fp16, fp32 or fp64,

depending on κ(Di).

Simulations and energy modelling show canoutperform fixed precision preconditioner.

Nick Higham Multiprecision Algorithms 25 / 95

Page 30: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Range Parameters

r (s)min = smallest positive (subnormal) number,rmin = smallest normalized positive number,rmax = largest finite number.

r (s)min rmin rmax

fp16 5.96× 10−8 6.10× 10−5 65504fp32 1.40× 10−45 1.18× 10−38 3.4× 1038

fp64 4.94× 10−324 2.22× 10−308 1.80× 10308

Nick Higham Multiprecision Algorithms 26 / 95

Page 31: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Example: Vector 2-Norm in fp16

Evaluate ‖x‖2 for

x =

[αα

]as√

x21 + x2

2 in fp16.

Recall uh = 4.88× 10−4, rmin = 6.10× 10−5.

α Relative error Comment10−4 1 Underflow to 0

3.3× 10−4 4.7× 10−2 Subnormal range.5.5× 10−4 7.1× 10−3 Subnormal range.1.1× 10−2 1.4× 10−4 Perfect rel. err.

Nick Higham Multiprecision Algorithms 27 / 95

Page 32: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

A Simple Loop

x = pi; i = 0;while x/2 > 0

x = x/2; i = i+1;endfor k = 1:i

x = 2*x;end

Precision i |x − π|Double 1076 0.858Single 151 0.858Half 26 0.858

Why these large errors?Why the same error for each precision?

Nick Higham Multiprecision Algorithms 28 / 95

Page 33: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

A Simple Loop

x = pi; i = 0;while x/2 > 0

x = x/2; i = i+1;endfor k = 1:i

x = 2*x;end

Precision i |x − π|Double 1076 0.858Single 151 0.858Half 26 0.858

Why these large errors?Why the same error for each precision?

Nick Higham Multiprecision Algorithms 28 / 95

Page 34: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Error Analysis in Low Precision (1)

For inner product xT y of n-vectors standard error bound is

| fl(xT y)− xT y | ≤ γn|x |T |y |, γn =nu

1− nu, nu < 1.

Can also be written as

| fl(xT y)− xT y | ≤ nu|x |T |y |+ O(u2).

In half precision, u ≈ 4.9× 10−4, so nu = 1 for n = 2048 .

What happens when nu > 1?

Nick Higham Multiprecision Algorithms 29 / 95

Page 35: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Error Analysis in Low Precision (1)

For inner product xT y of n-vectors standard error bound is

| fl(xT y)− xT y | ≤ γn|x |T |y |, γn =nu

1− nu, nu < 1.

Can also be written as

| fl(xT y)− xT y | ≤ nu|x |T |y |+ O(u2).

In half precision, u ≈ 4.9× 10−4, so nu = 1 for n = 2048 .

What happens when nu > 1?

Nick Higham Multiprecision Algorithms 29 / 95

Page 36: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Error Analysis in Low Precision (2)

Rump & Jeannerod (2014) prove that in a number ofstandard rounding error bounds, γn = nu/(1− nu) can bereplaced by nu provided that round to nearest is used.

Analysis nontrivial. Only a few core algs have beenanalyzed.

Be’rr bound for Ax = b is now (3n − 2)u + (n2 − n)u2

instead of γ3n.

Cannot replace γn by nu in all algs (e.g., pairwisesummation).

Once nu ≥ 1 bounds cannot guarantee any accuracy,maybe not even a correct exponent!

Nick Higham Multiprecision Algorithms 30 / 95

Page 37: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Error Analysis in Low Precision (2)

Rump & Jeannerod (2014) prove that in a number ofstandard rounding error bounds, γn = nu/(1− nu) can bereplaced by nu provided that round to nearest is used.

Analysis nontrivial. Only a few core algs have beenanalyzed.

Be’rr bound for Ax = b is now (3n − 2)u + (n2 − n)u2

instead of γ3n.

Cannot replace γn by nu in all algs (e.g., pairwisesummation).

Once nu ≥ 1 bounds cannot guarantee any accuracy,maybe not even a correct exponent!

Nick Higham Multiprecision Algorithms 30 / 95

Page 38: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Simulating fp16 Arithmetic

Simulation 1.Converting operands to fp32 or fp64, carry out theoperation in fp32 or fp64, then round the result back tofp16.Simulation 2.Scalar fp16 operations as in Simulation 1. Carry outmatrix multiplication and matrix factorization in fp32 orfp64 then round the result back to fp16.

Nick Higham Multiprecision Algorithms 31 / 95

Page 39: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

MATLAB fp16 Class (Moler)Cleve Laboratory fp16 class uses Simulation 2 formtimes (called in lu) and mldivide.http://mathworks.com/matlabcentral/fileexchange/59085-cleve-laboratory

function z = plus(x,y)z = fp16(double(x) + double(y));

endfunction z = mtimes(x,y)

z = fp16(double(x) * double(y));endfunction z = mldivide(x,y)

z = fp16(double(x) \ double(y));endfunction [L,U,p] = lu(A)

[L,U,p] = lutx(A);end

Nick Higham Multiprecision Algorithms 32 / 95

Page 40: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Is Simulation 2 Too Accurate?

For matrix mult, standard error bound is

|C − C| ≤ γn|A||B|, γn = nu/(1− nu).

Error bound for Simulation 2 has no n factor.

For triangular solves, Tx = b, error should be boundedby cond(T , x)γn, but we actually get error of order u.

Simulation 1 preferable. but too slow unless theproblem is fairly small.Large operator overloading overheads in any language.

Nick Higham Multiprecision Algorithms 33 / 95

Page 41: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Lecture 2

Rounding error analysisHigher precisionMultiprecision alg for log A.

Download these slides fromhttp://bit.ly/kacov18

Nick Higham Multiprecision Algorithms 34 / 95

Page 42: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

James Hardy Wilkinson (1919–1986)

Nick Higham Multiprecision Algorithms 35 / 95

Page 43: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Nick Higham Multiprecision Algorithms 36 / 95

Page 44: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Summation

Compute sn =∑n

i=1 xi .Recursive summation:

1 s1 = x1

2 for i = 2:n3 si = si−1 + xi

4 end

s2 = fl(x1 + x2) = (x1 + x2)(1 + δ2), |δ2| ≤ u.

s3 = fl(s2 + x3) = fl(s2 + x3)(1 + δ3), |δ3| ≤ u= (x1 + x2)(1 + δ2)(1 + δ3) + x3(1 + δ3).

Etc. . . .

Nick Higham Multiprecision Algorithms 37 / 95

Page 45: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

General Summation Algorithm

Algorithm (to compute sn =∑n

i=1 xi)

1 Let X = {x1, . . . , xn}.2 while X contains more than one element3 Remove two numbers y and z from X

and put their sum y + z back in X .4 end5 Assign the remaining element of X to sn.

Write i th execution of loop as pi = yi + zi . Then

pi =yi + zi

1 + δi, |δi | ≤ u, i = 1 : n − 1,

where u is the unit roundoff.

Nick Higham Multiprecision Algorithms 38 / 95

Page 46: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

General Summation Algorithm (cont.)

Error in forming pi , namely yi + zi − pi , is therefore δi pi .Summing individual errors get overall error

en := sn − sn =n−1∑i=1

δi pi ,

which is bounded by

|en| ≤ un−1∑i=1

|pi | .

Bound shows we should minimize the size of theintermediate partial sums.

Nick Higham Multiprecision Algorithms 39 / 95

Page 47: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Difficulty of Rounding Error Analysis

Deciding what to try to prove.

Type of analysis:componentwise backward error

normwise forward errormixed backward–forward error

Knowing what model to use.

Keep or discard second order terms?

Ignore possibility of underflow and overflow?

Assume real data?

Nick Higham Multiprecision Algorithms 40 / 95

Page 48: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Multiprecision Rounding Error Analysis

Increasing motivation to mix different precisions:half, single, double, even quarter, quadruple.

NVIDIA V100 tensor core takes fp16 matrices as inputand computes the product using FMAs at fp32; resultreturned at fp16 or fp32.

We need parametrized error analyses that allowdifferent precisions to be used in different componentsin an algorithm. Hard to cover all possibilities!

See analysis in third lecture on iterative refinement (3precisions).

Nick Higham Multiprecision Algorithms 41 / 95

Page 49: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Need for Higher Precision

He and Ding, Using Accurate Arithmetics toImprove Numerical Reproducibility and Stability inParallel Applications, 2001.Bailey, Barrio & Borwein, High-PrecisionComputation: Mathematical Physics & Dynamics,2012.Khanna, High-Precision Numerical Simulations on aCUDA GPU: Kerr Black Hole Tails, 2013.Beliakov and Matiyasevich, A Parallel Algorithm forCalculation of Determinants and Minors UsingArbitrary Precision Arithmetic, 2016.Ma and Saunders, Solving Multiscale LinearPrograms Using the Simplex Method in QuadruplePrecision, 2015.

Nick Higham Multiprecision Algorithms 42 / 95

Page 50: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Aside: Rounding Errors in Digital Imaging

An 8-bit RGB image is an m × n × 3 array of integersaijk ∈ {0,1, . . . ,255}.Every editing op (levels, curves, colour balance, . . . )aijk ← round(fijk(aijk)) incurs a rounding error:

Should we edit in 16-bit?

Nick Higham Multiprecision Algorithms 43 / 95

Page 51: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Aside: Rounding Errors in Digital Imaging

An 8-bit RGB image is an m × n × 3 array of integersaijk ∈ {0,1, . . . ,255}.Every editing op (levels, curves, colour balance, . . . )aijk ← round(fijk(aijk)) incurs a rounding error:

Should we edit in 16-bit?

Nick Higham Multiprecision Algorithms 43 / 95

Page 52: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

16-bit vs. 8-bit editing

SLR cameras generate image at 12–14 bits internally.

8-bit 0 : 1 : 25516-bit 0 : 3.9× 10−3 : 255

Controversial: hard to find realistic image where using16 bits makes any practical difference in quality.

Relevant metric is not normwise relative error!

Nick Higham Multiprecision Algorithms 44 / 95

Page 53: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

16-bit vs. 8-bit editing

SLR cameras generate image at 12–14 bits internally.

8-bit 0 : 1 : 25516-bit 0 : 3.9× 10−3 : 255

Controversial: hard to find realistic image where using16 bits makes any practical difference in quality.

Relevant metric is not normwise relative error!

Nick Higham Multiprecision Algorithms 44 / 95

Page 54: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

16-bit vs. 8-bit editing

SLR cameras generate image at 12–14 bits internally.

8-bit 0 : 1 : 25516-bit 0 : 3.9× 10−3 : 255

Controversial: hard to find realistic image where using16 bits makes any practical difference in quality.

Relevant metric is not normwise relative error!

Nick Higham Multiprecision Algorithms 44 / 95

Page 55: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Increasing the Precision

y = eπ√

163 evaluated at t digit precision:

t y20 262537412640768744.0025 262537412640768744.000000030 262537412640768743.999999999999

Is the last digit before the decimal point 4?

t y35 262537412640768743.9999999999992500740 262537412640768743.9999999999992500725972

So no, it’s 3!

Nick Higham Multiprecision Algorithms 45 / 95

Page 56: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Increasing the Precision

y = eπ√

163 evaluated at t digit precision:

t y20 262537412640768744.0025 262537412640768744.000000030 262537412640768743.999999999999

Is the last digit before the decimal point 4?

t y35 262537412640768743.9999999999992500740 262537412640768743.9999999999992500725972

So no, it’s 3!

Nick Higham Multiprecision Algorithms 45 / 95

Page 57: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Another Example

Consider the evaluation in precision u = 2−t of

y = x + a sin(bx), x = 1/7, a = 10−8, b = 224.

10 15 20 25 30 35 4010

−14

10−13

10−12

10−11

10−10

10−9

10−8

10−7

10−6

10−5

10−4

t

error

Nick Higham Multiprecision Algorithms 46 / 95

Page 58: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

MythIncreasing the precision at which a computation isperformed increases the accuracy of the answer.

Page 59: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Going to Higher Precision

If we have quadruple or higher precision, how can wemodify existing algorithms to exploit it?

To what extent are existing algs precision-independent?

Newton-type algs: just decrease tol?

How little higher precision can we get away with?

Gradually increase precision through the iterations?

For Krylov methods # iterations can depend onprecision, so lower precision might not give fastestcomputation!

Nick Higham Multiprecision Algorithms 48 / 95

Page 60: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Going to Higher Precision

If we have quadruple or higher precision, how can wemodify existing algorithms to exploit it?

To what extent are existing algs precision-independent?

Newton-type algs: just decrease tol?

How little higher precision can we get away with?

Gradually increase precision through the iterations?

For Krylov methods # iterations can depend onprecision, so lower precision might not give fastestcomputation!

Nick Higham Multiprecision Algorithms 48 / 95

Page 61: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Availability of Multiprecision in Software

Maple, Mathematica, PARI/GP, Sage.

MATLAB: Symbolic Math Toolbox, MultiprecisionComputing Toolbox (Advanpix).

Julia: BigFloat.

Mpmath and SymPy for Python.

GNU MP Library.

GNU MPFR Library.

(Quad only): some C, Fortran compilers.

Gone, but not forgotten:

Numerical Turing: Hull et al., 1985.

Nick Higham Multiprecision Algorithms 49 / 95

Page 62: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Cost of Quadruple Precision

How Fast is Quadruple Precision Arithmetic?Compare

MATLAB double precision,Symbolic Math Toolbox, VPA arithmetic, digits(34),Multiprecision Computing Toolbox (Advanpix),mp.Digits(34). Optimized for quad.

Ratios of timesIntel Broadwell-E Core i7-6800K @3.40GHz, 6 cores

mp/double vpa/double vpa/mpLU, n = 250 98 25,000 255eig, nonsymm, n = 125 75 6,020 81eig, symm, n = 200 32 11,100 342

Nick Higham Multiprecision Algorithms 50 / 95

Page 63: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Matrix Functions

(Inverse) scaling and squaring-type algorithms for eA,log A, cos A, At use Padé approximants.

Padé degree and algorithm parameters chosen toachieve double precision accuracy, u = 2−53.

Change u and the algorithm logic needs changing!

Open questions, even for scalar elementary functions?

Now focus on matrix logarithm.

Nick Higham Multiprecision Algorithms 51 / 95

Page 64: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Existing Software

MATLAB has built-in expm, logm, sqrtm, funm.We have written cosm, sinm, signm, powerm,lambertwm, unwindm, . . .Julia has expm, logm, sqrtm.NAG Library has 42+ f (A) codes.

H & Deadman, A Catalogue of Software for MatrixFunctions (2016).

Nick Higham Multiprecision Algorithms 52 / 95

Page 65: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Log Tables

Henry Briggs (1561–1630):Arithmetica Logarithmica (1624)Logarithms to base 10 of 1–20,000 and90,000–100,000 to 14 decimal places.

Briggs must be viewed as one of thegreat figures in numerical analysis.

—Herman H. Goldstine,A History of Numerical Analysis (1977)

Name Year Range Decimal placesR. de Prony 1801 1− 10,000 19

Edward Sang 1875 1− 20,000 28

Nick Higham Multiprecision Algorithms 53 / 95

Page 66: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,
Page 67: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,
Page 68: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Scalar Logarithm: Comm. ACM

J. R. Herndon (1961). Algorithm 48: Logarithm of acomplex number.

A. P. Relph (1962). Certification of Algorithm 48:Logarithm of a complex number.

M. L. Johnson and W. Sangren, W. (1962). Remark onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Remark on remarks onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Algorithm 243: Logarithm of acomplex number: Rewrite of Algorithm 48.

Nick Higham Multiprecision Algorithms 56 / 95

Page 69: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Scalar Logarithm: Comm. ACM

J. R. Herndon (1961). Algorithm 48: Logarithm of acomplex number.

A. P. Relph (1962). Certification of Algorithm 48:Logarithm of a complex number.

M. L. Johnson and W. Sangren, W. (1962). Remark onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Remark on remarks onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Algorithm 243: Logarithm of acomplex number: Rewrite of Algorithm 48.

Nick Higham Multiprecision Algorithms 56 / 95

Page 70: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Scalar Logarithm: Comm. ACM

J. R. Herndon (1961). Algorithm 48: Logarithm of acomplex number.

A. P. Relph (1962). Certification of Algorithm 48:Logarithm of a complex number.

M. L. Johnson and W. Sangren, W. (1962). Remark onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Remark on remarks onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Algorithm 243: Logarithm of acomplex number: Rewrite of Algorithm 48.

Nick Higham Multiprecision Algorithms 56 / 95

Page 71: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Scalar Logarithm: Comm. ACM

J. R. Herndon (1961). Algorithm 48: Logarithm of acomplex number.

A. P. Relph (1962). Certification of Algorithm 48:Logarithm of a complex number.

M. L. Johnson and W. Sangren, W. (1962). Remark onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Remark on remarks onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Algorithm 243: Logarithm of acomplex number: Rewrite of Algorithm 48.

Nick Higham Multiprecision Algorithms 56 / 95

Page 72: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Scalar Logarithm: Comm. ACM

J. R. Herndon (1961). Algorithm 48: Logarithm of acomplex number.

A. P. Relph (1962). Certification of Algorithm 48:Logarithm of a complex number.

M. L. Johnson and W. Sangren, W. (1962). Remark onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Remark on remarks onAlgorithm 48: Logarithm of a complex number.

D. S. Collens (1964). Algorithm 243: Logarithm of acomplex number: Rewrite of Algorithm 48.

Nick Higham Multiprecision Algorithms 56 / 95

Page 73: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Principal Logarithm and pth Root

Let A ∈ Cn×n have no eigenvalues on R− .

Principal logX = log A denotes unique X such that

eX = A.−π < Im

(λ(X )

)< π.

Nick Higham Multiprecision Algorithms 57 / 95

Page 74: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

The Average Eye

First order character of optical system characterized bytransference matrix

T =

[S δ0 1

]∈ R5×5,

where S ∈ R4×4 is symplectic:

ST JS = J =

[0 I2−I2 0

].

Average m−1∑mi=1 Ti is not a transference matrix.

Harris (2005) proposes the average exp(m−1∑mi=1 log Ti).

Nick Higham Multiprecision Algorithms 58 / 95

Page 75: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Inverse Scaling and Squaring

log A = log(A1/2s)2s

= 2s log(A1/2s)

1: s ← 0

2: Compute the Schur decomposition A := QTQ∗

3: while ‖ − I‖2 ≥ 1 do4: ← 1/2

5: s ← s + 16: end while7: Find Padé approximant rkm to log(1 + x)8: return

What degree for the Padé approximants?Should we take more square roots once ‖T − I‖2 < 1?

Nick Higham Multiprecision Algorithms 59 / 95

Page 76: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Inverse Scaling and Squaring

log A = log(A1/2s)2s

= 2s log(A1/2s)

Algorithm Transformation-free inverse scaling & squaring1: s ← 0

2: Compute the Schur decomposition A := QTQ∗

3: while ‖A− I‖2 ≥ 1 do4: A← A1/2

5: s ← s + 16: end while7: Find Padé approximant rkm to log(1 + x)8: return 2srkm(A− I)

What degree for the Padé approximants?Should we take more square roots once ‖T − I‖2 < 1?

Nick Higham Multiprecision Algorithms 59 / 95

Page 77: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Inverse Scaling and Squaring

log A = log(A1/2s)2s

= 2s log(A1/2s)

Algorithm Schur–Padé inverse scaling & squaring1: s ← 02: Compute the Schur decomposition A := QTQ∗

3: while ‖T − I‖2 ≥ 1 do4: T ← T 1/2

5: s ← s + 16: end while7: Find Padé approximant rkm to log(1 + x)8: return 2sQ∗rkm(T − I)Q

What degree for the Padé approximants?Should we take more square roots once ‖T − I‖2 < 1?

Nick Higham Multiprecision Algorithms 59 / 95

Page 78: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Bounding the Error

Kenney & Laub (1989) derived absolute f’err bound

‖ log(I + X )− rkm(X )‖ ≤ | log(1− ‖X‖)− rkm(−‖X‖)|.

Al-Mohy, H & Relton (2012, 2013) derive b’err bound.Strategy requires symbolic and high prec. comp. (doneoff-line) and obtains parameters dependent on u.

Fasi & H (2018) use a relative f’err bound based on‖ log(I + X )‖ ≈ ‖X‖, since ‖X‖ � 1 is possible.

Strategy: Use f’err bound to minimize s subject to f’errbounded by u (arbitrary) for suitable m.

Nick Higham Multiprecision Algorithms 60 / 95

Page 79: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Sharper Error Bound for Nonnormal Matrices

Al-Mohy & H (2009) note that∥∥∑∞

i=` ciAi∥∥∞ ≤

∑∞i=` |ci |‖A‖i

is too weak. Instead can use

‖A4‖ ≤ ‖A2‖2 =(‖A2‖1/2)4

,

‖A12‖ ≤ ‖A3‖4 =(‖A3‖1/3)12

, . . .

Fasi & H (2018) refine Kenney & Laub’s bound using

αp(X ) = max(‖X p‖1/p, ‖X p+1‖1/(p+1)),

in place of ‖A‖. Note that ρ(A) ≤ αp(A) ≤ ‖A‖ .

Even sharper bounds derived by Nadukandi & H (2018).

Nick Higham Multiprecision Algorithms 61 / 95

Page 80: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

64 Digits

0 20 40 6010-66

10-60

10-54

10-48

κlog(A)u

logm mct

logm agm

logm tayl

logm tfree tayl

logm pade

logm tfree pade

Nick Higham Multiprecision Algorithms 62 / 95

Page 81: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

256 Digits

0 20 40 6010-258

10-253

10-248

10-243

Nick Higham Multiprecision Algorithms 63 / 95

Page 82: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Conclusions

For matrix log at high precision, since u is now a variable:

Error bound computed at run-time, not fixed.Relative error bound needed.

Tests exposed weakness in existing toolbox at highprecisions (since fixed).

Nick Higham Multiprecision Algorithms 64 / 95

Page 83: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Lecture 3

Iterative Refinement

Nick Higham Multiprecision Algorithms 65 / 95

Page 84: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Accelerating the Solution of Ax = b

A ∈ Rn×n nonsingular.

Standard method for solving Ax = b: factorize A = LU,solve LUx = b, all at working precision.

Can we solve Ax = b faster and/or more accuratelyby exploiting multiprecision arithmetic?

Nick Higham Multiprecision Algorithms 66 / 95

Page 85: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Iterative Refinement for Ax = b (classic)

Solve Ax0 = b by LU factorization in double precision.

r = b − Ax0 quad precisionSolve Ad = r double precisionx1 = x0 + d double precision

(x0 ← x1 and iterate as necessary.)

Programmed in J. H. Wilkinson, Progress Report onthe Automatic Computing Engine (1948).Popular up to 1970s, exploiting cheap accumulation ofinner products.

Nick Higham Multiprecision Algorithms 67 / 95

Page 86: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Iterative Refinement (1970s, 1980s)

Solve Ax0 = b by LU factorization.r = b − Ax0

Solve Ad = rx1 = x0 + d

Everything in double precision.

Skeel (1980).Jankowski & Wozniakowski (1977) for a generalsolver.

Nick Higham Multiprecision Algorithms 68 / 95

Page 87: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Iterative Refinement (2000s)

Solve Ax0 = b by LU factorization in single precision.

r = b − Ax0 double precisionSolve Ad = r single precisionx1 = x0 + d double precision

Dongarra et al. (2006).Motivated by single precision being at least twice asfast as double.

Nick Higham Multiprecision Algorithms 69 / 95

Page 88: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Iterative Refinement in Three Precisions

A,b given in precision u.

Solve Ax0 = b by LU factorization in precision uf .

r = b − Ax0 precision ur

Solve Ad = r precision uf

x1 = fl(x0 + d) precision u

Three previous usages are special cases.Choose precisions from half, single, double, quadruplesubject to ur ≤ u ≤ uf .Can we compute more accurate solutions faster?

Nick Higham Multiprecision Algorithms 70 / 95

Page 89: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Very Ill-Conditioned Problems

Allow

κ(A) = ‖A‖‖A−1‖ & u−1.

Applications include:Reference solutions for testing solvers.Radial basis functions.Ill-conditioned FE geomechanical problems.

Base precision may be half or single.In low precision almost every problem is ill conditioned!

Nick Higham Multiprecision Algorithms 71 / 95

Page 90: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Very Ill-Conditioned Problems

Allow

κ(A) = ‖A‖‖A−1‖ & u−1.

Applications include:Reference solutions for testing solvers.Radial basis functions.Ill-conditioned FE geomechanical problems.

Base precision may be half or single.In low precision almost every problem is ill conditioned!

Nick Higham Multiprecision Algorithms 71 / 95

Page 91: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Existing Rounding Error Analysis

Wilkinson (1963): fixed-point arithmetic.Moler (1967): floating-point arithmetic.Higham (1997, 2002): more general analysis forarbitrary solver.Dongarra et al. (2006): lower precision LU.

At most two precisions and require κ(A)u < 1 .

New AnalysisApplies to any solver.Covers b’err and f’err. Focus mainly on f’err.Allows κ(A)u & 1 .

Nick Higham Multiprecision Algorithms 72 / 95

Page 92: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

New Analysis

Assume computed solution to Adi = ri has normwise relb’err O(uf ) and satisfies

‖di − di‖∞‖di‖

≤ uf θi < 1.

Define µi by

‖A(x − xi)‖∞ = µi ‖A‖∞‖x − xi‖∞,and note that

κ∞(A)−1 ≤ µi ≤ 1.

Nick Higham Multiprecision Algorithms 73 / 95

Page 93: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Condition Numbers

|A| = (|aij |).

cond(A, x) =‖ |A−1||A||x | ‖∞

‖x‖∞,

cond(A) = cond(A,e) = ‖ |A−1||A| ‖∞,

κ∞(A) = ‖A‖∞‖A−1‖∞.

1 ≤ cond(A, x) ≤ cond(A) ≤ κ∞(A) .

Nick Higham Multiprecision Algorithms 74 / 95

Page 94: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Convergence Result

Theorem (Carson & H, 2018)

For IR in precisions ur ≤ u ≤ uf if

φi = 2uf min(cond(A), κ∞(A)µi

)+ uf θi

is sufficiently less than 1, the forward error is reduced onthe ith iteration by a factor ≈ φi until an iterate x satisfies

‖x − x‖∞‖x‖∞

. 4nur cond(A, x) + u.

Analogous standard bound would haveµi = 1,uf θi = κ(A)uf .

Nick Higham Multiprecision Algorithms 75 / 95

Page 95: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Backward Error Result

Theorem (Carson & H, 2018)For IR in precisions ur ≤ u ≤ uf , if

φ = (c1κ∞(A) + c2)uf

is sufficiently less than 1 then the residual is reduced oneach iteration by a factor approximately φ until

‖b − Axi‖∞ . nu(‖b‖∞ + ‖A‖∞‖xi‖∞) .

Nick Higham Multiprecision Algorithms 76 / 95

Page 96: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Precision Combinations

H = half, S = single, D = double, Q = quad. “uf u ur ”:

Traditional:

SSDDDQHHSHHDHHQSSQ

1970s/1980s:

SSSDDDHHHQQQ

2000s:

SDDHSSDQQHDDHQQSQQ

3 precisions:

HSDHSQHDQSDQ

Nick Higham Multiprecision Algorithms 77 / 95

Page 97: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Results for LU Factorization (1)

Backward erroruf u ur κ∞(A) norm comp Forward errorH S S 104 S S cond(A, x) · SH S D 104 S S SH D D 104 D D cond(A, x) · DH D Q 104 D D DS S S 108 S S cond(A, x) · SS S D 108 S S SS D D 108 D D cond(A, x) · DS D Q 108 D D D

Nick Higham Multiprecision Algorithms 78 / 95

Page 98: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Results (2): HSD vs. SSD

Backward erroruf u ur κ∞(A) norm comp Forward errorH S S 104 S S cond(A, x) · SH S D 104 S S SH D D 104 D D cond(A, x) · DH D Q 104 D D DS S S 108 S S cond(A, x) · SS S D 108 S S SS D D 108 D D cond(A, x) · DS D Q 108 D D D

Can we get the benefit of “HSD” while allowing a largerrange of κ∞(A)?

Nick Higham Multiprecision Algorithms 79 / 95

Page 99: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Results (2): HSD vs. SSD

Backward erroruf u ur κ∞(A) norm comp Forward errorH S S 104 S S cond(A, x) · SH S D 104 S S SH D D 104 D D cond(A, x) · DH D Q 104 D D DS S S 108 S S cond(A, x) · SS S D 108 S S SS D D 108 D D cond(A, x) · DS D Q 108 D D D

Can we get the benefit of “HSD” while allowing a largerrange of κ∞(A)?

Nick Higham Multiprecision Algorithms 79 / 95

Page 100: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Extending the Range of Applicability

Recall that the convergence condition is

φi = 2uf min(cond(A), κ∞(A)µi

)+ uf θi � 1.

We need both terms to be smaller than κ∞(A)uf .

Recall that‖di − di‖∞‖di‖

≤ uf θi ,

µi ‖A‖∞‖x − xi‖∞ = ‖A(x − xi)‖∞ = ‖b − Axi‖∞ = ‖ri‖∞.

Nick Higham Multiprecision Algorithms 80 / 95

Page 101: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Bounding µi (1)

For the 2-norm, can show that

µi ≤‖ri‖2

‖Pk ri‖2

σn+1−k

σ1,

where A = UΣV T is an SVD, Pk = UkUTk with

Uk = [un+1−k , . . . ,un].

Expect µi � 1 when ri contains a significant component inthe subspace span(Uk) for any k such that σn+1−k ≈ σn

(e.g., k = 1).

Nick Higham Multiprecision Algorithms 81 / 95

Page 102: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Bounding µi (2)

For a stable solver, in the early stages we expect

‖ri‖‖A‖‖xi‖

≈ u � ‖x − xi‖‖x‖

,

or equivalently µi � 1. But close to convergence

‖ri‖‖A‖‖xi‖

≈ u ≈ ‖x − xi‖‖x‖

or µi ≈ 1.

Concludeµi � 1 initially and µi → 1 as the iteration converges.

Nick Higham Multiprecision Algorithms 82 / 95

Page 103: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Bounding θi

uf θi bounds rel error in solution of Adi = ri .We need uf θi � 1.

Standard solvers cannot achieve this for very ill conditioned A!

Empirically observed by Rump (1990) that if L and U arecomputed LU factors of A from GEPP then

κ(L−1AU−1) ≈ 1 + κ(A)u,

even for κ(A)� u−1.

Nick Higham Multiprecision Algorithms 83 / 95

Page 104: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Preconditioned GMRES

To compute the updates di we apply GMRES to

Adi ≡ U−1L−1Adi = U−1L−1ri .

A is applied in twice the working precision.

κ(A)� κ(A) typically.

Rounding error analysis shows we get an accurate di

even for numerically singular A.Call the overall alg GMRES-IR.

GMRES cgce rate not directly related to κ(A).

Cf. Kobayashi & Ogita (2015), who explicitly form A.

Nick Higham Multiprecision Algorithms 84 / 95

Page 105: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Benefits of GMRES-IR

Recall H = 10−4, S = 10−8, D = 10−16, Q = 10−34.

Backward erroruf u ur κ∞(A) nrm cmp F’error

LU H D Q 104 D D DLU S D Q 108 D D D

GMRES-IR H D Q 1012 D D DGMRES-IR S D Q 1016 D D D

How many GMRES iterations are required?

Some tests with 100× 100 matrices . . .

Nick Higham Multiprecision Algorithms 85 / 95

Page 106: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Benefits of GMRES-IR

Recall H = 10−4, S = 10−8, D = 10−16, Q = 10−34.

Backward erroruf u ur κ∞(A) nrm cmp F’error

LU H D Q 104 D D DLU S D Q 108 D D D

GMRES-IR H D Q 1012 D D DGMRES-IR S D Q 1016 D D D

How many GMRES iterations are required?

Some tests with 100× 100 matrices . . .

Nick Higham Multiprecision Algorithms 85 / 95

Page 107: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Benefits of GMRES-IR

Recall H = 10−4, S = 10−8, D = 10−16, Q = 10−34.

Backward erroruf u ur κ∞(A) nrm cmp F’error

LU H D Q 104 D D DLU S D Q 108 D D D

GMRES-IR H D Q 1012 D D DGMRES-IR S D Q 1016 D D D

How many GMRES iterations are required?

Some tests with 100× 100 matrices . . .

Nick Higham Multiprecision Algorithms 85 / 95

Page 108: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Test 1: LU-IR, (uf ,u,ur) = (S,D,D)

κ∞(A) ≈ 1010, σi = αi Divergence!

0 1 2

re-nement step

10-15

10-10

10-5

100 ferrnbecbe

Nick Higham Multiprecision Algorithms 86 / 95

Page 109: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Test 1: LU-IR, (uf ,u,ur) = (S,D,Q)

κ∞(A) ≈ 1010, σi = αi Divergence!

0 1 2

re-nement step

10-15

10-10

10-5

100 ferrnbecbe

Nick Higham Multiprecision Algorithms 87 / 95

Page 110: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Test 1: LU-IR, (uf ,u,ur) = (S,D,Q)

κ∞(A) ≈ 104, σi = αi Convergence

0 1 2

10-15

10-5

Nick Higham Multiprecision Algorithms 88 / 95

Page 111: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Test 1: GMRES-IR, (uf ,u,ur) = (S,D,Q)

κ∞(A) ≈ 1010, σi = αi , GMRES its (2,3) Convergence

0 1 2

re-nement step

10-15

10-10

10-5

100 ferrnbecbe

Nick Higham Multiprecision Algorithms 89 / 95

Page 112: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Test 2: GMRES-IR, (uf ,u,ur) = (H,D,Q)

κ∞(A) ≈ 102, 1 small σi , GMRES its (8,8,8) Convergence

0 1 2 3

re-nement step

10-15

10-10

10-5

100 ferrnbecbe

Nick Higham Multiprecision Algorithms 90 / 95

Page 113: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Test 3: GMRES-IR, (uf ,u,ur) = (H,D,Q)

κ∞(A) ≈ 1012, σi = αi , GMRES (100,100)Take x0 = 0 because of overflow! Convergence

0 1 2 3

10-15

10-5

Nick Higham Multiprecision Algorithms 91 / 95

Page 114: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Other Mixed Precision Work

Yamazaki, Tomov & Dongarra (2015): Compute QRfact of A via Cholesky of AT A, using twice workingprecision to overcome instability for large κ(A).

Petschow, Quintana-Ortí & Bientinesi (2004):multiple relatively robust representations (MRRR) forsymm tridiagonal eigenproblem. Extra precisionprovides improved accuracy.

Bini and Robol (2014): multiprecision alg forpolynomial rootfinding. fp64 + GNU MP Library.

Nick Higham Multiprecision Algorithms 92 / 95

Page 115: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Conclusions (1)

Both low and high precision floating-point arithmeticbecoming more prevalent, in hardware and software.

Need better understanding of behaviour of algs in lowprecision arithmetic.

Judicious use of a little high precision can bringmajor benefits.

Nick Higham Multiprecision Algorithms 93 / 95

Page 116: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

Conclusions (2)

Showed that using three precisions in iter ref bringsmajor benefits.

LU in lower precision⇒ twice as fast as trad. IR.GMRES-IR cges when trad. IR doesn’t thanks topreconditioned GMRES solved of Adi = ri .GMRES-IR handles κ∞(A) ≈ u−1. Further work:tune cgce tol, alternative preconditioners etc.

Matrix functions at high precision need algs with(at least) different logic.

Nick Higham Multiprecision Algorithms 94 / 95

Page 117: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

How to Do Research

Develop new metric for solved problem then develop anew alg (or tweak an existing one) so it does best onthat metric. E.g. measure communcation costs as wellas arithmetic.

Find hidden assumptions in existing method andremove them. E.g.: Strassen realized use of innerproducts was not necessary for matrix multiplication.Find simpler proof of existing result.Prove (weaker version of) existing result under weakerassumptions.Develop alg that is faster and more accurate thanexisting ones.Develop alg that can solve problems other algs can’t(e.g., GMRES-IR for κ(A)� u−1).

Nick Higham Multiprecision Algorithms 95 / 95

Page 118: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

How to Do Research

Develop new metric for solved problem then develop anew alg (or tweak an existing one) so it does best onthat metric. E.g. measure communcation costs as wellas arithmetic.Find hidden assumptions in existing method andremove them. E.g.: Strassen realized use of innerproducts was not necessary for matrix multiplication.

Find simpler proof of existing result.Prove (weaker version of) existing result under weakerassumptions.Develop alg that is faster and more accurate thanexisting ones.Develop alg that can solve problems other algs can’t(e.g., GMRES-IR for κ(A)� u−1).

Nick Higham Multiprecision Algorithms 95 / 95

Page 119: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

How to Do Research

Develop new metric for solved problem then develop anew alg (or tweak an existing one) so it does best onthat metric. E.g. measure communcation costs as wellas arithmetic.Find hidden assumptions in existing method andremove them. E.g.: Strassen realized use of innerproducts was not necessary for matrix multiplication.Find simpler proof of existing result.

Prove (weaker version of) existing result under weakerassumptions.Develop alg that is faster and more accurate thanexisting ones.Develop alg that can solve problems other algs can’t(e.g., GMRES-IR for κ(A)� u−1).

Nick Higham Multiprecision Algorithms 95 / 95

Page 120: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

How to Do Research

Develop new metric for solved problem then develop anew alg (or tweak an existing one) so it does best onthat metric. E.g. measure communcation costs as wellas arithmetic.Find hidden assumptions in existing method andremove them. E.g.: Strassen realized use of innerproducts was not necessary for matrix multiplication.Find simpler proof of existing result.Prove (weaker version of) existing result under weakerassumptions.

Develop alg that is faster and more accurate thanexisting ones.Develop alg that can solve problems other algs can’t(e.g., GMRES-IR for κ(A)� u−1).

Nick Higham Multiprecision Algorithms 95 / 95

Page 121: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

How to Do Research

Develop new metric for solved problem then develop anew alg (or tweak an existing one) so it does best onthat metric. E.g. measure communcation costs as wellas arithmetic.Find hidden assumptions in existing method andremove them. E.g.: Strassen realized use of innerproducts was not necessary for matrix multiplication.Find simpler proof of existing result.Prove (weaker version of) existing result under weakerassumptions.Develop alg that is faster and more accurate thanexisting ones.

Develop alg that can solve problems other algs can’t(e.g., GMRES-IR for κ(A)� u−1).

Nick Higham Multiprecision Algorithms 95 / 95

Page 122: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

How to Do Research

Develop new metric for solved problem then develop anew alg (or tweak an existing one) so it does best onthat metric. E.g. measure communcation costs as wellas arithmetic.Find hidden assumptions in existing method andremove them. E.g.: Strassen realized use of innerproducts was not necessary for matrix multiplication.Find simpler proof of existing result.Prove (weaker version of) existing result under weakerassumptions.Develop alg that is faster and more accurate thanexisting ones.Develop alg that can solve problems other algs can’t(e.g., GMRES-IR for κ(A)� u−1).

Nick Higham Multiprecision Algorithms 95 / 95

Page 123: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References I

H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S.Quintana-Ortí.Adaptive precision in block-Jacobi preconditioning foriterative sparse linear system solvers.Concurrency Computat.: Pract. Exper., 2018.

D. H. Bailey, R. Barrio, and J. M. Borwein.High-precision computation: Mathematical physics anddynamics.Appl. Math. Comput., 218(20):10106–10121, 2012.

G. Beliakov and Y. Matiyasevich.A parallel algorithm for calculation of determinants andminors using arbitrary precision arithmetic.BIT, 56(1):33–50, 2016.

Nick Higham Multiprecision Algorithms 1 / 10

Page 124: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References II

D. A. Bini and L. Robol.Solving secular and polynomial equations: Amultiprecision algorithm.J. Comput. Appl. Math., 272:276–292, 2014.

E. Carson and N. J. Higham.A new analysis of iterative refinement and its applicationto accurate solution of ill-conditioned sparse linearsystems.SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017.

E. Carson and N. J. Higham.Accelerating the solution of linear systems by iterativerefinement in three precisions.SIAM J. Sci. Comput., 40(2):A817–A847, 2018.

Nick Higham Multiprecision Algorithms 2 / 10

Page 125: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References III

M. Courbariaux, Y. Bengio, and J.-P. David.Training deep neural networks with low precisionmultiplications, 2015.ArXiv preprint 1412.7024v5.

M. Fasi and N. J. Higham.Multiprecision algorithms for computing the matrixlogarithm.SIAM J. Matrix Anal. Appl., 39(1):472–491, 2018.

W. F. Harris.The average eye.Opthal. Physiol. Opt., 24:580–585, 2005.

Nick Higham Multiprecision Algorithms 3 / 10

Page 126: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References IV

Y. He and C. H. Q. Ding.Using accurate arithmetics to improve numericalreproducibility and stability in parallel applications.J. Supercomputing, 18(3):259–277, 2001.

N. J. Higham.Iterative refinement for linear systems and LAPACK.IMA J. Numer. Anal., 17(4):495–509, 1997.

N. J. Higham.Accuracy and Stability of Numerical Algorithms.Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, second edition, 2002.ISBN 0-89871-521-0.xxx+680 pp.

Nick Higham Multiprecision Algorithms 4 / 10

Page 127: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References V

G. Khanna.High-precision numerical simulations on a CUDA GPU:Kerr black hole tails.J. Sci. Comput., 56(2):366–380, 2013.

Y. Kobayashi and T. Ogita.A fast and efficient algorithm for solving ill-conditionedlinear systems.JSIAM Lett., 7:1–4, 2015.

Nick Higham Multiprecision Algorithms 5 / 10

Page 128: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References VI

J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari,and J. Dongarra.Exploiting the performance of 32 bit floating pointarithmetic in obtaining 64 bit accuracy (revisitingiterative refinement for linear systems).In Proceedings of the 2006 ACM/IEEE Conference onSupercomputing, Nov. 2006.

D. Ma and M. Saunders.Solving multiscale linear programs using the simplexmethod in quadruple precision.In M. Al-Baali, L. Grandinetti, and A. Purnama, editors,Numerical Analysis and Optimization, number 134 inSpringer Proceedings in Mathematics and Statistics,pages 223–235. Springer-Verlag, Berlin, 2015.

Nick Higham Multiprecision Algorithms 6 / 10

Page 129: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References VII

S. Markidis, W. D. Chien, E. Laure, I. B. Peng, and J. S.Vetter.NVIDIA tensor core programmability, performance &precision.ArXiv preprint 1803.04014, Mar. 2018.

J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Jeannerod,M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, andS. Torres.Handbook of Floating-Point Arithmetic.Birkhäuser, Boston, MA, USA, second edition, 2018.ISBN 978-3-319-76525-9.xxv+627 pp.

Nick Higham Multiprecision Algorithms 7 / 10

Page 130: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References VIII

P. Nadukandi and N. J. Higham.Computing the wave-kernel matrix functions.MIMS EPrint 2018.4, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Feb. 2018.23 pp.

M. Petschow, E. Quintana-Ortí, and P. Bientinesi.Improved accuracy and parallelism for MRRR-basedeigensolvers—A mixed precision approach.SIAM J. Sci. Comput., 36(2):C240–C263, 2014.

Nick Higham Multiprecision Algorithms 8 / 10

Page 131: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References IX

A. Roldao-Lopes, A. Shahzad, G. A. Constantinides,and E. C. Kerrigan.More flops or more precision? Accuracyparameterizable linear equation solvers for modelpredictive control.In 17th IEEE Symposium on Field ProgrammableCustom Computing Machines, pages 209–216, Apr.2009.

S. M. Rump and C.-P. Jeannerod.Improved backward error bounds for LU and Choleskyfactorizations.SIAM J. Matrix Anal. Appl., 35(2):684–698, 2014.

Nick Higham Multiprecision Algorithms 9 / 10

Page 132: Nick Higham Research Matters School of Mathematics The …higham/talks/essam18.pdf · 2018. 6. 2. · “The Tesla P100 is the world’s first accelerator built for deep learning,

References X

K. Scheinberg.Evolution of randomness in optimization methods forsupervised machine learning.SIAG/OPT Views and News, 24(1):1–8, 2016.

I. Yamazaki, S. Tomov, and J. Dongarra.Mixed-precision Cholesky QR factorization and its casestudies on multicore CPU with multiple GPUs.SIAM J. Sci. Comput., 37(3):C307–C330, 2015.

Nick Higham Multiprecision Algorithms 10 / 10


Recommended