+ All Categories
Home > Documents > Demmel Chapter 1

Demmel Chapter 1

Date post: 30-Nov-2015
Category:
Upload: seema-soleja
View: 133 times
Download: 4 times
Share this document with a friend
Description:
james demmel numerical linear algebra 1998 chapter 1 berkeley math 221 advanced matrix computations
30
1 Introduction 1.1. Basic Notation In this course we will refer frequently to matrices, vectors, and scalars. A matrix will be denoted by an upper case letter such as A, and its (i, j)th element will be denoted by az^. If the matrix is given by an expression such as A + B, we will write (A + In detailed algorithmic descriptions we will sometimes write A(i, j) or use the MatlabTM 1 [184] notation A(i : j, k : 1) to denote the submatrix of A lying in rows i through j and columns k through 1. A lower-case letter like x will denote a vector, and its ith element will be written x Z . Vectors will almost always be column vectors, which are the same as matrices with one column. Lower-case Greek letters (and occasionally lower-case letters) will denote scalars. R will denote the set of real numbers; IR, the set of n-dimensional real vectors; and R mxn the set of m-by-n real matrices. C, (Cn, and cmxn denote complex numbers, vectors, and matrices, respectively. Occasionally we will use the shorthand Amxn to indicate that A is an m-by-n matrix. AT will denote the transpose of the matrix A: (AT) = aai. For complex matrices we will also use the conjvgate transpose A*: (A*)Z^ = á^i. lRz and sz will denote the real and imaginary parts of the complex number z, respectively. If A is m-by-n, then IAI is the m-by-n matrix of absolute values of entries of A: (i A i) = iail. Inequalities like Al< BI are meant componentwise: a 2 < i bi for all i and j. We will also use this absolute value notation for vectors: (lxi)i = Jx ^. Ends of proofs will be marked by , and ends of examples by o. Other notation will be introduced as needed. 1.2. Standard Problems of Numerical Linear Algebra We will consider the following standard problems: 1 Matlab is a registered trademark of The MathWorks, Inc., 24 Prime Park Way, Natick, MA 01760, USA, tel. 508-647-7000, fax 508-647-7001, [email protected], http://www.mathworks.com. 1 Downloaded 05/17/13 to 169.229.108.104. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Transcript
Page 1: Demmel Chapter 1

1Introduction

1.1. Basic Notation

In this course we will refer frequently to matrices, vectors, and scalars. Amatrix will be denoted by an upper case letter such as A, and its (i, j)thelement will be denoted by az^. If the matrix is given by an expression suchas A + B, we will write (A + In detailed algorithmic descriptions we willsometimes write A(i, j) or use the MatlabTM 1 [184] notation A(i : j, k : 1) todenote the submatrix of A lying in rows i through j and columns k through1. A lower-case letter like x will denote a vector, and its ith element willbe written x Z . Vectors will almost always be column vectors, which are thesame as matrices with one column. Lower-case Greek letters (and occasionallylower-case letters) will denote scalars. R will denote the set of real numbers;IR, the set of n-dimensional real vectors; and Rmxn the set of m-by-n realmatrices. C, (Cn, and cmxn denote complex numbers, vectors, and matrices,respectively. Occasionally we will use the shorthand Amxn to indicate that A isan m-by-n matrix. AT will denote the transpose of the matrix A: (AT) = aai.For complex matrices we will also use the conjvgate transpose A*: (A*)Z^ = á^i.lRz and sz will denote the real and imaginary parts of the complex numberz, respectively. If A is m-by-n, then IAI is the m-by-n matrix of absolutevalues of entries of A: (i A i) = iail. Inequalities like Al< BI are meantcomponentwise: a2 < ibi for all i and j. We will also use this absolute valuenotation for vectors: (lxi)i = Jx ^. Ends of proofs will be marked by ❑ , andends of examples by o. Other notation will be introduced as needed.

1.2. Standard Problems of Numerical Linear Algebra

We will consider the following standard problems:

1 Matlab is a registered trademark of The MathWorks, Inc., 24 Prime Park Way,Natick, MA 01760, USA, tel. 508-647-7000, fax 508-647-7001, [email protected],http://www.mathworks.com.

1

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 2: Demmel Chapter 1

2 Applied Numerical Linear Algebra

• Linear systems of equations: Solve Ax = b. Here A is a given n-by-n

nonsingular real or complex matrix, b is a given column vector with nentries, and x is a column vector with n entries that we wish to compute.

• Least squares problems: Compute the x that minimizes iIAx — bil2. Here

A is m-by-n, b is m-by-1, x is n-by-1, and IIYII2 = >2 ly l 2 is called

the two-norm of the vector y. If m > n so that we have more equations

than unknowns, the system is called overdetermined. In this case we

cannot generally solve Ax = b exactly. If m < n, the system is called

underdetermined, and we will have infinitely many solutions.

• Eigenvalue problems: Given an n-by-n matrix A, find an n-by-1 nonzero

vector x and a scalar ). so that Ax = .fix.

• Singular value problems: Given an m-by-n matrix A, find an n-by-1

nonzero vector x and scalar .\ so that AT Ax = ).x. We will see that thisspecial kind of eigenvalue problem is important enough to merit separateconsideration and algorithms.

We choose to emphasize these standard problems because they arise sooften in engineering and scientific practice. We will illustrate them throughoutthe book with simple examples drawn from engineering, statistics, and otherfields. There are also many variations of these standard problems that we willconsider, such as generalized eigenvalue problems Ax = )^Bx (section 4.5) and

"rank-deficient" least squares problems min Ax — bil2, whose solutions arenonunique because the columns of A are linearly dependent (section 3.5).

We will learn the importance of exploiting any special structure our problem

may have. For example, solving an n-by-n linear system costs 2/3n3 floatingpoint operations if we use the most general form of Gaussian elimination. If weadd the information that the system is symmetric and positive definite, we cansave half the work by using another algorithm called Cholesky. If we further

know the matrix is banded with semibandwidth \ (i.e., = 0 if 1i —i 1 > \),

then we can reduce the cost further to O(n2 ) by using band Cholesky. If we

say quite explicitly that we are trying to solve Poisson's equation on a squareusing a 5-point difference approximation, which determines the matrix nearly

uniquely, then by using the multigrid algorithm we can reduce the cost to O(n),

which is nearly as fast as possible, in the sense that we use just a constant

amount of work per solution component (section 6.4).

1.3. General Techniques

There are several general concepts and techniques that we will use repeatedly:

1. matrix factorizations;

2. perturbation theory and condition numbers;

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 3: Demmel Chapter 1

Introduction

3

3. effects of roundoff error on algorithms, including properties of floatingpoint arithmetic;

4. analysis of the speed of an algorithm;

5. engineering numerical software.

We discuss each of these briefly below.

1.3.1. Matrix Factorizations

A factorization of the matrix A is a representation of A as a product of several"simpler" matrices, which make the problem at hand easier to solve. We givetwo examples.

EXAMPLE 1.1. Suppose that we want to solve Ax = b. If A is a lower trian-gular matrix,

all xl b1a21 a22 x2 b2

ani an2 ... ann xn bn

is easy to solve using forward substitution:

for i = 1 to n2

xz = (bi — E- 1

k-1 aikxk)/' aizend for

An analogous idea, back substitution, works if A is upper triangular. Touse this to solve a general system Ax = b we need the following matrix factor-ization, which is just a restatement of Gaussian elimination.

THEOREM 1.1. If the n-by-n matrix A is nonsingular, there exist a permu-tation matrix P (the identity matrix with its rows perrnuted), a nonsingularlower triangular matrix L, and a nonsingular upper triangular matrix U suchthat A = P • L • U. To solve Ax = b, we solve the equivalent system PLUx = bas follows:

LUx = P-1b = PTb (permute entries of b),Ux = L-1 (PT b) (forward substitution),x = U-1 (L-1 PT b) (back substitution).

We will prove this theorem in section 2.3. o

EXAMPLE 1.2. The Jordan canonical factorization A = VJV -1 exhibits theeigenvalues and eigenvectors of A. Here V is a nonsingular matrix, whosecolumns include the eigenvectors, and J is the Jordan canonical form of A,

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 4: Demmel Chapter 1

4 Applied Numerical Linear Algebra

a special triangular matrix with the eigenvalues of A on its diagonal. Wewill learn that it is numerically superior to compute the Schur factorizationA = UTU*, where U is a unitary matrix (i.e., U's columns are orthonormal)and T is upper triangular with A's eigenvalues on its diagonal. The Schur formT can be computed faster and more accurately than the Jordan form J. Wediscuss the Jordan and Schur factorizations in section 4.2. o

1.3.2. Perturbation Theory and Condition Numbers

The answers produced by numerical algorithms are seldom exactly correct.There are two sources of error. First, there may be errors in the input datato the algorithm, caused by prior calculations or perhaps measurement errors.Second, there are errors caused by the algorithm itself, due to approximationsmade within the algorithm. In order to estimate the errors in the computedanswers from both these sources, we need to understand how much the solutionof a problem is changed (or perturbed) if the input data are slightly perturbed.

EXAMPLE 1.3. Let f(x) be a real-valued differentiable function of a real vari-able x. We want to compute f(x), but we do not know x exactly. Supposeinstead that we are given x + bx and a bound on Sx. The best that we cando (without more information) is to compute f(x + öx) and to try to boundthe absolute error I f (x + öx) — f(x)I. We may use a simple linear approxima-tion to f to get the estimate f(x + öx) f(x) + Sx f'(x), and so the error isI f (x + Sx) — f (x) I ISx I • I f'(x) . We call I f' (x) I the absolute condition numberof f at x. If I f'(x)1 is large enough, then the error may be large even if öx issmall; in this case we call f ill-conditioned at x. o

We say absolute condition number because it provides a bound on theabsolute error 1 f(x + 6x) — f(x) given a bound on the absolute change 1 Sx I inthe input. We will also often use the following essentially equivalent expressionto bound the error:

f(x+sx) — f(x)I Ib^l If'(x)I H IxIIf(x)I Ixl If(x)I

This expression bounds the relative error I f (x + bx) — f (x)1/I f (x)I as a multi-

ple of the relative change ISxI/Ixl in the input. The multiplier, I f'(x)I • I xl /I f (x)Iis called the relative condition number, or often just condition number for short.

The condition number is all that we need to understand how error in theinput data affects the computed answer: we simply multiply the conditionnumber by a bound on the input error to bound the error in the computed

solution.

For each problem we consider, we will derive its corresponding conditionnumber.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 5: Demmel Chapter 1

Introduction 5

1.3.3. Effects of Roundoff Error on Algorithms

To continue our analysis of the error caused by the algorithm itself, we needto study the effect of roundoff error in the arithmetic, or simply roundoff forshort. We will do so by using a property possessed by most good algorithms:backward stability. We define it as follows.

If alg(x) is our algorithm for f(x), including the effects of roundoff,we call alg(x) a backward stable algorithm for f (x) if for all x thereis a "small" 6x such that alg(x) = f(x + 6x). Sx is called thebackward error. Informally, we say that we get the exact answer(f(x + bx)) for a slightly wrong problem (x + ex).

This implies that we may bound the error as

error = jalg(x) — f (x)l = i f(x + bx) — f (x)I N I f'(x)I - j6xj,

the product of the absolute condition number f'(x)I and the magnitude ofthe backward error ISxI. Thus, if alg(•) is backward stable, IbxI is alwayssmall, so the error will be small unless the absolute condition number is large.Thus, backward stability is a desirable property for an algorithm, and mostof the algorithms that we present will be backward stable. Combined withthe corresponding condition numbers, we will have error bounds for all ourcomputed solutions.

Proving that an algorithm is backward stable requires knowledge of theroundoff error of the basic floating point operations of the machine and howthese errors propagate through an algorithm. This is discussed in section 1.5.

1.3.4. Analyzing the Speed of Algorithms

In choosing an algorithm to solve a problem, one must of course considerits speed (which is also called performance) as well as its backward stability.There are several ways to estimate speed. Given a particular problem instance,a particular implementation of an algorithm, and a particular computer, onecan of course simply run the algorithm and see how long it takes. This maybe difficult or time consuming, so we often want simpler estimates. Indeed, wetypically want to estimate how long a particular algorithm would take beforeimplementing it.

The traditional way to estimate the time an algorithm takes is to countthe flops, or floating point operations, that it performs. We will do this forall the algorithms we present. However, this is often a misleading time es-timate on modern computer architectures, because it can take significantlymore time to move the data inside the computer to the place where it is tobe multiplied, say, than it does to actually perform the multiplication. Thisis especially true on parallel computers but also is true on conventional ma-chines such as workstations and PCs. For example, matrix multiplication on

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 6: Demmel Chapter 1

6 Applied Numerical Linear Algebra

the IBM RS6000/590 workstation can be sped up from 65 Mflops (millions offloating point operations per second) to 240 Mflops, nearly four times faster,by judiciously reordering the operations of the standard algorithm (and usingthe correct compiler optimizations). We discuss this further in section 2.6.

If an algorithm is iterative, i.e., produces a series of approximations con-verging to the answer rather than stopping after a fixed number of steps, thenwe must ask how many steps are needed to decrease the error to a toler-able level. To do this, we need to decide if the convergente is linear (i.e.,the error decreases by a constant factor 0 < c < 1 at each step so that^errori1 < c• lerrori_il) or faster, such as quadratic (lerroril < c• Ierrorz_1^ 2). If

two algorithms are both linear, we can ask which has the smaller constant c.Iterative linear equation solvers and their convergence analysis are the subjectof Chapter 6.

1.3.5. Engineering Numerical Software

Three main issues in designing or choosing a piece of numerical software areease of use, reliability, and speed. Most of the algorithms covered in this bookhave already been carefully programmed with these three issues in mind. Ifsome of this existing software can solve your problem, its ease of use may welloutweigh any other considerations such as speed. Indeed, if you need only tosolve your problem once or a few times, it is often easier to use general purposesoftware written by experts than to write your own more specialized program.

There are three programming paradigms for exploiting other experts' soft-ware. The first paradigm is the traditional software library, consisting of acollection of subroutines for solving a fixed set of problems, such as solvinglinear systems, finding eigenvalues, and so on. In particular, we will discussthe LAPACK library [10], a state-of-the-art collection of routines available inFortran and C. This library, and many others like it, are freely available inthe public domain; see NETLIB on the World Wide Web. 2 LAPACK providesreliability and high speed (for example, making careful use of matrix multipli-cation, as described above) but requires careful attention to data structuresand calling sequences on the part of the user. We will provide pointers to suchsoftware throughout the text.

The second programming paradigm provides a much easier-to-use environ-

ment than libraries like LAPACK, but at the cost of some performance. Thisparadigm is provided by the commercial system Matlab [184], among others.Matlab provides a simple interactive programming environment where all vari-ables represent matrices (scalars are just 1-by-1 matrices), and most linear al-gebra operations are available as built-in functions. For example, "C = A * B"stores the product of matrices A and B in C, and "A = inv(B)" stores theinverse of matrix B in A. It is easy to quickly prototype algorithms in Matlaband to see how they work. But since Matlab makes a number of algorith-

2Recall that we abbreviate the URL prefix http://www.netlib.org to NETLIB in the text.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 7: Demmel Chapter 1

Introduction

mic decisions automatically for the user, it may perform more slowly than acarefully chosen library routine.

The third programming paradigm is that of templates, or recipes for as-sembling complicated algorithms out of simpler building blocks. Templates areuseful when there are a large number of ways to construct an algorithm but nosimple rule for choosing the best construction for a particular input problem;therefore, much of the construction must be left to the user. An example ofthis may be found in Templates for the Solution of Linear Systems: BuildingBlocks for Iterative Methods [24]; a similar set of templates for eigenproblemsis currently under construction.

1.4. Example: Polynomial Evaluation

We illustrate the ideas of perturbation theory, condition numbers, backwardstability, and roundoff error analysis with the example of polynomial evaluation:

d

p(x) _ E aix i .i=o

Horner's rule for polynomial evaluation is

p = adfor i = d — 1 down to 0

p=x *pr-aiend for

Let us apply this to p(x) = (x — 2) 9 = x9 — 18x8 + 144x7 — 672x6 + 2016x5 —4032x4 + 5376x3 — 4608x2 + 2304x — 512. In the bottom of Figure 1.1, we seethat near the zero x = 2 the value of p(x) computed by Horner's rule is quiteunpredictable and may justifiably be called "noise." The top of Figure 1.1shows an accurate plot.

To understand the implications of this figure, let us see what would happenif we tried to find a zero of p(x) using a simple zero finder based on Bisection,shown below in Algorithm 1.1.

Bisection starts with an interval [xiow xhigh] in which p(x) changes sign

(p(xlow) •P(Xhigh) < 0) so that p(x) must have a zero in the interval. Then thealgorithm computes p(xmid) at the interval midpoint xmid = (xlow + Xhigh)/ 2and asks whether p(x) changes sign in the bottom half interval [Xl „, xmid]or top half interval [xmid, xhigh]. Either way, we find an interval of half theoriginal length containing a zero of p(x). We can continue bisecting until theinterval is as short as desired.

So the decision between choosing the top half interval or bottom half inter-val depends on the sign of p(xrnid). Examining the graph of p(x) in the bottomhalf of Figure 1.1, we see that this sign varies rapidly from plus to minus as

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 8: Demmel Chapter 1

E:3

Applied Numerical Linear Algebra

Fig. 1.1. Plot of y = (x — 2) = x9 — 18x8 + 144x7 — 672x6 + 2016x5 — 4032x4 +5376x3 — 4608x2 + 2304x — 512 evaluated at 8000 equispaced points, 'using y = (x — 2)(top) and using Horner's rule (bottom).

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 9: Demmel Chapter 1

Introduction 9

x varies. So changing xlo,,, or xhigh just slightly could completely change thesequence of sign decisions and also the final interval. Indeed, depending on theinitial choices of Slow and xhigh, the algorithm could converge anywhere insidethe "noisy region" from 1.95 to 2.05 (see Question 1.21).

To explain this fully, we return to properties of floating point arithmetic.

ALGORITHM 1.1. Finding zeros of p(x) using Bisection.

proc bisect (p, xlow, xhigh, tol)/* find a root of p(x) = 0 in [xlow, xhigh]

assuming p(xiow) - P(Xhigh) <0 * //* stop if zero found to within +tol * /

Plow = P(xlow)

Phigh = P(Xhigh)

while xhigh — xlow > 2 • tol

xmid = (xlow + xhigh)/2Pmid = P(Smid)

iƒ plow ' Pmid <0 then /* there is a root in [Slow, Smid] */

xhigh = Smid

Phigh = Pmid

else if Pmid ' Phigh < 0 then /* there is a root in [Smid, xhigh] */

Slow = Smid

Plow = Pmid

else /* Smid is a root */

Slow = Smid

xhigh = Smid

end ifend whileroot = (Slow + xhigh)/2

1.5. Floating Point Arithmetic

The number —3.1416 may be expressed in scientific notation as follows:

—.31416X

sign fraction

10 1

t exponent

base

Computers use a similar representation called floating point, but gener-ally the base is 2 (with exceptions, such as 16 for IBM 370 and 10 for somespreadsheets and most calculators). For example, .101012 x 2 3 = 5. 2510.

A floating point number is called normalized if the leading digit of thefraction is nonzero. For example, .101012x2 3 is normalized, but .0101012x24 isnot. Floating point numbers are usually normalized, which has two advantages:

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 10: Demmel Chapter 1

10 Applied Numerical Linear Algebra

each nonzero floating point value has a unique representation as a bit string,and in binary the leading 1 in the fraction need not be stored explicitly (becauseit is always 1), leaving one extra bit for a longer, more accurate fraction.

The most important parameters describing floating point numbers are thebase; the number of digits (bits) in the fraction, which determines the precision;and the number of digits (bits) in the exponent, which determines the expo-nent range and thus the largest and smallest representable numbers. Differentfloating point arithmetics also differ in how they round computed results, whatthey do about numbers that are too near zero (underflow) or too big (over-flow), whether foo is allowed, and whether useful nonnumbers (sometimescalled NaNs, indefinites, or reserved operands) are provided. We discuss eachof these below.

First we consider the precision with which numbers can be represented.For example, .31416 x 10 1 has five decimal digits, so any information less than.5 x 10-4 may have been lost. This means that if x is a real number whosebest five-digit approximation is .31416 x 10 1 , then the relative representationerror in .31416 x 10 1 is

Ix - .31416 x 1011.5x10-4 4.31416x10 1 -.31416x101 16x10-.

The maximum relative representation error in a normalized number occurs for.10000 x 10 1 , which is the most accurate five-digit approximation of all numbersin the interval from .999995 to 1.00005. Its relative error is therefore boundedby .5 • 10 -4 . More generally, the maximum relative representation error in afloating point arithmetic with p digits and base /3 is .5 x /31-P. This is also halfthe distante between 1 and the next larger floating point number, 1 + ,31-P.

Computers have historically used many different choices of base, numberof digits, and range, but fortunately the IEEE standard for binary arithmeticis now most common. It is used on Sun, DEC, HP, and IBM workstationsand all PCs. IEEE arithmetic includes two kinds of floating point numbers:single precision (32 bits long) and double precision (64 bits long).

IEEE single precision 1 8 1 23

sign exponent T fractionbinary point

If s, e, and f < 1 are the 1-bit sign, 8-bit exponent, and 23-bit fraction inthe IEEE single precision format, respectively, then the number represented is(- 1 )8 • 2 e-127 • (1 + f). The maximum relative representation error is 2 -24

6 • 10-8 , and the range of positive normalized numbers is from 2 -126 (theunderflow threshold) to 2127 • (2 - 2-23) N 2128 (the overflow threshold), orabout 10-38 to 1038 . The positions of these floating point numbers on the real

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 11: Demmel Chapter 1

Introduction 11

2 -126 _ 2127*(2-2-23) _

onderflow overflow-2-126 threhold thres old

128 -124 -125 1 0a+0 -125 -124 128-2 -2 -2 2 2 2y

normalized suf bnormal normalizednegative numbers positivenumbers numbers

Fig. 1.2. Real number line with floating point numbers indicated by solid tick marks.The range shown is correct for IEEE single precision, but a 3-bit fraction is assumedfor ease of presentation so that there are only 2 3 — 1 = 7 floating point numbersbetween consecutive powers of 2, not 223 — 1. The distance between consecutive tickmarks is constant between powers of 2 and doubles/halves across powers of 2 (amongthe normalized floating point numbers). +2 128 and —2128 , which are one unit in thelast place larger in magnitude than the overflow threshold (the largest finite floatingpoint number, 2 127 .(2-2 -23 )), are shown as dotted tick marks. The figure is symmetricabout 0; +0 and —0 are distinct floating point bit strings but compare as numericallyequal. Division by zero is the only binary operation that gives different results, +ooand —oo, for different signed zero arguments.

number line are shown in Figure 1.2 (where we use a 3-bit fraction for ease ofpresentation) .

IEEE double precision 1 11 52

sign exponent T fractionbinary point

If s, e, and f < 1 are the 1-bit sign, 11-bit exponent, and 52-bit fractionin IEEE double precision format, respectively, then the number represented is( -1 )s • 2e -1023 • ( 1 + f). The maximum relative representation error is 2-53 ^

10-16 , and the exponent range is 2-1022 (the underflow threshold) to 21023

(2 — 2-52) N 21024 (the overflow threshold), or about 10-308 to 10308 .

When the true value of a computation a O b (where O is one of the fourbinary operations +, —, *, and /) cannot be represented exactly as a floatingpoint number, it must be approximated by a nearby floating point numberbefore it can be stored in memory or a register. We denote this approximationby fl(aOb). The differente (a®b)—fl(a®b) is called the roundoff error. If fl(aOb)is a nearest floating point number to a O b, we say that the arithmetic rounds

correctly (or just rounds). IEEE arithmetic has this attractive property. (IEEEarithmetic breaks ties, when a O b is exactly halfway between two adjacentfloating point numbers, by choosing fl(a O b) to have its least significant bitzero; this is called rounding to nearest even.) When rounding correctly, if a O bis within the exponent range (otherwise we get overflow or underiow), then

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 12: Demmel Chapter 1

12 Applied Numerical Linear Algebra

we can writefl(a O b) = (a O b)(1 + S), (1.1)

where S is bounded by e, which is called variously machine epsilon, machineprecision, or macheps. Since we are rounding as accurately as possible, e isequal to the maximum relative representation error .5 • /3 1—P. IEEE arithmeticalso guarantees that fl(.,/a) = ,/Ei(1 + 6), with 6! <_ E. This is the mostcommon model for roundoff error analysis and the one we will use in thisbook. A nearly identical formula applies to complex floating point arithmetic;see Question 1.12. However, formula (1.1) does ignore some interesting details.

1.5.1. Further Details

IEEE arithmetic also includes subnormal numbers, i.e., unnormalized floatingpoint numbers with the minimum possible exponent. These represent tinynumbers between zero and the smallest normalized floating point number; seeFigure 1.2. Their presence means that a differente fl(x - y) can never be zerobecause of underflow, yielding the attractive property that the predicate x = yis true if and only if fl(x - y) = 0. To incorporate errors caused by underflowinto formula (1.1) one would change it to

fl(a(b) = (a®b)(1+6)+77,

where 161 < E as before, and rl^ is bounded by a tiny number equal to thelargest error caused by underflow (2-150 ti 10 -45 in IEEE single precision and2-1075 N 10-324 in IEEE double precision).

IEEE arithmetic includes the symbols foo and NaN (Not a Number). ±oois returned when an operation overflows, and behaves according to the followingarithmetic rules: x/±oo = 0 for any finite floating point number x, x/0 = ±00for any nonzero floating point number x, +oo + oo = +oo, etc. An NaN isreturned by any operation with no well-defined finite or infinite result, such as00 - 00, -, ó, v1, NaN O x, etc.

Whenever an arithmetic operation is invalid and so produces an NaN, oroverflows or divides by zero to produce ±oo, or underflows, an exception lag isset and can later be tested by the user's program. These features permit oneto write both more reliable programs (because the program can detect andcorrect its own exceptions, instead of simply aborting execution) and fasterprograms (by avoiding "paranoid" programming with many tests and branchesto avoid possible but unlikely exceptions). For examples, see Question 1.19,the comments following Lemma 5.3, and [81].

The most expensive error known to have been caused by an improperlyhandled floating point exception is the crash of the Ariane 5 rocket of theEuropean Space Agency on June 4, 1996. See HOME/ariane5rep.html fordetails.

Not all machines use IEEE arithmetic or round carefully, although nearly

all do. The most important modern exceptions are those machines produced by

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 13: Demmel Chapter 1

Introduction 13

Cray Research, 3 although future generations of Cray machines may use IEEEarithmetic. 4 Since the difference between fl(a®b) computed on a Cray machineand fl(a O b) computed on an IEEE machine usually lies in the 14th decimalplace or beyond, the reader may wonder whether the differente is important.Indeed, most algorithms in numerical linear algebra are insensitive to detailsin the way roundoff is handled. But it turns out that some algorithms areeasier to design, or more reliable, when rounding is done properly. Here aretwo examples.

When the Cray C90 subtracts 1 from the next smaller floating point num-ber, it gets —2 -47 , which is twice the correct answer, —2 -48 . Getting eventiny differences to high relative accuracy is essential for the correctness of thedivide-and-conquer algorithm for finding eigenvalues and eigenvectors of sym-metric matrices, currently the fastest algorithm available for the problem. Thisalgorithm requires a rather nonintuitive modification to guarantee correctnesson Cray machines (see section 5.3.3).

The Cray machine may also yield an error when computingarccos(x/ x ++ y2 ) because excessive roundoff causes the argument of arccosto be larger than 1. This cannot happen in IEEE arithmetic (see Ques-tion 1.17).

To accommodate error analysis on a Cray C90 or other Cray machines wemay instead use the model fl(a±b) = a(1 +51) ±b(1 +52), fl(a*b) = (a*b)(1 +53),and fl(a/b) = (a/b)(1 + 63), with Ibi < E, where s is a small multiple of themaximum relative representation error.

Briefly, we can say that correct rounding and other features of IEEE arith-metic are designed to preserve as many mathematical relationships used to de-rive formulas as possible. It is easier to design algorithms knowing that (barringoverflow or underflow) fl(a — b) is computed with a small relative error (oth-erwise divide-and-conquer can fail), and that —1 < c = fl(x/ x 2 + y2 ) < 1(otherwise arccos(c) can fail). There are many other such mathematical rela-tionships that one relies on (often unwittingly) to design algorithms. For moredetails about IEEE arithmetic and its relationship to numerical analysis, see[159, 158, 81].

Given the variability in floating point across machines, how does one writeportable software that depends on the arithmetic? For example, iterative al-gorithms that we will study in later chapters frequently have loops such as

repeat

update euntil "e is negligible compared to f,"

3We include machines such as the NEC SX-4, which has a "Cray mode" in which it

performs arithmetic the same way. We exclude the Cray T3D and T3E, which are par-allel computers built from DEC Alpha processors, which use IEEE arithmetic very nearly(underflows are flushed to zero for speed's sake).

4 Cray Research was purchased by Silicon Graphics in 1996.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 14: Demmel Chapter 1

14 Applied Numerical Linear Algebra

where e> 0 is some error measure, and f > 0 is some comparison value (seesection 4.4.5 for an example). By negligible we mean "is e < c • e • f?," wherec >_ 1 is some modest constant, chosen to trade off accuracy and speed of con-vergence. Since this test requires the machine-dependent constant E, this testhas in the past often been replaced by the apparently machine-independenttest "is e + c f = eƒ?" The idea here is that adding e to c f and roundingwill yield c f again if e < cs f or perhaps a little smaller. But this test canfail (by requiring e to be much smaller than necessary, or than attainable),depending on the machine and compiler used (see the next paragraph). Sothe best test indeed uses E explicitly. It turns out that with sufficient care onecan compute E in a machine-independent way, and software for this is avail-able in the LAPACK subroutines slamch (for single precision) and dlamch(for double precision). These routines also compute or estimate the overflowthreshold (without overflowing!), the underflow threshold, and other parame-ters. Another portable program that uses these explicit machine parametersis discussed in Question 1.19.

Sometimes one needs higher precision than is available from IEEE singleor double precision. For example, higher precision is of use in algorithms suchas iterative refinement for improving the accuracy of a computed solution ofAx = b (see section 2.5.1). So IEEE defines another, higher precision calleddouble extended. For example, all arithmetic operations on an Intel Pentium(or its predecessors going back to the Intel 8086/8087) are performed in 80-bitdouble extended registers, providing 64-bit fractions and 15-bit exponents. Un-fortunately, not all languages and compilers permit one to declare and computewith double-extended precision variables.

Few machines offer anything beyond double-extended arithmetic in hard-ware, but there are several ways in which more accurate arithmetic may besimulated in software. Some compilers on DEC Vax and DEC Alpha, SunSparc, and IBM RS6000 machines permit the user to deelare quadruple preci-sion (or real *16 or double double precision) variables and to perform computa-tions with them. Since this arithmetic is simulated using shorter precision, itmay run several times Blower than double. Cray's single precision is similar inprecision to IEEE double, and so Cray double precision is about twice IEEEdouble; it too is simulated in software and runs relatively slowly. There are alsoalgorithms and packages available for simulating much higher precision float-ing point arithmetic, using either integer arithmetic [20, 21] or the underlyingfloating point (see Question 1.18) [204, 218].

Finally, we mention interval arithmetic, a style of computation that au-tomatically provides guaranteed error bounds. Each variable in an intervalcomputation is represented by a pair of floating point numbers, one a lowerbound and one an upper bound. Computation proceeds by rounding in such away that lower bounds and upper bounds are propagated in a guaranteed fash-ion. For example, to add the intervals a = [al, au] and b = [b1, bv], one rounds

ai + bl down to the nearest floating point number, el, and rounds au + bu

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 15: Demmel Chapter 1

Introduction 15

up to the narest floating point number, c„. This guarantees that the inter-val c = [cl, c] contains the sum of any pair of variables from a and from b.Unfortunately, if one naively takes a program and converts all floating pointvariables and operations to interval variables and operations, it is most likelythat the intervals computed by the program will quickly grow so wide (such as[-oo, +oc]) that they provide no useful information at all. (A simple exampleis to repeatedly compute x = x - x when x is an interval; instead of gettingx = 0, the width xu - xj of x doubles at each subtraction.) It is possible tomodify old algorithms or design new ones that do provide useful guaranteederror bounds [4, 140, 162, 190], but these are often several times as expensiveas the algorithms discussed in this book. The error bounds that we presentin this book are not guaranteed in the same mathematical sense that intervalbounds are, but they are reliable enough in almost all situations. (We discussthis in more detail later.) We will not discuss interval arithmetic further inthis book.

1.6. Polynomial Evaluation Revisited

Let us now apply roundoff model (1.1) to evaluating a polynomial with Horner'srule. We take the original program,

p=adfor i= d- 1 down to0

p=x•p +ai

end for

Then we add subscripts to the intermediate results so that we have a uniquesymbol for each one (po is the final result):

Pd = adfor i= d- 1 down to0

pi = X • pi+1 + aiend for

Then we insert a roundoff term (1 + Si) at each floating point operation to get

pd=adfor i = d — 1 down to 0

pi = ((x . pi+l)(1 + Si) + ai)(1 + S), where , S < eend for

Expanding, we get the following expression for the final computed value of thepolynomial:

d-1 i-1 d-1

po =E [(1+6)(1+6)(1+)] aix 2 + 11(1 +Sj)(1+S'j) adxd

i=0 j=0 j=0

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 16: Demmel Chapter 1

16 Applied Numerical Linear Algebra

This is messy, a typical result when we try to keep track of every roundingerror in an algorithm. We simplify it using the following upper and lowerbounds:

(1 + bi) • .. (1 + 6^) < (l + £)^ < 11—j,- = 1 + je + O(E 2 ),—

(1+1)•••(1+j) > (1—E)1—jE.

These bounds are correct, provided that je < 1. Typically, we make thereasonable assumption that jE « 1 (j « 10 7 in IEEE single precision) andmake the approximations

1—je< (1+bi)•••(1+6^) < 1+je.

This lets us write

d

P0 = (1 + Si)aix 2 , where ISiI < 2dEi=od

_ r,aixii=o

So the computed value po of p(x) is the exact value of a slightly differentpolynomial with coefficients ?i. This means that evaluating p(x) is "backwardstable," and the "backward error" is 2de measured as the maximum relativechange of any coefficient of p(x).

Using this backward error bound, we bound the error in the computedpolynomial:

d d

I Po — p(x) I = ^(1 + Si)aixi — ^aixii=0 i=o

d d= ESiaix' < 0 E2d1 ai • x z l

i=0 i=0d

< 2dEr1ai • xZI.i=o

Note that E i I aix' I bounds the largest value that we could compute if therewere no cancellation from adding positive and negative numbers, and the error

bound is 2dE times smaller. This is also the case for computing dot productsand many other polynomial-like expressions.

By choosing Si = e • sign(aixz), we see that the error bound is attainable towithin the modest factor 2d. This means that we may use

Ya n laix Z I

1> o aix,1

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 17: Demmel Chapter 1

Introduction 17

as the relative condition number for polynomial evaluation.

We can easily compute this error bound, at the cost of doubling the numberof operations:

p = ad, bp = jadlfor i = d — 1 down to 0

p = x -p+aibp = lxI •bp+ aZ

end forerror bound = bp = 2d • r bp

so the true value of the polynomial is in the interval [p — bp, p + bp], and thenumber of guaranteed correct decimal digits is — log go ( bp ). These bounds are

plotted in the top of Figure 1.3 for the polynomial discussed earlier, (x — 2) 9 .(The reader may wonder whether roundof errors could make this computederror bound inaccurate. This turns out not to be a problem and is left to thereader as an exercise.)

The graph of — loggo ^ P 1 in the bottom of Figure 1.3, a lower bound onthe number of correct decimal digits, indicates that we expect difficulty com-puting p(x) to high relative accuracy when p(x) is near 0. What is specialabout p(x) = 0? An arbitrarily small error c in computing p(x) = 0 causesan infinite relative error P̂ x^ = ó . In other words, our relative error bound

2de ^d o Iaix 2 / >O ajx'I is infinite.

DEFINITION 1.1. A problem whose condition number is infinite is called ill-posed. Otherwise it is called well-posed. 5

There is a simple geometric interpretation of the condition number: it tellsus how far p(x) is from a polynomial which is ill-posed.

DEFINITION 1.2. Let p(z) = ^a_ o aiz' and q(z) = ^d_o biz 2 . Define the rel-ative distante d(p, q) from p to q as the smallest value satisfying Iai — bi d(p, q) • ^au for 0 < i < d. (If all ai 0, then we can more simply writed(p, q) = maxo<i<d I a2a . b2 I •)

Note that if ai = 0, then bi must also be zero for d(p, q) to be finite.

5 This definition is slightly nonstandard, because ill-posed problems include those whosesolutions are continuous as long as they are nondifferentiable. Examples include multiple

roots of polynomials and multiple eigenvalues of matrices (section 4.3). Another way todescribe an ill-posed problem is one in which the number of correct digits in the solution isnot always within a constant of the number of digits used in the arithmetic in the solution.For example, multiple roots of polynomials tend to lose half or more of the precision of thearithmetic.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 18: Demmel Chapter 1

18

Applied Numerical Linear Algebra

Fig. 1.3. Plot of error bounds on the value of y = (x — 2)9 evaluated using Horner'sTule.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 19: Demmel Chapter 1

Introduction 19

THEOREM 1.2. Suppose that p(z) = Ed o aiz 2 is not identically zero.

min{d(p, q) such that q(x) = 0} = 1 - i o aix%>i =o I ajx2

In other words, the distance from p to the nearest polynomial q whose conditionnumber at x is infinite, i.e., q(x) = 0, is the reciprocal of the condition numberof p(x).

Proof. Write q(z) _ > biz z = (1 + ei)aiz z so that d(p, q) = max EiJ. Thenq(x) = 0 implies I p(x) I = 1 q(x) — p(x) l = id_o eiaix 2 ^ <maxi ^e2' ^ i jaix' j, which in turn implies d(p, q) = max ieil >'p(x)I / aix.To see that there is a q this close to p, choose

— xe ' _ E jaix I sign(azxz). ❑

This simple reciprocal relationship between condition number and distanceto the nearest ill-posed problem is very common in numerical analysis, and weshall encounter it again later.

At the beginning of the introduction we said that we would use canonicalforms of matrices to help solve linear algebra problems. For example, knowingthe exact Jordan canonical form makes computing exact eigenvalues trivial.There is an analogous canonical form for polynomials, which makes accuratepolynomial evaluation easy: p(x) = ad fa_ 1 (x — ri). In other words, we rep-resent the polynomial by its leading coefficient ad and its roots r1, ... , r? . Toevaluate p(x) we use the obvious algorithm

p=adfor i = 1 to d

p=p• (x—r2)end for

It is easy to show the computed p = p(x) • (1 + 6), where 61 < 2de; i.e., wealways get p(x) with high relative accuracy. But we need the roots of thepolynomial to do this!

1.7. Vector and Matrix Norms

Norms are used to measure errors in matrix computations, so we need tounderstand how to compute and manipulate them.

Missing proofs are left as problems at the end of the chapter.

DEFINITION 1.3. Let 13 be a real (complex) linear space Rn (or (C'). It isnormed if there is a function ii • j) : 23 — R, which we call a norm, satisfyingall of the following:

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 20: Demmel Chapter 1

20 Applied Numerical Linear Algebra

1)IIxii > 0, and f Ixii = 0 if and only ifx = 0 (positive definiteness),2) Iiaxii = kl • IIxii for any real (or complex) scalar a (homogene-ity),3) iix + yUI c IIxii + IIyli (the triangle inequality).

EXAMPLE 1.4. The most common norms are lixll P = (Ei I xil P)' for 1 <p <oo, which we call p-norms, as well as I I x I I „ = maxi I xi I , which we call theoo -norm or infinity-norm. Also, if I I x I I is any norm and C is any nonsingularmatrix, then II Cx I I is also a norm. o

We see that there are many norms that we could use to measure errors; itis important to choose an appropriate one. For example, let xl = [l, 2, 3] T inmeters and x2 = [1.01, 2.01, 2.99] T in meters. Then x2 is a good approximation

to xl because the relative error 11xllxl 11 II. .0033, and x3 = [10, 2.01, 2•991 T is

a bad approximation because Ilxl –x311°° = 3. But suppose the first componentI xll oois measured in kilometers instead o meters. Then in this norm xl and x3 lookclose:

001.01 _xl =1 2 x3 = 2.01 , and

IIxII xl II 3II°°

.0033.3 2.99

To compare xl and x3, we should use

1000

Ilxlis = 11

to make the units the same or so that equally important errors make the normequally large.

Now we define inner products, which are a generalization of the standarddot product E i xiy2, and arise frequently in linear algebra.

DEFINITION 1.4. Let 13 be a real (complex) linear space. ( ) : 13 x 13 — R(C)

is an inner product if all of the following apply:

1) (x, y) = (y, x) (or (y, x)),2) (x, y + z) = (x, y) + (x, z),3) (ax, y) = ce(x, y) for any real (or complex) scalar a,4) (x, x) > 0, and (x, x) = 0 if and only if x = 0.

EXAMPLE 1.5. Over R, (x, y) = yT x = > i xiyi, and over C, (x, y) = y*x =

Ei xiy2 are inner products. (Recall that y* = yT is the conjugate transpose ofy.) 0

DEFINITION 1.5. x and y are orthogonal if (x, y) = 0.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 21: Demmel Chapter 1

Introduction 21

The most important property of an inner product is that it satisfies theCauchy—Schwartz inequality. This can be used in turn to show that (x x) isa norm, one that we will frequently use.

LEMMA 1.1. Cauchy—Schwartz inequality. 1 (x, y) 1 < (x, x) (y, y).

LEMMA 1.2. (x x) is a norm.

There is a one-to-one correspondence between inner products and symmet-ric (Hermitian) positive definite matrices, as defined below. These matricesarise frequently in applications.

DEFINITION 1.6. A real symmetrie (complex Hermitian) matrix A is positivedefinite if xTAx > 0 (x*Ax > 0) for all x 0. We abbreviate symmetriepositive definite to s.p.d., and Hermitian positive to h.p.d.

LEMMA 1.3. Let B = Rn (or Cn) and (•, •) be an inner product. Then thereis an n-by-n s.p.d. (h.p.d.) matrix A such that (x, y) = yTAx (y*Ax). Con-versely, if A is s.p.d (h.p.d.), then yTAx (y*Ax) is an inner product.

The following two lemmas are useful in converting error bounds in termsof one norm to error bounds in terms of another.

LEMMA 1.4. Let II ' IIa and II ' II0 be two norms on W (or C'). There areconstants Cl, c2 > 0 such that, for all x, c1IIxIIa < IIxII0 < c2IIx1Ia. We alsosay that norms II II a and II ' IIa are equivalent with respect to constants c l andc2.

LEMMA 1.5.

IIx1I2 <_ IIxII1 <_ ,/;1ijIxII2,IIxIIoo <_ IIx1I2 <_ ,/IIxIIo

IIxIko <_ Ilxlll < nllxll..

In addition to vector norms, we will also need matrix norms to measureerrors in matrices.

DEFINITION 1.7. II ' II is a matrix norm on m-by-n matrices if it is a vectornorm on m • n dimensional space:

1)IIAII ? 0 and IIAII = 0 if and only if A= 0,

2) IIaAII = Ial ' IIAII,3) IIA + Bij < IIAII + IIBII

EXAMPLE 1.6. maxZj Iajjl is called the max norm, and (E Iaij l 2 ) 1 / 2 = IIAIIFis called the Frobenius norm. o

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 22: Demmel Chapter 1

22 Applied Numerical Linear Algebra

The following definition is useful for bounding the norm of a product ofmatrices, something we often need to do when deriving error bounds.

DEFiNiTION 1.8. Let II • Ilmxn be a matrix norm on m-by-n matrices, II • Inxpbe a matrix norm on n-by-p matrices, and II ' II mxp be a matrix norm on m-by-p matrices. These norms are called mutually consistent if IIA • BII m, xp <

II AII mxn • IIBII nxp, where A is m-by-n and B is n-by-p.

DEFINITION 1.9. Let A be m-by-n, II (I r,, be a vector norm on Rm, and II ' IInbe a vector norm on Rn. Then

II A II max II Ax IImO IIxIIna ER

is called an operator norm or induced norm or subordinate matrix norm.

The next lemma provides a large source of matrix norms, ones that we willuse for bounding errors.

LEMMA 1.6. An operator norm is a matrix norm.

Orthogonal and unitary matrices, defined next, are essential ingredients ofnearly all our algorithms for least squares problems and eigenvalue problems.

DEFnvITi0N 1.10. A real square matrix Q is orthogonal if Q -1 = QT. Acomplex square matrix is unitary if Q — ' = Q*.

All rows (or columns) of orthogonal (or unitary) matrices have unit 2-normsand are orthogonal to one another, since QQT = QTQ = I (QQ* = Q*Q = I)

The next lemma summarizes the essential properties of the norms andmatrices we have introduced so far. We will use these properties later in thebonk.

LEMMA 1.7. 1. IIAxil < IIAII ' IIxMI for a vector norm and its correspondingoperator norm, or the vector two-norm and matrix Frobenius norm.

2• IIABII < IIAII ' IIBII for any operator norm or for the Frobenius norm.In other words, any operator norm (or the Frobenius norm) is mutually

consistent with itself.

3. The max norm and Frobenius norm are not operator norms.

4. IIQAZII = IIAII if Q and Z are orthogonal or unitary for the Frobeniusnorm and for the operator norm induced by II ' II2. This is really just thePythagorean theorem.

5 . IIAII „ - maxx#o 1II xlI = maxi r_^ I aju I = maximum absolute row sum.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 23: Demmel Chapter 1

Introduction 23

6. JIAlI1 - max,#o III x iiIIA'11. = max iI Iaiil = maximum absolute

column sum.

7. IIAII2 - maxx^o III 11 112 = ^max(A*A where )'max genotes the largest

eigenvalue.

8. JIAll2 = hIAT II 2 .

9. h1Ahl2 = maxi P. (A)l if A is normal, i.e., AA* = A*A.

10.If A is n-by-n, then n- 1 /2 IIAII2 < IIAII1 < nl/2 IIAII2•

11. If A is n-by-n, then n- 1/2 IIAII2 < IIAII. < n l /2 IIAII2•

12. If A is n-by-n, then n— 'IJAlI. < I^All1 < nllAII..

13. If A is n-by-n, then lIAIIi < IIAIIF < n l/2 IlAhl2.

Proof. We prove part 7 only and leave the rest to Question 1.16.Since A*A is Hermitian, there exists an eigendecomposition A*A = QAQ*,

with Q a unitary matrix (the columns are eigenvectors), and A = diag()s.1, ...\n ), a diagonal matrix containing the eigenvalues, which must all be real.Note that all .\i > 0 since if one, say )., were negative, we would take q asits eigenvector and get the contradiction 0 < IIAg112 = gTATAq = qT). q

) l l q l I 2< 0. Therefore

IAII2 = max IIAxlh2 = max (x*A*Ax)1/2

(x*QAQ*x)1/2= max

X54o hIxI12 x O IIx112 x#° IIxJ12_ ((Q*x)*AQ*x)1/2 _ (y*Ay)l/2 _ A'Y2

max Imax0 liQ*x) 112 O IIYII2 o

V ^2 =max max 2 maxy340 Yi

which is attainable by choosing y to be the appropriate column of the identitymatrix. ❑

1.8. References and Other Topics for Chapter 1

At the end of each chapter we will list the references most relevant to thatchapter. They are also listed alphabetically in the bibliography at the end. Inaddition we will give pointers to related topics not discussed in the main text.

The most modern comprehensive work in this area is by G. Golub and C.Van Loan [121], which also has an extensive bibliography. A recent undergrad-uate level or beginning graduate text in this material is by D. Watkins [252].Another good graduate text is by L. Trefethen and D. Bau [243]. A classic

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 24: Demmel Chapter 1

24 Applied Numerical Linear Algebra

work that is somewhat dated but still an excellent reference is by J. Wilkinson[262]. An older but still excellent book at the same level as Watkins is by G.Stewart [235] .

More detailed information on error analysis can be found in the recent bookby N. Higham [149]. Older but still good general references are by J. Wilkinson[261] and W. Kahan [157].

"What every computer scientist should know about floating point arith-metic" by D. Goldberg is a good recent survey [119]. IEEE arithmetic is de-scribed formally in [11, 12, 159] as well as in the reference manuals publishedby computer manufacturers. Discussion of error analysis with IEEE arithmeticmay be found in [54, 70, 159, 158] and the references cited therein.

A more general discussion of condition numbers and the distance to thenearest ill-posed problem is given by the author in [71] as well as in a seriesof papers by S. Smale and M. Shub [219, 220, 221, 222]. Vector and matrixnorms are discussed at length in [121, sects. 2.2, 2.3].

1.9. Questions for Chapter 1

QUESTION 1.1. (Easy; Z. Bai) Let A be an orthogonal matrix. Show thatdet(A) = f1. Show that if B also is orthogonal and det(A) = —det(B), thenA + B is singular.

QUESTION 1.2. (Easy; Z. Bai) The rank of a matrix is the dimension of thespace spanned by its columns. Show that A has rank one if and only if A = abTfor some column vectors a and b.

QUESTION 1.3. (Easy; Z. Bai) Show that if a matrix is orthogonal and trian-gular, then it is diagonal. What are its diagonal elements?

QUESTION 1.4. (Easy; Z. Bai) A matrix is strictly upper triangular if it isupper triangular with zero diagonal elements. Show that if A is strictly uppertriangular and n-by-n, then A' = 0.

QUESTION 1.5. (Easy; Z. Bai) Let II - II be a vector norm on IRm and assumethat C E R x n Show that if rank(C) = n, then II x IIc - II Cx II is a vectornorm.

QUESTION 1.6. (Easy; Z. Bai) Show that if 0 s E R' and E E Rn>< then

E (I— sTS)^^

'

F —IIEIIF— IIETII2

QUESTION 1.7. (Easy; Z. Bai) Verify that II xyH II F = II xyH II2 = IIxII2IIyII2 forany x,yEC'.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 25: Demmel Chapter 1

-8 -4 0 4x Macheps

8 -8 -4 0 4 8x Macheps

y = log(1+x)/x y = log(1+x)/[(1+x)-1]

-1 -0.5 0 0.5 1

y = log(1+x)/x3

2

1

0

-1 -0.5 0 0.5 1

y = log(1 +x)4(1+x)-1]

Introduction 25

QUESTION 1.8. (Medium) One can identify the degree d polynomials p(x) _=o azx 2 with Rd+ 1 via the vector of coefficients. Let x be fixed. Let SS be

the set of polynomials with an infinite relative condition number with respectto evaluating them at x (i.e., they are zero at x). In a few words, describe SS

geometrically as a subset of Il8d+ 1 Let SS (k) be the set of polynomials whoserelative condition number is K or greater. Describe S(k) geometrically in afew words. Describe how Sx (i) changes geometrically as r, —* oo.

QUESTION 1.9. (Medium) Consider the figure below. It plots the functiony = log(1 + x)/x computed in two different ways. Mathematically, y is asmooth function of x near x = 0, equaling 1 at 0. But if we compute y usingthis formula, we get the plots on the left (shown in the ranges x E [-1, 1]on the top left and x E [-10 -15 , 10-15] on the bottom left). This formula isclearly unstable near x = 0. On the other hand, if we use the algorithm

d=1+xif d = 1 then

y=1else

y = log(d)/(d — 1)end if

we get the two plots on the right, which are correct near x = 0. Explain thisphenomenon, proving that the second algorithm must compute an accurateanswer in floating point arithmetic. Assume that the log function returns anaccurate answer for any argument. (This is true of any reasonable implemen-tation of logarithm.) Assume IEEE floating point arithmetic if that makesyour argument easier. (Both algorithms can malfunction on a Cray machine.)

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 26: Demmel Chapter 1

26 Applied Numerical Linear Algebra

QUESTION 1.10. (Medium) Show that, barring overflow or underflow,

fl(Ed 1 xiy2) = ra 1 xiy2( 1 + 6), where b2^ < de. Use this to prove the

following fact. Let A` and BfXP be matrices, and compute their product

in the usual way. Barring overflow or underflow show that lfl(A • B) — A • BI <n • E • ^Al • IBI. Here the absolute value of a matrix Al means the matrix with

entries (1A )ij = ^a2^^, and the inequality is meant componentwise.The result of this question will be used in section 2.4.2, where we analyze

the roundof errors in Gaussian elimination.

QUESTION 1.11. (Medium) Let L be a lower triangular matrix and solve Lx =b by forward substitution. Show that barring overflow or underflow, the com-

puted solution x satisfies (L + 6L)x = b, where l6lI nellil, where E is themachine precision. This means that forward substitution is backward stable.Argue that backward substitution for solving upper triangular systems satisfies

the same bound.The result of this question will be used in section 2.4.2, where we analyze

the roundoff errors in Gaussian elimination.

QUESTION 1.12. (Medium) In order to analyze the effects of rounding errors,we have used the following model (see equation (1.1)):

fl(aOb) = (aOb)(1+5),

where O is one of the four basic operations +, —, *, and /, and 161 <_ s. To showthat our analyses also work for complex data, we need to prove an analogousformula for the four basic complex operations. Now b will be a tiny complexnumber bounded in absolute value by a small multiple of e. Prove that thisis true for complex addition, subtraction, multiplication, and division. Youralgorithm for complex division should successfully compute a/a 1, where1al is either very large (larger than the square root of the overflow threshold)or very small (smaller than the square root of the underflow threshold). Is ittrue that both the real and imaginary parts of the complex product are alwayscomputed to high relative accuracy?

QUESTION 1.13. (Medium) Prove Lemma 1.3.

QUESTION 1.14. (Medium) Prove Lemma 1.5.

QUESTION 1.15. (Medium) Prove Lemma 1.6.

QUESTION 1.16. (Medium) Prove all parts except 7 of Lemma 1.7. Hint for

part 8: Use the fact that if X and Y are both n-by-n, then XY and YX have

the same eigenvalues. Hint for part 9: Use the fact that a matrix is normal ifonly if it has a complete set of orthonormal eigenvectors.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 27: Demmel Chapter 1

Introduction 27

QUESTION 1.17. (Hard; W. Kahan) We mentioned that on a Cray machine

the expression arccos(x/ x2 + y2 ) caused an error, because roundoff caused(x/ x2 + y2 ) to exceed 1. Show that this is impossible using IEEE arithmetic,barring overflow or underflow. Hint: You will need to use more than the simplemodel fl(a O b) = (a O b)(1 + S) with ISi small. Think about evaluating x 2 ,and show that, barring overflow or underflow, fl(/ ) = x exactly; in numerical

experiments done by A. Liu, this failed about 5% of the time on a Cray YMP.You might try some numerical experiments and explain them. Extra credit:Prove the same result using correctly rounded decimal arithmetic. (The proofis different.) This question is due to W. Kahan, who was inspired by a bug ina Cray program of J. Sethian.

QUESTION 1.18. (Hard) Suppose that a and b are normalized IEEE doubleprecision floating point numbers, and consider the following algorithm, runningwith IEEE arithmetic:

if (^aI < Ibi), swap a and b

si = a + b82 = (a — sl) + b

Prove the following facts:

1. Barring overflow or underflow, the only roundoff error committed in run-ning the algorithm is computing si = fl(a + b). In other words, bothsubtractions sl — a and (sl — a) — b are computed exactly.

2. s1 + s2 = a + b, exactly. This means that s2 is actually the roundoff errorcommitted when rounding the exact value of a + b to get s1.

Thus, this program in effect simulates quadruple precision arithmetic, repre-

senting the true sum a + b as the higher-order bits (si) and the lower-orderbits (82).

Using this and similar tricks in a systematic way, it is possible to effi-ciently simulate all four basic floating point operations in arbitrary precisionarithmetic, using only the underlying floating point instructions and no "bit-fiddling" [204]. 128-bit arithmetic is implemented this way on the IBM RS6000and Cray (but much less efficiently on the Cray, which does not have IEEE

arithmetic).

QUESTION 1.19. (Hard; Programming) This question illustrates the challengesin engineering highly reliable numerical software. Your job is to write a pro-

gram to compute the two-norm s - IIx112 = (^ 1 x2 )'/2 given xl, . .. ,x. The

most obvious (and inadequate) algorithm is

s=0

for i = 1 to ns=s+x2

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 28: Demmel Chapter 1

28 Applied Numerical Linear Algebra

endfors = sgrt(s)

This algorithm is inadequate because it does not have the following desirableproperties:

1. It must compute the answer accurately (i.e., nearly all the computeddigits must be correct) unless 11X112 is (nearly) outside the range of nor-malized floating point numbers.

2. It must be nearly as fast as the obvious program above in most cases.

3. It must work on any "reasonable" machine, possibly including ones notrunning IEEE arithmetic. This means it may not cause an error condi-tien, unless 11XJ12 is (nearly) larger than the largest floating point number.

To illustrate the difficulties, note that the obvious algorithm fails when n = 1and xl is larger than the square root of the largest floating point number (inwhich case xi overflows, and the program returns +oo in IEEE arithmetic andhalts in most non-IEEE arithmetics) or when n = 1 and xl is smaller than thesquare root of the smallest normalized floating point number (in which casex1 underflows, possibly to zero, and the algorithm may return zero). Scalingthe xi by dividing them all by maxi lxii does not have property 2), becausedivision is usually many times more expensive than either multiplication oraddition. Multiplying by c = 1/ maxi lxi 1 risks overflow in computing c, evenwhen maxj 1x 2 1 > 0.

This routine is important enough that it has been standardized as a Basic

Linear Algebra Subroutine, or BLAS, which should be available on all machines[169]. We discuss the BLAS at length in section 2.6.1, and documentationand sample implementations may be found at NETLIB/blas. In particular,see NETLIB/cgi-bin/netlibget.pl/blas/snrm2.f for a sample implementationthat has properties 1) and 3) but not 2). These sample implementations areintended to be starting points for implementations specialized to particulararchitectures (an easier problem than producing a completely portable one, as

requested in this problem). Thus, when writing your own numerical software,you should think of computing lXii2 as a building block that should be available

in a numerical library on each machine.

For another careful implementation of Iixii2, see [35].

You can extract test code from NETLIB/bias/sblatl to see if your imple-mentation is correct; all implementations turned in must be thoroughly testedas well as timed, with times compared to the obvious algorithm above on thosecases where both run. See how close to satisfying the three conditions you cancome; the frequent use of the word "nearly" in conditions (1), (2) and (3)shows where you may compromise in attaining ene condition in order to more

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 29: Demmel Chapter 1

Introduction 29

nearly attain another. In particular, you might want to see how much easierthe problem is if you limit yourself to machines running IEEE arithmetic.

Hint: Assume that the values of the overflow and underflow thresholds areavailable for your algorithm. Portable software for computing these values isavailable (see NETLIB/cgi-bin/netlibget.pl/lapack/util/slamch.f).

QUESTION 1.20. (Easy; Medium) We will use a Matlab program to illustratehow sensitive the roots of polynomial can be to small perturbations in thecoefficients. The program is available6 at HOMEPAGE/Matlab/polyplot.m.Polyplot takes an input polynomial specified by its roots r and then addsrandom perturbations to the polynomial coefficients, computes the perturbedroots, and plots them. The inputs are

r = vector of roots of the polynomial,e = maximum relative perturbation to make to each coefficient ofthe polynomial,m = number of random polynomials to generate, whose roots areplotted.

1. (Easy) The first part of your assignment is to run this program for thefollowing inputs. In all cases choose m high enough that you get a fairlydense plot but don't have to wait too long. m = a few hundred or perhaps1000 is enough. You may want to change the axes of the plot if the graphis too small or too large.

• r=(1:10); e = le-3, le-4, le-5, le-6, le-7, le-8,

• r= (1:20); e = 1e-9, le-11, le-13, le-15,

• r=[2,4,8,16,..., 1024]; e=le-1, le-2, le-3, le-4.Also try your own example with complex conjugate roots. Which rootsare most sensitive?

2. (Medium) The second part of your assignment is to modify the programto compute the condition number c(i) for each root. In other words, arelative perturbation of e in each coefficient should change root r(i) byat most about e*c(i). Modify the program to plot circles centered at r(i)with radii e*c(i), and confirm that these circles enclose the perturbedroots (at least when e is small enough that the linearization used toderive the condition number is accurate). You should turn in a few plotswith circles and perturbed eigenvalues, and some explanation of whatyou observe.

3. (Medium) In the last part, notice that your formula for c(i) "blows up" ifp'(r(i)) = 0. This condition means that r(i) is a multiple root of p(x) = 0.We can still expect some accuracy in the computed value of a multiple

6 Recall that we abbreviate the URL prefix of the class homepage to HOMEPAGE in thetext.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 30: Demmel Chapter 1

30 Applied Numerical Linear Algebra

root, however, and in this part of the question, we will ask how sensitivea multiple root can be: First, write p(x) = q(x) • (x — r(i))m, whereq(r(i)) 0 and m is the multiplicity of the root r(i). Then compute them roots nearest r(i) of the slightly perturbed polynomial p(x) — q(x)E,and show that they differ from r(i) by l /m. So that if m = 2, forinstance, the root r(i) is perturbed by E 1 /2 , which is much larger thanE if e « 1. Higher values of m yield even larger perturbations. If E isaround machine epsilon and represents rounding errors in computing theroot, this means an m-tuple root can lose all but 1/m-th of its significantdigits.

QUESTION 1.21. (Medium) Apply Algorithm 1.1, Bisection, to find the rootsof p(x) = (x — 2) 9 = 0, where p(x) is evaluated using Horner's rule. Use theMatlab implementation in HOMEPAGE/Matlab/bisect.m, or else write yourown. Confirm that changing the input interval slightly changes the computedroot drastically. Modify the algorithm to use the error bound discussed in thetext to stop bisecting when the roundof error in the computed value of p(x)gets so large that its sign cannot be determined.

Dow

nloa

ded

05/1

7/13

to 1

69.2

29.1

08.1

04. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Recommended