Numerical Methods Orals - cims.nyu.eduaskham/orals-notes/numerical.pdf · Orals Travis Askham April...

Numerical Methods

Orals

Travis Askham

April 4, 2012

Contents

1 Floating Point Arithmetic, Conditioning, and Stability 31.1 Previously Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Numerical Methods Class Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Questions that Seem Reasonable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 The IEEE Standard (Double and Single) . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Conditioning, Stability, and Accuracy 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 A Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Solution of Linear Systems (Direct) 73.1 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Cholesky’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 QR and SVD Factorizations and Least Squares 104.1 QR Decomposition by Modified Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . 104.2 QR Decomposition by Householder Reflections . . . . . . . . . . . . . . . . . . . . . 114.3 SVD Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.5 Best Low Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Eigenvalue Algorithms 155.1 Why Iterative? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 The QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.4 Jacobi’s Method by Givens Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1

5.5 Bisection for Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.6 Methods Described in other Parts of the Notes . . . . . . . . . . . . . . . . . . . . . 21

6 Iterative Methods for Linear Systems 216.1 Classical Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 Krylov Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.4 Arnoldi Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.5 GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.6 Lanczos Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.7 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.8 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Interpolation by Polynomials 367.1 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367.2 Divided Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.3 Piecewise Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8 Numerical Integration 398.1 Using Equidistant Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398.2 Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428.3 Adaptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438.4 Issues and Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9 Nonlinear Equations and Newton’s Method 479.1 The Bisection Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479.3 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

10 Numerical ODE 4910.1 The Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4910.2 It’s Almost Always this Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5010.3 A Note on Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5110.4 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5210.5 Modified Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5410.6 Splitting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5510.7 Linear Multi-Step Methods (LMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5710.8 Runge-Kutta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6210.9 Stiff Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6610.10Stability Analysis (Especially for Stiff Problems) . . . . . . . . . . . . . . . . . . . . 68

2

11 Numerical PDE 7211.1 Finite Differences for Elliptic PDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7211.2 Finite Differences for Equations of Evolution (Parabolic and Hyperbolic PDE) . . . 7211.3 Stability, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7411.4 Finite Element Methods for Elliptic PDE . . . . . . . . . . . . . . . . . . . . . . . . 74

12 Fast Algorithms in Potential Theory and Linear Algebra 8112.1 Linear Elliptic PDE and the FMM: A Paradigm . . . . . . . . . . . . . . . . . . . . 8112.2 Fast Multipole Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8112.3 Hierarchical Matrix Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8112.4 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

13 References 82

1 Floating Point Arithmetic, Conditioning, and Stability

1.1 Previously Asked Questions

• Why are normal equations bad for solving least squares problems? (Ans: condition numbersquared) What would you use instead?

• What does the condition number of a matrix tell you? (Ans: I talked about error boundsfor solving systems of equations and convergence rates of iterative methods.) What does theerror bound tell you in the case that the solution is zero? (It turns out that this is somethingof a trick question.)

1.2 Numerical Methods Class Problems

• Find three double precision IEEE floating point numbers a, b, and c for which the relativeerror of a+ b+ c is very large. Try to make it as bad as possible and explain your reasoning.

• What is the smallest positive integer which is not exactly represented as a single precisionIEEE floating point number? What is the largest finite integer which is part of the doubleprecision IEEE floating point system?

• Find the IEEE single and double precision floating point representations of the numbers4, 100, 1/100, 2100, 2200, and 21050 .

1.3 Questions that Seem Reasonable

• What is the approximate value of machine epsilon in the IEEE double-format floating-pointstandard?

3

1.4 The IEEE Standard (Double and Single)

A double-format number is represented as

± a1a2 . . . a11 b1b2 . . . b52

The value given can be figured out from the following table

If exponent string is: Then the value is:(00000000000)2 = (0)10 ±(0.b1b2 . . . b52)2 × 2−1022

(00000000001)2 = (1)10 ±(1.b1b2 . . . b52)2 × 2−1022

(00000000010)2 = (2)10 ±(1.b1b2 . . . b52)2 × 2−1021

......

(01111111111)2 = (1023)10 ±(1.b1b2 . . . b52)2 × 20

(10000000000)2 = (1024)10 ±(1.b1b2 . . . b52)2 × 21

......

(11111111101)2 = (2045)10 ±(1.b1b2 . . . b52)2 × 21022

(11111111110)2 = (2046)10 ±(1.b1b2 . . . b52)2 × 21023

(11111111111)2 = (2047)10 ±∞ if bi = 0 ∀i,NaN otherwise

A single-format number is represented as

± a1a2 . . . a8 b1b2 . . . b23

The value given can be figured out from the following table

If exponent string is: Then the value is:(00000000)2 = (0)10 ±(0.b1b2 . . . b23)2 × 2−126

(00000001)2 = (1)10 ±(1.b1b2 . . . b23)2 × 2−126

(00000010)2 = (2)10 ±(1.b1b2 . . . b23)2 × 2−125

......

(01111111)2 = (127)10 ±(1.b1b2 . . . b23)2 × 20

(10000000)2 = (128)10 ±(1.b1b2 . . . b23)2 × 21

......

(11111101)2 = (253)10 ±(1.b1b2 . . . b23)2 × 2126

(11111110)2 = (254)10 ±(1.b1b2 . . . b23)2 × 2127

(11111111)2 = (255)10 ±∞ if bi = 0 ∀i,NaN otherwise

Note that for either format, the numbers corresponding to the exponent zero are interpreted dif-ferently. The leading zero allows these “subnormal” numbers to take very small values, but at thecost of significant digits.

4

The value of machine epsilon is determined by the smallest significant digit that is capable ofbeing stored the values are

• Double-format: ε = 2−52 ≈ 2.2× 10−16

• Single-format: ε = 2−23 ≈ 1.2× 10−7

With IEEE correctly rounded arithmetic, the results of arithmetical operations have nice errorbounds and behavior. Let round be the function that represents a number as its nearest floatingpoint approximation and a circled operation signify its floating point equivalent. Assume x, y arefloating point numbers.

• x⊕ y = round (x+ y) = (x+ y)(1 + δ)

• x y = round (x− y) = (x− y)(1 + δ)

• x⊗ y = round (x× y) = (x× y)(1 + δ)

• x y = round (x/y) = (x/y)(1 + δ)

• x⊗ 1 = x

• x y = 0⇒ x = y

However, it’s not all good. Indeed, it is possible that

(x⊕ y) z 6= round (x+ y − z)

for x, y, and z floating-point numbers. Consider x = z = 1 and y = 2−25 in single-format.

2 Conditioning, Stability, and Accuracy

2.1 Definitions

2.1.1 Absolute Condition Number

For a problem f : X → Y .

κ = limδ→0

sup‖δx‖≤δ

‖δf‖‖δx‖

If f is differentiable

κ = ‖J(x)‖

Where the norm on the Jacobian is the induced norm from the norms on X and Y .

5

2.1.2 Relative Condition Number

For a problem f : X → Y .

κ = limδ→0

sup‖δx‖≤δ

‖δf‖/‖f(x)‖‖δx‖/‖x‖

For f differentiable, this gives

κ = ‖x‖‖J(x)‖‖f(x)‖

2.1.3 Condition Number of a Matrix

For an invertible matrix A

κ(A) = ‖A‖‖A−1‖

If we are using the 2-norm, we get

κ(A) =σ1

σm

Note: the condition numbers of the following problems are bounded by κ(A): matrix-vector mul-tiplication and solving a system Ax = b.

2.1.4 Algorithms

Problems can be viewed as f : X → Y and algorithms can be viewed as another map f : X → Y .Let c(x) mean the computer representation of x. Then, in particular, we have

x→ c(x)→ program → f(x)

2.1.5 Accuracy

If we have that

‖f(x)− f(x)‖‖f(x)‖

= O(εmach)

Then the algorithm f is accurate. We note that for an ill-conditioned problem, an accurate algo-rithm in the above sense is unlikely. For instance, rounding error alone could lead to poor accuracyin the case of an ill-conditioned problem.

6

2.1.6 Stability

An alrogithm is “stable” if for each x ∈ X

‖f(x)− f(x)‖‖f(x)‖

= O(εmach)

for some x with

‖x− x‖‖x‖

= O(εmach)

“A stable algorithm gives nearly the right answer to nearly the right question.”

2.1.7 Backward Stable

Often, numerical linear algebra algorithms satisfy a stronger condition:

f(x) = f(x)

for some x with

‖x− x‖‖x‖

= O(εmach)

“A backward stable algorithm gives exactly the right answer to nearly the right question.”

2.2 A Theorem

The accuracy of a backward stable algorithm is given by

‖f(x)− f(x)‖‖f(x)‖

= O(κ(x)εmach)

Proving this theorem is straightforward.

2.3 An Example

3 Solution of Linear Systems (Direct)

3.1 LU Decomposition

3.1.1 Explanation

This is the decomposition you obtain when you perform Gaussian elimination on a matrix. Yousuccessively introduce zeros below the diagonal of your matrix, one column at a time (from right toleft). The resulting decomposition is A = LU for L a lower triangular (often unit diagonal) matrixand U an upper triangular matrix. The proof of the existence is easy.

7

3.1.2 Stability

Without pivoting, the method can be quite unstable. The following is a standard example. PerformGaussian elimination on

A =(

10−20 11 1

)You get

A =(

10−20 11 1

)=(

1 01020 1

)(10−20 1

0 1− 1020

)≈(

1 01020 1

)(10−20 1

0 −1020

)Where the ≈ indicates its floating point representation. We lose information when this digit is lost.Indeed (

1 01020 1

)(10−20 1

0 −1020

)=(

10−20 11 0

)So the backwards error is order 1. When you perform the elimination with pivoting you do nothave this problem. Further, we note that the matrix A is well-conditioned here. While Gaussianelimination with partial pivoting is indeed backward stable, the worst-case can be quite bad. Thestandard example is

1 1−1 1 1−1 −1 1 1−1 −1 −1 1 1−1 −1 −1 −1 1

If you row-reduce – even with pivoting – the growth factor is 2m which corresponds to a loss on theorder of m bits (a friggin’ disaster). However, in practice, this peculiarity is rarely (read: never)seen.

3.1.3 Work

On the k-th step, you have about (n − k)2 multiplications and additions. This gives that there isO(n3) work. More precisely, it’s about 2

3n3 flops. For banded matrices, the work can be significantly

decreased. For instance, a tri-diagonal matrix can be row-reduced in linear time.

3.1.4 Heuristics

Using the LU decomposition is fairly straightforward. If A = LU then solving Ax = b is equivalentto performing two back-substitutions (stable and O(n2)), Ly = b and Ux = y. If you have multiple

8

right-hand-sides, you only have to compute the LU decomposition once.

There is an interesting existence/uniqueness result for sonsingular matrices. Each principal mi-nor of A is nonsingular if and only if A has an LU factorization. Further, this factorization isunique.

In practice, Gaussian elimination with partial pivoting is quite stable.

3.2 Cholesky’s Method

3.2.1 Explanation

A Cholesky factorization is a decomposition of the form A = R∗R for R upper-triangular andRjj > 0. Every Hermitian positive definite matrix has a unique Cholesky factorization. Theexistence and uniqueness can be proven by construction. The basic step of the induction used is towrite

A =(a11 w∗

w K

)=(

α 0w/α I

)(1 00 K − ww∗/a11

)(α w∗/α0 I

)where α =

√a11 This incrementally transforms the middle matrix into the identity.

3.2.2 Stability

This algorithm is always stable. Intuitively, we have that ‖R‖ = ‖R∗‖ = ‖A‖1/2 by the SVD, so Rcan never grow.

3.2.3 Work

It is easy to see by the construction that there are O((n − k)2) operations at each step, givingO(n3) total work. More precisely, it is about 1

3n3 total flops. This is an improvement by a factor

of 2 over Gaussian elimination.

3.2.4 Heuristics

In comparison to Gaussian elimination, the Cholesky factorization requires less work and is morestable. It is, however, limited to positive definite Hermitian matrices. Once obtained, the decom-position is used in the same manner as Gaussian elimination.

9

4 QR and SVD Factorizations and Least Squares

4.1 QR Decomposition by Modified Gram-Schmidt

4.1.1 Explanation

A QR decomposition allows you to write a matrix A as A = QR where Q is an orthogonal matrixand R is upper-triangular. To be brief, the regular Gram-Schmidt method is exactly what you thinkit is. You take the columns of your matrix and then Gram-Schmidt them (storing the inner-productsand norms as you go). At step j

vj = vj − 〈q1, vj〉q1 − · · · − 〈qj−1, vj〉qj−1 qj = vj/‖vj‖

That is, you update the current vector using the previous vectors. The modified Gram-Schmidtmethod, instead, normalizes the current vector and then uses it to modify the future vectors. Thisprocess is mathematically equivalent; however, it has numerical implications. At step j

qj = vj/‖vj‖→ vj+1 = vj+1 − 〈qj , vj+1〉qj

...→ vn = vn − 〈qj , vn〉qj

Thus, at each step the imperfections of the previous steps are again orthogonalized with respect toqj . To see this, a fixed vector vj undergoes:

vj 7→ vj − 〈q1, vj〉q1 7→ vj − 〈q1, vj〉q1 − 〈vj − 〈q1, vj〉q1, q2〉q2 7→ · · ·

This is the key difference.

4.1.2 Work

For an m× n matrix, at step j you perform n− j inner products each with O(m) flops and n− jvector subtractions, again about O(m) each. Thus, summing over j you get O(mn2) total work.More precisely, you have ≈ 2mn2 flops.

4.1.3 Heuristics

Gram-Schmidt is a nice method because you obtain the orthonormal vectors sequentially.

If you perform the algorithm as described, you actually get a reduced QR factorization, i.e.

A = QR

10

where Q is an m×n matrix with orthonormal columns and R is an n×n upper-triangular matrix.You can obtain a full QR decomposition by extending the columns of Q to a full basis and paddingR with zeros. However, I’m not sure how (or if) this is done in practice. Perhaps by taking randomvectors and orthonormalizing them?

A way to summarize the process of Gram-Schmidt is “triangular orthogonalization”.

4.2 QR Decomposition by Householder Reflections

4.2.1 Explanation

The idea of using Householder Reflections to obtain the QR factorization is to successively introducezeros below the diagonal of each column by means of an orthogonal transformation. It is easy to seehow the first step of the process can be repeated to get the full decomposition. Let x = (x1, . . . , xm)T

be the first column of your matrix A. We would like to reflect this vector so that it is zero in everyentry below the first. Let the matrix of this reflection be Q1. Because reflections are unitary, weknow that Q1x = ±‖x‖e1. We see that v = ±‖x‖e1 + x is a suitable vector to reflect across (easyto see if you draw it). Because subtracting numbers of similar sizes is unstable, it is best to choosev = sign (x1)‖x‖e1 + x. Finally, Q1 is then given by

F = I − 2vv∗

v∗v

Repeating this process, we get Qm · · ·Q1A = R. So setting

Q = Q∗1 · · ·Q∗mwe have a QR factorization.

4.2.2 Stability

QR decomposition by Householder reflections is stable because multiplication by orthogonal ma-trices is stable.

4.2.3 Work

Let A be m × n. At step k in the algorithm, you compute the reflection vector v, which is aboutO(m − k) flops. Then you mst apply the matrix F to the submatrix Ak:m,k:n. This is aboutO((m − k)(n − k)) = O(mn − (m + n)k + k2) work. Summing in k we get about O(mn2). Amore careful analysis gives that it is in fact ≈ 2mn2 − 2n3/3 flops (a slight improvement overGram-Schmidt).

11

4.2.4 Heuristics

We note that this method always gives a full QR factorization, in contrast with Gram-Schmidt.Further, if you stop the algorithm before the final step, you do not have any of the vectors in yourbasis, so you don’t gain them sequentially like you do in Gram-Schmidt.

The algorithm should really only store the reflection vectors and R. Forming the explicit Q fromthese vectors involves a decent amount of work. Further, evaluating Q∗b or Qx using the explicitQ is O(mn2) work whereas you can multiply by the Qi or Qi∗ in the correct order to get theseresults in O(mn) work. We also note that Qi = Q∗i . As an example, Q∗b can be calculated by

for k = 1 to n

bk:m = bk:m − 2vk(v∗kbk:m)

4.3 SVD Overview

4.3.1 Explanation

Every matrix A ∈ Cm×n has an SVD (singular value decomposition). It consists of two orthogonalmatrices U ∈ Cm×m and V ∈ Cn×n as well as a diagonal matrix Σ ∈ Cm×n. The singular valuesare those found on the diagonal of Σ. They are non-negative and decrease from top-left to bottom-right. The singular values are uniquely determined and if they are distinct, then the right and leftsingular vectors corresponding to them are unique up to complex signs. We note that if A = UΣV ∗,we have

AV = UΣ

So Avi = σiui. This leads to the geometric interpretation of the SVD. The image of the unit sphereunder any matrix is a hyper-ellipse. The right-singular vectors vi are the pre-images of the ellipse’saxes found on the unit ball, the σi the lengths of those axes, and the left-singular vectors ui thedirections of those axes. The SVD has many applications, in particular to least squares problemsas outlined in the next subsection.

4.3.2 Stability

We omit many of the details, but the process described in the sketch of the computation foundbelow is stable.

4.3.3 Work

The reduction to bidiagonal form requires O(mn2) work for an m × n matrix. The iterative steptakes about O(n) iterations and each requires about O(n) work, resulting in O(n2) total work.Thus, the reduction to bidiagonal form is often the more expensive step.

12

4.3.4 Computation Sketch

The matrix is reduced to bidiagonal form by applying Householder reflections on both right andleft. This process can be sped up by first computing a QR factorization for part of the matrix(choosing when to perform this step to get the least amount of work is subtle). A variant of theQR algorithm (described in the eigenvalues section) is then used to find the SVD of the bidiagonalmatrix. Really, you should never code up an SVD algorithm yourself, as the industry standard willlikely destroy anything you come up with (the same philosophy applies for many numerical linearalgebra algorithms).

4.4 Least Squares

Least-squares refers to minimizing some quantity with respect to the 2-norm.

4.4.1 Overdetermined System

A common least squares problem is show that there is an optimal approximation to an overdeter-mined system of linear equations. As a numerical problem, you would compute the minimizer. Theproblem can be stated as follows. Let A ∈ Cm×n for m > n be full rank. Find the vector x ∈ Cn

which minimizes

‖Ax− b‖2The existence of a QR factorization and the fact that orthogonal matrices preserve the 2-normgreatly simplify this problem.

minx‖Ax− b‖ = min

x‖QRx− b‖ = min

x‖Rx−Q∗b‖

As R is upper-triangular, we see that the entries of Q∗b beyond the n-th entry will never be cancelledout. Further, as A was full rank the triangular system corresponding to the first n entries can besolved exactly (and stably).

4.4.2 Underdetermined Systems

Another common least squares problem is to minimize the 2-norm of a vector that solves anunderdetermined system. The problem can be stated as follows. Let A ∈ Cm×n for m < n be fullrank. Find the vector x ∈ Cn which solves Ax = b and minimizes the 2-norm of x. This problemis greatly simplified by considering the SVD of A. Write A = UΣV ∗. Then

b = Ax

b = UΣV ∗xU∗b = Σy

13

where y = V ∗x ∈ Cn and ‖y‖ = ‖x‖. We note that the first m components of y are determinedby the diagonal part of Σ and that because A is full rank, these are sufficient to satisfy U∗b = Σy.Setting the rest of the entries of y to zero, we obtain the vector of minimum 2-norm. Finally,x = V y.

4.4.3 Example of a Least Squares Problem That’s Not Quite the Same

“Best plane approximation”

Consider n > 3 points (xi, yi, zi), 1 ≤ i ≤ n, in three-dimensional, real Euclidean space.The points lie close to a common plane. Determine a plane which goes through (x, y, z) wherex =

∑ni=1 xi/n, etc., which provides the best least squares fit to the data. Also, show how the

normal to the plane can be obtained readily by using the singular value decomposition.

Solution:

We note that the equation of a plane going through (x, y, z)t is given by

a(x− x) + b(y − y) + c(z − z) = 0

for n = (a, b, c)t 6= 0. We note that the scale of n is irrelevant as αn determines the same plane forany non-zero α. Thus, we normalize so ‖n‖ = 1. We let vi = (xi, yi, zi)t− (x, y, z)t, for i = 1, . . . , n.Let A be the n × 3 matrix whose i-th row is vi. Then, the least squares fit is given by the n thatproduces the minimum below

min‖n‖=1

‖An‖

Let A = UΣV ∗ be the SVD decomposition of A. This gives

min‖n‖=1

‖An‖ = min‖n‖=1

‖UΣV ∗n‖

= min‖n‖=1

‖ΣV ∗n‖

Let w = V ∗n. Then ‖w‖ = 1. We note that the rows of Σ are zero beyond the third row. Thisgives

min‖n‖=1

‖An‖ = min‖w‖=1

‖Σw‖

= min‖w‖=1

‖Σw‖

14

Because the singular values are in decreasing magnitude down the diagonal, we see that the mini-mizing w is w = (0, 0, 1)t. Thus, the minimizing n is given by n = V (0, 0, 1)t, i.e., the third columnof V .

4.5 Best Low Rank Approximation

The SVD allows you to construct the best low-rank approximation (in the 2-norm) to a matrix Aof any rank (less than that of the matrix itself). The best approximation to A ∈ Cm×n of rankk < dim range A is given by

B =(u1 . . . uk

)Σ1:k,1:k

− v∗1 −...

− v∗k −

Also, you can use the SVD to see that this is indeed the best approximation.

5 Eigenvalue Algorithms

5.1 Why Iterative?

Any root finding problem can be recast as an eigenvalue problem, and vice versa. Thus, the fol-lowing theorem from Galois theory provides some insight into eigenvalue problems.

For any m ≥ 5 there is a polynomial p(x) of degree m with rational coefficients and a root zsuch that z cannot be expressed in terms of the coefficients using addition, subtraction, multipli-cation, division, and radicals.

Thus, there cannot be a finite time algorithm (using the above operations) to find the eigenvaluesof an arbitrary matrix.

Further, we note that the preferred method for calculating eigenvalues is by an eigenvalue revealingmethod instead of working with the characteristic polynomial. This is because certain polynomialroot finding problems are extremely ill-conditioned with respect to the coefficients. The followingexample due to James Wilkinson

w(x) =20∏i=1

(x− i)

illustrates this issue with conditioning. Expanded in terms of the monomials, we have w(x) =x20 − 210x19 + · · · . If we perturb the coefficient of x19 by ε then the value of w(20) is changed byε · 2019. This is especially troubling, because the eigenvalue problem is a well-conditioned one (forsymmetric matrices. This can be seen by considering Gershgorin’s circle theorem. see Dahlquist

15

Section 5.8). With these issues in mind, the preferred approach is to transform the given matrix intoan upper-triangular (Schur decomposition for any matrix) or diagonal (for Normal and Hermitianmatrices) form via orthogonal transformations.

5.2 Power Method

5.2.1 Idea

The power method is based on the idea that if you repeatedly apply a matrix to a vector, thedirection of the eigenvector corresponding to the largest eigenvalue will dominate. We assume thatwe have a full basis vi of eigenvectors for our matrix A and that our starting vector v =

∑αivi

has non-zero components in the direction of each eigenvector. Then

Akv = λk1α1v1 + · · ·+ λknαnvn

Suppose that λ1 is the strictly largest magnitude eigenvalue, then

Akv

‖Akv‖→ v1

5.2.2 Pitfalls

The convergence can be quite slow. The error is dominated by (|λ2|/|λ1|)k where λ2 is the secondgreatest eigenvalue in magnitude.

The method is also limited to finding the eigenvector corresponding to the largest magnitudeeigenvalue.

5.2.3 Other Uses

The method is used to approximate the spectral norm of a matrix (which coincides with the induced2-norm). In this case, you alternatively apply A and A∗ to the vector.

5.3 The QR Algorithm

5.3.1 Description

The QR algorithm is an extremely simple one to describe, though it can appear a bit mysterious.We start with A(0) = A. Then the later steps are determined by the following

Q(k)R(k) = A(k−1) this is obtained from a QR decomposition algorithm

A(k) = R(k)Q(k) simply multiply these together

16

5.3.2 Understanding QR in Terms of Simultaneous Iteration

Simultaneous iteration can be understood as applying the power iteration method to multiplevectors and orthonormalizing the result each time

Q(k)R(k) = Z again from a QR decomposition algorithm

Z = AQ(k) simply multiply these together

If we let R(k) = R(k) · · ·R(1), then Ak = Q(k)R(k) and A(k) = (Q(k))TAQ(k), where A(k) is the k-thiterate of the QR Algorithm. Further, Q(k) = Q(1) · · ·Q(k) where the Q(i) are those from the QRAlgorithm. In this way it is clear to why the QR Algorithm produces eigenvalues.

5.3.3 Convergence

By comparison with the power method, we see that we get linear convergence to the eigenvectorsthat depends on the ratios of the eigenvalues. The convergence of the eigenvalues can be deducedby considering the Rayleigh quotients of the eigenvector approximations. In particular, we notethat for an eigenvector v, the Rayleigh quotient r(v) = vTAv/vT v satisfies ∇r(v) = 0. This givesus that the first order terms are 0, so we have that the convergence of the eigenvalues is quadratic.

5.3.4 Shifted QR and the Inverse Iteration Idea

If we have a good guess as to an eigenvalue, we can perform inverse iteration in which we utilizethe fact that (A− µI)−1 has the same eigenvectors as A and – in particular – an eigenvector witheigenvalue 1/(λ − µ). So if our guess is close, applying (A − µI)−1 to a vector will strongly favorthe eigenvector corresponding to eigenvalue λ ≈ µ.

To utilize this idea, we can choose a shift at each step of the QR Algorithm. In particular, wecan perform

Q(k)R(k) = A(k−1) − µ(k)I obtained from a QR decomposition algorithm

A(k) = R(k)Q(k) + µ(k)I

We still have that A(k) = (Q(k))TAQ(k). Thus, to get a good eigenvalue estimate, we can considera Rayleigh quotient for an eigenvector approximation of A. We note that

A(k)mm = q(k)T

m Aq(k)m

So µ(k) = A(k)mm is a decent choice of shift. However, there can be issues with symmetry of eigen-

values.

17

A =(

0 11 0

)The A above has eigenvalues ±1. The shift suggested above is 0 and with that shift, A is leftunchanged by the QR algorithm. Wilkinson’s shift instead uses the eigenvalue of A(k)

m−1:m,m−1:m

closest to A(k)mm (or either one in the case of a tie) for the shift. This breaks the symmetry problem.

In either case, we get cubic convergence.

5.3.5 Actual Implementation

The actual implementations of the QR algorithm normally tridiagonalize the matrix first and thenperform the algorithm with some sort of shift. Further, the algorithms often “deflate” the matrixwhen eigenvalues are found, a process I will not describe here.

5.4 Jacobi’s Method by Givens Rotations

5.4.1 Description

The idea of Jacobi’s method is quite simple. It is to diagonalize a matrix by repeatedly annihilatingoff-diagonal entries. It applies only to symmetric matrices. We note that a symmetric 2× 2 matrixcan be diagonalized easily by a rotation. In particular, if we let

A =(a bb d

)then we can diagonalize using the rotation matrix

J =(

c s−s c

)In particular, D = JTAJ if we let θ = arctan[2b/(d− a)] and c = cos θ and s = sin θ. This processcan then be used to zero out a specific off-diagonal entry of a matrix by using rotations of the form

18

J =

1. . .

1c · · · · · · · · · s... 1

......

. . ....

... 1...

−s · · · · · · · · · c1

. . .1

(the dots between the c’s and s’s are only there to show how they line up). The matrix aboveonly affects the rows and columns that have c’s and s’s, call these rows i and j (the affectedcolumns have the same indeces). Then, we have that (JTAJ)ij = 0. While this introduces a zeroin the i, j-th entry, it can remove other zeros (note that this is clear, otherwise, we’d have a finitestep algorithm). However, the idea with the Jacobi algorithm is that the total of the off-diagonalelements goes down and that if you continuously sweep through the matrix and zero out elements,you will eventually converge.

5.4.2 Convergence

The convergence of the algorithm can be easily proved for the case where you annihilate the largestoff-diagonal element each time. In this case the sum of the squares of the off-diagonal entries de-creases by a factor of 1− 2/(m2 −m) at each step. The proof of this fact is below.

First we note that the Frobenius norm is invariant under unitary multiplication. Thus, for J aJacobi rotation, we have

‖JTAJ‖F = ‖A‖FNow, suppose Aj,k is the largest off-diagonal element (for k > j). Then

|Aj,k|2 ≥2∑

i>h |Ah,i|2

m(m− 1)=

S(A)m(m− 1)

where S(A) is the sum of the off-diagonal elements. The effect of the product JTAJ on the entriesof A is easily understood for the submatrix[

Aj,j Aj,kAk,j Ak,k

]19

Indeed, this matrix gets replaced with[Aj,j Aj,kAk,j Ak,k

]gets

[cos θ − sin θsin θ cos θ

] [Aj,j Aj,kAk,j Ak,k

] [cos θ sin θ− sin θ cos θ

]=[6= 0 0

0 6= 0

]For appropriately chosen θ. As the 2× 2 rotation matrix in the above is unitary, it should preservethe Frobenius norm of this matrix. In particular, the sum of the diagonal entries should increaseby 2 times the value of |Aj,k|2. As the other diagonal entries are unaffected by the Jacobi rotation,we have ∑

(JTAJ)2i,i ≥

∑A2i,i +

2S(A)m(m− 1)

Because the Frobenius norms of JTAJ and A are the same, we have then that

S(JTAJ) = S(A) +∑

A2i,i −

∑(JTAJ)2

i,i ≤ S(A)(

1− 2m(m− 1)

)5.4.3 Some Heuristics

Rather than searching for the largest off-diagonal entry at each step (which is costly), most algo-rithms simply sweep through the off-diagonal entries, eliminating them sequentially. In contrastwith the QR algorithm, there is no benefit to first tri-diagonalizing the system. That is becausethe Jacobi rotations would introduce non-zeros outside of the sub, main, and super-diagonals, de-stroying that structure. Finally, the process can be parallelized, as it only affects and depends onthe rows and columns corresponding to the element to be zeroed out.

5.5 Bisection for Tridiagonal Systems

5.5.1 Description

Bisection is quite distinct from the other methods, as it allows you to find the eigenvalues in a specificrange. It applies to symmetric tridiagonal matrices. The key fact used is that the eigenvalues ofthe principal minors of such matrices interlace. That is if we have λ(k)

i are the strictly increasingeigenvalues of the k-th principal minor of A, then λ

(k+1)i < λ

(k)i < λ

(k+1)j+1 . If we have

A =

a1 b1b1 a2 b2

b2 a3. . .

. . . . . .

Then there is a simple recursion formula for the characteristic polynomials of the principal minors.We have p(−1)(x) = 0, p(0)(x) = 1, and

20

p(k)(x) = (ak − x)p(k−1)(x)− b2k−1p(k−2)(x)

We say p(k)(x) → p(k+1)(x) has a sign change if it changes sign or goes from zero to negative orpositive (but not if it goes from positive or negative to zero). If we evaluate the above sequencefor a fixed value of x and we count the sign changes, the total gives the number of eigenvalues in(−∞, x).

5.5.2 Heuristics

While Wilkinson’s polynomial suggests that bisection is an unstable approach, it does not actuallyexperience such difficulties. This is because the method simply evaluates the characteristic poly-nomial without considering its coefficients and therefore avoids the issue of ill-conditioning.

Bisection is especially good for situations in which you need only a few eigenvalues. This is becauseit requires only O(m) flops for each evaluation of the sequence p(k)(x).

5.6 Methods Described in other Parts of the Notes

The Arnoldi and Lanczos algorithms are related to eigenvalues and the details of these are givenin the Iterative Methods for Linear Systems section.

6 Iterative Methods for Linear Systems

6.1 Classical Iterative Methods

6.1.1 Linear One-Step Stationary Schemes

We consider the case of solving Ax = b via a one-step stationary scheme, that is, a scheme in whichyou obtain iterates by

xk+1 = Hxk + v

We note that this scheme converges to a bounded limit x if and only if ρ(H) < 1 (a straightforwardcalculation). Further, if ρ(H) < 1 then x is the correct solution if and only if v = (I−H)A−1b. Wenote that as of now the scheme seems impractical. To get v, it seems we have to calculate A−1b.However, there are ways to get around this.

6.1.2 Regular Splitting

One can split A = P −N where P is non-singular and consider the scheme

Pxk+1 = Nxk + b

21

We generally choose a P which is simpler to invert (e.g. has an easily computed LU factorization).A sufficient condition for convergence of this method is that both the matrices A and P T + P −Aare symmetric positive definite.

6.1.3 Jacobi, Gauss-Seidel, SOR

• Jacobi iteration is a regular splitting scheme in which P = D, the diagonal of A, and N =−(A−D).

• Gauss-Seidel iteration is a regular splitting scheme in which P = D + L, the diagonal andstrictly lower triangular part of A, and N = −(A−D−L) = −U , the negative of the strictlyupper triangular part of A.

• Successive-over-relaxation is a regular splitting scheme in which P = 1ωD + L and N =

−(A− 1ωD−L) = ( 1

ω − 1)D−U for some ω ∈ [1, 2). We note that Gauss-Seidel is SOR withω = 1.

6.1.4 As Linear Equations

Each of the above methods can be written as a system of linear equations. For l = 1, ldots, n wehave for Jacobi iteration:

l−1∑j=1

aljxkj + allx

k+1l +

n∑j=l+1

aljxkj = bl

For Gauss-Seidel:

l−1∑j=1

aljxk+1j + allx

k+1l +

n∑j=l+1

aljxkj = bl

For SOR:

ω

l−1∑j=1

aljxk+1j + allx

k+1l + (ω − 1)allxkl + ω

n∑j=l+1

aljxkj = ωbl

These equations highlight the differences between the methods. Gauss-Seidel uses all of the newentries of xk+1 that have been computed up to that point. This leads to obvious storages savings,as the new entries can be simply overwritten into the old vector. Further, if we rearrange, we have

xk+1l =

1all

bl − l−1∑j=1

aljxk+1j −

n∑j=l+1

aljxkj

for Gauss-Seidel and

22

xk+1l = (1− ω)xkl +

ω

all

bl − l−1∑j=1

aljxk+1j −

n∑j=l+1

aljxkj

for SOR. If we compare these we note that Gauss-Seidel exactly solves the linear equation for xk+1

l ,while SOR actually moves further in the direction of the solution (passed what is optimal if we areonly changing xk+1

l ). The convergence of SOR is often faster than Gauss-Seidel and depends onthe value of ω. The optimal value of ω can be chosen, but the methods for doing so are left out ofthese notes.

6.1.5 A Different Viewpoint

If we have a minimization problem

minx∈Rn

f(x1, . . . , xn)

the Gauss-Seidel minimization process would be to cycle through the coordinates of x and thenminimize f with respect to each coordinate, holding the other coordinates constant. After eachiteration (and indeed after each step of each iteration), the value of f improves or stays the same.In this way, we see that the Gauss-Seidel iteration will always converge to a local minimum.

6.1.6 Some Heuristics

Gauss-Seidel converges when the matrix is diagonally-dominant. We note that Gauss-Seidel tendsto smooth out high frequency errors (large eigenvalue errors) quickly; this is the basis for themultigrid method (see Iserles for a reference).

6.2 Steepest Descent

For the method of steepest descent, we assume A is symmetric positive definite and recast theproblem Ax = b as a minimization problem. Consider the quadratic form

f(x) =12xTAx− btx

We note that if x = A−1b, then for y 6= x, we have

f(y) =12ytAy − bT y

=12yTAy − yTAx

=12

(y − x)TA(y − x)− 12xTAx

23

=12

(y − x)TA(y − x) + f(x)

> f(x)

Thus, the solution of Ax = b is the minimizer of f(x). The method of steepest descent attemptsto solve this minimization problem iteratively, moving from one approximation to the next in theopposite direction of the gradient of f (which points in the direction of greatest increase). Thegradient of f is given by

f ′(x) =12ATx+

12Ax− b

= Ax− b

Thus, if we move opposite the gradient (in the direction of “steepest descent”) we move in thedirection of the residual b−Ax. The method then calls for us to minimize f in this direction. Wehave

xi+1 = xi + α(b−Axi) = xi + αri

Thus, we have

f(xi+1) =12

(xi + αri)TA(xi + αri)− bT (xi + αri)

= f(xi) + αxTi Ari +12α2rTi Ari − αbT ri

d

dαf(xi+1) = xTi Ari + αrTi Ari − bT ri

0 = −rTi ri + αrTi Ari

α =rTi ri

rTi Ari

The above choice for α gives the minimizer. We note that we can use the chain rule to get

0 =d

dαf(xi+1) = f ′(xi+1)T

d

dαxi+1 = f ′(xi+1)T ri

so we see that the new residual ri+1 = −f ′(xi+1) is orthogonal to the previous residual ri. This factlimits the method of steepest descent. In particular, for 2 dimensions, we will switch between twosearch directions. So if f has elliptical cross sections, the method of steepest descent will generallyzig-zag towards the solution, which could take many iterations to converge. The algorithm goes asfollows

24

ri = b−Axi

αi =rTi ri

rTi Ari

xi+1 = xi + αi + αiri

We note that after the initial iteration, the matrix-vector multiplication in the first step can beavoided because we have

xi+1 = xi + αri

ri+1 = ri − αiAri

and we have already computed Ari. We point out that updating in this fashion can cause a buildup of round-off error over time. One possible solution is to occasionally update ri using xi. At anyrate, the work of the steepest descent method is dominated by the matrix-vector multiplicationAri in the second step. The error analysis of this method can be found in Shewchuk’s CG paper(see references). The basic idea is to write the initial error in terms of the eigenvectors of A andthen see what the method does to the error over time in terms of the norm ‖e‖2A = eTAe. Theconvergence goes like

‖ei‖A ≤(κ− 1κ+ 1

)i‖e0‖A

where κ is the condition number of A. An interesting feature of the method of steepest descentand CG is that from one step to the next, the error in a certain eigenvector direction may actuallyincrease. This is in contrast with methods such as Gauss-Seidel iteration which decrease eachcomponent (called smoothers). This is why steepest descent and CG are sometimes referred to as“roughers”.

6.3 Krylov Subspaces

If we are working with a matrix A, then the Krylov subspaces generated by a vector b are given byKn = span 〈b, Ab, . . . , An−1b〉. The Arnoldi and Lanczos iteration methods come up with increasingorthonormal bases Qn for Kn which satisfy

AQn = Qn+1Hn AQn = Qn+1Tn

where Hn is upper-Hessenberg for Arnoldi and Tn is tridiagonal for Lanczos. Two of their applica-tions are to finding eigenvalues and iteratively solving the system Ax = b. The following statementsare much more heuristic than they are precise.

25

• Finding eigenvalues. Let Pn be the space of monic polynomials of degree n. Then Pn(A)b ⊂Kn+1. Let pn ∈ Pn be the minimizing polynomial for ‖pn(A)b‖. We note that if pn has zerosnear the eigenvalues of A, then this value will likely be small. To see this, consider the caseof a diagonal matrix A ∈ Cm×m which has n m non-zero eigenvalues. Then the minimalpolynomial p∗ of A gives p∗(A) ≡ 0, so pn = p∗ would have zeros given by the eigenvalues of A.Thus, the minimizing polynomial pn ∈ Pn has zeros which approximate the eigenvalues of A.Interestingly, this minimzing polynomial is given by the characteristic polynomial of Hn or Tn.Then, to approximate the eigenvalues of A one can run the Arnoldi or Lanczos algorithm for nsteps and then send Hn or Tn to an eigenvalue algorithm. These approximations (sometimescalled Ritz values) tend to find eigenvalues located at the extremes of the spectrum of A andwhen it finds them it converges linearly (geometrically).

• Solving Ax = b. Krylov subspaces are at the heart of the GMRES and Conjugate Gradientalgorithms described below.

6.4 Arnoldi Iteration

Let A be a general square matrix. As mentioned above, the Arnoldi iteration finds increasingorthonormal bases for Kn which satisfy

AQn = Qn+1Hn

for Hn ∈ C(n+1)×n upper-Hessenberg. This can be viewed in a different light. Suppose we wishedto find a similarity transformation giving

A = QHQ∗

for H upper-Hessenberg. If we used a process like Gram-Schmidt to find the vectors of Q succes-sively, we would end up with the formula for Qn and Hn above, where Qn was the first n columnsof Q and Hn was the upper (n+ 1)×n block of H. This Gram-Schmidt like process is the Arnoldiiteration. Using the relation for Qn and Hn above, we see that

Aqn = h1,nq1 + · · ·+ hn,nqn + hn+1,nqn+1

many of the coefficients hj,n can be found by taking the inner-product of both sides with qj . Forj ≤ n, we have

hj,n = 〈qj , Aqn〉

Then, qn+1 is determined and given by the normalization of

wn+1 = Aqn − (h1,nq1 + · · ·+ hn,nqn)

and hn+1,n = ‖wn+1‖. In pseudocode, we have

26

b = arbitrary, q_1 = b/|b|for n = 1,2, ...v = Aq_nfor j = 1 to nh_jn = <q_j, v>v = v- h_jn q_j

endh_n+1,n = |v|q_n+1 = v/|v|

We note that the above is a variant of modified Gram-Schmidt. Another popular way to makethe process stable is by taking the inner-products again to ensure orthogonality (called doubleGram-Schmidt). In pseudocode

b = arbitrary, q_1 = b/|b|for n = 1,2, ...w = Aq_nfor j = 1 to nf_jn = <q_j, Aq_n>w = w- f_jn q_j

endv=wfor k = 1 to ng_jn = <q_j,w>v = v - g_jn q_jh_jn = f_jn + g_jn

endh_n+1,n = |v|q_n+1 = v/|v|

We note that there are subtle differences between the above and the original algorithm. The origi-nal is akin to a single round of modified Gram-Schmidt whereas this algorithm is like two roundsof classical Gram-Schmidt. I’m not entirely sure of the benefits of choosing one over the other,though the second algorithm appears to be potentially parallelizable.

The work at each iteration of either algorithm is dominated by the matrix-vector multiplicationAqn. This step can be sped-up using a black-box matrix multiplication process such as a sparsematrix class (obviously for the case of sparse matrices), a Fast Multipole method, a compressedform of the matrix, etc.

27

6.5 GMRES

6.5.1 Description

The idea of the algorithm is to choose an x ∈ Kn which minimizes ‖b − Ax‖. If Qn are the basismatrices gained from the Arnoldi iteration then x = Qny ∈ Kn for some y ∈ Cn. This gives that

b−Ax = ‖b‖q1 −AQny = Qn+1(e−Hny)

where e = (‖b‖, 0, . . . , 0)T ∈ Cn+1. Because the columns of Qn+1 are orthonormal, we see thatminimizing ‖b − Ax‖ in x is equivalent to minimizing ‖e −Hny‖ in y. The advantage is that thisminimization problem is in a much lower dimension for n m. We are then left with a least squaresproblem to find y. This can be accomplished by a QR decomposition. Naıvely, if we recomputethe QR decomposition of Hn anew at each step, this is about O(n3) work per step. However, theHn are closely related and the new QR decomposition can be obtained from the previous in a moreefficient manner (via Givens rotations, see for reference the Wikipedia page on GMRES). Further,even the work of the back-substitution step for the least squares (finding R−1) can be shortenedby virtue of the relationship between the Hn.

6.5.2 Convergence

Why is it a good idea to choose xn from the Krylov subspace Kn? As seen above, this reduces thedimensionality of the problem. We will consider a different characterization of the problem to seehow good of an approximation we get for x. We note that the residual after n iterations is

rn = b−Axn = (I −Apn−1(A))b = fn(A)b

where fn is the degree n polynomial fn(x) = 1 − xpn−1(x) and pn−1 is the polynomial whichminimizes ‖fn(A)b‖ (this is what GMRES does). Thus, fn is the minimizing polynomial of ‖fn(A)b‖with f(0) = 1. Thus, we have

‖rn‖‖b‖

≤ infFn

‖fn(A)‖

where Fn = fn : fn(0) = 1, deg(fn) ≤ n are polynomials. Suppose A is diagonalizable withA = V ΛV −1. Then

‖rn‖‖b‖

≤ inffn∈Fn

‖fn(A)‖ ≤ κ(V ) inffn∈Fn

[sup

x∈σ(A)‖fn(x)‖

]This gives an idea of when we can expect fast convergence. If κ(V ) is not too large (i.e. A is nottoo far from being normal) and if polynomials which are 1 at the origin can be well bounded onthe spectrum of A, then we could expect good convergence. For instance, the latter condition issatisfied for eigenvalues which are clustered away from the origin. We note that this is a bound

28

on the value of ‖rn‖ rather than ‖en‖ where en = x − xn. Thus, it represents a kind of backwardaccuracy (as opposed to forward).

We note that if the method is able to continue and produces a full basis for Cm, it is clear that themethod converges after m steps (for A ∈ Cm×m). If we are unable to produce a new vector for thebasis at step n (but were able at step n− 1), i.e., if

Aqn − (〈q1, Aqn〉+ · · ·+ 〈qn, Aqn〉) = 0

then we see that Aqn ∈ span 〈q1, . . . , qn〉 = span 〈b, . . . , An−1b〉. Thus, there is a non-trivialcombination

c0b+ c1Ab+ · · ·+ cnAnb = 0

We see that c0 6= 0, otherwise we have c1b+ · · ·+ cnAn−1b = 0 (if A is invertible) and the iteration

would have stopped a step earlier. Thus, if we choose xn = c1/c0b + · · · + cn/c0An−1b ∈ Kn then

Ax = b and we must have found a solution at step n. Thus, we see that we always get a solutionby the m-th step (of course we hope to get a solution earlier than that).

6.6 Lanczos Iteration

The Lanczos iteration is a specialization of the Arnoldi iteration to the case where A is Hermitian(we will assume, further, that A is real symmetric). We would like to find increasing orthonormalbases for Kn which satisfy

AQn = Qn+1Tn

for Tn ∈ C(n+1)×n tridiagonal. This can be viewed in a different light. Suppose we wished to finda similarity transformation giving

A = QTQT

for T tridiagonal. To see why this matrix should be tridiagonal, we note thatAqn ∈ span 〈q1, . . . , qn+1〉.Further, we have

Tij = qTi Aqj

so Tij = 0 for i > j+1. This gives that T is necessarily upper-Hessenberg. Because A is symmetric,we have T T = (QTAQ)T = QTAT (QT )T = QTAQ = T so T is symmetric and thus tridiagonal. Ifwe used a process like Gram-Schmidt to find the vectors of Q successively, we would end up withthe formula for Qn and Tn above, where Qn was the first n columns of Q and Tn was the upper(n+ 1)× n block of T . This Gram-Schmidt like process is the Lanczos iteration. Write Tn as

29

Tn =

α1 β1

β1 α2 β2

β2 α2. . .

. . . . . . βn−1

βn−1 αnβn

Using the relation for Qn and Tn above, we see that

Aqn = βn−1qn−1 + αnqn + βnqn+1

As before, αn can be found by taking the inner-product of both sides with qn and βn is simplyfound to normalize qn+1. Thus, the algorithm is greatly simplified in this case. We have

beta_0 = 0;q_0 = 0;b given;q_1 = b/|b|;for n=1,2, ...v = Aq_n;alpha_n = (q_n,v);v = v - beta_n-1 q_n-1 - alpha_n q_n;beta_n = |v|;q_n+1 = v/beta_n;

endfor

It can be seen from the algorithm above that we only need the last two iterates to compute thenext. Thus, the phrase “three-term recurrence” is often used for the Lanczos iteration.

When the Lanczos iteration is used in a GMRES-like algorithm, we have MINRES.

6.7 Conjugate Gradient

Conjugate Gradient (CG) is an iterative method for solving the system Ax = b, when A is symmetricpositive definite.

6.7.1 As a Modification of the Steepest Descent Idea

• Conjugate Directions. Again, we let

30

f(x) =12xTAx− xT b

We note that in the method of steepest descent, we often take many steps in the samedirection. Wouldn’t it be nice to get the correct step length for each direction the first time wemove in that direction? (Yes, it would). If we have orthogonal search directions d0, . . . , dn−1,then to ensure we’ve gone the correct distance, we need that the error ei+1 = x − xi+1 forxi+1 = xi + αidi should be perpendicular to the search direction di.

dTi ei+1 = 0

dTi (ei − αidi) = 0

αi =dTi ei

dTi di

However, we cannot compute the values αi because we do not know the error ei as we don’tknow x. Instead, we require that the search directions be A-orthogonal and that ei+1 isA-orthogonal to di. This is equivalent to finding the minimum of f with respect to α. Wehave

d

dαf(xi+1) = 0

f ′(xi+1)Td

dαxi+1 = 0

−rTi+1di = 0

dTi Aei+1 = 0

Further, with this choice we have the following equation for αi

αi =dTi ri

dTi Adi

which can be evaluated. If the search direction were the residual, this would be exactly themethod of steepest descent.

To see that this procedure computes x in n steps, we note that the vectors di form a ba-sis for the whole space and thus we can write

e0 =n−1∑j=0

δjdj

31

if we take the A-inner-product of each side of the equation with dk we get

dTkAe0 =n−1∑j=0

δjdTkAdj

δk =dTkAe0

dTkAdk

=dTkA(e0 +

∑k−1i=0 αidi

dTkAdk

=dTkAek

dTkAdk

= −αk

Thus, we see that we cut down the error at each step. In particular

ek =n−1∑j=k

δjdj

If we take the A-inner-product of either side of this equation with di for i < k we get

−diAek = dTi rk = 0

We note that, as with the method of steepest descent, the number of matrix-vector multipli-cations can be reduced to one by noting that

ri+1 = −Aei+1 = ri − αiAdiand that Adi has already been computed.

• Optimality. How good is the error term? If Di = span 〈d0, . . . , di−1〉, suppose we’d like tominimize the value of ei taken from Di + e0 with respect to the A-norm. We have

‖ei‖2A =n−1∑j=i

n−1∑k=i

δjδkdTj Adk

=n−1∑j=i

δ2j dTj Adj

32

Thus, with our choice of αi, the error we obtained involves only search directions which arenot yet available (not in Di). This implies that the error for x (forward error) is minimal inthis norm. We note that this is in contrast to GMRES and MINRES, in which we minimizethe Euclidean error of the residual over the spaces Ki.

• Finding the Search Directions. The search directions could be found by starting with anarbitrary basis uj of the space and using a conjugated form of Gram-Schmidt on thatbasis. However, this process involves quite a bit of work. If we instead choose a particularbasis made of the residuals ui = ri, there are a number of advantages. As shown above,the residual is orthogonal to the previous search directions. Thus, unless the residual iszero, we can get a new search direction (if the residual is zero, we don’t need a new searchdirection). Choosing the ri as the basis vectors has a further advantage. First, we note that ifDi = span 〈d0, . . . , di−1〉 and Ri = span 〈r0, . . . , ri−1〉 then we necessarily have that Di = Ri.Further, by the equation

ri+1 = ri − αiAdiwe have that

Di = span 〈d0, Ad0, . . . , Ai−1d0〉 = span 〈r0, Ar0, . . . , A

i−1r0〉

Thus, Di is a Krylov subspace generated by r0. Because ADi ⊂ Di+1 and ri+1 ⊥ Di+1, wehave that ri+1 is A-orthogonal to Di, i.e. rTi+1Adj = 0 for j ≤ i−1. Thus, the Gram-Schmidtconjugation process is greatly simplified because ri+1 is already A-orthogonal to the previousdirections other than di. To highlight this simplification, we write out what happens for theGram-Schmidt conjugation process

di = ri +i−1∑k=0

βi,kdk

dTj Adi = dTj Ari +i−1∑k=0

βi,kdTj Adi

0 = dTj Ari + βi,jdTj Adj

βi,j = −dTj Ari

dTj Adj

which gives

βi,j =

− dT

i−1Ari

dTi−1Adi−1

for j = i− 1

0 for j < i− 1

33

The matrix-vector multiplications from the top and bottom of this fraction can be removedby observing that Adi−1 = −(ri − ri−1)/αi−1 and that rTi rj = 0 for i 6= j. This gives

βi,j =

1

αi−1

rTi ri

dTi−1Adi−1

for j = i− 1

0 for j < i− 1

Finally, we use the definition of αi−1 to get

βi,j =

rTi ri

rTi−1ri−1

for j = i− 1

0 for j < i− 1

• The CG Algorithm. Now, we write βi = βi,i−1. We then have the following algorithm (inpseudocode)

// initial directiond_0 = r_0 = b-Ax_0 ;

// find the minimizing step in the direction d_ialpha_i = r_i^T r_i / ( d_i^T A d_i );

// take that stepx_i+1 = x_i - alpha_i d_i;

// update the residual recursively (note that this can be unstable)r_i+1 = r_i - alpha_i A d_i;

// find the new directionbeta_i+1 = r_i+1 ^T r_i+1 / (r_i^T r_i);d_i+1 = r_i+1 + beta_i+1 d_i

6.7.2 Krylov Subspace Characterization (Minimizing Polynomial)

To analyze the convergence of the conjugate gradient method, we will outline the relation betweenKrylov subspaces and minimizing polynomials. We have

Di = span r0, Ar0, . . . , A(i−1)r0 = span Ae0, A

2e0, . . . , Aie0

As noted in the section on the optimality of CG, the method chooses xi which minimizes ei in theA norm over the set e0 + Di, i.e., the method chooses the polynomial p(z) with p(0) = 1 which

34

minimizes p(A)e0 in the A norm. Call the i-th degree polynomial obtained at the i-th step Pi. Wewrite ei = Pi(A)e0. Write

e0 =n∑j=1

ξjvj

where the vj are a set of orthonormal eigenvectors of A with eigenvalues λj > 0. We have

ei =∑j

ξjPi(λj)vj

Aei =∑j

ξjPi(λj)λjvj

eTi Aei =∑j

ξ2j (Pi(λj))2λj

‖ei‖2A ≤ minPi

maxλ∈σ(A)

(Pi(λ))2∑j

ξ2jλj

‖ei‖2A ≤ minPi

maxλ∈σ(A)

(Pi(λ))2 ‖e0‖2A

We note that the Pi must satisfy Pi(0) = 1. Polynomials of degree i can be chosen to fit i + 1points, thus we can satisfy Pi(0) = 1 and Pi(λ) = 0 for i eigenvalues. This is another way to seethat CG converges in n steps. This also suggests that CG is quicker with duplicated eigenvalues.

6.7.3 Convergence

Following the ideas above, we can gain a bound for the error ei in the A norm if we come up witha specific example of an degree i polynomial which satisfies Pi(0) = 1. Instead of trying to fit Pi tobe exactly zero at the eigenvalues, we choose Pi to be small on the range of the eigenvalues of A,i.e., on [λmin, λmax]. To accomplish this, we consider the Chebyshev polynomials Ti. They satisfytwo nice properties. On [−1, 1] we have that |Ti(x)| ≤ 1. Further, outside of [−1, 1] the value of|Ti(x)| is maximal among all polynomials which satisfy the first property. We then set

Pi(x) =Ti(λmax+λmin−2x

λmax−λmin)

Ti(λmax+λminλmax−λmin

)

The scaling in the numerator ensures that the numerator takes values in [−1, 1] for x ∈ [λmin, λmax].The value chosen for the bottom ensures that Pi(0) = 1. We then have that

‖ei‖A ≤ maxσ(A)

Pi‖e0‖A

≤ Ti(λmax + λmin

λmax − λmin

)−1

‖e0‖A

35

= Ti

(κ+ 1κ− 1

)−1

‖e0‖A

We then employ the following explicit formula for Ti

Ti(x) =12

[(x+√x2 − 1)i + (x−

√x2 − 1)i]

This gives

‖ei‖A ≤ 2

[(√κ+ 1√κ− 1

)i+(√

κ− 1√κ+ 1

)i]−1

‖e0‖A

≤ 2(√

κ− 1√κ+ 1

)i‖e0‖A

where the last line is the most common error estimate used for CG. This gives the following estimateon the number of iterations required to reduce the norm of the error by a factor of ε. We needabout

i ≈√κ

2ln(

2ε

)6.8 Preconditioning

Preconditioning of linear systems is a rather large subject, so we only summarize it briefly here.The idea of preconditioning is to approximate A−1 using a matrix that is easier to invert. Forinstance, the Jacobi preconditioner takes the diagonal part of A, call it D, and inverts this. We getthe resulting equivalent system (as long as D is invertible)

D−1Ax = D−1b

The idea is that D−1A has a lower condition number and thus will behave better in the iterativemethods described above. The Jacobi preconditioner is not particularly sophisticaed but is very easyto compute. The main goal of any preconditioner is to balance the effectiveness of the preconditionerwith the difficulty of computing it. For instance, taking A itself is a perfect preconditioner but wehaven’t accomplished anything in that we must be able to compute (or evaluate) the inverse of A(which was the goal to begin with).

7 Interpolation by Polynomials

7.1 Existence

If we have a polynomial pn(x) = a0 + a1x + · · · anxn for distinct points xi and distinct values yi,i = 0, . . . , n we can find ai such that pn(xi) = yi. One can write this as a Vandermonde matrix

36

problem but we can also observe that pn(x) =∑yili(x) is a solution to the problem where

li(x) =

∏k 6=i(x− xk)∏k 6=i(xi − xk)

These are called the Lagrange polynomials. They directly show that the Vandermonde matrix isinvertible for distinct xi.

7.2 Divided Differences

To find the coefficients one could try to invert the Vandermonde matrix, which often takes O(n3)time and seems too slow. Calculating the coefficients at each step is okay but requires that youstart over if you add a new point. Polynomials can always be rewritten for a different center

pn(x) = a′0 + a′1(x− c) + · · ·+ a′n(x− c)n

a useful observation is that one can write the polynomial expanded about different centers, calledthe Newton form

pn(x) = b0 + b1(x− c1) + b2(x− c1)(x− c2) + · · ·+ bn(x− c1) · · · (x− cn)= b0 + (x− c1)(b1 + b2(x− c2) + · · · )= b0 + (x− c1)(b1 + (x− c2)(b2 + b3(x− c3) + · · · ))

where the last two lines suggest an efficient way to evaluate the polynomial at a particular x. Thereis an efficient way to find the coefficients of the Newton form where ci = xi−1. We have

pn(x) = A0 +A1(x− x0) + · · ·+An(x− x0) · · · (x− xn−1)

and pn(xi) = fi. Plugging in x0 and x1, we see

A0 = f0

A1 =f1 − f0

x1 − x0

These coefficients can be found recursively. We let

Ak = f [x0, . . . , xk] =f [x1, . . . , xk]− f [x0, . . . , xk−1]

xk − x0

We note that if we are interpolating over four points, this gives

37

x0 f [x0]f [x0, x1]

x1 f [x1] f [x0, x1, x2]f [x1, x2] f [x0, x1, x2, x3]

x2 f [x2] f [x1, x2, x3]f [x2, x3]

x3 f [x3]

From this, we see that adding a new point, say the n-th point only requires about n new differencecalculations. Thus, we can find the interpolating polynomials one point at a time for a totalamount of work about n2/2 up to the n-th step. If we want to interpolate using function valuesand derivatives, we can modify the scheme by

x0 f(x0)f ′(x0)

x0 + ε f(x0) f [x0, x0, x1]f [x0, x1] f [x0, x0, x1, x1]

x1 f(x1) f [x0, x1, x1]f ′(x1)

x1 + ε f(x1)

To figure out the error in the interpolation, we consider adding a new point x. This gives

pn+1(x) = pn(x) + f [x0, . . . , xn, x]n∏j=0

(x− xj)

We note

f(x) = pn+1(x) = pn(x) + f [x0, . . . , xn, x]n∏j=0

(x− xj)

So the error is

en(x) = f(x)− pn(x) = f [x0, . . . , xn, x]n∏j=0

(x− xj)

We claim that f [x0, . . . , xn, x] = f (m+1)(ζ)/(n + 1)! for some ζ ∈ [minxi,maxxi]. To see this, wenote that f − pn+1 vanishes at at least n + 2 points. Thus f ′ − p′n+1 vanishes at at least n + 1points. Thus, f (n+1)(ζ) = p

(n+1)n (ζ) for some ζ. Then we note that because

pn+1 = lower order terms + f [x0, . . . , xn, x]n∏j=0

(x− xj)

38

we have

f (n+1)(ζ) = p(n+1)n (ζ) = (n+ 1)!f [x0, . . . , xn, x]

7.3 Piecewise Approximations

There are serious limitations to piecewise linear approximations. In particular, the derivatives cannot be made continuous in general. The preferred method is piecewise cubic splines. If you knowthe derivative values, then you can use the modified divided differences method of the previoussection. This gives

P3(x) = f0 + f ′0(x− x0) +f [x0, x1]− f ′0x1 − x0

(x− x0) +f ′1 − 2f [x0, x1] + f ′0

(x1 − x0)2(x− x0)2(x− x1)

for the interval [x0, x1]. If you don’t have the derivative values, you can leave them as unknowns(let’s call them Si) and then require that the second derivatives match at the interpolation pointson the interior. If you are interpolating on x0 < x1 < · · · < xn, this gives a tri-diagonal system ofn− 1 equations for the n+ 1 values Si. The formula is then

P3,i(x) = fi + Si(x− xi) +f [xi, xi+1]− Sixi+1 − xi

(x− xi) +Si+1 − 2f [xi, xi+1] + Si

(xi+1 − xi)2(x− xi)2(x− xi+1)

on each interval [xi, xi+1]. The system of tridiagonal equations then results by setting

d2

dx2P3,i−1(xi) =

d2

dx2P3,i(xi)

for i = 1, . . . , n − 1. To get two more equations there are options to handle the endpoints. Youcan set the derivatives of the end points to some prescribed values or an approximation of thederivative. You could require that the third derivatives match up for x1 and xn−1. You couldrequire that p′′(x0) = p′′(xn) = 0.

8 Numerical Integration

8.1 Using Equidistant Points

The typical approach is to find an interpolating polynomial for the given function and exactlyintegrate that polynomial. We conisder the problem

I(f) ≈∫ b

af(x) dx

39

In general, suppose we interpolate with a k-th degree polynomial on k + 1 points, px(xi) = f(xi)for i = 0, . . . , k. From the previous section, we had

f(x) = pk(x) + f [x0, . . . , xk, x]ψ(x)

where

ψ(x) =k∏j=0

(x− xj)

If E(f) is the error in the integral approximation, we have

E(f) =∫ b

af [x0, . . . , xk, x]ψ(x) dx

8.1.1 Trapezoidal Rule

For interpolation with two points, we have

ψ1(x) = (x− a)(x− b) ≤ C

This function always has the same sign in [a, b]. From the previous section, we also had

f [a, b, x] =f ′′(ζx)

2!We note that ∫ b

a(x− a)(x− b) dx = (x− a)(x− b)2/2|ba − (x− b)3/6|ba =

(b− a)3

6Thus, if we let

I(f) =∫ b

afa +

f(b)− f(a)b− a

(x− a) dx =f(a) + f(b)

2(b− a)

then we have ∫ b

af dx− I(f) = E(f) = −f

′′(ζ)12

(b− a)3

Thus, the trapezoidal method overestimates for concave up functions and underestimates for con-cave down functions. If we are integrating over a uniform grid, we have

∫ xN

x0

f(x) dx ≈ h

2

N−1∑k=0

(f(xk+1) + f(xk)) =b− a2N

(f(x0) + 2f(x1) + · · ·+ 2f(xN−1) + f(xN ))

40

The convergence of the method is often very fast for smooth periodic functions. One can expectthis because a periodic function should spend roughly the same time concave up as concave downand the errors cancel. For a more rigorous explanation of this fact, one turns to Euler-MacLaurinformulae.

8.1.2 Midpoint Rule

This is interpolation by a constant at the midpoint. You have

I(f) = f

(a+ b

2

)(b− a)

Because it is centered, you again achieve third order accuracy with an even slightly better boundthan trapezoidal

E(f) =f ′(ζ)

24(b− a)3

8.1.3 Simpson’s Rule

Simpson’s is a 3 point rule that interpolates at the endpoints and midpoint. If you choose theconstants A,B,C such that

I(f) = Af(a) +Bf((a+ b)/2) + Cf(b)

is exact for polynomials of degree less than or equal to 2, you get Simpson’s rule. It turns out thatthe rule is exact up to degree 3 polynomials. The rule is

I(f) = (b− a)(

16f(a) +

23f((a+ b)/2) +

16f(b)

)The error is ∫ b

af dx− I(f) = E(f) = −f

(4)(η)((b− a)/2)5

90

for η ∈ (a, b).

8.1.4 Rules Using Hermite Cubics

Hermite cubics are the interpolating polynomials of degree 3 which match the function value andfirst derivative at the endpoints. How to find these polynomials is outlined in the previous section.Using these, we get a rule

I(f) =xi+1 − xi

2(f(xi) + f(xi+1)) +

(xi+1 − xi)2

12(f ′(xi)− f ′(xi+1))

41

which is exact for cubics. Further, if you concatenate with this rule, the derivatives cancel out inthe middle. Thus, you need only the function evaluations at all the points and then the additionalderivative evaluations only at the endpoints.

8.2 Quadrature Rules

Given m quadrature points, it is possible to come up with a scheme that is exact for polynomialsof degree 2m − 1 or less. We consider integrating against a weight function w, that is we wish toapproximate ∫ b

af(x)w(x) dx

using m points. We have the following theorem: if the points x0, . . . , xm−1 are chosen as the zerosof the polynomial ϕm(x) of degree m in the family of orthogonal polynomials associated with w(x),then ∫ b

af(x)w(x) dx ≈ A0f0 + · · ·Am−1fm−1

is exact for polynomials up to degree 2m− 1. The coefficients are given by

Ai =∫ b

aLi(x)w(x) dx

where Li is the Legendre polynomial taking the value 1 at xi and 0 at all other xj (this choiceshould be clear). The proof of the theorem follows by writing a degree 2m − 1 polynomial f asf = qϕm + r with q and r of degree less than or equal to m− 1. Then you use orthogonality andthe fact that the choice of coefficients guarantees exact integration for degree m− 1 polynomials.

8.2.1 Optimality of Gaussian Quadrature

From the above, we see that Gaussian quadrature gives you a method which is exact for polynomialsup to degree 2m− 1. This is optimal as there is no quadrature scheme with m points that is exactfor polynomials of degree 2m. Let xi be the quadrature nodes. Consider the polynomial

p(x) =m∏i=1

(x− xi)2

The integral of this polynomial is strictly positive over any interval. However, the quadrature rulegives

p(x) ≈m∑i=1

Aip(xi) = 0

for any choice of coefficients Ai.

42

8.3 Adaptive Methods

The error bounds for the schemes above depend heavily on the length of the interval and the value ofthe derivatives of the function we’re trying to integrate. Thus, for a fixed length interval, we couldobtain a more accurate solution by dividing the interval into lots of smaller subintervals uniformly.However, in doing so, we may end up doing extra work than is necessary. For instance, we can usea much coarser grid for sections of the interval over which f is relatively flat. To save the methodfrom unnecessary evaluations, we can use adaptive methods. We’ll use a concrete example of anadaptive method for Simpson’s rule to elaborate. Suppose we want to approximate the integral of∫ b

af(x) dx

up to a given tolerance ε using Simpson’s rule. We have the following formula for the error

E(f) = −f(4)(η)((b− a)/2)5

90Let S be the approximation over the whole interval and S′ be the approximation obtained by split-ting the interval in half, using Simpson’s rule on each subinterval, and adding the results together.We note that |S − S′| = |E − E′| where E and E′ are the errors for S and S′ respectively. Fromthe formula for the error, we see that E ≈ 16E′. Thus, if |E − E′| ≤ 15ε, then the approximationS′ is acceptable. We then define the integral over an interval (c, d) recursively as follows. Let ε bethe desired precision (tolerance) on (c, d). Split (c, d) in half and compute the Simpson’s rule ap-proximation on each subinterval. Use the above to determine if this approximation is good enough.If not, do the same for each subinterval with the tolerance ε/2, and so on (note: you should havesome sort of maximum depth set).

If you program an adaptive method like this using a literal recursion, you can avoid doing ex-tra function evaluations (when intervals have the same endpoints, etc.) by passing function valuesfrom level to level. If you unroll the recursion in your code, you can still save on function evaluationsby storing the evaluations you’ve already performed in a stack.

8.4 Issues and Other Considerations

• Singularities: the following examples can be found in Dahlquist’s book.

– By substitution: If we wish to calculate

A =∫ 1

0x−1/2ex dx

we can use the substitution x = t2 giving

43

A =∫ 1

02et

2dt

which can be approximated using the methods described above.– Using integration by parts: for the same integral as above, we can integrate by parts,

giving

∫ 1

0x−1/2ex dx = [2x1/2ex]10 − 2

∫ 1

0x1/2ex dx

= 2e− 2[

23x3/2ex

]1

0

+43

∫ 1

0x3/2ex dx

=23e+

43

∫ 1

0x3/2ex dx

and so on. We note that after the first integration by parts, we don’t have a singularity.However, the derivative of the function is singular and therefore the methods will notconverge rapidly. Generally, the more continuous derivatives we can take the better.

– Simple comparison problem: We consider the problem of calculating the integral

I =∫ 1

.1x−3ex dx

which is infinite near the left endpoint. We could instead write the integral as

I =∫ 1

.1x−3(1 + x+ x2) dx+

∫ 1

.1x−3(ex − 1− x− x2

2) dx

The first integral above can be evaluated analytically and the second integral has anintegrand which is well behaved (bounded with bounded derivatives) and can thereforebe evaluated numerically to high precision.

– Special integration formula: in many cases, if a function has a certain kind of singularity(e.g. f has a log singularity at zero) then near the singularity, the function is actuallyequal to the type of singularity times a smooth function plus a smooth function (e.g.f(x) = h(x) log(x) + g(x) for h, g smooth). Thus, coming up with a quadrature thatworks to integrate ∫ h

0h(x) log(x) dx

can have a wide range of applications. These quadratures can be found by using unde-termined coefficients to integrate polynomials up to a certain degree exactly.

44

• Infinite interval of integration: analagous versions of the methods above for singular integralscan be effective for infinite intervals of integration. Here are a few other thoughts

– Slowly decaying integrands: when the integrand decays slowly, one can use numericalintegration on a modest length interval (say [0, R]) and use some analysis for the tail.To evaluate the integral on [R,∞), one can expand the function in powers of x−1 andintegrate these analytically and sum the totals numerically.

– Oscillating integrals: if the integrand decays slowly and oscillates as x → ∞, then theapproach is quite similar to that for evaluating alternating series efficiently. Namely, onecan split the integral into an alternating series of integrals over the intervals where theintegrand is alternatively positive and negative. Then, one can apply the technique ofrepeated averaging to speed the convergence of the sum. We briefly described repeatedaveraging below.

– Smooth, rapidly decreasing integrands: if the integrand is smooth and has rapidly de-creasing values and derivatives, then the trapezoidal rule gives a very good approxima-tion to the integral over large intervals (e.g. consider integrating

∫∞−∞ e

−x4 dx by thetrapezoidal rule over [−R,R]). The quality of this scheme can be explained by two facts.One is that ∫ ∞

Re−x

4dx ≤

∫ ∞R

x3

R3e−x

4dx =

1R3

14e−R

4

which decays rapidly. Let f(x) = e−x4. The other fact is that the Euler-Maclaurin

formula gives us that if I is the trapezoidal rule approximation of the integral over[−R,R] with step size h, we have

I −∫ R

−Rf(x) dx =

h2

12[f ′(R)− f ′(−R)]− h4

720[f ′′′(R)− f ′′′(−R)] + · · ·

As f and its derivatives vanish quickly as R→∞ we see that the trapezoidal rule gives agood approximation. Further, we could try to correct the error using the formula aboveto get a higher order scheme.

• Repeated Averaging: for slowly converging, alternating series, one can employ repeated av-eraging to speed up the convergence. We include the following example from Dahlquist andBjorck. We have the following formula

π

4= 1− 1

3+

15− 1

7+ · · ·

which is a very slowly converging sum (after 500 terms the value can still change in the thirddigit). Consider the following scheme

45

n Sn M1 M2 M3 M4 M5 M6

5 .744012.782474

6 .820935 .785308.787602 · · ·

7 .754268 .785641 · · ·.783680 · · · · · ·

8 .813092 .785228 · · · .785398.786776 · · · · · ·

9 .760460 .785523 · · ·.784720 · · ·

10 .808079 .785305.786340

11 .764601

in the above we start with a column of the partial sums and the following columns are obtainedby averaging each entry’s northwest and southwest neighbors. Note that the values oscillatein each column. Generally, it can be shown that if the absolute value of the j-th term (as afunction of j) has a k-th derivative which approaches zero monotonically, then the values incolumn Mk will alternatively be above and below the limit of the series. Here, the value inM6 is actually correct to 6 digits.

• Euler-Maclaurin formulae: Let T (h) be the trapezoidal sum for a function f on the interval[a, b]. If f is sufficiently smooth, then

T (h) =∫ b

af(x) dx+

h2

12[f ′(b)−f ′(a)]− h4

720[f ′′′(b)−f ′′′(a)]+· · ·+c2rh

2r[f (2r−1)(b)−f (2r−1)(a)]+O(h2r+2)

The constants c2r have the generating function

1 = c2h2 + c4h

4 + · · · = h

2eh + 1eh − 1

Proving the formula is a bit tedious, but we’ll mention that it involves repeated integrationby parts. Finding the generating function for the constants is simpler. One simply appliesthe formula to f(x) = ex. The formula has many uses.

– It gives an asymptotic expansion for the error in a trapezoidal integration rule. Inparticular this means that Richardson extrapolation (see ODE section) can be usedwith the trapezoidal rule to get higher order schemes (called Romberg’s method).

– It has the application mentioned above for the integral of e−x4

and generally when youknow the formula for the derivatives at the endpoints.

46

– It shows that the convergence as h→ 0 of the trapezoidal rule applied to smooth periodicfunctions (note f (k)(b) = f (k)(a) in this case) is super-algebraic (faster than hp for anyp).

– It can also be used for the inverse problem, i.e., evaluating sums when the integral issimpler to compute.

9 Nonlinear Equations and Newton’s Method

9.1 The Bisection Idea

Given a continuous function f , one can use the intermediate value theorem to come up with thefollowing scheme. Given a < b such that f(a) < 0 and f(b) > 0, we come up with ak, bk iterativelyby

m_k+1 = (a_k + b_k)/2if f(m_k+1) > 0b_k+1 = m_k+1a_k+1 = a_k

if f(m_k+1) < 0a_k+1 = m_k+1b_k+1 = b_k

We see that the intervals (ak, bk) always contain a root and are shrinking in size by a factor of 2at each step. This is slow. It takes an iteration to gain a binary digit. To gain a decimal digit ittakes a little over 3 iterations. There are better methods that take better advantage of the valuesof f (note that bisection only considers the sign) and its derivatives. These methods often need adecent starting position to be effective and bisection can provide this starting position.

9.2 Newton’s Method

The idea behind Newton’s method is to obtain a sequence xn iteratively using a difference formulafor the derivative of the function f . If the next step is close to a root, we assume f(xn+1) ≈ 0.Then

f ′(xn) ≈ f(xn)− f(xn+1)xn − xn+1

f ′(xn) ≈ f(xn)xn − xn+1

xn+1 ≈ xn −f(xn)f ′(xn)

47

Let α be a simple root of a function f (i.e. f(α) = 0 and f ′(α) 6= 0). If your starting x0 isnear enough to the root and we have suitable bounds on f ′′ and f ′, then Newton’s method isquadratically convergent. This can be seen by using a Taylor expansion. Let en = α − xn andζ ∈ (xn, α). Then

0 = f(α) = f(xn) + enf′(xn) +

e2n

2f ′′(ζ)

0 =f(xn)f ′(xn)

+ en +e2nf′′(ζ)

2f ′(xn)

0 = en+1 +e2nf′′(ζ)

2f ′(xn)

|en+1| ≤|en|2|f ′′(ζ)|2|f ′(xn)|

Let M = max f ′′(x) on the interval (α − |e0|, α + |e0|) and assume that m = maxxM/(2f ′(x)) <1/|e0| on that interval. Then we see that

|en| ≤ |en−1|2mm|en| ≤ (m|en−1|)2

m|en| ≤ (m|en−2|)22 ...

m|en| ≤ (m|e0|)2n

giving quadratic convergence. We note that quadratic convergence is significantly better than linearconvergence. For each iteration, we double the number of significant digits in the computation.Thus, to get about the same number of digits as a method with linear convergence that took nsteps, we only need to do about log2(n) steps.

There are many ways in which the method can go wrong. For the function f(x) = x1/3 wenote that the derivative at zero is infinite. We note that we otherwise have f ′(x) = 1

3x−2/3. Say

our starting point is x0. We then have

x1 = x0 −3x1/3

0

x−2/30

= −2x0

x2 = −2x1 = 4x0

x3 = −2x2 = −8x0

48

So we see that for any starting guess, the method diverges away from the zero. For the functionf(x) = x2 we have a double root at zero. We note that f ′(x) = 2x and

xn+1 = xn −f(xn)f ′(xn)

xn+1 = xn −x2n

2xn=xn2

so we see that the method still converges to zero but at a rate as slow as the bisection method.

9.3 The Secant Method

If the process of evaluating the derivative of f is impossible or prohibitively expensive, one caninstead use the secant method. It requires two starting conditions but only involves one newfunction evaluation per step. The main idea is to use Newton’s method but to approximate thevalue of f ′(xn) by a difference formula. You get

xn+1 = xn − f(xn)xn − xn−1

f(xn)− f(xn−1)You can again get an idea of the rate of the convergence of the method by considering a Taylorexpansion. If en is the error at each step and you make similar assumptions on f as for Newton’smethod, you can find |en+1| ≤ K|en||en−1|. Interestingly, one can find that the order of convergenceis given by the golden ratio (1 +

√5)/2.

10 Numerical ODE

10.1 The Set-Up

For the following, we assume that we are dealing with a system of first-order ODE’s. This is generalenough to cover higher order ODE’s because given a higher-order problem, we can convert it intoa first-order system as follows. Let

y(n) = f(t, y, y′, . . . , y(n−1))

Then we introduce new variables v1, . . . vn with vi = y(i−1). We have

d

dt

v1...

vn−1

vn

=

v2...vn

f(t, v1, . . . , vn)

49

Further, we often assume that the equation is autonomous (no explicit t dependence) as t can beeasily absorbed into the system as follows. Let

y′ = f(t, y)

Then if we set v1 = y and v2 = t we have

d

dt

(v1

v2

)=(f(v2, v1)

1

)10.2 It’s Almost Always this Idea

To show the order (and hence convergence) of many ODE methods, one can generally follow thesesteps:

• Estimate the local truncation error, i.e., the error that is accrued if you start at the correctlocation. The standard way to find this error is through a Taylor expansion.

• Derive a relation between the error at the i-th step and (i− 1)-st step. This is often accom-plished by assuming some sort of Lipschitz continuity on the data.

• Solve the relation for the error at the i-th step in terms of the starting error.

As an example, we perform the above steps for Euler’s method. Euler’s method for the ODEx = f(x) makes the approximations as follows:

xi+1 = xi + ∆tf(xi)

Let y(t) be a solution to the ODE. If we assume that we start at yi = y(ti) then the local truncationerror can be found by Taylor expanding y(ti+1). We have

y(ti+1) = y(ti) + ∆ty′(ti) +∆t2

2y′′(ξ) = yi + ∆tf(yi) +

∆t2

2y′′(ξ) = yi+1 +

∆t2

2y′′(ξ)

Giving the local truncation error

τi = y(ti+1)− yi+1 =∆t2

2y′′(ξ)

Next, we derive a difference relation for the error. Let xi be the computed solution. We have

y(ti) = y(ti−1) + ∆tf(ti−1, y(ti−1)) + τi−1

xi = xi−1 + ∆tf(ti−1, yi−1)

50

y(ti)− xi = y(ti−1)− xi−1 + ∆t[f(ti−1, y(ti−1))− f(ti−1, yi−1)] + τi−1

|ei| ≤ |ei−1|+ ∆t|f(ti−1, y(ti−1))− f(ti−1, yi−1)|+ |τi−1|≤ (1 + ∆tL)|ei−1|+ |τi−1|

Where the last line assumes a Lipschitz condition on f with constant L (note this is stronger thannecessary but makes the calculation easier). Finally, we solve this difference relation in terms ofthe starting error.

|ei| ≤ (1 + ∆tL)|ei−1|+ |τi−1|≤ (1 + ∆tL)2|ei−2|+ (1 + ∆tL)|τi−2|+ |τi−1|...

≤ (1 + ∆tL)i|e0|+i−1∑p=0

(1 + ∆tL)p|τi−p−1|

Now, we assume that we have a bounded second-derivative on the solution, say it’s bounded by2M . Let N = (T − t0)/∆t. Then

|eN | ≤ (1 + ∆tL)N |e0|+N−1∑p=0

(1 + ∆tL)p|τi−p−1|

≤ e∆tLN |e0|+M∆t2N−1∑p=0

(1 + ∆tL)p

≤ eL(T−t0)|e0|+ ∆t2(1 + ∆tL)N − 1

1 + ∆tL− 1≤ eL(T−t0)|e0|+ C∆t

Thus, if the starting error is order 1, we have an order 1 approximation.

10.3 A Note on Conditioning

ODE’s often have a family of solution curves depending on the initial condtions. From this vantagepoint, we can describe the conditioning of an ODE as follows: if the curves in the family of solutionsdepart from each other rapidly, then the initial-value problem is ill-conditioned; otherwise, theproblem is well-conditioned. We note that this definition depends only on the ODE and does notconsider conditioning issues specific to the algorithm used to solve the ODE. Also, it should beclear that this definition of conditioning is distinct from that given for numerical linear algebraproblems. To get a more quantitative idea of this definition, we consider the ODE

51

x = f(x)

and we let F (t, x0) be the solution operator for initial condition x(0) = x0. If ddxf ≤ L for some L

(which can be negative) we have

F (t, x0)− F (t, x0 + h) = x0 +∫ t

0f(F (s, x0)) ds− x0 − h−

∫ 1

0f(F (s, x0 + h)) ds

F (t, x0)− F (t, x0 + h) ≤ |h|+∫ t

0f(F (s, x0))− f(F (s, x0 + h)) ds

≤ |h|+ L

∫ t

0F (s, x)− F (s, x+ h) ds

F (t, x0)− F (t, x0 + h) ≤ heLt

Where the last line follows by Gronwall’s inequality. Similarly, we have

F (t, x0 + h)− F (t, x0) ≤ |h|eLt

|F (t, x0)− F (t, x0 + h)| ≤ |h|eLt

Thus, for well-conditioned ODE’s (small L or L < 0) we have that the solutions for nearby startingconditions stay close together (in the above sense).

10.4 Richardson Extrapolation

10.4.1 Explanation

The set-up for Richardson extrapolation is that we have an algorithm that computes A(h) dependingon the step size h. Further, this algorithm is designed to compute the value a and we assume ithas the following asymptotic expansion

A(h) = a+ a1hp1 + a2h

p2 + a3hp3 + · · ·

for increasing pi. We then build better approximations using A(h) as follows. Let A1(h) = A(h)and

Ak+1(h) = Ak(h) +Ak(h)−Ak(2h)

2pk − 1Then An(h) has the form

52

An(h) = a+ cnhpn + cn+1h

pn+1 + · · ·

We include the following example problem to give a fuller understanding of the process.

Explain how to construct high-order initial values for an Adams method using the first orderforward Euler method and Richardson Extrapolation. Does this significantly increase the cost ofintegrating an ODE up to some time T?

Solution:

We note that the forward Euler method has an asymptotic error expansion. Let A denote thesolution to the ODE at time t = h and let A(h) be the numerical approximation to A using a timestep of length h. We can write

A(h) = A+A1h1 +A2h

2 +A3h3 + · · ·

because forward Euler is a first order method. We can combine the results of two of these firstorder approximations to obtain a second-order approximation by

A(2)(h) =2A(h/2)−A(h)

22−1 − 1= A− 1

2A2h

2 + higher order terms

It is clear that this process can be continued to get higher order approximations defined iteratively.

A(p)(h) =2p−1A(p−1)(h/2)−A(p−1)(h)

2p−1 − 1= A+ Cph

p + higher order terms

If this recursion is unrolled, we see that it is simply a linear combination of values given by theoriginal Euler method for p different step sizes: A(h), A(h/2), A(h/22), . . . , A(h/2p−1). The amountof work to evaluate A(h/2n) at time t = h involves 2n steps of forward Euler, so you have 2n functionevaluations (and additions). Thus, the total work to get these values is

p−1∑n=0

2n =2p − 12− 1

= 2p − 1

A p-th order Adams-Bashforth method needs p initial values, so given the initial value and usingthe above steps for the remaining values we have

(p− 1)(2p − 1)

extra function evaluations. To compare, the Adams-Bashforth method will require p functionevaluations per step; however, only one of these will be new. Thus, to get an approximation attime T this will require about

53

T

h

function evaluations. Thus, if (p − 1)(2p − 1) is significantly smaller than T/h the extra work forthe initial values will not have much affect on the total amount of work. We note that as h → 0this will be true.

10.4.2 Uses

As outlined above, the Richardson extrapolation process can be used to gain higher-order approxi-mations for a lower-order method. Further, a programmer can approximate the order of convergencethat their code is achieving by using a similar idea to the Richardson extrapolation. If we knewthe actual answer, we could look at

A(2h)− aA(h)− a

=2p1hp1a1 + 2p2hp2a2 + · · ·

hp1a1 + hp2a2 + · · ·≈ 2p1 for small h

If we don’t know the answer, we can consider instead

A(4h)−A(2h)A(2h)−A(h)

≈ 2p1 for small h

This fact is useful for debugging.

10.4.3 A Word of Warning

The above depends not on having a method of a certain order, rather, it depends strongly on thefact that the method has an asymptotic expansion.

10.5 Modified Equation

One can gain insight about the qualitative nature of the solution obtained by a particular numericalmethod by computing what is called the modified equation. The modified equation is an ODE thatyour computed solution satisfies to higher order than the original ODE. We give an example. If weuse Forward Euler to approximate the solution of x = f , i.e.

xn+1 = xn + ∆tf(xn)

which ODE does xn satisfy to second order (i.e. third order local truncation error)? We assumethat we can write

x = f(x) + ∆tf1(x) + · · ·

Then we integrate

54

x(tn+1) = x(tn) +∫ tn+1

tn

f(x(t)) + ∆tf1(x(t)) +O(∆t2) dt

= x(tn) +∫ tn+1

tn

f(x(tn)) + (t− tn)f ′(x(tn))f(x(tn)) + ∆tf1(x(tn)) +O(∆t2) dt

= xn + ∆tf(xn) +∆t2

2f ′(xn)f(xn) + ∆t2f1(xn) +O(∆t3)

Thus, if f1(x(t)) = −12f′(x(t))f(x(t)), then we have

x(tn+1) = xn + ∆tf(xn) +O(∆t3)

= xn+1 +O(∆t3)

and we see that Forward Euler for x = f is a second order method for

x = f(x)− ∆t2f ′(x)f(x)

We then apply this to a specific problem to see what we can expect in terms of the error. Considerthe harmonic oscillator

d

dt

(x1

x2

)=(

x2

−x1

)We then have that

f ′f =(

0 1−1 0

)(x2

−x1

)=(−x1

−x2

)So Forward Euler solves

d

dt

(x1

x2

)=(

x2

−x1

)+

∆t2

(x1

x2

)to second order. Normally the Harmonic oscillator starting at (x1, x2) = (0, 1) has solutions givenby (sin t, cos t). However, the modified equation has solutions given by e∆t/2(sin t, cos t). Thus, wecan qualitatively expect the error of the solution gained by Forward Euler to grow outward fromthe correct solution (while staying relatively in phase).

10.6 Splitting Schemes

Sometimes, it is desirable to split the right hand side of a differential equation into two parts, givingtwo ODEs, each of which is easier to solve by a numerical method. However, it is not immediately

55

obvious that we could combine the solutions to the two new equations in a way that gives a solutionto the original equation. Strang Splitting provides a second order way of doing so.

Suppose the ODE x = f(x) has solution operator x(t) = F (t, x(0)). Suppose that x = g(x)has solution operator x(t) = G(t, x(0)). Consider the combined ODE

x = f(x) + g(x)

Consider the “Strang splitting” scheme

yn+1/2 = F (∆t/2, xn)zn+1/2 = G(∆t, yn+1/2)xn+1 = F (∆t/2, zn+1/2)

This produces a second order accurate approximation, which can be seen by taking Taylor expan-sions. We note that if we have approximate solution operators F and G instead of exact solutionoperators (e.g. numerical methods), the Strang Splitting solution is second order if F and G aresecond order. At first glance, the method seems a little inefficient in that F or F is evaluated twiceper step. However, we note the following. If the ODE has unique solutions (e.g. f is Lipschitz),then we have F (∆t/2, F (∆t/2), x) = F (∆t, x). To see this, let x(t) be the solution starting at x attime t. Then

x(∆t) = F (∆t/2, x(∆t/2))= F (∆t/2, F (∆t/2, x))

x(∆t) = F (∆t, x)

So we have equality. Then we note that as we run the algorithm, we really only need to keep trackof the yn+1/2 and zn+1/2 variables (until we calculate xN at the last step. We see that

yn+3/2 = F (∆t/2, xn+1) = F (∆t/2, F (∆t/2, zn+1/2)) = F (∆t, zn+1/2)

Thus, aside from the beginning and end iterations, the algorithm can go as

yn+1/2 = F (∆t, zn−1/2)zn+1/2 = G(∆t, yn+1/2)

which eliminates one of the evaluations of F per time-step.

An example of a situation in which it is desirable to use splitting arises in numerical PDE. Becauseof the spectral properties of the discrete Laplacian, it is a good idea to use an implicit method forthe time-stepping. However, the matrix for the full Laplacian has a rather large bandwidth. In

56

contrast, if we split the Laplacian into its ∂xx and ∂yy parts, we have (perhaps after reordering)two tridiagonal systems. These are much easier to invert, which we would have to do in the caseof an implicit method (e.g. the trapezoidal rule).

10.7 Linear Multi-Step Methods (LMM)

Linear Multi-Step Methods store previously computed values of xn and f(xn) and use these toform higher order approximations to the next step than might have been possible by only storingthe previous value. They can be either implicit or explicit.

10.7.1 General Explicit LMM

An explicit linear multistep method with p lags can be written as

xn+1 = α0xn + · · ·+ αpxn−p + ∆t(β0f(xn) + · · ·+ βpf(xn−p))

They are called linear because they are linear in f . By contrast, Runge-Kutta methods are notlinear in f .

10.7.2 Consistency, Zero-Stability, and the Root Condition

There is a quick simple check for consistency. In the case where f ≡= 0, we should hope that x = ca constant satisfies the recurrence of the method. We have

c = α0x+ · · ·+ αpc = c

p∑i=0

αi

so∑

i αi = 1 is a quick test of consistency.

The standard way to check the stability of a LMM is to consider its zero-stability. That is, youcheck what happens when you apply the method to x = 0. The first observation in this case is thatthe xi satisfy a linear recurrence relation

xn+1 = α0xn + · · ·+ αpxn−p

We then seek solutions to the recurrence of the form xn = zn. This gives

zn+1 = α0zn + · · ·+ αpz

n−p

zp+1 = α0zp + · · ·+ αp

We then define

f(z) = zp+1 − α0zp − · · · − αp

57

to be the “characteristic function”. The solutions of the recurrence correspond to roots of thecharacteristic polynomial. It is immediate from the consistency condition, that for a consistentmethod z = 1 is a root of f . We see that if |z| > 1 for some root, then xn = zn is a solution whichgrows without bound (a sign of instability). If |z| ≤ 1, there are two cases. Simple roots are fineand lead to bounded solution xn. If you have a double root z0, then

f(z0) = 0 f ′(z0) = 0

so we have xn = nzn0 is a solution. To see this

xn+1 − α0xn − · · · − αpxn−p = (n+ 1)zn+10 − nα0z

n0 − · · · − (n− p)αpzn−p0

= (n− p)zn−p0 f(z0) +[(p+ 1)zn+1

0 − pα0zn0 − · · · − 1 · αp−1z

n−p+10

]= 0 + zn−p+1

0 ((p+ 1)zp0 − pα0zp−10 − · · · − αp−1

= zn−p+10 f ′(z0) = 0

So we see that multiple roots are a problem when |z0| = 1. Further, higher multiplicity roots giverise to similar solutions as the above so we see that |z0| < 1 is fine for multiple roots. These notionsgive rise to the following definition.

The Root Condition: A linear recurrence satisfies the root condition if each root z of the charac-teristic polynomial satisfies either |z| < 1 or is simple and |z| = 1.

10.7.3 Convergence

The ideas above can be used to establish the convergence of an LMM. The result we will use is thata matrix A is power bounded (‖An‖ ≤ C independent of n) if and only if its minimal polynomialsatisfies the root condition (a statement about the Jordan form of A, in particular no Jordan blocksof size bigger than one for |λ| = 1 and no eigenvalues larger than 1). The converse direction isproved by constructing a norm ‖ · ‖∗ under which A is a contraction and then using the equivalenceof norms to get power boundedness. The following is taken directly from a homework problem forNumerical Methods II at Courant:

Use the result to prove convergence of linear multistep methods. More precisely, suppose thecharacteristic polynomial of the multistep method satisfies the root condition. Use a norm in whichthe companion matrix is a contraction. Let Xn = (xn, . . . , xn−p). Then Xn+1 = AXn + ∆tF (Xn),where A is d “copies” of A, one for each component of x. You will be able to show that ifYn+1 = AYn + ∆tF (Yn) + ∆tRn, then

‖Yn −Xn‖∗ ≤ C(tn)‖X0 − Y0‖∗ + ∆t∑k≤n

C(tn − tk)‖Rk‖∗

58

where ‖ · ‖∗ is the contraction norm for A. Use this to conclude that if the method has formal orderof accuracy q and is stable by the root condition criterion, then it actually is accurate with orderq provided the initial steps are done accurately enough.

Solution:

We calculate the characteristic polynomial of the matrix A. We have

A− zI =

a0 − z a1 a2 · · · ap−1 ap1 −z 0 · · · 0 0

0 1 −z 0...

. . . . . . . . . . . . 00 1 −z 0

0 1 −z

Expanding along the first row, this gives that

det(A− zI) = (a0 − z)(−z)p +p∑

k=1

(−1)kak(−z)p−k

= (−1)p+1z + (−1)pp∑

k=0

akzp−k

Thus, we have that the characteristic polynomial of A is (up to a sign) equivalent to the charac-teristic polynomial of the linear recurrence. Thus, if the characteristic polynomial of the linearrecurrence satisfies the root condition (roots are either simple or have magnitude less than 1), thenthe eigenvalues of A have magnitude less than or equal to 1. Further, as the minimal polynomialof a matrix must divide the characteristic polynomial, we see that the roots with magnitude oneare simple for the minimal polynomial. In particular, the size of the largest Jordan block for sucha root must be 1. Thus, the matrix A satisfies the root condition for matrices. From problem 1, wesee that there is a norm ‖ · ‖1 such that A is a contraction. Let X(j)

n be the vector consisting of thej-th element in each of the vectors xn, . . . , xn−p. For each j ∈ 1, . . . , d we have X(j)

n+1 = AX(j)n .

Let ‖X‖∗ = (‖X(1)‖21 + · · ·+ ‖X(d)‖21)1/2. We note that with this norm, we have

‖AX‖∗ =(‖AX(1)‖21 + · · ·+ ‖AX(d)‖21

)1/2≤(‖X(1)‖21 + · · ·+ ‖X(d)‖21

)1/2= ‖X‖∗

so A is a contraction under ‖·‖∗. Next, we assume that F satisfies some suitable Lipschitz condition,say ‖F (Y )− F (X)‖∗ ≤M‖X − Y ‖∗ for M <∞. Then we have

59

‖Yn+1 −Xn+1‖∗ = ‖A(Yn −Xn) + ∆t(F (Yn)− F (Xn)) + ∆tRn‖∗≤ ‖A(Yn −Xn)‖∗ + ∆t‖F (Xn)− F (Yn)‖∗ + ∆t‖Rn‖∗≤ (1 +M∆t)‖Yn −Xn‖∗ + ∆t‖Rn‖∗

We use the semi-group approach. Define Sn,k for n ≥ k by

Sk,k = 1Sn+1,k = (1 +M∆t)Sn,k

It is clear that Sn,k ≤ eM(tn−tk). Assume the initial data is (x0, . . . , x−p). LetR = maxk≤n−1 ‖Rn−1‖∗.We then have

‖Yn −Xn‖∗ ≤ Sn,0‖X0 − Y0‖∗ + ∆t∑

k≤n−1

Sn,k∆t‖Rk‖∗

≤ eMtn‖X0 − Y0‖∗ +∑

k≤n−1

∆teM(tn−tk)‖Rk‖∗

≤ eMtn‖X0 − Y0‖∗ +∫ tn

0eM(tn−t)Rdt

= eMtn‖X0 − Y0‖∗ +1M

(eMtn − 1)R

Say R = O(∆tq). We see that if ‖X0 − Y0‖ is O(∆tq), then the method is O(∆tq).

10.7.4 Maximum Order but Unstable Example and The Dahlquist Barriers

We begin with an example. The LMM described by

xn+1 = −4xn + 5xn−1 + ∆t(4f(xn) + 2f(xn−1))

has a local truncation error of order ∆t4 (so it has formal 3rd order accuracy). However, its recur-rence polynomial is z2 + 4z − 5 which has z = −5 as a root. This method is quite unstable.

Thus, it seems that even though we have the degrees of freedom to come up with a higher or-der scheme, it won’t necessarily be stable. There is a theorem that summarizes this idea.

The First Dahlquist Barrier : an explicit LMM with p-lags has order less than or equal to p+ 1 oris unstable.

We will mention the Second Dahlquist Barrier in the section on stability analysis for stiff equations.

60

10.7.5 Adaptive LMM

Adaptive time stepping can be very beneficial when it comes to speeding up an algorithm. If theright hand side of an ODE is smooth for a long region, it is often possible to have an accurateapproximation with long time steps. Similarly, in highly oscillatory or steep regions, a shorter timestep is often required. Thus, we would like the ability to adjust the time step of an algorithm asnecessary. It is a simple matter to make LMM adaptive; however, you do have to allow for theflexibility in the time steps when you derive the methods. The factors by which you allow yourselfto adjust the step up (say by γ) and down (say by δ) affect the constant in the error bound youultimately achieve. In class, it was mentioned that a good rule of thumb was to not adjust up bymore than a factor of 2 or down by a factor smaller than 1/2.

10.7.6 Named Methods

• Modified Midpoint (a Nystrom Method) This method is similar to the midpoint method thatarises from Runge-Kutta. Its formula is

xn+1 = xn−1 + 2∆tf(xn)

It is second order (a result of symmetry). As it is a Nystrom method, the time step must beuniform.

• Adams-Bashforth. These methods use the value of xn (and no other previous xj values) andthe values of f(xn), . . . , f(xn−p). They are derived by considering

x(tn+1) = x(tn) +∫ tn+1

tn

f(x(t)) dt

For the approximation, f(x(t)) is replaced by the interpolating polynomial p(t) that goesthrough the points f(xn), . . . , f(xn−p) at times tn, . . . , tn−p. The A-B formula with twopoints is

xn+1 = xn +∆t2

(3f(xn)− f(xn−1))

and is second order. (In general you get order p with p points or p− 1 lags.)

• Adams-Moulton. Adams-Moulton methods take a similar form as Adams-Bashforth but alsouse the value of f(xn+1). The coefficients are then chosen to get the highest order possible(which, in general, is one higher than Adams-Bashforth).

• Nystrom. Nystrom methods are of the form xn+1 = xn−1+∆t∑βjf(xn−j). They are derived

by integrating over the interval [tn−1, tn+1] (twice as long as A-B). Nystrom methods are oftenhigher order than Adams but they lack the flexibility. The higher order of Nystrom dependson a fixed time step and thus they are not commonly used.

61

• Backward Differentiation Formula (BDF). These methods are quite different from Adams-Bashforth in that they use many of the previous xj values to approximate f at the next step.Therefore they are implicit. They look like

xn+1 = α0xn + · · ·+ αpxn−p + β∆tf(xn+1)

Backward Euler is an example.

10.7.7 Heuristics

Implicit schemes are often a higher order of accuracy than explicit schemes (in addition to be morestable). A way to gain the accuracy of an implicit scheme without having to solve the implicitequation for f at the current time is to use a “predictor-corrector” scheme. The idea is to use theresult of the p-th order explicit scheme as the value of f at the current time in the implicit scheme.The result is often the same order of accuracy as the original implicit scheme.

In contrast with Runge-Kutta, an advantage of LMM is that it requires only one new functionevaluation per step. Another difference between the schemes is that Runge-Kutta doesn’t reallyhave the same zero-stability considerations. That is, there isn’t an equivalent of the root conditionfor Runge-Kutta.

10.8 Runge-Kutta

Runge-Kutta schemes are another popular family of ODE solvers. They are based on the ideaof sampling multiple tangents, that is, you move along the curve to a new spot and sample thetangent there for another approximation to the derivative over the next time step. We make thisidea clearer in the next section

10.8.1 Definition

The stages of a Runge-Kutta method refer to how many of these tangents we sample. A general sstage Runge-Kutta method is defined by

c1 a11 a12 · · · a1s

c2 a21 a22 · · · a2s...

......

...cs as1 as2 · · · ass

b1 b2 · · · bs

=c A

bT

the above mnemonic is known as a Butcher Tableau. It specifies the scheme as follows

62

Ki = f(tn + ci∆t, yn + ∆ts∑j=1

aijKj) for each i = 1, . . . , s

yn+1 = yn + ∆ts∑j=1

biKi

We see immediately that if aij 6= 0 for some j ≥ i then the scheme is implicit. In this case, onemight have to solve a system of nonlinear equations in the Ki. We’ll get back to implicit schemeslater. If A is strictly lower triangular (i.e. if aij = 0 for j ≥ i), then the scheme is explicit. It iseasier to see in this case what the method does. We have

Ki = f(tn + ci∆t, yn + ∆ti−1∑j=1

aijKj)

If we think of Ki as the i-th tangent we sample, then we see that the method uses the previoustangents to step a certain distance along the curve and evaluates f at that point along the curveat the time tn + ci∆t. With this characterization, the consistency condition is clear: ci =

∑j aij .

10.8.2 Implicit Methods

“Dirk, a kind of dagger used in the highlands of Scotland”

Implicit methods allow for any form of A. We note that for arbitrary A this gives a systemof implicit equations in the Ki, which is rather undesirable. One simplification is to limit yourchoices to implicit methods whose matrix A is lower triangular, i.e., aij = 0 for j > i. These arecalled diagonally implicit Runge-Kutta (DIRK). In this case, we have an implicit equation for K1

alone. After we solve for K1, then there is an implicit equation for K2 alone, and so on. In thissituation, finding the Ki is much simpler. To highlight this, we consider the rare case in which wemust solve a linear system for the Ki. If the spatial dimension is n, then we must solve a system oflinear equations in sn unknowns for general implicit methods. If we use Gaussian elimination thisrequires s3n3 work. For DIRK, we must solve s linear equations in n unknowns. If we use Gaussianelimination for this as well, we have sn3 work, which is quite a bit better. These restrictions arein contrast to implicit LMM methods, where you only have one system of n unkowns for spatialdimension n.

There also exist implicit schemes of very high order. These are the Gauss-Legendre methodsand they use Gaussian quadrature ideas to achieve order 2s with an s stage scheme. They are alsoA-stable, which will be elaborated in the section on ODE stability.

63

10.8.3 Adaptive Methods

We see from the equations that Runge-Kutta is especially easy to set up as an adaptive methodbecause the previous time steps are not involved in the formulation. The time step is often adjustedadaptively by approximating the error accrued at each step and adjusting the step size up or downbased on some threshold for the error. To approximate the error, one can use an embedded Runge-Kutta method. The idea behind this approach is to include two Runge-Kutta schemes in the sametableau (so there aren’t extra function evaluations). Often, one of the schemes is of a higher orderthan the other and for the purposes of approximation is taken to give an exact answer. This allowsyou to approximate the error and adjust the step-size on the fly. A famous example of such amethod is the Runge-Kutta-Fehlberg method, or RK45. It uses a fourth and fifth order methodgiven in the following tableau

01/4 1/43/8 3/32 9/32

12/13 1932/2197 −7200/2197 7296/21971 439/216 −8 3680/513 −845/4104

1/2 −8/27 2 −3544/2565 1859/4104 −11/4025/216 0 1408/2565 2197/4104 −1/5 016/135 0 6656/12825 28651/56430 −9/50 2/55

where the first row of bj give the coefficients of the forth order method and the second row give thefifth order method. A simpler example involves forward Euler and Heun’s method, which are firstand second order respectively. Its extended tableau is

01 1

1 01/2 1/2

10.8.4 Runge-Kutta Examples

The following are some common examples of Runge-Kutta methods

• Explicit Schemes

– The midpoint method uses a tangent that is approximately given at the middle of theinterval. It is given by

xn+1 = xn + ∆tf(xn + ∆t/2f(xn))

64

– A very popular scheme is Runge-Kutta 4, or RK4. It is a fourth order scheme with only4 stages (the significance of this is explained in the heuristics section). Its tableau isgiven by:

01/2 1/21/2 0 1/2

1 0 0 11/6 1/3 1/3 1/6

Interestingly, if f has no x dependence, then RK4 is simply the Simpson’s rule fornumerical integration.

• Implicit Schemes

– Backward Euler is the simplest example of an implicit Runge-Kutta scheme (note thatit is also a BDF and A-M method). It is given by

xn+1 = xn + ∆tf(xn+1)

– The trapezoidal rule uses tangents sampled at both endpoints. It is quite similar to thetrapezoidal rule from numerical integration (and in fact, if f has no x dependence it isthe same). Its Butcher Tableau is given as

0 0 01 1/2 1/2

1/2 1/2

It is not immediately obvious that this is equivalent to the trapezoidal rule you get fromAdams-Moulton. We have

K1 = f(tn, yn)K2 = f(tn+1, yn + ∆t/2f(yn) + ∆t/2K2)

However, we note that this expression can be simplified by considering the equation forthe next step. We have

yn+1 = yn + ∆t/2f(yn) + ∆t/2K2

Comparing with the equation for K2, we see that

K2 = f(tn+1, yn+1)

65

and thus we have the familiar formula

yn+1 = yn + ∆t/2(f(yn) + f(yn+1)

10.8.5 Heuristics

We note that it is generally possible to get an implicit s stage scheme of much higher order than thehighest order explicit s stage scheme (for consistent schemes). There are implicit methods based onGaussian quadrature which are order 2s with s stages. On the other hand, the following maximumorders are known for explicit schemes:

stage order 1 2 3 4 5 6 7 8 9 10order of method 1 2 3 4 4 5 6 6 7 7

10.9 Stiff Problems

10.9.1 The Model Problem

Suppose we are trying to solve y = λy for A a matrix and with initial condition y(0) = y0. Thenwe have the following solution

y = y0eλt

If <(λ) < 0 then limt→∞ ‖y(t)‖ = 0 and if <(λ) > 0 then limt→∞ ‖y(t)‖ = ∞. Further, for purelyimaginary λ, we have an oscillatory solution for which ‖y(t)‖ is bounded for all t. Thus, to getqualitatively correct solutions, we hope that our numerical methods experience similar behaviorwhen applied to the model problem, especially we want bounded solutions (stability) for λ suchthat <(λ) ≤ 0. This is of particular importance for stiff problems, which we define below.

10.9.2 Definition/ Intuition

If your problem is given by x = Ax for A a matrix or x = F (x), then we define the notion of a stiffproblem as follows. Consider the eigenvalues of A or [JF ] (the Jacobian) with negative real part, ifthe ratio of the largest magnitude eigenvalue to the smallest is large, then you have a stiff problem.Why is a problem with a large magnitude eigenvalue with negative real part not necessarily stiff?Consider the problem y = −500y which has solution y = e−500t. This function decays extremelyrapidly and thus we are likely to only be concerned with the value of the function for short timeintervals. Thus, taking small time steps is not quite as big of an issue for this problem. If, how-ever, there are relatively large and small eigenvalues that have negative real part, then we couldbe concerned with the long time behavior of the components of the solution (generally speaking inthe direction of the eigenvector) for the smaller eigenvalue while desiring to correctly resolve thecomponents for the larger eigenvalue. Thus, we would end up using short time steps over a longinterval. This gives the stiffness.

66

A related characterization of stiffness is then: a problem is stiff if there exists λk such that

[T − t0]<(λk) −1

where λk is an eigenvalue of [JF ].

A final characterization of stiffness is that the requirements on the time step to have a smalltruncation error are weaker than the requirements for the behavior of the local numerical trajec-tories to have the same behavior as the local trajectories of the exact solution. In particular, if<(λ) = 0 the desired qualitative behavior is a bounded solution, if <(λ) < 0 the desired qualitativebehavior is a decaying solution.

10.9.3 Some Examples of Stiff Problems

An artificial stiff problem is given by

A =

−100 0 00 −1 −10 1 −1

which has eigenvalues −100,−1 + i,−1− i. This has a solution which decays extremely quickly inthe directions (1, 0, 0) and which oscillates and decays slowly in the other directions.

Stiff problems arise in PDE’s. In particular, we note that the matrix associated with the discreteLaplacian is stiff. We consider the interval [0, 1] with homogeneous Dirichlet boundary conditionsin 1-D, the center differencing discrete Laplacian is given by

∆huj =uj+1 − 2uj + uj−1

∆x2

so the matrix we get for the semi-discretization scheme (see numerical PDE section) is given by

A =1

∆x2

−2 1 0 · · · · · · 0

1 −2 1. . .

...

0 1. . . . . . . . .

......

. . . . . . . . . 1 0...

. . . 1 −2 10 · · · · · · 0 1 −2

=

1∆x2

B

We will consider the eigenvalues of B as the ratios of the eigenvalues will be the same as the ratiosfor A. It is immediate by the Gershgorin circle theorem that B is negative semi-definite. Thus,it has real, non-positive eigenvalues. Because finite difference operators are translation invariant,

67

we consider discrete Fourier transform basis functions for diagonalizing them. Let these vectors begiven by vj and suppose that there are m+ 2 points in our discretization (counting the boundary).Because of the boundary condition, we note that we would like (vj)0 = (vj)m+1 = 0. This leads usto choose functions of the form

(vj)k = sin(πjk/(m+ 1))

Now, we check what the discrete operator does to these functions. We have

∆h(vj)k =(vj)k+1 − 2(vj)k + (vj)k−1

∆x2

=12ieπi(k+1)j/(m+1) − e−πi(k+1)j/(m+1) − 2(vj)k(2i) + eπi(k−1)j/(m+1) − e−πi(k−1)j/(m+1)

∆x2

=eπij/(m+1)(vj)k + e−πij/(m+1)(vj)k − 2(vj)k

∆x2

=1

∆x2[2 cos(πj/(m+ 1))− 2](vj)k

= − 1∆x2

4 sin2(πj/2(m+ 1))(vj)k

So we see that (vj) is an eigenvector of B with eigenvalue

−4 sin2(πj/2(m+ 1)))

for j = 1, . . . ,m. (Note that for j = 0,m+ 1 the vector (vj) is zero and hence not an eigenvector).Further, our matrix of unkowns is m×m, so we have a complete basis of eigenvectors and we have aformula for each eigenvalue. Now, we note that the largest magnitude eigenvalue is given by j = mand is approximately −4 for large enough m. The smallest magnitude eigenvalue is given by j = 1and is approximately 1/m2 for large enough m. Thus, their ratio is order m2, making this matrixquite stiff for finer discretizations.

10.10 Stability Analysis (Especially for Stiff Problems)

10.10.1 The Stability Region

We define the stability region of a numerical method based on the model problem described abovefor stiff equations. Let the numerical method be applied to the problem y = λy with y(0) = y0 andstep size h. The set R ⊂ C given by

R = z = λh : ‖yk‖ is bounded.

where yk are the iterates of the numerical method is the region of absolute stability. The conceptis clearer with examples (see below).

68

10.10.2 A-Stability and A-α Stability

A numerical method is called A-Stable if z : <(z) < 0 ⊂ R for its stability region R. Thesemethods are desirable because there is not time-step restriction for λ∆t to be in the stability regionif <(λ) ≤ 0. Thus, these methods are useful for stiff equations.

A related notion is A-α stability. A numerical method is A-α stable if z = reiθ : |θ − (−π)| <α ⊂ R. That is, if the wedge that makes an angle of α above and below the x-axis in the left half-plane is a subset of your stability region. These methods are especially useful for symmetric stiffproblems (e.g. solving the heat equation) as the value of λ∆t will be in the region for any time step.

Here are some facts without proof:

• There are no A-stable and explicit methods for either LMM or Runge-Kutta. This propertywill be clear for Runge-Kutta after the discussion below. The stability regions of thesemethods are determined by the sets where a polynomial is bounded by 1, which is necessarilya bounded set for non-constant polynomials.

• The s stage Gauss-Legendre methods (which are iterative Runge-Kutta) are A-stable and or-der 2s. Thus, there are implicit Runge-Kutta methods of arbitrarily high order. By contrast:

• The Second Dahlquist Barrier : There are no A-stable and explicit linear multistep methods.The implicit methods which are A-stable have order of convergence at most 2. The trapezoidalrule has the smallest error constant amongst the A-stable linear multistep methods of order2.

10.10.3 Runge Kutta Stability Regions

Because of the form of explicit Runge-Kutta methods, their stability regions are defined by a setwhere a polynomial is bounded. For instance, if we apply forward Euler to the model problem, wehave

yk = yk−1 + λhyk−1 = (1 + λh)yk−1

Thus, we see that the stability region is given by the z such that |1 + z| ≤ 1, which is the disc ofradius 1 centered at −1.

For the midpoint method, we have

yk = yk−1 + λh(yk−1 + h/2yk−1) = (1 + λh+ (λh)2/2)yk−1

Thus, the stability region is given by z such that |1 + z+ z2/2| ≤ 1. The higher order Runge-Kuttamethods have ever larger stability regions, some of which contain portions of the imaginary axis(which is useful for skew-symmetric problems).

69

As mentioned above, there are implicit Runge-Kutta methods which are A-stable. An exampleis the trapezoidal rule. When applied to the model problem, we have

yn+1 = yn + ∆t/2(λyn + λyn+1)(1−∆tλ/2)yn+1 = (1 + ∆tλ/2)yn

yn+1 =1 + ∆tλ/21−∆tλ/2

Thus, the stability region is given by those z such that∣∣∣∣1 + z/21− z/2

∣∣∣∣ ≤ 1

which is precisely the left half-plane.

10.10.4 LMM Stability Regions

Finding the stability region of an LMM is related to satisfying the root condition. We see that forthe model problem

xn+1 = α0xn + · · ·+ αpxn−p + ∆t(β−1λxn+1 + β0λxn + · · ·+ βpλxn−p)

Let z = λ∆t. We again have a recursion polynomial, but this time it depends on z. Let

ρ(ζ) = ζp+1 − α0ζp − · · · − αp

σ(ζ) = β−1ζp+1 + β0ζ

p + · · ·+ βp

Then the recursion polynomial is

p(ζ) = ρ(ζ)− zσ(ζ)

We see then that the stability region is given by

z : p(ζ) satisfies the root condition.

This concept is much clearer when considering an example method. We will calculate the stabilityregion of the modified midpoint method

xn+1 = xn−1 + 2∆tf(xn)

In this case, we have p = 1 and

70

ρ(ζ) = ζ2 − 1 σ(ζ) = 2ζ

This gives that the recurrence polynomial is

p(ζ) = ζ2 − 2zζ − 1

This polynomial has roots given by

ζ =2z ±

√4z2 + 42

= z ±√z2 + 1

For the root condition to be satisfied, we need that <(z) = 0. Otherwise, we see that in the abovethe real part of one of the zeros will be larger than one in magnitude. For similar reasons, it isclear that any purely imaginary solution must have magnitude less than or equal to 1. If ζ = ±iwe have a double root of magnitude 1, so the root condition is not satisfied. Thus, the stabilityregion is the purely imaginary interval i · (−1, 1). From this, we see that the modified midpointrule might be suited to the case where we have a skew-symmetric matrix (as it will have purelyimaginary eigenvalues) for the right hand side of our ODE. On the other hand, it is not well-suitedfor stiff problems at all.

Backwards Differentiation Formulas (BDF) often have nice stability regions (though, as notedabove, they are necessarily not A-stable for order greater than 2 by the second Dahlquist barrier).A simple example is backward Euler. It’s given by

yn+1 = yn + ∆tf(yn+1)

We have the following

ρ(ζ) = ζ − 1 σ(ζ) = ζ

So the recurrence polynomial is given by

p(ζ) = (1− z)ζ − 1

which has a root given by ζ = 1/(1 − z). We would like this root to be bounded by 1, thus thestability region is given by

z : |1− z| ≥ 1

That is, it’s the region outside of the disc of radius 1 centered at 1 (the stability region includesthe boundary of this disc). This includes the left half-plane and is thus A-stable. The shape of this

71

region is typical of the stability regions for BDF. They are defined outside of some bounded shape(which is mostly in the right half-plane). However, for BDF of order 3 and higher, these shapescreep into the left half-plane and thus these BDF schemes are not A-stable. They are, however,A-α stable. This property makes BDF an attractive alternative to implicit Runge-Kutta schemesfor symmetric problems as you can have a high order scheme with just one implicit equation tosolve that still has a suitable stability region.

11 Numerical PDE

11.1 Finite Differences for Elliptic PDE

The prototypical elliptic PDE is Poisson’s equation −∆u = f . The standard approach for solvingthis problem with finite differences is to discretize the problem on a uniform grid and use a stencilon the interior points. For the 1-D case, we have

∆huj =uj+1 − 2uj + uj−1

h2

and for the 2-D case we have the 5 point stencil

∆hui,j =ui,j+1 + ui+1,j − 4ui,j + ui,j−1 + ui−1,j

h2

Ignoring boundary conditions, this process gives us a matrix equation to solve for u. We have

Au = f

Where A is the matrix corresponding to the discrete Laplacian ∆h. The tools for solving this prob-lem have been discussed above but we’ll summarize the options. First, we note that, as calculatedabove, the matrix is very poorly conditioned. In the 1-D case we calculated that the conditionnumber was on the order O(m2) where m was the number of points in the discretization. Thus,the convergence of methods like Gauss-Seidel can be quite slow (especially the asymptotic rate ofconvergence; the initial rate of convergence can be quite good and this fact is exploited in multigridsolvers). Because we have an explicit expression for the eigenvalues, it is possible to determine anoptimal choice of ω for the SOR algorithm, which can be quite efficient here. However, if we’rebeing serious a full multigrid scheme or a preconditioned conjugate gradient scheme is likely thebest choice.

Dealing with boundary conditions is quite tedious and technical and is thus ignored here.

11.2 Finite Differences for Equations of Evolution (Parabolic and HyperbolicPDE)

We will discuss the approach for equations of evolution in terms of semi-discretization schemes.The idea is that we come up with a spatial differencing scheme for the spatial derivatives to get a

72

system of ODE’s. For example, consider the case of the heat equation

ut = ∆u

If we discretize the spatial derivatives using a stencil (say we’re in 2-D) we have

ut = Au

where A is the matrix corresponding to the discrete Laplacian ∆h seen in the elliptic section above.This approach of using a spatial discretization to turn the PDE into a system of ODE’s is ubiqui-tous in numerical PDE.

As another example, we can use similar ideas to solve the transport equation ut + sux = 0. Weconsider a few spatial discretizations (we assume s > 0)

(uj)x ≈uj+1 − uj−1

2∆x“centered difference”

(uj)x ≈uj − uj−1

∆x“upwinding scheme”

(uj)x ≈uj+1 − uj

∆x“downwinding scheme”

As we will see in the following sections, the choice of spatial discretization can have a serious impacton the choice of our ODE solver and the success of the method. Often, stability concerns dictatethe choice of ODE solver used for a problem.

Our final example is the wave equation utt − uxx = 0. We note that we could change this into acoupled system of transport equations for ut and ux but it seems much more desirable to computeu itself. The following approach can be found in Iserles. We consider in addition to u a dummyfunction v. We set up a coupled system of advection equations

ut + vx = 0 vt + ux = 0

And we note that utt = (ut)t = (−vx)t = (−vt)x = uxx. This coupled equation can then be solvedby diagonalizing and using a solver for each advection problem. Another approach (whose effec-tiveness I’m unsure of) is to view this as a second order semi-discretization problem and solve it inthe obvious way, i.e., setting v = ut and getting the coupled system v = uxx, u = v. I suspect thatthe smoothing properties of the heat equation make this a bad approach.

Nonlinear problems present difficulties beyond the scope of these notes.

73

11.3 Stability, etc.

11.3.1 For Semi-Discretization

If we have a semi-discretization for our ODE and the matrix given by it is well-understood, we canuse the considerations for the model problem and stiff ODE’s to determine which ODE solver tochoose. We use some examples:

• Advection equation with centered differencing in space: we note that the matrix gained fromcentered differencing looks like

12∆x

0 1−1 0 1

−1 0. . .

. . . . . .

and is thus skew-symmetric. These matrices have purely imaginary eigenvalues. Thus, wesee that using forward Euler as the ODE solver for the time stepping is a bad choice (thestep size can never be small enough for stability). Interestingly, the leap-frog (or modifiedmidpoint method) is suitable for sufficiently short time steps (as its stability region includespart of the imaginary axis). Finally, most implicit methods are a good choice here.

• Heat equation in 1-D with the typical stencil. As noted in the other sections, this matrixis symmetric and has negative eigenvalues which are O(N2) and O(1). Thus, the problemis quite stiff. Because the eigenvalues of the matrix are all real, an A-α stable method issufficient for stability. We note that then the BDF methods are a good choice if we want ahigher order method (higher order implicit RK methods have good stability regions as wellbut their implicit equations are more difficult to solve than those given by BDF).

11.3.2 Von Neumann Analysis

11.3.3 CFL Conditions

11.3.4 Lax Equivalence

11.4 Finite Element Methods for Elliptic PDE

A finite element is generally formulated according to the following outline:

• Come up with the variational formulation of the problem.

• Discretization of the problem using finite elements: construction of the finite dimensionalspace Vh.

• Solution of the discrete problem (often, writing a linear system which can be solved).

74

• Implementation of the method on a computer.

There are number of advantages to the FEM when compared to traditional finite difference methods.In particular, it is easier to handle complicated geometries, general boundary conditions, andvariable/ non-linear material properties.

11.4.1 The Model Problem

We consider the problem

(D)

−u′′ = fu(0) = u(1) = 0

and we will follow the basics steps of an FEM for solving this problem

• Minimization/ Variational Formulation.

We note that (D) has solutions. This can be seen by integrating the equation twice. Let

V = v : v ∈ C[0, 1], v′ is piecewise continuous and bounded on [0, 1], v(0) = v(1) = 0

and let

F (v) =12

(v′, v′)− (f, v)

where (·, ·) is the inner product on L2[0, 1]. Then we define the following two problems

(M) find u ∈ V s.t. F (u) ≤ F (v) ∀v ∈ V

(V ) find u ∈ V s.t. (u′, v′) = (f, v) ∀v ∈ V

We see that a solution u of (D) is also a solution of (V ) by integration by parts

(−u′′, v) = (f, v)⇒ (u′, v′) = (f, v)

Further, we note that (V ) and (M) have the same solutions. This is simple to verify. It isalso easy to verify that solutions of (V ) are unique. If we have a function u which solves (V )and has a continuous second derivative, then we can integrate by parts to see that u solves(D). Finally, if u solves (V ), then u′′ is continuous. Hence, (V ), (M), and (D) are equivalentand have unique solutions.

75

• Discretization of the Problem.

If we were to discretize the minimization problem (M), we would have what’s called a Ritzmethod. We will give an example for discretizing the variational problem, called a Galerkinmethod. The idea is to come up with a finite dimensional space Vh which approximates theoriginal space V . In particular, we take Vh ⊂ V . The first set Vh which we will consider willconsist of piecewise linear functions. We let

0 = x0 < x1 < · · · < xM < xM+1 = 1

be a partition of the interval into subintervals Ij = (xj−1, xj), these discrete sub-domainsare the elements of the finite element method. We now let Vh be the set of functions whichare linear on each interval, continuous, and satisfy the Dirichlet boundary conditions. Thesefunctions are completely determined by their values at the points x1, . . . , xM . This guides us infinding a set of basis functions. Ket ϕj(x) be the “tent” function which satisfies ϕj(xi) = δi,j .These look like (where the x-axis takes the value xi at the point i, so this is ϕ3)

Any function v ∈ Vh can then be written as

v(x) =M∑i=1

ηiϕi(x)

In particular, Vh is dimension M .

• Solve the discrete problem.

Because the ϕj form a basis of Vh, we can solve the discrete variational problem

76

(Vh) find u ∈ Vh s.t. (u′, v′) = (f, v) ∀v ∈ Vhby solving the equivalent problem

(V ′h) find u ∈ V s.t. (u′, ϕ′j) = (f, ϕj) ∀ϕjThis problem is equivalent by linearity. Further, we can write u =

∑ξjϕj and we see that u

is completely determined by its coefficients. We then note that the above problem gives us alinear system of M equations for the ξj . Let ξ = (ξ1, . . . , ξm)T then we have Aξ = b where

A =

(ϕ′1, ϕ′1) · · · (ϕ′1, ϕ

′M )

......

(ϕ′M , ϕ′1) · · · (ϕ′M , ϕ

′M )

b =

(f, ϕ1)...

(f, ϕM )

If we assume a uniform grid spacing h, the matrix A has a simple form. We note that ϕ′j = 1/hon [xj−1, xj ] and ϕ′j = −1/h on [xj , xj+1]. The derivative is zero otherwise. This gives that

(ϕ′j , ϕ′j) =

∫ xj+1

xj−1

1/h2 dx = 2/h

(ϕ′j , ϕ′j−1) =

∫ xj

xj−1

−1/h2 dx = −1/h

(ϕ′j , ϕ′j+1) =

∫ xj+1

xj

−1/h2 dx = −1/h

(ϕ′j , ϕ′i) = 0 for |j − i| > 1

So A has the form

A =1h

2 −1 0 0 0−1 2 −1 0 0

0 −1 2 −1 00 0 −1 2 −10 0 0 −1 2

By the Gershgorin circle theorem, it is clear that A is positive semi-definite. For u =

∑ξiϕ,

we have that

0 ≤ (u′, u′) = ξTAξ

with equality only when u ≡ 0. Thus, A is positive definite and Aξ = b can be solved.

77

• Computer implementation.

We will not get too far into the details of implementing this specific problem, as it is quitesimple. The computer implementation consists of populating the matrix A and vector b andusing a numerical linear algebra algorithm to solve for ξ. Because the matrix is tridiagonal,an LU decomposition can be done in O(m) work and is quite easy to implement.

11.4.2 Error Estimates for FEM

• For the model problem.

We derive an error estimate for the model problem using the discretization above. Let ube the solution of (D) and uh be the solution of Vh. Because Vh ⊂ V , we have that

((u− uh)′, v′) = (f, v)− (f, v) = 0

for all v ∈ Vh. We now claim that for any v ∈ Vh we have

‖(u− uh)′‖ ≤ ‖(u− v)′‖

To establish this fact, we use that ((u−h)′, v′) = 0 for all v ∈ Vh. Let w = uh−v ∈ Vh. Then

‖(u− uh)′‖ = ((u− uh)′, (u− uh)′) + ((u− uh)′, w′)= ((u− uh)′, (u− v)′)≤ ‖(u− uh)′‖‖(u− v)′‖ by Cauchy-Schwarz

Thus, we can come up with an upper-bound on the error by considering the error for aparticular choice of functiion vh ∈ Vh. If we choose vh to be the linear interpolant of u, By astandard Taylor expansion argument, one can see that we have the following bounds

|u′(x)− v′h(x)| ≤ h maxy∈[0,1]

|u′′(y)|

|u(x)− vh(x)| ≤ h2

8maxy∈[0,1]

|u′′(y)|

pointwise where the derivative of vh is defined. Thus, the bound on ‖(u− uh)′‖ gives us

‖(u− uh)′‖ ≤ ‖(u− vh)′‖ ≤ h maxy∈[0,1]

|u′′(y)|

and thus by integration we have

78

|u(x)− uh(x)| ≤ h maxy∈[0,1]

|u′′(y)|

We note that this bound is not as good as the one given for vh. It can be shown that in thiscase, the FEM solution is actually O(h2). At any rate, we have convergence if u′′ is bounded.

11.4.3 What about Higher Dimensions?

We discuss the case for the 2D Poisson equation here. We would like to solve

−∆u = f in Ωu = 0 on Γ = ∂Ω

Following much the same route as the 1D case, we arrive at the following variational formula-tion. Let V = v : v is continuous and piecewise differentiable with v|Γ = 0. For any v ∈ V ,we have

∫Ω−∆uv dx =

∫Γfv dS∫

Ω∇u · ∇v −

∫Γv∇u · ndS =

∫Γfv dS∫

Ω∇u · ∇v =

∫Γfv dS

a(u, v) = (f, v)

Where a(u, v) and (f, v) are given by the left and right-hand-sides of the line above, re-spectively. Thus, the variational formulation of our problem is to find u ∈ V such thata(u, v) = (f, v) for all v ∈ V .

For the discretization, we need to define what it is to be a triangulation. Let Th = K1, . . . ,KMbe a set of non-overlapping triangles. We say that Th is a triangulation of Ω if Ω =

⋃Kn and

no vertex of one triangle lies on the edge of the other. Often, one denotes the mesh parameter

h = maxK∈Th

length of longest side of K

We define our finite dimensional subspace Vh as

Vh = v ∈ V : v|K for K ∈ Th is linear

79

Because v = 0 for v ∈ Vh on the boundary, a function v ∈ Vh is then determined by its valueson the interior nodes, call them Nj . We again define tent functions φj for each node Nj . Inthis case, they look like (image from http://boxc.sourceforge.net/examples/boxer/fem.html):

The support of φj is then restricted to triangles with common node Nj . As before, we mustset up a matrix with entries Aij = a(φi, φj). We note that

a(φi, φj) =∑K∈Th

aK(φi, φj)

where

aK(φi, φj) =∫K∇φi · ∇φj dx

In 2D, the values of a(φi, φj) are normally found this way. We note that aK(φi, φj) 6= 0 ifand only if Ni and Nj are vertices of K. Thus, we can compute a “stiffness” matrix for eachtriangle that is 3× 3, because a triangle has 3 vertices. Then, adding the contributions fromeach of these gives the full system.

We note that there is a new difficulty in higher dimensions when it comes to getting gooderror bounds. We will not get into it here, but in order to show that the linear interpolant ofu, which is in Vh, has the desired accuracy, we require that the triangles do not get too thin.

80

11.4.4 Abstract Existence/Uniqueness of the FEM

11.4.5 Finite Element Examples

12 Fast Algorithms in Potential Theory and Linear Algebra

12.1 Linear Elliptic PDE and the FMM: A Paradigm

12.2 Fast Multipole Methods

12.3 Hierarchical Matrix Compression

12.4 The Fast Fourier Transform

12.4.1 The DFT

12.4.2 Relation to Fourier Series (Aliasing)

12.4.3 Chebyshev Polynomials

12.4.4 The Cooley-Tukey Algorithm

12.4.5 Applications

• Theoretical : The DFT can be used to explain the spectral convergence attained by the trape-zoidal rule applied to periodic, analytic functions. For simplicity, we take f to be analyticand periodic on the interval [0, 1]. Because f is analytic, there is an open ball about eachpoint of [0, 1] in which f is equal to its Fourier series. These balls form an open cover of theinterval, which is compact, thus there’s a finite subcovering. Taking the minimum radius ofconvergence over these balls, we find that f can be extended to an analytic function in Cwhich converges in a region that includes a buffer of positive width about [0, 1]. We knowthat the Fourier series of f converges rapidly to f . Its coefficients are given by

fα

∫ 1

0e−2πiαtf(t) dt

By a use of Cauchy’s integral theorem and the periodicity of f , we deform the contour integralof f above or below the real axis to get that

|fα| ≤Me−c|α|

Finally, we note that the trapezoidal rule gives

∫ 1

0f(t) dt ≈ 1

N

N−1∑n=0

f(n/N) = fN0

81

By the aliasing rule we have

fN0 = f0 +∑

k∈Z\0

fkN

this gives that |fN0 − f0| ≤ C2e−cN . Finally, we note that f0 is equal to the integral we’re

evaluating, so we have established that the trapezoidal rule is spectrally accurate for thisfunction.

• Convolutions and Circulant Matrices

Further, a Toeplitz matrix (which has the same value on each diagonal) can be embeddedinto a circulant matrix of double the size.

• PDE with periodic boundary conditions

• Spectral Integration and Two-Point Boundary Value Problems

13 References

I used the following sources in creating these notes:

• Numerical Linear Algebra by Trefethen and Bau

• Numerical Analysis of Differential Equations by Arieh Iserles

• Numerical Computing with IEEE Arithmetic by Michael Overton

• Numerical Methods by Germund Dahlquist and Ake Bjorck

• Numerical Solution of Partial Differential Equations by the Finite Element Method by ClaesJohnson

• Multilevel Compression of Linear Operators: Descendants of Fast Multipole Methods andCalderon-Zygmund Theory by Per-Gunnar Martinsson and Mark Tygert

• Iterative Methods for Sparse Linear Systems by Yousef Saad, http://www.stanford.edu/class/cme324/saad.pdf

• An Introduction to the Conjugate Gradient Method Without the Agonizing Pain by JonathanRichard Shewchuk, http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf

• A Short Course on Fast Multipole Methods by Rick Beatson and Leslie Greengard

• Spectral Integration and Two-Point Boundary Value Problems by Leslie Greengard

82

http://www.stanford.edu/class/cme324/saad.pdf

http://www.stanford.edu/class/cme324/saad.pdf

http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf

http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf

• Course Notes for Numerical Methods I at CIMS, taught by Olof Widlund

• Course Notes for Numerical Methods II at CIMS, taught by Jonathan Goodman

• Course Notes for Adv. Top. in Numerical Analysis: Fast Algorithms at CIMS, taught byMark Tygert

• Course Notes for Adv. Top. in Numerical Analysis: Computational EM at CIMS, taught byLeslie Greengard

• Course Notes for Advanced Numerical Analysis at UCLA, taught by Chris Anderson

• Course Notes for Advanced Numerical Analysis at UCLA, taught by Luminita Vese

• I’ll be honest ... I did look at Wikipedia a few times.

83

Date post:	23-Mar-2018
Category:	Documents
Upload:	vodat
View:	221 times
Download:	3 times

Numerical Methods Orals - cims.nyu.eduaskham/orals-notes/numerical.pdf · Orals Travis Askham April...

Documents