Numerical Methods for Differential Equationsctr.maths.lu.se/na/courses/FMNN10/course_media/... ·...

Gustaf Soderlind

Numerical Methods forDifferential Equations

An Introduction to Scientific Computing

November 29, 2017

Springer

Contents

Part I Scientific Computing: An Orientation

1 Why numerical methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Concepts and problems in analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Concepts and problems in algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Computable problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Principles of numerical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 The First Principle: Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 The Second Principle: Polynomials and linear algebra . . . . . . . . . . . . 232.3 The Third Principle: Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 The Fourth Principle: Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.5 Correspondence Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.6 Accuracy, residuals and errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Differential equations and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1 Initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Two-point boundary value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3 Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Summary: Objectives and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Part II Initial Value Problems

5 First Order Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.1 Existence and uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 Stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76.1 Linear stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76.2 Logarithmic norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106.3 Inner product norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

v

vi Contents

6.4 Matrix categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.5 Nonlinear stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.6 Stability in discrete systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 The Explicit Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.2 Alternative bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347.3 The Lipschitz assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 The Implicit Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438.3 Stiff problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478.4 Simple methods of order 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

9 Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639.1 An elementary explicit RK method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639.2 Explicit Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659.3 Taylor series expansions and elementary differentials . . . . . . . . . . . . . 679.4 Order conditions and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709.5 Implicit Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739.6 Stability of Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.7 Embedded Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799.8 Adaptive step size control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.9 Stiffness detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

10 Linear Multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8910.1 Adams methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9110.2 BDF methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9410.3 Operator theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9610.4 General theory of linear multistep methods . . . . . . . . . . . . . . . . . . . . . 9910.5 Stability and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10010.6 Stability regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10210.7 Adaptive multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11 Special Second Order Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10511.1 Standard methods for the harmonic oscillator . . . . . . . . . . . . . . . . . . . 10611.2 The symplectic Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10911.3 Hamiltonian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11111.4 The Stormer–Verlet method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11211.5 Time symmetry, reversibility and adaptivity . . . . . . . . . . . . . . . . . . . . 114

Part III Boundary Value Problems

Contents vii

12 Two-point Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.1 The Poisson equation in 1D. Boundary conditions . . . . . . . . . . . . . . . 312.2 Existence and uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.3 Notation: Spaces and norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.4 Integration by parts and ellipticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612.5 Self-adjoint operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912.6 Sturm–Liouville eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . 11

13 Finite difference methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1513.1 FDM for the 1D Poisson equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1613.2 Toeplitz matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1813.3 The Lax Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2613.4 Other types of boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3013.5 FDM for general 2pBVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3513.6 Higher order methods. Cowell’s difference correction . . . . . . . . . . . . 3913.7 Higher order problems. The biharmonic equation . . . . . . . . . . . . . . . . 4313.8 Singular problems and boundary layer problems . . . . . . . . . . . . . . . . . 4913.9 Nonuniform grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

14 Finite element methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5114.1 The weak form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5314.2 The cG(1) finite element method in 1D . . . . . . . . . . . . . . . . . . . . . . . . 5414.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5914.4 Neumann conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6014.5 cG(1) FEM on nonuniform grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Part IScientific Computing: An Orientation

Chapter 1Why numerical methods?

Numerical computing is the continuationof mathematics by other means

Science and engineering rely on both qualitative and quantitative aspects of mathe-matical models. Qualitative insight is usually gained from simple model problemsthat may be solved using analytical methods. Quantitative insight, on the other hand,typically requires more advanced models and numerical solution techniques. Withincreasingly powerful computers, ever larger, and more complex mathematical mod-els can be studied. Results are often analyzed by visualizing the solution, sometimesfor a large number of cases defined by varying some problem parameter.

Scientific computing is the systematic use of highly specialized numerical meth-ods for solving specific classes of mathematical problems on a computer. But arenumerical methods different from just solving the mathematical problem, and theninserting the data to evaluate the solution? The answer is yes. Most problems ofinterest do not have a “closed form solution” at all. There is no formula to evaluate.The problems are often nonlinear and almost always too complex to be solved byanalytical techniques. In such cases numerical methods allow us to use the powersof a computer to obtain quantitative results. All important problems in science andengineering are solved in this manner.

It is important to note that a numerical solution is approximate. As we cannotobtain an exact solution to our problem, we construct an approximating problemthat is amenable to automatic computation. The construction and analysis of com-putational methods is a mathematical science in its own right. It deals with questionssuch as how to obtain accurate results, and whether they can be computed efficiently.

This cannot be taken for granted. As a simple example, let us consider the prob-lem of solving a linear system of equations, Ax = b, on a computer using standardGaussian elimination. Let us assume that A is large, with (say) N = 10,000 equa-tions, and that A is a dense matrix. Because Gaussian elimination has an operationcount of O(N3), the total number of operations in solving the problem is on the or-der of 1012 operations. Not only do we need a fast computer and a large memory, butwe might also ask the question whether it is at all possible to obtain any accuracy.After all, in standard IEEE arithmetic, operations are “only” carried out to 16 digitprecision. Can we trust the computed result, given that our computational sequenceis so long?

3

4 1 Why numerical methods?

This is a nontrivial issue, and the answer depends both on the problem’s mathe-matical properties as well as on the numerical algorithms used to solve the problem.It typically requires a high level of mathematical and numerical skills in order todeal with such problems successfully. Nevertheless, it can be done on a routine ba-sis, provided that one masters central notions and techniques in scientific computing.

Mathematical problems can roughly be divided into four categories. Thus wedistinguish problems in algebra and problems in analysis. On the other hand, wedistinguish between linear problems and nonlinear problems, see Table 1.1.

The distinction between these categories is very important. In algebra, all con-structs are finite, but in analysis, one is allowed to use transfinite constructs suchas limits. Thus analysis involves infinite processes, and whether these converge ornot. In scientific computing this presents an issue, because on a computer we canonly carry out finite computations. Therefore all computational methods are basedon algebraic methodology, and it becomes a central issue whether we can devisealgebraic problems that (in one sense or another) have solutions that approximatethe solutions of analysis problems.

Table 1.1 Categories of mathematical problems. Only problems in linear algebra are computable.

Category Algebra Analysis

Linear computable not computableNonlinear not computable not computable

In a similar way, the distinction between linear and nonlinear problems is of im-portance. Most classical mathematical techniques deal with linear problems, whichshare a lot of structure, while nonlinear problems usually have to be approached ona case-by-case basis, if they van be solved by analytical techniques at all.

Borrowing a notion from computer science, we say that a solution is computableif it can be obtained by a finite algorithm. However, a computer is limited to finitecombinations of the four arithmetic operations, +, −, ×, ÷, and logical operations.Therefore, in general, only problems in the linear algebra category are computable,with the standard examples being linear systems of equations, linear least squaresproblems, linear programming problems in optimization, and some problems in dis-crete Fourier analysis.

There are a few problems in analysis or nonlinear problems that can be solved byanalytical techniques, but the vast majority cannot. For such problems, the only wayto obtain quantitative results is by using numerical methods to obtain approximateresults. This is where scientific computing plays its central role, and as we shall see,computational methods almost invariably work by, in one way or another, approxi-mating the problem at hand by (a sequence of) problems in linear algebra, since thisis where we possess finite, or terminating, algorithms.

1.1 Concepts and problems in analysis 5

However, in practice it is also impossible to solve linear systems exactly, unlessthey are very small. It is not uncommon to encounter large linear systems, perhapshaving millions or even billions of unknowns. In such cases it is again only possibleto obtain approximate results. This may perhaps sound disheartening to the math-ematician, but the interesting challenge for the numerical analyst is to constructalgorithms that have the capacity of computing approximations to any prescribedaccuracy. Better still, if we can construct fast converging algorithms, such accurateapproximations can be obtained with a moderate computational effort. Needlessto say, there is a trade-off; more accuracy will almost always come at a cost, althoughon a good day, a good numerical analyst may occasionally enjoy a free lunch.

This lays out some of the interesting issues in scientific computing. The objectiveis to solve complex mathematical problems on a computer, and to construct reliablemathematical software for given classes of problems, and making sure that quanti-tative results can be obtained both with accuracy and efficiency. In order to furtheroutline some basic thoughts, we have to investigate what mathematical concepts wehave to deal with in the different problem categories above.

1.1 Concepts and problems in analysis

The central ideas of analysis that distinguish it from algebra are limits and con-vergence. The two most important concepts from analysis are derivatives and in-tegrals. Let us start with the derivative, classically defined by the limit

d fdx

= limh→0

f (x+h)− f (x)h

. (1.1)

If the function f is differentiable, the difference quotient converges to the limit f ′(x)as h goes to zero. Later, we will consider several alternative expressions to that of(1.1) for approximating derivatives. In practical computations, mere convergence israrely enough; we would also like fast convergence, so that the trade-off betweenaccuracy and computational effort becomes favorable.

The derivatives of elementary functions, such as polynomials, trigonometricfunctions and exponentials, are well known. Nevertheless, if a function is complexenough, or an explicit expression of the function is unknown, the derivative maybe impossible to compute analytically. However, it can always be approximated. Atfirst, this may not seem to be an important application, but as we shall see later, itis important in scientific computing to approximate derivatives to a high order ofaccuracy, as this is key to solving differential equations.

Let us consider the problem of computing an “algebraic” approximation to (1.1).Since we cannot compute the limit in a finite process, we consider a finite differenceapproximation,

d fdx≈ f (x+∆x)− f (x)

∆x, (1.2)


10-15

10-10

10-5

100

Delta x

10-12

10-10

10-8

10-6

10-4

10-2

100

Rela

tive e

rror

Fig. 1.1 Finite difference approximation to a derivative. The graph plots the relative error r inthe finite difference approximation (1.3) vs. ∆x in a log-log diagram. Blue straight line on theright represents r as described by (1.4). A nearby dashed reference line of slope +1 shows thatr = O(∆x). For ∆x < 10−8, however, roundoff becomes dominant, as demonstrated by the rederratic part of the graph on the left. Roundoff error grows like O(∆x−1), as indicated by the seconddashed reference line on the left, of slope −1. Thus the maximum accuracy that can be obtainedby (1.3) is ∼ 10−8

where ∆x > 0 is small, but non-zero. Thus the finite difference approximation sim-ply skips taking the limit in (1.1), replacing the limit by (1.2). Let us see how accu-rate this approximation is.

Example Consider the function f (x) = ex with derivative f ′(x) = ex, and consider com-puting a finite difference approximation to f ′(1), i.e.,

f ′(1)≈ f (1+∆x)− f (1)∆x

. (1.3)

Since f ′(1) = e, the relative error in the approximation is, by Taylor series expansion,

r =f (1+∆x)− f (1)

e∆x−1 =

e∆x−1∆x

−1 =∆x2

+O(∆x2). (1.4)

Thus we can expect to obtain an accuracy proportional to ∆x. For example, in order to obtainsix-digit accuracy, we should take ∆x≈ 10−6. In order to check this result, we compute thedifference quotient (1.3) for a large number of values of ∆x and compare to the exact result,f ′(1) = e, to evaluate the error.

It is an important to note that the approximation (1.2) is convergent, as shown by (1.4).Convergence means that the error can be made arbitrarily small, by reducing ∆x. However,


in the numerical experiment in Figure 1.1, we see that we cannot obtain arbitrarily highaccuracy. How is this discrepancy consistent with the claim that the approximation (1.2) isconvergent?

The answer is that convergence can only take place on a dense set of numbers, such as onthe set of real numbers, which allows us to use the standard “epsilon – delta” arguments ofanalysis. However, the real number system cannot be implemented in computer arithmetic(since it requires an infinite amount of information) and in numerical computation we haveto make do with computer representable numbers, usually the IEEE 754 standard. Be-cause this is a finite set of numbers, the notion of convergence does not exist in computerarithmetic, which explains why there will be a breakdown, sooner or later, if we take ∆xtoo small. Even so, the set of computer numbers is dense enough for almost all practicalpurposes, and in the right part of Figure 1.1, we can clearly see “the beginning of conver-gence,” as the error behaves exactly as expected according to (1.4). Thus we can in practiceusually observe the correct order of convergence. In this case the order of convergence isp = 1, meaning that the error is r = O(∆xp).

In this example we have also seen that the results were plotted in a log-log di-agram. This is a standard technique used whenever the plotted quantity obeys apower law. A power law is a relation of the form

η =C ·ξ p.

Assuming that ξ and η are positive and taking logarithms, we obtain

logη = logC+ p · logξ .

This is the equation of a straight line. Thus logη is a linear function of logξ , with aneasily identifiable slope of p, which is the “power” in the power law. In our example,η is the error r, and ξ is the “step size” ∆x. Consequently, we have the power law

r ≈C ·∆xp,

where we have neglected the higher order terms in the expansion (1.4). We plotlogr as a function of log∆x to be able to identify p. In this case we have p = 1 andC = 1/2, which represents the leading term of the error. If ∆x is not too large, thehigher order terms can be neglected and the order of convergence clearly observed.

A first order approximation, such as the one above, is rarely satisfactory in ad-vanced scientific computing. We are therefore interested in whether it is possible toimprove accuracy, and, in particular, to improve the order of convergence. This willplay a central role in our analysis later, because it will allow us to construct highlyaccurate methods that require only a moderate computational effort. That soundslike a “free lunch,” but we shall see that if we develop our techniques well, it isindeed possible to obtain a vastly improved performance.

Example We shall again consider the function f (x) = ex with derivative f ′(x) = ex. Thistime, however, we are going to use a symmetric a finite difference approximation to f ′(1),of the form

f ′(1)≈ f (1+∆x)− f (1−∆x)2∆x

. (1.5)


10-15

10-10

10-5

100

Delta x

10-12

10-10

10-8

10-6

10-4

10-2

100

Rela

tive e

rror

Fig. 1.2 Symmetric finite difference approximation. Relative error r in the symmetric differencequotient (1.5) is plotted vs. ∆x in a log-log diagram. Green straight line on the right representsr as described by (1.6). A nearby dashed reference line of slope +2 shows that r = O(∆x2). For∆x < 10−5, roundoff becomes dominant, as demonstrated by the red erratic part of the graph onthe left. Roundoff error grows like O(∆x−1), as indicated by the second dashed reference line onthe left. The maximum accuracy that can be obtained by (1.5) is ∼ 10−11

The limit, as ∆x→ 0, equals the derivative, but as we shall see both theoretically and prac-tically, the convergence is much faster. By expanding both f (1+∆x) and f (1−∆x) in aTaylor series around x = 1, we find the relative error

r =f (1+∆x)− f (1−∆x)

2e∆x−1 =

e∆x− e−∆x

2∆x−1 =

∆x2

6+O(∆x4). (1.6)

Once again we would observe a power law, but this time p = 2, which makes the symmetricfinite difference approximation convergent of order p = 2. Again, we compare with realcomputations, and obtain the result in Figure 1.1.

Evidently we can obtain considerably higher accuracy, while we still only use a a finitedifference quotient requiring two function evaluations. It is of some interest to make a closercomparison, and by superimposing the two plots we obtain Figure 1.1.

This shows that a higher order approximation will produce much higher accuracy, and ob-tain that higher accuracy at a relatively large value of ∆x. We also see that the roundofferror is largely unaffected. To see how the roundoff error occurs, we assume that the func-tion f (x) can be evaluated to a relative accuracy of ε10−16, which approximately equals theIEEE 754 roundoff unit. This means that we obtain an approximate value f (x) satisfying∣∣∣∣ f (x)− f (x)

f (x)

∣∣∣∣≤ ε.


10-15

10-10

10-5

100

Delta x

10-12

10-10

10-8

10-6

10-4

10-2

100

Rela

tive e

rror

Fig. 1.3 Comparison of finite difference approximations. Relative errors in first and second orderfinite difference approximations are compared as functions of ∆x in a log-log diagram. The straightlines corresponds to order p = 1 (blue) and p = 2 (green), respectively. At ∆x = 10−5, the secondorder approximation is more than five orders of magnitude more accurate. Roundoff errors remainsimilar for both approximations

As a consequence, when we evaluate the difference quotient (1.5) we obtain a perturbedvalue,

f ′(1)≈ f (1+∆x)− f (1−∆x)2∆x

=f (1+∆x)− f (1−∆x)

2∆x+ρ, (1.7)

where we will have a maximum roundoff error of

ρ ≤ ρ =ε f (1)

∆x=

ε e∆x

.

Likewise, in (1.3) the maximum roundoff is 2ρ in (1.3). Upon close inspection, the graphsalso reveal that the roundoff is slightly larger in (1.3). However, in both cases ρ = O(∆x−1),meaning that the effect of the roundoff is inversely proportional to ∆x when ∆x→ 0. Inother words, the approximation eventually deteriorates when ∆x is reduced, explaining thenegative unit slope observed in the three plots above.

The brief glimpse above of problems in analysis demonstrates that computations(and problem solving) require approximate techniques. These are constructed insuch a way that the approximating quantities converge, in a mathematical sense,to the original analysis quantities. However, because convergence itself is a notionfrom analysis, the successive improvement of the approximations is limited by theaccuracy of the computer number system. This means that, while it is possible toobtain highly accurate approximations, one must be aware of the shortcomings of


finite computations. Some difficulties can be overcome by constructing fast con-verging approximations, but the bottom line is still that the approximations will bein error, and that it is important to be able to estimate and bound the remaining er-rors. This requires considerable mathematical skill as well as a thorough knowledgeof the main principles of scientific computing.

1.2 Concepts and problems in algebra

The key concepts in linear algebra are vectors and matrices. The central problemsrelate vectors and matrices, and the most important problem is linear systems ofequations,

Ax = b, (1.8)

where A is an n×n matrix, x is the unknown n-vector, and b is a given n-vector. Inmathematics the solution (if it exists) is often expressed as

x = A−1b, (1.9)

where A−1 is the inverse of A.In practice the inverse is rarely computed. Instead we employ some computa-

tional technique, such as Gaussian elimination, to solve the problem. This standardnumerical method obtains the exact solution in a finite number of steps, i.e., the so-lution is computable. More precisely, it takes about 2n3/2 arithmetic operations tocompute the solution in this way. Although the expression A−1b is frequently used intheoretical discussions, Gaussian elimination avoids computing the inverse, whichis more expensive to compute.

In fact, the elimination procedure is equivalent to a matrix factorization, A =LU , where L and U are lower and upper triangular matrices. This transforms theoriginal system to

LUx = b, (1.10)

which is solved in two steps: first the triangular system Ly = b is solved for y, andthen the system Ux = y is solved for x. This procedure is used for the simple reasonthat it is faster to solve triangular systems than full systems, thus economizing thetotal work for obtaining x. In summary, Gaussian elimination first factorizes thematrix A = LU (this is the expensive operation), after which the equation solvingis reduced to solving triangular systems. In scientific computing it is very commonthat one has to solve a sequence of systems,

Axk = bk, (1.11)

where the right hand sides bk often depend on the previous solution xk−1. Here thesuperscript denotes the recursion index, not to be confused with the vector compo-nent index. In such situations the LU factorization of A is only carried out once, and

1.2 Concepts and problems in algebra 11

the factors L and U are then used to solve each system (1.11) cheaply. These forwardand back substitutions only have an operation count of 2n2 arithmetic operations.

Most computational algorithms in linear algebra are built on some matrix factor-ization technique, chosen to provide suitable benefits. For example, in linear leastsquares problems it is standard procedure to factorize the matrix A = QR, where Qis an orthogonal matrix and R is upper triangular. This again reduces the amountof work involved, and the use of orthogonal matrices is beneficial to maintainingstability and accuracy in long computations.

For very large systems (say problems involving millions of unknowns) the com-putational complexity of matrix factorization can become prohibitive. It is thencommon to employ iterative methods for solving linear systems. This means thatwe sacrifice obtaining an exact solution (up to roundoff), in favor of obtaining anapproximate solution at a much reduced computational effort. In fact, large-scaleproblems in linear partial differential equations may be too large for factorizationmethods, leaving us with no choice other than iterative methods.

Another important standard problem in linear algebra is to find the eigenvaluesand eigenvectors of a matrix A. They satisfy the equation

Ax = λx. (1.12)

As both λ and x are unknowns, this problem is technically nonlinear, although it isreferred to as the linear eigenvalue problem. This designation stems from the factthat A is a linear operator.

The eigenvalues are the (complex) roots λ of the characteristic equation

det(A−λ I) = 0. (1.13)

If A is n×n, then P(λ ) := det(A−λ I) is a polynomial of degree n, so we seek then roots of the polynomial equation P(λ ) = 0. Polynomial equations are obviouslynonlinear whenever the degree of the polynomial is 2 or higher. The major compli-cation is, however, that there are in general no closed form expressions for the rootsof (1.13) if n> 4. For higher order matrices, therefore, numerical methods are neces-sary. In practice eigenvalues are always computed by iterative methods, producinga sequence of approximations λk converging to the eigenvalue λ . These computa-tional techniques are non-trivial, and usually employ various matrix factorizationsas part of the computation. If the matrix A is large, the computational effort can eas-ily become overwhelming, and most computational techniques for such problemsoffer the possibility of only computing a few (dominant) eigenvalues.

Thus we see that convergence becomes an issue also for computational meth-ods in some standard linear algebra problems. Although this may appear coun-terintuitive, it shows that numerical methods are necessary in the vast majority ofmathematical problems. Computational methods are constructive approximations,and are not a matter of merely inserting data into a mathematical formula in orderto obtain the solution to the problem.


It is fair to say that almost every single problem in applied mathematics involveslinear algebra problems at some level. Therefore linear algebra techniques are at thecore of scientific computing, although we will see that real problems in science andengineering often have to be approached by a number of preparatory techniques be-fore the task has been brought to a linear algebra problem where standard techniquesmay be employed.

1.3 Computable problems

We have seen that mathematical problems can be divided into problems from alge-bra and analysis, and that these two categories can further be divided into linearand nonlinear problems. That leaves us with four categories of problems. We saythat a problem is computable if its exact solution can be obtained in a finite num-ber of operations. Unfortunately, only some of the problems in linear algebra arecomputable.

It is important to note that the notion of computable problems is concerned withinexact computer arithmetic. For example, Gaussian elimination will in theory solvea linear system exactly in a finite number of operations, but on the computer it is sub-ject to roundoff errors, since it is impossible to implement the real number systemon a computer. Instead we have to make do with the IEEE 754 computer arithmetic.This system only contains a (small) subset of rational numbers, and further, eacharithmetic operation will in general incur additional errors, since the set of computerrepresentable numbers is not closed under the standard arithmetical operations.

As a consequence, not even computable problems are necessarily solved exactlyon a computer. This is why numerical methods are necessary, even for the smallestof problems. They are not an alternative to analytical techniques, but rather the onlyway to obtain quantitative results.

In spite of the necessity of using computational methods in almost all of appliedmathematics, the analytical tools of pure mathematics are no less important in nu-merical analysis. The construction and analysis of computational methods rely onbasic as well as advanced concepts in mathematics, which the numerical analystmust master in order to devise stable, accurate and efficient computational methods.

Numerical analysis is the continuation of mathematics by other means. Itsgoal is to construct computational methods and provide software for the efficientapproximate solution of mathematical problems of all kinds, using the computer asits main tool. The aim is to compute an approximations to a prescribed accuracy,preferably in tandem with error bounds or error estimates. In other words, we wantto design methods that produce accurate approximations that converge as fast aspossible to the exact solution. This convergence will only be observed in part, as theexact solution is generally both unknown and non-computable.

The design of such computational methods is a nontrivial task. Although theiraim is to address the most challenging problems in nonlinear analysis, computa-

1.3 Computable problems 13

tional methods are usually assessed by applying them to well known problems fromapplied mathematics, where there are known analytical expressions for the exactsolution. It is of course not necessary to solve such problems, but they remain thebest benchmarks for new computational methods. Thus, a method which cannot ex-cel at solving a standard problem in applied mathematics is almost bound to fail onreal-life problems.

Chapter 2Principles of numerical analysis

The subject of numerical analysis revolves around how to construct computableproblems that approximate the mathematical problem of interest. There is a verylarge number of computational methods, superficially having rather little in com-mon. But it is important to realize that all computational methods are constructedfrom, and rely on, only four basic principles of numerical analysis. These are:

• The principle of discretization• Linear algebra, including polynomial spaces• The principle of iteration• The principle of linearization.

Because the computable problems are essentially those of linear algebra, almostall numerical methods will at the core work with linear algebra techniques. Thepurpose of the other three principles listed above is to construct various approxima-tions that bring the original problem to computable form. Below we shall outlinewhat these principles entail and why these ideas are so important. They will be en-countered throughout the book in many different shapes.

2.1 The First Principle: Discretization

The purpose of discretization is to reduce the amount of information in a problem toa finite set, making way for algebraic computations. It is used as a general techniquefor converting analysis problems into (approximating) algebra problems, see Table1.1, and is the key to numerical methods for differential equations.

Consider an interval Ω = [0,1] and select from it a subset ΩN = x0,x1, . . . ,xNof distinct points, ordered so that x0 < x1 < · · ·< xN . The set ΩN is called discrete,to distinguish it from the “continuous set” (or rather the continuum) Ω . We refer toΩN as the grid.

Let a continuous function f be defined on Ω . Discretization (of f ) means thatwe convert f into a vector, by the map f 7→ F = f (ΩN), i.e.,

15

16 2 Principles of numerical analysis

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

Fig. 2.1 Discretization of a function. A continuous function f on [0,1] (top). A grid ΩN is selectedon the x-axis and samples from f are drawn (center). The grid function F (bottom) is a vector witha finite number of components (red), plotted vs. the grid points ΩN (black)

F =

f (x0)...

f (xN)

. (2.1)

The function f has the independent variable x ∈ [0,1], meaning that f (x) is a par-ticular value of that function. Likewise, the vector F has an index as its independentvariable, and Fk = f (xk) is a particular value of the vector, with 0≤ k ≤ N. As F isonly defined on the grid, it is also called a grid function, see Figure 2.1. As the gridfunction is obtained by drawing equidistant samples of the function f , we may ineffect think of the process as an analog-to-digital conversion, akin to recording ananalog audio signal to a digital format.

For theoretical reasons, one may also wish to consider the case N→ ∞. In suchsituations the grid as well as grid functions are sequences rather than vectors. Inpractical computations, however, the number is always finite, meaning that compu-tational methods work with vectors as discrete representations of functions.

To see how discretization can be used in differential equations, consider the sim-ple radioactive decay problem

u = αu ; u(0) = u0, (2.2)

2.1 The First Principle: Discretization 17

with exact solution u(t) = eαtu0, and suppose we want to solve this equation on[0,T ]. To make the problem computable, we will turn it into a linear algebra problemby using a discrete approximation to the derivative.

Introduce a grid ΩN = t0, . . . , tN with tk = k∆ t and N∆ t = T . Noting that

u(t) = lim∆ t→0

u(t +∆ t)−u(t)∆ t

, (2.3)

we introduce a grid function U approximating u, i.e., Uk ≈ u(tk). We then have

u(tk)≈u(tk +∆ t)−u(tk)

∆ t≈ Uk+1−Uk

∆ t. (2.4)

This is referred to as a finite difference approximation. Next, we will use this to re-place the derivative in (2.2). Thus the original, non-computable problem is replacedby a computable, discrete problem,

Uk+1−Uk

∆ t= αUk. (2.5)

This can be rewritten as a linear system of equations,1 0 . . . 0

−1−α∆ t 1

0 −1−α∆ t 1...

......

0 . . . −1−α∆ t 1

U0U1...

UN

=

u00...

0

, (2.6)

showing that the approximating problem is indeed a problem of linear algebra.As the matrix is lower triangular, the system is easily solved using forward substi-

tution, meaning that we can compute successive approximations to u(∆ t),u(2∆ t), . . .by repetitive use of the formula Uk+1 = (1+α∆ t)Uk. We then find that

u(k∆ t)≈Uk = (1+α∆ t)ku0.

All approximations are obtained using elementary arithmetic operations. If, in par-ticular, we want to find an approximation to u(T ) in N steps, we take ∆ t = T/N,and get

u(T )≈UN =

(1+

αTN

)N

u0.

From analysis we recognize the limit

limN→∞

(1+

αTN

)N

= eαT ,


so apparently the method is convergent — the numerical approximation approachesthe exact solution u(T ) = eαT u0 as N→ ∞. Formally, the method needs an infinitenumber of steps to generate the exact solution.

Although this may at first seem to be a drawback, it is the very idea of conver-gence that saves the situation. Thus, for every ε > 0, an ε-accurate approximation iscomputable in N(ε) steps. This means that we can obtain a solution to any desiredaccuracy with a finite effort. Although N(ε) is finite for every ε > 0, it need not bebounded as ε → 0; the numerical problem is still computable, although the originalanalysis problem is not.

Now, above we note that the discretization is built on the finite difference approx-imation (2.4), which incidentally is the same as the approximation (1.2) we used toapproximate a derivative numerically. The resulting method, (2.5), is known as theexplicit Euler method. Since the finite difference approximation (2.4) of the deriva-tive is of first order, we might expect that the resulting method for the differentialequation is also first order convergent. Indeed, one can show that

|uN−u(T )| ≤C ·N−1 = O(∆ t).

Because the error is proportional to ∆ t p with p = 1, the method is first order con-vergent. This is a slow convergence; if we want ten times higher accuracy, we willhave to use a ten times shorter time step ∆ t, or, equivalently, ten times more steps(work) to reach the end point T , since ∆ t = T/N.

This is demonstrated below using a MATLAB implementation of the functioneulertest(alpha, u0, t0, tf), where alpha is the parameter α in theproblem, u0 is the initial value u(0), and t0 and tf are the initial time and terminaltime of the integration. The problem was run with α =−1 on the time interval [0,1]with initial value u(0) = 1. It was further run for N = 10,100,1000 and 10000,corresponding to step sizes ∆ t = 1/N. The results are found in Figures 2.1 and 2.1.

function eulertest(alpha, u0, t0, tf)% Test of explicit Euler method. Written (c) 2017-09-06

for k = 1:4;N = 10ˆ(5-k); % Number of stepsdt = (tf-t0)/N; % Time stepu = u0; % Initial valuet = t0;sol = u;time = t;err = 0;

for i=1:N % Time stepping loopu = u + dt*alpha*u; % A forward Euler stept = t + dt; % Update timesol = [sol u]; % Collect solution datatime = [time t];

end


uexact = exp(alpha*(time - t0))*u0; % Exact solutionerror = sol - uexact; % Numerical errorh(k) = dt; % Save step sizer(k) = abs(error(end)); % Save endpoint error

figure(1)subplot(2,1,1)plot(time,sol,'r') % Numerical solution (red)hold onplot(time,uexact,'b') % Exact solution (blue)grid onaxis([0 1 0.3 1])xlabel('t')ylabel('u(t)')hold off

subplot(2,1,2)semilogy(time,abs(error)) % Error vs. timegrid onhold onaxis([0 1 1e-6 1e-1])xlabel('t')ylabel('error')

end

figure(2)loglog(h,r,'b') % Endpoint error vs dtxlabel('dt')ylabel('error')grid onhold onxref = [1e-4 1e-1];yref = [1e-5 1e-2];loglog(xref,yref,'k--')hold off

The explicit Euler method performs as expected. Although the results are text-book perfect, the convergence is slow and the accuracy is less impressive. Even atN = 104 we do not obtain more than four-digit accuracy. However, when approxi-mating derivatives, we also used a second order approximation, (1.5). It would thenbe of interest to try out that approximation too, for the differential equation u = αu.This leads to a different discretization,

Vk+1−Vk−1

2∆ t= αVk, (2.7)

where Vk ≈ u(tk) as before, and tk = k ·∆ t. By Taylor series expansion, we find that

u(tk+1)−u(tk−1)

2∆ t=

u(tk +∆ t)−u(tk−∆ t)2∆ t

= u(tk)+O(∆ t2),


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

0.3

0.4

0.5

0.6

0.7

0.8

0.9

u(t

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

10-6

10-4

10-2

err

or

Fig. 2.2 Test of the explicit Euler method. Top panel shows the exact solution (blue) and the nu-merical solution (red) using N = 10 steps. The step size is coarse enough to make the error readilyvisible. Bottom panel shows the error |Uk−u(tk)| along the solution in a lin-log diagram. From topto bottom, the individual graphs correspond to N = 10,100,1000 and N = 10000. At each point intime, the error decreases by a factor of 10 when N increases by a factor of 10, demonstrating thatthe error is O(∆ t)

so we would expect the method (2.7) to be of second order and therefore more accu-rate than the explicit Euler method. The method is known as the explicit midpointmethod. We note that there is one minor complication using this method — it re-quires two starting values instead of one, since we need both V0 and V1 to get therecursion

Vk+1 =Vk−1 +2α∆ tVk (2.8)

started. We shall take V0 = u(0) and V1 = eα∆ tu(0). This corresponds to taking initialvalues from the exact solution.

The previous code can easily be modified to test the explicit midpoint method.When this was done, the code was tested in a slightly more challenging setting, bytaking α =−4.5 but otherwise solving the same problem as before. A wider rangeof step sizes were chosen, by taking N = 10,32,100, . . . ,32000,100000, so that ∆ tvaries between 0.1 and 10−5.

This time the results are not in line with expectations. As we see in Figures 2.1and 2.1, the method appears to be second order convergent, but the error is not asregular as one would have hoped for. This turns out to depend on the construction ofthe method, and it illustrates that advanced methods in scientific computing cannot,in general, be constructed by intuitive techniques. The loss of performance for this


10-4

10-3

10-2

10-1

dt

10-5

10-4

10-3

10-2

10-1

err

or

Fig. 2.3 Test of the explicit Euler method. The endpoint error |UN − u(1)| is plotted in a log-logdiagram for N = 10,100,1000 and N = 10000 (blue). A dashed reference line of slope 1 showsthat the error is O(∆ t), i.e., the method is first order convergent

method is due to some stability issues; in fact, this method is unsuitable for theradioactive decay problem, although it excels for other classes of problems, suchas problems in celestial mechanics and in hyperbolic partial differential equations.Thus, one needs a deep understanding of both the mathematical problem and thediscretization method in order to match the two and obtain reliable results. For thetime being, and until the computational behavior of the explicit midpoint methodhas been sorted out and analyzed, we have to consider the method a potential failure,in spite of the unquestionable success we observed before, when using exactly thesame difference quotient for approximating derivatives.

We have seen above that a discretization method only computes approximationsat discrete points to a continuous function. In the simplest case, we approximatederivatives at distinct points, and in the more advanced cases we compute grid func-tions (vectors) that approximate a continuous function solving a differential equa-tion. The distinctive feature is that discretization uses algebraic computations toapproximate problems from analysis. Thus the computations can be carried out infinite time, at the price of being approximate. Even so, we have seen that it is a non-trivial task to find such approximations; intuitive techniques do not always producethe desired results. For this reason, we need to carefully examine how approximatemethods are constructed, and how the approximate methods differ in character fromthe exact solutions to problems of analysis.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

-0.2

0

0.2

0.4

0.6

0.8

1

u(t

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

10-10

10-5

100

err

or

Fig. 2.4 Test of the explicit midpoint method. Top panel shows the exact solution (blue) and thenumerical solution (red) using N = 10 steps. The numerical solution has an undesirable oscillatorybehavior of growing amplitude, indicating instability. The method eventually produces negativevalues, in spite of the exact solution always remaining positive, being an exponential. Bottompanel shows the error |Vk−u(tk)| along the solution in a lin-log diagram. From top to bottom, theindividual graphs correspond to N = 10,320,1000, . . .100000. Initially, the error is O(∆ t2), buthere, too, we observe an oscillatory behavior, and a faster error growth when t grows

The examples above are very simple. In particular, the differential equation islinear. It is chosen simply to illustrate the errors of the approximate methods. Linearproblems usually have solutions that can be expressed in terms of elementary func-tions. By contrast, most nonlinear differential equations of interest cannot be solvedin terms of elementary functions. For example, the van der Pol equation,

u′ = v

v′ = µ · (1−u2)v−u

with u(0) = 2 and v(0) = 0, is a system of ordinary differential equations, whichis nonlinear if µ 6= 0. An analytical solution can only be found for µ = 0; thenu(t) = 2cos t and v(t) = −2sin t. Nonlinearity is therefore an added difficulty, andfor most nonlinear problems, there is no alternative to computing approximate nu-merical solutions.

In a linear problem the unknowns enter only in their first power. There are noother powers, like u2 or v−1/2, nor any nonlinear functions, such as u2, sinu or logv,occurring in the equation. In the van der Pol equation above, we see that the cubic

2.2 The Second Principle: Polynomials and linear algebra 23

10-5

10-4

10-3

10-2

10-1

dt

10-12

10-10

10-8

10-6

10-4

10-2

100

err

or

Fig. 2.5 Test of the explicit midpoint method. The endpoint error |VN−u(1)| is plotted in a log-logdiagram for N = 10, . . . ,100000 (blue). A dashed reference line of slope 2 shows that the error isO(∆ t2), but only if ∆ t < 3 ·10−4 (left part of graph, corresponding to N > 3200). Thus the methodis second order convergent, but for larger ∆ t, the error grows rapidly and the proper convergenceorder is no longer observed (right part of graph)

term u2v occurs. Note that a nonlinearity can take the form of a product of first-degree terms, such as uv, which is a quadratic term.

Discretization methods are in general aimed at solving nonlinear problems, pro-vided that the original problem has a unique solution. A necessary condition forsolving nonlinear problems successfully is that we are able to solve linear problemsin an accurate and robust way. In addition, many classical problems in applied math-ematics are linear, but still have to be solved numerically due to various complica-tions, such as problem size or complex geometry. Discretization therefore remainsone of the most important principles in scientific computing.

2.2 The Second Principle: Polynomials and linear algebra

Linear algebra is synonymous with matrix problems of various kinds, such as solv-ing linear systems of equations and eigenvalue problems. But to fully appreciate therole of linear algebra in numerical analysis, it must be recognized that most compu-tational problems that give rise to algebraic computations do not come ready-madein matrix–vector form.


Problems in analysis typically involve continuous functions, for instance thecomputation of integrals. It may sometimes be difficult to solve such problems ana-lytically, as one cannot always find primitive functions in terms of elementary func-tions. But polynomials are different. It is straightforward to work with polynomi-als in analysis – the primitive function of a polynomial is again a polynomial, andconversely, the derivative of a polynomial is also a polynomial. This makes them at-tractive and convenient tools in numerical analysis, and a large number of numericalmethods are therefore based on polynomial approximation.

The simplest (but far from the best) representation of a polynomial is

P(x) = c0 + c1x+ c2x2 + · · ·+ cNxN . (2.9)

Thus every polynomial is completely defined by a finite set of information, the co-efficient vector, (c0,c1, . . . ,cN)

T. Consequently, many problems involving polyno-mials lead directly to matrix–vector computations, or, in other words, linear algebra.

The interpolation problem is one of the most important uses of polynomialsin numerical analysis. In interpolation, given a grid function F on a grid ΩN , oneconstructs a polynomial P on the interval Ω , such that P(ΩN) = F . The purposeis the opposite of discretization. Interpolation aims at generating a continuousfunction from discrete data. If we think of discretization as an analog-to-digitalconversion, interpolation is the opposite, a digital-to-analog conversion.

Example. Let us consider approximating the function f (x) = 1+0.8sin2πx by apolynomial P on the interval [0,1], such that the polynomial reproduces the correctvalues of f at some selected points, say at x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4 andx4 = 1. At those points, f (x) takes the values 1, 1.8, 1, 0.2 and 1, respectively. Usingthe ansatz (2.9) with N = 4 and imposing the five conditions

c0 + c1xk + c2x2k + c3x3

k + c4x4 = f (xk) ; k = 0,1, . . . ,4

we obtain the linear system of equations

1 x0 x20 x3

0 x40

1 x1 x21 x3

1 x41

1 x2 x22 x3

2 x42

1 x3 x23 x3

3 x43

1 x4 x24 x3

4 x44

c0

c1

c2

c3

c4

=

f (x0)

f (x1)

f (x2)

f (x3)

f (x4)

. (2.10)

This determines the unknown coefficients c0, . . . ,c4 that uniquely characterize theinterpolation polynomial. Inserting the data and solving for the coefficient vectorc gives

P(x) = 1+12815

x− 1285

x2 +25615

x3 +0 · x4.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

data

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

inte

rpola

nt

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

-0.2

-0.1

0

0.1

0.2

err

or

Fig. 2.6 Interpolation of a function. A grid function F , indicated by blue markers, represents sam-ples of a trigonometric function f on [0,1], indicated by dashed curve (top panel). The grid functionis interpolated by a polynomial P of degree 3 (solid blue curve, center panel). As P deviates visiblyfrom f , interpolation does not recover the original function f from the grid function F . The errorP(x)− f (x) is plotted as a function of x ∈ [0,1] (red curve, lower panel). The error is zero only atthe interpolation points, indicated by red markers

We note that there is no fourth degree term due to the “partial symmetry” of thefunction f . In general, having to interpolate at five points, we would expect thepolynomial to have five coefficients, i.e., P would have to be a degree four polyno-mial. Here, however, the interpolant is a polynomial of degree 3.

Let us leave aside the question of whether this is a good approximation – thepoint here is rather to recognize that although both f (x) and the approximatingpolynomial P(x) are nonlinear functions, the interpolation problem is linear andtherefore computable. The unknown coefficients ck enter the problem linearly, andwe obtained the linear system of equations (2.10). The interpolation problem there-fore falls in the linear algebra category.

As the example in Figure 2.2 shows, interpolation generates a continuous func-tion from a grid function, but it is not the inverse operation of discretization. Thus,discretization maps a function f to a grid function F , and interpolation maps Fto a polynomial P. This does not recover f unless f originally was identical to P.Therefore, in general, it holds that f (x) 6= P(x), except at the interpolation points.

In scientific computing it is often preferable to use more general functions thanthe standard polynomial (2.9). Just like linear algebra uses basis vectors, it is com-mon to choose a set of basis functions, ϕ0(x),ϕ1(x), . . .ϕN(x). These are often, but


not always, polynomials that are carefully chosen for their computational advan-tages, or to reflect some special properties of the problem at hand.

Given a function f (x), we can then try to find the best approximation amongall linear combination of the basis functions ϕk(x). In other words, we want to findthe best choice of coefficients ck such that

∑k

ckϕk(x)≈ f (x), (2.11)

for x in some set. This could either be an interval, say Ω = [0,1], or a grid ΩN =x0,x1, . . . ,xN. The interpolation problem above was a particular case, where weused the well-known monomial basis,

ϕk(x) = xk ; k = 0, . . . ,N (2.12)

together with the grid ΩN . With general basis functions, that system would becomeϕ0(x0) ϕ1(x0) . . . ϕN(x0)ϕ0(x1) ϕ1(x1) ϕN(x1)

.... . .

...ϕ0(xN) . . . ϕN(xN)

c0c1...

cN

=

f (x0)f (x1)

...f (xN)

. (2.13)

The important observation here is that the basis functions enter the matrix as itscolumn vectors, and that the system remains linear.

A natural question is whether the interpolation is “improved” by letting N→ ∞.In general this is not the case; it depends in a complicated way on the choice ofbasis functions, as well as on the grid points ΩN and the regularity of f . Because thegrid points are often defined by the application (e.g. in image processing the gridpoints correspond to pixels with fixed locations), the basis functions usually playthe more important role. An example is the monomial basis (2.12), a choice that isfraught with prohibitive shortcomings for large N. It is often better to use piecewiselow-degree interpolation, a technique frequently used in connection with differentialequations and the Finite Element Method, where N is often very large.

Let ϕk and F denote the grid functions

ϕk =

ϕk(x0)...

ϕk(xN)

F =

f (x0)...

f (xN)

. (2.14)

Then (2.13) can be writtenN

∑k=0

ckϕk = F. (2.15)


It expresses the vector F as a linear combination of the basis vectors ϕk, while(2.11) expresses the function f (x) as a linear combination of the basis functionsϕk(x).

The least-squares problem is quite closely related to the interpolation problem.There the number of basis functions, M + 1, is small compared to the number ofelements N+1 in ΩN (not to mention the case when Ω is an interval). For example,fitting a straight line (a first-degree polynomial) to a data set means that M = 1, i.e.,we have but two coefficients, c0 and c1, to determine, while the data set, defined byf -values on ΩN , may be arbitrarily large. The system

M

∑k=0

ckϕk = F (2.16)

is then an overdetermined system; it has more equations than unknowns. This cor-responds to (2.13) but with a rectangular matrix containing only a few columns. Insuch a situation, we form the scalar product of (2.16) with any ϕ j, to get

M

∑k=0

ckϕTj ϕk = ϕ

Tj F ; j = 0, . . . ,M. (2.17)

This is an M×M linear system of equations, referred to as the normal equations.It can be written Ac = g, with matrix elements a jk = ϕT

j ϕk, and where g j = ϕTj F . If

the basis vectors ϕk are linearly independent, it can be solved for the vector c ofunknown coefficients ck.

An even better approach is to first select the basis functions ϕk(x) so that thevectors ϕk are orthogonal on ΩN , which means that ϕT

j ϕk = 0 if j 6= k. Then the sumon the left hand side of (2.17) has only one term; the system reads c jϕ

Tj ϕ j = ϕT

j F .Thus we can immediately solve for the coefficient c j, to get

c j =ϕT

j F

ϕTj ϕ j

. (2.18)

This means that the coefficient vector is easily computed in terms of scalar prod-ucts alone, one of the basic techniques in numerical linear algebra. A most impor-tant special case is Fourier analysis, which is exclusively based on this technique,making it very useful and efficient in practical computation. The coefficients c j in(2.18) are commonly referred to as Fourier coefficients. In Fourier analysis, thebasis functions are typically chosen as trigonometric functions. By Euler’s formula,eiωn = cosωn+ i sinωn. Therefore, trigonometric functions are polynomials in thevariable x = eiω , motivating the commonly used term “trigonometric polynomials.”

More interestingly, but less obviously, the Finite Element Method (FEM) fordifferential equations is based on similar ideas, using scalar products to constructthe linear system that needs to be solved. In effect, the best approximation is thenfound by solving a system of the form (2.17), although we then have M = N. There


are many different FEM approaches, consisting in choosing the basis functions ϕ jto carefully represent the desired properties of the solution, in terms of the degreeof the polynomial basis functions as well as in order to satisfy boundary and conti-nuity conditions in a proper way. All these methods lead to linear systems of specialstructure, and provide strong examples of how numerical analysis prefers to work invarious polynomial settings to make efficient use of linear algebra for the determi-nation of the best linear combination of the chosen basis functions. Thus, it is fairto say that approximation by polynomials is a cornerstone of scientific computing,allowing us to employ the full range of linear algebra techniques in the process offinding approximate solutions to problems from analysis.

An added difficulty is that in many applications, not least in FEM, the linearsystems are extremely large and sparse. It is not uncommon to have millions ofunknowns or more. In such cases it may no longer be possible to use standard fac-torization methods to solve the system. Instead, other approximate techniques haveto be considered. These are iterative.

2.3 The Third Principle: Iteration

Nonlinear problems can rarely be solved by analytical techniques. The computa-tional techniques used are iterative. This means that they are “repetitive,” in thesense that the computational method generates a sequence of successively improvedapproximations that converge to the solution of the nonlinear problem. Since con-vergence is usually an infinite process, the iteration will in practice be terminatedafter a finite number of steps, but if convergence is observed it is possible to termi-nate the process when an acceptable accuracy has been achieved.

The very simplest example is the problem of computing the square root of anumber, say

√2. As this is an irrational number, we only have numerical approxi-

mations to it, although today’s software allows us to compute such approximationsat the touch of a button.

The square root of 2, symbolized by√

2, is the positive root of the nonlinearequation x2 = 2. The root is obviously greater than 1 and less than 2, so a simpleguess is x≈ 1.5. Since the root is now in the interval [1,1.5], one could repeat halv-ing the interval to improve the accuracy. This process, called bisection, is howeverunacceptably slow. Much better and more general techniques are needed.

Two thousand years ago, Heron of Alexandria noted that if an approximation xwas a bit too large (like 1.5), then 2/x would be a bit too small, and vice versa. Hethus suggested that one take the average of x and 2/x to obtain a better approxima-tion. This looks deceptively similar to bisection, but it was an enormous advance,which 1,700 years later became known as Newton’s method. If iterated, the com-putation is

xk+1 =12

(xk +

2xk

),

2.3 The Third Principle: Iteration 29

where the superscript k is the iteration index. To verify that limxk =√

2, and toanalyze the iterative process, let εk = (xk−

√2)/√

2 denote the relative error of theapproximation xk. Then, by expanding in a Taylor series, we get

xk+1 =12

(xk +

2xk

)=

12

(√

2(1+ εk)+

√2

1+ εk

)

≈ 1√2

(1+ ε

k +(1− εk +(εk)2− . . .)

)≈√

2+(εk)2√

2.

Hence it follows that the relative error in xk+1 is

εk+1 =

xk+1−√

2√2

≈ (εk)2

2.

This means that the accuracy more than doubles each iteration! If the relative errorof x0 is 10−2 (or 1%, corresponding to two correct digits, as in x0 = 1.4), then therelative error of x1 is 0.5 · 10−4. In fact, the third iterate, x3, is correct to 16 digits,implying that full IEEE precision has been attained. By contrast, simple bisectionwould require 50 iterations to achieve the same.

This shows the power of well-designed iterative methods. In scientific computingthere are two basic types of iterative methods for solving nonlinear equations. Theseare fixed-point iteration and Newton’s method. The iteration demonstrated above isan example of both methods. In fact, they solve slightly different problems. New-ton’s method solves problems of the form f (x) = 0, while fixed-point iterationsolves problems of the form x = g(x). The functions f and g are both nonlinear.We shall leave Newton’s method for the next section, and only analyze fixed-pointiteration here.

Given a nonlinear function g : D⊂Rm→Rm, fixed-point iteration starts from aninitial approximation x0 and computes the next approximation from x1 = g(x0). Theiteration becomes

xk+1 = g(xk). (2.19)

The name fixed-point iteration comes from the fact that if the iteration converges,we have limxk = x∗, where

x∗ = g(x∗), (2.20)

i.e., x∗ is left unchanged by the map g. Thus x∗ maps to itself, and is termed a fixedpoint of g. Whether there exist fixed points and whether the iteration convergesdepend on the function g.

Subtracting (2.20) from (2.19), we get

xk+1− x∗ = g(xk)−g(x∗). (2.21)

Taking norms (for the time being, any vector norm will do), we find that

‖xk+1− x∗‖= ‖g(xk)−g(x∗)‖.


Next, we shall assume that the map g is Lispchitz continuous.

Definition 2.1. Let g : D ⊂ Rm→ Rm. The Lipschitz constant of g on D is definedby

L[g] = supu6=v

‖g(u)−g(v)‖‖u− v‖

(2.22)

for all u,v ∈ D.

Using this definition, it holds that

‖xk+1− x∗‖= ‖g(xk)−g(x∗)‖ ≤ L[g] · ‖xk− x∗‖.

Letting εk = ‖xk− x∗‖ denote the norm of the absolute error in xk, we note that

εk+1 ≤ L[g] · εk.

Hence, if L[g]< 1, the error decreases in every iteration, and, in fact, εk→ 0. A mapwith L[g] is called a contraction map, as the distance between two points (here xk

and x∗) decreases under the map g. For contraction maps the fixed point iteration istherefore convergent, provided that x0,x∗ ∈D. The fixed point theorem (also knownas the contraction mapping theorem) is one of the most important theorems innonlinear analysis, and it states the following:

Theorem 2.1. (Fixed point theorem) Let D be a closed set and assume that g is aLipschitz continuous map satisfying g : D⊂Rm→D. Then there exists a fixed pointx∗ ∈D. If, in addition, L[g]< 1 on D, then the fixed point x∗ is unique, and the fixedpoint iteration (2.19) converges to x∗ for every starting value x0 ∈ D.

We shall not give a complete proof. The key points here are that g maps theclosed set D into itself (existence) and that g is a contraction, L[g]< 1 (uniqueness).Proving existence is the hard part (20th century mathematics), while uniqueness andconvergence are quite simple. Naturally, existence is always of fundamental impor-tance, but in computational practice it is the contraction that matters most, since thetheorem offers a constructive way of obtaining successively improved approxima-tions, converging to the proper limit x∗.

It is worth noting that Lipschitz continuity is quite close to differentiability. Infact, we can rewrite (2.21) as

xk+1− x∗ = g(xk)−g(x∗) = g(x∗+(xk− x∗))−g(x∗)≈ g′(x∗) · (xk− x∗), (2.23)

provided that g is differentiable. Thus

‖xk+1− x∗‖/ ‖g′(x∗)‖ · ‖xk− x∗‖.

This indicates that the error is diminishing provided that ‖g′(x∗)‖ < 1. In fact, if gis differentiable and the set D is convex, then one can show that

2.3 The Third Principle: Iteration 31

L[g] = supx∈D‖g′(x)‖.

Returning to the classical (but trivial) problem of computing the square root of 2,we note that

g(x) =12

(x+

2x

).

It follows that

g′(x) =12

(1− 2

x2

).

Therefore, g′(x∗) = g′(√

2) = 0. This means that g has a very small Lipschitz con-stant in a neighborhood of the root, explaining the fast convergence of Heron’s for-mula.

In scientific computing, it is always of interest to find error estimates. Again, wecan rewrite (2.21) as

xk+1− x∗ = g(xk)−g(x∗) = g(xk)−g(xk+1)+g(xk+1)−g(x∗). (2.24)

By the triangle inequality, we have

εk+1 ≤ ‖g(xk)−g(xk+1)‖+‖g(xk+1)−g(x∗)‖ ≤ L[g] · ‖xk− xk+1‖+L[g] · εk+1.

Hence we have(1−L[g])εk+1 ≤ L[g] · ‖xk− xk+1‖.

Solving for εk+1, we obtain the error bound

εk+1 ≤ L[g]

1−L[g]· ‖xk− xk+1‖. (2.25)

Thus, while the true error remains unknown, it can be bounded in terms of the com-putable quantity ‖xk− xk+1‖, provided that we know L[g]. Unfortunately, the Lip-schitz constant is rarely known, but a rough lower estimate can be obtained duringthe iteration process from

L[g]≈ ‖g(xk)−g(xk−1)‖‖xk− xk−1‖

=‖xk+1− xk‖‖xk− xk−1‖

.

This makes it possible to compute a reasonably accurate error estimate from(2.25).

While iterative methods are necessary for nonlinear problems, iterative methodsare also of interest for large-scale linear systems. For example, linear systems arisingin partial differential equations may have hundreds of millions of equations, whichexcludes the use of conventional “direct methods” such as Gaussian elimination.The remaining option is then iterative methods. If the problem is Ax= b, one usuallytries to split the matrix so that A = M−N, and rewrite the system


Mx = Nx+b.

If one can find a splitting such that M is inexpensive to invert (for example if M isdiagonal), we can use a fixed point iteration

xk+1 = M−1(Nxk +b).

Here, if xk = x+δxk, with δxk representing the absolute error, we see that

δxk+1 = M−1Nδxk.

This iteration will be convergent (δxk → 0) if all eigenvalues of M−1N are locatedstrictly inside the unit circle; this then becomes a measure of a “well designed”iterative method. Interestingly, whether this can be achieved depends on the choiceof discretization as well as on the properties of the differential equation.

It is easily seen that the iteration above can be rewritten

xk+1 = xk−M−1(Axk−b).

Here the quantity rk = Axk−b is known as the residual. In many large-scale prob-lems it is easy to compute the residual, but the matrix A itself may not be availablefor separate processing. The iterative method then attempts to successively improvethe approximation xk by only evaluating the residual, but changing its direction bythe matrix M in order to speed up convergence. For such a technique to be success-ful, much work goes into the construction of M, using all available knowledge ofthe problem at hand. Depending on on the construction of the iterative method, thematrix M is often referred to as a preconditioner.

The “ideal” choice of M would be to take M = A, but this choice requires thatA is inverted or factorized by conventional techniques. Therefore M is always someapproximation, for example an “incomplete” factorization of A.

As mentioned before, there are many other kinds of problems that require iter-ative methods, e.g. eigenvalue problems. It is impossible to give a comprehensiveoverview in a limited space, but suffice it to say that without iterative methods, sci-entific computing would not be able to address important classes of problems inapplied mathematics.

2.4 The Fourth Principle: Linearization

The last principle that is a key element in many numerical methods is linearization.This simply means that one converts a nonlinear problem to a linear one. Often, thisimplies that one considers small variations around a given point, using differentia-tion as an approximation.

2.4 The Fourth Principle: Linearization 33

Consider a differentiable map f : Rm→ Rn, mapping x ∈ Rm to y ∈ Rn, so thaty = f (x). Analysis then allows us to write

dy = f ′(x)dx. (2.26)

This expresses that the differential dy is a linear function of the differential dx. Itis valid for any differentiable function f , and the “derivative” f ′(x) can be a scalar(when n = m = 1), a gradient (when n = 1 with m arbitrary but finite), a Jacobianmatrix (with both n and m arbitrary, but finite), or a linear operator (both n and mpossibly infinite), as the case may be; the formalism is always the same.

In numerical analysis, it is common to approximate the infinitesimal differentialsby finite differences, such as (1.2). We then write

∆y≈ f ′(x) ·∆x. (2.27)

This is again expressing (small) variations in y as a linear function of ∆x. That is,we consider the effect ∆y, due to small variations ∆x in the independent variable x,to be proportional to ∆x. This is the essential idea of linearization.

To take this further, let f be a differentiable nonlinear map f : Rm→ Rm. Givenany fixed vector x and a small, varying offset ∆x, we can expand f in a Taylor seriesabout x, to obtain

f (x+∆x)≈ f (x)+ f ′(x) ·∆x, (2.28)

by retaining the first two terms. This is equivalent to linearization because it ap-proximates f (x+∆x) by a linear function of ∆x; the approximation on the righthand side of (2.28) only contains ∆x to its first power.

Note that if the problem is scalar (m = 1) this is a standard Taylor series, withall quantities involved being scalar. One can then graph the right hand side of (2.28)and get a straight line with slope f ′(x). In the vector case, however, x is an m-vector,f (x) is a m-vector, and each component of f (x) depends on all components of x,i.e.,

fi(x) = fi(x1, . . . ,x j, . . . ,xm).

Hence each component fi can be differentiated with respect to every component ofx, to obtain its gradient f ′i (x). In other words,

f ′i (x) = gradx fi(x) =(

∂ fi

∂x1, . . . ,

∂ fi

∂x j, . . . ,

∂ fi

∂xm

). (2.29)

If we write down the derivatives of all components of f simultaneously, we obtainthe m×m Jacobian matrix,


f ′(x) =

∂ f1∂x1

∂ f1∂x2

. . . ∂ f1∂xm

∂ f2∂x1

∂ f2∂x2

∂ f2∂xm

.... . .

...∂ fm∂x1

. . . ∂ fm∂xm

. (2.30)

The Jacobian matrix enters as f ′(x) in the Taylor series expansion (2.28), whichremains a linear approximation.

Let us now turn to the problem of solving a nonlinear equation f (x) = 0. If onehas some approximation x0 of the solution, we can expand around x0 to get

f (x) = f(x0 +(x− x0)

)≈ f (x0)+ f ′(x0) · (x− x0),

retaining the first two terms of the expansion, corresponding to a linear approx-imation. Because nonlinear problems are not computable but linear ones are, wemay consider the linear system of equations

f (x0)+ f ′(x0) · (x− x0) = 0, (2.31)

as an approximation to the nonlinear system f (x) = 0. In (2.31), f (x0) is an m-vector, f ′(x0) an m×m-matrix, so this is a linear system that can be used to deter-mine x. The formal solution is

x = x0−(

f ′(x0))−1

f (x0),

where all expressions on the right hand side are computable, provided that the Ja-cobian matrix is available and nonsingular. It is obvious, however, that this is notthe solution to the problem f (x) = 0, as that problem was replaced by a linear ap-proximation. If the linearization is in good agreement with the nonlinear function,we may obtain an improved solution compared to x0. Naturally, this process can berepeated, leading to the iterative method known as Newton’s method,

xk+1 = xk−(

f ′(xk))−1

f (xk), (2.32)

where the superscript is used as the iteration index as before, to distinguish it fromthe vector component subscript index used above.

As we have seen above, Newton’s method is based on approximating the originalproblem f (x) = 0 by a sequence of linear, computable problems, whose solutionsare expected to produce successively better approximations, converging to the truesolution. We encountered Newton’s method already in the previous section, appliedto the quadratic equation x2−2 = 0. Thus, taking f (x) = x2−2, we have f ′(x) = 2x,and Newton’s method becomes

xk+1 = xk− (xk)2−22xk =

xk

2+

1xk =

12

(xk +

2xk

).

2.4 The Fourth Principle: Linearization 35

Thus we have obtained Heron’s formula for computing square roots. Newton’smethod is far more general, however, and is one of the most frequently used methodsfor nonlinear equations, of equal importance to the fixed point iteration of the previ-ous section. It is by far the most important example of the principle of linearization.

Nevertheless, the convergence of Newton’s method remains an extremely diffi-cult matter. To summarize, it will converge if f ′(x) is nonsingular in the vicinity ofthe solution x∗ to f (x) = 0. This is equivalent to requiring that ‖( f ′(x))−1‖ ≤C1 ina neighborhood B(r) = x : ‖x− x∗‖ ≤ r of x∗. In addition, we have to require thatthe second derivative (a 3-tensor) is bounded, i.e., ‖ f

′′(x)‖ ≤ C2 for x ∈ B(r). Fi-

nally we need an initial approximation x0 ∈ B(r) and ‖ f (x)‖ ≤C0 on B(r). Successdepends on further conditions on f and on the relation between the bounds C0, C1and C2.

In practice, it is impossible to verify these conditions mathematically, and asa consequence, one often experiences considerable difficulties or even failuresin computational practice. This is not necessarily due to any shortcomings of themethod; it is more of an indication that strong skills in mathematics and numericalanalysis are needed. On the positive side, if one is successful, Newton’s method isquadratically convergent, as we already saw in the case of Heron’s formula. Thus,when all conditions of convergence are fulfilled, the error behaves like

εk+1 = O((εk)2),

which means that Newton’s method offers an extraordinary fast convergence whenall conditions are in place. By contrast, fixed point iteration is in general only lin-early convergent, i.e.,

εk+1 = O(εk),

thus calling for a much larger number of iterations before an acceptable approxima-tion can be obtained. These issues are not much noticed for small systems, but manymathematical models in science and engineering lead to large-scale systems, and toconsiderable difficulties.

As an illustration of what Newton’s method does in other special cases, considersolving the problem f (x) = 0, with f (x) = Ax− b. (This means that we are goingto use Newton’s method to solve a system of linear equations.) Then f ′(x) = A, andNewton’s method becomes (cp. 2.32)

xk+1 = xk−A−1(Axk−b) = xk− xk +A−1b = A−1b.

Hence Newton’s method solves the problem in a single iteration. This comes as nosurprise; the idea of linearization is “wasted” on a linear problem, as the linear prob-lem is its own linearization and the method is designed to solve the linear problemexactly. However, we also note that Newton’s method is “expensive” to use, as itrequires the Jacobian matrix and is supposed to work with a full linear solver. Thisagain means that if the nonlinear problem is very large, a conventional factoriza-tion method is not affordable, and Newton’s method will be modified to some lessexpensive variant, e.g.


xk+1 = xk−M−1 f (xk), (2.33)

where M ≈ f ′(xk) is inexpensive to invert. The matrix M is often kept constant formany successive iterations to further reduce computational effort. This is referredto as modified Newton iteration. While each iteration is cheaper to carry out, thesavings come at a price – the fast, quadratic convergence is lost, and convergence (ifit converges at all) will be reduced to linear convergence. Naturally, (2.33) may beviewed as a fixed point iteration, with

g : x 7→ x−M−1 f (x)

and Jacobian matrixg′(x) = I−M−1 f ′(x).

In the fixed point iteration, we need ‖g′(x)‖ 1 in a neighborhood of the solutionx∗. This obviously requires M−1 f ′(x) ≈ I, i.e., M−1 must approximate ( f ′(x))−1

well enough. Needless to say, finding such an M, which is also cheap to invert, is atall order. Even so, it is a standard task for the numerical analyst who works withadvanced, large-scale problems in applied mathematics.

2.5 Correspondence Principles

Out of the four principles outlined above, three of them aim to take not computableproblems into computable problems, or linear algebra. Discretization is the mostimportant principle; it reduces the information in an analysis problem into a finiteset and enables computations to be completed in finite time. Linearization is a toolfor approximating a nonlinear problem by a linear problem; because the latter isnot the same as the original problem, it must be combined with iteration to createa sequence of problems, whose solutions converge to the solution of the originalproblem. Finally, the “principle” of linear algebra and polynomials represents coremethodology, allowing us to approximate and compute in finite time.

On top of this methodology, it is important to understand that analysis problemsare set in the “continuous world” of functions, derivatives and integrals. By contrast,the algebraic problems are set in the “discrete world” of vectors, polynomials andlinear systems.

There is a strong correspondence between the continuous world and the discreteworld. Both have rich theories that are largely parallel, with similar structure. Thenumerical analyst must be well acquainted with both worlds, and be able to movefreely between them, as all numerical methods work with finite data sets but areintended to solve problems from analysis.

As an example, consider a linear first-order initial value problem

u = Au. (2.34)

2.5 Correspondence Principles 37

Its discrete-world counterpart is

vk+1 = Bvk. (2.35)

While the first problem is a differential equation, the second problem is a differ-ence equation. If we discretize the differential equation (2.34) in a way similar to(2.5) and let vk denote an approximation to u(tk), where tk = k∆ t, we get the differ-ence equation

vk+1 = (I +∆ tA)vk, (2.36)

corresponding to taking B = I +∆ tA. The latter equation links the two equationsabove. Now, we are often interested in questions such as whether u(t)→ 0 as t→∞.This will happen if all eigenvalues of A are located strictly inside the left half plane.

But if we have discretized the system, we are also interested in knowing whethervk→ 0 as k→∞. This will happen in (2.35) if all eigenvalues of B are located strictlyinside the unit circle. Thus we see that the conditions we arrive at in the continuousand discrete worlds are different, although the two theories are largely analogous.

The real problem in numerical analysis, however, is to relate the two. Thus wewould like our discretization (2.36) to behave in a way that replicates the behaviorof the continuous problem (2.34). As B = I +∆ tA we can find out whether this isthe case. Let x be an eigenvector of A, i.e., Ax = λx. We then have

Bx = (I +∆ tA)x = x+∆Ax = (1+∆ tλ )x,

so we conclude that the eigenvalues of B are 1+∆ tλ .Now, if Reλ < 0, will |1+∆ tλ | be less than 1? Interestingly, this puts a condition

on ∆ t, which must be chosen small enough. In a broader context, it implies thatnot every discretization will do. Thus, in order to find out how to solve a problemsuccessfully, we need to have a strong foundation both in the classical continuousworld and in the discrete world, as well as knowing how problems, concepts andtheorems from one world correspond to those of the other.

It is of particular importance to learn to recognize the structural similarity of thecontinuous and discrete worlds. Above we have seen the similarity and connectionsbetween u = Au and vk+1 = Bvk. The beginner in numerical analysis is probablybetter acquainted with the analysis side, but one quickly recognizes how the dis-crete world works. In fact, for almost every continuous principle, there is a discreteprinciple, and vice versa.

For example, the usual scalar product f Tg between two vectors f and g is

f Tg =N

∑k=0

fkgk.

Is there a “continuous” scalar product? The answer is yes, and we will use it inparticular in connection with the finite element method. Thus the correspondingoperation in the continuous case is


〈 f ,g〉=∫ 1

0f (x)g(x)dx,

where 〈·, ·〉 denotes the inner product of two functions. Here we recognize the point-wise product of the two functions, integrated (“summed”) over the entire range ofthe independent variable, x.

Many operations in algebra have similar counterparts in analysis. Another im-portant example is linear systems of equations, Ax = b, which, when written incomponent form, are

∑j

ai, jx j = bi.

There are many different types of linear operators in analysis. A direct counterpartto the linear system above is the integral equation∫ 1

0k(x,y)u(y)dy = v(x).

Here we see that a function u on [0,1] is mapped to another function v. Again thishappens by multiplying the function u at the point y by a function of two independentvariables, x and y, and integrating (“summing”) over the entire range of y, leavingus with a function of the remaining independent variable, x. This operation is oftenwritten, just like in the linear algebra case, as Ku = v. If the operator K and thefunction v are given, we need to solve the integral equation Ku= v. Such a problemcan be solved using discretization, which takes us back to a linear algebraic systemof equations.

The simplest example of an integral equation, as well as a differential equation,is an equation of the form

x = f (t), ⇒ x(t) =∫ t

0f (τ)dτ.

Obviously, this is a problem from analysis, and we would approximate its solution innumerical analysis by using a discretization method. For example, we may choose agrid Ω and sample the function f on Ω . Because it may in general be impossible tofind an elementary primitive function of f , we convert it to a grid function F . This,in turn, is reconverted to an approximating polynomial P, with | f (t)−P(t)| ≤ ε onthe entire interval.

Replacing f in the integral by the polynomial P, we are now ready to find anumerical approximation to the integral. We have obtained a computable problem.As primitive functions of polynomials are polynomials, the integral can easily becomputed. Thus

x(t)≈∫ t

0P(τ)dτ.

In the section on linear algebra and polynomials, we have seen that an interpolationpolynomial is obtained by solving a linear system of equations, where the discretesamples F of the continuous function f form the data vector. As a result, the in-

2.5 Correspondence Principles 39

terpolation polynomial is a linear combination of the data samples, i.e., it can bewritten

P(t) = ∑k

Fkϕk(t),

where the basis functions ϕk(t) are also polynomials. Therefore,∫ t

0P(τ)dτ =

∫ t

0∑k

Fkϕk(τ)dτ = ∑k

Fk

∫ t

0ϕk(τ)dτ.

By introducing the notation wk =∫ t

0 ϕk(τ)dτ , we arrive at∫ t

0f (τ)dτ ≈

∫ t

0P(τ)dτ = ∑

kFkwk.

This means that the integral in the continuous world can be approximated by aweighted sum in the discrete world; the latter is easily computable. In fact, we rec-ognize this computation as a simple scalar product of the grid function vector F anda the weight vector w.

We further note that if | f (t)−P(t)| ≤ ε , then

|∫ t

0f (τ)−P(τ)dτ| ≤

∫ t

0| f (τ)−P(τ)|dτ ≤ εt.

This means that not only can we compute an approximation in the discrete world,but we can also obtain an error bound, provided that we master the interpolationproblem.

In a similar way, it is of fundamental importance to master the four basic princi-ples of numerical analysis, the correspondence between the discrete and the contin-uous worlds, and how well we can approximate solutions to problems in analysis byproblems in algebra. In this book, we are going to focus on differential equations,and therefore the principle of discretization is of primary importance. Accuracy is amatter of two questions, first stability (which we have not yet been able to explore)and accuracy, which is usually a matter of how fine we make the discretization.

Naturally there is a trade-off. The finer the discretization, the higher is the com-putational cost. On the other hand, a finer discretization offers better accuracy. Canwe obtain high accuracy at a low cost? The answer is yes, provided that we can con-struct stable methods of a high order of convergence. This is not an easy task, butit is exactly what makes scientific computing an interesting and challenging field ofstudy.

For reasons of efficiency, as well as for computability, we need to stay away frominfinity. However, accuracy is obtained “near infinity.” We obviously need to strikea balance.


2.6 Accuracy, residuals and errors

Chapter 3Differential equations and applications

Differential equations are perhaps the most common type of mathematical modelin science and engineering. Our objective is to give an introduction to the basicprinciples of computational methods for differential equations. This entails the con-struction and analysis of such methods, as well as many other aspects linked tocomputing. Thus, one needs to have an understanding of

• the application• the mathematical model and what it represents• the mathematical properties of the problem• what properties a computational method needs to solve the problem• how to verify method properties• qualitative discrepancies between numerical and exact solutions• how to obtain accuracy and estimate errors• how to interpret computational results.

We are going to consider three different but basic types of problems. The first isinitial value problems (IVP) of the form

y = f (t,y),

with initial condition y(0) = y0. In general this is a system of equations, and thedot represents differentiation with respect to the independent variable t, which isinterpreted as “time.” Thus the differential operator of interest is

dd t

. (3.1)

The second type of problem is a boundary value problem (BVP) of the form

−u′′ = f (x),

with boundary conditions u(0) = uL and u(1) = uR. Here prime denotes differentia-tion with respect to the independent variable x, which is interpreted as “space.” The

41

42 3 Differential equations and applications

differential operator of interest is now

d2

dx2 . (3.2)

In connection with BVPs, we will also consider other types of boundary conditions.Occasionally, we will also consider first order derivatives, i.e., the operator d/dx.

We will find that there are very significant differences in properties and com-putational techniques between dealing with initial and boundary value problems.Interestingly, these theories are combined in the third type of problem we will con-sider: time-dependent partial differential equations (PDE). This is a very largefield, and we limit ourselves to some simple standard problems in a single spacedimension.

Thus, we are going to combine the differential operators (3.1) and (3.2). Thesimplest example is the diffusion equation,

∂u∂ t

=∂ 2u∂x2 ,

which requires initial as well as boundary conditions. We will follow standard nota-tion in PDE theory, and rewrite this equation as

ut = uxx,

where the subscript t denotes partial derivative with respect to time t, and the sub-script x denotes partial derivative with respect to space x. Both initial conditions andboundary conditions are needed to have a well-defined solution, and the solution uis a function of two independent variables, t and x.

Apart from combining ∂/∂ t with ∂ 2/∂x2, we shall also be interested in combin-ing it with the first order space derivative ∂/∂x, as in the advection equation,

ut = ux.

The two PDEs above have completely different properties. Both are standard modelproblems in mathematical physics. The advection equation has wave solutions,while the diffusion equation is dissipative with strong damping. The two prob-lems cannot be solved using the same numerical methods, since the methods mustbe constructed to match the special properties of each problem.

We will also see that there are many variants of the equations above, and wewill consider various nonlinear versions of the equations, as well as other types ofboundary conditions. In addition, we will consider eigenvalue problems for the dif-ferential operators mentioned above, and build a comprehensive theory from theelementary understanding that can be derived from the IVP and BVP ordinary dif-ferential equations.

3.1 Initial value problems 43

3.1 Initial value problems

As mentioned above, the general form of a first order IVP is

y = f (t,y),

with f : R×Rm→Rm, and an initial condition y(0) = y0. The task is to compute theevolution of the dependent variable, y(t), for t > 0. In practice, the computationalproblem is always set on a finite (compact) interval, [0, tend].

Almost all software for initial value problems address this problem. This meansthat one has to provide the function f and the initial condition in order to solve theproblem numerically. Most problems of interest have a nonlinear function f , buteven in the linear case numerical methods are usually necessary, due to the size ifthe problem.

A simple example of an IVP is a second-order equation, describing the motionof a pendulum of length L, subject to gravitation, as characterized by the constantg,

ϕ +gL

sinϕ = 0, ϕ(0) = ϕ0, ϕ(0) = ω0.

Here ϕ represents the angle of the pendulum. The problem is nonlinear, since theterm sinϕ occurs in the equation. Another issue is that that the equation is secondorder, modeling Newtonian mechanics. Since standard software solves first orderproblems, we need to rewrite the system in first order form by a variable transfor-mation. Thus we introduce the additional variable ω = ϕ , representing the angularvelocity. Then we have

ϕ = ω

ω =−gL

sinϕ.

This is a first order system of initial value problems. The general principle is thatany scalar differential equation of order d can be transformed in a similar manner toa system of d first order differential equations.

Another nonlinear system of equations is the predator–prey model

y1 = k1 y1− k2 y1 y2

y2 = k3 y1 y2− k4 y2,

where y1 represents the prey population and y2 the predator species. The coefficientski are supposed to be positive, and the variables y1 and y2 are non-negative. Thisequation is a classical model known as the Lotka–Volterra equation, and it exhibitsan interesting behavior with periodic solutions. The problem is nonlinear, becauseit contains the quadratic terms y1y2. The model can be used to model the interactionof rabbits (y1) and foxes (y2).


Let us consider the case where there are no foxes, i.e., y2 = 0. The problem thenreduces to y1 = k1y1, which has exponentially growing solutions. Thus, without apredators, the rabbits multiply and the population grows exponentially. On the otherhand, if there are no rabbits (y1 = 0), the system reduces to y2 =−k4y2. The solutionis then a negative exponential, going to zero, representing the fact that the foxesstarve without food supply.

If there are both rabbits and foxes, the product term y1y2 represents the chanceof an encounter between a fox and a rabbit. The term −k2y1y2 in the first equationis negative, representing the fact that the encounter has negative consequences forthe rabbit. Likewise, the positive term k3y1y2 in the second equation models the factthat the same encounter has positive consequences for the fox. When these inter-actions are accounted for, one gets an interesting dynamical system, with periodicvariations over time in the fox and rabbit populations. The system does not reach anequilibrium.

Initial value problems have applications in a large number of areas. Some well-known examples are

• Mechanics, where Newton’s second law is Mq = F(q). Here M is a mass matrix,and F represents external forces. The dependent variable q represents position,and its second derivative represents acceleration. Thus mass times accelerationequals force. This is a second order equation, and it is usually transformed to afirst order system before it is solved numerically. Applications range from celes-tial mechanics to vehicle dynamics and robotics.

• Electrical circuits, where Kirchhoff’s law is Cv=−I(v), and where C is a capac-itance matrix, v represents nodal voltages, and I represents currents. Applicationsare found in circuit design and VLSI simulation.

• Chemical reaction kinetics, where the mass action law reads c = f (c). Here thevector c represents the concentration of the reactants in a perfectly mixed reactionvessel, and the function f represents the actual reaction interactions. The sametype of equation is used in biochemistry as well as in chemical process industry.

We note, in the last case, that many other applications lead to equations that arestructurally similar to the reaction equations. Thus the Lotka–Volterra predator-preymodel has a similar structure.

Other examples of this type of equation are epidemiological models, where thespreading of an infectious disease has a similar dynamical form. The interactions de-scribe the number of infected people, given that some are susceptible, while othersare infected and some have recovered (or are vaccinated) to be immune to further in-fections. This is the classical “SIR model,” developed by Kermack and McKendrickin 1927; it was a breakthrough in the understanding of epidemiology and vaccina-tion.

Equations such as those above are all important for modeling complex, nonlin-ear interactions, whose outcome cannot be foreseen by intuitive or analytical tech-niques.

3.2 Two-point boundary value problems 45

3.2 Two-point boundary value problems

The two-point boundary value problems (2p-BVP) we will consider are mostly sec-ond order ordinary differential equations. A classical problem takes the form

−u′′ = f (x), (3.3)

on [0,1], with Dirichlet boundary conditions u(0) = uL and u(1) = uR. Here thefunction f is a source term, which only depends on the independent variable x.Occasionally one of the boundary conditions will be replaced by a Neumann con-dition, u′(0) = u′L. The Neumann condition can also be imposed at the right endpointof the interval, but in order to have a unique solution, at least one boundary conditionhas to be of Dirichlet type.

The task is to compute (an approximation to) a function u ∈C2[0,1] that satisfiesthe boundary conditions and the differential equation on [0,1]. Because we cannotcompute functions in general, we will have to introduce a discretization, to computea discrete approximation to u(x).

The equation above is a one-dimensional version of the Poisson equation. Un-derstanding this 2p-BVP is key to understanding the Laplace and Poisson equationsin 2D or 3D. The latter cases are certainly more complicated, but many basic prop-erties concerning solvability and error bounds are built on similar theories.

An example of a fourth order problem is the beam equation

−M′′ = f (x)

−u′′ =MEI

,

where M represents the bending moment of a slender beam supported at both ends,subject to a distributed load f (x) on [0,1]. If the supports do not sustain any bendingmoment, the boundary conditions for the first equation are of homogeneous Dirich-let type, M(0) = M(1) = 0. In the second equation, which is structurally identicalto the first, u represents the deflection of the beam, under the bending moment M.Here, too, the boundary conditions are Dirichlet conditions, u(0) = u(1) = 0. Fi-nally, E is a material constant and I (which may depend on x) is the cross-sectionmoment of inertia of the beam, which depends on the shape of the beam’s crosssection.

This is a linear problem, as the dependent variables M and u only enter linearly.Many solvers for 2p-BVPs are written for second order problems, and often de-signed especially for Poisson-type equations. In this particular example, one firstsolves the moment equation, and once the moment has been computed, it is used asdata for solving the second, deflection equation.

Whether such an approach is possible or not depends on the boundary conditions.Thus, there is a variant of the beam equation, where the beam is “clamped” at bothends. This means that the supports do sustain bending moments, and that the bound-


ary conditions take the form u′(0) = u′(1) = 0 and u(0) = u(1) = 0. In this case,one cannot solve teh problem in two steps, because the problem is a genuine fourthorder equation, known as the biharmonic equation. If I is constant, the equation isequivalent to

uIV =f (x)EI

,

and will require its own dedicated numerical solver.In connection with 2p-BVPs, we will also consider problems containing first

order derivatives, e.g.−u′′+u′+u = f (x),

or nonlinear problems, such as

−u′′+uu′ = f (x),

or−u′′+u = f (u).

In the last case, the function f is a function of the solution u, not just a sourceterm. The structure of these equations will become clear in connection with PDEs,as these operators will correspond to the spatial operators of time-dependent PDEs.

Another very important, and distinct, type of BVP is eigenvalue problems fordifferential operators. The simplest example is

−u′′ = λu, (3.4)

with homogeneous boundary conditions u(0) = u(1) = 0. Since an algebraic eigen-value problem is usually written Au = λu, where λ corresponds to the eigenvaluesof the matrix A, we see that (3.4) is in fact an “analytical” eigenvalue problem forthe differential operator −d2/dx2. This type of problem is usually referred to asSturm–Liouville problems. Here we will construct discretization methods that ap-proximate this analysis problem by a linear algebra problem; in other words, we willbe able to approximate the eigenvalues of (3.4) by solving a linear algebra eigen-value problem.

Two-point boundary value problems occur in a large number of applications. Ashort list of examples includes

• Materials science and structural analysis, as exemplified by the beam equationabove.

• Microphysics, as exemplified by the (stationary) Shrodinger equation

− h2m

ψ′′+V (x)ψ = Eψ,

where ψ is the wave function, V (x) is a “potential well,” and E is the energy of aparticle in the state ψ .

3.3 Partial Differential Equations 47

• Eigenmode analysis, such as in −u′′ = λu, which may describe buckling ofstructures; eigenfrequency oscillation modes of e.g. a bridge or a musical instru-ment.

The last two examples are Sturm–Liouville problems, which in many cases aredirectly related to Fourier analysis. Thus, we find new links between computationalmethods and advanced analysis.

One of the great insights of applied mathematics is that there are applicationsin vastly differing areas that still share the same structure of their equations. Forexample, it is by no means obvious that buckling problems, first solved by Eulerin the middle of the 18th century, satisfy the same (or nearly the same) equation asSchrodinger’s “particle-in-box” problems of the 20th century, or could be used in thedesign of music instruments. The applications in macroscopic material science andin quantum mechanics appear to have nothing in common, but mathematics tellsus otherwise. This makes it possible to develop common techniques in scientificcomputing, with only minor changes in details. The overall methodology will stillbe the same, having an impact in all of these areas.

3.3 Partial Differential Equations

Partial differential equations are characterized by having two or more independentvariables. Here we shall limit ourselves to the simplest case, where the two inde-pendent variables are time and space (in 1D). This is simply the combination ofinitial and boundary value problems, without having to approach the difficulty ofrepresenting geometry in space.

While IVPs and BVPs have their own special interests, it is in PDEs that differen-tial equations and scientific computing get really exciting. The particular difficultiesof initial and boundary value problems are present simultaneously, and conspire tocreate a whole new range of difficulties, often far harder than one would have ex-pected.

As mentioned above, we are mainly going to consider two equations, the parabolicdiffusion equation

ut = uxx, (3.5)

and the hyperbolic advection equation,

ut = ux. (3.6)

The latter is also often referred to as the convection equation. The space operatorscan be combined, and the equation

ut = uxx +ux (3.7)


is usually referred to as the convection–diffusion equation. It has two differen-tial operators in space, ∂ 2/∂x2 (giving rise to diffusion), and ∂/∂x (creating con-vection/advection). Convection and advection are transport phenomena, usuallyassociated with some material flow, like in hydrodynamics or gas dynamics, andmodel wave propagation. By contrast, diffusion does not require a mass flow, and(3.5) is the standard model for heat transfer. The combined equation, possibly in-cluding further terms, is the simplest model equation for problems in heat and masstransfer.

We have previously seen that in IVPs, equations of the form u= f (u) can be usedto model chemical reactions. For this reason, if we add such a term to the diffusionequation (3.5), we obtain

ut = uxx + f (u),

known as the reaction–diffusion equation. Here f depends on the dependent vari-able u, but in case it would only depend on the independent variable x, it is a sourceterm, and the equation becomes

ut = uxx + f (x).

This is a plain diffusion equation with source term. Many time-dependent problemsinvolving diffusion have stationary solutions. This means that after a long time,equilibrium is reached, and there is no longer any time evolution. The stationarysolution is independent of time, implying that ut = 0. In the last equation, the equi-librium state therefore satisfies

0 = uxx + f (x).

We already encountered this equation in the context of 2p-BVPs. Thus

−u′′ = f (x),

is the Poisson equation (3.3). While the time dependent problems considered aboveare either parabolic (both ut and uxx are present) or hyperbolic (ut is present, butthe highest order derivative in space is ux), the stationary equation is different; thePoisson equation is elliptic. We will later have a closer look at the classification ofthese equations. For the time being, let us just note that there are elliptic, parabolicand hyperbolic equations, and that the 2p-BVP (3.3) represents elliptic problems,so long as there is only one space dimension.

With the exception of special equations such as Poisson’s, the names of the equa-tions we consider usually only list the terms included in the right hand side, pro-vided that the left hand side only contains the time derivative ut . For example, by in-cluding the first derivative ux in the diffusion equation, we obtained the convection–diffusion equation. Likewise, if we also include the reaction term f (u), we get

ut = uxx +ux + f (u),

3.3 Partial Differential Equations 49

known as the convection–diffusion–reaction equation. Such names are practical, asthe different terms often put different requirements on the solution techniques. Thusthe name of the equation tells us about what difficulties we expect to encounter.

Elliptic problems have characteristics that are similar to plain 2p-BVPs. Parabolicand hyperbolic equations, on the other hand, are both time- and space-dependent,and are more complicated. Parabolic equations are perhaps somewhat simpler, whilehyperbolic equations are particularly difficult because they represent conservationlaws. They have wave solutions without damping, and one of the main difficulties isto create discretization methods that replicate the conservation properties. As veryfew numerical methods conserve energy, qualitative differences between the exactand numerical solutions soon become obvious.

There are also second order equations in time, with the most obvious examplebeing the classical wave equation,

utt = uxx.

This equation is closely related to the advection equation, ut = ux, and both equa-tions are hyperbolic. The most interesting problems, however, are nonlinear. A fa-mous test problem is the inviscid Burgers equation,

ut +uux = 0,

which again is a hyperbolic conservation law with wave solutions. However, due tothe nonlinearity, this equation may develop discontinuous solutions, correspondingto shock waves (cf. “sonic booms”). Such solutions are extremely difficult to ap-proximate, and we shall have to introduce a new notion, of weak solutions. Thesedo not have to be differentiable at all points in time and space, and therefore onlysatisfy the equation in a generalized sense.

The situation is somewhat relaxed if diffusion is also present, such as in theviscous Burgers equation,

ut +uux = uxx.

The diffusion term on the right will then introduce dissipation, representing vis-cosity, meaning that wave energy will dissipate over distance. (The inviscid Burg-ers equation has no viscosity term.) The solution may initially be discontinuous,but over time it becomes increasingly regular and eventually becomes smooth. Theviscous Burgers equation is a simple model in several applications, for examplemodeling seismic waves. Because of the presence of the diffusion term, this equa-tion is no longer hyperbolic but parabolic. However, if the diffusion is weak, wavephenomena will still be observed over a considerable range in time and space. Forthis reason, the viscous Burgers equation is sometimes referred to by the oxymoronparabolic wave equation.

PDEs is an extremely rich field of study, with applications e.g. in materialsscience, fluid flow, electromagnetics, potential problems, field theory, and multi-physics. Multiphysics is any area that combines two or more fields of application,


as have been suggested by the many combinations of terms above. One exampleis fluid–structure interaction, such as air flow over an elastic wing, or blood flowinside the heart. Because of the different nature of the equations, there is typicallya need for dedicated numerical methods, even though many components of suchmethods are common for spcial classes of problems. The area is a highly active fieldof research.

From the vast variety of applications, it is clear that some difficulties will be en-countered in the numerical solution of differential equations. Although basic prin-ciples covering e.g. discretization and polynomial approximation apply, it is stillnecessary to construct methods with a special focus on the class of problems wewant to address. Even so, the basic principles are few, and the challenge is to findthe right combination of techniques in each separate case.

Chapter 4Summary: Objectives and methodology

One of the pioneers in scientific computing, Peter D Lax, has said that

“The computer is to mathematics what the telescopeis to astronomy, and the microscope is to biology.”

Thus ever faster computers, with ever larger memory capacity, allow us to considermore and more advanced mathematical models, and it is possible to explore by nu-merical simulation how complex mathematical models behave. Naturally, this leadsto large-scale computations that cannot be carried out by analytical methods. Therole of scientific computing is to bridge the gap between mathematics proper andcomputable approximations.

Although our focus is on differential equations, the previous chapters have intro-duced a variety of basic principles in scientific computing, and outlined why theyare needed in order to construct approximate solutions to mathematical problems. Inparticular, we noted that only problems in linear algebra are computable, in the sensethat an exact solution can be constructed in a finite number of steps. By contrast,nonlinear problems, and problems from analysis (such as differential equations) canonly be solved by approximate methods.

The purpose of scientific computing is

• to construct and analyze methods for solving specific problems in appliedmathematics

• to construct software implementing such methods for solving applied mathe-matics problems on a computer.

Scientific computing is a separate field of study because conventional, analyticalmathematical techniques have a very limited range. Few problems of interest can besolved analytically. Instead, we must use approximate methods. This can be justifiedby taking great care to prove that the computational methods are convergent, andtherefore (at least in principle) able to find approximate solutions to any prescribedaccuracy. This extends the analytical techniques by a systematic use of numerical

51

52 4 Summary: Objectives and methodology

computing, which, from the scientific point of view is subject to no less rigorousstandards than conventional mathematics.

The methodology of scientific computing is

• based on conventional mathematics, with the usual proof requirements• built on classical results in continuous as well as discrete mathematics.

A few principles stand out as the main building blocks in all computational meth-ods. These are

• Standard techniques in linear algebra• A systematic use of polynomials, including trigonometric polynomials• The principle of discretization, which brings a problem from analysis to a prob-

lem in algebra• The principle of linearization, which approximates a nonlinear problem locally

by a linear problem• The principle of iteration, which constructs a sequence of approximations con-

verging to the final result.

All numerical methods are constructed by using elements of these principles. Thisdoes not mean that the methods are similar, or that a “trick” that works in one prob-lem context also works in another. In fact, the variety of techniques used is verylarge; without recognizing the basic principles, one can easily be overwhelmed bythe technicalities and miss the the broader pattern that are characteristic of a generalapproach.

On top of the approximation techniques mentioned above, all computations arecarried out in finite precision arithmetic, usually defined by the IEEE 754 stan-dard. This will lead to roundoff errors. It is a common misunderstanding about sci-entific computing that the results are erroneous due to roundoff. However, mostcomputational algorithms are little affected by roundoff, as other errors, such as dis-cretization errors, linearization errors and iteration errors typically dominate. Theonly problems that are truly affected by roundoff are problems from linear algebrasolved by finite algorithms. Such problems are unaffected by discretization, lin-earization and iteration errors, meaning that only roundoff remains.

Apart from computing approximate solutions, scientific computing is also con-cerned with error bounds and error estimates. Thus we are not only concernedwith obtaining an approximation, but also the accuracy of the computation. Finallywe are interested in obtaining these results reasonably fast. In order of importance,computational methods must be

1. Stable2. Accurate3. Reliable4. Efficient5. Easy to use.

4 Summary: Objectives and methodology 53

Stability is priority number one, as unstable algorithms are unable to obtain anyaccuracy at all; instability will ruin the results. If stable algorithms can devised, weare interested in obtaining accuracy. Accurate algorithms should also be reliableand robust, and be able to solve broad classes of problems; high accuracy on a fewselected test problems is not good enough. As part of the reliability, we usually alsowant reliable error estimates, indicating the accuracy obtained.

Once these criteria have been met, we want to address efficiency. Efficiency isa very broad issue, ranging from data structures allowing efficient memory use,to adaptivity, which means that the software “tunes” itself automatically to themathematical properties of the problem at hand. Finally, when efficient softwarehas been constructed, we want ease of use, meaning that various types of interfacesare needed to set up problems (entering or generating data, e.g. the geometry ofthe computational domain), as well as postprocessing tools, such as visualizationsoftware.

In this introduction to scientific computing, we focus less on a rigorous mathe-matical treatment of method construction and analysis, and more on an understand-ing of elementary computational methods, and how they interact with standard prob-lems in applied mathematics. This entails understanding what the original equationsrepresent, meaning that we will emphasize the links from physics, via mathematics,to methods and actual computational results.

These will also be investigated in model implementations. We need to understandthe basic mathematics of the problems, the properties of our computational methods,and how to assess their performance. Assessing performance is a difficult matter.It is invariably done by trying out the method on various well-known test problemsand benchmarks, but the assessment is never complete. Ideally, we want to infersome general conclusions, but have to keep in mind that any numerical test onlyyields results for that particular problem.

Since we are going to use standard test problems and benchmarks, which mayoften have a known analytical solution, we work in an idealized setting. In realcomputational problems, the exact solution is never known. However, unless wecan demonstrate that standard benchmarks can be solved correctly, with both stabil-ity and accuracy, there is little hope that the method would work for the advancedproblems.

Many of our benchmarks will use simple tools of visualization. For example,we typically want to demonstrate that an implementation is correct by verifying itstheoretical convergence order. This is usually done in graphs, like those used in thesection on discretization above. There we saw use of lin-lin graphs, lin-log graphsand log-log graphs. This may appear to be a simple remark, but part of the skillin scientific computing is in visualizing the results in a proper way. Therefore it isimportant to master these techniques, and carefully try out the best tools in each sit-uation. As we have remarked previously, log-log diagrams are used for power laws.Likewise, lin-log diagrams are preferred for exponentials, but there are numerousother situations where scaling is key to revealing the relevant information.

Part IIInitial Value Problems

Chapter 5First Order Initial Value Problems

Initial value problems occur in a large number of applications, from rigid-body me-chanics, via electrical circuits to chemical reactions. They describe the time evo-lution of some process, and the task is usually to predict how a given system willevolve. Most standard software is written for initial value problems of the form

y = f (t,y); y(0) = y0, (5.1)

where f : R×Rd→Rd is a given function, and where the initial condition y0 is alsogiven.

Before constructing computational methods, it is often a good idea to verify,through mathematical means, whether there exists a solution and whether the so-lution is unique. A somewhat more ambitious approach is to verify that the initialvalue problem is well posed. This means that we need to demonstrate that there is aunique solution, which depends continuously on the data. This usually means thata small change in the initial value, or a small change in a forcing function, will onlyhave a minor effect on the solution to the problem.

While such a theory exists for initial value problems in ordinary differential equa-tions, much less is known in partial differential equations. Thus, fortunately, thetheory of initial value problems is quite comprehensive compared to other areas indifferential equations.

5.1 Existence and uniqueness

The classical result on existence and uniqueness is built on the continuity propertiesof the function f .

Definition 5.1. Let f : R×Rd → Rd be a given function, and define its Lipschitzconstant with respect to the second argument by

3

4 5 First Order Initial Value Problems

L[ f ] = supu6=v

‖ f (t,u)− f (t(v)‖‖u− v‖

. (5.2)

Obviously, the Lipschitz constant depends on t, and we shall also assume that thefunction f is continuous with respect to time t. The classical existence and unique-ness result on (5.1) is the following.

Theorem 5.1. Let f : R×Rd → Rd be a given function, and assume that it is con-tinuous with respect to time t, and that it is Lipschitz continuous with respect to y,with L[ f ]< ∞. Then there is a unique solution to (5.1) for t ≥ 0.

If f is a linear, constant coefficient system, i.e., y = Ay, then

L[ f ] = supu6=v

‖Au−Av‖‖u− v‖

= supx 6=0

‖Ax‖‖x‖

= ‖A‖, (5.3)

where ‖A‖ is the norm of the matrix A induced by the vector norm ‖ · ‖. Thus,for a linear map A, the Lipschitz constant is just the operator norm. But nonlinearfunctions are harder to deal with. In general, very few nonlinear maps satisfy aLipschitz condition on all of Rd . If it satisfies such a condition on a bounded domainD, then one can guarantee the existence of a solution up to the point where thesolution y reaches the boundary of D. A classical example is the following.

Example Consider the IVP

y = y2; y(0) = y0 > 0,

with analytical solutiony(t) =

y0

1− ty0.

The function f (t,y) = y2 obviously does not satisfy a Lipschitz condition on all of R, butonly on a finite domain. In fact, inspecting the solution, we see that the solution remainsbounded only up to t = 1/y0, when the solution “blows up.” Initial value problems havingthis property are said to have “finite escape time.” Thus, no matter how large the region Dis where the Lipschitz condition holds, the solution will reach the boundary of D in finitetime and escape.

There are also other examples, when the Lipschitz condition does not hold.

Example Consider the IVP

y =−√−y; y(0) = 0.

This problem does not have a unique solution. The solution may be chosen as y(t) = 0 onthe interval t ∈ [0,2τ], with τ ≥ 0 an arbitrary number, followed by y(t) =−(t/2− τ)2 fort > 2τ . While this might seem like a contrived problem, it can actually arise, largely dueto poor mathematical modeling. Thus, assume that we we want to model the free fall of aparticle of mass m. Its potential energy is U = mgy, where g is the gravitational constant,and where the kinetic energy is T = my2/2. The total energy is the sum of the two, i.e.,

5.1 Existence and uniqueness 5

E = mgy+my2

2.

Let us assume (a matter of normalization) that E = 0. Then, obviously

y2 =−2gy,

and, because y ≤ 0, y = −√−2gy. To obtain the original equation, let us choose units so

that g = 1/2. Then y =−√−y.

The model is, at least in principle, correct. However, it is “unfortunate”, since we choose theinitial condition y(0) = 0 and the constant E = 0. Although this choice may seem natural,it just happens to be at the singularity of the right-hand side function f of the differentialequation, and we do not necessarily get the expected solution, y(t) =−t2/4.

Thus, when f (y) = −√−y, we have f ′(y) = 1/(2

√−y). Hence f ′(y) is not defined at

the initial value y = 0, and since the Lipschitz constant L[ f ] ≥ max | f ′(y)|, the Lipschitzcondition is not satisfied at the starting point either. Therefore, we are not guaranteed aunique solution. What this example shows is that even if one uses “sensible” mathematicalmodeling, one may end up with a poor mathematical description, lacking unique solutions.This appears to be less understood; even so, one needs to be aware that not all models, andnot all normalizations of coordinate systems and initial values, will lead to proper models.

Example A variant of the same problem is obtained by modeling a leaky bucket. Assumingthat a cylindrical bucket initially contains water up to a level y(0) = y0 and that the waterruns out of a hole in the bottom, we apply Bernoulli’s law, according to which the flowrate v out of the hole satisfies v2 ∼ p, where p is the pressure at the opening. Because of thecylindrical shape of the bucket, the pressure is proportional to the level y of water. Likewise,if the flow rate is v, then y∼ v. Dropping all proportionality constants, it follows that y2 = y,so that

y =−√y; y(0) = y0.

Even though this equation is similar to the former, we do not face the same problem, sincewe start at a different initial condition. The problem can be solved analytically, and

y(t) =(

y0−t2

)2.

Thus we find that the bucket will be empty (y = 0) at time t = 2y0 (where, again, propor-tionality constants have been neglected).

This problem has an interesting connection to the former. The previous problem, of model-ing the free fall of a mass leads to essentially the same equation. While the free fall problemdid not have a unique solution, the leaky bucket problem does. However, the relation be-tween the two problems is that the free fall problem is equivalent to the leaky bucket prob-lem in reverse time. Thus, if the bucket is currently empty, we cannot answer the questionof when it was last time full. Such a question is not well posed. This is evident in the leakybucket problem, but less so in the free fall problem.

As the examples above demonstrate, questions of existence, uniqueness and well-posedness are of great importance not only to mathematics, but also to the appliedmathematician and the numerical analyst. In general it is of importance to obtain agood understanding of whether the problems we want to solve have unique solutionsthat depend continuously on the data. If this is not the case, we can hardly expectour computational methods to be successful.

6 5 First Order Initial Value Problems

Even so, it is often impossible to verify that Lipschitz conditions are satisfied onsufficiently large domains. In fact, this is often neglected in practice, but one mustthen be aware of the risk of the odd surprise. Failures to solve problems are notuncommon, in which case it is also of important why there is a failure. Is it becauseof problem properties, or because of an unsuitable choice of computational method?Mastering theory is never a waste of time. Nothing is as practical as a good theory.

5.2 Stability

Stability is a key concept in all types of differential equations. It will appear nu-merous times in this book. It will refer to the stability of the original mathematicalproblem, as well as to the stability the discrete problem, and, which is more com-plex, to the numerical method itself. By and large, these concepts are broadly re-lated in the following way: if the mathematical problem is stable, then the dicscreteproblem will be stable as well, provided that the numerical method is stable. This re-lation will, however, not hold without qualifications. For this reason, we will se thatstability plays the central role in the numerical solution of differential equations.

For initial value problems, stability refers to the question whether two neighbor-ing solutions will remain close when t→ ∞. Here our original IVP problem is

ddt

y = f (t,y); y(0) = y0

and we will consider a perturbed problem

ddt

(y+∆y) = f (t,y+∆y); ∆y(0) = ∆y0.

It then follows that the perturbation satisfies

ddt

∆y = f (t,y+∆y)− f (t,y); ∆y(0) = ∆y0.

In classical Lyapunov stability theory, we ask whether for every ε > 0 there is aδ > 0 such that

‖∆y0‖ ≤ δ ⇒ ‖∆y(t)‖ ≤ ε,

for all t > 0. If this holds, the solution to the IVP obviously depends continuouslyon the data, i.e., the initial value. The interpretation is that, even if we perturb theinitial value by a small amount, the perturbed solution will never deviate by morethan a small amount from the solution of the original solution, even when t → ∞.We then say that the solution y(t) is stable. If, in addition, ‖∆y(t)‖ → 0 as t→ ∞,then we say that y(t) is asymptotically stable. And if ‖∆y(t)‖ ≤ K · e−αt for somepositive constants K and α , then y(t) is exponentially stable. There are also furtherqualifications that offer additional stability notions.

Chapter 6Stability theory

Because stability plays such a central role in the numerical treatment of differen-tial equations, we shall devote a chapter to lay down some foundations of stabilitytheory. The details vary between initial value problems and boundary value prob-lems; between differential equations and their discretizations (usually some form ofdifference equations); and between linear systems and nonlinear systems. However,the stability notions have something in common: the solutions to our problem musthave a continuous dependence on the data, even though what constitutes “data”also varies.

The continuous data dependence would not be such a special issue if it did notalso include some transfinite process. Thus we are interested in how solutions be-have as t → ∞ in the initial value case, or in how a discretization behaves as thenumber of discretization points N → ∞. As long as we study ordinary differentialequations, we are interested in single parameter limits, but in partial differentialequations we face multiparametric limits, which makes stability an intricate matter.

As we cannot deal with all these issues at once, this chapter will focus on the im-mediate needs for first order initial value problems, but we will nevertheless developtools that allow extensions also to other applications.

6.1 Linear stability theory

By considering a linear, constant coefficient problem, the stability notions becomemore clear, since linear systems have a much simpler behavior. Thus, if

ddt

y = Ay; y(0) = y0

andddt

(y+∆y) = A(y+∆y); ∆y(0) = ∆y0,

7

8 6 Stability theory

it follows thatddt

∆y = A∆y; ∆y(0) = ∆y0.

We note that the perturbed differential equation is identical to the original problem.Thus we may consider the original linear system directly, and whether it has solu-tions depending continuously on y0. In particular, no matter how we choose y0, thesolution y(t) must remain bounded. Since y = 0 is a solution of the problem (fory0 = 0), we say that the zero solution is stable if y(t) remains bounded for all t > 0.But as this applies to every solution, we can speak of a stable linear system.

We begin by considering a linear system with constant coefficients,

y = Ay; y(0) = y0.

Does the solution grow or decay? The exact solution is

y(t) = etA y0,

and it follows that ‖y(t)‖≤ ‖etA‖‖y0‖. Therefore, ‖y(t)‖→ 0 as ‖y0‖→ 0, providedthat, for all t ≥ 0, it holds that

‖etA‖ ≤C. (6.1)

This is the crucial stability condition, and the question is: for what matrices A is etA

bounded?The answer is well known, and is formulated in terms of the eigenvalues of A.

Let us therefore make a brief excursion into eigenvalue theory, which will play acentral role not only for initial value problems but also in boundary value problemsand partial differential equations.

Definition 6.1. The set of all eigenvalues of A ∈ Rd×d will be denoted by

λ [A] = λ ∈ C : Au = λu,

and is referred to as the spectrum of A. Whenever the eigenvalues and eigenvectorsare numbered, we write

Auk = λk[A] ·uk

for k = 1 : d, where uk is the kth eigenvector, associated with the kth eigenvalue λk[A].

For A ∈Rd×d there are always d eigenvalues. If the eigenvalues are distinct, thenthere are also d linearly independent eigenvectors, and the matrix can be diagonal-ized. Thus, writing the eigenproblem

AU =UΛ ,

with Λ = diagλk, and the eigenvectors arranged as the d columns of the d×d matrixU , we have U−1AU = Λ .

Now, if Au = λu, it follows that A2u = λAu = λ 2u, and in general,

6.1 Linear stability theory 9

λk[Ap] = (λk[A])p,

for every power p ≥ 0. This motivates that we write λ [Ap] = λ p[A]. From this itfollows that if P is any polynomial, we have λ [P(A)] = P(λ [A]). Hence:

Lemma 6.1. Let P be a polynomial of any degree, and let λ [A] denote the spectrumof a matrix A ∈ Rd×d . Then

λ [P(A)] = P(λ [A]). (6.2)

If the matrix is diagonalizable, i.e., U−1AU = Λ , then U−1ApU = Λ p.

For the exponential function, which is defined by

eA =∞

∑n=0

An

n!,

it holds that λ [eA] = eλ [A], even though the “polynomial” is a power series. Thisfollows from

eAuk =∞

∑n=0

Anuk

n!=

∞

∑n=0

λ nk [A]u

k

n!= eλk[A]uk.

Although this result does not rely on A being diagonalizable, it also holds that ifU−1AU =Λ , then U−1eAU = eU−1AU = eΛ . In fact, if f (z) is any analytical function,then U−1 f (A)U = f (U−1AU) = f (Λ). More generally, we have

Lemma 6.2. Let A ∈ Rd×d , and let λ [A] denote the spectrum of a matrix. Thenλ [eA] = eλ [A], and

|λ [etA]|= |etλ [A]|= etReλ [A]. (6.3)

It immediately follows that etA is bounded as t → ∞ if λ [A] ∈ C−, i.e., if theeigenvalues of A are located in the left half plane. Eigenvalues with zero real partare also acceptable if they are simple.

The result of Lemma 6.1 can also be extended. We note that if A is nonsingular,it holds that

Au = λu ⇒ λ−1u = A−1u.

Therefore λk[Ap] = (λk[A])p holds also for negative powers, if only A is invertible, sothat no eigenvalue is zero. It follows that if a polynomial Q(z) has the property thatλ [Q(A)] = Q(λ [A]) 6= 0, then Q−1(A) exists. This implies that Lemma 6.1 can beextended to rational functions. This will be useful in connection with Runge–Kuttamethods. Thus we have the following extension:

Lemma 6.3. Let R(z) = P(z)/Q(z) be a rational function, where the degrees of thepolynomials P and Q are arbitrary. Let A ∈Rd×d , and let λ [A] denote the spectrumof a matrix. If Q(λ [A]) 6= 0, then


λ [R(A)] = R(λ [A]), (6.4)

where R(A) = P(A) ·Q−1(A) = Q−1(A) ·P(A).

Although a linear system y = Ay is stable whenever the eigenvalues are located inthe left half plane, this eigenvalue theory does not extend to non-autonomous linearsystems y = A(t)y and to nonlinear problems y = f (t,y). As more powerful toolsare needed, the standard approach is to analyze stability in terms of some normcondition on the vector field.

6.2 Logarithmic norms

Let us begin by considering the linear constant coefficient system y = Ay once more,and find an equation for the time evolution of ‖y‖. It satisfies

d‖y‖dt+

= limsuph→0+

‖y(t +h)‖−‖y(t)‖h

= limh→0+

‖y(t)+hAy(t)+O(h2)‖−‖y(t)‖h

≤ limh→0+

‖I +hA‖−1h

‖y(t)‖,

where d/dt+ denotes the right-hand derivative. This is used as we are interestedin the forward time evolution of ‖y(t)‖.

Definition 6.2. Let A ∈ Rd×d . The upper logarithmic norm of A is defined by

M[A] = limh→0+

‖I +hA‖−1h

. (6.5)

This limit can be shown to exist for all matrices. From the derivation above, wehave obtained the differential inequality

d‖y‖dt+

≤M[A] · ‖y‖. (6.6)

This differential inequality is easily solved. Note that

ddt+

(‖y‖e−tM[A]

)= e−tM[A]

(d‖y‖dt+−M[A] · ‖y‖

)≤ 0.

Hence ‖y(t)‖e−tM[A]≤‖y(0)‖, and it follows that ‖y(t)‖≤ etM[A]‖y(0)‖ for all t ≥ 0.Since y(t) = etAy(0), we have:

6.2 Logarithmic norms 11

Vector norm Matrix norm Log norm µ[A]

‖x‖1 = ∑i |xi| max j ∑i |ai j| max j[Rea j j +∑′i |ai j|]

‖x‖2 =√

∑i |xi|2√

ρ[A∗A] α[(A+A∗)/2]

‖x‖∞ = maxi |xi| maxi ∑ j |ai j| maxi[Reaii +∑′j |ai j|]

Table 6.1 Computation of matrix and logarithmic norms. The functions ρ[·] and α[·] refer tothe spectral radius and spectral abscissa, respectively. The matrix A∗ is the (possibly complexconjugate) transpose of A

Theorem 6.1. For every A ∈ Rd×d , for any vector norm, and for t ≥ 0, it holds that

‖etA‖ ≤ etM[A]. (6.7)

The reason why this is of interest is that we have the following result on stability:

Corollary 6.1. Let A ∈ Rd×d and assume that M[A]≤ 0, then

‖etA‖ ≤ 1, (6.8)

for all t ≥ 0.

Therefore, if the logarithmic norm is non-positive, the system is stable. Note thatthe logarithmic norm is not a “norm” in the proper sense of the word. Unlike a truenorm, the logarithmic norm can be negative, which makes it especially interestingin connection with stability theory. Table 6.2 shows how the logarithmic norm iscalculated for the most common norms. Note that it is easily computed for the norms‖ · ‖1 and ‖ · ‖∞, but that it is harder to compute it for the standard Euclidean norm,‖·‖2. We then need the spectral radius ρ[·], and the spectral abscissa, α[·], definedby

ρ[A] = maxk|λk[A]|, α[A] = max

kReλk[A]. (6.9)

In order to use norms and logarithmic norms in the analysis that follows, werecall the properties of these norms.

Definition 6.3. A vector norm ‖ · ‖ satisfies the following axioms:

1. ‖x‖ ≥ 0; ‖x‖= 0 ⇔ x = 02. ‖γx‖= |γ| · ‖x‖3. ‖x+ y‖ ≤ ‖x‖+‖y‖.


Definition 6.4. The operator norm induced by the vector norm ‖ · ‖ is defined by

‖A‖= supx 6=0

‖Ax‖‖x‖

. (6.10)

The operator norm is a matrix norm, and hence satisfies the same rules as thevector norm. However, being an operator norm, it has an additional property. Byconstruction, it satisfies

‖Ax‖ ≤ ‖A‖ · ‖x‖, (6.11)

from which it follows that the operator norm is submultiplicative. This means that

‖AB‖ ≤ ‖A‖ · ‖B‖. (6.12)

This follows directly from ‖ABx‖ ≤ ‖A‖ · ‖Bx‖ ≤ ‖A‖ · ‖B‖ · ‖x‖.The logarithmic norm has already been defined above by (6.5), by the limit

M[A] = limh→0+

‖I +hA‖−1h

.

Thus it is defined in terms of the operator norm, which in turn is defined in termsof the vector norm. All three are thus connected, and the computation of these quan-tities are linked as shown in Table 6.2.

The logarithmic norm has a wide range of applications in both initial value andboundary value problems, as well as in algebraic equations. Later on, we shall alsosee that it can be extended to nonlinear maps and to differential operators. Like theoperator norm, it has a number of useful properties that play an important role inderiving error and perturbation bounds.

Theorem 6.2. The upper logarithmic norm M[A] of a matrix A ∈ Rd×d has the fol-lowing properties:

1. M[A]≤ ‖A‖2. M[A+ zI] = M[A]+Rez3. M[γA] = γ M[A], γ ≥ 04. M[A+B]≤M[A]+M[B]5. ‖etA‖ ≤ etM[A], t ≥ 0

It is also easily demonstrated that the operator norm and the logarithmic normare related to the spectral bounds (6.9). Thus

α[A]≤M[A]; ρ[A]≤ ‖A‖. (6.13)

Consequently, if M[A]< 0, then all eigenvalues λ [A] ∈ C− and A is invertible.We are now in a position to address more important stability issues. Let us con-

sider a perturbed linear constant coefficient problem,

6.2 Logarithmic norms 13

y = Ay+ p(t); y(0) = y0, (6.14)

where p is a perturbation function. This satisfies the differential inequality

d‖y‖dt+

≤M[A] · ‖y‖+‖p‖,

with solution‖y(t)‖ ≤ etM[A]‖y0‖+

∫ t

0e(t−τ)M[A]‖p(τ)‖dτ.

We have already treated the case p ≡ 0, so let us instead consider the case p 6= 0and y0 = 0, and answer the question of how large ‖y(t)‖ can become given theperturbation p. Thus, letting

‖p‖∞ = supt≥0‖p(t)‖,

we have‖y(t)‖ ≤ ‖p‖∞

∫ t

0e(t−τ)M[A] dτ.

Assuming that M[A] 6= 0, evaluating the integral, we find the perturbation bound

‖y(t)‖ ≤ ‖p‖∞

etM[A]−1M[A]

. (6.15)

In case M[A] > 0 the bound grows exponentially. More interestingly, if M[A] < 0,the exponential term decays, and

‖y‖∞ ≤−‖p‖∞

M[A]. (6.16)

Then y(t) can never exceed the bound given by (6.18). We summarize in a theorem:

Theorem 6.3. Let y = Ay+ p(t) with y(0) = 0. Let ‖p‖∞ = supt≥0 ‖p(t)‖. Assumethat M[A] 6= 0. Then

‖y(t)‖ ≤ ‖p‖∞

etM[A]−1M[A]

; t ≥ 0. (6.17)

If M[A] = 0, then‖y(t)‖ ≤ ‖p‖∞ · t; t ≥ 0. (6.18)

If M[A]< 0 it holds that

‖y‖∞ ≤−‖p‖∞

M[A]. (6.19)

Note that this theorem allows an exponentially growing bound in (6.17), a lin-early growing bound in (6.18), and a uniform upper bound in (6.19). The latter is


the limit in (6.17) as t→∞ if M[A]< 0. Note that the different bounds are primarilydistinguished by the sign of M[A], which governs stability.

Let us simplify the problem further and assume that p is constant in (6.14). Then,as M[A] < 0 implies that the exponential goes to zero (since λ [A] ∈ C− by (6.13)),there must be a unique stationary solution y to (6.14), satisfying

0 = Ay+ p ⇒ y =−A−1 p.

Thus

‖y‖= ‖A−1 p‖ ≤ − ‖p‖M[A]

,

and it follows that‖A−1 p‖‖p‖

≤ − 1M[A]

.

Because p is an arbitrary constant vector,

supp6=0

‖A−1 p‖‖p‖

= ‖A−1‖ ≤ − 1M[A]

.

Thus we have derived the following important, but less obvious result:

Theorem 6.4. Let A ∈ Rd×d and assume that M[A]< 0. Then A is invertible, with

‖A−1‖ ≤ − 1M[A]

. (6.20)

This result will be used numerous times in various situations, both in initial valueproblems and in boundary value problems. It is the simplest version of a more gen-eral result, known as the uniform monotonicity theorem.

The theory above collects some of the most important results in linear stabilitytheory, both in terms of eigenvalues and in terms of norms and logarithmic norms.The theory is general and is valid for all norms. However, if we specialize to innerproduct norms (Hilbert space) we obtain stronger results, in the sense that they canbe directly extended beyond elementary matrix theory.

6.3 Inner product norms

Inner products generate (and generalize) the Euclidean norm. They are defined asfollows.

Definition 6.5. An inner product is a bilinear form 〈·, ·〉 : Cd×Cd →C satisfying

1. 〈u,u〉 ≥ 0; 〈u,u〉= 0 ⇔ u = 0

6.3 Inner product norms 15

2. 〈u,v〉= 〈v,u〉3. 〈u,αv〉= α〈u,v〉4. 〈u,v+w〉= 〈u,v〉+ 〈u,w〉,

generating the Euclidean norm by 〈u,u〉= ‖u‖22.

Above, the bar denotes complex conjugate. In most cases we will only considerreal inner products, in which case the bar can be neglected. However, the complexnotation above enables us to also discuss operators with complex eigenvalues.

An inner product generalizes the notion of scalar product. Apart from the proper-ties listed above, and has a few more essential properties. One of the most importantis the following:

Theorem 6.5. (Cauchy–Schwarz inequality) For all u,v ∈ Cd , it holds that

−‖u‖2 · ‖v‖2 ≤ Re〈u,v〉 ≤ ‖u‖2 · ‖v‖2.

For u,v ∈ Rd , it holds that

−‖u‖2 · ‖v‖2 ≤ 〈u,v〉 ≤ ‖u‖2 · ‖v‖2.

This distinguishes between the case of real and complex vector spaces. Note thatwhen we have operators with complex conjugate eigenvalues, the correspondingeigenvectors are also complex, which more or less necessitates the use of complexvector spaces. Whether the vector space is real or complex, however, we alwayshave the following:

Definition 6.6. The operator norm associated with 〈·, ·〉 is

‖A‖22 = sup

u 6=0

〈Au,Au〉〈u,u〉

= supu6=0

‖Au‖22

‖u‖22

For vectors in finite dimensions, we may denote the inner product by 〈u,v〉= u∗v,where u∗ denotes the transpose, or, in the complex case, the complex conjugatetranspose. With this notation, we have u∗u = ‖u‖2

2, from which it follows that〈Au,Au〉 ≤ ‖A‖2

2 ·‖u‖22. This is easily seen to be equivalent to the standard definition

of the operator norm for a general choice of norm. For the logarithmic norm, thesituation is similar, but we give an alternative definition for an inner product norm,since this is not only convenient, but also turns out to allow the logarithmic normto be defined for a somewhat wider class of vector fields. In addition, we obtain anatural definition of a lower as well as the upper logarithmic norm.

Definition 6.7. The lower and upper logarithmic norms associated with the innerproduct 〈·, ·〉 are defined as


m2[A] = infu6=0

Re〈u,Au〉‖u‖2

2, M2[A] = sup

u6=0

Re〈u,Au〉‖u‖2

2. (6.21)

This implies that m2[A] =−M2[−A], and that we have the following bounds,

m2[A] · ‖u‖22 ≤ Re〈u,Au〉 ≤M2[A] · ‖u‖2

2. (6.22)

Again, if we write the inner product as 〈u,v〉= u∗v, we find that

‖A‖22 = sup

u 6=0

u∗A∗Auu∗u

andM2[A] = sup

u6=0

Reu∗Auu∗u

.

Thus the norm of a matrix, as well as the lower and upper logarithmic norms areextrema of two quadratic forms, albeit with different matrices, A∗A and A, respec-tively. It is therefore in order to investigate how these extrema are found.

Let C be a given matrix, and let

q(u) =Reu∗Cu

u∗u

denote the Rayleigh quotient formed by C and the vector u. We will find its extremaby finding its stationary points, i.e., by solving the equation gradu q = 0. Now,

gradu q =Re [u∗u ·gradu(u

∗Cu)−u∗Cu ·gradu(u∗u)]

u∗u ·u∗u

=Re [u∗u · (u∗C+u∗C∗)−u∗Cu · (2u∗)]

u∗u ·u∗u:= 0,

which, upon (conjugate) transposition, gives the equation

C+C∗

2u = q(u) ·u

for the determination of the stationary points. This is obviously an eigenvalue prob-lem, where q(u) is an eigenvalue of the symmetric matrix (C+C∗)/2. Thus, in casewe take C = A∗A, we obtain the eigenvalue problem

A∗Au = λu,

showing that ‖A‖22 is the largest eigenvalue of A∗A. This eigenvalue is real and pos-

itive and σ2 := λmax[A∗A] is the square of the maximal singular value of A.On the other hand, if we take C = A (which is not a priori symmetric), we still

end up with a symmetric eigenvalue problem for the stationary points,

6.3 Inner product norms 17

A+A∗

2u = λ ·u.

The eigenvalues of (A+A∗)/2 are real, but they are not necessarily positive. In fact,we have just demonstrated that the logarithmic norm is given by

M2[A] = maxk

λk

[A+A∗

2

],

as was indicated in Table 6.2. The Euclidean norm is sometimes referred to as thespectral norm, as operator norms and logarithmic norms are determined by the spec-trum of symmetrized operators associated with A. We summarize:

Theorem 6.6. In terms of the spectral radius and spectral abscissa, it holds that

‖A‖2 =√

ρ[A∗A]; M2[A] = α

[A+A∗

2

]. (6.23)

It remains to show that the alternative definition 6.21 is compatible with the previ-ous definition 6.5 whenever ‖A‖2 < ∞ (which corresponds to a Lipschitz condition).Note that, as h→ 0+,

‖(I +hA)‖2 =

√supu 6=0

‖(I +hA)u‖22

‖u‖22

= supu6=0

√u∗(I +hA)∗(I +hA)u

u∗u

= supu6=0

√u∗u+hu∗(A+A∗)u+O(h2)

u∗u

= supu6=0

√1+h

u∗(A+A∗)uu∗u

+O(h2)

= 1+hsupu6=0

u∗(A+A∗)u2u∗u

+O(h2)

= 1+hM2[A]++O(h2).

Hence it follows that 6.21 and 6.5 represent the same limit, in case ‖A‖2 < ∞. How-ever, we shall see later that 6.21 applies also in the case when A is an unboundedoperator.

A more important aspect of using inner products is that, since ‖u‖22 = u∗u is

differentiable,

‖u‖2d‖u‖2

dt=

12

d‖u‖22

dt=

12

d(u∗u)dt

=u∗u+u∗u

2= Re(u∗u).


Therefore, if we consider the linear system u = Au, we can assess stability by inves-tigating the projection of the derivative u on u, i.e.,

Reu∗u = Reu∗Au≤M2[A] · ‖u‖22

and it follows that

m2[A] · ‖u‖2 ≤d‖u‖2

dt≤M2[A] · ‖u‖2. (6.24)

The upper bound is the same differential inequality as we had before, when theconcept was introduced for general norms. The reason why the technique aboveis of special importance is because it is a standard technique in partial differentialequations, when A represents a differential operator in space.

6.4 Matrix categories

Although general norms have their place in matrix theory and in the analysis ofdifferential equations, inner product norms are particularly useful. The mathematicsof Hilbert space plays a central role in most of applied mathematics, and will bethe preferred setting in this book.

Inner products allow the notion of orthogonality. Thus two vectors are orthog-onal if 〈u,v〉 = 0. In line with the notation used above, we will write this u∗v = 0.Orthogonality is the key idea behind some of the best known methods in appliedmathematics, such as the least squares method, and, in the present context, the finiteelement method. These methods find a best approximation by requiring that theresidual is orthogonal the span of the basis functions; hence the residual cannot bemade any smaller in the inner product norm.

Although there are several ways of constructing inner product norms, we will let‖u‖2

2 = u∗u denote the associated norm unless a special construction is emphasized.In this section, however, the norm refer to the standard Euclidean norm.

Just like there is a (conjugate) transpose of a vector, there is a (conjugate) trans-pose of a matrix. The definition is

〈u,Av〉= 〈A∗u,v〉,

for all vectors u,v. Now, since 〈u,v〉= u∗v, we have

u∗Av = 〈u,Av〉= 〈A∗u,v〉= (A∗u)∗v,

so (A∗u)∗ = u∗A, and (A∗)∗ = A. We shall return to this in connection with differen-tial operators, where A∗ is known as the adjoint of A.

6.4 Matrix categories 19

Property Name λk[A] m2[A] M2[A] ‖A‖2

A∗ = A symmetric real −α[−A] α[A] ρ[A]

A∗ =−A skew-symmetric iωk 0 0 ρ[A]

A∗ = A−1 orthogonal eiϕk −α[−A] α[A] 1

A∗A = AA∗ normal complex −α[−A] α[A] ρ[A]

positive definite − > 0 − −

negative definite − − < 0 −

indefinite − < 0 > 0 −

contraction − − − < 1

Table 6.2 Matrix categories and logarithmic norms. A∗ is the (conjugate) transpose of A. All listedcategories of matrices have (or can be arranged to have) orthogonal eigenvectors. The most generalclass is normal matrices; all categories above are normal

Within this framework, there are several important classes of matrices that wewill encounter many times. Below in Table 6.4 these classes are characterized. Inaddition, for each class, the spectral properties and logarithmic norms are given.

The names of the classes of matrices vary depending on the context. The namesgiven in Table 6.4 refer to standard terminology for real matrices, A ∈ Rd×d . Forcomplex matrices A ∈ Cd×d , the corresponding terms are, respectively, Hermitian;skew-Hermitian; unitary; and normal. For more general linear operators, such aslinear differential operators, the terms are self-adjoint; anti-selfadjoint; unitary;and normal.

For example, in a linear system u = Au, where A is skew-symmetric, we have

ddt

log‖u‖2 =1

2‖u‖22

d‖u‖22

dt=

Re〈u, u〉〈u,u〉

=Re〈u,Au〉〈u,u〉

=Reu∗Au

u∗u= 0.

Thus it follows that ‖u(t)‖2 remains constant in such a system; problems of thistype are referred to as conservation laws, and occur e.g. in transport equations inpartial differential equations. They require special numerical methods, that replicatethe conservation law, keeping the norm of the solution constant. For other classesof matrices, there may be similar concerns whether we can construct methods thathave a proper qualitative behavior.

By contrast, if M2[A]< 0, it follows that ‖u(t‖2→ 0. Thus the magnitude of thesolution will decrease as t→ ∞, and ‖u(t)‖2 ≤ ‖u(0)‖2 for all t ≥ 0.

We also see that definiteness can be characterized in terms of the logarithmicnorms. Thus a positive definite matrix has m2[A]> 0, corresponding to


0 < m2[A] = infu6=0

Reu∗Auu∗u

,

so the quadratic form only takes values in the right half-plane. Likewise, a negativedefinite matrix is characterized by M2[A] < 0. The upper and lower logarithmicnorms provide additional quantitative information, however, as the actual values ofthe logarithmic norms tell us how positive (or negative) definite a matrix is; thisallows us to find stability margins. Note, however, that there are matrices that areneither positive nor negative definite.

6.5 Nonlinear stability theory

The stability of nonlinear systems is considerably more complicated than for linearsystems. Yet there are strong similarities, even though the stability of a solution mustbe considered in a case by case basis. Let u and v be two solutions to the same IVP,with initial conditions u(0) = u0 and v(0) = v0. Then u− v satisfies

ddt(u− v) = f (t,u)− f (t,v).

Taking the inner product with u− v, we find the differential inequality

12

ddt‖u− v‖2

2 = 〈u− v, f (t,u)− f (t,v)〉 ≤M2[ f ] · ‖u− v‖22,

where the upper logarithmic norm of f (t, ·) is

M2[ f ] = supu6=v

〈u− v, f (t,u)− f (t,v)〉‖u− v‖2

2, (6.25)

where u,v are in the domain of f (t, ·). In a similar way, taking the infimum instead,we obtain the lower logarithmic norm m2[A]. Consequently, letting ∆u = u− v de-note the difference between u and v, we have the differential inequalities

m2[ f ] · ‖∆u‖2 ≤ddt‖∆u‖2 ≤M2[ f ] · ‖∆u‖2,

which are completely analogous to those we obtained in the linear case. Thus wecould bound the growth rate of ‖∆u‖2 in terms of M2[ f ].

In fact, if f (t, ·) is Lipschitz with respect to its second argument, we have

L22[ f ] = sup

u 6=v

〈 f (t,u)− f (t,v), f (t,u)− f (t,v)〉‖u− v‖2

2= sup

u6=v

‖ f (t,u)− f (t,v)‖22

‖u− v‖22

.

We can easily verify that that L[·] is an operator (semi) norm; in fact, if f (t,u) = Auis a linear map, we find that L2[A] = ‖A‖2. This part of the theory is easily extended

6.5 Nonlinear stability theory 21

to general norms, so that we can define

M[ f ] = limh→0+

L[I +h f (t, ·)]−1h

.

In case of the Euclidean norm, the logarithmic norm defined above is identical tothe expression (6.25) above.

No matter what norm we choose, we obtain differential inequalities and pertur-bation bounds of the same structure as in the linear case. This extends linear theoryto nonlinear systems. Let us therefore state two results of great importance in theanalysis that follows.

Theorem 6.7. Let u = f (u)+ p(t) and v = f (v) with u(0)− v(0) = 0. Let ‖p‖∞ =supt≥0 ‖p(t)‖. Assume that f : Rd → Rd with M[ f ] 6= 0 and let ∆u = u− v. Then

‖∆u(t)‖ ≤ ‖p‖∞

etM[ f ]−1M[ f ]

; t ≥ 0. (6.26)

If M[ f ] = 0, then‖∆u(t)‖ ≤ ‖p‖∞ · t; t ≥ 0. (6.27)

If M[ f ]< 0 it holds that

‖∆u‖∞ ≤−‖p‖∞

M[ f ]. (6.28)

We note that this is a nonlinear version of the bounds (6.17 – 6.19). Here v(t)represents the unperturbed solution, and u(t) the solution obtained when a forcingperturbation term p(t) drives the solution away from v(t). Whether the perturbedsolution grows or not is primarily governed by M[ f ]. Above we note that this resultis valid for any norm, even though we will give preference to the Euclidean norm.

In the linear case we saw that if M[A]< 0, then any stationary solution is stable.In the nonlinear case we have a similar result. First we note that if p = 0 then u = v;this means that the solution to the system is unique. Now, in the theorem above,assume that M[ f ] < 0 and that p 6= 0 is constant. Then there is a unique stationarysolution u, satisfying

0 = f (u)+ p

We shall see that we can write u = f−1(−p), i.e., we want to show that the inversemap f−1 exists. To this end, note that, by the Cauchy–Schwarz inequality,

−‖ f (u)− f (v)‖2‖u− v‖2 ≤ 〈u− v, f (u)− f (v)〉 ≤M2[ f ] · ‖u− v‖22 < 0

for any distinct vectors u,v ∈ Rd . Simplifying, we have

−‖ f (u)− f (v)‖2 ≤M2[ f ] · ‖u− v‖2 < 0.

This means that if u→ v, then necessarily f (u)→ f (v). Hence f is one-to-one, andwe may write f (u) = x and f (v) = y, with u = f−1(x) and v = f−1(y). It follows


that‖ f−1(x)− f−1(y)‖2

‖x− y‖2≤− 1

M2[ f ]

holds for all x 6= y. By taking the supremum of the left hand side we arrive at thefollowing theorem:

Theorem 6.8. (Uniform Monotonicity Theorem) Assume that f : Rd → Rd withM2[ f ]< 0. Then f is invertible on Rd with a Lipschitz inverse, and

L2[ f−1]≤− 1M2[ f ]

. (6.29)

The derivation above is only for inner product norms, but the result also holds forgeneral norms. Likewise, it holds if we replace the condition M[ f ]< 0 by m[ f ]> 0.This can be compared to the linear case. Thus, if A is a positive definite matrix (i.e.,m2[A] > 0) it has a bounded inverse, with ‖A−1‖2 ≤ 1/m2[A]. Similarly, a negativedefinite matrix has a bounded inverse. The uniform monotonicity theorem abovegeneralizes those classical results to nonlinear maps.

An interesting consequence of the uniform monotonicity theorem is the follow-ing:

Corollary 6.2. Let h> 0 and assume that f :Rd→Rd with M2[h f ]< 1. Then I−h fis invertible on Rd with a Lipschitz inverse, and

L2[(I−h f )−1]≤ 11−M2[h f ]

. (6.30)

Proof. This result is obtained from the elementary properties of M[·] in Theorem6.2. Thus we note that

M2[h f − I] = M2[h f ]−1.

By assumption M2[h f − I] < 0, so (I− h f )−1 is Lipschitz, with the constant givenby (6.30).

This corollary will be seen to guarantee existence and uniqueness of solutions inimplicit time-stepping methods for initial value problems.

6.6 Stability in discrete systems

There is a strong analogy between differential equations and difference equations.Corresponding to the linear and nonlinear initial value problems

6.6 Stability in discrete systems 23

y = Ay

y = f (y),

we have the discrete systems

yn+1 = Ayn

yn+1 = f (yn).

Beginning with the linear system, we saw that in the continuous case, stability wasgoverned by having all eigenvalues in the left half-plane, α[A]< 0, or, if norms wereused, by having a non-negative upper logarithmic norm, M[A] ≤ 0. In the discretelinear case, the stability conditions require that we have the eigenvalues in the unitcircle, ρ[A]< 1, or, in terms of norms, that ‖A‖ ≤ 1.

Thus the role of left half-plane is “replaced by” the unit circle in the discretecase; the spectral abscissa by the spectral radius; and the logarithmic norm by thematrix norm. In the discrete, linear case, the solution is yn = Any0, and the solutionis bounded (stable) if

‖An‖ ≤C

for all n ≥ 1. A matrix satisfying this condition is called power bounded. Powerboundedness is necessary for stability, but as it depends on the spectrum of A aswell as the eigenvectors, it may often be difficult to establish. On the other hand,using the submultiplicativity of the operator norm, we have

‖An‖ ≤ ‖A‖n.

Therefore A is power bounded if ‖A‖ ≤ 1, and the latter condition is often mucheasier to establish. Since A may be power bounded even if ‖A‖ > 1, the condition‖A‖ ≤ 1 is sufficient for stability, but not necessary.

In an analogous way, the solution of the continuous system is y(t) = etAy(0), andthe solution is bounded (stable) if

‖etA‖ ≤C.

This, too, is more difficult to establish than the result obtained by using norms. Thuswe have seen, by using differential inequalities, that

‖etA‖ ≤ etM[A]

for t ≥ 0, and it follows that the system is stable if M[A] ≤ 0. Again, the latter is asufficient but not a necessary condition.

In the nonlinear continuous case, we need to investigate the difference of twosolutions, u and v, and whether they remain close as t → ∞. This is the case ifM[ f ]≤ 0. The situation is similar in the discrete case. We then have


un+1 = f (un)

vn+1 = f (vn).

The difference between the solutions satisfy

‖un+1− vn+1‖= ‖ f (un)− f (vn)‖ ≤ L[ f ] · ‖un− vn‖,

where L[ f ] is the Lipschitz constant of f . Thus, if L[ f ] < 1, the distance betweenthe solutions decreases. We have already seen in Theorem 2.1 that, provided thatf : D→ f (D)⊂D and L[ f ]< 1, there is a unique fixed point u solving the equation

u = f (u).

This is then a “stationary solution” to the discrete dynamical system. In addition,we saw in (2.25) that (I− f )−1 exists and is Lipschitz. In fact, in view of Theorem6.2, the inverse exists under a slightly relaxed condition, M[ f ] < 1, and the errorestimate (2.25) can be sharpened. We shall derive an improved bound, but restrictthe derivation to inner product norms. Note that

un+1− u = f (un)− f (un+1)+ f (un+1)− f (u).

Hence

‖un+1− u‖22 = 〈un+1− u,un+1− u〉= 〈un+1− u, f (un)− f (un+1)〉+ 〈un+1− u, f (un+1)− f (u)〉≤ ‖un+1− u‖2‖ f (un)− f (un+1)‖2 +M2[ f ] · ‖un+1− u‖2

2

≤ L2[ f ] · ‖un+1− u‖2‖un−un+1‖2 +M2[ f ] · ‖un+1− u‖22.

Simplifying, we obtain

‖un+1− u‖2 ≤ L2[ f ] · ‖un−un+1‖2 +M2[ f ] · ‖un+1− u‖2.

Note that M2[ f ] ≤ L2[ f ] for all f , and that we have assumed L2[ f ] < 1. ThereforeM2[ f ]< 1 and

‖un+1− u‖2 ≤L2[g]

1−M2[g]· ‖un+1−un‖2.

This bound is always preferable to (2.25) since M2[ f ] ≤ L2[ f ]. In particular, ifM2[ f ] ≤ 0, it follows that ‖un+1 − u‖2 ≤ ‖un+1 − un‖2, expressing that the erroris less than the computable difference between the last two iterates. Such a boundcannot be obtained using the Lipschitz constant alone, as in (2.25). We restate thefixed point theorem in improved form, for general norms, even though only the Eu-clidean norm was used above:

Theorem 6.9. (Fixed point theorem) Let D be a closed set and assume that f is aLipschitz continuous map satisfying f : D⊂Rd→D. Then there exists a fixed point

6.6 Stability in discrete systems 25

System type y = Ay y = f (y) yn+1 = Ayn yn+1 = f (yn)

Spectral condition α[A]< 0 − ρ[A]< 1 −

Norm condition M[A]≤ 0 M[ f ]≤ 0 ‖A‖ ≤ 1 L[ f ]≤ 1

Table 6.3 Stability conditions for linear and nonlinear systems. Elementary stability conditionsare given in terms of the spectrum and norms in the linear constant coefficient case. The spectrummay reach the boundary of the left half plane or the unit circle, provided that a multiple eigenvaluehas a full set of eigenvectors. The norm conditions are sufficient, but not necessary

u ∈ D. If, in addition, L[ f ] < 1 on D, then the fixed point u is unique, and the fixedpoint iteration converges to u for every starting value u0 ∈D, with the error estimate

‖un+1− u‖ ≤ L[g]1−M[g]

· ‖un+1−un‖. (6.31)

Thus, in connection with discrete dynamical systems, there are also links to iter-ative methods for solving equations; such iterations are also discrete time dynamicalsystems, to which stability and contractivity applies. Returning to the stability in-terpretation, we have collected some elementary stability conditions in Table 6.6.While the conditions on the eigenvalues (spectrum) only apply to linear constantcoefficient systems, the norm conditions apply to linear as well as nonlinear sys-tems, but are only sufficient conditions; a system could be stable and yet fail tofulfill the norm condition.

Another analogy is found in linear differential and difference equations.

Example Consider the linear differential equation

y+3y+ y = 0

with suitable initial condition. The standard procedure is to insert the ansatz y = etλ intothe equation, to obtain the characteristic equation from which the possible values of λ aredetermined. Thus

λ2 +3λ +1 = 0, (6.32)

and it follows that λ1,2 = (−3±√

9−4)/2 = (−3±√

5)/2. The general solution is

y(t) =C1etλ1 +C2etλ2 ,

where the constants of integration C1 and C2 are determined by the initial condition. Obvi-ously y(t) = 0 is a solution. It is stable, since both roots λ1,2 ∈ C−.

Example The corresponding example in difference equations is

yn+2 +3yn+1 + yn = 0.

This time, however, we make the ansatz yn = λ n, which is an “exponential function” indiscrete time. Upon insertion, we get


λn+2 +3λ

n+1 +λn = 0,

leading to the same characteristic equation (6.32) as before, since λ 6= 0 when we seek anonzero solution. Naturally, we have the same roots λ1,2. The general solution is

yn =C1λn1 +C2λ

n2 ,

where C1 and C2 are determined by the initial conditions. The zero solution yn = 0 is nowunstable, since one root is outside the unit circle:

|λ1|= |−3+√

5|/2≈ 0.382 < 1

|λ1|= |−3−√

5|/2≈ 2.618 > 1.

Thus, in linear differential and difference equations, stability is once more deter-mined by having the roots in the left half-plane, or in the unit circle. This is in factthe same result as we saw before:

Example By putting y = z in the second order problem above, it is transformed into asystem of first order equations. Thus

y = z

z =−y−3z,

with initial conditions y(0) = y0 and z(0) = y(0) = y0. In matrix–vector form we have

ddt

(yz

)=

(0 1−1 −3

)·(

yz

).

The matrix thus obtained is referred to as the companion matrix of the differential equa-tion, and for stability its eigenvalues must be located in the left half plane. To determine itseigenvalues λk[A], we need the characteristic equation,

det(A−λ I) = det(−λ 1−1 −3−λ

)= λ (λ +3)+1 = λ

2 +3λ +1 := 0.

This is the same characteristic equation as before, showing that spectral stability conditionsare identical to those derived directly for higher order differential or difference equations.

Chapter 7The Explicit Euler method

The first discrete method for solving initial value problems was devised by Euler inthe mid 18th century. One of the greatest mathematicians of all times, Euler realizedthat many of the emerging problems of analysis could only be solved approximately.In the problem y = f (t,y), the “issue” is the derivative. Thus we have already notedthat derivatives need to be approximated by finite differences in order to constructcomputable approximations.

In the differential equationy = f (t,y), (7.1)

we start from the simplest approximation. We will compute a sequence of approxi-mations, yn ≈ y(tn), such that

yn+1− yn

∆ t= f (tn,yn). (7.2)

This follows the pattern from the standard definition of the derivative. Since

lim∆ t→0

y(tn +∆ t)− y(tn)∆ t

= y(tn),

the finite difference approximation (7.2) is obtained by replacing the derivative in(7.1), using a finite time step ∆ t > 0. From (7.2) we create a recursion,

yn+1 = yn +∆ t · f (tn,yn), (7.3)

starting from the initial value y0 = y(0). This is the Explicit Euler method. It isthe original time-stepping method, and all other types of time-stepping method con-structions include the explicit Euler method as the simplest case.

27

28 7 The Explicit Euler method

Fig. 7.1 Leonhard Euler (1707–1783). Portrait by J.E. Handmann (1753), Kunstmuseum Basel

7.1 Convergence

The Euler recursion implies that we sample the vector field f (t,y) at the currentpoint of approximation, i.e., at (tn,yn), and then take one step of size ∆ t in thedirection of the tangent. Naturally, the exact solution will not satisfy this recursion.As before, we let y(tn) denote the exact solution of the differential equation at timetn. Inserting the exact solution into the recursion (7.3), we obtain

y(tn+1) = y(tn)+∆ t · f (tn,y(tn))− rn, (7.4)

where the local residual rn 6= 0 signifies that the exact solution does not satisfy therecursion.

The first question is, how large is rn? Let us assume that the exact solution y istwice continuously differentiable on [0,T ]. Expanding in a Taylor series, we obtain

y(tn+1) = y(tn)+∆ t · y(tn)+∆ t2 · y(θn)

2,

for some θn ∈ [tn, tn+1]. Since y(tn) = f (tn,y(tn)), we can compare to (7.4) and con-clude that

rn =−∆ t2 · y(θn)

2. (7.5)

Hence ‖rn‖= O(∆ t2) as ∆ t→ 0, provided that y ∈C2[0,T ].

7.1 Convergence 29

Lemma 7.1. Let y(tn) denote the exact solution to (7.1) at the time points tn.Assume that y∈C2[0,T ], and that f is Lipschitz. When the exact solution is insertedinto the explicit Euler recursion (7.3), the local residual is

‖rn‖ ≤ ∆ t2 · maxt∈[0,T ]

‖y(t)‖2

. (7.6)

This means that the difference equation is consistent with the differential equa-tion, and that the approximation improves as ∆ t → 0. We will return to this notionlater; for the time being, we say that the method has order of consistency p, if

‖rn‖= O(∆ t p+1)

as ∆ t→ 0 (or, equivalently, N→ ∞).Now, because the exact solution does not satisfy the recursion, it follows that the

numerical solution will deviate from the exact solution. We introduce the followingdefinition.

Definition 7.1. Let the sequence yn denote the numerical solution generated bythe Euler method (7.3) and let y(tn) denote the exact solution to (7.1) at the timepoints tn. Then the difference

en = yn− y(tn)

is called the global error at time tn.

The next question is, therefore, how large is en? Now, the objective of all timestepping methods is to generate a numerical solution yn whose global error can bebounded. In fact, we want more: the method must be convergent. This means thatgiven any prescribed error tolerance ε , we must be able to choose the step size ∆ taccordingly, so that the numerical solution attains the prescribed accuracy ε .

Let us see how this is done. The local residual is related to the global error.Subtracting (7.4) from (7.3), we get

en+1 = en +∆ t · f (tn,y(tn)+ en)−∆ t · f (tn,y(tn))+ rn. (7.7)

This is a recursion for the global error, where the local residual is the forcing func-tion. It should be noted that the terminology varies in the literature, and that rn isoften called the local truncation error, or the local error. The reason for this dis-crepancy will become clear later, in connection with implicit methods.

Taking norms in (7.7), using the triangle inequality and the Lipschitz conditionfor f , yields

‖en+1‖ ≤ ‖en‖+∆ t L[ f ] · ‖en‖+‖rn‖= (1+∆ t L[ f ])‖en‖+‖rn‖. (7.8)


This is a difference inequality, and from it we are going to derive a bound on theglobal error. To this end, we need the following lemma.

Lemma 7.2. Let the sequence un∞0 satisfy un+1 ≤ (1+ µ)un + vn, with u0 = 0,

vn > 0 and µ >−1, but µ 6= 0. Then

un ≤maxk<n

vk ·enµ −1

µ. (7.9)

In case µ = 0, it holds that un ≤ n ·maxk<n vk.

Proof. The case µ = 0 is trivial. For µ 6= 0, we first prove by induction that un ≤Un,where

Un =n−1

∑k=0

(1+µ)n−k−1vk. (7.10)

This obviously holds for n = 1. Assume that un ≤Un holds for a given n. Then

un+1 ≤ (1+µ)un + vn ≤ (1+µ)Un + vn =n

∑k=0

(1+µ)n−kvk =Un+1,

so un ≤Un holds for all n≥ 1. From (7.10), it now follows that

Un ≤maxk<n

vk

n−1

∑k=0

e(n−k−1)µ = maxk<n

vk ·enµ −1eµ −1

≤maxk<n

vk ·enµ −1

µ.

since 1+µ ≤ eµ and the sum is a geometric series. The proof is complete.

We are now ready to construct a global error bound for the explicit Euler method,by applying (7.9) to the error recursion (7.8). Thus we obtain the following result.

Theorem 7.1. Let the initial value problem y = f (t,y) be given with y(0) = y0 ona compact interval [0,T ]. Assume that L[ f ] < ∞ for y ∈ Rd , and that the solutiony ∈ C2[0,T ]. When this problem is solved, taking N steps with the explicit Eulermethod using step size ∆ t = T/N, the global error at t = T is bounded by

‖yN− y(T )‖ ≤ ∆ t · maxt∈[0,T ]

‖y‖2

eT L[ f ]−1L[ f ]

. (7.11)

Proof. The result follows immediately from identifying µ = ∆ t L[ f ] in (7.8), and

vk = ‖rk‖=∆ t2‖y(θk)‖

2

in (7.5), and noting that n ·∆ t = tn in general, and N ·∆ t = T in particular.

7.1 Convergence 31

This classical result proves that the explicit Euler method is convergent, i.e.,by choosing the step size ∆ t = T/N small enough, the method can generate approx-imations that are arbitrarily accurate. Thus the structure of the bound (8.7) is

‖yN− y(T )‖ ≤C(T ) ·∆ t. (7.12)

An alternative formulation of (7.12) uses ∆ t = T/N, to express the bound as

‖yN− y(T )‖ ≤ K(T )N

, (7.13)

with K(T ) = T ·C(T ).

Definition 7.2. If a time stepping method generates a sequence of approximationsyn ≈ y(tn) at the time points tn = n ·∆ t with ∆ t = T/N, and the exact solution y(t)is sufficiently smooth, the method is said to be of convergence order p, if there isa constant C and an N∗ such that

‖yN− y(T )‖ ≤C ·N−p

for all N > N∗.

Here we note that, by Theorem 7.1, the explicit Euler method is of convergenceorder p = 1. This implies that the global error is ‖eN‖= O(∆ t), and that given anyaccuracy requirement ε , we can pick ∆ t (or N) such that ‖eN‖ ≤ ε .

Example The theory is easily illustrated on a simple test problem. We choose

y = λ (y−g(t))+ g(t); y(0) = y0. (7.14)

The exact solution isy(t) = eλ t(y0−g(0))+g(t),

and is composed of a homogeneous solution (the exponential) and the particular solutionf (t). We will choose g(t) = sinπt and y0 = 0 so that the exact solution is y(t) = sinπt,and we will solve the problem on [0,1]. We deliberately choose N = 5 and N = 10 steps,to obtain large errors, visible to the naked eye. Such a computational setup is obviouslyexaggerated for the purpose of creating insight.

The results are seen in Figure 7.1, where we have taken λ =−0.2. In spite of using so fewsteps, we still observe the expected behavior, with a global error O(∆ t) and local residualsO(∆ t2). This corroborates the first order convergence.

The convergence proof for the explicit Euler method is key to a broader under-standing of time stepping methods. We shall elaborate on the numerical tests byvarying the parameters in the test problem, and also compare the results to thosewe obtain for the implicit Euler method, to be studied next. However, before we goon, it is important to discuss the interpretation of the convergence proof, as well assome important, critical remarks on what has been achieved so far.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

0

0.2

0.4

0.6

0.8

1

1.2

Fig. 7.2 Demonstration of the Explicit Euler method. The simple test problem (7.14) is solved on[0,1] with N = 5 steps (top) and then N = 10 steps (bottom). The exact solution y(t) = sinπt isindicated by emphasized blue curve, while the explicit Euler method generates the red, polygonal,discrete solution. Each step is taken in the tangential directions of local solutions to the differentialequation (green intermediate curves). The global error at the endpoint is approximately 0.6 for∆ t = 1/5 and half as large, 0.3, for ∆ t = 1/10, in agreement with an O(∆ t) global error. Thelocal residuals approximately correspond to the distance between the green curves. Since there aretwice as many green curves in the lower graph, with only half the distance between them, the localresidual is O(∆ t2)

Remark 1 (Convergence) The notion of a convergent method applies to a general classof Lipschitz continuous problems whose solutions are smooth enough, in this case withy ∈C2[0,T ]. Thus convergence is a nominal method property, and the convergence orderis the best generic performance the method will exhibit.

However, there are exceptions. For example, if the solution y is a polynomial of degree1 (a straight line), then y ≡ 0 and the local residual vanishes. The explicit Euler methodthen generates the exact solution. Conversely, if a given problem fails to satisfy the Lip-schitz condition, or if the solution is not in C2[0,T ], the convergence order may drop, orthe method may fail altogether. In practice, one rarely verifies the theoretical assumptions,and occasional failures are encountered. Using a convergent method does not guaranteeunconditional success.

Finally, convergence is a notion from analysis, requiring a dense vector space, such as Rd .In computer arithmetic, there is no “true convergence,” since machine representable num-bers are few and far between. Even so, it is rare to experience difficulties due to roundoff,and in most cases the nominal convergence order is observed. In the explicit Euler case, thismeans that if ∆ t is reduced by a factor of 10, we will typically observe a global error that is10 times smaller.

7.1 Convergence 33

Remark 2 (Consistency and stability imply convergence) The error bound has the form

‖yN − y(T )‖ ≤C(T ) ·∆ t,

and it has two essential components. First, the single power of ∆ t is due to consistency.Second, C(T ) must be bounded; this is referred to as stability, and implies that the bounddepends continuously on ∆ t. Here C(T ) is sometimes referred to as the stability constant.Let us have a closer look at where these concepts were employed. We used the inequality

un ≤n−1

∑k=0

(1+µ)n−k−1vk,

and we need uN → 0 as N → ∞. We then need (1+ µ)N to be bounded. This appears torequire µ ≤ 0, but in fact we have a little bit of leeway. In our case

(1+µ)N = (1+∆ t ·L[ f ])N =

(1+

T L[ f ]N

)N

→ eT L[ f ],

so the exponential term is bounded for fixed T , even though µ > 0. That is where stabilityentered. Without stability, C(T ) would not have been bounded. Thus stability is necessary.

Second, for the error to go to zero, we needed vk → 0. This is where consistency entered.Above we saw that vk ≤ O(∆ t2)→ 0, since the local residual is

‖rn‖ ≤ O(∆ t2).

Without consistency, the upper bound of the global error ‖yN − y(T )‖ ≤ C(T ) ·∆ t wouldnot have contained the factor ∆ t. Thus consistency is necessary, too.

Remark 3 (What is a stability constant?) The convergence proof above is a mathematicalproof. It is “sharp” in the sense that equality could be attained, but it is far too weak fornumerical purposes. Consider a plain initial value problem

y =−10y; y(0) = 1,

to be solved on [0,T ] with T = 10. The exact solution is y(t) = e−10t , with y = 100y, im-plying that max |y|= 100. The Lipschitz constant is L[ f ] = 10.

Inserting these data into (8.7), we get

‖yN − y(T )‖ ≤ ∆ t · maxt∈[0,T ]

‖y‖2

eT L[ f ]−1L[ f ]

= ∆ t ·50 · e100−1

10≈ 1.3 ·1044

∆ t.

Thus C(T ) is stupendous; such “constants” do not belong in numerical analysis. In realcomputations, stability constants must have a reasonable magnitude, keeping in mind thatcomputations are carried out in finite precision, and need to finish in finite time.

Fortunately, the real error is much smaller. Suppose we take ∆ t = 0.01 and N = 103 stepsto complete the integration from t = 0 to T = 10. Because y(t) is convex, it is easily seenthat 0 < yn < y(tn) during the entire integration. Therefore the error at T is certainly lessthan y(T ). Now,

y(T ) = e−10T = e−100 = 3.7 ·10−44,

so ‖yN − y(T )‖ ≈ 3.7 ·10−44. Thus the error bound overestimates the error by 88 orders ofmagnitude. This is unacceptable, especially when the differential equation poses no specialproblems at all.


To summarize the analysis above, we note that stability and consistency are twodistinct necessary conditions for convergence. Later on, we will simplify the con-vergence proofs, reducing them to a matter of verifying the order of consistency,and stability. Proving consistency is usually easy, only requiring a Taylor series ex-pansion. Stability is more difficult, but once established, it holds that the order ofconvergence equals the order of consistency.

Bridging the gap from consistency to convergence, stability plays the key role.It will turn up in many different forms depending on the problem type and methodconstruction. Because the pattern remains the same, we will discover that the LaxPrinciple is the most important idea in numerical analysis: consistency and stabil-ity imply convergence.

7.2 Alternative bounds

The convergence proof derived above is only a “mathematical” proof, and we needto find better estimates. The problem with the derivation above depends on twothings: a reckless use of the triangle inequality, and a consequential damaginguse of the Lipschitz constant. As a consequence, we obtained a stability constantC(T ) which is so large as to suggest that accurate numerical solution of differentialequations is impossible over reasonably long time intervals.

However, this is not the case. By using logarithmic norms, we will be able toderive much improved error bounds that support the observation from computationalpractice, that most initial value problems can be solved to a very high accuracy.

Going back to the recursion (7.7) for the global error, we had

en+1 = en +∆ t · f (tn,y(tn)+ en)−∆ t · f (tn,y(tn))+ rn.

Again, we take norms and use the triangle inequality, without splitting the first threeterms. Thus we get

‖en+1‖ ≤ L[I +∆ t f ]‖en‖+‖rn‖ ≈ (1+∆ t M[ f ])‖en‖+‖rn‖,

where the approximation is derived from

≤ L[I +∆ t f ] = 1+∆ t M[ f ]+O(∆ t2),

in accordance with Definition 6.2. Dropping the O(∆ t2) term, we have effectivelyjust replaced the Lipschitz constant L[ f ] in (7.8) by the logarithmic norm M[ f ].Otherwise, everything remains the same. The convergence proof now only offersan approximate global error bound, but it is much improved due to the fact thatM[ f ]≤ L[ f ].

Theorem 7.2. Let the initial value problem y = f (t,y) be given with y(0) = y0 ona compact interval [0,T ]. Assume that L[ f ] < ∞ for y ∈ Rd , and that the solution

7.3 The Lipschitz assumption 35

y ∈ C2[0,T ]. When this problem is solved, taking N steps with the explicit Eulermethod using step size ∆ t = T/N, the global error at t = T is bounded by

‖yN− y(T )‖. ∆ t · maxt∈[0,T ]

‖y‖2

eT M[ f ]−1M[ f ]

. (7.15)

Remark 1 (Perturbation bound) We note that the error bound (7.15) has the same struc-ture as the perturbation bound (6.26). While the latter was derived for the differential equa-tion, the global error bound was derived for the discretization. The shared structure of thebounds shows that the error accumulation in the discrete recursion is similar to the effect ofa continuous perturbation p(t) in the differential equation.

Remark 2 (The stability constant, revisited) Let us again consider the problem

y =−10y; y(0) = 1,

to be solved on [0,T ] with T = 10. The exact solution is y(t) = e−10t , with y = 100y, im-plying that max |y|= 100. While the Lipschitz constant is L[ f ] = 10, the logarithmic normis M[ f ] =−10.

This gives a completely different error bound. Inserting the data into (7.15), we get

‖yN − y(T )‖ ≤ ∆ t · maxt∈[0,T ]

‖y‖2

eT M[ f ]−1M[ f ]

= ∆ t ·50 · e−100−1−10

≈ 5∆ t.

Thus the stability constant C(T ) ≈ 5 is moderate in size. The error bound is still an over-estimate, but the new error bound shows that the numerical method can achieve realisticaccuracy.

Even with the logarithmic norm, the error bound usually overestimates the error. However,the main difference is that while the Lipschitz constant is positive, it cannot pick up anyinformation on the stability of the solutions of the equation. By contrast, the logarithmicnorm distinguishes between forward and reverse time integration, and therefore con-tains some information about stability. This is necessary in order to have realistic errorbounds.

7.3 The Lipschitz assumption

While a much better error bound could be obtained when the logarithmic normreplaced the Lischitz constant in the derivation, the classical assumption for estab-lishing existence of solutions on some compact interval [0,T ] is still that the vectorfield is Lipschitz with respect to its second argument, i.e.,

L[ f (t, ·)]< ∞.

Noting that it always holds that M[ f ] ≤ L[ f ] (see the basic properties in Theorem6.2, which apply also to nonlinear maps), the Lischitz assumption automatically


implies that M[ f ] exists and is bounded. Therefore it is always possible to work withthe logarithmic norm instead of the Lipschitz constant in the estimates, althoughthis occasionally leads to approximate upper bounds.

More importantly, it may happen that M[ f ] L[ f ], implying that vastly im-proved error bounds can be obtained with the logarithmic norm. This is of particularimportance in connection with stiff differential equations, which will be studiedlater. Since the error bounds typically contain a factor eT L[ f ], which can be replacedby eT M[ f ] (see Theorem 6.7), the difference is enormous. In case T is also large, theclassical bound, based on the Lipschitz constant, loses its computational relevancealtogether. Bounds and estimates have to be as tight as possible.

Unlike the Lipschitz constant, the logarithmic norm may be negative. Thus,in cases where M[ f ] < 0, we can obtain uniform error bounds, also when T → ∞,which is otherwise impossible in the classical setting. In fact, stiff problems haveT L[ f ] 1 but T M[ f ] small or even negative. Such problems cannot be dealt with ina meaningful way without using the logarithmic norm. Typical examples are foundin parabolic partial differential equations, such as in the diffusion equation.

For this reason, we shall in the sequel start our derivations from the (weaker)assumption that

M[ f (t, ·)]< ∞,

keeping in mind that this is compatible with classical existence theory for ordinarydifferential equations, no matter how large L[ f (t, ·)] is.

Chapter 8The Implicit Euler Method

The explicit Euler method is

yn+1− yn

∆ t= f (tn,yn).

and is obtained from the finite difference approximation to the derivative,

y(tn +∆ t)− y(tn)∆ t

≈ y(tn).

In the Implicit Euler Method, we instead interpret the difference quotient as anapproximation to y(tn+1), which is the right endpoint of the interval [tn, tn+1] ratherthan the left. This leads to the discretization

yn+1− yn

∆ t= f (tn+1,yn+1).

The method is implicit, because given the point (tn,yn), we cannot compute yn+1directly. To use this method, we have to solve the (nonlinear) equation

yn+1−∆ t f (tn+1,yn+1) = yn (8.1)

on every single step. This leads to a number of questions:

1. Under what conditions can this equation be solved?2. Which method should be used to solve this equation?3. Can the added cost of equation solving be justified?

Let us start with existence of solutions. In ordinary differential equations wegenerally assume that f is Lipschitz with respect to the second argument. For sim-plicity, let us assume that L[ f (t, ·)] ≤ L[ f ] < ∞ on all of Rd . This implies thatM[ f ]≤ L[ f ]< ∞. By Corollary 6.2 we have the following result:

M[∆ t f ]< 1 ⇒ L[(I−∆ t f )−1]≤ 11−∆ t M[ f ]

.

37

38 8 The Implicit Euler Method

Throughout the analysis, we shall assume that M[∆ t f ] < 1. We note that this is aconsiderably weaker assumption than assuming L[∆ t f ]< 1, in which case the fixedpoint theorem would apply. Thus, a solution to (8.1) exists for (possibly) much largerstep sizes ∆ t than the fixed point theorem would indicate.

This brings us to the second question. If we were to use step sizes ∆ t such thatM[∆ t f ]< 1 but L[∆ t f ] 1, then obviously we cannot solve the equation by fixedpoint iteration. We will see that these conditions are typical. For this reason, we needto use Newton’s method, which may converge in the operative conditions definedby M[∆ t f ]< 1.

As for the third question, being implicit, the implicit Euler method is going tobe more expensive per step than the explicit method. But using Newton’s method isgoing to make it far more expensive per step. Can this extra cost can be justified?There are two possible benefits – if the method is more accurate or has improvedstability properties, it might be possible to employ larger steps ∆ t than in the explicitmethod. Then the implicit method would compensate for the inefficiency of theexplicit method. It turns out that the advantage is in improved stability, and thatthere are cases when the implicit Euler method can use step sizes ∆ t that are ordersof magnitude greater than those of the explicit method. These conditions depend onthe differential equation, and do not violate the solvability issues raised in the firstquestion. Whenever these conditions are at hand, the implicit method easily makesup for its more expensive computations. The issue is not the cost per step, but thecost per integrated unit of time, often referred to as the cost per unit step.

8.1 Convergence

Let us begin by investigating consistency. We follow standard procedure and insertthe exact solution y(t) into the discretization, to obtain

y(tn+1) = y(tn)+∆ t · f (tn+1,y(tn+1))− rn, (8.2)

where we want to find the magnitude of the local residual rn. We assume that theexact solution y is twice continuously differentiable and expand in a Taylor series.Here we note that we need to expand both t(tn+1) and y(tn+1) = f (tn+1,y(tn+1)around tn. Thus we have

y(tn+1) = y(tn)+∆ t · y(tn)+∆ t2 · y(tn)

2+O(∆ t3)

y(tn+1) = y(tn)+∆ t · y(tn)+O(∆ t2).

Inserting into (8.2) we conclude that

rn =∆ t2 · y(tn)

2+O(∆ t3). (8.3)

8.1 Convergence 39

Hence the order of consistency of the implicit Euler method is p= 1, just like for theexplicit method. The only difference is that the local residual in the implicit Eulermethod has the opposite sign from that of the explicit Euler method.

Lemma 8.1. Let y(tn) denote the exact solution to (7.1) at the time points tn.Assume that y∈C2[0,T ], and that f is Lipschitz. When the exact solution is insertedinto the implicit Euler recursion (8.2), the local residual is

‖rn‖ ≤ ∆ t2 · maxt∈[0,T ]

‖y(t)‖2

. (8.4)

as ∆ t→ 0.

As for the global error and convergence, the analysis is now slightly more com-plicated because the method is implicit. Assuming that ∆ t M[ f ]< 1, the inverse map(I−∆ t · f )−1 exists and is Lipschitz on account of Theorem 6.2. We now have have

(I−∆ t · f )(yn+1) = yn

(I−∆ t · f )(y(tn+1)) = y(tn)− rn.

Inverting I−∆ t f , subtracting and taking norms, it follows that

‖en+1‖ ≤ L[(I−∆ t f )−1] · ‖en + rn‖, (8.5)

where en = yn− y(tn) is the global error at tn. Again, by Theorem 6.2, we obtain

‖en+1‖≤‖en + rn‖

1−∆ t M[ f ]≤ ‖en‖+‖rn‖

1−∆ t M[ f ]≈ (1+∆ t M[ f ])‖en‖+‖rn‖+O(∆ t3). (8.6)

Thus we have the approximate difference inequality

‖en+1‖. (1+∆ t M[ f ])‖ · ‖en‖+‖rn‖,

conforming to Lemma 7.2. Identifying µ = ∆ t M[ f ] and vk = ‖rk‖, we obtain thefollowing standard convergence result.

Theorem 8.1. Let the initial value problem y = f (t,y) be given with y(0) = y0 on acompact interval [0,T ]. Assume that ∆ t M[ f ] < 1 for y ∈ Rd , and that the solutiony ∈ C2[0,T ]. When this problem is solved, taking N steps with the implicit Eulermethod using step size ∆ t = T/N, the global error at t = T is bounded by

‖yN− y(T )‖. ∆ t · maxt∈[0,T ]

‖y‖2

eT M[ f ]−1M[ f ]

. (8.7)

While this conforms to the bound obtained for the explicit method, there ap-pears to be little new to learn from the implicit method. Before we run some tests,


comparing the explicit and implicit Euler methods, let us note that there is anotherway of deriving the error estimate above. Thus, starting all over, we note that if themethod in a single step starts from a point y(tn) on the exact solution, it produces anapproximation

yn+1 = y(tn)+∆ t · f (tn+1, yn+1), (8.8)

Because yn+1 6= y(tn+1), it is warranted to introduce a special notation for the dis-crepancy. Thus we introduce the local error ln+1, defined by

yn+1 = y(tn+1)+ ln+1 (8.9)

The local error is, naturally, related to the local residual. Subtracting (8.2) from(8.8), we have

ln+1 = ∆ t · f (tn+1,y(tn+1)+ ln+1)−∆ t · f (tn+1,y(tn+1))+ rn.

Let us for simplicity assume that f (t,y) = Jy, i.e., that the vector field f is a linearconstant coefficient system. Then

ln+1 = (I−∆ t J)−1rn. (8.10)

The inverse of the matrix exists, since we assumed ∆ tM[ f ]< 1, and M[J] = M[ f ] iff = J. Obviously, if ‖∆ t J‖→ 0 as ∆ t→ 0, we have ln+1→ rn. Thus, if the step sizesare small enough to make ‖∆ t J‖ 1, it holds that ln+1 ≈ rn. However, the point inusing the implicit Euler method is to employ large step sizes for which ‖∆ t J‖ 1,while it still holds that ∆ tM[J] 1. This is the case in stiff differential equations,where the local error ln+1 can be much smaller than the local residual rn.

For the explicit Euler method, the local residual and the local error ar identical;thus ln+1 ≡ rn. For implicit methods, however, especially in realistic operationalconditions, the difference is significant. In practical computations, it is thereforemore important to control the magnitude of the local error than the local residual.For this reason, we emphasize the local error perspective in the sequel.

Let us now turn to comparing the computational behavior of the explicit andimplicit Euler methods. This will be done by considering a few simple test problemsthat illustrate both stability and accuracy. It is important to note that stability is thehighest priority; without stability no accuracy can be obtained.

Example We shall compare the explicit and implicit Euler methods using the same testproblem (7.14) as before, specifically

y = λ (y− sinωt)+ω cosωt; y(0) = y0, (8.11)

with exact solutiony(t) = eλ t y0 + sinωt.

We shall take y0 = 0 so that the homogenous solution is not present in the exact solutiony(t) = sinωt, but the exponential term will show up in the local solutions passing throughthe points yn generated by the numerical methods. We will take ω = π and solve theproblem on [0,1], and just like before we will use N = 5 and 10 steps, respectively. The

8.1 Convergence 41

0 0.2 0.4 0.6 0.8 1

-0.5

0

0.5

1

Explicit Euler, N=5

0 0.2 0.4 0.6 0.8 1

t

-0.5

0

0.5

1

Explicit Euler, N=10

0 0.2 0.4 0.6 0.8 1

-0.5

0

0.5

1

Implicit Euler, N=5

0 0.2 0.4 0.6 0.8 1

t

-0.5

0

0.5

1

Implicit Euler, N=10

Fig. 8.1 Comparing the Euler methods. Test problem (8.11) is solved for λ = −0.2 with N = 5(top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (rightpanels). Blue curve is the exact solution y(t); red polygons represent the numerical solutions;green curves represent local solutions through numerically computed points. The local residualshave opposite signs, since explicit solutions proceed above y(t), and below it in the implicit case.Going from 5 to 10 steps, the global error is O(∆ t) for both methods, with local residuals O(∆ t2)

prime motivation for the test is to investigate the effect of varying λ , which controls thedamping of the exponential. We will use three different values, λ =−0.2,−2 and−20, andotherwise use the same computational setup for both methods.

The results are shown in Figures 8.1 – 8.1. For λ = −0.2 the damping is weak and localsolutions almost run “parallel” to the exact solution. The test verifies that the global error isof the same magnitude for both methods.

When λ = −2 there is more exponential damping. This has the interesting effect that theglobal error becomes smaller, due to the fact that the stability constant C(T ) is smaller whenthe damping rate increases. In addition, local solutions now show a moderate damping, butwe still observe how the global error is similar in both methods, and still O(∆ t).

For λ = −20, there is strong exponential damping, as is evident from the fast contractinglocal solutions. The big surprise, however, is that for N = 5 (or, equivalently, ∆ t = 0.2), theexplicit method goes unstable, with a numerical solution exhibiting growing oscillationsdiverging from the exact solution, even though the initial value was taken on the exactsolution. This effect is due to numerical instability, and will be investigated in detail.

The instability may at first seem surprising, since we do have a convergence proof for themethod, but we note that λ∆ t =−4, which is too large for the method to remain stable. Bycontrast, there is no sign of instability in the implicit method, which remains stable. Nor isthere any instability in the explicit method when N = 10 and ∆ t = 0.1 put λ∆ t at −2.


0 0.2 0.4 0.6 0.8 1

-0.5

0

0.5

1

Explicit Euler, N=5

0 0.2 0.4 0.6 0.8 1

t

-0.5

0

0.5

1


0 0.2 0.4 0.6 0.8 1

-0.5

0

0.5

1

Implicit Euler, N=5

0 0.2 0.4 0.6 0.8 1

t

-0.5

0

0.5

1


Fig. 8.2 Comparing the Euler methods. Test problem (8.11) is solved for λ = −2 with N = 5(top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (rightpanels). When the exponential damping increases, the global error decreases, but otherwise theresults remain similar, with the exception of local solutions clearly displaying a faster dampingrate

However, we have seen that in the convergence proof stability depends on (1+∆ t L[ f ])N

being bounded as N→∞ and ∆ t→ 0. While this is still true, we need to recognize that in apractical computation, we fix a ∆ t > 0 and then take a large number of steps with that stepsize. In such a case we have a new situation; the successive powers (1+µ)n will naturallygrow unless |1+ µ| ≤ 1. We will see that this condition has been violated in our situationwhen the method goes unstable.

The main discovery in the comparison of the explicit and implicit Euler methodsis that the explicit method suddenly goes unstable when the product of the step size∆ t and the problem parameter λ is too large. This means that we need to developa stability theory for the methods. It is not sufficient to investigate the stability ofthe mathematical problem and the discrete problem; we need to establish underwhat conditions stability of the mathematical problem carries over to the discreteproblem. This will pose (possibly) restrictive conditions on the magnitude of thestep size ∆ t.

8.2 Numerical stability 43

0 0.2 0.4 0.6 0.8 1

-0.5

0

0.5

1

Explicit Euler, N=5

0 0.2 0.4 0.6 0.8 1

t

-0.5

0

0.5

1


0 0.2 0.4 0.6 0.8 1

-0.5

0

0.5

1

Implicit Euler, N=5

0 0.2 0.4 0.6 0.8 1

t

-0.5

0

0.5

1


Fig. 8.3 Comparing the Euler methods. Test problem (8.11) is solved for λ = −20 with N = 5(top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (rightpanels). For N = 5, the explicit method shows numerical instability (top left) as indicated bygrowing oscillations, diverging from the exact solution. When the step size is shortened (N = 10,lower left), the method regains stability. In the implicit method, there is no instability at all; theimplicit Euler method can use larger steps than the explicit Euler method. Finally, due to thestrong exponential damping, the global error is very small whenever the computation is stable

8.2 Numerical stability

Numerical stability is investigated in terms of the linear test equation,

x = λx, x(0) = 1, (8.12)

where λ ∈ C. This requires a special motivation.

Motivation Consider a linear constant coefficient system

y = Ay, (8.13)

where A is diagonalizable by the transformation U−1AU = Λ , and Λ is the diagonal matrixcontaining the eigenvalues of A. The transformation y =Ux then implies y =Ux, and

Ux = AUx ⇒ x =U−1AUx = Λx.

Since A may have λ [A] ∈ C, we take λ ∈ C in (8.12).

If (8.13) is discretized by (say) the explicit Euler method, we obtain


y = Ay • • yn+1 = (I +hA)yn-

Euler discretization

?

diagonalization

?

diagonalization

x = Λx • • xn+1 = (I +hΛ)xn-

Euler discretization

Fig. 8.4 Commutative diagram. Diagonalization U−1AU =Λ of the vector field A commutes withthe Euler discretization, justifying the linear test equation for investigating numerical stability

yn+1 = (I +∆ t A)yn.

Putting yn =Uxn, we get

Uxn+1 = (I +∆ t A)Uxn ⇒ xn+1 =U−1(I +∆ t A)Uxn = (I +∆ tΛ)xn.

But this is the explicit Euler method applied to the diagonalized problem x = Λx. Thus,diagonalization and discretization commute; it does not matter in which order these opera-tions are carried out, see Figure 8.4.

Therefore (8.13) can be analyzed eigenvalue by eigenvalue; this is what the linear test equa-tion (8.12) does, with λ ∈ λ [A]. This justifies the interest in (8.12) as a standard test problem.

Now, let us consider the mathematical stability of (8.12). Since the solution is

x(t) = etλ ,

it follows that|x(t)|= |etλ |= etReλ .

Hence |x(t)| remains bounded for t ≥ 0 if Reλ ≤ 0. For a system y = Ay, this corre-sponds to having λ [A]∈C−, or equivalently α[A]≤ 0. This is the stability conditionfor the differential equation.

Checking the stability of the discretization, we start by investigating the explicitEuler method. For the linear test equation (8.12) we obtain

xn+1 = (1+∆ tλ )xn,

and it follows that |xn+1| = |1+∆ tλ | · |xn|. Thus the numerical solution is nonin-creasing provided that

|1+∆ tλ | ≤ 1. (8.14)

This is the condition for numerical stability. Unlike mathematical stability, it doesnot only depend on λ , but on the step size ∆ t as well. More precisely, numericalstability depends on the product ∆ tλ , and does not automatically follow from

8.2 Numerical stability 45

Explict Euler

-3 -2 -1 0 1

-3

-2

-1

0

1

2

3

Implict Euler

-1 0 1 2 3

-3

-2

-1

0

1

2

3

Fig. 8.5 Stability regions of explicit and implicit Euler methods. Left panel shows the explicitEuler stability region SEE in the complex plane. The method is stable for ∆ tλ inside the greendisk. Right panel shows the stability region SIE of the implicit Euler method. The method is stablein C, except inside the red disk, which corresponds to the region where the method is unstable.Thus the implicit Euler method is stable in the entire left half-plane, but also in most of the righthalf-plane, where the differential equation is unstable

mathematical stability. Instead, numerical stability must be examined method bymethod, establishing the unique step size limitations associated with each method.

In (8.14), λ is a complex number. We put z = ∆ tλ ∈ C, and rewrite (8.14) as

|1+ z| ≤ 1.

This is the interior of a circle in C, with center at z =−1 and radius 1. It is referredto as the stability region of the explicit Euler method, formally defined as the disk

SEE = z ∈ C : |1+ z| ≤ 1.

The discrete problem is numerically stable for ∆ tλ ∈ SEE.A similar analysis for the implicit Euler method yields

xn+1 = xn +∆ tλ xn+1 ⇒ xn+1 = (1−∆ tλ )−1xn.

Hence the numerical solution remains bounded if |1− z|−1 ≤ 1, and the stabilityregion of the implicit Euler method is defined by


SIE = z ∈ C : |1− z| ≥ 1.

Numerical stability requires that ∆ tλ ∈ SIE. The shape of the stability region of theimplicit Euler method is also a circle, but now with center at z = 1 and radius 1. Theimportant difference is that while SEE is the interior of a circle, SIE is the exterior ofa circle. In particular, we note that C− ⊂ SIE. Thus, if ∆ tλ ∈ C−, the implicit Eulermethod is stable.

The implicit Euler method has a large stability region, covering the entire nega-tive half plane, while the explicit method has a small stability region, putting strongrestrictions on the step size ∆ t. Thus the explicit Euler method can only use shortsteps. By contrast, the implicit Euler method is stable whenever ∆ tλ ∈ C−. Since∆ t > 0 is real, it follows that there is no restriction on the step size if λ ∈ C−.For this reason, the method is sometimes referred to as unconditionally stable. Thestability regions of the explicit and implicit Euler methods are found in Figure 8.2.

We can now analyze the results we obtained when testing the two methods.

Example The previous test problem, used to assess the properties of the explicit and im-plicit Euler methods, was

y = λ (y−g)+ g.

This has a particular solution y(t) = g(t) and exponential homogeneous solutions. Puttingx = y− g, the test problem is transformed into the linear test equation x = λx. Thus, sta-bility is only a function of ∆ tλ , and only depends on the homogeneous solutions and themethod’s ability to handle exponential solutions.

In the previous tests, we used two step sizes, ∆ t = 0.2 and ∆ t = 0.1. We also used threedifferent values of λ , viz., λ =−0.2, λ =−2, and λ =−20. Since in all cases λ ∈C−, theimplicit Euler method is stable, no matter how the step size is chosen. This explains why nostability issues were observed.

For the explicit Euler method, we have the following table of parameter combinations:

Parameters λ =−0.2 λ =−2 λ =−20

∆ t = 0.1 ∆ tλ =−0.02 ∆ tλ =−0.2 ∆ tλ =−2

∆ t = 0.2 ∆ tλ =−0.04 ∆ tλ =−0.4 ∆ tλ =−4

From this table we see that only one parameter combination, ∆ tλ =−4, is such that ∆ tλ /∈SEE, causing numerical instability. Another combination, ∆ tλ =−2, is marginally stable.This is in full agreement with the tests, and for ∆ tλ = −4 the numerical solution becameoscillatory and diverged from the exact solution, see Figure 8.1. This is the typical behaviorwhen numerical instability is encountered.

8.3 Stiff problems 47

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Fig. 8.6 Vector field and flow of a stiff problem. The test problem (8.11) is solved using the implicitEuler method with N = 20 steps. The exact solution y(t) = sinπt (blue) and the discrete solution(red markers) are plotted vs t ∈ [0,1]. Neighboring solutions (green) to the differential equation, forother initial conditions, illustrate the vector field away from the particular solution. At λ = −30,these solutions quickly converge on the particular solution. This is typical of stiff problems

8.3 Stiff problems

Stiff initial value problems require time stepping methods with stability regions cov-ering (near) all of C−. The simplest example of such a method is the implicit Eulermethod. The simplest illustration of a stiff problem is the Prothero–Robinson testproblem (8.11), i.e.,

y = λ (y−g)+ g.

whose particular solution is y(t) = g(t), and where the homogeneous solutions areexponentials etλ .

Stiffness is a question of how the numerical method interacts with the problem.If λ −1, the homogeneous solutions decay very fast, after which the solution isnear the particular solution, y(t) ≈ g(t) no matter what the initial condition was.This is demonstrated in Figure 8.3 for λ =−30. This is not a very stiff problem, butthe parameter values are chosen to make the effect readily visible to the naked eye.By taking λ −30 the neighboring flow becomes nearly vertical.

Example (Stiffness) In the Prothero–Robinson test problem, assume that λ = −1000,and that g(t) = sin t. Analyzing the stability of a time stepping method for this problem isequivalent to considering the linear test equation (8.12) with the given λ .


Let us consider the numerical stability of the explicit Euler method. Since we then require|1+∆ t λ | ≤ 1, or, given that λ < 0,

∆ t ≤ ∆ tS =2λ

= 2 ·10−3,

where ∆ tS is the maximum stable step size.

Meanwhile, approximating the particular solution y(t) ≈ sin t, using the explicit Eulermethod, produces a local residual (equal to its local error)

|rn| ≈ ∆ t2 |y|2≤ ∆ t2

2.

Let us assume that we need an accuracy specified by |ln| ≤ TOL = 10−4, where TOL is aprescribed local error tolerance. Then, obviously, we can accept a step size

∆ tTOL =√

2TOL = 1.4 ·10−2.

However, it will not be possible to use such a step size, because the method would gounstable. There is a conflict between the stability requirement and the accuracy requirement,since ∆ tS <∆ tTOL . This is the problem of stiffness; being restricted by having to maintainstability, an explicit method cannot reach its potential accuracy.

The problem is overcome by using an appropriate implicit method. For example, as theimplicit Euler method is unconditionally stable, it has no stability restriction on ∆ t. Itslocal residual is the same as that of the explicit method, so it will be possible to use ∆ t =1.4 ·10−2. In fact, one can use an even larger step size, as the local error is smaller. Thus

ln+1 =rn

1−∆ t λ≈ ∆ t2

2+2 ·103∆ t≈ ∆ t

2 ·103 ≤ TOL,

which requires ∆ t ≤ 2 ·103TOL = 0.2. This step size will produce the requested accuracy,and, since it is 100 times larger than the maximum stable step size ∆ tS available to theexplicit Euler method, the implicit Euler method is likely going to be far more efficient,even though it is more expensive per step due to the necessary equation solving.

In real applications, one often encounters stiff problems where the ratio ∆ tS/∆ tTOL 1 forany explicit method. This ratio can be arbitrarily large, making it impossible to solve suchstiff problems unless dedicated methods are used. On the other hand, when an appropriateimplicit method is used, these problems can often be solved very quickly.

To develop a comprehensive theory os stiffness, we may consider a system ofnonlinear equations having a structure similar to the Prothero–Robinson example.Thus we will consider

y = f (y−g)+ g, (8.15)

where f : Rd → Rd is a nonlinear map having f (0) = 0. Then g can be viewed asa “particular solution,” as y(t) = g(t) satisfies (8.15). But since the initial conditiony(0) may not be chosen to equal g(0), there is also a nonlinear transient, correspond-ing to the “homogeneous solution.”

Putting u = y−g, (8.15) is turned into the simpler nonlinear problem

u = f (u), u(0) = u0. (8.16)


The central issue is now whether an explicit method would suffer severe step sizerestrictions when solving this problem. This depends on the mathematical stabilityproperties of the system, and in particular on whether there are strongly damped so-lutions near the zero solution u = 0, corresponding to the linear Prothero–Robinsonproblem with λ −1.

If the initial value u(0) is small enough, the system (8.16) can be linearizedaround the equilibrium solution u = 0, to obtain

u≈ f ′(0)u,

where f ′(0) is the Jacobian matrix of f evaluated at u = 0, as long as ‖u‖2 1.Since the matrix is constant, the linearized problem is a simple linear system,

u = Au,

where A ∈ Rd×d . Unlike having a single, scalar λ as before, we now have to dealwith a matrix (and its spectrum), asking whether it will lead to stability restrictionson the step size ∆ t. This is done using the techniques of Chapter 6.

Investigating its mathematical stability, we take the inner product with u to findthe differential inequalities

m2[A] · ‖u‖2 ≤ddt‖u‖2 ≤M2[A] · ‖u‖2.

Here m2[A] and M2[A] are the lower and upper logarithmic norms, respectively, andthe differential inequalities imply that

etm2[A] ≤ ‖etA‖2 ≤ etM2[A].

Thus the matrix exponential is bounded above and below. The lower bound givesthe maximum decay rate of homogeneous solutions, while the upper bound givesthe maximum growth rate.

In the Prothero–Robinson example above, we saw that the stability restrictionin an explicit method is caused by a fast decay rate. This happened when λ −1. In a linear system, we have such a fast decay rate when m2[A]−1. This ischaracteristic of stiff problems: m2[A]−1 is a necessary for stiffness. To give analternative interpretation, if one would integrate the problem in reverse time, thenwe would solve u = −A, whose maximum growth rate is M2[−A] = −m2[A] 1.Thus the reverse time problem has very unstable solutions.

Since mathematical and numerical stability are not equivalent, we again considerthe explicit Euler method. We then see that a sufficient condition for numericalstability is the circle condition

‖I +∆ t A‖2 ≤ 1.


By the triangle inequality, −1+ ‖∆ t A‖2 ≤ ‖I + ∆ t A‖2 ≤ 1, the circle conditionimplies ‖∆ t A‖2 ≤ 2. However, ‖∆ t A‖2 ≥−m2[∆ t A]. Therefore, if m2[∆ t A]−1the circle condition cannot possibly be satisfied, and instability is bound to happen.It follows that m2[∆ t A]−1 is a necessary condition for stiffness.

However, a system of equations is more complicated than a scalar equation. Theinvestigation of the scalar problem excluded growing solutions. In a system, it ispossible that we have growing as well as decaying solutions. We therefore introducethe average decay rate,

s[A] =m2[A]+M2[A]

2.

The reason why the condition m2[A]−1 alone might not cause stiffness is that ifM2[A] 1 at the same time, the system has both rapidly decaying and rapidly grow-ing solutions. While decaying solutions put a stability restriction on the step size,the growing solutions put an accuracy restriction, in order to resolve the growingsolutions.

Definition 8.1. Let A ∈ Rd×d . The stiffness indicator of A is defined by

s[A] =m2[A]+M2[A]

2. (8.17)

Here we note that if A = λ ∈ R, then s[λ ] = λ . The stiffness indicator is com-patible with the previous discussion of scalar problems, and we can proceed to gen-eralize the concept to systems. For scalar systems, λ put a restriction on the stepsize. More generally, we now need to relate s[A] to a time scale τ , which is notnecessarily the same as the step size ∆ t.

Definition 8.2. Assume that s[A] < 0. Then the local reference time scale τ is de-fined by

τ =− 1s[A]

. (8.18)

The reference time scale approximates the largest step size by which an explicitmethod can proceed, without losing numerical stability. Any desired time interval,be it the length of the integration interval or the preferred step size, can be related tothe reference time scale.

Definition 8.3. Let τ be the local reference time scale. For a given step size ∆ t thestiffness factor is defined by

r(∆ t) = ∆ t/τ.

Irrespective of whether an explicit or implicit method is used, stiffness is deter-mined by whether a step size ∆ t > τ is desired or not. This depends on many factors,not least the accuracy requirement and error tolerance TOL. As a simple observa-tion, if the problem is to be solved on [0,T ], stiffness cannot occur if r(T )< 1, since


the step sizes will obviously be shorter than τ in such a case. However, it may verywell occur that r(T ) 1, in which case the problem may turn out to be stiff.

We are now in a position to discuss stiffness in more general problems. Followingthe techniques outlined in Section 6.5, we consider two neighboring solutions to anonlinear problem,

u = f (u), u(0) = u0

v = f (v), v(0) = u0 +∆u0,

where we do no longer require that f (0) = 0. Thus we can consider stiffness in termsof perturbations along a non-constant solution. The difference ∆u = v−u satisfies

ddt

∆u = f (v)− f (u).

As before, we will only consider small (“infinitesimal”) perturbations ∆u, allowingus to linearize the perturbed problem around the non-constant solution u. Taking theinner product with ∆u, we obtain the differential inequalities

m2[J(u)] · ‖∆u‖2 ≤ddt‖∆u‖2 ≤M2[J(u)] · ‖∆u‖2,

where J(u) = f ′(u) is the Jacobian matrix of f , evaluated along the nominal solutionu(t). Thus the matrix is no longer constant but varies along the solution trajectory.Nevertheless, the same theory applies, and if s[J(u)]< 0, we obtain a reference timescale τ(u). Thus the stiffness factor too can vary along the solution. Stiffness occurswhenever we need to use a step size ∆ t such that

r(∆ t) =∆ t

τ(u) 1.

By evaluating s[J(u)] along a trajectory stiffness can be assessed locally.

Remarks on stiffness For any nonlinear system u = f (u), with f ∈C1, stiffness is definedlocally at any point u of the vector field, in terms of s[ f ′(u)]. The stiffness indicator deter-mines a local time scale τ(u). In case f = A is a linear constant coefficient system, s[A] andτ are constant on [0,T ].

1. Since |s[J(u)]| ≤ L[ f ], a necessary condition for stiffness is that T L[ f ] 1. However, thelatter be used as a characterization of stiffness. As L[ f ] = L[− f ], this quantity is independentof whether the problem is integrated in forward time or reverse time. One of the most typicalcharacteristics of stiffness is that the problem has strong damping in forward time, and isstrongly unstable in reverse time. This is reflected by s[J(u)].

2. Depending on the requested error tolerance TOL, as well as on the choice of method, thestep size ∆ t may have to be chosen shorter than τ; in such a case stiffness will not occureither. Likewise, should s[J(u)] become positive during some subinterval, stiffness is nolonger an issue. Unless T · s[J(u)]−1 stiffness will not occur; this means that any timestepping method can be used, without loss of efficiency.

3. It is not uncommon to encounter problem where T · s[J(u)]−106 or much greater. Insome problems of practical interest T · s[J(u)]−1012 or more; in such cases, an explicit


method will never finish the integration, as trillions of steps will be necessary. By contrast,there are stiff problems where a well designed implicit method solves the problem in Nsteps, where N is practically independent of T s[J(u)], or of T L[ f ].

4. Given the choice between an explicit and an implicit method, using the same error toler-ance, the explicit method may be restricted to using step sizes ∆ t τ , while an “uncondi-tionally stable” implicit method might be able to employ a step size ∆ t τ .

8.4 Simple methods of order 2

The explicit and implicit Euler methods are the simplest time stepping methods, butthey illustrate the essential aspects of such methods. They are simple to analyze andcan be understood both intuitively and theoretically. However, the methods are onlyfirst order convergent and therefore of little practical interest. In real computationswe need higher order methods. Before we proceed to advanced methods, we shallconsider a few simple methods of convergence order p = 2.

The construction of the Euler methods started from approximating the derivativeby a finite difference quotient. If y ∈C1[tn, tn+1], then, by the mean value theorem,there is a θ ∈ [0,1] such that

y(tn+1)− y(tn)tn+1− tn

= y((1−θ)tn +θ tn+1) = y(tn +θ(tn+1− tn)).

But this is only an existence theorem, not telling us the value of θ . In the explicitEuler method, we used θ = 0 to create a first order approximation. Likewise, in theimplicit Euler we used θ = 1. However, there is a better choice. Thus, assuming thaty is sufficiently differentiable, expanding in Taylor series around tn yields, for theleft hand side,


= y(tn)+tn+1− tn

2y(tn)+O((tn+1− tn)2);

and for the right hand side,

y(tn +θ(tn+1− tn)) = y(tn)+θ(tn+1− tn)y(tn)+O((tn+1− tn)2).

Matching terms, we see that by taking θ = 1/2, we have a second order approx-imation. This is the best that can be achieved in general, and corresponds to theapproximation


≈ y(

tn + tn+1

2

).

We can transform this into a computational method for first order IVP’s in two ways.For obvious reasons, both are referred to as the midpoint method; one is explicit andthe other implicit.

8.4 Simple methods of order 2 53

Beginning with the Implicit Midpoint method, we define the scheme

yn+1− yn = ∆ t · f(

tn + tn+1

2,

yn + yn+1

2

). (8.19)

The method is implicit since yn+1 appears both in the left and right hand sides.The explicit construction is just a matter of re-indexation. We use three consecu-

tive equidistant points tn−1, tn and tn+1, all separated by a step size ∆ t. Then

y(tn+1)− y(tn−1)

2∆ t= y(tn)+O(∆ t2).

This leads to the Explicit Midpoint method, defined by

yn+1− yn−1 = 2∆ t · f (tn,yn). (8.20)

This is a two-step method, since it needs both yn−1 and yn to compute yn+1. On theother hand, the method is explicit and needs no equation solving.

There is a third, less obvious, construction. Consider the approximation

y(

tn + tn+1

2

)≈ y(tn)+ y(tn+1)

2.

This means that we approximate the derivative at the midpoint by the average of thederivatives at the endpoints. Expanding y(tn) and y(tn+1) around the midpoint, weobtain

y(tn) = y(t)− ∆ t2

y(t)+O(∆ t2)

y(tn+1) = y(t)+∆ t2

y(t)+O(∆ t2),

where t = (tn + tn+1)/2 represents the midpoint. It follows that

y(tn)+ y(tn+1)

2= y(t)+O(∆ t2)

is a second order approximation. This leads to the Trapezoidal Rule,

yn+1− yn = ∆ t ·(

f (tn,yn)+ f (tn+1,yn+1)

2

). (8.21)

It is an implicit method, and it is obviously related to the implicit midpoint method.Thus, if we consider a linear constant coefficient problem y = Ay, both methodsproduce the discretization

yn+1− yn = ∆ t ·(

Ayn +Ayn+1

2

).


Solving for yn+1, we obtain

yn+1 =

(I− ∆ t A

2

)−1(I +

∆ t A2

)yn.

However, the two methods are no longer identical for nonlinear systems, or for linearsystems with time dependent coefficients.

Among advanced methods, there are two dominating classes, Runge–Kutta(RK) methods, and linear multistep (LM) methods. While the latter may use anarbitrary number of steps to advance the solution, RK methods are one-step meth-ods. We shall study both method classes below.

Of the three second order methods above, the explicit midpoint method is in theLM class, but not in RK. The implicit midpoint method is in the RK class, but not inLM. Finally, the trapezoidal rule, as well as the explicit and implicit Euler methodsstudied before, can be seen as (some of the simplest) members of both the LM andthe RK class.

Let us now turn to comparing these methods. For simplicity, we take a the scalarProthero– Robinson test problem (8.11), which means that we can solve equationsin the implicit methods exactly. In Figure 8.4 the trapezoidal rule is compared tothe explicit Euler method. The test demonstrates the need for higher order methods.Thus, going from a first order to a second order method has a strong impact onaccuracy. Although the step size is the same for both methods, the second ordermethod can achieve several orders of magnitude higher accuracy. The same relativeeffect takes place each time we select a higher method order. Since it is possibleto construct methods of very high convergence orders, it is in practice possible tosolve many problems to full numerical accuracy at a reasonable computational cost.Modern codes typically implement methods of up to order p = 6, although there arestandard solvers of even higher orders.

The test problem in Figure 8.4 is not stiff, but slightly unstable. It is not a partic-ularly difficult problem. It solved using a rather coarse step size, again to emphasizethe differences in accuracy. In real computations the step size would be smaller,making the difference even more pronpounced. To see the effect as a function of thestep size ∆ t, we compare the explict and implicit Euler methods to the trapezoidalrule in Figure 8.4. Here, the setting is moderately stiff, at λ = −50, so as to alsodemonstrate when the explicit method goes unstable. More importantly, we see thatfor smaller (but still relevant) step sizes, the second order method achieves severalorders of magnitude better accuracy.

This is the central idea in discretization methods: in a convergent method, ac-curacy increases as ∆ t → 0, but the smaller the step size, the more computationaleffort is needed. So how do we obtain high accuracy without a too large compu-tational cost? The answer is by using high order methods. Then the error can bemade extremely small, even without taking ∆ t exceedingly small.

The only concern is that we have to make sure that the method remains stable,since stability is required in order to have convergence. Note that this is not a matter


0 0.5 1 1.5 2 2.5 3 3.5 4

t

-1

-0.5

0

0.5

1

Trapezoidal Rule, N=50

0 0.5 1 1.5 2 2.5 3 3.5 4

t

-1

-0.5

0

0.5

1


Fig. 8.7 The effect of second order convergence. The test problem (8.11) is solved using the im-plicit Euler method (top) and with the trapezoidal rule (bottom), both using N = 50 steps. The exactsolution y(t) = sinπt (blue) and the discrete solution (red markers) are plotted vs t ∈ [0,4], cov-ering two full periods. The choice λ = 0.1 makes the problem slightly unstable, posing a greaterchallenge to the methods. The first order explicit Euler has a readily visible and growing error. Bycontrast, the second order trapezoidal rule offers much higher accuracy

of the differential equation being unstable; such problems can be solved too, asdemonstrated in Figure 8.4. Instead, it is a matter of whether the method as suchis a stable discretization. To illustrate this point, we return to the test problem(8.11), and solve it using the explicit midpoint method. The results are found inFigure 8.4. Comparing the explicit midpoint method to teh trapezoidal rule, both ofsecond order, we find that there are substantial differences. Here we have returnedto a nonstiff problem, with λ =−1, but even so, the explicit method soon developsunacceptable oscillations. These are due to instability, altough the issue is lessserious than before. Even so, the trapezoidal rule is far better, as the error plotsdemonstrate.

This test suggests that stability, convergence and accuracy are delicate mattersthat require a deep understanding. Later, we shall return to these methods in con-nection with Hamiltonian problems, and find that the explicit midpoint method hasa unique niche, in Hamiltonian problems (e.g. in celestial mechanics) and in hyper-bolic conservation laws. This is due to the mathematical properties of such prob-lems.


10-4

10-2

dt

10-9

10-8

10-7

10-6

10-5

10-4

10-3

err

or

EE

10-4

10-2

dt

10-9

10-8

10-7

10-6

10-5

10-4

10-3

IE

10-4

10-2

dt

10-9

10-8

10-7

10-6

10-5

10-4

10-3

TR

Fig. 8.8 Global error vs. step size. The test problem (8.11) is solved using explicit Euler (left),implicit Euler (center) and the trapezoidal rule (right), over [0,1] for λ = −50. Red graphs showglobal error magnitude at t = 1 as a function of ∆ t. The Euler methods are of order p = 1 asindicated by dashed reference line of slope 1. The trapezoidal rule is of order p = 2, indicatedby reference line of slope 2. The error of the trapezoidal rule is 1000 times smaller at ∆ t = 10−4,showing the impact of higher order methods. The explicit Euler error graph goes “through the roof”at ∆ t ≥ 4 ·10−2 due to numerical instability, when ∆ t λ is outside the method’s stability region

To analyze the instability of the explicit midpoint method, we apply it to thelinear test equation y = λy. We then obtain the recursion

yn+1− yn−1 = 2∆ t λyn.

Putting q = ∆ t λ , this is a linear difference equation

yn+1−2qyn− yn−1 = 0.

This has stable solutions provided that the roots of the characteristic equation areinside the unit circle. The characteristic equation is

κ2−2qκ−1 = 0.

Since this can be factorized into (κ−κ1)(κ−κ2) = 0, where κ1,κ2 are the roots ofthe characteristic equation, we see that κ1κ2 =−1. Thus, if one root is less that onein magnitude, the other is greater than 1. Therefore we can write


0 1 2 3 4 5 6

-1

0

1

Explicit midpoint method, N=75

0 1 2 3 4 5 6

t

10-4

10-2

100

Error magnitudes

0 1 2 3 4 5 6

-1

0

1

Trapezoidal rule, N=75

Fig. 8.9 Comparison of second order methods. The test problem (8.11) is solved using the explicitmidpoint method (top) and the trapezoidal rule (center), over [0,6] for λ =−1. Both methods haveorder p = 2 and use N = 75 steps. The explicit method method develops growing oscillations overtime, indicative of instability. No such issues are observed in the implicit method. Graphs of howthe error evolves over time (bottom) show that while the trapezoidal rule (green) maintains an errornever exceeding 5 · 10−3, the explicit midpoint method has an exponentially growing error (red),as indicated by the straight trendline in the lin-log diagram

κ1 = eiϕ ; κ2 =−e−iϕ .

Now, since we must also have κ1 +κ2 = 2q, we obviously have

2q = eiϕ − e−iϕ = 2isinϕ.

Consequently, q = i sinϕ , and it follows that

∆ t λ = i sinϕ.

Since ∆ t > 0, the surprising result is that the method is only stable when λ is on theimaginary axis. Writing λ = iω , we must therefore have

∆ t ω ∈ (−1,1).

Note that we cannot allow |∆ t ω| = 1 since we would then have a double root,leading to unbounded solutions.


Trapezoidal rule

-1.5 -1 -0.5 0 0.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1.5 -1 -0.5 0 0.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Explicit midpoint method

Fig. 8.10 Stability regions. The stability region of the trapezoidal rule is the entire negative half-plane C−, as indicated by the green area in the left panel. The stability region of the explicitmidpoint method, however, is very “small” (right panel) as it only consists of the open intervali(−1,1). Thus the endpoints ±i are not included

The conclusion is that the method is only stable for ∆ t λ on the open intervali · (−1,1) in the complex plane. This is simply a short strip of the imaginary axis.Since we used λ =−1 in our test, we chose λ in the negative half plane. The methodis therefore obviously unstable, no matter how we choose ∆ t. For this reason, themethod is unsuitable for the test problem.

By contrast, the trapezoidal rule is stable, which explains why its performance issuperior. No matter how λ is chosen, the trapezoidal rule will not fail. To see this,we again consider the linear test equation, y = λy. We obtain the recursion

yn+1− yn =∆ t λyn

2(yn+1− yn),

resulting in

yn+1 =1+ z/21− z/2

yn = R(z)yn,

where z = ∆ t λ . Thus we need |R(z)| ≤ 1 for stability. This implies that

|1+ z/2| ≤ |1− z/2|.


This requires that the distance from an arbitrary point z/2 ∈ C to +1 is greater thanits distance to −1 in the complex plane. This obviously implies that z ∈ C−. Themethod’s stability region is therefore STR ≡C−. The stability regions of the explicitmidpoint method and the trapezoidal rule are shown in Figure 8.4.

This explains why two second order methods can produce such different results.In fact, the stability properties of the explicit midpoint method need to be qualified.We have already determined its stability region. Let us again assume that we solvey = λy with λ ∈ R to obtain the linear difference equation

yn+1−2qyn− yn−1 = 0,

where q = ∆ t λ , and characteristic equation κ2− 2κ − 1 = 0. As noted above, theproduct of the roots is −1, and the sum is 2q. If λ is real, then q is real, so forstability the only possibility is that κ1 = 1 and κ2 =−1. The sum is zero, implyingλ = 0. Thus, if λ 6= 0 is real, the method is necessarily unstable, with roots

κ1 ≈ 1+∆ t λ κ2 ≈−1

1+∆ t λ≈−1+∆ t λ .

It follows that the solution has the behavior

yn ≈C1(1+∆ t λ )n +C2(−1)n(1−∆ t λ )n ≈C1etnλ +C2 (−1)ne−tnλ ,

where tn = n∆ t. Thus, there is a discrete oscillatory behavior, indicated by the fac-tor (−1)n. In case λ ∈ R−, the amplitude of this oscillation grows exponentially,as indicated by the factor e−tnλ . Although this is confirmed in Figure 8.4, in fullagreement with theory, this undesirable behavior only evolves over time, since thecoefficient C2 is very small. The method does remain convergent, but it is not “sta-ble” in the same sense as the other methods considered so far. The explicit midpointmethod is weakly stable, and is only of practical value for problems where λ = iωis located on the imaginary axis.

The qualification of stability that is needed is the following. A method is calledzero stable if it produces stable solutions when applied to the (trivial) problemy = 0. The method is called strongly zero stable if the characteristic equation asso-ciated with this problem has a single root of unit modulus, κ = 1. In case there areother roots of unit modulus, but still no roots of multiplicity greater than one, themethod is weakly zero stable. The explicit and implicit Euler methods, as well asthe trapezoidal role, only have the single root κ = 1, and are strongly zero stable.(All one-step methods are strongly zero stable, since they only have a single root.)The explicit midpoint method, however, is a two-step method having two unimodu-lar roots, κ = 1 and κ =−1. Therefore it is weakly zero stable. It is the root κ =−1that limits method performance.


8.5 Conclusions

In a stable method, the global error has a more or less generic structure of the form

‖e‖.C ·∆ t p · ‖y(p)‖∞ ·eT M[ f ]−1

M[ f ]. (8.22)

For convergence, stability is necessary. The necessary stability condition is rathermodest – a method is only be required to solve the linear test equation y = λy ina stable manner for λ = 0, and for this reason, the condition is referred to as zerostability. It requires that the point 0 ∈ C is inside the method’s stability region.While this is the case also for the explicit midpoint method studied above, furtherqualifications are needed, and in most cases we require strong zero stability, whichexcludes the explicit midpoint method.

Apart from stability, which is crucial, the global error bound depends on

• A constant C, characteristic of the method, known as the error constant• The step size ∆ t• The method order p• The regularity of the solution y, as indicated by ‖y(p)‖∞

• The range of integration [0,T ], or the interval on which the problem is solved• The logarithmic norm M[ f ], determining the damping rate of perturbations.

Thus a strongly stable (convergent) method will produce any desired accuracy bychoosing ∆ t small enough. Choosing a method with a higher order of convergencep may often be preferable. The regularity of the solution, the damping rate, and therange of integration are parameters given by the problem, and little can be doneabout them. Nevertheless, it is of importance to understand that the final computa-tional accuracy also depends on those parameters.

Returning to stability, there are three different notions involved:

• Stability of the problem, essentially governed by M[ f ]• Stability of the discretization for a fixed ∆ t > 0• Stability of the method as ∆ t→ 0 for a given T .

Although this may at first seem like an overuse of the same term, they all refer to acontinuous data dependence, as some variable goes to infinity.

In a stable problem, also referred to as mathematical stability, we are interestedin whether a small perturbation of the differential equation only results in a smallperturbation of the solution y(t) as t → ∞. The usual setting is that we consider asingle solution, often an equilibrium, and whether small perturbations of the initialcondition will increase or stay bounded. If it stays bounded, the solution is stable.One sufficient condition for mathematical stability is M[ f ] < 0, and in (8.22), wesee that the stability of the problem affects the computational accuracy, in particularhow fast the global error is accumulated.

In a stable discretization, also referred to as numerical stability, we are inter-ested in whether the discrete system has a similar property. The standard setting is

8.5 Conclusions 61

that we take the given problem, fix a finite time step ∆ t, and let the recursion indexn→ ∞. The question is whether the numerical solution yn remains bounded underperturbations, again usually in the initial value.

In the best of worlds, numerical stability would follow from mathematical sta-bility. However, this requires more. Thus the method must be stable in order tohave convergence. The setting is different; here we fix an interval [0,T ], and con-sider solving the problem on that interval using N steps ∆ t = T/N. The questionis whether the numerical solution remains bounded as per (8.22) when N → ∞. Ineffect, we ask that the accumulated error at time T remains bounded, independentof how many steps N we use to reach T . This is key to convergence, since the bound(8.22) contains the factor ∆ t p, allowing us to make the accumulated error arbitrarilysmall by choosing N large enough.

In all three cases above, we ask that perturbations remain bounded as some pa-rameter tends to infinity. This is stability, and it keeps recurring in various guisesthroughout all of numerical analysis, explaining the importance of stability theory.

This concludes our analysis of elementary methods for first order initial valueproblems. Advanced methods work with similar notions, but require a different ap-proach to the construction of the methods. The two main contenders are Runge–Kutta and linear multistep methods.

Chapter 9Runge–Kutta methods

We have seen that a higher order methods offer significantly higher accuracy at amoderate extra cost. Here we shall explore a systematic approach to the constructionof high order one-step methods. In Runge–Kutta methods the key idea is to samplethe vector field f at several points in a single step, combining the results to obtaina high order end result in a single step. This entails using the samples to match asmany terms in a Taylor series expansion of the solution as possible.

9.1 An elementary explicit RK method

Let us consider one of the simplest explicit Runge–Kutta methods, achieving secondorder convergence by sampling the vector field at two points per step. For clarity, weshall make the simplifying assumption that the differential equation is autonomous,i.e., the vector field does not depend on time t, but has the form

y = f (y); y(0) = y0.

The following computational procedure is a simple RK method. We first computethe stage derivatives,

Y ′1 = f (yn)

Y ′2 = f (yn +∆ t Y ′1).

These are samples of the vector field f at two points Y1 and Y2 near the solutiontrajectory. They are not derivatives in the true sense, since they are not functionsof time, but only evaluations of f . For this reason we use a prime to denote thestage derivatives, while a dot represents the derivative y of the solution y, which is adifferentiable function.

The points Y1 and Y2 are referred to as the stage values, and are defined by

63

64 9 Runge–Kutta methods

Y1 = yn

Y2 = yn +∆ t Y ′1.

Thus it holds that Y ′i = f (Yi). We finally update the the solution according to

yn+1 = yn +∆ t2(Y ′1 +Y ′2). (9.1)

This is an explicit method, since there is no need to solve nonlinear equations inthe process. We shall see that it is a second order convergent method, which can beviewed as an explicit method that “emulates” the trapezoidal rule.

We shall compare this method to the second order convergent trapezoidal rule,starting from the same point yn. It computes an update yn+1, defined by

yn+1 = yn +∆ t2( f (yn)+ f (yn+1)).

This method is implicit, which makes it relatively expensive. However, the explicitsecond order RK method (9.1) above has been constructed along similar lines, byreplacing the vector field sample f (yn+1) by Y ′2. To justify this operation, note that

Y2 = yn +∆ t f (yn).

Thus the stage value Y2 is simply an explicit Euler step starting from yn, and itfollows that Y2 − yn+1 = O(∆ t2), corresponding to the local error of the explicitEuler method in a single step. It follows that

Y ′2 = f (Y2) = f (yn+1 +O(∆ t2)) = f (yn+1)+O(∆ t2),

provided that f is Lipschitz (the standard assumption). Therefore (9.1) computes

yn+1 = yn +∆ t2(Y ′1 +Y ′2) = yn+1 +O(∆ t3).

This implies that the RK method (9.1) is a second order explicit “workaround” pro-ducing nearly the same result as the trapezoidal rule. It is cheaper, but the benefitcome at a price. Thus the explicit RK method does not have the excellent stabilityproperties of the implicit trapezoidal rule.

Runge–Kutta methods are divided into explicit (ERK) and implicit (IRK) meth-ods. The classical ERK methods date back to 1895, when Carl Runge and Wil-helm Kutta developed some of the ERK methods that are still in use today. Rungeand Kutta were involved in mathematics and its applications in physics and fluidmechanics, and realized that the construction of accurate and stable computationalmethods for initial value problems required serious mathematical thought. The mod-ern theory of RK methods, however, was largely initiated and developed by John C.Butcher of the University of Auckland, New Zealand, around 1965. Because of thecomplexity of RK theory, this area is still a lively field of research.

9.2 Explicit Runge–Kutta methods 65

9.2 Explicit Runge–Kutta methods

So far we have based the construction of our methods on replacing the derivative bya finite difference. By contrast, the idea behind Runge–Kutta methods are closelyrelated to interpolation theory. Integrating the differential equation y = f (t,y) overthe interval [tn, tn+1], we obtain

y(tn+1)− y(tn) =∫ tn+1

tnf (τ,y(τ))dτ. (9.2)

This transforms the differential equation into an integral equation. Here we havereturned to the nonautonomous formulation y = f (t,y), although we will shortly goback to the autonomous formulation.

As the integral cannot be evaluated exactly, it needs to be approximated numer-ically. The standard approach to numerical integration is to sample the integrandf (τ,y(τ)) at a number of points, and approximate the integral by a weighted sum,∫ tn+1

tnf (τ,y(τ))dτ ≈ ∆ t

s

∑i=1

biY ′i ,

where Y ′i = f (τi,Yi). The accuracy of this approximation depends on how we con-struct the stage values Yi and the corresponding stage derivatives, Y ′i = f (τi,Yi).

This means that (9.2) generates a numerical integration formula

yn+1− yn = ∆ ts

∑i=1

biY ′i ,

and the difficulty lies in the construction of the stage values Yi, which generate thestage derivatives, Y ′i = f (τi,Yi). There are many ways to choose the stage values,corresponding to the many different ways in which integrals can be approximatedby discrete sums. However, because the stage values and derivatives have to begenerated sequentially in an explicit computation, we must have

Y ′1 = f (tn,yn).

Subsequent stage values are obtained by advancing a local solution based on linearcombinations of previously computed stage derivatives. Thus

Y ′2 = f (tn + c2∆ t,yn +a2,1∆ tY ′1)

Y ′3 = f (tn + c3∆ t,yn +a3,1∆ tY ′1 +a3,2∆ tY ′2)

. . .

This means that an explicit Runge–Kutta method for the problem y= f (t,y) is givenby the computational process


Y ′i = f (tn + ci∆ t,yn +i−1

∑j=1

ai, j∆ t Y ′j), (9.3)

together with the updating formula

yn+1 = yn +∆ ts

∑i=1

biY ′i . (9.4)

The method is determined by three sets of parameters, the nodes ci, forming avector c, the weights bi forming a vector b, and the matrix A with coefficients ai, j.These are usually arranged in the Butcher tableau,

0 0 0 · · · 0c2 a2,1 0 · · · 0...

.... . .

...cs as,1 as,2 · · · 0

b1 b2 · · · bs

or

c A

bT

For explicit RK methods the coefficient matrix A is strictly lower triangular. Inimplicit RK methods, this is no longer the case.

In the sequel, we shall use the simplifying assumption

ci =s

∑j=1

ai, j. (9.5)

This means that the nodes are determined by the row sums of the coefficient matrixA. We then only need to consider the autonomous initial value problem y = f (y)to derive order and stability conditions. While it is possible to construct RK meth-ods that do not satisfy the simplifying assumption, such methods are never usedin practice, since important invariance properties are lost. Thus all state-of-the-artRunge–Kutta methods, including the original methods of 1895, satisfy the simpli-fying assumption.

With the simplifying assumption, we can describe the RK process as

1. Compute the i th stage value Yi = yn +∆ t ∑i−1j=1 ai, j Y ′j

2. Sample vector field to compute the i th stage derivative Y ′i = f (Yi)3. After computing all stage derivatives, update yn+1 = yn +∆ t ∑

si=1 bi Y ′i

In this process, we note that stage derivatives are always multiplied by the time step∆ t. Thus the process should be viewed as computing stage values Yi and scaledstage derivatives, ∆ tY ′i . The latter quantity is then computed from the vector fieldby ∆ tY ′i = ∆ t f (Yi).

9.3 Taylor series expansions and elementary differentials 67

9.3 Taylor series expansions and elementary differentials

In the derivation of RK methods, we need to match terms in the Taylor series ex-pansions of the method’s updating formula and the expansion of the exact solution.Because RK methods by construction employ nested evaluations of the vector fieldf when the stage derivatives Y ′i are computed, the Taylor series expansions are morecomplicated than otherwise. The standard approach is to express the Taylor seriesin terms of the function f and its derivatives, rather than in terms of the solution yand its derivatives.

Below, we shall use a short-hand notation for function values and their deriva-tives. Since all expansions are around tn in time and yn in “space,” we let y, y, y, . . .denote the values y(tn), y(tn), y(tn), etc.

Likewise, we let f denote f (y), while fy = ∂ f/∂y denotes the Jacobian matrixwith elements ∂ fi/∂y j. Note that, due to the simplifying assumption we onlyhave to consider the differential equation y = f (y); without that assumption, wewould have had to consider y = f (t,y), requiring two partial derivatives, fy and ft .As will become clear, the simplifying assumption saves a lot of work, without losinggenerality.

We also need higher order derivatives, and fyy denotes the 3-tensor with ele-ments ∂ 2 fi/∂y j∂yk. Having three indices, it is a multilinear operator producinga vector if applied to two vectors. Thus fyy f f is a vector, which can be computedsuccessively, from ( fyy f ) f , where the product of the 3-tensor fyy and the vector (1-tensor) f produces the matrix (2-tensor) fyy f . This, in turn, can then multiply thevector f in the usual way; thus ( fyy f ) f is simply a matrix-vector multiply.

If this sounds complicated, the worst is yet to come. Now, since

y(t +∆ t) = y+∆ ty+∆ t2

2y+

∆ t3

6...y + . . .

we will have to convert derivatives of y into derivatives of f using the differentialequation y = f . By the chain rule it follows that

y = fyy = fy f .

Then, again using the chain rule,

...y =ddt

fy f = ( fyyy) f + fy fy f = ( fyy f ) f + fy fy f .

Before computing higher order derivatives, we introduce some simplifications andshort-hand notation. First, in an expression of the type fyygh, the order of the twoarguments g and h does not matter; thus ( fyyg)h = ( fyyh)g. Second, in an expressionlike fyy f f , where the 3-tensor has two identical arguments f , we will allow the(slightly abusive) short-hand notation fyy f 2, although f 2 does not represent a power(which has no meaning for a vector) but only that the argument occurs twice. Finally,


in an expression of the type fy fy f the Jacobian multiplies f twice, justifying thenotation f 2

y f ; this is indeed a power of fy. We then have...y = fyy f 2 + f 2

y f .From here it is all uphill. Thus, using the chain rule to each term of the third

derivative, observing the rules of the simplified notation, we have....y = fyyy f 3 + fyy( fy f ) f +( fyy f ) fy f +( fyy f ) fy f + fy( fyy f 2)+ f 3

y f .

Noting that three terms are identical, omitting superfluous parentheses, we collectterms to get ....y = fyyy f 3 +3 fyy f fy f + fy fyy f 2 + f 3

y f .

The terms appearing in the total derivatives are called elementary differentials, andevery total derivative is composed of several elementary differentials. Unfortunatelythe number of elementary differentials grows exponentially with the order of thederivative, soon making the expressions very complicated. Nevertheless, collectingthe expressions obtained so far, we have

y = f

y = fy f...y = fyy f 2 + f 2

y f....y = fyyy f 3 +3 fyy f fy f + fy fyy f 2 + f 3

y f ,

so that the Taylor series is

y(t +∆ t) = y+∆ t f +∆ t2

2!fy f +

∆ t3

3!(

fyy f 2 + f 2y f)+

∆ t4

4!(

fyyy f 3 +3 fyy f fy f + fy fyy f 2 + f 3y f)+ . . .

The next step is to compare this Taylor series to that of the RK method’s updatingformula. To exemplify, let us consider a two-stage ERK. Its Butcher tableau is

0 0 0c2 a21 0

b1 b2

where c2 = a21. Thus a two-stage ERK has three free parameters, a21,b1 and b2,which can be chosen to maximize the order of the method. The method advancesthe solution a single step by

yn+1 = yn +∆ t (b1 f (yn)+b2 f (yn +a21∆ t f (yn))) . (9.6)

We now need to select a21,b1 and b2 so as to match as many terms as possible to theprevious Taylor series expansion. To this end, we need to expand (9.6) in a Taylorseries as well. Fortunately, since by assumption f (yn) = y(tn), there is only one termto expand. Noting that a21∆ t f (yn) is “small,” we have

9.3 Taylor series expansions and elementary differentials 69

f (yn +a21∆ t f (yn)) = f +a21∆ t · fy f +O(∆ t2).

Thus we can assemble the Taylor series from (9.6) to obtain

yn+1 = y+∆ t (b1 f +b2 f )+b2a21∆ t2 · fy f +O(∆ t3).

Matching terms, we achieve second order if

b1 +b2 = 1

b2c2 =12,

where we have preferred to let the parameter c2 replace a21. Now, since we havethree parameters but only two equations, the solution is not unique; there is a one-parameter family of two-stage RK methods of order p = 2. Choosing β = b2 as thefree parameter, the family can be written

0 0 01

2β

12β

01−β β

where we typically choose β ∈ [ 12 ,1]. The methods at the endpoints are perhaps the

best known. Thus, at β = 1/2 we obtain Heun’s method,

0 0 01 1 0

1/2 1/2

corresponding to the simple ERK we introduced to emulate the trapezoidal rule byan explicit method. On the other hand, taking β = 1, we get the modified Eulermethod,

0 0 01/2 1/2 0

0 1

This procedure looks complicated, and it is. For higher order methods we “only”need to allow more stage values and parameters, and expand the Taylor series toinclude more terms. This quickly goes out of hand, and the construction of RKmethods needs special techniques. Thus the elementary differentials are usually rep-resented by graphs (“trees”), and order conditions are derived using group theory,combinatorics and symmetry properties. That is the easy part. The hard part is that,as we saw above, the equations for determining the method coefficients are non-linear algebraic equations, which means that it is often difficult to solve for themethod parameters, and that computer software (numerical and symbolic) is usuallyneeded in this step. In the light of this, it is quite remarkable that Runge managed toderive methods of order p = 4.


9.4 Order conditions and convergence

To specify order conditions, we take the derivations one step further and consider athree-stage ERK with Butcher tableau

0 0 0 0c2 a21 0 0c3 a31 a32 0

b1 b2 b3

Here we proceed in the same fashion as in the two-stage case, expanding the updat-ing formula and comparing to a third order Taylor series. We now have six parame-ters, and as there are four elementary differentials for total derivatives not exceedingorder three, we will again have a non-unique solution, now with three degrees offreedom. Two different families (therefore not exhaustive), are

0 0 0 02/3 2/3 0 02/3 2

3 −1

4β

14β

014

34 −β β

and0 0 0 0

2/3 2/3 0 00 − 1

4β

14β

014 −β

34 β

Of these two, the first is typically preferred since the bi coefficients are positive ifβ < 3/4. The best known method from this family is the Nystrom method,

0 0 0 02/3 2/3 0 02/3 0 2/3 0

1/4 3/8 3/8

A method of order p = 3, but not coming from any one of these two families, is theclassical RK3 method

0 0 0 01/2 1/2 0 0

1 −1 2 01/6 2/3 1/6

The procedure can be continued to find higher order methods, but it soon getsout of hand. More importantly, the degrees of freedom are lost, since the number ofof elementary differentials (order conditions) grows faster than the number of freeparameters. In fact, already for order four, we have eight order conditions and ten

9.4 Order conditions and convergence 71

parameters; hence still some degrees of freedom. But for order five, there is already17 order conditions (elementary differentials), but only 15 parameters to choose ina five-stage explicit RK method. Unsurprisingly, there is no explicit Runge–Kuttamethod of order five, with five stages, but six stages are necessary.

To give an impression of the increasing complexity, we consider a four-stageERK with Butcher tableau

0 0 0 0 0c2 a21 0 0 0c3 a31 a32 0 0c4 a41 a42 a43 0

b1 b2 b3 b4

After expanding in the relevant Taylor series, matching elementary differentials, weobtain the order conditions for order p = 1,

f : ∑i

bi = 1.

In addition, for order p = 2,

fy f : ∑i

bici =12.

For order p = 3, it must also hold that

fyy f 2 : ∑i

bic2i =

13

f 2y f : ∑

i, jbiai jc j =

16.

For order p = 4, we further require

fyyy f 3 : ∑i

bic3i =

14

fyy f fy f : ∑i, j

biciai jc j =18

fy fyy f 2 : ∑i, j

biai jc2j =

112

f 3y f : ∑

i, j,kbiai ja jkck =

124

.

By now, it is clear that the order conditions cannot be solved for the free coefficientswithout hard work. Even so, in 1895 Runge found the classical RK4 method,


0 0 0 0 01/2 1/2 0 0 01/2 0 1/2 0 01 0 0 1 0

1/6 1/3 1/3 1/6

corresponding to the computational scheme

Y ′1 = f (tn,yn)

Y ′2 = f (tn +∆ t/2,yn +∆ tY ′1/2)Y ′3 = f (tn +∆ t/2,yn +∆ tY ′2/2)Y ′4 = f (tn +∆ t,yn +∆ tY ′3)

yn+1 = yn +∆ t6(Y ′1 +2Y ′2 +2Y ′3 +Y ′4

).

This method is still in wide use today, demonstrating how powerful it is. It is linkedto Simpson’s rule for the computation of integrals. This is a fourth order method forcomputing integrals of a function of t. At considerable extra work, Runge was ableto extend this idea to ordinary differential equations.

There is a full theory for how to construct explicit Runge–Kutta methods, andtoday, there are ERK methods in use of orders up to p = 8. Such methods are diffi-cult to construct, but due to their extremely high accuracy, they are useful for highprecision computations. To illustrate how difficult it is to construct such methods,we note that for an s-stage ERK method, there are s(s+1)/2 coefficients, as seen inthe following table.

Stages s 1 2 3 4 5 6 7 8 9 10 11Coefficients 1 3 6 10 15 21 28 36 45 55 66

Table 9.1 Stages and coefficients in ERK methods. The number of free parameters in an s-stageERK method is s(s+1)/2

But there is an overwhelming number of order conditions to achieve order p.Comparing a given order to the number of order conditions, and the minimum num-ber of stages s required to achieve the requested order, we have the following table.

Order p 1 2 3 4 5 6 7 8 9 10Conditions 1 2 4 8 17 37 85 200 486 1205Min stages 1 2 3 4 6 7 9 11 ? ?

Table 9.2 Necessary number of stages in ERK methods. The number of order conditions in ERKmethods grows prohibitively fast

Thus it is currently not known how many stages are minimally needed to con-struct methods of orders 9 and 10. Naturally, this might not seem to be important.However, what is more surprising is that it has been possible to construct (say) 7-

9.5 Implicit Runge–Kutta methods 73

stage ERK methods of order p = 6; such methods are subject to no less than 37order conditions due to the large number of elementary differentials, yet they onlyhave 28 parameters. Thus 28 parameters must satisfy 37 equations; while this seemsunlikely, it is nevertheless possible. It is even more remarkable that an order p = 8method (with 200 order conditions) can be constructed using only 11 stages and 66parameters (Butcher, 1985).

The one thing that is simple about Runge–Kutta methods is that every RKmethod satisfying the first order condition ∑bi = 1 (consistency) is conver-gent. This follows from the methods being one-step methods. All consistent one-step methods are convergent, and have global error bounds similar to those derivedfor the Euler methods. The possible break-down of convergence only happens inmultistep methods, or in connection with time-dependent partial differential equa-tions.

The construction of explicit Runge–Kutta methods is far from trivial and re-quires considerable expertise. Luckily, there are several high performing methodsto choose from, also with built-in error estimators. We will return to how thesemethods are made adaptive, using automatic step size control to meet a prescribedaccuracy requirement.

9.5 Implicit Runge–Kutta methods

Implicit RK methods have a Butcher tableau

c A

bT

where the matrix A is no longer required to be strictly lower triangular. The orderconditions are exactly the same as in the ERK case, as the parameters are againdetermined by matching coefficients in the Taylor series expansions of the solutiony(t +∆ t) and the updating formula for yn+1.

Let us consider a general two-stage IRK method. It has a Butcher tableau

c1 a11 a12c2 a21 a22

b1 b2

This corresponds to the equations

Y ′1 = f (yn +∆ t(a11Y ′1 +a12Y ′2))

Y ′2 = f (yn +∆ t(a21Y ′1 +a22Y ′2))

yn+1 = yn +∆ t(b1Y ′1 +b2Y ′2),


and it becomes evident that the two first stage equations form a single nonlinearequation system which must be solved in order to advance the solution.

The aim of using IRK methods is to increase the stability region so as to ob-tain methods useful for solving stiff differential equations, for which ∆ tL[ f ] 1.However, because it is expensive to use a general IRK with a full A matrix, only afew such methods are ever used. Among them we find the well-known 3-stage orderp = 5 Radau IIa method, also known as RADAU5. Its Butcher tableau is

25 −

√6

101145 −

7√

6360

37225 −

169√

61800 −

2225 +

√6

7525 +

√6

1037225 +

169√

61800

1145 +

7√

6360 − 2

225 −√

675

1 49 −

√6

3649 +

√6

3619

49 −

√6

3649 +

√6

3619

Due to its structure, the computations can be arranged in a quite efficient way, andRADAU5 is probably the most powerful IRK method available today for solvingstiff problems.

For IRK methods, the maximum order when using s stages is p = 2s. This isachieved by the Gauss–Legendre methods. They are also useful for stiff prob-lems. Due to symmetry their stability regions coincide with C−; thus the methodsare A-stable. However, the computations associated with these methods are morecomplicated than for RADAU5. The latter method also has better damping when∆ tλ →−∞; this is often an advantage in practice. The Gauss–Legendre method oforder six has the Butcher tableau:

12 −

√15

10536

29 −

√15

15536 −

√15

3012

536 +

√15

2429

536 −

√15

2412 +

√15

105

36 +√

1530

29 +

√15

15536

518

49

518

The methods discussed so far have a full A matrix. However, one can achievesufficiently improved stability properties already when the matrix A is lower trian-gular, with a nonzero diagonal. Such methods are referred to as DIRK methods(diagonally implicit Runge–Kutta methods). A further restriction is to demand thatthe diagonal elements are all equal. Such methods are known as SDIRK methods(singly diagonally implicit RK), and have the Butcher tableau (in the 2-stage case)

γ γ 0c2 a21 γ

b1 b2


9.5 Implicit Runge–Kutta methods 75

Y ′1 = f (yn + γ ∆ t Y ′1)

Y ′2 = f (yn +a21∆ t Y ′1 + γ ∆ t Y ′2)

yn+1 = yn +∆ t(b1Y ′1 +b2Y ′2).

The first two equations form a system, but the equations are decoupled. Thus, theycan be rewritten

(I− γ ∆ t f )(yn + γ ∆ t Y ′1) = yn

(I− γ ∆ t f )(yn +a21∆ t Y ′1 + γ ∆ t Y ′2) = yn +a21∆ t Y ′1,

and we see that they will share the same Jacobian matrix, I−γ∆ t fy. After the firstequations has been solved, the right-hand side of the second equation can be com-puted, and the second equation solved. Thus we now have two separate, sequentialsystems to solve, and this requires less work than simultaneously solving two cou-pled equations. This substantially reduces the computational effort and complexitycompared to the more advanced methods.

Let us now turn to some of the most elementary IRK methods. The implicitEuler method is a one-stage method with Butcher tableau

1 11

and equations

Y ′1 = f (yn +∆ t Y ′1)

yn+1 = yn +∆ tY ′1.

Here we make the important observation that yn+1 = Y1, which means that we canrewrite the first equation above as Y ′1 = f (yn+1) with updating formula yn+1 = yn +∆ t f (yn+1). This is recognized as the implicit Euler method.

The implicit midpoint method is also a one-stage method, with the Butchertableau

1/2 1/21

together with the equations

Y ′1 = f (yn +∆ t Y ′1/2)yn+1 = yn +∆ tY ′1.

The first equation now needs a minor transformation,

yn +∆ t Y ′1/2 = yn +∆ t2

f (yn +∆ t Y ′1/2),

implying that yn +∆ t Y ′1/2 is the solution to the equation


(I− ∆ t2

f )(yn +∆ t Y ′1/2) = yn.

Hence yn +∆ t Y ′1/2 = (I− ∆ t2 f )−1(yn). Therefore the updating formula becomes

yn+1 = 2(I− ∆ t2

f )−1(yn)− yn

= 2(I− ∆ t2

f )−1(yn)− (I− ∆ t2

f )(I− ∆ t2

f )−1(yn)

=

(2I− (I− ∆ t

2f)(I− ∆ t

2f )−1(yn)

= (I +∆ t2

f )(I− ∆ t2

f )−1(yn).

Hence the implicit midpoint method can be expressed as

yn+1 = (I +∆ t2

f )(I− ∆ t2

f )−1(yn). (9.7)

Let us now compare to the trapezoidal rule. This is a two-stage, one-step IRKmethod of order p = 2, whose Butcher tableau is

0 0 01 1/2 1/2

1/2 1/2


Y ′1 = f (yn)

Y ′2 = f (yn +∆ t Y ′1/2+∆ t Y ′2/2)

yn+1 = yn +∆ t2

Y ′1 +∆ t2

Y ′2.

Here we note that Y ′2 = f (yn+1), so it immediately follows that

yn+1 = yn +∆ t2( f (yn)+ f (yn+1)),

which is recognized as the trapezoidal rule in the form it was previously discussed.Moreover, the latter formula can be rearranged to obtain(

I− ∆ t2

f)(yn+1) =

(I +

∆ t2

f)(yn),

resulting in the formula

yn+1 = (I− ∆ t2

f )−1(I +∆ t2

f )(yn). (9.8)

9.6 Stability of Runge–Kutta methods 77

Thus, comparing (9.8) to (9.7) we see that the difference between the trapezoidalrule and the implicit midpoint method is that the two operators, corresponding toone half-step explicit Euler method and one half-step implicit Euler method, arecommuted. In the case of a linear constant coefficient system y = Ay this does notmatter since the factors commute, but in the nonlinear case there is a difference.This also makes a difference in terms of stability.

All of the three elementary IRK methods above are SDIRK methods. In addition,the trapezoidal rule has a first explicit stage, and is therefore sometimes referred toan ESDIRK method. It has one more property of significance; the second stage Y2is identical to the output yn+1, which, of course, becomes the first stage on the nextstep. Methods with this property are called “first same as last” (FSAL). The FSALproperty can be used to economize the computations.

9.6 Stability of Runge–Kutta methods

To assess stability, we use the linear test equation, y = λy, with y(0) = 1. Let usinvestigate the stability of the classical RK4 method, with equations

Y ′1 = f (tn,yn)

Y ′2 = f (tn +∆ t/2,yn +∆ tY ′1/2)Y ′3 = f (tn +∆ t/2,yn +∆ tY ′2/2)Y ′4 = f (tn +∆ t,yn +∆ tY ′3)

yn+1 = yn +∆ t6(Y ′1 +2Y ′2 +2Y ′3 +Y ′4

).

Since f (t,y) = λy, we obtain

∆ t Y ′1 = ∆ t λyn

∆ t Y ′2 = ∆ t λ (yn +∆ tY ′1/2)) = ∆ t λ (yn +∆ t λyn/2)) =(∆ t λ +(∆ t λ )2/2

)yn

∆ t Y ′3 = ∆ t λ (yn +∆ tY ′2/2)) = · · ·=(∆ t λ +(∆ t λ )2/2+(∆ t λ )3/4

)yn

∆ t Y ′4 = ∆ t λ (yn +∆ tY ′3)) = · · ·=(∆ t λ +(∆ t λ )2 +(∆ t λ )3/2+(∆ t λ )4/4

)yn.

We now assemble these expressions in the updating formula, to get

yn+1 = yn +∆ t6(Y ′1 +2Y ′2 +2Y ′3 +Y ′4

)=

(1+∆ t λ +

(∆ t λ )2

2+

(∆ t λ )3

6+

(∆ t λ )4

24

)yn.

Thus, when the classical RK4 method is applied to the linear test equation, theupdating formula is a polynomial of degree four, yn+1 = P(∆ t λ )yn, where


Fig. 9.1 Stability region of the classical RK4 method. Colors indicate values z ∈ C where |P(z)| ∈[0,1]. Dark blue areas reveal the location of the four zeros of the polynomial P(z)

P(z) = 1+ z+z2

2+

z3

6+

z4

24≈ ez.

This is no coincidence; since y = λy implies y(t +∆ t) = e∆ t λ y(t), the polynomialP(z) must necessarily approximate ez accurately.

The same thing will happen for any explicit RK method. Thus for every ERKmethod the updating formula is of the form yn+1 =P(∆ t λ )yn, where the polynomialP is characteristic for each method. Since

|yn+1| ≤ |P(∆ t λ )| · |yn|,

it follows that the method is numerically stable if |P(z)| ≤ 1. For this reason, P(z) isreferred to as the stability function of the method. The stability region of the RK4method is plotted in Figure 9.6.

A similar procedure can be followed for implicit RK methods, with the onlydifference being that the stability function is then a rational function, R(z) =P(z)/Q(z), with P and Q polynomials such that R(z)→ ez as z→ 0. An implicitRK method is A-stable if

Rez≤ 0 ⇒ |R(z)| ≤ 1.

Since R(z) must be bounded in all of the left half plane, it follows that degQ ≥degP. Obviously, no explicit method can be A-stable, since for every polynomial,P(z)→ ∞ as z→ ∞.

9.7 Embedded Runge–Kutta methods 79

To check for A-stability, we first check that R(z) has no poles in the left halfplane (Q has no zeros in C−). Then, invoking the maximum principle, one needsto verify that |R(iω)| ≤ 1 for all ω ∈ R, i.e., an IRK method without poles in C−is A-stable if it is stable on the imaginary axis.

9.7 Embedded Runge–Kutta methods

In general, the solution of an initial value problem changes character over time. Forexample, in the Prothero–Robinson problem

y = λ (y−ϕ(t))+ ϕ(t)

there is a particular solution ϕ(t) and homogeneous solutions eλ t . These may bevery different in character. Consequently, during the transient, the step size needsto be adapted to the homogenous solution. But once the homogeneous solution hasdecayed and the solution is dominated by the particular solution ϕ(t), the step sizeneeds to be readapted to ϕ(t). This is especially important in stiff problems, wherethe step size satisfying the error tolerance may vary by several orders of magnitudeduring the integration. Without adaptivity, the efficiency may suffer to the pointof even making the integration impossible.

To control the step size, we need an error estimate. This can be provided byembedded RK methods. A simple example is given by the classical RK3 and RK4methods. Thus the RK3, with Butcher tableau

0 0 0 01/2 1/2 0 0

1 −1 2 01/6 2/3 1/6

has the equations

Y ′1 = f (yn)

Y ′2 = f (yn +∆ t Y ′1)

Z′2 = f (yn−∆ t Y ′1 +2∆ t Y ′2)

zn+1 = yn +∆ t6(Y ′1 +4Y ′2 +Z′2

).

This is a method of order p = 3. Noting that the first two stage derivatives are iden-tical to those of the RK4 method, the two methods RK3 and RK4 can be embeddedinto the same computational scheme, based on the equations


Y ′1 = f (yn)

Y ′2 = f (yn +∆ t Y ′1/2)Z′2 = f (yn−∆ t Y ′1 +2∆ t Y ′2)

Y ′3 = f (yn +∆ t Y ′2/2)Y ′4 = f (yn +∆ t Y ′3)

zn+1 = yn +∆ t6(Y ′1 +4Y ′2 +Z′2

)yn+1 = yn +

∆ t6(Y ′1 +2Y ′2 +2Y ′3 +Y ′4

).

Thus, with one extra function evaluation for computing Z′2, we can obtain both aresult of order p = 3 (i.e., zn+1) and the fourth order result yn+1.

This embedded RK pair can be displayed in terms of a common Butcher tableau

Y ′1 Y ′2 Z′2 Y ′3 Y ′40 0 0 0 0 0

1/2 1/2 0 0 0 01 −1 2 0 0 0

1/2 0 1/2 0 0 01 0 0 0 1 0

1/6 2/3 1/6 0 01/6 1/3 0 1/3 1/6

where headlines in terms of stage derivatives have been included for clarity.Now, we have seen that there is a dramatic difference in accuracy between a

method of order p and one of order p+1. For practical purposes, the more accurateresult can then be regarded as “exact” when compared to the lower order result. Thismeans that we can here estimate the local error by the difference

ln+1 = zn+1− yn+1 = O(∆ t4),

where we remind the reader that the local error is of order O(∆ t p+1) for an orderp method. Thus, the statement above is that zn+1 has a local error of magnitudeO(∆ t4).

This is then an estimate of the local error in the lower order result, i.e., in zn+1.However, the most common approach, referred to as local extrapolation, only usesthe error estimate ln+1 for controlling the step size, while retaining the best availableresult (here of order p = 4) as the output. This implies that a somewhat questionableapproach is used, regarding ln+1 as a tool for adjusting the step size ∆ t.

The embedded pair above is referred to as the RK34 method, meaning that itadvances the solution by a method of order p = 4 (the RK4 method) and uses abuilt-in error estimator of order p = 3. This is a simple but relevant example. Tosee what advanced methods look like, we can consider the low order Shampine–Bogacki SB23 method,

9.7 Embedded Runge–Kutta methods 81

0 0 0 0 01/2 1/2 0 0 03/4 0 3/4 0 0

1 2/9 1/3 4/9 02/9 1/3 4/9 0

7/24 1/4 1/3 1/8

where the first row of b coefficients corresponds to order p = 2 and the second toorder p = 3.

MATLAB offers the Dormand–Prince DOPRI45 solver ode45,

0 0 0 0 0 0 0 01/5 1/5 0 0 0 0 0 0

3/10 3/40 9/40 0 0 0 0 04/5 44/45 −56/15 32/9 0 0 0 08/9 19372/6561 −25360/2187 64448/6561 −212/729 0 0 01 9017/3168 −355/33 46372/5247 49/176 −5103/18656 0 01 35/384 0 500/1113 125/192 −2187/6784 11/84 0

5179/57600 0 7571/16695 393/640 −92097/339200 187/2100 1/4035/384 0 500/1113 125/192 −2187/6784 11/84 0

Here the first row of b coefficients corresponds to order p = 4 and the second toorder p = 5. The 5th order method has the FSAL property.

Although there are further ERK methods, the current state-of-the-art method ofa similar order is the Cash–Karp CK45 method,

0 0 0 0 0 0 01/5 1/5 0 0 0 0 03/10 3/40 9/40 0 0 0 03/5 3/10 −9/10 6/5 0 0 0

1 −11/54 5/2 −70/27 35/27 0 07/8 1361/55296 175/512 575/13824 44275/110592 253/4096 0

2825/28648 0 18575/48384 13525/55296 277/14336 1/437/378 0 250/621 125/594 0 512/1771

where the first row of b coefficients corresponds to order p = 4 and the second toorder p = 5.

The coefficients of these methods clearly suggest that it is a nontrivial task toconstruct high performance ERK methods. IRK methods are extended in a similarway to embedded methods, but there are added difficulties obtaining good errorestimators for IRK methods. For stiff problems, MATLAB offers the IRK solverode23s.


9.8 Adaptive step size control

Let us consider a method of order p taking a single step of size ∆ t from a pointyn = y(tn), where y(t) is a local solution to the differential equation. We then obtain

Rp : yn 7→ yn+1,

where the local error is

lp+1,n+1 = yn+1− y(tn+1) = O(∆ t p+1).

In the analysis of the Euler methods, we saw in Section that global error bounds attime T are of the form

‖eN‖.C ·maxn ‖lp+1,n‖

∆ teT M[ f ]−1

M[ f ].

This bound shows that in order to control the magnitude of the global error, whichis O(∆ t p), we need to control the local error per unit step (EPUS) ‖lp+1,n‖/∆ t,which is also O(∆ t p). Thus, if step sizes can be varied along the solution trajectoryso that ‖lp+1,n‖/∆ t ≈ TOL, where TOL is the local error tolerance, then

‖eN‖.C ·TOLeT M[ f ]−1

M[ f ].

This means that an EPUS strategy will make the global error proportional toTOL, irrespective of how many steps are needed to reach the end-point T . A solverthat achieves this is called tolerance proportional.

While EPUS is common in nonstiff solvers, it is not always used in connectionwith stiff problems, simply because stiffness typically requires step sizes to vary byseveral orders of magnitude. EPUS would then cause the solver to spend a dispro-portionate computational effort on resolving initial transients, especially since thesystem’s damping will make these errors decay rather than accumulate. Therefore,in stiff solvers, it is common to only control the error per step (EPS), which meansthat one attempts to adjust the step size along the trajectory so that ‖lp+1,n‖ ≈ TOL.Due to damping, it is still possible to obtain highly accurate results.

Now, in order to control the local error in practice, we need a local error esti-mate obtained from an embedded method. Let us therefore assume that we have twomethods, of orders p and p− 1, respectively, both starting from a point yn = y(tn),producing two different results,

Rp : yn 7→ yn+1

Rp−1 : yn 7→ zn+1.

The local errors are

9.8 Adaptive step size control 83

lp+1,n+1 = yn+1− y(tn+1) = O(∆ t p+1)

lp,n+1 = zn+1− y(tn+1) = O(∆ t p),

and the embedded method produces a local error estimate

lp,n+1 = zn+1− yn+1 = O(∆ t p),

where the asymptotic behavior of the error estimate is evidently determined by thelower order method. In advancing the solution, however, it is common to choosethe higher order method, Rp, so that no accuracy is “wasted.” Because this methodhas a global error O(∆ t p, one could expect to achieve a performance similar totolerance proportionality by controlling the magnitude of ‖lp,n+1‖ directly.

The basic assumption in step size control is that the norm of the local error esti-mate, wn = ‖lp,n‖, is accurately represented by the asymptotic model

wn = φn−1∆ t pn−1,

where p is the order of the error estimator. (The power would change if EPUS isused instead of EPS). The step-size indexing used above is defined

∆ tn−1 = tn− tn−1,

and the principal error function φn−1 is assumed to vary slowly along the solutiontrajectory. The step size is continually adjusted to keep wn (approximately) constant,equal to the requested tolerance TOL. The objective is to keep the scaled controlerror

cn =

(TOL

wn

)1/p

(9.9)

close to 1 for all n. This is achieved by selecting the next step-size ∆ tn as

∆ tn = ρn−1∆ tn−1, (9.10)

where ρn−1 is the step size ratio. Taking logarithms, this multiplicative recursion istransformed into a simple summation,

log∆ tn = log∆ tn−1 + logρn−1,

referred to as a discrete integrating controller, as it adjusts log∆ tn by a summationof step size ratios, which in turn are determined by past scaled control errors. Thesimplest choice is to take ρn−1 = cn. Inserted into (9.10), it yields

∆ tn =(

TOL

wn

)1/p

∆ tn−1. (9.11)

This elementary controller is used with various restrictions in most conventionalcodes. It is a single-input single-output feedback control system, known as a dead-


beat controller as it immediately tries to compensate any deviation from the targettolerance in a single correction. Unfortunately, this approach has several shortcom-ings and is prone to overreact to variations in the error estimator, especially if thenumerical method operates near instability.

However, control theory offers many alternatives, and a more advanced approachis to process the scaled control errors by applying a recursive digital filter for theconstruction of ρn−1. A filter is a linear difference equation (multistep method) de-signed to regularize the step size sequence, eliminating “noise,” while still makingthe step size track the error, which is kept close to the preset tolerance. The generalmultiplicative form of a two-step filter is

ρn−1 = cγ1n cγ2

n−1ρ−κ2n−2 . (9.12)

The filter coefficients γ1,γ2 and κ2 are chosen to produce a smooth step size se-quence while maintaining overall stability. The filter coefficients must satisfy spe-cific order and stability conditions, and can be selected to match method and prob-lem classes. A well designed controller produces small step-size ratios ρn−1. Thesame filter coefficients can be used irrespective of method order.

The full step size controller consists of the filter recursion (with its two-stepcharacter taking back values into account) followed by a simple integrator. Thus

ρn−1 = cγ1n cγ2

n−1ρ−κ2n−2

∆ tn = ρn−1∆ tn−1

controls the step size during the integration. By taking logarithms, we see that thisstep size control mechanism consists of two linear difference equations for the log-arithmic quantities,

logρn−1 +κ2 logρn−2 = γ1 logcn + γ2 logcn−1

log∆ tn = log∆ tn−1 + logρn−1.

This can therefore be analyzed by standard techniques from the theory of lineardifference equations.

In computational practice, however, the multiplicative form is preferred. Thus thecombination of the filter and the integrator can be assembled into a multiplicativeform similar to that of the elementary controller, to yield

∆ tn =(

TOL

wn

)γ1/p( TOL

wn−1

)γ2/p(∆ tn−1

∆ tn−2

)−κ2

∆ tn−1. (9.13)

Due to the similarity between (9.11) and (9.13), we see that it is straightforwardto replace the elementary controller in an existing code by a specifically designeddigital filter. The computations must, however, also be equipped with the usual pro-tections against division by zero, etc.

9.9 Stiffness detection 85

Among the filter designs are standard integrating controllers and proportional-integral (PI) controllers. It also includes PI lowpass filters that suppress step sizeoscillations. A list of selected filter coefficients of proven designs can be found inTable 9.8. The filter choice has little effect on total number of steps as long as sta-bility is maintained, but it does have a significant impact on computational stability.Thus, for ERK methods and nonstiff problems, the best choice is the PI3333 con-troller, while for IRK and stiff problems, the H211PI digital filter is preferable.

The reason why the same controller cannot be used with all methods is that inexplicit methods the step size needs to have a stable, smooth behavior also whenthe step size becomes limited by numerical stability. For ERK methods, this canbe shown to require γ2 < 0. Meanwhile, in a lowpass filter suppressing (−1)n os-cillations in the step size, it is necessary to have γ1 = γ2, and since the “integralgain” γ1 + γ2 must remain positive (although not too large) to make wn → TOL,some of these criteria are conflicting. This means that we need different controllersto manage regularity and computational stability well. It also implies that little canbe gained from varying the filter coefficients at random; the designs below are welltested and robust, yet still have to be properly selected with respect to methods andproblem classes.

γ1 γ2 κ2 Designation1 0 0 Elementary

1/3 0 0 Convolution2/3 −1/3 0 PI33331/6 1/6 0 H211PI1/b 1/b 1/b H211b; b ∈ [2,6]

Table 9.3 Controllers and digital filters. A selection of filter coefficients for two-step step sizecontrollers. The first controller is the elementary deadbeat controller, which tends to react too fastand induce step size oscillations. The second two are more benign and well suited to ERK methodsand nonstiff problems, with a low integral gain of γ1 + γ2 = 1/3. The last two are digital filterssuppressing step size oscillations (γ1 = γ2) in IRK methods applied to stiff or nonstiff problems

9.9 Stiffness detection

When discussing stiff problems, we defined the stiffness indicator in a linear systemy = Jy as

s[J] =m2[J]+M2[J]

2,

and we found that the problem is stiff if and only if s[J]−1. As it is too expen-sive to compute this quantity along a solution we need a simpler stiffness detectionprocedure.


We recall that, for a given inner product and any two vectors u and v, it holds that

m2[ f ]≤〈u− v, f (u)− f (v)〉

‖u− v‖22

≤M2[ f ].

Rather than computing the average of the upper and lower bounds, we can obtainan inexpensive estimate by simply computing the inner product, provided that twodifferent samples of the vector field can be computed at the same point in time. Thisis easily done in RK methods, since RK methods sample the vector field at severalneighboring points.

To describe how this is done, let us consider the embedded RK34 pair,

Y ′1 = f (tn,yn)

Y ′2 = f (tn +∆ t/2,yn +∆ t Y ′1/2)Z′2 = f (tn +∆ t,yn−∆ t Y ′1 +2∆ t Y ′2)

Y ′3 = f (tn +∆ t/2,yn +∆ t Y ′2/2)Y ′4 = f (tn +∆ t,yn +∆ t Y ′3)

zn+1 = yn +∆ t6(Y ′1 +4Y ′2 +Z′2

)yn+1 = yn +

∆ t6(Y ′1 +2Y ′2 +2Y ′3 +Y ′4

).

Putting tn+1 = tn +∆ tn, we note that two of the five stage derivatives are

Z′2 = f (tn+1,Z2)

Y ′4 = f (tn+1,Y4),

where the stage values Z2 and Y4 are introduced to simplify the notation. Moreover,we note that the corresponding stage derivatives are both evaluated at time tn+1. Thisallows us to estimate the stiffness indicator at time tn+1 by

s[ f ]≈ 〈Z2−Y4, f (tn+1,Z2)− f (tn+1,Y4)〉‖Z2−Y4‖2

2=

〈Z2−Y4,Z′2−Y ′4〉∆ t2‖−Y ′1 +2Y ′2−Y ′3‖2

2.

Since all quantities are generated in every single step taken by the RK34 method,the stiffness indicator can be estimated at a moderate extra cost of computing theinner product and norm as specified above.

Another, often better alternative, is to rearrange the computational sequence ofthe R34 pair thus:

9.9 Stiffness detection 87

Y ′2 = f (tn +∆ t/2,yn +∆ t Y ′1/2)Z′2 = f (tn +∆ t,yn−∆ t Y ′1 +2∆ t Y ′2)

Y ′3 = f (tn +∆ t/2,yn +∆ t Y ′2/2)Y ′4 = f (tn +∆ t,yn +∆ t Y ′3)

zn+1 = yn +∆ t6(Y ′1 +4Y ′2 +Z′2

)yn+1 = yn +

∆ t6(Y ′1 +2Y ′2 +2Y ′3 +Y ′4

)Y ′1 = f (tn+1,yn+1).

Thus we now have the first stage derivative of the next step available, before weestimate the stiffness indicator, enabling us to form an approximation from Y ′4 andthe new Y ′1, which are both computed at time tn+1. Thus we obtain the estimate

s[ f ]≈ 〈yn+1−Y4, f (tn+1,yn+1)− f (tn+1,Y4)〉‖yn+1−Y4‖2

2=〈yn+1−Y4,Y ′1−Y ′4〉‖yn+1−Y4‖2

2.

Similar approximations can be found (possibly after minor rearrangements of thecomputational process) for all embedded ERK methods.

The reason for computing the stiffness indicator is that if an adaptive ERKmethod is used and the differential equation starts exhibiting stiffness, the step sizewill be limited by stability. A well designed PI step size controller will then managethe situation, keeping the method stable. But the method will be inefficient, and itis valuable to have this situation diagnosed automatically. In a dissipative system(let us for simplicity consider y = Jy with M2[J] < 0) the most common onset ofstiffness is that the maximum negative eigenvalue λk[∆ tJ] reaches the boundary ofthe stability region on the negative real axis. For example, in the RK4 method, wehave seen that the stability region reaches out to −2.7 on the real axis. This will beidentified by the stiffness detection, by observing that

∆ t · s[J]≈−2.7

for several consecutive steps. By checking for this condition, an explicit solver canreturn the diagnostics that the problem ought to be solved using an implicit methodintended for stiff problems instead.

In the construction of the estimate s[ f ] it is desirable to use the “last” stage deriva-tive as described above. Since RK methods use nested function evaluations, it is thelast or the “most nested” function evaluation which gives rise to the elementarydifferential f 4

y f (in the case of four stages). In the linear case, this elementary dif-ferential will have the form J4 ·Jy = J5y. Thus it is similar to a power iteration. As apower iteration converges to the dominant eigenvalue, which typically is the largestin magnitude negative eigenvalue, the stiffness detector will pick up the mode thatis likely to limit the step size due to numerical stability.

Chapter 10Linear Multistep methods

As opposed to Runge–Kutta methods, linear multistep methods include additionalinformation about the vector field by approximating the differential equation by adifference equation. Two different approaches dominate this approach.

For nonstiff problems, the differential equation y = f (t,y) is integrated to berepresented by an integral equation,

y(tn+k)− y(tn+k−1) =∫ tn+k

tn+k−1

f (τ,y(τ))dτ,

If an interpolation polynomial P(t) ≈ f (t,y(t)) can be constructed, the integral isapproximated by a k-step numerical integration method, as a weighted sum. Thisyields schemes of the form

yn+k = yn+k−1 +∆ t ·k

∑j=0

β j f (tn+ j,yn+ j), (10.1)

where ∆ t = tn+k − tn+k−1 is a constant step size. The coefficients β j are deter-mined by integrating polynomial basis functions, and the numerical solution isyn+k = P(tn+k). The approach is reminiscent of RK methods, which also approx-imate the integral by using several samples of the vector field f , but now thesesamples are “recycled” function values from previous steps, rather than obtainedthrough nested evaluations of f on the current step.

This corresponds to the Adams methods, introduced in the second half of the19th century by Astronomer Royal John C. Adams. The methods were first usedin the various attempts to locate an unknown planet (the discovery of Neptune in1846). The methods proved so successful that they remain a standard computationaltechnique to this very day. There are explicit and implicit Adams methods, knownas Adams–Bashforth (AB) and Adams–Moulton (AM) methods, respectively. Bothare only intended for nonstiff problems, although AM have better stability and sig-nificantly smaller errors. There are AM and AB methods of all orders.

89

90 10 Linear Multistep methods

For stiff problems, on the other hand, the differential equation is not convertedinto an integral equation. Instead, the derivative is discretized using a finite differ-ence. Thus y = f (t,y) is replaced by a difference equation

k

∑j=0

α jyn+ j = ∆ t · f (tn+k,yn+k), (10.2)

where the step size ∆ t = tn+k− tn+k−1 is constant, and where

1∆ t

k

∑j=0

α jy(tn+ j) = y′(tn+k)+O(∆ tk).

These methods are known as backward differentiation formulas (BDF). They aredesigned to have large stability regions, covering most of the negative half plane, soas to make the BDF methods suitable for stiff problems. The methods were intro-duced in the early 1950s, and efficient software emerged in the early 1970s. Sincethen, these methods remain one of the most competitive approaches available to-day for solving stiff initial value problems. The BDF family only contains methodsof orders 1–6, but the 6th order method is not used in practice due to insufficientstability.

The Adams and BDF method families are the most common linear multistepmethods, even though there are other methods that potentially could offer higher ac-curacy or better stability. One of the key advantages of the Adams and BDF methodsis that both families are based on a local polynomial representation of the solution,making it comparatively simple to make the methods adaptive, using a a varyingtime step ∆ t. Still, it is far more complicated than for RK methods.

To make the polynomial representation clear, Adams and BDF methods are dis-tinguished as follows. In Adams methods we construct a polynomial P(t) interpo-lating past f values. Since

P(tn+k)−P(tn+k−1) =∫ tn+k

tn+k−1

P(t)dt, (10.3)

we obtain a computable numerical approximation yn+k = P(tn+k).In the BDF methods, on the other hand, we instead construct a polynomial P(t)

interpolating past y values, requiring that this polynomial satisfies the differentialequation at tn+k, i.e.,

P(tn+k) = f (tn+k,P(tn+k)). (10.4)

This is referred to as a collocation condition. Thus we see that both classes ofmethods discussed above are related to interpolation theory, which also makes itpossible to extend the methods to variable step sizes.

10.1 Adams methods 91

10.1 Adams methods

Adams methods compute a sequence yn where yn ≈ y(tn), and where the corre-sponding samples of the vector field are y′n = f (tn,yn). These are denoted y′n insteadof yn, since they are not derivatives of a function, but merely values of f obtainednearby the exact solution trajectory.

The Adams-Bashforth methods are explicit. The simplest AB method is theone-step AB1, and is identical to the explicit Euler method. To derive the first dis-tinct AB method, we turn to the two-step AB2 method. This is a method that takesy′n and y′n+1 and computes

yn+2− yn+1 = ∆ t(β1y′n+1 +β0y′n

),

where we need to determine the two coefficients β0 and β1. This can be done inseveral different ways. In keeping with a polynomial representation, let

P(t) = y′n+1 ·t− tn

tn+1− tn+ y′n ·

t− tn+1

tn− tn+1= y′n+1 ·ϕ1(t)+ y′n ·ϕ0(t),

where the basis functions ϕ j are Lagrange interpolation polynomials of degree 1,satisfying

ϕ j(tn+k) = δ jk =

1 j = k0 j 6= k.

Here δ jk is the Kronecker delta. It follows that

P(tn) = y′n, P(tn+1) = y′n+1,

showing that the polynomial P(t) interpolates the two derivative samples y′n (at tn)and y′n+1 (at tn+1). We now have

P(tn+2)−P(tn+1) =∫ tn+2

tn+1

P(t)dt

= y′n+1

∫ tn+2

tn+1

ϕ1(t)dt + y′n

∫ tn+2

tn+1

ϕ0(t)dt

= ∆ t(β1y′n+1 +β0y′n

).

Determining the coefficients, we have

β1 =1

∆ t

∫ tn+2

tn+1

ϕ1(t)dt =1

∆ t

∫ tn+2

tn+1

t− tntn+1− tn

dt =1

∆ t2

[(t− tn)2

2

]tn+2

tn+1

=32

and

β0 =1

∆ t

∫ tn+2

tn+1

ϕ0(t)dt =1

∆ t

∫ tn+2

tn+1

t− tn+1

tn− tn+1dt =− 1

∆ t2

[(t− tn+1)

2

2

]tn+2

tn+1

=−12.


Relabeling the derivative samples y′n = f (tn,yn) etc, we have the constant step sizeform of the Adams–Bashforth AB2 method,

yn+2− yn+1 = ∆ t(

32

f (tn+1,yn+1)−12

f (tn,yn)

).

The same process can be repeated for increasing polynomial degrees. The keyis to interpolate the f -samples y′n, . . . ,y

′n+k−1 by a polynomial P of degree k− 1,

constructed from Lagrange basis ϕ0(t), . . . ,ϕk−1(t), and compute the method coef-ficients

β j =1

∆ t

∫ tn+k

tn+k−1

ϕ j(t)dt. (10.5)

This results in the explicit k-step ABk method, of order p = k,

yn+k− yn+k−1 = ∆ t (βk−1 f (tn+k−1,yn+k−1)+ · · ·+β0 f (tn,yn)) , (10.6)

where the coefficients are given in Table 10.1.

Steps k Order p βk−1 βk−2 βk−3 βk−4 βk−5

1 1 12 2 3/2 −1/23 3 23/12 −16/12 5/124 4 55/24 −59/24 37/24 −9/245 5 1901/720 −2774/720 2616/720 −1274/720 251/720

Table 10.1 Adams–Bashforth method coefficients. The first five ABk methods are explicit, withcoefficients β j normalized so that Σ β j = 1. Note that AB1 is the explicit Euler method

While explicit methods are inexpensive, implicit methods typically offer bet-ter stability. The procedure above can be used to construct the implicit Adams–Moulton methods. The simplest AM method is the implicit Euler method of orderp= 1, but there is also another one-step AM1 method of order p= 2, the trapezoidalrule. The first musltistep AM method is the two-step AM2 method,

yn+2− yn+1 = ∆ t(β2y′n+2 +β1y′n+1 +β0y′n

),

where we need to determine three coefficients β0,β1 and β2. The method is implicitsince the vector field sample y′n+2 = f (tn+2,yn+2) depends on yn+2, which also ap-pears in the left hand side.

For the AM2 method, we construct an interpolation polynomial

P(t) = y′n+2 ·(t− tn+1)(t− tn)

(tn+2− tn+1)(tn+2− tn)+ y′n+1 ·

(t− tn+2)(t− tn)(tn+1− tn+2)(tn+1− tn)

+

+ y′n ·(t− tn+2)(t− tn+1)

(tn− tn+2)(tn− tn+1)= y′n+2 ·ϕ2(t)+ y′n+1 ·ϕ1(t)+ y′n ·ϕ0(t),

10.1 Adams methods 93

where the basis functions ϕ j are Lagrange interpolation polynomials of degree 2.The coefficients are determined in the same way as in the explicit case, by comput-ing the integrals

β j =1

∆ t

∫ tn+k

tn+k−1

ϕ j(t)dt.

For example,

β2 =1

∆ t

∫ tn+2

tn+1

(t− tn+1)(t− tn)(tn+2− tn+1)(tn+2− tn)

dt =5

12.

The form of the implicit k-step AMk method, of order p = k+1, is

yn+k− yn+k−1 = ∆ t (βk f (tn+k,yn+k)+ · · ·+β0 f (tn,yn)) , (10.7)

where the coefficients are given in Table 10.1.

Steps k Order p βk βk−1 βk−2 βk−3 βk−4

1 1 11 2 1/2 1/22 3 5/12 8/12 −1/123 4 9/24 19/24 −5/24 1/244 5 251/720 646/720 −264/720 106/720 −19/720

Table 10.2 Adams–Bashforth method coefficients. The first five AMk methods are implicit, withcoefficients β j normalized so that Σ β j = 1. Note that there are two one-step methods, the implicitEuler method of order p = 1, and the trapezoidal rule of order p = 2

Now, because the AM methods are implicit, we need to solve for yn+k. In theAMk method, the equation for yn+k is

yn+k = ∆ t βk f (tn+k,yn+k)+ψ, (10.8)

where the vector ψ only depends on past data,

ψ = yn+k−1 +∆ t (βk−1 f (tn+k−1,yn+k−1)+ · · ·+β0 f (tn,yn)) . (10.9)

The nonlinear equation (10.8) can in principle be solved using either fixed pointiteration or Newton iteration. For fixed point iteration to converge, we need

∆ t βk L[ f ]< 1, (10.10)

where L[ f ] is the Lipschitz constant of the vector field f with respect to its secondargument. Because the AMk methods are intended for nonstiff problems, the Lips-chitz condition is usually acceptable. Selecting ∆ t small enough to satisfy (10.10)is comparable to the step size restriction imposed by numerical stability, which alsohas to be satisfied.


But we still need an initial approximation y0n+k to get the iteration started. This is

obtained from the explicit ABk method; this is referred to as a predictor–corrector(PC) scheme, where the ABk method is the predictor (P), and the AMk is the cor-rector (C). The classical PC scheme is

y0n+k = yn+k−1 +∆ t

(βk−1 f (tn+k−1,yn+k−1)+ · · ·+ β0 f (tn,yn)

)ym+1

n+k = ∆ t βk f (tn+k,ymn+k)+ψ,

where the β j coefficients refer to the ABk predictor method, and βk to the AMkcorrector method. The superscript m refers to the iteration index, and the correctoris iterated “until convergence.” This means that the iteration is terminated when theremaining error in the AMk equation is acceptably small.

If the explicit method is used as a predictor, if first evaluates (E) the vector fieldf (tn+k−1,yn+k−1) to compute the predictor (P). If the explicit method is used to ad-vance the solution, the procedure continues in this EP-EP-EP. . . mode. If the implicitmethod is used, however, the EP step of the predictor is followed by a function eval-uation (E) and a subsequent correction (C) using the implicit method. The correctioniteration is repeated m times, and the scheme is denoted EP(EC)m. The vector fieldf is usually sampled also at the accepted point ym+1

n+k , which is equivalent to the firstevaluation (E) of the predictor for the next step. It is of importance to avoid evalu-ating the vector field at the accepted point in stiff computations, when (10.10) doesnot hold; then such an evaluation typically degrades accuracy.

Because convergence can be slow and each iteration requires a new functionevaluation, the AMk equation (10.8) is sometimes solved using Newton’s method.This speeds up convergence, but at the cost of having to compute the Jacobian matrix∂ f/∂y and solving full linear systems of equations. In this case, too, we need aninitial approximation, and the ABk method is still of use for this purpose.

10.2 BDF methods

The one-step BDF method, BDF1, is identical to the implicit Euler method. Thetwo-step BDF2 is obtained by constructing a polynomial P(t) of degree 2, interpo-lating the back values yn and yn+1, and satisfying the differential equation at tn+2.This is a set of three conditions, which determine the three coefficients of the poly-nomial:

P(tn) = yn

P(tn+1) = yn+1

P(tn+2) = f (tn+2,P(tn+2)).

Writing the polynomial as

10.2 BDF methods 95

P(t) = yn ϕ0(t)+ yn+1 ϕ1(t)+ yn+2 ϕ2(t),

where ϕ j(t) are the Lagrange basis polynomials of degree 2, the first two interpola-tion conditions are automatically satisfied. As for the third condition, we have

yn ϕ0(tn+2)+ yn+1 ϕ1(tn+2)+ yn+2 ϕ2(tn+2) = f (tn+2,yn+2).

Let us introduce the following notation,

α j = ∆ t · ϕ(tn+2).

We then have a nonlinear equation for the determination of yn+2,

α2 yn+2 −∆ t f (tn+2,yn+2) = ψ,

where ψ =−α0 yn−α1 yn+1 only depends on past data. Thus, to work out the coef-ficients of the BDF2 method, we need to find the derivatives of the basis functions.(This is the opposite of the case with Adams methods, where we had to find theintegrals of the basis functions.) To this end, we note that

ϕ0(t) =(t− tn+1)(t− tn+2)

(tn− tn+1)(tn− tn+2)

ϕ1(t) =(t− tn)(t− tn+2)

(tn+1− tn)(tn+1− tn+2)

ϕ2(t) =(t− tn)(t− tn+1)

(tn+2− tn)(tn+2− tn+1).

Noting that the basis functions are of the form a · (t−b)(t− c), their derivatives area · (2t− (b+ c)), so that, assuming that the step size ∆ t is constant, we get

ϕ0(tn+2) =2tn+2− (tn+1 + tn+2)

(tn− tn+1)(tn− tn+2)=

∆ t2∆ t2 =

12∆ t

ϕ1(tn+2) =2tn+2− (tn + tn+2)

(tn+1− tn)(tn+1− tn+2)=−2∆ t

∆ t2 =− 2∆ t

ϕ2(tn+2) =2tn+2− (tn + tn+1)

(tn+2− tn)(tn+2− tn+1)=

3∆ t2∆ t2 =

32∆ t

.

It follows that the constant step size BDF2 method is

32

yn+2−2yn+1 +12

yn = ∆ t f (tn+2,yn+2).

When this is extended to BDFk, the method takes the form

αkyn+k + · · ·+α0yn = ∆ t f (tn+k,yn+k),

where the coefficients α j are found in Table 10.2.


Steps k Order p αk αk−1 αk−2 αk−3 αk−4 αk−5

1 1 1 −12 2 3/2 −2 1/23 3 11/6 −3 3/2 −1/34 4 25/12 −4 3 −4/3 1/45 5 137/60 −5 5 −10/3 5/4 −1/5

Table 10.3 BDF method coefficients. The first five BDFk methods are implicit, with coefficients α jnormalized so that βk = 1. Note that the BDF1 method is equivalent to the implicit Euler method.The BDFk method is of order p = k as long as k ≤ 6

10.3 Operator theory

However, there is another, more interesting, way to derive these methods. For con-stant steps, this is based on operator theory. Thus we introduce the forward shiftoperator, acting on continuous functions y(t), according to

E∆ t : y(t) 7→ y(t +∆ t). (10.11)

The forward shift operator has the following properties:

1. E0 = 12. E∆ tE∆s = E∆sE∆ t = E∆ t+∆s

3. (E∆ t)−1 = E−∆ t .

The properties are reminiscent of an exponential function. This is no coincidence.Let us denote the usual differentiation operator by D, so that Dy = y. For an analyticfunction Taylor’s theorem now reads

E∆ ty = y(t +∆ t) = y+∆ t Dy+(∆ t D)2

2!y+

(∆ t D)3

3!y+ · · ·= e∆ t Dy,

where the operator function e∆ t D is interpreted as the operator series

e∆ t D =∞

∑j=0

(∆ t D) j

j!. (10.12)

In operator calculus the actual convergence of this series is an issue. Here we shallrefrain from investigating this aspect, as we will truncate every operator series wederive. In fact, this is what numerical computation is about – we always truncateinfinite expansions, to consider the approximations that can be obtained from thefirst few terms. This may appear less rigorous, but even for a truncated series, theresulting expansion will be exact for polynomials up to a corresponding degree.

The result of the calculus above is that Taylor’s theorem can formally be written

10.3 Operator theory 97

E∆ t = e∆ t D (10.13)

explaining the exponential character of the forward shift operator.Since we need finite differences, we introduce the forward difference operator

∆ : y(t) 7→ y(t +∆ t)− y(t) (10.14)

as well as the backward difference operator

∇ : y(t) 7→ y(t)− y(t−∆ t). (10.15)

Obviously, ∆y = E∆ ty− y, and ∇y = y−E−∆ ty. Hence we can write

∆ = E∆ t −1

∇ = 1−E−∆ t .

Since “time” t is a function in its own right, we may apply these operators to y(t)= t,obtaining the seemingly trivial result

∆ t = E∆ tt− t = (t +∆ t)− t = ∆ t.

This shows that a “time increment” ∆ t is indeed only the forward difference operatoracting on the function y(t) = t. Later, in connection with boundary value problems,the independent variable is “space” x, and a space discretization ∆x is then in asimilar way a forward difference operator acting on the independent variable. Withthis notation, we have

∆y∆ t

=y(t +∆ t)− y(t)

∆ t≈ dy

dt= Dy.

All operators introduced here are linear operators. They can be added and com-posed, with composition being identified by “multiplication.” With these two oper-ations, these operators satisfy the axioms of a ring, and form an operator algebra.By permitting infinite series (this needs further qualifications), we are working in asomewhat broader context of operator calculus. Let the game begin.

Since ∇ = 1− E−∆ t and E∆ t = e∆ t D, we have e−∆ t D = 1−∇. Hence, takinglogarithms, we get

∆ t D =− log(1−∇),

where the logarithm function is to be interpreted in terms of its power series expan-sion. Thus,

∆ t D = ∇+∇2

2+

∇3

3+

∇4

4+ · · ·=

∞

∑j=1

∇ j

j. (10.16)

Now consider a differential equation y = f (t,y). Then ∆ t Dy = ∆ t f (t,y), and a lin-ear multistep method can be obtained by replacing the operator ∆ t D by a truncated


operator series from (10.16). This yields a k-step finite difference method,(∇+

∇2

2+ · · ·+ ∇k

k

)yn = ∆ t f (tn,yn). (10.17)

This is simply the BDFk method in backward difference representation. If k = 1,we have the implicit Euler method in backward difference representation, as

∇yn+1 = ∆ t f (tn+1,yn+1).

More interestingly, for k = 2 we obtain the method(∇+

∇2

2

)yn = ∆ t f (tn,yn).

Here the operators act on the discrete sequence yn rather than on functions, and∇yn= yn− yn−1, which we represent in the informal shorthand notation

∇yn = yn− yn−1.

Given that yn ≈ y(tn), we see that this is the same action as when the operator actson continuous functions. Powers of the operator ∇ are defined recursively, by

∇2yn = ∇(∇yn) = ∇(yn− yn−1) = yn−2yn−1 + yn−2.

The same principle applies to higher order differences. Considering (10.17), we findthat

yn− yn−1 +yn−2yn−1 + yn−2

2= ∆ t f (tn,yn),

which is simplified to

32

yn−2yn−1 +12

yn−2 = ∆ t f (tn,yn).

This is recognized as the BDF2 formula obtained from the Lagrange polynomialbasis.

The use of operator series is very powerful so long as the step size is constant.(Variable step size multistep methods are very complicated, although necessary inadvanced scientific computing.) In principle, any multistep method can be derivedusing difference operators on a uniform grid. Thus, noting that(

∇+∇2

2+

∇3

3+ . . .

)= ∇

(1+

∇

2+

∇2

3+ . . .

)= ∇ ·L(∇),

the operator L(∇) can be formally inverted, to construct a method

∇yn = L−1(∇)∆ t f (tn,yn).

10.4 General theory of linear multistep methods 99

Expanding L−1(∇) in powers of ∇ we obtain

∇yn =

(1− ∇

2− ∇2

12− ∇3

24− 19∇4

720− 3∇5

160− . . .

)∆ t f (tn,yn).

These are the AMk methods in backward difference representation, where weinclude as many powers of ∇ as needed. If the highest power on the right hand sideis ∇k the method is the k-step AMk.

A similar technique can be used to derive the ABk methods as well as many othertypes of methods. Operator techniques are particularly convenient to derive constantstep size multistep methods.

10.4 General theory of linear multistep methods

To develop a general theory, the problem to be solved is y = f (t,y). Linear multistepmethods approximate this problem by a difference equation of the form

k

∑j=0

α jyn+ j = ∆ tk

∑j=0

β j f (tn+ j,yn+ j). (10.18)

As the difference equation is of order k, the method (10.18) is referred to as a k-stepmethod. The two sets of coefficients α jk

j=0 and β jkj=0 define the generating

polynomials,

ρ(ζ ) =k

∑j=0

α jζj ; σ(ζ ) =

k

∑j=0

β jζj, (10.19)

which are assumed to have no common factor. We also assume that αk 6= 0. The co-efficients are normalized by requiring that σ(1) = 1. Then the pair (ρ,σ) uniquelydefines a linear k-step method, which can be written in terms of the forward shiftoperator as

ρ(E∆ t)yn = ∆ t σ(E∆ t) f (tn,yn).

As αk 6= 0 the difference equation (10.18) can be written

yn+k−∆ tβk

αkf (tn+k,yn+k) = ψ, (10.20)

where ψ only depends on past data. The method is explicit if βk = 0 and implicit ifβk 6= 0. In the latter case (10.20) must be solved numerically to determine yn+k.

Inserting any sufficiently differentiable function y and its derivative y into (10.18)we find, by Taylor series expansion, that

rn[y] :=k

∑j=0

α jy(tn+ j)−∆ tk

∑j=0

β j y(tn+ j) = ck∆ t p+1y(p+1)(tn +θk∆ t), (10.21)


for some θ ∈ [0,1] as ∆ t → 0, where the remainder term rn is the local residual.Here ck is the error constant, and p is the order of consistency. The order ofconsistency is determined by inserting polynomials y(t) = tq, with y(t) = qtq−1,since rn[tq]≡ 0 for all q≤ p.

Specifically for q = 0 the order condition is ∑α j = ρ(1) = 0. Hence ρ(ζ ) = 0must always have one root ζ = 1, known as the principal root. Taking n = 0 andt j = j ·∆ t, for q≥ 1, the order conditions are

k

∑j=0

(α j jq−β jq jq−1)= 0; q = 1, . . . , p, (10.22)

where the (p+ 1)th condition fails. The first order condition (q = 1) can also bewritten ρ ′(1) = σ(1). The two conditions ρ(1) = 0 and ρ ′(1) = σ(1) are referredto as pre-consistency and consistency, respectively.

As a k-step method has 2k+1 coefficients (one coefficient being lost to normal-ization), and the coefficients α jk

0 and β jk0 enter (10.22) linearly, it is possible to

have rn[tq]≡ 0 for q = 0, . . .2k. Thus the maximal order of consistency is p = 2k.However, for k > 2 such methods are unstable and fail to be convergent.

We shall see that zero-stability restricts the order, and we have to distinguish be-tween order of consistency, and order of convergence. The latter is the importantproperty, which means that the global error en = yn− y(tn) = O(∆ t p) as ∆ t→ 0.

10.5 Stability and convergence

Let us consider the simple problem y = f (t). Then (10.18) can be written

ρ(E∆ t)yn = ∆ t σ(E∆ t) fn. (10.23)

The solution is yn = un + vn, where un is a particular solution of the differenceequation, and the homogeneous solutions vn satisfy ρ(E∆ t)vn = 0. The latter aredetermined by the roots ζν of the characteristic equation ρ(ζ ) = 0.

Thus the homogenous solutions depend on the method but are unrelated to thegiven problem y = f (t). They must therefore remain bounded for all n to avoidan unstable numerical solution, diverging from the particular solution un whichapproximates the exact solution,

∫f (t)dt.

The homogeneous solutions are unstable unless all roots ζν are inside or on theunit circle. Furthermore, vn also grows if any root on the unit circle is multiple.Thus it is necessary to impose the root condition,

ρ(ζ ) = 0 ⇒ |ζν | ≤ 1; ν = 1, . . . ,k|ζν |= 1 ⇒ ζν is a simple root.

10.5 Stability and convergence 101

A method whose ρ polynomial satisfies the root condition is zero stable. Zero sta-bility is necessary for convergence, i.e., for the numerical solution to converge tothe exact solution, yn→ y(tn) as ∆ t→ 0.

If a zero-stable method (ρ,σ) has order of consistency p, it is also convergent,with order of convergence p, i.e., ‖yn− y(tn)‖ = O(∆ t p) as ∆ t → 0. In particular,every consistent one-step method is convergent, since ρ has but a single root. Thisis why zero stability is not an issue for RK methods.

Thus the bridge from order of consistency to order of convergence is stabil-ity.For a stable method, the order of convergence equals the order of consistency.We will see later that this is a general principle in numerical analysis.

In the case of linear multistep methods, the crucial stability notion (for conver-gence) is zero stability. Of the methods studied here, the Adams–Bashforth andAdams–Moulton methods has a single root ζ = 1 and the remaining roots at ζ = 0.Hence they satisfy the root condition; they are zero stable and convergent. In thepresent context, zero stability is only an issue for the BDF methods, due to thestructure of their ρ polynomials. The BDF1–6 are zero stable, but the BDF7 andhigher are not. Thus, while the BDFk has order of consistency p = k for all k, themethods are only convergent for k ≤ 6.

As only convergent methods are of interest, the maximum order needs to be re-examined, taking zero stability into account. According to the First Dahlquist Bar-rier Theorem, the maximal order of convergence of a k-step method is

pmax =

k explicit methodsk+1 implicit methods with k oddk+2 implicit methods with k even.

Implicit methods of order p = k + 2 are weakly stable, meaning that ρ(ζ ) = 0has two or more distinct roots on the unit circle. This usually results in a spuri-ous, oscillatory error caused by undamped homogeneous solutions, and has beendemonstrated in the explicit midpoint method,

yn+2− yn = 2∆ t f (tn−1,yn+1).

This method has ρ(ζ ) = ζ 2− 1 with zeros ζ = ±1. Such methods are only usedin exceptional cases or for special problems. In practice, the maximal order of animplicit method is p = k + 1, as exemplified by the AMk methods. Similarly, theABk methods are explicit, and of maximal order p = k. The implicit BDFk methods,on the other hand, are only of order p = k, as they sacrifice maximal order forimproved stability, so that they can be used for stiff differential equations.


10.6 Stability regions

Stability regions are determined by applying the method to the linear test equationy = λy for λ ∈ C, with y(0) = 1. Since y(t) = etλ , the zero solution y = 0 is stablewhenever Reλ ≤ 0.

Applying (ρ,σ) to the linear test equation leads to the homogeneous differenceequation ρ(E)yn−∆ t λ σ(E)yn = 0. Stability is then governed by the roots of thecharacteristic equation

ρ(ζ )−∆ t λ σ(ζ ) = 0. (10.24)

Like before, solutions are stable if and only if the k roots of (10.24) satisfy|ζν(∆ t λ )| ≤ 1, with simple roots of unit modulus. The stability region is definedas

S(ρ,σ) = ∆ t λ ∈ C ρ(ζ )−∆ t λ σ(ζ ) satisfies the root condition. (10.25)

Note that zero stability can be expressed as 0 ∈ S(ρ,σ). This is a very modestrequirement. In practice, a method needs a fairly large stability region.

Since few methods have the property that S(ρ,σ) = C− we distinguish betweennumerical stability and the mathematical stability of the problem. Moreover,since S(ρ,σ) is defined in terms of ∆ t λ , combining method and problem parame-ters ∆ t and λ , there is in general a restriction on the step size ∆ t in order to maintainnumerical stability. This is referred to as conditional stability.

The stability regions of the Euler methods,the trapezoidal rule, the explicit mid-point method and RK methods have already been investigated. A method whosestability region covers the left half-plane, C− ⊂ S(ρ,σ), is called A-stable. Thisleads to additional constraints on the method’s coefficients. Thus, according to theSecond Dahlquist Barrier theorem, the maximum order of an A-stable multistepmethod is p = 2, and of all A-stable second order multistep methods, the trapezoidalrule has the smallest error constant.

Although this appears to be a disappointing result, among higher order multistepmethods, the BDF methods have stability regions covering most of C−. Otherwise,high order A-stable methods can be found among IRK methods. The stability re-gions of some of the AMk and BDFk methods are plotted in Figure 10.1.

The characteristic equation ρ(ζ )−∆ t λ σ(ζ ) = 0 implies that

∆ t λ =ρ(ζ )

σ(ζ ).

To find the stability region, we note that on the boundary ∂S(ρ,σ), there is at leastone root of unit modulus, i.e., ζν(∆ t λ ) = eiϕ . We therefore let ζ = eiϕ , and consider

z =ρ(eiϕ)

σ(eiϕ).

10.7 Adaptive multistep methods 103

−5 0 5 10 15

−10

−5

0

5

10

−6 −4 −2 0−4

−3

−2

−1

0

1

2

3

4

Fig. 10.1 Stability regions of AMk and BDFk methods. Adams–Moulton methods of orders p =3−6 (left) and BDF methods of orders p = 2−5 (right) plotted in the complex ∆ t λ plane. AMkmethods are stable inside each closed curve. The stability regions shrink with increasing order,starting from p = 3. BDF methods are stable outside each closed curve, where the largest curvecorresponds to p = 5. Here, too, stability is lost with increasing order

By plotting the image of the unit circle under the rational map ρ/σ , we obtain theboundary locus in the complex ∆ t λ plane. The boundary of the stability regionconsists of (a subset of) this curve.

10.7 Adaptive multistep methods

Most linear multistep methods are implemented as adaptive methods, meaning thatorder as well as step size are varied along the solution, to suit its local behavior. Suchimplementations are referred to as variable order – variable step size methods.They are very complex codes, since it is already difficult to construct multistepmethods on nonuniform grids. Step size control then proceeds in a manner similarto that we have already seen in RK methods. Order control, on the other hand, is adifficult issue, since it will require that error estimates for different orders must beavailable for comparison.

To advance one step, an implicit method requires that a nonlinear equation of thegeneric form

yn−∆ tβk

αkf (tn,yn) = ψ (10.26)

be solved on each step. The actual iteration can be carried out using fixed-point iter-ation or Newton iteration. As fixed-point iteration effectively requires ∆ t βkL[ f ]/αk <1 for convergence, the step size ∆ t is restricted to ∆ t ∼ 1/L[ f ], making fixed-pointiteration useless for stiff problems. In stiff problems, therefore, BDF methods areused with some Newton-type iteration to overcome step size restrictions.


Using Newton’s method is relatively expensive, as it requires the Jacobian matrixof (10.26),

J(y) = I− ∆ t βk

αk

∂ f∂y

.

To reduce the cost of evaluating and factorizing the Jacobian, it is common to usea modified Newton iteration, only recomputing ∂ f/∂y when convergence slowsdown. For example, in a linear system with constant coefficients, the Jacobian isconstant and needs no re-evaluation at all. In nonlinear systems, re-evaluation willusually be necessary, but a modified Newton iteration keeps the Jacobian un-changed over several iterations, and sometimes over several steps. Such strategiesalso become part of a well designed adaptive code.

For very large systems it may not be affordable to use Newton iterations asthe equation solving using conventional Gaussian elimination may be too time-consuming. In such cases, there are other alternatives, e.g. using matrix-free it-erations, based on conjugate gradient methods or Krylov subspace iteration.

Because multistep solvers are very complex, it is impossible to construct yourown code and expect it to perform well. The existing efficient and reliable codesfor stiff and nonstiff problems are time-tested pieces of software, based on decadesof research and experience. Some well-known codes are VODE and LSODE (non-stiff and stiff problems), SUNDIALS (nonstiff, stiff and differential-algebraic prob-lems), DASSL and DASPK (stiff and differential-algebraic) and MEBDF (modifiedextended BDF methods). MATLAB offers the solvers ode15s (stiff problems) andode113 (nonstiff problems).

Apart from the problem types mentioned here, there are differential-algebraicequations (DAEs), stochastic differential equations (SDEs), delay differential equa-tions (DDEs), and combinations of these. They go outside the standard setting ofinitial value problems in ODEs, and cannot be dealt with here. Nevertheless, theseareas offer their own theories, methodologies and algorithms, often drawing on theelementary techniques for ODEs which have been presented above.

Chapter 11Special Second Order Problems

There are special cases when the standard first-order formulation and its methodsmay not be the best choice. Special second order equations have the form

y = f (y); y(0) = y0, y(0) = y0.

The are called special second order problems because they have a second orderderivative but no first order derivative. Second order initial value problems arefound in mechanics, and if the first order derivative is missing, the problem hasno damping. Such problems occur for instance in celestial mechanics (describingplanetary motion and orbits), and also in molecular dynamics. Without damping,the problems usually have some form of oscillatory behavior, often in combinationwith some conservation principle.

The simplest case would be an undamped pendulum,

ϕ +gL

sinϕ = 0; ϕ(0) = ϕ0, ϕ(0) = ϕ0, (11.1)

or its linearization, the harmonic oscillator

ϕ =−ω2ϕ; ϕ(0) = ϕ0, ϕ(0) = ϕ0. (11.2)

The linearization around ϕ = 0 is obtained by considering small amplitudes andapproximating the nonlninear function, sinϕ ≈ ϕ . Here, obviously, the angular fre-quency is

ω =

√gL,

where g is the gravitational constant and L is the length of the pendulum. The char-acteristic equation of (11.2) is λ 2 = −ω2, with solution λ = ±iω . Consequently,the solution is

ϕ(t) = Asinωt +Bcosωt.

105

106 11 Special Second Order Problems

For simplicity, let us assume that the initial conditions have been chosen so thatϕ(t) = sinωt. This is a plain harmonic oscillation. It is undamped, and goes onindefinitely, with constant amplitude. The nonlinear pendulum (11.1) has the samebehavior, except with an amplitude-dependent frequency.

11.1 Standard methods for the harmonic oscillator

In the harmonic oscillator, the amplitude remains unchanged, and no energy is lost.The question now is whether it is possible to find numerical methods that also pre-serve an undamped oscillation. This is the central issue in geometric integration,where we seek to find discretizations with conservation properties similar to thoseof the differential equation.

A first approach to this problem is to rewrite it as a system of two first orderequations. For the harmonic oscillator we put y = ϕ = sinωt and x = ϕ/ω = cosωt.Then we have the system

dd t

(xy

)=

(0 −ω

ω 0

)(xy

)= K(ω) ·

(xy

), (11.3)

where the skew-symmetric matrix K(ω) has two complex conjugate eigenvalues,λ1,2[K(ω)] =±iω . These are obviously located on the imaginary axis. It is there-fore immediately clear that the simplest methods like the explicit and implicit Eulermethods will not do. In the explicit method, iω∆ t will always be outside the stabilityregion, leading to numerical instability. In the implicit method, on the other hand,iω∆ t will fall strictly inside the stability region, leading to exponential damping.Thus neither of these methods has a qualitative behavior agreeing with that of thedifferential equation, see Figure 11.1.

Recalling linear stability theory, the Trapezoidal Rule has a stability regionwhose boundary coincides with the imaginary axis. Therefore this method should be“ideal,” in the sense that the eigenvalues of the discretization lie on the unit circle.The trapezoidal method for a system z = Az is

zn+1 =

(I− ∆ tA

2

)−1(I +

∆ tA2

)zn. (11.4)

Therefore, if A = iω (corresponding to the linear test equation), we have

zn+1 =1+ iω ∆ t/21− iω ∆ t/2

zn,

where ∣∣∣∣1+ iω∆ t/21− iω∆ t/2

∣∣∣∣2 = 1+ω2∆ t2/41+ω2∆ t2/4

≡ 1

11.1 Standard methods for the harmonic oscillator 107

-1.5 -1 -0.5 0 0.5 1 1.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-1.5 -1 -0.5 0 0.5 1 1.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Fig. 11.1 Euler methods applied to harmonic oscillator. When the explicit Euler method (leftpanel) and the implicit Euler method (right panel) are applied to the harmonic oscillator, the for-mer method is unstable and forms an outward spiral. The implicit method, on the other hand,spirals inward, due to overstability. Neither method is satisfactory, in spite of both methods beingconvergent. In both cases, ω = 1 and N = 5000 steps were taken to integrate the system on [0,20π].The unit circle is in both cases indicated by a blue circle

for all ω∆ t ∈ R.The analysis is only slightly more complicated in the vector case, when A =

K(ω). Putting a = ω∆ t/2 and applying the trapezoidal method to (11.3) then yields(xn+1yn+1

)=

11+a2

(1−a2 −2a

2a 1−a2

)(xnyn

)= T (a) ·

(xnyn

), (11.5)

where we are interested in the eigenvalues of T (a). These are

λ1,2[T (a)] =1−a2±2ai

1+a2 ,

implying that

|λ1,2[T (a)]|2 =∣∣∣∣1−a2±2ai

1+a2

∣∣∣∣2 = 1−2a2 +a4 +4a2

1+2a2 +a4 ≡ 1

for all a ∈ R, and hence for all ω∆ t ∈ R. Thus the discretization will also representoscillations, and since the eigenvalues of T (a) have unit modulus, there is no damp-


-1 -0.5 0 0.5 1

-1.5

-1

-0.5

0

0.5

1

1.5

-1 -0.5 0 0.5 1

-1.5

-1

-0.5

0

0.5

1

1.5

Fig. 11.2 Explicit midpoint and trapezoidal method applied to harmonic oscillator. When the ex-plicit midpoint method (left panel) and the trapezoidal rule (right panel) are applied to the harmonicoscillator, both methods show an outstanding performance with excellent long-time behavior. Inboth cases, ω = 1 and a mere N = 200 steps were taken to integrate the system on [0,20π]. Theunit circle is in both cases indicated by a blue circle, and the numerical solutions (red) only de-viate negligibly from the true solution. In both cases ω∆ t = π/10 satisfies the Nyquist–Shannonsampling criterion

ing, meaning that the norm of the solution remains bounded above and below, forall n > 1. This is a desirable behavior, as it is similar to the behavior of the harmonicoscillator. However, in order to resolve the correct frequency of the oscillation, thetime step ∆ t has to be chosen short enough to have |ω∆ t| < 1, in accordance withthe Nyquist–Shannon sampling theorem.

Looking for an explicit alternative to the implicit trapezoidal method, we havepreviously encountered the two-step Explicit Midpoint method. When applied toa system z = Az it reads

zn+1 = zn−1 +2∆ t Azn.

Again, let us take A= iω , corresponding to the linear test equation. For this two-stepdifference equation, we have the characteristic equation

q2−2∆ t iω q−1 = 0.

Since the product of the roots is −1, the method is only stable if the roots are

q1 = eiξ , q2 =−e−iξ

11.2 The symplectic Euler method 109

with sum q1 +q2 = 2isinξ = 2∆ t iω . Hence

sinξ = ω∆ t.

Obviously, this requires ω∆ t ∈ (−1,1), or |ω∆ t|< 1. Equality is excluded becauseif ω∆ t =±1 we obtain a double root, q1 = q2 =±i, leading to instability.

By taking A = K(ω) this result is immediately generalized to the full recursionfor (11.3). The scheme is(

xn+1yn+1

)=

(xn−1yn−1

)+2∆ t ·K(ω) ·

(xnyn

), (11.6)

and it is evidently stable whenever |λ1,2[∆ t K(ω)]|< 1, or simply |ω∆ t|< 1.While there was no such stability restriction on ∆ t for the trapezoidal method,

one would think that the explicit midpoint method is at a disadvantage. However,in order to resolve the frequency of the actual oscillation, the step size ∆ t has tofulfill a similar requirement according to the Nyquist–Shannon sampling theorem.Therefore, one can expect a similar performance with both methods. They are bothsecond order convergent, and can use similar step sizes. The drawback is that thetrapezoidal rule is implicit, and hence more expensive to use. On the other hand,it is more robust in case the system would also include a small damping, in whichcase the trapezoidal rule will still manage well, while the explicit midpoint methodimmediately goes unstable.

As the numerical tests in Figure 11.1 demonstrate, there is simply no compari-son between conventional methods and methods designed for special second orderequations. Even at coarse step sizes, the special methods beat conventional methodshands down, and accuracy far exceeds any standard expectations. While the specialmethods are not perfect, they still have phase errors, even though they nearly con-serve total energy over very long times. This demonstrates that a complete knowl-edge of both the problem at hand, and and understanding of the special propertiesof numerical methods can have a tremendous impact on the quality of the numericalsolution.

11.2 The symplectic Euler method

As there are few other conventional methods that will prove conservative for theharmonic oscillator, we have to seek other alternatives. One possibility is to combinethe Explicit and Implicit Euler methods in the following way,

xn+1 = xn−∆ t ·ωyn

yn+1 = yn +∆ t ·ωxn+1.


Thus the first equation is discretized using the explicit Euler method, and the secondequation by the implicit Euler method. The overall method is nevertheless explicit,because once xn+1 has been computed from the first equation, it can be substitutedinto the second equation to compute yn+1. This unusual method is known as theSymplectic Euler method.

In order to analyze the method, we can eliminate y. Thus, from the first equationwe obtain

yn =xn+1− xn

ω∆ t.

Substituting into the second equation, eliminating the variable y, we obtain, afterrearranging terms,

xn+2−2xn+1 + xn

∆ t2 =−ω2xn+1. (11.7)

The difference quotient on the left hand side is a second order approximation to thederivative x at time tn+1, making (11.7) a two-step direct discretization of x =−ω2x.This particular discretization will be used in connection with the wave equation inPDEs, where it is sometimes referred to under a nickname, the “leap-frog method.”The method is related to the explicit midpoint method, and is sometimes a method ofchoice for problems in wave mechanics. We note, in passing, that the classical waveequation is utt = uxx, i.e., there is no first order time derivative. This is thereforea “special second order equation” without damping, and it therefore benefits fromspecial time-stepping methods.

Let us now analyze this method the way we evaluated the other methods. Wenote that (11.7) is a linear difference equation. Its characteristic equation is

q2− (2− (ω∆ t)2)q+1 = 0.

Here, the products of the roots is +1, and their sum is 2− (ω∆ t)2. Therefore, aslong as the method is stable, we have

q1 = eiξ , q2 = e−iξ

with q1 +q2 = 2cosξ = 2− (ω∆ t)2. It follows that

1− cosξ = 2sin2 ξ

2=

ω2∆ t2

2,

which requires |ω∆ t|< 2. This stability condition is slightly more allowing than theprevious conditions, obtained for the trapezoidal and explicit midpoint methods. Inaddition, the symplectic Euler method has both roots on the unit circle, implyingthat the solution remains bounded above and below, which is desirable for a methodsimulating wave propagation.

11.3 Hamiltonian systems 111

11.3 Hamiltonian systems

A most important class of second order problems are Hamiltonian systems. Sucha system is described by a scalar function, the Hamiltonian H(q, p), where the twovariables usually refer to position (q) and momentum (p). The time evolution of thesystem is described by the equations

p =−Hq

q = Hp,

where Hq = ∂H/∂q and Hp = ∂H/∂ p.An important special case is when the Hamiltonian is separable. It can then be

written as a sum of two independent functions,

H(q, p) =U(q)+T (p).

This is often the case in mechanics, where U(q) denotes a potential (dependingonly on the position q) and T (p) denotes kinetic energy (depending only on themomentum p). For example, in a harmonic oscillator we could have

H(q, p) =12

kq2 +p2

2m,

where the parameters m and k represent mass and spring constant. In such a prob-lem total energy is conserved, although there is energy transfer from potential tokinetic energy and vice versa. This means that both p and q evolve in time, whileH(p,q) = C, independent of time. The computational challenge, then, is whetherwe can find numerical methods that solve the differential equations while keepingthe total energy invariant.

In a more general setting, U(q) is a potential depending on the position vectorq, and the kinetic energy is pTM−1 p/2, where p is the momentum vector, and Mis the mass matrix. In this case, Uq = ∇qU is the gradient of the potential, andTp = M−1 p. Since Hq =Uq and Hp = Tp, the system takes the form

p =−Uq (11.8)q = Tp. (11.9)

For such problems, it is again possible to combine explicit and implicit time-stepping, because the right hand side of the first equation only depends on q andthe right hand side of the second equation only on p.

The methods we have considered so far are reasonably good contenders, but wewill find that they do not keep H(q, p) constant. Instead, they keep H(q, p) nearlyconstant. While q and p cannot be computed exactly, there will be a (possibly grow-ing) phase error, but the deviation from constant energy will remain small over ex-tremely long integration times. Thus, if the phase error can be tolerated, all the


methods presented here will produce numerical solutions with a qualitative behav-ior in good agreement with that of the continuous system.

11.4 The Stormer–Verlet method

For the separable Hamiltonian system (11.8), a simple discretization is the follow-ing,

pn+1/2 = pn−∆ t2

Uq(qn)

qn+1 = qn +∆ t ·Tp(pn+1/2)

pn+1 = pn+1/2−∆ t2

Uq(qn+1).

This method, known as the Stormer–Verlet method in connection with Hamilto-nian problems, splits the integration into separate steps for p and q. The first stepadvances p half a step from tn to tn+1/2. Next, we advance q a full step from tn totn+1, making use of the fact that we can evaluate the right hand side at the midpoint,as pn+1/2 is available from the previous operation. To complete the integration, wetake the second half step in p, from tn+1/2 to tn, using the recently updated valueqn+1. The method is explicit and of second order. It is easy to use and is highlysuccessful in molecular dynamics and celestial mechanics.

After the three steps above have been completed, the integration continues byrepeating the procedure. Considering two consecutive steps, we see that the splittingcan then be rewritten as

pn+1/2 = pn−1/2−∆ t ·Uq(qn)

qn+1 = qn +∆ t ·Tp(pn+1/2).

This means that the variable q advances on a grid where the time points have aninteger index, tn, while the variable p advances on a staggered grid, where the timepoints have fractional indices, with tn+1/2 = (tn+ tn+1)/2. Thus the method is a map

(pn−1/2,qn) 7→ (pn+1/2,qn+1).

Whenever p and q are needed at a single point in time, this is easily computedaccording to the original scheme.

As a final test of the Stormer–Verlet method, we integrate the harmonic oscillatorover a very long time interval, [0,200π], corresponding to one hundred full periodsof oscillation, see Figure 11.4. We take N = 104 steps, corresponding to 100 stepsper period. We compute the error of the Hamiltonian, which in this case means thatwe check how far from the trigonometric identity the computed quantity x2 + y2 is.The remarkable feature of the Stormer–Verlet method is that there is no growth of

11.4 The Stormer–Verlet method 113

-1 -0.5 0 0.5 1

-1.5

-1

-0.5

0

0.5

1

1.5

-1 -0.5 0 0.5 1

-1.5

-1

-0.5

0

0.5

1

1.5

Fig. 11.3 Symplectic Euler and Stormer–Verlet methods applied to harmonic oscillator. When thesymplectic Euler method (left panel) and the Stormer–Verlet method (right panel) are applied tothe harmonic oscillator, both methods show a periodic behavior. In both cases, ω = 1 and N = 200steps were taken to integrate the system on [0,20π]. The unit circle is in both cases indicated by ablue circle. The symplectic Euler method shows a very visible error. By comparison, the Stormer–Verlet method has a far smaller error

this error; in fact, the error shows a strong periodic behavior, and one can integrateover extremely long times with a bounded error, as demonstrated in Figure 11.4.

Even if the energy remains near the invariant, there is still a phase error, i.e., thereis a deviation between (x(t),y(t)) and the exact solution, (cos t,sin t). The phase er-ror initially grows linearly. Eventually, for very large times it will ruin the globalaccuracy of the solution, as the numerical solution goes completely out of phase,although the energy is still correct. This limits the time horizon over which simu-lations remain accurate. Both the energy error and the global error in the Stormer–Verlet method are of second order. The energy error is O((ω∆ t)2), while the globalerror, as long as it is small, has a behavior of the form ∼ O(t(ω∆ t)2).

The special methods discussed here are not suitable (indeed often complete fail-ures) in problems having damping, however small. In connection with conservativesystems and wave phenomena, on the other hand, the methods excel, and have someunparalleled properties, with a unique ability to conserve energy and/or amplitude.Even so, the method comparisons demonstrated here show that methods have to beselected with great care also for special second order equations. In particular, Fig-ure 11.4 shows that even if a method preserves symplectic structure (the symplectic


0 100 200 300 400 500 600 70010

-15

10-10

10-5

100

0 100 200 300 400 500 600 700

0

0.02

0.04

0.06

0.08

0.1

0.12

Fig. 11.4 Error in Stormer–Verlet method applied to harmonic oscillator. When the Stormer–Verlet method solves the harmonic oscillator with ω = 1 over a long time interval [0,200π], theenergy error e = x2 + y2 − 1 is plotted as a function of time t (top panel). Using 100 steps perperiod, for a total of N = 104 steps, we note that this error does not increase with time, but remainsbounded for all t > 0. Thus the solution remains close to the invariant x2 + y2 = 1 for extremelylong times. However, there is also a phase error, which causes the global error to essentially growlinearly (lower panel). The global error is

√(x− cos t)2 +(y− sin t)2 and measures the distance

between the numerical solution and the exact solution

Euler method) it might still not have a strong enough performance when comparedto a dedicated method, such as the Stormer–Verlet method.

11.5 Time symmetry, reversibility and adaptivity

Many special second order problems are reversible, or symmetric with respect totime, in the sense that the problems of integrating in forward or reverse time areessentially the same. For example, simulating the solar system in reverse time (tofind its state in the past) is not qualitatively different from simulating it in forwardtime (predicting its state in the future). In case damping had been present, thesetasks would have been entirely different.

Consider a system z = F(z). Integrating it in reverse time is equivalent to makingthe variable change t ↔ −t, which implies the transformation dz/d t ↔ −dz/d t.Thus solving z = F(z) in reverse time is the same as solving z =−F(z) in forward

11.5 Time symmetry, reversibility and adaptivity 115

time. But this transformation reverses the stability properties of the system. If theyhave the same stability properties, then the system must be neutrally stable.

For the Hamiltonian problem (11.8), reversibility is a slightly more technicalissue. With q representing position and p momentum, we obtain the solution inreverse time by merely making the variable change p↔ −p. While the positionvariables are unchanged, the “velocities” are reversed, and (11.8) takes the form

−p =−Uq (11.10)q =−Tp. (11.11)

Meanwhile, if we only change the independent variable, t↔−t in (11.8), we have

−p =−Uq (11.12)−q = Tp, (11.13)

which is immediately seen to be the same system as (11.10). Combining the twochanges p↔−p and t↔−t then produces the original system.

Let us specialize z = F(z) to a linear system z = Az with initial condition z0.Advancing the solution ∆ t units in time, we have z1 = e∆ tAz(0). If, on the otherhand, the solution at time ∆ t is known, then z0 is given by z0 = e−∆ tAz1, and(

e∆ tA)−1

= e−∆ tA.

A symmetric one-step method Φ∆ t : z0 7→ z1 must have the corresponding property,

(Φ∆ t)−1 = Φ−∆ t .

While conventional methods such as the Euler methods fail to have this property,we note that e.g. the trapezoidal rule is symmetric, since, by (11.4),(

(I− ∆ t2

A)−1(I +∆ t2

A))−1

= (I− −∆ t2

A)−1(I +−∆ t

2A).

Explicit methods cannot be symmetric without being two-step (like the explicit mid-point method and the Stormer–Verlet method), but require a more elaborate theory.Symmetric methods play a key role in geometric integration, since they replicatethe symmetry and reversibility of special second order problems. Usually, explicitmethods are preferred. Implicit methods are more expensive, and require that the(nonlinear) equations are solved to full precision.

Now, the real difficulty comes in making symmetric methods adaptive. The con-ventional way of changing the time step is multiplicative, as in

∆ tn+1 = θn ·∆ tn.


This is not symmetric, since it uses the back history of the solution to compute θnand decide on the next step size. (It is only symmetric for θn = ±1, which fails tomake the method adaptive.)

It is therefore necessary to reconsider adaptivity. We need to construct a step sizecontrol which is in itself a Hamiltonian problem. This can be obtained by transform-ing the independent variable t so that d/dt = ρd/dτ , where ρ(τ) is a time rescalingfunction. The original system z = F(z) is transformed into

z′ = F(z)/ρ , (11.14)

where prime denotes derivative with respect to τ . This equation is augmented by thecontrol system

ρ′ = G(z) (11.15)

t ′ = 1/ρ, (11.16)

where (11.15) generates ρ from a chosen control function G(z) and (11.16) recoversthe original time t. The initial values are ρ(0) = 1 and t(0) = 0. By solving theaugmented system numerically, the continuous function ρ(τ) will be represented bya discrete sequence ρn+1/2, where the varying time step ∆ t is constructed from

∆ tn+1/2 =ε

ρn+1/2. (11.17)

The fixed parameter ε is interpreted as the initial step size.The target is to keep ρ = Q(z), where Q is a prescribed symmetric function of

the solution z. We need to construct a function G(z) to achieve this end. Definingh(t,ρ) = log[Q(z(t))/ρ], we construct a (separable) Hamiltonian system

t ′ =−hρ

ρ′ = ht .

By construction, h(t,ρ) = log[Q(z(t))/ρ] is a first integral and will remain constant.Now, noting that hρ = 1/ρ , and

ht = ∇zQ(z)TF(z)/Q(z),

we make the control system (11.15–11.16) Hamiltonian by taking

G(z) = d(logQ)/dt = ∇zQ(z)TF(z)/Q(z).

As a result, ρ will track the quantity Q(z) along the solution z(t).In numerical computations, we use the explicit Stormer–Verlet method to solve

(11.15), resulting in the discrete control law

ρn+1/2 = ρn−1/2 + ε∇zQ(zn)TF(zn)/Q(zn). (11.18)

11.5 Time symmetry, reversibility and adaptivity 117

This generates a sequence ρn+1/2 such that Q(zn)/ρn remains nearly constant,and from which the step size is computed using (11.17). For separable Hamiltonianproblems, the adaptive Stormer–Verlet method then becomes

ρn+1/2 = ρn−1/2 + ε∇zQ(zn)TF(zn)/Q(zn)

∆ t =ε

ρn+1/2

pn+1/2 = pn−∆ t2

Uq(qn)

qn+1 = qn +∆ t ·Tp(pn+1/2)

pn+1 = pn+1/2−∆ t2

Uq(qn+1)

tn+1 = tn +∆ t.

Thus the method is made adaptive by simply adding the recursion for ρ .There are many ways of choosing the tracking function Q(z). However, it is

important to note that the adaptive Stormer–Verlet method above uses a constantpseudo-time step ε , which is converted to the real time step through ∆ t = ε/ρ . Thelatter varies along the solution, since ρ tracks Q(z). However, this is not a controllerin the usual sense, since Q(z) is not an error estimate. Another important aspect isthat the tracking control used here is not of the usual multiplicative type in conven-tional error control. The tracking control is additive and controls the inverse of ∆ t,and hence is a nonlinear control law.

Example We use the classical Kepler problem as an illustration. This describes the motionof a special two-body problem, which models the trajectory of a light body in a highlyeccentric orbit around a heavy body. This could represent a comet orbiting the Sun, orthe “free” return trajectory of a manned space capsule around the back of the Moon. Thestructure of the problem is given by its Hamiltonian (with normalized constants)

H(p,q) =pT p

2− 1√

qTq.

We take the initial conditions

q(0) = (1− e,0)T

p(0) = (0,√(1+ e)/(1− e))T,

where e is the eccentricity of the orbit. At eccentricity e = 0 the orbit is circular withconstant velocity, with no need to vary the step size. The higher the value of the eccentricitye, the more dramatic is the change of velocity near the heavy mass. The solution is highlysensitive to perturbations near the heavy mass. Thus we need a short step size there, but faraway accuracy is less crucial and the step size can be large.

Figure 11.5 shows the numerical solution of the Kepler problem with an eccentricity of e =0.8. The Stormer–Verlet method is used as described, first with constant steps, and then inadaptive mode, with variable steps constructed by letting ρ track the quantity Q = 1/

√qTq.

This means that ρ is small (and ∆ t is large) when ‖q‖ is large (away from the central mass),and vice versa. In the adaptive computation this is achieved by taking G = −pTq/qTq.


Fig. 11.5 Thirty orbits of the Kepler problem. When the Stormer–Verlet method integrates theKepler problem with eccentricity 0.8 using N = 104 constant steps (left), numerical precession(phase error) is significant. When reversible adaptivity is used with N = 104 variable steps (right),numerical precession is strongly suppressed, and the energy error is reduced by a factor of 30

Both the constant step size method and the adaptive method (nearly) conserve energy, butadaptivity significantly improves both accuracy and efficiency. In fact, the computationalcost of varying the step size is negligible. For the same total work (the same number ofsteps), the global error is much reduced while energy is still nearly conserved.

Part IIIBoundary Value Problems

Chapter 12Two-point Boundary Value Problems

Boundary value problems (BVPs) occur in materials science, structural analysis,physics and similar applications. They are often connected to time dependent partialdifferential equations, where the BVP represents a stationary solution. The simplestBVP has one independent variable, x, usually interpreted as “space,” and one depen-dent variable, u, which is then a scalar function of x. The standard case is to considera second order differential equation on a compact interval, say x ∈ [0,1], with oneboundary condition at each endpoint of the interval.

12.1 The Poisson equation in 1D. Boundary conditions

We begin by considering the simplest two-point boundary value problem (2pBVP),

−u′′ = f (x); u(0) = uL, u(1) = uR. (12.1)

Here f is a given data function on Ω = [0,1], and the specified type of boundaryconditions, defining the value of the solution on the boundary ∂Ω , are referred to asDirichlet conditions.

There are also other types of boundary conditions. Neumann conditions refer to

u(0) = uL, u′(1) = u′R, (12.2)

which combine prescribing the value of u at one endpoint, and the derivative u′ atthe other endpoint.

Robin conditions refer to a linear combination of the value of u and its derivativeat one of the endpoints, as in

u(0) = uL, αu(1)+βu′(1) = γ, (12.3)

where the numbers α,β and γ are given, in addition to uL.

3

4 12 Two-point Boundary Value Problems

In general, the terms Dirichlet condition, Neumann condition and Robin condi-tion (in singular) are commonly used for any single condition having the structureindicated above. Such a condition can be prescribed at either one of the boundarypoints, but a Dirichlet condition is needed at the other endpoint.

The problem (12.1) may look over-simplified. It is linear, and solving the prob-lem is (technically) only a matter of integrating the data function f twice, deter-mining the constants of integration from the boundary conditions. However, there ismore to this problem; it can be viewed as the Poisson equation in 1D. The Poissonequation occurs frequently in applied mathematics, and is the foremost example ofan elliptic problem. Thus we shall see that (12.1) is well worth the study. Someimportant insights gained from this problem will carry over to the 2D case, wherethe Poisson equation is a partial differential equation.

There are basically two different approaches to solve this problem – the finitedifference method (FDM), and the finite element method (FEM). We shall studyboth, starting with the FDM, since it can be understood intuitively using elementarymethodology. The FEM is theoretically much more advanced, but in the 1D case it,too, becomes easily accessible.

12.2 Existence and uniqueness

Just like with initial value problems, we need to investigate existence and unique-ness before we attempt to solve the boundary value problem numerically. In BVPs,existence and uniqueness are more intricate issues than in IVPs. However, in (12.1),these questions are simple. The problem is linear and consists of a homogeneous so-lution and a particular solution. The particular solution is only a matter of whetherf is twice integrable, and the homogeneous solution is simply a straight line,uH(x) =C0 +C1x. The constants are determined by the boundary conditions.

Recalling that the 1D Poisson equation is a special example of a 2pBVP, oneneeds to discuss existence and uniqueness for more general problems. The questionis then more difficult. Let us consider 2pBVPs of the form

−u′′+au = f (x); u(0) = uL, u(1) = uR. (12.4)

(Further terms could also be added.) Because the homogeneous solutions now havea different character, a more advanced theory is needed. The basic result is:

Theorem 12.1. (Fredholm alternative) Let f ∈ L2[0,1]. The two-point BVP (12.4)either has a unique solution, or the homogeneous problem

−u′′+au = 0; u(0) = 0, u(1) = 0 (12.5)

has nontrivial solutions.

12.3 Notation: Spaces and norms 5

This needs an explanation. The Fredholm alternative allows us to determinewhether there is a unique solution by merely considering the homogenous prob-lem. Note that the homogeneous problem does not only have a zero right-hand side,but homogeneous boundary conditions as well. Naturally, the trivial solution to thehomogeneous problem (12.5) is the zero solution, u(x) ≡ 0. Can there be nonzerosolutions?

The answer is yes: the solutions to the homogeneous problem −u′′+au = 0 are

u(x) =C0 cos√

ax+C1 sin√

ax.

The left boundary condition u(0) = 0 implies C0 = 0, while the right boundarycondition u(1) = 0 implies

0 =C1 sin√

a.

The obvious solution is C1 = 0, leading to the trivial solution u(x)≡ 0. But there isanother possibility: we could have sin

√a = 0, which holds whenever

a = k2π

2,

for any positive integer k. Thus it could happen that the homogeneous problem hasnontrivial solutions. Suppose that uP(x) is a particular solution to (12.4), and thata = k2π2. Then

u(x) = uP(x)+C1 sinkπx

is also a solution, for every C1 and for every integer k. Therefore, the solution isnot unique. However, if a 6= k2π2, then C1 ≡ 0 and the solution is unique, all inaccordance with the Fredholm alternative for (12.4).

The Fredholm alternative applies to more general equations, usually formulatedas L u = f , where L is a linear differential operator. A sufficient condition forestablishing unique solutions, is to demonstrate that the operator L is elliptic. Thismeans that, for all functions u satisfying u(0) = u(1) = 0, the operator must satisfy

〈u,L u〉> 0, (12.6)

where the inner product is defined by

〈u,v〉=∫ 1

0uvdx. (12.7)

12.3 Notation: Spaces and norms

In order to develop a theory and a methodology for 2pBVPs, notation is important,together with the choice of norms. We let the computational domain be denotedby Ω , with boundary ∂Ω . In most cases, we will take Ω = [0,1]. In this stan-


dard setting, ∂Ω only consists of two points, x = 0 and x = 1, where the boundaryconditions are imposed.

The space of square integrable functions on Ω is denoted by L2(Ω), or, if thecomputational domain is made specific, by L2[0,1]. We write f ∈ L2(Ω) for anyfunction defined on Ω , such that

‖ f‖2L2(Ω) =

∫Ω

| f (x)|2 dx < ∞.

The norm on L2(Ω) is associated with the inner product 〈·, ·〉. For u,v ∈ L2(Ω) it isdefined by

〈u,v〉=∫

Ω

uvdx,

and ‖u‖2L2(Ω)

= 〈u,u〉. In most cases, we will use the simplified notation ‖ f‖2 for

the L2-norm of a function.The space Cp(Ω) consists of all p times continuously differentiable functions.

Usually we need to be more specific. Thus we say that u ∈ H1(Ω) if u′ ∈ L2(Ω).Because we also need to impose boundary conditions on the function u, we intro-duce the function space

H10 (Ω) = u : u′ ∈ L2(Ω) and u = 0 on ∂Ω.

In the standard setting with Ω = [0,1], this corresponds to differentiable functionswith ‖u′‖2

L2[0,1] <∞, satisfying homogeneous boundary conditions, u(0) = u(1) = 0.An example of a function in this space is u(x) = sinπx.

If f ∈ L2[0,1] and we seek a solution to the 1D Poisson equation, −u′′ = f withu(0) = u(1) = 0, we are looking for a function u ∈ H1

0 [0,1]∩H2[0,1]. The spaceH1(Ω) plays an important role. It is referred to as a Sobolev space, and is usuallyequipped with its own norm, defined as

‖u‖2H1(Ω) = ‖u‖

2L2(Ω)+‖u

′‖2L2(Ω).

For our purposes, however, it will be sufficient to consider the standard L2(Ω) normof functions, even though we will have to require that solutions are u ∈ H1

0 [0,1]∩H2[0,1].

12.4 Integration by parts and ellipticity

To show ellipticity, we need more advanced tools. Let us consider the operator

L =− d2

dx2 ,

12.4 Integration by parts and ellipticity 7

and the 1D Poisson equation L u = f with homogeneous boundary conditions,u(0) = u(1) = 0. We shall prove that, for all sufficiently differentiable functionsu satisfying the boundary conditions, it holds that

0 < m2[L ] · ‖u‖2 ≤ 〈u,L u〉,

where m2[L ] is the lower logarithmic norm of L . Thus, even differential oper-ators have logarithmic norms, and m2[L ] > 0 implies that L is elliptic. By theuniform monotonicity theorem,

m2[L ]> 0 ⇒ ‖L −1‖2 ≤1

m2[L ].

Thus an elliptic operator has a bounded, continuous inverse, and for the problemL u = f it follows that

‖u‖2 = ‖L −1 f‖2 ≤‖ f‖2

m2[L ].

As a result, ‖ f‖2 → 0 implies that ‖u‖2 → 0, i.e., the solution is unique and de-pends continuously on the data f . This is traditionally expressed by saying that theproblem is well posed. We also note that the bound breaks down if m2[L ]→ 0+,which corresponds to loss of ellipticity. This explains why ellipticity is a key prop-erty in boundary value problems.

The discussion above is still sketchy, and we need to qualify and quantify theresults. More specifically, we shall show that L = −d2/dx2 is elliptic for allu ∈ H1

0 [0,1]∩H2[0,1], with inner product defined by (12.7). As mentioned above,this is a Sobolev space of differentiable functions with u′ ∈ L2[0,1], satisfying theboundary conditions u(0) = u(1) = 0.

Consider u,v ∈ H10 [0,1]∩H1[0,1]. By the Leibniz rule, (uv)′ = u′v+ uv′. Inte-

gration yields ∫ 1

0uv′ dx = [uv]10−

∫ 1

0u′vdx.

Due to the boundary conditions, we have [uv]10 = 0. Consequently,∫ 1

0uv′ dx =−

∫ 1

0u′vdx.

In terms of the inner product (12.7), integration by parts can then be written

〈u,v′〉=−〈u′,v〉.

Integration by parts is a key technique in the analysis of boundary value problemsand in finite element analysis.

To illustrate this, we consider L =−d2/dx2 for functions u∈H10 [0,1]∩H2[0,1].

Integration by parts then yields


−〈u,u′′〉= 〈u′,u′〉= ‖u′‖22 > 0, (12.8)

whenever u is a nonzero function. Note that ‖u′‖2 = 0 is equivalent to u′ = 0; then umust be constant, and in fact u(x)≡ 0, due to the homogeneous boundary conditionssatisfies by all functions u ∈ H1

0 [0,1]∩H2[0,1]. Therefore the expressions in (12.8)are strictly positive whenever u 6= 0. This shows that L =−d2/dx2 is elliptic.

However, we can say more. In (12.8), we have not yet determined m2[−d2/dx2].This requires that we find a lower bound of ‖u′‖2

2 for functions u ∈ H10 [0,1] ∩

H2[0,1]. To be precise, we have to find m2[−d2/dx2], such that

‖u′‖22 ≥ m2[−d2/dx2] · ‖u‖2

2,

We can determine the largest constant m2[−d2/dx2] for which this inequality holds.Since u(0) = u(1) = 0, any function u ∈ H1

0 [0,1] can be written as a Fourier series

u =√

2∞

∑k=1

ck sinkπx,

and by Parseval’s theorem, we have ‖u‖22 = ∑

∞k=1 c2

k . Likewise,

u′ =√

2∞

∑k=1

ckkπ coskπx,

and it follows that

‖u′‖22 =

∞

∑k=1

k2π

2c2k = π

2∞

∑k=1

k2c2k ≥ π

2‖u‖22.

This inequality is sharp, since equality holds for the function u(x) =√

2sinπx, forwhich ‖u‖2

2 = 1 and ‖u′‖22 = π2. Thus we have obtained the following result:

Theorem 12.2. Let u ∈H = H10 [0,1]. Then the Poincare inequality

‖u′‖2 ≥ π ‖u‖2 (12.9)

holds, and for the operator −d2/dx2 on H = H10 [0,1]∩H2[0,1], it holds that

m2

[− d2

dx2

]= π

2. (12.10)

Thus we have shown that that L is an elliptic operator. The Poincare inequality isthe simplest case of Sobolev inequalities, which offer lower bounds on derivativesin terms of the norm of u. Note that there is no upper bound, since d2/dx2 is anunbounded operator; no matter how small (but nonzero) the function u is, u′ canbe arbitrarily large.

12.5 Self-adjoint operators 9

12.5 Self-adjoint operators

Let u,v ∈ H10 [0,1]∩H2[0,1] and let L be a linear operator. Its adjoint operator

L ∗ is defined by the operation

〈v,L u〉= 〈L ∗v,u〉.

This is an operator analogue of taking the transpose of a matrix in linear algebra.Thus, if A ∈ Rd×d is a matrix, and u,v ∈ Rd are two vectors, with inner product〈v,u〉= v∗u, where v∗ denotes the transposed vector, then

〈v,Au〉= v∗Au = (A∗v)∗u = 〈A∗v,u〉,

where A∗ denotes the transpose of A. Hence the transpose A∗ is the adjoint of A.However, a linear differential operator does not have a transpose, even though

an adjoint operator exists. The adjoint plays a role similar to that of the transposein algebra. When we solve (12.1) numerically by a finite difference method, ourdiscretization of this linear problem results in an linear algebraic system,

LNu = f, (12.11)

where we want the N×N matrix to have properties reflecting the properties of L .We therefore need to find out what properties L has. Let us begin by considering

L = −d2/dx2 on u,v ∈ H10 [0,1]∩H2[0,1]. We then have, integrating by parts

twice,〈v,L u〉=−〈v,u′′〉= 〈v′,u′〉=−〈v′′,u′〉= 〈L ∗v,u〉.

Thus L u =−u′′, and L ∗v =−v′′, for all u,v. Therefore L ∗ = L .

Definition 12.1. An linear operator satisfying L ∗ = L is called self-adjoint.

A self-adjoint operator is a counterpart to the notion of a symmetric matrix.This is a property we want to retain when we discretize our BVP. Thus, in the 1DPoisson equation with homogeneous boundary conditions, −u′′ = f , we obtain thelinear system (12.11), where we wish to have L∗N = LN , i.e., we want LN to be asymmetric matrix, since for the differential operator

L =− d2

dx2

it holds that L ∗ = L .It is important to note that not all operators are self-adjoint. For example, if we

consider L = d/dx on u,v ∈ H10 [0,1]∩H2[0,1], we integrate by parts to obtain

〈v,L u〉= 〈v,u′〉=−〈v′,u〉= 〈L ∗v,u〉.

Thus L u = u′ and L ∗v =−v. It follows that L ∗ =−L .


Definition 12.2. An linear operator satisfying L ∗ =−L is called anti-selfadjoint.

This is a counterpart to skew-symmetric matrices, which are defined by the prop-erty A∗ = −A. The symmetry and anti-symmetry properties have important conse-quences. We have already computed the lower logarithmic norm m2[−d2/dx2], andwe now wish to do the same for d/dx. To this end, we note that

m2

[ddx

]· ‖u‖2 ≤ 〈u,u′〉 ≤M2

[ddx

]· ‖u‖2. (12.12)

Now, integrating by parts, we get 〈u,u′〉 = −〈u′,u〉, since d/dx is anti-self adjointon H1

0 [0,1]∩H2[0,1]. However, recalling that

〈v,u〉=∫ 1

0vudx =

∫ 1

0uvdx = 〈u,v〉,

we have 〈u,u′〉=−〈u′,u〉=−〈u,u′〉. It follows that 〈u,u′〉= 0 for all u∈H10 [0,1]∩

H2[0,1]. Thus every differentiable function u satisfying u(0) = u(1) = 0 is orthog-onal to its derivative u′. As a consequence, putting 〈u,u′〉= 0 in (12.12), we find thelogarithmic norms

m2

[ddx

]= M2

[ddx

]= 0. (12.13)

With this information, we can consider more interesting operators. For example,if we consider the boundary value problem L u = f with homogeneous Dirichletboundary conditions u(0) = u(1) = 0 and

L u =−u′′+au′+bu,

we find, integrating by parts,

〈u,L u〉= 〈u,−u′′+au′+bu〉= 〈u,−u′′〉+a〈u,u′〉+b〈u,u〉≥ (π2 +b) · ‖u‖2

2.

Thus

m2[L ] = m2

[− d2

dx2 +addx

+b]= π

2 +b.

Therefore L is elliptic on H10 [0,1]∩H2[0,1] as long as b >−π2, independent of a.

It follows that the problem

−u′′+au+bu = f , u(0) = u(1) = 0,

is well posed for b > −π2 and for all a ∈ R. The solution is unique, and dependscontinuously on the data, as

‖u‖2 ≤‖ f‖2

π2 +b.

12.6 Sturm–Liouville eigenvalue problems 11

12.6 Sturm–Liouville eigenvalue problems

Eigenvalue problems play an important role in applied mathematics, with appli-cations in structural analysis, quantum mechanics, wave problems and resonanceanalysis. Beside solving the plain 2pBVP problems of the previous section, we shallalso develop methods for solving eigenvalue problems.

An eigenvalue problem for a differential operator is a BVP of the form

L u = λu, (12.14)

where λ is a scalar eigenvalue, and u is the corresponding eigenfunction. Theboundary conditions are homogeneous. Thus, in the 2pBVP case, the boundary con-ditions are either homogeneous Dirichlet conditions u(0) = u(1) = 0, or homoge-neous Neumann conditions, corresponding to u(0) = 0 and u′(1) = 0, or u′(0) = 0and u(0) = 0.

The reason for the homogeneous boundary conditions is that the entire eigen-value problem is homogeneous. This means that if u(x) is an eigenfunction, then sois αu(x) for any scalar α 6= 0, since αu also satisfies (12.14) as well as the chosenboundary conditions. As a consequence, the eigenfunctions are only determined upto a constant, even though the eigenvalues are unique. An example is the vibrationmodes of a string; the frequency (corresponding to the eigenvalue λ ) is well de-fined, but the amplitude (corresponding to α) is not; one can have a large or smallamplitude oscillation.

Example The simplest eigenvalue problem is

−u′′ = λu, u(0) = u(1) = 0.

Its general solution is Asin√

λx+Bcos√

λx. Imposing the Dirichlet condition u(0) = 0implies B = 0. Imposing the second Dirichlet condition u(1) = 0 implies

Asin√

λ = 0.

Here we cannot infer that the amplitude A is zero, since the solution is then identically zero.Rather, we must have sin

√λ = 0. Hence the eigenvalues and eigenfunctions are

λk = k2π

2, k ∈ Z+ (12.15)

uk(x) = sinkπx. (12.16)

We note that there is an infinite sequence of eigenvalues and corresponding eigenfunctions.The amplitude A of the eigenfunction is undetermined, but the eigenfunctions form an or-thonormal basis,

uk∞k=0

with uk(x) =√

2sinkπx. The amplitude√

2 is chosen so that

〈ui, u j〉= δi j,

where δi j is the Kronecker delta. This orthonormal basis is, naturally, the standard basis forFourier analysis on H1

0 [0,1]∩H2[0,1].


Thus we have determined the basic properties of the operator −d2/dx2. It is self-adjoint,and it is elliptic with lower logarithmic norm m2[−d2/dx2] = π2.

The eigenvalue problem above is the simplest example of a Sturm–Liouvilleeigenvalue problem. These are eigenvalue problems for general selfadjoint opera-tors, i.e., operators satisfying L ∗ = L . The general form is

− ddx

(p(x)

dudx

)+q(x)u = λu, (12.17)

together with homogeneous boundary conditions. In this equation, the scalar coef-ficients must satisfy p(x)> 0 and q(x)≥ 0. The simple worked example above hasp(x)≡ 1 and q(x) = 0.

Let us now show that L , defined by

L u =−(pu′)′+qu,

is selfadjoint. To this end, we consider 〈v,L u〉 and integrate by parts,

〈v,−(pu′)′+qu〉= 〈v,−(pu′)′〉+ 〈v,qu〉= 〈v′, pu′〉+ 〈qv,u〉= 〈pv′,u′〉+ 〈qv,u〉= 〈−(pv′)′,u〉+ 〈qv,u〉= 〈−(pv′)′+qv,u〉= 〈L v,u〉.

By definition, it holds that 〈v,L u〉 = 〈L ∗v,u〉. Consequently, L ∗ = L , showingthat L is selfadjoint.

We note that there is no first derivative appearing in the Sturm–Liouville operator(12.17). The reason is, naturally, that d/dx is anti-selfadjoint, and would changethe properties of the operator L . In spite of this, one can also consider eigenvalueproblems involving d/dx, although there will then be a loss of structure.

Self-adjoint operators have several important properties. Among the most impor-tant are the following.

Theorem 12.3. Self-adjoint operators have real eigenvalues and orthogonal eigen-functions.

Proof. Let L be self-adjoint, and let L u = λ1u and L v = λ2v. We first prove thatλk is real. As L ∗ = L , we have

〈u,L u〉= 〈u,λ1u〉= λ1‖u‖22

= 〈L ∗u,u〉= 〈L u,u〉= 〈λ1u,u〉= λ∗1 ‖u‖2

2,

and it follows that λ ∗1 = λ1, so λ1 is real. The same applies to all other eigenvalues.Now consider the two eigenpairs λ1,u and λ2,v, with λ1 6= λ2. Then

12.6 Sturm–Liouville eigenvalue problems 13

〈v,L u〉= 〈L v,u〉= 〈v,λ1u〉= 〈λ2v,u〉= λ1〈v,u〉= λ2〈v,u〉.

Hence (λ1−λ2)〈v,u〉= 0. Since λ2 6= λ2, it follows that 〈v,u〉= 0, proving orthog-onality.

It is worthwhile noting that for an anti-selfadjoint operator (L ∗ =−L ) there isa similar result – the eigenvalues are then purely imaginary, but the eigenfunctionsremain orthogonal. This is in full agreement with the properties of skew-symmetricmatrices. These have imaginary eigenvalues and orthogonal eigenvectors.

In fact, there is a more general result. An operator is normal if L ∗L = L L ∗.(This condition obviously holds if L ∗ =±L .) Such operators all have orthogonaleigenfunctions (hence the name “normal”), but the eigenvalues may now be com-plex. These results hold both in function spaces and in coordinate vector spaces.

In summary, a self-adjoint operator has real eigenvalues and orthogonal eigen-functions. Since the regular Sturm–Liouville problem is self-adjoint, all such prob-lems have this property. Above, in the worked example, we have seen that this ap-plies to the operator−d2/dx2 with homogeneous Dirichlet boundary conditions, buta similar behavior holds for all Sturm–Liouville operators.

When constructing numerical methods for such problems, it is important thatthe discretization preserves these properties. All properties of normal operatorsare analogous in the discrete setting. Thus, if L is normal, the numerical methodshould replicate these properties, in order to recover the proper behavior of the orig-inal problem.

Chapter 13Finite difference methods

Finite difference methods (FDM) are based on approximating derivatives by differ-ence quotients. The basis is a forward difference,

y′(x) =y(x+∆x)− y(x)

∆x+O(∆x) (13.1)

or a backward difference,

y′(x) =y(x)− y(x−∆x)

∆x+O(∆x). (13.2)

Both are first order approximations. Taking the average of the two yields a symmet-ric difference quotient,

y′(x) =y(x+∆x)− y(x−∆x)

2∆x+O(∆x2). (13.3)

This is a second order approximation. Symmetric difference approximations are ingeneral at least second order accurate. Therefore one can usually construct secondorder convergent FDMs.

In a similar way, we can approximate a second order derivative by the symmet-ric difference quotient

y′′(x)≈ 1∆x

(y(x+∆x)− y(x)

∆x− y(x)− y(x−∆x)

∆x

)=

y(x+∆x)−2y(x)+ y(x−∆x)∆x2 +O(∆x2).

The orders of these approximations are easily verified by Taylor series expansions.Finite difference methods for 2pBVPs are constructed from these approxima-

tions and some further variants building on the same techniques. This converts thedifferential equation into an approximating algebraic problem, which can be solvedwith a finite computational effort.

15

16 13 Finite difference methods

13.1 FDM for the 1D Poisson equation

The 1D Poisson equation, −u′′ = f on Ω = [0,1], with Dirichlet boundary condi-tions u(0) = uL and u(1) = uR, is both the simplest 2pBVP and the simplest exampleof how to construct and apply a finite difference method. For this reason we take thismodel equation to describe the main techniques and ideas of finite difference meth-ods for 2pBVPs. This includes proving that the FDM is convergent. Later, we shallextend the techniques to also discuss Neumann and other boundary conditions, aswell as eigenvalue problems and non-selfadjoint boundary value problems.

The independent variable, x ∈Ω , is discretized by choosing an equidistant gridΩN = x jN+1

j=0 ⊂Ω , where the grid points are given by

x j = j ·∆x,

and where the spacing between the grid points is referred to as the mesh width,

∆x =1

N +1.

This means that x0 = 0 is the left endpoint, and xN+1 = 1 is the right endpoint. Inbetween, there are N internal points in the interval [0,1]. In case one wants to solvethe problem on an interval Ω = [0,L] the only change is that ∆x= L/(N+1). Below,we shall only describe the method on the unit interval.

The solution will be approximated numerically on this grid, by a vector u =u jN

1 of internal points, augmented by the boundary values, u0 = uL and uN+1 =uR. The vector u approximates the solution to the differential equation, according tou j ≈ u(x j). Now the second order derivative −u′′ is approximated by a symmetricfinite difference,

−u′′ ≈−u j−1−2u j +u j+1

∆x2 .

We thus obtain a linear system of equations,

−uL +2u1−u2

∆x2 = f (x1)

−u j−1 +2u j−u j+1

∆x2 = f (x j) j = 2 : N−1

−uN−1 +2uN−uR

∆x2 = f (xN),

where the first and last equations include the boundary conditions, and the mid-dle equation is the generic equation, only involving internal points. Given that theboundary values are known, we retain the unknowns in the left hand side, to get

13.1 FDM for the 1D Poisson equation 17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

4

5Data function f

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

0

0.1

0.2

0.3

0.4

0.5FDM solution to -u" = f

Fig. 13.1 FDM solution of −u′′ = f . The continuous data function f on [0,1] is sampled on thegrid (red markers, top). The FDM system LNu = f+b is then solved to produce an approximatesolution (red markers, bottom) with inhomogeneous Dirichlet conditions indicated separately

2u1−u2

∆x2 = f (x1)+uL

∆x2

−u j−1 +2u j−u j+1

∆x2 = f (x j) j = 2 : N−1

−uN−1 +2uN

∆x2 = f (xN)+uR

∆x2 .

This linear system is conveniently represented in matrix-vector form, as

1∆x2

2 −1 0 · · · 0−1 2 −1 · · · 0

. . . . . . . . .−1 2 −1

0 · · · 0 −1 2

u1u2...

uN−1uN

=

f1f2...

fN−1fN

+1

∆x2

uL0...0

uR

. (13.4)

In compact form, the FDM discretization can be written

LNu = f+b. (13.5)

Here the vector b ∈ RN represents the boundary conditions; in case the Dirichletconditions are homogeneous we have b = 0. The N×N tridiagonal matrix can bewritten


LN =1

∆x2 T (13.6)

withT = tridiag(−1 2 −1). (13.7)

Thus the self-adjoint problem −u′′ = f , in the form L u = f with L ∗ = L andu(0) = u(1) = 0, is approximated by the linear algebraic system LNu = f, where thematrix is symmetric, i.e., L∗N = LN . The FDM solution of an example problem withinhomogeneous Dirichlet conditions is illustrated in Figure 13.1.

We shall later prove that the method constructed above is convergent of orderp = 2. As u = L−1

N (f+b), convergence will depend on whether we can show thatthe inverse L−1

N is continuous (bounded). This requires that we explore the specialproperties of LN and similar tridiagonal matrices.

13.2 Toeplitz matrices

In the 1D Poisson problem L u = f the operator L =−d2/dx2 is elliptic, allowingus to solve this problem for all f ∈ L2[0,1. As we shall see, the symmetric secondorder discretization above, LNu = f+b, is also a solvable problem, as LN is sym-metric and positive definite. The symmetry was already established in (13.6), andwe now have to demonstrate that LN is positive definite.

This is related to the eigenvalues of LN , and, in particular, to its lower logarithmicnorm. Thus we have to show that m2[LN ]> 0. To establish such a result, we turn tothe algebraic eigenvalue problem

LNu = λu, (13.8)

where LN is given by (13.6). We also noted that LN = T/∆x2, where the tridiagonalmatrix T is a Toeplitz matrix.

Definition 13.1. A matrix A = ai, j is called Toeplitz if ai, j = ai− j.

This means that along diagonals (whether the main diagonal or super-, or sub-diagonals) the matrix elements are constant. For example, on the main diagonal,i = j, so ai, j = a0, for all i, j.

Toeplitz matrices occur frequently in applied mathematics. They typically occurwhen local operators (such as difference operators or digital filters) are representedon equidistant grids. Toeplitz matrices are discrete analogues of convolutions, since,if v = Au, then

vi =N

∑j=1

ai, ju j =N

∑j=1

ai− ju j.

This is a discrete counterpart to a convolution integral of two functions a and u,

13.2 Toeplitz matrices 19

v(x) = a∗u =∫ 1

0a(x− t)u(t)dt.

In our case, T = tridiag(−1 2 −1). This is sometimes written

T =] −1 2 −1 [ , (13.9)

where the reversed brackets indicate the neighboring diagonals, where the main di-agonal is emphasized in boldface. The bracket is usually referred to as the convo-lution kernel. The convolution kernel completely defines the matrix and its action.

Much is known about Toeplitz matrices. For example, we can compute the Eu-clidean norm and logarithmic norm of tridiagonal Toeplitz matrices. This is keyto proving that LN is positive definite, and that the FDM method is convergent asN→∞. Such a result is by no means trivial; the matrix LN is N×N and so grows insize, but moreover, its elements also grow in magnitude, since the diagonal elementof LN is 2(N +1)2→ ∞.

Now, in (13.8) we need to compute the eigenvalues of LN . To this end, it is suffi-cient to find the eigenvalues of the N×N Toeplitz matrix T , since

λ [LN ] = (N +1)2 ·λ [T ].

Because we will also be interested in skew-symmetric and non-symmetric Toeplitzmatrices, we will consider a general tridiagonal matrix. Since the diagonal onlyshifts the eigenvalues by a constant, the general result is obtained from consideringa matrix with zero diagonal. The main result is the following:

Theorem 13.1. Let K be an N×N tridiagonal Toeplitz matrix, with kernel

K =]a 0 c [ , (13.10)

where a,c ∈ R. The N eigenvalues λ [K] are given by

λk = 2√

ac coskπ

N +1, k = 1 : N, (13.11)

and the kth eigenvector v, satisfying Kv = λkv has components

v j =(a

c

) j/2sin

kπ jN +1

=(a

c

) j/2sinkπx j , j = 1 : N, (13.12)

on the grid points x j ∈ΩN ⊂Ω = [0,1].

Proof. We first note that if a 6= c, we can construct a diagonal symmetrizer D, suchthat

DKD−1 =√

ac S, (13.13)


where the tridiagonal matrix S is symmetric, with kernel

S =]1 0 1 [ . (13.14)

Assume that Kv = λkv. Then

DKD−1Dv = λkDv,

showing that Su = λku. Thus, since (13.13) is a similarity transformation, the eigen-values are preserved, i.e.,

λ [K] = λ [DKD−1] =√

ac λ [S],

while the eigenvectors are transformed according to u = Dv.Starting with the symmetrizer, we illustrate the construction by taking N = 3. We

then have

DKD−1 =

d1 0 00 d2 00 0 d3

0 c 0a 0 c0 a 0

d−11 0 00 d−1

2 00 0 d−1

3

=

0 d1c/d2 0d2a/d1 0 d2c/d3

0 d3a/d2 0

.

Hence the subdiagonal elements in the jth column are d j+1a/d j, while the super-diagonal elements in the jth row are d jc/d j+1. These can be made equal, makingthe matrix DKD−1 symmetric, by requiring

d j+1

d ja =

d j

d j+1c,

from which it follows that d j+1/d j =√

c/a. Consequently, the transformed sub-and superdiagonal elements are all equal to

√ac, and DKD−1 =

√ac S, with the

symmetric matrix S given by (13.14). The diagonal matrix D is found from therecursion

d j+1 =

√ca

d j , d1 =√

c/a.

It follows that d j = (c/a) j/2. Note that if c/a < 0, the symmetrizer D is complexvalued.

The remaining problem is to find λ [S]. For the eigenvalue problem Su = λu, wenote that the jth equation reads

u j+1 +u j−1 = λu j.

This is a linear difference equation, with boundary conditions u0 = uN+1 = 0. Itscharacteristic equation is

κ2−λκ +1 = 0. (13.15)


The product of the two roots is 1, due to the last coefficient in the characteristicequation. Hence, denoting one root by κ , the other root is 1/κ , and the generalsolution to the difference equation can be written

u j = Aκj +Bκ

− j.

Inserting u0 = 0, we find A+B = 0, so the general solution is

u j = A(κ j−κ− j). (13.16)

Inserting the other boundary condition, uN+1 = 0, we find

0 = A(κN+1−κ−(N+1)).

Since A 6= 0 (otherwise we obtain the trivial solution u≡ 0), we must have

κ2(N+1) = 1.

Thus we need to find the (N +1)st roots of unity. Writing 1 = e2πik, we obtain

κk = eikπ

N+1 , k = 1 : N.

Now, in the characteristic equation (13.15) we see that the sum of the two roots isλ . Therefore,

λk[S] = κk +1κk

= eikπ

N+1 + e−ikπ

N+1 = 2coskπ

N +1.

Thus we have found λ [S]. It follows that

λk[K] = 2√

ac coskπ

N +1.

We now have to construct the eigenvectors. We note in (13.16), taking A = 1/(2i),that the kth eigenvector u of S has components

u j = A · (κ jk −κ

− jk ) = A · (e

ikπ jN+1 − e−

ikπ jN+1 ) = sin

kπ jN +1

= sinkπx j,

where the x j are the grid points. The transformation v = D−1u implies that theeigenvector Kv = λkv has components

v j = (a/c) j/2 sinkπx j.

This completes the proof.

Theorem 13.1 is central in 1D FDM and FEM theory, and useful in 2pBVPs aswell as in parabolic and hyperbolic PDEs. Because we are not only interested in


self-adjoint problems, the theorem also covers skewsymmetric and nonsymmetricToeplitz matrices, since a and c may be different, and even have opposite signs, inwhich case the eigenvalues become imaginary and the eigenvectors complex.

Applying Theorem 13.1 to the Poisson equation with Dirichlet conditions, whichis a self-adjoint problem, and whose FDM discretization is symmetric, we have

Corollary 13.1. Let LN = (N+1)2T be an N×N tridiagonal Toeplitz matrix, wherethe kernel of T is given by (13.9). Then

λk[LN ] = 4(N +1)2 sin2 kπ

2(N +1), k = 1 : N. (13.17)

The kth eigenvector u, satisfying LNu = λku, has vector components

un =√

2sinkπxn =√

2sinkπn

N +1, n = 1 : N (13.18)

on the uniform grid ΩN = xnN1 , with grid points xn = n∆x = n/(N +1).

Proof. Note that T = 2I +K, with a = c =−1 in Theorem 13.1. Hence

λk[T ] = 2+2coskπ

N +1= 4sin2 kπ

2(N +1),

and it follows that λk[LN ] = (N + 1)2λk[T ], see Figure 13.2. The eigenvectors arerescaled to conform to the orthonormal eigenfunctions of the continuous operator.The proof is complete.

As a further application, we turn to the Sturm–Liouville eigenvalue problem

−u′′ = λu, u(0) = u(1) = 0.

For this problem, we already know that the eigenvalues and eigenfunctions are

λk

[− d2

dx2

]= k2

π2, k ∈ Z+

uk(x) =√

2sinkπx.

We employ the same discretization as for the 2pBVP above, to obtain the algebraiceigenvalue problem

LNu = λu, (13.19)

We can now compare the eigenvalues and eigenfunctions of the continuous prob-lem, L u = λu, and those of the discrete problem, LNu = λku. Starting with the


Fig. 13.2 Eigenvalues of LN . The eigenvalues λk[LN ] = (N +1)2 ·(2+2cos kπ

N+1

)are projections

on the real axis of equally spaced angular increments. Here N = 19

eigenfunctions, we note that the kth eigenfunction of the continuous problem isuk(x) =

√2sinkπx. Sampled on the grid, we obtain

uk(xn) =√

2sinkπxn = un,

where u = unN1 is the kth eigenvector of the discrete problem. Thus the dis-

crete eigenvectors are exact samples of the continuous eigenfunctions, withoutdiscretization errors.

The three first eigenvectors (the “lowest modes”) are plotted on the grid ΩN inFigure 13.2, and are seen to be accurate renditions of the continuous eigenfunctions,even when N is small. The last eigenvector (the “highest mode”), however, appearsto have little reminiscence of the continuous eigenfunction, in spite of coincidingwith the exact solution on every grid point, as seen in Figure 13.2. This illustratesthe fact that the sampling of a highly oscillatory function must be dense enoughin order to visually represent the function well, even though the Nyquist–Shannonsampling theorem is not violated in Figure 13.2.

The visual deterioration of the eigenvectors is perhaps better revealed by theeigenvalues, which are subject to discretization errors. We have already seen thatλk[L ] = k2π2. For the discrete eigenvalues, we expand the first eigenvalues of(13.17) in a Taylor series as N→ ∞, to obtain

λk[LN ] = 4(N +1)2 sin2 kπ

2(N +1)= k2

π2− k4π4

12(N +1)2 +O(N−4).

It follows that

λk[LN ] = λk[L ]− k4π4

12(N +1)2 +O(N−4), (13.20)

implying that the numerically computed eigenvalues converge to the eigenvaluesof the differential operator with order p = 2, since the error is O((N + 1))−2) =


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1First three modes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

-0.5

0

0.5

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

-1

-0.5

0

0.5

1

Fig. 13.3 First three eigenvector modes of LN . From top to bottom, the panels show the eigenvec-tors u1, u2 and u3 of the N×N matrix LN , for N = 19. The graphs are in close agreement with theeigenfunctions uk(x) = sinkπx of L = −d2/dx2 with u(0) = u(1) = 0, for k = 1,2 and 3. Notethat the boundary values (indicated separately) are not components of the eigenvectors

O(∆x2). Thus a second order FDM will produce second order approximations to theeigenvalues. However, high accuracy will only be obtained for the first few eigen-values. The relative error in the kth eigenvalue is

λk[LN ]−λk[L ]

λk[L ]=− λk[L ]

12(N +1)2 +O(N−4).

It increases with the magnitude of λk[L ], which is proportional to k2. At the sametime, for a fixed k, the relative error decreases as N2 increases, due to the secondorder convergence of the FDM. Therefore, as the relative error is O(k2/N2), weobtain increased accuracy only if k grows “slower” than N. As a rule of thumb, onecan obtain high accuracy for the first k ∼

√N eigenvalues. Likewise, we typically

obtain an acceptable discrete rendition of the first√

N eigenfunctions.The “mismatch” for higher eigenvalues and eigenvectors is typical of numerical

eigenvalue computations. It is due to the fact that the continuous operator L has aninfinite sequence of eigenvalues and eigenvectors, while the discrete FDM operatoronly has N eigenvalues and eigenvectors.

Since L∗N = LN , the eigenvalues are real, as observed above, and the eigenvectorsare orthogonal. Apart from calculating the eigenvalues and eigenvectors, we can also


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

-0.5

0

0.5

1Highest mode

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

-0.5

0

0.5

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

-1

-0.5

0

0.5

1

Fig. 13.4 Highest eigenvector mode of LN . Top panel shows the eigenvector u19 of the N ×Nmatrix LN , corresponding to the Nth mode for N = 19. Center panel shows the eigenfunctionu19(x) = sin19πx of L = −d2/dx2 with u(0) = u(1) = 0. When both graphs are overlaid (bot-tom) we see that the eigenvector u19 consists of exact samples of u19(x) on the grid

determine the Euclidean norms and logarithmic norms of LN , as these values aredetermined by the extreme eigenvalues in symmetric (normal) matrices.

Theorem 13.2. Let LN = (N + 1)2T be the N ×N symmetric tridiagonal Toeplitzmatrix given by (13.9). Then

m2[LN ] = λ1[LN ]≈ π2 +O(N−2)

‖LN‖2 = λN [LN ]≈ 4(N +1)2−O(1)

‖L−1N ‖2 = 1/λ1[LN ]≈ 1/π

2 +O(N−2).

Proof. The results follows directly from the fact that, for a symmetric matrix, theEuclidean norm and upper and lower logarithmic norms are determined by theextreme eigenvalues. Since the eigenvalues are all real, and are positive for thematrix LN , it follows that m2[LN ] = λ1[LN ] and ‖LN‖2 = λN [LN ]. Further, sinceλ [L−1

N ] = λ−1[LN ] and L−1N , too, is symmetric, it follows that ‖L−1

N ‖2 = 1/λ1[LN ].Taylor series expansions as N→ ∞ then provide the approximations.

We note that, since m2[LN ] > 0, the matrix LN is symmetric positive definite.Thus the FDM discretization reflects the fact that L is self-adjoint and elliptic.


This qualitative preservation of the properties of the original operator is very impor-tant for success in numerical computations.

It is worth noting that, by the uniform monotonicity theorem, it holds that

m2[LN ]> 0 ⇒ ‖L−1N ‖2 ≤

1m2[LN ]

. (13.21)

However, while this bound holds for general matrices, (13.21) now holds with equal-ity, since the matrix LN and its inverse are both symmetric positive definite. Thusthe Euclidean norm is “optimal” for symmetric positive definite systems.

We are now in a position to prove that the FDM is convergent for elliptic prob-lems. This will follow the standard pattern of inserting the exact solution into thediscretization to obtain a local error, followed by establishing stability, from whichconvergence results.

13.3 The Lax Principle

The Lax Principle is a “meta theorem” in numerical analysis. It states a general pat-tern in numerical analysis, establishing how consistency is related to convergence.The bridge between consistency and convergence is stability: without stability amethod is not convergent.

Theorem 13.3. (Lax Principle) Consistency and Stability imply Convergence.

We have already seen this principle in action in initial value problems, proving thattime stepping methods are convergent. We first proved the order of consistency, andthen proved that the method was stable, to conclude that the method was convergent.Here, we shall do the same thing for the FDM discretization of the 1D Poisson equa-tion. This is but a single example of 2pBVPs, and whenever the problem changes,or the boundary conditions change, we have to reconsider the proof. Nevertheless,we will see that all convergence proofs follow the same pattern: we first prove theorder of consistency, then prove stability, and the order of convergence follows.

The present task is to prove that the FDM discretization of the 1D Poisson equa-tion is convergent. We therefore consider (13.5), LNu = f+b. For simplicity, butwithout loss of generality, we may consider homogeneous boundary conditions, i.e.,b = 0. The generic equation of this discretization reads

−u j−1 +2u j−u j+1

∆x2 = f (x j). (13.22)

Upon inserting the exact solution to −u′′ = f , we have

13.3 The Lax Principle 27

−u(x j−1)+2u(x j)−u(x j+1)

∆x2 = f (x j)− r j, (13.23)

where r j is the local residual (often referred to as the local error in the literature).Expanding the exact solution around the grid point x j, we find that

−u(x j−∆x)+2u(x j)−u(x j +∆x)∆x2 =−u′′(x j)−

∆x2

12u(4)(x j)+O(∆x4).

Since −u′′ = f , it follows that

r j =∆x2

12u(4)(x j)+O(∆x4),

so the method’s order of consistency is p = 2. Now, defining the global error ase j = u j−u(x j) and subtracting (13.23) from (13.22), we obtain

−e j−1 +2e j− e j+1

∆x2 = r j.

This is referred to as the error equation, as it defines the global error in terms ofthe local residual. Interestingly, the structure of the error equation is identical to thatof (13.22). Thus, denoting the global error vector on ΩN by e, and the local residualby r, we have

LNe = r. (13.24)

Because LN is symmetric positive definite, it is invertible, and therefore

e = L−1N r. (13.25)

We now need to bound the error. To this end, we need to choose an appropriate norm.In the continuous problem −u′′ = f the standard choice is the L2 norm, defined by

‖u‖2L2(Ω) =

∫Ω

|u(x)|2 dx.

We need a discrete counterpart. This norm is the root mean square norm (oftenreferred to as the RMS norm, or the discrete L2 norm), defined by

‖u‖2L2(ΩN)

=N

∑1|u j|2∆x,

where ∆x = 1/(N +1) in the case of Dirichlet boundary conditions.The main reason for using this norm is that we need a norm such that the norm

of the unit function, u(x) ≡ 1, is ‖u‖2L2(Ω)

= 1 not only in the continuous case, butalso in the discrete case, independent of the number of grid points. This precludesthe use of the standard Euclidean norm, since the norm of the unit function is then√

N, and growing with the number of grid points. However, with the discrete L2


norm, this problem is rectified. (In fact, the discrete L2(ΩN) norm is a second orderapproximation to the continuous L2(Ω) norm.)

To use the discrete L2(ΩN) norm, we need the corresponding operator norm.Given a matrix A ∈ RN×N , we have

‖A‖2L2(ΩN)

= supu6=0

‖Au‖2L2(ΩN)

‖u‖2L2(ΩN)

= supu6=0

∑ j(Au)2j/(N +1)

∑ j u2j/(N +1)

= supu6=0

‖Au‖22

‖u‖22

= ‖A‖22.

Definition 13.2. The discrete L2(ΩN) norm of an N-vector u is defined by

‖u‖L2(ΩN)=

√1

N +1

N

∑1|u j|2. (13.26)

The induced operator norm of a matrix A : L2(ΩN)→ L2(ΩN) is

‖A‖2L2(ΩN)

= ‖A‖22. (13.27)

We note that the structure of the expression (13.26) explains the term “root, mean,square.” Now, by (13.25), we can bound the global error,

‖e‖L2(ΩN)≤ ‖L−1

N ‖2 · ‖r‖L2(ΩN). (13.28)

The global error is bounded if the two factors on the right hand side are bounded.

Theorem 13.4. The local residual in the FDM discretization (13.5) of the 1D Pois-son equation, with Dirichlet boundary conditions, is

‖r‖L2(ΩN)= O(∆x2),

implying that the method’s order of consistency is p = 2.

This result has already been demonstrated above, using Taylor series expansions.Thus, with every component r j being O(∆x2), we have ‖r‖L2(ΩN)

= O(∆x2), show-ing that the method is 2nd order consistent. To translate this result into convergence,we need stability.

Definition 13.3. The FDM discretization (13.5) is stable, if there is a constant C,independent of N, such that

‖L−1N ‖2 ≤C. (13.29)

13.3 The Lax Principle 29

By Theorem 13.2 the stability condition is also satisfied, as ‖L−1N ‖2 ≈ 1/π2. This

implies that the global error can be bounded according to (13.28), proving the con-vergence of the finite difference method:

Theorem 13.5. The global error in the FDM discretization (13.5) of the 1D Poissonequation, with Dirichlet boundary conditions, is bounded by

‖e‖L2(ΩN).‖r‖L2(ΩN)

π2 = O(∆x2), (13.30)

implying that the method’s order of convergence is p = 2.

This result is classical: consistency and stability imply convergence. It ex-emplifies the Lax Principle as stated above, also referred to as the fundamentaltheorem of numerical analysis. It is a recurring theme, and the pattern is found inmost numerical approximation schemes, whether in initial value problems, bound-ary value problems, or in partial differential equations. It states that in a stablemethod, ‖r‖L2(ΩN)

→ 0 implies that ‖e‖L2(ΩN)→ 0.

In FDMs for 2pBVPs, consistency means that the local residual r→ 0 as N→∞.Likewise, convergence means that the global error e→ 0 as the discretization ismade finer. The bridge from consistency to convergence is stability, which requiresthat ‖L−1

N ‖2 is uniformly bounded as N → ∞. Then (13.28) implies that the orderof convergence is equal to the order of consistency.

It is important to note that the stability condition ‖L−1N ‖2 ≤C is not a condition

on a single matrix. It is a condition on the entire family of matrices L−1N ∞

N=N0,

which must remain continuous (bounded) as N → ∞. For different N, the matriceshave different dimensions, and, naturally, different elements. Yet they have to sharea common constant, C, such that for any N, it holds that ‖L−1

N ‖2 ≤ C, where thestability constant C is independent of N.

Remark 1 (Continuity and stability) On close inspection, we see that stability is justanother term for continuity. Thus, since e = L−1

N r, we see that if L−1N is a continuous map,

then r→ 0 implies e→ 0, see Figure 13.5. While we typically establish the consistencycondition r→ 0 by straightforward Taylor series expansions, the rest of the convergenceproof “only” requires that we establish stability (continuity). This is usually the harder partof the proof.

Remark 2 (Ellipticity and stability) Above, we have established stability from elliptic-ity. Thus, by investigating the associated Sturm–Liouville eigenvalue problem, we demon-strated that the differential operator L is elliptic, and that the FDM preserves that prop-erty. As a consequence, LN is uniformly positive definite for all N, as demonstrated bym2[LN ] ≈ π2 > 0. This means that ellipticity plays a central role. Because ellipticity alsodepends on the boundary conditions, it is typically necessary to establish ellipticity in eachseparate case. We note that ellipticity is a sufficient condition for demonstrating stability;thus there are operators that are not elliptic, but whose discretizations are still stable. Whilestability is always necessary, it is typically much harder to establish stability (and henceconvergence) in the non-elliptic case.


H20 (Ω) L2(Ω)

u • • f-L

?

ΓN

?

ΓN

H20 (ΩN) L2(ΩN)

u• • f

L−1N

u(ΩN)• • f− r@@@R

e@@@R

r

Fig. 13.5 The Lax Principle. A finite difference method is applied to L u = f on Ω = [0,1] withhomogeneous Dirichlet conditions. A grid ΩN with N interior points is used to obtain the algebraicproblem LNu = f. When the exact solution is inserted into the discretization, we obtain a discrep-ancy, represented by the local residual −r. The corresponding global error is e = L−1

N r. As r→ 0(consistency), it follows that e→ 0 (convergence), provided that ‖L−1

N ‖ ≤ C for all N (stability).Thus the stability condition is equivalent to the family of maps L−1

N being continuous

13.4 Other types of boundary conditions

The standard boundary conditions for a 2pBVP are Dirichlet conditions (alreadydealt with), Neumann conditions, Robin conditions, and periodic boundary condi-tions. Both Neumann and Robin conditions involve u′ on the boundary. Without lossof generality, we begin by assuming that the boundary condition at x = 0 is a Dirich-let condition, and that we have a Neumann condition at x = 1. This means that thesimplest model problem is the 1D Poisson equation

−u′′ = f , u(0) = uL, u′(1) = u′R. (13.31)

There are (at least) three different ways to approach this problem. In all cases thecondition on the derivative at x = 1 must be discretized, and to maintain 2nd orderconvergence, the boundary condition must also be discretized to 2nd order accuracy.The different approaches will be seen to involve the construction of the grid, so asto meet the specific requirements.

In the first approach, we construct a grid with N internal points, such that

x j = j ·∆x,

with ∆x = 1/(N + 1/2). This means that x0 = 0 and that all grid points are stillspaced by ∆x, but that

13.4 Other types of boundary conditions 31

xN = 1− 12

∆x , xN+1 = 1+∆x2,

with xN+1 outside the [0,1] interval. Since xN and xN+1 are symmetrically locatedaround the boundary point x = 1, we have

u′(1) =u(xN+1)−u(xN)

∆x+O(∆x2).

Conforming to this approximation, we define the numerical approximation uN+1 by

u′R =uN+1−uN

∆x

and solve for uN+1 to obtain uN+1 = uN +∆x ·u′R. Inserting this expression into theFDM discretization of −u′′ = f , we get

−uL +2u1−u2

∆x2 = f (x1)

−u j−1 +2u j−u j+1

∆x2 = f (x j) j = 2 : N−1

−uN−1 +2uN− (uN +∆x ·u′R)∆x2 = f (xN).

Simplifying the first and last equations, we now have

2u1−u2

∆x2 = f (x1)+uL/∆x2

−u j−1 +2u j−u j+1

∆x2 = f (x j) j = 2 : N−1

−uN−1 +uN

∆x2 = f (xN)+u′R/∆x.

In matrix-vector form, this system reads LNu = f+b, with

1∆x2

2 −1 0 · · · 0−1 2 −1 · · · 0

. . . . . . . . .−1 2 −1

0 · · · 0 −1 1

u1u2...

uN−1uN

=

f1f2...

fN−1fN

+1

∆x2

uL0...0

∆x ·u′R

. (13.32)

The main difference compared to the previous case, with Dirichlet conditions at bothendpoints, is that ∆x = 1/(N+1/2) and that the lower right element of LN is now 1rather than 2; thus the matrix is no longer Toeplitz. As the matrix is still symmetric,we have ‖L−1

N ‖2 = 1/λ1[LN ] < C. While one has to reconsider the invertibility ofthe matrix, and whether the FDM is convergent, this only requires that we study theSturm–Liouville eigenvalue problem −u′′ = λu with boundary conditions u(0) =u′(1) = 0, showing that its smallest eigenvalue is positive. Finally, to complete the


approach, we need a numerical approximation to u(1), e.g. for plotting purposes.This, too, has to be 2nd order, and we make the symmetric approximation

u(1)≈ uN+1 +uN

2= uN +

∆x ·u′R2

.

This is computable as soon as the interior discrete solution, u = u jN1 , has been

computed.In the second approach, we construct a grid such that

x j = j ·∆x,

with ∆x = 1/N. This means that x0 = 0 and xN = 1, although the grid point spacingis still ∆x. We introduce an extra grid point outside the interval, xN+1 = 1+∆x, andapproximate the boundary condition by

u′(1) = u′R =uN+1−uN−1

2∆x+O(∆x2).

This approximation is still 2nd order, again because xN−1 and xN+1 are symmetri-cally located around the boundary point x = 1. Defining uN+1 by

uN+1−uN−1

2∆x= u′R,

we have uN+1 = uN−1 + 2∆x · u′R. When this approximation is inserted into thegeneric FDM discretization, we obtain

2u1−u2

∆x2 = f (x1)+uL/∆x2

−u j−1 +2u j−u j+1

∆x2 = f (x j) j = 2 : N−1

−2uN−1 +2uN

∆x2 = f (xN)+2u′R/∆x.

In matrix-vector form, this system reads

1∆x2

2 −1 0 · · · 0−1 2 −1 · · · 0

. . . . . . . . .−1 2 −1

0 · · · 0 −2 2

u1u2...

uN−1uN

=

f1f2...

fN−1fN

+1

∆x2

uL0...0

∆x ·u′R

, (13.33)

with ∆x = 1/N. As in the previous approach, the matrix is not Toeplitz, and nolonger symmetric. Again invertibility has to be reconsidered in order to proveconvergence. In this case, too, stability can be established. Comparing (13.32) to(13.33), the systems look very similar. The only difference is in the last row of thematrices, which have different elements, and in the location of the grid points, corre-

13.4 Other types of boundary conditions 33

sponding to ∆x= 1/(N+1(2) in the first case and ∆x= 1/N in the latter. Finally, wenote that in this approach, the discretization generates the internal approximationsu1, . . . ,uN−1 as well as the numerical solution uN on the boundary. This explainswhy this N×N system corresponds to a discretization with ∆x = 1/N.

In the third approach, we work with the standard grid x j = j ·∆x and meshwidth ∆x = 1/(N+1). Again, x0 = 0 and xN+1 = 1, but we do not employ any extragrid points outside the interval. Instead, we approximate the boundary conditionu′(1) = u′R to 2nd order accuracy using the BDF2 method (see methods for initialvalue problems). This means that we use a “one-sided” difference approximation

32

uN+1−2uN +12

uN−1 = ∆x ·u′R.

Solving for uN+1 (which approximates the solution u(1) at the boundary), we get

uN+1 =43

uN−13

uN−1 +23

∆x ·u′R. (13.34)

As before, this is inserted into the standard discretization, again affecting only thelast row (the Nth equation) of the system, which now reads

−2uN−1 +2uN

3∆x2 = f (xN)+2

3∆x·u′R. (13.35)

The matrix is neither Toeplitz nor symmetric, calling for a sperate analysis of theinvertibility to prove stability and convergence. Once the interior approximationsu = u jN

1 have been computed, the solution at the right boundary is approximatedby (13.34).

The three approaches above work on different grids, with ∆x = 1/N, ∆x =1/(N + 1/2), and ∆x = 1/(N + 1). Apart from the mesh width factor, the mainchange in the system is found in the last row of the matrix. Without affecting sta-bility, the last equation can be rescaled (by multiplying the last equation by a suit-able factor) so that the last row is the same as in (13.32). Thus all approaches willbe stable if the symmetric system (13.32) is stable. For the latter system, we willdemonstrate later that λ1[LN ]≈ π2/4, which guarantees that ellipticity is preserved.

For Robin conditions, we employ the same techniques in any suitable combina-tion. For example, if the boundary condition is uR+αu′R = β and we use the secondapproach above, with mesh width ∆x = 1/N, we define

uN +αuN+1−uN−1

2∆x= β ,

and solve for uN+1 = uN−1+2∆x(β−uN)/α , eliminating this variable from the lastequation. Again, we see that this will affect the last row of the matrix, as well as theright hand side of the system. If the non-Dirichlet condition is located at x = 0, theprocedures are completely analogous.


Now, let us return to a convergence proof in the case we have a Neumann condi-tion. As noted above, it is sufficient to consider the first approach, where the matrixLN is defined in (13.32). That is a symmetric Toeplitz matrix, and it is therfore suf-ficient to consider its smallest eigenvalue, since ‖L−1

N ‖2 = 1/λ1[LN ], provided thatwe can show that this eigenvalue is strictly positive for all N.

A prerequisite is that the Sturm–Liouville eigenvalue problem −u′′ = λu withNeumann conditions u(0) = 0 and u′(1) = 0 is elliptic. As trigonometric functionsare eigenfunctions of the second derivative, we note that

uk(x) = sin(2k−1)πx

2, k ∈ Z+

are eigenfunctions satisfying the Neumann conditions, with eigenvalues

λk =(2k−1)2π2

4.

Thus the eigenvalue sequence is π2/4,9π2/4, . . . . Since we are using a 2nd orderconsistent discretization, we expect the matrix LN as defined in (13.32) to have sim-ilar eigenvalues, and, in particular, that the smallest eigenvalue is obtained for k = 1,with λ1[LN ]≈ π2/4. In order to demonstrate that this is in fact the case, we considerthe FDM discretization

−u j−1 +2u j−u j+1

∆x2 = λu j,

where ∆x = 1/(N +1/2) and uN+1 = uN , as described in the first approach above.Simplifying, we can rewrite this difference equation as

u j−1 +u j+1 = (2−∆x2λ )u j =: µ u j. (13.36)

Thus we may consider the difference equation u j−1 +u j+1 = µu j, the characteristicequation of which is

z2−µz+1 = 0.

Then the general solution is u j = Az j +Bz− j, since the product of the two roots is 1.Inserting u0 = 0, we find A+B = 0. Thus the general solution is u j = A(z j− z− j).Upon inserting the second boundary condition, uN+1 = uN , we get

zN(z−1) = z−N(1z−1),

from which it follows that z2N+1 =−1 = e−iπ ei2kπ . Thus zk = ei (2k−1)π2N+1 , and

µk = zk +1zk

= 2cos2k−12N +1

π, k = 1 : N.

By (13.36) it follows that

13.5 FDM for general 2pBVPs 35

λk[LN ] =2−µk

∆x2 =

(N +

12

)2(2−2cos

(2k−1)π2N +1

); k = 1 : N.

Thus, expanding λk[LN ] in a Taylor series, we find, for the first few eiegnevalues

λk[LN ]≈ (2k−1)2 π2

4. (13.37)

This is in perfect agreement with the eigenvalues of the differential operator, andsince LN is symmetric, we have λ2[LN ]≈ π2/4. Thus LN is positive definite (reflect-ing the fact that the differential operator is elliptic also for Neumann conditions),and

‖L−1N ‖2 ≈

4π2

as N→ ∞. This is the stability constant of the discrete problem, and it proves thatthe standard 2nd order FDM for the 1D Poisson equation is stable, and thereforeconvergent, also for Neumann conditions. Thus, once more, we have applied theLax Principle to prove that an FDM is convergent, and once more it has turned outto depend on the ellipticity of the differntial operator.

In addition, as has been pointed out above, the other approaches of implementingthe Neumann condition will yield a modfied matrix which is directly turned into thematrix treated above, by rescaling the last equation of the FDM. Thus convergenceis completley determined by the matrix LN , no matter how we choose to representthe Neumann condition, provided that it is done so as to maintain the 2nd orderconsistency of the FDM.

13.5 FDM for general 2pBVPs

The principles for more general problems than the 1D Poisson equation with Dirich-let conditions follow similar lines. Here we shall examine general self-adjoint oper-ators as well as problems involving first order derivatives, and nonlinear problems.

Let us begin by considering the linear problem L u = f , with

L u =−(pu′)′+qu, (13.38)

where p(x)> 0 and q(x)≥ 0. This is a self-adjoint elliptic problem – multiplyingby u and integrating by parts, using homogeneous Dirichlet conditions on Ω = [0,1],one easily shows that

m2[L ]≥minx|p(x)| ·π2 +min

x|q(x)|> 0.

We want to construct a discretization having similar properties. This means that weneed to construct a system LNu = f+b where LN is symmetric and positive definite.


When discretizing this operator, we start with the expression pu′. On an interval[xi,xi+1], we make the approximation

(pu′)(x1+1/2) =pi+1/2 · (ui+1−ui)

∆x+O(∆x2),

where pi+1/2 = p((xi+1 + xi)/2) is the value of the function p at the interval mid-point. Likewise, we have

(pu′)(x1−1/2) =pi−1/2 · (ui−ui−1)

∆x+O(∆x2).

We can next approximate −(pu′)′ by a symmetric difference quotient. This can beobtained as

−(pu′)′ =−(pu′)(x1+1/2)− (pu′)(x1−1/2)

xi+1/2− xi−1/2+O(∆x2)

=−pi+1/2 · (ui+1−ui)− pi−1/2 · (ui−ui−1)

∆x2 +O(∆x2)

=−pi−1/2ui−1 +(pi−1/2 + pi+1/2)ui− pi+1/2ui+1

∆x2 +O(∆x2).

Consequently, we obtain a second order consistent approximation to the self-adjointoperator L u =−(pu′)′+qu through

−(pu′)′+qu =−pi−1/2ui−1 +(pi−1/2 + pi+1/2)ui− pi+1/2ui+1

∆x2 +qiui +O(∆x2).

We note that if p(x)≡ 1 and q(x)≡ 0, then

−(pu′)′ =−u′′ =−ui−1 +2ui−ui+1

∆x2 +O(∆x2).

For the second order discretization of the operator L u = −(pu′)′+ qu, we get amatrix, whose representation of the central elements are

LN =1

∆x2

...−pi−1/2

. . . −pi−1/2 pi−1/2 + pi+1/2 +qi∆x2 −pi+1/2 . . .−pi+1/2

...

. (13.39)

Seeing that the matrix elements above and to the left of the diagonal elements arethe same, as well as that the elements below and to the right of the center elementare the same, we conclude that the matrix is symmetric, i.e., L∗N = LN . Thus thediscretization above preserves the self-adjoint symmetry of L .

13.5 FDM for general 2pBVPs 37

Let us now consider a general linear 2pBVP,

−u′′+au′+bu = f , (13.40)

with Dirichlet boundary conditions u(0) = uL and u(1) = uL. We have already dealtwith the discretization of u′′, so here it remains to discretize u′. We introduce astandard grid x j = j ·∆x on Ω = [0,1], with

∆x =1

N +1.

The discretization of −u′′ is given by (N +1)2 ·T u. To maintain second order con-sistency, we use a symmetric discretization of u′, and represent it by

u′ ≈u j+1−u j−1

2∆x.

In matrix-vector form, this becomes 2(N + 1) · Su, where S is a skew-symmetricToeplitz matrix, with kernel

S =] −1 0 1 [ , (13.41)

Skew-symmetry means that S∗ =−S. For simplicity, we introduce the notation

D20 = (N +1)2 ·T, D1

0 = 2(N +1) ·S,

where the superscript refers to the order of the derivative, not to a power of thematrix. In this notation, a discretization of the linear problem (13.42) becomes alinear system of equations,

(D20 +aD1

0 +bI)u = f+b. (13.42)

Thus, the differential operators are linear operators, and upon discretization, they arerepresented by finite difference matrices, since linear operators on finite dimensionalspaces can be represented by matrices. We may now introduce the notation

LN = D20 +aD1

0 +bI.

The solvability of the system, and the convergence of the discretization, depends onthe boundedness of L−1

N . We note that

m2[LN ] = m2[D20 +aD1

0 +bI]≥ m2[D20]+m2[aD1

0]+b≈ π2 +b,

since skew-symmetry implies that m2[aD10] = 0 for every a. Thus the presence of

the first derivative does not affect solvability and the stability of the discretization.The problem is not symmetric (although the matrix LN is Toeplitz), but it is elliptic,as long as π2 +b > 0. It follows that the problem is solvable, and that the FDM isstable and convergent if


b &−π2.

Again, we see that if the problem is elliptic, the FDM discretization is convergentas long as we choose the mesh width ∆x small enough to keep LN positive definite.

As for the eigenvalues of LN , the matrix is is Toeplitz, and therefore Theorem13.1 applies. We note that the kernel of LN is

LN =] − a2∆x− 1

∆x22

∆x2 +ba

2∆x− 1

∆x2 [ , (13.43)

and it follows that the eigenvalues are

λk[LN ] = b+2(N +1)2

√1− (a∆x)2

4sin2 kπ

2(N +1). (13.44)

Here we have chosen to represent the eigenvalues both in terms of ∆x and N, in spiteof the relation ∆x = 1/(N +1). We note that the parameter b only shifts the eigen-values to the right in the complex plane, increasing the ellipticity. The parameter a,however, which does not affect ellipticity at all, has a different influence. Thus, aslong as |a∆x| < 2 the eigenvalues are real, but if |a∆x| > 2 they become complex.We shall return to this problem later, to see that the continuous operator only hasreal eigenvalues, and that we therefore want to restrict ∆x so that |a∆x|< 2. This isan unexpected condition; it is not a stability condition, and does not matter to con-vergence, but it does matter to grid quality. In practical computations, we want touse the largest mesh width that produces acceptable results, and keeping |a∆x|< 2will be seen to be necessary to resolve the solution on the grid. The quantity |a∆x|is usually referred to as the mesh Peclet number.

As the last example of general 2pBVPs, we shall consider nonlinear problems.By and large, what is meant by a nonlinear problem is an operator where the highestorder derivative is linear, and lower order terms are nonlinear. In a second orderdifferential equation, this means that the structure of the equation is

u′′ = f (x,u,u′)

together with suitable boundary conditions. The FMD discretization proceeds withthe same techniques as before, applied to each occurring derivative. Stability con-ditions will depend on the properties of f . For simplicity, let us consider a specificcase,

u′′−uu′ = 0. (13.45)

This equation will recur in connection with hyperbolic partial differential equations.It has an alternative formulation,

u′′−(

u2

2

)′= 0. (13.46)

Using symmetric difference approximations, (13.45) yields

13.6 Higher order methods. Cowell’s difference correction 39

u j−1−2u j +u j+1

∆x2 −u ju j+1−u j−1

2∆x= 0. (13.47)

In (13.46) we instead obtain

u j−1−2u j +u j+1

∆x2 −u2

j+1−u2j−1

4∆x= 0. (13.48)

Due to maintaining symmetry, both discretizations are second order consistent. Sta-bility is now considerably more difficult, as is proving convergence. The two dis-cretizations are not the same. In fact, there is a variant of (13.47) which is still 2ndorder consistent, where we replace one factor to get

u j−1−2u j +u j+1

∆x2 −u j+1 +u j−1

2·

u j+1−u j−1

2∆x= 0. (13.49)

This is immediately seen to be identical to (13.48). As already mentioned, we shallreturn to these discretizations in connection with partial differential equations.

13.6 Higher order methods. Cowell’s difference correction

So far, we have only considered second order discretizations. Such methods areeasily constructed by merely using symmetric difference approximations to deriva-tives, and, in case it is needed, the boundary conditions. Let us now consider thepossibility of constructing higher order FDM techniques for 2pBVPs. Again, welimit ourselves to the Poisson equation, demonstrating the idea behind higher ordermethods.

In the Poisson equation −u′′ = f we have approximated the second order deriva-tive by the finite difference quotient

−u(x j−1)+2u(x j)−u(x j+1)

∆x2 =−u′′(x j)−∆x2

12u(4)(x j)−O(∆x4).

Thus the asymptotically dominant term in the local residual is

−∆x2

12u(4)(x j) =

∆x2

12f ′′(x j),

since −u′′ = f . The dominating residual term can therefore be eliminated by theapproximation

∆x2

12f ′′(x j) =

f (x j−1)−2 f (x j)+ f (x j+1)

12+O(∆x4).

With this estimate, we will consider the discretization


−u j−1 +2u j−u j+1

∆x2 = f (x j)+f (x j−1)−2 f (x j)+ f (x j+1)

12,

which has order of consistency p = 4. Given that u and f are sufficiently differen-tiable. Simplifying, we obtain Cowell’s method, where the dominant O(∆x2) resid-ual in the standard 2nd order discretization has been eliminated by a differencecorrection. Cowell’s method is

−u j−1 +2u j−u j+1

∆x2 =f (x j−1)+10 f (x j)+ f (x j+1)

12. (13.50)

Since the method’s consistency order is p = 4, we need to verify its convergenceorder. By the Lax Principle, this will only require that we establish stability. This is,however, a next to trivial issue. Thus, writing the Cowell method in matrix–vectorform, we have

LNu = MNf+b, (13.51)

where LN is exactly the same matrix as in the standard 2nd order FDM, and wherethe vector b accounts for the Dirichlet boundary conditions in exactly the same wayas before. The difference correction is carried out by the Toeplitz matrix MN , whichis an averaging operator, whose convolution kernel is

MN =1

12]1 10 1 [. (13.52)

We note that the row sum of elements in the matrix MN is one. The two functionvalues surrounding f (x j) will modify the right-hand side, so as to annihilate theO(∆x2) term in the residual, leaving only fourth and higher order terms. The onlything to keep in mind is that in order to apply the difference correction, we cannotmake do with only the interior points f jN

1 , but we also have to include the samplesof f (x) on the boundary. Thus we have to consider a convolution of f -values fromf (x0) to f (xN+1).

To clarify this detail, we give the linear system, specifying the first and last equa-tions, and the generic equation. This system is, for inhomogeneous Dirichlet condi-tions u(0) = uL and u(1) = uR,

2u1−u2

∆x2 =f0 +10 f1 + f2

12+

uL

∆x2

−u j−1 +2u j−u j+1

∆x2 =f j−1 +10 f j + f j+1

12, j = 2 : N−1

−uN−1 +2uN

∆x2 =fN−1 +10 fN + fN+1

12+

uR

∆x2 .

Now, in order to verify stability, we note that in (13.51) the system matrix isLN , and that stability only depends on the inverse L−1

N being uniformly bounded as

13.6 Higher order methods. Cowell’s difference correction 41

N→ ∞. Since the matrix LN is identical to that of the standard 2nd order FDM, andwe have already demonstrated that ‖L−1

N ‖2 ≈ 1/π2, Cowell’s method, too, is stable.Since the method has consistency order p = 4, it now follows from the Lax Prin-

ciple that Cowell’s method is convergent of order p = 4. Naturally, this requiresthe data function f to be sufficiently smooth; the convergence order of any methodis always the nominal order obtained when the problem data are smooth enough.

Example A simple MATLAB code implementing the standard second order FDM, as wellas Cowell’s 4th order method is given below. In Figure 13.6 this code is tested for a simpleproblem constructed from an exact solution,

u(x) =−2+ x+ e−x + sin(5x+1/3),

so that the right hand side is

f (x) = 25sin(5x+1/3)− e−x.

Thus we are able to compute the error and verify the convergence orders. This simpledemonstration shows the very significant accuracy improvement achieved by the 4th ordermethod. Already at ∆x = 10−2, the 4th order method produces a solution with an error fourorders of magnitude smaller for this particular problem. It is to be noted that the computati-nal effort is similar for the 2nd and 4th order methods. The only additional work needed inCowell’s method is the convolution of the right hand side f to obtain the difference correc-tion. Because the kernel corresponds to a tridiagonal Toeplitz matrix, the additional work ismarginal, but delivers a vast gain in accuracy.

function bvp24(N)% Written (c) 2017-11-17 for 1D Poisson problem -u'' = f.% Test of standard 2nd order FDM + 4th order Cowell method

L = 1;dx = L/(N+1); % Mesh widthxx = linspace(0,L,N+2)'; % Grid on [0,L]

% Define RHS (any data function)ff = 1 + xx - exp(-xx) + sin(5*xx + 1/3);

% Define Dirichlet boundary conditionsuL = 0;uR = 1;

% Assemble matrixa = zeros(N,1);a(1) = 2;a(2) = -1;LN = toeplitz(a)/dxˆ2; % System matrix

% Construct difference correction for 4th order methodker = [1 10 1]/12; % Convolution kernelf4 = conv(ff,ker,'same'); % RHS for Cowell's method

% Reduce right hand sides to interior pointsf4 = f4(2:end-1); % 4th order Cowell


0 0.2 0.4 0.6 0.8 1

x

10-13

10-12

10-11

10-10

10-9

10-8

10-7

10-6

10-5

10-4

10-3

Error vs x

10-3

10-2

dx

10-13

10-12

10-11

10-10

10-9

10-8

10-7

10-6

10-5

10-4

10-3

Error vs dx

Fig. 13.6 Comparison of 2nd and 4th order methods. The 1D Poisson equation −u′′ = f withDirichlet conditions is solved using a standard 2nd order FDM and the 4th order Cowell method forN = 49,99,199,399 and 799. Left panel shows the absolute value of the global error vs. x ∈ [0,1]for the 2nd order method (blue) as well as for Cowell’s 4th order method (red). The sharp dip nearx = 0.65 corresponds to a sign change in the global error. Right panel shows the discrete L2 normof the error vs. ∆x = 1/(N + 1) for the 2nd order method (blue, top) and the 4th order Cowellmethod (red, bottom). The superior accuracy of the higher order method is visible in both panels

f2 = ff(2:end-1); % Standard 2nd order FDM

% Select method (f=f2 or f=f4). Include boundary valuesf = f4;f(1) = f(1) + uL/dxˆ2;f(end) = f(end) + uR/dxˆ2;

% Solve problem and append boundary conditionsu = LN\f;uu = [uL; u; uR];

% Plot solutionplot(xx,uu,'b')xlabel('x')grid on

Later on, in connection with the Laplace and Poisson equations in 2D (or higher),we shall see that standard 2nd order FDMs can be enhanced to produce 4th orderconvergent results by applying a similar difference correction to the one employed

13.7 Higher order problems. The biharmonic equation 43

above. Because the system to be solved has the same matrix as the second ordermethod, stability is unaffected and the extra work is marginal. The gain in accuracy,however, going from second to fourth order convergence, is very substantial. Theonly drawback is the difficulty to apply this technique on nonuniform grids.

13.7 Higher order problems. The biharmonic equation

The boundary value problems discussed so far have involved derivatives of ordertwo. Higher order problems refer to differential equations involving third orderderivatives or higher. A typical application is in elasticity theory, and in particularthe biharmonic equation, which is used in plate theory. Interestingly, both the for-mulation of the problem, and the solution techniques are influenced by the boundaryconditions.

Consider the beam equation, modeling a simply supported elastic beam. It con-sists of two second order problems,

M′′ = f

u′′ = M/(EI).

Here f is an external load density on [0,1], and M is the resulting bending momenton the beam. The boundary conditions for the first equation are M(0) = M(1) = 0(implying that the supported endpoints do not sustain any bending moment). In thesecond equation, u is the deflection of the beam’s center line under the bendingmoment M. The boundary conditions are u(0) = u(1) (reflecting firm supports thatare level. Finally E is a material constant, the modulus of elasticity, and I is a shapeparameter, the cross-sectional moment of inertia, which is determined by the beam’sphysical size and cross-section geometry.

Thus the beam equation consists of two coupled 1D Poisson problems withDirichlet conditions, often referred to as articulated conditions. (A Neumann con-dition is referred to as a free boundary condition.) The sequence of two second orderproblems is mathematically (but not numerically) equivalent to a single fourth orderequation. Let us for simplicity take EI = 1 to focus on the mathematical form of theproblem. Eliminating M, we obtain the fourth order problem

u′′′′ = f (13.53)

together with the articulated boundary conditions,

u′′(0) = u′′(1) = 0u(0) = u(1) = 0.

Due to the boundary conditions, this problem can be decomposed into the two equa-tions we started with. However, if the beam ends are clamped instead of articulated,


the boundary conditions are

u′(0) = u′(1) = 0u(0) = u(1) = 0.

Due to the boundary conditions now only involving u and u′, the problem can nolonger be decomposed into two second order problems. It is a genuine fourth orderproblem. This is the biharmonic equation.

In the beam equation with articulated supports, it is important to take advantageof its structure of being equivalent to two second order problems. The term “har-monic” comes from the fact that the eigenfunctions of d2/dx2 are trigonometricfunctions, leading to Fourier analysis or harmonic analysis. In higher dimensionsthe same holds for the Laplacian operator,

∆ =∂ 2

∂x2 +∂ 2

∂y2 ,

sometimes referred to as the harmonic operator. The biharmonic operator is simply

∆2 =

∂ 4

∂x4 +2∂ 4

∂x2∂y2 +∂ 4

∂y4 ,

which, in our 1D context above, simplifies to d4/dx4.Using a simple second order FDM on an equidistant grid for the biharmonic and

beam equations requires that we approximate d4/dx4 on the usual grid ΩN ⊂ [0,1],with points x j = j∆x and ∆x = 1/(N +1), as follows:

u′′′′ ≈−−u j−2+2u j−1−u j

∆x2 +2−u j−1+2u j−u j+1∆x2 − −u j+2u j+1−u j+2

∆x2

∆x2

=u j−2−4u j−1 +6u j−4u j+1 +u j+2

∆x4 .

The articulated boundary conditions u(0) = u(1) = 0 correspond to u0 = uN+1 = 0.The clamping conditions are represented by

u′(0)≈ −u−1 +u1

2∆x

u′(1)≈ −uN +uN+2

2∆x.

Given that u′(0) = u′(1) = 0 the clamped conditions are

u−1 = u1

uN+1 = uN ,


allowing the elimination of the exterior variables and modifying the matrix elementsaccordingly.

If the clamped conditions are replaced by the articulated moment boundary con-ditions used in the beam equation, the boundary conditions are represented by

u′′(0)≈ u−1−2u0 +u1

∆x2

u′′(1)≈ uN−2uN+1 +uN+2

∆x2 .

Given that u′′(0) = u′′(1) = 0 and u0 = uN+1 = 0, we use the articulated conditions

u−1 =−u1

uN+1 =−uN ,

which would be used as an alternative to the clamped conditions.Collecting the information, we construct a system PN(β )u = f, where the N×N

pentadiagonal matrix PN(β ) is given by (here exemplified for N = 5)

PN(β ) = (N +1)4 ·

β −4 1 0 0−4 6 −4 1 0

1 −4 6 −4 10 1 −4 6 −40 0 1 −4 β

.

where the boundary conditions only affect the top left and bottom right elements.For the biharmonic equation (clamped conditions), β = 7, while for the simplysupported beam (articulated conditions), β = 5.

As noted above the beam problem can be factorized into two Poisson problems,

M′′ = f

u′′ = M

with M(0) = M(1) = 0 and u(0) = u(1) = 0. The standard second order discretiza-tion is then

LNM = fLNu = M,

where LN is described by the kernel LN = (N+1)2· ]1 −2 1 [. It is easily verifiedthat PN(5) = L2

N , reflecting the splitting into two consecutive Poisson problems. It isimportant to note that the simply supported beam problem should always be solvedas two first order problems. The reason is that the condition numbers differ. Thus

κ2[LN ]≈4(N +1)2

π2 ,


whileκ2[PN(β )] = O((N +1)4).

This means that the biharmonic equation is more sensitive to perturbations thanthe second order problem. When solving the biharmonic, the matrix PN(7) must befactorized numerically as is, while PN(5) = L2

N is an exact factorization into twofactors LN , only requiring that LN be factorized numerically. In case one factorizesPN(5) numerically, perturbation sensitivity is comparable in both problems.

We further note that the biharmonic equation is elliptic. The eigenvalue problem

u′′′′ = λu

has eigenfunctionsuk(x) = sinkπx

satisfying the articulated conditions u(0) = u(1) = u′′(0) = u′′(1) = 0, giving eigen-values λk = k4π4. These are just the squares of the eigenvalues of −d2/dx2.

With clamped conditions, the eigenfunctions of the genuine biharmonic equationare more complicated. They can be written

u(x) = (cosα− coshα)(sinαx− sinhαx)− (sinα− sinhα)(cosαx− coshαx),

where α = λ 1/4. By construction, u(x) satisfies the three conditions u(0) = u(1) = 0and u′(0) = 0, and the parameter α is determined by the last clamped condition,u′(1) = 0. Since

u′(1) = α · (cosα− coshα)2 +α · (sinα− sinhα)(sinα + sinhα)

= α · (cos2α−2cosα coshα + cosh2

α + sin2α− sinh2

α)

= 2α−2α cosα coshα,

it follows that α must satisfy the nonlinear equation

cosα =1

coshα. (13.54)

There is an infinite suite of solutions αk, for k ∈ Z+, as illustrated in Figure 13.7.The smallest positive root, α1 & 3π/2, must be computed numerically, and as thecorresponding eigenvalue of the differential operator is λ1 = α4

1 , we get

α1 ≈ 4.73004075 λ1 ≈ 500.564.

Thus the biharmonic operator is selfadjoint and elliptic. Except for the first feweigenvalues, which are larger, the eigenvalues approach, for k ∈ Z+,

λk

[d4

dx4

]≈ (2k+1)4π4

24 ≈ k4π

4


0 2 4 6 8 10 12 14 16 18 20

alpha

-1

-0.5

0

0.5

1

100

101

k

102

104

106

108

Eigenvalues

Fig. 13.7 Eigenvalues of biharmonic operator. Top panel shows left and right hand sides of theequation cosα = 1/coshα . Roots are found where the graphs intersect, indicated by blue markers.The first root α1 occurs at α ≈ 3π/2, and yields λ1 = α4

1 ≈ 500. Lower panel shows the first30 eigenvalues of PN(7) (blue) for N = 999, approximating the first 30 eigenvalues of d4/dx4

with clamped conditions. These are compared to the corresponding eigenvalues of d4/dx4 witharticulated conditions (red). Asymptotically, for large k, the eigenvalues are almost the same

on the interval [0,1] with clamped boundary conditions, u(0) = u(1) = 0 andu′(0) = u′(1) = 0. This is easily verified by computing the eigenvalues of PN(7),see Figure 13.7. They can also be compared to the eigenvalues of PN(5). The latterare smaller, corresponding to the fact that the clamped beam is stiffer than the simplysupported beam. This is in particular noticeable in the first eigenvalue of the fourthorder differential operator with articulated conditions, for which λ1 = π4 ≈ 97.409,more than a factor of 5 smaller than the first eigenvalue of the operator with clampedconditions.

Example Consider the problem u′′′′ =−1 on Ω = [0,1]. With articulated boundary condi-tions u(0) = u(1) = u′′(0) = u′′(1) = 0, this equation represents a simply supported beamunder a uniform load density of f (x) =−1. The operator is elliptic with m2[d4/dx4] = π4.Since the load is constant, the analytic solution can be found through four integrations; it is

u(x) =1

24(−x4 +2x3− x

).

With clamped conditions u(0)= u(1)= u′(0)= u′(1)= 0 the problem instead represents thebiharmonic equation for a beam sustaining bending moments at the endpoints. The operatoris elliptic, but the lower logarithmic norm has now increased to m2[d4/dx4]& 81π4/16. Thesolution is


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

-0.015

-0.01

-0.005

0Beam and Biharmonic equations u"" = -1

10-4

10-3

10-2

10-1

100

1/(N+1)

10-10

10-8

10-6

10-4

10-2

Fig. 13.8 Beam and biharmonic equations. The problem u′′′′ =−1 is solved on a grid with N = 49interior points (top panel), using clamped conditions (upper curve, red markers), and with articu-lated conditions (lower curve, blue markers). The clamped beam is stiffer than the simply supportedbeam, due to nonzero bending moments at the endpoints. Lower panel shows the discrete L2 normof the global error for the simply supported beam (top, blue) and the biharmonic problem (bottom,red) vs. mesh width ∆x = 1/(N + 1). Second order convergence is clearly visible. For ∆x small,the error starts increasing due to roundoff in the factorization of PN(7)

u(x) =124(−x4 +2x3− x2) .

Using a symmetric 2nd order consistent FDM on a uniform grid ΩN ⊂ Ω with N interiorpoints will in both cases produce a local residual of the form C ·∆x2u′′′′(x). However, theglobal error will be different; it is bounded by

‖e‖L2(ΩN )/‖r‖L2(ΩN )

m2[PN(β )],

where β = 7 for clamped conditions, and β = 5 for articulated conditions. Because thelogarithmic norm is determined by the smallest eigenvalue of d4/dx4, and the FDM ap-proximates these eigenvalues to second order accuracy, we have m2[PN(5)] ≈ π4 ≈ 100,and m2[PN(7)]≈ 81π4/16≈ 500, so the error is expected to be smaller with clamped con-ditions. In both cases, the FDM is second order convergent, by virtue of the global errorbound. Solving the problems, one obtains the solutions shown in Figure 13.7.

13.9 Nonuniform grids 49

13.8 Singular problems and boundary layer problems

13.9 Nonuniform grids

In problems with singularities and boundary layers it is desirable to allocate thegrid points to the subintervals where there is significant change in the solution. Inboundary layer problems, this is at one of the boundaries, or possibly both. In FDM,derivatives are still approximated by difference quotients, but the mesh width ∆xvaries.

For a nonuniform grid x jN+10 we define the mesh width locally, in terms of

forward and backward differences, i.e.,

∆x j = x j+1− x j

∇x j = x j− x j−1.

To approximate derivatives by divided differences, we use the forward and back-ward difference operators,

∆y j = y j+1− y j

∇y j = y j− y j−1.

Left and right divided differences are now defined as

D−y j =y j− y j−1

x j− x j−1=

∇y j

∇x j

D+y j =y j+1− y j

x j+1− x j=

∆y j

∆x j.

Using these difference quotients we approximate derivatives by the divided differ-ences

y′(x j)≈∇x jD+y j +∆x jD−y j

∆x j +∇x j

y′′(x j)≈ 2D+y j−D−y j

∆x j +∇x j.

This is 2nd order only on smooth grids, which means that the local mesh widthmust change slowly, with ∆x j/∇x j = 1+O(N−1).

Chapter 14Finite element methods

While the finite difference method (FDM) takes a linear operator equation

L u = f (14.1)

on Ω = [0,1] and converts it into a linear algebraic equation

Lu = f+b

on a grid ΩN ⊂Ω , where L is a matrix and u is a vector, the finite element method(FEM) leaves the operator L intact, instead representing the solution u(x) by apiecewise polynomial,

u∆x(x) =N

∑j=1

c jϕ j(x). (14.2)

Thus the approximant u∆x(x) is a linear combination of basis functions ϕ j(x), oftenreferred to as shape functions. We will still use a grid for the construction of thefunctions ϕ j(x), which are also piecewise polynomials of compact support. Theterm “compact support” means that each basis function is nonzero only on a (small)compact interval. Usually the construction is such that a given basis function isnonzero only on two adjacent cells, where a “cell” refers to the interval between twoneighboring grid points. The shape functions are often constructed to be a partitionof unity, implying that

N

∑j=1

ϕ j(x) = 1.

When u∆x(x) is inserted into the original operator equation (14.1), there willobviously be a residual,

L u∆x = f + r.

The finite element method now needs to define the coefficients c j. The criterion isto minimize the residual with respect to the L2(Ω) norm. This is equivalent to usingthe least squares method. Given that the residual is

51

52 14 Finite element methods

r = L u∆x− f ,

we require that the residual is orthogonal to the set of basis functions, ϕ jN1 . This

is referred to as the Galerkin method.Since orthogonality is defined in terms of an inner product, the best approxima-

tion is characterized by〈ϕi,r〉= 0,

for all i = 1 : N, or simply

N

∑j=1

c j〈ϕi,L ϕ j〉= 〈ϕi, f 〉 ; i = 1 : N,

where we have used the linearity of the operator L , and the fact that the innerproduct is a bilinear form. This now results in a linear system of equations fordetermining the coefficients c j.

The set of basis functions span a space V∆x. Since u∆x ∈ V∆x, the orthogonalityrequirement above implies that u∆x is the best approximation to be found in V∆x;any change in the coefficients will increase the norm of the residual, and thereforebe worse. Thus, the discrete solution cannot be improved upon, without finding a“better space” V∆x, e.g. by refining the mesh width ∆x.

The procedure described above has a shortcoming, however. We will be inter-ested in solving the Poisson equation −u′′ = f , where the operator L is a secondorder derivative. The simplest basis functions are piecewise linear functions. But apiecewise linear function cannot be used, because its second derivative is zero ev-erywhere, except at the grid points, where u∆x(x) is not two times differentiable.Therefore the finite element method needs a modification of the approach above.Thus we will reformulate the operator equation in its weak form, which is obtainedby integration by parts. With this change, the FEM proceeds according to the pat-tern above.

The finite element method has many advantages over finite difference methods.There are numerous variations on the theme. For example, we can choose basisfunctions of high degrees to create high order methods, and with different continu-ity requirements between cells. In the simplest case mentioned above, the approxi-mating function u∆x(x) is piecewise linear and continuous. That method is referredto as the continuous Galerkin method cG(1), where the number 1 refers to thedegree of the piecewise polynomials. There are also other variants, where the theapproximant is not required to be continuous (favoring other properties, such asconservation principles). This is referred to as discontinuous Galerkin methods,dG(·). Further variants may impose the orthogonality condition differently. But theoutstanding advantage of finite elements is in 2D and 3D problems is that the gridand cells can easily be adapted to complex geometries. By contrast, finite differ-ence methods have considerable difficulties in such situations.

14.1 The weak form 53

14.1 The weak form

Let us begin by considering the 1D Poisson equation −u′′ = f on Ω = [0,1], withhomogeneous Dirichlet boundary conditions, u(0) = u(1) = 0. This formulation iscalled the strong form. It requires that the differential equation is satisfied point-wise, for all x ∈Ω .

To obtain the weak form, let v ∈ H10 (Ω) be a function satisfying the bound-

ary conditions. Construct the inner product of v and the differential equation in thestrong form to get

−〈v,u′′〉= 〈v, f 〉.

In the left hand side, we integrate by parts to get

〈v′,u′〉= 〈v, f 〉. (14.3)

This is the weak form of the differential equation. It is “weak,” because u is nowonly required to be differentiable a single time, and (14.3) does not require thatthe −u′′ = f holds pointwise, but only in an average sense. Thus, while the strongsolution satisfies the weak formulation, the converse is not true; a solution to theweak formulation is an approximate solution to −u′′ = f .

This approach is formalized in the following way.

Definition 14.1. (Energy norm) Let u,v ∈ H10 (Ω). The energy norm is the bilinear

form a : H10 (Ω)×H1

0 (Ω)→ R, defined by

a(v,u) = 〈v′,u′〉. (14.4)

The energy norm derives its name from the fact that a(u,u) = ‖u′‖22, and that the

potential energy of an elastic beam, whose deflection from the mean, equilibriumline is u, is proportional to ‖u′‖2

2. This is similar to the energy stored in a linearspring, which has been compressed a given distance. While this notion of energy is,at least in part, an intuitive concept, it is also related to a variational formulation ofthe problem. The equilibrium solution is found where the virtual work is zero. Thisprinciple is of key importance in finite element methods, where the (approximate)solution is found by a variational principle. In order to compute such an approxima-tion, we need to formulate the equation that characterizes the optimal solution. Thisis done by using the energy norm.

Definition 14.2. (Weak form) The weak form of the Poisson equation−u′′= f withhomogeneous Dirichlet conditions is defined by requiring that

a(v,u) = 〈v, f 〉 (14.5)

holds for all test functions v ∈ H10 (Ω). A function u ∈ H1

0 (Ω) satisfying (14.5) iscalled a weak solution to the Poisson equation.


We shall return to the question of whether there exists a solution to this problem.This follows from the Lax–Milgram lemma, which relies on ellipticity and somefurther properties of the problem.

14.2 The cG(1) finite element method in 1D

The cG(1) method refers to the continuous Galerkin method with linear elements.(A linear element is a polynomial of degree 1.) Its shape functions are piecewise lin-ear functions, of compact support. Given a grid ΩN = x jN+1

j=0 , the basis functionsϕi(x)N+1

i=0 satisfyϕi(x j) = δi j,

where δi j is the Kronecker delta. Thus, since the shape functions are piecewiselinear, they satisfy

ϕi(x) =x− xi−1

xi− xi−1=

x− xi−1

∆xx ∈ [xi−1,xi]

ϕi(x) =xi+1− xxi+1− xi

=xi+1− x

∆xx ∈ [xi,xi+1].

This choice of basis functions is considered the simplest in FEM theory. They areoften referred to as hat functions, perhaps motivated by their appearance whengraphed, see Figure 14.2.

Now, let us consider the 1D Poissons equation −u′′ = f with inhomogeneousDirichlet conditions u(0) = uL and u(1) = uR. To apply the cG(1) method, we write

u∆x(x) = uLϕ0(x)+N

∑j=1

c jϕ j(x)+uRϕN+1(x), (14.6)

and require that this ansatz satisfy the weak formulation,

〈ϕ ′i ,u′∆x〉= 〈ϕi, f 〉, i = 1 : N. (14.7)

Writing this as a system of equations, especially including the first and last equationsto demonstrate the influence from the boundary conditions, we have

uL〈ϕ ′1,ϕ ′0〉+ c1〈ϕ ′1,ϕ ′1〉+ c2〈ϕ ′1,ϕ ′2〉= 〈ϕ1, f 〉ci−1〈ϕ ′i ,ϕ ′i−1〉+ ci〈ϕ ′i ,ϕ ′i 〉+ ci+1〈ϕ ′i ,ϕ ′i+1〉= 〈ϕi, f 〉, i = 2 : N−1

cN−1〈ϕ ′N ,ϕ ′N−1〉+ cN〈ϕ ′N ,ϕ ′N〉+uR〈ϕ ′N ,ϕ ′N+1〉= 〈ϕN , f 〉.

14.2 The cG(1) finite element method in 1D 55

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Fig. 14.1 Shape functions for cG(1) FEM. Piecewise linear basis functions are plotted on ΩN ⊂[0,1] for ∆x = 0.1. Top panel shows a generic basis function (blue) at the center, and the twospecial boundary basis functions (red, green) used to match boundary conditions. Bottom panelshows three adjacent basis functions. Because each function has support only on two neighboringcells, only three consecutive basis functions overlap

This is a system of N linear equations, for the N unknowns c = c jN1 ,

c1〈ϕ ′1,ϕ ′1〉+ c2〈ϕ ′1,ϕ ′2〉= 〈ϕ1, f 〉−uL〈ϕ ′1,ϕ ′0〉ci−1〈ϕ ′i ,ϕ ′i−1〉+ ci〈ϕ ′i ,ϕ ′i 〉+ ci+1〈ϕ ′i ,ϕ ′i+1〉= 〈ϕi, f 〉, i = 2 : N−1

cN−1〈ϕ ′N ,ϕ ′N−1〉+ cN〈ϕ ′N ,ϕ ′N〉= 〈ϕN , f 〉−uR〈ϕ ′N ,ϕ ′N+1〉.

It can be writtenKNc = g+b, (14.8)

where KN ∈ RN×N is referred to as the stiffness matrix. It is tridiagonal, due to thefact that a given shape function only overlaps with its nearest neighbors. The matrixelements are

kii = 〈ϕ ′i ,ϕ ′i 〉, and ki j = 〈ϕ ′i ,ϕ ′j〉, for j = i±1.

Since 〈ϕ ′i ,ϕ ′i+1〉 = 〈ϕ ′i+1,ϕ′i 〉 it follows that ki j = k ji; in other words, the stiffness

matrix is symmetric. This reflects the fact that the operator −d2/dx2 is selfadjoint.To assemble the matrix on a uniform grid, we compute


kii = 〈ϕ ′i ,ϕ ′i 〉=∫ xi+1

xi−1

|ϕ ′i (x)|2 dx = 2∆x · 1∆x2 =

2∆x

,

where the limits of the integral represent the subinterval of Ω = [0,1] where ϕ ′i hassupport, cp. Figure 14.2. We also compute

ki,i+1 = 〈ϕ ′i ,ϕ ′i+1〉=∫ xi+1

xi

ϕ′i (x)ϕ

′i+1(x)dx = ∆x · −1

∆x2 =− 1∆x

,

where the integration interval corresponds to the overlapping of the two shape func-tions, where ϕ ′i (x)ϕ

′i+1(x) 6= 0. Having computed these inner products, the stiffness

matrix can be assembled, and

KN =1

∆x

2 −1 0 · · · 0−1 2 −1 · · · 0

. . . . . . . . .−1 2 −1

0 · · · 0 −1 2

. (14.9)

Here we recognize the usual symmetric positive definite tridiagonal Toeplitz ma-trix T = ] − 1 2 − 1 [ that we encountered when the 1D Poisson problem wassolved using the finite difference method.

Since we have already computed 〈ϕ ′i ,ϕ ′i+1〉, we can also account for the boundaryconditions, which involve the shape functions on the boundary. Thus the first andlast elements of the vector b in (14.8) are

b1 =uL

∆x, bN =

uR

∆x.

The rest of its elements are zero. This, too, conforms to the results for FDM. How-ever, the difference between the cG(1) FEM and the standard 2nd order FDM be-comes apparent when considering the vector g. The issue is that one needs to com-pute the integrals gi = 〈ϕi, f 〉. This is in general not possible. Instead, these integralsmust be approximated numerically. This is done as follows.

We sample the function f on the grid ΩN , including its boundary points, since,in general, f (x) is nonzero also on the boundary of Ω = [0,1]. This gives us avector f = f (x j)N+1

0 , allowing us to represent the function f as a piecewise linearfunction,

f (x)≈N+1

∑j=0

f jϕ j(x),

where f j = f (x j). This means that we use linear interpolation between the sam-ples; it is a second order accurate approximation, provided that f ∈C2[0,1]. We cannow compute the inner products

14.2 The cG(1) finite element method in 1D 57

〈ϕi, f 〉 ≈N+1

∑j=0

f j〈ϕi,ϕ j〉=1

∑ν=−1

fi+ν〈ϕi,ϕi+ν〉,

again due to the fact that the shape function only overlaps with its nearest neighbors.We now have

〈ϕi,ϕi〉=∫ xi

xi−1

(x− xi−1

∆x

)2

dx+∫ xi+1

xi

(xi− x

∆x

)2

dx =2∆x

3,

and

〈ϕi,ϕi+1〉=∫ xi+1

xi

(xi+1− x)(x− xi)

∆x2 dx =∆x6.

This again yields a Toeplitz matrix, but it is N× (N +2). Computing the right handside vector g corresponds to convolving the sampled function f, with a kernel

MN =∆x6

]1 4 1 [. (14.10)

This is an averaging operator, which can be compared to the difference correctionused in the Cowell FDM, but its origin is different. Also, it does not increase theorder or accuracy; it is merely a matter of calculating the load function g in thelinear system. Thus the cG(1) equations for the 1D Poisson equation can be written

1∆x

2 −1 0 · · · 0−1 2 −1 · · · 0

. . . . . . . . .−1 2 −1

0 · · · 0 −1 2

c1c2...

cN

=∆x6

4 1 0 · · · 01 4 1 · · · 0

. . . . . . . . .1 4 1

0 · · · 0 1 4

f1f2...

fN

+

d00...0

dN

,

where the first and last elements of the vector d are

d1 =f0∆x

6+

uL

∆x, dN =

fN+1∆x6

+uR

∆x.

In matrix–vector form the system reads

KNc = MNf+d, (14.11)

where MN is referred to as the mass matrix. The N×N mass matrix is a symmet-ric positive definite tridiagonal Toeplitz matrix. By solving (14.11), we obtain thevector c, and therefore also the piecewise linear approximate solution u∆x(x) forx ∈Ω = [0,1]. Moreover, since ϕ j(xi) = δi j, it follows from (14.6) that


u∆x(x0) = uL

u∆x(xi) =N

∑j=1

c jϕ j(xi) = ci, i = 1 : N

u∆x(xN+1) = uR.

Thus ci ≈ u(xi) approximates the exact solution on the grid, and the interpolantu∆x(x)≈ u(x) for all x ∈Ω = [0,1].

The cG(1) FEM can be compared to the standard FDM for the same problem.The latter produced the linear system LNu = f+b. The relation between LN and KNis trivial, as KN = ∆x ·LN . The main difference is in the right hand side. While theboundary conditions enter in a similar way (scaled by ∆x), the load function f istreated differently, inasmuch as the FEM employs the mass matrix to create a localaveraging of function values. In addition, the FEM also uses the values of f on theboundary, f (0) and f (1). This is not unlike the 4th order Cowell FDM, but therethe matrix differs from the mass matrix in the FEM. Later we shall also see that onnonuniform grids the differences are greater than they appear. There, we will findsignificant differences also between LN and KN .

Naturally, the cG(1) FEM can be used also to solve eigenvalue problems. Forfurther comparison with the FDM, we apply the FEM to the eigenvalue problem

−u′′ = λu, u(0) = u(1) = 0. (14.12)

Due to the homogeneous Dirichlet conditions, we have uL = uR = 0, so the ansatz(14.6) does not contain the boundary shape functions ϕ0(x) and ϕN+1(x). This holdsalso in the right hand side, simplifying the problem. The construction of the systemis otherwise identical to the derivation above, and one arrives at the generalizedeigenvalue problem

KNu = λ MNu. (14.13)

This is not a standard eigenvalue problem due to the appearance of the mass matrixin the right hand side. There are special computational methods that solve general-ized eigenvalue problems directly, but here we note that (14.13) is mathematicallyequivalent to the problem

M−1N KNu = λu. (14.14)

Since MN is symmetric positive definite it is invertible, with a symmetric positivedefinite inverse. Thus, in the cG(1) method we obtain discrete eigenvalues

λk[M−1N KN ].

These should be close to those produced by the standard 2nd order FDM, and theerror should be similar since the cG(1) method is 2nd order. This is also seen to bethe case in Figure 14.2, where the same problem is also solved using Cowell’s 4thorder method. The algebraic problem is again a generalized eigenvalue problem, likein the cG(1) method, but with a different weight matrix replacing the mass matrix.

14.3 Convergence 59

100

101

k

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Relative errors vs k

Fig. 14.2 Relative error in eigenvalues. The cG(1) and standard 2nd order FDM solve the Dirichleteigenvalue problem −u′′ = λu using ∆x = 10−2. Relative errors in λk for the first ten discreteeigenvalues are practically indistinguishable (top curve, red markers), and the relative error in λkis O(k2∆x2). These results are compared to Cowell’s 4th order FDM (lower curve, blue markers),which produces much higher accuracy, with a relative error in λk proportional to O(k4∆x4)

14.3 Convergence

We shall only sketch a convergence proof for the cG(1) FEM applied to the 1DPoisson equation with Dirichlet conditions. Above, we have already noted that

u∆x(xi) = ci.

Thus ci ≈ u(xi). The FEM equation (14.11) is obviously consistent of order p = 2 atthe grid points xi provided that the function f is smooth. Since ‖K−1

N ‖2 ≤C impliesstability, it follows that ci = u(xi)+O(∆x2) as ∆x→ 0, implying that the method isconvergent, at the grid points, in the sense that ci→ u(xi).

Between the grid points, the method is also 2nd order convergent, due to linearinterpolation being 2nd order accurate. We have the following classical result.

Lemma 14.1. Let v∈C2[a,b] and assume that x∈ [a,b]. Then the linear interpolant

P(x) = v(a)+(x−a)v(b)− v(a)

b−a


satisfies P(a) = v(a) and P(b) = v(b), with an error

v(x)−P(x) =v′′(ξ )

2!(x−a)(x−b)

for some ξ ∈ (a,b).

Note that |(x− a)(x− b)| ≤ (b− a)2/4 for all x ∈ [a,b] and that the maximumis attained at the midpoint, x = (a + b)/2. Suppose that b− a = ∆x, so that theinterpolant represents the piecewise linear interpolation on the grid. Then the inter-polation error satisfies the estimate

|v(x)−P(x)| ≤ ∆x2

8|v′′(ξ )|,

showing that linear interpolation is 2nd order accurate as ∆x→ 0. Thus, at the gridpoints, |u∆x(xi)− u(xi)| = O(∆x2), and since linear interpolation between the gridpoints is also second order accurate, the global error is

|u∆x(x)−u(x)|= O(∆x2) (14.15)

for all x ∈ Ω = [0,1], provided that the solution u(x) is twice continuously differ-entiable. In fact, if u is a strong (pointwise) solution to −u′′ = f , the error bound(14.15) holds whenever f ∈C0[0,1].

If u∆x(x)→ u(x) with a second order error, we generally lose an order for thederivative u′

∆x(x). The derivative u′∆x(x) is piecewise constant and will deviate by

an error |u′∆x(x)− u′(x)| = O(∆x) for an arbitrary x ∈ Ω = [0,1]. However, by the

mean value theorem, there is a ξi ∈ (xi,xi+1) such that u′∆x(ξi) = u′(ξi). As always,

each ξi is unknown, but a second order approximation is found at the midpoint ofeach cell. Thus it holds that∣∣∣∣u′∆x

(xi + xi+1

2

)−u′

(xi + xi+1

2

)∣∣∣∣= O(∆x2).

As a consequence, in the finite element method there is a shift in importance fromgrid points to the cells (intervals), where the cG(1) method can produce a globalO(∆x2) accurate solution, as well as a similar accuracy in the derivative, althoughonly at the cell centers.

14.4 Neumann conditions

We have previously solved the Neumann problem for the 1D Poisson equation, usingFDM. There we saw that there were many options for representing the boundarycondition so as to achieve 2nd order accuracy. In the cG(1), the options are fewer,

14.4 Neumann conditions 61

and second order convergence is obtained by constructing a grid with N internalpoints, such that

x j = j ·∆x,

with ∆x = 1/(N +1/2). Hence x0 = 0 and

xN = 1− 12

∆x , xN+1 = 1+∆x2,

with xN+1 outside the [0,1] interval. Since xN and xN+1 are symmetrically locatedaround the boundary point x = 1, the latter is the center of the cell [xN ,xN+1], wherethe derivative will be represented to second order accuracy. Using the standardpiecewise linear shape functions, we have

u∆x(x0) = uL

u∆x(x) =N+1

∑j=1

c jϕ j(x) i = 1 : N

where the last shape function ϕN+1(x) is used to impose the Neumann boundarycondition u′

∆x(1) = u′R at x = 1. The value of u(1) is approximated by u∆x(1). Dueto the piecewise linear construction, we have

u∆x(1) = cN+1ϕN+1(1)+ cNϕN(1) =cN+1 + cN

2.

Sinceu′∆x(1) = cN+1ϕ

′N+1(1)+ cNϕ

′N(1) =

cN+1− cN

∆x= u′R

it follows that cN+1 = cN +∆x u′R, and consequently

u∆x(1) =cN+1 + cN

2= cN +

∆x u′R2

= u∆x(1−∆x/2)+∆x u′R

2.

Thus cN+1 is determined by cN and the Neumann condition u′(1) = u′R. The remain-ing coefficients ciN

1 are determined by the linear system


cN−1〈ϕ ′N ,ϕ ′N−1〉+ cN〈ϕ ′N ,ϕ ′N〉+ cN+1〈ϕ ′N ,ϕ ′N+1〉= 〈ϕN , f 〉.

Using cN+1 = cN +∆x u′R, the system becomes


cN−1〈ϕ ′N ,ϕ ′N−1〉+ cN(〈ϕ ′N ,ϕ ′N〉+ 〈ϕ ′N ,ϕ ′N+1〉) = 〈ϕN , f 〉−∆x u′R〈ϕ ′N ,ϕ ′N+1〉.

As a result, we get the system


1∆x

2 −1 0 · · · 0−1 2 −1 · · · 0

. . . . . . . . .−1 2 −1

0 · · · 0 −1 1

c1c2...

cN

=∆x6

4 1 0 · · · 01 4 1 · · · 0

. . . . . . . . .1 4 1

0 · · · 0 1 4

f1f2...

fN

+

d00...0

dN

.

In matrix–vector form the system reads

KNc = MNf+d, (14.16)

where the first and last elements of the vector d are

d1 =f0∆x

6+

uL

∆x, dN =

fN+1∆x6

+u′R.

In addition, the lower right element of the stiffness matrix has been changed from2 to 1, and the mesh width is adjusted from ∆x = 1/(N +1) to ∆x = 1/(N +1/2).This problem representation is similar to the first approach used in the FDM analysisof the Neumann problem for the 1D Poisson problem.

Convergence is obvious, since the modified stiffness matrix is the same as thematrix in the FDM treatment of the eigenvalue problem with Neumann condition.Therefore the method is stable. Likewise consistency at the grid points follows if fis regular. By the Lax principle the method is convergent or order p = 2 at the gridpoints, and the piecewise linear interpolation implies that the cG(1) solution u∆x(x)is globally 2nd order accurate.

14.5 cG(1) FEM on nonuniform grids

Let us consider a nonuniform grid constructed as a differentiable deformation of auniform grid. This means that we choose a differentiable map Φ : [0,1]→ [0,1] suchthat Φ(0) = 0 and Φ(1) = 1, and such that a uniform grid

ξn = n/(N +1)

for n = 0 : N +1 is mapped to a nonuniform grid xn = Φ(ξn).

Date post:	01-Feb-2020
Category:	Documents
Upload:	others
View:	30 times
Download:	1 times

Numerical Methods for Differential Equationsctr.maths.lu.se/na/courses/FMNN10/course_media/... ·...

Documents