NUMERICALSOLUTIONOF ORDINARYDIFFERENTIAL EQUATIONSatkinson/papers/NAODE_Book.pdf · 2008. 11....

NUMERICAL SOLUTION OFORDINARY DIFFERENTIALEQUATIONS

Kendall Atkinson, Weimin Han, David StewartUniversity of IowaIowa City, Iowa

A JOHN WILEY & SONS, INC., PUBLICATION

Copyright c©2009 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400,fax (978) 646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission shouldbe addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ07030, (201) 748-6011, fax (201) 748-6008.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created ore extended by salesrepresentatives or written sales materials. The advice andstrategies contained herin may not besuitable for your situation. You should consult with a professional where appropriate. Neither thepublisher nor author shall be liable for any loss of profit or any other commercial damages, includingbut not limited to special, incidental, consequential, or other damages.

For general information on our other products and services please contact our Customer CareDepartment with the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,however, may not be available in electronic format.

Library of Congress Cataloging-in-Publication Data:

Numerical Solution of Ordinary Differential Equations / Kendall E. Atkinson . . . [et al.].p. cm.—(Wiley series in ???????)

“Wiley-Interscience."Includes bibliographical references and index.ISBN ????????????? (pbk.)1. Numerical analysis. 2. Ordinary differential equations.

I. Atkinson, Kendall E. II. Series.

MATLABR© is a trademark of The MathWorks, Inc. and is used with permission.The MathWorks does not warrant the accuracy of the text or exercises in this book.This book’s use or discussion of MATLABR© software or related products does notconstitute endorsement or sponsorship by The MathWorks of aparticular pedagogicalapproach or particular use of the MATLABR© software.

QA31.????.???? 2008510.??????-???Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

To Alice, Huidi, and Sue

Preface

This book is an expanded version of supplementary notes thatwe used for a course onordinary differential equations for upper-division undergraduate students and begin-ning graduate students in mathematics, engineering, and sciences. The book intro-duces the numerical analysis of differential equations, describing the mathematicalbackground for understanding numerical methods and givinginformation on whatto expect when using them. As a reason for studying numericalmethods as a partof a more general course on differential equations, many of the basic ideas of thenumerical analysis of differential equations are tied closely to theoretical behaviorassociated with the problem being solved. For example, the criteria for the stabilityof a numerical method is closely connected to the stability of the differential equationproblem being solved.

This book can be used for a one-semester course on the numerical solution of dif-ferential equations, or it can be used as a supplementary text for a course on the theoryand application of differential equations. In the latter case, we present more aboutnumerical methods than would ordinarilybe covered in a class on ordinarydifferentialequations. This allows the instructor some latitude in choosing what to include, andit allows the students to read further into topics that may interest them. For example,the book discusses methods for solving differential algebraic equations (Chapter 10)and Volterra integral equations (Chapter 12), topics not commonly included in anintroductory text on the numerical solution of differential equations.

vii

viii PREFACE

We also include MATLABR© programs to illustrate many of the ideas that areintroduced in the text. Much is to be learned by experimenting with the numericalsolution of differential equations. The programs in the book can be downloaded fromthe following website.

http://www.math.uiowa.edu/NumericalAnalysisODE/

This site also contains graphical user interfaces for use inexperimenting with Euler’smethod and the backward Euler method. These are to be used from within theframework of MATLAB.

Numerical methods vary in their behavior, and the many different types of differ-ential equation problems affect the performance of numerical methods in a variety ofways. An excellent book for “real world” examples of solvingdifferential equationsis that of Shampine, Gladwell, and Thompson [74].

The authors would like to thank Olaf Hansen, California State University at SanMarcos, for his comments on reading an early version of the book. We also expressour appreciation to John Wiley Publishers.

CONTENTS

Introduction 1

1 Theory of differential equations: An introduction 3

1.1 General solvability theory 7

1.2 Stability of the initial value problem 8

1.3 Direction fields 11

Problems 13

2 Euler’s method 15

2.1 Definition of Euler’s method 16

2.2 Error analysis of Euler’s method 21

2.3 Asymptotic error analysis 26

2.3.1 Richardson extrapolation 28

2.4 Numerical stability 29

2.4.1 Rounding error accumulation 30

Problems 32

ix

x CONTENTS

3 Systems of differential equations 37

3.1 Higher-order differential equations 39

3.2 Numerical methods for systems 42

Problems 46

4 The backward Euler method and the trapezoidal method 49

4.1 The backward Euler method 51

4.2 The trapezoidal method 56

Problems 62

5 Taylor and Runge–Kutta methods 67

5.1 Taylor methods 68

5.2 Runge–Kutta methods 70

5.2.1 A general framework for explicit Runge–Kutta methods 73

5.3 Convergence, stability, and asymptotic error 75

5.3.1 Error prediction and control 78

5.4 Runge–Kutta–Fehlberg methods 80

5.5 MATLAB codes 82

5.6 Implicit Runge–Kutta methods 86

5.6.1 Two-point collocation methods 87

Problems 89

6 Multistep methods 95

6.1 Adams–Bashforth methods 96

6.2 Adams–Moulton methods 101

6.3 Computer codes 104

6.3.1 MATLAB ODE codes 105

Problems 106

7 General error analysis for multistep methods 111

7.1 Truncation error 112

7.2 Convergence 115

7.3 A general error analysis 117

7.3.1 Stability theory 118

7.3.2 Convergence theory 122

7.3.3 Relative stability and weak stability 122

Problems 123

CONTENTS xi

8 Stiff differential equations 127

8.1 The method of lines for a parabolic equation 131

8.1.1 MATLAB programs for the method of lines 135

8.2 Backward differentiation formulas 140

8.3 Stability regions for multistep methods 141

8.4 Additional sources of difficulty 143

8.4.1 A-stability and L-stability 143

8.4.2 Time-varying problems and stability 145

8.5 Solving the finite-difference method 145

8.6 Computer codes 146

Problems 147

9 Implicit RK methods for stiff differential equations 149

9.1 Families of implicit Runge–Kutta methods 149

9.2 Stability of Runge–Kutta methods 154

9.3 Order reduction 156

9.4 Runge–Kutta methods for stiff equations in practice 160

Problems 161

10 Differential algebraic equations 163

10.1 Initial conditions and drift 165

10.2 DAEs as stiff differential equations 168

10.3 Numerical issues: higher index problems 169

10.4 Backward differentiation methods for DAEs 173

10.4.1 Index 1 problems 173


10.5 Runge–Kutta methods for DAEs 175



10.6 Index three problems from mechanics 181

10.6.1 Runge–Kutta methods for mechanical index 3 systems 183

10.7 Higher index DAEs 184

Problems 185

11 Two-point boundary value problems 187

11.1 A finite-difference method 188

11.1.1 Convergence 190

xii CONTENTS

11.1.2 A numerical example 190

11.1.3 Boundary conditions involving the derivative 194

11.2 Nonlinear two-point boundary value problems 195

11.2.1 Finite difference methods 197

11.2.2 Shooting methods 201

11.2.3 Collocation methods 204

11.2.4 Other methods and problems 206

Problems 206

12 Volterra integral equations 211

12.1 Solvability theory 212

12.1.1 Special equations 214

12.2 Numerical methods 215

12.2.1 The trapezoidal method 216

12.2.2 Error for the trapezoidal method 217

12.2.3 General schema for numerical methods 219

12.3 Numerical methods: Theory 223

12.3.1 Numerical stability 225

12.3.2 Practical numerical stability 227

Problems 231

Appendix A. Taylor’s Theorem 235

Appendix B. Polynomial interpolation 241

References 245

Index 250

Introduction

Differential equations are among the most important mathematical tools used in pro-ducing models in the physical sciences, biological sciences, and engineering. In thistext, we consider numerical methods for solving ordinary differential equations, thatis, those differential equations that have only one independent variable.

The differential equations we consider in most of the book are of the form

Y ′(t) = f(t, Y (t)),

whereY (t) is an unknown function that is being sought. The given functionf(t, y)of two variables defines the differential equation, and examples are given in Chapter1. This equation is called afirst-order differential equationbecause it contains afirst-order derivative of the unknown function, but no higher-order derivative. Thenumerical methods for a first-order equation can be extendedin a straightforward wayto a system of first-order equations. Moreover, a higher-order differential equationcan be reformulated as a system of first-order equations.

A brief discussion of the solvability theory of the initial value problem for ordi-nary differential equations is given in Chapter 1, where theconcept of stability ofdifferential equations is also introduced. The simplest numerical method,Euler’smethod, is studied in Chapter 2. It is not an efficient numerical method, but it is anintuitive way to introduce many important ideas. Higher-orderequations and systemsof first-order equations are considered in Chapter 3, and Euler’s method is extended

1

2 INTRODUCTION

to such equations. In Chapter 4, we discuss some numerical methods with betternumerical stability for practical computation. Chapters 5and 6 cover more sophisti-cated and rapidly convergent methods,namely Runge–Kutta methods and the familiesof Adams–Bashforth and Adams–Moulton methods, respectively. In Chapter 7, wegive a general treatment of the theory of multistep numerical methods. The numericalanalysis of stiff differential equations is introduced in several early chapters, and itis explored at greater length in Chapters 8 and 9. In Chapter 10, we introduce thestudy and numerical solution of differential algebraic equations, applying some of theearlier material on stiff differential equations. In Chapter 11, we consider numericalmethods for solving boundary value problems of second-order ordinary differentialequations. The final chapter, Chapter 12, gives an introduction to the numerical solu-tion of Volterra integral equations of the second kind, extending ideas introduced inearlier chapters for solving initial value problems. Appendices A and B contain briefintroductions to Taylor polynomial approximations and polynomial interpolation.

CHAPTER 1

THEORY OF DIFFERENTIALEQUATIONS: AN INTRODUCTION

For simple differential equations, it is possible to find closed form solutions. Forexample, given a functiong, the general solution of the simplest equation

Y ′(t) = g(t)

is

Y (t) =

∫g(s) ds+ c

with c an arbitrary integration constant. Here,∫g(s) ds denotes any fixed antideriva-

tive ofg. The constantc, and thus a particular solution, can be obtained by specifyingthe value ofY (t) at some given point:

Y (t0) = Y0.

Example 1.1 The general solution of the equation

Y ′(t) = sin(t)

isY (t) = − cos(t) + c.

3

4 THEORY OF DIFFERENTIAL EQUATIONS: AN INTRODUCTION

If we specify the condition

Y(π

3

)= 2,

then it is easy to findc = 2.5. Thus the desired solution is

Y (t) = 2.5 − cos(t).

The more general equation

Y ′(t) = f(t, Y (t)) (1.1)

is approached in a similar spirit, in the sense that usually there is a general solutiondependent on a constant. To further illustrate this point, we consider some moreexamples that can be solved analytically. First, and foremost, is the first-order linearequation

Y ′(t) = a(t)Y (t) + g(t). (1.2)

The given functionsa(t) andg(t) are assumed continuous. For this equation, weobtain

f(t, z) = a(t)z + g(t),

and the general solution of the equation can be found by the so-calledmethod ofintegrating factors.

We illustrate the method of integrating factors through a particularly useful case,

Y ′(t) = λY (t) + g(t) (1.3)

withλ a given constant. Multiplying the linear equation (1.3) by the integrating factore−λt, we can reformulate the equation as

d

dt

(e−λtY (t)

)= e−λtg(t).

Integrating both sides fromt0 to t, we obtain

e−λtY (t) = c+

∫ t

t0

e−λsg(s) ds,

wherec = e−λ t0Y (t0). (1.4)

So the general solution of (1.3) is

Y (t) = eλt

[c+

∫ t

t0

e−λsg(s) ds

]= ceλt +

∫ t

t0

eλ(t−s)g(s) ds. (1.5)

This solution is valid on any interval on whichg(t) is continuous.As we have seen from the discussions above, the general solution of the first-order

equation (1.1) normally depends on an arbitrary integration constant. To single out

5

a particular solution, we need to specify an additional condition. Usually such acondition is taken to be of the form

Y (t0) = Y0. (1.6)

In many applications of the ordinary differential equation(1.1), the independent vari-ablet plays the role of time, andt0 can be interpreted as the initial time. So it iscustomary to call (1.6) aninitial value condition. The differential equation (1.1) andthe initial value condition (1.6) together form aninitial value problem

Y ′(t) = f(t, Y (t)),Y (t0) = Y0.

(1.7)

For the initial value problem of the linear equation (1.3), the solution is given bythe formulas (1.5) and (1.4). We observe that the solution exists on any open intervalwhere the data functiong(t) is continuous. This is a property for linear equations.For the initial value problem of the general linear equation(1.2), its solution existson any open interval where the functionsa(t) andg(t) are continuous. As we willsee next through examples, when the ordinary differential equation (1.1) is nonlinear,even if the right-side functionf(t, z) has derivatives of any order, the solution of thecorresponding initial value problem may exist on only a smaller interval.

Example 1.2 By a direct computation, it is easy to verify that the equation

Y ′(t) = −[Y (t)]2 + Y (t)

has a so-called trivial solutionY (t) ≡ 0 and a general solution

Y (t) =1

1 + c e−t(1.8)

with c arbitrary. Alternatively, this equation is a so-called separable equation, and itssolution can be found by a standard method such as that described in Problem 4. Tofind the solution of the equation satisfyingY (0) = 4, we use the solution formula att = 0:

4 =1

1 + c,

c = −0.75.

So the solution of the initial value problem is

Y (t) =1

1 − 0.75e−t, t ≥ 0.

With a general initial valueY (0) = Y0 6= 0, the constantc in the solution formula(1.8) is given byc = Y −1

0 − 1. If Y0 > 0, thenc > −1, and the solutionY (t) existsfor 0 ≤ t < ∞. However, forY0 < 0, the solution exists only on the finite interval


[0, log(1 − Y −10 )); the valuet = log(1 − Y −1

0 ) is the zero of the denominator in theformula (1.8). Throughout this work,log denotes the natural logarithm.

Example 1.3 Consider the equation

Y ′(t) = −[Y (t)]2.

It has a trivial solutionY (t) ≡ 0 and a general solution

Y (t) =1

t+ c(1.9)

with c arbitrary. This can be verified by a direct calculation or by the method describedin Problem 4. To find the solution of the equation satisfying the initial value conditionY (0) = Y0, we distinguish several cases according to the value ofY0. If Y0 = 0,then the solution of the initial value problem isY (t) ≡ 0 for any t ≥ 0. If Y0 6= 0,then the solution of the initial value problem is

Y (t) =1

t+ Y −10

.

ForY0 > 0, the solution exists for anyt ≥ 0. ForY0 < 0, the solution exists only onthe interval[0,−Y−1

0 ). As a side note, observe that for0< Y0 < 1 with c = Y −10 −1,

the solution (1.8) increases fort ≥ 0, whereas forY0 > 0, the solution (1.9) withc = Y −1

0 decreases fort ≥ 0.

Example 1.4 The solution of

Y ′(t) = λY (t) + e−t, Y (0) = 1

is obtained from (1.5) and (1.4) as

Y (t) = eλt +

∫ t

0

eλ(t−s)e−s ds.

If λ 6= −1, then

Y (t) = eλt

{1 +

1

λ+ 1[1 − e−(λ+1)t]

}.

If λ = −1, thenY (t) = e−t (1 + t) .

We remark that for a general right-side functionf(t, z), it is usually not possibleto solve the initial value problem (1.7) analytically. One such example is for theequation

Y ′ = e−t Y 4

.

In such a case, numerical methods are the only plausible way to compute solutions.Moreover, even when a differential equation can be solved analytically, the solution

GENERAL SOLVABILITY THEORY 7

formula, such as (1.5), usually involves integrations of general functions. The inte-grals mostly have to be evaluated numerically. As an example, it is easy to verify thatthe solution of the problem

{Y ′ = 2 t Y + 1, t > 0,Y (0) = 1

is

Y (t) = et2∫ t

0

e−s2

ds+ et2 .

For such a situation, it is usually more efficient to use numerical methods from theoutset to solve the differential equation.

1.1 GENERAL SOLVABILITY THEORY

Before we consider numerical methods, it is useful to have some discussions on prop-erties of the initial value problem (1.7). The following well-known result concernsthe existence and uniqueness of a solution to this problem.

Theorem 1.5 Let D be an open connected set inR2, let f(t, y) be a continuousfunction oft and y for all (t, y) in D, and let(t0, Y0) be an interior point ofD.Assume thatf(t, y) satisfies theLipschitz condition

|f(t, y1) − f(t, y2)| ≤ K |y1 − y2| all (t, y1), (t, y2) in D (1.10)

for someK ≥ 0. Then there is a unique functionY (t) defined on an interval[t0 − α, t0 + α] for someα > 0, satisfying

Y ′(t) = f(t, Y (t)), t0 − α ≤ t ≤ t0 + α,

Y (t0) = Y0.

The Lipschitz condition onf is assumed throughout the text. The condition (1.10)is easily obtained if∂f(t, y)/∂y is a continuous function of(t, y) overD, the closureof D, with D also assumed to be convex. (A setD is calledconvexif for any twopoints inD the line segment joining them is entirely contained inD. Examples ofconvex sets include circles, ellipses, triangles, parallelograms.) Then we can use

K = max(t,y)∈D

∣∣∣∣∂f(t, y)

∂y

∣∣∣∣ ,

provided this is finite. If not, then simply use a smallerD, say, one that is boundedand contains(t0, Y0) in its interior. The numberα in the statement of the theoremdepends on the initial value problem (1.7). For some equations, such as the linearequation given in (1.3) with a continuous functiong(t), solutions exist for anyt, andwe can takeα to be∞. For many nonlinear equations, solutions can exist only in


bounded intervals. We have seen such instances in Examples 1.2 and 1.3. Let us lookat one more such example.

Example 1.6 Consider the initial value problem

Y ′(t) = 2t[Y (t)]2, Y (0) = 1.

Here

f(t, y) = 2ty2,∂f(t, y)

∂y= 4ty,

and both of these functions are continuous for all(t, y). Thus, by Theorem 1.5 thereis a unique solution to this initial value problem fort in a neighborhood oft0 = 0.This solution is

Y (t) =1

1 − t2, −1 < t < 1.

This example illustrates that the continuity off(t, y) and∂f(t, y)/∂y for all (t, y)does not imply the existence of a solutionY (t) for all t.

1.2 STABILITY OF THE INITIAL VALUE PROBLEM

When numerically solving the initial value problem (1.7), we will generally assumethat the solutionY (t) is being sought on a given finite intervalt0 ≤ t ≤ b. In thatcase, it is possible to obtain the following result on stability. Make a small change inthe initial value for the initial value problem, changingY0 toY0 +ǫ. Call the resultingsolutionYǫ(t),

Y ′ǫ (t) = f(t, Yǫ(t)), t0 ≤ t ≤ b, Yǫ(t0) = Y0 + ǫ. (1.11)

Then, under hypotheses similar to those of Theorem 1.5, it can be shown that for allsmall values ofǫ, Y (t) andYǫ(t) exist on the interval[t0, b], and moreover,

‖Yǫ − Y ‖∞ ≡ maxt0≤t≤b

|Yǫ(t) − Y (t)| ≤ c ǫ (1.12)

for somec > 0 that is independent ofǫ. Thus small changes in the initial valueY0

will lead to small changes in the solutionY (t) of the initial value problem. This is adesirable property for a variety of very practical reasons.

Example 1.7 The problem

Y ′(t) = −Y (t) + 1, 0 ≤ t ≤ b, Y (0) = 1 (1.13)

has the solutionY (t) ≡ 1. The perturbed problem

Y ′ǫ (t) = −Yǫ(t) + 1, 0 ≤ t ≤ b, Yǫ(0) = 1 + ǫ

STABILITY OF THE INITIAL VALUE PROBLEM 9

has the solutionYǫ(t) = 1 + ǫe−t. Thus

Y (t) − Yǫ(t) = −ǫe−t,

|Y (t) − Yǫ(t)| ≤ |ǫ| , 0 ≤ t ≤ b.

The problem (1.13) is said to be stable.

Virtually all initial value problems (1.7) are stable in thesense specified in (1.12);but this is only a partial picture of the effect of small perturbations of the initialvalueY0. If the maximum error‖Yǫ − Y ‖∞ in (1.12) is not much larger thanǫ,then we say that the initial value problem (1.7) iswell-conditioned. In contrast, when‖Yǫ − Y ‖∞ is much larger thanǫ [i.e., the minimal possible constantc in the estimate(1.12) is large], then the initial value problem (1.7) is considered to beill-conditioned.Attempting to numerically solve such a problem will usuallylead to large errors inthe computed solution. In practice, there is a continuum of problems ranging fromwell-conditioned to ill-conditioned, and the extent of theill-conditioning affects thepossible accuracy with which the solutionY can be found numerically, regardless ofthe numerical method being used.


Y ′(t) = λ [Y (t) − 1] , 0 ≤ t ≤ b, Y (0) = 1 (1.14)

has the solutionY (t) = 1, 0 ≤ t ≤ b.

The perturbed problem

Y ′ǫ (t) = λ[Yǫ(t) − 1], 0 ≤ t ≤ b, Yǫ(0) = 1 + ǫ

has the solutionYǫ(t) = 1 + ǫeλt, 0 ≤ t ≤ b.

For the error, we obtain

Y (t) − Yǫ(t) = −ǫeλt, (1.15)

max0≤t≤b

|Y (t) − Yǫ(t)| =

{|ǫ| , λ ≤ 0,

|ǫ| eλb, λ ≥ 0.

If λ < 0, the error|Y (t) − Yǫ(t)| decreases ast increases. We see that (1.14) is well-conditioned whenλ ≤ 0. In contrast, forλ > 0, the error|Y (t) − Yǫ(t)| increasesas t increases. And forλb moderately large, sayλb ≥ 10, the change inY (t) isquite significant att = b. The problem (1.14) is increasingly ill-conditioned asλincreases.

For the more general initial value problem (1.7) and the perturbed problem (1.11),one can show that

Y (t) − Yǫ(t) ≈ −ǫ exp

(∫ t

t0

g(s) ds

)(1.16)


with

g(t) =∂f(t, y)

∂y

∣∣∣∣y=Y (t)

for t sufficiently close tot0. Note that this formula correctly predicts (1.15), since inthat case

f(t, y) = λ (y − 1) ,

∂f(t, y)

∂y= λ,

∫ t

0

g(s) ds = λt.

Then (1.16) yieldsY (t) − Yǫ(t) ≈ −ǫeλt,

which agrees with the earlier formula (1.15).


Y ′(t) = −[Y (t)]2, Y (0) = 1 (1.17)

has the solution

Y (t) =1

t+ 1.

For the perturbed problem,

Y ′ǫ (t) = −[Yǫ(t)]

2, Yǫ(0) = 1 + ǫ, (1.18)

we use (1.16) to estimateY (t) − Yǫ(t). First,

f(t, y) = −y2,

∂f(t, y)

∂y= −2y,

g(t) = −2Y (t) = − 2

t+ 1,

∫ t

0

g(s) ds = −2

∫ t

0

ds

s+ 1= −2 log(1 + t) = log(1 + t)−2,

exp

[∫ t

0

g(s) ds

]= elog(t+1)−2

=1

(t+ 1)2.

For t ≥ 0 sufficiently small, substituting into (1.16) gives

Y (t) − Yǫ(t) ≈−ǫ

(1 + t)2. (1.19)

DIRECTION FIELDS 11

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

Y=et

Y=−et

Figure 1.1 The direction field of the equationY ′ = Y and solutionsY = ±et

This indicates that (1.17) is a well-conditioned problem.

In general, if∂f(t, Y (t))

∂y≤ 0, t0 ≤ t ≤ b, (1.20)

then the initial value problem is generally considered to bewell-conditioned. Al-though this test depends onY (t) over the interval[t0, b], one can often show (1.20)without knowingY (t) explicitly; see Problems 5, 6.

1.3 DIRECTION FIELDS

Direction fields serve as a useful tool in understanding the behavior of solutionsof a differential equation. We notice that the graph of a solution of the equationY ′ = f(t, Y ) is such that at any point(t, y) on the solution curve, the slope isf(t, y).The slopes can be represented graphically in direction fielddiagrams. In MATLABR©,direction fields can be generated by using themeshgrid andquiver commands.

Example 1.10 Consider the equationY ′ = Y . The slope of a solution curve at apoint(t, y) on the curve isy, which is independent oft. We generate a direction fielddiagram with the following MATLAB code:First draw the direction field:

[t,y] = meshgrid(-2:0.5:2,-2:0.5:2);


−1.5 −1 −0.5 0 0.5 1 1.50.5

1

1.5

2

2.5

3

3.5

4

4.5

Figure 1.2 The direction field of the equationY ′ = 2tY 2 and the solutionY = 1/`1 − t2

´

dt = ones(9); %Generates a matrix of 1’s.

dy = y;

quiver(t,y,dt,dy);

Then draw two solution curves:

hold on

t = -2:0.01:1;

y1 = exp(t); y2 = -exp(t);

plot(t,y1,t,y2)

text(1.1,2.8,’\itY=e^t’,’FontSize’,14)text(1.1,-2.8,’\itY=-e^t’,’FontSize’,14)hold off

The result is shown in Figure 1.1.

Example 1.11 Continuing Example 1.6, we use the following MATLAB M-file togenerate a direction field diagram and the particular solutionY = 1/(1−t2) in Figure1.2.

[t,y] = meshgrid(-1:0.2:1,1:0.5:4);

dt = ones(7,11); dy = 2*t.*y.^2;

quiver(t,y,dt,dy);

hold on

tt = -0.87:0.01:0.87;

DIRECTION FIELDS 13

yy = 1./(1-tt.^2);

plot(tt,yy)

hold off

Note that for largey values, the arrows in the direction field diagram (Figure 1.2)point almost vertically. This suggests that a solution to the equation may exist onlyin a bounded interval of thet axis, which, indeed, is the case.

PROBLEMS

1. In each of the following cases, show that the given function Y (t) satisfies theassociated differential equation. Then determine the value ofc required by theinitial condition. Finally, with reference to the general format in (1.7), identifyf(t, z) for each differential equation.

(a) Y ′(t) = −Y (t) + sin(t) + cos(t), Y (0) = 1;Y (t) = sin(t) + ce−t.

(b) Y ′(t) =[Y (t) − Y (t)2

]/t, Y (1) = 2; Y (t) = t/(t+ c), t > 0.

(c) Y ′(t) = cos2(Y (t)), Y (0) = π/4; Y (t) = tan−1(t+ c).

(d) Y ′(t) = Y (t)[Y (t) − 1], Y (0) = 1/2; Y (t) = 1/(1 + cet).

2. Use MATLAB to draw direction fields for the differential equations listed inProblem 1.

3. Solve the following problem by using (1.5) and (1.4):

(a) Y ′(t) = λY (t) + 1, Y (0) = 1.

(b) Y ′(t) = λY (t) + t, Y (0) = 3.

4. Consider the differential equation

Y ′(t) = f1(t)f2(Y (t))

for some given functionsf1(t) andf2(z). This is called aseparabledifferentialequation, and it can be solved by direct integration. Write the equation as

Y ′(t)

f2(Y (t))= f1(t),

and find the antiderivative of each side:∫

Y ′(t) dt

f2(Y (t))=

∫f1(t) dt.

On the left side, change the integration variable by lettingz = Y (t). Then theequation becomes ∫

dz

f2(z)=

∫f1(t) dt.


After integrating, replacez by Y (t); then solve forY (t), if possible. If theseintegrals can be evaluated, then the differential equationcan be solved. Doso for the following problems, finding the general solution and the solutionsatisfying the given initial condition.

(a) Y ′(t) = t/Y (t), Y (0) = 2.

(b) Y ′(t) = te−Y (t), Y (1) = 0.

(c) Y ′(t) = Y (t)[a− Y (t)], Y (0) = a/2, a > 0.

5. Check the conditioning of the initial value problems in Problem 1. Use the test(1.20).

6. Check the conditioning of the initial value problems in Problem 4(a), (b). Usethe test (1.20).

7. Use (1.20) to discuss the conditioning of the problem

Y ′(t) = Y (t)2 − 5 sin(t) − 25 cos2(t), Y (0) = 6.

You do not need to know the true solution.

8. Consider the solutionsY (t) of

Y ′(t) + aY (t) = de−bt

with a, b, d constants anda, b > 0. Calculate

limt→∞

Y (t).

Hint: Consider the casesa 6= b anda = b separately.

CHAPTER 2

EULER’S METHOD

Although it is possible to derive solution formulas for someordinary differentialequations, as is shown in Chapter 1, many differential equations arising in applicationsare so complicated that it is impractical to have solution formulas. Even when asolution formula is available, it may involve integrals that can be calculated only byusing a numerical quadrature formula. In either situation,numerical methods providea powerful alternative tool for solving the differential equation.

The simplest numerical method for solving the initial valueproblem is calledEuler’s method. We first define it and give some numerical illustrations, andthenwe analyze it mathematically. Euler’s method is not an efficient numerical method,but many of the ideas involved in the numerical solution of differential equations areintroduced most simply with it.

Before beginning, we establish some notation that will be used in the rest of thisbook. As before,Y (t) denotes the true solution of the initial value problem with theinitial valueY0:

Y ′(t) = f(t, Y (t)), t0 ≤ t ≤ b,

Y (t0) = Y0.(2.1)

15

16 EULER’S METHOD

Numerical methods for solving (2.1) will find an approximatesolution y(t) at adiscrete set of nodes,

t0 < t1 < t2 < · · · < tN ≤ b. (2.2)

For simplicity, we will take these nodes to be evenly spaced:

tn = t0 + nh, n = 0, 1, . . . , N.

The approximate solution will be denoted usingy(t), with some variations. Thefollowing notations are all used for the approximate solution at the node points:

y(tn) = yh(tn) = yn, n = 0, 1, . . . , N.

To obtain an approximate solutiony(t) at points in[t0, b] other than those in (2.2),some form of interpolation must be used. We will not considerthat problem here,although there are standard techniques from the theory of interpolation that can beeasily applied. For an introduction to interpolation theory, see, e.g., [11, Chap. 3],[12, Chap. 4], [57, Chap. 8], [68, Chap. 8].

2.1 DEFINITION OF EULER’S METHOD

To derive Euler’s method, consider the standard derivativeapproximation from be-ginning calculus,

Y ′(t) ≈ 1

h[Y (t+ h) − Y (t)]. (2.3)

This is called aforward difference approximationto the derivative. Applying this tothe initial value problem (2.1) att = tn,

Y ′(tn) = f(tn, Y (tn)),

we obtain

1

h[Y (tn+1) − Y (tn)] ≈ f(tn, Y (tn)),

Y (tn+1) ≈ Y (tn) + hf(tn, Y (tn)). (2.4)

Euler’s method is defined by taking this to be exact:

yn+1 = yn + hf(tn, yn), 0 ≤ n ≤ N − 1. (2.5)

For the initial guess, usey0 = Y0 or some close approximation ofY0. SometimesY0 is obtained empirically and thus may be known only approximately. Formula(2.5) gives a rule for computingy1, y2, . . . , yN in succession. This is typical of mostnumerical methods for solving ordinary differential equations.

Some geometric insight into Euler’s method is given in Figure 2.1. The linez = p(t) that is tangent to the graph ofz = Y (t) at tn has slope

Y ′(tn) = f(tn, Y (tn)).

DEFINITION OF EULER’S METHOD 17

Y(tn)

Y(tn+1

) Y(tn)+h f(t

n,Y(t

n))

t

z

z=Y(t)Tangent line

tn

tn+1

Figure 2.1 An illustration of Euler’s method derivation

Using this tangent line to approximate the curve near the point (tn, Y (tn)), the valueof the tangent line

p(t) = Y (tn) + f(tn, Y (tn))(t − tn)

at t = tn+1 is given by the right side of (2.4).

Example 2.1 The true solution of the problem

Y ′(t) = −Y (t), Y (0) = 1 (2.6)

is Y (t) = e−t. Euler’s method is given by

yn+1 = yn − hyn, n ≥ 0 (2.7)

with y0 = 1 andtn = nh. The solutiony(t) for three values ofh and selected valuesof t is given in Table 2.1. To illustrate the procedure, we compute y1 andy2 whenh = 0.1. From (2.7), we obtain

y1 = y0 − hy0 = 1 − (0.1)(1) = 0.9, t1 = 0.1,

y2 = y1 − hy1 = 0.9 − (0.1)(0.9) = 0.81, t2 = 0.2.

For the error in these values, we have

Y (t1) − y1 = e−0.1 − y1.= 0.004837,

Y (t2) − y2 = e−0.2 − y2.= 0.008731.

18 EULER’S METHOD

Table 2.1 Euler’s method for (2.6)

h t yh(t) Error RelativeError

0.2 1.0 3.2768e − 1 4.02e − 2 0.109

2.0 1.0738e − 1 2.80e − 2 0.207

3.0 3.5184e − 2 1.46e − 2 0.293

4.0 1.1529e − 2 6.79e − 3 0.371

5.0 3.7779e − 3 2.96e − 3 0.439

0.1 1.0 3.4867e − 1 1.92e − 2 0.0522

2.0 1.2158e − 1 1.38e − 2 0.102

3.0 4.2391e − 2 7.40e − 3 0.149

4.0 1.4781e − 2 3.53e − 3 0.193

5.0 5.1538e − 3 1.58e − 3 0.234

0.05 1.0 3.5849e − 1 9.39e − 3 0.0255

2.0 1.2851e − 1 6.82e − 3 0.0504

3.0 4.6070e − 2 3.72e − 3 0.0747

4.0 1.6515e − 2 1.80e − 3 0.0983

5.0 5.9205e − 3 8.17e − 4 0.121

Example 2.2 Solve

Y ′(t) =Y (t) + t2 − 2

t+ 1, Y (0) = 2 (2.8)

whose true solution is

Y (t) = t2 + 2t+ 2 − 2(t+ 1) log(t+ 1).

Euler’s method for this differential equation is

yn+1 = yn +h(yn + t2n − 2)

tn + 1, n ≥ 0

with y0 = 2 andtn = nh. The solutiony(t) is given in Table 2.2 for three valuesof h and selected values oft. A graph of the solutionyh(t) for h = 0.2 is given inFigure 2.2. The node valuesyh(tn) have been connected by straight line segments inthe graph. Note that the horizontal and vertical scales are different.

In both examples, observe the behavior of the error ash decreases. For each fixedvalue oft, note that the errors decrease by a factor of about2 whenh is halved. As

DEFINITION OF EULER’S METHOD 19

0 1 2 3 4 5 60

5

10

15

20

25

yh(x)

Y(x)

Figure 2.2 Euler’s method for problem (2.8),h = 0.2

an illustration, take Example 2.1 witht = 5.0. The errors forh = 0.2, 0.1, and0.05,respectively, are

2.96 × 10−3, 1.58 × 10−3, 8.17 × 10−4

and these decrease by successive factors of1.93 and1.87. The reader should do thesame calculation for other values oft, in both Examples 2.1 and 2.2. Also, note thatthe behavior of the error ast increases may be quite different from the behavior ofthe relative error. In Example 2.2, the relative errors increase initially, and then theydecrease with increasingt.

MATLAB R© program. The following MATLAB program implements Euler’s method.The Euler method is also called theforward Euler method. The backward Eulermethodis discussed in Chapter 4.

function [t,y] = euler for(t0,y0,t end,h,fcn)

%

% function [t,y]=euler for(t0,y0,t end,h,fcn)

%

% Solve the initial value problem

% y’ = f(t,y), t0 <= t <= b, y(t0)=y0

% Use Euler’s method with a stepsize of h. The user must

% supply a program to define the right side function of the

% differential equation. Use some name, say deriv, and a

20 EULER’S METHOD


h t yh(t) Error RelativeError

0.2 1.0 2.1592 6.82e − 2 0.0306

2.0 3.1697 2.39e − 1 0.0701

3.0 5.4332 4.76e − 1 0.0805

4.0 9.1411 7.65e − 1 0.129

5.0 14.406 1.09 0.0703

6.0 21.303 1.45 0.0637

0.1 1.0 2.1912 3.63e − 2 0.0163

2.0 3.2841 1.24e − 1 0.0364

3.0 5.6636 2.46e − 1 0.0416

4.0 9.5125 3.93e − 1 0.0665

5.0 14.939 5.60e − 1 0.0361

6.0 22.013 7.44e − 1 0.0327

0.05 1.0 2.2087 1.87e − 2 0.00840

2.0 3.3449 6.34e − 2 0.0186

3.0 5.7845 1.25e − 1 0.0212

4.0 9.7061 1.99e − 1 0.0337

5.0 15.214 2.84e − 1 0.0183

6.0 22.381 3.76e − 1 0.0165

% first line of the form

% function ans=deriv(t,y)

% A sample call would be

% [t,z]=euler for(t0,z0,b,delta,’deriv’)

%

% Output:

% The routine eulercls will return two vectors, t and y.

% The vector t will contain the node points

% t(1)=t0, t(j)=t0+(j-1)*h, j=1,2,...,N

% with

% t(N) <= t end-h, t(N)+h > t end-h

% The vector y will contain the estimates of the solution Y

% at the node points in t.

%

n = fix((t end-t0)/h)+1;

t = linspace(t0,t0+(n-1)*h,n)’;

y = zeros(n,1);

ERROR ANALYSIS OF EULER’S METHOD 21

y(1) = y0;

for i = 2:n

y(i) = y(i-1)+h*feval(fcn,t(i-1),y(i-1));

end

2.2 ERROR ANALYSIS OF EULER’S METHOD

The purpose of analyzing Euler’s method is to understand howit works, be able topredict the error when using it, and perhaps accelerate its convergence. Being able todo this for Euler’s method will also make it easier to answer the same questions forother, more efficient numerical methods.

For the error analysis, we assume that the initial value problem (1.7) has a uniquesolutionY (t) on t0 ≤ t ≤ b, and further, that this solution has a bounded sec-ond derivativeY ′′(t) over this interval. We begin by applying Taylor’s theorem toapproximatingY (tn+1),

Y (tn+1) = Y (tn) + hY ′(tn) + 12h

2Y ′′(ξn)

for sometn ≤ ξn ≤ tn+1. Using the fact thatY (t) satisfies the differential equation,

Y ′(t) = f(t, Y (t)),

our Taylor approximation becomes

Y (tn+1) = Y (tn) + hf(tn, Y (tn)) + 12h

2Y ′′(ξn). (2.9)

The termTn+1 = 1

2h2Y ′′(ξn) (2.10)

is called thetruncation errorforEuler’s method, and it is theerror in theapproximation

Y (tn+1) ≈ Y (tn) + hf(tn, Y (tn)).

To analyze the error in Euler’s method, subtract

yn+1 = yn + hf(tn, yn) (2.11)

from (2.9), obtaining

Y (tn+1) − yn+1 = Y (tn) − yn + h[f(tn, Y (tn)) − f(tn, yn)]

+ 12h

2Y ′′(ξn).(2.12)

The error inyn+1 consists of two parts: (1) the truncation errorTn+1, newly intro-duced at steptn+1; and (2) thepropagated error

Y (tn) − yn + h[f(tn, Y (tn)) − f(tn, yn)].

22 EULER’S METHOD

The propagatederror can be simplified by applying the mean value theorem tof(t, z),considering it as a function ofz,

f(tn, Y (tn)) − f(tn, yn) =∂f(tn, ζn)

∂y[Y (tn) − yn] (2.13)

for someζn betweenY (tn) andyn. Letek ≡ Y (tk)− yk, k ≥ 0, and then use (2.13)to rewrite (2.12) as

en+1 =

[1 + h

∂f(tn, ζn)

∂y

]en + 1

2h2Y ′′(ξn). (2.14)

These results can be used to give a general error analysis of Euler’s method for theinitial value problem.

Let us first consider a special case that will yield some intuitive understanding ofthe error in Euler’s method. Consider using Euler’s method to solve the problem

Y ′(t) = 2t, Y (0) = 0, (2.15)

whose true solution isY (t) = t2. Then, from the error formula (2.14), we have

en+1 = en + h2, e0 = 0,

where we are assuming the initial valuey0 = Y (0). This leads, by induction, to

en = nh2, n ≥ 0.

Sincenh = tn,en = htn. (2.16)

For each fixedtn, the error attn is proportional toh. The truncation error isO(h2),but the cumulative effect of these errors is a total error proportional toh.

We now turn to a convergence analysis of Euler’s method for solving the generalinitial value problem on a finite interval[t0, b]:

Y ′(t) = f(t, Y (t)), t0 ≤ t ≤ b,Y (t0) = Y0.

(2.17)

For the complete error analysis, we begin with the followinglemma. It is quiteuseful in the analysis of most numerical methods for solvingthe initial value problem.

Lemma 2.3 For any realt,1 + t ≤ et,

and for anyt ≥ −1, anym ≥ 0,

0 ≤ (1 + t)m ≤ emt. (2.18)

Proof. Using Taylor’s theorem yields

et = 1 + t+ 12 t

2eξ


with ξ between0 and t. Since the remainder is never negative, the first result isproved. Formula (2.18) follows easily.

For this and several of the following chapters, we assume that the derivative func-tion f(t, y) satisfies the following stronger Lipschitz condition: there existsK ≥ 0such that

|f(t, y1) − f(t, y2)| ≤ K |y1 − y2| (2.19)

for −∞ < y1, y2 < ∞ and t0 ≤ t ≤ b. Although stronger than necessary, itsimplifies the proofs. In addition, given a functionf(t, y) satisfying the weakercondition (1.10) and a solutionY (t) to the initial value problem, the functionf canbe modified to satisfy (2.19) without changing the solutionY (t) or the essentialcharacter of the initial value problem (2.17) and its numerical solution.

Theorem 2.4 Letf(t, y) be a continuous function fort0 ≤ t ≤ b and−∞ < y <∞,and further assume thatf(t, y) satisfies theLipschitz condition(2.19). Assume thatthe solutionY (t) of (2.17) has a continuous second derivative on[t0, b]. Then thesolution{yh(tn) | t0 ≤ tn ≤ b} obtained by Euler’s method satisfies

maxt0≤tn≤b

|Y (tn) − yh(tn)| ≤ e(b−t0)K |e0| +[e(b−t0)K − 1

K

]τ(h), (2.20)

whereτ(h) = 1

2h ‖Y ′′‖∞ = 12h max

t0≤t≤b|Y ′′(t)| (2.21)

ande0 = Y0 − yh(t0).If, in addition, we have

|Y0 − yh(t0)| ≤ c1h ash→ 0 (2.22)

for somec1 ≥ 0 (e.g., ifY0 = y0 for all h, thenc1 = 0), then there is a constantB ≥ 0 for which

maxt0≤tn≤b

|Y (tn) − yh(tn)| ≤ Bh. (2.23)

Let en = Y (tn) − y(tn), n ≥ 0. LetN ≡ N(h) be the integer for which

tN ≤ b, tN+1 > b.

Defineτn = 1

2hY′′(ξn), 0 ≤ n ≤ N(h) − 1,

based on the truncation error in (2.10). Easily, we obtain

max0≤n≤N−1

|τn| ≤ τ(h)

using (2.21).Recalling (2.12), we have

en+1 = en + h [f(tn, Yn) − f(tn, yn)] + hτn. (2.24)

24 EULER’S METHOD

We are using the common notationYn ≡ Y (tn). Taking bounds using (2.19), weobtain

|en+1| ≤ |en| + hK |Yn − yn| + h |τn| ,

|en+1| ≤ (1 + hK) |en| + hτ(h), 0 ≤ n ≤ N(h) − 1. (2.25)

Apply this recursively to obtain

|en| ≤ (1 + hK)n |e0| +[1 + (1 + hK) + · · · + (1 + hK)n−1

]hτ(h).

Using the formula for the sum of a finite geometric series,

1 + r + r2 + · · · + rn−1 =rn − 1

r − 1, r 6= 1, (2.26)

we obtain

|en| ≤ (1 + hK)n |e0| +[(1 + hK)n − 1

K

]τ(h). (2.27)

Using Lemma 2.3, we obtain

(1 + hK)n ≤ enhK = e(tn−t0)K ≤ e(b−t0)K ,

and this with (2.27) implies the main result (2.20).The remaining result (2.23) is a trivial corollary of (2.20)with the constantB given

by

B = c1e(b−t0)K +

1

2

[e(b−t0)K − 1

K

]‖Y ′′‖∞ .

The result (2.23) is consistent with the behavior observed in Tables 2.1 and 2.2earlier in this chapter, and it agrees with (2.16) for the special case (2.15). Whenhis halved, the boundBh is also halved, and that is the behavior in the error observedearlier. Euler’s method is said to converge with order1, because that is the power ofh that occurs in the error bound. In general, if we have

|Y (tn) − yh(tn)| ≤ chp, t0 ≤ tn ≤ b (2.28)

for some constantp ≥ 0, then we say that the numerical method isconvergent withorder p. Naturally, the higher the orderp, the faster the convergence we can expect.

We emphasize that for the error bound (2.20) to hold, the truesolution must beassumed to have a continuous second derivativeY ′′(t) over[t0, b]. This assumptionis not always valid. WhenY ′′(t) does not have such a continuous second derivative,the error bound (2.20) no longer holds. (See Problem 11.)

The error bound (2.20) is valid for a large family of the initial value problems.However, it usually produces a very pessimistic numerical bound for the error, due tothe presence of the exponential terms. Under certain circumstances, we can improvethe result. Assume

∂f(t, y)

∂y≤ 0, (2.29)


K ≡ supt0≤t≤b

−∞<y<∞

∣∣∣∣∂f(t, y)

∂y

∣∣∣∣ <∞. (2.30)

Note the relation of (2.29) to the stability condition (1.20) in Chapter 1. Also assumethath has been chosen so small that

1 − hK ≥ −1, t0 ≤ t ≤ b, −∞ < z <∞.

Returning to (2.14), we have

en+1 = en + h∂f(tn, ζn)

∂yen + 1

2h2Y ′′(ξn) (2.31)

with ζn betweenY (tn) andyn. Using (2.29) and (2.30), we have

1 ≥ 1 + h∂f(tn, ζn)

∂y≥ 1 − hK ≥ −1.

When combined with (2.31), we have

|en+1| ≤ |en| + ch2, t0 ≤ tn ≤ b, (2.32)

wherec = 1

2 ‖Y′′‖∞ = 1

2 · maxt0≤t≤b

|Y ′′(t)| .

In addition, assumee0 = 0. Applying (2.32) inductively, we obtain

|en| ≤ nch2 = c (tn − t0) h. (2.33)

The error is bounded by a quantity proportional toh, and the coefficient of theh termincreases linearly with respect to the pointtn, in contrast to the exponential growthgiven in the bound (2.20).

The error bound in Theorem 2.4 is rigorous, and is useful in providing an insightto the convergence behavior of the numerical solution. However, it is rarely advisableto use (2.20) for an actual error bound, as the next example shows.


Y ′(t) = −Y (t), Y (0) = 1 (2.34)

was solved earlier in this chapter, with the results given inTable 2.1. To apply (2.20),we have∂f(t, y)/∂y = −1,K = 1. The true solution isY (t) = e−t; thus

max0≤t≤b

|Y ′′(t)| = 1.

With y0 = Y0 = 1, the bound (2.20) becomes∣∣e−tn − yh(tn)

∣∣ ≤ 12h(eb − 1

), 0 ≤ tn ≤ b. (2.35)

26 EULER’S METHOD

Ash→ 0, this shows thatyh(t) converges toe−t. However, this bound is excessivelyconservative. Asb increases, the bound increases exponentially. Forb = 5, the boundis ∣∣e−tn − yh(tn)

∣∣ ≤ 12h(e5 − 1

)≈ 73.7h, 0 ≤ tn ≤ 5.

And this is far larger than the actual errors shown in Table 2.1, by several orders ofmagnitude. For the problem (2.34), the improved error bound(2.33) applies withc = 1

2 (see Problem 7). A more general approach for accurate error estimation isdiscussed in the following section.

2.3 ASYMPTOTIC ERROR ANALYSIS

To obtain more accurate predictions of the error, we consider asymptotic error esti-mates. Assume thatY is 3 times continuously differentiable and

∂f(t, y)

∂y,

∂2f(t, y)

∂y2

are both continuous for all values of(t, y) near(t, Y (t)), t0 ≤ t ≤ b. Then one canprove that the error in Euler’s method satisfies

Y (tn) − yh(tn) = hD(tn) + O(h2), t0 ≤ tn ≤ b. (2.36)

The termO(h2) denotes a quantity of maximal size proportional toh2over the interval[t0, b]. More generally, the statement

F (h; tn) = O(hp), t0 ≤ tn ≤ b

for some constantp means

maxt0≤tn≤b

|F (h; tn)| ≤ c hp

for some constantc and all sufficiently small values ofh.Assumingy0 = Y0, the usual case, the functionD(t) satisfies an initial value

problem for a linear differential equation,

D′(t) = g(t)D(t) + 12Y

′′(t), D(t0) = 0, (2.37)

where

g(t) =∂f(t, y)

∂y

∣∣∣∣y=Y (t)

.

WhenD(t) can be obtained explicitly, the leading error termhD(tn) from the formula(2.36) usually provides a quite good estimate of the true error Y (tn) − yh(tn), andthe quality of the estimation improves with decreasing stepsizeh.

ASYMPTOTIC ERROR ANALYSIS 27

Example 2.6 Consider again the problem (2.34). ThenD(t) satisfies

D′(t) = −D(t) + 12e

−t, D(0) = 0.

The solution isD(t) = 1

2 te−t.

Using (2.36), the error satisfies

Y (tn) − yh(tn) ≈ 12htne

−tn . (2.38)

We are neglecting theO(h2) term, since it should be substantially smaller than thetermhD(t) in (2.36), for all sufficiently small values ofh. To check the accuracy of(2.38), considertn = 5.0 with h = 0.05. Then

12htne

−tn.= 0.000842.

From Table 2.1, the actual error is0.000817, which is quite close to our estimate ofit.

How do we obtain the result given in (2.36)? We sketch the mainideas but do notfill in all of the details. We begin by approximating the errorequation (2.31) with

en+1 =

[1 + h

∂f(t, Y (tn))

∂y

]en + 1

2h2Y ′′(tn). (2.39)

We have used

∂f(tn, ζn)

∂y≈ ∂f(t, Y (tn))

∂y,

Y ′′(ξn) ≈ Y ′′(tn).

This will cause an approximation error

en − en = O(h2), (2.40)

although that may not be immediately evident. In addition, we may write

en = hδn, n = 0, 1, . . . , (2.41)

on the basis of (2.33); and for simplicity, assumeδ0 = 0.Substituting (2.41) into (2.39) and then cancelingh, we obtain

δn+1 =

[1 + h

∂f(t, Y (tn))

∂y

]δn + 1

2hY′′(tn)

= δn + h

[∂f(t, Y (tn))

∂yδn + 1

2Y′′(tn)

].

28 EULER’S METHOD

This is Euler’s method applied to (2.37). Applying the earlier convergence analysisfor Euler’s method, we have

maxt0≤tn≤b

|D(tn) − δn| ≤ Bh

for some constantB > 0. We then multiply byh to get

maxt0≤tn≤b

|hD(tn) − en| ≤ Bh2.

Combining this with (2.40) demonstrates (2.36), although we have omitted a numberof details.

We comment that the functionD(t) defined by (2.37) is continuously differen-tiable. Then the error formula (2.36) allows us to use the divided difference

yh(tn+1) − yh(tn)

h

as an approximation to the derivativeY ′(tn) (or Y ′(tn+1)),

Y ′(tn) − yh(tn+1) − yh(tn)

h= O(h). (2.42)

The proof of this is left as Problem 16.

2.3.1 Richardson extrapolation

It is not practical to try to find the functionD(t) from the problem (2.37), principallybecause it requires knowledgeof the true solutionY (t). The real power of the formula(2.36) is that it describes precisely the error behavior. Wecan use (2.36) to estimatethe solution error and to improve the quality of the numerical solution, without anexplicit knowledge of the functionD(t). For this purpose, we need two numericalsolutions, say,yh(t) andy2h(t) over the intervalt0 ≤ t ≤ b.

Assume thatt is a node point with the stepsize2h, and note that it is then also anode point with the stepsizeh. By the formula (2.36), we have

Y (t) − yh(t) = hD(t) + O(h2),

Y (t) − y2h(t) = 2hD(t) + O(h2).

Multiply the first equation by 2, and then subtract the secondequation to eliminateD(t), obtaining

Y (t) − [2 yh(t) − y2h(t)] = O(h2). (2.43)

This can also be written as

Y (t) − yh(t) = yh(t) − y2h(t) + O(h2). (2.44)

We know from our earlier error analysis thatY (t)− yh(t) = O(h). By dropping thehigher-order termO(h2) in (2.43), we obtainRichardson’s extrapolation formula

Y (t) ≈ yh(t) ≡ 2yh(t) − y2h(t). (2.45)

NUMERICAL STABILITY 29

Table 2.3 Euler’s method with Richardson extrapolation

t Y (t) − yh(t) yh(t) − y2h(t) eyh(t) Y (t) − eyh(t)

1.0 9.39e − 3 9.81e − 3 3.6829346e − 1 −4.14e − 4

2.0 6.82e − 3 6.94e − 3 1.3544764e − 1 −1.12e − 4

3.0 3.72e − 3 3.68e − 3 4.9748443e − 2 3.86e − 5

4.0 1.80e − 3 1.73e − 3 1.8249877e − 2 6.58e − 5

5.0 8.17e − 4 7.67e − 4 6.6872853e − 3 5.07e − 5

Dropping the higher-order term in (2.44), we obtainRichardson’s error estimate

Y (t) − yh(t) ≈ yh(t) − y2h(t). (2.46)

With these formulas, we can estimate the error in Euler’s method and can also obtaina more rapidly convergent solutionyh(t).

Example 2.7 Consider (2.34) with stepsizeh = 0.05, 2h = 0.1. Then Table 2.3contains Richardson’s extrapolation results for selectedvalues oft. Note that (2.46)is a fairly accurate estimator of the error, and thatyh(t) is much more accurate thanyh(t).

Using (2.43), we have

Y (tn) − yh(tn) = O(h2), (2.47)

an improvement on the convergence order of Euler’s method. We will consider againthis type of extrapolation for the methods introduced in later chapters. However, theactual formulas may be different from (2.45) and (2.46), andthey will depend on theorder of the method.

2.4 NUMERICAL STABILITY

Recall the discussion of stability for the initial value problem given in Section 1.2. Inparticular, recall the result (1.12) bounding the change inthe solutionY (t) when theinitial condition is perturbed byε. To perform a similar analysis for Euler’s method,we define a numerical solution{zn} by

zn+1 = zn + hf(tn, zn), n = 0, 1, . . . , N(h) − 1 (2.48)

with z0 = y0 + ǫ. This is analogous to looking at the solutionY (t; ε) to the perturbedinitial value problem, in (1.11). We compare the two numerical solutions{zn} and{yn} ash→ 0.

30 EULER’S METHOD

Let en = zn − yn, n ≥ 0. Thene0 = ǫ, and subtractingyn+1 = yn + hf(tn, yn)from (2.48), we obtain

en+1 = en + h [f(tn, zn) − f(tn,yn)] .

This has exactly the same form as (2.24),withτn set to zero. Using the same procedureas that following (2.24), we have

max0≤n≤N(h)

|zn − yn| ≤ e(b−t0)K |ǫ| .

Consequently, there is a constantc ≥ 0, independent ofh, such that

max0≤n≤N(h)

|zn − yn| ≤ c |ǫ| . (2.49)

This is the analog to the result (1.12) for the original initial value problem. Thissays that Euler’s method is a stable numerical method for thesolution of the initialvalue problem (2.17). We insist that all numerical methods for initial value problemspossess this form of stability, imitating the stability of the original problem (2.17). Inaddition, we require other forms of stability, based on replicating additional propertiesof the initial value problem; these are introduced later.

2.4.1 Rounding error accumulation

The finite precision of computer arithmetic affects the accuracy in the numericalsolution of a differential equation. To investigate this effect, consider Euler’s method(2.5). The simple arithmetic operations and the evaluationof f(xn, yn) will usuallycontain errors due to rounding or chopping. For definitions of chopped and roundedfloating-point arithmetic, see [12, p. 39]. Thus what is actually evaluated is

yn+1 = yn + hf(xn,yn) + δn, n ≥ 0, y0 = Y0. (2.50)

The quantityδn will be based on the precision of the arithmetic, and its sizeis affectedby that ofyn. To simplify our work, we assume simply

|δn| ≤ cu · maxx0≤x≤xn

|Y (x)| , (2.51)

whereu is themachine epsilonof the computer (see [12, p. 38]) andc is a constantof magnitude1 or larger. Using double precision arithmetic with a processor basedon the IEEE floating-point arithmetic standard,u

.= 2.2 × 10−16.

To compare{yn} to the true solutionY (x), we begin by writing

Y (xn+1) = Y (xn) + hf(xn, Y (xn)) + 12h

2Y ′′(ξn), (2.52)

which was obtained earlier in (2.9). Subtracting (2.50) from (2.52), we get

Y (xn+1) − yn+1 = Y (xn) − yn + h[f(xn, Y (xn)) − f(xn, yn)]

+ 12h

2Y ′′(xn) − δn, n ≥ 0(2.53)


with Y (x0) − y0 = 0. This equation is analogous to the error equation given earlierin (2.12), with the role of the truncation error12h

2Y ′′(ξn) in that earlier equationreplaced by the term

12h

2Y ′′(ξn) − δn = h

[12hY

′′(ξn) − δnh

]. (2.54)

If the argument in the proof of Theorem 2.4 is applied to (2.53) rather than to (2.12),then the error result (2.20) generalizes to

|Y (xn) − yn| ≤ c1

{12h

[max

x0≤x≤b|Y ′′(x)|

]+cu

h

[max

x0≤x≤b|Y (x)|

]}(2.55)

for x0 ≤ xn ≤ b, we obtain

c1 =e(b−x0)K − 1

2K,

andK is the supremum of|∂f(x, y)/∂y|, defined in (2.30). The term in braces onthe right side of (2.55) is obtained by bounding the term in brackets on the right sideof (2.54) and using the assumption (2.51).

In essence, (2.55) says that

|Y (xn) − yn| ≤ α1h+α2

h, x0 ≤ xn ≤ b

for appropriate choices ofα1,α2. Note thatα2 is generally small becauseu is small.Thus the error bound will initially decrease ash decreases; but at a critical value ofh, call it h∗, the error bound will increase, because of the termα2/h. The samequalitative behavior turns out to apply also for the actual error Y (xn) − yn. Thusthere is a limit on the attainable accuracy, and it is less than the number of digitsavailable in the machine floating-point representation. This same analysis is validfor other numerical methods, with a term of the form

cu

h

[max

x0≤x≤b|Y (x)|

]

to be included as part of the global error for the numerical method. With roundedfloating-pointarithmetic, this behavior can usually be improvedon. But with choppedfloating-point arithmetic, it is likely to be accurate in a qualitative sense: ash is halved,the contribution to the error due to the chopped arithmetic will double.

Example 2.8 Solve the problem

Y ′(x) = −Y (x) + 2 cos(x), Y (0) = 1

using Euler’s method. The true solution isY (x) = sinx + cosx. Use a four digitdecimal machine with chopped floating-point arithmetic, and then repeat the calcu-lation with rounded floating-point arithmetic. The machineepsilon in this arithmeticis u = 0.001. Finally, give the results of Euler’s method with exact arithmetic. The

32 EULER’S METHOD

Table 2.4 Effects of rounding/chopping errors in Euler’s method

h x Chopped arithmetic Rounded arithmetic Exact arithmeticY (x) − yh(x) Y (x) − yh(x) Y (x) − yh(x)

0.04 1 −1.00e − 2 −1.70e − 2 −1.70e − 2

2 −1.17e − 2 −1.83e − 2 −1.83e − 2

3 −1.20e − 3 −2.80e − 3 −2.78e − 3

4 1.00e − 2 1.60e − 2 1.53e − 2

5 1.13e − 2 1.96e − 2 1.94e − 2

0.02 1 7.00e − 3 −9.00e − 3 −8.46e − 3

2 4.00e − 3 −9.10e − 3 −9.13e − 3

3 2.30e − 3 −1.40e − 3 −1.40e − 3

4 −6.00e − 3 8.00e − 3 7.62e − 3

5 −6.00e − 3 8.50e − 3 9.63e − 3

0.01 1 2.80e − 2 −3.00e − 3 −4.22e − 3

2 2.28e − 2 −4.30e − 3 −4.56e − 3

3 7.40e − 3 −4.00e − 4 −7.03e − 4

4 −2.30e − 2 3.00e − 3 3.80e − 3

5 −2.41e − 2 4.60e − 3 4.81e − 3

results with decreasingh are given in Table 2.4. The errors for the answers thatare obtained by using floating–point chopped and/or roundeddecimal arithmetic arebased on the true answers rounded to four digits.

Note that the errors with the chopped case are affected ath = 0.02, with the erroratx = 3 larger than whenh = 0.04 for that case. The increasing error is clear withthe h = 0.01 case, at all points. In contrast, the errors using rounded arithmeticcontinue to decrease, although theh = 0.01 case is affected slightly, in comparisonto the true errors when no rounding is present. The column with the errors for thecase with exact arithmetic show that the use of the rounded decimal arithmetic hasless effect on the error than does the use of chopped arithmetic. But there is still aneffect.

PROBLEMS

1. Solve the following problems using Euler’s method with stepsizes ofh =0.2, 0.1, 0.05. Compute the error and relative error using the true solutionY (t). For selected values oft, observe the ratio by which the error decreaseswhenh is halved.

(a) Y ′(t) = [cos(Y (t))]2, 0 ≤ t ≤ 10, Y (0) = 0;


Y (t) = tan−1(t).

(b) Y ′(t) =1

1 + t2− 2[Y (t)]2, 0 ≤ t ≤ 10, Y (0) = 0;

Y (t) =t

1 + t2.

(c) Y ′(t) =1

4Y (t)

[1 − 1

20Y (t)

], 0 ≤ t ≤ 20, Y (0) = 1;

Y (t) =20

1 + 19e−t/4.

(d) Y ′(t) = −[Y (t)]2, 1 ≤ t ≤ 10, Y (1) = 1;

Y (t) =1

t.

(e) Y ′(t) = te−t − Y (t), 0 ≤ t ≤ 10, Y (0) = 1;

Y (t) =

(1 +

1

2t2)e−t.

(f) Y ′(t) =t3

Y (t), 0 ≤ t ≤ 10, Y (0) = 1;

Y (t) =

√1

2t4 + 1.

(g) Y ′(t) =(3t2 + 1

)Y (t)2, 0 ≤ t ≤ 10, Y (0) = −1;

Y (t) = −(t3 + t+ 1

)−1.

2. Compute the true solution to the problem

Y ′(t) = −e−tY (t), Y (0) = 1.

Using Euler’s method, solve this equation numerically withstepsizes ofh =0.2, 0.1, 0.05. Compute the error and relative error using the true solutionY (t).

3. Consider the linear problem

Y ′(t) = λY (t) + (1 − λ) cos(t) − (1 + λ) sin(t), Y (0) = 1.

The true solution isY (t) = sin(t) + cos(t). Solve this problem using Euler’smethod with several values ofλ andh, for 0 ≤ t ≤ 10. Comment on theresults.

(a) λ = −1; h = 0.5, 0.25, 0.125.

(b) λ = 1; h = 0.5, 0.25, 0.125.

(c) λ = −5; h = 0.5, 0.25, 0.125, 0.0625.

(d) λ = 5; h = 0.125, 0.0625.

34 EULER’S METHOD

4. As a special case in which the error of Euler’s method can beanalyzed directly,consider Euler’s method applied to

Y ′(t) = Y (t), Y (0) = 1.

The true solution iset.

(a) Show that the solution of Euler’s method can be written as

yh(tn) = (1 + h)tn/h, n ≥ 0.

(b) Using L’Hospital’s rule from calculus, show that

limh→0

(1 + h)1/h = e.

This then proves that for fixedt = tn,

limh→0

yh(t) = et.

(c) Let us do a more delicate convergence analysis. Use the propertyab =eb log a to write

yh(tn) = etn log(1+h)/h.

Then use the formula

log(1 + h) = h− 12h

2 + O(h3)

and Taylor expansion of the natural exponential function toshow that

Y (tn) − yh(tn) = 12htne

tn + O(h2).

This shows that forh small, the error is almost proportional toh, a phe-nomenon already observed from the numerical results given in Tables 2.1and 2.2.

5. Repeat the general procedures of Problem 4, but do so for the initial valueproblem

Y ′(t) = cY (t), Y (0) = 1

with c 6= 0 a given constant.

6. Check the accuracy of the error bound (2.35) forb = 1, 2, 3, 4, 5 andh =0.2, 0.1, 0.05. Compute the error bound and compare it with Table 2.1.

7. Consider again the problem (2.34) of Example 2.5. Let us derive a moreaccurate error bound than the one given in Theorem 2.4. From (2.14) we have

en+1 = (1 − h) en + 12h

2e−ξn .


Using this formula with0 < h ≤ 1, and recallinge0 = 0, show the error bound

|en| ≤ 12htn.

Compare this error bound to the true errors in Table 2.1.Hint: 1 − h ≤ 1 ande−ξn ≤ 1.

8. Compute the errorbound (2.20), assumingy0 = Y0, for the problem (2.8) givenearlier in this chapter. Compare the bound with the actual errors given in Table2.2, forb = 1, 2, 3, 4, 5 andh = 0.2, 0.1, 0.05.

9. Repeat Problem 8 for the equation in Problem 1(a).

10. For Problems 1 (b)–(d), the constantK in (2.19) will be infinite. To use theerror bound (2.20) in such cases, let

K = 2 · maxt0≤t≤b

∣∣∣∣∂f(t, Y (t))

∂y

∣∣∣∣ .

This can be shown to be adequate for all sufficiently small values ofh. Thenrepeat Problem 8 for Problem 1(b)–(d).

11. Consider the initial value problem

Y ′(t) = α tα−1, Y (0) = 0,

whereα > 0. The true solution isY (t) = tα. Whenα 6= integer, the true solu-tion is not infinitely differentiable. In particular, to haveY twice continuouslydifferentiable, we needα ≥ 2. Use the Euler method to solve the initial valueproblem forα = 2.5, 1.5, 1.1 with stepsizeh = 0.2, 0.1, 0.05. Compute thesolution errors at the nodes, and determine numerically theconvergence ordersof the Euler method for these problems.

12. The solution of

Y ′(t) = λY (t) + cos(t) − λ sin(t), Y (0) = 0

is Y (t) = sin(t). Find the asymptotic error formula (2.36) in this case. Alsocompute the Euler solution for0 ≤ t ≤ 6, h = 0.2, 0.1, 0.05, andλ = 1,−1.Compare the true errors with those obtained from the asymptotic estimate

Y (tn) − yn ≈ hD(tn).

13. Repeat Problem 12 forProblem1(d). Compare for1 ≤ t ≤ 6,h = 0.2, 0.1, 0.05.

14. For the example (2.8), with the numerical results in Table 2.2, use Richardson’sextrapolation to estimate the errorY (tn) − yh(tn) whenh = 0.05. Also,produce the Richardson extrapolateyh(tn) and compute its error. Do this fortn = 1, 2, 3, 4, 5, 6.

36 EULER’S METHOD

15. Repeat Problem 14 for Problems 1 (a)–(d).

16. Use Taylor’s theorem to show the standard numerical differentiation method

Y ′(tn+1) =Y (tn+1) − Y (tn)

h+ O(h).

Combine this with (2.36) to prove the error result (2.42).

CHAPTER 3

SYSTEMS OF DIFFERENTIALEQUATIONS

Although some applications of differential equations involve only a single first-orderequation, most applications involve a system of several such equations or higher-orderequations. In this chapter, we consider systems of first-order equations, showinghow Euler’s method applies to such systems. Numerical treatment of higher-orderequations can be carried out by first converting them to equivalent systems of first-order equations.

To begin with a simple case, the general form of a system of twofirst-order differ-ential equations is

Y ′1(t) = f1(t, Y1(t), Y2(t)),Y ′

2(t) = f2(t, Y1(t), Y2(t)).(3.1)

The functionsf1(t, z1,z2) andf2(t, z1, z2) define the differential equations, and theunknown functionsY1(t) andY2(t) are being sought. The initial value problemconsists of solving (3.1), subject to the initial conditions

Y1(t0) = Y1,0, Y2(t0) = Y2,0. (3.2)

37

38 SYSTEMS OF DIFFERENTIAL EQUATIONS

Example 3.1

(a) The initial value problem

Y ′1(t) = Y1(t) − 2Y2(t) + 4 cos(t) − 2 sin(t), Y1(0) = 1,

Y ′2(t) = 3Y1(t) − 4Y2(t) + 5 cos(t) − 5 sin(t), Y2(0) = 2

(3.3)

has the solution

Y1(t) = cos(t) + sin(t), Y2(t) = 2 cos(t).

This example will be used later in a numerical example illustrating Euler’smethod for systems.

(b) Consider the system

Y ′1(t) = AY1(t)[1 −BY2(t)], Y1(0) = Y1,0,

Y ′2(t) = CY2(t)[DY1(t) − 1], Y2(0) = Y2,0

(3.4)

with constantsA,B,C,D > 0. This is called the Lotka–Volterra predator–prey model. The variablet denotes time,Y1(t) the numberof prey (e.g., rabbits)at timet, andY2(t) the number of predators (e.g., foxes). If there is only asingle type of predator and a single type of prey, then this model is often areasonable approximation of reality. The behavior of the solutionsY1 andY2

is illustrated in Problem 8.

The initial value problem for a system ofm first-order differential equations hasthe general form

Y ′1(t)= f1(t, Y1(t), . . . , Ym(t)), Y1(t0) = Y1,0,

...Y ′

m(t)= fm(t, Y1(t), . . . , Ym(t)), Ym(t0)= Ym,0.

(3.5)

We seek the functionsY1(t), . . . , Ym(t) on some intervalt0 ≤ t ≤ b. An example ofa three-equation system is given later in (3.21).

The general form (3.5) is clumsy to work with, and it is not a convenient way tospecify the system when using a computer program for its solution. To simplify theform of (3.5), represent the solution and the differential equations by using columnvectors. Denote

Y(t) =

Y1(t)

...Ym(t)

, Y0 =

Y1,0

...Ym,0

, f(t,y) =

f1(t, y1, . . . , ym)

...fm(t, y1, . . . , ym)

(3.6)

with y = [y1, y2, . . . , ym]T. Then (3.5) can be rewritten as

Y′(t) = f(t,Y(t)), Y(t0) = Y0. (3.7)

HIGHER-ORDER DIFFERENTIAL EQUATIONS 39

This resembles the earlier first-order single equation, butit is general as to the numberof equations. Computer programs for solving systems will almost always refer to thesystem in this manner.

Example 3.2 System (3.3) can be rewritten as

Y′(t) = AY(t) + G(t), Y(0) = Y0

with

Y =

[Y1

Y2

], A =

[1 −2

3 −4

],

G(t) =

[4 cos(t) − 2 sin(t)

5 cos(t) − 5 sin(t)

], Y0 =

[1

2

].

In the notation of (3.6), we obtain

f(t,y) = Ay + G(t), y = [y1, y2]T.

The general theory in Chapter 1 for a single differential equation generalizes inan easy way to systems of first-order differential equations, once we have introducedappropriate notation and tools for (3.6). For example, the role of the partial differential∂f/∂y is replaced with the Jacobian matrix

fy(t,y) =

[∂fi(t, y1, . . . , ym)

∂yj

]m

i,j=1

. (3.8)

We replace the absolute value|·| with a vector norm. A convenient choice is themaximum norm:

‖y‖∞ = max1≤i≤m

|yi| , y ∈ Rm.

With this, we can generalize the Lipschitz condition (2.19)to

‖f(t,y) − f(t, z)‖∞ ≤ K ‖y − z‖∞ , y, z ∈ Rm, t0 ≤ t ≤ b, (3.9)

K = maxt0≤t≤b

max1≤i≤m

supy∈Rm

m∑

j=1

∣∣∣∣∂fi(t,y)

∂yj

∣∣∣∣ .

3.1 HIGHER-ORDER DIFFERENTIAL EQUATIONS

In physics and engineering, the use ofNewton’s second law of motionleads to systemsof second-order differential equations, modeling some of the most important physicalphenomena of nature. In addition, other applications also lead to higher-order equa-tions. Higher-order equations can be studied either directly or through equivalentsystems of first-order equations.


m

θ=0 mg

θ(t)l

Figure 3.1 The schematic of pendulum

As an example, consider the second-order equation

Y ′′(t) = f(t, Y (t), Y ′(t)), (3.10)

wheref(t, y1, y2) is given. The initial value problem consists of solving (3.10) subjectto the initial conditions

Y (t0) = Y0, Y ′(t0) = Y ′0 . (3.11)

To reformulate this as a system of first-order equations, denote

Y1(t) = Y (t), Y2(t) = Y ′(t).

ThenY1 andY2 satisfy

Y ′1(t) = Y2(t), Y1(t0) = Y0,

Y ′2(t) = f(t, Y1(t), Y2(t)), Y2(t0) = Y ′

0 .(3.12)

Also, starting from this system, it is straightforward to show that the solutionY1 of(3.12) will also have to satisfy (3.10) and (3.11), thus demonstrating the equivalenceof the two formulations.

Example 3.3 Consider the pendulum shown in Figure 3.1, of massm and lengthl.The motion of this pendulum about its centerlineθ = 0 is modeled by a second-order

HIGHER-ORDER DIFFERENTIAL EQUATIONS 41

differential equation derived from Newton’s second law of motion. If the pendulum isassumed to move back and forth with negligible friction at its vertex, then the motionis modeled fairly accurately by the equation

mld2θ

dt2= −mg sin(θ(t)), (3.13)

wheret is time andθ(t) is the angle between the vertical centerline and the pendulum.The description of the motion is completed by specifying theinitial positionθ(0) andinitial angular velocityθ′(0). To convert this to a system of two first-order equations,we may write

Y1(t) = θ(t), Y2(t) = θ′(t).

Then (3.13) and the initial conditions can be rewritten as

Y ′1(t) = Y2(t), Y1(0) = θ(0)

Y ′2(t) = −g

lsin(Y1(t)), Y2(0) = θ′(0).

(3.14)

This system is equivalent to the initial value problem for the original second-orderequation (3.13).

A general differential equation of orderm can be written as

dmY (t)

dtm= f

(t, Y (t),

dY (t)

dt, . . . ,

dm−1Y (t)

dtm−1

), (3.15)

and the initial conditions needed to solve it are given by

Y (t0) = Y0, Y ′(t0) = Y ′0 , . . . , Y (m−1)(t0) = Y

(m−1)0 . (3.16)

It is reformulated as a system ofm first-order equations by introducing

Y1(t) = Y (t), Y2(t) = Y ′(t), . . . , Ym(t) = Y (m−1)(t).

Then the equivalent initial value problem for a system of first-order equations is

Y ′1(t)=Y2(t), Y1(t0)=Y0,

......

Y ′m−1(t)=Ym(t), Ym−1(t0)=Y

(m−2)0 ,

Y ′m(t)=f(t, Y1(t), . . . , Ym(t)), Ym(t0)=Y

(m−1)0 .

(3.17)

A special case of (3.15) is the orderm linear differential equation

dmY

dtm= a0(t)Y + a1(t)

dY

dt+ · · · + am−1(t)

dm−1Y

dtm−1+ b(t). (3.18)


This is reformulated as above, with

Y ′m = a0(t)Y1 + a1(t)Y2 + · · · + am−1(t)Ym + b(t) (3.19)

replacing the last equation in (3.17).

Example 3.4 The initial value problem

Y ′′′(t) + 3Y ′′(t) + 3Y ′(t) + Y (t) = −4 sin(t),

Y (0) = Y ′(0) = 1, Y ′′(0) = −1(3.20)

is reformulated as

Y ′1(t)=Y2(t), Y1(0)=1,

Y ′2(t)=Y3(t), Y2(0)=1,

Y ′3(t)=−Y1(t) − 3Y2(t) − 3Y3(t) − 4 sin(t), Y3(0)=−1.

(3.21)

The solution of (3.20) isY (t) = cos(t) + sin(t), and the solution of (3.21) can begenerated from it. This system will be solved numerically later in this chapter.

3.2 NUMERICAL METHODS FOR SYSTEMS

Euler’s method and the numerical methods discussed in laterchapters can be appliedwithout change to the solution of systems of first-order differential equations. Thenumerical method should be applied to each equation in the system, or more simply,in a straightforward way to the system written in the matrix–vector format (3.7). Thederivation of numerical methods for the solution of systemsis essentially the same asis done for a single equation. The convergence and stabilityanalyses are also donein the same manner.

To be more specific, we consider Euler’s method for the general system of twofirst-order equations that is given in (3.1). By following the derivation given forEuler’s method in obtaining (2.9), Taylor’s theorem gives

Y1(tn+1) = Y1(tn) + hf1(tn, Y1(tn), Y2(tn)) +h2

2Y ′′

1 (ξn),

Y2(tn+1) = Y2(tn) + hf2(tn, Y1(tn), Y2(tn)) +h2

2Y ′′

2 (ζn)

for someξn, ζn in [tn, tn+1]. Dropping the error terms, we obtain Euler’s method fora system of two equations forn ≥ 0:

y1,n+1 = y1,n + hf1(tn, y1,n, y2,n),

y2,n+1 = y2,n + hf2(tn, y1,n, y2,n).(3.22)

NUMERICAL METHODS FOR SYSTEMS 43

In matrix–vector format, this is

yn+1 = yn + hf(tn,yn), y0 = Y0. (3.23)

The convergence and stability theory of Euler’s method and of the other numericalmethods also generalizes. The key is to use the matrix–vector notation introducedearlier in the chapter together with (3.8)–(3.9). This allows a straightforward imitationof the proofs given in earlier chapters for a single equation.

Letm = 2 as above, and consider Euler’s method (3.22) together with the exactinitial valuesy1,0 = Y1,0, y2,0 = Y2,0. If Y1(t), Y2(t) are twice continuouslydifferentiable, then it can be shown that

|Y1(tn) − y1,n| ≤ ch, |Y2(tn) − y2,n| ≤ ch

for all t0 ≤ tn ≤ b, for some constantc. In addition, the earlier asymptotic errorformula (2.36) will still be valid; forj = 1, 2, we obtain

Yj(tn) − yj,n = Dj(tn)h+ O(h2), t0 ≤ tn ≤ b.

Thus Richardson’s extrapolation and error estimation formulas will still be valid. ThefunctionsD1(t),D2(t) satisfy a particular linear system of differential equations, butwe omit it here. Stability results for Euler’s method generalize without any significantchange. Thus in summary, the earlier work for Euler’s methodgeneralizes withoutsignificant change to systems. The same is true of the other numerical methodsgiven earlier, thus justifying our limitation to a single equation for introducing thosemethods.

MATLAB R© program. The following is a MATLAB codeeulersys implementingthe Euler method to solve the initial value problem (3.7). Itcan be seen that thecodeeulersys is just a slight modification of the codeeuler for for solving asingle equation in Chapter 2. The program can automaticallydetermine the numberof equations in the system.

function [t,y] = eulersys(t0,y0,t end,h,fcn)

%

% function [t,y]=eulersys(t0,y0,t end,h,fcn)

%

% Solve the initial value problem of a system

% of first order equations

% y’ = f(t,y), t0 <= t <= b, y(t0)=y0

% Use Euler’s method with a stepsize of h.

% The user must supply a program to compute the

% right hand side function with some name, say

% deriv, and a first line of the form



% [t,z]=eulersys(t0,z0,b,delta,’deriv’)


Table 3.1 Solution of (3.3) using Euler’s method

j t Yj(t) Yj(t) − yj,2h(t) Yj(t) − yj,h(t) Ratio yj,h(t) − yj,2h(t)

1 2 0.49315 −5.65e − 2 −2.82e − 2 2.0 −2.83e − 2

4 −1.41045 −5.64e − 3 −2.72e − 3 2.1 −2.92e − 3

6 0.68075 4.81e − 2 2.36e − 2 2.0 2.44e − 2

8 0.84386 −3.60e − 2 −1.79e − 2 2.0 −1.83e − 2

10 −1.38309 −1.81e − 2 −8.87e − 3 2.0 −9.40e − 2

2 2 −0.83229 −3.36e − 2 −1.70e − 2 2.0 −1.66e − 2

4 −1.30729 5.94e − 3 3.19e − 3 1.9 2.75e − 3

6 1.92034 1.59e − 2 7.69e − 3 2.1 8.17e − 3

8 −0.29100 −2.08e − 2 −1.05e − 2 2.0 −1.03e − 2

10 −1.67814 1.26e − 3 9.44e − 4 1.3 3.11e − 4

%

% The program automatically determines the

% number of equations from the dimension of

% the initial value vector y0.

%

% Output:

% The routine eulersys will return a vector t

% and a matrix y. The vector t will contain the

% node points in [t0,t end]:

% t(1)=t0, t(j)=t0+(j-1)*h, j=1,2,...,N

% The matrix y is of size N by m, with m the

% number of equations. The i-th row y(i,:) will

% contain the estimates of the solution Y

% at the node points in t(i).

%

m = length(y0);



y = zeros(n,m);

y(1,:) = y0;

for i = 2:n

y(i,:) = y(i-1,:) + h*feval(fcn,t(i-1),y(i-1,:));

end


Example 3.5

(a) Solve (3.3) using Euler’s method. The numerical resultsare given in Table 3.1,along with Richardson’s error estimate

Yj(tn) − yj,h(tn) ≈ yj,h(tn) − yj,2h(tn), j = 1, 2.

In the table,h = 0.05, 2h = 0.1. It can be seen that this error estimate is quiteaccurate, except for the one casej = 2, t = 10. To get the numerical solutionvalues and their errors at the specified node pointst = 2, 4, 6, 8, 10, we usedthe following MATLAB commands, which can be included at the end of theprogrameulersys for this example.

n1 = (n-1)/5;

for i = n1+1:n1:n

e(i,1) = cos(t(i))+sin(t(i))-y(i,1);

e(i,2) = 2*cos(t(i))-y(i,2);

end

diary euler sys1

fprintf(’ h = 6.5f\n’, h)

disp(’ t y(1) e(1) y(2) e(2)’)

for i = n1+1:n1:n

fprintf(’2.0f%10.2e%10.2e%10.2e%10.2e\n’, ...

t(i), y(i,1),e(i,1),y(i,2),e(i,2))

end

diary off

The right-hand side function for this example is defined by the following.

function z = eulersys fcn(t,y);

z = zeros(1,2);

z(1) = y(1)-2*y(2)+4*cos(t)-2*sin(t);

z(2) = 3*y(1)-4*y(2)+5*cos(t)-5*sin(t);

(b) Solve the third-order equation in (3.20), using Euler’smethod to solve thereformulated problem (3.21). The results fory(t) = Y1(t) = sin(t) + cos(t)are given in Table 3.2, for stepsizes2h = 0.1 andh = 0.05. The Richardsonerror estimate is again quite accurate.

Other numerical methods apply to systems in the same straightforward manner.Also, by using the matrix form (3.7) for a system, there is no apparent change in thenumerical method. For example, the Runge–Kutta method (5.20), given in Section5.2 of Chapter 5, is

yn+1 = yn +h

2[f(tn, yn) + f(tn+1, yn + hf(tn, yn))], n ≥ 0. (3.24)


Table 3.2 Solution of (3.20) using Euler’s method

t y(t) y(t) − y2h(t) y(t) − yh(t) Ratio yh(t) − y2h(t)

2 0.49315 −8.78e − 2 −4.25e − 2 2.1 −4.53e − 2

4 −1.41045 1.39e − 1 6.86e − 2 2.0 7.05e − 2

6 0.68075 5.19e − 2 2.49e − 2 2.1 2.70e − 2

8 0.84386 −1.56e − 1 −7.56e − 2 2.1 −7.99e − 2

10 −1.38309 8.39e − 2 4.14e − 2 2.0 4.25e − 2

Interpret this for a system of two equations with

yn =

[y1,n

y2,n

], f(tn,yn) =

[f1(tn, y1,n, y2,n)

f2(tn, y1,n, y2,n)

],

yn+1 = yn + 12h[f(tn,yn) + f(tn+1,yn + hf(tn,yn))], n ≥ 0. (3.25)

In component form, the method is

yj,n+1 = yj,n + 12h[fj(tn, y1,n, y2,n)

+fj(tn+1, y1,n + hf1(tn, y1,n, y2,n),

y2,n + hf2(tn,y1,n, y2,n))]

(3.26)

for j = 1, 2. The matrix–vector format (3.25) can be programmed very convenientlyon a computer. We leave its illustration to the problems.

PROBLEMS

1. Let

A =

[1 −2

2 −1

], Y =

[Y1

Y2

],

G(t) =

[−2e−t + 2

−2e−t + 1

], Y0 =

[1

1

].

Write out the two equations that make up the system

Y′(t) = AY(t) + G(t), Y(t0) = Y0.

The true solution isY(t) = [e−t, 1]T .

2. Express the system (3.21) to the general form of Problem 1,giving the matrixA.

3. Convert the following higher-order equations to systemsof first-order equa-tions.


(a) Y ′′′(t) + 4Y ′′(t) + 5Y ′(t) + 2Y (t) = 2t2 + 10t+ 8,Y (0) = 1, Y ′(0) = −1, Y ′′(0) = 3.

The true solution isY (t) = e−t + t2.

(b) Y ′′(t) + 4Y ′(t) + 13Y (t) = 40 cos(t),Y (0) = 3, Y ′(0) = 4.

The true solution isY (t) = 3 cos(t) + sin(t) + e−2t sin(3t).

4. Convert the following system of second-order equations to a larger systemof first-order equations. This system arises from studying the gravitationalattraction of one mass by another:

x′′(t) =−cx(t)r(t)3

, y′′(t) =−cy(t)r(t)3

, z′′(t) =−cz(t)r(t)3

Herec is a positive constant andr(t) = [x(t)2 + y(t)2 + z(t)2]1/2, with tdenoting time.

5. Using Euler’s method, solve the system in Problem 1. Use stepsizes ofh =0.1, 0.05, 0.025, and solve for0 ≤ t ≤ 10. Use Richardson’s error formula toestimate the error forh = 0.025.

6. Repeat Problem 5 for the systems in Problem 3.

7. Consider solving the pendulum equation (3.13) withl = 1 andg = 32.2 ft/s2.For the initial values, choose0 < θ(0) ≤ π/2, θ′(0) = 0. Use Euler’s methodto solve (3.14),and experiment with various values ofh so as to obtain a suitablysmall error in the computed solution. Grapht vs.θ(t), t vs.θ′(t), andθ(t) vs.θ′(t). Does the motion appear to be periodic in time?

8. Solve the Lotka–Volterra predator–prey model of (3.4) with the parametersA = 4, B = 1

2 , C = 3, D = 13 , and useeulersys to solve approximately

this model for0 ≤ t ≤ 5. Use stepsizesh = 0.001, 0.0005, 0.00025. Use theinitial valuesx(0) = 3, y(0) = 5. Plotx andy as functions oft, and plotxversusy. Comment on your results. We return to this problem in later chapterswhen we have more efficient methods for its solution.

CHAPTER 4

THE BACKWARD EULER METHOD ANDTHE TRAPEZOIDAL METHOD

In Section 1.2 of Chapter 1, we discussed the stability property of the initial valueproblem (1.7). Roughly speaking,stability means that a small perturbation in theinitial value of the problem leads to a small change in the solution. In Section 2.4 ofChapter 2, we showed that an analogous stability result was true forEuler’s method. Ingeneral, we want to work with numerical methods for solving the initial value problemthat are numerically stable. This means that for any sufficiently small stepsizeh, asmall change in the initial value will lead to a small change in the numerical solution.Indeed, such a stability property is closely related to the convergenceof the numericalmethod, a topic we discuss at length in Chapter 7. For anotherexample of the relationbetween convergence and stability, we refer to Problem 16 for a numerical methodthat is neither convergent nor stable.

A stable numerical method is one for which the numerical solution is well behavedwhen considering small perturbations, provided that the stepsizeh is sufficientlysmall. In actual computations,however, the stepsizeh cannot be too small since a verysmall stepsize decreases the efficiency of the numerical method. As can be shown,the accuracy of the forward difference approximations, such as[Y (t+ h)− Y (t)]/hto the derivativeY ′(t), deteriorates when, roughly speaking,h is of the order of thesquare root of themachine epsilon. Hence, for actual computations, what matters

49

50 THE BACKWARD EULER METHOD AND THE TRAPEZOIDAL METHOD

is the performance of the numerical method whenh is not assumedvery small. Weneed to further analyze the stability of numerical methods whenh is not assumed tobe small.

Examining the stability question for the general problem

Y ′(t) = f(t, Y (t)), Y (t0) = Y0 (4.1)

is too complicated. Instead, we examine the stability of numerical methods for themodel problem

Y ′(t) = λY (t) + g(t), Y (0) = Y0 (4.2)

whose exact solution can be found from (1.5). Questions regarding stability andconvergence are more easily answered for this problem, and the answers to thesequestions can be shown to usually be the answers to those samequestions for themore general problem (4.1).

Let Y (t) be the solution of (4.2), and letYǫ(t) be the solution with the perturbedinitial dataY0 + ǫ:

Y ′ǫ (t) = λYǫ(t) + g(t), Yǫ(0) = Y0 + ǫ.

LetZǫ(t) denote the change in the solution

Zǫ(t) = Yǫ(t) − Y (t).

Then, subtracting (4.2) from the equation forYǫ(t), we obtain

Z ′ǫ(t) = λZǫ(t), Zǫ(0) = ǫ.

The solution isZǫ(t) = ǫeλt.

Typically in applications, we are interested in the case that eitherλ is real and negativeor λ is complex with a negative real part. In such a case,Zǫ(t) will go to zero ast→ ∞ and, thus, the effect of theǫ perturbation dies out for large values oft. (See arelated discussion in Section 1.2 of Chapter 1.) We would like the same behavior tohold for the numerical method that is being applied to (4.2).

By considering the functionZǫ(t)/ǫ instead ofZǫ(t), we obtain the followingmodel problem that is generally used to test the performanceof various numericalmethods:

Y ′ = λY, t > 0,Y (0) = 1.

(4.3)

In the following, when we refer to the model problem (4.3), wealways assume thatthe constantλ < 0 or λ is complex and withReal(λ) < 0. The true solution of theproblem (4.3) is

Y (t) = eλ t, (4.4)

which decays exponentially int since the parameterλ has a negative real part.

THE BACKWARD EULER METHOD 51

The kind of stability property we would like for a numerical method is that whenit is applied to (4.3), the numerical solution satisfies

yh(tn) → 0 as tn → ∞ (4.5)

for any choice of the stepsizeh. The set of valueshλ, considered as a subset of thecomplex plane, for whichyn → 0 asn→ ∞, is called theregion of absolute stabilityof the numerical method. The use ofhλ arises naturally from the numerical method,as we will see.

Let us examine the performance of the Euler method on the model problem (4.3).We have

yn+1 = yn + hλ yn = (1 + hλ) yn, n ≥ 0, y0 = 1.

By an inductive argument, it is not difficult to find

yn = (1 + hλ)n, n ≥ 0. (4.6)

Note that for a fixed node pointtn = nh ≡ t, asn→ ∞, we obtain

yn =

(1 +

λt

n

)n

→ eλt.

The limiting behavior is obtained using L’Hospital’s rule from calculus. This confirmsthe convergence of the Eulermethod. We emphasize that this is an asymptotic propertyin the sense that it is valid in the limit ash→ 0.

From formula (4.6), we see thatyn → 0 asn→ ∞ if and only if

|1 + hλ| < 1.

Forλ real and negative, the condition becomes

−2 < hλ < 0. (4.7)

This sets a restriction on the range ofh that we can take to apply Euler’s method,namely,0 < h < −2/λ.

Example 4.1 Consider the model problem withλ = −100. Then the Euler methodwill perform well only whenh < 2×100−1 = 0.02. The true solutionY (t) = e−100t

at t = 0.2 is 2.061 × 10−9. Table 4.1 lists the Euler solution att = 0.2 for severalvalues ofh.

4.1 THE BACKWARD EULER METHOD

Now we consider a numerical method that has the property (4.5) for any stepsizehwhen applied to the model problem (4.3). Such a method is saidto beabsolutelystable.


Table 4.1 Euler’s solution atx = 0.2 for Example 4.1

h yh(0.2)

0.1 810.05 2560.02 10.01 00.001 7.06e − 10

In the derivation of the Euler method, we used the forward difference approxima-tion

Y ′(t) ≈ 1

h[Y (t+ h) − Y (t)].

Let us use, instead, thebackward difference approximation

Y ′(t) ≈ 1

h[Y (t) − Y (t− h)]. (4.8)

Then the differential equationY ′(t) = f(t, Y (t)) at t = tn is discretized as

yn = yn−1 + h f(tn, yn).

Shifting the index by 1, we then obtain thebackward Euler method{yn+1 = yn + h f(tn+1, yn+1), 0 ≤ n ≤ N − 1,y0 = Y0.

(4.9)

Like the Euler method, the backward Euler method is of first-order accuracy, and aconvergenceresult similar to Theorem 2.4 holds. Also, an asymptotic error expansionof the form (2.36) is valid. The method of proof is a variationon that used for Euler’smethod in Section 2.3 of Chapter 2.

Let us show that the backward Euler method has the desired property (4.5) on themodel problem (4.3). We have

yn+1 = yn + hλ yn+1,

yn+1 = (1 − hλ)−1yn, n ≥ 0.

Using this together withy0 = 1, we obtain

yn = (1 − hλ)−n. (4.10)

For any stepsizeh > 0, we have|1 − hλ| > 1 and soyn → 0 asn→ ∞.Continuing with Example 4.1, in Table 4.2 we give numerical results for the back-

ward Euler method. A comparison between Tables 4.1 and 4.2 reveals that the back-ward Euler method is substantially better than the Euler method on the model problem(4.3).


Table 4.2 Backward Euler solution atx = 0.2 for Example 4.1

h yh(0.2)

0.1 8.26e − 30.05 7.72e − 40.02 1.69e − 50.01 9.54e − 70.001 5.27e − 9

The major difference between the two methods is that for the backward Eulermethod, at each timestep, we need to solve a nonlinear algebraic equation

yn+1 = yn + h f(tn+1, yn+1) (4.11)

for yn+1. Methods in whichyn+1 must be found by solving a rootfinding problemare calledimplicit methods, sinceyn+1 is defined implicitly. In contrast, methods thatgiveyn+1 directly are calledexplicit methods. Euler’s method is an explicit method,whereas the backward Euler method is an implicit method. Under the Lipschitzcontinuity assumption (2.19) on the functionf(t, z), it can be shown that ifh is smallenough, the equation (4.11) has a unique solution.

Traditional rootfinding methods (e.g., Newton’s method, the secant method, thebisection method) can be applied to (4.11) to find its rootyn+1; but often that is avery time-consuming process. Instead, (4.11) is usually solved by a simple iterationtechnique. Given an initial guessy(0)

n+1 ≈ yn+1, definey(1)n+1, y(2)

n+1, etc., by

y(j+1)n+1 = yn + h f(tn+1, y

(j)n+1), j = 0, 1, 2, . . . . (4.12)

It can be shown that ifh is sufficiently small, then the iteratesy(j)n+1 will converge to

yn+1 asj → ∞. Subtracting (4.12) from (4.11) gives us

yn+1 − y(j+1)n+1 = h [f(tn+1, yn+1) − f(tn+1, y

(j)n+1)],

yn+1 − y(j+1)n+1 ≈ h · ∂f(tn+1, yn+1)

∂y[yn+1 − y

(j)n+1].

The last formula is obtained by applying the mean value theorem to f(tn+1, z),considered as a function ofz. This formula gives a relation between the error insuccessive iterates. Therefore, if

∣∣∣∣h · ∂f(tn+1, yn+1)

∂y

∣∣∣∣ < 1, (4.13)

then the errors will converge to zero, as long as the initial guessy(0)n+1 is a sufficiently

accurate approximation toyn+1.


The preceding iteration method (4.12) and its analysis is a special case of the theoryof fixed-point iterationfor solving a nonlinear equationz = g(z). The iterationscheme is

zj+1 = g(zj), j = 0, 1, 2, . . . (4.14)

with z0 an initial estimate of the solution being sought. Denote byα the solutionwe are seeking for the equationz = g(z). Assuming thatg(z) is continuouslydifferentiable in a neighborhood ofα, we have that the iteration (4.14) will convergeif

|g′(α)| < 1 (4.15)

and if the initial estimatez0 is chosen sufficiently close toα; see [11,§2.5], [12,§3.4],[68, §6.3]. Applying this notation to our iteration (4.12),α = yn+1 is the fixed point,and

g(z) ≡ yn + h f(tn+1, z).

The convergence condition (4.13) is simply the condition (4.15).In practice, one uses a good initial guessy

(0)n+1, and one chooses anh that is so

small that the quantity in (4.13) is much less than1. Then the erroryn+1 − y(j)n+1

decreases rapidly to a small quantity asj increases, and often only one iterate needsto be computed. The usual choice of the initial guessy

(0)n+1 for (4.12) is based on the

Euler methody(0)n+1 = yn + hf(tn, yn). (4.16)

This is called apredictor formula, as it predicts the root of the implicit method.For many equations, it is usually sufficient to do the iteration (4.12) once. Thus,

a practical way to implement the backward Euler method is to do the following one-point iteration for solving (4.11) approximately:

yn+1 = yn + h f(tn+1, yn),

yn+1 = yn + h f(tn+1, yn+1).

The resulting numerical method is then given by the formula

yn+1 = yn + h f(tn+1, yn + h f(tn+1, yn)). (4.17)

It can be shown that this method is still of first-order accuracy. However, it is nolonger absolutely stable (see Problem 1).

MATLAB R© program. We now turn to an implementation of the backward Eulermethod. At each step, withyn available from the previous step, we use the Eulermethod to compute an estimate ofyn+1:

y(1)n+1 = yn + hf(tn, yn).

Then we carry out the iteration

y(k+1)n+1 = yn + h f(tn+1, y

(k)n+1)


until the difference between successive values of the iterates is sufficiently small,indicating a sufficiently accurate approximation of the solutionyn+1. To prevent aninfinite loop of iteration, we require the iteration to stop if 10 iteration steps are takenwithout reaching a satisfactory solution; in this latter case, an error message will bedisplayed.

function [t,y] = euler back(t0,y0,t end,h,fcn,tol)

%

% function [t,y] = euler back(t0,y0,t end,h,fcn,tol)

%


% y’ = f(t,y), t0 <= t <= b, y(t0)=y0

% Use the backward Euler method with a stepsize of h.

% The user must supply an m-file to define the

% derivative f, with some name, say ’deriv.m’, and a

% first line of the form


% tol is the user supplied bound on the difference

% between successive values of the backward Euler

% iteration. A sample call would be

% [t,z]=euler back(t0,z0,b,delta,’deriv’,1.0e-3)

%

% Output:

% The routine euler back will return two vectors,

% t and y. The vector t will contain the node points

% t(1)=t0, t(j)=t0+(j-1)*h, j=1,2,...,N

% with

% t(N) <= t end, t(N)+h > t end

% The vector y will contain the estimates of the

% solution Y at the node points in t.

%

% Initialize.



y = zeros(n,1);

y(1) = y0;

i = 2;

% advancing

while i <= n

%

% forward Euler estimate

%

yt1 = y(i-1)+h*feval(fcn,t(i-1),y(i-1));

% one-point iteration


count = 0;

diff = 1;

while diff > tol & count < 10

yt2 = y(i-1) + h*feval(fcn,t(i),yt1);

diff = abs(yt2-yt1);

yt1 = yt2;

count = count +1;

end

if count >= 10

disp(’Not converging after 10 steps at t = ’)

fprintf(’%5.2f\n’, t(i))

end

y(i) = yt2;

i = i+1;

end

4.2 THE TRAPEZOIDAL METHOD

One main drawback of both the Euler method and the backward Euler method is thelow convergenceorder. Next we present a method that has a higher convergenceorderand in which, at the same time, the stability property (4.5) is valid for any stepsizehin solving the model problem (4.3).

We begin by introducing thetrapezoidal rulefor numerical integration:

∫ b

a

g(s) ds ≈ 12 (b− a) [g(a) + g(b)] . (4.18)

This rule is illustrated in Figure 4.1. The graph ofy = g(t) is approximated on[a, b]by the linear functiony = p1(t) that interpolatesg(t) at the endpoints of[a, b]. Theintegral ofg(t) over [a, b] is then approximated by the integral ofp1(t) over [a, b].By using various approaches, we can obtain the more completeresult

∫ b

a

g(s) ds = 12 (b− a) [g(a) + g(b)] − 1

12 (b− a)3g′′(ξ) (4.19)

for somea ≤ ξ ≤ b.We integrate the differential equation

Y ′(t) = f(t, Y (t))

from tn to tn+1:

Y (tn+1) = Y (tn) +

∫ tn+1

tn

f(s, Y (s)) ds. (4.20)

THE TRAPEZOIDAL METHOD 57

t

y

a b

y=g(t)

y=p1(t)

Figure 4.1 Illustration of trapezoidal rule

Use the trapezoidal rule (4.18) to approximate the integral. Applying (4.19) to thisintegral, we obtain

Y (tn+1) = Y (tn) + 12h [f(tn, Y (tn)) + f(tn+1, Y (tn+1))]

− 112h

3Y (3)(ξn)(4.21)

for sometn ≤ ξn ≤ tn+1. By dropping the final error term and then equating bothsides, we obtain thetrapezoidal methodfor solving the initial value problem (1.7):

yn+1 = yn + 12h [f(tn, yn) + f(tn+1, yn+1)] , n ≥ 0, (4.22)

with y0 = Y0.The truncation error for the trapezoidal method is

Tn+1 = − 112h

3Y (3)(ξn). (4.23)

It can be shown that the trapezoidal method is of second-order accuracy. Assumingy0 = Y0, we can show

maxt0≤tn≤b

|Y (tn) − yh(tn)| ≤ ch2

for all sufficiently smallh, with c independentofh. The method of proof is a variationof that used for Euler’s method in Chapter 2. In addition, thetrapezoidal method is


absolutely stable. This higher order and its absolute stability has made the trapezoidalmethod an important tool when solving partial differentialequations of parabolic type;see Section 8.1 in Chapter 8.

Notice that the trapezoidal method is animplicit method.In a general step,yn+1

is found from the equation

yn+1 = yn +h

2[f(tn, yn) + f(tn+1, yn+1)], (4.24)

although this equation can be solved explicitly in only a relatively small number ofcases. The discussion for the solution of the backward Eulerequation (4.11) appliesto the solution of the equation (4.24), with a slight variation. The iteration formula(4.12) is now replaced by

y(j+1)n+1 = yn +

h

2[f(tn, yn) + f(tn+1, y

(j)n+1)], j = 0, 1, 2, . . . . (4.25)

If y(0)n+1 is a sufficiently good estimate ofyn+1 and ifh is sufficiently small, then the

iteratesy(j)n+1 will converge toyn+1 asj → ∞. The convergence condition (4.13) is

replaced by ∣∣∣∣h

2· ∂f(tn+1, yn+1)

∂y

∣∣∣∣ < 1. (4.26)

Note that the condition (4.26) is somewhat easier to satisfythan (4.13), indicatingthat the trapezoidal method is slightly easier to use than the backward Euler method.

The usual choice of the initial guessy(0)n+1 for (4.25) is based on the Euler method

y(0)n+1 = yn + hf(tn, yn), (4.27)

or an Adams–Bashforth method of order 2 (see Chapter 6)

y(0)n+1 = yn +

h

2[3f(tn, yn) − f(tn−1, yn−1)]. (4.28)

These are calledpredictor formulas. In either of these two cases for generatingy(0)n+1,

computey(1)n+1 from (4.25) and accept it as the rootyn+1. In the first step (n = 0), we

use the Euler predictor formula rather than the predictor (4.28). With both methods ofchoosingy(0)

n+1, it can be shown that the global error in the resulting solution{yh(tn)}

is still O(h2). If the Euler predictor (4.27) is used to definey(0)n+1, and if we accept

y(1)n+1 as the value ofyn+1, then the resulting new scheme is

yn+1 = yn +h

2[f(tn, yn) + f(tn+1, yn + h f(tn, yn))] , (4.29)

known asHeun’s method. The Heun method is still of second-order accuracy.However, it is no longer absolutely stable.


MATLAB program. In our implementation of the trapezoidal method, at each step,with yn available from the previous step, we use the Euler method to compute anestimate ofyn+1:

y(0)n+1 = yn + hf(tn, yn).

Then we use the trapezoidal formula to do the iteration

y(k+1)n+1 = yn +

h

2

[f(tn, yn) + f(tn+1, y

(k)n+1)

]

until the difference between successive values of the iterates is sufficiently small,indicating a sufficiently accurate approximation of the solutionyn+1. To prevent aninfinite loop of iteration, we require the iteration to stop if 10 iteration steps are takenwithout reaching a satisfactory solution; and in this latter case, an error message willbe displayed.

function [t,y] = trapezoidal(t0,y0,t end,h,fcn,tol)

%

% function [t,y] = trapezoidal(t0,y0,t end,h,fcn,tol)

%


% y’ = f(t,y), t0 <= t <= b, y(t0)=y0

% Use trapezoidal method with a stepsize of h. The

% user must supply an m-file to define the derivative

% f, with some name, say ’deriv.m’, and a first line

% of the form


% tol is the user supplied bound on the difference

% between successive values of the trapezoidal

% iteration. A sample call would be

% [t,z]=trapezoidal(t0,z0,b,delta,’deriv’,1e-3)

%

% Output:

% The routine trapezoidal will return two vectors,

% t and y. The vector t will contain the node points

% t(1) = t0, t(j) = t0+(j-1)*h, j=1,2,...,N

% with

% t(N) <= t end, t(N)+h > t end



%

% Initialize.



y = zeros(n,1);

y(1) = y0;


i = 2;

% advancing

while i <= n

fyt = feval(fcn,t(i-1),y(i-1));

%

% Euler estimate

%

yt1 = y(i-1)+h*fyt;

% trapezoidal iteration

count = 0;

diff = 1;

while diff > tol & count < 10

yt2 = y(i-1) + h*(fyt+feval(fcn,t(i),yt1))/2;

diff = abs(yt2-yt1);

yt1 = yt2;

count = count +1;

end

if count >= 10

disp(’Not converging after 10 steps at t = ’)

fprintf(’%5.2f\n’, t(i))

end

y(i) = yt2;

i = i+1;

end

Example 4.2 Consider the problem

Y ′(t) = λY (t) + (1 − λ) cos(t) − (1 + λ) sin(t), Y (0) = 1, (4.30)

whose true solution isY (t) = sin(t) + cos(t). Euler’s method is used for thenumerical solution, and the results for several values ofλ andh are given in Table4.3. Note that according to the formula (2.10) for the truncation error, we obtain

Tn+1 = 12h

2Y ′′(ξn).

The solutionY (t) does not depend onλ. But the actual global error depends stronglyonλ, as illustrated in the table; and the behavior of the global error is directly linkedto the size ofλh and, thus, to the size of the stability region for Euler’s method. Theerror is small, provided that|λ| h is sufficiently small. The cases of an unstable andrapid growth in the error are exactly the cases in which|λ|h is outside the range (4.7).We then apply the backward Euler method and the trapezoidal method to the solutionof the problem (4.30). The results are shown in Tables 4.4 and4.5, with the stepsizeh = 0.5. The error varies withλ, but there are no stability problems, in contrast tothe Euler method. The solutions of the backward Euler methodand the trapezoidalmethod foryn+1 were done exactly. This is possible because the differential equationis linear inY . The fixed-point iterations (4.12) and (4.25) do not converge when|λ|his large.



λ t Error Error Errorh = 0.5 h = 0.1 h = 0.01

−1 1 −2.46e − 1 −4.32e − 2 −4.22e − 3

2 −2.55e − 1 −4.64e − 2 −4.55e − 3

3 −2.66e − 2 −6.78e − 3 −7.22e − 4

4 2.27e − 1 3.91e − 2 3.78e − 3

5 2.72e − 1 4.91e − 2 4.81e − 3

−10 1 3.98e − 1 −6.99e − 3 −6.99e − 4

2 6.90e + 0 −2.90e − 3 −3.08e − 4

3 1.11e + 2 3.86e − 3 3.64e − 4

4 1.77e + 3 7.07e − 3 7.04e − 4

5 2.83e + 4 3.78e − 3 3.97e − 4

−50 1 3.26e + 0 1.06e + 3 −1.39e − 4

2 1.88e + 3 1.11e + 9 −5.16e − 5

3 1.08e + 6 1.17e + 15 8.25e − 5

4 6.24e + 8 1.23e + 21 1.41e − 4

5 3.59e + 11 1.28e + 27 7.00e − 5

Table 4.4 Backward Euler solution for (4.30);h = 0.5

t Error Error Errorλ = −1 λ = −10 λ = −50

2 2.08e − 1 1.97e − 2 3.60e − 3

4 −1.63e − 1 −3.35e − 2 −6.94e − 3

6 −7.04e − 2 8.19e − 3 2.18e − 3

8 2.22e − 1 2.67e − 2 5.13e − 3

10 −1.14e − 1 −3.04e − 2 −6.45e − 3

Equations withλ negative but large in magnitude are examples ofstiff differentialequations. Their truncation error may be satisfactorily small with not too small avalue ofh, but the large size of|λ| may forceh to be much smaller in order thatλhis in the stability region. The backward Euler method and thetrapezoidal methodare therefore very desirable because their stability regions contain allλh whereλ isnegative orλ is complex with negative real part. For stiff differential equations, onemust use a numerical method with a large region of absolute stability, or elseh mustbe chosen very small. The backward Euler method is preferredto the trapezoidalmethod when solving very stiff differential equations (seeProblems 14, 15), although


Table 4.5 Trapezoidal solution for (4.30);h = 0.5

t Error Error Errorλ = −1 λ = −10 λ = −50

2 −1.13e − 2 −2.78e − 3 −7.91e − 4

4 −1.43e − 2 −8.91e − 5 −8.91e − 5

6 2.02e − 2 2.77e − 3 4.72e − 4

8 −2.86e − 3 −2.22e − 3 −5.11e − 4

10 −1.79e − 2 −9.23e − 4 −1.56e − 4

it is of lower-order. There are other methods, of higher-order, for approximating stiffdifferential equations (see [44], [72, Chap. 8]); this is anactive area of research.More extensive discussions on numerically solving stiff differential equations can befound later in Chapters 8 and 9.

PROBLEMS

1. Show that the method defined by formula (4.17) is not absolutely stable.

2. Show that the trapezoidal method (4.22) is absolutely stable, but the scheme(4.29) is not.

3. Use backward Euler’s method to solve Problem 3 of Chapter 2.

4. Use the trapezoidal method to solve Problem 3 of Chapter 2.

5. Apply the backward Euler method to solve the initial valueproblem in Problem11 of Chapter 2 forα = 2.5, 1.5, 1.1, with h = 0.2, 0.1, 0.05. Compute theerror in the solution at the nodes,determine the convergence orders numerically,and compare the results with those obtained by Euler’s method.

6. Apply the trapezoidal method to solve the initial value problem in Problem 11of Chapter 2 forα = 2.5, 1.5, 1.1, with h = 0.2, 0.1, 0.05. Compute the errorin the solution at the nodes, determine numerically the convergence orders,and compare the results with that of the Euler method and the backward Eulermethod.

7. Solve the equation

Y ′(t) = λY (t) +1

1 + t2− λ tan−1(t), Y (0) = 0;

Y (t) = tan−1(t) is the true solution. Use Euler’s method, the backwardEuler method, and the trapezoidal method. Letλ = −1,−10,−50, andh = 0.5, 0.1, 0.001. Discuss the results. In implementing the backward Euler


method and the trapezoidal method, note that the implicit equation foryn+1

can be solved explicitly without iteration.

8. Apply the backward Eulermethod to the numerical solutionofY ′(t) = λY (t)+g(t) withλ < 0 and large in magnitude. Investigate how smallhmust be chosenfor the iteration

y(j+1)n+1 = yn + h f

(tn+1, y

(j)n+1

), j = 0, 1, 2, . . .

to converge toyn+1. Is this iteration practical for very large values of|λ|?

9. Repeat Problem 5 of Chapter 3 using the backward Euler method.

10. Determine whether the midpoint method

yn+1 = yn + h f(tn+1/2,

12 (yn + yn+1)

),

wheretn+1/2 = (tn + tn+1)/2, is absolutely stable.

11. Letθ ∈ [0, 1] be a constant, and denotetn+θ = (1− θ) tn + θ tn+1. Considerthe generalized midpoint method

yn+1 = yn + h f(tn+θ, (1 − θ) yn + θ yn+1)

and its trapezoidal analog

yn+1 = yn + h [(1 − θ) f(tn, yn) + θ f(tn+1, yn+1)] .

Show that the methods are absolutely stable whenθ ∈ [1/2, 1]. Determine theregions of absolute stability of the methods when0 ≤ θ < 1

2 .

12. As a special case in which the error of the backward Euler method can be ana-lyzed directly, we consider the model problem (4.3) again, with λ an arbitraryreal constant. The backward Euler solution of the problem isgiven by theformula (4.10). Following the procedure for solving Problem 4(c) in Chapter2, show that

Y (tn) − yh(tn) = −λ2tne

λ tn

2h+ O(h2).

13. LetY (t) be the solution, if it exists, to the initial value problem (1.7). Byintegrating, show thatY satisfies

Y (t) = Y0 +

∫ t

t0

f(s, Y (s)) ds.

Conversely, show that if this equation has a continuous solution on the intervalt0 ≤ t ≤ b, then the initial value problem (1.7) has the same solution.


14. As in the previous problems, consider the model problem (4.3) with a realconstantλ < 0. Show that the solution of the trapezoidal method is

yh(tn) =

(1 + 1

2λh

1 − 12λh

)n

, n ≥ 0.

Rewrite the solution formula as

yh(tn) = exp

([log(1 + 1

2λh) − log(1 − 12λh)]

htn

),

and use Taylor polynomial expansions oflog (1 ± u) aboutu = 0 to show that

Y (tn) − yh(tn) = − 112h

2λ3tneλtn + O(h4).

So forh small, the error is almost proportional toh2.

15. Use the formula (4.10) for the backward Euler method and the formula fromProblem 14 for the trapezoidal method to show that the backward Euler methodperforms better than the trapezoidal method problem (4.3) with λ negativelyvery large.

16. In this exercise, we consider a method with third-order truncation errors, whichis not convergent or stable.

(a) GivenY (t) 3 times continuously differentiable, show that

Y (tn+1) = 3Y (tn) − 2Y (tn−1) + 12h[Y

′(tn) − 3Y ′(tn−1)]

+ 712h

3Y ′′′(tn) + O(h4). (4.31)

Thus a numerical method for solving the differential equation

Y ′(t) = f(t, Y (t))

is

yn+1 = 3yn − 2yn−1 + 12h[f(tn, yn) − 3f(tn−1, yn−1)], n ≥ 1.

This is a numerical method whose truncation error isO(h3). It is anexample of a multistep method (see Chapter 6). To use the method, weneed a value fory1, called an artificial initial value, in addition to theinitial valuey0 = Y0.

Hint: To prove (4.31), use a quadratic Taylor expansion about the pointtn for Y (t), including an error termR3(t). Use this to evaluateY (tn−1)andY (tn+1), along withY ′(tn−1). Substitute into

Y (tn+1) −{3Y (tn) − 2Y (tn−1) + 1

2h[Y′(tn) − 3Y ′(tn−1)]

}


to obtain the final term in (4.31).

(b) Now apply the method to solve the very simple initial value problem

Y ′(t) ≡ 0, Y (0) = 1,

whose solution isY (t) ≡ 1. Show that if the initial values are chosen tobey0 = 1, y1 = 1+h, then the numerical solution isyn = 1−h+h 2n.Note that|y1 − Y (h)| = h → 0 ash → 0. Let tn = 1. Show that|Y (1) − yn| → ∞ ash→ 0. Thus, the method is not convergent.

(c) A slight variant of the arguments of (b) can be used to showthe instabilityof the method. Show that with the initial valuesy0 = y1 = 1, thenumerical solution isyn = 1 for all n, while if the initial values areperturbed toyǫ,0 = 1, yǫ,1 = 1 + ǫ, then the numerical solution becomesyǫ,n = 1 − ǫ + ǫ 2n. Show that at any fixed node pointtn = t > 0,|yǫ,n − yn| → ∞ ash→ 0. Hence, the method is unstable.

CHAPTER 5

TAYLOR AND RUNGE–KUTTAMETHODS

To improve on the speed of convergence of Euler’s method, we look for approxima-tions toY (tn+1) that are more accurate than the approximation

Y (tn+1) ≈ Y (tn) + hY ′(tn),

which led to Euler’s method. Since this is a linear Taylor polynomial approximation,it is natural to consider higher-order Taylor approximations. Doing this will lead to afamily of methods, called the Taylor methods, depending on the order of the Taylorapproximation being used.

In deriving a Taylor method, we need higher-order derivatives of the true solution,and we obtain them using the solution itself by differentiating the differential equation.Such expressions for higher-order derivatives are usuallytime-consuming. The ideaof Runge–Kutta methods is to use combinations of compositions of the right-sidefunction of the equation to approximate the derivative terms to a required order. Theresulting Runge–Kutta methods are among the most popular methods in solving initialvalue problems.

67

68 TAYLOR AND RUNGE–KUTTA METHODS

5.1 TAYLOR METHODS

To keep the initial explanations as intuitive as possible, we will develop a Taylormethod for the problem

Y ′(t) = −Y (t) + 2 cos(t), Y (0) = 1, (5.1)

whose true solution isY (t) = sin(t) + cos(t). To approximateY (tn+1) by usinginformation aboutY at tn, use the quadratic Taylor approximation

Y (tn+1) ≈ Y (tn) + hY ′(tn) + 12h

2Y ′′(tn). (5.2)

Its truncation error is

Tn+1(Y ) = 16h

3Y ′′′(ξn), sometn ≤ ξn ≤ tn+1. (5.3)

To evaluate the right side of (5.2), we can obtainY ′(tn) directly from (5.1). ForY ′′(t), differentiate (5.1) to get

Y ′′(t) = −Y ′(t) − 2 sin(t) = Y (t) − 2 cos(t) − 2 sin(t).

Then (5.2) becomes

Y (tn+1) ≈ Y (tn) + h[−Y (tn) + 2 cos(tn)]

+ 12h

2[Y (tn) − 2 cos(tn) − 2 sin(tn)].

By forcing equality, we are led to the numerical method

yn+1 = yn + h[−yn + 2 cos(tn)]

+ 12h

2[yn − 2 cos(tn) − 2 sin(tn)], n ≥ 0 (5.4)

with y0 = 1. This should approximate the solution of the problem (5.1).Because thetruncation error (5.3) contains a higher power ofh than was true for Euler’s method[see (2.10)], it is hoped that the method (5.4) will convergemore rapidly.

Table 5.1 contains numerical results for (5.4) and for Euler’s method, and it is clearthat (5.4) is superior. In addition, if the results for stepsizesh = 0.1 and0.05 arecompared, it can be seen that the errors decrease by a factor of approximately4 whenh is halved. This can be justified theoretically, as is discussed later.

In general, to solve the initial value problem

Y ′(t) = f(t, Y (t)), t0 ≤ t ≤ b, Y (t0) = Y0 (5.5)

by the Taylor method, select a Taylor approximation of certain order and proceed asdescribed above. For orderp, write

Y (tn+1) ≈ Y (tn) + hY ′(tn) + · · · + hp

p!Y (p)(tn), (5.6)

TAYLOR METHODS 69

Table 5.1 Example of second-order Taylor method (5.4)

h t yh(t) Error Euler Error

0.1 2.0 0.492225829 9.25e − 4 −4.64e − 24.0 −1.411659477 1.21e − 3 3.91e − 26.0 0.682420081 −1.67e − 3 1.39e − 28.0 0.843648978 2.09e − 4 −5.07e − 2

10.0 −1.384588757 1.50e − 3 2.83e − 2

0.05 2.0 0.492919943 2.31e − 4 −2.30e − 24.0 −1.410737402 2.91e − 4 1.92e − 26.0 0.681162413 −4.08e − 4 6.97e − 38.0 0.843801368 5.68e − 5 −2.50e − 2

10.0 −1.383454154 3.62e − 4 1.39e − 2

where the truncation error is

Tn+1(Y ) =hp+1

(p+ 1)!Y (p+1)(ξn), tn ≤ ξn ≤ tn+1. (5.7)

Find Y ′′(t), . . . , Y (p)(t) by differentiating the differential equation in (5.5) succes-sively, obtaining formulas that implicitly involveonlytnandY (tn). As an illustration,we have the following formulas

Y ′′(t) = ft + fyf, (5.8)

Y (3)(t) = ftt + 2 ftyf + fyyf2 + fy(ft + fyf), (5.9)

where

ft =∂f

∂t, fy =

∂f

∂y, fty =

∂2f

∂t∂y,

and so on are partial derivatives, and together withf , they are evaluated at(t, Y (t)).The formulas for the higher derivatives rapidly become verycomplicated as the dif-ferentiation order is increased.

Substitute these formulas into (5.6) and then obtain a numerical method of theform

yn+1 = yn + hy′n +h2

2y′′n + · · · + hp

p!y(p)

n (5.10)

by forcing (5.6) to be an equality. In the formula,

y′n = f(tn, yn) , y′′n = (ft + fyf) (tn, yn) ,

and so on, using the pattern of (5.8)–(5.9).If the solutionY (t) and the derivative functionf(t, z) are sufficiently differen-

tiable, then it can be shown that the method (5.10) will satisfy

maxt0≤tn≤b

|Y (tn) − yh(tn)| ≤ chp · maxt0≤t≤b

∣∣∣Y (p+1)(t)∣∣∣ . (5.11)


The constantc is similar to that appearing in the error formula (2.20) for Euler’smethod. A proof can be constructed along the same lines as that used for Theorem2.4 in Chapter 2. In addition, there is an asymptotic error formula

Y (tn) − yh(tn) = hpD(tn) + O(hp+1) (5.12)

with D(t) satisfying a certain linear differential equation. The result (5.11) showsthat for any integerp ≥ 1, a numerical method based on the Taylor approximationof orderp leads to a convergent numerical method with order of convergencep. Theasymptotic result (5.12) justifies the use of Richardson’s extrapolation to estimate theerror and to accelerate the convergence (see Problems 3, 4).

Example 5.1 With p = 2, formula (5.12) leads to

Y (tn) − yh(tn) ≈ 13 [yh(tn) − y2h(tn)]. (5.13)

Its derivation is left as Problem 3 for the reader. To illustrate the usefulness of theformula, use the entries from Table 5.1 withtn = 10:

y0.1(10).= −1.384588757,

y0.05(10).= −1.383454154.

From (5.13),

Y (10) − y0.05(10).= 1

3 [0.001134603].= 3.78 × 10−4.

This is a good estimate of the true error3.62 × 10−4, given in Table 5.1.

5.2 RUNGE–KUTTA METHODS

The Taylor method is conceptuallyeasy to work with, but as wehave seen, it is tediousand time-consuming to have to calculate the higher-order derivatives. To avoid theneed for the higher-order derivatives, the Runge–Kutta methods evaluatef(t, y) atmore points, while attempting to retain the accuracy of the Taylor approximation. Themethods obtained are fairly easy to program, and they are among the most popularmethods for solving the initial value problem.

We begin with Runge–Kutta methods of order2, and later we consider somehigher-order methods. The Runge–Kutta methods have the general form

yn+1 = yn + hF (tn, yn;h), n ≥ 0, y0 = Y0. (5.14)

The quantityF (tn, yn;h) can be regarded as some kind of “average slope” of thesolution on the interval[tn, tn+1]. But its construction is based on making (5.14) actlike a Taylor method. For methods of order2, we generally choose

F (t, y;h) = b1f(t, y) + b2f(t+ αh, y + βhf(t, y)) (5.15)

RUNGE–KUTTA METHODS 71

and determine the constants{α, β, b1, b2} so that when the true solutionY (t) issubstituted into (5.14), the truncation error

Tn+1(Y ) ≡ Y (tn+1) − [Y (tn) + hF (tn, Y (tn);h)] (5.16)

will satisfyTn+1(Y ) = O(h3), (5.17)

just as with the Taylor method of order2.To find the equations for the constants, we use Taylor expansions to compute the

truncation errorTn+1(Y ). For the termf(t + αh, y + βhf(t, y)), we first expandwith respect to the second argument aroundy. Note that we need a remainderO(h2):

f(t+ αh, y + βhf(t, y)) = f(t+ αh, y) + fy(t+ αh, y)βhf(t, y) + O(h2).

We then expand the terms with respect to thet variable to obtain

f(t+ αh, y + βhf(t, y)) = f + ftαh+ fyβhf + O(h2),

where the functions are all evaluated at(t, y). Also, recall from following (5.10) that

Y ′′ = ft + fyf.

Hence

Y (t+ h) = Y + hY ′ +h2

2Y ′′ + O(h3)

= Y + hf +h2

2(ft + fyf) + O(h3).

Then

Tn+1(Y ) = Y (t+ h) − [Y (t) + hF (t, Y (t);h)]

= Y + hf + 12h

2(ft + fyf)

− [Y + hb1f + b2h (f + αhft + βhfyf)] + O(h3)

= h (1 − b1 − b2) f + 12h

2[(1 − 2 b2α) ft

+ (1 − 2 b2β)fyf ] + O(h3). (5.18)

The requirement (5.17) implies that the coefficients must satisfy the system

1 − b1 − b2 = 0,1 − 2 b2α = 0,1 − 2 b2β = 0.

Therefore

b2 6= 0, b1 = 1 − b2, α = β =1

2b2. (5.19)


Y(t)+h F(t,Y(t);h)

z=Y(t)

t t+h

L1

L2

L3

L4

Figure 5.1 An illustration of Runge–Kutta method (5.20); the slope ofL1 is f(t, Y (t)), thatof L2 is f(t + h, Y (t) + hf(t, Y (t))), and those ofL3 andL4 are the averageF (t, Y (t);h)

Thus there is a family of Runge–Kutta methods of order2, depending on the choiceof b2. The three favorite choices areb2 = 1

2 , 34 , and1.

With b2 = 12 , we obtain the numerical method

yn+1 = yn +h

2[f(tn, yn) + f(tn + h, yn + hf(tn, yn))], n ≥ 0. (5.20)

This is also Heun’s method (4.29) discussed in Chapter4. Thenumberyn+hf(tn, yn)is the Euler solution attn+1. Using it, we obtain an approximation to the derivativeat tn+1, namely,

f(tn+1, yn + hf(tn, yn)).

This and the slopef(tn, yn) are then averaged to give an “average” slope of thesolution on the interval[tn, tn+1], giving

F (tn, yn;h) = 12 [f(tn, yn) + f(tn + h, yn + hf(tn, yn))].

This is then used to predictyn+1 from yn, in (5.20). This definition is illustrated inFigure 5.1 forF (t, Y (t);h) as an average slope ofY ′ on [t, t+ h].

Another choice is to useb2 = 1, resulting in the numerical method

yn+1 = yn + hf(tn + 1

2h, yn + 12hf(tn, yn)

). (5.21)

RUNGE–KUTTA METHODS 73

Table 5.2 Example of second-order Runge–Kutta method

h t yh(t) Error

0.1 2.0 0.491215673 1.93e − 34.0 −1.407898629 −2.55e − 36.0 0.680696723 5.81e − 58.0 0.841376339 2.48e − 3

10.0 −1.380966579 −2.13e − 3

0.05 2.0 0.492682499 4.68e − 44.0 −1.409821234 −6.25e − 46.0 0.680734664 2.01e − 58.0 0.843254396 6.04e − 4

10.0 −1.382569379 −5.23e − 4

Example 5.2 Reconsider the problem (5.1):

Y ′(t) = −Y (t) + 2 cos(t), Y (0) = 1.

Heref(t, y) = −y + 2 cos(t).

The numerical results from using (5.20) are given in Table 5.2. They show that theerrors in this Runge–Kutta solution are comparable in accuracy to the results obtainedwith the Taylor method (5.4). In addition, the errors in Table 5.2 decrease by a factorof approximately4 whenh is halved, confirming the second-order convergence ofthe method.

5.2.1 A general framework for explicit Runge–Kutta methods

Runge–Kutta methods of higher-order can also be developed.An explicit Runge–Kutta formula withs stages has the following form:

z1 = yn,

z2 = yn + ha2,1f(tn, z1),

z3 = yn + h [a3,1f(tn, z1) + a3,2f(tn + c2h, z2)] ,...

zs = yn + h [as,1f(tn, z1) + as,2f(tn + c2h, z2)

+ · · · + as,s−1f(tn + cs−1h, zs−1)] ,

(5.22)

yn+1 = yn + h [b1f(tn, z1) + b2f(tn + c2h, z2)

+ · · · + bs−1f(tn + cs−1h, zs−1) + bsf(tn + csh, zs)] . (5.23)


Hereh = tn+1 − tn. The coefficients{ci, ai,j , bj} are given and they define thenumerical method. The functionF of (5.14), defining a one-step method, is definedimplicitly through the formulas (5.22)-(5.23).

More succinctly, we can write the formulas as

zi = yn + hi−1∑

j=1

ai,jf(tn + cjh, zj) , i = 1, . . . , s, (5.24)

yn+1 = yn + hs∑

j=1

bjf(tn + cjh, zj) . (5.25)

The coefficients are often displayed in a table called aButcher tableau(after J. C.Butcher):

0 = c1

c2 a2,1

c3 a3,1 a3,2

......

. . .cs as,1 as,2 · · · as,s−1

b1 b2 · · · bs−1 bs

(5.26)

The coefficients{ci} and{ai,j} are usually assumed to satisfy the conditions

i−1∑

j=1

ai,j = ci, i = 2, . . . , s. (5.27)

Example 5.3 We give two examples of well-known Runge–Kutta methods.

• The method (5.20) has the Butcher tableau

01 1

1/2 1/2

• A popular classical method is the following fourth-order procedure.

z1 = yn,

z2 = yn + 12h f (tn, z1) ,

z3 = yn + 12h f

(tn + 1

2h, z2),

z4 = yn + h f(tn + 1

2h, z3),

yn+1 = yn + 16h[f (tn, z1) + 2f

(tn + 1

2h, z2)

+2f(tn + 1

2h, z3)

+ f (tn + h, z4)].

(5.28)

CONVERGENCE, STABILITY, AND ASYMPTOTIC ERROR 75

The Butcher tableau is

01/2 1/21/2 0 1/21 0 0 1

1/6 1/3 1/3 1/6

(5.29)

Following an extended calculation modeled on that in (5.18), we can showTn+1 = O(h5).

When the differential equation is simplyY ′(t) = f(t) with no dependence off on Y , this method reduces to Simpson’s rule for numerical integration on[tn, tn+1]. The method (5.28) can be easily implemented using a computer or aprogrammable hand calculator, and it is generally quite accurate. A numericalexample is given at the end of the next section.

5.3 CONVERGENCE, STABILITY, AND ASYMPTOTIC ERROR

We want to examine the convergence of the one-step method

yn+1 = yn + hF (tn, yn;h), n ≥ 0, y0 = Y0 (5.30)

to the solutionY (t) of the initial value problem

Y ′(t) = f(t, Y (t)), t0 ≤ t ≤ b,Y (t0) = Y0.

(5.31)

Using the truncation error of (5.16) for the true solutionY , we introduce

τn(Y ) =1

hTn+1(Y ).

In order to show convergence of (5.30), we need to haveτn(Y ) → 0 ash→ 0. Since

τn(Y ) =Y (tn+1) − Y (tn)

h− F (tn, Y (tn), h; f), (5.32)

we require that

F (t, Y (t), h; f) → Y ′(t) = f(t, Y (t)) ash→ 0.

Accordingly, define

δ(h) = supt0≤t≤b

−∞<y<∞

|f(t, y) − F (t, y, h; f)| , (5.33)

and assumeδ(h) → 0 ash→ 0. (5.34)


This is occasionally called theconsistency conditionfor the one-step method (5.30).We can rewrite (5.32) in the form

Y (tn+1) = Y (tn) + hF (tn, Y (tn), h; f) + hτn(Y ). (5.35)

We then introduceτ(h) = max

t0≤tn≤b|τn(Y )| .

The condition (5.34) can be used to showτ(h) → 0 ash → 0; or we may show thisresult by other means (e.g. see (5.17)).

We also need a Lipschitz condition onF, namely

|F (t, y, h; f) − F (t, z, h; f)| ≤ L |y − z| (5.36)

for all t0 ≤ t ≤ b, −∞ < y, z < ∞, and all smallh > 0. This is in analogy withthe Lipschitz condition (1.10) forf(t, z) of Chapter 1 which was used to guaranteethe existence of a unique solution to the initial value problem forY ′ = f(t, Y ). Thecondition (5.36) is usually proved by using the Lipschitz condition (1.10) onf(t, y).For example, with method (5.21), we obtain

|F (t, y, h; f) − F (t, z, h; f)|

=∣∣f(t+ 1

2h, y + 12hf(t, y)

)− f

(t+ 1

2h, z + 12hf(t, z)

)∣∣

≤ K∣∣y − z + 1

2h [f(t, y) − f(t, z)]∣∣

≤ K(1 + 1

2hK)|y − z| .

The last two inequalities use the Lipschitz condition (1.10) for f . ChooseL =K(1 + 1

2K) for h ≤ 1.

Theorem 5.4 Assume that the Runge–Kuttamethod (5.30) satisfies the Lipschitz con-dition (5.36). Then, for the initial value problem (5.31), the solution{yn} satisfies

maxt0≤tn≤b

|Y (tn) − yn| ≤ e(b−t0)L |Y0 − y0| +[e(b−t0)L − 1

L

]τ(h), (5.37)

whereτ(h) ≡ max

t0≤tn≤b|τn(Y )| . (5.38)

If the consistency condition (5.34) is also satisfied, then the numerical solution{yn}converges toY (t).

Proof. Subtract (5.30) from (5.35) to obtain

en+1 = en + h [F (tn, Yn, h; f) − F (tn, yn, h; f)] + hτn(Y ) (5.39)

in which en = Y (tn) − yn. Apply the Lipschitz condition (5.36) and use (5.38) toobtain

|en+1| ≤ (1 + hL) |en| + hτn(h), t0 ≤ tN ≤ b. (5.40)


As with the convergence proof in Theorem 2.4 for the Euler method, given in Section2.2 of Chapter 2, this leads easily to the result (5.37).

In most cases, it is known by direct computation thatτ(h) → 0 ash → 0, and inthat case, convergence of{yn} to Y (t) is immediately proved. But all that we needto know is that (5.34) is satisfied. To see this, write

hτn(Y ) = Y (tn+1) − Y (tn) − hF (tn, Y (tn), h; f)

= hY ′(tn) +h2

2Y ′′(ξn) − hF (tn, Y (tn), h; f),

h |τn(Y )| ≤ hδ(h) +h2

2‖Y ′′‖∞ ,

τ(h) ≤ δ(h) +1

2h ‖Y ′′‖∞ .

Thus τ(h) → 0 ash → 0, completing the proof. The preceding examples areillustrations of the theorem.

The following result is an immediate consequence of (5.37).

Corollary 5.5 If the Runge–Kutta method (5.30) has a truncation errorTn(Y ) =O(hm+1), then the error in the convergence of{yn} to Y (t) on [t0, b] isO(hm).

It is not too difficult to derive an asymptotic error formula for the Runge–Kuttamethod (5.30), provided one is known for the truncation error. Assume

Tn(Y ) = ϕ(tn)hm+1 + O(hm+2) (5.41)

with ϕ(t) determined byY (t) andf(t, Y (t)). As an example, see the result (5.18) toobtain this expansion for second-order Runge–Kutta methods. Strengthened formsof (5.34) and (5.36) are also necessary. Assume

F (t, y, h; f) − F (t, z, h; f) =∂F (t, y, h; f)

∂y(y − z) + O((y − z)2) (5.42)

and also

δ1(h) ≡ supt0≤t≤b

−∞<y<∞

∣∣∣∣∂f(t, y)

∂y− ∂F (t, y, h; f)

∂y

∣∣∣∣→ 0 ash→ 0. (5.43)

In practice, both of these results are straightforward to confirm. With these assump-tions, we can derive the formula

Y (tn) − yh(tn) = D(tn)hm + O(hm+1), (5.44)

with D(t) satisfying the linear initial value problem

D′(t) = fy(t, Y (t))D(t) + ϕ(t), D(t0) = 0. (5.45)


Stability results can be obtained for Runge–Kutta methods in analogy with thosefor Euler’s method as presented in Section 2.4 of Chapter 2. We omit any discussionhere.

As with Taylor methods, Richardson’s extrapolation can be justified for Runge–Kutta methods using (5.44), and the error can be estimated. For the second-ordermethod (5.20), we obtain the error estimate

Y (tn) − yh(tn) ≈ 13 [yh(tn) − y2h(tn)],

just as we obtained it earlier for the second-order Taylor method; see Problem 3.

Example 5.6 Estimate the error forh = 0.05 andt = 10 in Table 5.2. Then

Y (10) − y0.05(10).= 1

3 [−1.3825669379− (−1.380966579)].= −5.34 × 10−4.

This compares closely with the actual error of−5.23 × 10−4.

Example 5.7 Consider the problem

Y ′ =1

1 + x2− 2Y 2, Y (0) = 0 (5.46)

with the solutionY = x/(1+x2). The method (5.28) was used with a fixed stepsize,and the results are shown in Table 5.3. The stepsizes areh = 0.25 and2h = 0.5.The asymptotic error formula (5.44) becomes

Y (x) − yh(x) = D(x)h4 + O(h5), (5.47)

in this case, and this leads to the asymptotic error estimate

Y (x) − yh(x) = 115 [yh(x) − y2h(x)] + O(h5). (5.48)

In the table the column labeled “Ratio” gives the ratio of theerrors for correspondingnode points ash is halved. The last column is an example of formula (5.48). BecauseTn(Y ) = O(h5) for method (5.28), Theorem 5.4 implies that the rate of convergenceof yh(x) toY (x) isO(h4). The theoretical value of “Ratio” is16, and ash decreasesfurther, this value will be realized more closely.

5.3.1 Error prediction and control

The easiest way to predict the errorY (t) − yh(t) in a numerical solutionyh(t) is touse Richardson’s extrapolation. Solve the initial value problem twice on the giveninterval [t0, b], with stepsizes2h andh. Then use Richardson’s extrapolation toestimateY (t)− yh(t) in terms ofyh(t)− y2h(t), as was done in (5.13) for a second-order method. The cost of estimating the error in this way is an approximately50%increase in the amount of computation, as compared with the cost of computing just


Table 5.3 Example of Runge-Kutta method (5.28)

x yh(x) Y (x) − yh(x) Y (x) − y2h(x) Ratio 115

[yh(x) − y2h(x)]

2.0 0.39995699 4.3e − 5 1.0e − 3 24 6.7e − 54.0 0.23529159 2.5e − 6 7.0e − 5 28 4.5e − 66.0 0.16216179 3.7e − 7 1.2e − 5 32 7.7e − 78.0 0.12307683 9.2e − 8 3.4e − 6 36 2.2e − 7

10.0 0.09900987 3.1e − 8 1.3e − 6 41 8.2e − 8

yh(t). This may seem a large cost, but it is generally worth paying except for themost time-consuming of problems.

It would be desirable to have computer programs that would solve a differentialequation on a given interval[t0, b] with an error less than a given error toleranceǫ > 0. Unfortunately, this is not possible with most types of numerical methods forthe initial value problem. If at some pointtwe discover thatY (t)−yh(t) is too large,then the error cannot be reduced by merely decreasingh from that point onward inthe computation. The errorY (t) − yh(t) depends on the cumulative effect of allpreceding errors at pointstn < t. Thus, to decrease the error att, it is necessary torepeat the solution of the equation fromt0, but with a smaller stepsizeh. For thisreason, most package programs for solving the initial valueproblem will not attemptto directly control the error, although they may try to monitor or bound it. Instead,they use indirect methods to affect the size of the error.

The errorY (tn)−yh(tn) is called theglobal erroror total error attn. Rather thancontrolling this global error, we control another error. Weintroduce the followinginitial value problem:

u′n(t) = f(t, un(t)) , t ≥ tn,un(tn) = yn.

(5.49)

The solutionun(t) is called thelocal solution to the differential equation at the point(tn, yn). Using it we introduce thelocal error

LEn+1 = un(tn+1) − yn+1. (5.50)

This is the error introduced into the solution at the pointtn+1 when assuming thesolutionyn at tn is the exact solution. Most computer programs that contain errorcontrol are based on estimating the local error and then controlling it by varyinghsuitably. By so doing, they hope to keep the global error sufficiently small. If an errorparameterǫ > 0 is given, the better programs choose the stepsizeh to ensure that thelocal errorLEn+1 is much smaller, usually satisfying something like

|LEn+1| ≤ ǫ(tn+1 − tn). (5.51)

This is called controlling theerror per unit stepsize,with which the global error isgenerally also kept small. For many differential equations, the global error will thenbe less thanǫ(tn+1 − t0).


Table 5.4 Fehlberg coefficientsαi, βij

i αi βi0 βi1 βi2 βi3 βi4

1 14

14

2 38

332

932

3 1213

19322197

− 72002197

72962197

4 1 439216

−8 3680513

− 8454104

5 12

− 827

2 − 35442565

18594104

− 1140

For more detailed discussions of one-step methods, especially Runge–Kutta meth-ods, see Shampine [72], Iserles [48, Chap. 3], and Deuflhard and Bornemann [33,Chaps. 4-6].

5.4 RUNGE–KUTTA–FEHLBERG METHODS

To estimate the local error (5.50), various techniques can be used, including Richard-son’s extrapolation. A novel technique was devised in the 1970s, and it has led to thecurrently most popular Runge–Kutta methods. Rather than computing with a methodof fixed order, one simultaneously computes by using two methods of different orders.The two methods share most of the function evaluations off at each step fromtn totn+1. Then the higher-order formula is used to estimate the errorin the lower-orderformula. These methods are often calledFehlberg methods; we give one such pair ofmethods, of orders4 and5.

Define six intermediate slopes in[tn, tn+1] by

v0 = f(tn, yn),

vi = f

tn + αih, yn + hi−1∑

j=0

βijvj

, i = 1, 2, 3, 4, 5.(5.52)

Then the fourth- and fifth-order formulas are given by

yn+1 = yn + h4∑

i=0

γivi, (5.53)

yn+1 = yn + h5∑

i=0

δivi. (5.54)

The coefficientsαi, βij , γi, δi are given in Tables 5.4 and 5.5.The local error in the fourth-order formula (5.53) is estimated by

LEn+1 ≈ yn+1 − yn+1. (5.55)

RUNGE–KUTTA–FEHLBERG METHODS 81

Table 5.5 Fehlberg coefficientsγi, δi

i 0 1 2 3 4 5

γi25216

0 14082565

21974104

− 15

δi16135

0 665612825

2856156430

− 950

255

Table 5.6 Example of fourth-order Fehlberg formula (5.53)

h t yh(t) Y (t) − yh(t) yh(t) − yh(t)

0.25 2.0 0.493156301 −5.71e − 6 −9.49e − 74.0 −1.410449823 3.71e − 6 1.62e − 66.0 0.680752304 2.48e − 6 −3.97e − 78.0 0.843864007 −5.79e − 6 −1.29e − 6

10.0 −1.383094975 2.34e − 6 1.47e − 6

0.125 2.0 0.493150889 −2.99e − 7 −2.35e − 84.0 −1.410446334 2.17e − 7 4.94e − 86.0 0.680754675 1.14e − 7 −1.76e − 88.0 0.843858525 −3.12e − 7 −3.47e − 8

10.0 −1.383092786 1.46e − 7 4.65e − 8

It can be shown that this is a correct asymptotic result ash → 0. By using thisestimate, ifLEn+1 is too small or too large, the stepsize can be varied so as to givea value forLEn+1 of acceptable size. Note the two formulas (5.53) and (5.54) usethe common intermediate slopesv0, . . . , v4. At each step, we need to evaluate onlysix intermediate slopes. In a number of programs, the fifth-order solutionyn+1 isactually the numerical solution used, even though the erroris being controlled onlyfor the fourth-order solutionyn+1.

Example 5.8 Solve

Y ′(t) = −Y (t) + 2 cos(t), Y (0) = 1 (5.56)

whose true solution isY (t) = sin(t) + cos(t). Table 5.6 contains numerical resultsfor h = 0.25 and0.125. Compare the global errors with those in Tables 5.1 and 5.2,where second-order methods are used. Also, it can be seen that the global errors inyh decrease by factors of17 to 21, which are fairly close to the theoretical value of16for a fourth-order method. The truncation errors, estimated from (5.55), are includedto show that they are quite different from the global error. The preceding examplesare illustrations of the theorem.

The method (5.52) to (5.55) usesyn+1 only for estimating the truncation error inthe fourth-order method. In practice,yn+1 is kept as the numerical solution rather thanyn+1; thusyn should replaceyn on the right sides of (5.52) to (5.54). The quantity


Table 5.7 Example of fifth-order method (5.54)

h t yn(t) Y (t) − yn(t)

0.25 2.0 0.493151148 −5.58e − 74.0 −1.410446359 2.43e − 76.0 0.680754463 3.26e − 78.0 0.843858731 −5.18e − 7

10.0 −1.383092745 1.05e − 7

0.125 2.0 0.493150606 −1.61e − 84.0 −1.410446124 8.03e − 96.0 0.680754780 8.65e − 98.0 0.843858228 1.53e − 8

10.0 −1.383092644 4.09e − 9

in (5.55) will still be the truncation error in the fourth-order method. Programs basedon this will be fifth-order, but they will vary their stepsizeh to control the local errorin the fourth-order method. This tends to make these programs very accurate withregard to global error.

Example 5.9 Repeat the last example, but use the fifth-order method described in thepreceding paragraph. The results are given in Table 5.7. Note that the errors decreaseby approximately32 whenh is halved, consistent with a fifth-order method.

5.5 MATLAB CODES

MATLAB R© contains an excellent suite of programs for solving the initial valueproblem for systems of ordinary differential equations andrelated problems. Theprograms use a variety of methods, and in this text we introduce and illustrate a fewof these programs. For a complete description of these programs and the variousoptions that are available when using them, go to the documentation for MATLABor to the excellent text by Shampine et al. [74]. Each such MATLAB program solvesa given differential equation in such a manner that the estimated local error in eachcomponent of the solution satisfies a given error test. For a single equation theestimated local error in passing fromy(tn) to y(tn+1), call it e(tn), is to satisfy

|e(tn)| ≤ max {AbsTol, RelTol· |y(tn)|} .

The error tolerancesAbsTol and RelTol can be specified by having the user runthe MATLAB programodeset; when left unspecified, the default tolerances areAbsTol = 10−6,RelTol = 10−3. For a discussion of the construction of this MAT-LAB suite for solving ordinary differential equations, seeShampine and Reichelt [73]or Shampine, Gladwell, and Thompson [74].

MATLAB CODES 83

0 5 10 15 20−1.5

−1

−0.5

0

0.5

1

1.5

Figure 5.2 The solution values to (5.56) obtained byode45 are indicated by the symbolo.The curve line is obtained by interpolating these solution values fromode45 usingdeval

The codeode45 is an implementation of a method similar to the Runge–Kutta–Fehlberg method presented earlier. The programode45 uses a pair of formulasof orders 4 and 5 by Dormand and Prince [34, cf. Table 2], againestimating thelocal error as in (5.55). We illustrate the use ofode45 with the following programtest ode45.

Example 5.10 We illustrate the use ofode45 by solving the earlier test equation(5.56). When callingtest ode45, we useλ = −1 and the error tolerancesAbsTol=10−6,RelTol = 10−4. In the programtest ode45,odeset is used to set parametervalues that are used inode45. For a complete description of these parameter valuesand for more a complete discussion of the varied options for usingode45, consult theMATLAB documentation. We note that in the call to programode45, we specify thederivative function by giving as an input the function handle@deriv. The outputsolnfrom ode45 is a MATLAB structure, and it contains all of the informationneeded toobtain the solution and to interpolate the solution to othervalues of the independentvariable. In our test program, we use the MATLAB programdeval to carry out theinterpolation on an evenly spaced grid. This could have beendone directly whencallingode45, but we have chosen a more general approach to usingode45. Figures5.2 and 5.3 contain, respectively, the interpolated numerical solution and the error init.


0 5 10 15 20−1.5

−1

−0.5

0

0.5

1

1.5x 10

−4

Figure 5.3 The errors in the solution to (5.56) obtained usingode45

The code described in Example 5.10 proceeds as follows.

function test ode45(lambda,relerr,abserr)

%

% function test ode45(lambda,relerr,abserr)

%

% This is a test program for the ode solver ’ode45’.

% The test is carried out for the single equation

% y’ = lambda*y + (1-lambda)*cos(t) - (1+lambda)*sin(t)

% The initial value at t=0 is y(0)=1. The true solution is

% y = cos(t) + sin(t)

% The user can input the relative and absolute error

% tolerances to be used by ode45. These are incorporated

% using the initialization program ’odeset’.

% The program can be adapted easily to other equations and

% other parameter values.

% Initialize and solve

options = odeset(’RelTol’,relerr,’AbsTol’,abserr);

t begin = 0; t end = 20;

y initial = true soln(t begin);

num fcn eval = 0; % initialize count of derivative evaluations

soln = ode45(@deriv,[t begin,t end],y initial,options);

MATLAB CODES 85

% See below for function deriv.

% Produce the solution on a uniform grid using interpolation

% of the solution obtained by ode45. The points plotted with

% ’o’ are for the node points returned by ode45.

h plot = (t end-t begin)/200; t plot = t begin:h plot:t end;

y plot = deval(soln,t plot);

figure

plot(soln.x,soln.y,’o’,t plot,y plot)

title([’Interpolated solution:’,...

’ points noted by ‘‘o’’ are at ode45 solution nodes’])

xlabel([’\lambda = ’,num2str(lambda)])

disp(’press on any key to continue’)

pause

% Produce the error in the solution on the uniform grid.

% The points plotted with ’o’ are for the solution values

% at the points returned by ode45.

y true = true soln(t plot);

error = y true - y plot;

y true nodes = true soln(soln.x);

error nodes = y true nodes - soln.y;

figure

plot(soln.x,error nodes,’o’,t plot,error)

title(’Error in interpolated solution’)

xlabel([’\lambda = ’,num2str(lambda)])

norm error = norm(error,inf);

disp([’maximum of error = ’,num2str(norm error)])

disp([’number of derivative evaluations = ’,...

num2str(num fcn eval)])

function dy = deriv(t,y)

% Define the derivative in the differential equation.

dy = lambda*y + (1-lambda)*cos(t) - (1+lambda)*sin(t);

num fcn eval = num fcn eval + 1;

end % deriv

function true = true soln(t)

% Define the true solution of the initial value problem.

true = sin(t) + cos(t);

end % true soln

end % test ode45


5.6 IMPLICIT RUNGE–KUTTA METHODS

Return to (5.24)–(5.25) for the definition of ans-stage Runge–Kutta (RK) method.An s-stageimplicit Runge–Kutta methodhas the form

zi = yn + hs∑

j=1

ai,jf(tn + cjh, zj) , i = 1, . . . , s, (5.57)

yn+1 = yn + h

s∑

j=1

bjf(tn + cjh, zj) . (5.58)

It has the Butcher tableau

c1 a1,1 · · · a1,s

c2 a2,1 · · · a2,s

......

...cs as,1 · · · as,s

b1 · · · bs

(5.59)

We give here a very brief introduction to implicit RK methods, referring to Chapter9 for a more extensive discussion of the topic.

The equations (5.57) form a simultaneous system ofs nonlinear equations for thes unknownsz1, . . . , zs; and if the equationy′ = f(t, y) is a system ofm differentialequations, then (5.57) is a simultaneous system ofsm nonlinear scalar equations.Why does one want to consider such a complicated numerical method? The answeris that a number of such methods (5.57)-(5.58) have desirable numerical stabilityproperties that are important in solving a variety of important classes of differentialequations.

We introduce one approach to deriving many such methods. We begin by convert-ing the differential equation

Y ′(t) = f(t, Y (t))

into an integral equation. Integrating the equation over the interval[tn, t], we obtain

∫ t

tn

Y ′(r) dr =

∫ t

tn

f(r, Y (r)) dr,

Y (t) = Y (tn) +

∫ t

tn

f(r, Y (r)) dr. (5.60)

Approximate the equation, first by replacingY (tn) with yn, and then by replacing theintegrand with a polynomial interpolant of it. In particular, choose a set of parameters

0 ≤ τ1 < · · · < τs ≤ 1.

Let p(r) be the unique polynomial of degree< s that interpolatesf(r, Y (r)) at thenode points{tn,i ≡ tn + τih : i = 1, . . . , s} on [tn, tn+1]; see Appendix B. Then

IMPLICIT RUNGE–KUTTA METHODS 87

(5.60) is approximated by

Y (t) ≈ yn +

∫ t

tn

p(r) dr. (5.61)

Using the Lagrange form of the interpolation polynomial [see (B.6) from AppendixB], we write

p(r) =

s∑

j=1

f(tn,j, Y (tn,j))lj(r).

The Lagrange basis functions{lj(r)} can be obtained from (B.4). Then (5.61) be-comes

Y (t) ≈ yn +

s∑

j=1

f(tn,j , Y (tn,j))

∫ t

tn

lj(r) dr. (5.62)

We now determine approximatevalues for{Y (tn,j) : j = 1, . . . , s} by forcing equal-ity in the expression (5.62) at the points{tn,j}. Let{yn,j} denote these approximatevalues. They are to be determined by solving the nonlinear system

yn,i = yn +

s∑

j=1

f(tn,j, yn,j)

∫ tn,i

tn

lj(r) dr, i = 1, . . . , s. (5.63)

If τs = 1, then we defineyn+1 = yn,s. Otherwise, we define

yn+1 = yn +

s∑

j=1

f(tn,j, yn,j)

∫ tn+1

tn

lj(r) dr. (5.64)

The integrals in (5.63) and (5.64) are easily evaluated, andwe will give a particularcase below withs = 2.

The general method of forcing an approximating equation to be true at a givenset of node points is calledcollocation, and the points{tn,i} at which equality isforced are called thecollocation node points. We should note that some Runge–Kuttamethods are not collocation methods. An example is the following implicit methodgiven by Iserles [48, p. 44]:

0 0 02/3 1/3 1/3

1/4 3/4

(5.65)

5.6.1 Two-point collocation methods

Let 0 ≤ τ1 < τ2 ≤ 1, and recall thattn,1 = tn + hτ1 andtn,2 = tn + hτ2. Then theinterpolation polynomial is

p(r) =1

h (τ2 − τ1)[(tn+1 − r) f(tn,1, Y (tn,1)) + (r − tn) f(tn,2, Y (tn,2))] .

(5.66)


Following calculation of the integrals, the system (5.64) has the Butcher tableau

τ1 (τ 22 − [τ2 − τ1]

2)/ (2 [τ2 − τ1]) −τ 21 / (2 [τ2 − τ1])

τ2 τ 22 / (2 [τ2 − τ1]) ([τ2 − τ1]

2 − τ 21 )/ (2 [τ2 − τ1])

(τ 22 − [1 − τ2]

2)/ (2 [τ2 − τ1]) ([1 − τ1]2 − τ 2

1 )/ (2 [τ2 − τ1])

(5.67)

As a special case, note that whenτ1 = 0 andτ2 = 1, the system (5.64) becomes

yn,1 = yn,

yn,2 = yn + 12h [f(tn, yn,1) + f(tn+1, yn,2)] .

Substituting from the first equation into the second equation and usingyn+1 = yn,2,we have

yn+1 = yn + 12h [f(tn, yn) + f(tn+1, yn+1)] ,

which is simply the trapezoidal method.Another choice that has very good convergence and stabilityproperties is to use

τ1 = 12 − 1

6

√3, τ2 = 1

2 + 16

√3. (5.68)

The Butcher tableau is`3 −

√3

´/6 1/4

`3 − 2

√3

´/12`

3 +√

3´/6

`3 + 2

√3

´/12 1/4

1/2 1/2

(5.69)

The associated nonlinear system is

yn,i = yn +

2∑

j=1

ai,jf(tn + τjh, yn,j), i = 1, 2, (5.70)

where we have used the implicit definition of{ai,j} that uses (5.59) to reference theelements in (5.69). Then

yn+1 = yn +h

2[f(tn+1, yn,1) + f(tn+1, yn,2)] . (5.71)

This method, called thetwo stage Gauss method, is exact for all polynomial solutionsY (t) of degree≤ 4. Showing that it has degree of precision 2 is straightforward,because the linear interpolation formula (5.66) is exact whenY ′(t) = f(t, Y (t)) islinear. Proving that the degree of precision is 4 is a more substantial argument, andwe refer the reader to [48, p. 46]. It can be shown that the truncation error for thismethod has sizeO(h5), and thus the convergence isO(h4). It also has desirablestability properties, some of which are taken up in Problem 15 and some of which aredeferred to Chapter 9. A disadvantage of the method is the need to solve the nonlinearsystem in (5.70).


A number of other families of implicit Runge–Kutta methods are discussed inChapter 9. These methods have stability properties that make them especially usefulfor solving stiff differential equations.

PROBLEMS

1. A Taylor method of order3 for problem (5.1) can be obtained using the sameprocedure that led to (5.4). On the basis of third-order Taylor approximation

Y (tn+1) ≈ Y (tn) + hY ′(tn) +h2

2Y ′′(tn) +

h3

6Y ′′′(tn),

derive the numerical method

yn+1 = yn + h[−yn + 2 cos(tn)] +h2

2[yn − 2 cos(tn) − 2 sin(tn)]

+h3

6[−yn + 2 sin(tn)], n ≥ 0. (5.72)

Implement the numerical method (5.72) forsolving the problem(5.1). Computewith stepsizes ofh = 0.1, 0.05 for 0 ≤ t ≤ 10. Compare to the values in Table5.1, and also check the ratio by which the error decreases when h is halved.

Hint: To simplify the programming, just modify the Euler program given inChapter 2.

2. Compute solutions to the followingproblems with asecond-orderTaylormethod.Use stepsizesh = 0.2, 0.1, 0.05.

(a) Y ′(t) = [cos(Y (t))]2, 0 ≤ t ≤ 10, Y (0) = 0;

Y (t) = tan−1(t).

(b) Y ′(t) = 1/(1 + t2) − 2[Y (t)]2, 0 ≤ t ≤ 10, Y (0) = 0;

Y (t) = t/(1 + t2).

(c) Y ′(t) = 14Y (t)[1 − 1

20Y (t)], 0 ≤ t ≤ 20, Y (0) = 1;

Y (t) = 20/(1 + 19e−t/4).

(d) Y ′(t) = −[Y (t)]2, 1 ≤ t ≤ 10, Y (1) = 1;

Y (t) = 1/t.

(e) Y ′(t) = −e−tY (t), 0 ≤ t ≤ 10, Y (0) = 1;

Y (t) = exp (e−t − 1).

These were solved previously in Problems 1 and 2 of Chapter 2.Compare yourresults with those earlier ones.


3. Recall the asymptotic error for Taylor methods, given in (5.12). For second-order methods, this yields

Y (tn) − yh(tn) = h2D(tn) + O(h3).

From this, derive the Richardson extrapolation formula

Y (tn) = 13 [4yh(tn) − y2h(tn)] + O(h3)

≈ 13 [4yh(tn) − y2h(tn)] ≡ yh(tn)

and the asymptotic error estimate

Y (tn) − yh(tn) = 13 [yh(tn) − y2h(tn)] + O(h3)

≈ 13 [yh(tn) − y2h(tn)].

Hint: Consider the formula

Y (tn) − y2h(tn) = 4h2D(tn) + O(h3)

and combine it suitably with the earlier formula forY (tn) − yh(tn).

4. Repeat Problem 3 for methods of a general orderp ≥ 1. Derive the formulas

Y (tn) ≈ 1

2p − 1[2pyh(tn) − y2h(tn)] ≡ yh(tn)

with an error proportional tohp+1, and

Y (tn) − yh(tn) ≈ 1

2p − 1[yh(tn) − y2h(tn)].

5. Use Problem 3 to estimate the errors in the results of Table5.1, for h =0.05. Also produce the Richardson extrapolateyh(tn) and calculate its error.Compare its accuracy to that ofyh(tn).

6. Derive the second-orderRunge–Kutta methods (5.14) corresponding tob2 = 34

andb2 = 1 in (5.15). Forb2 = 1, draw an illustrative graph analogous to thatof Figure 5.1 forb2 = 1

2 . Give the Butcher tableau for this method.

7. Give the Butcher tableau for each of the following methods.(a) The second-order method (5.21)(b) The Fehlberg formulas (5.53) and (5.54).

8. Solve the problem (5.1) with one of the formulas from Problem 6. Compareyour results to those in Table 5.2 for formula (5.20) withb2 = 1

2 .

9. Using (5.20), solve the equations in Problem 2. Estimate the error by usingProblem 3, and compare it to the true error.


10. Implement the classical procedure (5.28), and apply it to the equation (5.1).Solve it with stepsizes ofh = 0.25 and0.125. Compare with the results inTable 5.6, the fourth-order Fehlberg example.

Hint: Modify the Euler program of Chapter 2.

11. Use the program of Problem 10 to solve the equations in Problem 2.

12. Modify the Euler program of Chapter 3 to implement the Runge–Kutta methodgiven in (3.26). With this program, repeat Problems 5 and 6 ofChapter 3.

13. Consider the predator-prey model of (3.4), with the particular constantsA = 4,B = 0.5, C = 3, andD = 1

3 . Also, recall Problem 8 in Chapter 3.

(a) Show that there is a solutionY1(t) = C1, Y2(t) = C2, with C1 andC2

nonzero constants. What would be the physical interpretation of such asolutionY (t)?Hint: What areY

′

1 (t) andY′

2 (t) in this case?

(b) Solve this system (3.4) withY1(0) = 3, Y2(0) = 5, for 0 ≤ t ≤ 4, anduse the Runge–Kutta method of Problem 12 with stepsizes ofh = 0.01and0.005. Examine and plot the values of the output in steps oft of 0.1.In addition to these plots oft vs.Y1(t) andt vs.Y2(t), also plotY1 vs.Y2.

(c) Repeat (b) for the initial valuesY1(0) = 3, Y2(0) = 1, 1.5, 1.9 in succes-sion. Comment on the relation of these solutions to one another and tothe solution of part (a).

14. Show that the implicit Runge–Kutta method (5.65) has a truncation error of sizeO(h3). This can then be used to prove that the method has order of convergence2.

15. Apply the implicit Runge–Kutta method (5.69) to the model problem

Y ′ = λY, t ≥ 0,

Y (0) = 1.

(a) Show that the solution can be written asyn = [R(λh)]n with

R (z) =1 + 1

2z + 112z

2

1 − 12z + 1

12z2.

(b) For any realz < 0 show that|R (z)| < 1. In fact, this bound is true forany complexz with Real (z) < 0, and this implies that the method isabsolutely stable.

16. Solve the equations of Problem 2 with the built-inode45 function. Experimentwith several choices of error tolerances, including an absolute error toleranceof AbsTol = 10−4 and ǫ = 10−6, along with a relative error tolerance ofRelTol = 10−8.


17. Solve the equations of Problem 2 with the built-inode23 function. Experimentwith several choices of error tolerances, including an absolute error toleranceof AbsTol = 10−4 andǫ = 10−6, along with a relative error tolerance ofRelTol = 10−8.

18. Repeat Problem 13 usingode45.

19. Consider the motion of a particle of massm falling vertically under the earth’sgravitational field, and suppose that the downward motion isopposed by africtional forcep(v) dependent on the velocityv(t) of the particle. Then thevelocity satisfies the equation

mv′(t) = −mg + p(v), t ≥ 0, v(0) given.

Letm = 1 kg,g = 9.8 m/s2, andv(0) = 0. Solve the differential equation for0 ≤ t ≤ 20 and for the following choices ofp(v):

(a) p(v) = −0.1v, which is positive for a falling body.

(b) p(v) = 0.1v2.

Find answers to at least three digits of accuracy. Graph the functionsv(t).Compare the solutions.

20. Consider solving the initial value problem

Y ′(t) = t− Y (t)2, Y (0) = 0

on the interval0 ≤ t ≤ 20. Create a Taylor series method of order2. Implementit in MATLAB and use stepsizes ofh = 0.4, 0.2, and0.1 to solve for anapproximation toY . Estimate the error by using Problem 3. Graph the solutionthat you obtain.

21. Repeat Problem 20 with various initial valuesY (0). In particular, useY (0) =−0.2, −0.4, −0.6, −0.8. Comment on your results.

22. Repeat Problems 20 and 21, but use a second-order Runge–Kutta method.

23. Repeat Problems 20 and 21, but use the MATLAB codeode45. Do not attemptto estimate the error since that is embedded inode113.

24. Consider the problem

Y ′ =1

t+ 1+ c · tan−1(Y (t)) − 1

2, Y (0) = 0

with c a given constant. SinceY ′(0) = 12 , the solutionY (t) is initially increas-

ing ast increases, regardless of the value ofc. As best you can, show that thereis a value ofc, call it c∗, for which (1) if c > c∗, the solutionY (t) increasesindefinitely, and (2) ifc < c∗, thenY (t) increases initially, but then peaks and


decreases. Usingode45, determinec∗ to within 0.00005, and then calculatethe associated solutionY (t) for 0 ≤ t ≤ 50.

25. (a) Using the Runge–Kutta method (5.20), solve

Y ′(t) = −Y (t) + t0.1(1.1 + t), Y (0) = 0,

whose solution isY (t) = t1.1. Solve the equation on[0, 5], print-ing the solution and the errors att = 1, 2, 3, 4, 5. Use stepsizesh =0.1, 0.05, 0.025, 0.0125, 0.00625. Calculate the ratios by which the errorsdecrease whenh is halved. How does this compare with the theoreticalrate of convergence ofO(h2). Explain your results as best you can.

(b) What difficulty arises in attempting to use a Taylor method of order≥ 2to solve the equation of part (a)? What does it tell us about the solution?

26. Consider the three-stage Runge–Kutta formula

z1 = yn,

z2 = yn + ha2,1f(tn, z1),

z3 = yn + h [a3,1f(tn, z1) + a3,2f(tn + c2h, z2)] ,

yn+1 = yn + h [b1f(tn, z1) + b2f(tn + c2h, z2) + b3f(tn + c3h, z3)] .

Generalize the argument used in (5.14)–(5.19) for determining the two-stageRunge–Kutta formulas of order 2. Determine the set of equations that thecoefficients{bj, cj, aij} must satisfy if the formula given above is to be oforder3. Find a particular solution to these equations.

CHAPTER 6

MULTISTEP METHODS

Taylor methods and Runge–Kutta (RK) methods are known assingle-steporone-stepmethods,since at a typical stepyn+1 is determined solely fromyn. In this chapter,we consider multistep methods in which the computation of the numerical solutionyn+1 uses the solution values at several previous nodes. We derive here two familiesof the most widely used multistep methods.

Reformulate the differential equation

Y ′(t) = f(t, Y (t))

by integrating it over the interval[tn, tn+1], obtaining

∫ tn+1

tn

Y ′(t) dt =

∫ tn+1

tn

f(t, Y (t)) dt,

Y (tn+1) = Y (tn) +

∫ tn+1

tn

f(t, Y (t)) dt. (6.1)

We will develop numerical methods to compute the solutionY (t) by approximatingthe integral in (6.1). There are many such methods,and we will consider only the most

95

96 MULTISTEP METHODS

popular of them, the Adams–Bashforth (AB) and Adams–Moulton (AM) methods.These methods are the basis of some of the most widely used computer codes forsolving the initial value problem. They are generally more efficient than the RKmethods, especially if one wishes to find the solution with a high degree of accuracyor if the derivative functionf(t, y) is expensive to evaluate.

To evaluate the integral

∫ tn+1

tn

g(t) dt, g(t) = Y ′(t) = f(t, Y (t)), (6.2)

we approximateg(t) by using polynomial interpolation and then integrate the inter-polating polynomial. For a given nonnegative integerq, the AB methods use interpo-lation polynomial of degreeq at the points{tn, tn−1, . . . , tn−q}, and AM methodsuse interpolation polynomial of degreeq at the points{tn+1, tn, tn−1, . . . , tn−q+1}.

6.1 ADAMS–BASHFORTH METHODS

We begin with the AB method based on linear interpolation (q = 1). The linearpolynomial interpolatingg(t) at{tn, tn−1} is

p1(t) =1

h[(tn − t)g(tn−1) + (t− tn−1)g(tn)]. (6.3)

From the theory of polynomial interpolation (Theorem B.3 inAppendix B),

g(t) − p1(t) = 12 (t− tn) (t− tn−1) g

′′(ζn) (6.4)

for sometn−1 ≤ ζn ≤ tn+1. Integrating over[tn, tn+1], we obtain

∫ tn+1

tn

g(t) dt ≈∫ tn+1

tn

p1(t) dt = 12h[3g(tn) − g(tn−1)].

In fact, we can obtain the more complete result

∫ tn+1

tn

g(t) dt = 12h[3g(tn) − g(tn−1)] + 5

12h3g′′(ξn) (6.5)

for sometn−1 ≤ ξn ≤ tn+1; see Problem 4 for a derivation of a related but somewhatweaker result on the truncation error. Applying this to the relation (6.1) gives us

Y (tn+1) = Y (tn) + 12h[3f(tn, Y (tn)) − f(tn−1, Y (tn−1))]

+ 512h

3Y ′′′(ξn).(6.6)

Dropping the final term, the truncation error, we obtain the numerical method

yn+1 = yn + 12h[3f(tn, yn) − f(tn−1, yn−1)]. (6.7)

ADAMS–BASHFORTH METHODS 97

Table 6.1 An example of the second order Adams-Bashforth method

t yh(t) Y (t) − y2h(t) Y (t) − yh(t) Ratio 13[yh(t) − y2h(t)]

2 0.49259722 2.13e − 3 5.53e − 4 3.9 5.26e − 44 −1.41116963 2.98e − 3 7.24e − 4 4.1 7.52e − 46 0.68174279 −3.91e − 3 −9.88e − 4 4.0 −9.73e − 48 0.84373678 3.68e − 4 1.21e − 4 3.0 8.21e − 5

10 −1.38398254 3.61e − 3 8.90e − 4 4.1 9.08e − 4

With this method, note that it is necessary to haven ≥ 1. Both y0 andy1 areneeded in findingy2, andy1 cannot be found from (6.7). The value ofy1 must beobtained by another method. The method (6.7) is an example ofa two step method,since values attn−1 andtn are needed in finding the value attn+1. If we assumey0 = Y0, and if we can determiney1 ≈ Y (t1) with an accuracyO(h2), then the ABmethod (6.7) is of order2, that is, its global error is of sizeO(h2),

maxt0≤tn≤b

|Y (tn) − yh(tn)| ≤ ch2. (6.8)

We must note that this result assumesf(t, y) andY (t) are sufficiently differen-tiable, just as with all other similar convergence error bounds and asymptotic errorresults stated in this book. In this particular case (6.8), we would assume thatY (t)is 3 times continuously differentiable on[t0, b] and thatf(t, y) satisfies the Lipschitzcondition of (2.19) in Chapter 2. We usually omit the explicit statement as to theorder of differentiability onY (t) being assumed, although it is usually apparent fromthe given error results.

Example 6.1 Use (6.7) to solve

Y ′(t) = −Y (t) + 2 cos(t), Y (0) = 1 (6.9)

with the solutionY (t) = sin(t) + cos(t). For illustrative purposes only, we takey1 = Y (t1). The numerical results are given in Table 6.1, usingh = 0.05. Note thatthe errors decrease by a factor of approximately4whenh is halved,which is consistentwith the numerical method being of order2. The Richardson error estimate is alsoincluded in the table, using the formula (5.13) for second-order methods. Where theerror is decreasing likeO(h2), the error estimate is quite accurate.

Adams methods are often considered to be “less expensive” than RK methods, andthe main reason can be seen by comparing (6.7) with the second-order RK method in(5.20). The main task of both methods is to evaluate the derivative functionf(t, y).With second-order RK methods, there are two evaluations off for each step fromtnto tn+1. In contrast, the AB formula (6.7) uses only one evaluation per step, providedthat past values off are reused. Other factors affect the choice of a numerical method,but the AB and AM methods are generally more efficient in the number of evaluationsof f that are needed for a given amount of accuracy.


A problem with multistep methods is the need to generate someof the initial valuesof the solution by using another method. For the second-order AB method in (6.7),we must obtainy1; and since the global error inyh(tn) is to beO(h2), we mustensure thatY (t1) − yh(t1) is alsoO(h2). There are two immediate possibilities,using methods from preceding chapters.

Case (1) Use Euler’s method:y1 = y0 + hf(t0, y0). (6.10)

Assumingy0 = Y0, this has an error of

Y (t1) − y1 = 12h

2Y ′′(ξ1)

based on (2.10) withn = 0. Thus (6.10) meets our error criteria fory1.Globally, Euler’s method has onlyO(h) accuracy, but the error of a single stepisO(h2).

Case (2) Use a second-order RK method, such as (5.20). Since only one step int isbeing used,Y (t1) − y1 will be O(h3), which is more than adequate.

Example 6.2 Combine (6.10) with (6.7) to solve the problem (6.9) from thelastexample. Forh = 0.05 andt = 10, the error in the numerical solution turns out to be

Y (10) − yh(10).= 8.90 × 10−4,

the same as before for the results in Table 6.1.

Higher-order Adams–Bashforth methods are obtained by using higherdegree poly-nomial interpolation in the approximation of the integrandin (6.2). (For an introduc-tion to polynomial interpolation, see Appendix B.) The nexthigher-order examplefollowing the linear interpolation of (6.3) uses quadraticinterpolation. Letp2(t)denote the quadratic polynomial that interpolatesg(t) attn, tn−1, tn−2, and then use

∫ tn+1

tn

g(t) dt ≈∫ tn+1

tn

p2(t) dt.

To be more explicit, we may write

p2(t) = g(tn)ℓ0(t) + g(tn−1)ℓ1(t) + g(tn−2)ℓ2(t) (6.11)

with

ℓ0(t) =(t− tn−1)(t− tn−2)

2h2,

ℓ1(t) = − (t− tn)(t− tn−2)

h2,

ℓ2(t) =(t− tn)(t− tn−1)

2h2.

(6.12)

ADAMS–BASHFORTH METHODS 99

For the error, we have

g(t) − p2(t) = 16 (t− tn) (t− tn−1) (t− tn−2) g

′′′(ζn) (6.13)

for sometn−2 ≤ ζn ≤ tn+1.It can be shown that∫ tn+1

tn

g(t) dt = 112h[23g(tn) − 16g(tn−1) + 5g(tn−2)] +

38h

4g′′′(ξn)

for sometn−2 ≤ ξn ≤ tn+1. Applying this to (6.1), the integral formulation of thedifferential equation, we obtain

Y (tn+1) = Y (tn) + 112h[23f(tn, Y (tn)) − 16f(tn−1, Y (tn−1))

+ 5f(tn−2, Y (tn−2))] + 38h

4Y (4)(ξn).

By dropping the last term, the truncation error, we obtain the third-order AB method

yn+1 = yn + 112h[23y′n − 16y′n−1 + 5y′n−2], n ≥ 2, (6.14)

wherey′k ≡ f(tk,yk), k ≥ 0. This is a three step method, requiringn ≥ 2. Thusy1, y2 must be obtained separately by other methods. We leave the implementationand illustration of (6.14) as Problem 2 for the reader.

In general, it can be shown that the AB method based on interpolation of degreeqwill be a (q + 1)-step method, and its truncation error will be of the form

Tn+1 = cqhq+2Y (q+2)(ξn)

for sometn−q ≤ ξn ≤ tn+1. The initial valuesy1, . . . , yq will have to be generatedby other methods. If the errors in these initial values satisfy

Y (tn) − yh(tn) = O(hq+1), n = 1, 2, . . . , q, (6.15)

then the global error in the(q + 1)-step AB method will also beO(hq+1), providedthat the true solutionY is sufficiently differentiable. In addition, the global error willsatisfy an asymptotic error formula

Y (tn) − yh(tn) = D(tn)hq+1 + O(hq+2),

much as was true earlier for the Taylor and RK methods described in Chapter 5. ThusRichardson’s extrapolation can be used to accelerate the convergence of the methodand to estimate the error.

To generate the initial valuesy1, . . . , yq for the (q + 1)-step AB method, and tohave their errors satisfy the requirement (6.15), it is sufficient to use a RK methodof orderq. However, in many instances, people prefer to use a RK methodof orderq + 1, the same order as that of the(q + 1)-step AB method. Other procedures areused in the automatic computer programs for AB methods, and we discuss them laterin this chapter.


Table 6.2 Adams-Bashforth methods

q Order Method T. Error

0 1 yn+1 = yn + hy′

n12h2Y ′′(ξn)

1 2 yn+1 = yn + h2[3y′

n − y′

n−1]512

h3Y ′′′(ξn)

2 3 yn+1 = yn + h

12[23y′

n − 16y′

n−1 + 5y′

n−2]38h4Y (4)(ξn)

3 4 yn+1 = yn + h

24[55y′

n − 59y′

n−1 + 37y′

n−2 − 9y′

n−3]251720

h5Y (5)(ξn)

Table 6.3 Example of fourth order Adams-Bashforth method

t yh(t) Y (t) − y2h(t) Y (t) − yh(t) Ratio 115

[yh(t) − y2h(t)]

2 0.49318680 −3.96e − 4 −3.62e − 5 10.9 −2.25e − 54 −1.41037698 −1.25e − 3 −6.91e − 5 18.1 −7.37e − 56 0.68067962 1.05e − 3 7.52e − 5 14.0 6.12e − 58 0.84385416 3.26e − 4 4.06e − 6 80.0 2.01e − 5

10 −1.38301376 −1.33e − 3 −7.89e − 5 16.9 −7.82e − 5

The AB methods of orders1 through4 are given in Table 6.2; the column heading“T. Error” denotes “Truncation Error”. The order1 formula is simply Euler’s method.In the table,y′k ≡ f(tk, yk).

Example 6.3 Solve the problem (6.9) by using the fourth-order AB method.Sincewe are illustrating the AB method, we simply generate the initial valuesy1, y2, y3 byusing the true solution,

yi = Y (ti), i = 1, 2, 3.

The results forh = 0.125 and2h = 0.25 are given in Table 6.3. Richardson’s errorestimate for a fourth-order method is given in the last column. For a fourth-ordermethod, the error should decrease by a factor of approximately 16 whenh is halved.In those cases where this is true, the Richardson’s error estimate is accurate. In nocase is the error badly underestimated.

Comparing these results with those in Table 5.6 for the fourth-order Fehlbergmethod, we see that the present errors appear to be very large. But note that theFehlberg formula uses five evaluations off(t, y) for each step oftn to tn+1; whereasthe fourth-order AB method uses only one evaluation off per step, assuming thatprevious evaluations are reused. If this AB method is used with anh that is only1

5as large (for a comparable number of evaluations off ), then the present errors willdecrease by a factor of approximately54 = 625. The AB errors will be mostly smallerthan those of the Fehlberg method in Table 5.6, and the work will be comparable(measured by the number of evaluations off ).

ADAMS–MOULTON METHODS 101

Table 6.4 Example of Adams-Moulton method of order 2

t Y (t) − y2h(t) Y (t) − yh(t) Ratio 13[yh(t) − y2h(t)]

2 −4.59e − 4 −1.15e − 4 4.0 −1.15e − 44 −5.61e − 4 −1.40e − 4 4.0 −1.40e − 46 7.98e − 4 2.00e − 4 4.0 2.00e − 48 −1.21e − 4 −3.04e − 5 4.0 −3.03e − 4

10 −7.00e − 4 −1.75e − 4 4.0 −1.28e − 4

6.2 ADAMS–MOULTON METHODS

As with the AB methods, we begin our presentation of AM methods by consideringthe method based on linear interpolation. Letp1(t) be the linear polynomial thatinterpolatesg(t) at tn andtn+1,

p1(t) =1

h[(tn+1 − t)g(tn) + (t− tn)g(tn+1)].

Using this equation to approximate the integrand in (6.2), we obtain the trapezoidalrule discussed in Chapter 4,

Y (tn+1) = Y (tn)+ 12h[f(tn, Y (tn))+f(tn+1, Y (tn+1))]− 1

12h3Y ′′′(ξn). (6.16)

Dropping the last term, the truncation error, we obtain the AM method

yn+1 = yn + 12h[f(tn, yn) + f(tn+1, yn+1)], n ≥ 0. (6.17)

This is the trapezoidal method discussed in Section 4.2. It is a second-order methodand has a global error of sizeO(h2). Moreover, it is absolutely stable.

Example 6.4 Solve the earlier problem (6.9) by using the AM method (6.17)(thetrapezoidal method). The results are given in Table 6.4 forh = 0.05, 2h = 0.1, andthe Richardson error estimate for second-order methods is given in the last column.In this case, theO(h2) error behavior is very apparent, and the error estimation isvery accurate.

Example 6.5 Repeat Example 6.4, but using the procedure described following(4.28) in Chapter 4, with only one iterate being computed foreachn. Then, theerrors do not change significantly from those given in Table 6.4. For example, witht = 10 andh = 0.05, the error is

Y (10) − yh(10).= −2.02 × 10−4.

This is not very different from the value of−1.75 × 10−4 given in Table 6.4. Theuse of the iteratey(1)

n+1 as the rootyn+1 will not affect significantly the accuracy of


Table 6.5 Adams-Moulton methods

q Order Method T. Error

0 1 yn+1 = yn + hy′

n+1 − 12h2Y ′′(ξn)

1 2 yn+1 = yn + h2[y′

n+1 + y′

n] − 112

h3Y ′′′(ξn)

2 3 yn+1 = yn + h

12[5y′

n+1 + 8y′

n − y′

n−1] − 124

h4Y (4)(ξn)

3 4 yn+1 = yn + h

24[9y′

n+1 + 19y′

n − 5y′

n−1 + y′

n−2] − 19720

h5Y (5)(ξn)

the solution for most differential equations. Stiff differential equations are a majorexception.

By integrating the polynomial of degreeq that interpolates on the set of the nodes{tn+1, tn, . . . , tn−q+1} to the functiong(t) of (6.2), we obtain the AM method oforderq+ 1. It will be an implicit method, but in other respects the theory is the sameas for the AB methods described previously. The AM methods oforders1 through4are given in Table 6.5, wherey′k ≡ f(tk, yk). As in Table 6.2, the column heading“T. Error” denotes “Truncation Error”. Note that the AM method of order 1 is thebackward Euler method, and the AM method of order 2 is the trapezoidal method.

The effective cost of an AM method is two evaluations of the derivative f(t, y)per step in most cases and assuming that previous function values off are reused.This includes one evaluation off to calculate an initial guessy(0)

n+1, and then oneevaluation off in the iteration formula for the AM method. For example, withthetrapezoidal method this means using the calculation

y(0)n+1 = yn + 1

2h [3f(tn, yn) − f(tn−1, yn−1)] ,

y(1)n+1 = yn + 1

2h[f(tn, yn) + f(tn+1, y

(0)n+1)],

(6.18)

or using some otherpredictor formula fory(0)n+1 with an equivalent accuracy. With

this calculation, there is no significant gain in accuracy over the AB method of thesame order when comparing methods of equivalent cost.

Nonetheless, AM methods possess other properties that makethem desirable foruse in many types of differential equations. The desirable features relate to stabilitycharacteristics of numerical methods. Recall from Chapter4, following (4.3), thatwe study the behavior of a numerical method when applied to the model problem

Y ′(t) = λY (t), t > 0,Y (0) = 1.

(6.19)

We always assume the constantλ < 0 or λ is complex withReal(λ) < 0. The truesolution of the problem (6.19) isY (t) = eλ t, which decays exponentially int sincethe parameterλ has a negative real part. The kind of stability property thatwe would

ADAMS–MOULTON METHODS 103

like for a numerical method is that when it is applied to (6.19), the numerical solutionsatisfies

yh(tn) → 0 as tn → ∞ (6.20)

for any choice of stepsizeh. With most numerical methods, this is not satisfied. Theset of valueshλ, considered as a subset of the complex plane, for whichyn → 0 asn→ ∞, is called theregion of absolute stabilityof the numerical method.

As seen in Chapter 4, the AM methods of orders 1 and 2 are absolutely stable,satisfying (6.20) for all values ofh. Such methods are particularly suitable for solvingstiff differential equations. In general, we prefer numerical methods with a largerregion of absolute stability; the larger is the region, the less restrictive the conditionon h in order to ensure satisfaction of (6.20) for the model problem (6.19). Thus amethod with a large region of absolute stability is generally preferred over a methodwith a smaller region, provided that the accuracy of the two methods is similar. It canbe shown that for AB and AM methods of equal order, the AM method will have thelarger region of absolute stability; see Figures 8.1 and 8.2in Chapter 8. Consequently,Adams–Moulton methods are generally preferred over Adams–Bashforth methods.

Example 6.6 Applying the AB method of order2 to equation (6.19) leads to thefinite difference equation

yn+1 = yn + 12hλ (3yn − yn−1) , n = 1, 2, . . . (6.21)

with y0 andy1 determined beforehand. Jumping ahead to (7.45) in Chapter 7, thesolution to this finite difference equation is given by

yh(tn) = γ0 [r0(hλ)]n

+ γ1 [r1(hλ)]n, n ≥ 0 (6.22)

with r0(hλ) andr1(hλ) the roots of the quadratic polynomial

r2 = r + 12hλ (3r − 1) . (6.23)

Whenλ = 0, one of the roots equals 1, and we denote arbitrarily that root by r0(hλ)in general:r0(0) = 1. The constantsγ0 andγ1 are determined fromy0 andy1. Inorder to satisfy (6.22) for a given choice ofhλ and for any choice ofγ0 andγ1, it isnecessary to have

|r0(hλ)| < 1, |r1(hλ)| < 1. (6.24)

Solving this pair of inequalities for the case thatλ is real, and looking only at the casethatλ < 0, we obtain

−1 < hλ < 0 (6.25)

as the region of absolute stability on the real axis. In contrast, the AM method oforder2 has−∞ < hλ < 0 on the real axis of its region of stability. There is nostability restriction onh with this AM method.


6.3 COMPUTER CODES

Some of the most popular computer codes for solving the initial value problem arebased on using AM and AB methods in combination, as suggestedin the discussionpreceding (6.18). These codes control the truncation errorby varying both the stepsizeh and the order of the method. They are self-starting in terms of generating the initialvaluesy1, . . . , yq needed with higher-order methods of orderq+1. To generate thesevalues, they begin with first-order methods and a small stepsizeh and then increasethe order to generate the starting values needed with higher-order methods. Thepossible order is allowed to be as large as12 or more; this results in a very efficientnumerical method when the solutionY (t) has several continuous derivatives and isslowly varying. A comprehensive discussion of Adams’ methods and an example ofone such computer code is given in Shampine [72].

MATLAB R© program. To facilitate the illustrative programming of the methodsofthis chapter, we present a modification of the Euler program of Chapter 2. Theprogram implements the Adams–Bashforth formula of order2, given in (6.7); andit uses Euler’s method to generate the first valuey1 as in (6.10). We defer to theProblems section the experimental use of this program.

function [t,y] = AB2(t0,y0,t end,h,fcn)

%

% function [t,y]=AB2(t0,y0,t end,h,fcn)

%


% y’ = f(t,y), t0 <= t <= b, y(t0)=y0

% Use Adams-Bashforth formula of order 2 with

% a stepsize of h. Euler’s method is used for

% the value y1. The user must supply a program for

% the right side function defining the differential

% equation. For some name, say deriv, use a first

% line of the form



% [t,z]=AB2(t0,z0,b,delta,’deriv’)

%

% Output:

% The routine AB2 will return two vectors, t and y.

% The vector t will contain the node points

% t(1)=t0, t(j)=t0+(j-1)*h, j=1,2,...,N

% with

% t(N) <= t end-h, t(N)+h > t end-h



%


COMPUTER CODES 105

0 5 10 15 20−1.5

−1

−0.5

0

0.5

1

1.5

Figure 6.1 The solution values to (6.26) obtained byode113 are indicated by the symbolo.The curve line is obtained by interpolating these solution values fromode113 usingdeval.


y = zeros(n,1);

y(1) = y0;

ft1 = feval(fcn,t(1),y(1));

y(2) = y(1)+h*ft1;

for i = 3:n

ft2 = feval(fcn,t(i-1),y(i-1));

y(i) = y(i-1)+h*(3*ft2-ft1)/2;

ft1 = ft2;

end

6.3.1 MATLAB ODE codes

Built-in MATLAB programs based on multistep methods areode113 andode15s.These programs implement explicit and implicit linear multistep methods of variousorders, respectively. The programode113 is used to solve nonstiff ordinary differen-tial equations, using the Adams–Bashforth and Adams–Moulton methods presentedin this chapter. The codeode15s is for stiff ordinary differential equations, and itis based on yet another variable order family of multistep methods, one that is dis-cussed in Chapter 8. The programs are used in precisely the same manner as theprogramode45 discussed in Section 5.5 of Chapter 5; and the entire suite ofMAT-


0 5 10 15 20−4

−3

−2

−1

0

1

2x 10

−4

Figure 6.2 The errors in the solution to (6.26) obtained usingode113. The errors at thenode points are indicated by the symbolo

LAB ode programs is discussed at length by Shampine and Reichelt [73]. Also, seeShampine [72] for a thorough study of one-step and multistepmethods and of theirimplementation in computer software.

Example 6.7 We modify the programtest ode45by replacingode45with ode113

throughout the code. The programode113 is recommended for medium- to high-accuracy solutions, but we will illustrate its use with the same example as in Section5.5 of Chapter 5 for the programode45. As before, we solve the test equation

Y ′(t) = −Y (t) + 2 cos(t), Y (0) = 1 (6.26)

and we useAbsTol = 10−6,RelTol = 10−4. Figures 6.1 and 6.2 illustrate, respec-tively, the interpolated numerical solution and the error contained therein. Comparethese results to those in Figures 5.2 and 5.3 of Chapter5. There are 229 derivative eval-uations when usingode45 for this problem, whereasode113 uses 132 evaluations.This is a typical example for comparison of the number of derivative evaluations.

PROBLEMS

1. Use the MATLAB program for the AB method of order two to solve the equa-tions in Problem 2 of Chapter 5. Include the Richardson errorestimate foryh(t) whenh = 0.1 and0.05.

COMPUTER CODES 107

2. Modify the MATLAB program of this chapter to use the third-order AB method.To calculatey1 andy2, use one of the second-order RK methods from Chapter5. Then repeat Problem 1.

3. Use the program from Problem 2 to solve the continuing example problem(6.9).

4. To make the error term in (6.5) a bit more believable, prove

∫ h

0

γ(s) ds− 12h[3γ(0)− γ(−h)] = 5

12h3γ′′(0) + O(h4)

with γ (s) a 3 times continuously differentiable function for−h ≤ s ≤ h.Hint: Expandγ(s) as a quadratic Taylor polynomial about the origin, withan error termR3(t). Substitute that Taylor expansion into the left side of theequation above, and obtain the right side. For simplicity, we have changed theinterval in (6.5) from[tn, tn+1] to [0, h]. The result extends to (6.5) by meansof a simple change of variable in (6.5), namely,t = tn + s, 0 ≤ s ≤ h. Alsonote that if−h ≤ ξ ≤ h, then

γ′′(ξ) = γ′′(0) + ξγ′′′(ζ), someζ between0 andξ

= γ′′(0) + O(h),512h

3γ′′(ξ) = 512h

3γ′′(0) + O(h4),

since|ξ| ≤ h. This argument assumesγ (s) is 3 times continuously differen-tiable.

5. Repeat the type of argument given in Problem 4, extending it to the Adams–Bashforth method of order 3, given in Table 6.2.

6. Repeat the type of argument given in Problem 4, extending it to the Adams–Moulton method of order 2, given in Table 6.5.

7. Repeat the type of argument given in Problem 4, extending it to the Adams–Moulton method of order 3, given in Table 6.5.

8. Modify the MATLAB program of this chapter to use the AM method of order2. For the predictor, use the AB method of order2; for the first stepy1, use theEuler predictor. Iterate the formula (4.25) only once. Apply this to the solutionof the equations considered in Problem 1, and produce the Richardson errorestimate.

9. Use the MATLAB codeode113 to solve the equations in Problem 2 of Chapter5. For error tolerances, use absolute error boundsAbsTol = 10−4 andǫ =10−6, along with a relative error toleranceRelTol = 10−8. Keep track ofthe number of evaluations off(t, y) that are used by the routine, and compareit to the number used in your own programs for the Adams–Bashforth andAdams–Moulton methods.


10. (a) Using the program of Problem 1 for the AB method of order 2, solve

Y ′(t) = −50Y (t) + 51 cos(t) + 49 sin(t), Y (0) = 1

for 0 ≤ t ≤ 10. The solution isY (t) = sin(t) + cos(t). Use stepsizes ofh = 0.1, 0.02, 0.01. In each case, print the errors as well as the answers.

(b) Using the program of Problem 8 for the AM method of order2, repeatpart (a). Check the condition of (4.26).

(c) When the AM method of order2 is applied to the equation in (a), the valueof yn+1 can be found directly. While doing so, repeat part (a). Compareyour results.

11. The Adams–Bashforth and Adams–Moulton methods are based on (6.1) to-gether with the integration over[tn, tn+1] of a polynomial interpolating theintegrandY ′(t) = f(t, Y (t)). As an alternative, consider integration over[tn−1, tn+1], obtaining

Y (tn+1) = Y (tn−1) +

∫ tn+1

tn−1

f(t, Y (t)) dt. (6.27)

We can replace the integrandf(t, Y (t)) with an approximation based on inter-polation. The simplest example is to use a constant interpolant; in particular,

∫ tn+1

tn−1

f(t, Y (t)) dt ≈∫ tn+1

tn−1

f(tn, Y (tn)) dt = 2hf(tn, Y (tn)).

This leads to the numerical method

yn+1 = yn−1 + 2hf (tn, yn) , n ≥ 1. (6.28)

This is called themidpoint method. As with the Adams–Bashforth method(6.7) of order 2, the value ofy1 must be obtained by other means. Using thetype of argument given in Problem 4, show that

Y (tn+1) − [Y (tn−1) + 2hf(tn, Y (tn))] = − 13h

3Y ′′′ (tn) + O(h4).

Hint: ExpandY (t) as a quadratic Taylor polynomial abouttn, with an errortermR3(t). Substitute that Taylor expansion into the left side of the equationabove to obtain the right side.

12. Using the same arguments as in Problem 11, consider interpolatingY ′(t) =f(t, Y (t)) with a quadratic polynomial. Have it interpolateY ′(t) = f (t, Y (t))at the nodes{tn−1, tn, tn+1}. Use this to obtain the numerical method

yn+1 = yn−1 + 13 [hf (tn−1, yn−1)

+4f (tn, yn) + f (tn+1, yn+1)].(6.29)

COMPUTER CODES 109

As with the Adams–Moulton methods, this is an implicit method and the valueof yn+1 must be calculated by a rootfinding method. Also, the value ofy1 mustbe obtained by other means.

This isSimpson’s parabolic rulefor numerical integration, and when applied,as here, to solving differential equations, it is one part ofMilne’s method, whichis mainly of historical interest, as the family of Adams methods have replacedit in modern codes. We return to Simpson’s rule, however, when developingnumerical methods for solving Volterra integral equationsin Chapter 12.

13. As an alternative to (6.27), consider

Y (tn+1) = Y (tn−3) +

∫ tn+1

tn−3

f(t, Y (t)) dt.

Using the same arguments as in Problem 12, consider interpolatingY ′(t) =f(t, Y (t)) with a quadratic polynomial, but have it interpolateY ′(t) at thenodes{tn−2, tn−1, tn}. Use this to obtain the numerical method

yn+1 = yn−3 + 43h[2f (tn−2, yn−2)

−f (tn−1, yn−1) + 2f (tn, yn)].(6.30)

This is an explicit method, and historically it has been usedto estimate aninitial valuey(0)

n+1 for the iterative solution of equation (6.29) in Problem 12,thus forming the other half ofMilne’s method. The values ofy1, y2, y3 mustbe obtained by other means.

14. Repeat Problems 20 and 21 of Chapter 5 using the MATLAB code ode113.Do not attempt to estimate the error since that is embedded inode113.

15. Repeat Problem 24 of Chapter 5 using the MATLAB codeode113.

CHAPTER 7

GENERAL ERROR ANALYSIS FORMULTISTEP METHODS

We now present a general error analysis for multistep methods in solving the initialvalue problem of a single first-order equation. In addition to explaining the underlyingbehavior of the numerical methods, such a general error analysis allows us to designbetter numerical procedures for various classes of problems. We begin by consideringthe truncation error for multistep methods. Next, in Section 7.2, we look at a relativelysimple error analysis that is similar to that given for Euler’s method in Chapter 2; it isan error analysis that works for many popular multistep methods. In Section 7.3 wegive a complete error analysis for all multistep methods, and we follow it with someexamples.

As before, leth > 0 and define the nodes bytn = t0 + nh, n ≥ 0. The generalform of the multistep methods to be considered is

yn+1 =

p∑

j=0

ajyn−j + h

p∑

j=−1

bjf(tn−j , yn−j), n ≥ p. (7.1)

The coefficientsa0, . . . , ap, b−1, b0, . . . , bp are constants andp ≥ 0. Assumingthat |ap| + |bp| 6= 0, we consider this method a(p+ 1)-step method, becausep+ 1previous solution values are being used to computeyn+1. The valuesy1, . . . , yp must

111

112 GENERAL ERROR ANALYSIS FOR MULTISTEP METHODS

be obtained by other means, as was illustrated in Chapter 6 with the Adams methods.Euler’s method is an example of a one-step method withp = 0 and

a0 = 1, b0 = 1, b−1 = 0.

If b−1 = 0, thenyn+1 occurs on only the left side of equation (7.1). Such formulasare calledexplicit methods. If b−1 6= 0, thenyn+1 is present on both sides of (7.1),and the formula is called animplicit method. As was discussed following (4.12) inChapter 4 for the backward Euler method, the solutionyn+1 can be computed byfixed point iteration,

y(i+1)n+1 =

p∑

j=0

ajyn−j +h

p∑

j=0

bjf(tn−j , yn−j)+hb−1f(tn+1, y(i)n+1), i = 0, 1, . . . ,

providedh is chosen sufficiently small.

Example 7.1

1. The midpoint method is defined by

yn+1 = yn−1 + 2hf(tn, yn), n ≥ 1 (7.2)

and it is an explicit two-step method. We discuss this methodin more detaillater in the chapter.

2. The Adams–Bashforth and Adams–Moulton methods are all special cases of(7.1), with

a0 = 1, aj = 0 for j = 1, . . . , p.

Also, refer to the formulas for these methods in Tables 6.2 and 6.5 of Chapter6.

7.1 TRUNCATION ERROR

For any differentiable functionY (t), define the truncation error for integratingY ′(t)by

Tn(Y ) = Y (tn+1) −

p∑

j=0

ajY (tn−j) + h

p∑

j=−1

bjY′(tn−j)

(7.3)

for n ≥ p. Define the functionτn(Y ) by

τn(Y ) =1

hTn(Y ). (7.4)

In order to prove the convergence of the approximate solution {yn : t0 ≤ tn ≤ b} of(7.1) to the solutionY (t) of the initial value problem

Y ′(t) = f(t, Y (t)), t ≥ t0,Y (t0) = Y0,

TRUNCATION ERROR 113

it is necessary to have

τ(h) ≡ maxtp≤tn≤b

|τn(Y )| → 0 ash→ 0. (7.5)

This is often called theconsistency conditionfor the method (7.1). The speed ofconvergence of the solution{yn} to the true solutionY (t) is related to the speed ofconvergence in (7.5), and thus we need to know the conditionsunder which

τ(h) = O(hm) (7.6)

for some desired choice ofm ≥ 1. We now examine the implications of (7.5) and(7.6) for the coefficients in (7.1).

Theorem 7.2 Letm ≥ 1 be a given integer. For (7.5) to hold for all continuously dif-ferentiable functionsY (t), that is, for the method (7.1) to be consistent, it is necessaryand sufficient that

p∑

j=0

aj = 1, (7.7)

−p∑

j=0

jaj +

p∑

j=−1

bj = 1. (7.8)

Further, for (7.6) to be valid for all functionsY (t) that arem+1 times continuouslydifferentiable, it is necessary and sufficient that (7.7)–(7.8) hold and that

p∑

j=0

(−j)iaj + i

p∑

j=−1

(−j)i−1bj = 1, i = 2, . . . ,m. (7.9)

Proof. Note thatTn(αY + βW ) = αTn(Y ) + βTn(W ) (7.10)

for all constantsα, β and all differentiable functionsY,W . To examine the conse-quences of (7.5) and (7.6), expandY (t) abouttn using Taylor’s theorem to obtain

Y (t) =m∑

i=0

1

i!(t− tn)iY (i)(tn) + Rm+1(t), (7.11)

Rm+1(t) =1

m!

∫ t

tn

(t− s)mY (m+1)(s) ds

=(t− tn)m+1

(m+ 1)!Y (m+1)(ξn) (7.12)

with ξn betweent andtn (see (A.4)–(A.6) in Appendix A). We are assuming thatY (t) ism+ 1 times continuously differentiable on the interval boundedby t andtn.Substituting into (7.3) and using (7.10), we obtain

Tn(Y ) =

m∑

i=0

1

i!Y (i)(tn)Tn((t− tn)i) + Tn(Rm+1).


It is necessary to calculateTn((t− tn)i) for i ≥ 0.

• For i = 0,

Tn(1) = c0 ≡ 1 −p∑

j=0

aj .

• For i ≥ 1,

Tn((t− tn)i) = (tn+1 − tn)i

−

p∑

j=0

aj(tn−j − tn)i + h

p∑

j=−1

bji(tn−j − tn)i−1

= cihi

ci = 1 −

p∑

j=0

(−j)iaj + i

p∑

j=−1

(−j)i−1bj

i ≥ 1. (7.13)

This gives

Tn(Y ) =

m∑

i=0

cii!hiY (i)(tn) + Tn(Rm+1). (7.14)

From (7.12) it is straightforward thatTn(Rm+1) = O(hm+1

). If Y ism+ 2 times

continuously differentiable, we may write the remainderRm+1(t) as

Rm+1(t) =1

(m+ 1)!(t− tn)m+1Y (m+1)(tn) + · · · ,

and thenTn(Rm+1) =

cm+1

(m+ 1)!hm+1Y (m+1)(tn) + O(hm+2). (7.15)

To obtain the consistency condition (7.5),assuming thatY is an arbitrary twice con-tinuously differentiable function, we needτ(h) = O(h) and this requiresTn(Y ) =O(h2). Using (7.14) withm = 1, we must havec0 = c1 = 0, which gives the set ofequations (7.7)–(7.8). In some texts, these equations are referred to as theconsistencyconditions. It can be further shown that (7.7)–(7.8) are the necessary and sufficientconditions for the consistency (7.5), even whenY is only assumed to be continuouslydifferentiable. To obtain (7.6) for somem ≥ 1, we must haveTn(Y ) = O(hm+1).From (7.14) and (7.13), this will be true if and only ifci = 0, i = 0, 1, . . . ,m. Thisproves the conditions (7.9) and completes the proof.

The largest value ofm for which (7.6) holds is called theorder or order of con-vergenceof the method (7.1).

CONVERGENCE 115

Example 7.3 Find all second-order two-step methods. Formula (7.1) is

yn+1 = a0yn + a1yn−1 + h [b−1f(tn+1, yn+1) + b0f(tn, yn)+ b1f(tn−1, yn−1)] , n ≥ 1.

(7.16)

The coefficients must satisfy (7.7)–(7.9) withm = 2:

a0 + a1 = 1, −a1 + b−1 + b0 + b1 = 1, a1 + 2b−1 − 2b1 = 1.

Solving, we obtain

a1 = 1 − a0, b−1 = 1 − 14a0 − 1

2b0, b1 = 1 − 34a0 − 1

2b0 (7.17)

with a0, b0 indeterminate. The midpoint method is a special case in which a0 = 0,b0 = 2. For the truncation error, we have

Tn(R3) = 16c3h

3Y (3)(tn) + O(h4), (7.18)

c3 = −4 + 2a0 + 3b0. (7.19)

The coefficientsa0, b0 can be chosen to improve the stability, give a small truncationerror, give an explicit formula,or some combination of these. The conditions to ensurestability and convergence cannot be identified until the general theory for (7.1) hasbeen given in the remainder of this chapter.

7.2 CONVERGENCE

We now give a convergence result for the numerical method (7.1). Although thetheorem will not cover all the multistep methods that are convergent, it does includemany methods of current interest, including those of Chapters 2, 4, and 6. Moreover,the proof is much easier than that of the more general Theorem7.6 given in Section7.3.

Theorem 7.4 Consider solving the initial value problem

Y ′(t) = f(t, Y (t)), t ≥ t0,Y (t0) = Y0

(7.20)

using the multistep method (7.1). Assume that the derivative functionf(t, y) is con-tinuous and satisfies the Lipschitz condition

|f(t, y1) − f(t, y2)| ≤ K |y1 − y2| (7.21)

for all −∞ < y1, y2 <∞, t0 ≤ t ≤ b, and for some constantK > 0. Let the initialerrors satisfy

η(h) ≡ max0≤i≤p

|Y (ti) − yh(ti)| → 0 ash→ 0. (7.22)


Assume that the solutionY (t) is continuously differentiable and the method is con-sistent, that is, that it satisfies (7.5). Finally, assume that the coefficientsaj are allnonnegative,

aj ≥ 0, j = 0, 1, . . . , p. (7.23)

Then the method (7.1) is convergent and

maxt0≤tn≤b

|Y (tn) − yh(tn)| ≤ c1η(h) + c2τ(h) (7.24)

for suitable constantsc1, c2. If the solutionY (t) ism+1 times continously differen-tiable, the method (7.1) is of orderm, and the initial errors satisfyη(h) = O(hm),then the order of convergence of the method ism; that is, the error is of sizeO(hm).

Proof. Rewrite (7.3), and useY ′(t) = f(t, Y (t)) to get

Y (tn+1) =

p∑

j=0

ajY (tn−j) + h

p∑

j=−1

bjf(tn−j, Y (tn−j)) + hτn(Y ).

Subtracting (7.1) from this equality and using the notationei = Y (ti)−yi, we obtain

en+1 =

p∑

j=0

ajen−j + h

p∑

j=−1

bj [f(tn−j , Yn−j) − f(tn−j , yn−j)] + hτn(Y ).

Apply the Lipschitz condition (7.21) and the assumption (7.23) to obtain

|en+1| ≤p∑

j=0

aj |en−j| + hK

p∑

j=−1

|bj| |en−j | + hτ(h).

Introduce the following error bounding function

fn = max0≤i≤n

|ei| , n = 0, 1, . . . , N(h).

Using this function, we have

|en+1| ≤p∑

j=0

ajfn + hK

p∑

j=−1

|bj| fn+1 + hτ(h),

and applying (7.7), we obtain

|en+1| ≤ fn + hcfn+1 + hτ(h), c = K

p∑

j=−1

|bj | .

The right side is trivially a bound forfn and thus

fn+1 ≤ fn + hcfn+1 + hτ(h).

A GENERAL ERROR ANALYSIS 117

Forhc ≤ 12 , which is true ash→ 0, we obtain

fn+1 ≤ fn

1 − hc+

h

1 − hcτ(h)

≤ (1 + 2hc)fn + 2hτ(h).

Noting thatfp = η(h), proceed as in the proof of Theorem 2.4 in Chapter 2, from(2.25) onward. Then

fn ≤ e2c(b−t0)η(h) +

[e2c(b−t0) − 1

c

]τ(h), t0 ≤ tn ≤ b. (7.25)

This completes the proof.

To obtain a rate of convergence ofO(hm) for the method (7.1), it is necessary thateach step have an error

Tn(Y ) = O(hm+1).

But the initial valuesy0, . . . , yp need to be computedwith an accuracyof onlyO(hm),sinceη(h) = O(hm) is sufficient in (7.24).

The result (7.25) can be improved somewhat for particular cases, but the orderof convergence will remain the same. As with Euler’s method,a complete stabilityanalysis can be given, yielding a result of the form (2.49) inChapter 2. The analysis isa straightforward modification of that described in Section2.4 of Chapter2. Similarly,an asymptotic error analysis can also be given.

7.3 A GENERAL ERROR ANALYSIS

We begin with a few definitions. The concept ofstabilitywas introduced with Euler’smethod, and we now generalize it. Let{yn : 0 ≤ n ≤ N(h)} denote the solutionof (7.1) with initial valuesy0, y1, . . . , yp for some differential equationY ′(t) =f(t, Y (t)) and for all sufficiently small values ofh, sayh ≤ h0. Recall thatN(h)denotes the largest subscriptN for which tN ≤ b. For eachh ≤ h0, perturb theinitial valuesy0, . . . , yp to new valuesz0, . . . , zp with

max0≤n≤p

|yn − zn| ≤ ǫ. (7.26)

Note that these initial values are allowed to depend onh. We say that the family ofdiscrete numerical solutions{yn : 0 ≤ n ≤ N(h)}, obtained from (7.1), isstableifthere is a constantc, independent ofh ≤ h0 and valid for all sufficiently smallǫ, forwhich

max0≤n≤N(h)

|yn − zn| ≤ cǫ, 0 < h ≤ h0. (7.27)

Consider all differential equation problems

Y ′(t) = f(t, Y (t)), t ≥ t0,Y (t0) = Y0

(7.28)


with the derivative functionf(t, z) continuous and satisfying the Lipschitz condition(7.21). Suppose further that the approximating solutions{yn} are all stable. Thenwe say that (7.1) is astable numerical method.

To defineconvergencefor a given problem (7.28), suppose that the initial valuesy0, . . . , yp satisfy

η(h) ≡ max0≤n≤p

|Y (tn) − yn| → 0 ash→ 0. (7.29)

Then the solution{yn} is said to converge toY (t) if

maxt0≤tn≤b

|Y (tn) − yn| → 0 ash→ 0. (7.30)

If (7.1) is convergent for all problems (7.28) with the properties specified immediatelyfollowing (7.28), then it is called aconvergent numerical method. Convergence canbe shown to imply consistency; consequently, we consider only methods satisfying(7.7)–(7.8). The necessity of the condition (7.7) follows from the assumption ofconvergence of (7.1) for the problem

Y ′(t) ≡ 0, Y (0) = 1.

Just takey0 = · · · = yp = 1, and observe the consequences of the convergence ofyp+1 to Y (t) ≡ 1. We leave the proof of the necessity of (7.8) as Problem 8.

The convergence and stability of (7.1) are linked to the roots of the polynomial

ρ(r) = rp+1 −p∑

j=0

ajrp−j . (7.31)

Note thatρ(1) = 0 from the consistency condition (7.7). Letr0, . . . , rp denote theroots ofρ(r), repeated according to their multiplicity, and letr0 = 1. The method(7.1) satisfies theroot conditionif

(R1) |rj | ≤ 1, j = 0, 1, . . . , p, (7.32)

(R2) |rj | = 1 =⇒ ρ′(rj) 6= 0. (7.33)

The first condition requires all roots ofρ(r) to lie on the unit circle{z: |z| ≤ 1} inthe complex plane. Condition (7.33) states that all roots onthe boundary of the circleare to be simple roots ofρ(r).

7.3.1 Stability theory

All of the numerical methods presented in the preceding chapters have been stable,but we now give an example of a consistent unstable multistepmethod. This is tomotivate the need to develop a general theory of stability.


Example 7.5 Consider the two step method

yn+1 = 3yn − 2yn−1 + 12h [f(tn,yn) − 3f(tn−1, yn−1)] , n ≥ 1. (7.34)

It can be shown to have the truncation error

Tn(Y ) = 712h

3Y (3)(ξn), tn−1 ≤ ξn ≤ tn+1

and therefore, it is a consistent method. Consider solving the problemY ′(t) ≡ 0,Y (0) = 0, which has the solutionY (t) ≡ 0. Usingy0 = y1 = 0, the numericalsolution is clearlyyn = 0, n ≥ 0. Perturb the initial data toz0 = ǫ/2, z1 = ǫ, forsomeǫ 6= 0. Then the corresponding numerical solution can be shown to be

zn = ǫ · 2n−1, n ≥ 0. (7.35)

The reader should check this assertion. To see the effect of the perturbation on theoriginal solution, let us assume that

maxt0≤tn≤b

|yn − zn| = max0≤tn≤b

|ǫ| 2n−1 = |ǫ| 2N(h)−1.

SinceN(h) → ∞ ash → 0, the deviation of{zn} from {yn} increases ash → 0.The method (7.34) is unstable, and it should never be used. Also, note that the rootcondition is violated, sinceρ(r) = r2 − 3r + 2 has the rootsr0 = 1, r1 = 2.

To investigate the stability of (7.1), we consider only the special equation

Y ′(t) = λY (t), t ≥ 0,Y (0) = 1

(7.36)

with the solutionY (t) = eλt; λ is allowed to be complex. This is the model problemof (4.3), and its use was discussed in Chapter 4. The results obtained will transfer tothe study of stability for a general differential equation problem. An intuitive reasonfor this is easily derived. ExpandY ′(t) = f(t, Y (t)) about(t0, Y0) to obtain

Y ′(t) ≈ f(t0, Y0) + ft(t0, Y0)(t− t0) + fy(t0, Y0)(Y (t) − Y0)

= λ(Y (t) − Y0) + g(t) (7.37)

with λ = fy(t0, Y0) and g(t) = f(t0, Y0) + ft(t0, Y0)(t − t0). This is a validapproximation if|t− t0| is sufficiently small. IntroducingV (t) = Y (t) − Y0,

V ′(t) ≈ λV (t) + g(t). (7.38)

The inhomogeneous termg(t) will drop out of all derivations concerning numericalstability, because we are concerned with differences of solutions of the equation.Droppingg(t) in (7.38), we obtain the model equation (7.36).


In the case thatY′ = f(t,Y) represents a system ofm differential equations,which is discussed in Chapter 3, the partial derivativefy(t,y) becomes a Jacobianmatrix,

[fy(t,y)]ij =∂fi

∂yj, 1 ≤ i, j ≤ m.

Thus the model equation becomes

y′ = Λy + g(t), (7.39)

a system ofm linear differential equations withΛ = fy(t0,Y0). It can be shown thatin many cases, this system reduces to an equivalent system

z′i = λizi + γi(t), 1 ≤ i ≤ m (7.40)

with λ1, . . . , λm the eigenvalues ofΛ (see Problem 6). With (7.40), we are back tothe simple model equation (7.36), provided we allowλ to be complex in order toinclude all possible eigenvalues ofΛ.

Applying (7.1) to the model equation (7.36), we obtain

yn+1 =

p∑

j=0

ajyn−j + hλ

p∑

j=−1

bjyn−j , (7.41)

(1 − hλb−1)yn+1 −p∑

j=0

(aj + hλbj)yn−j = 0, n ≥ p. (7.42)

This is ahomogeneous linear difference equationof orderp+ 1, and the theory forits solvability is completely analogous to that of(p + 1)-order homogeneous lineardifferential equations. As a general reference, see Henrici [45, pp. 210–215] orIsaacson and Keller [47, pp. 405–417].

We attempt to find a general solution by first looking for solutions of the specialform

yn = rn, n ≥ 0.

If we can findp+1 linearly independent solutions,then an arbitrary linearcombinationwill give the general solution of (7.42).

Substitutingyn = rn into (7.42) and cancelingrn−p, we obtain

(1 − hλb−1)rp+1 −

p∑

j=0

(aj + hλbj)rp−j = 0. (7.43)

This is called thecharacteristic equation, and the left-side is thecharacteristicpolynomial. The roots are calledcharacteristic roots. Define

σ(r) = b−1rp+1 +

p∑

j=0

bjrp−j ,


and recall the definition (7.31) ofρ(r). Then (7.43) becomes

ρ(r) − hλσ(r) = 0. (7.44)

Denote the characteristic roots by

r0(hλ), . . . , rp(hλ),

which can be shown to depend continuously on the value ofhλ. Whenhλ = 0,equation (7.44) becomes simplyρ(r) = 0, and we haverj(0) = rj , j = 0, 1, . . . , pfor the earlier rootsrj of ρ(r) = 0. Sincer0 = 1 is a root ofρ(r), we letr0(hλ) bethe root of (7.44) for whichr0(0) = 1. The rootr0(hλ) is called theprincipal rootfor reasons that will become apparent later. If the rootsrj(hλ) are all distinct, thenthe general solution of (7.42) is

yn =

p∑

j=0

γj [rj(hλ)]n , n ≥ 0. (7.45)

But ifrj(hλ) = rj+1(hλ) = · · · = rj+ν−1(hλ)

is a root of multiplicityν > 1, then the following areν linearly independent solutionsof (7.42):

{[rj(hλ)]n}, {n [rj(hλ)]n}, . . . , {nν−1 [rj(hλ)]

n}.Moreover, in the formula (7.45), the part

γj [rj(hλ)]n

+ · · · + γj+ν−1 [rj+ν−1(hλ)]n

needs to be replaced by

[rj(hλ)]n (γj + γj+1n+ · · · + γj+ν−1n

ν−1). (7.46)

These can be used with the solution arising from the other roots to generate a generalsolution for (7.42), comparable to (7.45).

In particular, for consistent methods it can be shown that

[r0(hλ)]n

= eλtn + O(h) (7.47)

ash → 0. The remaining rootsr1(hλ), . . . , rp(hλ) are calledparasitic rootsof thenumerical method. The term

p∑

j=1

γj [rj(hλ)]n (7.48)

is called aparasitic solution. It is a creation of the numerical method and does notcorrespond to any solution of the original differential equation being solved.

Theorem 7.6 Assume the consistency conditions (7.7)–(7.8). Then the multistepmethod (7.1) is stable if and only if the root condition (7.32)–(7.33) is satisfied.

The proof makes essential use of the general solution (7.45)in the case of distinctroots{rj(hλ)}, or the variant of (7.45) modified according to (7.46) when multipleroots are present. The reader is referred to [11, p. 398] for apartial proof and to [47,pp. 405-417] for a more complete development.


7.3.2 Convergence theory

The following result generalizes Theorem 7.4 from earlier in this chapter, givingnecessary and sufficient conditions for the convergence of multistep methods.

Theorem 7.7 Assume the consistency conditions (7.7)–(7.8). Then the multistepmethod (7.1) is convergent if and only if the root condition (7.32)–(7.33) is satisfied.

Again, we refer the reader to [11, p. 401] for a partial proof and to [47, pp. 405–417]for a more complete development.

The following is a well-known result, and it is a trivial consequence of Theorems7.6 and 7.7.

Corollary 7.8 Let (7.1) be a consistent multistep method. Then it is convergent ifand only if it is stable.

Example 7.9 Return to the two-step methods of order 2, developed in Example 7.3.The polynomialρ(r) is given by

ρ(r) = r2 − a0r − a1, a0 + a1 = 1.

Thenρ(r) = (r − 1) (r + 1 − a0) ,

and the roots arer0 = 1, r1 = a0 − 1.

The root condition requires−1 ≤ a0 − 1 < 1,

0 ≤ a0 < 2,

to ensure convergence and stability of the associated two step method in (7.16).

7.3.3 Relative stability and weak stability

Consider again the model equation (7.36) and its numerical solution (7.45). For aconvergent numerical method, it can be shown that in the general solution (7.45), weobtain

γ0 → 1,γj → 0, j = 1, . . . , p

ash → 0. The parasitic solution (7.48) converges to zero ash → 0, and the termγ0 [r0(hλ)]

n converges toY (t) = eλt with tn = t fixed. However, for a fixedh withincreasingtn, we also would like the parasitic solution to remain small relative to theprincipal part of the solutionγ0[r0(hλ)]

n. This will be true if the characteristic rootssatisfy

|rj(hλ)| ≤ r0(hλ), j = 1, 2, . . . , p (7.49)


for all sufficiently small values ofh. This leads us to the definition of relative stability.We say that the method (7.1) isrelatively stableif the characteristic rootsrj(hλ)

satisfy (7.49) for all sufficiently small nonzero values of|hλ|. Further, the method issaid to satisfy thestrong root conditionif

|rj(0)| < 1, j = 1, 2, . . . , p. (7.50)

This condition is easy to check, and it implies relative stability. Just use the continuityof the rootsrj(hλ) with respect tohλ to verify that (7.50) implies (7.49). Relativestability does not imply the strong root condition, although they are equivalent formost methods. If a multistep method is stable but not relatively stable, then it will becalledweakly stable.

Example 7.10

(1) For the midpoint method, we obtain

r0(hλ) = 1 + hλ+ O(h2), r1(hλ) = −1 + hλ+ O(h2). (7.51)

Forλ < 0, we have|r1(hλ)| > r0(hλ)

for all small values ofh > 0, and thus (7.49) is not satisfied. The midpointmethod is not relatively stable; it is only weakly stable. Weleave it as an exerciseto show experimentally that the midpoint method has undesirable stability whenλ < 0 for the model equation (7.28).

(2) The Adams–Bashforth and Adams–Moulton methods of Chapter 6 have thesame characteristic polynomial whenh = 0,

ρ(r) = rp+1 − rp. (7.52)

The roots arer0 = 1, rj = 0, j = 1, 2, . . . , p; thus the strong root condition issatisfied and the Adams methods are relatively stable.

PROBLEMS

1. Consider the two-step method

yn+1 =1

2(yn + yn−1) +

h

4

[4y′n+1 − y′n + 3y′n−1

], n ≥ 1

with y′n ≡ f(tn,yn). Show that it has order 2, and find the leading term in thetruncation error, written as in (7.15).

2. Recall the midpoint method

yn+1 = yn−1 + 2hf(tn, yn) , n ≥ 1


from Problem 11 in Chapter 6.

(a) Show that the midpoint method has order 2, as noted earlier following(7.2).

(b) Show that the midpoint method is not relatively stable.

3. Write a program to solveY ′(t) = f(t, Y (t)), Y (t0) = Y0 using the midpointrule of Problem 2. Use a fixed stepsizeh. For the initial valuey1, use the Eulermethod withy0 = Y0,

y1 = y0 + hf(t0, y0).

Using the program, solve the problem

Y ′(t) = −Y (t) + 2 cos(t), Y (0) = 1.

The true solution isY (t) = cos(t) + sin(t). Solve this problem on the interval[0, 10], and use stepsizes ofh = 0.2, 0.1, 0.05. Comment on your results.Produce a graph of the error.

4. Show that the two-step method

yn+1 = −yn + 2yn−1 + h[52y

′n + 1

2y′n−1

], n ≥ 1

is of order2 and unstable. Also, show directly that it need not converge whensolvingY ′(t) = f(t, Y (t)) by considering the special problem

Y ′(t) = 0, Y (0) = 0.

For the numerical method, consider using the initial values

y0 = h, y1 = −2h.

Hint: Use the general formula (7.45), and examine the numerical solution fortn = nh = 1.

5. Consider the general formula for all explicit two-step methods,

yn+1 = a0yn + a1yn−1 + h [b0f(tn, yn) + b1f(tn−1, yn−1)] , n ≥ 1.

(a) Consider finding all such two-step methods that are of order 2. Show thatthe coefficients must satisfy the equations

a0 + a1 = 1, −a1 + b0 + b1 = 1, a1 − 2b1 = 1.

Solve for{a1, b0, b1} in terms ofa0.

(b) Find a formula for the leading term in the truncation error, written as in(7.15). It will depend ona0.

(c) What are the restrictions ona0 for this two-step method to be stable? Tobe convergent?


6. Consider the model equation (7.39) withΛ,a square matrix of orderm. AssumeΛ = P−1DP with D a diagonal matrix with entriesλ1, . . . , λm. Introducethe new unknown vector functionz = Py(t). Show that (7.39) converts tothe form given in (7.40), demonstrating the reduction to theone-dimensionalmodel equation.Hint: In (7.39) replaceΛ with P−1DP , and then introduce the new unknownsz = Py. Simplify to a differential equation forz.

7. For solvingY ′(t) = f(t, Y (t)), consider the numerical method

yn+1 = yn +h

2

[y′n + y′n+1

]+h2

12

[y′′n − y′′n+1

], n ≥ 0.

Herey′n = f(tn, yn),

y′′n =∂f(tn, yn)

∂t+ f(tn, yn)

∂f(tn, y)

∂y

∣∣∣∣z=yn

with this formula based on differentiatingY ′(t) = f(t, Y (t)).

(a) Show that this is a fourth-order method withTn(Y ) = O(h5).Hint: Use the Taylor approximation method used earlier in deriving theresults of Theorem 7.2, modifying this procedure as necessary for ana-lyzing this case.

(b) Show that the region of absolute stability contains the entire negative realaxis of the complexhλ-plane.

8. Prove that (7.8) is necessary for the multistep numericalmethod (7.1) to beconsistent.Hint: Apply (7.1) to the initial value problem

Y ′(t) = 1, Y (0) = 0

with exact initial conditions.

9. (a) Find all explicit fourth-order formulas of the form

yn+1 = a0yn + a1yn−1 + a2yn−2

+ h[b0y

′n + b1y

′n−1 + b2y

′n−2

], n ≥ 2.

(b) Show that every such method is unstable.

10. (a) Consider methods of the form

yn+1 = yn−q + h

p∑

j=−1

bjf(xn−j , yn−j)


with q ≥ 1. Show that such methods do not satisfy the strong rootcondition. As a consequence, most such methods are only weakly stable.

(b) Find an example withq = 1 that is relatively stable.

11. For the polynomialρ(r) = rp+1 −∑pj=0 ajr

p−j , assumeaj ≥ 0, 0 ≤ j ≤ p,and

∑pj=0 aj = 1. Show that the roots ofρ(r) will satisfy the root conditions

(7.32) and (7.33). This shows directly that Theorem 7.4 is a corollary ofTheorem 7.7.

CHAPTER 8

STIFF DIFFERENTIAL EQUATIONS

The numerical solution of stiff differential equations is awidely studied subject.Such equations (including systems of differential equations) appear in a wide varietyof applications, in subjects as diverse as chemical kinetics, mechanical systems, andthe numerical solution of partial differential equations.In this section, we sketchsome of the main ideas about this subject, and we show its relation to the numericalsolution of the simple heat equation from partial differential equations.

There are several definitions of the concept of stiff differential equation. Themost important common feature of these definitions is that when such equations arebeing solved with standard numerical methods (e.g., the Adams–Bashforth methodsof Chapter 6), the stepsizehmust be extremely small in order to maintain stability —far smaller than would appear to be necessary from a consideration of the truncationerror. A numerical illustration for Euler’s method is givenin Table 4.3 as a part ofExample 4.2 in Chapter 4.

Definitions and results related to the topic of stiff differential equations were in-troduced in Chapter 4 (see (4.3)–(4.5) and (4.10)) and Chapter 6 (see the discussionaccompanying (6.19)–(6.20)). For convenience, we review those ideas here. As wasdiscussed preceding (4.3) in Chapter 4, the following modelproblem is used to test

127

128 STIFF DIFFERENTIAL EQUATIONS

the performance of numerical methods,

Y ′ = λY, t > 0,Y (0) = 1.

(8.1)

Following (7.36) in Chapter 7, a derivation was given to showthat (8.1) is usefulin studying the stability of numerical methods for very general systems of nonlineardifferential equations; we review this in more detail in a later paragraph.

When the constantλ is real, we assumeλ < 0; or more generally, whenλ iscomplex, we assumeReal(λ) < 0. This assumption aboutλ is generally associatedwith stable differential equation problems (see Section 1.2). The true solution of themodel problem is

Y (t) = eλ t. (8.2)

From our assumption onλ, we have

Y (t) → 0 as t→ ∞. (8.3)

The kind of stability property we would prefer for a numerical method is that whenit is applied to (8.1), the numerical solution satisfies

yh(tn) → 0 as tn → ∞ (8.4)

for any choice of the stepsizeh. Such numerical methods are calledabsolutely stableorA-stable. For an arbitrary numerical method, the set of valueshλ for which (8.4) issatisfied, considered as a subset of the complex plane, is called theregion of absolutestability of the numerical method. The dependence on the producthλ is based onthe general solution to the finite difference method for solving (8.1), given in (7.45)of Chapter 7.

Example 8.1 We list here the region of absolute stability as derived inearlier chapters.Again, we consider onlyλ satisfying our earlier assumption thatReal (λ) < 0.

• For Euler’s method, it was shown following (4.5) that (8.4) is satisfied if andonly if

|1 + hλ| = |hλ− (−1)| < 1. (8.5)

Thushλ is in the region of absolute stability if and only if it is within a distanceof 1 from the point−1 in the complex plane. The region of absolute stabilityis a circle of unit radius with center at−1. For realλ, this requires

−2 < hλ < 0.

• For the backward Euler method of (4.9), it was shown in and following (4.10)that (8.4) is satisfied forevery value ofhλ inwhichReal(λ) < 0. The backwardEuler method is A-stable.

• For the trapezoidal method of (4.22), it was left to Problem 2in Chapter 4 toshow that (8.4) is satisfied for every value ofhλ in whichReal (λ) < 0. Thetrapezoidal method is A-stable.

129

• For the Adams–Bashforth method of order 2, namely

yn+1 = yn +h

2[3y′n − y′n−1], n ≥ 1 (8.6)

(see Table 6.2), it was stated in Example 6.6 that the real part of the region ofabsolute stability is the interval

−1 < hλ < 0. (8.7)

Why is this of interest? If a method is absolutely stable, then there are no re-strictions onh in order for the numerical method to be stable in the sense of (8.4).However, consider what happens to the stepsizeh if a method has a region of absolutestability that is bounded (and say, of moderate size). Suppose that the value ofλ hasa real part that is negative and of very large magnitude. Thenh must be correspond-ingly small forhλ to belong to the region of absolute stability of the method. Even ifthe truncation error is small, it is necessary thathλ belong to the region of absolutestability to ensure that the error in the approximate solution{yn} is also small.

Example 8.2 Recall Example 4.2 in Chapter 4, which illustrated the computationaleffects of regions of absolute stability for the Euler, backward Euler, and trapezoidalmethods when solving the problem

Y ′(t) = λY (t) + (1 − λ) cos(t) − (1 + λ) sin(t), Y (0) = 1. (8.8)

The true solution isY (t) = sin(t) + cos(t). We augment those earlier calculationsby giving results for the Adams–Bashforth method (8.6) whensolving (8.8). Forsimplicity, we usey1 = Y (t1). Numerical results for several values ofλ are given inTable 8.1. The values ofh are the same as those used in Table 4.3 for Euler’s methodin Example 4.2. The stability of the error in the numerical results are consistent withthe region of absolute stability given in (8.7).

Returning to the derivation following (7.36) in Chapter 7, we looked at the lin-earization of the system

Y′ = f(t,Y) (8.9)

of m differential equations, resulting in the approximating linear system

Y′ = ΛY + g(t). (8.10)

In this,Λ = fy(t0,Y0) is them×m Jacobian matrix off evaluated at(t0,Y0). Aswas explored in Problem 6 of Chapter 7, many such systems can be reduced to a setof m independent scalar equations

Y ′i = λiYi + gi(t), i = 1, . . . ,m.


Table 8.1 The Adams-Bashforth method (8.6) for solving (8.8)

λ t Error Error Errorh = 0.5 h = 0.1 h = 0.01

−1 1 −2.39e − 2 −7.58e − 4 −7.24e − 62 4.02e − 2 2.13e − 3 2.28e − 53 1.02e − 1 4.31e − 3 4.33e − 54 8.50e − 2 2.98e − 3 2.82e − 55 −3.50e − 3 −9.16e − 4 −1.13e − 5

−10 1 −2.39e − 2 −1.00e − 4 6.38e − 72 −1.10e + 0 3.75e − 4 5.25e − 63 −5.23e + 1 3.83e − 4 5.03e − 64 2.46e + 3 −8.32e − 5 1.91e − 75 −1.16e + 5 −5.96e − 4 −4.83e − 6

−50 1 −2.39e − 2 −1.57e + 3 2.21e − 72 −3.25e + 1 −3.64e + 11 1.09e − 63 4.41e + 4 −8.44e + 19 9.60e − 74 −5.98e + 7 −1.96e + 28 −5.54e − 85 −8.12e + 10 −4.55e + 36 −1.02e − 6

As was discussed following (7.36), this leads us back to the model equation (8.1) withλ an eigenvalue of the Jacobian matrixfy(t0,Y0).

We say that the differential equationY′ = f(t,Y) is stiff if some of the eigen-valuesλj of Λ, or more generally offy(t,Y), have a negative real part of very largemagnitude. The question may arise as to how large the eigenvalue should be to beconsidered large? The magnitude of the eigenvalues might depend on the units ofmeasurement used, for example, which has no impact on the amount of computationneeded to accurately solve a particular problem. The crucial test is to consider theeigenvalue(s) associated with the slowest rates of change,and compare them with theeigenvalue(s) associated with the fastest rates of change.A simple test is to look atthe ratiomaxi |λi| /mini |λi|. If this number is large, then the problem is stiff. Forexample, in the pendulum model (3.13), the two eigenvalues in the linearization havethe same or similar magnitudes. So it is not a stiff problem. Most problems that wehave seen so far are not stiff. Yet, stiff problems are commonin practice. In the nextsection we see one very important example.

We study numerical methods for stiff equations by considering their effect on themodel equation (8.1). This approach has its limitations, some of which we indicatelater, but it does give us a means of rejecting unsatisfactory methods, and it suggestssome possibly satisfactory methods. Before giving some higher-order methods thatare suitable for solving stiff differential equations, we give an important practicalexample.

THE METHOD OF LINES FOR A PARABOLIC EQUATION 131

8.1 THE METHOD OF LINES FOR A PARABOLIC EQUATION

Consider the following parabolic partial differential equation problem:

Ut = Uxx +G(x, t), 0 < x < 1, t > 0, (8.11)

U(0, t) = d0(t), U(1, t) = d1(t), t ≥ 0, (8.12)

U(x, 0) = f(x), 0 ≤ x ≤ 1. (8.13)

The unknown functionU(x, t) depends on the timetand a spatial variablex,andUt =∂U/∂t, Uxx = ∂2U/∂x2. The conditions (8.12) are calledboundary conditions,and (8.13) is called aninitial condition. The solutionU can be interpreted as thetemperature of an insulated rod of length1 withU(x, t), the temperature at positionxand timet; thus (8.11) is often called theheat equation. The functionsG, d0, d1, andf are assumed given and smooth. For a development of the theoryof (8.11)–(8.13),see Widder [78] or any standard introduction to partial differential equations. Wegive themethod of linesfor solving forU , a popular numerical method for solvingnumerically linear and nonlinear partial differential equations of parabolic type. Thisnumerical method also leads to the necessity of solving a stiff system of ordinarydifferential equations.

Letm > 0 be an integer, defineδ = 1/m, and define the spatial nodes

xj = jδ, j = 0, 1, . . . ,m.

We discretize (8.11) by approximating the spatial derivative Uxx in the equation.Using a standard result in the theory of numerical diffentiation,

Uxx(xj , t) =U(xj+1, t) − 2U(xj , t) + U(xj−1, t)

δ2− δ2

12

∂4U(ξj , t)

∂x4(8.14)

for j = 1, 2, . . . ,m − 1, where eachξj ≡ ξj(t) is some point betweenxj−1 andxj+1. For a derivation of this formula, see [11, p. 318] or [12, p. 237]. Substitutinginto (8.11), we obtain

Ut(xj , t) =U(xj+1, t) − 2U(xj , t) + U(xj−1, t)

δ2+G(xj , t)

− δ2

12

∂4U(ξj , t)

∂x4, 1 ≤ j ≤ m− 1.

(8.15)

Equation (8.11) is to be approximated at each interior node pointxj .We drop the final term in (8.15), the truncation error in the numerical differentia-

tion. Forcing equality in the resulting approximate equation, we obtain

u′j(t) =1

δ2[uj+1(t) − 2uj(t) + uj−1(t)] +G(xj , t) (8.16)

for j = 1, 2, . . . ,m − 1. The functionsuj(t) are intended as approximations ofU(xj , t), 1 ≤ j ≤ m − 1. This is themethod of linesapproximation to (8.11), and


it is a system ofm − 1 ordinary differential equations. Note thatu0(t) andum(t),which are needed in (8.16) forj = 1 andj = m− 1, are given using (8.12):

u0(t) = d0(t), um(t) = d1(t). (8.17)

The initial condition for (8.16) is given by (8.13):

uj(0) = f(xj), 1 ≤ j ≤ m− 1. (8.18)

The termmethod of linescomes from solving forU(x, t) along the lines(xj , t), t ≥ 0,1 ≤ j ≤ m− 1 in the(x, t) plane.

Under suitable assumptions on the functionsd0, d1,G, andf , it can be shown that

max0≤j≤m

0≤t≤T

|U(xj , t) − uj(t)| ≤ CT δ2. (8.19)

Thus to complete the solution process, we need only solve thesystem (8.16).It is convenient to write (8.16) in matrix form. Introduce

u(t) = [u1(t), . . . , um−1(t)]T, u0 = [f(x1), . . . , f(xm−1)]

T,

g(t) =

[d0(t)

δ2+G(x1, t), G(x2, t), . . . , G(xm−2, t),

d1(t)

δ2+G(xm−1, t)

]T

,

Λ =1

δ2

−2 1 0 · · · 01 −2 1 0

. . ....

... 1 −2 10 · · · 0 1 −2

.

The matrixΛ is of orderm − 1. In the definitions ofu andg, the superscriptTindicates matrix transpose, so thatu andg are column vectors of lengthm−1. Usingthese matrices, equations (8.16)–(8.18) can be rewritten as

u′(t) = Λu(t) + g(t), u(0) = u0. (8.20)

If Euler’s method is applied, we have the numerical method

Vn+1 = Vn + h [ΛVn + g(tn)] , V0 = u0 (8.21)

with tn = nh andVn ≈ u(tn). This is a well-known numerical method for the heatequation, called thesimple explicit method. We analyze the stability of (8.21) andsome other methods for solving (8.20).

Equation (8.20) is in the form of the model equation, (8.10),and therefore we needthe eigenvalues ofΛ to examine the stiffness of the system. These eigenvalues are allreal and are given by

λj = − 4

δ2sin2

(jπ

2m

), 1 ≤ j ≤ m− 1. (8.22)


A proof (which we omit here) can be obtained by showing a relationship betweenthe characteristic polynomial forΛ and Chebyshev polynomials. Directly examining(8.22), we have

λm−1 ≤ λj ≤ λ1, (8.23)

with

λm−1 =−4

δ2sin2

((m− 1)π

2m

)≈ −4

δ2,

λ1 =−4

δ2sin2

( π

2m

)≈ −π2

with the approximations valid for largerm. As λm−1/λ1 ≈ 4/(πδ)2, it can be seenthat (8.20) is a stiff system ifδ is small.

Applying (8.23) and (8.5) to the analysis of stability in (8.21), we must have

|1 + hλj | < 1, j = 1, . . . ,m− 1.

Using (8.22), this leads to the equivalent statement

0 <4h

δ2sin2

(jπ

2m

)< 2, 1 ≤ j ≤ m− 1.

This will be satisfied if4h/δ2 ≤ 2 or

h ≤ 12δ

2. (8.24)

If δ is at all small, sayδ = 0.01, then the timesteph must be quite small to ensurestability.

In contrast to the restriction (8.24) with Euler’s method,the backward Eulermethodhas no such restriction since it is A-stable. Applying the backward Euler method, ourapproximation to (8.20) is

Vn+1 = Vn + h [ΛVn+1 + g(tn+1)] , V0 = u0. (8.25)

This is called thesimple implicit methodfor solving the heat equation. To solve thislinear problem forVn+1, we rewrite the equation as

(I − hΛ)Vn+1 = Vn + hg(tn+1). (8.26)

Solving forVn+1 gives

Vn+1 = (I − hΛ)−1 [Vn + hg(tn+1)] . (8.27)

Since all the eigenvaluesλi of Λ are negative, the eigenvalues of(I − hΛ)−1 are1/(1 − hλi), which are all bounded by one. Because of this, the implicit Eulermethod for this problem is always stable; there is no limitation on the stepsizeh,unlike the case for the explicit Euler method. Also, the linear system to be solved


Table 8.2 The method of lines: Euler’s method (h = 12δ2)

Error Error Errort m = 4 Ratio m = 8 Ratio m = 16

1.0 4.85e − 2 4.096 1.18e − 2 4.024 2.94e − 3

2.0 4.39e − 2 4.096 1.07e − 2 4.024 2.66e − 3

3.0 3.97e − 2 4.096 9.69e − 3 4.024 2.41e − 3

4.0 3.59e − 2 4.096 8.77e − 3 4.024 2.18e − 3

5.0 3.25e − 2 4.096 7.93e − 3 4.024 1.97e − 3

Table 8.3 The method of lines: Backward Euler method (h = 0.1)

Error Error Errort m = 4 m = 8 m = 16

1.0 4.85e − 2 1.19e − 2 2.99e − 3

2.0 4.39e − 2 1.08e − 2 2.70e − 3

3.0 3.98e − 2 9.73e − 3 2.45e − 3

4.0 3.60e − 2 8.81e − 3 2.21e − 3

5.0 3.25e − 2 7.97e − 3 2.00e − 3

is a tridiagonal system, and there is a well-developed numerical analysis for suchlinear systems (e.g. see [11, p. 527] or [12, p. 287]). It can be solved very rapidlywith approximately5m arithmetic operations per timestep, excluding the cost ofcomputing the right side in (8.26). The cost of solving the Euler method (8.21) isalmost as large, and thus the solution of (8.26) is not especially time-consuming.

Example 8.3 Solve the partial differential equation problem (8.11)–(8.13) with thefunctionsG, d0, d1, andf , determined from the known solution

U = e−.1t sin(πx), 0 ≤ x ≤ 1, t ≥ 0. (8.28)

Results for Euler’s method (8.21) are given in Table 8.2, andresults for the backwardEuler method (8.25) are given in Table 8.3.

For Euler’s method, we takem = 4, 8, 16, and to maintain stability, we takeh =12δ

2 from (8.24). This leads to the respective timesteps ofh.= 0.031, 0.0078, 0.0020.

From (8.19) and the error formula for Euler’s method, we would expect the error tobe proportional toδ2, sinceh = 1

2δ2. This implies that the error should decrease by

a factor of4 whenm is doubled, and the results in Table 8.2 agree. In the table, thecolumn “Error” denotes the maximum error at the node points(xj,t), 0 ≤ j ≤ n, forthe given value oft.

For the solution of (8.20) by the backward Euler method, there need no longer beany connection between the spatial stepsizeδ and the timesteph. By observing the


error formula (8.19) for the method of lines and the truncation error formula (8.33)(usep = 1) for the backward Euler method, we see that the error in solving theproblem (8.11)–(8.13) will be proportional toh + δ2. For the unknown functionUof (8.26), there is a slow variation witht. Thus, for the truncation error associatedwith the time integration, we should be able to use a relatively large timesteph ascompared to the spatial stepsizeδ, for the two sources of error be relatively equal insize. In Table 8.3, we useh = 0.1 andm = 4, 8, 16. Note that this timestep is muchlarger than that used in Table 8.2 for Euler’s method, and thus the backward Eulermethod is much more efficient for this example.

For more discussion of the method of lines, see Aiken [1, pp. 124–148] andSchiesser [71].

8.1.1 MATLAB R© programs for the method of lines

We give MATLAB programs for both the Euler method (8.21) and the backward Eulermethod (8.27).

Euler method code:

function [x,t,u] = MOL Euler(d0,d1,f,G,T,h,m)

%

% function [x,t,u] = MOL Euler(d0,d1,f,G,T,h,m)

%

% Use the method of lines to solve

% u t = u xx + G(x,t), 0 < x < 1, 0 < t < T

% with boundary conditions

% u(0,t) = d0(t), u(1,t) = d1(t)

% and initial condition

% u(x,0) = f(x).

% Use Euler’s method to solve the system of ODEs.

% For the discretization, use a spatial stepsize of

% delta=1/m and a timestep of h.

%

% For numerical stability, use a timestep of

% h = 1/(2*m^2) or smaller.

x = linspace(0,1,m+1)’; delta = 1/m; delta sqr = delta^2;

t = (0:h:T)’; N = length(t);

% Initialize u.

u = zeros(m+1,N);

u(:,1) = f(x);

u(1,:) = d0(t); u(m+1,:) = d1(t);


% Solve for u using Euler’s method.

for n=1:N-1

g = G(x(2:m),t(n));

u(2:m,n+1) = u(2:m,n) + (h/delta sqr)*(u(1:(m-1),n) ...

- 2*u(2:m,n) + u(3:(m+1),n)) + h*g;

end

u = u’;

end % MOL Euler

Test of Euler method code:

function [x,t,u,error] = Test MOL Euler(index u,t max,h,m)

% Try this test program with

% [x,t,u,error] = Test MOL Euler(2,5,1/128,8);

[x,t,u] = MOL Euler(@d0,@d1,@f,@G,t max,h,m);

% Graph numerical solution

[X,T] = meshgrid(x,t);

figure; mesh(X,T,u); shading interp

xlabel(’x’); ylabel(’t’);

title([’Numerical solution u: index of u = ’,...

num2str(index u)])

disp(’Press any key to continue.’); pause

% Graph error in numerical solution

true u = true soln(X,T); error = true u - u;

disp([’Maximum error = ’,num2str(max(max(abs(error))))])

figure; mesh(X,T,error); shading interp

xlabel(’x’); ylabel(’t’);

title([’Error in numerical solution u: index of u = ’,...

num2str(index u)])

disp(’Press any key to continue.’); pause

% Produce maximum errors over x as t varies.

maxerr in x = max(abs(error’));

figure; plot(t,maxerr in x); text(1.02*t max,0,’t’)

title(’Maximum error for x in [0,1], as a function of t’)

function true u = true soln(z,s)

switch index u

case 1

true u = s.^2 + z.^4;


case 2

true u = exp(-0.1*s).*sin(pi*z);

end

end % true u

function answer = G(z,s)

% This routine assumes s is a scalar, while z can be a vector.

switch index u

case 1

answer = 2*s - 12*z.^2;

case 2

answer = (pi^2 - 0.1)*exp(-0.1*s).*sin(pi*z);

end

end % G

function answer = d0(s)

z = zeros(size(s));

answer = true soln(z,s);

end % d0

function answer = d1(s)

z = ones(size(s));


end % d1

function answer = f(z)

s = zeros(size(z));


end % f

end % Test MOL Euler

Backward Euler method code:

function [x,t,u] = MOL BEuler(d0,d1,f,G,T,h,m)

%

% function [x,t,u] = MOL BEuler(d0,d1,f,G,T,h,m)

%

% Use the method of lines to solve

% u t = u xx + G(x,t), 0 < x < 1, 0 < t < T

% with boundary conditions

% u(0,t) = d0(t), u(1,t) = d1(t)

% and initial condition

% u(x,0) = f(x).

% Use the backward Euler’s method to solve the system of


% ODEs. For the discretization, use a spatial stepsize of

% delta=1/m and a timestep of h.

x = linspace(0,1,m+1)’; delta = 1/m; delta sqr = delta^2;

t = (0:h:T)’; N = length(t);

% Initialize u.

u = zeros(m+1,N);

u(:,1) = f(x);

u(1,:) = d0(t); u(m+1,:) = d1(t);

% Create tridiagonal coefficient matrix.

a = -(h/delta sqr)*ones(m-1,1); c = a;

b = (1+2*h/delta sqr)*ones(m-1,1);

a(1) = 0; c(m-1) = 0; option = 0;

% Solve for u using the backward Euler’s method.

for n=2:N

g = G(x(2:m),t(n));

g(1) = g(1) + (1/delta sqr)*u(1,n);

g(m-1) = g(m-1) + (1/delta sqr)*u(m+1,n);

f = u(2:m,n-1) + h*g;

switch option

case 0 % first time: factorize matrix

[v,alpha,beta,message] = tridiag(a,b,c,f,m-1,option);

option = 1;

case 1 % other times: use available factorization

v = tridiag(alpha,beta,c,f,m-1,option);

end

u(2:m,n) = v;

end

u = u’;

end % MOL BEuler

function [x, alpha, beta, message] = tridiag(a,b,c,f,n,option)

%

% function [x, alpha, beta, message] = tridiag(a,b,c,f,n,option)

%

% Solve a tridiagonal linear system M*x=f

%

% INPUT:

% The order of the linear system is given as n.

% The subdiagonal, diagonal, and superdiagonal of M are given

% by the arrays a,b,c, respectively. More precisely,


% M(i,i-1) = a(i), i=2,...,n

% M(i,i) = b(i), i=1,...,n

% M(i,i+1) = c(i), i=1,...,n-1

% option=0 means that the original matrix M is given as

% specified above. We factorize M.

% option=1 means that the LU factorization of M is already

% known and is stored in a,b,c. This will have been

% accomplished by a previous call to this routine. In

% that case, the vectors alpha and beta should have

% been substituted for a and b in the calling sequence.

% All input values are unchanged on exit from the routine.

%

% OUTPUT:

% Upon exit, the LU factorization of M is already known and

% is stored in alpha,beta,c. The solution x is given as well.

% message=0 means the program was completed satisfactorily.

% message=1 means that a zero pivot element was encountered

% and the solution process was abandoned. This case

% happens only when option=0.

if option == 0

alpha = a; beta = b;

alpha(1) = 0;

% Compute LU factorization of matrix M.

for j=2:n

if beta(j-1) == 0

message = 1; return

end

alpha(j) = alpha(j)/beta(j-1);

beta(j) = beta(j) - alpha(j)*c(j-1);

end

if beta(n) == 0

message = 1; return

end

end

% Compute solution x to M*x = f using LU factorization of M.

% Do forward substitution to solve lower triangular system.

if option == 1


end

x = f; message = 0;


for j=2:n

x(j) = x(j) - alpha(j)*x(j-1);

end

% Do backward substitution to solve upper triangular system.

x(n) = x(n)/beta(n);

for j=n-1:-1:1

x(j) = (x(j) - c(j)*x(j+1))/beta(j);

end

end % tridiag

The test code forMOL BEuler is essentially the same as that forMOL Euler. InTest MOL Euler, simply replace the phraseMOL Euler with MOL BEuler through-out the code.

8.2 BACKWARD DIFFERENTIATION FORMULAS

The concept of a region of absolute stability is the initial tool used in studying thestability of a numerical method for solving stiff differential equations. We seekmethods whose stability region each contains the entire negative real axis and asmuch of the left half of the complex plane as possible. There are a number of ways todevelop such methods, but we discuss only one of them in this chapter — obtainingthebackward differentiation formulas(BDFs).

LetPp(t) denote the polynomial of degree≤ p that interpolatesY (t) at the pointstn+1, tn, . . . , tn−p+1 for somep ≥ 1,

Pp(t) =

p−1∑

j=−1

Y (tn−j)lj,n(t), (8.29)

where{lj,n(t) : j = −1, . . . , p − 1} are the Lagrange interpolation basis functionsfor the nodestn+1, . . . , tn−p+1 (see (B.4) in Appendix B). Use

P ′p(tn+1) ≈ Y ′(tn+1) = f(tn+1, Y (tn+1)). (8.30)

Combining (8.30) with (8.29) and solving forY (tn+1), we obtain

Y (tn+1) ≈p−1∑

j=0

αjY (tn−j) + hβf (tn+1, Y (tn+1)) . (8.31)

Thep-step BDF method is given by

yn+1 =

p−1∑

j=0

αjyn−j + hβf(tn+1, yn+1). (8.32)

STABILITY REGIONS FOR MULTISTEP METHODS 141

Table 8.4 Coefficients of BDF method (8.32)

p β α0 α1 α2 α3 α4 α5

1 1 1

2 23

43

− 13

3 611

1811

− 911

211

4 1225

4825

− 3625

1625

− 325

5 60137

300137

− 300137

200137

− 75137

12137

6 60147

360147

− 450147

400147

− 225147

72147

− 10147

The coefficients for the cases ofp = 1, . . . , 6 are given in Table 8.4. The casep = 1is simply the backward Euler method of (4.9) in Chapter 4. Thetruncation error for(8.32) can be obtained from the error formulas for numericaldifferentiation (e.g. see[11, (5.7.5)]),

Tn(Y ) = − β

p+ 1hp+1Y (p+1)(ξn) (8.33)

for sometn−p+1 ≤ ξn ≤ tn+1.The regions of absolute stability for the formulas of Table 8.4 are given in Figure

8.3. To create these regions, we must find all valueshλ for which

|rj(hλ)| < 1, j = 0, 1, . . . , p, (8.34)

where the characteristic rootsrj(hλ) are the solutions of

rp =

p−1∑

j=0

αjrp−1−j + hλβrp. (8.35)

It can be shown that forp = 1 andp = 2, the BDF’s are A-stable, and that for3 ≤ p ≤ 6, the region of absolute stability becomes smaller asp increases, althoughcontaining the entire negative real axis in each case. Forp ≥ 7, the regions of absolutestability are not acceptable for the solution of stiff problems. This is discussed ingreater detail in the following section.

8.3 STABILITY REGIONS FOR MULTISTEP METHODS

Recalling (7.1), all general multistep methods, includingAB, AM, and BDF (andother) methods, can be represented as follows:

yn+1 =

p∑

j=0

aj yn−j + h

p∑

j=−1

bj f(tn−j, yn−j). (8.36)


−2 −1.5 −1 −0.5 0−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

AB1

AB2

AB3

AB4

Re(hλ)

Im(h

λ)

Figure 8.1 Stability regions for Adams–Bashforth methods. Note that AB1 is Euler’s method

For the test equationdY/dt = λY , we havef(t, Y ) = λY ; and recalling (7.42), thecharacteristic polynomial for (8.36) is

0 = (1 − hλb−1) rp+1 −

p∑

j=0

(aj + hλbj) rp−j . (8.37)

Theboundaryof the stability region is where all roots of this characteristic equationhave magnitude 1 or less, and at least one root with magnitude1. We can find all thevalues ofhλwhere one of the roots has magnitude 1. All roots with magnitude 1 canbe represented asr = eiθ with i =

√−1. So we can find allhλ where (8.37) holds

with r = eiθ. Separating outhλ gives

rp+1 −p∑

j=0

aj rp−j = hλ

p∑

j=−1

bjrp−j ,

hλ =

rp+1 −p∑

j=0

aj rp−j

÷

p∑

j=−1

bjrp−j

,

wherer = eiθ for 0 ≤ θ ≤ 2π gives a set that includes the boundary of the stabilityregion. With a little more care, we can identify which of the regions separated by thiscurve form the true stability region.Remark. From Section 7.3 of Chapter 7, the root condition (7.32)–(7.33) is nec-essary for convergence and stability of a multistep method.This form of stability

ADDITIONAL SOURCES OF DIFFICULTY 143

−6 −5 −4 −3 −2 −1 0 1 2

−3

−2

−1

0

1

2

3

Re(hλ)

Im(h

λ)

AM1

(outside circle)

AM2

(left of line)AM3

AM4

Figure 8.2 Stability regions for Adams–Moulton methods. Note that AM1is the implicitEuler method, and AM2 is the trapezoidal method. Note the different scale on the axes ascompared to Figure 8.1

is sometimes also calledweak stability, as we ordinarily require additional stabilityconditions for a practical numerical method. Without the root condition, the methodcannot be expected to produce numerical solutions that approach the true solution ash → 0, regardless of the value ofλ. The root condition sometimes fails for certainconsistent multistep methods, but almost no one discusses those methods becausethey are useless except to explain the importance of stability! As a simple exampleof such a method, recall Example 7.34 from Section 7.3.

8.4 ADDITIONAL SOURCES OF DIFFICULTY

8.4.1 A-stability and L-stability

There are still problems with the BDF methods and with other methods that are chosensolely on the basis of their region of absolute stability. First, with the model equationY ′ = λY , if Real(λ) is of large magnitude and negative, then the solutionY (t)goes to the zero very rapidly, and asReal(λ) → −∞, the convergence to zero ofY (t) becomes more rapid. We would like the same behavior to hold for the numericalsolution of the model equation{yn}. To illustrate this idea, we show that theA-stabletrapezoidal rule does not maintain this behavior.


−6 −4 −2 0 2−4

−3

−2

−1

0

1

2

3

4

Re(hλ)

Im(h

λ)BDF1

BDF2

BDF3BDF4

BDF5

BDF6

Figure 8.3 Stability regions for backward difference formula methods. Note that BDF1is again the implicit Euler method. The labels are inside thestability region for the labeledmethod.

Apply the trapezoidal method (4.22) to the model equation (8.1). Doing so leadsto the numerical approximation

yn =

[1 + 1

2hλ

1 − 12hλ

]n

, n ≥ 0. (8.38)

If |Real(λ)| is large, then the fraction inside the brackets is less than 1in magnitude,but is nearly equal to−1; and thusyn decreases to0 quite slowly. This suggests thatthe trapezoidal method may not be a completely satisfactorychoice for stiff problems.

In comparison, the A-stable backward Euler method has the desired behavior.From (4.10) in Chapter 4, the solution of the model problem is

yn =

[1

1 − hλ

]n

, n ≥ 0.

As |λ| increases, the sequence{yn} goes to zero more rapidly. Thus the backwardEuler solution better reflects the behavior of the true solution of the model equation.An A-stable numerical method is calledL-stableif at each fixedt = tn, the numericalsolutionyn at tn satisfiesyn → 0 asReal (λ) → −∞. The trapezoidal rule is notL-stable, whereas the backward Euler method is L-stable. This material was exploredearlier in Problems 14 and 15 of Chapter 4.

SOLVING THE FINITE-DIFFERENCE METHOD 145

8.4.2 Time-varying problems and stability

A second problem with the use of stability regions to determine methods for stiffproblems is that it is based on using constantλ and linear problems. The linearization(8.10) is often valid, but not always. For example, considerthe second-order linearproblem

y′′ + ay′ + (1 + b · cos(2πy))y = g(t), t ≥ 0, (8.39)

in which one coefficient is not constant. Convert it to the equivalent system

y′1 = y2,

y′2 = −(1 + b · cos(2πt))y1 − ay2 + g(t).(8.40)

We assumea > 0, |b| < 1. The eigenvalues of the Jacobian matrix for this systemare

λ =−a±

√a2 − 4 [1 + b · cos(2πt)]

2. (8.41)

These are either negative real numbers or complex numbers with negative real parts.On the basis of the stability theory for the constant coefficient (or constantΛ) case,we might be led to assume that the effect of all perturbationsin the initial data for(8.40) would die away ast → ∞. But in fact, the homogeneous part of (8.39) willhave unbounded solutions. Thus there will be perturbationsof the initial values thatwill lead to unbounded perturbed solutions in (8.39). This calls into question thevalidity of the use of the model equationy′ = λy + g(t). Using the model equation(8.1) suggests methods that we may want to study further; butby itself, this approachis not sufficient to encompass the vast variety of linear and nonlinear problems. Theexample (8.39) is taken from Aiken [1, p. 269].

8.5 SOLVING THE FINITE-DIFFERENCE METHOD

We illustrate the difficulty in solving the finite differenceequations by consideringthe backward Euler method,

yn+1 = yn + hf(tn+1, yn+1), n ≥ 0 (8.42)

first for a single equation and then for a system of equations.For a single equation,we summarize the discussion involving (4.12)–(4.16) of Chapter 4. If the ordinaryiteration formula

y(j+1)n+1 = yn + hf(tn+1, y

(j)n+1), j ≥ 0 (8.43)

is used, then

yn+1 − y(j+1)n+1 ≈ h

∂f(tn+1, yn+1)

∂y

[yn+1 − y

(j)n+1

].

For convergence, we would need to have∣∣∣∣h∂f(tn+1, yn+1)

∂y

∣∣∣∣ < 1. (8.44)


But with stiff equations, this would again forceh to be very small, which we aretrying to avoid. Thus another rootfinding method must be usedto solve foryn+1 in(8.42).

The most popular methods for solving (8.42) are based on Newton’s method andvariants of it. For a single differential equation, Newton’s method for findingyn+1 is

y(j+1)n+1 = y

(j)n+1 −

[1 − hfy(tn+1, y

(j)n+1)

]−1

×[y(j)n+1 − yn − hf(tn+1, y

(j)n+1)

] (8.45)

for j ≥ 0. A crude initial guess isy(0)n+1 = yn, although generally this can be improved

on.With a system ofm differential equations, as in (8.9), Newton’s method is

[I − hfy(tn+1,y

(j)n+1)

]δ(j)n = y

(j)n+1 − yn − hf(tn+1,y

(j)n+1),

y(j+1)n+1 = y

(j)n+1 − δ

(j)n , j ≥ 0.

(8.46)

This is a system ofm linear simultaneous equations for the vectorδ(j)n ∈ R

m, and sucha linear system must be solved repeatedly at each steptn. The matrix of coefficientschanges with each iteratey(j)

n+1 and with each steptn. This rootfinding procedure isusually costly to implement; consequently, we seek variants of Newton’s method thatrequire less computation time.

As one approach to decreasing the cost of (8.46), the matrix approximation

I − hfy(tn+1, z) ≈ I − hfy(tn+1,y(j)n+1), somez ≈ yn (8.47)

is used for allj and for a number of successive stepstn. Thus Newton’s method(8.46) is approximated by

[I − hfy(tn+1, z)] δ(j)n = y

(j)n+1 − yn − hf

(tn+1,y

(j)n+1

),

y(j+1)n+1 = y

(j)n+1 − δ

(j)n , j ≥ 0.

(8.48)

This amounts to solving a number of linear systems with the same coefficient matrix.This can be done much more cheaply than when the matrix is being modified witheach iteration and each new steptn. The matrix in (8.47) will have to be updatedperiodically, but the savings will still be very significantwhen compared to an exactNewton method. For a further discussion of this topic, see Aiken [1, p. 7].

8.6 COMPUTER CODES

The MATLAB programode15s is used to solve stiff ordinary differential equations.It is based on a modification of the variable order family of BDF methods discussedearlier in the chapter. Details of the actual methods and their implementation can

COMPUTER CODES 147

be found in Shampine and Reichelt [73, Section 2]. The nonlinear finite differencesystem (see (8.42) for the backward Euler method) at each timesteptn is solved bya variant of the modified Newton method of (8.47)–(8.48). Theprogramode15s isused in precisely the same manner as the programode45 discussed in Chapter 5 andthe programode113 of Chapter 6; and the entire suite of MATLAB ODE programsis discussed at length in [73], [74].

A package of programs calledSundials[46] includes state-of-the-art programsfor solving initial value problems for ordinary differential equations, including stiffequations, and differential algebraic equations. Included is an interface for use withMATLAB. The Sundialspackage is the latest in a sequence of excellent programsfrom the national energy laboratories (especially Lawrence-Livermore Laboratoryand Sandia Laboratory) in the USA, for use in solving ordinary differential equationsand developed over more than 30 years.

A general presentation of the method of lines is given in Schiesser [71]. Forsome older “method of lines codes” to solve systems of nonlinear parabolic partialdifferential equations in one and two space variables, see Sincovec and Madsen [75]and Melgaard and Sincovec [63]. For use with MATLAB, the Partial DifferentialEquations Toolbox solves partial differential equations,and it contains a \method oflines codes" code to solve parabolic equations. It also makes use of the MATLABsuite of programs for solving ordinary differential equations.

Example 8.4 We modify the programtest ode45of Section 5.5 by replacingode45with ode15s throughout the code. We illustrate the use ofode15s with the earlierexample (8.8), solving it on[0, 20] and usingAbsTol = 10−6, RelTol = 10−4.We chooseλ to be negative, but allow it to have a large magnitude, as in Example8.2 for the Adams–Bashforth method of order 2 (see Table 8.1). As a comparisonto ode15s, we also give the results obtained usingode45 andode113. We give thenumber of needed derivative evaluations with the three programs, and we also givethe maximum error in the computed solution over[0, 20]. This maximum error is forthe interpolated solution at the points defined in the test programtest ode15s. Theresults, shown in Table 8.5, indicate clearly that as the stiffness increases (or as|λ|increases), the efficiencies ofode45 andode113 decreases. In comparison, the codeode15s is relatively unaffected by the increasing magnitude of|λ|.

PROBLEMS

1. Derive the BDF method of order 2.

2. Consider the BDF method of order 2. Show that its region of absolute stabilitycontains the negative real axis,−∞ < hλ < 0.

3. Using the BDF method of order 2, repeat the calculations inExample 8.2.Comment on your results.Hint: Note that the linearity of the test equation (8.8) allows the implicit BDFequation foryn+1 to be solved explicitly; iteration is unnecessary.


Table 8.5 Comparison ofode15s, ode45, andode113 for the stiff equation (8.8)

ode15s ode45 ode113

λ = −1

Maximum error 5.44e − 4 1.43e − 4 3.40e − 4Function evaluations 235 229 132

λ = −10


λ = −50


λ = −500


4. Implement MOL Euler. Use it to experiment with various choices ofδ andhwith the true solutionU = e−0.1t sin(πx). Use some values ofδ andh thatsatisfy (8.24) and others not satisfying it. Comment on yourresults.

5. ImplementMOL Euler andMOL BEuler. Experiment as in Example 8.3. Usevarious values ofh andδ. Do so for the following true solutionsU (note thatthe functionsd0, d1, f, andG are determined from the known test caseU ):

(a) U = x4 + t2.

(b) U = (1 − e−t) cos (πx).

(c) U = exp (1/ (t+ 1)) cos (πx).

CHAPTER 9

IMPLICIT RK METHODS FOR STIFFDIFFERENTIAL EQUATIONS

Runge–Kutta methods were introduced in Chapter 5, and we nowwant to considerthem as a means of solving stiff differential equations. When working with multistepmethods in Chapter 8, we needed to use implicit methods in order to solve stiffequations; the same is true with Runge–Kutta methods. Also,as with multistepmethods, we need to develop the appropriate stability theory and carefully analyzewhat happens when we apply these methods to stiff equations.

9.1 FAMILIES OF IMPLICIT RUNGE–KUTTA METHODS

Runge–Kutta methods can be used for stiff differential equations. However, we needimplicit Runge–Kutta methods, which were introduced in Section 5.6 of Chapter 5.

149

150 IMPLICIT RK METHODS FOR STIFF DIFFERENTIAL EQUATIONS

The general forms of these equations, for a method withs stages, are as follows:

zn,i = yn + hs∑

j=1

ai,jf(tn + cjh, zn,j) , i = 1, . . . , s, (9.1)

yn+1 = yn + hs∑

j=1

bjf(tn + cjh, zn,j) . (9.2)

Note that the equation forzn,i involvesall thezn,j values. So for implicit Runge–Kutta methods we need to solve an extended system of equations. If eachyn is a realnumber, then we have a system ofs equations ins unknowns for each timestep. Ifeachyn is a vector of dimensionN then we have a system ofNs equations inNsunknowns. As in Chapter 5, we can represent implicit Runge–Kutta methods in termsof Butcher tableaus (see (5.26)),

c1 a1,1 a1,2 · · · a1,s−1 a1,s

c2 a2,1 a2,1 · · · a2,s−1 a2,s

c3 a3,1 a3,2 · · · a3,s−1 a3,s

......

.... . .

......

cs as,1 as,2 · · · as,s−1 as,s

b1 b2 · · · bs−1 bs

orc A

bT

(9.3)

Some implicit methods we have already seen are actually implicit Runge–Kuttamethods, namely, the backward Euler method and the trapezoidal rule. Their Butchertableaus are shown in Tables 9.1 and 9.2.

Table 9.1 Butcher tableau - backward Euler method1 1

1

Table 9.2 Butcher tableau - trapezoidal method

0 0 01 1/2 1/2

1/2 1/2

These methods are also BDF methods. However, higher-order BDF methods requireyn−1 to computeyn+1 and so they are not Runge–Kutta methods.

Higher-order Runge–Kutta methods have been developed, although the conditionsthat need to be satisfied for such Runge–Kutta methods to haveorderp become verycomplex for largep. Nevertheless, a few families of Runge–Kutta methods witharbitrarily high-order accuracy have been created. One such family is the set of

FAMILIES OF IMPLICIT RUNGE–KUTTA METHODS 151

Gauss methods given in (5.63)–(5.64)of Chapter 5; they are closely related to Gauss–Legendre quadrature for approximating integrals. These have the property that thecivalues are the roots of the Legendre polynomial

ds

dxs[xs (1 − x)

s] .

The other coefficients of these methods can be determined from the ci values bymeans of the so-calledsimplifying assumptionsof Butcher [23]:

B(p) :

s∑

i=1

bick−1i =

1

k, k = 1, 2, . . . , p, (9.4)

C(q) :s∑

j=1

aijck−1j =

ckik, k = 1, 2, . . . , q, i = 1, 2, . . . , s, (9.5)

D(r) :

s∑

i=1

bick−1i aij =

bjk

(1 − ckk

),

k = 1, 2, . . . , r, j = 1, 2, . . . , s. (9.6)

ConditionB(p) says that the quadrature formula

∫ t+h

t

f(s) ds ≈ h

s∑

i=1

bi f(t+ cjh)

is exact for all polynomials of degree< p. If this condition is satisfied, we saythat the Runge–Kutta method hasquadrature orderp . ConditionC(q) says that thecorresponding quadrature formulas on[t, t+ cih], namely

∫ t+cih

t

f(s) ds ≈ h

s∑

j=1

aij f(t+ cjh)

are exact for all polynomials of degree< q. If this condition is satisfied, we say thatthe Runge–Kutta method hasstage orderq . The importance of these assumptions isdemonstrated in the following theorem of Butcher [23, Thm. 7].

Theorem 9.1 If a Runge–Kutta method satisfies conditionsB(p), C(q), andD(r)with p ≤ q + r + 1 andp ≤ 2q + 2, its order of accuracy isp.

We can use this theorem to construct the Gauss methods. Firstwe choose{ci}, thequadrature points of the Gaussian quadrature. This can be done by looking up tables ofthese numbers, and then scaling and shifting them from the interval[−1, +1] to [0, 1].Alternatively, they can be computed as zeros of appropriateLegendre polynomials[11, Section 5.3]. We then choose the quadrature weightsbi to makeB(p) true foras large a value ofp as possible. For the Gaussian quadrature points, this isp = 2s.Note that if conditionB(p) fails, then the methodcannothave orderp.


This leaves us with thes2 coefficientsaij to find. These can be determined byapplying conditionsC(q) andD(r) with sufficiently largeq and r. Fortunately,there are some additional relationships between these conditions. It turns out that ifB(q + r) andC(q) hold, thenD(r) holds as well. Also ifB(q + r) andD(r) hold,then so doesC(q) [23, Thms. 3, 4, 5 & 6].

So we just need to satisfyC(s) in addition. ThenB(2s) andC(s) together implyD(s); settingq = r = s andp = 2s in Theorem 9.1 gives us a method of order2s. Imposing conditionC(s) gives us exactlys2 linear equations for theaij values,which can be easily solved. Thus the order of thes-stage Gauss method is2s.

Some Gauss methods are shown in Tables 9.3–9.5. For the derivation of theseformulas, refer back to Section 5.6 in Chapter 5.2. The two-point Gauss method wasgiven in (5.70)-(5.71) of Section 5.6.1.

Table 9.3 Butcher tableau for Gauss method of order 21/2 1/2

1

Table 9.4 Butcher tableau for Gauss method of order 4`3 −

√3

´/6 1/4

`3 − 2

√3

´/12`

3 +√

3´/6

`3 + 2

√3

´/12 1/4

1/2 1/2

Table 9.5 Butcher tableau for Gauss method of order 6`5 −

√15

´/10 5/36 2/9 −

√15/5 5/36 −

√15/30

1/2 5/36 +√

15/24 2/9 5/36 −√

15/24`5 +

√15

´/10 5/36 +

√15/30 2/9 +

√15/5 5/36

5/18 4/9 5/18

There are some issues that Gauss methods do not address, and so a number ofclosely related methods have been developed. The most important of these are theRadau methods, particularly the Radau IIA methods. For the Radau IIA methods theci terms are roots of the polynomial

ds−1

dxs−1

[xs−1 (1 − x)

s].

In particular, we havecs = 1, as we can see in Tables 9.6 and 9.7, which showthe lower-order Radau IIA methods. The simplifying assumptions satisfied by theRadau IIA methods areB(2s−1),C(s), andD(s−1), so that the order of a Radau IIAmethod is2s− 1. The order 1 Radau IIA method is just the implicit Euler method,given in Table 9.1. The derivation of these formulas is similar to that for the Gauss

FAMILIES OF IMPLICIT RUNGE–KUTTA METHODS 153

formulas, only now we are using Radau quadrature rules rather than Gauss–Legendrequadrature rules; see Section 5.6.

Table 9.6 Butcher tableau for Radau method of order 31/3 5/12 −1/121 3/4 1/4

3/4 1/4

Table 9.7 Butcher tableau for Radau method of order 5`4 −

√6

´/10

`88 − 7

√6

´/360

`296 − 169

√6

´/1800

`−2 + 3

√6

´/225`

4 +√

6´/10

`296 + 169

√6

´/1800

`88 + 7

√6

´/360

`−2 − 3

√6

´/225

1`16 −

√6

´/36

`16 +

√6

´/36 1/9

`16 −

√6

´/36

`16 +

√6

´/36 1/9

A third family of Runge–Kutta methods worth considering arethe Lobatto IIICmethods; thecj values are the roots of the polynomial

ds−2

dxs−2

[xs−1(1 − x)s−1

],

and we use the simplifying conditionsB(2s − 2), C(s − 1), andD(s − 1). TheLobatto IIIC methods havec1 = 0 andcs = 1. The order of thes-stage Lobatto IIICmethod is2s− 2.

Other Runge–Kutta methods have been developed to handle various other issues.For example, while general implicit Runge–Kutta methods with s stages requirethe solution of a system ofNs equations inNs unknowns, some implicit Runge–Kutta methods require the solution of a sequence ofs systems ofN equations inNunknowns. This is often simpler than solvingNs equations inNs unknowns. Thesemethods are known asdiagonally implicit Runge–Kutta methods(DIRK methods).For these methods we takeai,j = 0 wheneveri < j. Two examples of DIRKs aregiven in Table 9.8. The method of Alexander [2] is an order 3 method with threestages. The method of Crouzeix and Raviart [31] is an order 4 method with threestages. The constants in Alexander’s method are

α = the root of x3 − 3x2 + 32x− 1

6 in (16 ,

12 ),

τ2 = 12 (1 + α),

b1 = − 14 (6α2 − 16α+ 1),

b2 = 14 (6α2 − 20α+ 5).


The constants in Crouzeix and Raviart’s method are given by

γ =1√3

cos( π

18

)+

1

2,

δ =1

6 (2γ − 1)2 .

There are a large number of DIRK methods, and some of them can be found, forexample, in Hairer and Wanner’s text [44].

Table 9.8 Butcher tableau for DIRK methodsα ατ2 τ2 − α α1 b1 b2 α

b1 b2 α

(a) Method of Alexanderγ γ

1/2 1/2 − γ γ1 − γ 2γ 1 − 4γ γ

δ 1 − 2δ δ

(b) Method of Crouzeix & Raviart

9.2 STABILITY OF RUNGE–KUTTA METHODS

Implicit Runge–Kutta methods need the same kind of stability properties as foundin multistep methods if they are to be useful in solving stiffdifferential equations.Fortunately, most of the stability aspects can be derived using some straightforwardlinear algebra.

Consider the model differential equation

Y ′ = λY.

Following (9.1)–(9.2), denotezTn = [zn,1, zn,2, . . . , zn,s]. Apply (9.1)–(9.2) to this

differential equation:

zn = yn e + hλA zn,

yn+1 = yn + hλbT zn.

HereeT = [1, 1, . . . , 1] is thes-dimensional vector of all ones. Some easy algebragives

yn+1 =[1 + hλbT (I − hλA)

−1e]yn = R(hλ) yn.

The stability function is

R(η) = 1 + η bT (I − ηA)−1

e. (9.7)

STABILITY OF RUNGE–KUTTA METHODS 155

As before, the Runge–Kutta method is A-stable if|R(η)| < 1 for all complexη withReal η < 0.

All Gauss (Tables 9.3-9.5), Radau IIA (Tables 9.1, 9.6, 9.7), and some DIRKmethods (Table 9.8) are A-stable, which makes them stable for anyλwithRealλ < 0.However, this does not necessarily make themaccurate. For more on this topic, seethe following section on order reduction.

For nonlinear problems, there is another form of stability that is very useful, calledB-stability. This is based on differential equations

Y ′ = f(t, Y ), Y (t0) = Y0,

wheref(t, y) satisfies only aone-sided Lipschitz condition:

(y − z)T (f(t, y) − f(t, z)) ≤ µ ‖y − z‖2 .

If f(t, y) is Lipschitz iny with Lipschitz constantL (see (1.10) in Chapter 1), thenit automatically satisfies the one-sided Lipschitz condition with µ = L. However,the reverse need not hold. For example, the system of differential equations (8.16)obtained for the heat equation in Section 8.1 satisfies the one-sided Lipschitz conditionwith µ = 0, no matter how fine the discretization. The ordinary Lipschitz constant,however, is roughly proportional tom2, wherem is the number of grid points chosenfor the space discretization.

The importance of one-sided Lipschitz conditions is that they are closely relatedto stability of the differential equation. In particular, if

Y ′ = f(t, Y ), Y (t0) = Y0,

Z ′ = f(t, Z), Z(t0) = Z0,

andf(t, y) satisfies the one-sided Lipschitz condition with constantµ, then

‖Y (t) − Z(t)‖ ≤ eµ(t−t0) ‖Y0 − Z0‖ .

This can be seen by differentiating

m(t) = ‖Y (t) − Z(t)‖2= (Y (t) − Z(t))

T(Y (t) − Z(t))

as follows:

m′(t) = 2 (Y (t) − Z(t))T

(Y ′(t) − Z ′(t))

= 2 (Y (t) − Z(t))T

[f(t, Y (t)) − f(t, Z(t))]

≤ 2µ ‖Y (t) − Z(t)‖2= 2µm(t).

Hencem(t) ≤ e2µ(t−t0)m(t0),

and taking square roots gives

‖Y (t) − Z(t)‖ ≤ eµ(t−t0) ‖Y0 − Z0‖ .


The case where the one-sided Lipschitz constantµ is zero means that the dif-ferential equation iscontractive; that is, different solutions cannot become furtherapart or separated. If we require that the numerical solution be also contractive(‖yn+1 − zn+1‖ ≤ ‖yn − zn‖ for any two numerical solutionsyk andzk) wheneverµ = 0, then the method is calledB-stable[24]. This condition seems very useful, butrather difficult to check. Fortunately, a simple and easy condition to test was foundindependently in [22] and [30]: namely, if

bi ≥ 0 for all i (9.8)

andM = [biaij + bjaji − bibj ]

si,j=1 is positive semidefinite (9.9)

(i.e., wTMw ≥ 0 for all vectorsw), then the Runge–Kutta method is B-stable.Testing a matrixM for being positive semidefinite is actually quite easy. One test isto compute the eigenvalues ofM if M is symmetric. If all eigenvalues are≥ 0, thenM is positive semidefinite. For a nonsymmetric matrixM , it is positive semidefiniteif all the eigenvalues of the matrix(M +MT )/2 are nonnegative.

If a method is B-stable, then it is A-stable. To see this, for aB-stable method wecan look at the differential equation

Y ′ =

[α +β−β α

]Y,

which has the one-sided Lipschitz constantµ = 0 if α ≤ 0. The eigenvalues of this2×2matrix areα±iβ, which are in the left half of the complex plane ifα < 0. So if amethod is B-stable, thenα ≤ 0 implies that the numerical solution is contractive, andthus the stability region includes the left half-plane; that is, the method is A-stable.

This test for B-stability quickly leads to the realization that a number of importantfamilies of implicit Runge–Kutta methods are B-stable, such as the Gauss methods,the Radau IA, and the Radau IIA methods. The DIRK method in Table 9.8 (partb) is, however, A-stable but not B-stable. What does this mean in practice? Forstrongly nonlinear problems, A-stability may not suffice toensure good behavior ofthe numerical method, especially if we consider integration for long time periods. Italso means that Gauss or Radau IIA methods are probably better than DIRK methodsdespite the extra computational cost of the Gauss and Radau methods.

9.3 ORDER REDUCTION

Stability is clearly necessary, but it is not sufficient to obtain accurate solutions tostiff systems of ordinary differential equations. A phenomenon that is commonlyobserved is that when applied to stiff problems, many implicit methods do not seemto achieve the order of accuracy that is expected for the method. This phenomenonis calledorder reduction[44, pp. 225–228].

Order reduction occurs for certain Runge–Kutta methods, but not for BDF meth-ods.

ORDER REDUCTION 157

100

101

102

103

104

105

10−15

10−10

10−5

100

10−1/n2

300/n4

2 u n1/2

number of steps (n)

erro

r at t

= 1

Figure 9.1 Error norms for the test equation (9.10)

Example 9.2 Consider, for example, the fourth-order Gauss method withs = 2 (seeTables 9.3–9.5) . Figure 9.1 shows how the error behaves for atest equation

Y ′ = D (Y − g(t)) + g′(t), Y (0) = g(0). (9.10)

For this particular example,D is a100×100 diagonal matrix with negative diagonalsrandomly generated in the range from−2−20 to−2+20 ≈ −106. The diagonal entriesare exponentials of uniformly distributed pseudo-random values. The functiong(t)likewise involves pseudo-random numbers, but is a smooth function oft. The exactsolution isY (t) = g(t), so we can easily compute errors in the numerical solution.For the functiong(t) we usedg(t) = cos(t) z1 − exp(−t) z2 with z1, z2 randomlygenerated vectors using a normal distribution.

Note that the Gauss method withs = 2 is a fourth-ordermethod, so that we expectthe errors to beO(h4) as the stepsize becomes small. But this ignores two factors:(1) the hidden constant in theO expression may be quite large because of the stiffnessof the differential equation, and (2) asymptotic results like this are true providedh is“small enough”. How small is “small enough” depends on the problem, and for stiffdifferential equations, this can depend on how stiff the equation is. Make the stiffnessgo to infinity, and the limit for “small enough” may go to zero.If that happens, thenthe standard convergence theory may be meaningless for practical stiff problems.

As can be seen from Figure 9.1, the error for larger values ofh seems to behavemore likeO(h2) thanO(h4). Also, for smaller values ofh we seeO(h4) errorbehavior (as we might expect), but with a large value for the hidden constant inside


theO. For very smallh and many steps, we see that roundoff error from floating-pointarithmetic limits the accuracy possible with this method. (The quantityu in the graphdenotes the unit round of the floating-point arithmetic.) Ifwe increase the stiffnessof the problem as we reduce the size ofh, we might only see theO(h2) behavior ofthe error. This is the effect oforder reduction.

Order reduction can be explained in terms of the following simple version of thetest differential equation (9.10),

Y ′(t) = λ (Y − g(t)) + g′(t), Y (t0) = g(t0). (9.11)

The exact solution isY (t) = g(t) for all t. However, the numerical solution of thisis not exact, particularly ifhλ is large. What we want to find out is the magnitude ofthe error in terms ofh independently ofλh. This can be different from the order ofthe error for fixedλ ash→ 0. The Runge–Kutta equations are

zn,i = yn + h

s∑

j=1

aij f(tn + cjh, zn,j), i = 1, 2, . . . , s.

From this formula, it seems that the intention is forzn,i ≈ Y (tn + cih). Considerfor a moment the even simpler test problem

dY

dt= g′(t), Y (t0) = g(t0).

Thestage orderof a Runge–Kutta method comes from the order of the error in theapproximationzn,i ≈ g(tn + cih),

g(tn + cih) = g(tn) + hs∑

j=1

aij g′(tn + cjh) + O(hq+1)

for all i, indicating a stage order ofq. Thequadrature orderis the order of the finalformula for this very simple test equation; the result

g(tn + h) = g(tn) + hs∑

j=1

bj g′(tn + cjh) + O(hp+1)

means that the quadrature order isp. Usually the stage order is of no concern for non-stiff differential equations, and only the quadrature order matters. This is importantfor explicit methods, since the first step of an explicit method is essentially a step ofthe explicit Euler method; this means that the stage order for explicit methods is one.Nevertheless, for nonstiff differential equations, we have Runge–Kutta methods ofarbitrarily high-order.

ORDER REDUCTION 159

On the other hand, stiffness means that the stage order cannot be ignored. Goingback to the test equation (9.11), write

∆n,i = g(tn + cih) − g(tn) − h

s∑

j=1

aij g′(tn + cjh),

∆n = g(tn + h) − g(tn) − h

s∑

j=1

bj g′(tn + cjh).

Then, after some calculation, we find that

yn+1 − g(tn+1) = R(hλ) [yn − g(tn)] − hλbT (I − hλA)−1

∆n − ∆n.

Clearly we still need|R(hλ)| ≤ 1 for stability. But we have to be careful about∆n

(the stage errors) as well as∆n (the quadrature error). In other words, our accuracycan be reduced by a low stage order as well as by a low quadrature order.

Many Runge–Kutta methods for stiff differential equationsarestiffly accurate.This simply means that the last row ofA is bT ; that is,ais = bi for i = 1, 2, . . . , s.An example is the trapezoidal rule:

yn+1 = yn + 12h [f(tn, yn) + f(tn+1, yn+1)] .

The quadrature order is 2 (∆n = O(h3)), which is the same order as the second stage(∆n,2 = O(h3)). The order of the first stage is infinite:∆n,1 = 0, sincec1 = 0 andg(tn + 0h) = g(tn) + 0. For the test equation (9.10), we have

yn+1 − g(tn+1) = R(hλ) [yn − g(tn)] − hλbT (I − hλA)−1

∆n − ∆n

as before. For this method

−hλbT (I − hλA)−1

∆n =2hλ

hλ− 2

[1

2,

1

2

]

1 − 1

2hλ 0

1

2hλ 1

[

0O(h3)

]

=hλ

hλ− 2O(h3).

So thestiff orderof the trapezoidal method is 2, the same as its “normal” order. Thisis a desirable trait, but it is not shared by most higher-order methods.

Consider, for example, the Gauss methods. Thes-stage Gauss method has order2s. However, its stiff order is onlys. A simple example is thes = 1 Gauss method,which is also known as themidpoint method, as shown in Table 9.3. Then

−hλbT (I − hλA)−1 ∆n = − 2hλ

2 − hλO(h2).

So while the quadrature order of the midpoint rule is 2, its stiff order is 1. Furtheranalysis for the other Gauss methods can be found in [44].


DIRK methods of any number of stages have stage order≤ 2, and so the stiff order(for arbitraryhλ) is≤ 2. Radau IIA methods withs stages have order2s− 1, but thestiff order (for arbitraryhλ) is s+ 1. In fact, the global error for Radau IIA methodsis O(hs+1/(hλ)). If we consider only the casehλ → ∞ andh → 0, we find that,because the Radau IIA methods are stiffly accurate, we again getO(h2s−1) globalerrorin the limitashλ→ ∞. This turns out to be very useful for differential algebraicequations, the topic discussed in Chapter 10. However, for solving problems such asthe heat equation (see Section 8.1), there are many eigenvaluesλ, some small andsome large. So we cannot assume thathλ→ ∞.

On the other hand, order reduction does not occur for BDF methods. While acomplete answer is beyond the scope of this book, consider the differential equation

Y ′ = λ (Y − g(t)) + g′(t) Y (t0) = g(t0).

The exact solution isY (t) = g(t) for all t. If we applied a BDF method to thisequation, we get

yn+1 =

p−1∑

j=0

ajyn−j + hβ [λ (yn+1 − g(tn+1)) + g′(tn+1)] .

If ek = yk − g(tk) were the error at timestepk, after some algebra we would get

(1 − hλ) en+1 −p−1∑

j=0

ajen−j =

p−1∑

j=0

ajg(tn−j) + hβ g′(tn+1) − g(tn+1)

= O(hp+1),

since the BDF method has orderp. But forhλ in the stability region, this means thaten = O(hp); if |hλ| → ∞ along the negative real axis, thenen = O(hp/ |hλ|).

9.4 RUNGE–KUTTA METHODS FOR STIFF EQUATIONS IN PRACTICE

While a great many Runge–Kutta methods have been developed,for stiff differentialequations, the field narrows to a relatively small numberof methods, all of which havethe desirable characteristics of stability (especially B-stability) and accuracy (whenorder reduction is taken into account). The Radau IIA methods score well on justabout every characteristic, as they are B-stable, are stiffly accurate and have a highorder, even after order reduction is taken into account.

The downside is that Radau methods, like Gauss methods, are expensive to im-plement. For stiff differential equations, we cannot expect to solve the Runge–Kuttaequations by simple iteration. Some sort of nonlinear equation solver is needed.Newton’s method is the most common method, but simplified versions of Newton’smethod are often used in practice, as discussed in Section 8.5 in Chapter 8. For large-scale systems of differential equations, even implementing Newton’s method can bedifficult as large linear systems need to be solved. This can be done efficiently using

RUNGE–KUTTA METHODS FOR STIFF EQUATIONS IN PRACTICE 161

the tools of numerical linear algebra. This is an exciting and interesting area in itself,but beyond the scope of this book.

Practical codes for a number of these methods, such as the three-stage, fifth-order Radau IIA method, have been carefully designed, implemented, and tested. Anexample is theRadauandRadau5 codes of Hairer. For more details see p. 183. Thesecodes are automatic methods that can adjust the stepsize to achieve a user-specifiederror tolerance.

PROBLEMS

1. Show that the Gauss methods withs = 1 ands = 2 stages have stiff orders.

2. Consider the following iterative method for solving the Runge–Kutta equations

zn,i = yn + hs∑

h=1

aij f(tn + cjh, zn,j), i = 1, 2, . . . , s.

We set

z(k+1)n,i = yn + h

s∑

j=1

aij f(tn + cjh, z(k)n,j), i = 1, 2, . . . , s,

for k = 0, 1, 2, . . .. Show that iff(t, x) is Lipschitz inx with LipschitzconstantL, then this method is a contractive interation mapping provided

hL max1≤i≤s

s∑

j=1

|aij | < 1.

Is this method useful for stiff problems?

3. Show that the Gauss methods withs = 1 ands = 2 are B-stable using thealgebraic condition (9.8)–(9.9).

4. Repeat Problem 4 for the Radau IIA methods fors = 1 ands = 2.

5. Show that the DIRK method in Table 9.8 isnot B-stable.

6. Show that

f(t, y) =

[α +β−β α

]y

satisfies a one-sided Lipschitz condition withµ ≥ α. Use this to prove thatB-stability implies A-stability.Hint: First show that the eigenvalues of the matrix definingf areα± iβ.

7. The one-stage Gauss method is

zn,1 = yn + 12h f(tn + 1

2h, zn,1),

yn+1 = yn + h f(tn + 12h, zn,1).


Find the Taylor series expansion of∆n,1 = g(tn+c1h)−g(tn)−h a11 g′(tn+

c1h) (c1 = a11 = 12 ) to show that the stage order of this method is 1 while the

quadrature order of the method is 2.

8. Derive the coefficients for the Lobatto IIIC method with three stages (s = 3,order= 2s− 2 = 4). The quadrature points arec1 = 0, c2 = 1

2 , andc3 = 1.Use the simplifying conditionsB(2s − 2) to compute thebi values, and thesimplifying conditionsC(s − 1) and one of the conditions inD(s − 1) tocompute theaij matrix entries.

CHAPTER 10

DIFFERENTIAL ALGEBRAICEQUATIONS

In Chapter 3 we considered the motion of a pendulum consisting of a massm at theend of a light rigid rod of lengthl; see Figure 3.1. Deriving the differential equationfor the angleθ involved computing the torque about the pivot point. In simple systemslike this, it is fairly easy to derive the differential equation from a good knowledge ofmechanics. But with more complex systems it can become difficult just to obtain thedifferential equation to be solved.

Here we will consider a different way of handling this problem that makes itmuch easier to derive a mathematical model, but at a computational cost. Thesemodels contain not only differential equations but also “algebraic” equations. Here“algebraic” does not signify that only the usual operationsof arithmetic (+,−,×, and/) can appear; rather, it means that no derivatives or integrals of unknown quantitiescan appear in the equation. Differential and algebraic equations are collectivelyreferred to asdifferential algebraic equationsor by the acronym DAE. A number oftexts deal specifically with DAEs, such as Ascher and Petzold[10] and Brenan et al.[19].

In this new framework, the position of the mass is given by coordinates(x, y)relative to the pivot for the pendulum. There is a constraintdue to the rigid rod:√x2 + y2 = l. There are also two forces acting on the mass. One is gravitation,

163

164 DIFFERENTIAL ALGEBRAIC EQUATIONS

which acts downward with strength−mg. The other is the force that the rod exertson the mass to maintain the constraint. This force is in the direction of the rod; let itsmagnitude beN , so that the force itself is(−Nx,−Ny)/

√x2 + y2. This provides

a complete model for the pendulum:

md2x

dt2= −N x√

x2 + y2, (10.1)

md2y

dt2= −N y√

x2 + y2−mg, (10.2)

0 = l −√x2 + y2. (10.3)

This second-order system can be rewritten as a first-order system:

x′ = u, (10.4)

y′ = v, (10.5)

mu′ = −N x√x2 + y2

, (10.6)

mv′ = −N y√x2 + y2

−mg, (10.7)

0 = l −√x2 + y2. (10.8)

The unknowns are the coordinatesx(t), y(t), their velocitiesu(t) andv(t), and theforce exerted by the rod isN(t). All in all, there are five equations and five unknownfunctions. However, only four of the equations are differential equations. The last isan “algebraic” equation. Also, there is no equation withdN/dt in it, soN is calledanalgebraic variable.

For simplicity, we will writeλ = N/(m√x2 + y2) so thatdu/dt = −λx and

dv/dt = −λy − g. Also, the constraint equation will be replaced by

0 = l2 − x2 − y2.

We can turn the differential algebraic equations into a puresystem of differentialequations. To do that, we need to differentiate the algebraic equation until we canobtain an expression fordλ/dt. Differentiating the constraint three times gives first

0 =d

dt

(l2 − x2 − y2

)= −2xu− 2yv, (10.9)

0 =d2

dt2(l2 − x2 − y2

)= −2(u2 + v2) + 2λ(x2 + y2) + 2yg, (10.10)

and then

0 =d3

dt3(l2 − x2 − y2

)= 2

dλ

dt

(x2 + y2

)+ 6gv. (10.11)

The number of times that the algebraic equations of a DAE needto be differentiatedin order to obtaindifferential equationsfor all of the algebraic variables is called the

INITIAL CONDITIONS AND DRIFT 165

indexof the DAE. Two differentiations allow us to findλ in terms ofx, y, u, andv.But three differentiations are needed to computedλ/dt in terms of these quantities.So our pendulum problem is an index 3 DAE.

Solving forλ from the second derivative of the constraint gives

λ =u2 + v2 − yg

x2 + y2=u2 + v2 − yg

l2. (10.12)

Substituting this expression gives a system of ordinary differential equations:

x′ = u, (10.13)

y′ = v, (10.14)

u′ = −u2 + v2 − yg

l2x, (10.15)

v′ = −u2 + v2 − yg

l2y − g. (10.16)

If, instead of substituting forλ, we differentiate the constraint a third time, we obtaina differential equation forλ:

x′ = u, (10.17)

y′ = v, (10.18)

u′ = −λx, (10.19)

v′ = −λy − g, (10.20)

λ′ = −3gv

l2. (10.21)

The general scheme for a system of differential algebraic equations is

Y ′ = f(t, Y, Z), Y (t0) = Y0, (10.22)

0 = g(t, Y, Z). (10.23)

TheY variables are the differential variables, while theZ variables are the algebraicvariables.

10.1 INITIAL CONDITIONS AND DRIFT

In the general scheme, the constraints0 = g(t, Y, Z) must hold at timet = t0, so thatg(t0, Y0, Z0) = 0, whereZ0 = Z(t0). So the algebraic variables must also have theright initial values. But the conditions do not stop there. In addition, differentiatingthe constraints once att = t0 gives

d

dtg(t, Y, Z)|t=t0 = 0,


and differentiating twice gives

d2

dt2g(t, Y, Z)|t=t0 = 0,

and so on. This gives a whole sequence of extra initial conditions that must besatisfied. Fortunately, the number of extra conditions is not infinite: the number ofdifferentiatons needed to obtain the needed extra conditions is one less than theindexof the problem.

Consider, for example, the pendulum problem. Initially theposition of the mass isconstrained by the length of the rod:x(t0)2 + y(t0)

2 = l2. Differentiating the lengthconstraint (10.8) att = t0 gives

0 = x(t0)u(t0) + y(t0)v(t0);

that is, the initial velocity must be tangent to the circle that the pendulum sweeps out.Finally, the initial forceN(t0) (or equivalentlyλ(t0)) must be set correctly in orderfor the solution to follow the circlex2 + y2 = l2. This gives a total of three extraconditions to satisfy for the initial conditions, coming from the constraint functionand its first and second derivatives.

Note that the constraint and the subsequent conditions holdnot only at the initialtime, but also at any instant. Thus the differential equations obtained that have thealgebraic constraint removed (such as (10.13)–(10.16) and(10.17)–(10.20)) mustsatisfy these additional conditions at all times. Numerical methods do not necessarilypreserve these properties even though they are preserved inthe differential equations.This is known asdrift . In theory, if a numerical method for a differential equationor DAE is convergent, then as the stepsizeh goes to zero, the amount of drift willalso go to zero on any fixed time interval. In practice, however, instabilities that maybe introduced by the DAE or ODE formulation mean that extremely small stepsizesmay be needed to keep the drift sufficiently small for meaningful answers.

Figure 10.1 shows plots of the trajectories for the pendulumproblem using theformulation (10.13)–(10.16) and the Euler and Heun methods(see (4.29)) for itssolution.

There are a number of ways of dealing with drift.

1. Project current solution back to the constraints, either atevery step, or oc-casionally. For the pendulum example, this means projecting not only thepositions(x, y) back tox2 + y2 = l2, but also the velocities. Moreover, ifλ is computed via a differential equation, it, too, must be projected onto itsconstraints. Care must be taken in doing this, particularlyfor multistep meth-ods where projecting just the current solution vectorzn will introduce errors inthe approximate solution. Instead, all solution vectorszn−j for j = 0, 1, . . . , pshould be projected, wherep is the number of previous iterates used by the mul-tistep method. Also, if the index is high, we should project not only the solutionvector, but also the derivative and (if the index is high enough) higher-orderderivatives as well onto the appropriate manifold.

INITIAL CONDITIONS AND DRIFT 167

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

x(t)

y(t)

(a) Euler’s method (h = 0.015)

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4−1.01

−1

−0.99

−0.98

−0.97

−0.96

−0.95

−0.94

−0.93

−0.92

−0.91

x(t)

y(t)

x2+y2 = 1

(b) Heun’s method (h = 0.1)

Figure 10.1 Plots of trajectories for (10.13)–(10.16) showing drift for Euler and Heun’smethods


2. Modify the differential equation to make the constraint setstable, but otherwisedo not change the trajectories. This technique has been used in a number ofcontexts, but it almost always has to be done separately for every new case.An example of this technique is the method of Baumgarte [15] for equality-constrained mechanical systems. This would replace the conditiong(t, Y, Z) =0 with a differential equation, such as(d/dt)g(t, Y, Z)+αg(t, Y, Z) = 0 withα > 0; that is

0 = gt(t, Y, Z) + gy(t, Y, Z) f(t, Y, Z) + gz(t, Y, Z)Z ′ + αg(t, Y, Z),

which can be solved to give a differential equation forZ. (Note thatgy(t, Y, Z)is the Jacobian matrix ofg(t, Y, Z) with respect toY . See (10.3) below.)For index 3 systems, such as those arising in mechanics, stable second-orderequations must be used such as

(d2

dt2+ α

d

dt+ β

)g(t, Y, Z) = 0

with suitable choices forα andβ. These modifications need to be done withcare to ensure that they really are stable, not just for the continuous problem butalso for the numerical discretization. Since these stabilization methods haveone or more free scaling parameter(s)α (andβ), these must be chosen withcare. For more information about dealing with these issues,see Ascher et al.[5].

3. Use a numerical method that explicitly respects the constraints. These methodstreat the differential algebraic equationsasdifferential algebraic equations. In-stead of necessitating one or more differentiations in order to find differential orother equations for the “algebraic” variables, they are automatically computedby the method itself. These have been developed for general low-index DAEs.Petzold, who developed the first such methods, produced a package DASSL(see [19], [21], [65]) based on backward differentiation formulas (BDFs) forsolving index 1 DAEs. Many other methods have been developed, but thesetend to be limited in terms of the index that they can handle. All such meth-ods are implicit, and so require the solution of a linear or nonlinear system ofequations at each step.

To summarize: methods 1 and 2 for handling DAEs have some problems. Theprojection method can work with some ODE methods. The Baumgarte stabilizationmethod can also be made to work, but requires “tuning” the stabilization parameters;this method can run into trouble for stiff equations. Method3, designing numericalmethods that explicitly recognize the constraints, is the one that we focus on in theremainder of the chapter.

10.2 DAES AS STIFF DIFFERENTIAL EQUATIONS

Differential algebraic equations can be treated as the limit of ordinary differentialequations. Note thatg(t, Y, Z) = 0 if and only ifBg(t, Y, Z) = 0 for any nonsingular

NUMERICAL ISSUES: HIGHER INDEX PROBLEMS 169

square matrixB. Then the DAE (10.22)–(10.23) can be treated as the limit asǫ→ 0of

Y ′ = f(t, Y, Z), Y (t0) = Y0, (10.24)

ǫZ ′ = B(Y )g(t, Y, Z). (10.25)

The matrix functionB(Y ) should be chosen to make the differential equation inZ(10.25)stable, so that the solution for (10.25),Z(t), converges to the solutionZ = Z∗

whereg(t, Y, Z∗) = 0.Forǫ small, these equations arestiff, so implicit methods are needed. Furthermore,

since the order obtained in practice for an implicit method can differ from the order ofthe method for nonstiff problems, the order of an implicit method may deviate fromthe usual order when it is applied to differential algebraicequations.

But how do we apply a numerical method for stiff ODEs to a DAE? The simplestmethod to apply is the implicit Euler method. If we apply it tothe stiff approximation(10.24)–(10.25) using step sizeh, we get

yn+1 = yn + h f(tn+1, yn+1, yn+1), (10.26)

ǫzn+1 = ǫzn + hB(yn+1)g(tn+1, yn+1, zn+1). (10.27)

Taking the limit asǫ→ 0 and recalling thatB(Y ) is nonsingular, we get the equations

yn+1 = yn + h f(tn+1, yn+1, zn+1), (10.28)

0 = g(tn+1, yn+1, zn+1). (10.29)

This method will work for index 1 DAEs, but not in general for higher index DAEs.An issue regarding accuracy is thestiff order of an ODE solver: the order of a

method for solving stiff ODEs may be lower than that for solving a nonstiff ODE, asnoted in Section 9.3. Since DAEs can be considered to be an extreme form of stiffODEs, this can also affect DAE solvers. With some methods, some components ofthe solution (e.g., positions) can be computed more accurately than other components(e.g., forces).

10.3 NUMERICAL ISSUES: HIGHER INDEX PROBLEMS

Consider index 1 problems in standard form:

Y ′ = f(t, Y, Z), Y (t0) = Y0,

0 = g(t, Y, Z).

HereY (t) is ann-dimensional vector andZ(t) is anm-dimensional vector. Thefunction

g(t, Y, Z) = [g1(t, T, Z), g2(t, Y, Z), . . . , gm(t, Y, Z)]T


must have values that arem-dimensional vectors. For an index 1 problem, the Jaco-bian matrix ofg(t, Y, Z) with respect toZ, specifically

gz(t, Y, Z) =

∂g1/∂z1 ∂g1/∂z2 · · · ∂g1/∂zm

∂g2/∂z1 ∂g2/∂z2 · · · ∂g2/∂zm

......

. . ....

∂gm/∂z1 ∂gm/∂z2 · · · ∂gm/∂zm

∣∣∣∣∣∣∣∣∣(t,Y,Z)

is nonsingular. So we can apply the implicit function theorem to show that wheneverg(t0, y0, z0) = 0, there is locally a smooth solution functionz = ϕ(t, y), wherez0 = ϕ(t0, y0). With a numerical solution(yn, zn), n = 0, 1, 2, . . ., the error inzn should be of the same order as the error inyn. This does not always happen,but requires some special properties of the numerical method. As we will see forRunge–Kutta methods, we need the method to be stiffly accurate. A method is stifflyaccurate when the last row of theA matrix in the Butcher tableau is the same as thebottom rowbT of the Butcher tableau. Stiff accuracy is important for understandingRunge–Kutta methods for stiff differential equations, as was noted in Section 9.3.

Index 2 problems have a standard form:

Y ′ = f(t, Y, Z), Y (t0) = Y0, (10.30)

0 = g(t, Y ), (10.31)

where the product of Jacobian matrices ofgy(t, Y ) fz(t, Y, Z) is nonsingular. Butnow, to determineZ(t), we needdY/dt. Thus numerical methods applied to index2 problems will need to perform some kind of “numerical differentiation” in order tofind Z(t). This may result in a reduction of the order of accuracy in thenumericalapproximationZ(t), which can feed back into the equation (10.30) forY (t).

Index 3 problems, such as our pendulum problem, require morespecialized treat-ment. These problems are discussed in Subsection 10.6.1. However, the same com-plication arises — different components of the solution canhave different orders ofconvergence.

To illustrate this complication, consider the problem of the spherical pendulum.This is just like the ordinary planar pendulum, except that the mass is not constrainedto a single vertical plane. This is sometimes called “Foucault’s pendulum”, and canbe used to demonstrate the rotation of the earth, although our model will not includethat effect. For this system, we useq = [x, y, z]T for the position of the massm,which is subject to the constraint thatqTq = ℓ2 and a downward gravitational forceof strengthmg. Using the methods of Subsection 10.6.1, we obtain the followingindex 3 DAE:

mv′ = −λq −mg k,

q′ = v,

0 =1

2(qT q − ℓ2),

NUMERICAL ISSUES: HIGHER INDEX PROBLEMS 171

100

101

102

103

10−12

10−10

10−8

10−6

10−4

10−2

100

number of steps (n)

erro

r no

rm

positionvelocityforce

Figure 10.2 Errors in solving the spherical pendulum problem using the three-stage Radau IIAmethod with an index 1 DAE

wherek is the unit vector pointing upward. Note that the state vector for the DAE isyT = [qT ,vT , λ].

By differentiating the constraints as we did for the planar pendulum, we can obtainlower index DAEs. If we differentiate the constraint once, we obtain

0 = vT q

to give an index 2 DAE. If we differentiate again, we obtain

0 = vT v − λ

mqT q− kT q g

to give an index 1 DAE.Using the Radau IIA method with three stages (which is normally fifth-order),

we can solve each of these systems. Figures 10.2–10.4 show the numerical resultsfor each of these DAEs with indices 1, 2 and 3. The specific parameter values usedarem = 2 andℓ = 3

2 ; the initial time wast = 0, and the errors were computedat t = 1. As can be clearly seen, for both index 2 and index 3 cases, theforcesare computed considerably less accurately than are the other components, and theslope of the error line for the forces (λ) is substantially less than those for the othercomponents. This indicates a lower-order of convergence for the forces in the index2 and index 3 versions of the problem. For the index 3 case, both the forces andvelocities (v) appear to have a lower-order of convergence than the positions (q).However, the order of convergence of the positions does not seem to be affected bythe index of the DAE.


100

101

102

103

10−12

10−10

10−8

10−6

10−4

10−2

100

number of steps (n)

erro

r no

rm



100

101

102

103

10−12

10−10

10−8

10−6

10−4

10−2

100

number of steps (n)

erro

r no

rm



BACKWARD DIFFERENTIATION METHODS FOR DAES 173

From these numerical results, the following question may arise: Why use highindex DAEs? As noted above, one reason is that using the high index formulationcan prevent drift in the main constraintg(q) = 0. Another reason is that the modelof the spherical pendulum is most naturally given as an index3 DAE. The lowerindex DAEs are constructed by differentiating the constraint function. While this isoften the quickest approach for simple problems, for large problems this can becomedifficult to do, and might not be possible in practice for functions defined by some(complicated) piece of code.

10.4 BACKWARD DIFFERENTIATION METHODS FOR DAES

The first ODE methods to be applied to DAEs were the backward differentiationformula (BDF) methods. These work well for index 1 DAEs, and are the basis of thecode DASSL [19], [65]. These implicit methods were introduced in Section 8.2 andhave the form

yn+1 =

p−1∑

j=0

an−j yn−j + hβ f(tn+1, yn+1).

The coefficientsaj andβ are chosen so that

y′(tn+1) =1

β h

yn+1 −

p−1∑

j=0

ajyn−j

+ O(hp),

giving a method of orderp.These methods, while not A-stable, are nevertheless very well behaved, at least for

nonoscillatory problems forp ≤ 6. If p ≥ 7, part of the negative real axis lies outsidethe stability region, and the method can become unstable forλ < 0 large enough toputhλ in the unstable region. For this reason, we restrictp ≤ 6 for BDF methods.

10.4.1 Index 1 problems

For DAEs of the form

Y ′ = f(Y, Z), Y (t0) = Y0, (10.32)

0 = g(Y, Z), (10.33)

the BDF method becomes

yn+1 =

p−1∑

j=0

ajyn−j + hβ f(yn+1, zn+1),

0 = g(yn+1, zn+1).


For index 1 DAEs, the equationg(y, z) = 0 givesz implicitly as a function ofy. Ifwe writez = ϕ(y) as this implicit function, the BDF method can be reduced to

yn+1 =

p∑

j=0

ajyn−j + hβ f(yn+1, ϕ(yn+1)),

which is the result of applying the BDF method to the reduced equation

Y ′ = f(y, ϕ(Y )).

Thus the BDF method gives a numerical solutionwith the expected rate of convergenceto the true solution.


BDF methods can be used for DAEs of index 2 as well as index 1, particularly for thesemi-explicit index 2 DAEs:

Y ′ = f(Y, Z), Y (t0) = y0, (10.34)

0 = g(Y ). (10.35)

Recall thatg(Y ) is anm-dimensional vector for eachY , so thatgy(Y ) is anm× nmatrix. On the other hand,f(Y, Z) is ann-dimensional vector, so thatfz(Y, Z) isann×mmatrix. The productgy(Y ) fz(Y, Z) is thus anm×mmatrix. We assumethatgy(Y ) fz(Y, Z) is nonsingular.

The DAE (10.34)–(10.35) is index 2 if we can (locally) solve forZ(t) from Y (t)using only one differentiation of the “algebraic” equationg(Y ) = 0. Differentiatinggives0 = gy(Y ) dY/dt = gy(Y ) f(Y, Z). So for an index 2 DAE, the functionZ 7→ gy(Y ) f(Y, Z) needs to be invertible so that we can find a smooth implicitfunctionY 7→ Z. The usual requirement needed is that the Jacobian matrix ofthemapZ 7→ gy(Y ) f(Y, Z) be an invertible matrix on the exact solution. From theusual rules of calculus, this comes down to requiring thatgy(Y (t)) fz(Y (t), Z(t))is an invertible matrix for allt on the exact solution. Note that this implies thatgy(Y ) fz(Y, Z) is invertible for any(Y, Z) sufficiently nearthe exact solution aswell.

Assuming thatgy(Y ) fz(Y, Z) is nonsingular, we can show that thep-step BDFmethod for DAEs,

yn+1 =

p−1∑

j=0

αj yn−j + hβ f(yn+1, zn+1),

0 = g(yn+1),

is convergent of orderp for p ≤ 6. Recall that forp ≥ 7, the stability region for thep-step BDF methoddoes notinclude all of the negative real axis, making it unsuitablefor stiff ODEs or DAEs.

RUNGE–KUTTA METHODS FOR DAES 175

It should be noted that these methods are implicit,and therefore require the solutionof a nonlinear system of equations. We can use Newton’s method or any number ofvariants thereof [55]. The system of equations to be solved hasn+m equations andn+m unknowns.

For thep-step BDF method, we have

yn − Y (tn) = O(hp),

zn − Z(tn) = O(hp),

providedyj − Y (tj) = O(hp+1) for j = 0, 1, 2, . . . , p − 1 ([20], [40], [44], [60]).Note that we need one order higher accuracy in theinitial values; this is necessary asour estimates forzj, j = 0, 1, . . . , p − 1, are essentially obtained by differentiatingthe data foryj , j = 0, 1, . . . , p− 1.

Note that it is particularly important to solve the equationsg(yn+1) = 0 accurately.Noise in the solution of these equations will be amplified by afactor of order1/h toproduce errors inzn+1. This, in turn, will result in larger errors inyn over time.

10.5 RUNGE–KUTTA METHODS FOR DAES

As for stiff equations, the Runge–Kutta methods used for DAEs need to be implicitmethods. The way that a Runge–Kutta method is used for the index 1 DAE (10.32)–(10.33),

Y ′ = f(Y, Z), Y (t0) = Y0, (10.36)

0 = g(Y, Z), (10.37)

is

yn,i = yn + h

s∑

j=1

aij f(yn,j, zn,j), (10.38)

0 =

s∑

j=1

aij g(yn,j , zn,j), (10.39)

yn+1 = yn + h

s∑

j=1

bj f(yn,j, zn,j), (10.40)

for i = 1, 2, . . . , s. Provided the matrixA is invertible, (10.39) is equivalent to

0 = g(yn,i, zn,i), i = 1, 2, . . . , s.

As for BDF methods, these are systems of nonlinear equations, and can be solvedby Newton’s method or its variants [55]. Unlike the BDF methods, the number ofequations to be solved ares(M +N) with s(M +N) unknowns whereY is a vectorwith N components andZ hasM unknowns.


Also, the analysis of error in stiff problems in Section 9.3 shows that the stageorder of the Runge–Kutta method essentially determines theorder of the Runge–Kutta method for DAEs. For this to work well, we usually require that the methodbe stiffly accurate(such as Radau IIA methods); that is,bT must be the bottomrow of A: bi = as,i for i = 1, 2, . . . , s. This means thatyn+1 = yn,s and settingzn+1 = zn,s so thatg(yn+1, zn+1) = 0. As with stiff equations, the stability functionR(hλ) = 1 + hλbT (I − hλA)−1

e (see (9.7)) gives crucial information about thebehavior of the method. However, for DAEs, we are considering what happens ashλ → −∞. SinceR(hλ) is a rational function ofhλ, the important quantity isR(∞) = R(−∞) = 1 − bTA−1e for nonsingularA.


Consider index 1 problems of the form

Y ′ = f(Y, Z), Y (t0) = Y0,

0 = g(Y, Z).

Let us suppose that we have an implicit functionϕ for g, meaning that whenever0 = g(y, z), thenz = ϕ(y). If we can do this, then the problem reduces to findingthe solution of

Y ′ = f(Y, ϕ(Y )), Y (t0) = Y0.

Note that if the Jacobian matrix∇zf(y∗, z∗) is nonsingular, then we can find alocalimplicit functionϕ so thatϕ(y∗) = z∗ andϕ is smooth nearby toy∗. Then in thiscase,g(yn,i, zn,i) = 0 implies thatzn,i = ϕ(yn,i), and our Runge–Kutta equationsimply that

yn,i = yn + h

s∑

j=1

aij f(yn,j, zn,j)

= yn + h

s∑

j=1

aij f(yn,j, ϕ(yn,j)).

For a stiffly accurate method,yn+1 = yn,s andzn+1 = zn,s = ϕ(yn,s) = ϕ(yn+1).This is exactly what the Runge–Kutta method would give when applied to the ordinarydifferential equation

Y ′ = f(Y, ϕ(Y )), Y (t0) = Y0.

So the order of accuracy is exactly what we would expect for smooth ordinary differ-ential equations.

The case where the method isnot stiffly accurate is a little more complex; theargument for the accuracy ofyn ≈ Y (tn) is not changed, but the accuracy of thecomputed valueszn ≈ Z(tn) is, and can depend on the value ofR(∞). Recallthat p is the quadrature order of the method, andq is the stage order. In terms ofthe simplifying conditions (9.4)–(9.6), conditionsB(p) andC(q) hold. The error


zn − Z(tn) = O(hr), wherer = min(p, q + 1) if −1 ≤ R(∞) < 1 andr =min(p−1, q) if R(∞) = 1; butzn−z(tn) diverges exponentially inn if |R(∞)| > 1.We show this below.

Suppose our Runge–Kutta method has stage orderq and quadrature orderp, sothat for a smooth functionψ(·), we obtain

ψ(tn + cih) = ψ(tn) + h

s∑

j=1

aijψ′(tn + cjh) + O(hq+1),

i = 1, . . . , s, (10.41)

ψ(tn+1) = ψ(tn) + h

s∑

i=1

biψ′(tn + cjh) + O(hp+1). (10.42)

The global order of this method for DAEs can be determined from the stage andquadrature orders depending on several cases: (1) the method is stiffly accurate, (2)−1 ≤ R(∞) < 1, (3)R(∞) = 1, or (4) |R(∞)| > 1.

If the method is stiffly accurate, then (as we have seen) the accuracy for index 1DAEs is the same as for smooth ordinary differential equations:Y (tn)−yn = O(hp),providedtn − t0 is bounded.

If the method is not stiffly accurate, then the stage orderq becomes important. Ifwe write

Ψn = [ψ(tn + c1h), ψ(tn + c2h), . . . , ψ(tn + csh)]T ,

Ψ′n = [ψ′(tn + c1h), ψ

′(tn + c2h), . . . , ψ′(tn + csh)]

T ,

then, from (10.41), we obtain

Ψn = ψ(tn)e + hAΨ′n + O(hq+1),

so that for nonsingularA, we have

Ψ′n = h−1A−1 (Ψn − eψ(tn)) + O(hq).

Substituting this into (10.42) gives

ψ(tn+1) =(1 − bTA−1e

)ψ(tn) + bTA−1Ψn + O(hq+1) + O(hp+1).

But 1 − bTA−1e = R(∞). Thus

ψ(tn+1) = R(∞)ψ(tn) + bTA−1Ψn + O(hq+1) + O(hp+1).

In particular, we can takeψ(t) = Z(t) andψ(t) = Y (t), giving

Z(tn+1) = R(∞)Z(tn) + bTA−1Zn + O(hq+1) + O(hp+1), (10.43)

Y (tn+1) = R(∞)Y (tn) + bTA−1Yn + O(hq+1) + O(hp+1),


with

Zn = [Z(tn + c1h), . . . , Z(tn + csh)]T ,

Yn = [Y (tn + c1h), . . . , Y (tn + csh)]T .

Now g(yn,i, zn,i) = 0 sozn,i = ϕ(yn,i) as noted above. Let

Yn = [yn,1, yn,2, . . . , yn,s]T ,

Zn = [zn,1, zn,2, . . . , zn,s]T .

Then the Runge–Kutta equations can be written (as we did withψ(t) above) as

zn+1 = R(∞) zn + bTA−1Zn. (10.44)

The error∆zn+1 = Z(tn+1) − zn+1 is given by subtracting the above equations(10.43) and (10.44), yielding

∆zn+1 = R(∞)∆zn + bTA−1(Zn − Zn

)+ O(hq+1) + O(hp+1).

Note thatzn,i = ϕ(yn,i) andZ(tn + cih) = ϕ(Y (tn + cih). The stage order isq, sofrom the differential equation forY and the Runge–Kutta method,

yn,i − Y (tn + cih)

= yn − Y (tn)

+ h

s∑

j=1

aij (f(yn,j, ϕ(yn,j)) − f(Y (tn + cjh), ϕ(Y (tn + cjh)))) + O(hq+1).

Sinceyn = Y (tn) + O(hp), we get

yn,i = Y (tn + cih) + O(hmin(p,q+1)).

Sozn,i − Z(tn + cih) = ϕ(yn,i) − ϕ(Y (tn + cih)) = O(hmin(p,q+1)).

Therefore∆zn+1 = R(∞)∆zn + O(hmin(p,q+1)).

If |R(∞)| < 1, then we obtain the expected global order ofzn. If R(∞) = 1 wethe errors can accumulate giving a convergence order of one less. If|R(∞)| > 1,thenzn will grow exponentially inn. If R(∞) = −1, then we need to do somemore analysis to show that the hidden constant in the “O(hmin(p,q+1))” is actually asmooth function oft. Then successive steps will cause cancellation of the error, andthe global error forzn is O(hmin(p,q+1)).

To illustrate these theoretical results, consider again the numerical results shownin Figure 10.2 for the index 1 version of the spherical pendulum problem using the3-stage 5th-order Radau IIA method. All components of the solution converge with


roughly the same order of accuracy. In fact, the slopes of thestraightest parts of thethe graphs in Figure 10.2 are≈ −5.10,−5.04, and−5.05 for the position, velocity,and force components of the solution, respectively. This indicates that the index 1DAE is being solved with the full order of accuracy that the three-stage Radau IIAmethod can provide.


Here we consider index 2 problems of the form

Y ′ = f(Y, Z),

0 = g(Y ).

As in Subsection 10.4.2, we assume thatgy(Y ) fz(Y, Z) is a square nonsingularmatrix on the exact solution.

Index 2 problems are considerably harder to solve numerically than correspondingindex 1 problems. In the index 1 case where the “algebraic” equationsg(Y, Z) = 0giveZ as a function ofY (Z = ϕ(Y )), the result of solving this system of equationscould be substituted intodY/dt = f(Y, Z) = f(Y, ϕ(Y )) to form a smooth ordinarydifferential equation. This is not possible in the index 2 case. Indeed, the task ofdetermining whether initial values(y0, z0) are consistent (i.e.gy(y0) f(y0, z0) = 0)is a non-trivial task.

Runge–Kutta methods for index 2 problems have the form

yn,i = yn + h

s∑

j=1

aij f(yn,j, zn,j), for i = 1, 2, . . . , s,

zn,i = zn + h

s∑

j=1

aij ℓn,j, for i = 1, 2, . . . , s,

yn+1 = yn + h

s∑

j=1

bj f(yn,j , zn,j),

zn+1 = zn + h

s∑

j=1

bj ℓn,j,

0 = g(yn,i), for i = 1, 2, . . . , s.

Note that we have extra variablesℓn,i that are needed to solve the equationsg(yn,i) =0. If (yn, zn) is sufficiently close to being consistent, there exists(yn+1, zn+1) (as wellas theyn,j , zn,j , andℓn,j) satisfying the Runge–Kutta equations, and(yn+1, zn+1)is also close to being consistent.

This non-linear system of equations can be solved using, forexample, Newton’smethod. Given currently computed valuesy(k)

n,j, z(k)n,j , ℓ(k)

n,j andyn,zn from the previous

step, we compute corrected valuesy(k+1)n,j = y

(k)n,j + ∆yn,j , z(k+1)

n,j = z(k)n,j + ∆zn,j ,


Table 10.1 Order of accuracy for index 2 DAEs of the form (10.34)–(10.35) formethods withs stages

Method y z

Gauss

s + 1, s odd

s, s even

s − 1, s odds − 2, s even

Radau IIA 2s − 1 sLobatto IIIC 2s − 2 s − 1

DIRK a 2 1

andℓ(k+1)n,j = ℓ

(k)n,j + ∆ℓn,j by solving the linear system

y(k)n,i + ∆yn,i = yn + h

s∑

j=1

aij

[f(y

(k)n,j, z

(k)n,j) + fy(y

(k)n,j, z

(k)n,j)∆yn,j

+fz(y(k)n,j , z

(k)n,j)∆zn,j

], for i = 1, 2, . . . , s,

z(k)n,i + ∆zn,i = zn + h

s∑

j=1

aij

[ℓ(k)n,j + ∆ℓn,j

], for i = 1, 2, . . . , s,

0 = g(y(k)n,i ) + gy(y

(k)n,i )∆yn,i, for i = 1, 2, . . . , s.

There are several implications of the theory of these problems for numerical meth-ods, such as Runge–Kutta methods, for index 2 DAEs.

1. The order of accuracy for the numerical solutionszn ≈ Z(tn) andyn ≈ Y (tn)are often different.

2. The non-linear systems are generally harder to solve for index 2 systems thanfor index 1 systems. More specifically, the condition numberof the linearsystem for Newton’s method increases asO(1/h) as the step sizeh becomessmall [44,§ VII.4]. By comparison, the linear systems for Newton’s methodfor index 1 DAEs have bounded condition numbers ash goes to zero.

3. Additional conditions are needed to obtain convergence of the numerical meth-ods.

Development of the theory for the orderof convergenceof these methods is beyondthe scope of this book. However, we can present results for some families of Runge–Kutta methods, which are summarized in Table 10.1 ([42]). Inthe table, the DIRKmethod is taken from Table 9.8 (a) in Chapter 9 withs = 3.

Note that the Gauss methods suffer a strong loss of accuracy,obtaining only orders + 1 at best fory (compared to2s − 1 for ordinary differential equations), whileRadau IIA methods keep the same order fory as for solving ordinary differentialequations. The order forz is less for all methods listed, often quite substantiallyless. One reason for the good performance of Radau IIA methods is that it is stiffly

INDEX THREE PROBLEMS FROM MECHANICS 181

accurate, and has a high stage order (q) as well as having a good quadrature order (p).The Lobatto IIIC method, which is stiffly accurate, also has agood order of accuracy.

One of the most popular methods for solving DAEs is the 5th-order, 3-stageRadau IIA method (Table 9.7). This is the basis for some popular software for DAEs.For more information, see p. 183. Numerical results for thismethod (with a fixedstepsize) are shown in Figure 10.3 for the index 2 version of the spherical pendulumproblem. The slopes of the graphs are≈ −5.01,−4.98, and−2.85 for the position,velocity, and force components, respectively. In this version, the force componentplays the role ofZ, while the position and velocity components play the role ofY .These results seem roughly consistent with the expected fifth-order convergence ofyn to Y (t), and third-order convergence ofzn toZ(t).

Some other Runge–Kutta-type methods have been developed for index 2 DAEs,such as that proposed by Jay [51], which uses separate methods for theY andZcomponents of the solution.

10.6 INDEX THREE PROBLEMS FROM MECHANICS

Mechanics is a rich source of DAEs; the pendulum example of Figure 3.1 and (10.1)–(10.3) is a common example. For general mechanical systems,we need a moresystematic way of deriving the equations of motion. There are two main ways ofdoing this: Lagrangian mechanics and Hamiltonian mechanics. Although closelyrelated, they each have their own specific character. We willuse the Lagrangianapproach here.

For more information about this area, which is often calledanalytical mechanics,see Fowles [38] for a traditional introduction, and Arnold [4] or Marsden and Ratiu[61] for more mathematical treatments. A comprehensive approach can be foundin Fasano and Marmi [37], which includes extensions to statistical mechanics andcontinuum mechanics as well as more traditional topics.

In the Lagrangian approach to mechanics, the main variablesare thegeneralizedcoordinatesq = [q1, q2, . . . , qn]T and thegeneralized velocitiesv = dq/dt. Notethat in this sectionq is not the stage order. The generalized coordinates can be anyconvenient system of coordinates for representing the configuration of the system.For example, for a pendulum in the plane, we could use either the angle to the verticalθ, or x andy coordinates for the center of mass. In the latter case we willneed toinclude one (or more) constraints on the coordinates:g(q) = 0. Note that since thegeneralized coordinates could include angles, the generalized velocity vector couldinclude angular velocities as well as ordinary velocities.

The function that defines the motion in Lagrangian mechanicsis the LagrangianfunctionL(q, v), a scalar function of the generalized coordinates and generalizedvelocities. For a system with no constraints on the coordinates, we have

L(q, v) = T (q, v) − V (q),


whereT (q, v) is the kinetic energy of the system andV (q) is the potential energy ofthe system. Usually the kinetic energy is quadratic in the velocity:

T (q, v) = 12v

TM(q) v.

HereM(q) is the mass matrix, although sincev may contain quantities such asangular as well as ordinary velocities, the entries inM(q) may include quantitiessuch as moments of inertia as well as ordinary masses. If we have constraints on thecoordinates1, g(q) = 0, then these constraints can be incorporated into the Lagrangianfunction using Lagrange multipliers:

L(q, v, λ) = T (q, v) − V (q) − λT g(q).

The Lagrange multipliers can be regarded as generalized forces that ensure that theconstraints are satisfied. The equations of motion are obtained by means of theEuler–Lagrange equations

0 =d

dtLv(q, v) − Lq(q, v),

whereLv(q, v) is the gradient vector ofL(q, v) with respect tov, andLq(q, v) is thegradient vector ofL(q, v) with respect toq. If we have constraintsg(q) = 0, theEuler–Lagrange equations become

0 =d

dtLv(q, v, λ) − Lq(q, v, λ), (10.45)

0 = g(q) = Lλ(q, v, λ). (10.46)

For the pendulum example, let us useq = [x, y]T as the position of the mass, andv = dq/dt = [dx/dt, dy/dt]T is its velocity. The constraint is

g(q) =1

2

(x2 + y2 − ℓ2

)= 0.

The kinetic energy is just the energy of a mass moving with velocity v:

T (q, v) =1

2m

[(dx

dt

)2

+

(dy

dt

)2].

The potential energy is just the potential energy due to gravity: V (q) = mgy. TheLagrangian is then

L(q, dq/dt, λ) =m

2

((dx

dt

)2

+

(dy

dt

)2)

−mgy − λ1

2

(x2 + y2 − ℓ2

).

1Here we have constraints on the generalized coordinatesalone: g(q) = 0. These are calledholonomicconstraints.

INDEX THREE PROBLEMS FROM MECHANICS 183

The Euler–Lagrange equations are then

0 =d

dt

m

dx

dtdy

dt

+

[0mg

]+ λ

[xy

],

0 =1

2

(x2 + y2 − ℓ2

).

This is essentially the pendulum DAE (10.1)–(10.3) rearranged.Not only does this DAE have index 3, but all problems of this type have index 3 (or

higher). In general, for mechanical systems, the Euler–Lagrange equations become

M(q)dv

dt= k(q, v) −∇V (q) −∇g(q)Tλ, (10.47)

dq

dt= v, (10.48)

0 = g(q), (10.49)

where

ki(q, v) =1

2

n∑

j,k=1

(∂mjk

∂qi− ∂mij

∂qk− ∂mik

∂qj

)vj vk, i = 1, 2, . . . , n.

Differentiatingg(q) = 0 gives∇g(q) dq/dt = ∇g(q) v = 0; differentiating againgives

0 = ∇q (∇g(q) v) dqdt

+ ∇g(q) dvdt

= ∇q (∇g(q) v) v + ∇g(q)M(q)−1[k(q, v) −∇V (q) −∇g(q)Tλ

],

which can be solved forλ in terms ofq andv provided∇g(q)M(q)−1 ∇g(q)T is non-singular. So, provided∇g(q)M(q)−1 ∇g(q)T is nonsingular, the system (10.47)–(10.49) is an index 3 DAE. SinceM(q) can usually be taken to be symmetric positivedefinite, all that is really needed is for∇g(q) to have full row rank (i.e., the rows of∇g(q) should be linearly independent).

Note that we need initial conditions to be consistent; that is,g(q(t0)) = 0 and

(d/dt)g(q(t))|t=t0 = ∇g(q(t0)) v(t0) = 0.

Indeed, at every timet, we haveg(q(t)) = 0 and ∇g(q(t)) v(t) = 0 for thetrue solution. We can obtain the consistency condition forλ by differentiating∇g(q(t)) v(t) = 0 once again.

10.6.1 Runge–Kutta methods for mechanical index 3 systems

Apart from the index reduction techniques introduced at thestart of this chapter, wecan apply Runge–Kutta methods directly to the system (10.47)–(10.49). The Runge–Kutta equations are even harder to solve than those for index2 problems (the condition


Table 10.2 Proven order of accuracy for index 3 problems of types ≤ 3 for(10.47)–(10.49)

Method q v λ

Radau IIA 2s − 1 s s − 1Lobatto IIIC s + 1 s − 1 s − 2

number of the Jacobian matrix in Newton’s method grows likeO(h−2)), but this canbe done provided the computed generalized coordinatesqn and generalized velocitiesvn are sufficiently close to being consistent (g(qn) ≈ 0 and∇g(qn) vn ≈ 0), and thenewly computed valuesqn+1 andvn+1 are also close to being consistent.

The order of accuracy is still not known in general for the Gauss, Radau IIA, andLobatto IIIC families of Runge–Kutta methods. However, forno more than threestages, this is known for the Radau IIA and Lobatto IIIC methods, and is given inTable 10.2 ([42], [49]).

Again, the order of accuracy of the different components (coordinates, velocities,and constraint forces) are different — and again the winner seems to be the Radau IIAmethods (at least up to three stages). Indeed, the three-stage Radau IIA method hasbeen implemented as aFortran 77 code calledRadau5, which is available from

http://www.unige.ch/˜hairer/software.html

Also available from this website isRadau, anotherFortran 77 code for Radau IIAmethods that can switch between the methods of orders 5, 9, and 13 for DAEs andstiff ODEs.

Numerical results for a fixed stepsize, three-stage Radau IIA method are shownin Figure 10.4 for the index 3 version of the spherical pendulum problem. Withs = 3 we expect fifth-order convergence for positions, third-order convergence forthe velocities, and second-order convergence for the forces. Indeed, the slopes of thegraphs in Figure 10.4 are≈ −4.66, −3.04, and−2.05 for the positions, velocities,and forces, respectively. This slight drop in the slope from5 to 4.66 for the posi-tion errors is due mainly to the accuracy with which the Runge–Kutta equations aresolved, which limits the overall accuracy of the numerical solutions. Otherwise, thetheoretical expectations are confirmed by these numerical results.

Other approaches to Runge–Kutta methods for index 3 DAEs from mechanics canbe found in [50] for constrained Hamiltonian systems using apair of Runge–Kuttamethods. Essentially one Runge–Kutta method is used for themomentum variablesand another for the generalized coordinate variables. The optimal choice of methodsfor this approach is a combination of Lobatto IIIA and Lobatto IIIB methods.

10.7 HIGHER INDEX DAES

The theory and practice of DAEs become harder as the index increases. Beyond index3, the complexity of establishing the order of convergenceof a method (orif a method

HIGHER INDEX DAES 185

converges) becomes almost prohibitive for standard approaches such as Runge–Kuttamethods. Approaches to these problems can be developed by means of symbolic aswell as numerical computation. A survey of approaches to handling high-indexDAEscan be found in [26]. Software techniques such asAutomatic Differentiation[29],[69] can be used instead of symbolic computation (as carriedout byMathematicaTM,MapleTM, MacsymaTM, etc.). These approaches take us well outside the scope of thisbook, but may be useful in handling problems of this kind.

PROBLEMS

1. Obtain theRadau or Radau5 code, and use it to solve the pendulum DAE(10.4)–(10.8) as a DAE.

2. Repeat Problem 1 with the reduced index DAE (10.4)–(10.7)with the constraint0 = xu+ yv. This is an index 2 DAE. In particular, check the drift, or howfarx2 + y2 − l2 is from zero.

3. Repeat Problem 1 with the ODE (10.13)–(10.16). As in Problem 3, check thedrift in bothx2 + y2 − l2 and inxu+ yv from zero.

4. Repeat Problem 3 using the MATLABR© routineode23t instead ofRadau orRadau5.

5. Consider a system of chemical reactions

X + Y → Z,

Y + U ⇋ V.

Assuming that these aresimplereactions, the reaction rate of the first is pro-portional to the products of the concentrations of X and Y; that is, for the firstreaction, we obtain

d[X]

dt= −k1[X] [Y],

d[Z]

dt= +k1[X] [Y].

However, the second reaction is reversible:

d[V]

dt= +k2[Y] [U] − k3[V],

d[U]

dt= −k2[Y] [U] + k3[V].

Chemical species Y participates in both reactions:

d[Y]

dt= +k3[V] − k1[X] [Y].


(x1 y1),

θ2

θ1

m2

(x2 y2),

m1

Figure 10.5 Compound pendulum

Suppose thatk2, k3 ≫ k1, enabling us to treat the second reaction as beingvery nearly in equilibrium. (Mathematically, consider thelimit ask2, k3 → ∞butk2/k3 → c.) Write down the resulting system of differential and algebraicequations (perhaps involving the initial concentrations[Y]0, [U]0, [V]0, etc.).Show that they form an index 1 DAE.

6. Derive the equations of motion of a compound pendulum as shown in Fig-ure 10.5 as an index 3 DAE in terms of the coordinates of the centers of masses(x1, y1) and(x2, y2). This will entail the use of two constraints:x2

1 + y21 = l21

and(x2 − x1)2

+ (y2 − y1)2

= l22. Compare this with the same derivationinstead using just two generalized coordinates,θ1 andθ2. (Usingθ1 andθ2will give ugly expressions for the kinetic energy, but with fewer variables thanusingx1, y1, x2, andy2.)

CHAPTER 11

TWO-POINT BOUNDARY VALUEPROBLEMS

In Chapter 3 we saw that the initial value problem for the second-order equation

Y ′′ = f(t, Y, Y ′) (11.1)

can be reformulated as an initial value problem for a system of first-order equations,and that numerical methods for first-order initial value problems can then be appliedto this system. In this chapter, we consider the numerical solution of another type ofproblem for the second-orderequation (11.1), one where conditions on the solutionYare given at two distinctt values. Such a problem is called atwo-point boundary valueproblem (or sometimes for brevity, a BVP). For simplicity, we begin our discussionwith the following BVP for a second-orderlinear equation:

Y ′′(t) = p(t)Y ′(t) + q(t)Y (t) + r(t), a < t < b, (11.2)

Y (a) = g1, Y (b) = g2. (11.3)

The conditionsY (a) = g1 and Y (b) = g2 are called theboundary conditions.Boundary conditions involving the derivative of the unknown function are also

common in applications, and we discuss them later in the chapter.We assume the given functionsp, q andr to be continuous on[a, b]. A standard

theoretical result states that ifq(t) > 0 for t ∈ [a, b], then the boundary value problem

187

188 TWO-POINT BOUNDARY VALUE PROBLEMS

(11.2)–(11.3) has a unique solution; see Keller [53, p. 11].We will assume that theproblem has a unique smooth solutionY .

We begin our discussion of the numerical solution of BVPs by introducing afinite-difference approximation to (11.2). Later we look atmore general two-pointBVPs for the more general nonlinearsecond-orderequation (11.1), generalizingfinite-difference approximations as well. We also introduce othernumerical methods forthese nonlinear BVPs.

11.1 A FINITE-DIFFERENCE METHOD

The main feature of the finite-difference method is to obtaindiscrete equations byreplacing derivatives with appropriate finite divided differences. We derive a finite-difference system for the BVP (11.2)–(11.3) in three steps.

In the first step, we discretize the domain of the problem: theinterval [a, b]. LetN be a positive integer, and divide the interval[a, b] intoN equal parts:

[a, b] = [t0, t1] ∪ [t1, t2] ∪ · · · ∪ [tN−1, tN ],

wherea = t0 < t1 < · · · < tN−1 < tN = b are the grid (or node) points. Denoteh = (b − a)/N , called thestepsize. Then the node points are given by

ti = a+ i h, 0 ≤ i ≤ N. (11.4)

A nonuniform partition of the interval is also possible, andin fact this is preferable ifthe solution of the boundary value problem (11.2)–(11.3) changes much more rapidlyin some parts of[a, b] as compared to other parts of the interval. We restrict ourpresentation to the case of uniform partitions for the simplicity of exposition. Weuse the notationpi = p(ti), qi = q(ti), ri = r(ti), 0 ≤ i ≤ N , and denoteyi,0 ≤ i ≤ N , as numerical approximations of the true solution valuesYi = Y (ti),0 ≤ i ≤ N .

In the second step, we discretize the differential equationat the interior node pointst1, . . . , tN−1. For this purpose, let us note the following difference approximationformulas

Y ′(ti) =Yi+1 − Yi−1

2 h− h2

6Y (3)(ηi), (11.5)

Y ′′(ti) =Yi+1 − 2 Yi + Yi−1

h2− h2

12Y (4)(ξi) (11.6)

for someti−1 ≤ ξi, ηi ≤ ti+1, i = 1, . . . , N − 1. The errors can be obtained byusing Taylor polynomial approximations toY (t). We leave this as an exercise for thereader; or see [11,§5.7], [12,§5.4]. Using these relations, the differential equation att = ti becomes

Yi+1 − 2 Yi + Yi−1

h2= pi

Yi+1 − Yi−1

2 h+ qiYi + ri + O(h2). (11.7)

A FINITE-DIFFERENCE METHOD 189

Dropping the remainder termO(h2) and replacingYi by yi, we obtain the differenceequations

yi+1 − 2 yi + yi−1

h2= pi

yi+1 − yi−1

2 h+ qiyi + ri, 1 ≤ i ≤ N − 1, (11.8)

which can be rewritten as

−(1 + 1

2hpi

)yi−1 + (2 + h2qi)yi +

(12hpi − 1

)yi+1

= −h2ri, 1 ≤ i ≤ N − 1.(11.9)

The third step is devoted to the treatment of the boundary conditions. The differ-ence equations (11.9) consist ofN−1 equations forN+1 unknownsy0, y1, . . . , yN .We need two more equations, and they come from discretization of the boundaryconditions. For the model problem (11.2)–(11.3), the discretization of the boundaryconditions is straightforward:

y0 = g1, yN = g2. (11.10)

Equations (11.9) and (11.10) together form a linear system.Since the values ofy0 andyN are explicitly given in (11.10), we can eliminatey0 andyN from the linearsystem. Withy0 = g1, we can rewrite the equation in (11.9) withi = 1 as

(2 + h2q1)y1 +(

12hp1 − 1

)y2 = −h2r1 +

(1 + 1

2hp1

)g1. (11.11)

Similarly, from the equation in (11.9) withi = N − 1, we obtain

−(1 + 1

2hpN−1

)yN−2 + (2 + h2qN−1) yN−1

= −h2rN−1 +(1 − 1

2hpN−1

)g2.

(11.12)

So finally, the finite-difference system for the unknown numerical solution vectory = [y1, · · · , yN−1]

T isAy = b, (11.13)

where

A =

2 + h2q112hp1 − 1

−(1 + 1

2hp2

)2 + h2q2

12hp2 − 1

. . .. . .

2 + h2qN−212hpN−2 − 1

−(1 + 1

2hpN−1

)2 + h2qN−1

is the coefficient matrix and

bi =

−h2r1 +(1 + 1

2hp1

)g1, i = 1

−h2ri, i = 2, . . . , N − 2

−h2rN−1 +(1 − 1

2hpN−1

)g2, i = N − 1.

(11.14)

The linear system (11.13) istridiagonal, and the solution of tridiagonal linearsystems is a very well-studied problem. Examples of programs for the efficientsolution of tridiagonal linear systems can be found inLAPACK [3].


Table 11.1 Numerical errorsY (x) − yh(x) for solving (11.19)

t h = 1/20 h = 1/40 Ratio h = 1/80 Ratio h = 1/160 Ratio

0.1 5.10e − 5 1.27e − 5 4.00 3.18e − 6 4.00 7.96e − 7 4.000.2 7.84e − 5 1.96e − 5 4.00 4.90e − 6 4.00 1.22e − 6 4.000.3 8.64e − 5 2.16e − 5 4.00 5.40e − 6 4.00 1.35e − 6 4.000.4 8.08e − 5 2.02e − 5 4.00 5.05e − 6 4.00 1.26e − 6 4.000.5 6.73e − 5 1.68e − 5 4.00 4.21e − 6 4.00 1.05e − 6 4.000.6 5.08e − 5 1.27e − 5 4.00 3.17e − 6 4.00 7.94e − 7 4.000.7 3.44e − 5 8.60e − 6 4.00 2.15e − 6 4.00 5.38e − 7 4.000.8 2.00e − 5 5.01e − 6 4.00 1.25e − 6 4.00 3.13e − 7 4.000.9 8.50e − 6 2.13e − 6 4.00 5.32e − 7 4.00 1.33e − 7 4.00

11.1.1 Convergence

It can be shown that if the true solutionY (t) is sufficiently smooth, say, with con-tinuous derivatives up to order 4, then the difference scheme (11.13)–(11.14) is asecond-order method,

max0≤i≤N

|Y (ti) − yi| = O(h2). (11.15)

For a detailed discussion, see Ascher et al. [9, p. 189]. Moreover, if Y (t) has sixcontinuous derivatives, the following asymptotic error expansion holds:

Y (ti) − yh(ti) = h2D(ti) + O(h4), 0 ≤ i ≤ N (11.16)

for some functionD(t) independent ofh. The Richardson extrapolation formula forthis case is

yh(ti) = 13 [4 yh(ti) − y2h(ti)] , (11.17)

and we haveY (ti) − yh(ti) = O(h4). (11.18)

11.1.2 A numerical example

We illustrate the finite-difference approximation (11.12), the error result (11.15), andthe Richardson extrapolation results (11.16)–(11.18). The MATLAB R© codes that weuse for our calculations are given following the example.

Example 11.1 Consider the boundary value problem

Y ′′ = − 2 t

1 + t2Y ′ + Y +

2

1 + t2− log(1 + t2), 0 < t < 1,

Y (0) = 0, Y (1) = log(2).

(11.19)

The true solution isY (t) = log(1+ t2). In Table 11.1, we report the finite-differencesolution errorsY − yh at selected node points for several values ofh. In Table 11.2,


Table 11.2 Extrapolation errorsY (ti) − eyh(ti) for solving (11.19)

t h = 1/40 h = 1/80 Ratio h = 1/160 Ratio

0.1 −9.23e − 09 −5.76e − 10 16.01 −3.60e − 11 16.000.2 −1.04e − 08 −6.53e − 10 15.99 −4.08e − 11 15.990.3 −6.60e − 09 −4.14e − 10 15.96 −2.59e − 11 15.980.4 −1.18e − 09 −7.57e − 11 15.64 −4.78e − 12 15.850.5 3.31e − 09 2.05e − 10 16.14 1.28e − 11 16.060.6 5.76e − 09 3.59e − 10 16.07 2.24e − 11 16.040.7 6.12e − 09 3.81e − 10 16.04 2.38e − 11 16.030.8 4.88e − 09 3.04e − 10 16.03 1.90e − 11 16.030.9 2.67e − 09 1.67e − 10 16.02 1.04e − 11 16.03

we report the errors of the extrapolated solutionsY − 13 (4 yh − y2 h) at the same

node points and the associated ratios of the errors for different stepsizes. The columnmarked “Ratio” next to the column of the solution errors for astepsizeh consists ofthe ratios of the solution errors for the stepsize2h with those for the stepsizeh. Weclearly observe an error reduction of a factor of approximately 4 when the stepsize ishalved, indicating a second-order convergence of the method as asserted in (11.15).

There is a dramatic improvement in the solution accuracy through extrapolation.The extrapolated solutionyh with h = 1/40 is much more accurate than the solutionyh with h = 1/160. Note that the cost of obtainingyh with h = 1/40 is substantiallysmaller than that foryh with h = 1/160. Also observe that for the extrapolated solu-tion yh, the error decreases by a factor of approximately 16 whenh is halved. Indeed,it can be shown that if the true solutionY (t) is 8 times continuously differentiable,then we can improve the asymptotic error expansion (11.16) to

Y (ti) − yh(ti) = h2D1(ti) + h4D2(ti) + O(h6). (11.20)

Then (11.17) is replaced by

Y (ti) − yh(ti) = −4 h4D2(ti) + O(h6). (11.21)

Therefore, we can also perform an extrapolation procedure on yh to get an even moreaccurate numerical solution through the following formula:

Y (ti) − 115 [16 yh(ti) − y2h(ti)] = O(h6). (11.22)

As an example, atti = 0.5, with h = 1/80, the doubly extrapolated solution has anerror approximately equal to−1.88×10−12.

MATLAB program. The following MATLAB codeODEBVP implements the differ-ence method (11.13) for solving the problem (11.2)–(11.3).

function z = ODEBVP(p,q,r,a,b,ga,gb,N)

%


% function z = ODEBVP(p,q,r,a,b,ga,gb,N)

%

% A program to solve the two point boundary

% value problem

% y"=p(t)y’+q(t)y+r(t), a<t<b% y(a)=g1, y(b)=g2

% Input

% p, q, r: coefficient functions

% a, b: the end-points of the interval

% ga, gb: the prescribed function values

% at the end-points

% N: number of sub-intervals

% Output

% z = [ tt yy ]: tt is an (N+1) column vector

% of the node points

% yy is an (N+1) column vector of

% the solution values


% z=ODEBVP(’p’,’q’,’r’,a,b,ga,gb,100)

% The user must provide m-files to define the

% functions p, q, and r.

%

% The user must also supply a MATLAB program, called

% tridiag.m, for solving tridiagonal linear systems.

%

% Initialization

N1 = N+1;

h = (b-a)/N;

h2 = h*h;

tt = linspace(a,b,N1)’;

yy = zeros(N1,1);

yy(1) = ga;

yy(N1) = gb;

% Define the sub-diagonal avec, main diagonal bvec,

% superdiagonal cvec

pp(2:N) = feval(p,tt(2:N));

avec(2:N-1) = -1-(h/2)*pp(3:N);

bvec(1:N-1) = 2+h2*feval(q,tt(2:N));

cvec(1:N-2) = -1+(h/2)*pp(2:N-1);

% Define the right hand side vector fvec

fvec(1:N-1) = -h2*feval(r,tt(2:N));

fvec(1) = fvec(1)+(1+h*pp(2)/2)*ga;

fvec(N-1) = fvec(N-1)+(1-h*pp(N)/2)*gb;

% Solve the tridiagonal system

yy(2:N) = tridiag(avec,bvec,cvec,fvec,N-1,0);


z = [tt’; yy’]’;

The following MATLAB codetridiag solves tridiagonal linear systems.

function [x, alpha, beta, message] = tridiag(a,b,c,f,n,option)

%

% function [x, alpha, beta, message] = tridiag(a,b,c,f,n,option)

%

% Solve a tridiagonal linear system M*x=f

%

% INPUT:

% The order of the linear system is given as n.

% The subdiagonal, diagonal, and superdiagonal of M are given

% by the arrays a,b,c, respectively. More precisely,

% M(i,i-1) = a(i), i=2,...,n

% M(i,i) = b(i), i=1,...,n

% M(i,i+1) = c(i), i=1,...,n-1

% option=0 means that the original matrix M is given as

% specified above.

% option=1 means that the LU factorization of M is already

% known and is stored in a,b,c. This will have been

% accomplished by a previous call to this routine. In

% that case, the vectors alpha and beta should have

% been substituted for a and b in the calling sequence.

% All input values are unchanged on exit from the routine.

%

% OUTPUT:

% Upon exit, the LU factorization of M is already known and

% is stored in alpha,beta,c. The solution x is given as well.

% message=0 means the program was completed satisfactorily.

% message=1 means that a zero pivot element was encountered

% and the solution process was abandoned. This case

% happens only when option=0.

if option == 0


alpha(1) = 0;

% Compute LU factorization of matrix M.

for j=2:n

if beta(j-1) == 0

message = 1; return

end

alpha(j) = alpha(j)/beta(j-1);

beta(j) = beta(j) - alpha(j)*c(j-1);

end


if beta(n) == 0

message = 1; return

end

end

% Compute solution x to M*x = f using LU factorization of M.

% Do forward substitution to solve lower triangular system.

if option == 1


end

x = f; message = 0;

for j=2:n

x(j) = x(j) - alpha(j)*x(j-1);

end

% Do backward substitution to solve upper triangular system.

x(n) = x(n)/beta(n);

for j=n-1:-1:1

x(j) = (x(j) - c(j)*x(j+1))/beta(j);

end

end % tridiag

11.1.3 Boundary conditions involving the derivative

The treatment of boundary conditions involving the derivative of the unknownY (t)is somewhat involved. Assume that the boundary condition att = b is

Y ′(b) + k Y (b) = g2. (11.23)

One obvious discretization is to approximateY ′(b) by (YN − YN−1)/h. However,

Y ′(b) − YN − YN−1

h= O(h), (11.24)

and the accuracy of this approximation is one order lower than the remainder termO(h2) in (11.7). As a result, the corresponding difference solution with the followingdiscrete boundary condition

yN − yN−1

h+ k yN = g2 (11.25)

will have an accuracy ofO(h) only. To retain the second-order convergence of thedifference solution, we need to approximate the boundary condition (11.23) moreaccurately. One such treatment is based on the formula

Y ′(b) =3 YN − 4 YN−1 + YN−2

2 h+ O(h2). (11.26)

NONLINEAR TWO-POINT BOUNDARY VALUE PROBLEMS 195

Then the boundary condition (11.23) is approximated by

3 yN − 4 yN−1 + yN−2

2 h+ k yN = g2. (11.27)

It can be shown that the resulting difference scheme is againsecond-order accurate.A similar treatment can be given for more general boundaryconditions that involve

the derivativesY ′(a) andY ′(b). For a comprehensive introduction to this and to thegeneral subject of the numerical solution of two-point boundary value problems, seeKeller [53], Ascher et al [9], or Ascher and Petzold [10, Chap. 6].

11.2 NONLINEAR TWO-POINT BOUNDARY VALUE PROBLEMS

Consider the two-point boundary value problem

Y ′′ = f(t, Y, Y ′), a < t < b,

A

[Y (a)Y ′(a)

]+B

[Y (b)Y ′(b)

]=

[γ1

γ2

].

(11.28)

The termsA andB denote given square matrices of order2 × 2, andγ1 andγ2 aregiven constants. The theory for BVPs such as this one is more complex than that forthe initial value problem.

The theory for the nonlinear problem (11.28) is more complicated than that for thelinear problem (11.2). We give an introduction to that theory for the following morelimited problem:

Y ′′ = f(t, Y, Y ′), a < t < b, (11.29)

a0y(a) − a1y′(a) = g1, b0y(b) + b1y

′(b) = g2 (11.30)

with {a0, a1, b0, b1, g1, g2} as given constants. The functionf is assumed to satisfythe following Lipschitz condition,

|f(t, u1, v) − f(t, u2, v)| ≤ K |u1 − u2| ,|f(t, u, v1) − f(t, u, v2)| ≤ K |v1 − v2|

(11.31)

for all points(t, ui, v), (t, u, vi), i = 1, 2, in the region

R = {(t, u, v) | a ≤ t ≤ b, −∞ < u, v <∞} .This is far stronger than needed, but it simplifies the statement of the followingtheorem; and although we do not give it here, it also simplifies the error analysis ofnumerical methods for (11.29)–(11.30).

Theorem 11.2 For the problem (11.29)–(11.30), assumef(x, u, v) to be continuouson the regionR and that it satisfies the Lipschitz condition (11.31). In addition,assume that onR, f satisfies

∂f(x, u, v)

∂u> 0,

∣∣∣∣∂f(x, u, v)

∂v

∣∣∣∣ ≤M (11.32)


for some constantM > 0. For the boundary conditions of (11.30), assume

a0a1 ≥ 0, b0b1 ≥ 0, (11.33)

|a0| + |a1| 6= 0, |b0| + |b1| 6= 0, |a0| + |b0| 6= 0.

Then the BVP (11.29)–(11.30) has a unique solution.

For a proof, see Keller [53, p. 9].Although this theorem gives conditions for the BVP (11.29)–(11.30) to be uniquely

solvable, in fact nonlinear BVPs may be nonuniquely solvable with only a finitenumber of solutions. This is in contrast to the situation forlinear problems suchas (11.2)–(11.3) in which nonuniqueness always implies an infinity of solutions.An example of such nonunique solvability for a nonlinear BVPis the second-orderproblem

d

dt

[I(t)

dY

dt

]+ λ sin(Y ) = 0, 0 < t < 1,

Y ′(0) = Y ′(1) = 0, |Y (t)| < π,

(11.34)

which arises in studying the buckling of a vertical column when a vertical forceis applied. The unknownY (t) is related to the displacement of the column in theradial direction from its centerline. In the equationI(t) is a given function relatedto physical properties of the column; and the parameterλ is proportional to the loadon the column. Whenλ exceeds a certain size, there is a solution to the problem(11.34) other than the zero solution. Asλ continues to increase, the BVP (11.34) hasan increasing number of nonzero solutions, only one of whichis the correct physicalsolution. For a detailed discussion of this problem, see Keller and Antman [54, p. 43].

As with the earlier material on initial value problems in Chapter 3, all boundaryvalue problems for higher-order equations can be reformulated as problems for sys-tems of first-order equations. The general form of a two-point BVP for a system offirst-order equations is

Y′ = f(t,Y), a < t < b,

AY(a) +BY(b) = g.(11.35)

This represents a system ofm first-order equations. The quantitiesY(t), f(t,Y),andg are vectors withm components, andA andB are matrices of orderm ×m.There is a theory for such BVPs, analogous to that for the two-point problem (11.28),but we omit it here because of space limitations.

In the remainder of this section, we describe briefly the principal numerical meth-ods for solving the two-point BVP (11.28). These methods generalize to first-ordersystems such as (11.35),but again, because of space limitations,we omit those results.Much of our presentation follows Keller [53], and a theory for first-order systems isgiven there. Unlike the situation with initial value problems, it is often advantageousto directly treat higher-order BVPs rather than to numerically solve their reformula-tion as a first-order system. The numerical methods for the two-point boundary value


problem (11.28) are also less complicated to present, and therefore we have opted todiscuss the second-order problem (11.28) rather than the system (11.35).

11.2.1 Finite difference methods

We consider the two-point BVP:

Y ′′ = f(t, Y, Y ′), a < t < b,

Y (a) = g1, Y (b) = g2.(11.36)

with the true solution denoted byY (t). The boundary conditions are of the sameform as used with our earlier finite-difference approximation for the linear problem(11.2)–(11.3). As before, in (11.4), introduce an equally spaced subdivision

a = t0 < t1 < · · · < tN = b

At each interior node pointti, 0 < i < N , we approximateY ′′(ti) andY ′(ti)as in (11.5)–(11.6). Dropping the final error terms in (11.5)–(11.6) and using theseapproximations in the differential equation, we are led to the approximating nonlinearsystem:

yi+1 − 2yi + yi−1

h2= f

(ti, yi,

yi+1 − yi−1

2h

), i = 1, . . . , N − 1. (11.37)

This is a system ofN − 1 nonlinear equations in theN − 1 unknownsy1, . . . , yN−1;compare with the system (11.8). The valuesy0 = g1 andyN = g2 are known fromthe boundary conditions.

The analysis of the error in{yi} as compared to{Y (ti)} is too complicated tobe given here, because it requires methods for analyzing thesolvability of systemsof nonlinear equations. In essence, ifY (t) is 4 times differentiable, if the problem(11.36) is uniquely solvable for some region about the graphon [a, b] of Y (t), andif f(t, u, v) is sufficiently differentiable, then there is a solution to (11.37), and itsatisfies

max0≤i≤N

|Y (ti) − yi| = O(h2). (11.38)

For an analysis, see Keller [52, Sec. 3.2] or [53, Sec. 3.2]. Moreover, with additionalassumptions onf and the smoothness ofY , it can be shown that

Y (ti) − yi = D(ti)h2 + O(h4) (11.39)

with D(t) independent ofh. This can be used to justify Richardson extrapolation toobtain results that converge more rapidly, just as earlier in (11.16)–(11.18). (Thereare other methods for improving the convergence, based on correcting for the errorin the central difference approximations of (11.5)–(11.6); e.g., see [27], [77].)

The system (11.37) can be solved in a variety of ways, some of which are simplemodifications of Newton’s method for solving systems of nonlinear equations. Wedescribe here the application of the standard Newton method.


In matrix form, we have

1

h2

−2 1 0 · · · 0

1 −2 1...

.... . .

1 −2 10 · · · 0 1 −2

y1y2...

yN−1

=

f

(t1, y1,

1

2h(y2 − g1)

)

f

(t2, y2,

1

2h(y3 − y1)

)

...

f

(tN−1, yN−1,

1

2h(g2 − yN−2)

)

−

g1h2

0...g2h2

,

which we denote by

1

h2Ty = f(y) + g. (11.40)

The matrixT is both tridiagonal and nonsingular (see Problem 14). As wasdiscussedearlier for the solution of (11.13) for the linear BVP (11.2)–(11.3), tridiagonal linearsystemsTz = b are easily solvable. This can be used to show that (11.40) is solvablefor all sufficiently small values ofh; moreover, the solution is unique in a region ofR

N−1 corresponding to some neighborhood of the graph of the solutionY (t) for theoriginal BVP (11.36). Newton’s method (see [11,§2.11]) for solving (11.40) is givenby

y(m+1) = y(m) −[

1

h2T − F (y(m))

]−1 [1

h2Ty(m) − f (y(m)) − g

](11.41)

with F the Jacobian matrix forf ,

F (y) =

[∂fi

∂yj

]

i,j=1,...,N−1

This matrix simplifies considerably because of the special form of f (y),

[F (y)]ij =∂

∂yjf

(ti, yi,

1

2h(yi+1 − yi−1)

).


This is zero unlessj = i− 1, i, or i+ 1:

[F (y)]ii = f2

(ti, yi,

1

2h(yi+1 − yi−1)

), 1 ≤ i ≤ N − 1,

[F (y)]i,i−1 =−1

2hf3

(ti, yi,

1

2h(yi+1 − yi−1)

), 2 ≤ i ≤ N − 1,

[F (y)]i,i+1 =1

2hf3

(ti, yi,

1

2h(yi+1 − yi−1)

), 1 ≤ i ≤ N − 2

with f2(t, u, v) andf3(t, u, v) denoting partial derivatives off with respect tou andv, respectively. Thus the matrix being inverted in (11.41) istridiagonal. Letting

Bm =1

h2T − F (y(m)), (11.42)

we can rewrite (11.41) as

y(m+1) = y(m) − δ(m),

Bmδ(m) =

1

h2Ty(m) − f(y(m)) − g.

(11.43)

This linear system is easily and rapidly solvable, for example, using the MATLABcode of Subsection 11.1.2. The number of multiplications and divisions can be shownto equal approximately5N ,a relatively small number of operations for solving a linearsystem ofN − 1 equations. Additional savings can be made by not varyingBm orby changing it only after several iterations of (11.43). Foran extensive survey anddiscussion of the solution of nonlinear systems that arise in connection with solvingBVPs, see Deuflhard [32].

Example 11.3 Consider the two-point BVP:

Y ′′ = −y +2(Y ′)2

Y, −1 < x < 1,

Y (−1) = Y (1) = (e+ e−1)−1 .= 0.324027137.

(11.44)

The true solution isY (t) = (et + e−t)−1. We applied the preceding finite-differenceprocedure (11.37) to the solution of this BVP. The results are given in Table 11.3 forsuccessive doublings ofN = 2/h. The nonlinear system in (11.37) was solved usingNewton’s method, as described in (11.43). The initial guesswas

y(0)h (xi) = (e+ e−1)−1, i = 0, 1, . . . , N,

based on connecting the boundary values by a straight line. The quantity

dh = max0≤i≤N

∣∣∣y(m+1)i − y

(m)i

∣∣∣


Table 11.3 Finite difference method for solving (11.44)

N = 2/h Eh Ratio

4 2.63e − 2

8 5.87e − 3 4.48

16 1.43e − 3 4.11

32 3.55e − 4 4.03

64 8.86e − 5 4.01

was computed for each iterate, and when the condition

dh ≤ 10−10

was satisfied, the iteration was terminated. In all cases, the number of iterates com-puted was5 or 6. For the error, let

Eh = max0≤i≤N

|Y (xi) − yh(xi)|

with yh the solution of (11.37) obtained with Newton’s method. According to (11.38)and (11.39), we should expect the valuesEh to decrease by a factor of approximately4 whenh is halved, and that is what we observe in the table.

Higher-order methods can be obtained in several ways.

1. Using higher-orderapproximations to the derivatives, improving (11.5)–(11.6).

2. Using Richardson extrapolation based on (11.39), as was done in Subsection11.1.1 for the linear BVP (11.2)–(11.3). Richardson extrapolation can be usedrepeatedly to obtain methods of increasingly higher-order. This was discussedin Subsection 11.1.2, yielding the formulas (11.20)–(11.22) for extrapolatingtwice.

3. The truncation errors in (11.5)–(11.6) can be approximated with higher-orderdifferences using the calculated values ofyh. Using these values as correctionsin (11.37), we can obtain a new, more accurate approximationto the differentialequation in (11.36), leading to a more accurate solution. This is sometimescalled themethod of deferred corrections; for more recent work, see [27], [77].

All of these techniques have been used, and some have been implemented as quitesophisticated computer codes.


11.2.2 Shooting methods

Another popular approach to solving a two-point BVP is to reduce it to a problemin which a program for solving initial value problems can be used. We now developsuch a method for the BVP (11.29)–(11.30).

Consider the initial value problem

Y ′′ = f(t, Y, Y ′), a < t < b,

Y (a) = a1s− c1g1 Y ′(a) = a0s− c0g1,(11.45)

depending on the parameters, wherec0 andc1 are arbitrary (user chosen) constantssatisfying

a1c0 − a0c1 = 1.

Denote the solution of (11.45) byY (t; s). Then it is a straightforward calculationusing the initial condition in (11.45) to show that

a0Y (a; s) − a1Y′(a; s) = g1

for all s for which Y exists. This shows thatY (t; s) satisfies the first boundarycondition in (11.30).

SinceY is a solution of (11.29), all that is needed for it to be a solution of theBVP (11.29)–(11.30) is to have it satisfy the remaining boundary condition atb. Thismeans thatY (t; s) must satisfy

ϕ(s) ≡ b0Y (b; s) + b1Y′(b; s) − g2 = 0. (11.46)

This is a nonlinear equation fors. If s∗ is a root ofϕ(s), thenY (t; s∗) will satisfy theBVP (11.29)–(11.30). It can be shown that under suitable assumptions onf and itsboundary conditions, equation (11.46) will have a unique solution s∗; see Keller [53,p. 9]. We can use a rootfinding method for nonlinear equationsto solve fors∗. Thisway of finding a solution to a BVP is called ashooting method. The name comesfrom ballistics, in which one attempts to determine the needed initial conditions att = a in order to obtain a certain value att = b.

Most rootfinding methods can be applied to solvingϕ(s) = 0. Each evaluationof ϕ(s) involves the solution of the initial value problem (11.45) over [a, b], andconsequently, we want to minimize the number of such evaluations. As a specificexample of an important and rapidly convergent method,we look at Newton’s method:

sm+1 = sm − ϕ(sm)

ϕ′(sm), m = 0, 1, . . . . (11.47)

To calculateϕ′(s), differentiate the definition (11.46) to obtain

ϕ′(s) = b0ξs(b) + b1ξ′s(b), (11.48)

where

ξs(t) =∂Y (t; s)

∂s. (11.49)


To find ξs(t), differentiate the equation

Y ′′(t; s) = f(t, Y (t; s), Y ′(t; s))

with respect tos. Thenξs satisfies the initial value problem

ξ′′s (t) = f2(t, Y (t; s), Y ′(t; s))ξs(t) + f3(t, Y (t; s), Y ′(t; s))ξ′s(t), (11.50)

ξs(a) = a1, ξ′s(a) = a0.

The functionsf2 andf3 denote the partial derivatives off(t, u, v) with respect touandv, respectively. The initial values are obtained from those in (11.45) and fromthe definition ofξs.

In practice, we convert the problems (11.45) and (11.50) to asystem of fourfirst-order equations with the unknownsY , Y ′, ξs, andξ′s. This system is solvednumerically, say, with a method of orderp and stepsizeh. Let yh(t; s) denote theapproximation toY (t; s) with a similar notation for the remaining unknowns. Fromearlier results for solving initial valueproblems, it canbeshown that these approximatesolutions will be in error byO(hp). With suitable assumptions on the original problem(11.29)–(11.30), it can then be shown that the roots∗h obtained will also be in errorbyO(hp) and similarly for the approximate solutionyh(t; s∗h) when compared to thesolutionY (t; s∗) of the boundary value problem. For details of this analysis,seeKeller [53, pp. 47–54].

Example 11.4 We apply the preceding shooting method to the solution of theBVP(11.45), used earlier to illustrate the finite-difference method. The initial value prob-lem (11.35) for the shooting method is

Y ′′ = −Y +2(Y ′)2

Y, −1 < x ≤ 1,

Y (−1) = (e+ e−1)−1, Y ′(−1) = s.(11.51)

The associated problem (11.50) forξs(x) is

ξ′′

s =

[−1 − 2

(Y ′

Y

)2]ξs + 4

Y ′

Yξ′s,

ξs(−1) = 0, ξ′s(−1) = 1.

(11.52)

The equation forξ′′s uses the solutionY (x; s) of (11.51). The functionϕ(s) forcomputings∗ is given by

ϕ(s) ≡ Y (1; s) − (e+ e−1)−1.

For use in defining Newton’s method, we have

ϕ′(s) = ξs(1).


Table 11.4 Shooting method for solving (11.44)

n = 2/h s∗ − s∗h Ratio Eh Ratio

4 4.01e − 3 2.83e − 2

8 1.52e − 3 2.64 7.30e − 3 3.88

16 4.64e − 4 3.28 1.82e − 3 4.01

32 1.27e − 4 3.64 4.54e − 4 4.01

64 3.34e − 5 3.82 1.14e − 4 4.00

From the true solutionY of (11.44) and the conditiony′(−1) = s in (11.51), thedesired roots∗ of ϕ(s) is simply

s∗ = Y ′(−1) =e− e−1

(e+ e−1)2.= 0.245777174.

To solve the initial value problem (11.51)–(11.52), we use asecond-order Runge–Kutta method, such as (5.21), with a stepsize ofh = 2/n. The results for severalvalues ofn are given in Table 11.4. The solution of (11.52) is denoted byyh(t; s),and the resulting root for

ϕh(s) ≡ yh(1; s) − (e+ e−1)−1 = 0

is denoted bys∗h. For the error inyh(t; s∗h), let

Eh = max0≤i≤n

|Y (ti) − yh(ti; s∗h)| ,

where{ti} are the node points used in solving the initial value problem. The columnslabeled “Ratio” give the factors by which the errors decreased whennwas doubled (orh was halved). Theoretically these factors should approach4 since the Runge–Kuttamethod has an error ofO(h2). Empirically, the factors approach4.0, as expected.For the Newton iteration (11.47),s0 = 0.2 was used in each case. The iteration wasterminated when the test

|sm+1 − sm| ≤ 10−10

was satisfied. With these choices, the Newton method needed six iterations in eachcase, except that ofn = 4 (when seven iterations were needed). However, ifs0 = 0was used, then25 iterations were needed for then = 4 case, showing the importanceof a good choice of the initial guesss0.

A number of problems can arise with the shooting method. First, there is no generalguesss0 for the Newton iteration, and with a poor choice, the iteration may diverge.For this reason, a modified Newton method may be needed to force convergence. Asecond problem is that the choice ofyh(t; s) may be very sensitive toh, s, and othercharacteristics of the boundary value problem. For example, if the linearization of


the initial value problem (11.45) has large positive eigenvalues, then the choice ofY (t; s) is likely to be sensitive to variations ins. For a thorough discussion of theseand other problems, see Keller [53, Chap. 2], Ascher et al. [9], or Ascher and Petzold[10, Chap. 7]. Some of these problems are more easily examined for linear BVPs, asis done in Keller [53, Chap. 2].

11.2.3 Collocation methods

To simplify the presentation, we again consider only the differential equation

Y ′′ = f(t, Y, Y ′), a < t < b. (11.53)

Furthersimplifying the BVP, we consideronly the homogeneousboundaryconditions

Y (a) = 0, Y (b) = 0. (11.54)

It is straightforward to modify the nonhomogeneous boundary conditions of (11.36)to obtain a modified BVP having homogeneousboundary conditions; see Problem 16.The collocation methods are much more general than indicated by solving (11.53)–(11.54), but the essential ideas are more easily understoodin this context.

We assume that the solutionY (t) of (11.53)–(11.54) is approximable by a linearcombination ofn given functionsψ1(t), . . . , ψn(t),

Y (x) ≈ yn(x) =

n∑

j=1

cjψj(t), a ≤ x ≤ b. (11.55)

The functionsψj(t) are all assumed to satisfy the boundary conditions

ψj(a) = ψj(b) = 0, j = 1, . . . , n, (11.56)

and thus any linear combination (11.55) will also satisfy the boundary conditions. Thecoefficientsc1, . . . , cn are determined by requiring the differential equation (11.53)to be satisfied exactly atn preselected points in(a, b),

y′′n(ξi) = f(ξi, yn(ξi), y′n(ξi)) , i = 1, . . . , n (11.57)

with given pointsa < ξ1 < ξ2 < · · · < ξn < b. (11.58)

The procedure of definingyn(t) implicitly through (11.57) is known ascollocation,and the points{ξi} are calledcollocation points.

Substituting from (11.55) into (11.57), we obtain

n∑

j=1

cjψ′′j (ξi) = f

ξi,

n∑

j=1

cjψj(ξi),

n∑

j=1

cjψ′j(ξi)

, (11.59)


for i = 1, . . . , n. This is a system ofn nonlinear equations in then unknownsc1, . . . , cn. In general, this system must be solved numerically, as is done with thefinite-difference approximation (11.37) discussed earlier in Section 11.2.1.

In choosing a collocation method, we must do the following.

1. Choose the family of approximating functions{ψ1(t), . . . , ψn(t)}, includingthe requirement (11.56) for the endpoint boundary conditions.

2. Choose the collocation node points{ξi} of (11.58).

3. Choose a way to solve the nonlinearsystem (11.59). Included in this is choosingan initial guess for the method of solving the nonlinear system, and this maybe difficult to find.

For a general survey of this area, see the text by Ascher et al.[9]; for collocationsoftware, see [6], [7].

We describe briefly a particular collocation method that hasbeen implemented asa high quality computer code. Letm > 0, h = (b− a) /m, and define breakpoints{tj} by

tj = a+ jh, j = 0, 1, . . . ,m.

Consider all functionsp(t) that satisfy the following conditions:

• p(t) is continuously differentiable fora ≤ t ≤ b.

• p(a) = p(b) = 0.

• On each subinterval[tj−1, tj ], p(t) is a polynomial of degree≤ 3.

We use these functions as our approximationsyn(t) in (11.57). There are a number ofways to writeyn(t) in the form of (11.55), withn = km. A good way to choose thefunctions{ψj(t)} is to use the standard basis functions for cubic Hermite interpolationon each subinterval[tj−1, tj ]; see [11, p. 162].

For the collocation points, letρ1 = −1/√

3, ρ2 = 1/√

3, which are the zeros ofthe Legendre polynomial of degree 2 on[−1, 1]. Using these, define

ξi,j = 12 (ti−1 + ti) + 1

2hρj , j = 1, 2, i = 1, . . . ,m.

This definesn = 2m pointsξi,j , and these will be the collocation points used in(11.57).

With this choice foryn(t) and{ξi,j}, and assuming sufficient differentiability andstability in the solvability of the BVP (11.53)–(11.54), itcan be shown thatyn(t)satisfies the following:

maxa≤t≤b

|Y (t) − yn(t)| = O(h4).

An extensive discussion and generalizations of this methodare given in [9].


11.2.4 Other methods and problems

Yet another approach to solving a boundary value problem is to solve an equiva-lent reformulation as an integral equation. There is much less development of suchnumerical methods, although they can be very effective in some situations. For anintroduction to this approach, see Keller [53, Chap. 4].

There are also many other types of boundary value problems, some containingcertain types of singular behavior, that we have not discussed here. An excellentgeneral reference is the book by Ascher, Mattheij, and Russell [9]. In addition, seethe research papers in the proceedings of Ascher and Russell[8], Aziz [13], Childset al. [28], and Gladwell and Sayers [41]; see also Keller [52, Chap. 4] for singularproblems. For discussions of software, see Childs et al. [28], Gladwell and Sayers[41], and Enright [35].

PROBLEMS

1. In general, study of existence and uniqueness of a solution for boundary valueproblems is more complicated. Consider the boundary value problem

{Y ′′(t) = 0, 0 < t < 1,Y ′(0) = g1, Y

′(1) = g2.

Show that the problem has no solution ifg1 6= g2, and infinitely many solutionswheng1 = g2.

Hint: For the caseg1 6= g2, integrate the differential equation over[0, 1].

2. As another example of solution non-uniqueness, verify that for any constantc,Y (t) = c sin(t) solves the boundary value problem

{Y ′′(t) + Y (t) = 0, 0 < t < π,Y (0) = Y (π) = 0.

3. Verify that any function of the formY (t) = c1et + c2e

−t satisfies the equation

Y ′′(t) − Y (t) = 0.

Determinec1 andc2 for the functionY (t) to satisfy the following boundaryconditions:

(a) Y (0) = 1, Y (1) = 0.

(b) Y (0) = 1, Y ′(1) = 0.

(c) Y ′(0) = 1, Y (1) = 0.

(d) Y ′(0) = 1, Y ′(1) = 0.

4. Assume thatY is 3 times continuously differentiable. Use Taylor’s theorem toprove the formula (11.26).


5. Prove the formula (11.18) by using the asymptotic expansion (11.16).

6. Use the asymptotic error formula (11.16) withD(t) twice continuously differ-entiable to show

Y ′′(ti) −1

h2[yh(ti+1) − 2yh(ti) + yh(ti−1)] = O

(h2), 1 ≤ i ≤ N − 1.

In other words, the second-order centered divided difference of the numericalsolution is a second-order approximation of the second derivative of the truesolution at any interior node point.

7. Verify that any function of the formY (t) = c1√t+ c2t

4 satisfies the equation

t2Y ′′(t) − 72 tY

′(t) + 2Y (t) = 0.

Determine the solution of the equation with the boundary conditions

Y (1) = 1, Y (4) = 2.

Use the MATLAB programODEBVP to solve the boundary value problem forh = 0.1, 0.05, 0.025, and print the errors of the numerical solutions att = 1.2,1.4, 1.6, 1.8. Comment on how errors decrease whenh is halved. Do the samefor the extrapolated solutions.

8. The general solution of the equation

t2Y ′′ − t (t+ 2)Y ′ + (t+ 2)Y = 0

isY (t) = c1t+c2tet. Determine the solution of the equation with the boundary

conditionsY (1) = e, Y (2) = 2 e2.

Use the MATLAB programODEBVP to solve the boundary value problem forh = 0.1, 0.05, 0.025, print the errors of the numerical solutions att = 1.2, 1.4,1.6 and1.8. Comment on how errors decrease whenh is halved. Do the samefor the extrapolated solutions.

9. The general solution of the equation

t Y ′′ − (2 t+ 1)Y ′ + (t+ 1)Y = 0

is Y (t) = c1et + c2t

2et. Find the solution of the equation with the boundaryconditions

Y ′(1) = 0, Y (2) = e2.

Write down a formula for a discrete approximation of the boundary conditionY ′(1) = 0 similar to (11.27), which has an accuracyO(h2). Implement themethod by modifying the programODEBVP, and solve the problem withh= 0.1,0.05, 0.025. Print the errors of the numerical solutions att = 1, 1.2, 1.4, 1.6,


1.8, and comment on how errors decrease whenh is halved. Do the same forthe extrapolated solutions.

10. Consider the boundary value problem (11.2) withp, q, andr constant. Modifythe MATLAB program so that the commandfeval does not appear. Use themodified program to solve the following boundary value problem.

(a)Y ′′ = −Y, 0 < t < π

2 ,

Y (0) = Y(

12π)

= 1.

The true solution isY (t) = sin t+ cos t.

(b)Y ′′ + Y = sin t, 0 < t < π

2 ,

Y (0) = Y(

12π)

= 0.

The true solution isY (t) = − 12 t cos t.

11. Give a second-order scheme for the following boundary value problem.

Y ′′ = sin (tY ′) + 1, 0 < t < 1,

Y (0) = 0, Y (1) = 1.

12. Consider modifying the material of Section 11.1 to solvethe BVP

Y ′′(t) = p(t)Y ′(t) + q(t)Y (t) + r(t), a < t < b,

Y (a) = g1, Y ′(b) + k Y (b) = g2.

Do so with the first-order approximation given in (11.25). Give the analogs ofthe results (11.8)–(11.14).

13. Continuing with the preceding problem, modifyODEBVP to handle this newboundary condition. Apply it to the boundary value problem

Y ′′ = − 2 t

1 + t2Y ′ + Y +

2

1 + t2− log(1 + t2), 0 < t < 1,

Y (0) = 0, Y ′(1) + Y (1) = 1 + log(2).

The true solution isY (t) = log(1+t2), just as with the earlier example (11.19).Repeat the calculations leading to Table 11.1. Check the assertion on the orderof convergence given in Section 11.1.3 in the sentence containing (11.25).

14. Consider showing that the tridiagonal matrixT of (11.40) is nonsingular. Forsimplicity, denote its order bym × m. To show thatT is nonsingular, it issufficient to show that the only solutionx ∈ R

m of the homogeneous linearsystemTx = 0 is the zero solutionx = 0. Let c = max1≤j≤m |xj |. We want


to showc = 0. Begin by assuming the contrary, namely thatc > 0. Write theindividual equations in the systemTx = 0. In particular, consider an equationcorresponding to a component ofx that has magnitudec (of which there mustbe at least one), and denote its index byk. Assume initially that1 < k < m.Show from equationk thatxk+1 andxk−1 must also have magnitudec. Byinduction, show that all components must have magnitudec; and then showfrom the first or last equation that this leads to a contradiction.

15. For each of the following BVPs for a second-order differential equation, con-sider converting it to an equivalent BVP for a system of first-order equations,as in (11.35). What are the matricesA andB of (11.35)?

(a) The linear BVP (11.2)–(11.3).

(b) The nonlinear BVP of (11.44).

(c) The nonlinear BVP (11.29)–(11.30).

(d) The following system of second-order equations: for0 < t < 1,

mx′′(t) =cx(t)

(x(t)2 + y(t)2)3/2

, my′′(t) =cy(t)

(x(t)2 + y(t)2)3/2

,

with the boundary conditions

x(0) = x(1), y(0) = y(1),x′(0) = x′(1), y′(0) = y′(1).

16. Consider convertingnonzero boundaryconditions to zero boundaryconditions.

(a) Consider the two-point boundary value problem (11.36).To convert thisto an equivalent problem with zero boundary conditions, write Y (x) =z(x) +w(x) with w(x) a straight line satisfying the following boundaryconditions: w(a) = γ1, w(b) = γ2. Derive a new boundary valueproblem forz(x).

(b) Generalize this procedure to problem (11.29). Obtain a new problem withzero boundary conditions. What assumptions, if any, are needed for thecoefficientsa0, a1, b0, andb1?

17. Using the shooting method of Subsection 11.2.2, solve the following boundary-value problems. Study the convergence rate ash is varied.

(a) Y ′′ = − 2

xY Y ′, 1 < x < 2; Y (1) = 1

2 , Y (2) = 23 .

True solution:Y (x) = x/(1 + x).

(b) Y ′′ = 2Y Y ′, 0 < x < 14π; Y (0) = 0, Y

(14π)

= 1.

True solution:Y (x) = tan(x).

CHAPTER 12

VOLTERRA INTEGRAL EQUATIONS

In earlier chapters the initial value problem

Y ′(s) = f(s, Y (s)), t0 ≤ s ≤ b,

Y (t0) = Y0

was reformulated using integration. In particular, by integrating over the interval[t0, t], we obtain

Y (t) = Y0 +

∫ t

t0

f(s, Y (s)) ds, t0 ≤ t ≤ b.

This is an integral equation of Volterra type. Motivated in part by this reformulation,we consider now the integral equation

Y (t) = g(t) +

∫ t

0

K(t, s, Y (s)) ds, 0 ≤ t ≤ T. (12.1)

In this equation, the functionsK(t, s, u) andg(t) are given; the functionY (t) isunknown and is to be determined on the interval0 ≤ t ≤ T . This equation is called

211

212 VOLTERRA INTEGRAL EQUATIONS

a Volterra integral equation of the second kind. Such integral equations occur ina variety of physical applications, and few of them can be reformulated easily asdifferential equation initial value problems. However, the numerical methods forsuch equations are linked to those for the initial value problem, and we consider suchmethods in this chapter.

12.1 SOLVABILITY THEORY

We begin by discussing some of the theory behind such equations, beginning withthe linear equation

Y (t) = g(t) +

∫ t

0

K(t, s)Y (s) ds, 0 ≤ t ≤ T. (12.2)

The functionK(t, s) is called the “kernel function” of the integral operator, orsimplythe “kernel”. An important theoretical tool for studying this equation is the use of“successive approximations” or “Picard iteration”.

As an initial estimate of the solution, chooseY0(t) ≡ g(t). Then define a sequenceof iterates{Yℓ(t)} by

Yℓ+1(t) = g(t) +

∫ t

0

K(t, s)Yℓ(s) ds, 0 ≤ t ≤ T

for ℓ = 0, 1, . . . To develop some intuition, we calculateY2(t):

Y2(t) = g(t) +

∫ t

0

K(t, s)Y1(s) ds

= g(t) +

∫ t

0

K(t, s)

[g(s) +

∫ s

0

K(s, v) g(v) dv

]ds

= g(t) +

∫ t

0

K(t, s) g(s) ds

+

∫ t

0

K(t, s)

∫ s

0

K(s, v) g(v) dv ds. (12.3)

We then introduce a change in the order of integration,

∫ t

0

∫ s

0

K(t, s)K(s, v) g(v) dv ds

=

∫ t

0

g(v)

∫ t

v

K(t, s)K(s, v) ds dv.

(12.4)

and define

K2(t, v) =

∫ t

v

K(t, s)K(s, v) ds, 0 ≤ v ≤ t ≤ T.

SOLVABILITY THEORY 213

Then (12.3) becomes

Y2(t) = g(t) +

∫ t

0

K(t, s) g(s) ds+

∫ t

0

K2(t, v) g(v) dv.

This can be continued inductively to give

Yℓ(t) = g(t) +

ℓ∑

j=1

∫ t

0

Kj(t, s) g(s) ds (12.5)

for ℓ = 1, 2, . . . The kernel functionsKj are defined by

K1(t, s) = K(t, s) ,

Kj(t, s) =

∫ t

s

K(t, u)Kj−1(u, s) du, j = 2, 3, . . . . (12.6)

Much of the theory of solvability of the integral equation (12.2) can be developed bylooking at the limit of (12.5) asℓ→ ∞. This, in turn, requires an examination of thekernel functions{Kj(t, s)}∞j=1. Doing so yields the following theorem.

Theorem 12.1 Assume thatK(t, s) is continuous for0 ≤ s ≤ t ≤ T , and thatg(t)is continuous on[0, T ]. Then (12.2) has a unique continuous solutionY (t) on [0, T ],and

|Y (t)| ≤ eBt max0≤s≤t

|g(s)| , (12.7)

whereB = max0≤s≤t≤T |K(t, s)|.

Some details of the proof are taken up in the problems.A related approach can be used to prove the following theoremfor the fully non-

linear equation (12.1). The Picard iteration is now

Yℓ+1(t) = g(t) +

∫ t

0

K(t, s, Yℓ(s)) ds, 0 ≤ t ≤ T

for ℓ = 0, 1, . . .

Theorem 12.2 Assume that the functionK(t, s, u) satisfies the following two condi-tions:(a)K(t, s, u) is continuous for0 ≤ s ≤ t ≤ T and−∞ < u <∞.(b)K(t, s, u) satisfies a Lipschitz condition,

|K(t, s, u1) −K(t, s, u2)| ≤ c |u1 − u2| , 0 ≤ s ≤ t ≤ T

for all −∞ < u1, u2 <∞, with somec > 0.Assume further thatg(t) is continuous on[0, T ]. Then equation (12.1) has a uniquecontinuous solutionY (t) on the interval[0, T ]. In addition,

|Y (t)| ≤ ect max0≤s≤t

|g(s)| . (12.8)


For a proof, see Linz [59, Chap. 4].As with differential equations, it is important to examine the stability of the solution

Y (t) with respect to changes in the data of the equation,K andg. We consider onlythe perturbation of the linear equation (12.2) by changingg(t) to g(t) + ε(t). LetY (t; ε) denote the solution of the perturbed equation,

Y (t; ε) = g(t) + ε(t) +

∫ t

0

K(t, s)Y (s; ε) ds, 0 ≤ t ≤ T. (12.9)

Subtracting (12.2), we have

Y (t; ε) − Y (t) = ε(t)

+

∫ t

0

K(t, s) [Y (s; ε) − Y (s)] ds, 0 ≤ t ≤ T.(12.10)

Applying (12.7) from Theorem 12.1, we have

|Y (t; ε) − Y (t)| ≤ eBt max0≤s≤t

|ε(s)| . (12.11)

This shows stability of the solution with respect to perturbations in the functiongin (12.2). This is a conservative estimate because the multiplying factoreBt increasesvery rapidly witht. The analysis of stability can be improved by examining (12.10)in greater detail, just as was done for differential equations in (1.16) of Section 1.2.We can also generalize these results to the nonlinear equation (12.1); see [59], [64].

12.1.1 Special equations

A model equation for studying the numerical solution of (12.1) is the simple linearequation

Y (t) = g(t) + λ

∫ t

0

Y (s) ds, t ≥ 0. (12.12)

This can be reformulated as the initial value problem

Y ′(t) = λY (t) + g′(t), t ≥ 0, (12.13)

Y (0) = g(0),

which is the model equation used in earlier chapters for studying numerical methodsfor solving the initial value problem for ordinary differential equations. Using thesolution of this simple linear initial value problem leads to

Y (t) = g(t) + λ

∫ t

0

eλ(t−s)g(s) ds, t ≥ 0. (12.14)

Recall from (1.20) of Section 1.2 that, usually, (12.13) is considered stable forλ < 0and is considered unstable forλ > 0. Thus the same is true of the Volterra equation(12.12).

NUMERICAL METHODS 215

As another model Volterra integral equation, consider

Y (t) = g(t) + λ

∫ t

0

eβ(t−s)Y (s) ds, t ≥ 0. (12.15)

This can be reduced to the form of (12.12), and this leads to the solution

Y (t) = g(t) + λ

∫ t

0

e(λ+β)(t−s)g(s) ds, t ≥ 0. (12.16)

Equations of the form

Y (t) = g(t) + λ

∫ t

0

K(t− s)Y (s) ds, t ≥ 0 (12.17)

are said to be of ‘convolution type’, and theLaplace transformcan often be used toobtain a solution. Discussion of the Laplace transform and its application in solvingdifferential equations can be found in most undergraduate textbooks on ordinarydifferential equations; for example, see [16]. LetK(τ) denote the Laplace transformof K(t), and letL(t;λ) denote the inverse Laplace transform of

K(τ)

1 − λK(τ).

The solution of (12.17) is given by

Y (t) = g(t) + λ

∫ t

0

L(t− s;λ) g(s) ds, t ≥ 0. (12.18)

Both (12.12) and (12.15) are special cases of (12.17).

12.2 NUMERICAL METHODS

Numerical methods for solving the Volterra integral equation

Y (t) = g(t) +

∫ t

0

K(t, s, Y (s)) ds, 0 ≤ t ≤ T (12.19)

are similar to numerical methods for the initial value problem for ordinary differentialequations. A set of grid points{ti : i = 0, 1, . . . } is chosen, and an approximation to{Y (ti) : i = 0, 1, . . .} is computed in a step-by-step procedure. For simplicity, weuse an equally spaced grid,

ti = ih, i = 0, 1, . . . , Nh,

wherehNh ≤ T andh (Nh + 1) > T . To aid in developing some intuition forthis topic, we begin with an important special case, thetrapezoidal method. Latera general scheme is given for the numerical approximation of(12.19). As withnumerical methods for ordinary differential equations, let yn denote an approximationof Y (tn). From (12.19), takey0 = Y (0) = g(0).


12.2.1 The trapezoidal method

Forn > 0, write

Y (tn) = g(tn) +

∫ tn

0

K(tn, s, Y (s)) ds.

Using the trapezoidal numerical integration rule, we obtain

∫ tn

0

K(tn, s, Y (s)) ds ≈ h

n∑

j=0

′′

K(tn, tj , Y (tj)) . (12.20)

In this formula, the double-prime superscript indicates that the first and last termsshould be halved before being summed. Using this approximation leads to the nu-merical formula

Y (tn) ≈ g(tn) + h

n∑

j=0

′′

K(tn, tj , Y (tj)) ,

yn = g(tn) + h

n∑

j=0

′′

K(tn, tj , yj) , n = 1, 2, . . . , Nh. (12.21)

This equation definesyn implicitly, as earlier with the trapezoidal rule (4.22) ofSection 4.2 for the initial value problem. Also, as before, whenh is sufficiently small,this can be solved foryn by simple fixed point iteration,

y(k+1)n = g(t) +

h

2K(tn, t0, y0)

+hn−1∑

j=1

K(tn, tj , yj) +h

2K(tn, tn, y

(k)n

), k = 0, 1, . . .

(12.22)

with some giveny(0)n . Newton’s method and other rootfinding methods can also be

used. A MATLAB R© program implementing (12.21)–(12.22) is given at the end ofthe section.

Example 12.3 Consider solving the equation

Y (t) = cos t−∫ t

0

Y (s) ds, t ≥ 0 (12.23)

with the true solution

Y (t) =1

2

(cos t− sin t+ e−t

), t ≥ 0.

Equation (12.23) is the model equation (12.12) withλ = −1 andg(t) = cos t.Numerical results for the use of (12.21) are shown in Table 12.1 for varying stepsizesh. It can be seen that the error at each value oft is of sizeO

(h2).


Table 12.1 Numerical results for solving (12.23) using the trapezoidal method (12.21)

Errort h = 0.2 Ratio h = 0.1 Ratio h = 0.05

0.8 1.85e − 4 4.03 4.66e − 5 4.01 1.17e − 51.6 9.22e − 4 4.03 2.31e − 4 4.01 5.77e − 52.4 1.74e − 3 4.03 4.36e − 4 4.01 1.09e − 43.2 1.95e − 3 4.03 4.88e − 4 4.01 1.22e − 44.0 1.25e − 3 4.04 3.11e − 4 4.01 7.76e − 5

12.2.2 Error for the trapezoidal method

To build some intuition for the behaviour of (12.21), we consider first the linear case(12.2),

yn = g(tn) + h

n∑

j=0

′′

K(tn, tj) yj , n = 1, 2, . . . , Nh. (12.24)

Rewrite the original equation (12.2) using the trapezoidalnumerical integration rulewith its error formula,

Y (tn) = g(tn) + h

n∑

j=0

′′

K(tn, tj)Y (tj) +Qh(tn) , (12.25)

for n = 1, 2, . . . , Nh. The error term can be written in various forms:

Qh(tn) = −n∑

j=1

h3

12

∂2

∂s2[K(tn, s)Y (s)]

∣∣∣∣s=τn,j

(12.26)

= −h2tn12

∂2

∂s2[K(tn, s)Y (s)]

∣∣∣∣s=τn

(12.27)

≈ −h2

12

∂

∂s[K(tn, s)Y (s)]

∣∣∣∣tn

s=0

. (12.28)

In (12.26),τn,j is some unknown point in[tj−1, tj ]; and in (12.27),τn is an unknownpoint in [0, tn]. These are standard error formulas for the trapezoidal quadrature rule;e.g. see [12,§5.2]. Subtract (12.24) from (12.25), obtaining

Eh(tn) = h

n∑

j=0

′′

K(tn, tj)Eh(tj) +Qh(tn) (12.29)

in whichEh(tn) = Y (tn) − yn.

Example 12.4 As a simple particular case of (12.24), chooseK(t, s) ≡ λ andY (s) = s2. We are solving the equation (12.12) with a suitable choice of g(t).


Using (12.27) and noting thatEh(t0) = Eh(0) = 0, (12.29) becomes

Eh(tn) =

n−1∑

j=1

hλEh(tj) +hλ

2Eh(tn) − h2tn

12Y ′′(τn).

BecauseY ′′(s) ≡ 2, this simplifies further to

Eh(tn) =

n−1∑

j=1

hλEh(tj) + 12hλEh(tn) − 1

6h2tn, (12.30)

for n = 1, . . . , Nh. This complicated expression can be solved explicitly.Write the same formula withn− 1 replacingn, and then subtract it from (12.30).

This yields

Eh(tn) − Eh(tn−1) = hλEh(tn−1) + 12hλEh(tn) − 1

2hλEh(tn−1)

− 16h

2 (tn − tn−1) .

Solving forEh(tn), we obtain

Eh(tn) =

(1 + 1

2hλ

1 − 12hλ

)Eh(tn−1) −

1

1 − 12hλ

h3

6, n ≥ 0.

Using induction, this has the solution

Eh(tn) =

(1 + 1

2hλ

1 − 12hλ

)n

Eh(t0) −

n−1∑

j=0

(1 + 1

2hλ

1 − 12hλ

)j

1

1 − 12 hλ

h3

6. (12.31)

The first term equals zero sinceEh(t0) = 0; and the second term involves a geometricseries which sums to

(1 + 1

2hλ

1 − 12hλ

)n

− 1

(1 + 1

2hλ

1 − 12hλ

)− 1

=2 − hλ

2hλ

{[1 +

hλ

1 − 12hλ

]n

− 1

}.

Using this in (12.31),

Eh(tn) = −h2

6λ

{[1 +

hλ

1 − 12 hλ

]n

− 1

}.

For a fixedt = tn = nh, ash→ 0, this can be manipulated to obtain the asymptoticformula

Eh(tn) ≈ −h2

6λ

(eλ tn − 1

).


For this special case, the numerical solution of (12.12) using the trapezoidal methodhas an error of sizeO(h2). This is of the same order inh as the discretization error forthe trapezoidal rule approximation in (12.20). Although this result has been shownfor only a special solution, it turns out to be true in generalfor the trapezoidal methodof (12.21). This is discussed in greater detail in Section 12.3, including a generalconvergence theorem that includes the trapezoidal rule being applied to the fullynonlinear equation (12.19).

12.2.3 General schema for numerical methods

As a general approach to the numerical solution of the integral equation (12.19),consider replacing the integral term with an approximationbased on numerical inte-gration. Introduce the numerical integration

∫ tn

0

K(tn, s, Y (s)) ds ≈ h

n∑

j=0

wn,jK(tn, tj , Y (tj)) . (12.32)

The quadrature weightshwn,j are allowed to vary with the grid pointtn, in contrastto the trapezoidal method. Equation (12.19) is approximated by

yn = g(tn) + h

n∑

j=0

wn,jK(tn, tj , yj) , n = 1, 2, . . . , Nh. (12.33)

As with the earlier trapezoidal method, ifwn,n 6= 0, then (12.33) must be solved foryn by some rootfinding method. For example, simple iteration has the form

y(k+1)n = g(tn) + h

n−1∑

j=0

wn,jK(tn, tj , yj)

+hwn,nK(tn, tn, y

(k)n

), k = 0, 1, . . .

(12.34)

for some given initial estimatey(0)n . Also, many such methods (12.33) requiren ≥

p + 1 for some small integerp; the valuesy1, . . . , yp must be determined by someother “starting method”.

There are many possible such schemes (12.33), and we investigate only one pair ofsuch formulas, both based on Simpson’s numerical integration formula. The simpleSimpson rule has the form

∫ α+2h

α

F (s) ds ≈ h

3[F (α) + 4F (α+ h) + F (α+ 2h)] .

This classical quadrature formula is very popular, well-studied, and well-understood;e.g., see [12, Sections 5.1–5.2]. In producing the approximation of (12.32), consider


first the case wheren is even. Then define

∫ tn

0

K(tn, s, Y (s)) ds =

n/2∑

j=1

∫ t2j

t2j−2

K(tn, s, Y (s)) ds

≈ h

3

n/2∑

j=1

[K(tn, t2j−2, Y (t2j−2)) + 4K(tn, t2j−1, Y (t2j−1))

+K(tn, t2j , Y (t2j))]

.

(12.35)

This has an error of sizeO(h4).

Consider next the case thatnwhere odd andn ≥ 3. Then the interval[0, tn] cannotbe divided into a union of subintervals[t2j−2, t2j ]; and thus Simpson’s integrationrule cannot be applied in the manner of (12.35). To maintain the accuracy implicitin using Simpson’s rule, we use Newton’s3

8 ’s rule over one subinterval of length3h,

∫ α+3h

α

F (s) ds ≈ 3h

8[F (α) + 3F (α+ h) + 3F (α+ 2h) + F (a+ 3h)] .

We then use Simpson’s rule over the remaining subintervals of length2h. The interval[0, tn] can be subdivided in two convenient ways,

Scheme 1: [0, tn] = [0, t3] ∪ [t3, t5] ∪ · · · ∪ [tn−2, tn] ; (12.36)

Scheme 2: [0, tn] = [0, t2] ∪ · · · ∪ [tn−5, tn−3] ∪ [tn−3, tn] . (12.37)

With the first scheme, we apply Newton’s38 ’s rule over[0, t3] and apply Simpson’srule over the subintervals[t3, t5] , . . . , [tn−2, tn]. With the second scheme, we applyNewton’s3

8 ’s rule over[tn−3, tn] and Simpson’s rule over the remaining subintervals[0, t2], . . . , [tn−5, tn−3].

To be more precise, with the second scheme we begin by writing

∫ tn

0

K(tn, s, Y (s)) ds =

(n−3)/2∑

j=1

∫ t2j

t2j−2

K(tn, s, Y (s)) ds

+

∫ tn

tn−3

K(tn, s, Y (s)) ds.

Approximating the integrals as described above, we obtain

∫ tn

0

K(tn, s, Y (s)) ds ≈ 1

3h

n/2∑

j=1

{K(tn, t2j−2, Y (t2j−2))

+4K(tn, t2j−1, Y (t2j−1)) +K(tn, t2j , Y (t2j))]

+3

8h {K(tn, tn−3, Y (tn−3)) + 3K(tn, tn−2, Y (tn−2))

+3K(tn, tn−1, Y (tn−1)) +K(tn, tn, Y (tn))} .

(12.38)


Using (12.36) leads to a similar formula, but with Newton’s38 ’s rule applied over

[0, t3].We denote by “Simpson method 2” the combination of (12.35) and (12.38); and

we denote as “Simpson method 1” the combination of (12.35) and the analog of(12.38) for the subdivision of (12.36). Both methods require that the initial valuey1be calculated by another method.

Both approximations have discretization errors of sizeO(h4), but method 2 turnsout to be much superior to method 1 when solving (12.19). These methods arediscussed and illustrated in Section 12.3.

MATLAB program. The following MATLAB program implements the trapezoidalmethod (12.21)–(12.22).

function soln = vie trap(N h,T,fcn g,fcn k)

%

% function soln = vie trap(N h,T,fcn g,fcn k)

%

% This solves the integral equation

% t

% Y(t) = g(t) + Int k(t,s,Y(s))ds

% 0

% ==INPUT==

% N h: The number of subdivisions of [0,T].

% T: [0,T] is the interval for the solution function.

% fcn g: The handle of the driver function g(t).

% fcn k: The handle of the kernel function k(t,s,u).

% ==OUTPUT==

% soln: A structure with the following components.

% soln.t: The grid points at which the solution Y(t) is

% approximated.

% soln.y: The approximation of Y(t) at the grid points.

% The implicit trapezoidal equation is solved by simple fixed

% point iteration at each grid point in t. For simplicity,

% the program uses a crude means of controlling the iteration.

% The iteration is executed a fixed number of times, controlled

% by ’loop’.

loop = 10; % This is much more than is usually needed.

h = T/N h; t = linspace(0,T,N h+1);

g vec = fcn g(t);

g vec = zeros(size(t)); y vec(1) = g vec(1);

for n=1:N h

y vec(n+1) = y vec(n); % Initial estimate for the iteration.

k vec = fcn k(t(n+1),t(1:n+1),y vec(1:n+1));


for j=1:loop

y vec(n+1) = g vec(n+1) + h*(sum(k vec(2:n)) ...

+ (k vec(1) + k vec(n+1))/2);

k vec(n+1) = fcn k(t(n+1),t(n+1),y vec(n+1));

end

end

soln.t = t;

soln.y = y vec;

end % vie trap

The following program is a test program for the abovevie trap.

function test vie trap(lambda,N h,T,output step)

%

% function test vie trap(lambda,N h,T,output step)

%

% ==INPUT==

% lambda: Used in defining the integral equation.

% N h: The number of subdivisions of [0,T].

% T: [0,T] is the interval for the solution function.

% output step: The solution is output at the indices

% v = 1:output step:N h+1

soln = vie trap(N h,T,@g driver,@kernel);

t = soln.t; y = soln.y;

true = true soln(t);

error = true - y;

format short e

v = 1:output step:N h+1;

disp([t(v)’ y(v)’ error(v)’])

%================================================

function ans g = g driver(s)

ans g = (1-lambda)*sin(t) + (1+lambda)*cos(t) - lambda;

end % g driver

function ans true = true soln(s)

ans true = cos(s) + sin(s);

end % true soln

function ans k = kernel(tau,s,u)

% tau is a scalar, s and u vectors of the same dimension.

ans k = lambda*u;

end % kernel

%================================================

end % test vie trap

NUMERICAL METHODS: THEORY 223

12.3 NUMERICAL METHODS: THEORY

We begin by considering the convergence of methods

yn = g(tn) + h

n∑

j=0

wn,jK(tn, tj , yj) , n = p+ 1, . . . , Nh (12.39)

with y0 = g(0) and withy1, . . . , yp determined by another method. For example, thetrapezoidal method hasp = 0, and the two Simpson methods discussed in and fol-lowing (12.35) havep = 1. Later we discuss the error requirements when computingsuch initial valuesy1, . . . , yp.

To analyze the error in using (12.39) to solve

Y (t) = g(t) +

∫ t

0

K(t, s, Y (s)) ds, 0 ≤ t ≤ T, (12.40)

we proceed in analogy with the error equation (12.29) for thetrapezoidal method. Asin Section 12.1, we assume thatK(t, s, u) is continuous for0 ≤ s ≤ t ≤ T ; further,we assume thatK(t, s, u) satisfies the Lipschitz condition

|K(t, s, u1) −K(t, s, u2)| ≤ c |u1 − u2| , 0 ≤ s ≤ t ≤ T (12.41)

for −∞ < u1, u2 <∞. These are the assumptions used in Theorem 12.2.Rewrite (12.40) using numerical integration and the associated error,

Y (tn) = g(tn) + h

n∑

j=0

wn,jK(tn, tj , Y (tj))

+Qh(tn) , n = p+ 1, . . . , Nh.

(12.42)

The quantityQh(tn) denotes the error in the quadrature approximation to the integralin (12.40). As an example of the quadrature error, recall (12.25)–(12.28) for thetrapezoidal method.

Subtract (12.39) from (12.42), obtaining

Eh(tn) = hn∑

j=0

wn,j [K(tn, tj , Y (tj)) −K(tn, tj , yj)] +Qh(tn) (12.43)

forn = p+1, . . . , Nh, withEh(tn) = Y (tn)−yn. Applying the Lipschitz condition(12.41) to (12.43), we have

|Eh(tn)| ≤ hc

n∑

j=0

|wn,j | |Eh(tj)| +Qh(tn) , n = p+ 1, . . . , Nh. (12.44)

If we assume thath is small enough thathc |wn,n| < 1, then we can bound|Eh(tn)|in terms of preceding errors:

|Eh(tn)| ≤ hc

1 − hc |wn,n|

n−1∑

j=0

|wn,j | |Eh(tj)| +Qh(tn)

1 − hc |wn,n|, (12.45)


for n = p+ 1, . . . , Nh.To further simplify this, we assume

max0≤i≤n≤Nn

|wn,i| ≤ γ <∞ (12.46)

for all 0 < h ≤ h0 for some small value ofh0. Without any loss of generality whenanalyzing convergence ash→ 0, (12.46) permits the assumption that

hc |wn,n| ≤ 12 (12.47)

is true for allh andn of interest. With (12.46) and (12.47), the inequality (12.45)becomes

|Eh(tn)| ≤ 2γch

n−1∑

j=0

|Eh(tj)| + 2Qh(tn) , n = p+ 1, . . . , Nh. (12.48)

This can be solved to give a useful convergence result.

Theorem 12.5 In the Volterra integral equation (12.40), assume that the functionK(t, s, u) is continuous for0 ≤ s ≤ t ≤ T , −∞ < u < ∞, and further that itsatisfies the Lipschitz condition (12.41). Assume thatg(t) is continuous on[0, T ]. Inthe numerical approximation (12.39), assume (12.46). Introduce

η(h) ≡p∑

j=0

|Eh(tj)| , (12.49)

δ(tn;h) ≡ maxp+1≤j≤n

|Qh(tj)| .

Then

|Eh(tn)| ≤ e2γctn [2γchη(h) + δ(tn;h)] , n = p+ 1, . . . , Nh. (12.50)

Proof. This bound is a consequence of (12.48), the following lemma,and the bound

(1 + 2γch)n−p−1 ≤ e2γc(tn−tp+1) ≤ e2γctn, n ≥ p+ 1.

To show this bound, recall Lemma 2.3 from Section 2.2. A more complete proof isgiven in [59, Section 7.3].

Lemma 12.6 Let the sequence{ε0, ε1, . . . } satisfy

|εn| ≤ α

n−1∑

j=0

|εj | + βn, n = p+ 1, . . . . (12.51)

Then

|εn| ≤ (1 + α)n−p−1

α

p∑

j=0

|εj | + maxp+1≤j≤n

|βj |

. (12.52)


Proof. This can be proved using mathematical inductions, and we leave it as anexercise for the reader.

The bound (12.50)assures us of convergenceprovidedhη (h) → 0 andδ(tn;h) →0 ash→ 0.

Example 12.7 Recall the trapezoidal method of (12.21). Thenp = 0 andη(h) =|Y (0) − y0|. For the purpose of analyzing convergence, we takey0 = Y (0) andη(h) = 0. Also, from (12.27), we can take

δ(tn;h) = −h2tn12

max0≤s≤tn

∣∣∣∣∂2

∂s2[K(tn, s)Y (s)]

∣∣∣∣ . (12.53)

From (12.50), we obtain

|Eh(tn)| ≤ e2γctnδ (tn;h) ,

and this is of sizeO(h2)

on each finite interval0 ≤ tn ≤ T . Thus the trapezoidalmethod is convergent; and we say it is of order 2.

Example 12.8 Recall Simpson method 2 from (12.35), (12.38), and the associatedSimpson method 1. Both methods requirep = 1, and

η (h) = |Eh(t0)| + |Eh(t1)| .

Again, we take|Eh(t0)| = 0. The quadrature errorδ(tn;h) can be shown to be ofsizeO(h4) on each finite interval[0, tn]. If we also havehη(h) = O(h4), then theoverall error in both Simpson methods is of sizeO(h4) on each finite interval[0, T ].

If we use the simple trapezoidal method to generatey1, then it can be shown thatη(h) = O

(h3)

for this special case of a fixed finite number of errors (in particular,Eh(t1)); this is sufficient to yieldhη(h) = O

(h4). We illustrate this using Simpson

method 2 to solve

Y (t) = cos t−∫ t

0

Y (s) ds, t ≥ 0 (12.54)

with the true solution

Y (t) = 12

(cos t− sin t+ e−t

), t ≥ 0, (12.55)

the same test equation as in example 12.3. The numerical results with varying valuesof h are given in Table 12.2. The values in the columns labeled“Ratio” approach 16ash decreases, and this is consistent with a convergence rate ofO

(h4).

12.3.1 Numerical stability

In addition to being convergent, a numerical method must also be numerically stable.As with numerical methods for the initial value problem for differential equations,


Table 12.2 Numerical results for solving (12.54) using the Simpson method 2

Errort h = 0.2 Ratio h = 0.1 Ratio h = 0.05 Ratio h = 0.025

0.8 1.24e − 6 10.2 1.23e − 7 13.4 9.15e − 9 14.8 6.16e − 101.6 −5.56e − 7 −71.0 7.84e − 9 6.4 1.23e − 9 13.5 3.09e − 112.4 −1.90e − 6 14.2 −1.34e − 7 14.3 −9.37e − 9 15.1 −6.22e − 103.2 −1.95e − 6 10.4 −1.87e − 7 13.6 −1.38e − 8 14.9 −9.24e − 104.0 −7.10e − 7 6.2 −1.15e − 7 12.9 −8.95e − 9 14.7 −6.07e − 10

various meanings are given to the concept of “numerically stable”. We begin withstability as discussed in (12.9)-(12.11) for the linearequation (12.2). This is in analogywith stability as discussed in Section 7.3 of Chapter 7 for multistep methods for theinitial value problem for differential equations.

In the numerical method

yn = g(tn) + h

n∑

j=0

wn,jK(tn, tj , yj) , n = p+ 1, . . . , Nh. (12.56)

consider perturbing the initial valuesy0, . . . , yp, say, by changing them toyj + ηh,j,j = 0, . . . , p. Also, perturbg(tn) to g(tn) + εh,n for n ≥ p + 1. We are inter-ested in knowing how the perturbations{ηh,j} and{εh,n} affect the solution{yn},particularly for small perturbations and small values ofh.

Let {yn : 0 ≤ n ≤ Nh} denote the numerical solution in this perturbed case,

yn = g(tn) + εh,n + h

n∑

j=0

wn,jK(tn, tj , yj) , n = p+ 1, . . . , Nh,

yn = yn + ηh,j , j = 0, . . . , p.

(12.57)

Subtracting (12.56) from (12.57), using the Lipschitz condition (12.41) and the bound(12.46) for the weights, we obtain

|yn − yn| ≤ |εh,n| + hcγ

n∑

j=0

|yj − yj | , p+ 1 ≤ n ≤ Nh,

yn − yn = ηh,j , j = 0, . . . , p.

With assumption (12.47) and Lemma 12.6, we obtain

|yn − yn| ≤ e2γctn

2hγc

p∑

j=0

|ηh,j | + maxp+1≤j≤n

|εh,j|

.

This simplifies as

|yn − yn| ≤ Cδ, p+ 1 ≤ n ≤ Nh, 0 < h ≤ h0, (12.58)


whereC is a constant independent ofh and

δ = max0<h≤h0

{h max

0≤j≤p|ηh,j | , max

p+1≤j≤Nh

|εh,j |}.

The upper boundh0 onh is to be chosen so that for alln,

h0c |wn,n| ≤ 12 .

The bound (12.58) says that the numerical solution{yn : p+ 1 ≤ n ≤ Nh} variescontinuously with the initial starting values{y0, . . . , yp} and the functiong(t). Thisis true in a uniform sense for all sufficiently small values ofh. The bound (12.58) isthe numerical analogue of the stability result (12.11) for the linear equation (12.2).

The result (12.58) says that virtually all convergent quadrature schemes lead tonumerical methods (12.56) that are numerically stable. In practice,however, a numberof such methods remain very sensitive to perturbations in the starting values. Inparticular, experimental results imply that Simpson method 2 is numerically stable,whereas Simpson method 1 has practical stability problems.What is the explanationfor this?

12.3.2 Practical numerical stability

In discussing practical stability difficulties when using numerical methods (12.39),we follow Linz [59,§7.4]. We consider only the linear equation

Y (t) = g(t) +

∫ t

0

K(t, s)Y (s) ds, 0 ≤ t ≤ T, (12.59)

although the results generalize to the fully nonlinear equation (12.40). The typeof stability that is considered is related to the concept of “relative stability” fromSubsection 7.3.3.

Consider the numerical method (12.39) as applied to (12.59),

yn = g(tn) + hn∑

j=0

wn,jK(tn, tj) yj , n = p+ 1, . . . , Nh (12.60)

with y0 = g(0) and withy1, . . . , yp obtained by other means. The true solutionY (t)satisfies

Y (tn) = g(tn) + h

n∑

j=0

wn,jK(tn, tj)Y (tj) +Qh(tn) , (12.61)

for n = p+ 1, . . . , Nh. Subtracting (12.60) from (12.61), we obtain

Eh(tn) = h

n∑

j=0

wn,jK(tn, tj)Eh (tj) +Qh(tn) , (12.62)


for n = p+ 1, . . . , Nh.To aid in understanding the behavior ofEh(tn) astn increases, the error is de-

composed into two parts. First, let{EQ

h (tn)}

denote the solution of

EQh (tn) = h

n∑

j=0

wn,jK(tn, tj)EQh (tj) +Qh(tn) , n = p+ 1, . . . , Nh,

EQh (tj) = 0, j = 0, . . . , p.

(12.63)This error is due entirely to the quadrature errors{Qh(tn) : n ≥ p+ 1} that occurin discretizing the integral equation (12.59); it assumes that there is no error in theinitial valuesy0, . . . , yp. Second, consider the errorsES

h (tn) obtained by solving

ESh (tn) = h

n∑

j=0

wn,jK(tn, tj)ESh (tj) , n = p+ 1, . . . , Nh, (12.64)

ESh (tj) = ηj , j = 0, . . . , p. (12.65)

The quantities{η0, . . . , ηp} are the errors in the starting values{y0, . . . , yp} whenusing (12.60). The original errorEh(tn) is given by

Eh(tn) = EQh (tn) + ES

h (tn), n = 0, 1, . . . , Nh.

Returning to (12.63), assume that the quadrature error has an expansion of theform

Qh(tn) = a(tn)hm + O(hm+1

)

for some integerm ≥ 1. For example, the trapezoidal method has

Qh(tn) = a(t)h2 + O(h3),

a(t) = − 1

12

∂

∂s[K(t, s)Y (s)]

∣∣∣∣t

s=0

(see (12.28)). Then it can be shown thatEQh (tn) has the asymptotic formula

EQh (tn) = b(tn)hm + O

(hm+1

)(12.66)

with the functionb the solution of the integral equation

b(t) = −a(t) +

∫ t

0

K(t, s) b(s) ds, 0 ≤ t ≤ T.

For a derivation of this, see [59, Theorem 7.3]. The asymptotic formula (12.66)applies to virtually all quadrature schemes that are likelyto be used in setting up thenumerical scheme (12.56), and it forms the basis for numerical extrapolation schemesfor error estimation.


The second error,ESh (tn), is more subtle to understand. To begin, consider the

weights{wn,j} for the two Simpson methods.

• Simpson method 1:

n even: 13 ,

43 ,

23 ,

43 , · · · , 2

3 ,43 ,

13 ;

n odd: 38 ,

98 ,

98 ,

38 + 1

3 ,43 ,

23 ,

43 , · · · , 2

3 ,43 ,

13 .

(12.67)

all being multiplied byh. The weights satisfy

wn+ρ,i = wn,i, i = 4, . . . , n

with ρ = 2, but not withρ = 1. We say the weights have arepetition factorof2.

• Simpson method 2:

n even: 13 ,

43 ,

23 ,

43 , · · · , 2

3 ,43 ,

13 ;

n odd: 13 ,

43 ,

23 ,

43 , · · · , 2

3 ,43 ,

13 + 3

8 ,98 ,

98 ,

38 .

(12.68)

The weights satisfy

wn+1,i = wn,i, i = 0, 1, . . . , n− 4.

and again, all being multiplied byh. These weights have a repetition factor of1.

Both of these methods have an asymptotic formula forESh (tn); see [59, Theorem

7.4].In particular, for Simpson method 2 assume that the startingvalues{y0, y1} satisfy

Y (ti) − yi = δih3 + O(h4). (12.69)

ThenES

h (tn) = h4 [δ0C0(tn) + δ1C1(tn)] + O(h5)

(12.70)

with Ci(t) satisfying

Ci(t) = ViK(t, ti) +

∫ t

0

K(t, s)Ci(s) ds, i = 0, 1, 2.

The constantsVi are derived as a part of the proof in [59, Theorem 7.4]. The functionsC0(t) andC1(t) can be shown to be well behaved, and consequently, the same istrueof the error in (12.70).

For Simpson method 1, there is an asymptotic formula forESh (tn), but it is not as

well behaved as is (12.70) for Simpson method 2. For Simpson method 1, it can beshown that

ESh (t2n) = hx(t2n) + O

(h2), (12.71)

ESh (t2n+1) = hy(t2n+1) + O

(h2)

(12.72)


with (x(t), y(t)) the solution of a system of two Volterra integral equations.Thefunctionsx(t) andy(t) can be written in the form

x(t) = 12 (z1(t) + z2(t)) ,

y(t) = 12 (z1(t) − z2(t))

(12.73)

with z1(t) andz2(t) the solutions of the Volterra integral equation

zi(t) = gi(t) +

∫ t

0

K(t, s) zi(s) ds, 0 ≤ t ≤ T

for particular values ofgi(t) that depend on bothK(t, s) and the constants{δ0, δ1}of (12.69).

To develop some intuition from this, consider the special caseK(t, s) ≡ λ. Thenz1(t) andz2(t) have the forms

z1(t) = A1(t) +B1(t)eλt,

z2(t) = A2(t) +B2(t)e−λt/3.

Recalling the special formulas of (12.12)–(12.14), the caseλ < 0 is associated withstability in the Volterra integral equation andλ > 0 is associated with instability.

Considering only the case whereλ < 0, the functionz1(t) behaves “properly” ast increases. In contrast, the functionz2(t) is exponentially increasing ast increases.Applying this to (12.73), we have thatx(t) andy(t) will also increase exponen-tially, although with opposite signs depending on whether the index fortn is even orodd. Using this in (12.71)-(12.72), we find that the errorsES

h (tn) should increaseexponentially for larger values ofn, and that there should be an oscillation in sign.

Example 12.9 Recall Example 12.8 in which we examined Simpson method 2 forthelinear integral equation (12.54). We solve it again, now with both Simpson methods 1and 2, doing so on[0, 10]withh = 0.1.A plot of the errorwhen using Simpson method1 is given in Figure 12.1, and that for Simpson method 2 is given in Figure 12.2. Theerror with Simpson method 1 is as predicted from the above discussion: it increasesrapidly with increasingt, and it is oscillatory in sign. With Simpson method 2 there isa much more regular and better behavior in the error, in this case of sinusoidal form,reflecting the sinusoidal form of the true solutionY (t) = 1

2 (cos t− sin t+ e−t).There are also some oscillations, but they are more minor andare imposed on thedominant form of the error.

A very good introduction to the topic of numerical stabilityfor solving Volterraintegral equations is given by Linz [59, Section 7.4]. It also is a very good introductionto the general subject of the numerical solution of Volterraintegral equations. Anexcellent, more recent, and more specialized treatment is given by Brunner [17].


0 2 4 6 8 10−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

−6

Figure 12.1 The error in solving (12.54) using Simpson method 1

PROBLEMS

1. For the following Volterra integral equations of the second kind, show that thegiven functionY (t) is the solution of the given equation.

(a)

Y (t) = cos (t) −∫ t

0

(t− s) cos (t− s)Y (s) ds,

Y (t) = 23 cos(

√3 t) + 1

3 .

(b)

Y (t) = t+

∫ t

0

sin (t− s)Y (s) ds,

Y (t) = t+ 16 t

3.

(c)

Y (t) = sinh (t) −∫ t

0

cosh(t− s)Y (s) ds,

Y (t) =2√5

sinh

(√5

2t

)e−t/2.


0 2 4 6 8 10−4

−3

−2

−1

0

1

2

3

4x 10

−7

Figure 12.2 The error in solving (12.54) using Simpson method 2

2. Reduce equation (12.15) to (12.12) by introducing the newunknown functionZ(t) = e−λtY (t). Use this transformation to obtain (12.16) from (12.14).

3. Demonstrate formula (12.4).

4. Using mathematical induction, show that the kernelsKj(t, s) of (12.6) satisfy

|Kj(t, s)| ≤(t− s)j−1

(j − 1)!Bj , j ≥ 1.

From this, show that

∣∣∣∣∫ t

0

Kj(t, s) g(s) ds

∣∣∣∣ ≤(tB)

j

j!max0≤s≤t

|g(s)| .

5. Using the result of Problem 4, and motivated by (12.5), show that the series

g(t) +

∞∑

j=1

∫ t

0

Kj(t, s) g(s) ds

is absolutely convergent. Note that it still remains necessary to show that thisfunction satisfies (12.2). We refer to Linz [59, p. 30] for a proof, along with aproof of the uniquess of the solution.


6. Assume that it has been shown, based on (12.5), that

Y (t) = g(t) +

∞∑

j=1

∫ t

0

Kj(t, s) g(s) ds

is an absolutely convergent series. Combine this with Problem 4 to show thatY (t) satisfies (12.7).

7. LetY (t) be the continuous solution of (12.2).

(a) Assume thatK(t, s) is differentiable with respect totand that∂K(t, s)/∂tis continuous for0 ≤ s ≤ t ≤ T . Assume further thatg(t) is continouslydifferentiable on[0, T ]. Show thatY (t) is differentiable and that

Y ′(t) = g′(t) +K(t, t)Y (t) +

∫ t

0

∂K(t, s)

∂tY (t) dt.

(b) Give a corresponding result that guarantees thatY (t) is twice continu-ously differentiable on[0, T ].

8. Using the MATLAB programvie trap, solve (12.23) on[0, 12]. Do so forstepsizesh = 0.2, 0.1, 0.05; then graph the errors over the full interval.

9. Apply the MATLAB programvie trap to the equation

Y (t) = g(t) + λ

∫ t

0

Y (s) ds, t ≥ 0,

g(t) = (1 − λ) sin t+ (1 + λ) cos t− λ

over the interval[0, 2π]. The true solution isY (t) = cos t + sin t. Do so forstepsizes ofh = 0.5, 0.25, 0.125 andλ = −1, 1. Observe the decrease in theerror ash is halved. Comment on any differences observed between the casesof λ = −1 andλ = 1.

10. Using mathematical induction onn, prove Lemma 12.6.

11. In Example 12.8 it is asserted thatY (t1) − y1 = O(h3). Explain why this istrue.

12. Write MATLAB programs for both Simpson methods 1 and 2. Generatey1using the trapezoidal method. After writing the program, use it to solve thelinear integral equation (12.54), say on[0, 10]. Use a stepsize ofh = 0.2 andgraph the errors using MATLAB.

13. Using the programs of Problem 12, solve the equation given in Problem 9. Doso with both Simpson methods. Do so with bothλ = −1 andλ = 1. Useh = 0.2, 0.1 and solve the equation on[0, 10].

14. In analogy with the formulas (12.26)–(12.28) for the quadrature error for thetrapezoidal rule, give the corresponding formulas for Simpson method 2. Notethat this includes the Newton38 ’s rule.

APPENDIX A

TAYLOR’S THEOREM

For a function with a number of derivatives at a specific point, Taylor’s theoremprovides a polynomial that is close to the function in a neighborhood of the pointand an error formula for the difference between the functionand the polynomial.Taylor’s theorem is an important tool in developing numerical methods and derivingerror bounds. We start with a review of the mean value theorem.

Theorem A.1 (Mean value theorem) Assume thatf(x) is continuous on[a, b] and isdifferentiable on(a, b). Then there is a pointc ∈ (a, b) such that

f(b) − f(a) = f ′(c) (b− a). (A.1)

The numberc in (A.1) is usually unknown. There is an analogous form of thetheorem for integrals. Assume thatf(x) is continuous on[a, b],w(x) is nonnegativeand integrable on[a, b]. Then there existsc ∈ (a, b) for which

∫ b

a

f(x)w(x) dx = f(c)

∫ b

a

w(x) dx. (A.2)

235

236 APPENDIX A. TAYLOR’S THEOREM

Theorem A.2 (Taylor’s theorem for functions of one real variable) Assume thatf(x)hasn+ 1 continuous derivatives fora ≤ x ≤ b, and letx0 ∈ [a, b]. Then

f(x) = pn(x) +Rn(x), a ≤ x ≤ b, (A.3)

where

pn(x) = f(x0) + (x − x0)f′(x0)

+(x− x0)

2

2!f ′′(x0) + · · · + (x− x0)

n

n!f (n)(x0)

=

n∑

j=0

(x− x0)j

j!f (j)(x0) (A.4)

is the Taylor polynomial of degreen for the functionf(x) and the point of approx-imationx0, andRn(x) is the remainder in approximatingf(x) by pn(x). We have

Rn(x) =1

n!

∫ x

x0

(x− t)nf (n+1)(t) dt (A.5)

=(x− x0)

n+1

(n+ 1)!f (n+1)(cx) (A.6)

with cx an unknown point betweenx0 andx.

The Taylor polynomial is constructed by requiring

p(j)n (x0) = f (j)(x0), j = 0, 1, . . . , n.

Thus, we expectpn(x) is close tof(x), at least forx close tox0. Two forms of theremainderRn(x) are given in the theorem. The form (A.6) is derived from (A.5)byan application of the integral form of the mean value theorem, (A.2). The remainderformula (A.5) does not involve an unknown point, and it is useful where precise errorbound is needed. In most contexts, the remainder formula (A.6) is sufficient.

Taylor’s theorem can be proved by repeated application of the formula

g(x) = g(x0) +

∫ x

x0

g′(t) dt (A.7)

for a continuously differentiable functiong. Evidently, this formula corresponds toTaylor’s theorem withn = 0. As an example, we illustrate the derivation of (A.3)with n = 1; the derivation of (A.3) forn > 1 can be done similarly through aninductive argument. We apply (A.7) forg = f ′:

f ′(t) = f ′(x0) +

∫ t

x0

f ′′(s) ds.

APPENDIX A. TAYLOR’S THEOREM 237

Thus,

f(x) = f(x0) +

∫ x

x0

f ′(t) dt

= f(x0) +

∫ x

x0

[f ′(x0) +

∫ t

x0

f ′′(s) ds

]dt

= f(x0) + f ′(x0) (x− x0) +

∫ x

x0

∫ t

x0

f ′′(s) ds dt.

Interchanging the order of integration, we can rewrite the last term as∫ x

x0

∫ x

s

f ′′(s) dt ds =

∫ x

x0

(x− s) f ′′(s) ds.

Changings into t, we have thus shown Taylor’s theorem withn = 1.In applying Taylor’s theorem, we often need to choose a valuefor the nonneg-

ative integern. If we want to have a linear approximation of twice continuouslydifferentiable functionf(x) nearx = x0, then we taken = 1 and write

f(x) = f(x0) + (x− x0) f′(x0) + 1

2 (x− x0)2 f ′′(c)

for somec betweenx andx0. To show that(f(x + h) − f(x))/h (h > 0, usuallysmall) is a first-order approximation off ′(x), we choosen = 1,

f(x+ h) = f(x) + h f ′(x) + 12h

2 f ′′(c),

and sof(x+ h) − f(x)

h= f ′(x) + 1

2h f′′(c).

As a further example, let us show that(f(x + h) − f(x))/h is a second-order ap-proximation off ′(x+ h/2). We choosen = 2, and write (herex0 = x+ 1

2h)

f(x+ h) = f(x+ 1

2h)

+ 12h f

′(x+ 12h) + 1

2

(12h)2f ′′(x+ 1

2h)

+ 16

(12h)3f ′′′(c1),

f(x) = f(x+ 12h) − 1

2h f′(x+ h/2) + 1

2

(12h)2f ′′(x+ h/2)

− 16

(12h)3f ′′′(c2)

for somec1 ∈ (x+ 12h, x+ h) andc2 ∈ (x, x + 1

2h). Thus,

f(x+ h) − f(x)

h= f ′

(x+ 1

2h)

+ 148h

2 [f ′′′(c1) + f ′′′(c2)]

showing(f(x+ h)− f(x))/h is a second-order approximation off ′(x+ 12h). This

result is usually expressed by saying that(f(x + h) − f(x− h))/(2h) is a second-order approximation tof ′(x). Of course, in these preceding examples, we assumethe functionf(x) has the required number of derivatives.

238 APPENDIX A. TAYLOR’S THEOREM

Sample formulas resulted from Taylor’s theorem are

ex = 1 + x+x2

2!+ · · · + xn

n!+

xn+1

(n+ 1)!ec,

sin(x) = x− x3

3!+x5

5!− · · · + (−1)n−1 x2n−1

(2n− 1)!+ (−1)n x2n+1

(2n+ 1)!cos(c),

cos(x) = 1 − x2

2!+x4

4!− · · · + (−1)n x2n

(2n)!+ (−1)n+1 x2n+2

(2n+ 2)!cos(c),

log(1 − x) = −(x+

1

2x2 + · · · + 1

n+ 1xn+1

)−(

1

1 − c

)xn+2

n+ 2, −1 ≤ x < 1,

wherec is betweenx0 = 0 andx. The first three formulas are valid for any−∞ <x <∞.

Theorem A.3 (Taylor’s theorem for functions of two real variables) Assume thatf(x, y) has continuous partial derivatives up to ordern + 1 for a ≤ x ≤ b andc ≤ y ≤ d, and letx0 ∈ [a, b], y0 ∈ [c, d]. Then

f(x, y) = pn(x, y) +Rn(x, y), a ≤ x ≤ b, c ≤ y ≤ d, (A.8)

where

pn(x, y) = f(x0, y0)

+

n∑

j=1

1

j!

[(x− x0)

∂

∂x+ (y − y0)

∂

∂y

]j

f(x0, y0), (A.9)

Rn(x, y) =1

(n+ 1)!

[(x− x0)

∂

∂x+ (y − y0)

∂

∂y

]n+1

× f(x0 + θ (x− x0), y0 + θ (y − y0)) (A.10)

with an unknown numberθ ∈ (0, 1).

In (A.9) and (A.10), the expression

[(x− x0)

∂

∂x+ (y − y0)

∂

∂y

]j

f(x0, y0)

=

j∑

i=0

j!

i!(j − i)!(x − x0)

i(y − y0)j−i ∂j

∂xi∂yj−if(x0, y0)

is defined formally through the binomial expansion for numbers:

(a+ b)j =

j∑

i=0

j!

i!(j − i)!aibj−i.

APPENDIX A. TAYLOR’S THEOREM 239

For example, withj = 2, we obtain

[(x− x0)

∂

∂x+ (y − y0)

∂

∂y

]2f(x0, y0)

= (x − x0)2 ∂

2

∂x2f(x0, y0) + 2 (x− x0) (y − y0)

∂2

∂x∂yf(x0, y0)

+ (y − y0)2 ∂

2

∂y2f(x0, y0).

Formula (A.8) with (A.9)–(A.10) can be proved by applying Taylor’s theorem forone real variable as follows. Define a function of one real variable

F (t) = f(x0 + t (x− x0), y0 + t (y − y0)).

Note thatF (0) = f(x0, y0), F (1) = f(x, y). Applying formula (A.3) with (A.4)and (A.6), we obtain

F (1) = F (0) +

n∑

j=1

1

j!F (j)(0) +

1

(n+ 1)!F (n+1)(θ)

for some unknown numberθ ∈ (0, 1). Using the chain rule, we can verify that

F (j)(0) =

[(x − x0)

∂

∂x+ (y − y0)

∂

∂y

]j

f(x0, y0).

This argument is also valid when the function hasm (m > 2) real variables,leading to Taylor’s theorem for functions ofm real variables.

APPENDIX B

POLYNOMIAL INTERPOLATION

The problem of polynomial interpolation is the selection ofa particular polynomialp(x) from a given class of polynomials in such a way that the graph of y = p(x)passes through a finite set of given data points. Polynomial interpolation theory hasmany important uses, but in this text we are interested in it primarily as a tool fordeveloping numerical methods for solving ordinary differential equations.

Let x0, x1, . . . , xn be distinct real or complex numbers, and lety0, y1, . . . , yn beassociated function values. We now study the problem of finding a polynomialp(x)that interpolates the given data:

p(xi) = yi, i = 0, 1, . . . , n. (B.1)

Does such a polynomial exist, and if so, what is its degree? Isit unique? Whatformula can we use to for producep(x) from the given data?

By writingp(x) = a0 + a1x+ · · · + amx

m

for a general polynomial of degreem, we see that there arem + 1 independentparametersa0, a1, . . . , am. Since (B.1) imposesn + 1 conditions onp(x), it isreasonable to first consider the case whenm= n. Then we want to finda0, a1, . . . , an

241

242 APPENDIX B. POLYNOMIAL INTERPOLATION

such that

a0 + a1x0 + a2x20 + · · · + anx

n0 = y0,

...

a0 + a1xn + a2x2n + · · · + anx

nn = yn. (B.2)

This is a system ofn + 1 linear equations inn + 1 unknowns, and solving it iscompletely equivalent to solving the polynomial interpolation problem. In vector–matrix notation, the system is

Xa = y

with

X =

1 x0 x20 · · · xn

0

......

1 xn−1 x2n−1 · · · xn

n−1

1 xn x2n · · · xn

n

, (B.3)

a = [a0, a1, . . . , an]T , y = [y0, . . . , yn]T .

The matrixX is called aVandermonde matrix, and its determinant is given by

det(X) =∏

0≤j<i≤n

(xi − xj).

Theorem B.1 Givenn+1 distinct pointsx0, . . . , xn andn+1 ordinatesy0, . . . , yn,there is a polynomialp(x) of degree≤ n that interpolatesyi at xi, i = 0, 1, . . . , n.This polynomialp(x) is unique in the set of all polynomials of degree≤ n.

Proof. There are a number of different proofs of this important result. We give aconstructive proof that exhibits explicitly the interpolating polynomialp(x) in a formuseful for the applications in this text.

To begin, consider the special interpolation problem in which

yi = 1, yj = 0 for j 6= i

for somei, 0 ≤ i ≤ n. We want a polynomial of degree≤ n with then zerosxj ,j 6= i. Then

p(x) = c(x− x0) · · · (x− xi−1)(x− xi+1) · · · (x − xn)

for some constantc. The conditionp(xi) = 1 implies

c = [(xi − x0) · · · (xi − xi−1)(xi − xi+1) · · · (xi − xn)]−1.

This special polynomial is written as

li(x) =∏

j 6=i

(x− xj

xi − xj

), i = 0, 1, . . . , n. (B.4)

APPENDIX B. POLYNOMIAL INTERPOLATION 243

To solve the general interpolation problem (B.1), we can write

p(x) = y0l0(x) + y1l1(x) + · · · + ynln(x).

With the special properties of the polynomialsli(x), it is easy to show thatp(x)satisfies (B.1). Also, degreep(x) ≤ n since allli(x) have degreen.

To prove uniqueness, suppose thatq(x) is another polynomial of degree≤ n thatsatisfies (B.1). Define

r(x) = p(x) − q(x).

Then degreer(x) ≤ n and

r(xi) = p(xi) − q(xi) = yi − yi = 0, i = 0, 1, . . . , n.

Sincer(x) hasn+ 1 zeros, we must haver(x) ≡ 0. This provesp(x) ≡ q(x).

The formula

pn(x) =

n∑

i=0

yili(x) (B.5)

is calledLagrange’s formulafor the interpolating polynomial.

Example B.2

p1(x) =x− x1

x0 − x1y0 +

x− x0

x1 − x0y1 =

(x1 − x)y0 + (x− x0)y1x1 − x0

,

p2(x) =(x− x1)(x− x2)

(x0 − x1)(x0 − x2)y0 +

(x− x0)(x − x2)

(x1 − x0)(x1 − x2)y1 +

(x − x0)(x − x1)

(x2 − x0)(x2 − x1)y2.

The polynomial of degree≤ 2 that passes through the three points(0, 1), (−1, 2),and(1, 3) is

p2(x) =(x + 1)(x− 1)

(0 + 1)(0 − 1)· 1 +

(x− 0)(x− 1)

(−1 − 0)(−1 − 1)· 2 +

(x − 0)(x+ 1)

(1 − 0)(1 + 1)· 3

= 1 + 12x+ 3

2x2.

If a functionf(x) is given, then we can form an approximation to it using theinterpolating polynomial

pn(x; f) ≡ pn(x) =

n∑

i=0

f(xi)li(x). (B.6)

This interpolatesf(x) at x0, . . . , xn. This polynomial formula is used at severalpoints in this text.

The basic result used in analyzing the error of interpolation is the following theo-rem. As a notation,H{a, b, c, . . .} denotes the smallest interval containing all of thereal numbersa, b, c, . . . .

244 APPENDIX B. POLYNOMIAL INTERPOLATION

Theorem B.3 Letx0, x1, . . . , xn be distinct real numbers, and letf be a real valuedfunction withn + 1 continuous derivatives on the intervalIt = H{t, x0, . . . , xn}with t some given real number.

Then there existsξ ∈ It with

f(t) −n∑

j=0

f(xj)lj(t) =(t− x0) · · · (t− xn)

(n+ 1)!f (n+1)(ξ). (B.7)

A proof of this result can be found in many numerical analysistextbooks; e.g., see[11, p. 135]. The theory and practice of polynomial interpolation represent a verylarge subject. Again, most numerical analysis textbooks contain a basic introduction,and we refer the interested reader to them.

REFERENCES

1. R. Aiken (editor).Stiff Computation,Oxford University Press, Oxford, 1985.

2. R. Alexander. “Diagonally implicit Runge-Kutta methodsfor stiff ODE’s”, SIAM Journalon Numerical Analysis14 (1977), pp. 1006–1021.

3. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J.DuCroz, A. Greenbaum,S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorenson.LAPACK Users’ Guide,SIAM Pub., Philadelphia, 1992.

4. V. Arnold. Mathematical Methods of Classical Mechanics, Springer–Verlag, New York,1974.

5. U. Ascher, H. Chin, and S. Reich. “Stabilization of DAEs and invariant manifolds”,Nu-merische Mathematik67 (1994), pp. 131–149.

6. U. Ascher, J. Christiansen, and R. Russell. “Collocationsoftware for boundary-valueODEs”,ACM Trans. Math. Soft.7 (1981), pp. 209–222.

7. U. Ascher, J. Christiansen, and R. Russell. “COLSYS: Collocation software for boundary-value ODEs”,ACM Trans. Math. Soft.7 (1981), pp. 223–229.

8. U. Ascher and R. Russell, eds.Numerical Boundary Value ODEs, Birkhauser, Boston,MA, 1985.

9. U. Ascher, R. Mattheij, and R. Russell.Numerical Solution of Boundary Value Problemsfor Ordinary Differential Equations, Prentice-Hall, Englewood Cliffs, New Jersey, 1988.

10. U. Ascher and L. Petzold.Computer Methods for Ordinary Differential Equations andDifferential-Algebraic Equations, SIAM, Philadelphia, 1998.

245

246 REFERENCES

11. K. Atkinson.An Introduction to Numerical Analysis, 2nd ed., John Wiley, New York, 1989.

12. K. Atkinson and W. Han.Elementary Numerical Analysis, 3rd ed., John Wiley, New York,2004.

13. A. Aziz.Numerical Solutions of Boundary Value Problems for Ordinary Differential Equa-tions, Academic Press, New York, 1975.

14. C. Baker.The Numerical Treatment of Integral Equations, Clarendon Press, Oxford, 1977.

15. J. Baumgarte. “Stabilization of constraints and integrals of motion in dynamical systems”,Computer Methods in Applied Mechanics and Engineering1 (1972), pp. 1–16.

16. W. Boyce and R. DiPrima.Elementary Differential Equations, 7th edition, John Wiley &Sons, 2003.

17. H. Brunner.Collocation Methods for Volterra Integral and Related Functional Equations,Cambridge Univ. Press, 2004.

18. P. Bogacki and L. Shampine. “A3(2) pair of Runge-Kutta formulas”,Appl. Math. Lett. 2(1989), pp. 321–325.

19. K.E. Brenan, S.L. Campbell and L.R. Petzold.Numerical Solution of Initial-Value Prob-lems in Differential-Algebraic Equations, Number 14 in Classics in Applied Mathematics.SIAM Publ., Philadelphia, PA, 1996. Originally published by North Holland, 1989.

20. K.E. Brenan and B.E. Engquist. “Backward differentiation approximations of nonlineardifferential/algebraic systems”,Mathematics of Computation51 (1988), pp. 659–676.

21. P.N. Brown, A.C. Hindmarsh, and L.R. Petzold. “Using Krylov methods in the solutionof large-scale differential-algebraic systems”,SIAM J. Scientific Computing15 (1994),pp. 1467–1488.

22. K. Burrage and J.C. Butcher. “Stability criteria for implicit Runge-Kutta methods”,SIAMJ. Numer. Anal.16 (1979), pp. 46–57.

23. J.C. Butcher. “Implicit Runge-Kutta processes”,Math. Comp.18 (1964), pp. 50–64.

24. J.C. Butcher. “A stability property of implicit Runge–Kutta methods”,BIT 15 (1975), pp.358–361.

25. J.C. Butcher. “General linear methods”,Acta Numerica15 (2006), Cambridge UniversityPress.

26. S.L. Campbell, R. Hollenbeck, K. Yeomans and Y. Zhong. “Mixed symbolic-numericalcomputations with general DAEs. I. System properties”,Numerical Algorithms19 (1998),pp. 73–83.

27. J. Cash. “On the numerical integration of nonlinear two-point boundary value problemsusing iterated deferred corrections. II. The development and analysis of highly stabledeferred correction formulae”,SIAM J. Numer. Anal.25 (1988), pp. 862–882.

28. B. Childs, E. Denman, M. Scott, P. Nelson, and J. Daniel, eds.Codes for Boundary-ValueProblems in Ordinary Differential Equations, Lec. Notes in Comp. Sci.76, Springer-Verlag, New York, 1979.

29. G.F. Corliss,A. Griewank, P. Henneberger, G. Kirlinger, F.A. Potra, and H.J. Stetter. “High-order stiff ODE solvers via automatic differentiation and rational prediction”, inNumericalAnalysis and its Applications(Rousse, 1996), Lecture Notes in Computer Science1196,pp. 114–125. Springer–Verlag, Berlin, 1997.

REFERENCES 247

30. M. Crouzeix. “Sur laB-stabilite des methodes de Runge-Kutta”.Numer. Math.32 (1979),pp. 75–82.

31. M. Crouzeix and P.A. Raviart. “Approximation des probl`emes d’evolution”, unpublishedlecture notes, Universite de Rennes, 1980.

32. P. Deuflhard. “Nonlinear equation solvers in boundary value problem codes”, inCodesfor Boundary-Value Problems in Ordinary Differential Equations, B. Childs, M. Scott, J.Daniel, E. Denman, and P. Nelson, eds., Lec. Notes in Comp. Sci. 76, Springer-Verlag,New York, 1979, pp. 40–66.

33. P. Deuflhard and F. Bornemann.Scientific Computing with Ordinary Differential Equa-tions, Springer-Verlag, 2002.

34. J. Dormand and P. Prince.“A family of embedded Runge-Kutta formulae”,J. Comp. Appl.Math.6 (1980), pp. 19–26.

35. W. Enright. “Improving the performance of numerical methods for two-point boundaryvalue problems”, inNumerical Boundary Value ODEs, U. Ascher and R. Russell, eds.,Birkhauser, Boston, MA, 1985, pp. 107–119.

36. K. Eriksson, D. Estep, P. Hansbo, and C. Johnson. “Introduction to adaptive methods fordifferential equations”,Acta Numerica5 (1995), Cambridge University Press.

37. A. Fasano and S. Marmi.Analytical Mechanics: An Introduction. Oxford University Press,Oxford, 2006.

38. G.R. Fowles.Analytical Mechanics, Holt, Rinehart and Winston, 1962.

39. C. W. Gear.Numerical Initial Value Problems in Ordinary DifferentialEquations, Prentice-Hall, Englewood Cliffs, NJ, 1971.

40. C.W. Gear, B. Leimkuhler, and G.K. Gupta. “Automatic integration of Euler–Lagrangeequations with constraints”, inProceedings of the International Conference on Computa-tional and Applied Mathematics(Leuven, 1984), Vol.12/13 (1985), pp. 77–90.

41. I. Gladwell and D. Sayers.Computational Techniques for Ordinary Differential Equations,Academic Press, New York, 1980.

42. E. Hairer, C. Lubich, and M. Roche.The Numerical Solution of Differential-AlgebraicSystems by Runge–Kutta Methods. Lecture Notes in Mathematics1409 (1989), Springer–Verlag, Berlin.

43. E. Hairer, C. Lubich, and G. Wanner. “Geometric numerical integration illustrated by theStormer-Verlet method”,Acta Numerica12 (2003), Cambridge University Press.

44. E. Hairer and G. Wanner.Solving Ordinary Differential Equations. II. Stiff and Differential-Algebraic Problems,2nd ed., Springer-Verlag, Berlin, 1996.

45. P. Henrici.Discrete Variable Methods in Ordinary Differential Equations, John Wiley,1962.

46. A. Hindmarsh, P. Brown, K. Grant, S. Lee, R. Serban, D. Shumaker, and C. Wood-ward. SUNDIALS: Suite of Nonlinear and Differential/Algebraic Equation Solvers,ACMTransactions on Mathematical Software31 (2005), pp. 363–396. Also, go to the URLhttps://computation.llnl.gov/casc/sundials/

47. E. Isaacson and H. Keller.Analysis of Numerical Methods, John Wiley, New York, 1966.

48. A. Iserles.A First Course in the Numerical Analysis of Differential Equations, CambridgeUniversity Press, Cambridge, United Kingdom, 1996.

248 REFERENCES

49. L. Jay. “Convergence of Runge-Kutta methods for differential-algebraic systems of index3”, Applied Numerical Mathematics17 (1995), pp. 97–118.

50. L. Jay. “Symplectic partitioned Runge-Kutta methods for constrained Hamiltonian sys-tems”,SIAM Journal on Numerical Analysis33 (1996), pp. 368–387.

51. L. Jay. “Specialized Runge-Kutta methods for index 2 differential-algebraic equations”,Mathematics of Computation75 (2006), pp. 641–654.

52. H. Keller.Numerical Solution of Two-Point Boundary Value Problems, Regional Conf.Series in Appl. Maths.24, SIAM Pub., Philadelphia, PA, 1976.

53. H. Keller.Numerical Methods for Two-Point Boundary Value Problems,Dover, New York,1992 (corrected reprint of the 1968 edition, Blaisdell, Waltham, MA).

54. H. Keller and S. Antman, eds.Bifurcation Theory and Nonlinear Eigenvalue Problems,Benjamin, New York, 1969.

55. C.T. Kelley.Solving Nonlinear Equations with Newton’s Method, SIAM Pub., Philadel-phia, 2003.

56. W. Kelley and A. Peterson.Difference Equations, 2nd ed., Academic Press, Burlington,Massachusetts, 2001.

57. R. Kress.Numerical Analysis, Springer-Verlag, New York, 1998.

58. J. Lambert.Computational Methods in Ordinary Differential Equations, John Wiley, NewYork, 1973.

59. P. Linz.Analytical and Numerical Methods for Volterra Equations, SIAM Pub., 1985.

60. P. Lotstedt and L. Petzold. “Numerical solution of nonlinear differential equations withalgebraic constraints. I. Convergence results for backward differentiation formulas”,Math-ematics of Computation46 (1986), pp. 491–516.

61. J. Marsden and T. Ratiu.Introduction to Mechanics and Symmetry, Springer-Verlag, NewYork, 1999.

62. R. Marz. “Numerical methods for differential algebraic equations”,Acta Numerica 1992,Cambridge University Press, 1992.

63. D. Melgaard and R. Sincovec. “Algorithm 565: PDETWO/PSETM/GEARB: Solution ofsystems of two-dimensional nonlinear partial differential equations”,ACM Trans. Math.Software7 (1981), pp. 126–135.

64. R. Miller.Nonlinear Volterra Integral Equations, Benjamin Pub., 1971.

65. L.R. Petzold. “A description of DASSL: A differential-algebraic system solver”, in R. S.Stepleman, editor,Scientific Computing, pp. 65–68. North-Holland, Amsterdam, 1983.

66. L. Petzold, L. Jay, and J. Yen. “Numerical solution of highly oscillatory ordinary differ-ential equations”,Acta Numerica6 (1997), Cambridge University Press.

67. E. Platen. “An introduction to numerical methods for stochastic differential equations”,Acta Numerica8 (1999), Cambridge University Press.

68. A. Quarteroni, R. Sacco, and F. Saleri.Numerical Mathematics, Springer-Verlag, NewYork, 2000.

69. L.B. Rall and G.F. Corliss. “An introduction to automatic differentiation”, inComputa-tional Differentiation(Santa Fe, NM, 1996), pp. 1–18. SIAM, Philadelphia, PA, 1996.

REFERENCES 249

70. J. Sanz-Serna. “Symplectic integrators for Hamiltonian problems: an overview”,ActaNumerica 1992, Cambridge University Press, 1992.

71. W. Schiesser.The Numerical Method of Lines, Academic Press, San Diego, 1991.

72. L. Shampine.Numerical Solution of Ordinary Differential Equations, Chapman & Hall,New York, 1994.

73. L. Shampine and M. Reichelt. “The MATLAB ODE Suite”,SIAM Journal on ScientificComputing18 (1997), pp. 1–22.

74. L. Shampine, I. Gladwell, and S. Thompson.Solving ODEs with MATLAB, CambridgeUniversity Press, 2003.

75. R. Sincovec and N. Madsen. “Software for nonlinear partial differential equations”,ACMTrans. Math. Software1 (1975), pp. 232–260.

76. A. Stuart. “Numerical analysis of dynamical systems”,Acta Numerica 1994, CambridgeUniversity Press, 1994.

77. T. Van Hecke and M. Van Daele. “High-order convergent deferred correction schemesbased on parameterized Runge-Kutta-Nystrom methods for second-order boundary valueproblems. Advanced numerical methods for mathematical modelling”, J. Comput. Appl.Math.132 (2001), pp. 107–125.

78. D. Widder.The Heat Equation, Academic Press, New York, 1975.

INDEX

A-stability, 143, 173absolutely stable, 51, 128Adams-Bashforth methods, 96

asymptotic error formula, 99convergence, 99higher order, 99MATLAB program, 104order three, 99order two, 96predictor formula, 102region of absolute stability, 103truncation error, 99

Adams-Moulton methods, 101order two, 101trapezoidal method, 56, 101

B-stability, 155, 156backward differentiation formulas, 140, 160

characteristic equation, 141definition, 140stability regions, 141

backward Euler method, 49, 51, 150definition, 52MATLAB program, 54

Baumgarte stabilization, 168BDF methods, 140, 168, 173

boundary conditions, 187derivative approximations, 194

boundary value problem, 187finite difference method

convergence, 190boundary value problem, linear, 187

discretization, 189existence theory, 188finite difference method, 188

MATLAB program, 191Richardson extrapolation, 190

boundary value problem, nonlinear, 195collocation

Newton’s method, 204existence theorem, 195finite difference method, 197

asymptotic error formula, 197convergence, 197discretization, 197Newton’s method, 198

shooting method, 201Newton’s method, 201

Butcher tableau, 74, 150Butcher’s simplifying assumptions, 151

characteristic equation, 120

250

INDEX 251

characteristic polynomial, 120characteristic roots, 120collocation

boundary value problems, 204implicit Runge-Kutta methods, 87

Two-point collocation, 87consistency condition, 113contractive iteration mapping, 156

DASSL, 168, 173diagonally implicit Runge–Kutta methods, 153,

155, 160differential algebraic equations, 160, 163direction field, 11DIRK methods, 153, 155, 160drift, 165, 166

energypotential, 182

error per unit stepsize, 79Euler’s method, 15, 166

asymptotic error formula, 26convergence theorem, 23definition, 16error analysis, 21error bound, 23MATLAB program, 19, 43Richardson extrapolation, 28rounding errors effect, 30stability, 29systems, 42truncation error, 21

Euler-Lagrange equations, 182, 183explicit method, 53, 112

fixed-point iteration, 54

Gauss implicit Runge-Kutta method, 88, 151,155, 157, 159, 180

global error, 79

heat equation, 131, 155discretization, 131simple explicit method, 132simple implicit method, 133

Heun’s method, 58, 166higher order differential equations, 39homogeneous linear difference equation, 120

ill-conditioned, 9implicit method, 53, 112

solution of implicit equation, 145implicit Runge-Kutta methods, 86, 149

B-stability, 155collocation, 87

DIRK methods, 153Gauss methods, 151Lobatto IIIC methods, 153midpoint method, 159Radau IIA methods, 152

index, 165, 166, 169higher, 184one, 169, 173, 176three, 170, 181, 183two, 170, 174, 179

initial value problem, 5solvability theory, 7stability, 8

kinetic energy, 182

L-stability, 143Lagrangian, 181, 182Lipschitz condition, 7, 76

one-sided, 155Lobatto IIIC methods, 153, 181, 184local error, 79local solution, 79

machine epsilon, 30mass matrix, 182MATLAB ODE codes, 82, 105, 146mean value theorem, 235mechanics

Lagrangian, 181method of lines, 131

MATLAB program, 135midpoint method, 112

implicit Runge-Kutta, 159weak stability, 123

model problem, 50multistep methods, 95

characteristic equation, 120convergence, 115convergence theory, 122general error analysis, 111nonconvergent example, 118order conditions, 113parasitic solution, 121relative stability, 123root condition, 118stability, 117, 118stability theory, 121

Newton’s method, 146, 184numerical stability, 29

absolute stability, 51

ode113, 106, 147ode15s, 147

252 INDEX

ode45, 83, 147one-step methods

Runge-Kutta methods, 70Taylor series methods, 68

order of convergence, 24order reduction, 156, 158

parasitic root, 121parasitic solution, 121pendulum, 166

spherical, 170pendulum equation, 40, 163, 165, 183polynomial interpolation, 241

error formula, 244Lagrange’s formula, 243solvability theorem, 242

predictor formula, 54projection, 166

quadrature order, 151, 158, 177

Radau IIA methods, 152, 155, 160, 180, 184Radau5 (software), 184region of absolute stability, 51, 103, 128relative stability, 123repetition factor, 229Richardson extrapolation, 78root condition, 118rounding error, 30Runge-Kutta methods, 70

asymptotic formula, 77Butcher tableau, 74classical fourth order method, 74consistency, 76convergence, 75DAEs, 175error prediction, 78Fehlberg methods, 80general framework, 73implicit methods, 86MATLAB program, 83order 2, 70, 72two-point Gauss method, 88

stabilityinitial value problem, 8

stable numerical method, 118stage order, 151, 158, 177stiff differential equation, 61, 127stiff order, 159, 169stiffly accurate, 159, 176, 177Sundials, 147systems of differential equations, 37

Euler’s method, 42

Taylor series methods, 68asymptotic error formula, 70convergence, 69

Tayor’s theoremone variable, 236

remainder formula, 236special cases, 238

two variables, 238trapezoidal method, 49, 56, 159

absolute stability, 58definition, 57Heun’s method, 58MATLAB program, 59numerical integration, 56Volterra integral equation, 216

trapezoidal rule, 150tridiagonal system, 134, 138, 189, 193truncation error, 21, 57, 68, 112

multistep methods, 113Runge-Kutta method, 71

two-point boundary value problemlinear, 187nonlinear, 195

Volterra integral equation, 211linear solvability theory, 213nonlinear solvability theory, 213numerical methods

convergence theorem, 224general framework, 219repetition factor, 229stability, 225theory, 223

Simpson methods, 221stability, 229

solvability theory, 212special cases, 214trapezoidal method, 216

error estimate, 217MATLAB program, 221

weak stability, 123well-conditioned, 9

Date post:	31-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

NUMERICALSOLUTIONOF ORDINARYDIFFERENTIAL EQUATIONSatkinson/papers/NAODE_Book.pdf · 2008. 11....

Documents