Post on 07-Oct-2014
transcript
Numerical Methods Fall 2010
Lecturer: conf. dr. Viorel Bostan
O�ce: 6-417
Telephone: 50-99-38
E-mail address: viorel bostan@mail.utm.md
Course web page: moodle.fcim.utm.md
O�ce hours: TBA. I will also be available at other
times. Just drop by my o�ce, talk to me after the
class or send me an e-mail to make an appointment.
Prerequisites: A basic course on mathematical anal-
ysis (single and multivariable calculus), ordinary dif-
ferential equations and some knowledge of computer
programming.
Course outline: This is a fast-paced course. This coursegives an in-depth introduction to the basic areas of nu-merical analysis. The main objective will be to have aclear understanding of the ideas and techniques underly-ing the numerical methods, results, and algorithms thatwill be presented, where error analysis plays an impor-tant role. You will then be able to use this knowledge toanalyze the numerical methods and algorithms that youwill encounter, and also to program them e¤ectively ona computer. This knowledge will be useful in your futurenot only to solve problems with a numerical component,but also to develop numerical algorithms of your own.
Topics to be covered:
1. Computer representation of numbers. Errors: types,sources, propagation.
2. Solution of nonlinear equations. Root�nding.
3. Interpolation by polynomials and spline functions.
4. Approximation of functions.
5. Numerical integration. Automatic di¤erentiation.
6. Matrix computations and systems of linear equations.
7. Numerical methods for ODE.
This course plan may be modi�ed during the semester.Such modi�cations will be anounced in advance duringclass period. The student is responsible for keeping abreastof such changes.
Class procedure: The majority of each class period
will be lecture oriented. Some material will be handed
during lectures, some material will be send by e-mail.
I strongly advise to attend lectures, do your home-
work, work consistently, and ask questions. Lecture
time is at premium; you cannot be taught everything
in class. It is your responsability to learn the material;
the instructor's job is to guide you in your learning.
During the semester, 10 homeworks and 4 program-
ming projects will be assigned. As a general rule, you
will �nd it necessary to spend approximately 2-3 hours
of study for each lecture/lab meeting, and additional
time will be needed for exam preparation. It is strongly
advised that you start working on this course from the
very beginning. The importance of doing the assigned
homeworks and projects cannot be over emphasized.
Programming projects: The predominant programminglanguage used in numerical analysis are Fortran and MAT-LAB. We will focus on MATLAB. Programs in other lan-guages are also sometimes acceptable, but no program-ming assistance will be given in the use of such languages(i.e. C,C++,Java,Pascal). For students unaquaintedwith MATLAB, the following e-readings are suggested
1.Ian Cavers, An Introductory Guide to MATLAB, 2ndEdition, Dept. of Computer Science, University of BritishColumbia, December 1998,
www.cs.ubc.ca/spider/cavers/MatlabGuide/guide.html
2. Paul Fackler, A MATLAB Primer, North Carolina Sta-teUniversity,
www4.ncsu.edu/unity/users/p/pfackler/www/MPRIMER.htm
3.MATLAB Tutorials, Dept. of Mathematics, SouthernIllinois University at Carbondale,
www.math.siu.edu/matlab/tutorials.html
4.Christian Roessler, MATLAB Basics, University of Mel-bourne, june 2004,
www.econphd.net/downloads/matlab.pdf
5.Kermit Sigmon, MATLAB Primer, 3rd edition, Dept.of Mathematics, University of Florida,
www.wiwi.uni-frankfurt.de/professoren/krueger/teaching/ws0506/macromh/matlabprimer.pdf.
In your project report you should include:
1. The routines you have developed;
2. The results for your test cases in forms of tables,graphs etc.;
3. Answers to all questions contained in the assign-ment;
4. Comments.
You should report your results in a way that is easy toread, communicates the problem and the results ef-fectively, and can be reproduced by someone else whohas not seen the problem before, but is technicallyknowledgeable. You should also give any justi�cationor other reasons to believe the correctness of yourresults and code. Also, give conclusions on how e�ec-tive your methods and routines appear to be, repportand comment any "unusual behavior" of your results.Team working is allowed, but you should specify thisin your report ,as well as the tasks executed by eachmember of your team.
Grading policy: The �nal grade will be based on testsand hw/projects, as follows:
1. There will be one 3-hour written exam given after8 weeks of classes at a time arranged later (assumablyat the end of October). This midterm exam will count25% of the course grade.
2. The �nal comprehensive exam will be given dur-ing the scheduled examination time at the end of thesemester, it will cover all material, and it will count35% of your �nal grade.
3. HW and lab projects will count 20% of the gradeeach. Late homeworks and projects are not allowed!
4. You will need a scienti�c calculator during exams.Sharing of calculators will not be allowed. Make sureyou have one.
The exams will be open notes, i.e. you will be allowedto use your class notes and class slides (no other ma-terial will be allowed).
Grading for homeworks and labprojects
The HW will be graded on a scale from 0 to 4 with a
possibility of getting extra bonus point at each home-
work. Grades will be given according to the following
guidelines:
{ 0 { no homework turned in;
{ 1 { poor lousy job;
{ 2 { incomplete job;
{ 3 { good job;
{ 4 { very good job;
+1 for optional problems and/or excellent/outstanding
solution to one of the porblems
It is very important that you take the examinations at thescheduled times. Alternate exams will be scheduled onlyfor those who have compelling and convincing enoughreasons.
Academic misconduct: Any kinds of academic mis-
conduct will not be tolerated. If a situation arises
where you and your instructor disagree on some mat-
ter and cannot resolve the issue, you should see the
Dean. However, any problems concerning the course
should be �rst discussed with your instructor.
Readings:
1. Kendall Atkinson, An Introduction to Numerical Analy-sis, 2nd edition
2. Cleve Moler, Numerical Computing with MATLAB,
http://www.mathworks.com/moler/
3. Bjoerck A., Dahlquist G , Numerical mathematics andscienti�c computation.
4. Steven E. Pav, Numerical Methods Course Notes,University of California at San diego, 2005
5. Mathews J.H., Fink D.K., Numerical methods usingMATLAB, 1999
6. Kincaid D.�Cheney.W., Numerical analysis, 1991
7. Goldberg, What every computer scientist should knowabout �oating-point arithmetic, 1991
8. Ho¤man J.D.,. Numerical methods for engineers andscientists, 2001
9. Johnston.R.L., Numerical methods, a software ap-proach, 1982
10. Carothers N.L, A short course on approximation the-ory, Course notes, Bowling Green State University
11. George W. Collins, Fundamental Numerical Methodsand Data Analysis
12. Shampine L.F., Allen R.C., Pruess S., Fundamentalsof numerical computing, 1997
Also, you should check the university library for availablebooks.
Useful web-sites with on-line literature:
www.math.gatech.edu/~cain/textbooks/onlinebooks.html
www.econphd.net/notes.htm
De�nition of Numerical Analysis by Kendall Atkinson,Prof. University of Iowa
Numerical analysis is the area of mathematics and com-puter science that creates, analyzes, and implements al-gorithms for solving numerically the problems of contin-uous mathematics.
Such problems originate generally from real-world appli-cations of algebra, geometry and calculus, and they in-volve variables which vary continuously; these problemsoccur throughout the natural sciences, social sciences,engineering, medicine, and business.
During the past half-century, the growth in power andavailability of digital computers has led to an increas-ing use of realistic mathematical models in science andengineering, and numerical analysis of increasing sophis-tication has been needed to solve these more detailedmathematical models of the world.
With the growth in importance of using computers tocarry out numerical procedures in solving mathematicalmodels of the world, an area known as scienti�c com-puting or computational science has taken shape duringthe 1980s and 1990s. This area looks at the use of nu-merical analysis from a computer science perspective. Itis concerned with using the most powerful tools of nu-merical analysis, computer graphics, symbolic mathemat-ical computations, and graphical userinterfaces to makeit easier for a user to set up, solve, and interpret compli-cated mathematical models of the real world.
De�nition of Numerical Analysis by Lloyd N Trefethen,Prof. Cornell Unviersity
Here is the wrong answer: Numerical analysis is the studyof rounding errors
Some other wrong or incomplete answers:
Websters New Collegiate Dictionary:The study of quan-titative approximations to the solutions of mathematicalproblems including consideration of the errors and boundsto the errors involved.
Chambers 20th Century Dictionary: The study ofmethods of approximation and their accuracy etc
The American Heritage Dictionary: The study of ap-proximate solutions to mathematical problems taking intoaccount the extent of possible errors
Correct answer is:Numerical analysis is the study of algo-rithms for the problems of continuous mathematics
NUMERICAL ANALYSIS: This refers to the analysis
of mathematical problems by numerical means, es-
pecially mathematical problems arising from models
based on calculus.
Effective numerical analysis requires several things:
• An understanding of the computational tool beingused, be it a calculator or a computer.
• An understanding of the problem to be solved.
• Construction of an algorithm which will solve the
given mathematical problem to a given desired
accuracy and within the limits of the resources
(time, memory, etc) that are available.
This is a complex undertaking. Numerous people
make this their life’s work, usually working on only
a limited variety of mathematical problems.
Within this course, we attempt to show the spirit of
the subject. Most of our time will be taken up with
looking at algorithms for solving basic problems such
as rootfinding and numerical integration; but we will
also look at the structure of computers and the impli-
cations of using them in numerical calculations.
We begin by looking at the relationship of numerical
analysis to the larger world of science and engineering.
SCIENCE
Traditionally, engineering and science had a two-sided
approach to understanding a subject: the theoretical
and the experimental. More recently, a third approach
has become equally important: the computational.
Traditionally we would build an understanding by build-
ing theoretical mathematical models, and we would
solve these for special cases. For example, we would
study the flow of an incompressible irrotational fluid
past a sphere, obtaining some idea of the nature of
fluid flow. But more practical situations could seldom
be handled by direct means, because the needed equa-
tions were too difficult to solve. Thus we also used
the experimental approach to obtain better informa-
tion about the flow of practical fluids. The theory
would suggest ideas to be tried in the laboratory, and
the experiemental results would often suggest direc-
tions for a further development of theory.
1
Computational
Science
Theoretical
Science
Experimental
Science
With the rapid advance in powerful computers, we
now can augment the study of fluid flow by directly
solving the theoretical models of fluid flow as applied
to more practical situations; and this area is often re-
ferred to as “computational fluid dynamics”. At the
heart of computational science is numerical analysis;
and to effectively carry out a computational science
approach to studying a physical problem, we must un-
derstand the numerical analysis being used, especially
if improvements are to be made to the computational
techniques being used.
MATHEMATICAL MODELS
A mathematical model is a mathematical description
of a physical situtation. By means of studying the
model, we hope to understand more about the physi-
cal situation. Such a model might be very simple. For
example,
A = 4πR2e, Re.= 6, 371 km
is a formula for the surface area of the earth. How
accurate is it? First, it assumes the earth is sphere,
which is only an approximation. At the equator, the
radius is approximately 6,378 km; and at the poles,
the radius is approximately 6,357 km. Next, there is
experimental error in determining the radius; and in
addition, the earth is not perfectly smooth. Therefore,
there are limits on the accuracy of this model for the
surface area of the earth.
AN INFECTIOUS DISEASE MODEL
For rubella measles, we have the following model for
the spread of the infection in a population (subject to
certain assumptions).
ds
dt= −a s i
di
dt= a s i− b i
dr
dt= b i
In this, s, i, and r refer, respectively, to the propor-
tions of a total population that are susceptible, infec-
tious, and removed (from the susceptible and infec-
tious pool of people). All variables are functions of
time t. The constants can be taken as
a =6.8
11, b =
1
11The same model works for some other diseases (e.g.
flu), with a suitable change of the constants a and b.
Again, this is an approximation of reality (and a useful
one).
But it has its limits. Solving a bad model will not give
good results, no matter how accurately it is solved;
and the person solving this model and using the results
must know enough about the formation of the model
to be able to correctly interpret the numerical results.
THE LOGISTIC EQUATION
This is the simplest model for population growth. Let
N(t) denote the number of individuals in a population
(rabbits, people, bacteria, etc). Then we model its
growth by
N 0(t) = cN(t), t ≥ 0, N(t0) = N0
The constant c is the growth constant, and it usually
must be determined empirically. Over short periods of
time, this is often an accurate model for population
growth. For example, it accurately models the growth
of US population over the period of 1790 to 1860, with
c = 0.2975.
THE PREDATOR-PREY MODEL
Let F (t) denote the number of foxes at time t; and
let R(t) denote the number of rabbits at time t. A
simple model for these populations is called the Lotka-
Volterra predator-prey model :
dR
dt= a [1− bF (t)]R(t)
dF
dt= c [−1 + dR(t)]F (t)
with a, b, c, d positive constants. If one looks carefully
at this, then one can see how it is built from the logis-
tic equation. In some cases, this is a very useful model
and agrees with physical experiments. Of course, we
can substitute other interpretations, replacing foxes
and rabbits with other predator and prey. The model
will fail, however, when there are other populations
that affect the first two populations in a significant
way.
NEWTON’S SECOND LAW
Newton’s second law states that the force acting on
an object is directly proportional to the product of its
mass and acceleration,
F ∝ ma
With a suitable choice of physical units, we usually
write this in its scalar form as
F = ma
Newton’s law of gravitation for a two-body situation,
say the earth and an object moving about the earth is
then
md2r(t)
dt2= −Gmme
|r(t)|2 ·r(t)
|r(t)|with r(t) the vector from the center of the earth to
the center of the object moving about the earth. The
constant G is the gravitational constant, not depen-
dent on the earth; and m and me are the masses,
respectively of the object and the earth.
This is an accurate model for many purposes. But
what are some physical situations under which it will
fail?
When the object is very close to the surface of the
earth and does not move far from one spot, we take
|r(t)| to be the radius of the earth. We obtain thenew model
md2r(t)
dt2= −mgk
with k the unit vector directly upward from the earth’s
surface at the location of the object. The gravitational
constant
g.= 9.8meters/second2
Again this is a model; it is not physical reality.
The Patriot Missile Failure
On February 25, 1991, during the Gulf War, an Amer-
ican Patriot Missile battery in Dharan, Saudi Arabia,
failed to intercept an incoming Iraqi Scud missile. The
Scud struck an American Army barracks and killed 28
soliders.
A report of the General Accounting o�ce, GAO/IMTEC-
92-26, entitled Patriot Missile Defense: Software Prob-
lem Led to System Failure at Dhahran, Saudi Arabia
reported on the cause of the failure.
It turns out that the cause was an inaccurate calcula-
tion of the time since boot due to computer arithmetic
errors.
Speci�cally, the time in tenths of second as measured
by the system's internal clock was multiplied by 1=10
to produce the time in seconds. This calculation was
performed using a 24 bit �xed point register. In par-
ticular, the value 1=10, which has a non-terminating
binary expansion, was chopped at 24 bits after the
radix point. The small chopping error, when multi-
plied by the large number giving the time in tenths of
a second, lead to a
signi�cant error. Indeed, the Patriot battery had been uparound 100 hours, and an easy calculation shows that theresulting time error due to the magni�ed chopping errorwas about 0.34 seconds.
The number 110 equals
1
10=
1
24+1
25+1
28+1
29+
1
212+
1
213+ : : :
= (0:0001100110011001100110011001100 : : :)2
Now the 24 bit register in the Patriot stored instead
(0:00011001100110011001100)2
introducing an error of
(0:0000000000000000000000011001100:::)2
which being converted in decimal is
(0:000000095)10
Multiplying by the number of tenths of a second in 100hours gives:
0:000000095 � 100 � 60 � 60 � 10 = 0:34
A Scud travels at about 1676 meters per second, andso travels more than half a kilometer in this time. Thiswas far enough that the incoming Scud was outside the"range gate" that the Patriot tracked. Ironically, the factthat the bad time calculation had been improved in someparts of the code, but not all, contributed to the problem,since it meant that the inaccuracies did not cancel.
The following paragraph is excerpted from the GAO re-port.
The range gate�s prediction of where the Scud will nextappear is a function of the Scud�s known velocity and thetime of the last radar detection. Velocity is a real number
that can be expressed as a whole number and a decimal(e.g., 3750.2563...miles per hour). Time is kept continu-ously by the system�s internal clock in tenths of secondsbut is expressed as an integer or whole number (e.g., 32,33, 34...). The longer the system has been running, thelarger the number representing time. To predict wherethe Scud will next appear, both time and velocity mustbe expressed as real numbers. Because of the way thePatriot computer performs its calculations and the factthat its registers are only 24 bits long, the conversion oftime from an integer to a real number cannot be anymore precise than 24 bits. This conversion results in aloss of precision causing a less accurate time calculation.The e¤ect of this inaccuracy on the range gate�s calcu-lation is directly proportional to the target�s velocity andthe length of the the system has been running. Conse-quently, performing the conversion after the Patriot hasbeen running continuously for extended periods causesthe range gate to shift away from the center of the tar-get, making it less likely that the target, in this case aScud, will be successfully intercepted.
CALCULATION OF FUNCTIONS
Using hand calculations, a hand calculator, or a com-puter, what are the basic operations of which we arecapable? In essence, they are addition, subtraction,multiplication, and division (and even this will usuallyrequire a truncation of the quotient at some point).In addition, we can make logical decisions, such asdeciding which of the following are true for two realnumbers a and b:
a > b, a = b, a < b
Furthermore, we can carry out only a finite numberof such operations. If we limit ourselves to just addi-tion, subtraction, and multiplication, then in evaluat-ing functions f(x) we are limited to the evaluation ofpolynomials:
p(x) = a0 + a1x+ · · · anxnIn this, n is the degree (provided an 6= 0) and {a0, ..., an}are the coefficients of the polynomial. Later we willdiscuss the efficient evaluation of polynomials; but fornow, we ask how we are to evaluate other functionssuch as ex, cosx, log x, and others.
TAYLOR POLYNOMIAL APPROXIMATIONS
We begin with an example, that of f(x) = ex from
the text. Consider evaluating it for x near to 0. We
look for a polynomial p(x) whose values will be the
same as those of ex to within acceptable accuracy.
Begin with a linear polynomial p(x) = a0+a1x. Then
to make its graph look like that of ex, we ask that the
graph of y = p(x) be tangent to that of y = ex at
x = 0. Doing so leads to the formula
p(x) = 1 + x
Continue in this manner looking next for a quadratic
polynomial
p(x) = a0 + a1x+ a2x2
We again make it tangent; and to determine a2, we
also ask that p(x) and ex have the same “curvature”
at the origin. Combining these requirements, we have
for f(x) = ex that
p(0) = f(0), p0(0) = f 0(0), p00(0) = f 00(0)
This yields the approximation
p(x) = 1 + x+ 12x2
We continue this pattern, looking for a polynomial
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n
We now require that
p(0) = f(0), p0(0) = f 0(0), · · · , p(n)(0) = f (n)(0)
This leads to the formula
p(x) = 1 + x+ 12x2 + · · ·+ 1
n!xn
What are the problems when evaluating points x that
are far from 0?
TAYLOR’S APPROXIMATION FORMULA
Let f(x) be a given function, and assume it has deriv-
atives around some point x = a (with as many deriv-
atives as we find necessary). We seek a polynomial
p(x) of degree at most n, for some non-negative inte-
ger n, which will approximate f(x) by satisfying the
following conditions:
p(a) = f(a)
p0(a) = f 0(a)p00(a) = f 00(a)
...
p(n)(a) = f (n)(a)
The general formula for this polynomial is
pn(x) = f(a) + (x− a)f 0(a) + 1
2!(x− a)2f 00(a)
+ · · ·+ 1
n!(x− a)nf (n)(a)
Then f(x) ≈ pn(x) for x close to a.
TAYLOR POLYNOMIALS FOR f(x) = log x
In this case, we expand about the point x = 1, making
the polynomial tangent to the graph of f(x) = log x
at the point x = 1. For a general degree n ≥ 1, this
results in the polynomial
pn(x) = (x− 1)− 12(x− 1)2 + 1
3(x− 1)3
+ · · ·+ (−1)n−11n(x− 1)n
Note the graphs of these polynomials for varying n.
THE TAYLOR POLYNOMIAL ERROR FORMULA
Let f(x) be a given function, and assume it has deriv-
atives around some point x = a (with as many deriva-
tives as we find necessary). For the error in the Taylor
polynomial pn(x), we have the formulas
f(x)− pn(x) =1
(n+ 1)!(x− a)n+1f (n+1)(cx)
=1
n!
Z x
a(x− t)nf (n+1)(t) dt
The point cx is restricted to the interval bounded by x
and a, and otherwise cx is unknown. We will use the
first form of this error formula, although the second
is more precise in that you do not need to deal with
the unknown point cx.
Consider the special case of n = 0. Then the Taylor
polynomial is the constant function:
f(x) ≈ p0(x) = f(a)
The first form of the error formula becomes
f(x)− p0(x) = f(x)− f(a) = (x− a) f 0(cx)
with cx between a and x. You have seen this in
your beginning calculus course, and it is called the
mean-value theorem. The error formula
f(x)− pn(x) =1
(n+ 1)!(x− a)n+1f (n+1)(cx)
can be considered a generalization of the mean-value
theorem.
EXAMPLE: f(x) = ex
For general n ≥ 0, and expanding ex about x = 0, wehave that the degree n Taylor polynomial approxima-
tion is given by
pn(x) = 1 + x+1
2!x2 +
1
3!x3 + · · ·+ 1
n!xn
For the derivatives of f(x) = ex, we have
f (k)(x) = ex, f (k)(0) = 1, k = 0, 1, 2, ...
For the error,
ex − pn(x) =1
(n+ 1)!xn+1ecx
with cx located between 0 and x. Note that for x ≈ 0,we must have cx ≈ 0 and
ex − pn(x) ≈ 1
(n+ 1)!xn+1
This last term is also the final term in pn+1(x), and
thus
ex − pn(x) ≈ pn+1(x)− pn(x)
Consider calculating an approximation to e. Then let
x = 1 in the earlier formulas to get
pn(1) = 1 + 1 +1
2!+1
3!+ · · ·+ 1
n!For the error,
e− pn(1) =1
(n+ 1)!ecx, 0 ≤ cx ≤ 1
To bound the error, we have
e0 ≤ ecx ≤ e1
1
(n+ 1)!≤ e− pn(1) ≤ e
(n+ 1)!
To have an approximation accurate to within 10−5,we choose n large enough to have
e
(n+ 1)!≤ 10−5
which is true if n ≥ 8. In fact,e− p8(1) ≤
e
9!
.= 7.5× 10−6
Then calculate p8(1).= 2.71827877, and e− p8(1)
.=
3.06× 10−6.
FORMULAS OF STANDARD FUNCTIONS
1
1− x= 1 + x+ x2 + · · ·+ xn +
xn+1
1− x
cosx = 1− x2
2!+x4
4!− · · ·+ (−1)m x2m
(2m)!
+(−1)m x2m+2
(2m+ 2)!cos cx
sinx = x− x3
3!+x5
5!− · · ·+ (−1)m−1 x2m−1
(2m− 1)!+(−1)m x2m+1
(2m+ 1)!cos cx
with cx between 0 and x.
OBTAINING TAYLOR FORMULAS
Most Taylor polynomials have been bound by other
than using the formula
pn(x) = f(a) + (x− a)f 0(a) + 1
2!(x− a)2f 00(a)
+ · · ·+ 1
n!(x− a)nf (n)(a)
because of the difficulty of obtaining the derivatives
f (k)(x) for larger values of k. Actually, this is now
much easier, as we can use Maple or Mathematica.
Nonetheless, most formulas have been obtained by
manipulating standard formulas; and examples of this
are given in the text.
For example, use
et = 1 + t+1
2!t2 +
1
3!t3 + · · ·+ 1
n!tn
+1
(n+ 1)!tn+1ect
in which ct is between 0 and t. Let t = −x2 to obtain
e−x2 = 1− x2 +1
2!x4 − 1
3!x6 + · · ·+ (−1)
n
n!x2n
+(−1)n+1(n+ 1)!
x2n+2e−ξx
Because ct must be between 0 and −x2, we have itmust be negative. Thus we let ct = −ξx in the errorterm, with 0 ≤ ξx ≤ x2.
EVALUATING A POLYNOMIAL
Consider having a polynomial
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n
which you need to evaluate for many values of x. How
do you evaluate it? This may seem a strange question,
but the answer is not as obvious as you might think.
The standard way, written in a loose algorithmic for-
mat:
poly = a0for j = 1 : npoly = poly + ajx
j
end
To compare the costs of different numerical meth-
ods, we do an operations count, and then we compare
these for the competing methods. Above, the counts
are as follows:
additions : n
multiplications : 1 + 2 + 3 + · · ·+ n =n(n+ 1)
2
This assumes each term ajxj is computed indepen-
dently of the remaining terms in the polynomial.
Next, do the terms xj recursively:
xj = x · xj−1
Then to computenx2, x3, ..., xn
owill cost n−1 mul-
tiplications. Our algorithm becomes
poly = a0 + a1xpower = xfor j = 2 : npower = x · powerpoly = poly + aj · power
end
The total operations cost is
additions : n
multiplications : n+ n− 1 = 2n− 1When n is evenly moderately large, this is much less
than for the first method of evaluating p(x). For ex-
ample, with n = 20, the first method has 210 multi-
plications, whereas the second has 39 multiplications.
We now considered nested multiplication. As exam-ples of particular degrees, write
n = 2 : p(x) = a0 + x(a1 + a2x)n = 3 : p(x) = a0 + x (a1 + x (a2 + a3x))n = 4 : p(x) = a0 + x (a1 + x (a2 + x (a3 + a4x)))
These contain, respectively, 2, 3, and 4 multiplica-tions. This is less than the preceding method, whichwould have need 3, 5, and 7 multiplications, respec-tively.
For the general case, write
p(x) = a0+x (a1 + x (a2 + · · ·+ x (an−1 + anx) · · · ))This requires n multiplications, which is only abouthalf that for the preceding method. For an algorithm,write
poly = anfor j = n− 1 : −1 : 0poly = aj + x · poly
end
With all three methods, the number of additions is n;but the number of multiplications can be dramaticallydifferent for large values of n.
NESTED MULTIPLICATION
Imagine we are evaluating the polynomial
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n
at a point x = z. Thus with nested multiplication
p(z) = a0+z (a1 + z (a2 + · · ·+ z (an−1 + anz) · · · ))We can write this as the following sequence of oper-
ations:
bn = anbn−1 = an−1 + zbnbn−2 = an−2 + zbn−1
...b0 = a0 + zb1
The quantities bn−1, ..., b0 are simply the quantities inparentheses, starting from the inner most and working
outward.
Introduce
q(x) = b1 + b2x+ b3x2 + · · ·+ bnx
n−1
Claim:
p(x) = b0 + (x− z)q(x) (∗)Proof: Simply expand
b0 + (x− z)³b1 + b2x+ b3x
2 + · · ·+ bnxn−1´
and use the fact that
zbj = bj−1 − aj−1, j = 1, ..., n
With this result (*), we have
p(x)
x− z=
b0x− z
+ q(x)
Thus q(x) is the quotient when dividing p(x) by x−z,and b0 is the remainder.
If z is a zero of p(x), then b0 = 0; and then
p(x) = (x− z)q(x)
For the remaining roots of p(x), we can concentrate
on finding those of q(x). In rootfinding for polynomi-
als, this process of reducing the size of the problem is
called deflation.
Another consequence of (*) is the following. Form
the derivative of (*) with respect to x, obtaining
p0(x) = (x− z)q0(x) + q(x)
p0(z) = q(z)
Thus to evaluate p(x) and p0(x) simultaneously at x =z, we can use nested multiplication for p(z) and we
can use the intermediate steps of this to also evaluate
p0(z). This is useful when doing rootfinding problemsfor polynomials by means of Newton’s method.
APPROXIMATING SF (x)
Define
SF (x) =1
x
Z x
0
sin t
tdt, x 6= 0
We use Taylor polynomials to approximate this func-
tion, to obtain a way to compute it with accuracy and
simplicity.
x
y
0.5
1.0
-8 -4 84
As an example, begin with the degree 3 Taylor ap-
proximation to sin t, expanded about t = 0:
sin t = t− 16t3 +
1
120t5 cos ct
with ct between 0 and t. Then
sin t
t= 1− 1
6t2 +
1
120t4 cos ctZ x
0
sin t
tdt =
Z x
0
·1− 1
6t2 +
1
120t4 cos ct
¸dt
= x− 1
18x3 +
1
120
Z x
0t4 cos ctdt
1
x
Z x
0
sin t
tdt = 1− 1
18x2 +R2(x)
R2(x) =1
120
1
x
xZ0
t4 cos ct dt
How large is the error in the approximation
SF (x) ≈ 1− 1
18x2
on the interval [−1, 1]? Since |cos ct| ≤ 1, we have
for x > 0 that
0 ≤ R2(x) ≤1
120
1
x
Z x
0t4dt
=1
600x4
and the same result can be shown for x < 0. Then
for |x| ≤ 1, we have
0 ≤ R2(x) ≤1
600
To obtain a more accurate approximation, we can pro-
ceed exactly as above, but simply use a higher degree
approximation to sin t.
BINARY INTEGERS
A binary integer x is a finite sequence of the digits 0
and 1, which we write symbolically as
x = (amam−1 · · · a2a1a0)2where I insert the parentheses with subscript ()2 in
order to make clear that the number is binary. The
above has the decimal equivalent
x = am2m + am−12m−1 + · · ·+ a12
1 + a0
For example, the binary integer x = (110101)2 has
the decimal value
x = 25 + 24 + 22 + 20 = 53
The binary integer x = (111 · · · 1)2 with m ones has
the decimal value
x = 2m−1 + · · ·+ 21 + 1 = 2m − 1
DECIMAL TO BINARY INTEGER CONVERSION
Given a decimal integer x we write
x = (amam−1 · · · a2a1a0)2= am2m + am−12m−1 + · · ·+ a12
1 + a0
Divide x by 2, calling the quotient x1. The remainder
is a0, and
x1 = am2m−1 + am−12m−2 + · · ·+ a12
0
Continue the process. Divide x1 by 2, calling the quo-
tient x2. The remainder is a1, and
x2 = am2m−2 + am−12m−3 + · · ·+ a22
0
After a finite number of such steps, we will obtain all
of the coefficients ai, and the final quotient will be
zero.
Try this with a few decimal integers.
EXAMPLE
The following shortened form of the above method is
convenient for hand computation. Convert (11)10 to
binary.
b2√11c = 5 = x1 a0 = 1b2√5c = 2 = x2 a1 = 1
b2√2c = 1 = x3 a2 = 0b2√1c = 0 = x4 a3 = 1
In this, the notation bbc denotes the largest integer≤ b, and the notation 2
√n denotes the quotient re-
sulting from dividing 2 into n. From the above cal-
culation, (11)10 = (1011)2.
BINARY FRACTIONS
A binary fraction x is a sequence (possibly infinite) of
the digits 0 and 1:
x = (.a1a2a3 · · · am · · · )2= a12
−1 + a22−2 + a32
−3 + · · ·For example, x = (.1101)2 has the decimal value
x = 2−1 + 2−2 + 2−4
= .5 + .25 + .0625 = 0.8125
Recall the formula for the geometric series
nXi=0
ri =1− rn+1
1− r, r 6= 1
Letting n→∞ with |r| < 1, we obtain the formula
∞Xi=0
ri =1
1− r, |r| < 1
Using this,
(.0101010101010 · · · )2 = 2−2 + 2−4 + 2−6 + · · ·= 2−2
³1 + 2−2 + 2−4 + · · ·
´which sums to the fraction 1/3.
Also,
(.11001100110011 · · · )2= 2−1 + 2−2 + 2−5 + 2−6 + · · ·
and this sums to the decimal fraction 0.8 = 810.
DECIMAL TO BINARY FRACTION CONVERSION
In
x1 = (.a1a2a3 · · · am · · · )2= a12
−1 + a22−2 + a32
−3 + · · ·we multiply by 2. The integer part will be a1; and
after it is removed we have the binary fraction
x2 = (.a2a3 · · · am · · · )2= a22
−1 + a32−2 + a42
−3 + · · ·Again multiply by 2, obtaining a2 as the integer part of
2x2. After removing a2, let x3 denote the remaining
number. Continue this process as far as needed.
For example, with x = 15, we have
x1 = .2; 2x1 = .4; x2 = .4 and a1 = 02x2 = .8; x3 = .8 and a2 = 02x3 = 1.6; x4 = .6 and a2 = 1
Continue this to get the pattern
(.2)10 = (.00110011001100 · · · )2
DECIMAL FLOATING-POINT NUMBERS
Floating point notation is akin to what is called scien-
ti�c notation. For a nonzero number x, we can write
it in the form
x = � � x � 10e;
where � = �1, e is an integer, and 1 � x < 10. Num-ber � is called sign, e is exponent, and x is signi�candor mantissa.
For example,
345:78 = 3:4578 � 102;where � = +1, e = 2, x = 3:4578.
On a decimal computer or calculator, we store x by
instead storing �, e and x: We must restrict the num-
ber of digits in x and the size of the exponent e. The
number of digits in x is called precision.
For example, on an HP-15C calculator, precision is 10,
and the exponent is restricted to �99 � e � 99.
BINARY FLOATING-POINT NUMBERS
We now do something similar with the binary repre-
sentation of a number x. Write
x = � � x � 2e;
with 1 � x < (10)2 = (2)10 and e an integer.
For example,
(0:1)10 = (:000110011001100. . . )2
= +|{z}�=+1
(1:10011001100 : : :)2| {z }x
� 2ez}|{�4 ;
The number x is stored in the computer by storing the
�, x, and e. On all computers, there are restrictions
on the number of digits in x and the size of e.
FLOATING POINT NUMBERS
When a number x outside a computer or calculator
is converted into a machine number, we denote it by
fl(x). On an HP-calculator,
fl(:3333. . . ) = (3:333333333)10 � 10�1
The decimal fraction of in�nite length will not �t in
the registers of the calculator, but the latter 10�digitnumber will �t. Some calculators actually carry more
digits internally than they allow to be displayed. On
a binary computer, we use a similar notation.
We will concentrate on a particular form of computer
oating point number, that is called the IEEE oating
point standard.
Example 1 Consider a binary oating point represen-
tation with precision 3; and emin = �2 � e � 2 =
emax: All the numbers admitted by this representation
are presented in the table
ex �2 �1 0 1 2
(1:00)2 (:25)10 (:5)10 (1)10 (2)10 (4)10(1:01)2 (:3125)10 (:625)10 (1:25)10 (2:5)10 (5)10(1:10)2 (:375)10 (:75)10 (1:5)10 (3)10 (6)10(1:11)2 (:4375)10 (:875)10 (1:75)10 (3:5)10 (7)10
0 4 5 6 71 2 30.5
This representation can be extended to include smaller
numbers called denormalized numbers. These num-
bers are obtained if e = emin; and the �rst digit of the
signi�cand is 0:
Example 2 Previous example plus denormalized num-
bers
(0:01)2 � 2�1 =
1
16= (0:0625)10
(0:10)2 � 2�1 =
2
16= (0:125)10
(0:11)2 � 2�1 =
3
16= (0:1875)10
0 4 5 6 71 2 30.5
IEEE SINGLE PRECISION STANDARD
In IEEE single precision 32 bits are used to store num-
bers. A number is written as
x = � � 1:a1a2 : : : a23 � 2e:
The signi�cand x = (1:a1a2���a23)2 immediately sat-is�es 1 � x < 2.
What are the limits on e? To understand the limits
on e and the number of binary digits chosen for x, we
must look roughly at how the number x will be stored
in the computer.
Basically, we store � as a single bit, the signi�cand x
as 24 bits (only 23 need be stored), and the exponent
�lls out 8 bits, including both negative and positive
integers.
Roughly speaking, we have that e must satisfy
�(1111111| {z }7
)2 � e � (1111111| {z }7
)2
or in decimal
�127 � e � 127
In actuality, the limits are
�126 � e � 127
for reasons related to the storage of 0 and other num-
bers such as �1. In order to avoid the sign for ex-ponent, denote E = e+ 127:
Obviously, 1 � E � 254 with two additional values:
0 and 255:
� E xb1 b2 : : : b9 b10 : : : b32
Number x = 0 is stored in the following way: E = 0,
� = 0 and b10b11 : : : b32 = (00 : : : 0)2.
E = (b2 : : : b9)2 e x
(00000000)2 = 0 �127 �(0:b10 : : : b32)2 � 2�126(00000001)2 = 1 �126 �(1:b10 : : : b32)2 � 2�126(00000010)2 = 2 �125 �(1:b10 : : : b32)2 � 2�125
... ... ...(01111111)2 = 127 0 �(1:b10 : : : b32)2 � 20(10000000)2 = 128 1 �(1:b10 : : : b32)2 � 21
... ... ...(11111101)2 = 253 126 �(1:b10 : : : b32)2 � 2126(11111110)2 = 254 127 �(1:b10 : : : b32)2 � 2127
(11111111)2 = 255 128�1; if bi = 0NaN; otherwise
IEEE DOUBLE PRECISSION STANDARD
x = � � 1:a1a2 : : : a52 � 2e:
� E xb1 b2 : : : b12 b13 : : : b64
where E = e+ 1023
E = (b2 : : : b12)2 e x
(00000000000)2 = 0 �1023 �(:b13 : : : b64)2�1022(00000000001)2 = 1 �1022 �(1:b13 : : : b64)2�1022(00000000010)2 = 2 �1021 �(1:b13 : : : b64)2�1021
... ... ...(01111111111)2 = 1023 0 �(1:b13 : : : b64)20(10000000000)2 = 1024 1 �(1:b13 : : : b64)21
... ... ...(11111111101)2 = 2045 1022 �(1:b13 : : : b64)21022(11111111110)2 = 2046 1023 �(1:b13 : : : b64)21023
(11111111111)2 = 2047 1024�1; bi = 0NaN; otherwise
What is the connection of the 24 bits in the signi�cand
x to the number of decimal digits in the storage of
a number x into oating point form. One way of
answering this is to �nd the integer M for which
1. 0 < x � M and x an integer implies fl(x) = x;
and
2.fl(M + 1) 6=M + 1
This integer M is at least as big as
(11 : : : 1| {z })223 ones
= (1:11 : : : 1)2 � 223
= 223 + 222 + : : :+ 20
= 224 � 1
Also, 224 = (1:00 : : : 0)2 � 224 will be stored exactly.
Next integer 224+1 cannot be stored exactly since its
signi�cand will contain 24 + 1 binary digits:
224 + 1 = (1:00 : : : 0| {z }23 of 0
1)2 � 224:
Therefore for single precision M = 224. Any integer
less or equal to M will be stored exactly. So
M = 224 = 16777216:
For IEEE double precision standard we have
M = 253 � 9:0� 1015:
THE MACHINE EPSILON
Let y be the smallest number representable in the ma-
chine arithmetic that is greater than 1 in the machine.
The machine epsilon is η = y − 1. It is a widely usedmeasure of the accuracy possible in representing num-
bers in the machine.
The number 1 has the simple floating point represen-
tation
1 = (1.00 · · · 0)2 · 20
What is the smallest number that is greater than 1?
It is
1 + 2−23 = (1.0 · · · 01)2 · 20 > 1
and the machine epsilon in IEEE single precision float-
ing point format is η = 2−23 .= 1.19× 10−7.
THE UNIT ROUND
Consider the smallest number δ > 0 that is repre-
sentable in the machine and for which
1 + δ > 1
in the arithmetic of the machine.
For any number 0 < α < δ, the result of 1 + α is
exactly 1 in the machines arithmetic. Thus α ‘drops
off the end’ of the floating point representation in the
machine. The size of δ is another way of describing
the accuracy attainable in the floating point represen-
tation of the machine. The machine epsilon.has been
replacing it in recent years.
It is not too difficult to derive δ. The number 1 has
the simple floating point representation
1 = (1.00 · · · 0)2 · 20
What is the smallest number which can be added to
this without disappearing? Certainly we can write
1 + 2−23 = (1.0 · · · 01)2 · 20 > 1
Past this point, we need to know whether we are us-
ing chopped arithmetic or rounded arithmetic. We
will shortly look at both of these. With chopped
arithmetic, δ = 2−23; and with rounded arithmetic,δ = 2−24.
ROUNDING AND CHOPPING
Let us first consider these concepts with decimal arith-
metic. We write a computer floating point number z
as
z = σ · ζ · 10e ≡ σ · (a1.a2 · · · an)10 · 10e
with a1 6= 0, so that there are n decimal digits in thesignificand (a1.a2 · · · an)10.
Given a general number
x = σ · (a1.a2 · · · an · · · )10 · 10e, a1 6= 0we must shorten it to fit within the computer. This
is done by either chopping or rounding. The floating
point chopped version of x is given by
fl(x) = σ · (a1.a2 · · · an)10 · 10e
where we assume that e fits within the bounds re-
quired by the computer or calculator.
For the rounded version, we must decide whether to
round up or round down. A simplified formula is
fl(x) =(σ · (a1.a2 · · · an)10 · 10e an+1 < 5
σ · [(a1.a2 · · · an)10 + (0.0 · · · 1)10] · 10e an+1 ≥ 5The term (0.0 · · · 1)10 denotes 10−n+1, giving the or-dinary sense of rounding with which you are familiar.
In the single case
(0.0 · · · 0an+1an+2 · · · )10 = (0.0 · · · 0500 · · · )10a more elaborate procedure is used so as to assure an
unbiased rounding.
CHOPPING/ROUNDING IN BINARY
Let
x = σ · (1.a2 · · · an · · · )2 · 2e
with all ai equal to 0 or 1. Then for a chopped floating
point representation, we have
fl(x) = σ · (1.a2 · · · an)2 · 2e
For a rounded floating point representation, we have
fl(x) =(σ · (1.a2 · · · an)2 · 10e an+1 = 0
σ · [(1.a2 · · · an)2 + (0.0 · · · 1)2] · 10e an+1 = 1
ERRORS
The error x−fl(x) = 0 when x needs no change to beput into the computer or calculator. Of more interest
is the case when the error is nonzero. Consider first
the case x > 0 (meaning σ = +1). The case with
x < 0 is the same, except for the sign being opposite.
With x 6= fl(x), and using chopping, we have
fl(x) < x
and the error x− fl(x) is always positive. This later
has major consequences in extended numerical com-
putations. With x 6= fl(x) and rounding, the error
x− fl(x) is negative for half the values of x, and it is
positive for the other half of possible values of x.
We often write the relative error as
x− fl(x)
x= −ε
This can be expanded to obtain
fl(x) = (1 + ε)x
Thus fl(x) can be considered as a perturbed value
of x. This is used in many analyses of the effects of
chopping and rounding errors in numerical computa-
tions.
For bounds on ε, we have
−2−n ≤ ε ≤ 2−n, rounding−2−n+1 ≤ ε ≤ 0, chopping
IEEE ARITHMETIC
We are only giving the minimal characteristics of IEEE
arithmetic. There are many options available on the
types of arithmetic and the chopping/rounding. The
default arithmetic uses rounding.
Single precision arithmetic:
n = 24, −126 ≤ e ≤ 127This results in
M = 224 = 16777216
η = 2−23 = 1.19× 10−7
Double precision arithmetic:
n = 53, −1022 ≤ e ≤ 1023What are M and η?
There is also an extended representation, having n =
69 digits in its significand.
MATLAB can be used to generate the binary oating
point representation of a number.
Execute in MATLAB the command:
format hex
This will cause all subsequent numerical output to the
screen to be given in hexadecimal format (base 16).
For example, listing the number 7.125 results in an
output of
401c800000000000:
The 16 hexadecimal digits are
f0; 1; 2; 3; 4; 5; 6; 7; 8; 9; a; b; c; d; e; fg
To obtain the binary representation, convert each hex-
adecimal digit to a four digit binary number according
to the table below
Format Format Format Formathex binary hex binary
0 0000 8 10001 0001 9 10012 0010 a 10103 0011 b 10114 0100 c 11005 0101 d 11016 0110 e 11107 0111 f 1111
For the above number, we obtain the binary expansion
4z }| {0100
0z }| {0000
1z }| {0001
cz }| {1100
8z }| {1000
0z }| {0000
0z }| {0000 : : :
0z }| {0000
0|{z}�10000000001| {z }
E
1100100000000000 : : : 0000| {z }1:b13b14:::b64=x
which provides us with the IEEE double precision rep-
resentation of 7:125.
SOME DEFINITIONS
Let xT denote the true value of some number, usuallyunknown in practice; and let xA denote an approxi-mation of xT .
The error in xA is
error(xA) = xT � xA
The relative error in xA is
rel(xA) =error(xA)
xT
=xT � xAxT
Example:
xT = e; xA =197 : Then,
error(xA) = e� 197= 0:003996
rel(xA) =0:003996
e= 0:00147
Relative error is more exact in representing the d�er-
ence between the true value and approximated one.
Example: Suppose the distance between two cities is
DT = 100 km and let this distance be approximated
with DA = 99 km. In this case,
Err (DA) = DT �DA = 1 km,
Rel (DA) =Err (DA)
DT= 0:01 � 1%:
Now, suppose that distance is dT = 2 km and esti-
mate it with dA = 1 km. Then
Err (dA) = dT � dA = 1 km,
Rel (dA) =Err (dA)
dT= 0:5 � 50%:
In both cases the error is the same. But, obviously
DA is a better approximation of DT , then dA of dT :
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 3
1 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:
1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
2 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.
Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
3 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
4 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
5 / 83
Sources of Error
2. Physical / Observational / Measurement Error.The radius of an electron is given by
(2.81777+ ε)� 10�13cm, jεj � 0.00011
This error cannot be removed, and it must a�ect the accuracy ofany computation in which it is used.
We need to be aware of these e�ects and to so arrange thecomputation as to minimize the e�ects.
6 / 83
Sources of Error
2. Physical / Observational / Measurement Error.The radius of an electron is given by
(2.81777+ ε)� 10�13cm, jεj � 0.00011
This error cannot be removed, and it must a�ect the accuracy ofany computation in which it is used.
We need to be aware of these e�ects and to so arrange thecomputation as to minimize the e�ects.
7 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
8 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.
For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
9 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".
The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
10 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
11 / 83
Sources of Error
4. Finiteness of Algorithm ErrorThis is an error due to stopping an algorithm after a �nite numberof iterations.
Even if theoretically an algorithm can run for inde�nite time, aftera �nite (usually speci�ed) number of iterations the algorithm willbe stopped.
12 / 83
Sources of Error
4. Finiteness of Algorithm ErrorThis is an error due to stopping an algorithm after a �nite numberof iterations.
Even if theoretically an algorithm can run for inde�nite time, aftera �nite (usually speci�ed) number of iterations the algorithm willbe stopped.
13 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors.
Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
14 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs.
Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
15 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
16 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
17 / 83
Sources of Error
6. Rounding/chopping Error.This is the main source of many problems, especially problems insolving systems of linear equations. We later look at the e�ects ofsuch errors.
18 / 83
Sources of Error
7. Finitness of precision errorsAll the numbers stored in computer memory are subject to the�niteness of allocated space for storage.
19 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
20 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
21 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
22 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
23 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
24 / 83
Pendulum Example
Problem of continuous mathematics:
θ
T
mg
( .θ = ω.
ω = � gl sin θ
Modeling Errors
Physical Errors
25 / 83
Pendulum Example
Problem of continuous mathematics:
θ
T
mg
( .θ = ω.
ω = � gl sin θ
Modeling Errors
Physical Errors
26 / 83
Pendulum Example
Problem of continuous mathematics:
θ
T
mg
( .θ = ω.
ω = � gl sin θ
Modeling Errors
Physical Errors
27 / 83
Pendulum Example
Mathematical Algorithms:
θ
T
mg
�θn+1 = θn + hωn+1
ωn+1 = ωn � h gl sin (θn)
Discretisation Errors
Finiteness of Algorithm Errors
28 / 83
Pendulum Example
Mathematical Algorithms:
θ
T
mg
�θn+1 = θn + hωn+1
ωn+1 = ωn � h gl sin (θn)
Discretisation Errors
Finiteness of Algorithm Errors
29 / 83
Pendulum Example
Mathematical Algorithms:
θ
T
mg
�θn+1 = θn + hωn+1
ωn+1 = ωn � h gl sin (θn)
Discretisation Errors
Finiteness of Algorithm Errors
30 / 83
Pendulum Example
Computer Implementation:
θ
T
mg
for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
31 / 83
Pendulum Example
Computer Implementation:
θ
T
mg
for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
32 / 83
Pendulum Example
Computer Implementation:
θ
T
mg
for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
33 / 83
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.
Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012
100000 100.000 158.113 �58.1130
34 / 83
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012
100000 100.000 158.113 �58.1130
35 / 83
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012
100000 100.000 158.113 �58.113036 / 83
Loss of signi�cance errors
Example. De�ne
g(x) =1� cos xx2
and consider evaluating it on a 10-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
0.1 0.4995834700 0.4995834722 �2.2000e � 0090.01 0.4999960000 0.4999958333 1.6670e � 0070.001 0.5000000000 0.4999999583 4.1700e � 0080.0001 0.5000000000 0.4999999996 4.0000e � 0100.00001 0.0 0.5000000000 0.5
37 / 83
Loss of signi�cance errors
Example. De�ne
g(x) =1� cos xx2
and consider evaluating it on a 10-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
0.1 0.4995834700 0.4995834722 �2.2000e � 0090.01 0.4999960000 0.4999958333 1.6670e � 0070.001 0.5000000000 0.4999999583 4.1700e � 0080.0001 0.5000000000 0.4999999996 4.0000e � 0100.00001 0.0 0.5000000000 0.5
38 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001.
Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
39 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
40 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�7
1� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
41 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
42 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
43 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
44 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
45 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation.
In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
46 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect.
And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
47 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.
The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
48 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
49 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2
This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
50 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
51 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
52 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
53 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding.
To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
54 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
55 / 83
Another example
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 �0.1960 �0.042301 �5.000 �4.000 14 0.7001e � 1 0.027712 12.50 8.500 15 �0.2334e � 1 0.0043703 �20.83 �12.33 16 0.7293e � 2 0.011664 26.04 13.71 17 �0.2145e � 2 0.0095185 �26.04 �12.33 18 0.5958e � 3 0.010116 21.70 9.370 19 �0.1568e � 3 0.0099577 �15.50 �6.130 20 0.3920e � 4 0.0099968 9.688 3.558 21 �0.9333e � 5 0.0099879 �5.382 �1.824 22 0.2121e � 5 0.00998910 2.691 0.8670 23 �0.4611e � 6 0.00998911 �1.223 �0.3560 24 0.9670e � 7 0.00998912 0.5097 0.1537 25 �0.1921e � 7 0.009989
True answer is 0.006738
56 / 83
Another example
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 �0.1960 �0.042301 �5.000 �4.000 14 0.7001e � 1 0.027712 12.50 8.500 15 �0.2334e � 1 0.0043703 �20.83 �12.33 16 0.7293e � 2 0.011664 26.04 13.71 17 �0.2145e � 2 0.0095185 �26.04 �12.33 18 0.5958e � 3 0.010116 21.70 9.370 19 �0.1568e � 3 0.0099577 �15.50 �6.130 20 0.3920e � 4 0.0099968 9.688 3.558 21 �0.9333e � 5 0.0099879 �5.382 �1.824 22 0.2121e � 5 0.00998910 2.691 0.8670 23 �0.4611e � 6 0.00998911 �1.223 �0.3560 24 0.9670e � 7 0.00998912 0.5097 0.1537 25 �0.1921e � 7 0.009989
True answer is 0.00673857 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy.
For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
58 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333.
Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
59 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought.
Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
60 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
61 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
62 / 83
Noise in function evaluation
Consider plotting the function
f (x) = (x � 1)3 = x3 � 3x2 + 3x � 1 = �1+ x(3+ x(�3+ x))
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 21
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
x
y
63 / 83
Noise in function evaluation
0.99998 1.00000 1.000002
8
4
0
4
8x 10
15
x
y
64 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
65 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
66 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
67 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
68 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0.
When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
69 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
70 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
71 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
72 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
73 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
74 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
75 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors.
These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
76 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context.
Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
77 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
78 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
79 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
80 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
81 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.682 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.683 / 83
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 5
1 / 101
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.
Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012
100000 100.000 158.113 �58.1130
2 / 101
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012
100000 100.000 158.113 �58.1130
3 / 101
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012
100000 100.000 158.113 �58.11304 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
5 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
6 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
7 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
8 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
9 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
10 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
11 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
12 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
13 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
14 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
15 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
16 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
17 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
18 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
19 / 101
Propagation of errorsPropagation in arithmetic operations
The error in xA ω� yA is rewritten as
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)
The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume
xA ω� yA = (xA ω yA)
This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.
20 / 101
Propagation of errorsPropagation in arithmetic operations
The error in xA ω� yA is rewritten as
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)
The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume
xA ω� yA = (xA ω yA)
This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.
21 / 101
Propagation of errorsPropagation in arithmetic operations
The error in xA ω� yA is rewritten as
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)
The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume
xA ω� yA = (xA ω yA)
This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.
22 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
23 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
24 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.
Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
25 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
26 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA
= �ε
27 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
28 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
29 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.In what follows we examine it for particular cases.
30 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.In what follows we examine it for particular cases.
31 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.
In what follows we examine it for particular cases.
32 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.In what follows we examine it for particular cases.
33 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
34 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
35 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
36 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
37 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
38 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT
= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
39 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
40 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
41 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)� Rel(xA) + Rel( yA)
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
42 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
� Rel(xA) + Rel( yA)
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
43 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)� Rel(xA) + Rel( yA)
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
44 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)� Rel(xA) + Rel( yA)
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
45 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)� Rel(xA) + Rel( yA)
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
46 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)� Rel(xA) + Rel( yA)
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
47 / 101
Propagation of errorsPropagation in division
There is a similar result for division:
Rel(xA yA) � Rel(xA)� Rel( yA)
providedjRel(yA)j � 1
48 / 101
Propagation of errorsPropagation in addition and subtraction
For ω equal to � or +, we have
[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]
Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
49 / 101
Propagation of errorsPropagation in addition and subtraction
For ω equal to � or +, we have
[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]
Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
50 / 101
Propagation of errorsPropagation in addition and subtraction
For ω equal to � or +, we have
[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]
Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
51 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0
Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
52 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
53 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
54 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005
Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
55 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.03956 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
57 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
58 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
59 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
60 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
61 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
62 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
63 / 101
Propagation of errorsExample
The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.
Instead, use the mathematically equivalent formula
r2,A =1
13+p168
� 1
25.961
This results in a much more accurate answer, at the expense of anadditional division.
64 / 101
Propagation of errorsExample
The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.
Instead, use the mathematically equivalent formula
r2,A =1
13+p168
� 1
25.961
This results in a much more accurate answer, at the expense of anadditional division.
65 / 101
Propagation of errorsExample
The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.
Instead, use the mathematically equivalent formula
r2,A =1
13+p168
� 1
25.961
This results in a much more accurate answer, at the expense of anadditional division.
66 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
67 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).
Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
68 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
69 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).
What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
70 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)71 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)i
The quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.
72 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.
73 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.
74 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.75 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
76 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)
� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
77 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)
� f 0(xA)(xT � xA)
78 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
79 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
80 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
81 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
82 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT
= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
83 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)
= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
84 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
85 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
86 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
87 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx .
The number K is called a conditionnumber for the computation.
88 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
89 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a1 + a2 + . . . an (1)
where aj , j = 1, . . . , n, are oating point numbers.
Thesummation process consists of n� 1 consecutive additions
S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,
De�ne
S2 = (a1 + a2)
S3 = (S2 + a3)
S4 = (S3 + a4)...
Sn = (Sn�1 + an)
Recall the formula (x) = x(1+ ε)
90 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a1 + a2 + . . . an (1)
where aj , j = 1, . . . , n, are oating point numbers. Thesummation process consists of n� 1 consecutive additions
S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,
De�ne
S2 = (a1 + a2)
S3 = (S2 + a3)
S4 = (S3 + a4)...
Sn = (Sn�1 + an)
Recall the formula (x) = x(1+ ε)
91 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a1 + a2 + . . . an (1)
where aj , j = 1, . . . , n, are oating point numbers. Thesummation process consists of n� 1 consecutive additions
S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,
De�ne
S2 = (a1 + a2)
S3 = (S2 + a3)
S4 = (S3 + a4)...
Sn = (Sn�1 + an)
Recall the formula (x) = x(1+ ε) 92 / 101
Summation
S2 = (a1 + a2)(1+ ε2)
S3 = (S2 + a3)(1+ ε3)
S4 = (S3 + a4)(1+ ε4)...
Sn = (Sn�1 + an)(1+ εn)
Then
S3 = (S2 + a3)(1+ ε3)
= ((a1 + a2)(1+ ε2) + a3)(1+ ε3)
� (a1 + a2 + a3) + a1(ε2 + ε3)
+a2(ε2 + ε3) + a3ε3,
93 / 101
Summation
S2 = (a1 + a2)(1+ ε2)
S3 = (S2 + a3)(1+ ε3)
S4 = (S3 + a4)(1+ ε4)...
Sn = (Sn�1 + an)(1+ εn)
Then
S3 = (S2 + a3)(1+ ε3)
= ((a1 + a2)(1+ ε2) + a3)(1+ ε3)
� (a1 + a2 + a3) + a1(ε2 + ε3)
+a2(ε2 + ε3) + a3ε3,
94 / 101
Summation
Similarly,
S4 � (a1 + a2 + a3 + a4) + a1(ε2 + ε3 + ε4)
+a2(ε2 + ε3 + ε4) + a3(ε3 + ε4) + a4ε4
Finally,
Sn � (a1 + a2 + . . .+ an) + a1(ε2 + . . .+ εn)
+a2(ε2 + . . .+ εn) + a3(ε3 + . . .+ εn)
+a4(ε4 + . . .+ εn) + . . .+ anεn
95 / 101
Summation
Similarly,
S4 � (a1 + a2 + a3 + a4) + a1(ε2 + ε3 + ε4)
+a2(ε2 + ε3 + ε4) + a3(ε3 + ε4) + a4ε4
Finally,
Sn � (a1 + a2 + . . .+ an) + a1(ε2 + . . .+ εn)
+a2(ε2 + . . .+ εn) + a3(ε3 + . . .+ εn)
+a4(ε4 + . . .+ εn) + . . .+ anεn
96 / 101
Summation
We are interested in the error S � Sn :
S � Sn � �a1(ε2 + . . .+ εn)� a2(ε2 + . . .+ εn)� a3(ε3 + . . .+ εn)
� a4(ε4 + . . .+ εn)� . . .� anεn
From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order
ja1j � ja2j � ja3j � . . . � janj
In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.
97 / 101
Summation
We are interested in the error S � Sn :
S � Sn � �a1(ε2 + . . .+ εn)� a2(ε2 + . . .+ εn)� a3(ε3 + . . .+ εn)
� a4(ε4 + . . .+ εn)� . . .� anεn
From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order
ja1j � ja2j � ja3j � . . . � janj
In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.
98 / 101
Summation
We are interested in the error S � Sn :
S � Sn � �a1(ε2 + . . .+ εn)� a2(ε2 + . . .+ εn)� a3(ε3 + . . .+ εn)
� a4(ε4 + . . .+ εn)� . . .� anεn
From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order
ja1j � ja2j � ja3j � . . . � janj
In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.
99 / 101
Summation with chopping
Numberof terms, n
Exactvalue
SL Error LS Error
10 2.929 2.928 0.001 2.927 0.00225 3.816 3.813 0.003 3.806 0.01050 4.499 4.491 0.008 4.470 0.020100 5.187 5.170 0.017 5.142 0.045200 5.878 5.841 0.037 5.786 0.092500 6.793 6.692 0.101 6.569 0.2241000 7.486 7.284 0.202 7.069 0.417
100 / 101
Summation with rounding
Numberof terms, n
Exactvalue
SL Error LS Error
10 2.929 2.929 0 2.929 025 3.816 3.816 0 3.817 �0.00150 4.499 4.500 �0.001 4.498 0.001100 5.187 5.187 0 5.187 0200 5.878 5.878 0 5.876 0.002500 6.793 6.794 �0.001 6.783 0.0101000 7.486 7.486 0 7.449 0.037
101 / 101
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 6
1 / 94
Root�nding
We want to �nd the numbers x for which
f (x) = 0
with f : [a, b]! R a given real-valued function. Here, we denotesuch roots or zeroes by the Greek letter α. So
f (α) = 0
Root�nding problems occur in many contexts. Sometimes they area direct formulation of some physical situtation, but more often,they are an intermediate step in solving a much larger problem.
2 / 94
Root�nding
We want to �nd the numbers x for which
f (x) = 0
with f : [a, b]! R a given real-valued function. Here, we denotesuch roots or zeroes by the Greek letter α. So
f (α) = 0
Root�nding problems occur in many contexts. Sometimes they area direct formulation of some physical situtation, but more often,they are an intermediate step in solving a much larger problem.
3 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
4 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
5 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
6 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
7 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and
f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.
Therefore, further assume that the function f (x) changes sign on[a, b].
8 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and
f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.
Therefore, further assume that the function f (x) changes sign on[a, b].
9 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and
f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.
Therefore, further assume that the function f (x) changes sign on[a, b].
10 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
11 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
12 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.
Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
13 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .
Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
14 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
15 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
16 / 94
Bisection method
y
x
α
a1
b1=b
2
c1=a
2c
2
17 / 94
Bisection method
Example
Consider the function
f (x) = x6 � x � 1
We want to �nd the largest root with accuracy of ε = 0.001. It canbe seen form the graph of the function that the root is located in[1, 2] . Also, note that the function is continuous. Let a = 1 andb = 2, then f (a) = �1 and f (b) = 61, consequently the functionchanges its sign and thus all conditions are being satis�ed.
18 / 94
Bisection method
n an bn cn f (cn) bn � cn1 1.00000 2.00000 1.50000 8.891e + 00 5.000e � 012 1.00000 1.50000 1.25000 1.565e + 00 2.500e � 013 1.00000 1.25000 1.12500 �9.771e � 02 1.250e � 014 1.12500 1.25000 1.18750 6.167e � 01 6.250e � 025 1.12500 1.18750 1.15625 2.333e � 01 3.125e � 026 1.12500 1.15625 1.14063 6.158e � 02 1.563e � 027 1.12500 1.14063 1.13281 �1.958e � 02 7.813e � 038 1.13281 1.14063 1.13672 2.062e � 02 3.906e � 039 1.13281 1.13672 1.13477 4.268e � 04 1.953e � 0310 1.13281 1.13477 1.13379 �9.598e � 03 9.766e � 04
19 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
20 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
21 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
22 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
23 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
24 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an)
=1
2n(b� a)
25 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
26 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
27 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.
Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.97
28 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.97
29 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.97
30 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.9731 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
32 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
33 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
34 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
35 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
36 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
37 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
38 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
39 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
40 / 94
Root�nding
We want to �nd the root α of a given function f (x).
Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
41 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis.
One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
42 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
43 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
44 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?
Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
45 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
46 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
47 / 94
Root�nding
y
x
α
(x0,f (x
0))
x0x
1
48 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.
Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
49 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).
Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
50 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
51 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
52 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
53 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
54 / 94
Newton's method
Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.
General scheme for Newton's method consists in:
Starting with initial guess x0 compute iteratively
xn+1 = xn �f (xn)
f 0(xn), n = 0, 1, 2, . . .
55 / 94
Newton's method
Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.
General scheme for Newton's method consists in:
Starting with initial guess x0 compute iteratively
xn+1 = xn �f (xn)
f 0(xn), n = 0, 1, 2, . . .
56 / 94
Newton's method
Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.
General scheme for Newton's method consists in:
Starting with initial guess x0 compute iteratively
xn+1 = xn �f (xn)
f 0(xn), n = 0, 1, 2, . . .
57 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x6 � x � 1,f 0(x) = 6x5 � 1
to get
xn+1 = xn �x6n � xn � 16x5n � 1
, n � 0
Use initial guess x0 = 1.5.
58 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x6 � x � 1,f 0(x) = 6x5 � 1
to get
xn+1 = xn �x6n � xn � 16x5n � 1
, n � 0
Use initial guess x0 = 1.5.
59 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x6 � x � 1,f 0(x) = 6x5 � 1
to get
xn+1 = xn �x6n � xn � 16x5n � 1
, n � 0
Use initial guess x0 = 1.5.
60 / 94
Newton's method
n xn f (xn) xn � xn�1 α� xn0 1.50000000 8.89e + 11 1.30049088 2.54e + 1 �2.00e � 1 �3.65e � 12 1.18148042 5.38e � 1 �1.19e � 1 �1.66e � 13 1.13945559 4.92e � 2 �4.20e � 2 �4.68e � 24 1.13477763 5.50e � 4 �4.68e � 3 �4.73e � 35 1.13472415 7.11e � 8 �5.35e � 5 �5.35e � 56 1.13472414 1.55e � 15 �6.91e � 9 �6.91e � 9
True solution is α = 1.134724138.
61 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past.
Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
62 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
63 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
64 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
65 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
66 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
67 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
68 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
69 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
70 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
71 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α
= 1� bxn+1
72 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
73 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
74 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
75 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
76 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
77 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
78 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
79 / 94
Newton's method. Division example
y
x
y=b1/x
1/b
(x0,f(x
0))
x0 x1
2/b
b
80 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0.
ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
81 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn.
Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
82 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn.
Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
83 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
84 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�85 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
86 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
87 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
88 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
89 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
90 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
91 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
92 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0
In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
93 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����94 / 94
For xn close to α, and therefore cn also close to α,
we have
α− xn+1 ≈ −f 00(α)2f 0(α)
(α− xn)2
Thus Newton’s method is quadratically convergent,
provided f 0(α) 6= 0 and f(x) is twice differentiable inthe vicinity of the root α.
We can also use this to explore the ‘interval of con-
vergence’ of Newton’s method. Write the above as
α− xn+1 ≈M (α− xn)2 , M = − f 00(α)
2f 0(α)Multiply both sides by M to get
M (α− xn+1) ≈ [M (α− xn)]2
M (α− xn+1) ≈ [M (α− xn)]2
Then we want these quantities to decrease; and this
suggests choosing x0 so that
|M (α− x0)| < 1
|α− x0| <1
|M | =¯̄̄̄¯2f 0(α)f 00(α)
¯̄̄̄¯
If |M | is very large, then we may need to have a verygood initial guess in order to have the iterates xn
converge to α.
ADVANTAGES & DISADVANTAGES
Advantages: 1. It is rapidly convergent in most cases.
2. It is simple in its formulation, and therefore rela-
tively easy to apply and program.
3. It is intuitive in its construction. This means it is
easier to understand its behaviour, when it is likely to
behave well and when it may behave poorly.
Disadvantages: 1. It may not converge.
2. It is likely to have difficulty if f 0(α) = 0. This
condition means the x-axis is tangent to the graph of
y = f(x) at x = α.
3. It needs to know both f(x) and f 0(x). Contrastthis with the bisection method which requires only
f(x).
THE SECANT METHOD
Newton’s method was based on using the line tangent
to the curve of y = f(x), with the point of tangency
(x0, f(x0)). When x0 ≈ α, the graph of the tangent
line is approximately the same as the graph of y =
f(x) around x = α. We then used the root of the
tangent line to approximate α.
Consider using an approximating line based on ‘inter-
polation’. We assume we have two estimates of theroot α, say x0 and x1. Then we produce a linear
function
q(x) = a0 + a1x
with
q(x0) = f(x0), q(x1) = f(x1) (*)
This line is sometimes called a secant line. Its equa-
tion is given by
q(x) =(x1 − x) f(x0) + (x− x0) f(x1)
x1 − x0
(x0,f(x0))
(x1,f(x1))
x2x0
x1α
x
yy=f(x)
(x0,f(x0))
(x1,f(x1))
x2 x0x1
αx
yy=f(x)
q(x) =(x1 − x) f(x0) + (x− x0) f(x1)
x1 − x0
This is linear in x; and by direction evaluation, it satis-
fies the interpolation conditions of (*). We now solve
the equation q(x) = 0, denoting the root by x2. This
yields
x2 = x1 − f(x1)÷f(x1)− f(x0)
x1 − x0
We can now repeat the process. Use x1 and x2 to
produce another secant line, and then uses its root
to approximate α. This yields the general iteration
formula
xn+1 = xn−f(xn)÷f(xn)− f(xn−1)xn − xn−1
, n = 1, 2, 3...
This is called the secant method for solving f(x) = 0.
Example We solve the equation
f(x) ≡ x6 − x− 1 = 0which was used previously as an example for both the
bisection and Newton methods. The quantity xn −xn−1 is used as an estimate of α−xn−1. The iteratex8 equals α rounded to nine significant digits. As with
Newton’s method for this equation, the initial iterates
do not converge rapidly. But as the iterates become
closer to α, the speed of convergence increases.
n xn f(xn) xn − xn−1 α− xn−10 2.0 61.0
1 1.0 −1.0 −1.02 1.01612903 −9.15E− 1 1.61E− 2 1.35E− 13 1.19057777 6.57E− 1 1.74E− 1 1.19E− 14 1.11765583 −1.68E− 1 −7.29E− 2 −5.59E− 25 1.13253155 −2.24E− 2 1.49E− 2 1.71E− 26 1.13481681 9.54E− 4 2.29E− 3 2.19E− 37 1.13472365 −5.07E− 6 −9.32E− 5 −9.27E− 58 1.13472414 −1.13E− 9 4.92E− 7 4.92E− 7
It is clear from the numerical results that the se-
cant method requires more iterates than the New-
ton method. But note that the secant method does
not require a knowledge of f 0(x), whereas Newton’smethod requires both f(x) and f 0(x).
Note also that the secant method can be considered
an approximation of the Newton method
xn+1 = xn − f(xn)
f 0(xn)by using the approximation
f 0(xn) ≈ f(xn)− f(xn−1)xn − xn−1
CONVERGENCE ANALYSIS
With a combination of algebraic manipulation and the
mean-value theorem from calculus, we can show
α− xn+1 = (α− xn) (α− xn−1)"−f 00(ξn)2f 0(ζn)
#, (**)
with ξn and ζn unknown points. The point ξn is lo-
cated between the minimum and maximum of xn−1, xn,and α; and ζn is located between the minimum and
maximum of xn−1 and xn. Recall for Newton’s methodthat the Newton iterates satisfied
α− xn+1 = (α− xn)2
"−f 00(ξn)2f 0(xn)
#which closely resembles (**) above.
Using (**), it can be shown that xn converges to α,
and moreover,
limn→∞
|α− xn+1||α− xn|r =
¯̄̄̄¯ f 00(α)2f 0(α)
¯̄̄̄¯r−1
≡ c
where 12 (1 + sqrt(5)).= 1.62. This assumes that x0
and x1 are chosen sufficiently close to α; and how
close this is will vary with the function f . In addition,
the above result assumes f(x) has two continuous
derivatives for all x in some interval about α.
The above says that when we are close to α, that
|α− xn+1| ≈ c |α− xn|r
This looks very much like the Newton result
α− xn+1 ≈M (α− xn)2 , M =
−f 00(α)2f 0(α)
and c = |M |r−1. Both the secant and Newton meth-ods converge at faster than a linear rate, and they are
called superlinear methods.
The secant method converge slower than Newton’s
method; but it is still quite rapid. It is rapid enough
that we can prove
limn→∞
|xn+1 − xn||α− xn| = 1
and therefore,
|α− xn| ≈ |xn+1 − xn|is a good error estimator.
A note of warning: Do not combine the secant for-
mula and write it in the form
xn+1 =f(xn)xn−1 − f(xn−1)xn
f(xn)− f(xn−1)This has enormous loss of significance errors as com-
pared with the earlier formulation.
COSTS OF SECANT & NEWTON METHODS
The Newton method
xn+1 = xn − f(xn)
f 0(xn), n = 0, 1, 2, ...
requires two function evaluations per iteration, that
of f(xn) and f 0(xn). The secant method
xn+1 = xn−f(xn)÷f(xn)− f(xn−1)xn − xn−1
, n = 1, 2, 3...
requires 1 function evaluation per iteration, following
the initial step.
For this reason, the secant method is often faster in
time, even though more iterates are needed with it
than with Newton’s method to attain a similar accu-
racy.
ADVANTAGES & DISADVANTAGES
Advantages of secant method: 1. It converges at
faster than a linear rate, so that it is more rapidly
convergent than the bisection method.
2. It does not require use of the derivative of the
function, something that is not available in a number
of applications.
3. It requires only one function evaluation per iter-
ation, as compared with Newton’s method which re-
quires two.
Disadvantages of secant method:
1. It may not converge.
2. There is no guaranteed error bound for the com-
puted iterates.
3. It is likely to have difficulty if f 0(α) = 0. This
means the x-axis is tangent to the graph of y = f(x)
at x = α.
4. Newton’s method generalizes more easily to new
methods for solving simultaneous systems of nonlinear
equations.
BRENT’S METHOD
Richard Brent devised a method combining the advan-
tages of the bisection method and the secant method.
1. It is guaranteed to converge.
2. It has an error bound which will converge to zero
in practice.
3. For most problems f(x) = 0, with f(x) differen-
tiable about the root α, the method behaves like the
secant method.
4. In the worst case, it is not too much worse in its
convergence than the bisection method.
In Matlab, it is implemented as fzero; and it is present
in most Fortran numerical analysis libraries.
FIXED POINT ITERATION
We begin with a computational example. Consider
solving the two equations
E1: x = 1 + .5 sinxE2: x = 3 + 2 sinx
Graphs of these two equations are shown on accom-
panying graphs, with the solutions being
E1: α = 1.49870113351785E2: α = 3.09438341304928
We are going to use a numerical scheme called ‘fixed
point iteration’. It amounts to making an initial guess
of x0 and substituting this into the right side of the
equation. The resulting value is denoted by x1; and
then the process is repeated, this time substituting x1into the right side. This is repeated until convergence
occurs or until the iteration is terminated.
In the above cases, we show the results of the first 10
iterations in the accompanying table. Clearly conver-
gence is occurring with E1, but not with E2. Why?
x
y
y = x
y = 1 + .5sin x
α
x
y
y = x
y = 3 + 2sin x
α
E1: x = 1 + .5 sinxE2: x = 3 + 2 sinx
E1 E2n xn xn0 0.00000000000000 3.000000000000001 1.00000000000000 3.282240016119732 1.42073549240395 2.719631771815563 1.49438099256432 3.819100254885144 1.49854088439917 1.746293896516525 1.49869535552190 4.969279572147626 1.49870092540704 1.065630652992167 1.49870112602244 4.750188616394658 1.49870113324789 1.001428642365169 1.49870113350813 4.6844840491609710 1.49870113351750 1.00077863465869
The above iterations can be written symbolically as
E1 : xn+1 = 1 + 0:5 sinxn
E2 : xn+1 = 3 + 2 sinxn
for n = 0; 1; 2; : : : Why does one of these iterationsconverge, but not the other? The graphs show similarbehaviour, so why the di¤erence? Consider one moreexample:
Suppose we are solving the equation
x2 � 5 = 0
with exact root � =p5 � 2:2361 using iterates of the
form
xn+1 = g(xn):
Consider four di¤erent iterations
I1 : xn+1 = 5 + xn � x2n;
I2 : xn+1 =5
xn;
I3 : xn+1 = 1 + xn �1
5x2n;
I4 : xn+1 =1
2
�xn +
5
xn
�:
All of them, in case they are convergent will convergeto � =
p5 (just take the limit as n ! 1 of each
relation).
I1 I2 I3 I4n xn xn xn xn0 1:0e+ 00 1:0 1:0 1:01 5:0000e+ 00 5:0 1:8000 3:00002 �1:5000e+ 01 1:0 2:1520 2:33333 �2:3500e+ 02 5:0 2:2258 2:23814 �5:5455e+ 04 1:0 2:2350 2:23615 �3:0753e+ 09 5:0 2:2360 2:23616 �9:4575e+ 18 1:0 2:2361 2:23617 �8:9445e+ 37 5:0 2:2361 2:23618 �8:0004e+ 75 1:0 2:2361 2:2361
As another example, note that the Newton method
xn+1 = xn �f(xn)
f 0(xn)
is also a �xed point iteration, for the equation
x = x� f(x)
f 0(x)
In general, we are interested in solving equations
x = g(x)
by means of �xed point iteration:
xn+1 = g(xn); n = 0; 1; 2; : : :
It is called ��xed point iteration�because the root � isa �xed point of the function g(x), meaning that � is anumber for which
g(�) = �
EXISTENCE THEOREM
We begin by asking whether the equation
x = g(x)
has a solution. For this to occur, the graphs of y =x and y = g(x) must intersect, as seen on the earliergraphs. There are several lemmas and theorems that giveconditions under which we are guaranteed there is a �xedpoint �.
Lemma 1 Let g(x) be a continuous function on the in-terval [a; b], and suppose it satis�es the property
a � x � b ) a � g(x) � bThen the equation x = g(x) has at least one solution �in the interval [a; b].
The proof of this is fairly intuitive. Look at the functionf(x) = x � g(x), a � x � b. Evaluating at the end-points, f(a) � 0; f(b) � 0. The function f(x) iscontinuous on [a; b]; and therefore it contains a zero inthe interval.
Theorem: Assume g(x) and g0(x) exist and are con-tinuous on the interval [a, b]; and further, assume
a ≤ x ≤ b ⇒ a ≤ g(x) ≤ b
λ ≡ maxa≤x≤b
¯̄̄g0(x)
¯̄̄< 1
Then:
S1. The equation x = g(x) has a unique solution α
in [a, b].
S2. For any initial guess x0 in [a, b], the iteration
xn+1 = g(xn), n = 0, 1, 2, ...
will converge to α.
S3.
|α− xn| ≤ λn
1− λ|x1 − x0| , n ≥ 0
S4.
limn→∞
α− xn+1α− xn
= g0(α)
Thus for xn close to α,
α− xn+1 ≈ g0(α) (α− xn)
The proof is given in the text, and I go over only a
portion of it here. For S2, note that from (#), if x0is in [a, b], then
x1 = g(x0)
is also in [a, b]. Repeat the argument to show that
x2 = g(x1)
belongs to [a, b]. This can be continued by induction
to show that every xn belongs to [a, b].
We need the following general result. For any two
points w and z in [a, b],
g(w)− g(z) = g0(c) (w − z)
for some unknown point c between w and z. There-
fore,
|g(w)− g(z)| ≤ λ |w − z|for any a ≤ w, z ≤ b.
For S3, subtract xn+1 = g(xn) from α = g(α) to get
α− xn+1 = g(α)− g(xn)
= g0(cn) (α− xn) ($)
|α− xn+1| ≤ λ |α− xn| (*)
with cn between α and xn. From (*), we have that
the error is guaranteed to decrease by a factor of λ
with each iteration. This leads to
|α− xn| ≤ λn |α− xn| , n ≥ 0With some extra manipulation, we can obtain the error
bound in S3.
For S4, use ($) to write
α− xn+1α− xn
= g0(cn)
Since xn → α and cn is between α and xn, we have
g0(cn)→ g0(α).
The statement
α− xn+1 ≈ g0(α) (α− xn)
tells us that when near to the root α, the errors will
decrease by a constant factor of g0(α). If this is nega-tive, then the errors will oscillate between positive and
negative, and the iterates will be approaching from
both sides. When g0(α) is positive, the iterates willapproach α from only one side.
The statements
α− xn+1 = g0(cn) (α− xn)
α− xn+1 ≈ g0(α) (α− xn)
also tell us a bit more of what happens when¯̄̄g0(α)
¯̄̄> 1
Then the errors will increase as we approach the root
rather than decrease in size.
Look at the earlier examples
E1: x = 1 + .5 sinxE2: x = 3 + 2 sinx
In the first case E1,
g(x) = 1 + .5 sinxg0(x) = .5 cosx¯̄g0(α
¯̄ ≤ 12
Therefore the fixed point iteration
xn+1 = 1 + .5 sinxn
will converge for E1.
For the second case E2,
g(x) = 3 + 2 sinxg0(x) = 2 cosxg0(α) = 2 cos (3.09438341304928)
.= −1.998
Therefore the fixed point iteration
xn+1 = 3 + 2 sinxn
will diverge for E2.
Consider example x2 � 5 = 0
(I1) g(x) = 5 + x� x2; g0(x) = 1� 2x; g0(�) =1 � 2
p5 < �1: Thus, xn = g(xn�1) do not con-
verge top5:
(I2) g(x) =5x; g0(x) = � 5
x2; g0(�) = �1: There-
fore, xn = g(xn�1) can be either convergent ordivergent, but numerical results show it divergent.
(I3) g(x) = 1+x� 15x2; g0(x) = 1� 2
5x; g0(�) =1 � 2
5
p5 � 0:106: Thus, xn = g(xn�1) converge
top5: Moreover, we have
j�� xn+1j � 0:106 j�� xnj ;
if xn is su¢ ciently close to �: The errors are de-creasing with a liniar rate of 0:106.
(I4) g(x) =12
�x+ 5
x
�; g0(x) = 1
2
�1� 5
x2
�; g0(�) =
0:Sequence xn = g(xn�1) will converge top5;with
an order of convergence bigger than 1:
Sometimes it is di¢ cult to express equation f(x) = 0 inthe form x = g(x); such that the resulting iterates willconverge. Such a process is presented in the followingexamples.
Example 1 Let x4 � x� 1 = 0; rewritten as
x = 4p1 + x;
which will prov�de us with iterations
x0 = 1; xn+1 =4p1 + xn; n � 1
This sequence will converge to � � 1:2207:
Example 2 Let x3 + x� 1 = 0; rewritten as
x =1
1 + x2
and its �xed point iterations
x0 = 1; xn+1 =1
1 + x2n; n � 1
that will converge to � � 0:6823: Iterations are repre-sented graphically in the following �gure
0 x
y
y=g(x)
α=0.6823 x0x
2x
1 x3
y=x
x0 x
1 x2
α
y
xO
y =x
y =g(x)
0 < g0(�) < 1
x
y
O
x0
x1x
2x
3α
y =x
y =g(x)
�1 < g0(�) < 0
x
y
Oαx
0x
1x
2
y =x
y =g(x)
g0(�) > 1
y
xO
y =x
y =g(x)
α x0x
1 x2
g0(�) < �1
Besides the convergence we would like to know how fast isthe sequence xn = g(xn�1) converging to the solution,in other words how fast the error � � xn is decreasing.We will say that sequence fxng1n=0 converges to � withorder of convergence p � 1; if
j�� xn+1j � c j�� xnjp ; n � 0;
where c � 0 is a constant. Cases p = 1, p = 2 and p =3 are called linear, quadratic and cubic convergencies. Incase of linear convergence, constant c is called the rateof linear convergence liniare and we require additionallythat c < 1; otherwise sequence of errors �� xn can failto converge to zero. Also, for linear convergence wer canuse the relation,
j�� xn+1j � cn j�� x0j ; n � 0:
Thus bisection method is linearly convergent with rate 12;Newton�s method is quadratically convergent, and secantmethod has order of convergence p = 1+
p5
2 :
If��g0(�)�� < 1, from the last theorem we have that iter-
ations xn are at least linearly convergent. If in addition,g0(�) 6= 0; then we have exactly linear convergence withrate g0(�): In practice, the last theorem is rarely usedsince.it is quite di¢ cult to �nd an interval [a; b] such thatg ([a; b]) � [a; b] : To simplify the usage of the Theoremwe consider the following corollary.
Corollary: Assume x = g(x) has a solution α, and
further assume that both g(x) and g0(x) are contin-uous for all x in some interval about α. In addition,
assume ¯̄̄g0(α)
¯̄̄< 1 (**)
Then any sufficiently small number ε > 0, the interval
[a, b] = [α − ε, α + ε] will satisfy the hypotheses of
the preceding theorem.
This means that if (**) is true, and if we choose x0sufficiently close to α, then the fixed point iteration
xn+1 = g(xn) will converge and the earlier results
S1-S4 will all hold. The corollary does not tell us how
close we need to be to α in order to have convergence.
NEWTON’S METHOD
For Newton’s method
xn+1 = xn − f(xn)
f 0(xn)we have it is a fixed point iteration with
g(x) = x− f(x)
f 0(x)Check its convergence by checking the condition (**).
g0(x) = 1− f 0(x)f 0(x)
+f(x)f 00(x)[f 0(x)]2
=f(x)f 00(x)[f 0(x)]2
g0(α) = 0
Therefore the Newton method will converge if x0 is
chosen sufficiently close to α.
HIGHER ORDER METHODS
What happens when g0(α) = 0? We use Taylor’s
theorem to answer this question.
Begin by writing
g(x) = g(α) + g0(α) (x− α) +1
2g00(c) (x− α)2
with c between x and α. Substitute x = xn and
recall that g(xn) = xn+1 and g(α) = α. Also assume
g0(α) = 0.Then
xn+1 = α+ 12g00(cn) (xn − α)2
α− xn+1 = −12g00(cn) (xn − α)2
with cn between α and xn. Thus if g0(α) = 0, the
fixed point iteration is quadratically convergent or bet-
ter. In fact, if g00(α) 6= 0, then the iteration is exactlyquadratically convergent.
ANOTHER RAPID ITERATION
Newton’s method is rapid, but requires use of the
derivative f 0(x). Can we get by without this. Theanswer is yes! Consider the method
Dn =f(xn + f(xn))− f(xn)
f(xn)
xn+1 = xn − f(xn)
Dn
This is an approximation to Newton’s method, with
f 0(xn) ≈ Dn. To analyze its convergence, regard it
as a fixed point iteration with
D(x) =f(x+ f(x))− f(x)
f(x)
g(x) = x− f(x)
D(x)
Then we can, with some difficulty, show g0(α) = 0
and g00(α) 6= 0. This will prove this new iteration is
quadratically convergent.
FIXED POINT INTERATION: ERROR
Recall the result
limn→∞
α− xn
α− xn−1= g0(α)
for the iteration
xn = g(xn−1), n = 1, 2, ...
Thus
α− xn ≈ λ (α− xn−1) (***)
with λ = g0(α) and |λ| < 1.
If we were to know λ, then we could solve (***) for
α:
α ≈ xn − λxn−11− λ
Usually, we write this as a modification of the cur-
rently computed iterate xn:
α ≈ xn − λxn−11− λ
=xn − λxn
1− λ+λxn − λxn−1
1− λ
= xn +λ
1− λ[xn − xn−1]
The formula
xn +λ
1− λ[xn − xn−1]
is said to be an extrapolation of the numbers xn−1and xn. But what is λ?
From
limn→∞
α− xn
α− xn−1= g0(α)
we have
λ ≈ α− xn
α− xn−1
Unfortunately this also involves the unknown root α
which we seek; and we must find some other way of
estimating λ.
To calculate λ consider the ratio
λn =xn − xn−1xn−1 − xn−2
To see this is approximately λ as xn approaches α,
write
xn − xn−1xn−1 − xn−2
=g(xn−1)− g(xn−2)
xn−1 − xn−2= g0(cn)
with cn between xn−1 and xn−2. As the iterates ap-proach α, the number cn must also approach α. Thus
λn approaches λ as xn→ α.
We combine these results to obtain the estimation
bxn = xn+λn
1− λn[xn − xn−1] , λn =
xn − xn−1xn−1 − xn−2
We call bxn the Aitken extrapolate of {xn−2, xn−1, xn};and α ≈ bxn.We can also rewrite this as
α− xn ≈ bxn − xn =λn
1− λn[xn − xn−1]
This is called Aitken’s error estimation formula.
The accuracy of these procedures is tied directly to
the accuracy of the formulas
α−xn ≈ λ (α− xn−1) , α−xn−1 ≈ λ (α− xn−2)
If this is accurate, then so are the above extrapolation
and error estimation formulas.
EXAMPLE
Consider the iteration
xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...
for solving
x = 6.28 + sinx
Iterates are shown on the accompanying sheet, includ-
ing calculations of λn, the error estimate
α−xn ≈ bxn−xn =λn
1− λn[xn − xn−1] (Estimate)
The latter is called “Estimate” in the table. In this
instance,
g0(α) .= .9644
and therefore the convergence is very slow. This is
apparent in the table.
AITKEN’S ALGORITHM
Step 1: Select x0Step 2: Calculate
x1 = g(x0), x2 = g(x1)
Step3: Calculate
x3 = x2 +λ2
1− λ2[x2 − x1] , λ2 =
x2 − x1x1 − x0
Step 4: Calculate
x4 = g(x3), x5 = g(x4)
and calculate x6 as the extrapolate of {x3, x4, x5}.Continue this procedure, ad infinatum.
Of course in practice we will have some kind of er-
ror test to stop this procedure when believe we have
sufficient accuracy.
EXAMPLE
Consider again the iteration
xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...
for solving
x = 6.28 + sinx
Now we use the Aitken method, and the results are
shown in the accompanying table. With this we have
α− x3 = 7.98× 10−4, α− x6 = 2.27× 10−6
In comparison, the original iteration had
α− x6 = 1.23× 10−2
GENERAL COMMENTS
Aitken extrapolation can greatly accelerate the con-
vergence of a linearly convergent iteration
xn+1 = g(xn)
This shows the power of understanding the behaviour
of the error in a numerical process. From that un-
derstanding, we can often improve the accuracy, thru
extrapolation or some other procedure.
This is a justification for using mathematical analyses
to understand numerical methods. We will see this
repeated at later points in the course, and it holds
with many different types of problems and numerical
methods for their solution.
MULTIPLE ROOTS
We study two classes of functions for which there is
additional difficulty in calculating their roots. The first
of these are functions in which the desired root has a
multiplicity greater than 1. What does this mean?
Let α be a root of the function f(x), and imagine
writing it in the factored form
f(x) = (x− α)mh(x)
with some integer m ≥ 1 and some continuous func-tion h(x) for which h(α) 6= 0. Then we say that α
is a root of f(x) of multiplicity m. For example, the
function
f(x) = ex2 − 1
has x = 0 as a root of multiplicity m = 2. In partic-
ular, define
h(x) =ex2 − 1x2
for x 6= 0.
Using Taylor polynomial approximations, we can show
for x 6= 0 thath(x) ≈ 1 + 1
2x2 + 1
6x4
limx→0h(x) = 1
This leads us to extend the definition of h(x) to
h(x) =ex2 − 1x2
, x 6= 0h(0) = 1
Thus
f(x) = x2h(x)
as asserted and x = 0 is a root of f(x) of multiplicity
m = 2.
Roots for which m = 1 are called simple roots, and
the methods studied to this point were intended for
such roots. We now consider the case of m > 1.
If the function f(x) is m-times differentiable around
α, then we can differentiate
f(x) = (x− α)mh(x)
m times to obtain an equivalent formulation of what
it means for the root to have multiplicity m.
For an example, consider the case
f(x) = (x− α)3 h(x)
Then
f 0(x) = 3 (x− α)2 h(x) + (x− α)3 h0(x)≡ (x− α)2 h2(x)
h2(x) = 3h(x) + (x− α)h0(x)h2(α) = 3h(α) 6= 0
This shows α is a root of f 0(x) of multiplicity 2.
Differentiating a second time, we can show
f 00(x) = (x− α)h3(x)
for a suitably defined h3(x) with h3(α) 6= 0, and α isa simple root of f 00(x).
Differentiating a third time, we have
f 000(α) = h3(α) 6= 0We can use this as part of a proof of the following: α
is a root of f(x) of multiplicity m = 3 if and only if
f(α) = f 0(α) = f 00(α) = 0, f 000(α) 6= 0
In general, α is a root of f(x) of multiplicity m if and
only if
f(α) = · · · = f (m−1)(α) = 0, f (m)(α) 6= 0
DIFFICULTIES OF MULTIPLE ROOTS
There are two main difficulties with the numerical cal-
culation of multiple roots (by which we mean m > 1
in the definition).
1. Methods such as Newton’s method and the se-
cant method converge more slowly than for the
case of a simple root.
2. There is a large interval of uncertainty in the pre-
cise location of a multiple root on a computer or
calculator.
The second of these is the more difficult to deal with,
but we begin with the first for the case of Newton’s
method.
Recall that we can regard Newton’s method as a fixed
point method:
xn+1 = g(xn), g(x) = x− f(x)
f 0(x)Then we substitute
f(x) = (x− α)mh(x)
to obtain
g(x) = x− (x− α)mh(x)
m (x− α)m−1 h(x) + (x− α)mh0(x)
= x− (x− α)h(x)
mh(x) + (x− α)h0(x)Then we can use this to show
g0(α) = 1− 1
m=
m− 1m
For m > 1, this is nonzero, and therefore Newton’s
method is only linearly convergent:
α− xn+1 ≈ λ (α− xn) , λ =m− 1m
Similar results hold for the secant method.
There are ways of improving the speed of convergence
of Newton’s method, creating a modified method that
is again quadratically convergent. In particular, con-
sider the fixed point iteration formula
xn+1 = g(xn), g(x) = x−mf(x)
f 0(x)in which we assume to know the multiplicity m of
the root α being sought. Then modifying the above
argument on the convergence of Newton’s method,
we obtain
g0(α) = 1−m · 1m= 0
and the iteration method will be quadratically conver-
gent.
But this is not the fundamental problem posed by
multiple roots.
NOISE IN FUNCTION EVALUATION
Recall the discussion of noise in evaluating a function
f(x), and in our case consider the evaluation for val-
ues of x near to α. In the following figures, the noise
as measured by vertical distance is the same in both
graphs.
x
y
simple root
x
y
double root
Noise was discussed earlier and as example we used func-tion
f(x) = x3 � 3x2 + 3x� 1 � (x� 1)3
Because of the noise in evaluating f(x), it appears fromthe graph that f(x) has many zeros around x = 1,whereas the exact function outside of the computer hasonly the root � = 1; of multiplicity 3. Any root�ndingmethod to �nd a multiple root � that uses evaluation off(x) is doomed to having a large interval of uncertaintyas to the location of the root. If high accuracy is desired,then the only satisfactory solution is to reformulate theproblem as a new problem F (x) = 0 in which � is a sim-ple root of F . Then use a standard root�nding methodto calculate �. It is important that the evaluation ofF (x) not involve f(x) directly, as that is the source ofthe noise and the uncertainly.
EXAMPLE
Consider �nding the roots of
f(x) = (x� 1:1)3(x� 2:1)= 2:7951� 8:954x+ 10:56x2 � 5:4x3 + x4
This has a root at 1.1
n xn f(xn) �� xn Rate0 0:800000 0:03510 0:3000001 0:892857 0:01073 0:207143 0:6902 0:958176 0:00325 0:141824 0:6853 1:00344 0:00099 0:09656 0:6814 1:03486 0:00029 0:06514 0:6755 1:05581 0:00009 0:04419 0:6786 1:07028 0:00003 0:02972 0:6737 1:08092 0:0 0:01908 0:642
From an examination of the rate of linear convergence ofNewton�s method applied to this function, one can guesswith high probability that the multiplicity ism = 3. Thenform exactly the second derivative
f 00(x) = 21:12� 32:4x+ 12x2
Applying Newton�s method to this with a guess of x = 1will lead to rapid convergence to � = 1:1.
In general, if we know the root � has multiplicitym > 1,then replace the problem by that of solving
f (m�1)(x) = 0
since � is a simple root of this equation.
STABILITY
Generally we expect the world to be stable. By this,we mean that if we make a small change in something,then we expect to have this lead to other correspond-ingly small changes. In fact, if we think about thiscarefully, then we know this need not be true. Wenow illustrate this for the case of rootfinding.
Consider the polynomial
f(x) = x7 − 28x6 + 322x5 − 1960x4+6769x3 − 13132x2 + 13068x− 5040
This has the exact roots {1, 2, 3, 4, 5, 6, 7}. Now con-sider the perturbed polynomial
F (x) = x7 − 28.002x6 + 322x5 − 1960x4+6769x3 − 13132x2 + 13068x− 5040
This is a relatively small change in one coefficient, ofrelative error
−.002−28 = 7.14× 10−5
What are the roots of F (x)?
Root of Root of Errorf(x) F (x)1 1.0000028 −2.8E − 62 1.9989382 1.1E − 33 3.0331253 −0.0334 3.8195692 0.1805 5.4586758 + .54012578i −.46− .54i6 5.4586758− .54012578i −.46 + .54i7 7.2330128 −0.233
Why have some of the roots departed so radically from
the original values? This phenomena goes under a
variety of names. We sometimes say this is an example
of an unstable or ill-conditioned rootfinding problem.
These words are often used in a casual manner, but
they also have a very precise meaning in many areas
of numerical analysis (and more generally, in all of
mathematics).
A PERTURBATION ANALYSIS
We want to study what happens to the root of a func-
tion f(x) when it is perturbed by a small amount. For
some function g(x) and for all small ε, define a per-
turbed function
Fε(x) = f(x) + εg(x)
The polynomial example would fit this if we use
g(x) = x6, ε = −.002Let α0 be a simple root of f(x). It can be shown (us-
ing the implicit differentiation theorem from calculus)
that if f(x) and g(x) are differentiable for x ≈ α0,
and if f 0(α0) 6= 0, then Fε(x) has a unique simple
root α(ε) near to α0 = α(0) for all small values of ε.
Moreover, α(ε) will be a differentiable function of ε.
We use this to estimate α(ε).
The linear Taylor polynomial approximation of α(ε) is
given by
α(ε) ≈ α(0) + εα0(0)
We need to find a formula for α0(0). Recall that
Fε(α(ε)) = 0
for all small values of ε. Differentiate this as a function
of ε and using the chain rule. Then we obtain
F 0ε(α(ε)) = f 0(α(ε))α0(ε)+g(α(ε)) + ε g0(α(ε))α0(ε) = 0
for all small ε. Substitute ε = 0, recall α(0) = α0,
and solve for α0(0) to obtain
f 0(α0)α0(0) + g(α0) = 0
α0(0) = − g(α0)
f 0(α0)This then leads to
α(ε) ≈ α(0) + εα0(0)
= α0 − εg(α0)
f 0(α0)(*)
Example: In our earlier polynomial example, consider
the simple root α0 = 3. Then
α(ε) ≈ 3− ε36
48
.= 3− 15.2ε
With ε = −.002, we obtainα(−.002) ≈ 3− 15.2(−.002) .
= 3.0304
This is close to the actual root of 3.0331253.
However, the approximation (*) is not good at esti-
mating the change in the roots 5 and 6. By ob-
servation, the perturbation in the root is a complex
number, whereas the formula (*) predicts only a per-
turbation that is real. The value of ε is too large to
have (*) be accurate for the roots 5 and 6.
DISCUSSION
Looking again at the formula
α(ε) ≈ α0 − εg(α0)
f 0(α0)we have that the size of
εg(α0)
f 0(α0)is an indication of the stability of the solution α0.
If this quantity is large, then potentially we will have
difficulty. Of course, not all functions g(x) are equally
possible, and we need to look only at functions g(x)
that will possibly occur in practice.
One quantity of interest is the size of f 0(α0). If itis very small relative to εg(α0), then we are likely to
have difficulty in finding α0 accurately.
INTERPOLATION
Interpolation is a process of finding a formula (oftena polynomial) whose graph will pass through a givenset of points (x, y).
As an example, consider defining
x0 = 0, x1 =π
4, x2 =
π
2and
yi = cosxi, i = 0, 1, 2
This gives us the three points
(0, 1) ,µπ4 ,
1sqrt(2)
¶,
³π2 , 0
´Now find a quadratic polynomial
p(x) = a0 + a1x+ a2x2
for which
p(xi) = yi, i = 0, 1, 2
The graph of this polynomial is shown on the accom-panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)
x
y
π/4 π/2
y = cos(x)y = p2(x)
PURPOSES OF INTERPOLATION
1. Replace a set of data points {(xi, yi)} with a func-tion given analytically.
2. Approximate functions with simpler ones, usually
polynomials or ‘piecewise polynomials’.
Purpose #1 has several aspects.
• The data may be from a known class of functions.Interpolation is then used to find the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form
p(x) = a0 + a1ex + a2e
2x + · · ·+ anenx
Then we need to find the coefficientsnajobased
on the given data values.
• We may want to take function values f(x) givenin a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.
• Given a set of data points {(xi, yi)}, find a curvepassing thru these points that is “pleasing to the
eye”. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f(x) by simpler functions p(x), perhaps to make
it easier to integrate or differentiate f(x). That will
be the primary reason for studying interpolation in this
course.
As as example of why this is important, consider the
problem of evaluating
I =Z 10
dx
1 + x10
This is very difficult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.
We begin by using polynomials as our means of doing
interpolation. Later in the chapter, we consider more
complex ‘piecewise polynomial’ functions, often called
‘spline functions’.
LINEAR INTERPOLATION
The simplest form of interpolation is probably thestraight line, connecting two points by a straight line.
Let two data points (x0, y0) and (x1, y1) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight lineas
P1(x) = a0 + a1x
In fact, there are other more convenient ways to write
it, and we give several of them below.
P1(x) =x− x1x0 − x1
y0 +x− x0x1 − x0
y1
=(x1 − x) y0 + (x− x0) y1
x1 − x0
= y0 +x− x0x1 − x0
[y1 − y0]
= y0 +
Ãy1 − y0x1 − x0
!(x− x0)
Check each of these by evaluating them at x = x0and x1 to see if the respective values are y0 and y1.
Example. Following is a table of values for f(x) =tanx for a few values of x.
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x0 = 1.1, x1 = 1.2
with corresponding values for y0 and y1. Then
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tan (1.15) ≈ 1.9648 +1.15− 1.11.2− 1.1 [2.5722− 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sufficient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)y = p1(x)
QUADRATIC INTERPOLATION
We want to find a polynomial
P2(x) = a0 + a1x+ a2x2
which satisfies
P2(xi) = yi, i = 0, 1, 2
for given data points (x0, y0) , (x1, y1) , (x2, y2). One
formula for such a polynomial follows:
P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (∗∗)with
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
The formula (∗∗) is called Lagrange’s form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
are called ‘Lagrange basis functions’ for quadratic in-
terpolation. They have the properties
Li(xj) =
(1, i = j0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each Li(x) being of degree 2, we
have that the interpolant
P2(x) = y0L0(x) + y1L1(x) + y2L2(x)
must have degree ≤ 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), forwhich
deg(Q) ≤ 2Q(xi) = yi, i = 0, 1, 2
Thus, is the Lagrange formula P2(x) unique?
Introduce
R(x) = P2(x)−Q(x)
From the properties of P2 and Q, we have deg(R) ≤2. Moreover,
R(xi) = P2(xi)−Q(xi) = yi − yi = 0
for all three node points x0, x1, and x2. How manypolynomials R(x) are there of degree at most 2 andhaving three distinct zeros? The answer is that onlythe zero polynomial satisfies these properties, and there-fore
R(x) = 0 for all x
Q(x) = P2(x) for all x
SPECIAL CASES
Consider the data points
(x0, 1), (x1, 1), (x2, 1)
What is the polynomial P2(x) in this case?
Answer: We must have the polynomial interpolant is
P2(x) ≡ 1meaning that P2(x) is the constant function. Why?First, the constant function satisfies the property ofbeing of degree ≤ 2. Next, it clearly interpolates thegiven data. Therefore by the uniqueness of quadraticinterpolation, P2(x) must be the constant function 1.
Consider now the data points
(x0,mx0), (x1,mx1), (x2,mx2)
for some constant m. What is P2(x) in this case? Byan argument similar to that above,
P2(x) = mx for all x
Thus the degree of P2(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-nomials of a general degree n. We want to find apolynomial Pn(x) for which
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n (∗∗)
with given data points
(x0, y0) , (x1, y1) , · · · , (xn, yn)The solution is given by Lagrange’s formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
The Lagrange basis functions are given by
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
for k = 0, 1, 2, ..., n. The quadratic case was coveredearlier.
In a manner analogous to the quadratic case, we canshow that the above Pn(x) is the only solution to theproblem (∗∗).
In the formula
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
we can see that each such function is a polynomial of
degree n. In addition,
Lk(xi) =
(1, k = i0, k 6= i
Using these properties, it follows that the formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
satisfies the interpolation problem of finding a solution
to
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3Pn(1.15) 2.2685 2.2435 2.2296Error −.0340 −.0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n ≥ 10, is often poorly
behaved when the node points {xi} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x0 andx1, de�ne
f [x0; x1] =f(x1)� f(x0)x1 � x0
This is called a �rst order divided di¤erence of f(x). Bythe Mean-value theorem,
f(x1)� f(x0) = f 0(c)(x1 � x0)
for some c between x0 and x1. Thus
f [x0; x1] = f 0(c)
and the divided di¤erence is very much like the derivative,especially if x0 and x1 are quite close together. In fact,
f 0(x1 + x02
) � f [x0; x1]
is quite an accurate approximation of the derivative
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x0, x1, and x2, de�ne
f [x0; x1; x2] =f [x1; x2]� f [x0; x1]
x2 � x0This is called the second order divided di¤erence of f(x).By a fairly complicated argument, we can show
f [x0; x1; x2] =1
2f 00(c)
for some c intermediate to x0, x1, and x2. In fact, as weinvestigate,
f 00(x1) � 2f [x0; x1; x2]
in the case the nodes are evenly spaced,
x1 � x0 = x2 � x1:
EXAMPLE
Consider the table
x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997
Let x0 = 1, x1 = 1.1, and x2 = 1.2. Then
f [x0, x1] =.45360− .54030
1.1− 1 = −.86700
f [x1, x2] =.36236− .45360
1.1− 1 = −.91240
f [x0, x1, x2] =f [x1, x2]− f [x0, x1]
x2 − x0
=−.91240− (−.86700)
1.2− 1.0 = −.22700For comparison,
f 0µx1 + x02
¶= − sin (1.05) = −.86742
1
2f 00 (x1) = −1
2cos (1.1) = −.22680
GENERAL DIVIDED DIFFERENCES
Given n + 1 distinct points x0, ..., xn, with n ≥ 2,
define
f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]
xn − x0
This is a recursive definition of the nth-order divided
difference of f(x), using divided differences of order
n. Its relation to the derivative is as follows:
f [x0, ..., xn] =1
n!f (n)(c)
for some c intermediate to the points {x0, ..., xn}. LetI denote the interval
I = [min {x0, ..., xn} ,max {x0, ..., xn}]Then c ∈ I, and the above result is based on the
assumption that f(x) is n-times continuously differ-
entiable on the interval I.
EXAMPLE
The following table gives divided differences for the
data in
x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997
For the column headings, we use
Dkf(xi) = f [xi, ..., xi+k]
i xi f(xi) Df(xi) D2f(xi) D3f(xi) D4f(xi)0 1.0 .54030 -.8670 -.2270 .1533 .01251 1.1 .45360 -.9124 -.1810 .15832 1.2 .36236 -.9486 -.13353 1.3 .26750 -.97534 1.4 .16997
These were computed using the recursive definition
f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]
xn − x0
ORDER OF THE NODES
Looking at f [x0, x1], we have
f [x0, x1] =f(x1)− f(x0)
x1 − x0=
f(x0)− f(x1)
x0 − x1= f [x1, x0]
The order of x0 and x1 does not matter. Looking at
f [x0, x1, x2] =f [x1, x2]− f [x0, x1]
x2 − x0
we can expand it to get
f [x0, x1, x2] =f(x0)
(x0 − x1) (x0 − x2)
+f(x1)
(x1 − x0) (x1 − x2)+
f(x2)
(x2 − x0) (x2 − x1)
With this formula, we can show that the order of the
arguments x0, x1, x2 does not matter in the final value
of f [x0, x1, x2] we obtain. Mathematically,
f [x0, x1, x2] = f [xi0, xi1, xi2]
for any permutation (i0, i1, i2) of (0, 1, 2).
We can show in general that the value of f [x0, ..., xn]
is independent of the order of the arguments {x0, ..., xn},even though the intermediate steps in its calculations
using
f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]
xn − x0
are order dependent.
We can show
f [x0, ..., xn] = f [xi0, ..., xin]
for any permutation (i0, i1, ..., in) of (0, 1, ..., n).
COINCIDENT NODES
What happens when some of the nodes {x0, ..., xn}are not distinct. Begin by investigating what happens
when they all come together as a single point x0.
For first order divided differences, we have
limx1→x0
f [x0, x1] = limx1→x0
f(x1)− f(x0)
x1 − x0= f 0(x0)
We extend the definition of f [x0, x1] to coincident
nodes using
f [x0, x0] = f 0(x0)
For second order divided differences, recall
f [x0, x1, x2] =1
2f 00(c)
with c intermediate to x0, x1, and x2.
Then as x1 → x0 and x2 → x0, we must also have
that c→ x0. Therefore,
limx1→x0x2→x0
f [x0, x1, x2] =1
2f 00(x0)
We therefore define
f [x0, x0, x0] =1
2f 00(x0)
For the case of general f [x0, ..., xn], recall that
f [x0, ..., xn] =1
n!f (n)(c)
for some c intermediate to {x0, ..., xn}. Then
lim{x1,...,xn}→x0
f [x0, ..., xn] =1
n!f (n)(x0)
and we define
f [x0, ..., x0| {z }]n+1 times
=1
n!f (n)(x0)
What do we do when only some of the nodes are
coincident. This too can be dealt with, although we
do so here only by examples.
f [x0, x1, x1] =f [x1, x1]− f [x0, x1]
x1 − x0
=f 0(x1)− f [x0, x1]
x1 − x0The recursion formula can be used in general in this
way to allow all possible combinations of possibly co-
incident nodes.
LAGRANGE�S FORMULA FOR THEINTERPOLATION POLYNOMIAL
Recall the general interpolation problem: �nd a polyno-mial Pn(x) for which
deg(Pn) � n
Pn(xi) = yi; i = 0; 1; : : : ; n
with given data points
(x0; y0); (x1; y1); ���; (xn; yn)
and with fx0; :::; xng distinct points. The solution tothis problem is given as Lagrange�s formula
Pn(x) = y0L0(x) + y1L1(x) + ���+ ynLn(x)
with fL0(x); :::; Ln(x)g the Lagrange basis polynomials.Each Lj is of degree n and it satis�es
Lj(xi) =
(1; if ; j = i0; if ; j 6= i
for i = 0; 1; : : : ; n.
THE NEWTON DIVIDED DIFFERENCE FORM
OF THE INTERPOLATION POLYNOMIAL
Let the data values for the problem
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
be generated from a function f(x):
yi = f(xi), i = 0, 1, ..., n
Using the divided differences
f [x0, x1], f [x0, x1, x2], ..., f [x0, ..., xn]
we can write the interpolation polynomials
P1(x), P2(x), ..., Pn(x)
in a way that is simple to compute.
P1(x) = f(x0) + f [x0, x1] (x− x0)P2(x) = f(x0) + f [x0, x1] (x− x0)
+f [x0, x1, x2] (x− x0) (x− x1)= P1(x) + f [x0, x1, x2] (x− x0) (x− x1)
For the case of the general problem
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
we have
Pn(x) = f(x0) + f [x0, x1] (x− x0)+f [x0, x1, x2] (x− x0) (x− x1)+f [x0, x1, x2, x3] (x− x0) (x− x1) (x− x2)+ · · ·+f [x0, ..., xn] (x− x0) · · · (x− xn−1)
From this we have the recursion relation
Pn(x) = Pn−1(x)+f [x0, ..., xn] (x− x0) · · · (x− xn−1)
in which Pn−1(x) interpolates f(x) at the points in{x0, ..., xn−1}.
Example: Recall the table
i xi f(xi) Df(xi) D2f(xi) D3f(xi) D4f(xi)0 1.0 .54030 -.8670 -.2270 .1533 .01251 1.1 .45360 -.9124 -.1810 .15832 1.2 .36236 -.9486 -.13353 1.3 .26750 -.97534 1.4 .16997
withDkf(xi) = f [xi, ..., xi+k], k = 1, 2, 3, 4. Then
P1(x) = .5403− .8670 (x− 1)P2(x) = P1(x)− .2270 (x− 1) (x− 1.1)P3(x) = P2(x) + .1533 (x− 1) (x− 1.1) (x− 1.2)P4(x) = P3(x)
+.0125 (x− 1) (x− 1.1) (x− 1.2) (x− 1.3)Using this table and these formulas, we have the fol-
lowing table of interpolants for the value x = 1.05.
The true value is cos(1.05) = .49757105.
n 1 2 3 4Pn(1.05) .49695 .49752 .49758 .49757Error 6.20E−4 5.00E−5 −1.00E−5 0.0
EVALUATION OF THE DIVIDED DIFFERENCE
INTERPOLATION POLYNOMIAL
Let
d1 = f [x0, x1]d2 = f [x0, x1, x2]
...dn = f [x0, ..., xn]
Then the formula
Pn(x) = f(x0) + f [x0, x1] (x− x0)+f [x0, x1, x2] (x− x0) (x− x1)+f [x0, x1, x2, x3] (x− x0) (x− x1) (x− x2)+ · · ·+f [x0, ..., xn] (x− x0) · · · (x− xn−1)
can be written as
Pn(x) = f(x0) + (x− x0) (d1 + (x− x1) (d2 + · · ·+(x− xn−2) (dn−1 + (x− xn−1) dn) · · · )
Thus we have a nested polynomial evaluation, and
this is quite efficient in computational cost.
ERROR IN LINEAR INTERPOLATION
Let P1(x) denote the linear polynomial interpolating
f(x) at x0 and x1, with f(x) a given function (e.g.
f(x) = cosx). What is the error f(x)− P1(x)?
Let f(x) be twice continuously differentiable on an in-
terval [a, b] which contains the points {x0, x1}. Thenfor a ≤ x ≤ b,
f(x)− P1(x) =(x− x0) (x− x1)
2f 00(cx)
for some cx between the minimum and maximum of
x0, x1, and x.
If x1 and x are ‘close to x0’, then
f(x)− P1(x) ≈(x− x0) (x− x1)
2f 00(x0)
Thus the error acts like a quadratic polynomial, with
zeros at x0 and x1.
EXAMPLE
Let f(x) = log10 x; and in line with typical tables of
log10 x, we take 1 ≤ x, x0, x1 ≤ 10. For definiteness,let x0 < x1 with h = x1 − x0. Then
f 00(x) = −log10 ex2
log10 x− P1(x) =(x− x0) (x− x1)
2
"−log10 e
c2x
#
= (x− x0) (x1 − x)
"log10 e
2c2x
#We usually are interpolating with x0 ≤ x ≤ x1; and
in that case, we have
(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1
(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1
and therefore
(x− x0) (x1 − x)
"log10 e
2x21
#≤ log10 x− P1(x)
≤ (x− x0) (x1 − x)
"log10 e
2x20
#
For h = x1 − x0 small, we have for x0 ≤ x ≤ x1
log10 x− P1(x) ≈ (x− x0) (x1 − x)
"log10 e
2x20
#
Typical high school algebra textbooks contain tables
of log10 x with a spacing of h = .01. What is the
error in this case? To look at this, we use
0 ≤ log10 x− P1(x) ≤ (x− x0) (x1 − x)
"log10 e
2x20
#
By simple geometry or calculus,
maxx0≤x≤x1
(x− x0) (x1 − x) ≤ h2
4
Therefore,
0 ≤ log10 x− P1(x) ≤h2
4
"log10 e
2x20
#.= .0543
h2
x20
If we want a uniform bound for all points 1 ≤ x0 ≤ 10,we have
0 ≤ log10 x− P1(x) ≤h2 log10 e
8
.= .0543h2
0 ≤ log10 x− P1(x) ≤ .0543h2
For h = .01, as is typical of the high school text book
tables of log10 x,
0 ≤ log10 x− P1(x) ≤ 5.43× 10−6
If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
log 5.41.= .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.
From the bound
0 ≤ log10 x− P1(x) ≤h2 log10 e
8x20
.= .0543
h2
x20
we see the error decreases as x0 increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE
Recall the general interpolation problem: find a poly-
nomial Pn(x) for which deg(Pn) ≤ n
Pn(xi) = f(xi), i = 0, 1, · · · , nwith distinct node points {x0, ..., xn} and a givenfunction f(x). Let [a, b] be a given interval on which
f(x) is (n+ 1)-times continuously differentiable; and
assume the points x0, ..., xn, and x are contained in
[a, b]. Then
f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
with cx some point between the minimum and maxi-
mum of the points in {x, x0, ..., xn}.
f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
As shorthand, introduce
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
a polynomial of degree n+ 1 with roots {x0, ..., xn}.Then
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (cx)
THE QUADRATIC CASE
For n = 2, we have
f(x)− P2(x) =(x− x0) (x− x1) (x− x2)
3!f (3) (cx)
(*)
with cx some point between the minimum and maxi-
mum of the points in {x, x0, x1, x2}.
To illustrate the use of this formula, consider the case
of evenly spaced nodes:
x1 = x0 + h, x2 = x1 + h
Further suppose we have x0 ≤ x ≤ x2, as we would
usually have when interpolating in a table of given
function values (e.g. log10 x). The quantity
Ψ2(x) = (x− x0) (x− x1) (x− x2)
can be evaluated directly for a particular x.
Graph of
Ψ2(x) = (x+ h)x (x− h)
using (x0, x1, x2) = (−h, 0, h):
x
y
h
-h
In the formula (∗), however, we do not know cx, and
therefore we replace¯̄̄f (3) (cx)
¯̄̄with a maximum of¯̄̄
f (3) (x)¯̄̄as x varies over x0 ≤ x ≤ x2. This yields
|f(x)− P2(x)| ≤|Ψ2(x)|3!
maxx0≤x≤x2
¯̄̄f (3) (x)
¯̄̄(**)
If we want a uniform bound for x0 ≤ x ≤ x2, we must
compute
maxx0≤x≤x2
|Ψ2(x)| = maxx0≤x≤x2
|(x− x0) (x− x1) (x− x2)|
Using calculus,
maxx0≤x≤x2
|Ψ2(x)| =2h3
3 sqrt(3), at x = x1±
h
sqrt(3)
Combined with (∗∗), this yields
|f(x)− P2(x)| ≤h3
9 sqrt(3)max
x0≤x≤x2
¯̄̄f (3) (x)
¯̄̄for x0 ≤ x ≤ x2.
For f(x) = log10 x, with 1 ≤ x0 ≤ x ≤ x2 ≤ 10, thisleads to
|log10 x− P2(x)| ≤h3
9 sqrt(3)· maxx0≤x≤x2
2 log10 e
x3
=.05572h3
x30
For the case of h = .01, we have
|log10 x− P2(x)| ≤5.57× 10−8
x30≤ 5.57× 10−8
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log10 x with
h = .01? The error bound for the linear interpolation
was 5.43× 10−6, and therefore we want the same tobe true of quadratic interpolation. Using a simpler
bound, we want to find h so that
|log10 x− P2(x)| ≤ .05572h3 ≤ 5× 10−6
This is true if h = .04477. Therefore a spacing of
h = .04 would be sufficient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
f(x)− Pn(x) =(x− x0) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
=Ψn(x)
(n+ 1)!f (n+1) (cx)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
with cx some point between the minimum and max-
imum of the points in {x, x0, ..., xn}. When bound-ing the error we replace f (n+1) (cx) with its maximum
over the interval containing {x, x0, ..., xn}, as we haveillustrated earlier in the linear and quadratic cases.
Consider now the function
Ψn(x)
(n+ 1)!
over the interval determined by the minimum and
maximum of the points in {x, x0, ..., xn}. For evenlyspaced node points on [0, 1], with x0 = 0 and xn = 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR
Consider the error
f(x)− Pn(x) =(x− x0) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
=Ψn(x)
(n+ 1)!f (n+1) (cx)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
as n increases and as x varies. As noted previously, we
cannot do much with f (n+1) (cx) except to replace it
with a maximum value of¯̄̄f (n+1) (x)
¯̄̄over a suitable
interval. Thus we concentrate on understanding the
size of
Ψn(x)
(n+ 1)!
ERROR FOR EVENLY SPACED NODES
We consider first the case in which the node points
are evenly spaced, as this seems the ‘natural’ way to
define the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?
The interpolation nodes are determined by using
h =1
n, x0 = 0, x1 = h, x2 = 2h, ..., xn = nh = 1
For this case,
Ψn(x) = x (x− h) (x− 2h) · · · (x− 1)Our graphs are the cases of n = 2, ..., 9.
x
y n = 2
1x
y n = 3
1
x
y n = 4
1
x
y n = 5
1
Graphs of Ψn(x) on [0, 1] for n = 2, 3, 4, 5
x
y n = 6
1
x
y n = 7
1
x
y n = 8
1
x
y n = 9
1
Graphs of Ψn(x) on [0, 1] for n = 6, 7, 8, 9
Graph of
Ψ6(x) = (x− x0) (x− x1) · · · (x− x6)
with evenly spaced nodes:
xx0 x1 x2 x3 x4 x5 x6
Using the following table
,
n Mn n Mn
1 1.25E−1 6 4.76E−72 2.41E−2 7 2.20E−83 2.06E−3 8 9.11E−104 1.48E−4 9 3.39E−115 9.01E−6 10 1.15E−12
we can observe that the maximum
Mn ≡ maxx0≤x≤xn
|Ψn(x)|(n+ 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of Ψn(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
maxx0≤x≤x1
|Ψn(x)|(n+ 1)!
= 3.39× 10−11
maxx4≤x≤x5
|Ψn(x)|(n+ 1)!
= 6.89× 10−13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x0 ≤ x ≤ x1 as compared to the
case when x4 ≤ x ≤ x5. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x0, ..., xn} being used to define theinterpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a ≤ x0 < x1 < · · · < xn−1 < xn ≤ b
and produce the interpolation polynomial Pn(x) that
interpolates f(x) at the given node points. We would
like to have
maxa≤x≤b |f(x)− Pn(x)|→ 0 as n→∞
Does it happen?
Recall the error bound
maxa≤x≤b |f(x)− Pn(x)|
≤ maxa≤x≤b
|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄We begin with an example using evenly spaced node
points.
RUNGE�S EXAMPLE
Use evenly spaced node points:
h =b� an
xi = a+ ih for i = 0; : : : ; n
For some functions, such as f(x) = ex, the maximumerror goes to zero quite rapidly. But the size of the deriv-ative term f (n+1)(x) in
maxa�x�b
jf(x)� Pn(x)j
� 1
(n+ 1)!maxa�x�b
jn(x)j � maxa�x�b
���f (n+1)(x)���can badly hurt or destroy the convergence of other cases.In particular, we show the graph of
f(x) =1
1 + x2
and Pn(x) on [�5; 5] for the case n = 10. It canbe proven that for this function, the maximum error on[�5; 5] does not converge to zero. Thus the use of evenlyspaced nodes is not necessarily a good approach to ap-proximating a function f(x) by interpolation.
Runge’s example with n = 10:
x
y
y=P10(x)
y=1/(1+x2)
OTHER CHOICES OF NODES
Recall the general error bound
maxa≤x≤b |f(x)− Pn(x)| ≤ max
a≤x≤b|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄There is nothing we really do with the derivative term
for f ; but we can examine the way of defining the
nodes {x0, ..., xn} within the interval [a, b]. We askhow these nodes can be chosen so that the maximum
of |Ψn(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it will beconsidered in next lecture. The node points fx0; :::; xngturn out to be the zeros of a particular polynomial Tn+1(x)of degree n + 1, called a Chebyshev polynomial. Thesezeros are known explicitly, and with them
maxa�x�b
jn(x)j =�b� a2
�n+12�n
This turns out to be smaller than for evenly spaced cases;and although this polynomial interpolation does not workfor all functions f(x), it works for all di¤erentiable func-tions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (c)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
with c between the minimum and maximum of {x0, ..., xn, x}.A second formula is given by
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let Pn+1(x) denote the polynomial of degree ≤ n+1
which interpolates f(x) at the points {x0, ..., xn, xn+1}.Then
Pn+1(x) = Pn(x)
+f [x0, ..., xn, xn+1] (x− x0) · · · (x− xn)
Substituting x = xn+1, and using the fact that Pn+1(x)
interpolates f(x) at xn+1, we have
f(xn+1) = Pn(xn+1)
+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)
f(xn+1) = Pn(xn+1)
+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)
In this formula, the number xn+1 is completely ar-
bitrary, other than being distinct from the points in
{x0, ..., xn}. To emphasize this fact, replace xn+1 byx throughout the formula, obtaining
f(x) = Pn(x) + f [x0, ..., xn, x] (x− x0) · · · (x− xn)
= Pn(x) +Ψn(x) f [x0, ..., xn, x]
provided x 6= x0, ..., xn.
The formula
f(x) = Pn(x) + f [x0, ..., xn, x] (x− x0) · · · (x− xn)
= Pn(x) +Ψn(x) f [x0, ..., xn, x]
is easily true for x a node point. Provided f(x) is
differentiable, the formula is also true for x a node
point.
This shows
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
Compare the two error formulas
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (c)
Then
Ψn(x) f [x0, ..., xn, x] =Ψn(x)
(n+ 1)!f (n+1) (c)
f [x0, ..., xn, x] =f (n+1) (c)
(n+ 1)!
for some c between the smallest and largest of the
numbers in {x0, ..., xn, x}.
To make this somewhat symmetric in its arguments,
let m = n+ 1, x = xn+1. Then
f [x0, ..., xm−1, xm] =f (m) (c)
m!
with c an unknown number between the smallest and
largest of the numbers in {x0, ..., xm}. This was givenin an earlier lecture where divided differences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION
Recall the examples of higher degree polynomial in-
terpolation of the function f(x) =³1 + x2
´−1on
[−5, 5]. The interpolants Pn(x) oscillated a great
deal, whereas the function f(x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.
Consider the data
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
What are methods of interpolating this data, other
than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.
Since we only have the data to consider, we would gen-
erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
x
y
1 2 3 4
1
2
The data points
x
y
1 2 3 4
1
2
Piecewise linear interpolation
x
y
1 2 3 4
1
2
3
4
Polynomial Interpolation
x
y
1 2 3 4
1
2
Piecewise quadratic interpolation
PIECEWISE POLYNOMIAL FUNCTIONS
Consider being given a set of data points (x1, y1), ...,
(xn, yn), with
x1 < x2 < · · · < xn
Then the simplest way to connect the points (xj, yj)
is by straight line segments. This is called a piecewise
linear interpolant of the datan(xj, yj)
o. This graph
has “corners”, and often we expect the interpolant to
have a smooth graph.
To obtain a somewhat smoother graph, consider using
piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates
{(x1, y1), (x2, y2), (x3, y3)}Then construct the quadratic polynomial that inter-
polates
{(x3, y3), (x4, y4), (x5, y5)}
Continue this process of constructing quadratic inter-
polants on the subintervals
[x1, x3], [x3, x5], [x5, x7], ...
If the number of subintervals is even (and therefore
n is odd), then this process comes out fine, with the
last interval being [xn−2, xn]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modification of this procedure.
Suggest such!
With piecewise quadratic interpolants, however, there
are “corners” on the graph of the interpolating func-
tion. With our preceding example, they are at x3 and
x5. How do we avoid this?
Piecewise polynomial interpolants are used in many
applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION
Let data points (x1, y1), ..., (xn, yn) be given, as let
x1 < x2 < · · · < xn
Consider finding functions s(x) for which the follow-
ing properties hold:
(1) s(xi) = yi, i = 1, ..., n
(2) s(x), s0(x), s00(x) are continuous on [x1, xn].Then among such functions s(x) satisfying these prop-
erties, find the one which minimizes the integralZ xn
x1
¯̄̄s00(x)
¯̄̄2dx
The idea of minimizing the integral is to obtain an in-
terpolating function for which the first derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS
Let a set of node points {xi} be given, satisfyinga ≤ x1 < x2 < · · · < xn ≤ b
for some numbers a and b. Often we use [a, b] =
[x1, xn]. A cubic spline function s(x) on [a, b] with
“breakpoints” or “knots” {xi} has the following prop-erties:
1. On each of the intervals
[a, x1], [x1, x2], ..., [xn−1, xn], [xn, b]
s(x) is a polynomial of degree ≤ 3.2. s(x), s0(x), s00(x) are continuous on [a, b].
In the case that we have given data points (x1, y1),...,
(xn, yn), we say s(x) is a cubic interpolating spline
function for this data if
3. s(xi) = yi, i = 1, ..., n.
EXAMPLE
Define
(x− α)3+ =
((x− α)3 , x ≥ α
0, x ≤ α
This is a cubic spline function on (−∞,∞) with thesingle breakpoint x1 = α.
Combinations of these form more complicated cubic
spline functions. For example,
s(x) = 3 (x− 1)3+ − 2 (x− 3)3+is a cubic spline function on (−∞,∞) with the break-points x1 = 1, x2 = 3.
Define
s(x) = p3(x) +nX
j=1
aj³x− xj
´3+
with p3(x) some cubic polynomial. Then s(x) is a
cubic spline function on (−∞,∞) with breakpoints{x1, ..., xn}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integralZ xn
x1
¯̄̄s00(x)
¯̄̄2dx
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satisfies
s00(x1) = s00(xn) = 0
Spline functions satisfying these boundary conditions
are called “natural” cubic spline functions, and the so-
lution to our minimization problem is a “natural cubic
interpolatory spline function”. We will show a method
to construct this function from the interpolation data.
Motivation for these boundary conditions can be given
by looking at the physics of bending thin beams of
flexible materials to pass thru the given data. To the
left of x1 and to the right of xn, the beam is straight
and therefore the second derivatives are zero at the
transition points x1 and xn.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION
To make the presentation more specific, suppose we
have data
(x1, y1) , (x2, y2) , (x3, y3) , (x4, y4)
with x1 < x2 < x3 < x4. Then on each of the
intervals
[x1, x2] , [x2, x3] , [x3, x4]
s(x) is a cubic polynomial. Taking the first interval,
s(x) is a cubic polynomial and s00(x) is a linear poly-nomial. Let
Mi = s00(xi), i = 1, 2, 3, 4
Then on [x1, x2],
s00(x) = (x2 − x)M1 + (x− x1)M2
x2 − x1, x1 ≤ x ≤ x2
We can find s(x) by integrating twice:
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6 (x2 − x1)+ c1x+ c2
We determine the constants of integration by using
s(x1) = y1, s(x2) = y2 (*)
Then
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6 (x2 − x1)
+(x2 − x) y1 + (x− x1) y2
x2 − x1
−x2 − x16
[(x2 − x)M1 + (x− x1)M2]
for x1 ≤ x ≤ x2.
Check that this formula satisfies the given interpola-
tion condition (*)!
We can repeat this on the intervals [x2, x3] and [x3, x4],
obtaining similar formulas.
For x2 ≤ x ≤ x3,
s(x) =(x3 − x)3M2 + (x− x2)
3M3
6 (x3 − x2)
+(x3 − x) y2 + (x− x2) y3
x3 − x2
−x3 − x26
[(x3 − x)M2 + (x− x2)M3]
For x3 ≤ x ≤ x4,
s(x) =(x4 − x)3M3 + (x− x3)
3M4
6 (x4 − x3)
+(x4 − x) y3 + (x− x3) y4
x4 − x3
−x4 − x36
[(x4 − x)M3 + (x− x3)M4]
We still do not know the values of the second deriv-
atives {M1,M2,M3,M4}. The above formulas guar-antee that s(x) and s00(x) are continuous forx1 ≤ x ≤ x4. For example, the formula on [x1, x2]
yields
s(x2) = y2, s00(x2) =M2
The formula on [x2, x3] also yields
s(x2) = y2, s00(x2) =M2
All that is lacking is to make s0(x) continuous at x2and x3. Thus we require
s0(x2 + 0) = s0(x2 − 0)s0(x3 + 0) = s0(x3 − 0) (**)
This means
limx&x2
s0(x) = limx%x2
s0(x)
and similarly for x3.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h
Then our earlier formulas simplify to
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
for x1 ≤ x ≤ x2, with similar formulas on [x2, x3] and
[x3, x4].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s00(x) gives us immedi-ately
M1 =M4 = 0
Then we can solve the linear system for M2 and M3.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1 12
13
14
In this case, h = 1, and linear system becomes
2
3M2 +
1
6M3 = y3 − 2y2 + y1 =
1
31
6M2 +
2
3M3 = y4 − 2y3 + y2 =
1
12
This has the solution
M2 =1
2, M3 = 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
=(2− x)3 · 0 + (x− 1)3
³12
´6
+(2− x) · 1 + (x− 1)
³12
´1
−16
h(2− x) · 0 + (x− 1)
³12
´i= 112 (x− 1)3 − 7
12 (x− 1) + 1
Similarly, for 2 ≤ x ≤ 3,
s(x) =−112(x− 2)3 + 1
4(x− 2)2 − 1
3(x− 1) + 1
2
and for 3 ≤ x ≤ 4,
s(x) =−112(x− 4) + 1
4
x 1 2 3 4
y 1 12
13
14
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
1
x
y
y = 1/xy = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M1 and
M4. For example, the data in our numerical exam-
ple were generated from the function f(x) = 1x. With
it, f 00(x) = 2x3, and thus we could use
M1 = 2, M4 =1
32
With this we are led to a new formula for s(x), one
that approximates f(x) = 1x more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(xi) = yi, i = 1, 2, 3, 4
with the boundary conditions
s0(x1) = y01, s0(x4) = y04 (#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3M1 +
h
6M2 =
y2 − y1h
− y01h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
hh
6M3 +
h
3M4 = y04 −
y4 − y3h
For our numerical example, it is natural to obtain
these derivative values from f 0(x) = − 1x2:
y01 = −1, y04 = −1
16
When combined with your earlier equations, we have
the system
1
3M1 +
1
6M2 =
1
21
6M1 +
2
3M2 +
1
6M3 =
1
31
6M2 +
2
3M3 +
1
6M4 =
1
121
6M3 +
1
3M4 =
1
48
This has the solution
[M1,M2,M3,M4] =·173
120,7
60,11
120,1
60
¸
We can now write the functions s(x) for each of the
subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for
x1 ≤ x ≤ x2,
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
We can substitute in from the data
x 1 2 3 4
y 1 12
13
14
and the solutions {Mi}. Doing so, consider the errorf(x)− s(x). As an example,
f(x) =1
x, f
µ3
2
¶=2
3, s
µ3
2
¶= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x1, y1) , (x2, y2) , ..., (xn, yn)
and assume the node points {xi} are evenly spaced,xj = x1 + (j − 1)h, j = 1, ..., n
We have that the interpolating spline s(x) on
xj ≤ x ≤ xj+1 is given by
s(x) =
³xj+1 − x
´3Mj +
³x− xj
´3Mj+1
6h
+
³xj+1 − x
´yj +
³x− xj
´yj+1
h
−h6
h³xj+1 − x
´Mj +
³x− xj
´Mj+1
ifor j = 1, ..., n− 1.
To enforce continuity of s0(x) at the interior nodepoints x2, ..., xn−1, the second derivatives
nMj
omust
satisfy the linear equations
h
6Mj−1 +
2h
3Mj +
h
6Mj+1 =
yj−1 − 2yj + yj+1
h
for j = 2, ..., n− 1. Writing them out,
h
6M1 +
2h
3M2 +
h
6M3 =
y1 − 2y2 + y3h
h
6M2 +
2h
3M3 +
h
6M4 =
y2 − 2y3 + y4h
...h
6Mn−2 +
2h
3Mn−1 +
h
6Mn =
yn−2 − 2yn−1 + yn
h
This is a system of n−2 equations in the n unknowns{M1, ...,Mn}. Two more conditions must be imposedon s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very efficiently.
BOUNDARY CONDITIONS
“Natural” boundary conditions
s00(x1) = s00(xn) = 0Spline functions satisfying these conditions are called“natural cubic splines”. They arise out the minimiza-tion problem stated earlier. But generally they are notconsidered as good as some other cubic interpolatingsplines.
“Clamped” boundary conditions We add the condi-tions
s0(x1) = y01, s0(xn) = y0nwith y01, y0n given slopes for the endpoints of s(x) on[x1, xn]. This has many quite good properties whencompared with the natural cubic interpolating spline;but it does require knowing the derivatives at the end-points.
“Not a knot” boundary conditions This is more com-plicated to explain, but it is the version of cubic splineinterpolation that is implemented in Matlab.
THE “NOT A KNOT” CONDITIONS
As before, let the interpolation nodes be
(x1, y1) , (x2, y2) , ..., (xn, yn)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x1, y1) , (x3, y3) , ..., (xn−2, yn−2) , (xn, yn)Thus deleting two of the points. We now have n− 2points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x1, x3] , [x3, x4] , ..., [xn−3, xn−2] , [xn−2, xn]This leads to n− 4 equations in the n− 2 unknownsM1,M3, ...,Mn−2,Mn. The two additional boundary
conditions are
s(x2) = y2, s(xn−1) = yn−1These translate into two additional equations, and we
obtain a system of n−2 linear simultaneous equationsin the n− 2 unknowns M1,M3, ...,Mn−2,Mn.
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with ”not-a knot”
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x1, y1) , (x2, y2) , ..., (xn, yn)
type arrays containing the x and y coordinates:
x = [x1 x2 ...xn]y = [y1 y2 ...yn]plot (x, y, ’o’)
The last statement will draw a plot of the data points,
marking them with the letter ‘oh’. To find the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (xn − x1) / (10 ∗ n) ; xx = x1 : h : xn;
use
yy = spline (x, y, xx)plot (x, y, ’o’, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then define
h =b− a
n− 1, xj = a+ (j − 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Define
yi = f(xi), j = 1, ..., n
Let sn(x) denote the cubic spline interpolating this
data and satisfying the “not a knot” boundary con-
ditions. Then it can be shown that for a suitable
constant c,
En ≡ maxa≤x≤b |f(x)− sn(x)| ≤ ch4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h2 rather than h4;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctanx on [0, 5]. The following ta-
ble gives values of the maximum error En for various
values of n. The values of h are being successively
halved.
n En E12n/En
7 7.09E−313 3.24E−4 21.925 3.06E−5 10.649 1.48E−6 20.797 9.04E−8 16.4
BEST APPROXIMATION
Given a function f(x) that is continuous on a giveninterval [a, b], consider approximating it by some poly-nomial p(x). To measure the error in p(x) as an ap-proximation, introduce
E(p) = maxa≤x≤b |f(x)− p(x)|
This is called the maximum error or uniform error ofapproximation of f(x) by p(x) on [a, b].
With an eye towards efficiency, we want to find the‘best’ possible approximation of a given degree n.With this in mind, introduce the following:
ρn(f) = mindeg(p)≤n
E(p)
= mindeg(p)≤n
"maxa≤x≤b |f(x)− p(x)|
#The number ρn(f) will be the smallest possible uni-form error, orminimax error, when approximating f(x)by polynomials of degree at most n. If there is apolynomial giving this smallest error, we denote it bymn(x); thus E(mn) = ρn(f).
Example. Let f(x) = ex on [−1, 1]. In the followingtable, we give the values of E(tn), tn(x) the Tay-
lor polynomial of degree n for ex about x = 0, and
E(mn).
Maximum Error in:n tn(x) mn(x)1 7.18E− 1 2.79E− 12 2.18E− 1 4.50E− 23 5.16E− 2 5.53E− 34 9.95E− 3 5.47E− 45 1.62E− 3 4.52E− 56 2.26E− 4 3.21E− 67 2.79E− 5 2.00E− 78 3.06E− 6 1.11E− 89 3.01E− 7 5.52E− 10
Consider graphically how we can improve on the Tay-
lor polynomial
t1(x) = 1 + x
as a uniform approximation to ex on the interval [−1, 1].
The linear minimax approximation is
m1(x) = 1.2643 + 1.1752x
x
y
-1 1
1
2
y=t1(x)
y=m1(x)
y=ex
Linear Taylor and minimax approximations to ex
x
y
-1 1
0.0516
Error in cubic Taylor approximation to ex
x
y
-1 1
0.00553
-0.00553
Error in cubic minimax approximation to ex
Accuracy of the minimax approximation.
ρn(f) ≤[(b− a)/2]n+1
(n+ 1)!2nmaxa≤x≤b
¯̄̄f (n+1)(x)
¯̄̄This error bound does not always become smaller with
increasing n, but it will give a fairly accurate bound
for many common functions f(x).
Example. Let f(x) = ex for −1 ≤ x ≤ 1. Thenρn(e
x) ≤ e
(n+ 1)!2n(*)
n Bound (*) ρn(f)1 6.80E− 1 2.79E− 12 1.13E− 1 4.50E− 23 1.42E− 2 5.53E− 34 1.42E− 3 5.47E− 45 1.18E− 4 4.52E− 56 8.43E− 6 3.21E− 67 5.27E− 7 2.00E− 7
CHEBYSHEV POLYNOMIALS
Chebyshev polynomials are used in many parts of nu-merical analysis, and more generally, in applicationsof mathematics. For an integer n ≥ 0, define thefunction
Tn(x) = cos³n cos−1 x
´, −1 ≤ x ≤ 1 (1)
This may not appear to be a polynomial, but we willshow it is a polynomial of degree n. To simplify themanipulation of (1), we introduce
θ = cos−1(x) or x = cos(θ), 0 ≤ θ ≤ π (2)
Then
Tn(x) = cos(nθ) (3)
Example. n = 0
T0(x) = cos(0 · θ) = 1n = 1
T1(x) = cos(θ) = x
n = 2
T2(x) = cos(2θ) = 2 cos2(θ)− 1 = 2x2 − 1
x
y
-1 1
1
-1
T0(x)T1(x)T2(x)
x
y
-1 1
1
-1
T3(x)T4(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β)Let n ≥ 1, and apply these identities to get
Tn+1(x) = cos[(n+ 1)θ] = cos(nθ + θ)
= cos(nθ) cos(θ)− sin(nθ) sin(θ)Tn−1(x) = cos[(n− 1)θ] = cos(nθ − θ)
= cos(nθ) cos(θ) + sin(nθ) sin(θ)
Add these two equations, and then use (1) and (3) to
obtain
Tn+1(x) + Tn−1 = 2 cos(nθ) cos(θ) = 2xTn(x)Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall
T0(x) = 1, T1(x) = x
Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1
Let n = 2. Then
T3(x) = 2xT2(x)− T1(x)
= 2x(2x2 − 1)− x
= 4x3 − 3xLet n = 3. Then
T4(x) = 2xT3(x)− T2(x)
= 2x(4x3 − 3x)− (2x2 − 1)= 8x4 − 8x2 + 1
The minimum size property. Note that
|Tn(x)| ≤ 1, −1 ≤ x ≤ 1 (5)
for all n ≥ 0. Also, note thatTn(x) = 2
n−1xn + lower degree terms, n ≥ 1(6)
This can be proven using the triple recursion relation
and mathematical induction.
Introduce a modified version of Tn(x),
eTn(x) = 1
2n−1Tn(x) = xn+lower degree terms (7)
From (5) and (6),¯̄̄ eTn(x)¯̄̄ ≤ 1
2n−1, −1 ≤ x ≤ 1, n ≥ 1 (8)
Example.
eT4(x) = 1
8
³8x4 − 8x2 + 1
´= x4 − x2 +
1
8
A polynomial whose highest degree term has a coeffi-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial eTn(x) has size 1/2n−1 on−1 ≤ x ≤ 1, and this becomes smaller as the degreen increases. In comparison,
max−1≤x≤1 |xn| = 1
Thus xn is a monic polynomial whose size does not
change with increasing n.
Theorem. Let n ≥ 1 be an integer, and consider all
possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [−1, 1] is the modified Chebyshev polynomialeTn(x), and its maximum value on [−1, 1] is 1/2n−1.
This result is used in devising applications of Cheby-
shev polynomials. We apply it to obtain an improved
interpolation scheme.
A NEAR-MINIMAX APPROXIMATION METHOD
Let f(x) be continuous on [a, b] = [−1, 1]. Considerapproximating f by an interpolatory polynomial of de-
gree at most n = 3. Let x0, x1, x2, x3 be interpo-
lation node points in [−1, 1]; let c3(x) be of degree≤ 3 and interpolate f(x) at {x0, x1, x2, x3}. The in-terpolation error is
f(x)− c3(x) =ω(x)
4!f (4)(ξx), −1 ≤ x ≤ 1 (1)
ω(x) = (x− x0)(x− x1)(x− x2)(x− x3) (2)
with ξx in [−1, 1]. We want to choose the nodes
{x0, x1, x2, x3} so as to minimize the maximum value
of |f(x)− c3(x)| on [−1, 1].
From (1), the only general quantity, independent of f ,
is ω(x). Thus we choose {x0, x1, x2, x3} to minimizemax−1≤x≤1 |ω(x)| (3)
Expand to get
ω(x) = x4 + lower degree terms
This is a monic polynomial of degree 4. From the
theorem in the preceding section, the smallest possible
value for (3) is obtained with
ω(x) = eT4(x) = T4(x)
23=1
8(8x4 − 8x2 + 1) (4)
and the smallest value of (3) is 1/23 in this case. The
equation (4) defines implicitly the nodes {x0, x1, x2, x3}:they are the roots of T4(x).
In our case this means solving
T4(x) = cos(4θ) = 0, x = cos(θ)
4θ = ±π2,±3π
2,±5π
2,±7π
2, . . .
θ = ±π8,±3π
8,±5π
8,±7π
8, . . .
x = cosµπ
8
¶, cos
µ3π
8
¶, cos
µ5π
8
¶, . . . (5)
using cos(−θ) = cos(θ).
x = cosµπ
8
¶, cos
µ3π
8
¶, cos
µ5π
8
¶, cos
µ7π
8
¶, . . .
The first four values are distinct; the following ones
are repetitive. For example,
cosµ9π
8
¶= cos
µ7π
8
¶The first four values are
{x0, x1, x2, x3} = {±0.382683,±0.923880} (6)
Example. Let f(x) = ex on [−1, 1]. Use these nodesto produce the interpolating polynomial c3(x) of de-
gree 3. From the interpolation error formula and the
bound of 1/23 for |ω(x)| on [−1, 1] , we have
max−1≤x≤1 |f(x)− c3(x)| ≤1/23
4!max−1≤x≤1 e
ξx
≤ e
192
.= 0.014158
By direct calculation,
max−1≤x≤1 |ex − c3(x)| .= 0.00666
Interpolation Data: f(x) = ex
i xi f(xi) f [x0, . . . , xi]0 0.923880 2.5190442 2.51904421 0.382683 1.4662138 1.94537692 −0.382683 0.6820288 0.70474203 −0.923880 0.3969760 0.1751757
x
y
-1 1
0.00666
-0.00624
The error ex − c3(x)
For comparison, E(t3).= 0.0142 and ρ3(e
x).= 0.00553.
THE GENERAL CASE
Consider interpolating f(x) on [−1, 1] by a polyno-mial of degree ≤ n, with the interpolation nodes
{x0, . . . , xn} in [−1, 1]. Denote the interpolation poly-nomial by cn(x). The interpolation error on [−1, 1] isgiven by
f(x)− cn(x) =ω(x)
(n+ 1)!f (n+1)(ξx) (7)
ω(x) = (x− x0) · · · (x− xn)
with ξx and unknown point in [−1, 1]. In order to
minimize the interpolation error, we seek to minimize
max−1≤x≤1 |ω(x)| (8)
The polynomial being minimized is monic of degree
n+ 1,
ω(x) = xn+1 + lower degree terms
From the theorem of the preceding section, this min-
imum is attained by the monic polynomial
eTn+1(x) = 1
2nTn+1(x)
Thus the interpolation nodes are the zeros of Tn+1(x);
and by the procedure that led to (5), they are given
by
xj = cosµ2j + 1
2n+ 2π¶, j = 0, 1, . . . , n (9)
The near-minimax approximation cn(x) of degree n is
obtained by interpolating to f(x) at these n+1 nodes
on [−1, 1].
The polynomial cn(x) is sometimes called a Cheby-
shev approximation.
Example. Let f(x) = ex. the following table contains
the maximum errors in cn(x) on [−1, 1] for varyingn. For comparison, we also include the corresponding
minimax errors. These figures illustrate that for prac-
tical purposes, cn(x) is a satisfactory replacement for
the minimax approximation mn(x).
n max |ex − cn(x)| ρn(ex)
1 3.72E− 1 2.79E− 12 5.65E− 2 4.50E− 23 6.66E− 3 5.53E− 34 6.40E− 4 5.47E− 45 5.18E− 5 4.52E− 56 3.80E− 6 3.21E− 6
THEORETICAL INTERPOLATION ERROR
For the error
f(x)− cn(x) =ω(x)
(n+ 1)!f (n+1)(ξx)
we have
max−1≤x≤1 |f(x)− cn(x)| ≤max−1≤x≤1 |ω(x)|(n+ 1)!
max−1≤ξ≤1 |f(ξ)|
From the theorem of the preceding section,
max−1≤x≤1¯̄̄ eTn+1(x)¯̄̄ = max−1≤x≤1 |ω(x)| =
1
2n
in this case. Thus
max−1≤x≤1 |f(x)− cn(x)| ≤ 1
(n+ 1)!2nmax−1≤ξ≤1 |f(ξ)|
OTHER INTERVALS
Consider approximating f(x) on the finite interval
[a, b]. Introduce the linear change of variables
x =1
2[(1− t) a+ (1 + t) b] (10)
t =2
b− a
·x− b+ a
2
¸(11)
Introduce
F (t) = fµ1
2[(1− t) a+ (1 + t) b]
¶, −1 ≤ t ≤ 1
The function F (t) on [−1, 1] is equivalent to f(x) on[a, b], and we can move between them via (10)-(11).
We can now proceed to approximate f(x) on [a, b] by
instead approximating F (t) on [−1, 1].
Example. Approximating f(x) = cosx on [0, π/2] is
equivalent to approximating
F (t) = cosµ1 + t
4π¶, −1 ≤ t ≤ 1
NUMERICAL DIFFERENTIATION
There are two major reasons for considering numeri-
cally approximations of the differentiation process.
1. Approximation of derivatives in ordinary differen-
tial equations and partial differential equations.
This is done in order to reduce the differential
equation to a form that can be solved more easily
than the original differential equation.
2. Forming the derivative of a function f(x) which is
known only as empirical data {(xi, yi) | i = 1, . . . ,m}.The data generally is known only approximately,
so that yi ≈ f(xi), i = 1, . . . ,m.
Recall the definition
f 0(x) = limh→0
f(x+ h)− f(x)
h
This justifies using
f 0(x) ≈ f(x+ h)− f(x)
h≡ Dhf(x) (1)
for small values of h. The approximation Dhf(x) is
called a numerical derivative of f(x) with stepsize h.
Example. Use Dhf(x) to approximate the derivative
of f(x) = cos(x) at x = π/6. In the table, the error
is almost halved when h is halved.
h Dhf Error Ratio0.1 −0.54243 0.042430.05 −0.52144 0.02144 1.980.025 −0.51077 0.01077 1.990.0125 −0.50540 0.00540 1.990.00625 −0.50270 0.00270 2.000.003125 −0.50135 0.00135 2.00
Error behaviour. Using Taylor’s theorem,
f(x+ h) = f(x) + hf 0(x) + 12h2f 00(c)
with c between x and x+ h. Evaluating (1),
Dhf(x) =1
h
nhf(x) + hf 0(x) + 1
2h2f 00(c)
i− f(x)
o= f 0(x) + 1
2hf00(c)
f 0(x)−Dhf(x) = −12hf 00(c) (2)
Using a higher order Taylor expansion,
f 0(x)−Dhf(x) = −12hf 00(x)− 16h2f 00(c),
f 0(x)−Dhf(x) ≈ −12hf 00(x) (3)
for small values of h.
For f(x) = cosx,
f 0(x)−Dhf(x) =12h cos c, c ∈
hπ6 ,
π6 + h
iIn the preceding table, check the accuracy of the ap-
proximation (3) with x = π6.
The formula (1),
f 0(x) ≈ f(x+ h)− f(x)
h≡ Dhf(x)
is called a forward difference formula for approximat-
ing f 0(x). In contrast, the approximation
f 0(x) ≈ f(x)− f(x− h)
h, h > 0 (4)
is called a backward difference formula for approxi-
mating f 0(x). A similar derivation leads to
f 0(x)− f(x)− f(x− h)
h=
h
2f 00(c) (5)
for some c between x and x − h. The accuracy of
the backward difference formula (4) is essentially the
same as that of the forward difference formula (1).
The motivation for this formula is in applications to
solving differential equations.
DIFFERENTIATION USING INTERPOLATION
Let Pn(x) be the degree n polynomial that interpo-lates f(x) at n + 1 node points x0, x1, . . . , xn. Tocalculate f 0(x) at some point x = t, use
f 0(t) ≈ P 0n(t) (6)
Many different formulas can be obtained by varying nand by varying the placement of the nodes x0, . . . , xnrelative to the point t of interest.
Example. Take n = 2, and use evenly spaced nodesx0, x1 = x0 + h, x2 = x1 + h. Then
P2(x) = f(x0)L0(x) + f(x1)L1(x) + f(x2)L2(x)
P 02(x) = f(x0)L00(x) + f(x1)L
01(x) + f(x2)L
02(x)
with
L0(x) =(x− x1)(x− x2)
(x0 − x1)(x0 − x2)
L1(x) =(x− x0)(x− x2)
(x1 − x0)(x1 − x2)
L2(x) =(x− x0)(x− x1)
(x2 − x0)(x2 − x1)
Forming the derivatives of these Lagrange basis func-
tions and evaluating them at x = x1
f 0(x1) ≈ P 02(x1) =f(x1 + h)− f(x1 − h)
2h≡ Dhf(x1)
(7)
For the error,
f 0(x1)−f(x1 + h)− f(x1 − h)
2h= −h
2
6f 000(c2) (8)
with x1 − h ≤ c2 ≤ x1 + h.
A proof of this begins with the interpolation error for-
mula
f(x)− P2(x) = Ψ2(x)f [x0, x1, x2, x]
Ψ2(x) = (x− x0) (x− x1) (x− x2)
Differentiate to get
f 0(x)− P 02(x) = Ψ2(x)d
dxf [x0, x1, x2, x]
+Ψ02(x)f [x0, x1, x2, x]
f 0(x)− P 02(x) = Ψ2(x)d
dxf [x0, x1, x2, x]
+Ψ02(x)f [x0, x1, x2, x]With properties of the divided difference, we can show
f 0(x)−P 02(x) = 124Ψ2(x)f
(4)³c1,x
´+16Ψ
02(x)f
(3)³c2,x
´with c1,x and c2,x between the smallest and largest of
the values {x0, x1, x2, x}. Letting x = x1 and noting
that Ψ2(x1) = 0, we obtain (8).
Example. Take f(x) = cos(x) and x1 =16π. Then
(7) is illustrated as follows.
h Dhf Error Ratio0.1 −0.49916708 −0.00083290.05 −0.49979169 −0.0002083 4.000.025 −0.49994792 −0.00005208 4.000.0125 −0.49998698 −0.00001302 4.000.00625 −0.49999674 −0.000003255 4.00
Note the smaller errors and faster convergence as com-
pared to the forward difference formula (1).
UNDETERMINED COEFFICIENTS
Derive an approximation for f 00(x) at x = t. Write
f 00(t) ≈ D(2)h f(t) ≡ Af(t+ h)
+Bf(t) + Cf(t− h)(9)
with A, B, and C unspecified constants. Use Taylor
polynomial approximations
f(t− h) ≈ f(t)− hf 0(t) + h2
2f 00(t)
−h3
6f 000(t) + h4
24f (4)(t)
f(t+ h) ≈ f(t) + hf 0(t) + h2
2f 00(t)
+h3
6f 000(t) + h4
24f (4)(t)
(10)
Substitute into (9) and rearrange:
D(2)h f(t) ≈ (A+B + C)f(t)
+h(A− C)f 0(t) + h2
2(A+ C)f 00(t)
+h3
6(A− C)f 000(t) + h4
24(A+ C)f (4)(t)
(11)
To have
D(2)h f(t) ≈ f 00(t) (12)
for arbitrary functions f(x), require
A+B + C = 0: coefficient of f(t)
h(A− C) = 0: coefficient of f 0(t)h2
2(A+ C) = 1: coefficient of f 00(t)
Solution:
A = C =1
h2, B = − 2
h2(13)
This determines
D(2)h f(t) =
f(t+ h)− 2f(t) + f(t− h)
h2(14)
For the error, substitute (13) into (11):
D(2)h f(t) ≈ f 00(t) + h2
12f (4)(t)
Thus
f 00(t)− f(t+ h)− 2f(t) + f(t− h)
h2≈ −h
2
12f (4)(t)
(15)
Example. Let f(x) = cos(x), t = 16π; use (14) to
calculate f 00(t) = − cos³16π´.
h D(2)h f Error Ratio
0.5 −0.84813289 −1.789E− 20.25 −0.86152424 −4.501E− 3 3.97
0.125 −0.86489835 −1.127E− 3 3.99
0.0625 −0.86574353 −2.819E− 4 4.00
0.03125 −0.86595493 −7.048E− 5 4.00
EFFECTS OF ERROR IN FUNCTION VALUES
Recall
D(2)h f(x1) =
f(x2)− 2f(x1) + f(x0)
h2≈ f 00(x1)
with x2 = x1 + h, x0 = x1 − h. Assume the ac-
tual function values used in the computation contain
data error, and denote these values by bf0, bf1, and bf2.Introduce the data errors:
i = f(xi)− bfi, i = 0, 1, 2 (16)
The actual quantity calculated is
cD(2)h f(x1) =bf2 − 2 bf1 + bf2
h2(17)
For the error in this quantity, replace bfj by f(xj)− j,
j = 0, 1, 2, to obtain the following:
f 00(x1)− cD(2)h f(x1) = f 00(x1)
−[f(x2)− 2]− 2[f(x1)− 1] + [f(x0)− 0]
h2
=
"f 00(x1)−
f(x2)− 2f(x1) + f(x0)
h2
#
+ 2 − 2 1 + 0
h2
≈ − 112h
2f (4)(x1) +2 − 2 1 + 0
h2(18)
The last line uses (15).
The errors { 0, 1, 2} are generally random in some
interval [−δ, δ]. Ifn bf0, bf1, bf2o are experimental data,
then δ is a bound on the experimental error. Ifn bfjo
are obtained from computing f(x) in a computer, then
the errors j are the combination of rounding or chop-
ping errors and δ is a bound on these errors.
In either case, (18) yields the approximate inequality¯̄̄̄f 00(x1)− cD(2)h f(x1)
¯̄̄̄≤ h2
12
¯̄̄f (4)(x1)
¯̄̄+4δ
h2(19)
This suggests that as h→ 0, the error will eventually
increase, because of the final term 4δh2.
Example. Calculate cD(2)h (x1) for f(x) = cos(x) at
x1 =16π. To show the effect of rounding errors, the
values bfi are obtained by rounding f(xi) to six signif-icant digits; and the errors satisfy
| i| ≤ 5.0× 10−7 = δ, i = 0, 1, 2
Other than these rounding errors, the formula cD(2)h f(x1)
is calculated exactly. In this example, the bound (19)
becomes¯̄̄̄f 00(x1)− cD(2)h f(x1)
¯̄̄̄≤ 112h
2 cos³16π´
+³4h2
´(5× 10−7)
.= 0.0722h2 + 2×10−6
h2≡ E(h)
For h = 0.125, the bound E(h).= 0.00126, which is
not too far off from the actual error given in the table.
h cD(2)h f(x1) Error0.5 −0.848128 −0.0178970.25 −0.861504 −0.0045210.125 −0.864832 −0.0011930.0625 −0.865536 −0.0004890.03125 −0.865280 −0.0007450.015625 −0.860160 −0.0058650.0078125 −0.851968 −0.0140570.00390625 −0.786432 −0.079593
The bound E(h) indicates that there is a smallest
value of h, call it h∗, below which the error bound
will begin to increase. To find it, let E0(h) = 0, with
its root being h∗. This leads to h∗ .= 0.0726, which is
consistent with the behavior of the errors in the table.
LINEAR SYSTEMS
Consider the following example of a linear system:
x1 + 2x2 + 3x3 = −5−x1 + x3 = −3
3x1 + x2 + 3x3 = −3Its unique solution is
x1 = 1, x2 = 0, x3 = −2In general we want to solve n equations in n un-
knowns. For this, we need some simplifying nota-
tion. In particular we introduce arrays. We can think
of these as means for storing information about the
linear system in a computer. In the above case, we
introduce
A =
1 2 3−1 0 13 1 3
, b =
−5−3−3
, x =
10−2
These arrays completely specify the linear system and
its solution. We also know that we can give mean-
ing to multiplication and addition of these quantities,
calling them matrices and vectors. The linear system
is then written as
Ax = b
with Ax denoting a matrix-vector multiplication.
The general system is written as
a1,1x1 + · · ·+ a1,nxn = b1...
an,1x1 + · · ·+ an,nxn = bn
This is a system of n linear equations in the n un-
knowns x1, ..., xn. This can be written in matrix-
vector notation as
Ax = b
A =
a1,1 · · · a1,n... . . . ...
an,1 · · · an,n
, b = b1...bn
x =
x1...xn
A TRIDIAGONAL SYSTEM
Consider the tridiagonal linear system
3x1 − x2 = 2−x1 + 3x2 − x3 = 1
...−xn−2 + 3xn−1 − xn = 1
−xn−1 + 3xn = 2
The solution is
x1 = · · · = xn = 1
This has the associated arrays
A =
3 −1 0 · · · 0−1 3 −1 0
. . .... −1 3 −10 · · · −1 3
, b =21...12
, x =11...11
SOLVING LINEAR SYSTEMS
Linear systems Ax = b occur widely in applied mathe-matics. They occur as direct formulations of �real world�problems; but more often, they occur as a part of the nu-merical analysis of some other problem. As examples ofthe latter, we have the construction of spline functions,the numerical solution of systems of nonlinear equations,ordinary and partial di¤erential equations, integral equa-tions, and the solution of optimization problems.
There are many ways of classifying linear systems.
Size: Small, moderate, and large. This of course varieswith the machine you are using.
For a matrix A of order n× n, it will take 8n2 bytes
to store it in double precision. Thus a matrix of order
8000 will need around 512 MB of storage. The latter
would be too large for most present day PCs, if the
matrix was to be stored in the computer’s memory,
although one can easily expand a PC to contain much
more memory than this.
Sparse vs. Dense. Many linear systems have a matrixA in which almost all the elements are zero. These
matrices are said to be sparse. For example, it is quite
common to work with tridiagonal matrices
A =
a1 c1 0 · · · 0b2 a2 c2 0 ...0 b3 a3 c3... . . .0 · · · bn an
in which the order is 104 or much more. For such
matrices, it does not make sense to store the zero ele-
ments; and the sparsity should be taken into account
when solving the linear system Ax = b. Also, the
sparsity need not be as regular as in this example.
BASIC DEFINITIONS AND THEORY
A homogeneous linear systemAx = b is one for which theright hand constants are all zero. Using vector notation,we say b is the zero vector for a homogeneous system.Otherwise the linear system is call non-homogeneous.
Theorem. The following are equivalent statements.
(1) For each b, there is exactly one solution x.
(2) For each b, there is a solution x.
(3) The homogeneous system Ax = 0 has only the solu-tion x = 0.
(4) det(A) 6= 0.
(5) Inverse matrix A�1 exists.
EXAMPLE. Consider again the tridiagonal system
3x1 − x2 = 2−x1 + 3x2 − x3 = 1
...−xn−2 + 3xn−1 − xn = 1
−xn−1 + 3xn = 2
The homogeneous version is simply
3x1 − x2 = 0−x1 + 3x2 − x3 = 0
...−xn−2 + 3xn−1 − xn = 0
−xn−1 + 3xn = 0
Assume x 6= 0, and therefore that x has nonzero com-ponents. Let xk denote a component of maximum
size:
|xk| = max1≤j≤n
¯̄̄xj¯̄̄
Consider now equation k, and assume 1 < k < n.
Then
−xk−1 + 3xk − xk+1 = 0
xk = 13
¡xk−1 + xk+1
¢|xk| ≤ 1
3
¡¯̄xk−1
¯̄+¯̄xk+1
¯̄¢≤ 1
3 (|xk|+ |xk|)= 2
3 |xk|This implies xk = 0, and therefore x = 0. A similar
proof is valid if k = 1 or k = n, using the first or the
last equation, respectively.
Thus the original tridiagonal linear system Ax = b has
a unique solution x for each right side b.
METHODS OF SOLUTION
There are two general categories of numerical methods
for solving Ax = b.
Direct Methods: These are methods with a finite
number of steps; and they end with the exact solution
x, provided that all arithmetic operations are exact.
The most used of these methods is Gaussian elimi-
nation, which we begin with. There are other direct
methods, but we do not study them here.
Iteration Methods: These are used in solving all types
of linear systems, but they are most commonly used
with large sparse systems, especially those produced
by discretizing partial differential equations. This is
an extremely active area of research.
MATRICES in MATLAB
Consider the matrices
A =
1 2 32 2 33 3 3
, b =
111
In MATLAB, A can be created as follows.
A = [1 2 3; 2 2 3; 3 3 3];A = [1, 2, 3; 2, 2, 3; 3, 3, 3];A = [1 2 3
2 2 33 3 3] ;
Commas can be used to replace the spaces. The vec-
tor b can be created by
b = ones(3, 1);
Consider setting up the matrices for the system
Ax = b with
Ai,j = max {i, j} , bi = 1, 1 ≤ i, j ≤ n
One way to set up the matrix A is as follows:
A = zeros(n, n);for i = 1 : n
A(i, 1 : i) = i;A(i, i+ 1 : n) = i+ 1 : n;
end
and set up the vector b by
b = ones(n, 1);
MATRIX ADDITION
Let A =hai,j
iand B =
hbi,j
ibe matrices of order
m× n. Then
C = A+B
is another matrix of order m× n, with
ci.j = ai,j + bi,j
EXAMPLE. 1 23 45 6
+ 1 −1−1 11 −1
= 2 12 56 5
MULTIPLICATION BY A CONSTANT
c
a1,1 · · · a1,n... . . . ...
am,1 · · · am,n
= ca1,1 · · · ca1,n
... . . . ...cam,1 · · · cam,n
EXAMPLE.
5
1 23 45 6
= 5 1015 2025 30
(−1)"a bc d
#=
"−a −b−c −d
#
THE ZERO MATRIX 0
Define the zero matrix of order m× n as the matrix
of that order having all zero entries. It is sometimes
written as 0m×n, but more commonly as simply 0.Then for any matrix A of order m× n,
A+ 0 = 0 +A = A
The zero matrix 0m×n acts in the same role as doesthe number zero when doing arithmetic with real and
complex numbers.
EXAMPLE."1 23 4
#+
"0 00 0
#=
"1 23 4
#
We denote by −A the solution of the equation
A+B = 0
It is the matrix obtained by taking the negative of all
of the entries in A. For example,"a bc d
#+
"−a −b−c −d
#=
"0 00 0
#
⇒ −"a bc d
#=
"−a −b−c −d
#= (−1)
"a bc d
#
−"a1,1 a1,2a2,1 a2,2
#=
"−a1,1 −a1,2−a2,1 −a2,2
#
MATRIX MULTIPLICATION
Let A =hai,j
ihave order m×n and B =
hbi,j
ihave
order n× p. Then
C = AB
is a matrix of order m× p and
ci,j = Ai,∗B∗,j= ai,1b1,j + ai,2b2,j + · · ·+ ai,nbn,j
or equivalently
ci,j =hai,1 ai,2 · · · ai,n
ib1,jb2,j...
bn,j
= ai,1b1,j + ai,2b2,j + · · ·+ ai,nbn,j
EXAMPLES
"1 2 34 5 6
# 1 23 45 6
= "22 2849 64
#
1 23 45 6
" 1 2 34 5 6
#=
9 12 1519 26 3329 40 51
a1,1 · · · a1,n
... . . . ...an,1 · · · an,n
x1...xn
= a1,1x1 + · · ·+ a1,nxn
...an,1x1 + · · ·+ an,nxn
Thus we write the linear system
a1,1x1 + · · ·+ a1,nxn = b1...
an,1x1 + · · ·+ an,nxn = bn
as
Ax = b
THE IDENTITY MATRIX I
For a given integer n ≥ 1, Define In to be the matrixof order n × n with 1’s in all diagonal positions and
zeros elsewhere:
In =
1 0 . . . 00 1 0... . . . ...0 . . . 1
More commonly it is denoted by simply I.
Let A be a matrix of order m× n. Then
AIn = A, ImA = A
The identity matrix I acts in the same role as does
the number 1 when doing arithmetic with real and
complex numbers.
THE MATRIX INVERSE
Let A be a matrix of order n×n for some n ≥ 1. Wesay a matrix B is an inverse for A if
AB = BA = I
It can be shown that if an inverse exists for A, then
it is unique.
EXAMPLES. If ad− bc 6= 0, then"a bc d
#−1=
1
ad− bc
"d −b−c a
#"1 22 2
#−1=
" −1 1
1 −12
#1 1
213
12
13
14
13
14
15
−1
=
9 −36 30−36 192 −18030 −180 180
Recall the earlier theorem on the solution of linear
systems Ax = b with A a square matrix.
Theorem. The following are equivalent statements.
1. For each b, there is exactly one solution x.
2. For each b, there is a solution x.
3. The homogeneous system Ax = 0 has only the
solution x = 0.
4. det (A) 6= 0.
5. A−1 exists.
EXAMPLE
det
1 2 34 5 67 8 9
= 0Therefore, the linear system 1 2 3
4 5 67 8 9
x1x2x3
= b1b2b3
is not always solvable, the coefficient matrix does not
have an inverse, and the homogeneous system Ax = 0
has a solution other than the zero vector, namely 1 2 34 5 67 8 9
1−21
= 000
PARTITIONED MATRICES
Matrices can be built up from smaller matrices; or
conversely, we can decompose a large matrix into a
matrix of smaller matrices. For example, consider
A =
1 2 02 1 10 −1 5
= "B cd e
#
B =
"1 22 1
#c =
"01
#d =
h0 −1
ie = 5
Matlab allows you to build up larger matrices out of
smaller matrices in exactly this manner; and smaller
matrices can be defined as portions of larger matrices.
We will often write an n × n square matrix in terms
of its columns:
A =hA∗,1, ..., A∗,n
iFor the n× n identity matrix I, we write
I = [e1, ..., en]
with ej denoting a column vector with a 1 in position
j and zeros elsewhere.
ARITHMETIC OF PARTITIONED MATRICES
As with matrices, we can do addition and multiplica-
tion with partitioned matrices provided the individual
constituent parts have the proper orders.
For example, let A,B,C,D be n× n matrices. Then"I AB I
# "I CD I
#=
"I +AD C +AB +D I +BC
#
Let A be n × n and x be a column vector of length
n. Then
Ax =hA∗,1, ..., A∗,n
i x1...xn
= x1A∗,1+· · ·+xnA∗,n
Compare this to a1,1 · · · a1,n... . . . ...
an,1 · · · an,n
x1...xn
= a1,1x1 + · · ·+ a1,nxn
...an,1x1 + · · ·+ an,nxn
PARTITIONED MATRICES IN MATLAB
In MATLAB, matrices can be constructed using smaller
matrices. For example, let
A = [1, 2; 3, 4]; x = [5, 6]; y = [7, 8]0;
Then
B = [A, y; x, 9];
forms the matrix
B =
1 2 73 4 85 6 9
SOLVING LINEAR SYSTEMS
We want to solve the linear system
a1,1x1 + · · ·+ a1,nxn = b1...
an,1x1 + · · ·+ an,nxn = bn
This will be done by the method used in beginning
algebra, by successively eliminating unknowns from
equations, until eventually we have only one equation
in one unknown. This process is known as Gaussian
elimination. To put it onto a computer, however, we
must be more precise than is generally the case in high
school algebra.
We begin with the linear system
3x1 − 2x2 − x3 = 0 (E1)6x1 − 2x2 + 2x3 = 6 (E2)−9x1 + 7x2 + x3 = −1 (E3)
3x1 − 2x2 − x3 = 0 (E1)6x1 − 2x2 + 2x3 = 6 (E2)−9x1 + 7x2 + x3 = −1 (E3)
[1] Eliminate x1 from equations (E2) and (E3). Sub-
tract 2 times (E1) from (E2); and subtract −3 times(E1) from (E3). This yields
3x1 − 2x2 − x3 = 0 (E1)2x2 + 4x3 = 6 (E2)x2 − 2x3 = −1 (E3)
[2] Eliminate x2 from equation (E3). Subtract12 times
(E2) from (E3). This yields
3x1 − 2x2 − x3 = 0 (E1)2x2 + 4x3 = 6 (E2)
−4x3 = −4 (E3)
Using back substitution, solve for x3, x2, and x1, ob-
taining
x3 = x2 = x1 = 1
In the computer, we work on the arrays rather than
on the equations. To illustrate this, we repeat the
preceding example using array notation.
The original system is Ax = b, with
A =
3 −2 −16 −2 2−9 7 1
, b =
06−1
We often write these in combined form as an aug-
mented matrix:
[A | b] = 3 −2 −1
6 −2 2−9 7 1
¯̄̄̄¯̄̄ 0
6−1
In step 1, we eliminate x1 from equations 2 and 3.
We multiply row 1 by 2 and subtract it from row 2;
and we multiply row 1 by -3 and subtract it from row
3. This yields 3 −2 −10 2 40 1 −2
¯̄̄̄¯̄̄ 0
6−1
3 −2 −10 2 40 1 −2
¯̄̄̄¯̄̄ 0
6−1
In step 2, we eliminate x2 from equation 3. We mul-
tiply row 2 by 12 and subtract from row 3. This yields 3 −2 −10 2 40 0 −4
¯̄̄̄¯̄̄ 0
6−4
Then we proceed with back substitution as previously.
For the general case, we reduce
[A | b] =
a(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
in n− 1 steps to the form
a(1)1,1 · · · a
(1)1,n
0 . . . ...... . . .
0 · · · 0 a(n)n,n
¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......
b(n)n
More simply, and introducing new notation, this is
equivalent to the matrix-vector equation Ux = g:u1,1 · · · u1,n0 . . . ...... . . .0 · · · 0 un,n
x1......xn
=g1......gn
This is the linear system
u1,1x1 + u1,2x2 + · · ·+ u1,n−1xn−1 + u1,nxn = g1...
un−1,n−1xn−1 + un−1,nxn = gn−1un,nxn = gn
We solve for xn, then xn−1, and backwards to x1.
This process is called back substitution.
xn =gn
un,n
uk =gk −
nuk,k+1xk+1 + · · ·+ uk,nxn
ouk,k
for k = n−1, ..., 1. What we have done here is simplya more carefully defined and methodical version of
what you have done in high school algebra.
How do we carry out the conversion ofa(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
to
a(1)1,1 · · · a
(1)1,n
0 . . . ...... . . .
0 · · · 0 a(n)n,n
¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......
b(n)n
To help us keep track of the steps of this process, we
will denote the initial system by
[A(1) | b(1)] =
a(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
Initially we will make the assumption that every pivot
element will be nonzero; and later we remove this
assumption.
Step 1. We will eliminate x1 from equations 2 thru
n. Begin by defining the multipliers
mi,1 =a(1)i,1
a(1)1,1
, i = 2, ..., n
Here we are assuming the pivot element a(1)1,1 6= 0.
Then in succession, multiply mi,1 times row 1 (called
the pivot row) and subtract the result from row i.
This yields new matrix elements
a(2)i,j = a
(1)i,j −mi,1a
(1)1,j , j = 2, ..., n
b(2)i = b
(1)i −mi,1b
(1)1
for i = 2, ..., n.
Note that the index j does not include j = 1. The
reason is that with the definition of the multipliermi,1,
it is automatic that
a(2)i,1 = a
(1)i,1 −mi,1a
(1)1,1 = 0, i = 2, ..., n
The augmented matrix now is
[A(2) | b(2)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,n
... ... . . . ...
0 a(2)n,2 · · · a
(2)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2...
b(2)n
Step k: Assume that for i = 1, ..., k− 1 the unknownxi has been eliminated from equations i + 1 thru n.
We have the augmented matrix
[A(k) | b(k)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 · · · a
(2)2,n
. . . . . . ...
... 0 a(k)k,k · · · a
(k)k,n
... ... . . . ...
0 · · · 0 a(k)n,k · · · a
(k)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯
b(1)1
b(2)2...
b(k)k...
b(k)n
We want to eliminate unknown xk from equations k+
1 thru n. Begin by defining the multipliers
mi,k =a(k)i,k
a(k)k,k
, i = k + 1, ..., n
The pivot element is a(k)k,k, and we assume it is nonzero.
Using these multipliers, we eliminate xk from equa-
tions k + 1 thru n. Multiply mi,k times row k (the
pivot row) and subtract from row i, for i = k+1 thru
n.
a(k+1)i,j = a
(k)i,j −mi,ka
(k)k,j , j = k + 1, ..., n
b(k+1)i = b
(k)i −mi,kb
(k)k
for i = k+1, ..., n. This yields the augmented matrix
[A(k+1) | b(k+1)]:
a(1)1,1 · · · a
(1)1,n
0 . . . ...
a(k)k,k a
(k)k,k+1 · · · a
(k)k,n
... 0 a(k+1)k+1,k+1 a
(k+1)k+1,n
... ... . . . ...
0 · · · 0 a(k+1)n,k+1 · · · a
(k+1)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯
b(1)1...
b(k)k
b(k+1)k+1...
b(k+1)n
Doing this for k = 1, 2, ..., n − 1 leads to the uppertriangular system with the augmented matrix
a(1)1,1 · · · a
(1)1,n
0 . . . ...... . . .
0 · · · 0 a(n)n,n
¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......
b(n)n
We later remove the assumption
a(k)k,k 6= 0, k = 1, 2, ..., n
QUESTIONS
• How do we remove the assumption on the pivotelements?
• How many operations are involved in this proce-dure?
• How much error is there in the computed solutiondue to rounding errors in the calculations?
• How does the machine architecture affect the im-plementation of this algorithm.
PARTIAL PIVOTING
Recall the reduction of
[A(1) | b(1)] =
a(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
to
[A(2) | b(2)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,n
... ... . . . ...
0 a(2)n,2 · · · a
(2)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2...
b(2)n
What if a
(1)1,1 = 0? In that case we look for an equation
in which the x1 is present. To do this in such a way
as to avoid zero the maximum extant possible, we do
the following.
Look at all the elements in the first column,
a(1)1,1, a
(1)2,1, ..., a
(1)n,1
and pick the largest in size. Say it is¯̄̄̄a(1)k,1
¯̄̄̄= max
j=1,...,n
¯̄̄̄a(1)j,1
¯̄̄̄Then interchange equations 1 and k, which means
interchanging rows 1 and k in the augmented matrix
[A(1) | b(1)]. Then proceed with the elimination of x1from equations 2 thru n as before.
Having obtained
[A(2) | b(2)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,n
... ... . . . ...
0 a(2)n,2 · · · a
(2)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2...
b(2)n
what if a
(2)2,2 = 0? Then we proceed as before.
Among the elements
a(2)2,2, a
(2)3,2, ..., a
(2)n,2
pick the one of largest size:¯̄̄̄a(2)k,2
¯̄̄̄= max
j=2,...,n
¯̄̄̄a(2)j,2
¯̄̄̄Interchange rows 2 and k. Then proceed as before to
eliminate x2 from equations 3 thru n, thus obtaining
[A(3) | b(3)] =
a(1)1,1 a
(1)1,2 a
(1)1,3 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,3 · · · a
(2)2,n
0 0 a(3)3,3 · · · a
(3)3,n
... ... ... . . . ...
0 0 a(3)n,3 · · · a
(3)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2
b(3)3...
b(3)n
This is done at every stage of the elimination process.
This technique is called partial pivoting, and it is a
part of most Gaussian elimination programs (including
the one in the text).
Consequences of partial pivoting. Recall the defini-tion of the elements obtained in the process of elimi-
nating x1 from equations 2 thru n.
mi,1 =a(1)i,1
a(1)1,1
, i = 2, ..., n
a(2)i,j = a
(1)i,j −mi,1a
(1)1,j , j = 2, ..., n
b(2)i = b
(1)i −mi,1b
(1)1
for i = 2, ..., n. By our definition of the pivot element
a(1)1,1, we have¯̄̄
mi,1
¯̄̄≤ 1, i = 2, ..., n
Thus in the calculation of a(2)i,j and b
(2)i , we have that
the elements do not grow rapidly in size. This is in
comparison to what might happen otherwise, in which
the multipliers mi,1 might have been very large. This
property is true of the multipliers at very step of the
elimination process:¯̄̄mi,k
¯̄̄≤ 1, i = k + 1, ..., n, k = 1, ..., n− 1
The property¯̄̄mi,k
¯̄̄≤ 1, i = k + 1, ..., n
leads to good error propagation properties in Gaussian
elimination with partial pivoting. The only error in
Gaussian elimination is that derived from the round-
ing errors in the arithmetic operations. For example,
at the first elimination step (eliminating x1 from equa-
tions 2 thru n),
a(2)i,j = a
(1)i,j −mi,1a
(1)1,j , j = 2, ..., n
b(2)i = b
(1)i −mi,1b
(1)1
The above property on the size of the multipliers pre-
vents these numbers and the errors in their calculation
from growing as rapidly as they might if no partial piv-
oting was used.
As an example of the improvement in accuracy ob-
tained with partial pivoting, see the example on pages
262-263.
OPERATION COUNTS
One of the major ways in which we compare the effi-
ciency of different numerical methods is to count the
number of needed arithmetic operations. For solving
the linear system
a1,1x1 + · · ·+ a1,nxn = b1...
an,1x1 + · · ·+ an,nxn = bn
using Gaussian elimination, we have the following op-
eration counts.
1. A → U , where we are converting Ax = b to
Ux = g:
Divisionsn(n− 1)
2
Additionsn(n− 1)(2n− 1)
6
Multiplicationsn(n− 1)(2n− 1)
6
2. b→ g:
Additionsn(n− 1)
2
Multiplicationsn(n− 1)
23. Solving Ux = g:
Divisions n
Additionsn(n− 1)
2
Multiplicationsn(n− 1)
2
On some machines, the cost of a division is much
more than that of a multiplication; whereas on others
there is not any important difference. We assume the
latter; and then the operation costs are as follows.
MD(A→ U) =n³n2 − 1
´3
MD(b→ g) =n(n− 1)
2
MD(Find x) =n(n+ 1)
2
AS(A→ U) =n(n− 1)(2n− 1)
6
AS(b→ g) =n(n− 1)
2
AS(Find x) =n(n− 1)
2
Thus the total number of operations is
Additions2n3 + 3n2 − 5n
6ÃMultiplicationsand Divisions
!n3 + 3n2 − n
3
Both are around 13n3, and thus the total operations
account is approximately
2
3n3
What happens to the cost when n is doubled?
Solving Ax = b and Ax = c. What is the cost? Only
the modification of the right side is different in these
two cases. Thus the additional cost isÃMD(b→ g)MD(Find x)
!= n2
ÃAS(b→ g)AS(Find x)
!= n(n− 1)
The total is around 2n2 operations, which is quite a
bit smaller than 23n3 when n is even moderately large,
say n = 100.
Thus one can solve the linear system Ax = c at little
additional cost to that for solving Ax = b. This has
important consequences when it comes to estimation
of the error in computed solutions.
CALCULATING THE MATRIX INVERSE
Consider finding the inverse of a 3× 3 matrix
A =
a1,1 a1,2 a1,3a2,1 a2,2 a2,3a3,1 a3,2 a3,3
= hA∗,1, A∗,2, A∗,3
iWe want to find a matrix
X =hX∗,1,X∗,2,X∗,3
ifor which
AX = I
AhX∗,1,X∗,2,X∗,3
i= [e1, e2, e3]h
AX∗,1, AX∗,2, AX∗,3i= [e1, e2, e3]
This means we want to solve
AX∗,1 = e1, AX∗,2 = e2, AX∗,3 = e3
We want to solve three linear systems, all with the
same matrix of coefficients A.
MATRIX INVERSE EXAMPLE
A =
1 1 −21 1 11 −1 0
1 1 −21 1 11 −1 0
¯̄̄̄¯̄̄ 1 0 00 1 00 0 1
m2,1 = 1 ↓ m3,1 = 1 1 1 −20 0 30 −2 2
¯̄̄̄¯̄̄ 1 0 0−1 1 0−1 0 1
↓ 1 1 −2
0 −2 20 0 3
¯̄̄̄¯̄̄ 1 0 0−1 0 1−1 1 0
1 1 −20 −2 20 0 3
¯̄̄̄¯̄̄ 1 0 0−1 0 1−1 1 0
Then by using back substitution to solve for each col-
umn of the inverse, we obtain
A−1 =
16
13
12
16
13 −12
−13 13 0
COST OF MATRIX INVERSION
In calculating A−1, we are solving for the matrix X =hX∗,1,X∗,2, . . . ,X∗,n
iwhere
AhX∗,1,X∗,2, . . . ,X∗,n
i= [e1, e2, . . . , en]
and ej is column j of the identity matrix. Thus weare solving n linear systems
AX∗,1 = e1, AX∗,2 = e2, . . . , AX∗,n = en (1)
all with the same coefficient matrix. Returning tothe earlier operation counts for solving a single linearsystem, we have the following.
Cost of triangulating A: approx. 23n3 operations
Cost of solving Ax = b: 2n2 operations
Thus solving the n linear systems in (1) costs approx-imately
23n3 + n
³2n2
´= 83n3 operations, approximately
It costs approximately four times as many operationsto invert A as to solve a single system. With attentionto the form of the right-hand sides in (1) this can bereduced to 2n3 operations.
MATLAB MATRIX OPERATIONS
To solve the linear system Ax = b in Matlab, use
x = A \ bIn Matlab, the command
inv (A)
will calculate the inverse of A.
There are many matrix operations built into Matlab,
both for general matrices and for special classes of
matrices. We do not discuss those here, but recom-
mend the student to investigate these thru the Matlab
help options.
GAUSSIAN ELIMINATION - REVISITED
Consider solving the linear system
2x1 + x2 − x3 + 2x4 = 54x1 + 5x2 − 3x3 + 6x4 = 9−2x1 + 5x2 − 2x3 + 6x4 = 44x1 + 11x2 − 4x3 + 8x4 = 2
by Gaussian elimination without pivoting. We denote
this linear system by Ax = b. The augmented matrix
for this system is
[A | b] =
2 1 −1 24 5 −3 6−2 5 −2 64 11 −4 8
¯̄̄̄¯̄̄̄¯5942
To eliminate x1 from equations 2, 3, and 4, use mul-
tipliers
m2,1 = 2, m3,1 = −1, m4,1 = 2
To eliminate x1 from equations 2, 3, and 4, use mul-
tipliers
m2,1 = 2, m3,1 = −1, m4,1 = 2
This will introduce zeros into the positions below the
diagonal in column 1, yielding2 1 −1 20 3 −1 20 6 −3 80 9 −2 4
¯̄̄̄¯̄̄̄¯
5−19−8
To eliminate x2 from equations 3 and 4, use multipli-
ers
m3,2 = 2, m4,2 = 3
This reduces the augmented matrix to2 1 −1 20 3 −1 20 0 −1 40 0 1 −2
¯̄̄̄¯̄̄̄¯
5−111−5
To eliminate x3 from equation 4, use the multiplier
m4,3 = −1This reduces the augmented matrix to
2 1 −1 20 3 −1 20 0 −1 40 0 0 2
¯̄̄̄¯̄̄̄¯
5−1116
Return this to the familiar linear system
2x1 + x2 − x3 + 2x4 = 53x2 − x3 + 2x4 = −1
−x3 + 4x4 = 112x4 = 6
Solving by back substitution, we obtain
x4 = 3, x3 = 1, x2 = −2, x1 = 1
There is a surprising result involving matrices asso-
ciated with this elimination process. Introduce the
upper triangular matrix
U =
2 1 −1 20 3 −1 20 0 −1 40 0 0 2
which resulted from the elimination process. Then
introduce the lower triangular matrix
L =
1 0 0 0
m2,1 1 0 0m3,1 m3,2 1 0m4,1 m4,2 m4,3 1
=
1 0 0 02 1 0 0−1 2 1 02 3 −1 1
This uses the multipliers introduced in the elimination
process. Then
A = LU2 1 −1 24 5 −3 6−2 5 −2 64 11 −4 8
=
1 0 0 02 1 0 0−1 2 1 02 3 −1 1
2 1 −1 20 3 −1 20 0 −1 40 0 0 2
In general, when the process of Gaussian elimination
without pivoting is applied to solving a linear system
Ax = b, we obtain A = LU with L and U constructed
as above.
For the case in which partial pivoting is used, we ob-
tain the slightly modified result
LU = PA
where L and U are constructed as before and P is a
permutation matrix. For example, consider
P =
0 0 1 01 0 0 00 0 0 10 1 0 0
Then
PA =
0 0 1 01 0 0 00 0 0 10 1 0 0
a1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4
=A3,∗A1,∗A4,∗A2,∗
PA =
0 0 1 01 0 0 00 0 0 10 1 0 0
a1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4
=
A3,∗A1,∗A4,∗A2,∗
The matrix PA is obtained fromA by switching around
rows of A. The result LU = PA means that the LU-
factorization is valid for the matrix A with its rows
suitably permuted.
Consequences: If we have a factorization
A = LU
with L lower triangular and U upper triangular, then
we can solve the linear system Ax = b in a relatively
straightforward way.
The linear system can be written as
LUx = b
Write this as a two stage process:
Lg = b, Ux = g
The system Lg = b is a lower triangular system
g1 = b12,1g1 + g2 = b23,1g1 + 3,2g2 + g3 = b3
...
n,1g1 + · · · n,n−1gn−1 + gn = bn
We solve it by “forward substitution”. Then we solve
the upper triangular system Ux = g by back substi-
tution.
VARIANTS OF GAUSSIAN ELIMINATION
If no partial pivoting is needed, then we can look for
a factorization
A = LU
without going thru the Gaussian elimination process.
For example, suppose A is 4× 4. We writea1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4
=
1 0 0 0
2,1 1 0 0
3,1 3,2 1 0
4,1 4,2 4,3 1
u1,1 u1,2 u1,3 u1,40 u2,2 u2,3 u2,40 0 u3,3 u3,40 0 0 u4,4
To find the elements
ni,j
oand
nui,j
o, we multiply
the right side matrices L and U and match the results
with the corresponding elements in A.
Multiplying the first row of L times all of the columns
of U leads to
u1,j = a1,j, j = 1, 2, 3, 4
Then multiplying rows 2, 3, 4 times the first column
of U yields
i,1u1,1 = ai,1, i = 2, 3, 4
and we can solve forn2,1, 3,1, 4,1
o. We can con-
tinue this process, finding the second row of U and
then the second column of L, and so on. For example,
to solve for 4,3, we need to solve for it in
4,1u1,3 + 4,2u2,3 + 4,3u3,3 = a4,3
Why do this? A hint of an answer is given by this
last equation. If we had an n× n matrix A, then we
would find n,n−1 by solving for it in the equation
n,1u1,n−1+ n,2u2,n−1+· · ·+ n,n−1un−1,n−1 = an,n−1
n,n−1 =an,n−1 −
hn,1u1,n−1 + · · ·+ n,n−2un−2,n−1
iun−1,n−1
Embedded in this formula we have a dot product. Thisis in fact typical of this process, with the length of theinner products varying from one position to another.
Recalling the discussion of dot products, we can evaluatethis last formula by using a higher precision arithmeticand thus avoid many rounding errors.
This leads to a variant of Gaussian elimination in whichthere are far fewer rounding errors.
With ordinary Gaussian elimination, the number of round-ing errors is proportional to n3. This reduces the numberof rounding errors, with the number now being propor-tional to only n2. This can lead to major increases inaccuracy, especially for matrices which are very sensitiveto small changes.
TRIDIAGONAL MATRICES
A =
b1 c1 0 0 · · · 0a2 b2 c2 00 a3 b3 c3
.... . .
... an−1 bn−1 cn−10 · · · an bn
These occur very commonly in the numerical solution
of partial differential equations, as well as in other ap-
plications (e.g. computing interpolating cubic spline
functions).
We factor A = LU , as before. But now L and U
take very simple forms. Before proceeding, we note
with an example that the same may not be true of the
matrix inverse.
EXAMPLE
Define an n× n tridiagonal matrix
A =
−1 1 0 0 · · · 01 −2 1 00 1 −2 1 ...
. . .... 1 −2 1
0 · · · 1 −n−1n
Then A−1 is given by³
A−1´i,j= max {i, j}
Thus the sparse matrix A can (and usually does) have
a dense inverse.
We factor A = LU , with
L =
1 0 0 0 · · · 0α2 1 0 00 α3 1 0 ...
. . .... αn−1 1 00 · · · αn 1
U =
β1 c1 0 0 · · · 00 β2 c2 00 0 β3 c3
.... . .
... 0 βn−1 cn−10 · · · 0 βn
Multiply these and match coefficients with A to find
{αi, γi}.
To solve the linear system
Ax = f
or
LUx = f
instead solve the two triangular systems
Lg = f; Ux = g
Solving Lg = f :
g1 = f1
gj = fj � �jgj�1; j = 2; : : : ; n
Solving Ux = g:
xn =gn
�n
xj =gj � cjxj+1
�j; j = n� 1; : : : ; 1
By doing a few multiplications of rows of L times
columns of U , we obtain the general pattern as fol-
lows.
β1 = b1 : row 1 of LU
α2β1 = a2, α2c1 + β2 = b2 : row 2 of LU...
αnβn−1 = an, αncn−1 + βn = bn : row n of LU
These are straightforward to solve.
β1 = b1
αj =aj
βj−1, βj = bj − αjcj−1, j = 2, ..., n
OPERATIONS COUNT
Factoring A = LU .
Additions: n− 1Multiplications: n− 1Divisions: n− 1
Solving Lz = f and Ux = z:
Additions: 2n− 2Multiplications: 2n− 2Divisions: n
Thus the total number of arithmetic operations is ap-
proximately 3n to factor A; and it takes about 5n to
solve the linear system using the factorization of A.
If we had A−1 at no cost, what would it cost to com-pute x = A−1f?
xi =nX
j=1
³A−1
´i,jfj, i = 1, ..., n
MATLAB MATRIX OPERATIONS
To obtain the LU-factorization of a matrix, including
the use of partial pivoting, use the Matlab command
lu. In particular,
[L, U, P ] = lu(X)
returns the lower triangular matrix L, upper triangular
matrix U , and permutation matrix P so that
PX = LU
NUMERICAL INTEGRATION
How do you evaluate
I =Z b
af(x) dx
From calculus, if F (x) is an antiderivative of f(x),
then
I =Z b
af(x) dx = F (x)|ba = F (b)− F (a)
However, in practice most integrals cannot be evalu-
ated by this means. And even when this can work, an
approximate numerical method may be much simpler
and easier to use. For example, the integrand inZ 10
dx
1 + x5
has an extremely complicated antiderivative; and it is
easier to evaluate the integral by approximate means.
Try evaluating this integral with Maple or Mathemat-
ica.
NUMERICAL INTEGRATIONA GENERAL FRAMEWORK
Returning to a lesson used earlier with rootfinding:If you cannot solve a problem, then replace it with a“near-by” problem that you can solve.In our case, we want to evaluate
I =Z b
af(x) dx
To do so, many of the numerical schemes are basedon choosing approximates of f(x). Calling one suchef(x), use
I ≈Z b
a
ef(x) dx ≡ eIWhat is the error?
E = I − eI = Z b
a
hf(x)− ef(x)i dx
|E| ≤Z b
a
¯̄̄f(x)− ef(x)¯̄̄ dx
≤ (b− a)°°°f − ef°°°∞°°°f − ef°°°∞ ≡ max
a≤x≤b¯̄̄f(x)− ef(x)¯̄̄
We also want to choose the approximates ef(x) of aform we can integrate directly and easily. Examples
are polynomials, trig functions, piecewise polynomials,
and others.
If we use polynomial approximations, then how do we
choose them. At this point, we have two choices:
1. Taylor polynomials approximating f(x)
2. Interpolatory polynomials approximating f(x)
EXAMPLE
Consider evaluating
I =Z 10ex2dx
Use
et = 1 + t+ 12!t2 + · · ·+ 1
n!tn + 1
(n+1)!tn+1ect
ex2= 1 + x2 + 1
2!x4 + · · ·+ 1
n!x2n + 1
(n+1)!x2n+2edx
with 0 ≤ dx ≤ x2. Then
I =Z 10
h1 + x2 + 1
2!x4 + · · ·+ 1
n!x2nidx
+ 1(n+1)!
Z 10
hx2n+2edx
idx
Taking n = 3, we have
I = 1 + 13 +
110 +
142 +E = 1.4571 +E
0 < E ≤ e24
Z 10
hx8idx = e
216 = .0126
USING INTERPOLATORY POLYNOMIALS
In spite of the simplicity of the above example, it is
generally more difficult to do numerical integration by
constructing Taylor polynomial approximations than
by constructing polynomial interpolates. We therefore
construct the function ef inZ b
af(x) dx ≈
Z b
a
ef(x) dxby means of interpolation.
Initially, we consider only the case in which the in-
terpolation is based on interpolation at evenly spaced
node points.
LINEAR INTERPOLATION
The linear interpolant to f(x), interpolating at a and
b, is given by
P1(x) =(b− x) f(a) + (x− a) f(b)
b− a
Using the linear interpolant
P1(x) =(b− x) f(a) + (x− a) f(b)
b− a
we obtain the approximationZ b
af(x) dx ≈
Z b
aP1(x) dx
= 12 (b− a) [f(a) + f(b)] ≡ T1(f)
The rulebZa
f(x) dx ≈ T1(f)
is called the trapezoidal rule.
x
y
a b
y=f(x)
y=p1(x)
Illustrating I ≈ T1(f)
Example.Z π/2
0sinxdx ≈ π
4
hsin 0 + sin
³π2
´i= π
4.= .785398
Error = .215
HOW TO OBTAIN GREATER ACCURACY?
How do we improve our estimate of the integral
I =Z b
af(x) dx
One direction is to increase the degree of the approxi-mation, moving next to a quadratic interpolating poly-nomial for f(x). We first look at an alternative.
Instead of using the trapezoidal rule on the originalinterval [a, b], apply it to integrals of f(x) over smallersubintervals. For example:
I =Z c
af(x) dx+
Z b
cf(x) dx, c = b+a
2
≈ c−a2 [f(a) + f(c)] + b−c
2 [f(c) + f(b)]
= h2 [f(a) + 2f(c) + f(b)] ≡ T2(f), h = b−a
2
Example.Z π/2
0sinxdx ≈ π
8
hsin 0 + 2 sin
³π4
´+ sin
³π2
´i.= .948059
Error = .0519
x
y
a=x0 b=x3x1 x2
y=f(x)
Illustrating I ≈ T3(f)
THE TRAPEZOIDAL RULE
We can continue as above by dividing [a, b] into even
smaller subintervals and applying
βZα
f(x) dx ≈ β − α
2[f(α) + f(β)] , (∗)
on each of the smaller subintervals. Begin by intro-
ducing a positive integer n ≥ 1,
h =b− a
n, xj = a+ j h, j = 0, 1, ..., n
Then
I =Z xn
x0f(x) dx
=Z x1
x0f(x) dx+
Z x2
x1f(x) dx+ · · ·+
Z xn
xn−1f(x) dx
Use [α, β] = [x0, x1], [x1, x2], ..., [xn−1, xn], for eachof which the subinterval has length h.
Then applying
βZα
f(x) dx ≈ β − α
2[f(α) + f(β)]
we have
I ≈ h2 [f(x0) + f(x1)] +
h2 [f(x1) + f(x2)]
+ · · ·+h2 [f(xn−2) + f(xn−1)] + h
2 [f(xn−1) + f(xn)]
Simplifying,
I ≈ h·1
2f(a) + f(x1) + · · ·+ f(xn−1) +
1
2f(b)
¸≡ Tn(f)
This is called the “composite trapezoidal rule”, or
more simply, the trapezoidal rule.
Example. Again integrate sinx overh0, π2
i. Then we
have
n Tn(f) Error Ratio1 .785398163 2.15E−12 .948059449 5.19E−2 4.134 .987115801 1.29E−2 4.038 .996785172 3.21E−3 4.0116 .999196680 8.03E−4 4.0032 .999799194 2.01E−4 4.0064 .999949800 5.02E−5 4.00128 .999987450 1.26E−5 4.00256 .999996863 3.14E−6 4.00
Note that the errors are decreasing by a constant fac-
tor of 4. Why do we always double n?
USING QUADRATIC INTERPOLATION
We want to approximate I =R ba f(x) dx using quadratic
interpolation of f(x). Interpolate f(x) at the points
{a, c, b}, with c = 12 (a+ b). Also let h = 1
2 (b− a).
The quadratic interpolating polynomial is given by
P2(x) =(x− c) (x− b)
2h2f(a) +
(x− a) (x− b)
−h2 f(c)
+(x− a) (x− c)
2h2f(b)
Replacing f(x) by P2(x), we obtain the approximationZ b
af(x) dx ≈
Z b
aP2(x) dx
= h3 [f(a) + 4f(c) + f(b)] ≡ S2(f)
This is called Simpson’s rule.
x
y
a b(a+b)/2
y=f(x)
Illustration of I ≈ S2(f)
Example.Z π/2
0sinxdx ≈ π/2
3
hsin 0 + 4 sin
³π4
´+ sin
³π2
´i.= 1.00227987749221
Error = −0.00228
SIMPSON’S RULE
As with the trapezoidal rule, we can apply Simpson’s
rule on smaller subdivisions in order to obtain better
accuracy in approximating
I =Z b
af(x) dx
Again, Simpson’s rule is given byZ β
αf(x) dx ≈ δ
3[f(α) + 4f(γ) + f(β)] , γ =
α+ β
2
and δ = 12 (β − α).
Let n be a positive even integer, and
h =b− a
n, xj = a+ j h, j = 0, 1, ..., n
Then write
I =Z xn
x0f(x) dx
=Z x2
x0f(x) dx+
Z x4
x2f(x) dx+ · · ·+
Z xn
xn−2f(x) dx
ApplyZ β
αf(x) dx ≈ δ
3[f(α) + 4f(γ) + f(β)] , γ =
α+ β
2
to each of these subintegrals, with
[α, β] = [x0, x2] , [x2, x4] , ..., [xn−2, xn]
In all cases, 12 (β − α) = h. Then
I ≈ h3 [f(x0) + 4f(x1) + f(x2)]
+h3 [f(x2) + 4f(x3) + f(x4)]
+ · · ·+h3 [f(xn−4) + 4f(xn−3) + f(xn−2)]
+h3 [f(xn−2) + 4f(xn−1) + f(xn)]
This can be simplified toZ b
af(x) dx ≈ Sn(f) ≡ h
3 [f(x0) + 4f(x1)
+2f(x2) + 4f(x3) + 2f(x4)
+ · · ·+ 2f(xn−2) + 4f(xn−1) + f(xn)]
This is called the “composite Simpson’s rule” or more
simply, .Simpson’s rule
EXAMPLE
ApproximateZ π/2
0sinxdx. The Simpson rule results
are as follows.
n Sn(f) Error Ratio2 1.00227987749221 −2.28E−34 1.00013458497419 −1.35E−4 16.948 1.00000829552397 −8.30E−6 16.2216 1.00000051668471 −5.17E−7 16.0632 1.00000003226500 −3.23E−8 16.0164 1.00000000201613 −2.02E−9 16.00128 1.00000000012600 −1.26E−10 16.00256 1.00000000000788 −7.88E−12 16.00512 1.00000000000049 −4.92E−13 15.99
Note that the ratios of successive errors have con-
verged to 16. Why? Also compare this table with
that for the trapezoidal rule. For example,
I − T4 = 1.29E − 2I − S4 = −1.35E − 4
Example 1
I(1) =Z 10e�x
2dx � 0:746824132812427
I(2) =Z 40
dx
1 + x2= arctan(4) � 1:32581766366803
I(3) =Z 2�0
dx
2 + cosx=2�p3� 3:62759872846844
Table 1. Trapezoidal rule applied to Example 1.
n I(1) I(2) I(3)
Error R Error R Error R2 1:6E � 2 �1:3E � 1 �5:6E � 14 3:8E � 3 4:02 �3:6E � 3 37:0 �3:8E � 2 14:98 9:6E � 4 4:01 5:6E � 4 �6:4 �1:9E � 4 195:016 2:4E � 4 4:00 1:4E � 4 3:9 �5:2E � 9 3760032 6:0E � 5 4:00 3:6E � 5 4:0064 1:5E � 5 4:00 9:0E � 6 4:00128 3:7E � 6 4:00 2:3E � 6 4:00
Table 2. Simpson rule applied to Example 1.
n I(1) I(2) I(3)
Error R Error R Error R2 �3:6E � 4 8:7E � 2 �1:264 �3:1E � 5 11:4 3:9E � 2 2:2 1:4E � 1 �9:28 �2:0E � 6 15:7 2:0E � 3 20 1:2E � 2 11:216 �1:3E � 7 15:9 4:0E � 6 485 6:4E � 5 19132 �7:8E � 9 16:0 2:3E � 8 172 1:7E � 9 3760064 �4:9E � 10 16:0 1:5E � 9 16128 �3:0E � 11 16:0 9:2E � 11 16
TRAPEZOIDAL METHOD
ERROR FORMULA
Theorem Let f(x) have two continuous derivatives on
the interval a ≤ x ≤ b. Then
ETn (f) ≡
Z b
af(x) dx− Tn(f) = −h
2 (b− a)
12f 00 (cn)
for some cn in the interval [a, b].
Later I will say something about the proof of this re-
sult, as it leads to some other useful formulas for the
error.
The above formula says that the error decreases in
a manner that is roughly proportional to h2. Thus
doubling n (and halving h) should cause the error to
decrease by a factor of approximately 4. This is what
we observed with a past example from the preceding
section.
Example. Consider evaluating
I =Z 20
dx
1 + x2
using the trapezoidal method Tn(f). How large should
n be chosen in order to ensure that¯̄̄ETn (f)
¯̄̄≤ 5× 10−6
We begin by calculating the derivatives:
f 0(x) = −2x³1 + x2
´2, f 00(x) = −2 + 6x2³1 + x2
´3From a graph of f 00(x),
max0≤x≤2
¯̄̄f 00(x)
¯̄̄= 2
Recall that b− a = 2. Therefore,
ETn (f) = −h
2 (b− a)
12f 00 (cn)¯̄̄
ETn (f)
¯̄̄≤ h2 (2)
12· 2 = h2
3
ETn (f) = −h
2 (b− a)
12f 00 (cn)¯̄̄
ETn (f)
¯̄̄≤ h22
12· 2 = h2
3
We bound¯̄f 00 (cn)
¯̄since we do not know cn, and
therefore we must assume the worst possible case, that
which makes the error formula largest. That is what
has been done above.
When do we have¯̄̄ETn (f)
¯̄̄≤ 5× 10−6 (1)
To ensure this, we choose h so small that
h2
3≤ 5× 10−6
This is equivalent to choosing h and n to satisfy
h ≤ .003873
n =2
h≥ 516.4
Thus n ≥ 517 will imply (1).
DERIVING THE ERROR FORMULA
There are two stages in deriving the error:
(1) Obtain the error formula for the case of a single
subinterval (n = 1);
(2) Use this to obtain the general error formula given
earlier.
For the trapezoidal method with only a single subin-
terval, we haveZ α+h
αf(x) dx− h
2[f(α) + f(α+ h)] = −h
3
12f 00(c)
for some c in the interval [α,α+ h].
A sketch of the derivation of this error formula is given
in the problems.
Recall that the general trapezoidal rule Tn(f) was ob-
tained by applying the simple trapezoidal rule to a sub-
division of the original interval of integration. Recall
defining and writing
h =b− a
n, xj = a+ j h, j = 0, 1, ..., n
I =
xnZx0
f(x) dx
=
x1Zx0
f(x) dx+
x2Zx1
f(x) dx+ · · ·
+
xnZxn−1
f(x) dx
I ≈ h2 [f(x0) + f(x1)] +
h2 [f(x1) + f(x2)]
+ · · ·+h2 [f(xn−2) + f(xn−1)] + h
2 [f(xn−1) + f(xn)]
Then the error
ETn (f) ≡
Z b
af(x) dx− Tn(f)
can be analyzed by adding together the errors over the
subintervals [x0, x1], [x1, x2], ..., [xn−1, xn]. RecallZ α+h
αf(x) dx− h
2[f(α) + f(α+ h)] = −h
3
12f 00(c)
Then on [xj−1, xj],xjZxj−1
f(x) dx− h
2
hf(xj−1) + f(xj)
i= −h
3
12f 00(γj)
with xj−1 ≤ γj ≤ xj, but otherwise γj unknown.
Then combining these errors, we obtain
ETn (f) = −
h3
12f 00(γ1)− · · ·−
h3
12f 00(γn)
This formula can be further simplified, and we will do
so in two ways.
Rewrite this error as
ETn (f) = −
h3n
12
"f 00(γ1) + · · ·+ f 00(γn)
n
#Denote the quantity inside the brackets by ζn. This
number satisfies
mina≤x≤b f
00(x) ≤ ζn ≤ maxa≤x≤b f
00(x)
Since f 00(x) is a continuous function (by original as-sumption), we have that there must be some number
cn in [a, b] for which
f 00(cn) = ζn
Recall also that hn = b− a. Then
ETn (f) = −h
3n
12
"f 00(γ1) + · · ·+ f 00(γn)
n
#
= −h2 (b− a)
12f 00 (cn)
This is the error formula given on the first slide.
AN ERROR ESTIMATE
We now obtain a way to estimate the error ETn (f).
Return to the formula
ETn (f) = −
h3
12f 00(γ1)− · · ·−
h3
12f 00(γn)
and rewrite it as
ETn (f) = −
h2
12
hf 00(γ1)h+ · · ·+ f 00(γn)h
iThe quantity
f 00(γ1)h+ · · ·+ f 00(γn)h
is a Riemann sum for the integralZ b
af 00(x) dx = f 0(b)− f 0(a)
By this we mean
limn→∞
hf 00(γ1)h+ · · ·+ f 00(γn)h
i=Z b
af 00(x) dx
Thus
f 00(γ1)h+ · · ·+ f 00(γn)h ≈ f 0(b)− f 0(a)
for larger values of n. Combining this with the earlier
error formula
ETn (f) = −
h2
12
hf 00(γ1)h+ · · ·+ f 00(γn)h
iwe have
ETn (f) ≈ −
h2
12
hf 0(b)− f 0(a)
i≡ eET
n (f)
This is a computable estimate of the error in the nu-
merical integration. It is called an asymptotic error
estimate.
Example. Consider evaluating
I(f) =Z π
0ex cosxdx = −e
π + 1
2
.= −12.070346
In this case,
f 0(x) = ex [cosx− sinx]f 00(x) = −2ex sinx
max0≤x≤π
¯̄f 00(x)
¯̄=
¯̄f 00 (.75π)
¯̄= 14. 921
Then
ETn (f) = −h
2 (b− a)
12f 00 (cn)¯̄̄
ETn (f)
¯̄̄≤ h2π
12· 14.921 = 3.906h2
Also
eETn (f) = −h
2
12
£f 0(π)− f 0(0)
¤=
h2
12[eπ + 1]
.= 2.012h2
I(f)� Tn(f) � �h2
12
�f 0(b)� f 0(a)
�I(f) � Tn(f)�
h2
12
�f 0(b)� f 0(a)
�CTn(f) � Tn(f)�
h2
12
�f 0(b)� f 0(a)
�
This is the corrected trapezoidal rule. It is easy to obtainfrom the trapezoidal rule, and in most cases, it convergesmore rapidly than the trapezoidal rule.
Table 3. Asymptotic and corrected trapesoidal rule ap-plied to integral I(1) from Example 1.
n I � Tn(f) R eEn(f) I � CTn(f) R2 1:6E � 2 1:5E � 2 1:3E � 44 3:8E � 3 4 3:8E � 3 7:9E � 6 15:88 9:6E � 4 4 9:6E � 4 4:9E � 7 1616 2:4E � 4 4 2:4E � 4 3:1E � 8 1632 5:9E � 5 4 5:9E � 5 2:0E � 9 1664 1:5E � 5 4 1:5E � 5 2:2E � 10 16
SIMPSON’S RULE ERROR FORMULA
Recall the general Simpson’s ruleZ b
af(x) dx ≈ Sn(f) ≡ h
3 [f(x0) + 4f(x1) + 2f(x2)
+4f(x3) + 2f(x4) + · · ·+2f(xn−2) + 4f(xn−1) + f(xn)]
For its error, we have
ESn(f) ≡
bZa
f(x) dx− Sn(f) = −h4 (b− a)
180f (4)(cn)
for some a ≤ cn ≤ b, with cn otherwise unknown. For
an asymptotic error estimate,
bZa
f(x) dx−Sn(f) ≈ eESn (f) ≡ −
h4
180
hf 000(b)− f 000(a)
i
DISCUSSION
For Simpson’s error formula, both formulas assume
that the integrand f(x) has four continuous deriva-
tives on the interval [a, b]. What happens when this
is not valid? We return later to this question.
Both formulas also say the error should decrease by a
factor of around 16 when n is doubled.
Compare these results with those for the trapezoidal
rule error formulas:.
ETn (f) ≡
Z b
af(x) dx− Tn(f) = −h
2 (b− a)
12f 00 (cn)
ETn (f) ≈ −
h2
12
hf 0(b)− f 0(a)
i≡ eET
n (f)
EXAMPLE
Consider evaluating
I =Z 20
dx
1 + x2
using Simpson’s rule Sn(f). How large should n be
chosen in order to ensure that¯̄̄ESn(f)
¯̄̄≤ 5× 10−6
Begin by noting that
f (4)(x) = 245x4 − 10x2 + 1³
1 + x2´5
max0≤x≤1
¯̄̄f (4)(x)
¯̄̄= f (4)(0) = 24
Then
ESn(f) = −h
4 (b− a)
180f (4)(cn)¯̄̄
ESn(f)
¯̄̄≤ h4 · 2
180· 24 = 4h4
15
Then¯̄̄ESn(f)
¯̄̄≤ 5× 10−6 is true if
4h4
15≤ 5× 10−6
h ≤ .0658n ≥ 30.39
Therefore, choosing n ≥ 32 will give the desired er-
ror bound. Compare this with the earlier trapezoidal
example in which n ≥ 517 was needed.
For the asymptotic error estimate, we have
f 000(x) = −24x x2 − 1³1 + x2
´4eESn (f) ≡ − h4
180
£f 000(2)− f 000(0)
¤=
h4
180· 144625
=4
3125h4
INTEGRATING sqrt(x)
Consider the numerical approximation ofZ 10sqrt(x) dx =
2
3
In the following table, we give the errors when using
both the trapezoidal and Simpson rules.
n ETn Ratio ES
n Ratio2 6.311E − 2 2.860E − 24 2.338E − 2 2.70 1.012E − 2 2.828 8.536E − 3 2.74 3.587E − 3 2.8316 3.085E − 3 2.77 1.268E − 3 2.8332 1.108E − 3 2.78 4.485E − 4 2.8364 3.959E − 4 2.80 1.586E − 4 2.83128 1.410E − 4 2.81 5.606E − 5 2.83
The rate of convergence is slower because the func-
tion f(x) =sqrt(x) is not sufficiently differentiable on
[0, 1]. Both methods converge with a rate propor-
tional to h1.5.
ASYMPTOTIC ERROR FORMULAS
If we have a numerical integration formula,Z b
af(x) dx ≈
nXj=0
wjf(xj)
let En(f) denote its error,
En(f) =Z b
af(x) dx−
nXj=0
wjf(xj)
We say another formula eEn(f) is an asymptotic errorformula this numerical integration if it satisfies
limn→∞
eEn(f)
En(f)= 1
Equivalently,
limn→∞
En(f)− eEn(f)
En(f)= 0
These conditions say that eEn(f) looks increasinglylike En(f) as n increases, and thus
En(f) ≈ eEn(f)
Example. For the trapezoidal rule,
ETn (f) ≈ eET
n (f) ≡ −h2
12
hf 0(b)− f 0(a)
iThis assumes f(x) has two continuous derivatives on
the interval [a, b].
Example. For Simpson’s rule,
ESn(f) ≈ eES
n(f) ≡ −h4
180
hf 000(b)− f 000(a)
iThis assumes f(x) has four continuous derivatives on
the interval [a, b].
Note that both of these formulas can be written in an
equivalent form as
eEn(f) =c
np
for appropriate constant c and exponent p. With the
trapezoidal rule, p = 2 and
c = −(b− a)2
12
hf 0(b)− f 0(a)
iand for Simpson’s rule, p = 4 with a suitable c.
The formula eEn(f) =c
np(2)
occurs for many other numerical integration formulas
that we have not yet defined or studied. In addition,
if we use the trapezoidal or Simpson rules with an
integrand f(x) which is not sufficiently differentiable,
then (2) may hold with an exponent p that is less than
the ideal.
Example. Consider
I =Z 10xβ dx
in which −1 < β < 1, β 6= 0. Then the conver-
gence of the trapezoidal rule can be shown to have an
asymptotic error formula
En ≈ eEn =c
nβ+1(3)
for some constant c dependent on β. A similar result
holds for Simpson’s rule, with −1 < β < 3, β not an
integer. We can actually specify a formula for c; but
the formula is often less important than knowing that
(2) is valid for some c.
APPLICATION OF ASYMPTOTIC
ERROR FORMULAS
Assume we know that an asymptotic error formula
I − In ≈ c
np
is valid for some numerical integration rule denoted by
In. Initially, assume we know the exponent p. Then
imagine calculating both In and I2n. With I2n, we
have
I − I2n ≈c
2pnp
This leads to
I − In ≈ 2p [I − I2n]
I ≈ 2pI2n − In
2p − 1 = I2n +I2n − In
2p − 1The formula
I ≈ I2n +I2n − In
2p − 1 (4)
is called Richardson’s extrapolation formula.
Example. With the trapezoidal rule and with the in-tegrand f(x) having two continuous derivatives,
I ≈ T2n +1
3[T2n − Tn]
Example. With Simpson’s rule and with the integrandf(x) having four continuous derivatives,
I ≈ S2n +1
15[S2n − Sn]
We can also use the formula (2) to obtain error esti-
mation formulas:
I − I2n ≈I2n − In
2p − 1 (5)
This is called Richardson’s error estimate. For exam-
ple, with the trapezoidal rule,
I − T2n ≈1
3[T2n − Tn]
These formulas are illustrated for the trapezoidal rule
in an accompanying table, forZ π
0ex cosxdx = −e
π + 1
2
.= −12.07034632
AITKEN EXTRAPOLATION
In this case, we again assume
I − In ≈ c
np
But in contrast to previously, we do not know either
c or p. Imagine computing In, I2n, and I4n. Then
I − In ≈ c
np
I − I2n ≈ c
2pnp
I − I4n ≈ c
4pnp
We can directly try to estimate I. Dividing
I − In
I − I2n≈ 2p ≈ I − I2n
I − I4nSolving for I, we obtain
(I − I2n)2 ≈ (I − In) (I − I4n)
I (In + I4n − 2I2n) ≈ InI4n − I22n
I ≈ InI4n − I22nIn + I4n − 2I2n
This can be improved computationally, to avoid loss
of significance errors.
I ≈ I4n +
"InI4n − I22n
In + I4n − 2I2n− I4n
#
= I4n −(I4n − I2n)
2
(I4n − I2n)− (I2n − In)
This is called Aitken’s extrapolation formula.
To estimate p, we use
I2n − In
I4n − I2n≈ 2p
To see this, write
I2n − In
I4n − I2n=(I − In)− (I − I2n)
(I − I2n)− (I − I4n)
Then substitute from the following and simplify:
I − In ≈ c
np
I − I2n ≈ c
2pnp
I − I4n ≈ c
4pnp
Example. Consider the following table of numerical
integrals. What is its order of convergence?
n In In − I12n
Ratio
2 .284517796864 .28559254576 1.075E − 38 .28570248748 1.099E − 4 9.7816 .28571317731 1.069E − 5 10.2832 .28571418363 1.006E − 6 10.6264 .28571427643 9.280E − 8 10.84
It appears
2p.= 10.84, p
.= log2 10.84 = 3.44
We could now combine this with Richardson’s error
formula to estimate the error:
I − In ≈ 1
2p − 1·In − I1
2n
¸For example,
I − I64 ≈1
9.84[9.280E − 8] = 9.43E − 9
PERIODIC FUNCTIONS
A function f(x) is periodic if the following condition
is satisfied. There is a smallest real number τ > 0 for
which
f(x+ τ) = f(x), −∞ < x <∞ (6)
The number τ is called the period of the function
f(x). The constant function f(x) ≡ 1 is also consid-ered periodic, but it satisfies this condition with any
τ > 0. Basically, a periodic function is one which
repeats itself over intervals of length τ .
The condition (6) implies
f (m)(x+ τ) = f (m)(x), −∞ < x <∞ (7)
for the mth-derivative of f(x), provided there is such
a derivative. Thus the derivatives are also periodic.
Periodic functions occur very frequently in applica-
tions of mathematics, reflecting the periodicity of many
phenomena in the physical world.
PERIODIC INTEGRANDS
Consider the special class of integrals
I(f) =Z b
af(x) dx
in which f(x) is periodic, with b−a an integer multipleof the period τ for f(x). In this case, the performance
of the trapezoidal rule and other numerical integration
rules is much better than that predicted by earlier error
formulas.
To hint at this improved performance, recallZ b
af(x) dx− Tn(f) ≈ eEn(f) ≡ −h
2
12
hf 0(b)− f 0(a)
iWith our assumption on the periodicity of f(x), we
have
f(a) = f(b), f 0(a) = f 0(b)
Therefore, eEn(f) = 0
and we should expect improved performance in the
convergence behaviour of the trapezoidal sums Tn(f).
If in addition to being periodic on [a, b], the integrand
f(x) also has m continous derivatives, then it can be
shown that
I(f)− Tn(f) =c
nm+ smaller terms
By “smaller terms”, we mean terms which decrease
to zero more rapidly than n−m.
Thus if f(x) is periodic with b−a an integer multiple
of the period τ for f(x), and if f(x) is infinitely differ-
entiable, then the error I−Tn decreases to zero more
rapidly than n−m for any m > 0. For periodic inte-
grands, the trapezoidal rule is an optimal numerical
integration method.
Example. Consider evaluating
I =Z 2π0
sinxdx
1 + esinx
Using the trapezoidal rule, we have the results in the
following table. In this case, the formulas based on
Richardson extrapolation are no longer valid.
n Tn Tn − T12n
2 0.04 −0.72589193317292 −7.259E − 18 −0.74006131211583 −1.417E − 216 −0.74006942337672 −8.111E − 632 −0.74006942337946 −2.746E − 1264 −0.74006942337946 0.0
NUMERICAL INTEGRATION:
ANOTHER APPROACH
We look for numerical integration formulasZ 1−1
f(x) dx ≈nX
j=1
wjf(xj)
which are to be exact for polynomials of as large a
degree as possible. There are no restrictions placed
on the nodesnxjonor the weights
nwj
oin working
towards that goal. The motivation is that if it is exact
for high degree polynomials, then perhaps it will be
very accurate when integrating functions that are well
approximated by polynomials.
There is no guarantee that such an approach will work.
In fact, it turns out to be a bad idea when the node
pointsnxjoare required to be evenly spaced over the
interval of integration. But without this restriction onnxjowe are able to develop a very accurate set of
quadrature formulas.
The case n = 1. We want a formula
w1f(x1) �1R�1f(x)dx
The weight w1 and the nodex1 are to be so chosen thatthe formula is exact for polynomials of as large degreeas possible. To do this we substitute f(x) = 1 andf(x) = x. The �rst choice leads to
w1 � 1 �1R�11dx
w1 = 2
The choice f(x) = x leads to
w1x1 �1R�1xdx
x1 = 0
The desired formula is
1R�1f(x)dx � 2f(0)
It is called the midpoint rule.
The case n = 2. We want a formula
w1f(x1) +w2f(x2) ≈Z 1−1
f(x) dx
The weights w1, w2 and the nodes x1, x2 are to be so
chosen that the formula is exact for polynomials of as
large a degree as possible. We substitute and force
equality for
f(x) = 1, x, x2, x3
This leads to the system
w1 +w2 =Z 1−11 dx = 2
w1x1 + w2x2 =Z 1−1
xdx = 0
w1x21 + w2x
22 =
Z 1−1
x2 dx =2
3
w1x31 + w2x
32 =
Z 1−1
x3 dx = 0
The solution is given by
w1 = w2 = 1, x1 =−1
sqrt(3), x2 =
1sqrt(3)
This yields the formulaZ 1−1
f(x) dx ≈ fµ
−1sqrt(3)
¶+ f
µ1
sqrt(3)
¶(1)
We say it has degree of precision equal to 3 since it
integrates exactly all polynomials of degree ≤ 3. We
can verify directly that it does not integrate exactly
f(x) = x4. Z 1−1
x4 dx = 25
fµ
−1sqrt(3)
¶+ f
µ1
sqrt(3)
¶= 29
Thus (1) has degree of precision exactly 3.
EXAMPLE IntegrateZ 1−1
dx
3 + x= log 2
.= 0.69314718
The formula (1) yields
1
3 + x1+
1
3 + x2= 0.69230769
Error = .000839
THE GENERAL CASE
We want to find the weights {wi} and nodes {xi} soas to have Z 1
−1f(x) dx ≈
nXj=1
wjf(xj)
be exact for a polynomials f(x) of as large a degreeas possible. As unknowns, there are n weights wi andn nodes xi. Thus it makes sense to initially impose2n conditions so as to obtain 2n equations for the 2nunknowns. We require the quadrature formula to beexact for the cases
f(x) = xi, i = 0, 1, 2, ..., 2n− 1Then we obtain the system of equations
w1xi1 +w2x
i2 + · · ·+ wnx
in =
Z 1−1
xi dx
for i = 0, 1, 2, ..., 2n− 1. For the right sides,Z 1−1
xi dx =
2
i+ 1, i = 0, 2, ..., 2n− 2
0, i = 1, 3, ..., 2n− 1
The system of equations
w1xi1 + · · ·+ wnx
in =
Z 1−1
xi dx, i = 0, ..., 2n− 1has a solution, and the solution is unique except for
re-ordering the unknowns. The resulting numerical
integration rule is called Gaussian quadrature.
In fact, the nodes and weights are not found by solv-
ing this system. Rather, the nodes and weights have
other properties which enable them to be found more
easily by other methods. There are programs to pro-
duce them; and most subroutine libraries have either
a program to produce them or tables of them for com-
monly used cases.
CHANGE OF INTERVAL
OF INTEGRATION
Integrals on other finite intervals [a, b] can be con-
verted to integrals over [−1, 1], as follows:Z b
aF (x) dx =
b− a
2
Z 1−1
F
Ãb+ a+ t(b− a)
2
!dt
based on the change of integration variables
x =b+ a+ t(b− a)
2, −1 ≤ t ≤ 1
EXAMPLE Over the interval [0, π], use
x = (1 + t) π2
Then Z π
0F (x) dx = π
2
Z 1−1
F³(1 + t) π2
´dt
AN ERROR FORMULA
The usual error formula for Gaussian quadrature for-
mula,
En(f) =Z 1−1
f(x) dx−nX
j=1
wjf(xj)
is not particularly intuitive. It is given by
En(f) = enf (2n)(cn)
(2n)!
en =22n+1 (n!)4
(2n+ 1) [(2n)!]2≈ π
4n
for some a ≤ cn ≤ b.
To help in understanding the implications of this error
formula, introduce
Mk = max−1≤x≤1
¯̄̄f (k)(x)
¯̄̄k!
With many integrands f(x), this sequence {Mk} isbounded or even decreases to zero. For example,
f(x) =
cosx
1
2 + x
⇒ Mk ≤1
k!1
Then for our error formula,
En(f) = enf (2n)(cn)
(2n)!|En(f)| ≤ enM2n (2)
By other methods, we can show
en ≈ π
4n
When combined with (2) and an assumption of uni-
form boundedness for {Mk}, we have the error de-creases by a factor of at least 4 with each increase of
n to n + 1. Compare this to the convergence of the
trapezoidal and Simpson rules for such functions, to
help explain the very rapid convergence of Gaussian
quadrature.
A SECOND ERROR FORMULA
Let f(x) be continuous for a ≤ x ≤ b; let n ≥ 1.
Then, for the Gaussian numerical integration formula
I ≡Z b
af(x) dx ≈
nXj=1
wjf(xj) ≡ In
on [a, b], the error in In satisfies
|I(f)− In(f)| ≤ 2 (b− a) ρ2n−1(f) (3)
Here ρ2n−1(f) is the minimax error of degree 2n− 1for f(x) on [a, b]:
ρm(f) = mindeg(p)≤m
"maxa≤x≤b |f(x)− p(x)|
#, m ≥ 0
EXAMPLE Let f(x) = e−x2. Then the minimax er-rors ρm(f) are given in the following table.
m ρm(f) m ρm(f)1 5.30E− 2 6 7.82E− 62 1.79E− 2 7 4.62E− 73 6.63E− 4 8 9.64E− 84 4.63E− 4 9 8.05E− 95 1.62E− 5 10 9.16E− 10
Using this table, apply (3) to
I =Z 10e−x2 dx
For n = 3, (3) implies
|I − I3| ≤ 2ρ5µe−x2
¶.= 3.24× 10−5
The actual error is 9.55E− 6.
INTEGRATING
A NON-SMOOTH INTEGRAND
Consider using Gaussian quadrature to evaluate
I =Z 10sqrt(x) dx = 2
3
n I − In Ratio2 −7.22E− 34 −1.16E− 3 6.28 −1.69E− 4 6.916 −2.30E− 5 7.432 −3.00E− 6 7.664 −3.84E− 7 7.8
The column labeled Ratio is defined by
I − I12n
I − In
It is consistent with I−In ≈ c
n3, which can be proven
theoretically. In comparison for the trapezoidal and
Simpson rules, I − In ≈ c
n1.5
WEIGHTED GAUSSIAN QUADRATURE
Consider needing to evaluate integrals such asZ 10f(x) log xdx,
Z 10x13f(x) dx
How do we proceed? Consider numerical integration
formulas Z b
aw(x)f(x) dx ≈
nXj=1
wjf(xj)
in which f(x) is considered a “nice” function (one
with several continuous derivatives). The function
w(x) is allowed to be singular, but must be integrable.
We assume here that [a, b] is a finite interval. The
function w(x) is called a “weight function”, and it is
implicitly absorbed into the definition of the quadra-
ture weights {wi}. We again determine the nodes
{xi} and weights {wi} so as to make the integrationformula exact for f(x) a polynomial of as large a de-
gree as possible.
The resulting numerical integration formulaZ b
aw(x)f(x) dx ≈
nXj=1
wjf(xj)
is called a Gaussian quadrature formula with weight
function w(x). We determine the nodes {xi} andweights {wi} by requiring exactness in the above for-mula for
f(x) = xi, i = 0, 1, 2, ..., 2n− 1
To make the derivation more understandable, we con-
sider the particular caseZ 10x13f(x) dx ≈
nXj=1
wjf(xj)
We follow the same pattern as used earlier.
The case n = 1. We want a formula
w1f(x1) ≈Z 10x13f(x) dx
The weight w1 and the node x1 are to be so chosen
that the formula is exact for polynomials of as large a
degree as possible. Choosing f(x) = 1, we have
w1 =Z 10x13 dx = 3
4
Choosing f(x) = x, we have
w1x1 =
1Z0
x13xdx = 3
7
x1 = 47
Thus Z 10x13f(x) dx ≈ 3
4f³47
´has degree of precision 1.
The case n = 2. We want a formula
w1f(x1) +w2f(x2) ≈Z 10x13f(x) dx
The weights w1, w2 and the nodes x1, x2 are to be
so chosen that the formula is exact for polynomials of
as large a degree as possible. We determine them by
requiring equality for
f(x) = 1, x, x2, x3
This leads to the system
w1 +w2 =
1Z0
x13 dx = 3
4
w1x1 + w2x2 =
1Z0
xx13 dx = 3
7
w1x21 + w2x
22 =
1Z0
x2x13 dx = 3
10
w1x31 + w2x
32 =
1Z0
x3x13 dx = 3
13
The solution is
x1 =713 − 3
65 sqrt(35), x2 =713 +
365 sqrt(35)
w1 =38 − 3
392 sqrt(35), w2 =38 +
3392 sqrt(35)
Numerically,
x1 = .2654117024, x2 = .8115113746w1 = .3297238792, w2 = .4202761208
The formulaZ 10x13f(x) dx ≈ w1f(x1) +w2f(x2) (4)
has degree of precision 3.
EXAMPLE Consider evaluating the integralZ 10x13 cosx dx (5)
In applying (4), we take f(x) = cosx. Then
w1f(x1) + w2f(x2) = 0.6074977951
The true answer isZ 10x13 cosx dx
.= 0.6076257393
and our numerical answer is in error by E2.= .000128.
This is quite a good answer involving very little com-
putational effort (once the formula has been deter-
mined). In contrast, the trapezoidal and Simpson
rules applied to (5) would converge very slowly be-
cause the first derivative of the integrand is singular
at the origin.
CHANGE OF VARIABLES
As a side note to the preceding example, we observe
that the change of variables x = t3 transforms the
integral (5) to
3Z 10t3 cos
³t3´dt
and both the trapezoidal and Simpson rules will per-
form better with this formula, although still not as
good as our weighted Gaussian quadrature.
A change of the integration variable can often im-
prove the performance of a standard method, usually
by increasing the differentiability of the integrand.
EXAMPLE Using x = tr for some r > 1, we haveZ 10g(x) log x dx = r
Z 10tr−1g (tr) log t dt
The new integrand is generally smoother than the
original one.
INTERPOLATION
Interpolation is a process of finding a formula (oftena polynomial) whose graph will pass through a givenset of points (x, y).
As an example, consider defining
x0 = 0, x1 =π
4, x2 =
π
2and
yi = cosxi, i = 0, 1, 2
This gives us the three points
(0, 1) ,µπ4 ,
1sqrt(2)
¶,
³π2 , 0
´Now find a quadratic polynomial
p(x) = a0 + a1x+ a2x2
for which
p(xi) = yi, i = 0, 1, 2
The graph of this polynomial is shown on the accom-panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)
x
y
π/4 π/2
y = cos(x)y = p2(x)
PURPOSES OF INTERPOLATION
1. Replace a set of data points {(xi, yi)} with a func-tion given analytically.
2. Approximate functions with simpler ones, usually
polynomials or ‘piecewise polynomials’.
Purpose #1 has several aspects.
• The data may be from a known class of functions.Interpolation is then used to find the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form
p(x) = a0 + a1ex + a2e
2x + · · ·+ anenx
Then we need to find the coefficientsnajobased
on the given data values.
• We may want to take function values f(x) givenin a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.
• Given a set of data points {(xi, yi)}, find a curvepassing thru these points that is “pleasing to the
eye”. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f(x) by simpler functions p(x), perhaps to make
it easier to integrate or differentiate f(x). That will
be the primary reason for studying interpolation in this
course.
As as example of why this is important, consider the
problem of evaluating
I =Z 10
dx
1 + x10
This is very difficult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.
We begin by using polynomials as our means of doing
interpolation. Later in the chapter, we consider more
complex ‘piecewise polynomial’ functions, often called
‘spline functions’.
LINEAR INTERPOLATION
The simplest form of interpolation is probably thestraight line, connecting two points by a straight line.
Let two data points (x0, y0) and (x1, y1) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight lineas
P1(x) = a0 + a1x
In fact, there are other more convenient ways to write
it, and we give several of them below.
P1(x) =x− x1x0 − x1
y0 +x− x0x1 − x0
y1
=(x1 − x) y0 + (x− x0) y1
x1 − x0
= y0 +x− x0x1 − x0
[y1 − y0]
= y0 +
Ãy1 − y0x1 − x0
!(x− x0)
Check each of these by evaluating them at x = x0and x1 to see if the respective values are y0 and y1.
Example. Following is a table of values for f(x) =tanx for a few values of x.
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x0 = 1.1, x1 = 1.2
with corresponding values for y0 and y1. Then
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tan (1.15) ≈ 1.9648 +1.15− 1.11.2− 1.1 [2.5722− 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sufficient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)y = p1(x)
QUADRATIC INTERPOLATION
We want to find a polynomial
P2(x) = a0 + a1x+ a2x2
which satisfies
P2(xi) = yi, i = 0, 1, 2
for given data points (x0, y0) , (x1, y1) , (x2, y2). One
formula for such a polynomial follows:
P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (∗∗)with
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
The formula (∗∗) is called Lagrange’s form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
are called ‘Lagrange basis functions’ for quadratic in-
terpolation. They have the properties
Li(xj) =
(1, i = j0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each Li(x) being of degree 2, we
have that the interpolant
P2(x) = y0L0(x) + y1L1(x) + y2L2(x)
must have degree ≤ 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), forwhich
deg(Q) ≤ 2Q(xi) = yi, i = 0, 1, 2
Thus, is the Lagrange formula P2(x) unique?
Introduce
R(x) = P2(x)−Q(x)
From the properties of P2 and Q, we have deg(R) ≤2. Moreover,
R(xi) = P2(xi)−Q(xi) = yi − yi = 0
for all three node points x0, x1, and x2. How manypolynomials R(x) are there of degree at most 2 andhaving three distinct zeros? The answer is that onlythe zero polynomial satisfies these properties, and there-fore
R(x) = 0 for all x
Q(x) = P2(x) for all x
SPECIAL CASES
Consider the data points
(x0, 1), (x1, 1), (x2, 1)
What is the polynomial P2(x) in this case?
Answer: We must have the polynomial interpolant is
P2(x) ≡ 1meaning that P2(x) is the constant function. Why?First, the constant function satisfies the property ofbeing of degree ≤ 2. Next, it clearly interpolates thegiven data. Therefore by the uniqueness of quadraticinterpolation, P2(x) must be the constant function 1.
Consider now the data points
(x0,mx0), (x1,mx1), (x2,mx2)
for some constant m. What is P2(x) in this case? Byan argument similar to that above,
P2(x) = mx for all x
Thus the degree of P2(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-nomials of a general degree n. We want to find apolynomial Pn(x) for which
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n (∗∗)
with given data points
(x0, y0) , (x1, y1) , · · · , (xn, yn)The solution is given by Lagrange’s formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
The Lagrange basis functions are given by
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
for k = 0, 1, 2, ..., n. The quadratic case was coveredearlier.
In a manner analogous to the quadratic case, we canshow that the above Pn(x) is the only solution to theproblem (∗∗).
In the formula
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
we can see that each such function is a polynomial of
degree n. In addition,
Lk(xi) =
(1, k = i0, k 6= i
Using these properties, it follows that the formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
satisfies the interpolation problem of finding a solution
to
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3Pn(1.15) 2.2685 2.2435 2.2296Error −.0340 −.0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n ≥ 10, is often poorly
behaved when the node points {xi} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x0and x1, define
f [x0, x1] =f(x1)− f(x0)
x1 − x0
This is called a first order divided difference of f(x).
By the Mean-value theorem,
f(x1)− f(x0) = f 0(c) (x1 − x0)
for some c between x0 and x1. Thus
f [x0, x1] = f 0(c)and the divided difference in very much like the deriv-
ative, especially if x0 and x1 are quite close together.
In fact,
f 0µx1 + x02
¶≈ f [x0, x1]
is quite an accurate approximation of the derivative
(see §5.4).
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x0, x1, and x2, define
f [x0, x1, x2] =f [x1, x2]− f [x0, x1]
x2 − x0
This is called the second order divided difference of
f(x).
By a fairly complicated argument, we can show
f [x0, x1, x2] =1
2f 00(c)
for some c intermediate to x0, x1, and x2. In fact, as
we investigate in §5.4,f 00 (x1) ≈ 2f [x0, x1, x2]
in the case the nodes are evenly spaced,
x1 − x0 = x2 − x1
EXAMPLE
Consider the table
x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997
Let x0 = 1, x1 = 1.1, and x2 = 1.2. Then
f [x0, x1] =.45360− .54030
1.1− 1 = −.86700
f [x1, x2] =.36236− .45360
1.1− 1 = −.91240
f [x0, x1, x2] =f [x1, x2]− f [x0, x1]
x2 − x0
=−.91240− (−.86700)
1.2− 1.0 = −.22700For comparison,
f 0µx1 + x02
¶= − sin (1.05) = −.86742
1
2f 00 (x1) = −1
2cos (1.1) = −.22680
GENERAL DIVIDED DIFFERENCES
Given n + 1 distinct points x0, ..., xn, with n ≥ 2,
define
f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]
xn − x0
This is a recursive definition of the nth-order divided
difference of f(x), using divided differences of order
n. Its relation to the derivative is as follows:
f [x0, ..., xn] =1
n!f (n)(c)
for some c intermediate to the points {x0, ..., xn}. LetI denote the interval
I = [min {x0, ..., xn} ,max {x0, ..., xn}]Then c ∈ I, and the above result is based on the
assumption that f(x) is n-times continuously differ-
entiable on the interval I.
EXAMPLE
The following table gives divided differences for the
data in
x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997
For the column headings, we use
Dkf(xi) = f [xi, ..., xi+k]
i xi f(xi) Df(xi) D2f(xi) D3f(xi) D4f(xi)0 1.0 .54030 -.8670 -.2270 .1533 .01251 1.1 .45360 -.9124 -.1810 .15832 1.2 .36236 -.9486 -.13353 1.3 .26750 -.97534 1.4 .16997
These were computed using the recursive definition
f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]
xn − x0
ORDER OF THE NODES
Looking at f [x0, x1], we have
f [x0, x1] =f(x1)− f(x0)
x1 − x0=
f(x0)− f(x1)
x0 − x1= f [x1, x0]
The order of x0 and x1 does not matter. Looking at
f [x0, x1, x2] =f [x1, x2]− f [x0, x1]
x2 − x0
we can expand it to get
f [x0, x1, x2] =f(x0)
(x0 − x1) (x0 − x2)
+f(x1)
(x1 − x0) (x1 − x2)+
f(x2)
(x2 − x0) (x2 − x1)
With this formula, we can show that the order of the
arguments x0, x1, x2 does not matter in the final value
of f [x0, x1, x2] we obtain. Mathematically,
f [x0, x1, x2] = f [xi0, xi1, xi2]
for any permutation (i0, i1, i2) of (0, 1, 2).
We can show in general that the value of f [x0, ..., xn]
is independent of the order of the arguments {x0, ..., xn},even though the intermediate steps in its calculations
using
f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]
xn − x0
are order dependent.
We can show
f [x0, ..., xn] = f [xi0, ..., xin]
for any permutation (i0, i1, ..., in) of (0, 1, ..., n).
COINCIDENT NODES
What happens when some of the nodes {x0, ..., xn}are not distinct. Begin by investigating what happens
when they all come together as a single point x0.
For first order divided differences, we have
limx1→x0
f [x0, x1] = limx1→x0
f(x1)− f(x0)
x1 − x0= f 0(x0)
We extend the definition of f [x0, x1] to coincident
nodes using
f [x0, x0] = f 0(x0)
For second order divided differences, recall
f [x0, x1, x2] =1
2f 00(c)
with c intermediate to x0, x1, and x2.
Then as x1 → x0 and x2 → x0, we must also have
that c→ x0. Therefore,
limx1→x0x2→x0
f [x0, x1, x2] =1
2f 00(x0)
We therefore define
f [x0, x0, x0] =1
2f 00(x0)
For the case of general f [x0, ..., xn], recall that
f [x0, ..., xn] =1
n!f (n)(c)
for some c intermediate to {x0, ..., xn}. Then
lim{x1,...,xn}→x0
f [x0, ..., xn] =1
n!f (n)(x0)
and we define
f [x0, ..., x0| {z }]n+1 times
=1
n!f (n)(x0)
What do we do when only some of the nodes are
coincident. This too can be dealt with, although we
do so here only by examples.
f [x0, x1, x1] =f [x1, x1]− f [x0, x1]
x1 − x0
=f 0(x1)− f [x0, x1]
x1 − x0The recursion formula can be used in general in this
way to allow all possible combinations of possibly co-
incident nodes.
LAGRANGE’S FORMULA FOR
THE INTERPOLATION POLYNOMIAL
Recall the general interpolation problem: find a poly-
nomial Pn(x) for which
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
with given data points
(x0, y0) , (x1, y1) , · · · , (xn, yn)and with {x0, ..., xn} distinct points.
In §5.1, we gave the solution as Lagrange’s formulaPn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
with {L0(x), ..., Ln(x)} the Lagrange basis polynomi-als. Each Lj is of degree n and it satisfies
Lj(xi) =
(1, j = i0, j 6= i
for i = 0, 1, ..., n.
THE NEWTON DIVIDED DIFFERENCE FORM
OF THE INTERPOLATION POLYNOMIAL
Let the data values for the problem
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
be generated from a function f(x):
yi = f(xi), i = 0, 1, ..., n
Using the divided differences
f [x0, x1], f [x0, x1, x2], ..., f [x0, ..., xn]
we can write the interpolation polynomials
P1(x), P2(x), ..., Pn(x)
in a way that is simple to compute.
P1(x) = f(x0) + f [x0, x1] (x− x0)P2(x) = f(x0) + f [x0, x1] (x− x0)
+f [x0, x1, x2] (x− x0) (x− x1)= P1(x) + f [x0, x1, x2] (x− x0) (x− x1)
For the case of the general problem
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
we have
Pn(x) = f(x0) + f [x0, x1] (x− x0)+f [x0, x1, x2] (x− x0) (x− x1)+f [x0, x1, x2, x3] (x− x0) (x− x1) (x− x2)+ · · ·+f [x0, ..., xn] (x− x0) · · · (x− xn−1)
From this we have the recursion relation
Pn(x) = Pn−1(x)+f [x0, ..., xn] (x− x0) · · · (x− xn−1)
in which Pn−1(x) interpolates f(x) at the points in{x0, ..., xn−1}.
Example: Recall the table
i xi f(xi) Df(xi) D2f(xi) D3f(xi) D4f(xi)0 1.0 .54030 -.8670 -.2270 .1533 .01251 1.1 .45360 -.9124 -.1810 .15832 1.2 .36236 -.9486 -.13353 1.3 .26750 -.97534 1.4 .16997
withDkf(xi) = f [xi, ..., xi+k], k = 1, 2, 3, 4. Then
P1(x) = .5403− .8670 (x− 1)P2(x) = P1(x)− .2270 (x− 1) (x− 1.1)P3(x) = P2(x) + .1533 (x− 1) (x− 1.1) (x− 1.2)P4(x) = P3(x)
+.0125 (x− 1) (x− 1.1) (x− 1.2) (x− 1.3)Using this table and these formulas, we have the fol-
lowing table of interpolants for the value x = 1.05.
The true value is cos(1.05) = .49757105.
n 1 2 3 4Pn(1.05) .49695 .49752 .49758 .49757Error 6.20E−4 5.00E−5 −1.00E−5 0.0
EVALUATION OF THE DIVIDED DIFFERENCE
INTERPOLATION POLYNOMIAL
Let
d1 = f [x0, x1]d2 = f [x0, x1, x2]
...dn = f [x0, ..., xn]
Then the formula
Pn(x) = f(x0) + f [x0, x1] (x− x0)+f [x0, x1, x2] (x− x0) (x− x1)+f [x0, x1, x2, x3] (x− x0) (x− x1) (x− x2)+ · · ·+f [x0, ..., xn] (x− x0) · · · (x− xn−1)
can be written as
Pn(x) = f(x0) + (x− x0) (d1 + (x− x1) (d2 + · · ·+(x− xn−2) (dn−1 + (x− xn−1) dn) · · · )
Thus we have a nested polynomial evaluation, and
this is quite efficient in computational cost.
ERROR IN LINEAR INTERPOLATION
Let P1(x) denote the linear polynomial interpolating
f(x) at x0 and x1, with f(x) a given function (e.g.
f(x) = cosx). What is the error f(x)− P1(x)?
Let f(x) be twice continuously differentiable on an in-
terval [a, b] which contains the points {x0, x1}. Thenfor a ≤ x ≤ b,
f(x)− P1(x) =(x− x0) (x− x1)
2f 00(cx)
for some cx between the minimum and maximum of
x0, x1, and x.
If x1 and x are ‘close to x0’, then
f(x)− P1(x) ≈(x− x0) (x− x1)
2f 00(x0)
Thus the error acts like a quadratic polynomial, with
zeros at x0 and x1.
EXAMPLE
Let f(x) = log10 x; and in line with typical tables of
log10 x, we take 1 ≤ x, x0, x1 ≤ 10. For definiteness,let x0 < x1 with h = x1 − x0. Then
f 00(x) = −log10 ex2
log10 x− P1(x) =(x− x0) (x− x1)
2
"−log10 e
c2x
#
= (x− x0) (x1 − x)
"log10 e
2c2x
#We usually are interpolating with x0 ≤ x ≤ x1; and
in that case, we have
(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1
(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1
and therefore
(x− x0) (x1 − x)
"log10 e
2x21
#≤ log10 x− P1(x)
≤ (x− x0) (x1 − x)
"log10 e
2x20
#
For h = x1 − x0 small, we have for x0 ≤ x ≤ x1
log10 x− P1(x) ≈ (x− x0) (x1 − x)
"log10 e
2x20
#
Typical high school algebra textbooks contain tables
of log10 x with a spacing of h = .01. What is the
error in this case? To look at this, we use
0 ≤ log10 x− P1(x) ≤ (x− x0) (x1 − x)
"log10 e
2x20
#
By simple geometry or calculus,
maxx0≤x≤x1
(x− x0) (x1 − x) ≤ h2
4
Therefore,
0 ≤ log10 x− P1(x) ≤h2
4
"log10 e
2x20
#.= .0543
h2
x20
If we want a uniform bound for all points 1 ≤ x0 ≤ 10,we have
0 ≤ log10 x− P1(x) ≤h2 log10 e
8
.= .0543h2
0 ≤ log10 x− P1(x) ≤ .0543h2
For h = .01, as is typical of the high school text book
tables of log10 x,
0 ≤ log10 x− P1(x) ≤ 5.43× 10−6
If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
log 5.41.= .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.
From the bound
0 ≤ log10 x− P1(x) ≤h2 log10 e
8x20
.= .0543
h2
x20
we see the error decreases as x0 increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE
Recall the general interpolation problem: find a poly-
nomial Pn(x) for which deg(Pn) ≤ n
Pn(xi) = f(xi), i = 0, 1, · · · , nwith distinct node points {x0, ..., xn} and a givenfunction f(x). Let [a, b] be a given interval on which
f(x) is (n+ 1)-times continuously differentiable; and
assume the points x0, ..., xn, and x are contained in
[a, b]. Then
f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
with cx some point between the minimum and maxi-
mum of the points in {x, x0, ..., xn}.
f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
As shorthand, introduce
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
a polynomial of degree n+ 1 with roots {x0, ..., xn}.Then
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (cx)
THE QUADRATIC CASE
For n = 2, we have
f(x)− P2(x) =(x− x0) (x− x1) (x− x2)
3!f (3) (cx)
(*)
with cx some point between the minimum and maxi-
mum of the points in {x, x0, x1, x2}.
To illustrate the use of this formula, consider the case
of evenly spaced nodes:
x1 = x0 + h, x2 = x1 + h
Further suppose we have x0 ≤ x ≤ x2, as we would
usually have when interpolating in a table of given
function values (e.g. log10 x). The quantity
Ψ2(x) = (x− x0) (x− x1) (x− x2)
can be evaluated directly for a particular x.
Graph of
Ψ2(x) = (x+ h)x (x− h)
using (x0, x1, x2) = (−h, 0, h):
x
y
h
-h
In the formula (∗), however, we do not know cx, and
therefore we replace¯̄̄f (3) (cx)
¯̄̄with a maximum of¯̄̄
f (3) (x)¯̄̄as x varies over x0 ≤ x ≤ x2. This yields
|f(x)− P2(x)| ≤|Ψ2(x)|3!
maxx0≤x≤x2
¯̄̄f (3) (x)
¯̄̄(**)
If we want a uniform bound for x0 ≤ x ≤ x2, we must
compute
maxx0≤x≤x2
|Ψ2(x)| = maxx0≤x≤x2
|(x− x0) (x− x1) (x− x2)|
Using calculus,
maxx0≤x≤x2
|Ψ2(x)| =2h3
3 sqrt(3), at x = x1±
h
sqrt(3)
Combined with (∗∗), this yields
|f(x)− P2(x)| ≤h3
9 sqrt(3)max
x0≤x≤x2
¯̄̄f (3) (x)
¯̄̄for x0 ≤ x ≤ x2.
For f(x) = log10 x, with 1 ≤ x0 ≤ x ≤ x2 ≤ 10, thisleads to
|log10 x− P2(x)| ≤h3
9 sqrt(3)· maxx0≤x≤x2
2 log10 e
x3
=.05572h3
x30
For the case of h = .01, we have
|log10 x− P2(x)| ≤5.57× 10−8
x30≤ 5.57× 10−8
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log10 x with
h = .01? The error bound for the linear interpolation
was 5.43× 10−6, and therefore we want the same tobe true of quadratic interpolation. Using a simpler
bound, we want to find h so that
|log10 x− P2(x)| ≤ .05572h3 ≤ 5× 10−6
This is true if h = .04477. Therefore a spacing of
h = .04 would be sufficient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
f(x)− Pn(x) =(x− x0) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
=Ψn(x)
(n+ 1)!f (n+1) (cx)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
with cx some point between the minimum and max-
imum of the points in {x, x0, ..., xn}. When bound-ing the error we replace f (n+1) (cx) with its maximum
over the interval containing {x, x0, ..., xn}, as we haveillustrated earlier in the linear and quadratic cases.
Consider now the function
Ψn(x)
(n+ 1)!
over the interval determined by the minimum and
maximum of the points in {x, x0, ..., xn}. For evenlyspaced node points on [0, 1], with x0 = 0 and xn = 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR
Consider the error
f(x)− Pn(x) =(x− x0) · · · (x− xn)
(n+ 1)!f (n+1) (cx)
=Ψn(x)
(n+ 1)!f (n+1) (cx)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
as n increases and as x varies. As noted previously, we
cannot do much with f (n+1) (cx) except to replace it
with a maximum value of¯̄̄f (n+1) (x)
¯̄̄over a suitable
interval. Thus we concentrate on understanding the
size of
Ψn(x)
(n+ 1)!
ERROR FOR EVENLY SPACED NODES
We consider first the case in which the node points
are evenly spaced, as this seems the ‘natural’ way to
define the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?
The interpolation nodes are determined by using
h =1
n, x0 = 0, x1 = h, x2 = 2h, ..., xn = nh = 1
For this case,
Ψn(x) = x (x− h) (x− 2h) · · · (x− 1)Our graphs are the cases of n = 2, ..., 9.
x
y n = 2
1x
y n = 3
1
x
y n = 4
1
x
y n = 5
1
Graphs of Ψn(x) on [0, 1] for n = 2, 3, 4, 5
x
y n = 6
1
x
y n = 7
1
x
y n = 8
1
x
y n = 9
1
Graphs of Ψn(x) on [0, 1] for n = 6, 7, 8, 9
Graph of
Ψ6(x) = (x− x0) (x− x1) · · · (x− x6)
with evenly spaced nodes:
xx0 x1 x2 x3 x4 x5 x6
Using the following table
,
n Mn n Mn
1 1.25E−1 6 4.76E−72 2.41E−2 7 2.20E−83 2.06E−3 8 9.11E−104 1.48E−4 9 3.39E−115 9.01E−6 10 1.15E−12
we can observe that the maximum
Mn ≡ maxx0≤x≤xn
|Ψn(x)|(n+ 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of Ψn(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
maxx0≤x≤x1
|Ψn(x)|(n+ 1)!
= 3.39× 10−11
maxx4≤x≤x5
|Ψn(x)|(n+ 1)!
= 6.89× 10−13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x0 ≤ x ≤ x1 as compared to the
case when x4 ≤ x ≤ x5. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x0, ..., xn} being used to define theinterpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a ≤ x0 < x1 < · · · < xn−1 < xn ≤ b
and produce the interpolation polynomial Pn(x) that
interpolates f(x) at the given node points. We would
like to have
maxa≤x≤b |f(x)− Pn(x)|→ 0 as n→∞
Does it happen?
Recall the error bound
maxa≤x≤b |f(x)− Pn(x)|
≤ maxa≤x≤b
|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄We begin with an example using evenly spaced node
points.
RUNGE’S EXAMPLE
Use evenly spaced node points:
h =b− a
n, xi = a+ ih for i = 0, ..., n
For some functions, such as f(x) = ex, the maximumerror goes to zero quite rapidly. But the size of thederivative term f (n+1)(x) in
maxa≤x≤b |f(x)− Pn(x)|
≤ maxa≤x≤b
|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄can badly hurt or destroy the convergence of othercases.
In particular, we show the graph of f(x) = 1/³1 + x2
´and Pn(x) on [−5, 5] for the cases n = 8 and n = 12.The case n = 10 is in the text on page 127. It canbe proven that for this function, the maximum er-ror on [−5, 5] does not converge to zero. Thus theuse of evenly spaced nodes is not necessarily a goodapproach to approximating a function f(x) by inter-polation.
Runge’s example with n = 10:
x
y
y=P10(x)
y=1/(1+x2)
OTHER CHOICES OF NODES
Recall the general error bound
maxa≤x≤b |f(x)− Pn(x)| ≤ max
a≤x≤b|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄There is nothing we really do with the derivative term
for f ; but we can examine the way of defining the
nodes {x0, ..., xn} within the interval [a, b]. We askhow these nodes can be chosen so that the maximum
of |Ψn(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it is
taken up in §4.6. The node points {x0, ..., xn} turnout to be the zeros of a particular polynomial Tn+1(x)
of degree n+1, called a Chebyshev polynomial. These
zeros are known explicitly, and with them
maxa≤x≤b |Ψn(x)| =
µb− a
2
¶n+12−n
This turns out to be smaller than for evenly spaced
cases; and although this polynomial interpolation does
not work for all functions f(x), it works for all differ-
entiable functions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (c)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
with c between the minimum and maximum of {x0, ..., xn, x}.A second formula is given by
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let Pn+1(x) denote the polynomial of degree ≤ n+1
which interpolates f(x) at the points {x0, ..., xn, xn+1}.Then
Pn+1(x) = Pn(x)
+f [x0, ..., xn, xn+1] (x− x0) · · · (x− xn)
Substituting x = xn+1, and using the fact that Pn+1(x)
interpolates f(x) at xn+1, we have
f(xn+1) = Pn(xn+1)
+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)
f(xn+1) = Pn(xn+1)
+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)
In this formula, the number xn+1 is completely ar-
bitrary, other than being distinct from the points in
{x0, ..., xn}. To emphasize this fact, replace xn+1 byx throughout the formula, obtaining
f(x) = Pn(x) + f [x0, ..., xn, x] (x− x0) · · · (x− xn)
= Pn(x) +Ψn(x) f [x0, ..., xn, x]
provided x 6= x0, ..., xn.
The formula
f(x) = Pn(x) + f [x0, ..., xn, x] (x− x0) · · · (x− xn)
= Pn(x) +Ψn(x) f [x0, ..., xn, x]
is easily true for x a node point. Provided f(x) is
differentiable, the formula is also true for x a node
point.
This shows
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
Compare the two error formulas
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (c)
Then
Ψn(x) f [x0, ..., xn, x] =Ψn(x)
(n+ 1)!f (n+1) (c)
f [x0, ..., xn, x] =f (n+1) (c)
(n+ 1)!
for some c between the smallest and largest of the
numbers in {x0, ..., xn, x}.
To make this somewhat symmetric in its arguments,
let m = n+ 1, x = xn+1. Then
f [x0, ..., xm−1, xm] =f (m) (c)
m!
with c an unknown number between the smallest and
largest of the numbers in {x0, ..., xm}. This was givenin an earlier lecture where divided differences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION
Recall the examples of higher degree polynomial in-
terpolation of the function f(x) =³1 + x2
´−1on
[−5, 5]. The interpolants Pn(x) oscillated a great
deal, whereas the function f(x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.
Consider the data
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
What are methods of interpolating this data, other
than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.
Since we only have the data to consider, we would gen-
erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
x
y
1 2 3 4
1
2
The data points
x
y
1 2 3 4
1
2
Piecewise linear interpolation
x
y
1 2 3 4
1
2
3
4
Polynomial Interpolation
x
y
1 2 3 4
1
2
Piecewise quadratic interpolation
PIECEWISE POLYNOMIAL FUNCTIONS
Consider being given a set of data points (x1, y1), ...,
(xn, yn), with
x1 < x2 < · · · < xn
Then the simplest way to connect the points (xj, yj)
is by straight line segments. This is called a piecewise
linear interpolant of the datan(xj, yj)
o. This graph
has “corners”, and often we expect the interpolant to
have a smooth graph.
To obtain a somewhat smoother graph, consider using
piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates
{(x1, y1), (x2, y2), (x3, y3)}Then construct the quadratic polynomial that inter-
polates
{(x3, y3), (x4, y4), (x5, y5)}
Continue this process of constructing quadratic inter-
polants on the subintervals
[x1, x3], [x3, x5], [x5, x7], ...
If the number of subintervals is even (and therefore
n is odd), then this process comes out fine, with the
last interval being [xn−2, xn]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modification of this procedure.
Suggest such!
With piecewise quadratic interpolants, however, there
are “corners” on the graph of the interpolating func-
tion. With our preceding example, they are at x3 and
x5. How do we avoid this?
Piecewise polynomial interpolants are used in many
applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION
Let data points (x1, y1), ..., (xn, yn) be given, as let
x1 < x2 < · · · < xn
Consider finding functions s(x) for which the follow-
ing properties hold:
(1) s(xi) = yi, i = 1, ..., n
(2) s(x), s0(x), s00(x) are continuous on [x1, xn].Then among such functions s(x) satisfying these prop-
erties, find the one which minimizes the integralZ xn
x1
¯̄̄s00(x)
¯̄̄2dx
The idea of minimizing the integral is to obtain an in-
terpolating function for which the first derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS
Let a set of node points {xi} be given, satisfyinga ≤ x1 < x2 < · · · < xn ≤ b
for some numbers a and b. Often we use [a, b] =
[x1, xn]. A cubic spline function s(x) on [a, b] with
“breakpoints” or “knots” {xi} has the following prop-erties:
1. On each of the intervals
[a, x1], [x1, x2], ..., [xn−1, xn], [xn, b]
s(x) is a polynomial of degree ≤ 3.2. s(x), s0(x), s00(x) are continuous on [a, b].
In the case that we have given data points (x1, y1),...,
(xn, yn), we say s(x) is a cubic interpolating spline
function for this data if
3. s(xi) = yi, i = 1, ..., n.
EXAMPLE
Define
(x− α)3+ =
((x− α)3 , x ≥ α
0, x ≤ α
This is a cubic spline function on (−∞,∞) with thesingle breakpoint x1 = α.
Combinations of these form more complicated cubic
spline functions. For example,
s(x) = 3 (x− 1)3+ − 2 (x− 3)3+is a cubic spline function on (−∞,∞) with the break-points x1 = 1, x2 = 3.
Define
s(x) = p3(x) +nX
j=1
aj³x− xj
´3+
with p3(x) some cubic polynomial. Then s(x) is a
cubic spline function on (−∞,∞) with breakpoints{x1, ..., xn}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integralZ xn
x1
¯̄̄s00(x)
¯̄̄2dx
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satisfies
s00(x1) = s00(xn) = 0
Spline functions satisfying these boundary conditions
are called “natural” cubic spline functions, and the so-
lution to our minimization problem is a “natural cubic
interpolatory spline function”. We will show a method
to construct this function from the interpolation data.
Motivation for these boundary conditions can be given
by looking at the physics of bending thin beams of
flexible materials to pass thru the given data. To the
left of x1 and to the right of xn, the beam is straight
and therefore the second derivatives are zero at the
transition points x1 and xn.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION
To make the presentation more specific, suppose we
have data
(x1, y1) , (x2, y2) , (x3, y3) , (x4, y4)
with x1 < x2 < x3 < x4. Then on each of the
intervals
[x1, x2] , [x2, x3] , [x3, x4]
s(x) is a cubic polynomial. Taking the first interval,
s(x) is a cubic polynomial and s00(x) is a linear poly-nomial. Let
Mi = s00(xi), i = 1, 2, 3, 4
Then on [x1, x2],
s00(x) = (x2 − x)M1 + (x− x1)M2
x2 − x1, x1 ≤ x ≤ x2
We can find s(x) by integrating twice:
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6 (x2 − x1)+ c1x+ c2
We determine the constants of integration by using
s(x1) = y1, s(x2) = y2 (*)
Then
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6 (x2 − x1)
+(x2 − x) y1 + (x− x1) y2
x2 − x1
−x2 − x16
[(x2 − x)M1 + (x− x1)M2]
for x1 ≤ x ≤ x2.
Check that this formula satisfies the given interpola-
tion condition (*)!
We can repeat this on the intervals [x2, x3] and [x3, x4],
obtaining similar formulas.
For x2 ≤ x ≤ x3,
s(x) =(x3 − x)3M2 + (x− x2)
3M3
6 (x3 − x2)
+(x3 − x) y2 + (x− x2) y3
x3 − x2
−x3 − x26
[(x3 − x)M2 + (x− x2)M3]
For x3 ≤ x ≤ x4,
s(x) =(x4 − x)3M3 + (x− x3)
3M4
6 (x4 − x3)
+(x4 − x) y3 + (x− x3) y4
x4 − x3
−x4 − x36
[(x4 − x)M3 + (x− x3)M4]
We still do not know the values of the second deriv-
atives {M1,M2,M3,M4}. The above formulas guar-antee that s(x) and s00(x) are continuous forx1 ≤ x ≤ x4. For example, the formula on [x1, x2]
yields
s(x2) = y2, s00(x2) =M2
The formula on [x2, x3] also yields
s(x2) = y2, s00(x2) =M2
All that is lacking is to make s0(x) continuous at x2and x3. Thus we require
s0(x2 + 0) = s0(x2 − 0)s0(x3 + 0) = s0(x3 − 0) (**)
This means
limx&x2
s0(x) = limx%x2
s0(x)
and similarly for x3.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h
Then our earlier formulas simplify to
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
for x1 ≤ x ≤ x2, with similar formulas on [x2, x3] and
[x3, x4].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s00(x) gives us immedi-ately
M1 =M4 = 0
Then we can solve the linear system for M2 and M3.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1 12
13
14
In this case, h = 1, and linear system becomes
2
3M2 +
1
6M3 = y3 − 2y2 + y1 =
1
31
6M2 +
2
3M3 = y4 − 2y3 + y2 =
1
12
This has the solution
M2 =1
2, M3 = 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
=(2− x)3 · 0 + (x− 1)3
³12
´6
+(2− x) · 1 + (x− 1)
³12
´1
−16
h(2− x) · 0 + (x− 1)
³12
´i= 112 (x− 1)3 − 7
12 (x− 1) + 1
Similarly, for 2 ≤ x ≤ 3,
s(x) =−112(x− 2)3 + 1
4(x− 2)2 − 1
3(x− 1) + 1
2
and for 3 ≤ x ≤ 4,
s(x) =−112(x− 4) + 1
4
x 1 2 3 4
y 1 12
13
14
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
1
x
y
y = 1/xy = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M1 and
M4. For example, the data in our numerical exam-
ple were generated from the function f(x) = 1x. With
it, f 00(x) = 2x3, and thus we could use
M1 = 2, M4 =1
32
With this we are led to a new formula for s(x), one
that approximates f(x) = 1x more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(xi) = yi, i = 1, 2, 3, 4
with the boundary conditions
s0(x1) = y01, s0(x4) = y04 (#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3M1 +
h
6M2 =
y2 − y1h
− y01h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
hh
6M3 +
h
3M4 = y04 −
y4 − y3h
For our numerical example, it is natural to obtain
these derivative values from f 0(x) = − 1x2:
y01 = −1, y04 = −1
16
When combined with your earlier equations, we have
the system
1
3M1 +
1
6M2 =
1
21
6M1 +
2
3M2 +
1
6M3 =
1
31
6M2 +
2
3M3 +
1
6M4 =
1
121
6M3 +
1
3M4 =
1
48
This has the solution
[M1,M2,M3,M4] =·173
120,7
60,11
120,1
60
¸
We can now write the functions s(x) for each of the
subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for
x1 ≤ x ≤ x2,
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
We can substitute in from the data
x 1 2 3 4
y 1 12
13
14
and the solutions {Mi}. Doing so, consider the errorf(x)− s(x). As an example,
f(x) =1
x, f
µ3
2
¶=2
3, s
µ3
2
¶= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x1, y1) , (x2, y2) , ..., (xn, yn)
and assume the node points {xi} are evenly spaced,xj = x1 + (j − 1)h, j = 1, ..., n
We have that the interpolating spline s(x) on
xj ≤ x ≤ xj+1 is given by
s(x) =
³xj+1 − x
´3Mj +
³x− xj
´3Mj+1
6h
+
³xj+1 − x
´yj +
³x− xj
´yj+1
h
−h6
h³xj+1 − x
´Mj +
³x− xj
´Mj+1
ifor j = 1, ..., n− 1.
To enforce continuity of s0(x) at the interior nodepoints x2, ..., xn−1, the second derivatives
nMj
omust
satisfy the linear equations
h
6Mj−1 +
2h
3Mj +
h
6Mj+1 =
yj−1 − 2yj + yj+1
h
for j = 2, ..., n− 1. Writing them out,
h
6M1 +
2h
3M2 +
h
6M3 =
y1 − 2y2 + y3h
h
6M2 +
2h
3M3 +
h
6M4 =
y2 − 2y3 + y4h
...h
6Mn−2 +
2h
3Mn−1 +
h
6Mn =
yn−2 − 2yn−1 + yn
h
This is a system of n−2 equations in the n unknowns{M1, ...,Mn}. Two more conditions must be imposedon s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very efficiently.
BOUNDARY CONDITIONS
“Natural” boundary conditions
s00(x1) = s00(xn) = 0Spline functions satisfying these conditions are called“natural cubic splines”. They arise out the minimiza-tion problem stated earlier. But generally they are notconsidered as good as some other cubic interpolatingsplines.
“Clamped” boundary conditions We add the condi-tions
s0(x1) = y01, s0(xn) = y0nwith y01, y0n given slopes for the endpoints of s(x) on[x1, xn]. This has many quite good properties whencompared with the natural cubic interpolating spline;but it does require knowing the derivatives at the end-points.
“Not a knot” boundary conditions This is more com-plicated to explain, but it is the version of cubic splineinterpolation that is implemented in Matlab.
THE “NOT A KNOT” CONDITIONS
As before, let the interpolation nodes be
(x1, y1) , (x2, y2) , ..., (xn, yn)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x1, y1) , (x3, y3) , ..., (xn−2, yn−2) , (xn, yn)Thus deleting two of the points. We now have n− 2points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x1, x3] , [x3, x4] , ..., [xn−3, xn−2] , [xn−2, xn]This leads to n− 4 equations in the n− 2 unknownsM1,M3, ...,Mn−2,Mn. The two additional boundary
conditions are
s(x2) = y2, s(xn−1) = yn−1These translate into two additional equations, and we
obtain a system of n−2 linear simultaneous equationsin the n− 2 unknowns M1,M3, ...,Mn−2,Mn.
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with ”not-a knot”
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x1, y1) , (x2, y2) , ..., (xn, yn)
type arrays containing the x and y coordinates:
x = [x1 x2 ...xn]y = [y1 y2 ...yn]plot (x, y, ’o’)
The last statement will draw a plot of the data points,
marking them with the letter ‘oh’. To find the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (xn − x1) / (10 ∗ n) ; xx = x1 : h : xn;
use
yy = spline (x, y, xx)plot (x, y, ’o’, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then define
h =b− a
n− 1, xj = a+ (j − 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Define
yi = f(xi), j = 1, ..., n
Let sn(x) denote the cubic spline interpolating this
data and satisfying the “not a knot” boundary con-
ditions. Then it can be shown that for a suitable
constant c,
En ≡ maxa≤x≤b |f(x)− sn(x)| ≤ ch4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h2 rather than h4;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctanx on [0, 5]. The following ta-
ble gives values of the maximum error En for various
values of n. The values of h are being successively
halved.
n En E12n/En
7 7.09E−313 3.24E−4 21.925 3.06E−5 10.649 1.48E−6 20.797 9.04E−8 16.4
BEST APPROXIMATION
Given a function f(x) that is continuous on a giveninterval [a, b], consider approximating it by some poly-nomial p(x). To measure the error in p(x) as an ap-proximation, introduce
E(p) = maxa≤x≤b |f(x)− p(x)|
This is called the maximum error or uniform error ofapproximation of f(x) by p(x) on [a, b].
With an eye towards efficiency, we want to find the‘best’ possible approximation of a given degree n.With this in mind, introduce the following:
ρn(f) = mindeg(p)≤n
E(p)
= mindeg(p)≤n
"maxa≤x≤b |f(x)− p(x)|
#The number ρn(f) will be the smallest possible uni-form error, orminimax error, when approximating f(x)by polynomials of degree at most n. If there is apolynomial giving this smallest error, we denote it bymn(x); thus E(mn) = ρn(f).
Example. Let f(x) = ex on [−1, 1]. In the followingtable, we give the values of E(tn), tn(x) the Tay-
lor polynomial of degree n for ex about x = 0, and
E(mn).
Maximum Error in:n tn(x) mn(x)1 7.18E− 1 2.79E− 12 2.18E− 1 4.50E− 23 5.16E− 2 5.53E− 34 9.95E− 3 5.47E− 45 1.62E− 3 4.52E− 56 2.26E− 4 3.21E− 67 2.79E− 5 2.00E− 78 3.06E− 6 1.11E− 89 3.01E− 7 5.52E− 10
Consider graphically how we can improve on the Tay-
lor polynomial
t1(x) = 1 + x
as a uniform approximation to ex on the interval [−1, 1].
The linear minimax approximation is
m1(x) = 1.2643 + 1.1752x
x
y
-1 1
1
2
y=t1(x)
y=m1(x)
y=ex
Linear Taylor and minimax approximations to ex
x
y
-1 1
0.0516
Error in cubic Taylor approximation to ex
x
y
-1 1
0.00553
-0.00553
Error in cubic minimax approximation to ex
Accuracy of the minimax approximation.
ρn(f) ≤[(b− a)/2]n+1
(n+ 1)!2nmaxa≤x≤b
¯̄̄f (n+1)(x)
¯̄̄This error bound does not always become smaller with
increasing n, but it will give a fairly accurate bound
for many common functions f(x).
Example. Let f(x) = ex for −1 ≤ x ≤ 1. Thenρn(e
x) ≤ e
(n+ 1)!2n(*)
n Bound (*) ρn(f)1 6.80E− 1 2.79E− 12 1.13E− 1 4.50E− 23 1.42E− 2 5.53E− 34 1.42E− 3 5.47E− 45 1.18E− 4 4.52E− 56 8.43E− 6 3.21E− 67 5.27E− 7 2.00E− 7
CHEBYSHEV POLYNOMIALS
Chebyshev polynomials are used in many parts of nu-merical analysis, and more generally, in applicationsof mathematics. For an integer n ≥ 0, define thefunction
Tn(x) = cos³n cos−1 x
´, −1 ≤ x ≤ 1 (1)
This may not appear to be a polynomial, but we willshow it is a polynomial of degree n. To simplify themanipulation of (1), we introduce
θ = cos−1(x) or x = cos(θ), 0 ≤ θ ≤ π (2)
Then
Tn(x) = cos(nθ) (3)
Example. n = 0
T0(x) = cos(0 · θ) = 1n = 1
T1(x) = cos(θ) = x
n = 2
T2(x) = cos(2θ) = 2 cos2(θ)− 1 = 2x2 − 1
x
y
-1 1
1
-1
T0(x)T1(x)T2(x)
x
y
-1 1
1
-1
T3(x)T4(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β)Let n ≥ 1, and apply these identities to get
Tn+1(x) = cos[(n+ 1)θ] = cos(nθ + θ)
= cos(nθ) cos(θ)− sin(nθ) sin(θ)Tn−1(x) = cos[(n− 1)θ] = cos(nθ − θ)
= cos(nθ) cos(θ) + sin(nθ) sin(θ)
Add these two equations, and then use (1) and (3) to
obtain
Tn+1(x) + Tn−1 = 2 cos(nθ) cos(θ) = 2xTn(x)Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall
T0(x) = 1, T1(x) = x
Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1
Let n = 2. Then
T3(x) = 2xT2(x)− T1(x)
= 2x(2x2 − 1)− x
= 4x3 − 3xLet n = 3. Then
T4(x) = 2xT3(x)− T2(x)
= 2x(4x3 − 3x)− (2x2 − 1)= 8x4 − 8x2 + 1
The minimum size property. Note that
|Tn(x)| ≤ 1, −1 ≤ x ≤ 1 (5)
for all n ≥ 0. Also, note thatTn(x) = 2
n−1xn + lower degree terms, n ≥ 1(6)
This can be proven using the triple recursion relation
and mathematical induction.
Introduce a modified version of Tn(x),
eTn(x) = 1
2n−1Tn(x) = xn+lower degree terms (7)
From (5) and (6),¯̄̄ eTn(x)¯̄̄ ≤ 1
2n−1, −1 ≤ x ≤ 1, n ≥ 1 (8)
Example.
eT4(x) = 1
8
³8x4 − 8x2 + 1
´= x4 − x2 +
1
8
A polynomial whose highest degree term has a coeffi-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial eTn(x) has size 1/2n−1 on−1 ≤ x ≤ 1, and this becomes smaller as the degreen increases. In comparison,
max−1≤x≤1 |xn| = 1
Thus xn is a monic polynomial whose size does not
change with increasing n.
Theorem. Let n ≥ 1 be an integer, and consider all
possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [−1, 1] is the modified Chebyshev polynomialeTn(x), and its maximum value on [−1, 1] is 1/2n−1.
This result is used in devising applications of Cheby-
shev polynomials. We apply it to obtain an improved
interpolation scheme.