Notes for Numerical Analysis Math 5465 by S. Adjerid ... · Math 5465 by S. Adjerid Department of...

Notes for Numerical Analysis

Math 5465

by

S. Adjerid

Department of Mathematics

Virginia Tech

(A Rough Draft)

1

2

Contents

1 Error Analysis 5

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Representation of numbers and round-off errors . . . . . . . . . . . . . . . . 8

1.2.1 Representation of integers . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Floating-point representation . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 Chopping versus rounding . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.4 Properties of Finite Precision Arithmetic . . . . . . . . . . . . . . . . 18

1.3 Error propagation, stability and conditioning . . . . . . . . . . . . . . . . . . 20

1.3.1 Error propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3.2 Conditioning of a problem . . . . . . . . . . . . . . . . . . . . . . . . 24

1.3.3 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3

4

Chapter 1

Error Analysis

1.1 Introduction

In Science and Engineering there are several sources of errors that affect scientific results.The most common ones include

• Experimental errors: due to errors in measurements or to effects of external factorsduring experiments.

• Modeling errors: occur when a physical phenomenon is approximated by a mathe-matical model. In general, models are valid under certain assumptions, thus modelsare acceptable approximations of the real situation for some ranges of parameters.

Example: A swinging pendulum in a vacuum can be modeled by the nonlinear second-order differential equation

d2θ

dt2+mg

Lsin(θ), t > 0. (1.1)

subject to initial conditions

θ(0) = θ0,dθ

dt(0) = θ1. (1.2)

where L, m, g, θ, respectively, denote the pendulum length and mass, gravitationalacceleration and the angle.

A common approximation equation (1.1) is the linear differential equation

d2θ

dt2+mg

Lθ, t > 0. (1.3)

5

which is valid only for small angles θ. Here, the modeling error is θ − θ.

• Discretization errors: introduced when a continuous problem is approximated by adiscrete model. For instance, the rate of change of a function f at a given point x canbe approximated by the ratio given in the formula

f ′(x) =f(x+ h)− f(x)

h+h

2f ′′(ξ), x < ξ < x+ h, h > 0. (1.4)

as

f ′(x) ≈ f(x+ h)− f(x)

h. (1.5)

The discretization error is

τ =h

2f ′′(ξ). (1.6)

• Round-off error: this error is generated by floating-point operations on computers.Each machine has its own set of numbers and every real number outside this set isapproximated by a machine number through rounding.

Scientific computations may involve one or more of these errors simultaneously. For instance,let us compute the ratio

f(x+ h)− f(x)

h, for h = 1/2k, k = 0, 1, 2, · · · ,

for the function f = cos(x) at x = 0.3 using MATLAB with double precision and presentthe results in Table 1.1.

In order to study these errors, we assume that the computer values for f are denoted byf(x + h) = f(x + h) + δ1 and f(x) = f(x) + δ2. Thus, MATLAB results of Table 1.1 aregiven by the following formula

f(x+ h)− f(x)

h, for h = 1/2k, k = 0, 1, 2, · · · .

On the otherhand the total approximation errors which consists of both discretization androunding errors can be written as

f ′(x)− f(x+ h)− f(x)

h= f ′(x)− f(x+ h)− f(x)

h− δ1 − δ2

h(1.7)

=h

2f ′′(ξ)− δ1 − δ2

h. (1.8)

This formula indicates that the total error consists of a dicretization error which is large forh >> δ and a round-off error which becomes large for h ≤ δ.

In the next section we will study floating point arithmetic.

6

h f(0.3+h)−f(0.3)h

Error12

0.84367176847637 0.1116647206492414

0.90866808907728 0.0466684000483318

0.93440460065668 0.02093188846892116

0.94548264606895 0.00985384305666125

0.95056387828479 0.004772610840810.95298891221562 0.002347576909990.95417240103413 0.001164088091480.95475687241144 0.000579616714170.95504728787907 0.000289201246540.95519204031456 0.000144448811050.95526430267751 0.000072186448100.95530040539143 0.000036083734180.95531844963125 0.000018039494360.95532746997196 0.000009019153640.95533197969780 0.000004509427800.95533423444795 0.000002254677660.95533536179573 0.000001127329870.95533592546417 0.000000563661440.95533620729111 0.000000281834490.95533634821186 0.000000140913750.95533641870134 0.000000070424270.95533645385876 0.000000035266840.95533647155389 0.000000017571710.95533648040146 0.000000008724150.95533648505807 0.000000004067540.95533648878336 0.000000000342250.95533648878336 0.000000000342250.95533649623394 0.000000007108330.95533651113510 0.000000022009500.95533651113510 0.000000022009500.95533657073975 0.000000081614140.95533657073975 0.000000081614140.95533657073975 0.000000081614140.95533657073975 0.000000081614140.95533752441406 0.000001035288460.95533752441406 0.000001035288460.95533752441406 0.000001035288460.95533752441406 0.000001035288460.95535278320312 0.00001629407752

Table 1.1: Total approximation errors for derivative sin′(0.3)

7

1.2 Representation of numbers and round-off errors

In this section we discuss how numbers are represented on computers and how round-offerrors are introduced in the calculations.

1.2.1 Representation of integers

Computers represent integer numbers using a finite number of digits 1 and 0 in the binarysystem (base 2). We start with a number in the decimal system (base 10)

2583 = 2× 103 + 5× 102 + 8× 101 + 3× 100

In other words 2 is the number of thousands, 5 is the number of hundreds, 8 number of tensand 3 is the number of ones.

In the binary system (base 2) numbers are represented by a string of 1 or 0 as illustrated inthe following example

(110110)2 = 1× 25 + 1× 24 + 0× 23 + 1× 22 + 1× 21 + 0× 20

An algorithm to convert a integer from the decimal system to the binary system reads asfollows

For A = 57, the algorithm takes the following steps

57 = 2 * 28 + 1

28 = 2 * 14 + 0

14 = 2 * 7 + 0

7 = 2 * 3 + 1

3 = 2 * 1 + 1

1 = 2 * 0 + 1

Thus we can write(57)10 = (111001)2

Since computers use 32-bit-long words to represent integers, the number

A = (−1)sd30230 + d29229 + · · ·+ d222 + d121 + d0, s = 0, 1.

is represented by the following string

s d30 d29 · · · d2 d1 d0

8

function B=integerconvert10to2(D)

% convert a integer number D from decimal

% to binary representation

% input: D has to be between -1 < B < 1

% output: B string of 0 or 1

s = sign(D);

if s == 1

B(1)= 0;

else

B(1) =1;

end

%

D = abs(D);

n=1;

ndigits=32;

while (n < ndigits )

B(ndigits+1-n) = mod(D,2);

D = fix(D/2);

n=n+1;

end

The first bit contains the sign information s = 0 for + and s = 1 for −, the 31 next bitscontain the digits of the number.

Given the finite number of digits used to represent integers we note that a number greaterthan 230 + 229 + · · ·+ 22 + 2 + 1 = 231 − 1 will cause overflow.

1.2.2 Floating-point representation

Numbers with decimal part are written in a normalized floating-point form where onlysignificant digits are stored. For instance in the decimal system the following numbers

number normalized form00457.98 0.45798× 103

0.00000389 0.389× 10−5

The general form of a normalized floating-point number in the decimal system is

±(0.d1d2d3 · · · dn · · · )× 10m

= ±(d1 × 10−1 + d2 × 10−2 + d3 × 10−3 · · ·+ dn × 10−n · · · )× 10m,

where d1 6= 0. The digits d1, d2, · · · are called significant digits.

9

In the binary system the general form is

A = ±(0.b1b2b3 · · · bn · · · )× 2m =

±(b1 × 2−1 + b2 × 2−2 + b32−3 + · · ·+ bn × 2−n · · · )× 2m.

where b1 = 1. b1, b2, · · · are the significant digits.

Let int(A)=floor(A) denote the integer part of A and fr(A)=A-floor(A) denote the fractionalpart of A.

Remarks:

• Some numbers are represented by an infinite sequence of digits such as 1/3 = (0.3333333 · · · )10

• Some numbers are represented by a finite number of digits in one system and requirean infinite sequence of digits in another system as illustrated by

(0.7)10 = (0.101100110011001100110 · · · )2

Let us give more examples of normalized binary numbers

(1.00111)2 = (0.100111)2 × 2

(1011.011)2 = (0.1011011)2 × 24

(0.0000110)2 = (0.110)2 × 2−4

A Conversion Algorithm:

Next, we describe an algorithm that converts the fractional part of a floating-point from thedecimal system to the binary system. Let A be a number between 0 and 1 such as

A = (0.d1d2d3 · · · dn)10

which can be written as

A =b1

2+b2

22+b3

23+ · · ·+ bn

2n+ · · ·

2A = b1 +b2

2+b3

22+ · · ·+ bn

2n−1+ · · · .

Therefore, we can write b1 = floor(2A), b2 = floor(2(2A − b1)). Consult the MATLABscript below.

A Conversion Algorithm in MATLAB

10

function B=floatconvert10to2(D)

% convert a floating-point number D from decimal

% to binary representation

% input: D has to be between -1 < D < 1

% output: B string of 0 or 1

s = sign(D);

if s == 1

b(1)= 0;

else

b(1) =1;

end

%

D = abs(D);

n=1;

ndigits=24;

while (n <= ndigits )

B(8+n) = floor(2*D);

D = 2*D - B(8+n);

n=n+1;

end

The general form of a floating-point representation in a system of base a with k significantdigits takes the form

A = ±0.d1d2d3 · · · dk at ,where 0.d1d2d3 · · · dk is the mantissa and at is the characteristic.

Standard IEEE representation:For instance, in the standard IEEE floating-point representation, which most modern com-puters use, a string of 32 binary digits is used for single precision and 64 binary digits areused for double precision as described below.

Single precision:(32 bits ) The string of 32 bits

b1b2b3 · · · b32

is interpreted as

(−1)b12(b2b3···b9)22−127 × 1.b10b11 · · · b32 ,

If a number A can be written as

A = ±1.d1d2 · · · × 2m,

11

then one can find

b1 =

{0, if A ≥ 0

1, otherwise,

c = m+ 127 = (b2 · · · b9)2

and

b9+i = di, i = 1, 2, · · · , 23.

The characteristic c = (b2b3 · · · b9)2, is bounded as 0 < c < (11111111)2 = 255.

• c = 0 reserved for zero

• c = 255 reserved for ±∞

• c = 1 corresponds to the smallest computer number ≈ 2−126 ≈ 1.2× 10−38

• c = 254 corresponds to the largest computer number (2−2−23) 2127 ≈ 3.4×1038 where

we used23∑n=0

2−n.

• NaN where c = 255 and f 6= 0 corresponds to 0/0, ∞−∞ and x+NaN .

• Very small negative numbers are represented by −0

• Most operations yield +0

• x/∞ yields +0

• x+∞, ∞/x yields +∞

Example: Find the IEEE binary representation of 31.625.

First, we show that (31)10 = (11111)2 then convert the decimal part as:

(0.625)10 = (0.101)2,

step 1:a1 = 2 ∗ 0.625 = 1.250 =⇒ b1 = floor(a1) = 1

step 2: subtract b1 from 1.2501.250− 1 = 0.250

12

and obtain b2 as

a2 = 2 ∗ 0.25 = 0.5 =⇒ b2 = floor(a2) = 0

Step 3:

a3 = 2 ∗ (a2 − b2) = 1 =⇒ b3 = floor(a3) = 1.

A this point we note that all ak = 0, for k > 3 and hence we can write

(0.625)10 = (0.101)2.

Combining the previous results we write in normalized form:

31.625 = (11111.101)2 = (1.1111101)2 24.

Applying a shift to the characteristic to obtain

c = 4 + 127 = 131 = 27 + 2 + 1 = (10000011)2

to obtain the IEEE standard representation

0︸︷︷︸1

10000011︸︷︷︸8

11111010000000000000000︸︷︷︸23

.

Double or extended precision (64 bits )

The following string of 64 digits

b1|b2b3 · · · b12|b13 · · · b64

is interpreted as

(−1)b12(b2b3···b12)22−1023 × 1.b13b14 · · · b64.

Here the shifted characteristic isc = (b2b3 · · · b12)2, where 0 < c < (11111111111)2 = 2047 and such that

• c = 0 is reserved for zero

• c = 1 corresponds to the smallest number which is

2−1022 ≈ 2.22 10−308

13

• c = 2046 corresponds the largest number which is

21023 ∗ (52∑i=0

1

2i) = 21023 ∗ (2− 2−52) ≈ 1.79 10308

• c = 2047 is reserved for ±∞

Example : Let us consider again the number (31.625)10 = (1.1111101)2 24

shift m by 1023 asc = 4 + 1023 = 1027 = (10000000011)2.

The standard IEEE representation is

0︸︷︷︸1

10000000011︸︷︷︸11

11111010 . . . 0︸︷︷︸52

.

1.2.3 Chopping versus rounding

There are two common ways to approximate floating-point numbers with a finite number ofdigits.

Decimal representation:For the purpose of illustration we introduce two approximation methods known as choppingand rounding in the decimal system.

Chopping: numbers are chopped at a given location to keep a fixed number of significantdigits as

2/3 = (0.666666666666....)10 ≈ (0.6666666..6)10.

In general, chopping of the number

A = ±(0.d1d2 · · · dn · · · )× 10t

yieldsA ≈ A = ±(0.d1d2 · · · dk)× 10t.

The relative error can be estimated as

|A− A||A|

=0.dk+1 · · · 10t−k

0.d1d2 · · · dk10t<

10t−k

10t−1= 101−k

14

Rounding: decimal numbers are rounded to the nearest 7-digit number according to thefollowing rule

A ≈ A =

(0.d1d2 · · · dk)× 10t + 0.

k︷︸︸︷000 · · · 01×10t, if dk+1 ≥ 5

(0.d1d2 · · · dk)× 10t, otherwise

.

Note: an equivalent procedure consists of adding 0.510t−k to A and then chopping A+ 0.5×10t−k.The absolute rounding error can be estimated as

|A− A| < 0.510t−k.

Since |A| > 10t−1, the relative rounding error satisfies

|A− A||A|

<0.510t−k

10t−1= 0.5101−k.

We note that rounding errors are smaller than chopping errors.

Binary representation:

Letx = 0.b1b2 · · · bkbk+1 · · · 2t

Chopping with k significant digits we write

chop(x) = 0.b1b2 · · · bk2t (1.9)

Rounding x to the nearest machine number is performed by adding (0.1)2 × 2t−k to x andchopping at k digits yields

round(x) =

0.b1b2 · · · bk2t, if bk+1 = 0

(0.b1b2 · · · bk + 0. 00 · · · 1︸︷︷︸k

)2t, otherwise (1.10)

In the following theorem, we state and prove bounds for chopping and rounding errors.

Theorem 1.2.1. Let x be approximated by chop(x) and round(x) defined in (1.9) and (1.10),respectively. Then the chopping and rounding relative errors can be bounded as

|x− chop(x)||x|

< 2× 2−k. (1.11)

and|x− round(x)|

|x|< 2−k. (1.12)

15

Proof. First, we bound the chopping error as

|x− chop(x)||x|

=0.bk+1 · · · 2−k2t

0.b1b2 · · · 2t<

2−k

1/2.

Thus,|x− chop(x)|

|x|< 2 2−k = 21−k.

Now, for the rounding error we consider the case where bk+1 = 0, then x − round(x) =(0.0bk+2bk+3..)2 × 2t2−k. Thus

|x− round(x)| < (0.01111...)2 × 2t−k =1

22t−k = 2t−k−1.

Using the fact that (0.b1b2...)2 > (0.b1000...)2 = (0.1)2 = 12

we prove (1.12).

Now, if bk+1 = 1, then round(x) = (0.b1....bk)2 2t + 2−k × 2t. Therefore,

x− round(x) = (0.1bk+2....)2 × 2t−k − 2t−k = [1

2+

1

2(0.bk+2...)2]× 2t−k − 2t−k,

which can be reduced to

x− round(x) = [1

2+

1

2(0.bk+2...)2 − 1]× 2t−k.

Taking the absolute value and noting that 0 ≤ (0.bk+2...)2 ≤ 1 we obtain

|x− round(x)| = 1

2[1− (0.bk+2...)2]× 2t−k <

1

22t−k = 2t−k−1.

Finally, using the lower bound (0.b1b2...)2 ≥ (0.b1000...)2 = (0.1)2 yields

|x− round(x)||x|

<2−1−k2t

0.b1b2 · · · 2t<

2−k−12t

2t1/2= 2−k, (1.13)

which completes the proof.

The machine precision or unit round-off error is Eps = 2−k, where k is the number ofsignificant digits.

Floating-point representation and operations have been standardized using IEEE floating-point representation. The IEEE standards are adopted on most modern computers with aunit round-off error or machine epsilon which is given as follows:

16

IEEE single precision:for single precision IEEE floating-point arithmetic the number of significant digits is k = 24which yields a machine epsilon or unit round-off error Eps = 2−24 ≈ 10−8

IEEE double precison:for double precision IEEE floating-point arithmetic we have k = 53 significant digits andEps = 2−53 ≈ 10−16.

Precision versus accuracy:

Precision: refers to the accuracy with which the basic arithmetic operations are performed.It is measured by the unit round-off error Eps.

Accuracy: refers to the absolute or relative error of an approximation quantity. Accuracymay be much worse that the precision.

Floating-point operations:

First, let F denote the set of computer numbers for a given representation, e.g., IEEE singleor double.

As a result of (1.13) one can easily show for all x ∈ R and smaller than the largest computernumber there exists ε with |ε| ≤ Eps and fl(x) ∈ F such that

fl(x)− xx

= ε, |ε| ≤ Eps,

which yields

fl(x) = x(1 + ε), where |ε| ≤ Eps. (1.14)

The floating-point addition ⊕, multiplication ⊗, and division � are defined on F as follows

x� y = fl(x · y) = (x · y)(1 + ε), (1.15)

where � denotes any of the elementary operations mentioned above.

In words: we perform the regular elementary operations between two computernumbers in F and round the result to a computer number in F.

By combining (1.14) and (1.15) one can show that for every floating-point elementary oper-ation we have

17

x� y = (x · y)(1 + ε), where |ε| < Eps. (1.16)

Thus all elementary operations acting on F are accurate with a relative error of order Eps.

1.2.4 Properties of Finite Precision Arithmetic

Here we will consider a few examples to illustrate some of the properties of floating point-operations.

1. Loss of associative property for ⊕, ⊗ and �We illustrate this undesirable effect by an example where we compute the sum of threenumbers:

a= 56.02b= 0.002c= 0.004

and apply floating-point addition with k = 4 significant digits and rounding in thedecimal system.

(1) compute ((a⊕ b)⊕ c)

a+ b = 56.02 + 0.002 = 56.022,a⊕ b = 56.02(a⊕ b) + c = 56.02 + 0.004 = 56.024,(a⊕ b)⊕ c = 56.02

(2) compute (a⊕ (b⊕ c))

(b+ c) = 0.006(b⊕ c) = 0.006a+ (b⊕ c) = 56.026a⊕ (b⊕ c) = 56.03

We immediately see that (a⊕ b)⊕ c) 6= (a⊕ (b⊕ c)) . Thus, the floating-point additionis not associative.

18

2. Loss of uniqueness of neutral element property for ⊕For the standard addition of real numbers if 1 + x = 0 then x = 0, however in finiteprecision arithmetic

1⊕ x = 0 ; x = 0.

In the next example let use finite precision arithmetic with k = 4 significant digits andthe correct order of operations to find an accurate value of the following sum

1 +108∑k=0

0.00001.

It immediately becomes clear that if we add 0.00001 to 1 we obtain the incorrect sumof 1. This, is mainly due to the finite precision arithmetic where 1+0.000001 evaluatesto 1. Thus, this algorithm is unstable. We may avoid this problem by reorganizing thecomputations differently by grouping the numbers 0.00001 and summing them in eachgroup. Adding the results from all groups we obtain a more accurate answer.

For IEEE single precision floating-point addition we can verify that for all x = 2−k, k ≥25 we have 1 + x = 1. This problem is still present with IEEE extended precision.

3. Loss of significance or cancelation

Another problem with using finite precision arithmetic is the loss of significant digits.This mainly occurs when subtracting nearly equal quantities.

Example: Let us use k = 14 significant digits in the decimal system (double precision)to compute cos(x)− cos(2x) for x = 0.0001.

cos(x) = 0.99999999500000cos(2x) = 0.99999998000000

cos(x)− cos(2x) = 0.0000000149999999088379cos(x)− cos(2x) = 0.149999999088379× 10−7

We observe that there is a loss of 7 significant digits. This can be avoided, for instance,using Taylor series expansion.

Avoiding loss of significance:Loss of significance in the previous example may be avoided by using Maclaurin seriesfor cos(x) and cos(2x) where the leading terms in each series will cancel each other.

19

We further illustrate this effect on the expression

√1 + x4 −

√1− x2

If we use it in this form we will experience a loss of significant digits for values of|x| << 1. We can avoid this loss of significance by multiplying the numerator anddenominator by the conjugate expression and rewriting the expression as

(√

1 + x4 −√

1− x2)(√

1 + x4 +√

1− x2)

(√

1 + x4 +√

1− x2)

to obtain the equivalent expression where no cancelation occurs

x4 + x2

(√

1 + x4 +√

1− x2).

Finally, we note that round-off errors and floating-point arithmetic have positive effects sinceit helps avoid special initial vectors needed to start the power method for finding matrixeigenvalues.

1.3 Error propagation, stability and conditioning

In the previous section we showed that the floating-point operations and finite precisionarithmetic will introduce errors at several steps of the computations which may destroy theaccuracy. Here, we will study the propagation of round-off errors and define the notions ofconditioning of a problem and numerical stability for algorithms.

1.3.1 Error propagation

We will investigate the propagation of round-off errors in a computation by first studyingelementary operations.

1. Multiplication:

x⊗ y = x(1 + εx)y(1 + εy)(1 + ε) (1.17)

= xy(1 + εx)(1 + εy)(1 + ε) (1.18)

= xy(1 + δ), (1.19)

whereδ = εx + εy + ε+ εxεy + εxε+ εεy + εxεyε.

20

The relative error is approximated by

|δ| ≈ |εx + εy + ε| < 3Eps.

Thus, the floating-point multiplication is accurate to within machine epsilon.

2. Division:

x� y =x(1 + εx)

y(1 + εy)(1 + ε) (1.20)

=x

y(1 + εx)(1 + ε)(

∞∑k=0

(−1)kεky) (1.21)

=x

y(1 + δ), (1.22)

where we used the Maclaurin series

1

1 + t=∞∑k=0

(−1)ktk

and

δ = (1 + εx) ∗ (1 + ε)(∞∑k=0

(−1)kεky)− 1 (1.23)

≈ εx − εy + ε. (1.24)

Thus, the relative error δ may be bounded as |δ| < 3Eps. Hence, floating-point divisionis accurate to within machine epsilon.

3. Addition: If we assume that x+ y 6= 0, then

x⊕ y = (x(1 + εx) + y(1 + εy))(1 + ε) (1.25)

= (x+ y)(1 +x

x+ yεx +

y

x+ yεy +

x

x+ yεxε+

y

x+ yεyε+ ε) (1.26)

= (x+ y)(1 + δ) (1.27)

If x and y have the same sign then δ can be bounded as δ ≤ 3Eps after neglecting thehigher-order terms. Thus, addition is accurate to within machine epsilon.

However, if x and y are nearly equal with opposite signs, then x+ y may be arbitrarilysmall. Hence, the error δ cannot be bounded. The round-off errors are amplified andmay destroy the accuracy.

21

Let us consider the following problem which involves several elementary operations

φ : R2 → R(a, b) → φ(a, b) = a2 − b2,

and the following two algorithms for computing φ(a, b)

Algorithm 1:φ(0) : (a, b)→ (a, b, a− b)φ(1) : (a, b, a− b)→ (a+ b, a− b)φ(2) : (a+ b, a− b)→ (a+ b)(a− b).

Thus, φ = φ(2) ◦ φ(1) ◦ φ(0), where each φ(i) consists of an elementary operation.

Algorithm 2:φ(0) : (a, b)→ (a2, b)φ(1) : (a2, b)→ (a2, b2)φ(2) : (a2, b2)→ a2 − b2,

with φ = φ(2) ◦ φ(1) ◦ φ(0) where φ(k) denotes an elementary operation performed at the kth

step of an algorithm.

In general, every problem φ can be decomposed into r + 1 steps as

φ = φ(r) ◦ φ(r−1) ◦ · · · ◦ φ(1) ◦ φ(0),

where the remainder map is defined by

ψ(i) = φ(r) ◦ φ(r−1) ◦ · · · ◦ φ(i), i = 0, 1, · · · , r,

We note that ψ(0) = φ and if x(i) = φ(i−1) ◦ · · · ◦ φ(0)(x(0)), then

y = ψ(i)(x(i)), i = 0, 1, · · · .

x(0) φ(0)−→ x(1) φ(1)−→ x(2) · · ·x(i−1) φ(i−1)

−→ x(i) · · ·x(r) φ(r)−→ x(r+1)

x(0) φ(0)−→ x(1) φ(1)−→ x(2) · · · x(i−1) φ(i−1)

−→ x(i) · · · x(r) φ(r)−→ x(r+1)

22

Applying the chain rule we write

Dψ(i)(x(i)) = Dφ(r)(x(r))×Dφ(r−1)(x(r−1)) · · ·Dφ(i)(x(i)).

In order to study the propagation of round-off errors at the ith step after rounding we write

x(i+1) = φ(i)(x(i)) = round(φ(i)(x(i)).

Next, we write the absolute error at step i as

∆x(i+1) = x(i+1) − x(i+1) = φ(i)(x(i))− φ(i)(x(i))

= [φ(i)(x(i))− φ(i)(x(i))] + [φ(i)(x(i))− φ(i)(x(i))]. (1.28)

If φ(i) is a vector function, then for each component j we have

φ(i)j (u)) = φ

(i)j (u)(1 + εj), |εj| ≤ Eps,

which leads toφ(i)(u) = (I + Ei+1)φ(i)(u),

whereEi+1 = diag(ε1, ε2, · · · , εni+1

).

Now we can write

φ(i)(x(i))− φ(i)(x(i)) = Ei+1φ(i)(x(i))

≈ Ei+1φ(i)(x(i)) = Ei+1x

(i+1) = αi+1.

Substituting the previous relation in (??) we

∆x(i+1) ≈ αi+1 + φ(i)(x(i))− φ(i)(x(i))

which, by using differentials, leads to the recursion formula

∆x(i+1) ≈ αi+1 +Dφ(i)(x(i))∆x(i).

Now we are ready to express the forward error in terms of the round-off errors committedat previous steps as

∆x(r+1) ≈ Dφ(x)∆x(0) +Dψ(1)(x(1))α1 + · · ·+Dψ(r)(x(r))αr + αr+1,

where ∆x(0) = x(0) − x(0) is the error in the data. Finally, the total forward error can bewritten as

23

∆x(r+1) ≈ Dφ(x)∆x(0) +Dψ(1)(x(1))E1x(1) + · · ·+Dψ(r)(x(r))Erx

(r) + Er+1x(r+1)

where x(r+1) = φ(x) = y.

Remarks:

1. The effect of errors in input data x(0) is given as

Dφ(x)∆x(0),

which indicates how much is the output sensitive to changes in the input and is com-pletely independent from the algorithm used.

2. The total effect of round-off errors is

Dψ(1)(x1)E1x(1) + · · ·+Dψ(r)(x(r))Erx

(r) + Er+1x(r+1)

3. We may compare algorithms by comparing their total round-off errors.Thus, an algorithm A is said to be more trustworthy than algorithm B if the totaleffect of round-off errors for A is less than the total round-off error for B.

1.3.2 Conditioning of a problem

In this section we discuss the effect of errors in the input data This leads to the definitionof the conditioning of a problem which describes how sensitive is the output of the problemto the input. The amplification factor of the input error is called the condition number ofthe problem which is also the ratio of the error in the output to the error in the input. Asa consequence, Ill conditioning and well conditioning of a problem are defined as follows:

1. If a problem has a large condition number, then small perturbations in the input datalead to large changes in the output. In this case, the problem is said to be ill-conditionedand one should expect difficulties to solve it.

2. If a problem has a small condition number, then small perturbations in the input datayield small changes in the output. The problem is said to be well-conditioned.

In order to derive the condition number we apply differential calculus techniques to relatechanges in input to changes in output and introduce an amplification factor which is thecondition number.

24

To begin, let us assume we would like to compute f : R → R, and let us perturb the inputx by ∆x and write the relative error

∆f(x)

f(x)=f(x+ ∆x)− f(x)

f(x), f(x) 6= 0.

Applying Taylor series and neglecting higher-order terms we write

f(x+ ∆x)− f(x) ≈ ∆xf ′(x),

which can be written as∆f(x)

f(x)≈ xf ′(x)

f(x).∆x

x,

or∆f(x)f(x)

∆xx

≈ xf ′(x)

f(x).

Thus, the amplification factor κ(f)(x) = |xf ′(x)||f(x)| is the condition number. It describes how

large is the relative error in f(x) compared to the relative error in x.

The condition number of f is defined as

κ(f)(x) =|xf ′(x)||f(x)|

, for x 6= 0, and f(x) 6= 0 .

If, for instance, κ(f)(x) = 10t, ten significant digits are lost.

The condition number in the cases when no relative error exists is defined as

• If x = 0, f(0) 6= 0, then κ(f)(0) =|f ′(0)||f(0)|

.

• If x 6= 0, y = 0 or x = 0, y = 0, then κ(f)(x) = |f ′(x)| .

Let us consider φ(x) where x = (x1, x2, · · ·xn) and φ is continuously differentiable.

The relative error in xi is

εi =xi − xixi

, if xi 6= 0.

Let x = (x1, x2, · · · , xn) and∆x = x− x.

25

Using differentials we write

φ(x)− φ(x) ≈n∑i=1

(xi − xi)∂φ(x)

∂xi

=n∑i=1

εixi∂φ(x)

∂xi.

The relative error

φ(x)− φ(x)

φ(x)≈

n∑i=1

xiφ(x)

∂φ(x)

∂xiεi

Using the l2 norm, for instance, the previous equation yields

||φ(x)− φ(x)||2||φ(x)||2

≤ ||x||2||Dφ||2||φ||2

||∆x||2||x||2

.

The condition number of this problem is

κ(φ)(x) =||x||2||Dφ||2||φ||2

.

Example 1.3.1.

Let us consider φ(x) = x1 + x2 + x3.

||φ(x)− φ(x)||||φ(x)||

≈ x1

x1 + x2 + x3

ε1 +x2

x1 + x2 + x3

ε2 +x3

x1 + x2 + x3

ε2

≤

√√√√ 3∑i=1

(xi

x1 + x2 + x3

)2√ε21 + ε22 + ε23.

Thus, the condition number is

κ(φ)(x) =

√√√√ 3∑i=1

(xi

x1 + x2 + x3

)2

.

This problem is well conditioned if every xi is small compared to the sum x1 + x2 + x3. Forinstance, if xi have the same sign, then the problem in well-conditioned. Otherwise, it maybe ill-conditioned.

26

Example 1.3.2. An ill-conditioned algebraic problem:

We consider the quadratic equation: x2−2x+1 = 0 which has a double root x = 1. However,if we perturb the constant coefficient by ε, the perturbed equation x2 − 2x + 1 − ε = 0 hastwo roots x = 1 ±

√ε. Thus we immediately observe that a perturbation of order ε of the

coefficient leads to a perturbation of order√ε in the output where

√(ε) >> ε when ε << 1.

Let us consider the quadratic equation x2 − 2x+ c = 0, where the input is c and the outputis one of the roots x(c) given by

x(c) =2±√

4− 4c

2= 1±

√1− c.

The condition number of the function x(c) is computed as

κ(x)(c) =|c x′(c)||x(c)|

=|c|

|x(c)|√

4− 4c.

For c = 1− ε we have

κ(x)(c) =|(1− ε)|

|√

4ε(1∓√ε)|

≈ 1

2√ε>> 1, when ε << 1.

Here, we note that κ(x)(1) = ∞ and for values of c = 1 near 1 the condition number getslarge, hence, the problem is ill-conditioned near c = 1.

Example 1.3.3. Linear systems of algebraic equations:

In this last example we study the linear system

Ax = b.

where A ∈ Rn × Rn, x,b ∈ Rn.

Assume that the matrix A has an inverse A−1 ∈ Rn such that AA−1 = A−1A = I, where Iis the n× n identity matrix.

Let us define the lp vector norm

||x||p = p

√∑i

|xi|p, p = 1, 2, · · · where ||x||∞ = maxi|xi|.

27

From this lp vector norm we define the lp matrix norm

||A||p = maxx∈Rn,x6=0

||Ax||p||x||p

,

where, for instance,

||A||∞ = maxj=1,··· ,n

(n∑i=1

|aj,i|), ||A||1 = maxi=1,··· ,n

(n∑j=1

|aj,i|)

Next are few useful properties of matrix norms:

||Ax||p ≤ ||A||p||x||p, using the same norm for vector and matrix.

||AB||p ≤ ||A||p||B||p, ||A + B||p ≤ ||A||p + ||B||p.Now, we define the problem function φ : b→ φ(b) by

x = φ(b) = A−1b.

Since the Jacobian of φ is A−1, then the condition number of φ is

κ(φ)(b) =||b||p||A−1||p||x||p

,

Using b = Ax and ||b||p = ||Ax||p ≤ ||A||p||x||p, we have

maxb∈Rn

κ(φ)(b) = ||A||p||A−1||p.

Since this number does not depend on the right-hand side, it is called the condition numberof the matrix and denoted as

κp(A) = ||A||p||A−1||p.

Hilbert matrices are known of being ill-conditioned.

Hn =

1 1

2, · · · 1

n12

13· · · 1

n+1

· · · · · · · · · · · ·1n

1n+1

· · · 12n−1

,where

κ∞(H10) = 1.6× 1013,

κ∞(H20) = 2.45× 1028,

κ∞(H40) = 7.65× 1058.

28

1.3.3 Numerical stability

From the forward error analysis we note that every algorithm for computing y = φ(x) willgive at least an error of the form

|∆0y| = [|Dφ(x)||x|+ |y|]Eps ,

which is often called the inherent error and is common to all algorithms.

Numerical stability consists of studying the propagation of round-off errors introduced atintermediate steps of an algorithm and their impact on the accuracy of the final result.

Numerical stability of an algorithm may be defined in words in more than one way. Forinstance Bulirsh et al. [1] state the following definitions:

1. An algorithm is numerically stable if the total error has the same order as the inherenterror.

2. An algorithm is numerically stable if all round-off errors are harmless.

Other definitions of numerical stability are given by Highman [2] and will be discussed here.

If we let φ(x) denote the computed value of φ(x), then we have the following definition fromTrefethen and Bau [3] and as illustrated in the diagram of Figure 1.1.

Big Oh and small oh: Before stating the stability definitions we recall the notions of BigO and small o. Consider two sequences ak and bk k = 1, 2, 3, . . ..

Definition 1. We say that ak = O(bk) if and only if there exists a constant C > 0 andN > 0 such that

|ak| ≤ C|bk|, for k > N.

Definition 2. We say that ak = o(bk) if and only if for every ε > 0 there exists N > 0 suchthat

|ak| ≤ ε|bk|, for k > N.

Definition 3. An algorithm that computes φ(x), an approximation of φ(x), is numericallystable if and only if for every x there exists x with |x− x|/|x| = O(Eps) such that

|φ(x)− φ(x)||φ(x)|

= O(Eps).

29

−1 0 1 2 3 4 5 6−4

−3

−2

−1

0

1

2

3

4Diagram for numerical stability

φ

φ

φ~

x

x~

φ~(x)

φ (x)

backward error

forward error

φ (x~)

Figure 1.1: A diagram for a stable algorithm

In words: an algorithm is numerically stable if it gives nearly the right answer tonearly the right problem.

The accuracy of a stable algorithm can be established as follows

φ(x)− φ(x)

φ(x)=φ(x)− φ(x)

φ(x)+φ(x)− φ(x)

φ(x). (1.29)

Taking the absolute value we obtain the bound

|φ(x)− φ(x)||φ(x)|

≤ C1Eps+ C2κ(φ)Eps. (1.30)

In the previous proof we have assumed that φ(x) ≈ φ(x) or we may write

φ(x)

φ(x)= 1 +

xφ′(x)

φ(x)

(x− x)

x

which can be bounded as|φ(x)||φ(x)|

≤ κ(φ)(x)|tildex− x||x|

.

Hence we can write the following:

”Forward error” ≤ C1Eps+ κ(φ)”Backward error” (1.31)

30

Backward error analysis:

Backward error analysis was first introduced by Wilkinson [4] which proved very effectivefor many linear algebra problems. It consists of assuming an error in the output and findthe corresponding error in the intput such that the exact solution with the perturbed inputyields the perturbed output see Figure 1.2.

Definition 4. We say that an algorithm that computes φ(x), an approximation of φ(x), isbackward stable if for each x we can find x such that

φ(x) = φ(x), where||x− x||||x||

= O(Eps).

In words: A backward stable algorithm gives exactly the right answer to nearlythe right question.

Next, we will show that the elementary floating-point multiplication , division and additionare backward stable.

Here we follow the backward error analysis in Trefethen and Bau, pages 108-109 whichincludes the rounding of initial data as a step of their algorithm. We note that from thedefinition you may assume that the initial data x and y to be machine numbers, i.e., εx =εy = 0.

Multiplication:

x⊗ y = [x(1 + εx)y(1 + εy)](1 + ε)

= xy(1 + δ)

For instance, this can be written as

x⊗ y = xy,

where

x = x(1 + εx)(1 + ε) = x(1 + δx)

y = y(1 + εy) = y(1 + δy).

We note that the backward error in x, δx = εx+ ε+ εεx, and the backward error in y, δy = εy,are O(Eps). Thus, the floating-point multiplication is backward stable.

31

Addition:

x⊕ y = (x(1 + εx) + y(1 + εy))(1 + ε)

= x+ y,

where

x = x(1 + εx)(1 + ε) = x(1 + δx)

y = y(1 + εy)(1 + ε) = y(1 + δy)

The relative backward error in x and y δx = εx + ε + εxε and δy = εy + ε + εyε are O(Eps).Thus, the floating point addition is numerically backward stable.

Division:

x� y =x(1 + εx)

y(1 + εy)(1 + ε)

=x

y,

where

x = x(1 + εx)(1 + ε) (1.32)

y = y(1 + εy). (1.33)

Again, the relative backward errors in x and y are O(Eps), thus, division is backward stable.

Accuracy of a backward stable algorithm:in the following theorem we estimate the forward error in terms of the problem conditionnumber and the backward error for a backward stable algorithm.

Theorem 1.3.1. Assume a backward stable algorithm is used to solve a problem y = φ(x) ∈C1 with condition number κ(φ) on a computer using floating point arithmetic. Then, therelative error satisfies

||φ(x)− φ(x)||||φ(x)||

= O(κ(x)Eps).

Proof. By definition of backward stability there exists x such φ(x) = φ(x) with x satisfying

||x− x||||x||

= O(Eps).

32

If φ is differentiable, we write

||φ(x)− φ(x)||||φ(x)||

=||φ(x)− φ(x)||||φ(x)||

=||φ′(x)(x− x) +O((x− x)2)||

||φ(x)||

≤ ||φ′(x)||||(x− x)||+O((x− x)2)

||φ(x)||=

||x||||φ′(x)||||φ(x)||

||(x− x)||||x||

+ o(1)||(x− x)||||x||

= (κ(φ)(x) + o(1))Eps,

where

κ(φ)(x) =||x|| ||φ′(x)||||φ(x)||

.

This completes the proof of the theorem.

Remarks:

1. A backward stable algorithm yields an error equivalent to the inherent error.

2. The error in the output of a backward stable algorithm may be large but this is dueto errors in the input and to the large condition number of the problem and not dueto the stability.

3. Backward error analysis does not include problem sensitivity and is therefore moreappropriate for studying numerical stability.

4. A numerically backward stable algorithm gives results that reflect the condition num-ber.

5. In a backward stable algorithm the ratio of forward error to the backward error is equalto the problem condition number as

”Forward error” ≤ κ(φ) ”Backward error”.

33

−1 0 1 2 3 4 5 6−4

−3

−2

−1

0

1

2

3

4Diagram for backward stability

φ

φ

φ~

x

x~

φ~(x)

φ (x)

backward error

forward error

Figure 1.2: A diagram for a backward stable algorithm

34

Bibliography

[1] J. Stoer and R. Bulirsh. Introduction to Numerical Analysis. Springer-Verlag, 3rdEdition, New York, 2002.

[2] N. Highman. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia,1996.

[3] N.L. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.

[4] J. Wilkinson. Rounding Errors in Algebraic Processes. Dover, New York, 1994.

35

Date post:	04-Jun-2018
Category:	Documents
Upload:	donhi
View:	232 times
Download:	0 times

Notes for Numerical Analysis Math 5465 by S. Adjerid ... · Math 5465 by S. Adjerid Department of...

Documents