Iterative Solution of Large Linear Systems - TU...

Iterative Solution of Large Linear Systems

Lecture notes (W.Auzinger, SS 2011)

revised and extended version of a script by J.M.Melenk

Wien, 2011

Contents

1 Introduction; useful material 1

1.1 Vector norms and inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Types of matrices and matrix decompositions . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Eigenvalues and matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Cayley-Hamilton Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Sparse Storage 6

2.1 Coordinate format (COO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Compressed Sparse Row [Column] formats (CSR, CSC) . . . . . . . . . . . . . . . . . . . . 8

3 Direct Solution Methods 10

3.1 Fill-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Standard ordering strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Minimum degree ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 A fast Poisson Solver based on FFT 21

4.1 The 1D case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 The 2D case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Basic Iterative Methods 25

5.1 Convergence analysis of linear iterative methods . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 Splitting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Model problem analysis and consistent ordering . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Chebyshev Acceleration and Semi-iterative Methods 46

6.1 Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Chebyshev acceleration for σ(M) ⊂ (−1, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3 Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ii CONTENTS

7 Gradient Methods 54

7.1 The Method of Steepest Descent (SD) for SPD systems . . . . . . . . . . . . . . . . . . . . 56

7.2 Nonsymmetric steepest descent algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.3 Gradient methods as projection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8 The Conjugate Gradient (CG) Method for SPD Systems 61

8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2 Introduction to the CG method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.3 Derivation of the CG method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.4 CG as a projection method and its relation to polynomial approximation . . . . . . . . . . 67

8.5 Convergence properties of the CG method . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.6 CG in Matlab: The function pcg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.7 CGN: CG applied to the Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 74

9 General Approach Based on Orthogonalization of Km .The Arnoldi/Lanczos Procedures 74

9.1 The Arnoldi procedure for A ∈ Rn×n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

9.2 The MGS (Modified Gram-Schmidt) variant . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.3 The Lanczos procedure for symmetric A ∈ Rn×n . . . . . . . . . . . . . . . . . . . . . . . 79

9.4 Arnoldi / Lanczos and polynomial approximation . . . . . . . . . . . . . . . . . . . . . . . . 79

9.5 The direct Lanczos method for symmetric systems (D-Lanczos) . . . . . . . . . . . . . . . 81

9.6 From Lanczos to CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10 General Krylov Subspace Methods, in particular GMRES 86

10.1 Computing the xm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

10.2 The GMRES (Generalized Minimal Residual) method . . . . . . . . . . . . . . . . . . . . . 89

10.3 GMRES in Matlab: The function gmres . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10.4 Convergence properties of the GMRES method . . . . . . . . . . . . . . . . . . . . . . . . 95

Ed. 2011 Iterative Solution of Large Linear Systems

CONTENTS iii

11 Methods Based on Biorthogonalization. BiCG 97

11.1 Lanczos biorthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

11.2 BiCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

11.3 A brief look at CGS and BiCGStab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

12 Preconditioning 106

12.1 General remarks; preconditioned GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

12.2 PCG: Preconditioned CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

12.3 Preconditioning in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

12.4 Convergence behavior of PCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

12.5 Preconditioning techniques in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

12.6 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

13 Multigrid Methods (MG) 123

13.1 Motivation: 1D elliptic model problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

13.2 A more detailed analysis of the Jacobi method. Error smoothing . . . . . . . . . . . . . . . 125

13.3 The two-grid scheme (TG) in abstract formulation . . . . . . . . . . . . . . . . . . . . . . . 128

13.4 The two-grid method for the 1D model problem . . . . . . . . . . . . . . . . . . . . . . . . 130

13.5 Analysis of the two-grid method for elliptic problems . . . . . . . . . . . . . . . . . . . . . 133

13.6 Multigrid (MG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

13.7 Nested Iteration and Full Multigrid (FMG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

13.8 Nonlinear problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

14 Substructuring Methods 154

14.1 Subspace corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

14.2 Additive Schwarz methods (ASM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

14.3 Multiplicative Schwarz methods (MSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

14.4 Introduction to domain decomposition techniques . . . . . . . . . . . . . . . . . . . . . . . 164

References 175

Anhang: Auszug / ’Numerik von Differentialgleichungen’ (2 Kapitel: elliptische Probleme, FEM)

Iterative Solution of Large Linear Systems Ed. 2011

1

1 Introduction; useful material

Numerical models are growing in size and complexity with increasing computational power being available.Many of these models ultimately reduce to solving a set of linear equations of large dimension1 e.g., > 107 .In this course we present a few of the methods devised that can be efficiently applied to problems of thissize.

For the purposes of this course we will restrict ourselves to matrix vector systems over the reals. Forextensions to complex solutions refer to the following material. In addition we will essentially assumeserial implementation of these algorithms; efficient parallelization of these methods requires further con-siderations.

Fist we recall of some of the most important definitions and results from linear algebra. For further readingon these basic topics, the texts [2], [13], [19], and [20] can be recommended. The books [12] and [16] aremore specific to the topic of this lecture.

We will mainly be concerned with real systems, i.e., real mappings (matrices) A ∈ Rn×n . The naturalextension to a mapping from Cn to Cn is considered whenever necessary (e.g. concerning complex spectra).Further material from linear algebra will be provided in later chapters, whenever necessary.

1.1 Vector norms and inner products

1. The Euclidean inner product of (column) vectors x, y ∈ Rn is denoted in two alternative ways:

(x, y) ≡ xT y

The norm induced by the Euclidean inner product is the l2 -norm

∥x∥22 = (x, x)

which belongs to the family of lp norms defined via

∥x∥p =

n∑

i=1

|xi|p1/p

if p ∈ [1,∞)

maxi=1...n

|xi| if p =∞

2. The more general M - inner product(x, y)M = xT My

will also play a prominent role. Here, M is a symmetric positive-definite (SPD) matrix. 2 We notethat, by symmetry of M ,

(x, y)M = (x,My) = (Mx, y) = (y, x)M

and (x, x)M is definite, i.e., (x, x)m > 0 for x = 0 . The M - inner product induces the M - norm

∥x∥2M = (x, x)M

Since in many applications the quantity 12xT Mx represents an energy, the M - norm ∥ · ∥M is often

called the energy norm.

1 Facetious definition of ‘large’: the problem, by applying most tricks, barely fits into memory.2 M ∈ Rn×n is symmetric if MT = M ; it is positive semi-definite if xT Mx ≥ 0 for all x ∈ Rn ; M is positive definite if

xT Mx > 0 ⇔ x = 0 .


2 1 INTRODUCTION; USEFUL MATERIAL

3. Norms on a vector space V over R , x, y ∈ V satisfy by definition for all x , y ∈ V

(a) ∥αx∥ = |α| ∥x∥ ∀ α ∈ R(b) ∥x∥ ≥ 0 and ∥x∥ = 0 ⇔ x = 0

(c) ∥x+ y∥ ≤ ∥x∥+ ∥y∥ (triangle inequality)

We recall that for finite-dimensional vector spaces V any two norms ∥ ·∥A , ∥ ·∥B on V are equivalentin the sense that there exist constants c , C > 0 such that

c ∥x∥A ≤ ∥x∥B ≤ C ∥x∥A

1.2 Types of matrices and matrix decompositions

1. Hessenberg matrices: A matrix A satisfying Ai,j = 0 for i > j + 1 is called an upper Hessenbergmatrix; If Ai,j = 0 for j > i+ 1 it is called a lower Hessenberg matrix.

∗ . . . . . . . . . ∗∗ ∗ ...

∗ . . ....

. . . . . ....

∗ ∗

← upper Hessenberg matrix

A symmetric Hessenberg matrix tridiagonal.

2. Orthogonal matrices: A matrix A is orthogonal 3 if AAT = ATA = I . This means that thecolumns of A are orthonormal to each other (the same is true for the rows of A ). Note thatthis implies that AT = A−1 . An important further property of an orthogonal matrix A is that∥Ax∥2 = ∥x∥2 for all x ∈ Rn .

The complex analog is a unitary matrix A ∈ Cn×n , characterized by AAH = AHA = I .

3. QR factorization: Let m ≥ n . Every matrix A ∈ Rm×n can be written as A = QR , whereQ ∈ Rm×n has orthonormal columns and R ∈ Rn×n is upper triangular. If m = n and A isnonsingular, then Q is orthogonal.

4. Schur form: every A ∈ Cn×n can be written as A = QRQH , where Q is unitary and R is uppertriangular. Furthermore, the eigenvalues of A appear on the diagonal of R .

5. Orthonormal diagonalization of symmetric matrices: If A ∈ Rn×n is symmetric, then thereexists an orthogonal matrix Q (whose columns are eigenvectors of A ) and a diagonal matrix D(whose entries are the eigenvalues of A ) such that A = QDQT .

6. Unitary diagonalization of normal matrices: If A ∈ Rn×n is normal, i.e., ATA = AAT , thenthere exists a unitary matrix Q ∈ Cn×n (whose columns are eigenvectors of A ) and a diagonal matrixD ∈ Cn×n (whose entries are the eigenvalues of A ) such that A = QDQH .

Note that 4.–6. are similarity transformations which cannot be computed in a finite number of rationaloperations, since knowledge of the spectrum (the eigenvalues) of A is required. For symmetric matrices,the matrix R from 4. is exactly D from 6. We also note that, as easy to see, the decompositions 5., 6. are‘characteristic properties’, i.e., the ‘If’ can be replaced by ‘Iff’.

3 A more appropriate terminology would be ‘orthonormal’.


1.3 Eigenvalues and matrix norms 3

1.3 Eigenvalues and matrix norms

1. Let A ∈ Rn×n . A scalar λ ∈ C is called an eigenvalue of A if there exists an eigenvector v ∈ Cn \0such that Av = λ v . The set

σ(A) = λ ∈ C : λ is an eigenvalue of A is called the spectrum of A . If A is symmetric then the eigenvalues of A are real, and the extremaleigenvalues satisfy

λmin = λn = minx=0

(Ax, x)

(x, x), λmax = λ1 = max

x =0

(Ax, x)

(x, x)

2. Each norm on Rn induces a matrix norm ∥ · ∥ : Rn×n → R on the vector space of n×n matrices via

∥A∥ = maxx =0

∥Ax∥∥x∥

Matrix norms are norms on the vector space Rn×n . The induced matrix norms are submultiplicative,

∥AB∥ ≤ ∥A∥ ∥B∥ ∀ A,B ∈ Rn×n

The case of the ∥ · ∥2 - norm is particularly important. We have

∥AT ∥2 = ∥A∥2 (1.1)

∥A∥22 ≤ n maxj

n∑i=1

|Ai,j|2 and ∥A∥22 = ∥AT ∥22 ≤ n max

i

n∑i=1

|Ai,j|2 (1.2)

3. The spectral radius ρ(A) of A ∈ Rn×n is defined by

ρ(A) = max(σ(A)) = max|λ| : λ ∈ C is an eigenvalue of A If A is symmetric (or, more generally, normal) then

∥A∥2 = ρ(A)

4. The quantity

κp(A) =

∥A∥p ∥A−1∥p if A is nonsingular

∞ if A is singular

is called the lp - condition number of the matrix A . In particular, for SPD matrices we have

κ2(A) =λmax

λmin

The spectral radius ρ(A) of a matrix A will be important for the analysis of many iterative methods.

Theorem 1.1 Let A ∈ Rn×n or A ∈ Cn×n . Then:

(i) ρ(Am) = ρ(A)m for all m ∈ N

(ii) For any norm ∥ · ∥ on Rn (or on Cn if A ∈ Cn×n ) we have ρ(A) ≤ ∥A∥

(iii) For every ε > 0 there exists a norm ∥ · ∥ε on the vector space Rn (or Cn ) such that

ρ(A) ≤ ∥A∥ε ≤ ρ(A) + ε

(iv) For any norm ∥ · ∥ on Rn (or on Cn ) we have ρ(A) = limm→∞

∥Am∥1/m


4 1 INTRODUCTION; USEFUL MATERIAL

Proof: ad (i): From the Schur form of A we get Am = QRmQH . Since Rm is again upper triangular andthe diagonal entries are (Rm)ii = (Rii)

m , we obtain the spectrum σ(Am) = λm : λ ∈ σ(A) .

ad (ii): Let us first assume that A ∈ Cn×n and that ∥ · ∥ is a norm on Cn . Let λ ∈ σ(A) with |λ| = ρ(A) .

Then there exists an eigenvector x ∈ Cn such that Ax = λx . Thus |λ| = ∥λx∥∥x∥ = ∥Ax∥

∥x∥ ≤ ∥A∥ .

Let us now assume that A ∈ Rn×n and that ∥ · ∥ is a norm on Rn . On Cn we define the norm ∥ · ∥∗ by∥x+ i y∥2∗ = ∥x∥2 + ∥y∥2 for all x, y ∈ Rn . Note that ∥x∥ = ∥x∥∗ for all x ∈ Rn . From step (i) we have

ρ(A) ≤ ∥A∥∗

We conclude the argument by showing ∥A∥∗ = ∥A∥ . To see this, we note that A ∈ Rn×n implies

∥A∥ = supx∈Rn

∥Ax∥∥x∥

= supx∈Rn

∥Ax∥∗∥x∥∗

≤ supx∈Cn

∥Ax∥∗∥x∥∗

= ∥A∥∗.

On the other hand,

∥A∥2∗ = supx,y ∈Rn

∥A(x+ i y)∥2∗∥x+ i y∥2∗

= supx,y ∈Rn

∥Ax∥2 + ∥Ay∥2

∥x∥2 + ∥y∥2≤ sup

x,y ∈Rn

∥A∥2(∥x∥2 + ∥y∥2

)∥x∥2 + ∥y∥2

= ∥A∥2

ad (iii): We have just seen that for real matrices A the norms ∥A∥ and ∥A∥∗ coincide. We may thereforeassume without loss of generality that A ∈ Cn and ∥ · ∥ is a norm on Cn . We will also assume withoutloss of generality that A = 0 .

Let us motivate the construction of the norm ∥ · ∥ε before proving (iii). For the norm ∥ · ∥ε , one makes theansatz ∥x∥ε = ∥Zε x∥2 , where the invertible matrix Zε is constructed such that ZεAZ

−1ε = D+Rε , where

D is a diagonal matrix containing the eigenvalues of A , and Rε is small in the ∥ · ∥2 - norm. The reasonfor this is that a direct calculation then reveals ∥A∥ε = ∥ZεAZ

−1ε ∥2 = ∥D +Rε∥2 ≤ ∥D∥2 + ∥Rε∥2 . Since

∥D∥2 = ρ(A) , the requirement that ∥Rε∥2 be small then leads to the desired estimate.

Let A = QRQH be a Schur decomposition of A , and define D = Diag(R) = Diag(λ1, λ2, . . . , λn) . Then

ρ(A) = ρ(D) = ∥D∥2.

We define ξ = min1, ε/(n∥A∥2) . Defining X = Diag(1, ξ, ξ2, . . . , ξn−1) we compute

V = X−1RX =

λ1 ξR1,2 ξ2R1,3 · · ·

λ2 ξR2,3 · · ·. . .

= D + R, Ri,j =

0 if i ≥ j

ξj−iRi,j if i < j

We now define the norm ∥ · ∥ε on Cn by ∥x∥ε = ∥X−1QH x∥2 . It is easy to see that this is a norm on Cn .Furthermore, we have

∥A∥ε = ∥X−1QH AQX∥2 = ∥X

−1RX∥2 = ∥V ∥2 ≤ ∥D∥2 + ∥R∥2 = ρ(A) + ∥R∥2

It remains to show that ∥R∥2 ≤ ε . To see this, apply (1.2) to get ∥R∥22 ≤ ξ2n2 maxj>i |Ri,j|2 ≤ ξ2n2∥R∥22 =n2ξ2∥A∥22 , which implies the desired bound. In these estimates we employed the simple bound |Bi,j| ≤∥B∥2 for any index pair (i, j) and any matrix B , which follows from |Bi,j| = |eHi Bej| ≤ ∥ei∥2∥Bj∥2 ≤∥ei∥2∥B∥2∥ej∥2 = ∥B∥2 , where ei, ej denote the i -th and j -th unit vector.


1.4 Cayley-Hamilton Theorem 5

ad (iv): First, we consider the case ρ(A) = 0 . Then all eigenvalues of A vanish, i.e., the matrix R in theSchur form A = QRQH is strictly upper triagonal. An easy calculation shows Rm = 0 for any m ≥ n .Hence, Am = QRmQH = 0 for any m ≥ n . Thus ∥Am∥ = 0 = ρ(A) for any m ≥ n .

We now consider the case ρ(A) > 0 . Define B = 1ρ(A)

A . The assertion is equivalent to showing

limm→0 ∥Bm∥1/m = 1 . To see this, we fix ε > 0 and let ∥ · ∥ε be the norm constructed in part (iii)

for B . Then 1 = ρ(B) = ρ(Bm)1/m ≤ ∥Bm∥1/mε ≤ ∥B∥ε ≤ 1 + ε . Finally, since all norms are equivalenton the finite-dimensional space of n×n matrices, there exists a constant Cε > 0 such that

C−1ε ∥E∥ ≤ ∥E∥ε ≤ Cε∥E∥ ∀ E ∈ Cn×n

This allows us to conclude (note: 1 = ρ(B) = (ρ(Bm))1/m ≤ ∥Bm∥1/m ):

1 ≤ lim infm→∞

∥Bm∥1/m ≤ lim infm→∞

C1/mε ∥Bm∥1/mε

≤ lim supm→∞

C1/mε ∥Bm∥1/mε ≤ lim sup

m→∞C1/m

ε (1 + ε) = 1 + ε

Since ε > 0 is arbitrary, we conclude lim infm→∞ ∥Bm∥1/m = 1 . Analogously, we get

1 ≤ lim supm→∞

∥Bm∥1/m ≤ lim supm→∞

C1/mε ∥Bm∥1/mε ≤ lim sup

m→∞C1/m

ε (1 + ε) = 1 + ε

and therefore lim supm→∞ ∥Bm∥1/m = 1 .

Exercise 1.2 Let A ∈ Rn×n be normal. Then one can find a norm ∥ · ∥ on Rn such that ρ(A) = ∥A∥ . Comment

on the special cases of symmetric and skew-symmetric A .

Exercise 1.3 Show that for an orthogonal matrix Q ∈ Rn×n there holds ∥Qx∥2 = ∥x∥2 for all x ∈ Rn .

Exercise 1.4 Show that ∥A∥2 =√

ρ(AT A) .

Exercise 1.5 Show that ∥A∥22 ≤ ∥A∥1 ∥A∥∞ .

Exercise 1.6 Give an example of a matrix A = 0 such that ρ(A) = 0 .

1.4 Cayley-Hamilton Theorem

The Cayley-Hamilton theorem (see, e.g., [10, Thm. 2.8.4]) states that any square matrix A ∈ Cn×n satisfiesit own characteristic equation, i.e., the characteristic polynomial χ : z 7→ det(zI − A) , given by 4

χ(z) = (z − λ1) · · · (z − λn) = zn + c1zn−1 + · · ·+ cn−1z + cn (1.3)

satisfiesχ(A) = An + c1A

n−1 + · · ·+ cn−1A+ cnI = 0

A direct consequence of the Cayley-Hamilton theorem is that – provided A ∈ Rn×n is nonsingular – theinverse A−1 can be expressed as follows (after by multiplying (1.4) by A−1 ):

A−1 = − 1

cnAn−1 − c1

cnAn−2 − · · · − cn−1

cnI

Note that cn = (−1)n det(A) = 0 by assumption. This representation of A−1 in terms of a matrixpolynomial of degree n − 1 (with coefficients depending on the spectrum of A ) may be a viewed as amotivation for the class of Krylov subspace methods which we will introduce later on.

4 The λi are the eigenvalues of A ; each eigenvalue with algebraic multiplicity k occurs k times in (1.3).


6 2 SPARSE STORAGE

Remark 1.7 For diagonalizable matrices A , the Cayley-Hamilton Theorem is easy to prove: A = XDX−1

where D is a diagonal matrix with diagonal entries Dii = λi . For any polynomial π , we compute π(A) =Xπ(D)X−1 . A calculation reveals π(D) = Diag(π(λ1), π(λ2), . . . , π(λn))X , which implies the assertion ofthe Cayley-Hamilton Theorem.

In the general case, one transforms A to Jordan normal form: A = XJX−1 , where the matrix J is blockdiagonal, and the diagonal blocks Ji ∈ Cni×ni , i = 1 . . .m , are upper triangular matrices and have theeigenvalue λi on its diagonal. The size ni is less than or equal to the multiplicity of the zero λi of thecharacteristic polynomial χ . Next, one observes that χ(A) = Xχ(J)X−1 and χ(J) is again block diagonalwith the diagonal blocks being given by χ(Ji) . Since for each i we can write χ(z) = πi(z)(z − λi)ni forsome polynomial πi , it is now easy to see that for each diagonal block we have χ(Ji) = 0 . Thus χ(J) = 0 .

2 Sparse Storage

In the introduction we have pointed out that we are targetting very large problems. One consequenceof considering systems of equations of this type is that the matrix of coefficients has a huge number ofelements. Fortunately, the matrices arising in most numerical simulations, for example, those based onFinite Difference (FD), Finite Element (FEM), or Finite Volume (FV) methods, are sparse, i.e., only fewmatrix entries are non-zero. (In practice, a matrix A ∈ Rn×n is called sparse if the number of non-zeroentries is O(n) .)

The aim of sparse storage is to store only the non-zero elements of these matrices, but to do this in amanner that still enables efficient computations to be performed, especially matrix vector products. Sparsestorage promises large memory savings as many matrices arising from FD, FEM & FV models are sparsedue to the nature of the local approximations employed.

The following two examples are sparse matrices arising from FD or FEM methods.

Example 2.1 Let Ω = (0, 1) , h = 1/(n+1) , and xi = ih , i = 0 . . . n+1 . The FD discretization of thetwo-point boundary value problem (BVP)

−u′′(x) = f(x) on Ω , u(0) = u(1) = 0

is ( ui ≈ u(xi) , fi = f(xi) )−ui−1 + 2 ui − ui+1 = h2fi, i = 1 . . . n (2.1)

where we set u0 = un+1 = 0 . This is a system of linear equations of the form

Au = b, A ∈ Rn×n, b ∈ Rn

where the matrix A is tridiagonal and SPD with λmax = 4 sin2 nπ2(n+1)

= O(1) and λmin = 4 sin2 π2(n+1)

=

O(h2) , such that the condition number is κ2 = O(h−2) .

For this problem, all eigenvalues and eigenvectors (rather: ‘discrete eigenfunctions’) are explicitly known:For i = 1 . . . n , the vectors wi = ((wi)1, . . . , (wi)n))

T with entries

(wi)j = sinjiπ

n+1, j = 1 . . . n

are the eigenvectors associated with the eigenvalues λi = 4 sin2 iπ

2(n+1).


7

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 288

Figure 2.1: Sparsity pattern of the stiffness matrix A of 2D Poisson problem

Small eigenvalues are associated with slowly varying eigenvectors; larger eigenvalues are associated withincreasingly oscillating eigenvectors. Clearly, the eigenvectors wi are mutually orthogonal since A is sym-metric.

Example 2.2 Let Ω = (0, 1)2 , h = 1/(n+1) . Define the nodes Pij = (xi, yj) = (ih, jh) , i, j = 0 . . . n+1 .Consider the BVP (Dirichlet problem) for the 2D Poisson equation

−∆u(x, y) = f(x, y) on Ω , u|∂Ω = 0 (2.2)

The FD approximations uij to the values u(xi, xj) , based on the simplest 5-point discretization of theLaplacian (‘5-point-stencil’), are the solutions of

−ui−1,j − ui+1,j − ui,j−1 − ui,j+1 + 4uij = h2fij, i, j = 1 . . . n (2.3)

with fij = f(xi, yj) . The boundary conditions are enforced by setting u0,j = un+1,j = 0 for j = 0 . . . n+1and ui,0 = ui,n+1 = 0 for i = 0 . . . n+1 . The system of equations (2.3) can be written in matrix formAu = b as follows: a standard numbering the unknowns uij , i, j = 1 . . . n , is the ‘lexicographic’ ordering,i.e., we set u(i−1)n+j = uij . Then the matrix A ∈ RN×N , with N = n2 , has at most 5 non-zero entries perrow. An example of the sparsity pattern of A (for the case n = 8 ) can be seen in Fig. 2.1. The matrixA is again SPD and the eigenvalue satisfy λmin = 8 sin2 πh

2= O(h2) , λmax = 8 cos2 πh

2= O(1) so that

κ2 = O(h−2) .

In order to describe the (mutually orthogonal) eigenvectors of A , it is convenient to use ‘double index’notation, i.e., the matrix A has entries Aii′,jj′ , where i, i

′, j, j′ ∈ 1, . . . , n . The eigenvectors are thengiven by wij ( i, j ∈ 1, . . . , n ), with entries

(wij)i′j′ = sinii′π

2(n+1)sin

jj′π

2(n+1), i′, j′ ∈ 1, . . . , n (2.4)

The corresponding eigenvalues are λij = 4(sin2 iπ

n+1+ sin2 jπ

n+1

).

A few of the more straightforward methods for storing more general sparse matrices are considered in thefollowing.


8 2 SPARSE STORAGE

2.1 Coordinate format (COO)

This consists of three arrays: 5

AA – an array containing the nonzero elements of A ∈ Rn×n

IR – an integer array containing their row indices

IC – an integer array containing their column indices

Example 2.3

A =

2 −1 0 0 9

0 0 0 0 0

0 −2 4 −3 0

0 0 −3 6 −20 0 0 −2 4

AA = [ 2 −1 9 −2 4 −3 −3 6 −2 −2 4 ]

IR = [ 1 1 1 3 3 3 4 4 4 5 5 ]

IC = [ 1 2 5 2 3 4 3 4 5 4 5 ]

Here, the entries are stored in row-wise order (‘COOR’). We note that the array IR contains quite a lot ofrepeated entries; the format contains redundant information. A more economical way is the CompressedSparse Row (CSR) or the Compressed Sparse Column (CSC) format discussed in the following.

2.2 Compressed Sparse Row [Column] formats (CSR, CSC)

CSR consists of three arrays:

AA – an array containing the nonzero elements of A ∈ Rn×n

IC – an integer array containing the column indices of the non-zero entries of A

PR – an array of n+1 integer pointers;

—PR(i) is the position in AA and IC where the i -th row of A starts:

The non-zero entries of the i -th row are AA([PR(i):PR(i+1)–1])

and their column indices are IC([PR(i):PR(i+1)–1])

— If the i -th row of A contains only zeros, then PR(i) = PR(i+1)

—PR(n+1) = nnz+1 , where nnz is the total number of non-zero entries of A

For A from the above example, we have

AA = [ 2 −1 9 −2 4 −3 −3 6 −2 −2 4 ]

PR = [ 1 4 4 7 10 12 ]

IC = [ 1 2 5 2 3 4 3 4 5 4 5 ]

This is useful as the matrix vector product y = Ax is simple to express ( n is the number of rows in A ):

for i = 1:n

k1 = pr(i)

k2 = pr(i+1)-1

y(i) = AA(k1:k2)*x(IC(k1:k2))’ % here, AA x, are row vectors

% IC(k1:k2) is vector index (MATLAB syntax)

end

5 For rectangular matrices A ∈ Rm×n , the formats are defined in a completely analogous way.


2.2 Compressed Sparse Row [Column] formats (CSR, CSC) 9

In the example, for i = 1 , we have k1 = 1 , k2 = 3 , AA(k1 : k2) = (2,−1, 9) , and IC(k1 : k2) = (1, 2, 5)so that the first component y1 of y = Ax is computed as

AA(k1 : k2) ∗ x(IC(k1 : k2))′ = 2 −1 9

x1

x2

x5

as required.

The storage saving achieved with the CSR format increases with the size of the matrix. For a matrix ofdimension n×n with p or less elements per row we are required to store at most pn real values and (p+1)nintegers, i.e., a linear function of n . Therefore for a 1, 000×1, 000 tridiagonal matrix the CSR formatrequires less than 7,000 elements to be stored, while the full matrix has 1,000,000 elements.

Remark 2.4 Most iterative solution methods are based on the user’s providing the matrix-vector mul-tiplication x 7→ Ax by means of an appropriate procedure – the matrix A need not even be explicitlyavailable for the algorithm to operate. The implication for the storage format is that the multiplicationx 7→ Ax has to be performed efficiently. Typically, one expects the storage format to be such that thecost of x 7→ Ax is proportional to the number of non-zero entries of A . This is the case for CSR.

In special cases (e.g., constant coefficient problems), no storage of matrix entries at all is required forevaluation of x 7→ Ax . Rather, this is realized by a simple loop incorporating the constant coefficients, asfor instance for the FD discretization of the 1D and 2D Poisson equation with constant meshwidth h .

Exercise 2.5 Design analogously to the CSR format the CSC (‘compressed sparse column format’). Formulate

an algorithm that realizes the matrix-vector multiplication x 7→ Ax , where A is stored in the CSC format.

A wide variety of other sparse formats exist, often motivated from the particular structure of a problemunder consideration. A classical example are banded matrices.

Example 2.6 A matrix A ∈ Rn×n is said to be banded with upper bandwidth bu und lower bandwidth bl ,if Ai,j = 0 for j > i + bu and j < i − bl . If b = bu = bl , then b is called the bandwidth of A . Storagerequirement is only n(bu + bl + 1) real numbers. Typically, banded matrices are stored by (off-)diagonals(cf. the diag command in Matlab).

Exercise 2.7 Design a data structure for data-sparse storage of a symmetric tridiagonal matrix and realize

matrix-vector multiplication.

Matlab includes many functions for use with sparse matrices. 6 The internal format is COO, which isnot storage-optimal but more easy to handle. For, example to preallocate storage for a sparse matrix Aof dimension m×n with k non-zero elements the function A = spalloc(m,n,k) is called. The non-zeroelements of A can then be entered by indexing the elements of A as usual. The sparsity pattern of thematrix A can be viewed using the function spy(A) . In addition all the Matlab functions can be usedwith sparse matrices including the standard addition and multiplication operations. Sparse vectors arealso available but less relevant in practice.

Example 2.8

A = gallery(’poisson’,8);

spy(A)

6 see help sparse


10 3 DIRECT SOLUTION METHODS

This retrieves the 64×64 matrix A obtained by discretizing the Poisson equation with the 5-point stencil(see Example 2.2) from the gallery of test matrices, and then plots the sparsity pattern shown in Fig. 2.1on p. 7.

Unless stated the algorithms discussed from now on will be assumed to be using some appropriate sparsestorage technique. 7

3 Direct Solution Methods

For ‘small’ problems, direct methods (i.e., variants of Gaussian elimination) are the method of choice.While there is no general rule what ‘small’ means, sparse matrices arising from FD or FEM discretizationswith up to 100,000 unknowns are often treated by direct methods. Especially for 2D problems, directmethods are quite popular. We refer to [8, 5] for a good discussion of such methods. Current populardirect solvers include UMFPACK, SuperLU, Pardiso. These solvers are suitable for nonsymmetric problemsand support parallel computer architectures.

Generally speaking, whereas Gaussian elimination of a (full) n×n matrix requires O(n2) storage andO(n3) operations, typical sparse matrices can be factored with work O(nα) for some 1 < α < 3 (for 2DFEM-applications, we expect α = 3/2 ). Since α = 1 is not really achievable, iterative methods have tobe employed for very large problems.

The main issues in direct solvers are a) pivoting (to ensure numerical stability) and b) reordering strategiesfor the unknowns so as to keep fill-in small. These two requirements are usually incompatible, and acompromise has to be made, e.g. by allowing non-optimal pivot elements to some extent. Here we willconsider the important special case of sparse SPD matrices, since these can be factored without pivoting ina numerically stable way via the Cholesky algorithm. 8 For SPD matrices, one may therefore concentrateon reordering strategies to minimize fill-in.

3.1 Fill-in

For an SPD matrix A ∈ Rn×n its envelope (or: profile) is defined as the set of double indices

Env(A) = (i, j) : Ji(A) ≤ j < i, where Ji(A) = minj : Ai,j = 0

The envelope contains the indices of all non-zeros entries of the (strictly) lower part of A . A key featurefor sparse direct solvers is the observation that also the Cholesky factor has non-zero entries only withinthe envelope:

Theorem 3.1 Let A ∈ Rn×n be SPD, and let L ∈ Rn×n be its lower triangular Cholesky factor, i.e.,LLT = A . Then: Li,j = 0 for j < i and (i, j) ∈ Env(A) .

Proof: The proof (an exercise) follows from inspection of the way the Cholesky factorization is computed.An analogous result holds for LU -decompositions of general matrices (provided the LU -decompositioncan be performed without pivoting).

7 For an overview of sparse matrix formats, see http://en.wikipedia.org/wiki/Sparse matrix.8 Remark: The Cholesky decomposition A = LLT is based on a modified elimination procedure, i.e., A = LU with

additional scaling of columns such that U = LT , and 1 = Li,i > 0 . For arbitrary symmetric matrices, a symmetric versionof A = LU reads A = LDLT with Li,i = 1 and Di,i not positive in general. For the case where A is SPD we have D > 0 ,

and rewriting this as A = (LD12 )(LD

12 )T reproduces the Cholesky decomposition.


3.1 Fill-in 11

A =

1 1 1

1 2 2

1 3 3

1 2 3 15 18

1 5

1 6

1 2 3 18 5 6 92

L =

1

1

1

1 2 3 1

1

1

1 2 3 4 5 6 1

A =

92 6 5 18 3 2 1

6 1

5 1

18 15 3 2 1

3 3 1

2 2 1

1 1 1

L =

9.5917

0.6255 0.7802

0.5213 −0.4180 0.7440

1.8766 −1.5047 −2.1601 2.1327

0.3128 −0.2508 −0.3600 0.5899 0.6014

0.2085 −0.1672 −0.2400 0.3933 −0.7075 0.4644

0.1043 −0.0836 −0.1200 0.1966 −0.3538 −0.8444 0.3015

Figure 3.2: Top: arrowhead matrix A ∈ R7×7 and its Cholesky factor L = chol(A)’. Bottom: effect ofreversing the numbering.

Remark 3.2 An inspection of the algorithm to compute the Cholesky factorization reveals the cost: Com-puting L requires 1

2

∑ni=1 ωi(A)(ωi(A) + 3) flops, where the i -th frontwidth ωi(A) is the number of rows of

the envelope Env(A) that intersect column i : ωi(A) = #j > i : (i, j) ∈ Env(A) .

Example 3.3 Banded matrices are a special case: Here, Ji(A) = max1, i − b for all i , where b is thebandwidth. The storage requirement is then O(n b) . The factorization of A is done with work O(n b2) .

Theorem 3.1 shows that the sparsity pattern of a matrix A is roughly inherited by the Cholesky factor.More precisely: the envelope is inherited. Indices (i, j) ∈ Env(A) that satisfy Li,j = 0 but Ai,j = 0are called fill-in. Since, generally speaking, the majority of the (i, j) ∈ Env(A) will be filled in duringthe factorization, many efficient sparse direct solvers aim at finding a reordering of the unknowns suchthat the envelope is small. In other words: they are based on finding a permutation matrix P suchthat |Env(P TAP )| is small. The Reverse Cuthill-McKee (RCM) ordering (see Section 3.2) is a classicalexample.

Example 3.4 (fill-in) The 7×7 ‘arrowhead matrix’ A shown in Fig. 3.2 has a ‘good’ ordering: since allentries of A within the envelope are non-zero, there is no fill-in. Reversing the order of equations andunknowns, which leads to the matrix A = P T AP with corresponding permutation matrix P , is a disaster:the envelope is is the full lower part of A , and inspection of L shows that complete fill-in has taken place.

A closer look at fill-in for SPD matrices.

Fill-in takes place at (i, j) if Ai,j = 0 and Li,j = 0 . We have already seen that fill-in can only occur withinthe envelope of A . We now determine the fill-in more precisely by means of an inductive procedure. LetA ∈ Rn×n be an SPD matrix. Elementary calculations then show ( a11 ∈ R , a = A([2 : n], 1) ∈ Rn−1 ,A = A([2 :n], [2 :n]) ∈ R(n−1)×(n−1) ):

A = A(1) =

a11 aT

a A

=

√a11 0a√a11

In−1

︸︷︷︸=: L1

1 0

0 A− aaT

a11︸︷︷︸=: A(2)

√a11 aT√

a11

0 IN−1

︸︷︷︸= LT1

(3.1)



In this way we have eliminated the first row and column, and the rest of the job consists in continuing thisprocedure by factoring the matrix 9 A(2) = A− 1

a11aaT ∈ R(n−1)×(n−1) in an analogous way. This gives

A(2) = L2

1 0

0 A(3)

LT2 ,

where A(3) ∈ R(n−2)×(n−2) is again SPD, and L2 ∈ R(n−1)×(n−1) has a structure similar to that of L1 . Thus,

A = A(1) = L1︸︷︷︸=: L1

1 0

0 L2

︸︷︷︸=: L2

1 0 0

0 1 0

0 0 A(3)

1 0

0 L2

T

︸︷︷︸= LT2

LT1︸︷︷︸

= LT1

Proceeding in this way, we obtain a factorization

A = L1L2 · · · Ln−1 I LTn−1 · · · LT

2 LT1 =: LLT , with L = L1L2 · · · Ln−1

Remark 3.5 The factors Li are lower triangular matrices with a special structure: (Lk)ii = 1 for i = kand the only non-trivial column of Lk is column k . Clearly, L = L1 · · · Ln−1 is again lower triangular(lower triangular matrices form a subset of the ring of matrices); moreover, we have L(:, k) = Lk(:, k)(see Exercise 3.6). This is analogous to the case of LU -decomposition, where the lower factor L is therecombination of elementary elimination matrices Lk with one one-trivial column k , see [2].

Exercise 3.6 Show by an induction argument that the entries below the diagonal of the Cholesky factor L are

precisely the first columns of the matrices Li , or, equivalently, L(:, k) = Lk(:, k) .

Constructing the Cholesky factorization in this way is known as the ‘outer product method’, since theupdates for the A(k) are expressed by outer (dyadic) vector products; see (3.1). This is formalized inAlg. 3.1 representing the explicit algorithmic outcome of the inductive process indicated above.

Algorithm 3.1 Cholesky factorization – outer product variant

% returns the Cholesky factor L ∈ Rn×n of an SPD matrix A = A(1) ∈ Rn×n

% note: unusual choice of indices:% the matrices A(k) ∈ R(n−k+1)×(n−k+1) are of the form (A

(k)i,j )

ni,j=k

1: A(1) = A , L = 0 ∈ Rn×n ,2: for k = 1 . . . n−1 do

3: Lk,k =√A

(k)k,k

4: L([k+1:n], k) = 1Lk,k· A(k)([k+1:n], k) % column vector

5: A(k+1)([k+1:n], [k+1:n]) =

A(k)([k+1:n], [k+1:n])− L([k+1:n], k) · (L([k+1:n], k))T % outer product6: end for

We see how fill-in arises: For example, the first column L :,1 = 1a11A([2 : n], 1) of the Cholesky factor L

has non-zero entries only where the first column of A(1) = A has non-zero entries, see (3.1). The secondcolumn of L has non-zero entries where the first column of the matrix A(2) = A − 1

a11aaT has non-zero

entries, and so on. In general, we expect Li,k = 0 if A(k)i,k = 0 . From the update formula for the matrices

A(k) of (cf. lines 3–5 of Alg. 3.1), we have

A(k+1)i,j = A

(k)i,j −

1

A(k)k,k

A(k)i,kA

(k)k,j , i, j ∈ k+1, . . . , n

9 The matrix A(2) is again SPD since L−11 AL−T

1 is SPD, see (3.1).


3.1 Fill-in 13

Hence, for i, j ≥ k+1 , we have A(k+1)i,j = 0 if and only if 10

A(k)i,j = 0 or A

(k)i,kA

(k)k,j = 0

Another way of putting it is: A(k+1)i,j = 0 if either A

(k)i,j = 0 , or if in A(k) the indices i, j are connected to

each other via the index k , i.e., A(k)i,k = 0 together with A

(k)k,j = 0 . Based on this observation, one can

precisely determine the fill-in for the Cholesky factorization.

Fill-in from a graph theoretical view point.

An elegant way to study fill-in is done with the aid of graphs. A graph G = (V,E) consists of a set V ofnodes and a set of edgesE ⊂ V×V . Edges are denoted as pairs (v, v′) ∈ V×V with two distinct elements.

The sparsity pattern of a general matrix A can be represented by a graph G = (V,E) with nodes Vand edges E , its so-called adjacency graph. Here, the set V of nodes is simply the set of unknownsxi, i = 1 . . . n (or the corresponding indices i ), and two nodes xi = xj are connected by an edge(xi, xj) ∈ E iff Ai,j = 0 , i.e. if equation i involves the unknown xj . In the general setting of nonsymmetricmatrices this gives a directed graphs where vertices xi and xj are connected by a directed edge (xi, xj) ifAi,j = 0 . A directed edge (xi, xj) is visualized by an arrow pointing from node xi to node xj .

For a symmetric matrix A we have Ai,j = Aj,i , thus E is symmetric: (xi, xj) ∈ E iff (xj, xi) ∈ E , and thisis represented by xi, xj . In other words: We can use undirected graphs. An (undirected) edge xi, xjis visualized by a line joining the nodes xi and xj .

The degree degG(v) of a node v ∈ V is the number of edges emanating from v .

Remark 3.7 We point out that our requirement that an edge be a set of two distinct elements impliesthat the graph has no ‘loops’ connecting a node with itself. In other words: Information about the diagonalentries is not contained in the graph because it is of no interest for the study of fill-in.

In the outer product variant of Cholesky factorization, we denote by G(k) = (V (k), E(k)) the adjacencygraph of the matrix A(k) . The above discussion shows that the graph G(k+1) for matrix A(k+1) is obtainedfrom the graph G(k) by

removing the ‘pivot node’ xk , removing the edges emanating from xk , and adding edges xi, xj thatconnect those nodes xi, xj that have the property that xi, xk ∈ E(k) together with xk, xj ∈ E(k) .

We formalize this in the graph transformation algorithm 3.2. We will call the sequence (G(k))nk=1 theelimination history.

Algorithm 3.2 Elimination pattern via graph transformation

% input: Graph G = (V,E)% output: Graph G′ = (V ′, E ′) obtained by eliminating node v ∈ V1: V ′ = V \ v2: E ′ = v1, v2 ∈ V ′×V ′ : v1, v2 ∈ E or

(v1, v ∈ E and v, v2 ∈ E

)

10 In a strict sense, this is not an ‘if and only if’ situation since cancellation can take place, i.e., A(k)i,j = 1

A(k)k,k

A(k)i,kA

(k)k,j . This

(unlikely) cancellation will be ignored.



6

5

4

3

2

1

6

5 4

3

2

Figure 3.3: Adjacency graphs G(1) and G(2) from Example 3.8

Example 3.8 (See Fig. 3.3.) Let

A = A(1) =

x 0 x x x 00 x x 0 0 x

x x x 0 0 0

x 0 0 x 0 0

x 0 0 0 x 0

0 x 0 0 0 x

withG(1) = (V (1), E(1)) , V (1) = x1, x2, x3, x4, x5, x6 , E(1) = x1, x3, x1, x4, x1, x5, x2, x3, x2, x6 .Elimination of x1 results in the 5×5 matrix

A(2) =

x x 0 0 x

x x x x 0

0 x x x 0

0 x x x 0

x 0 0 0 x

with G(2) = (V (2), E(2)) , V (2) = x2, x3, x4, x5, x6 , E(2) = x3, x4, x3, x5, x4, x5, x2, x3, x2, x6 .

The above discussion shows that the elimination of node xk produces column L(:, k) of the Cholesky factorL with the property that Li,k = 0 iff xi, xk ∈ E(k) . Hence, the number of non-zero entries in columnL(:, k) is given by the degree of node xk ∈ V (k) . The memory requirement to store the Cholesky factor Lis therefore given by

Mem(L) =n∑

k=1

Mem(L(:, k)) = n+n−1∑k=1

degG(k)(xk)

where n represents storage of the diagonal.


3.1 Fill-in 15

Elimination graph and reordering.

A reordering of the unknowns corresponds (and simultaneous reordering of the equations) corresponds to apermutation of the columns and rows of A , i.e., it leads to the SPD matrix P T AP for some permutationmatrix P . The graph G(1) is (up labeling of nodes) independent of the permutation P . In terms ofgraphs, we note that if we eliminate nodes one by one (with Alg. 3.2), then we obtain a sequence of graphs(G(k))nk=1 that corresponds to the Cholesky factorization of the matrix P T AP , where the permutationmatrix P describes the order in which the nodes are eliminated. Hence, we arrive at the following resultconcerning the fill-in for Cholesky factorization for any ordering:

Theorem 3.9 Let G(1) = (V (1), E(1)) be the graph for an SPD matrix A . Eliminate sequentially nodesv1, v2, . . . , vn using Alg. 3.2 and denote by G(k) , k = 1 . . . n , the graphs obtained in this process. Thesequence (G(k))nk=1 represents the elimination history for the Cholesky factorization of P T AP , where Pis determined by the order in which the nodes are eliminated. The position of the non-zero entries of theCholesky factor L of P T AP can be read off (G(k))nk=1 : for i > k , there holds Li,k = 0 if vi, vk ∈ E(k) .The total memory requirement to store L is

n+n−1∑k=1

degG(k)(xk) (3.2)

Graph theory terminology.

Neighbors and degree: A neighbor of a node v is another node that is connected to v by an edge. ByAdj(v) we denote the set of all neighbors of v ∈ V . 11 Recall that the degreedeg(v) of a node v ∈ V isthe number of edges emanating from V , i.e., deg(v) = |Adj(v)| . More generally, for a subset V ′ ⊂ V , theunion of all edges outside V ′ connected with some node v′ ∈ V ′ is denoted by

Adj(V ′) =∪

v′∈V ′

Adj(v′) \ V ′

Path: A path connecting a node v ∈ V to a node v′ ∈ V is a tuple of nodes vi , i = 1 . . . l+1 , with v1 = vand vl+1 = v′ such that each vi, vi+1 ∈ E .

Connected graph, diameter of a graph: A graph is connected if any two nodes can be connected by apath. The distance d(v, v′) between two nodes v, v′ is the length of the shortest path connecting v with v′ .The diameter of a connected graph G is diam(G) = maxv,v′∈V d(v, v

′) , i.e., the longest distance betweentwo nodes .

Separator: A subset S ⊂ V is called a separator of G , if the graph G′ obtained from G by removing thenodes of S and the edges emanating from S is not connected. That is, V \S has the form V \S = V1 ∪V2and every path connecting a node v1 ∈ V1 with a node v2 ∈ V2 intersects the separator S .

Eccentricity of a node v , peripheral node: The eccentricity e(v) of a node v ∈ V is defined ase(v) = maxv′∈V d(v, v

′) . A node v ∈ V with e(v) = diam(G) is called peripheral. A peripheral node is ‘faroutside’ because there exists a path of maximal length emanating from it.

11 Note: since we have excluded v, v from the set of edges, v ∈ Adj(v) .



7

6

5 4

3

2

1

Figure 3.4: Here, e.g. S = 4, 5 is a separator

3.2 Standard ordering strategies

As we have seen in Example 3.4, the order in which the unknowns are numbered can have a tremendouseffect on the amount of fill-in, which in turn affects the storage requirement for the Cholesky factor andthe time to compute it. Modern sparse direct solvers analyze the matrix A prior to factorization and aimat determining a good ordering of the unknowns. Since the problem of finding a permutation matrix Pthat minimizes the amount of fill-in is a very hard problem, various heuristic methods have been devised.Popular ordering methods are:

1. Reverse Cuthill-McKee: Realized in Matlab as symrcm. This ordering may be motivated by mini-mizing the bandwidth of the reordered matrix P TAP .

2. Nested dissection: This ordering originates from FD/FEM applications with substructuring.

3. (Approximate) minimum degree: The approximate minimum degree ordering of [1] is currently themost popular choice (see also help symamd). It aims at minimizing the amount of fill-in.

We refer to [14, 5, 8] for good surveys on the topic.

Reverse Cuthill-McKee.

The Cuthill-McKee (CM) and the Reverse Cuthill-McKee (RCM) orderings can be viewed as attempts tominimize the bandwidth of a sparse matrix in a cheap way. The underlying heuristic idea is as follows:In order to obtain a small bandwidth, it is important to assign neighboring nodes in the graph G numbersthat are close together. Hence, as soon as one node is assigned a number, then all neighbors that have notbeen assigned a number yet should be numbered as quickly as possible – see Alg. 3.3.

This is a typical example of a ‘greedy algorithm’ based at a brute force, locally optimal strategy with the‘hope’ that the outcome is also near to optimal in the global sense.

The choice of starting node is, of course, important. One wishes to choose a peripheral node as a startingnode. Since these are in practice difficult to find, one settles for a pseudo (‘nearly’) peripheral node, asdescribed in [8].


3.2 Standard ordering strategies 17

Algorithm 3.3 Cuthill-McKee

1: choose a starting node v and put it into a FIFO % ‘first in – first out’ – a queue, or pipe2: while ( FIFO = ∅ ) 3: take first element v of FIFO and assign it a number4: let V ′ ⊂ Adj(v) be those neighbors of v that have not been numbered yet;5: put them into the FIFO in ascending order of degree (ties are broken arbitrarily).6:

c b j

e

g

hf

di

a

PSfrag replacements

abcdefg

hj

i

i v content of FIFO

1 g h,e,b,f2 h e,b,f3 e b,f,c4 b f,c,j5 f c,j,a,d6 c j,a,d7 j a,d8 a d9 d i10 i –

Figure 3.5: Example of Cuthill-McKee algorithm with non-peripheral starting node g . Peripheral nodesare, e.g., a, i, j .

It has been observed that better orderings are obtained by reversing the Cuthill-McKee ordering. In fact,is can be shown, [14, p. 119], that RCM is always better than CM in the sense that |Env(P T

RCMAPRCM)| ≤|Env(P T

CMAPCM)| . The RCM algorithm is shown in Alg. 3.4.

Algorithm 3.4 Reverse Cuthill-McKee

1. Choose as a starting node v a peripheral or pseudo-peripheral node2. determine the CM ordering using Alg. 3.3.3. reverse the CM ordering to get the RCM ordering.

! ! !"#"# $ ! #%$ # %! # ! # $

!" !

1

2

3 4

5

6 7

8

9

10

11

12

13

14 15

!" ! !Reverse Cuthill-McKee ordering.

Figure 3.6: Example of Reverse Cuthill-McKee algorithm (example taken from [16]).



Nested Dissection.

The key idea of nested dissection is to find a set a separator S that is small. Recall that a separator S forthe graph G partitions the set of nodes V into three disjoint sets

V = V1 ∪ V2 ∪S

with the property that no edges exist that connect nodes of V1 with nodes of V2 . If we number the nodesof V1 first, then the vertices of V2 and the nodes of S last, the matrix A has the following block structure:

A =

AV1,V1 0 AV1,S

0 AV2,V2 AV2,S

AS,V1 As,V2 AS,S

Theorem 3.1 tells us that the Cholesky factorization of A will inherit the two 0 - blocks. Therefore it ishighly desirable to choose the separator S to be small, because then these two 0 - blocks are large. (In theextreme case S = ∅ the adjacency graph of A is not connected and A becomes block diagonal.)

We will recursively apply nested dissection to the sets V1 and V2 and therefore obtain a ‘fork-like’ structurefor the envelope of A (see, e.g., Fig. 3.8), where the ‘prongs’ are the blocks created by the separators. Thenested dissection Algorithm is given in Alg. 3.5.

Algorithm 3.5 Nested Dissection

% input: adjacency graph G = (V,E) of of A% output: numbering for the vertices

1. select a separator S and sets V1, V2 such that a) V = V1 ∪V2 ∪S ; b) S is small; c) no node of V1shares an edge with any node of V2 .

2. number the nodes of V by a) numbering those of V1 first (i.e., recursive call of this algorithm withV1 as input), b) numbering those of V2 second (i.e., recursive call of this algorithm with V2 as input),and c) those of S last.

The ‘art’ in nested dissection lies in finding good separators. We illustrate that it is a feasible task. Infact, for the 2D Poisson problem on a square (Example 2.2) the choice of good separators leads to verylittle fill-in, namely O(N logN) :

Example 3.10 We consider the uniform mesh discussed in Example 2.2. For simplicity, we assume thatn is a power of 2 : n = 2m . As indicated in Fig. 3.7, the unit square is split into 4 boxes. By our choiceof n , the two lines that split the square into 4 squares are mesh lines.

In this example, it is better to separate the unknowns into 5 sets V1, V2, V3, V4 , and the separator S . Thissplitting is done as indicated in Fig 3.7: All nodes in the box V1 are numbered first, then those of the boxV2 , then those of the box V3 , then those of box V4 , and finally those that lie in the separator S . Thesystem matrix A then has the block structure depicted in Fig. 3.7. Each of the matrices AV1,V1 , . . . , AV4,V4

has size ≈ N/4 and is, up to the size, essentially the same as the original matrix A . The last block,ASV is of size O(

√N) × N . Since the submatrices AVi,Vi

are similar to A , we can repeat the procedurerecursively. An idea of the structure of the resulting envelope can be obtained from Fig. 3.8, where thesparsity pattern of the Cholesky factor for the case n = 30 is plotted.

A careful analysis given in [7] shows for this model problem that with this ordering of the unknowns,the memory requirement for the Cholesky factor is O(N logN) , and the number of arithmetic operations


3.3 Minimum degree ordering 19

V3 V4

V2V1

PSfrag replacements

V1V2V3V4

A11

A22

A33

A44

AS

PSfrag replacements

AV1,V1

AV2,V2

AV3,V3

AV4,V4

ASV

Figure 3.7: left: nested dissection of a square. right: block structure of the resulting matrix.

to compute it is O(N3/2) . Thus the fill-in is considerably smaller than for straightforward lexicographicordering, where the memory requirement for the Cholesky factor is O(N3/2) .

Exercise 3.11 Using (3.2), show that the memory requirement for the Cholesky factorization in Example 3.10 is

indeed O(N logN) . Show that it takes O(N3/2) floating point operations to compute the Cholesky factorization.

Example 3.12 We consider again the 2D Poisson problem on a square (Example 2.2) with the originalnumbering (‘lexicographic’ ordering). We observe that the bandwidth of the matrix A with this orderingin b = O(n) = O(

√N) . The size of the envelope is then EN = O(N b) = O(N3/2) , which is significantly

larger than that of nested dissection. Fig. 3.8 shows that virtually the full envelope is filled during thefactorization.

3.3 Minimum degree ordering

RCM ordering aims at minimizing the bandwidth of a matrix A . The ultimate goal, however, is to minimizethe fill-in rather than the bandwidth. This is the starting point of the minimum degree algorithm. Findingthe ordering that really minimizes the fill-in is a hard problem; the minimum degree algorithm, [17], is agreedy algorithm that aims at minimizing the fill for each column L:,k of the Cholesky factor separately.

The algorithm proceeds by selecting a starting node v1 of V (1) and computes G(2) ; then a node v2 ∈ V (2)

is selected, G(3) is computed etc. From (3.2), we see that the choice of node vk adds degG(k)(vk) to thememory requirement for the Cholesky factor. The ‘greedy’ strategy is to select vk (given that v1, . . . , vk−1

have already been selected) such that degG(k)(vk) is minimized, i.e., we choose a node of minimum degree.This procedure is formalized in Alg. 3.6.



0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 4380

A, Poisson problem

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 27029

Cholesky factor (lexicographic)

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 19315

Cholesky factor (reverse Cuthill−McKee)

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 13853

Cholesky factor (nested dissection)

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 10042

Cholesky factor (minimum degree)

Figure 3.8: Fill-in for gallery(’poisson’,30) and various ordering strategies.


21

Algorithm 3.6 Minimum degree ordering

1: set up the graph G(1) for the matrix A(1) = A .2: for k = 1 :N do3: select a node of V (k) with minimum degree and label it xk4: determine the graph G(k+1) obtained from G(k) by eliminating node xk5: end for

Several comments are in order:

1. In practice, there may be several nodes of minimum degree, and a tie-breaker is necessary. It isrecommended to proceed as follows: first, the nodes are numbered using RCM. Then, if tie-breakingis necessary, the node with the smallest RCM number is selected in step 2 of Alg. 3.6.

2. The minimum degree algorithm is quite costly – various cheaper variations such as as the approxi-mate minimum degree algorithm of [1] are used in practice.

Example 3.13 (fill-in) The matrix A ∈ R900×900 of Example 2.2 with n=30 has 5 non-zero entries perrow. Different orderings have a considerable effect on the amount of fill-in (see Fig. 3.8): Whereas the lowerpart of A has 2, 640 non-zero entries, the Cholesky factor of A has 27, 029 ; using RCM ordering reducesthis to 19, 315 . Nested dissection ordering is even better with 13, 853 non-zero entries. The best is theapproximate minimum degree ordering with 10, 042 non-zero entries. see Fig. 3.8. Calculations are donein Matlab with: A = gallery(’poisson’,30); p = symrcm(A); Lrcm = chol(A(p,p)); (approximate)minimum degree ordering is obtain by setting p = symamd(A).

4 A fast Poisson Solver based on FFT

For systems with special structure, fast direct solutions can be designed on the basis of the Fast FourierTransform (FFT), a fast implementation of the Discrete Fourier Transform (DFT), see [2]. Classicalexamples are systems represented by circulant matrices. Another important class are Toeplitz matrices,with a constant value along each diagonal. The FD matrices A for the 1D and 2D Poisson equation(Examples 2.1 and 2.2) are Toeplitz matrices; moreover, their eigenvalues and eigenvectors are explicitlyknown. This information can be exploited to design fast solvers for such special problems.

4.1 The 1D case

The FD matrix from Example 2.1 reads

A =

2 −1−1 2 −1

. . . . . . . . .

−1 2 −1−1 2

(4.1)

Since A is tridiagonal and SPD, the solution of Au = b can be performed in O(n) operations by straight-forward elimination (Cholesky), i,e., with optimal computational effort.


22 4 A FAST POISSON SOLVER BASED ON FFT

Nevertheless, we consider another fast algorithm, a so-called spectral method, mainly as preparation forthe 2D case. This is based on the spectral decomposition (i.e, the orthonormal diagonalization)

A = QDQT

with 12

Qi,j =(wj)i∥wj∥2

= cn sin(jπxi), Di = λi = 4 sin2 πxi2

(4.2)

Here h = 1/(n+1) is the stepsize and xi = ih is the i -th grid point, see p. 6. Thus, the solution of thesystem Au = b amounts to the evaluation of

u = A−1b = QD−1QT b = QD−1Qb (4.3)

(in this example, the orthogonal matrix Q is also symmetric). Here, two matrix-vector multiplicationsQ · v are involved, with effort O(n2) effort if performed in a naive manner. However, this can be relatedto the FFT in the following way. A fast implementation of the Discrete Sine Transform, explained in thesequel, leads to an evaluation algorithm for (4.3).

The Discrete Sine Transform (DST).

Consider a vector v = (v1, . . . , vn)T ∈ Rn and y = Qv = (y1, . . . , yn)

T with Q from (4.2),

yk =n∑

j=1

Qk,j vj = cn

n∑j=0

sin(jπxk) vj, k = 1 . . . n (4.4)

Here sin(jπx) = Im e i jπx is the imaginary part of e i jπx , hence

yk = cn Im( n∑

j=1

e i jπxk vj

), k = 1 . . . n

In the usual notation from the DFT (Discrete Fourier Transform), we write

n∑j=0

e i jπxk vj =n∑

j=0

ωk j vj, ω = eπi

n+1 = e2πi

2n+2 = (2n+2) - th root of unity

This is precisely a ‘half part’ from the sum appearing in the inverse DFT (IDFT) of the extended vector

v = (v0, v1, . . . , v2n+1)T := (0, v1, . . . , vn, 0, . . . , 0)

T ∈ R2n+2

i.e.n∑

j=0

ωk j vj = IDFT(v) =2n+1∑j=0

ωk j vj

With a fast O(n log n) [i]fft implementation of the [I]DFT available, we thus can compute the vectory = Qv in the following way:

12 In (4.2), the eigenvectors have been normalized such that Q is indeed an orthogonal matrix. Note that c = c(n) =∥w1∥1 = . . . = ∥wn∥n = 2√

n+1. It should also be noted that the λi/h

2 and the wi are discrete approximations for the

spectrum and the eigenfunctions of the differential operator −u′′ on [0, 1] with homogeneous boundary conditions.


4.2 The 2D case 23

– Extend v = (v1, . . . , vn) to v = (v0, v1, . . . , v2n+1) = (0, v1, . . . , vn, 0, . . . , 0)

– Compute y = (y0, y1, . . . , y2n+1) = ifft(v) , e.g. in Matlab

– Set y = cn Im (y1, . . . , yn)

Of course, there is some potential for optimization to avoid unnecessary computations. The transformation(4.4) is also called the Discrete Sine Transform (DST), and direct FFT-like implementations are available,e.g., dst (and its inverse idst) in Matlab. 13

4.2 The 2D case

For the 1D case above, the spectral solver based on the DST is slightly less efficient than tridiagonalCholesky. 14 For the 2D case (Example 2.2), the situation is different. Here, the FD matrix takes theblock-sparse form (see Fig. 2.1)

A =

T −I−I T −I

. . . . . . . . .

. . . . . . . . .

−I T −I−I T

∈ RN×N , N = n2 (4.5)

with the n×n blocks

T =

4 −1−1 4 −1

. . . . . . . . .

−1 4 −1−1 4

∈ Rn×n, I = In×n (4.6)

The bandwidth of the large matrix A is n , and when applying Cholesky factorization a certain amount offill-in encountered, even when applying reordering strategies from Section 3.

For deriving an alternative direct solution method, we write vectors v ∈ RN = Rn2in analogous partitioned

form,

v =

v(1)

v(2)

...

...

v(n−1)

v(n)

∈ RN , with v(i) =

v(i)1

v(i)2

...

v(i)n−1

v(i)n

∈ Rn (4.7)

13 Note that, analogously as for classical Forier series, the usual computational definition of DFT et. does not correspondto an orthonormal scaling. E.g., dst(b) delivers a scalar multiple of Qb . However, when applying idst this effect ‘cancels’;an explicit normalization of intermediate results is not necessary. For implementation, also take care of the fact that inMatlab all arrays have starting index 1, not 0.

DST may be called the ‘imaginary part’ of DFT. The related Discrete Cosine Transform (DCT) is, e.g., also used inimage compression algorithms (underlying the classical jpg Format).

14 Numerical stability for large n may expected to be better for the spectral method – a subject worth testing.


24 4 A FAST POISSON SOLVER BASED ON FFT

Then, the linear FD system Au = b can be written in the block form

−u(i−1) + Tu(i) − u(i+1) = b(i), i = 1 . . . n (4.8)

where we formally set u(0) = u(n+1) = 0 . From the 1D case we see that T has the spectral decomposition

T = QS QT

with (see (4.2))

Qi,j =(wj)i∥wj∥2

= cn sin(jπxi), Si =: σi = 2 + 4 sin2 πxi2, i = 1 . . . n

With y(i) = Qu(i) , multiplying the systems (4.8) by Q gives

−y(i−1) + Sy(i) − y(i+1) = Qb(i), i = 1 . . . n (4.9)

with the diagonal matrix S = diag(σ1, . . . , σn) . This system decouples by reordering the unknowns: Letus permute vectors v ∈ RN = Rn2

from (4.7) according to

v 7→ v =

v(1)

v(2)

...

...

v(n−1)

v(n)

, with v(i) =

v(1)i

v(2)i

...

v(n−1)i

v(n)i

∈ Rn

i.e., by switching from lexicographic column-wise to lexicographic row-wise ordering of the unknowns, i.e.,of the computational grid. Then, the set of linear systems (4.9) transforms into a set of independenttridiagonal systems of dimension n×n ,

S y(i) = Qb(i), i = 1 . . . n, with S =

σ1 −1−1 σ2 −1

. . . . . . . . .

−1 σn−1 −1−1 σn

∈ Rn×n (4.10)

(This corresponds to a column-wise permutation of the system represented by (4.9); the equations (rows)are not permuted.) The right-hand sides Qbi can be evaluated via the 1D-DST in n ·O(n log n) operations,and the resulting tridiagonal systems are of the same type as for the 1D Poisson case and can be solvedby tridiagonal Cholesky in n · O(n) operations (or DST may be used). Finally, the solution vectors y(i)

are recombined into the original order, and the resulting y(i) are transformed back using 1D-DST to yieldthe solution components u(i) = Qy(i) .

The overall computational and memory effort for this fast Poisson solver is O(n2 log n) = O(N logN) ,nearly optimal in the problem dimension N = n2 . However, one should keep in mind that the reorderingsteps may be nontrivial to implement on parallel computer architectures, since global communication takesplace for larger problems. (This challenge is also typical for all FFT-based computational techniques.)

3D Poisson solvers of a similar type can also be designed. See also [11] for a discussion of related methods,e.g., for more general problem geometries.


25

5 Basic Iterative Methods

5.1 Convergence analysis of linear iterative methods

We abbreviate K = R or C . We consider solving linear systems of the form

Ax = b (5.1)

and assume that A is invertible. Many iterative methods 15 for solving (5.1) have the form of a fixed pointiteration

xk+1 = Φ(xk; b) (5.2)

with some mapping Φ . Solutions x∗ of equation x∗ = Φ(x∗; b) are called fixed points of (5.2). Since theiteration mapping Φ is a fixed one, this is also called a stationary iteration.

Definition 5.1 The fixed point iteration (5.2) is said to be

• consistent with the invertible matrix A , if for every b , the solution x∗ = A−1b is a fixed point of(5.2);

• convergent, if for every b there exists a vector x∗ such that for every starting vector x0 the sequence(xk) = (xk)

∞k=0 defined by (5.2) converges to x∗ .

16

• linear, if Φ is an affine mapping of the form Φ(x; b) =Mx+N b , i.e.,

xk+1 =Mxk +N b (5.3)

M is called the iteration matrix. 17

The matrices M,N from (5.3) need to satisfy certain conditions in order to be consistent with a givenmatrix A . We have:

Lemma 5.2 Let A be invertible. Then the fixed point iteration (5.3) is consistent with A if and only if

M = I −NA ⇔ N = (I −M)A−1 ⇔ A−1 = (I −M)−1N

Proof: Exercise.

We will see that the asymptotic convergence behavior of the iteration (5.3) is determined by the spectrumof the iteration matrix M . We start with the following special (linear) variant of the Banach Fixed PointTheorem:

Theorem 5.3 Let ρ(M) < 1 . Then there exists a unique fixed point x∗ of (5.3),

x∗ = (I −M)−1N b

and the iteration (5.3) converges to x∗ for any x0 ∈ Kn .

15 See also [2, Ch. 7].16 We stress that we require a) convergence and b) a unique limit x∗ – independent of the starting vector x0 .17 Later we will see that in typical situations the matrix N should be invertible; N−1 is then called the preconditioner.


26 5 BASIC ITERATIVE METHODS

Proof: Since ρ(M) < 1 , we conclude 1 /∈ σ(M) , i.e., there exists a unique fixed point x∗ = (I −M)−1N b ∈Kn . By ek we denote the error of the k - th iterate: ek = xk − x∗ . By subtracting the equationsx∗ =Mx∗ +N b and xk+1 =Mxk +N b , we obtain a recursion for the errors ek :

ek+1 =Mek =M2ek−1 = · · · =Mk+1e0 (5.4)

By Theorem 1.1, we may find a norm ∥ · ∥ε on Kn such that ∥M∥ε ≤ ρ(M) + ε < 1 . In this norm, weobtain from (5.4)

∥ek+1∥ε = ∥Mk+1e0 ∥ε ≤ ∥M

k+1∥ε∥e0 ∥ε ≤ ∥M∥k+1ε ∥e0 ∥ε

Since ∥M∥ε < 1 , we conclude that the sequence (ek)∞k=0 converges to 0 in the norm ∥ · ∥ε . Since all norms

on the finite-dimensional space Kn are equivalent, we conclude that the sequence (xk)∞k=0 converges to x∗

in any norm.

Corollary 5.4 Let the fixed point iteration (5.3) be consistent with the invertible matrix A , and ρ(M) < 1 .Then for every starting value x0 the sequence (xk)

∞k=0 converges to the solution x∗ of Ax∗ = b .

Note that condition ρ(M) < 1 is essential for the convergence. We have:

Theorem 5.5 Let ρ(M) ≥ 1 . The the iteration (5.3) is not convergent; in other words: either there existstarting vectors x0 and vectors b such that the sequence (xk) of iterates does not converge, or the iteratescorresponding to different starting vectors converge to different fixed points.

Proof: We choose b = 0 .

We start with the case K = C . Let λ ∈ σ(M) with |λ| = ρ(M) ≥ 1 . Pick 0 = x0 as a correspondingeigenvector. Then the iterates xk of (5.3) are xk =Mkx0 = λkx0 . We consider three different cases:

(i) |λ| > 1 : then |xk| → ∞ for k →∞ . Hence (xk) does not converge.

(ii) |λ| = 1 and λ = 1 : Then λ = eiφ for some φ ∈ (0, 2π) . The iterates xk have the form xk = eikφx0and do not form a Cauchy sequence, because for every n ∈ N we have lim supm→∞ |xn − xm| =lim supm→∞ |ei(n−m)φ − 1| = 2 .

(iii) λ = 1 : Then xk = x0 for all k ∈ N0 . This sequence converges trivially: limk→∞ xk = x0 . However,taking as starting vector x′0 = −x0 leads, by the same reasoning, to limk→∞ x′k = x′0 = −x0 . Hence,the starting vectors x0 and −x0 lead to sequences converging to different limits.

We now consider the case K = R , with M ∈ Rn×n . With the above notation, let r0 = Re x0 , i0 = Im x0 .We then have: xk =Mkx0 =Mk(r0 + i i0) =Mkr0 + iMki0 and therefore, since M ∈ Rn×n :

Rexk =Mkr0, Imxk =Mki0

We have seen above that in the case |λ| ≥ 1 and λ = 1 that the sequence (xk)k∈N0 does not converge.Hence, at least one of the two sequences (Rexk)

∞k=0 = (Mkr0)

∞k=0 and (Im xk)k∈N0 = (Mki0)

∞k=0 does not

converge. If λ = 1 , then, the starting vectors r0 , −r0 , i0 , −i0 all lead to trivially converging sequences,but their limits do not coincide.


5.1 Convergence analysis of linear iterative methods 27

Remark 5.6 The natural interpretation of the matrix N is that of an approximate inverse, i.e., N ≈ A−1

in some specific sense – but is essential that N (and M ) can be efficiently evaluated. N = A−1 (togetherwithM = 0 ) would solve the problem in one step (direct solution). For N ≈ A−1 the matrixM = I−NAwill be ‘small’ (as required), due to consistency.

The invertibility of N is necessary for convergence: Otherwise the matrix I −M = NA would have anontrivial kernel and eigenvalue λ = 0 , which contradicts the convergence condition 1 /∈ σ(M) .

A fixed point iteration (5.3) consistent with A can be formulated in three (equivalent) ways:

(i) First normal form (5.3):xk+1 =Mxk +N b , with M = I −NA

(ii) Second normal form:xk+1 = xk +N(b− Axk) Thus the correction added to xk is a linear imageof the residual

b− Axk =: rk

(iii) Third normal form:W (xk+1 − xk) = b− Axk (i.e., W = N−1 ; W is an approximation to A )

Thus, the correction δk = xk+1 − xk is obtained by solving the linear system Wδk = rk .18

The third normal form reveals the heuristic idea behind a consistent iteration. For W ≈ A a sufficientlyreasonable approximation to A , we may expect

Wx∗ −Wxk ≈ Ax∗ − Axk = b− Axk = rk

which motivates the choice for xk+1 .

For the error ek = xk − x∗ we haveek+1 =Mek (5.5)

and the residual rk = b−Axk satisfies rk = −Aek . Thus, also rk+1 =M rk holds. In second normal formwe can also write

ek+1 = ek +Nrk = ek −NAek = (I −NA) ekIf the iteration matrix M = I −NA satisfies ρ(M) < 1 , then we call ρ(M) the (asymptotic) convergencefactor of the iteration (5.3). The expression − log10 ρ(M) is called the (asymptotic) convergence rate.

Remark 5.7 The convergence result from Theorem 5.3 supports the expectation that the error is reducedby a factor q ∈ (0, 1) in each step of the iteration. However, in general this is true only in an asymptotic‘mean’ sense. We consider

q = lim supk→∞

(∥ek∥∥e0∥

)1/kwhere ∥ · ∥ a norm in which we are interested, since (asymptotically) the error is reduced by a factor q ineach step. From (5.4) we obtain

∥ek∥ = ∥Mke0∥ ≤ ∥Mk∥ ∥e0∥and therefore

q = lim supk→∞

(∥ek∥∥e0∥

)1/k≤ lim sup

k→∞

(∥Mk∥ ∥e0∥∥e0∥

)1/k= lim sup

k→∞∥Mk∥1/k = ρ(M)

by Theorem 1.1. We note that the precise choice of norm is immaterial for the asymptotic convergencebehavior.19 The meaning of the asymptotic convergence rate − log10 ρ(M) is the number of steps neededto reduce (asymptotically) the error by a factor 10 .

18 Note the formal similarity to a [quasi-] Newton method for solving a nonlinear system F (x) = b , with residual b−F (xk) .19 We refer to [16, Section 4.2] for a more detailed discussion.



Remark 5.8 It should be noted, however, that in general for a given norm, e.g. for ∥ · ∥2 , the behavior ofthe error in a single step may have nothing to do with the size of ρ(M) < 1 . For (highly) non-normal M ,∥M∥2 may be > 1 (even >> 1 ). In other words: The spectral radius ρ(M) only determines the asymptoticconvergence rate in general.

5.2 Splitting methods

Early iterative methods (including the classical Jacobi and Gauss-Seidel methods) are splitting methodsfor solving (5.1), A splitting method is determined by writing

A = G−H

and using the approximate inverse N = G−1 ≈ A−1 . Equivalently, this is obtained by rewriting equation(5.1) as

Ax = b ⇒ Gx = b+H x ⇒ x = G−1b+G−1Hx

which has fixed point form (5.3) and is, by construction, consistent with the matrix A . The correspondingfixed point iteration reads

xk+1 = G−1H︸︷︷︸=M

xk + G−1︸︷︷︸=N

b

In practice, the approximate inverse N = G−1 is, of course, not computed; only the action z 7→ G−1zneeds to be realized computationally. For efficient methods, the action z 7→ G−1z , or equivalently, thesolution of the correction equation Gδk = rk needs to be ‘cheap’.

Richardson, Jacobi, Gauss-Seidel.

Richardson: We start with a trivial, basic case: A = I − (I− A) . This leads to the Richardson iteration:

xk+1 = xk + (b− Axk) (5.6)

Here, N = I is taken as an ‘approximate inverse’ of A .

Remark 5.9 We note that the second normal form of a (consistent) linear iteration shows that every lineariteration for a system Ax = b can be interpreted as the Richardson iteration applied to the transformedproblem NAx = N b . In this interpretation we call N a preconditioner; we need N ≈ A−1 to ensureconvergence of the preconditioned Richardson iteration.

In the following we throughout write A = D+L+U , where D,L, U are the diagonal, the (strictly) lower,and the (strictly) upper part of A , respectively.

@@

@@

@@

@@

@@@

@@

@@

@@

@@

@@ @@@@@

@@@

@@

A = L + D + U


5.2 Splitting methods 29

Jacobi: Choose G = D and H = −(L+ U) . The Jacobi iteration is given by

xk+1 = −D−1(L+ U)xk +D−1b = xk +D−1(b− Axk), i.e., N = D (5.7)

The inversion of the diagonal matrix D (which must be invertible) is trivial. In component notation, theJacobi iteration reads:

(xk+1)i =1

Ai,i

(bi −

n∑j=1j =i

Ai,j (xk)j

), i = 1 . . . n (5.8)

(Forward) Gauss-Seidel: We choose G = (D + L) and H = −U , which leads to

xk+1 = (D + L)−1(−U)xk + (D + L)−1b = xk + (D + L)−1(b− Axk), i.e., N = D + L (5.9)

Note that, provided D is invertible, the action z 7→ (D + L)−1z is easily done by ‘backsubstitution’ sinceD + L is a lower triangular matrix. In component notation, the Gauss-Seidel iteration reads

(xk+1)i =1

Ai,i

(bi −

i−1∑j=1

Ai,j (xk+1)j −n∑

j=i+1

Ai,j (xk)j

)(5.10)

= (xk)i +1

Ai,i

(bi −

i−1∑j=1


j=i

Ai,j (xk)j

), i = 1 . . . n

This means that the method works like Jacobi but, within each iteration step, the updates (xk+1)i im-mediately replace the old values (xk)i . Completely analogous to the forward Gauss-Seidel iteration is thebackward Gauss-Seidel iteration, which corresponds to the splitting G = (D+U) , H = −L and therefore

xk+1 = (D + U)−1(−L)xk + (D + U)−1b = xk + (D + U)−1(b− Axk), i.e., N = D + U (5.11)

Remark 5.10 The Richardson and Jacobi methods are independent of the numbering of the unknowns;the forward and backward Gauss-Seidel methods are not. E.g., the backward Gauss-Seidel method isobtained from the forward version by reversing the numbering of the unknowns. One should note that theJacobi method has much more potential for parallelization. On the other hand, the Gauss-Seidel versionsare more economic concerning storage.

Damped Richardson, Jacobi, SOR, SSOR.

The second normal form of a linear iteration suggests that one can create for each method a dampedversion, where a given approximate inverse N is replaced by ωN ,

xk+1 = xk + ωN(b− Axk) (5.12)

Here, ω ∈ R is the damping (or: relaxation) parameter. We recognize two special cases: N = I , whichcorresponds to the damped Richardson method:

xk+1 = xk + ω (b− Axk) (5.13)

and the damped Jacobi method,xk+1 = xk + ωD−1(b− Axk) (5.14)



The so-called SOR method (successive overrelaxation 20) is obtained by introducing a relaxation factor ωin the component formulation of the Gauss-Seidel method (5.10),

(xk+1)i = (xk)i + ω1

Ai,i

(bi −

i−1∑j=1


j=i

Ai,j (xk)j

), i = 1 . . . n (5.15)

In matrix notation, this reads

xk+1 = xk + ω (D + ω L)−1(b− Axk) (5.16)

Of course, instead of the forward Gauss-Seidel method, one could start from the backward Gauss-Seidelmethod – then, the matrix L is replaced by U in (5.16).

A disadvantage of the Gauss-Seidel and, more generally, the SOR methods is that the iteration matrix Mis not symmetric even if the original matrix A is symmetric. This can be overcome by defining a symmetricversion, the SSOR (Symmetric SOR) by applying first a step of SOR based on the forward Gauss-Seidelmethod and then an SOR step based on the backward Gauss-Seidel method:

xk+ 12= xk + ω (D + ω L)−1(b− Axk),

xk+1 = xk+ 12+ ω (D + ω U)−1(b− Axk+ 1

2)

This leads toxk+1 = xk + ω (2− ω)(D + ω U)−1D (D + ω L)−1(b− Axk) (5.17)

with symmetric iteration matrix for symmetric A , i.e., for U = LT . If A is SPD, then the SSOR iterationmatrix is also SPD, since

((D + ω LT )−1D (D + ω L)−1x, x) = (D (D + ω L)−1x, (D + ω L)−1x) > 0, x = 0

due to D > 0 .

We collect the iteration matrices M for these methods:

method iteration matrix M = I − NA approximate inverse N

damped Richardson MRich = I − ωA ω I

damped Jacobi MJacω = I − ωD−1A ωD−1

forward Gauss-Seidel MGS = I − (D+L)−1A (D+L)−1

SOR MSORω = I − ω (D+ωL)−1A ω (D+ωL)−1

SSOR MSSORω = I − ω (2− ω)(D+ωU)−1D (D+ωL)−1A ω (2− ω)(D+ωU)−1D (D+ωL)−1

Exercise 5.11 Prove (5.16) and (5.17).

Exercise 5.12 Consider the damped Richardson iteration with complex damping parameter ω ∈ C .

a) Show: σ(MRich) = 1− ωλ : λ ∈ σ(A) .b) Show: The damped Richardson iteration is convergent for damping parameters ω ∈ C with the following

property: the open disc B1/|ω|(1/ω) ⊂ C contains σ(A) but does not include 0 .

c) Give an example of a matrix for which no damped Richardson iteration is convergent.

d) Let σ(A) ⊂ z ∈ C : Re z > 0 . Show: Taking the damping parameter ω real and sufficiently small leads to aconvergent Richardson iteration.

e) Let σ(A) ⊂ [λmin, λmax] ⊂ (0,∞) . Show: the optimal damping parameter ωopt ∈ R , i.e., the parameterminimizing ρ(MRich

ω ) is given by ωopt =2

λmin(A)+λmax(A) .

20 More precisely: the choice ω > 1 is called overrelaxation whereas the choice ω < 1 corresponds to underrelaxation. Inthe examples we will consider, overrelaxation is advantageous.



Jacobi, Gauss-Seidel, SOR in the SPD case.

For the case where A is SPD, a fairly general convergence theory can be established for these methods.First we require some properties of symmetric and/or SPD matrices.

For a symmetric matrix A ∈ Rn×n we write A ≥ 0 iff A is positive semidefinite (i.e., (Ax, x) ≥ 0 for allx ∈ Rn ); we write A > 0 iff A is positive definite (i.e., (Ax, x) > 0 for all 0 = x ∈ Rn ). For two symmetricmatrices A,B ∈ Rn×n we write A ≥ B iff A−B ≥ 0 ; we write A > B iff A−B > 0 .

Lemma 5.13 Let A,B ∈ Rn×n be symmetric. Then:

(i) A > [≥] 0 ⇔ CT AC > [≥] 0 for all invertible C ∈ Rn×n .

(ii) A > [≥] B ⇔ CT AC > [≥] CT B C for all invertible C ∈ Rn×n .

(iii) A,B > [≥] 0 ⇒ A+B > [≥] 0(iv) λI < A < ΛI ⇔ σ(A) ⊂ (λ,Λ) (analogous assertion for ≤ )

(v) A > [≥] B > 0 ⇔ 0 < A−1 < [≤] B−1

Proof: We will only prove (v), the remaining cases being simple (exercise). If you, e.g., prove (i), observethe meaning of this assertion (a simple ‘change of coordinates’ - argument).

For (v), we will only show that A ≥ B > 0 implies B−1 ≤ A−1 . Since B is SPD, we can define the SPD

matrix B− 12 . Then, by (i), we infer that A ≥ B implies X := B− 1

2AB− 12 ≥ I . Hence, all eigenvalues of

the symmetric matrix X are ≥ 1 (cf. (iv)). Thus, all eigenvalues of the symmetric matrix X−1 are all ≤ 1 ,

i.e., I ≥ X−1 = B12A−1B

12 . Multiplying both sides by the symmetric matrix B− 1

2 and recalling (i) givesB−1 ≥ A−1 .

For an SPD matrix A = QΛQT > 0 with Q orthogonal and Λ > 0 diagonal, the square rootA12 ,

A12 = QΛ

12 QT (A

12A

12 = A )

and its inverse A− 12 = QΛ− 1

2 QT are also SPD. Furthermore, 21 each SPD matrix A defines an ‘energyproduct’ and associated ‘energy norm’,

(x, y)A = (Ax, y) = (x, y), ∥x∥A = (Ax, x)12 = ∥A

12x∥2 = ∥x∥2 (5.18)

The corresponding matrix norm is

∥M∥A = ∥A12MA− 1

2∥2 = ∥M∥2 (5.19)

We note some further properties of general inner products (energy products) (u, u)A = (Au, u) , generalizingwell-known identities from the Euclidean case:

Exercise 5.14

a) Let A ∈ Rn×n be SPD. Show: The adjoint MA of a matrix M w.r.t. the (·, ·)A inner product is given by

MA = A−1MT A (5.20)

Note that M is A - selfadjoint iff MT A = AM ⇔ A− 12MT A

12 = A

12MA− 1

2 , i.e., if M = A12MA− 1

2 issymmetric.

21 Vectors x and matrices M are understood in original coordinates in Rn . On could also use transformed coordinates;i.e., x = A− 1

2A12x ∈ Rn has coordinate vector x = A

12x with respect to the canonical A - orthogonal basis A− 1

2 , and a linearmapping represented by M ∈ Rn×n is represented by M = A

12MA− 1

2 in these new coordinates. This notation is used in(5.18),(5.19) and in Exercise 5.14.



b) An ‘ A - orthogonal’ pair of vectors x, y , satisfying

(x, y)A = (Ax, y) = (x, y) = 0

is also called [ A -]conjugate. A -orthogonality of a linear mapping represented by a matrix Q is defined in theusual way:

(Qx,Qy)A ≡ (x, y)A, i.e., QT AQ = A

Show that this is equivalent to QA Q = (QA Q)∧= QT Q = I .

c) A matrix P satisfying

P T AP = I, i.e., the columns pi of P are pairwise A -conjugate: (pi, pj)A = (Api, pj) = (pi, pj) = δi,j

is called [A-] conjugate. Show that this is equivalent to PA P = A−1 . Explain the difference between anA - orthogonal and an A - conjugate matrix.

The diagonal part D of an SPD matrix A satisfies D > 0 because Di,i = (Aei, ei) > 0 . In the followingtheorem, a stronger property in the sense of ‘diagonal dominance’ is involved.

Theorem 5.15 Let A ∈ Rn×n be SPD. Then,

0 < A <2

ωD ⇔ ρ(MJac

ω ) < 1

Proof: MJacω = I − ωD−1A is not symmetric but A - selfadjoint (see Exercise 5.14):(

MJacω

)A= A−1

(MJac

ω

)TA = I − ωD−1A =MJac

ω

Equivalently, A12MJac

ω A− 12 = I − ωA

12D−1A

12 is symmetric with real spectrum. We have

σ(MJacω ) = σ(A

12MJac

ω A− 12 ) = σ(I − ωA

12D−1A

12 )

From Lemma 5.13, (iv) we conclude

ρ(MJacω ) < 1 ⇔ −I < I − ωA

12D−1A

12 < I ⇔ 0 < ωA

12D−1A

12 < 2 I

⇔ 0 < ωD−1 < 2A−1 ⇔ 0 < A <2

ωD

Therefore, under the condition of the theorem, the damped Jacobi iteration is contractive in the energynorm: For the error ek = xk − x∗ we have

∥A12 ek+1∥2 = ∥A

12MJac

ω ek∥2 = ∥A12MJac

ω A− 12 A

12 ek∥2 < ∥A

12MJac

ω A− 12∥2 ∥A

12 ek∥2

or equivalently,∥ek+1∥A < ∥M

Jacω ∥A︸︷︷︸< 1

∥ek∥A

This strictly contractive behavior does, in general, not hold in the Euclidean norm. 22

Theorem 5.15 shows that, due to D > 0 , choosing the relaxation parameter ω > 0 sufficiently small(depending on A ) will guarantee that the damped Jacobi method converges for every SPD matrix A . Note,however, that an estimate for the rate of convergence, or contractivity factor, i.e., for ρ(MJac

ω ) = ∥MJacω ∥A

is not available from the theorem. For ω ↓ 0 we have MJacω = I − ωD−1A→ I ; in the limit, convergence

is assured but must be to expected to slow. We will return to the question of convergence rates later on.

The regime of damping parameters for which the damped Jacobi method converges depends on the problem.This is in contrast to the SOR method, which converges for arbitrary SPD matrices for any value ω ∈ (0, 2) :

22 Also for other classes of iterative methods to be discussed later, energy estimates are often more natural (and easier toderive) in the SPD case.



Theorem 5.16 Assume ω ∈ R and A ∈ Rn×n SPD. Then ρ(MSORω ) < 1 iff ω ∈ (0, 2) .

Proof: The key ingredient of the proof is that A and its diagonal D are positive definite, and the argumentis based on direct investigation of the spectrum of MSOR

ω . First we extend the inner product (u, v) = uT v(defined on Rn) in the standard way to Cn : (u, v) = uH v .

Step 1. Claim: If B ∈ Rn×n is symmetric, then (Bz, z) ∈ R for all z ∈ Cn ; furthermore, if B is SPD,then (Bz, z) > 0 for all 0 = z ∈ Cn . To see this, write z = x+ i y with real vectors x, y ∈ Rn . Then,

(Bz, z) = (B(x+ i y), (x+ i y)) = (Bx, x) + (By, y) + i ( (Bx, y)− (By, x) ) = (Bx, x) +B(y, y) ∈ R

by symmetry of B .

Step 2. Claim: If L ∈ Rn×n , then for every z ∈ Cn we have Re((L− LT )z, z) = 0 . To see this, we note(LT z, z) = (z, Lz) = (Lz, z) . Hence, ((L− LT )z, z) = (Lz, z)− (LT z, z) = (Lz, z)− (Lz, z) ∈ iR .

Step 3. Let λ ∈ σ(MSORω ) with corresponding eigenvector 0 = z ∈ Cn . We rewrite the eigen-equation

MSORω z = (I − ω (D + ω L)−1A)z = λz in the form[

(D + ω L)− ωA]z = λ(D + ω L)z

or equivalently,[(1− ω

2

)D − ω

2A+

ω

2(L− LT )

]z = λ

[(1− ω

2

)D +

ω

2A+

ω

2(L− LT )

]z

Taking the inner product with z and abbreviating

a = (Az, z), d = (D z, z), l =1

i((L− LT )z, z)

we obtain (1− ω

2

)d− ω

2a+

ω

2i l = λ

[(1− ω

2

)d+

ω

2+ω

2i l]

(5.21)

Step 4. From Steps 1 and 2, we know that a, d, l appearing in (5.21) satisfy a, d > 0 , l ∈ R . Hence wecan compute

|λ|2 =[(1− ω

2

)d− ω

2a]2

+ (ω2)2 l2[(

1− ω2

)d+ ω

2a]2

+ (ω2)2 l2

(5.22)

and arrive at the conclusion

|λ| < 1 ⇔∣∣∣(1− ω

2

)d− ω

2a∣∣∣ < ∣∣∣(1− ω

2

)d+

ω

2a∣∣∣ (5.23)

Step 5. ‘ ⇐ ’ : Let ω ∈ (0, 2) . Then(1− ω

2

)d and ω

2a are positive. Then the right-hand side statement

in (5.23) is true, i.e., the eigenvalue λ under consideration satisfies |λ| < 1 . Since this is true for alleigenvalues, we conclude ρ(MSOR

ω ) < 1 .

Step 6. ‘⇒ ’ : We proceed by contraposition. Let ω ∈ R \ (0, 2) . Then either(1− ω

2

)d > 0 and ω

2a < 0 ,

or(1 − ω

2

)d < 0 and ω

2a > 0 . Since |α − β| > |α + β| ifα, β ∈ R have opposite signs, we conclude from

(5.23) that all eigenvalues λ ∈ σ(MSORω ) must satisfy |λ| ≥ 1 . A fortiori, ρ(MSOR

ω ) ≥ 1 .

Note that this proof does not provide an estimate for ∥MSORω ∥ in some standard norm.



Exercise 5.17 Let A,N be SPD and consider the iteration xk+1 = xk + N(b − Axk) . Set W = N−1 and letM = I − W−1A be the iteration matrix. Assume

2W > A > 0 (5.24)

a) Show: ρ(M) = ∥M∥A = ∥M∥W < 1 .

b) If for some 0 < λ ≤ Λ there holds

0 < λW ≤ A ≤ ΛW (5.25)

then σ(M) ⊂ [1− Λ, 1− λ] , and thus,

ρ(M) ≤ max|1− λ|, |Λ− 1|

c) Show: for ω ∈ (0, 1] , the damped SSOR method (5.17) converges for all SPD matrices A .

Convergence of SSOR(ω) .

Exercise 5.17 shows that SSOR(ω) converges for ω ∈ (0, 1] . In fact, it converges for all ω ∈ (0, 2) as thefollowing exercises show.

Exercise 5.18 Let A be SPD and denote by Mω := MSORω = I − ω (D + ω L)−1A the iteration matrix of the

(forward) SOR method and by Mω = I − ω (D+ ω LT )−1A the iteration matrix of the backward SOR method.

a) Show: the iteration matrix MSSORω of the damped SSOR method (see (5.17)) satisfies MSSOR

ω = MωMω .

b) Show: Mω is the adjoint of Mω with respect to the (·, ·)A inner product, i.e., Mω = MAω . Conclude that

σ(MSSORω ) ⊂ R+

0 , i.e., the spectrum is non-negative.

c) Using a) and b), show: ∥MSSORω ∥A = ∥Mω∥2A .

Exercise 5.18 shows that SSOR(ω) converges (even monotonically in the energy norm) if SOR(ω) convergesmonotonically in the energy norm. The following exercise shows that the latter is indeed true.

Exercise 5.19

a) Let A be SPD and consider the iteration xk+1 = xk +N(b−Axk) . Let W = N−1 . Assume

W +W T > A (5.26)

Show: the iteration matrix M = I − NA satisfies ∥M∥A < 1 , i.e., monotone convergence in the energy norm.

Hint: Use (5.20), express N +NT by means of W +W T and show

∥Mx∥2A = (Mx,Mx)A = (MA Mx, x)A < (x, x)A = ∥x∥2A for all x ∈ Rn

b) Show: for ω ∈ (0, 2) , the iteration matrix MSORω of the damped Gauss-Seidel (SOR) method satisfies (5.26).

Conclude that ∥MSORω ∥A < 1 for all ω ∈ (0, 2) . Using Exercise 5.18, conclude that the SSOR(ω) method

converges for every ω ∈ (0, 2) .



Jacobi and Gauss-Seidel for diagonally dominant matrices.

Definition 5.20 ([ir]reducible matrix) A ∈ Kn×n is called reducible if there exists permutation of theindices (i.e., a permutation matrix P ) such that

P T AP =

A1 A2

0 A3

(5.27)

where A1 and A3 are square matrices with size(A1) = 0 , size(A3) = 0 . Matrices that cannot betransformed in this way a called irreducible.

We note that the simultaneous permutation of columns and rows of A , effected by forming P TAP , impliesthat A1 and A3 necessarily are square matrices. Reducibility of A means that the system Ax = b can besplit into two subsystems where the subsystem involving A3 can be solved independently. On the otherhand, irreducibility means that the system is ‘fully coupled’.

Definition 5.21 (directed adjacency graph) ; cf. Section 3.

Let A ∈ Kn×n . The graph

G = G(A) = (V,E) with V = 1, . . . , n and E = (i, j) : Ai,j = 0

is called the (directed) adjacency graph of A .

The index j is said to be adjacent to i if (i, j) ∈ E . The index j is said to be connected to i if thereexists a chain of indices i1, i2, . . . , ik such that (i, i1), (i1, i2), . . . , (ik−1, ik), (ik, j) ∈ E .

Lemma 5.22 A matrix A ∈ Kn×n is irreducible iff each index j ∈ 1, . . . , n is connected to each i ∈1, . . . , n .

Proof: Let A be reducible. Then, after relabeling (permutation) of the indices it takes the form ason the right-hand side of (5.27), with A1 ∈ K|N1|×|N1| , A3 ∈ K|N2|×|N2| where 23 N1 = 1, . . . , n1 andN2 = n1+1, . . . , n . Let j ∈ N1 and i ∈ N2 . We claim: j is not connected to i . Suppose otherwise.Then there exists a sequence (i, i1) , (i1, i2), . . . , (ik, j) ∈ E , i.e., every Ail,il+1

= 0 for l = 0 . . . k (we seti0 = i , ik+1 = j ). Since i ∈ N2 , j ∈ N1 and N2 ∪N1 = 1, . . . , n , there must exist at least one pair(il, il+1) with il ∈ N2 and il+1 ∈ N1 . However, the structure (5.27) implies Ai′,j′ = 0 for i′ ∈ N2 , j

′ ∈ N1 ,which leads to the desired contradiction.

Conversely, suppose the existence of j and i such that j is not connected to i . To show that A is reducible,we distinguish the cases j = i and j = i .

– Case j = i : We partition the index set: 1, . . . , n = N1 ∪N2 , whereN2 = i∪i′ : i′ is connected to i and N1 = 1, . . . , n \ N2 . We claim: Ai′,j′ = 0 for all i′ ∈ N2 and j′ ∈ N1 . Suppose otherwise. Thenthere exists i′ ∈ N2 and j′ ∈ N1 such that Ai′,j′ = 0 , i.e., (i′, j′) ∈ E ; thus, since i′ is connected to i , weconclude that j′ is connected to i . In other words: j′ ∈ N2 , which contradicts j′ ∈ 1, . . . , n \N2 . Sincej ∈ N1 and i ∈ N2 , both sets N1, N2 are non-empty, and we may renumber the indices such that those ofN1 are listed first and then those of N2 to obtain the desired structure (5.27).

23 Indices are already relabeled here.



– Case j = i : As in the case j = i , we define the sets N2 = i′ : i′ is connected to i and N1 =1, . . . , n \N2 . Now i ∈ N1 so N1 = ∅ . If N2 = ∅ , then we may reason as in the first case that Ai′,j′ = 0for all i′ ∈ N2 and j′ ∈ N1 . It therefore remains to consider the case N2 = ∅ . This means Ai,i′ = 0 forall i′ ∈ 1, . . . , n , i.e., the matrix A has a null row; renumbering the unknowns such that i appears lastguarantees again that we obtain the structure (5.27) with A1 ∈ K(n−1)×(n−1) and A3 = 0 ∈ K1×1 .

Definition 5.23 (strict and irreducible diagonal dominance)

A matrix A ∈ Kn×n is called strictly diagonally dominant if

n∑j=1j =i

|Ai,j| < |Ai,i| ∀ i ∈ 1, . . . , n (5.28)

A is called irreducibly diagonally dominant if

(i) A is irreducible

(ii)n∑

j=1, j =i

|Ai,j| ≤ |Ai,i| ∀ i ∈ 1, . . . , n

(iii) ∃ i ∈ 1, . . . , n such thatn∑

j=1,j =i

|Ai,j| < |Ai,i|

Exercise 5.24 Show that the matrices for the 1D and 2D Poisson problem (i.e., the matrices of Examples 2.1

and 2.2) are irreducibly diagonally dominant.

Theorem 5.25 Let A ∈ Kn×n be strictly diagonally dominant or irreducibly diagonally dominant. Then

ρ(MJac) < 1 and ρ(MGS) < 1

Proof: The proof is based on studying the error propagation ek+1 = M ek and careful estimation of∥MJac∥∞ and ∥MGS∥∞ , exploiting diagonal dominance (and irreducibility).

Step 1. We show ρ(MJac) < 1 for A strictly diagonally dominant. From MJac = I − D−1A we get

(MJace)i = −n∑

j=1, j =i

Ai,j

Ai,i

ej, i = 1 . . . n

Next,

∥MJac∥∞ = maxi=1...n

n∑i=1

|MJaci,j | = max

i

∣∣∣ n∑j=1, j =i

Ai,j

Ai,i

∣∣∣ < 1 (5.29)

thus, ρ(MJac) ≤ ∥MJac∥∞ < 1 .

Step 2. We show ρ(MGS) < 1 for A strictly diagonally dominant. Let λ be an eigenvalue of MGS =I− (D + L)−1A = −(D + L)−1 U with corresponding eigenvector x . Without loss of generality we assume∥x∥∞ = 1 . Then, −Ux = λ(D + L)x , i.e.,

−∑j>i

Ai,j xj = λAi,i xi + λ∑ji |Ai,j| |xj||Ai,i| |xi| −

∑ji |Ai,j||Ai,i| −

∑ji |Ai,j|∑

j>i |Ai,j|+[|Ai,i| −

∑j>i

|Ai,j| −∑j 0

< 1 (5.31)

due to strict diagonal dominance.

Step 3. This step is inserted mainly to illustrate how the irreducibility comes into play. An alternative,more compact argument based on reasoning by contradiction will be given for the Gauss-Seidel iterationin the following step.

We consider the Jacobi iteration for an irreducibly diagonally dominant matrix A . From ek+1 = MJac ekwe have

|(ek+1)i| =∣∣∣∑j =i

Ai,j

Ai,i

ek,j

∣∣∣ ≤ ∥ek∥∞ ∀ i ∈ 1, . . . , n (5.32)

Furthermore, there exist an i ∈ 1, . . . , n such that for some γ < 1 :

|(ek+1)i| =∣∣∣∑j =i

Ai,j

Ai,i

(ek)j

∣∣∣ ≤ γ ∥ek∥∞ (5.33)

We now consider an i′ such that (i′, i) ∈ E , i.e., Ai′,i = 0 . Here,

|(ek+2)i′| =∣∣∣∑j =i′

Ai′,j

Ai′,i′(ek+1)j

∣∣∣ ≤ ∑j ∈i,i′

∣∣∣ Ai′,j

Ai′,i′(ek+1)j

∣∣∣ + ∣∣∣ Ai′,i

Ai′,i′(ek+1)i

∣∣∣≤

∑j ∈i,i′

∣∣∣Ai′,j

Ai′,i′

∣∣∣ ∥ek+1∥∞︸︷︷︸≤ ∥ek∥∞

+∣∣∣Ai′,i

Ai′,i′

∣∣∣︸︷︷︸> 0

γ ∥ek∥∞ ≤(∑j =i′

∣∣∣Ai′,j

Ai′,i′

∣∣∣︸︷︷︸≤ 1

+ (γ − 1)︸︷︷︸<0

∣∣∣∣Ai′,i

Ai′,i′

∣∣∣∣︸︷︷︸> 0

)∥ek∥∞ ≤ γ′ ∥ek∥∞

for a suitable γ′ < 1 . We conclude that for all indices i′ such that i is adjacent to i′ we obtain |(ek+2)i′| ≤γ′ ∥ek∥∞ . Continuing in this fashion, we can exploit irreducibility to conclude that after n steps (at thelatest!) ∥ek+n∥∞ ≤ γ′′∥ek∥∞ for some γ′′ < 1 which depends only on A .

This shows that the Jacobi method converges for irreducibly diagonally dominant matrices. In particular,

the arguments even show ∥(MJac)n∥∞ < 1 and therefore ρ(MJac) =(ρ((MJac)n)

)1/n< 1 .

Step 4. We consider the Gauss-Seidel iteration for an irreducibly diagonally dominant matrix A . Here,the reasoning that led to (5.31) implies |λ| ≤ 1 for all eigenvalues ofMGS , i.e., ρ(MGS) ≤ 1 . We will showthat the assumption ρ(MGS) = 1 leads to a contradiction. Suppose the existence of an eigenvalue λ ofMGS with |λ| = 1 . Let x be a corresponding eigenvector with ∥x∥∞ = 1 . Let i be such that ∥x∥∞ = |xi| .Then (5.30) implies for such i :

|Ai,i| = |λ| |Ai,i| =∣∣∣∑ji

Ai,j xj

∣∣∣ ≤∑j =i

|Ai,j| |xj| ≤∑j =i

|Ai,j| ≤ |Ai,i|. (5.34)



We conclude that all ≤ are in fact = . Thus:∑

j =i |Ai,j| |xj| =∑

j =i |Ai,j| . This implies |xj| = 1 for all jwith Ai,j = 0 , i.e., all j with (i, j) ∈ E . By the irreducibility and a simple induction argument we thenobtain |xj| = 1 for all j ∈ 1, . . . , n . Thus, (5.34) is true for all i ∈ 1, . . . , n . In particular, the lastequality implies

∑j =i |Ai,j| = |Ai,i| for all i ∈ 1, . . . , n , which contradicts the assumption that there

exists and index i with∑

j =i |Ai,j| < |Ai,i| .

Remark 5.26 For the strictly diagonally dominant case, (5.29) provides an explicit estimate for ∥MJac∥∞in terms of the matrix entries, which depends on the ‘degree of diagonal dominance’.

A more detailed analysis shows that for diagonally dominant and irreducibly diagonally dominant matrices,we have ρ(MGS) ≤ ρ(MJac) < 1 , i.e., the Gauss-Seidel method converges faster than the Jacobi method;cf. e.g. [2, 10, 12]. For certain (other) types of matrices, this can even be quantified: for example, for

consistently ordered matrices (see Section 5.3 below) one can show ρ(MGS) = (ρ(MJac))2if ρ(MJac) < 1 .

It is a good rule of thumb 24 to expect the Gauss-Seidel method to be superior to the Jacobi method.

5.3 Model problem analysis and consistent ordering

A shortcoming of our results so far is that they do not provide useful quantitative estimates for theconvergence rate ρ(MGS) for the Gauss-Seidel method. In fact, it is not so clear that the introductionof the relaxation parameter for the Gauss-Seidel method can improve things substantially. For a classof matrices with a special [block-]band structure we will see that it is possible to choose the relaxationparameter ω in such a way that the convergence is significantly sped up.

In the literature, matrices to which this special type of analysis applies are called consistently ordered(Def. 5.27 below). This definition is motivated by the special structure of the matrix A for the Poissonproblem in 2D (see Example 2.2). In fact, the 2D Poisson problem provides the example par excellence of aconsistently ordered matrix. This special property is usually not retained for more general finite differencematrices; in this sense, the theory based on consistent ordering is sort of a model problem analysis. Froma historical point of view, this analysis was a significant step towards a deeper understanding of theconvergenc properties of SOR, at least for a highly relevant class of application problems.

In a nutshell, the results are:

1. The (by far) most prominent example of a consistently ordered matrix is the matrix arising from the2D Poisson problem (Example 2.2).

2. For consistently ordered matrices, the Gauss-Seidel method (i.e., ω = 1 ) converges at twice the rate

of the Jacobi method (if it converges): ρ(MGS) = (ρ(MJac))2(see Corollary 5.32).

3. For consistently ordered matrices, the optimal damping parameter ωopt for the Gauss-Seidel method

is available explicitly in terms of β = ρ(MJac) < 1 (see Thm. 5.33):ωopt = 2/(1 +√

1− β2) andρ(MSOR

ωopt) = ωopt − 1 . For the case β close to 1 one sees that the optimal damping leads to a

significant improvement: Let β = 1 − δ for some (small) δ > 0 . Then, ωopt = 2 − O(√δ) and

ρ(MSORωopt

) = 1−O(√δ) .

An illustration of the performance of the optimally damped SOR method for the 2D Poisson problem canbe found in Section 5.3. In particular, we will see that ρ(MJac) = 1−O(h2) , whereas ρ(MSOR

ωopt) = 1−O(h) .

24 Note: a rule of thumb is not a mathematical truth.


5.3 Model problem analysis and consistent ordering 39

Consistent ordering.

Definition 5.27 (consistent ordering)

A matrix A ∈ Kn×n with diagonal D , strict lower part L and strict upper part U is said to be consistentlyordered if the eigenvalues of A(z) = zD−1L+ 1

zD−1U are independent of z ∈ C \ 0 .

Example 5.28 A first class of matrices that are consistently ordered are block tridiagonal matrices of theform

D1 T12

T21 D2 T23

T32 D3. . .

. . . . . . Tp−1,p

Tp,p−1 Dp

where the Di are diagonal matrices. To see this, it suffices to see that D−1L+D−1U and zD−1L+ 1

zD−1U

are similar matrices for all 0 = z ∈ C :

zD−1L+ 1zD−1U = X(D−1L+D−1U)X−1 for X =

I

zI

z2I

. . .

zp−1I

Tridiagonal matrices (e.g., those arising in Example 2.1) fit into the setting of Example 5.28. The 2Dsituation of Example 2.2 does not. However, the matrix of Example 2.2 is also consistently ordered as thefollowing example shows:

Example 5.29 Block tridiagonal matrices whose diagonal block Ti are tridiagonal matrices and whoseoff-diagonal block are diagonal matrices are consistently ordered. To see this, consider such a matrix inthe form

B =

T1 D12

D21 T2 D23

D32 T3. . .

. . . . . . Dp−1,p

Dp,p−1 Tp

We proceed as above by similarity transformations, with X = Diag(I, zI, z2I, . . . , zp−1I) . A calculationshows

XBX−1 =

T1 z−1D12

zD21 T2 z−1D23

zD32 T3. . .

. . . . . . z−1Dp−1,p

zDp,p−1 Tp



So far, we have neither exploited the assumption that the off-diagonal blocks are diagonal nor that thediagonal blocks are tridiagonal. The next similarity transformation is again done by a block diagonalmatrix, C = Diag(C1, C2, . . . , Cp) , where the matrices Ci (all of the same size n ∈ N ) are given byCi = Diag(1, z, z2, . . . , zn−1) . Observing that the off-diagonal block Dij are diagonal matrices then allowsus to conclude

CXBX−1C−1 =

C1T1C−11 z−1D12

zD21 C2T2C−12 z−1D23

zD32 C3T3C−13

. . .

. . . . . . z−1Dp−1,p

zDp,p−1 CpTpC−1p

Next, the fact that the diagonal blocks Ti are tridiagonal matrices implies that, upon writing Ti = Di +Li +Ui (diagonal, lower, and upper part), we have CiTiC

−1i = Di +

1zUi + zLi . From this, it is easy to see

that the original matrix B is consistently ordered.

Lemma 5.30 Let A ∈ Kn×n be consistently ordered. Then,

σ(αD−1L+ βD−1U) = σ(±√αβ (D−1L+D−1U))

for all α, β ∈ C .

Proof: Let α, β = 0 and set z = ±√α/β . Then, αD−1L + βD−1U = ±

√αβ (zD−1L + 1/zD−1U) . By

consistent ordering of A , we conclude σ(αD−1L+ βD−1U) = σ(zD−1L+1/zD−1U) = σ(D−1L+D−1U) .

For α = 0 (or: β = 0 ), αD−1L+ βD−1U is strictly upper triangular (or: strictly lower triangular); henceσ(αD−1L+ βD−1U) = 0 = σ(±

√αβ (D−1L+D−1U)) .

We note that the expression D−1(L+U) , which appears in the definition of consistent ordering (for z = 1) corresponds (up to the sign) to the iteration matrix MJac of the Jacobi method. It may therefore benot all that surprising to see that σ(MJac) appears in the following results. The basis for the Theorem ofD.Young (Thm. 5.33) is the following result which links σ(MSOR

ω ) to σ(MJac) :

Lemma 5.31 Let A ∈ Kn×n be consistently ordered. Then:

λ ∈ σ(MSORω ) ⇔ µ ∈ σ(MJac) and λ, µ satisfy (5.36): (5.35)

where(λ+ ω − 1)2 = ω2λµ2 (5.36)

Proof: We define L′ = D−1L and U ′ = D−1U . Let λ ∈ σ(MSORω ) . A corresponding eigenvector x satisfies

(I − ω (D+ ω L)−1A)x = λx , i.e., (1− ω− λ)Dx = (ω U + λω L)x . Hence, 1− ω− λ ∈ σ(ω U ′ + λω L′) .Appealing now to Lemma 5.30, we get 1 − ω − λ ∈ σ(ω U ′ + λω L′) = σ(±

√λω (L′ + U ′)) = σ(MJac) .

Hence, for every λ ∈ σ(MSORω ) there exists a µ ∈ σ(MJac) such that 1− ω − λ = ±

√λω µ , i.e., (5.36).

Conversely, let λ, µ satisfy (5.36). Since MJac = −(L′ + U ′) , by Lemma 5.30, −µ is also an eigenvalue ofMJac ; hence (5.36) implies that for both signs

λ±(µ) = 1− ω +1

2ω2µ2 ± ω µ

√1− ω +

ω2µ2

4(5.37)



satisfy

λ+ ω − 1 = ±√λω µ

for suitably chosen sign. Since ±√λω (L′+U ′) = ∓

√λωMJac , we conclude with Lemma 5.30 that ±

√λω

is also an eigenvalue of −(ω U ′ + λω L′) , i.e., an eigenvalue of MSORω .

A special case arises for ω = 1 . Then, Lemma 5.31 implies for consistently ordered matrices ρ(MSOR1 ) =

ρ(MGS) = β2 = ρ(MJac)2 , i.e., the rate of convergence of the Gauss-Seidel method is twice that of theJacobi method (if the Jacobi method converges):

Corollary 5.32 Let A be consistently ordered. Then, ρ(MGS) = ρ(MJac))2 . In particular, the Gauss-Seidel method converges iff the Jacobi method converges.

Proof: The assertion ρ(MGS) = (ρ(MJac))2 follows immediately from (5.36) for ω = 1 .

Theorem 5.33 [D.Young] Assume:

(i) ω ∈ (0, 2) ,

(ii) MJac has only real eigenvalues,

(iii) β = ρ(MJac) < 1 ,

(iv) A is consistently ordered.

Then: ρ(MSORω ) < 1 , with

ρ(MSORω ) =

1− ω + 12ω2β2 + ωβ

√1− ω + ω2β2

4for 0 < ω ≤ ωopt

ω − 1 for ωopt ≤ ω < 2(5.38)

where ωopt is given by

ωopt =2

1 +√1− β2

(5.39)

The value ωopt minimizes ρ(MSORω ) , which then takes the value ρ(MSOR

ωopt) = ωopt − 1 .

Proof: With Lemma 5.31 at hand, the proof of Theorem 5.33 is conceptually easy yet tedious. Denotingby λ±(µ) the two solutions from (5.37), we will check that: (i) maxµ∈[−β,β] |λ±(µ)| is bounded by theright-hand side rhs of (5.38) and (ii) the choice µ = ±β realizes the upper bound, i.e., |λ±(β)| ≥ rhs .(Note that the assumption that σ(MJac) is real implies λmax(M

Jac) = β .) Several cases may occur whichare dealt with in turn.

Step 1. As a preparation, we note that the right-hand side rhs of (5.38) satisfies

|ω − 1| ≤ rhs

Instead of a formal argument, we refer the reader to Fig. 5.9, where the typical behavior of ω 7→ rhs isvisible.



0 1 20

1

ωopt

ωopt

−1

ρ(MSORω

) as a function of ω

Figure 5.9: asymptotic contraction rate ρ(MSORω ) of SOR in dependence of ω .

Step 2. Let 1 − ω + ω2µ2

4< 0 . Then the solutions λ±(µ) from (5.37)) arise in form of a pair of complex

conjugate numbers. Hence

|λ±|2 = (1− ω +1

2ω2µ2) +

∣∣∣ω µ√1− ω +ω2µ2

4

∣∣∣2= (1− ω +

1

2ω2µ2)− ω2µ2

(1− ω +

ω2µ2

4

)= (ω − 1)2

independent of µ .

Step 3. 1− ω + ω2µ2

4≥ 0 . Then the two solutions λ±(µ) are real, and

max|λ+(µ)|, |λ−(µ)| = 1− ω +1

2ω2µ2 + ω|µ|

√1− ω +

ω2µ2

4

Step 4. From steps 1–3 we conclude that ρ(MSORω ) ≤ rhs . It remains to verify equality. This is achieved

by choosing µ = β ∈ σ(MJac) . The choice of ωopt is such that ω 7→ 1−ω+ ω2β2

4is positive for 0 < ω < ωopt

and negative for ωopt < ω < 2 . Hence, for 0 < ω < ωopt we conclude from step 2 that ρ(MSORω ) ≥ |ω− 1| .

For ωopt < ω < 2 we infer from step 3 that ρ(MSORω ) ≥ 1 − ω + 1

2ω2µ2 + ω|µ|

√1− ω + ω2µ2

4. Hence, we

conclude ρ(MSORω ) ≥ rhs .

Corollary 5.34 (SPD case) Let A be SPD and consistently ordered. Let ω ∈ (0, 2) . Then σ(MJac) ⊂ Rand ρ(MJac) < 1 , and therefore the assertions of Theorem 5.33 hold.

Proof: Theorem 5.16 implies ρ(MGS) < 1 . Lemma 5.33 then gives ρ(MJac) < 1 . The spectrum of MJac

is real, since σ(MJac) = σ(I − D−1A) = σ(I − D− 12AD− 1

2 ) ⊂ R since I − D− 12AD− 1

2 is a symmetricmatrix.



5 10 15 20 25 30 35 40 45 5010

−10

10−8

10−6

10−4

10−2

100

102

iteration number

l2 nor

m o

f res

idua

l

Poisson problem, problem size N = 100

Jacobi

Gauss−Seidel

optimal SOR

CG

20 40 60 80 100 120 14010

−4

10−3

10−2

10−1

100

101

102

103

iteration number

l2 nor

m o

f res

idua

l


Jacobi

Gauss−Seidel

optimal SOR

CG

Figure 5.10: 2D Poisson problem: Comparison of Jacobi, Gauss-Seidel, optimal SOR, and CG method

Discussion of Young’s theorem.

By Example 5.29, the matrix for the 2D Poisson problem of Example 2.2 is consistently ordered. Hence,Young’s theorem is applicable. We can compute σ(MJac) explicitly: According to Example 2.2, we haveσ(A) = 4 sin2 iπ

n+1+ 4 sin2 jπ

n+1: 1 ≤ i, j ≤ n . Since the diagonal D of A is given by D = 4I , we easily

computeσ(MJac) = σ(I − D−1A) =

1− sin2 iπ

n+1− sin2 jπ

n+1: 1 ≤ i, j ≤ n

From this we can conclude, with h = 1

n+1and some c > 0 independent of h (and thus of the problem size):

ρ(MJac) = 1− c h2 +O(h3)

This allows us to compute the optimal relaxation parameter

ωopt =2

1 +√1− (ρ(MJac))2

= 2− c′h+O(h2), c′ > 0 suitable

We conclude that

ρ(MSORωopt

) = 1− c′h+O(h2)

We note that the SOR-method with optimally chosen relaxation parameter leads to a significant improve-ment of the convergence rate.

Example 5.35 We consider the matrix A from Example 2.2 for the cases n = 10 (i.e., h ≈ 0.1 andn = 100 (i.e., h ≈ 0.01 ). The right-hand side b is chosen as b = (1, 1 . . . 1)T ; x0 = (1, 1 . . . 1)T . Wecompare the Jacobi, the Gauss-Seidel, the optimally relaxed SOR-method and the CG-method (discussedbelow).

Fig. 5.10 shows the residual (in the ∥ · ∥2 -norm) versus the iteration count. We note that the Jacobi andGauss-Seidel methods converge (visible for n = 10 , invisible for n = 100 ); indeed, the Gauss-Seidel methodconverges at twice the rate of the Jacobi method and the optimally relaxed SOR-method is significantlyfaster. We also observe that the CG-method is vastly superior.



Remark 5.36 In practice, β is not known and therefore also ωopt . A possible technique is as follows:choose ω < ωopt (e.g., ω = 1 ). Perform a few SOR-steps and monitor the behavior of the iterates∥xk+1 − xk∥2 . This gives an indication of ρ(MSOR

ω ) and therefore, using (5.38) (note: ω < ωopt ), of β .With the aid of (5.39) one can then get an improved estimate for ωopt . As long as ω < ωopt one can proceedin this fashion. Since (cf. Fig. 5.9) the function ω 7→ ρ(MSOR

ω ) has a very steep slope for ω < ωopt , oneshould tend to choose ω slightly larger than an (estimated) ωopt .

Classes of consistently ordered matrices; ‘Property A’.

The property of a matrix to be consistently ordered does depend on the ordering. It is therefore of interestto identify matrices A for which permutation matrices P exist such that P T AP is consistently ordered.Examples of such matrices are those for which a suitably renumbering of the unknowns leads to blocktridiagonal form, where the diagonal blocks are diagonal matrices (cf. Example 5.28), which is addressedby the term ‘Property A’.

Let us see under which conditions on the adjacency graph G such a structure can be achieved. To this end,let A be a tridiagonal matrix with p blocks on the diagonal. This numbering of the unknowns correspondsto a partition of the set V of vertices of G into p pairwise disjoint sets Si , i = 1 . . . p : V = S1 ∪S2 · · · ∪Sp .The fact that the diagonal blocks are diagonal matrices and that A is block tridiagonal implies:

1. No node of a set Si connected to a node within the same Si .

2. Nodes of Si are only connected to nodes of Si−1 or Si+1 .

Example 5.37 We illustrate the situation for the special case p = 2 in Fig. 5.11. In particular, we notethat for the matrix A of the 2D Poisson problem, we can find a numbering which brings A to the desiredform: The ‘red-black ordering’ (or: chequerboard ordering) as shown in the right part of Fig. 5.11 yields apartition of the indices that realizes the splitting S1 ∪S2 with the desired property. The sparsity patternof the reordered matrix is shown in Fig. 5.12. Since the reordered matrix is again consistently ordered,the convergence results related to Young’s Theorem apply also to the reordered system, in particular forGauss-Seidel and SOR.

We also note that, due to this reordering, the update steps for then (reordered) ‘black’ unknowns xi inthe first half step of the iteration can be computed independently, because the i - th equation involvesonly values of ‘red’ unknowns xj, j = i available from the previous iteration step. In the sequel, also theupdates for the red unknowns are independent from each other. Thus, red-black ordering also enables anefficient parallelization of the SOR steps for this example.

S2S1

PSfrag replacements

S1

S2

PSfrag replacements

S1

S2

Figure 5.11: Left: illustration of property A. Right: partitioning (‘red-black coloring’) for model problemof Example 2.2 that realizes Property A



0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

nz = 217

Figure 5.12: Sparsity pattern for 2D Poisson problem after red-black ordering

Optimal damping parameters for Richardson and Jacobi (SPD case).

The choice of optimal damping parameters for the Richardson and Jacobi method is much simpler. Westart with Richardson method (already considered in Exercise 5.12):

Lemma 5.38 Let A ∈ Rn×n be SPD. Then the optimal damping parameter for the Richardson iteration(iteration matrix MRich

ω = 1− ωA ) is

ωopt =2

λmax + λmin

, with ρ(MRichωopt

) =λmax − λmin

λmax + λmin

Proof: Since A is SPD, it is easy to calculate the spectrum of MRich : σ(MRichω ) = 1 − ωσ(A) . Hence,

for real ω , we get ρ(MRichω ) = max|1− ωλmax|, |1− ωλmin| . Simple graphical considerations then show

that ρ(MRichω ) is minimal for ω = 2

λmin+λmax.

One may proceed analogously for the Jacobi method. Assume that

λD ≤ A ≤ ΛD

for some λ,Λ > 0 . Then the optimal damping parameter for the Jacobi method is ωopt =2

λ+Λ, leading to

ρ(MJacωopt

) =Λ− λΛ + λ


46 6 CHEBYSHEV ACCELERATION AND SEMI-ITERATIVE METHODS

Block versions.

The Jacobi method and the Gauss-Seidel method can also be employed in block versions. Namely, let Abe of the form

A =

A1,1 A1,2 · · ·

...

A2,1 A2,2 · · ·...

......

. . ....

· · · · · · · · · Ap,p

where the entries Ai,j are matrix blocks.

The block Jacobi method consists then in defining the block diagonal matrix

D =

A11

A22

. . .

App

and performing the iteration xk+1 = xk +D−1(b− Ax) . Of course, the block Gauss-Seidel and the blockSOR methods can be defined in an analogous way. Convergence theories exist for these cases as well.

If A corresponds to a FD or FEM matrix like in the 2D Poisson example, with original lexicographicordering of the unknowns, then the diagonal blocks Ai,i are tridiagonal matrices, and inversion of Damounts to solving small tridiagonal systems, e.g. via Cholesky decomposition. The higher effort for sucha block iteration is usually paid off by a faster convergence rate.

In the context of FD methods, such a version of block relaxation scheme is also called ‘line relaxation’, wherethe coupling of unknowns in one of the coordinate directions is retained. Another variant is ‘alternatingdirection line relaxation’, where the ordering of unknowns varied in in an alternating fashion, similarly asin SSOR; cf. e.g. [16].

6 Chebyshev Acceleration and Semi-iterative Methods

We have already pointed out some difficulties in choosing the optimal relaxation parameter ωopt . Cheby-shev acceleration and its variants are an alternative, not uncommon tool to accelerate the convergence ofa sequence (xk)

∞k=0 . We assume that this sequence is generated by the primary iteration

xk+1 =Mxk +N b (6.1)

We assume ρ(M) < 1 , such that the primary iteration (6.1) converges. We ask: Can we construct,for every k , an approximation yk based on x0, . . . , xk such that the new sequence (yk)

∞k=0 features faster

convergence towards the fixed point x∗ of (6.1) ? To this end, we make the ansatz for a secondary iteration

yk =k∑

m=0

ak,m xm (6.2)

for some parameters ak,m to be chosen. With the polynomials

pk(t) =k∑

m=0

ak,m tm


47

and

xm =Mmx0 +m∑ℓ=1

Mm−ℓN b =Mmx0 + qm(M)Nb

(with polynomials qm ∈ Pm ), the secondary iteration (6.2) can be written as

yk = pk(M)x0 + qk(M)Nb, qk(M) =k∑

m=0

ak,m qm(M) (6.3)

with matrix polynomials pk ∈ Pk . Of course we require that the parameters ak,m be chosen such that afixed point of (6.1) is reproduced, i.e., if we would consider (xk)

∞k=0 to be the constant sequence (x∗)

∞k=0 ,

then the sequence (yk)∞k=0 defined by (6.2) should be the same constant sequence. That is, we require the

consistency condition

1 =k∑

m=0

ak,m = pk(1) ∀ k ∈ N0

i.e., all yk are weighted means of the xm, m = 0 . . . k . Under this assumption, we can express the errorek = yk − x∗ in terms of the primary error ek = xk − x∗ :

ek = yk − x∗ =k∑

m=0

ak,m(xm − x∗) =k∑

m=0

ak,m em =k∑

m=0

ak,mMme0 = pk(M)e0

This formula for the error gives an indication of how the coefficients ak,m , or, equivalently, the polynomialspk should be chosen: we should choose pk such that ∥pk(M)∥ is small (minimal) in some norm of interest.Since sometimes information about the spectrum of M is available, we state:

• By Exercise 6.1, σ(p(M)) = p(σ(M)) for any matrix M and any polynomial p . Thus,

ρ(p(M)) = max|p(λ)| : λ ∈ σ(M) (6.4)

• All normal matrices M satisfy ∥M∥2 = ρ(M) . Hence, for normal matrices M , where p(M) is alsonormal, there holds ρ(p(M)) = ∥p(M)∥2 for all polynomials p .

• Let A be SPD and let M be A - selfadjoint. Then, analogously as for the symmetric case we haveρ(M) = ∥M∥A , and ρ(p(M)) = ∥p(M)∥A for all polynomials p .

These considerations suggest that a reasonable procedure is to seek pk ∈ Pk (the linear space of polynomialsof degree ≤ k ) as the solution of the following minimization problem:

minpk ∈Pk, pk(1)=1

maxλ∈σ(M)

|pk(λ)|

Since this problem is still hard to solve, we will settle for less: If Γ ⊂ C is a closed set such that σ(M) ⊂ Γ ,then we could seek to solve the following minimization problem:

minpk ∈Pk, pk(1)=1

maxz ∈Γ|pk(z)| (6.5)

Of course, this still requires some a priori knowledge about the location of the spectrum. Here we considerthe case that σ(A) ⊂ Γ = [a, b] , an interval on the real line. In this case, the minimization problem (6.5)can be solved explicitly (Corollary 6.3 ahead). As we will see in Section 6.2, the numerical realization canalso be achieved in an efficient way.

Exercise 6.1 Let M be an arbitrary matrix, and let p be a polynomial. Show (e.g., using the Schur or Jordan

form): σ(p(M)) = p(σ(M)) .



–1.5

–1

–0.5

0

0.5

1

1.5

2

–1 –0.5 0 0.5 1

x

Figure 6.13: Chebyshev polynomials of the first kind

6.1 Chebyshev polynomials

The Chebyshev polynomials of the first kind, Tk ∈ Pk , are defined by the three-term recurrence

T0(ξ) = 1, T1(ξ) = ξ, Tk+1 = 2 ξ Tk(ξ)− Tk−1(ξ), k ≥ 1 (6.6)

It can be checked (e.g., by induction) that these polynomials can be expressed in closed form:

Tk(ξ) =

cos(k arccos(ξ)), |ξ| ≤ 112

[(ξ +

√ξ2 − 1)k + (ξ +

√ξ2 − 1)−k

], |ξ| ≥ 1

(6.7)

Among the numerous remarkable properties of Chebyshev polynomials, we note that they are the solutionsof an optimization problem of the form considered in (6.5):

Theorem 6.2 Let [α, β] ⊂ R be a non-empty interval, and let γ be any real scalar outside this interval.Then the minimum

minp∈Pk, p(γ)=1

maxt∈ [α,β]

|p(t)| (6.8)

is attained by the polynomial 25

p(t) = Ck(t) =Tk(1 + 2 t−β

β−α)

Tk(1 + 2 γ−ββ−α

)(6.9)

Furthermore, the minimizer is unique.

25 For t ∈ [α, β] we have 1 + 2 t−ββ−α ∈ [−1, 1] .


6.2 Chebyshev acceleration for σ(M) ⊂ (−1, 1) 49

Proof: See [2] or any textbook on numerical analysis. A detailed proof can also be found in [10, Sec-tion 7.3.3]. We sketch the existence argument (uniqueness follows along the same lines). By affine trans-formation (which does not change the L∞ -norm ∥ · ∥∞ ), we may restrict ourselves to the standard case[α, β] = [−1, 1] and γ ∈ R \ [−1, 1] . Then, Ck(t) = C Tk(t) with constant C = 1/Tk(γ) . The Chebyshevpolynomial Tk(ξ) = cos(k arccos(ξ)) attains the values ±1 at the points ξi = cos(iπ/k) , i = 0 . . . k , and italternates between 1 and −1 (i.e., Tk(ξi) and Tk(ξi+1) have opposite signs). Furthermore, ∥Tk∥L∞(−1,1) = 1 ,

implying ∥Ck∥L∞(−1,1) = |C| .Assume now the existence of π ∈ Pk such that ∥π∥L∞(−1,1) < ∥Ck∥L∞(−1,1) = |C| and π(γ) = 1 . Then,the polynomial r = Ck − π changes sign k times in the interval [−1, 1] since sign r(ξi) = signTk(ξi) fori = 0 . . . k . Thus, r has at least k zeros in [−1, 1] . Additionally, r(γ) = 0 . Hence, r ∈ Pk has at least k+1zeros; thus, r ≡ 0 , which leads to a contradiction.

It will be relevant to have quantitative bounds for the minimal value in (6.8):

Corollary 6.3 Under the assumptions of Theorem 6.2 and α < β < γ , we have

minp∈Pk, p(γ)=1

maxt∈ [α,β]

|p(t)| = 1

Tk(1 + 2 γ−α

β−α

) = 2ck

1 + c2k, c =

√κ− 1√κ+ 1

, κ =γ − αγ − β

The same bound holds for the case γ < α < β , if κ is replaced by κ = γ−βγ−α

.

Proof: The key is to observe that |Tk(ξ)| ≤ 1 for ξ ∈ [−1, 1] . This implies that the polynomial Ck from(6.9) satisfies

maxt∈ [α,β]

|Ck(t)| =1

|Tk(1 + 2 γ−ββ−α

)|

The assertion then follows from the explicit representation of Tk given in (6.7) and some manipulations(see, e.g., [10, Section 7.3.3] for details).

6.2 Chebyshev acceleration for σ(M) ⊂ (−1, 1)

We now assume σ(M) ⊂ (−1, 1) (convergent primary iteration withM having a real spectrum). Moreover,we assume that parameters −1 < α < β < 1 are known such that σ(M) ⊂ [α, β] . With these parametersα, β and γ = 1 , we use the polynomials pk(t) = Ck(t) explicitly given in Theorem 6.2 to define our thesecondary iteration (6.2). This results in a ‘Chebyshev-accelerated’ iteration scheme. Note that this is aconsistent choice since Tk(1) = 1 for all k .

Improved convergence behavior of the Chebyshev iterates in self-adjoint cases.

To quantify the convergence behavior of the secondary Chebyshev iteration, we consider the case that theiteration matrix M is self-adjoint with respect to the energy product (·, ·)B for some SPD matrix B , i.e.,(Mx, y)B ≡ (x,My)B ( MB =M ). Any such matrix has a real spectrum. Additionally, we now assumeknowledge of ρ ∈ (0, 1) such that σ(M) ⊂ [−ρ, ρ] , We then take the polynomials pk defining the secondaryiteration (6.3), as pk(t) = Ck(t) with α = −ρ, β = ρ and γ = 1 , i.e.,

yk =k∑

m=0

ak,m xm = pk(M)x0 + dk, ak,m = coefficients of pk , (6.10)

with

pk(t) = Ck(t) =1

Tk(1/ρ)Tk(t/ρ) (6.11)



We then obtain from Corollary 6.3

ρ(pk(M)) = maxλ∈σ(M)

|pk(λ)| ≤ maxλ∈ [−ρ,ρ]

|pk(λ)| =2 ck

1 + c2k≤ 2 ck, c =

√κ− 1√κ+ 1

, κ =1 + ρ

1− ρ

The assumption that M is B - selfadjoint implies ∥p(M)∥B = ρ(p(M)) for all polynomials p . Hence, forthe primary and secondary errors ek, ek we have(∥ek∥B

∥e0∥B

)1k

≤ ∥Mk∥1k

B ≤ ρ(M),

(∥ek∥B∥e0∥B

)1k

≤ ∥pk(M)∥1kB = (ρ(pk(M)))

1k ≤ 2

1k c

In Remark 5.7, we have seen that the convergence factors

qprimary = lim supk→∞

(∥ek∥B∥e0∥B

) 1k ≤ ρ(M),

qCheb = lim supk→∞

(∥ek∥B∥e0∥B

) 1k ≤ c

are good measures for the asymptotic behavior of the convergence speed. To compare qprimary with qCheb ,let us assume that the parameter ρ is the best possible choice, i.e., ρ = ρ(M) . The interesting case is

ρ = 1− δ for small δ > 0 . We then have c = 1−1/√κ

1+1/√κand κ = 1+ρ

1−ρ. This leads to c = 1− c′

√δ with some

c′ > 0 . We therefore arrive at

qprimary ≤ 1− δ, qCheb ≤ 1− c′√δ (6.12)

In practice we have 26 qprimary = ρ , such that for small δ , Chebyshev acceleration will noticeably improvethe convergence behavior.

Numerical realization.

At first glance, a drawback of the Chebyshev acceleration appears to be that the definition of yk accordingto (6.10) requires knowledge of all primary iterates x0, . . . , xk . In view of storage restrictions, this maybe difficult to realize in practice. However, a clever rewriting and exploiting of the three-term recurrencefor the Chebyshev polynomials Tk removes this restriction.

Since the Tk(ξ) satisfy the three-term recurrence (6.6), and so do the polynomials pk from (6.11):

µk+1 pk+1(t) =2

ρµk t pk(t)− µk−1 pk−1(t), k ≥ 1, with µk = Tk(1/ρ) (6.13)

with initial functions

p0(t) = 1, p1(t) =T1(t/ρ)

T1(1/ρ)=t/ρ

1/ρ= t,

i.e., a0,0 = 1 and a1,0 = 0, a1,1 = 1 . We also observe the important property

µk+1 =2

ρµk − µk−1 (6.14)

which can be seen directly from the properties of the Chebyshev polynomials or by observing that ourrequirement pk(1) = 1 for all k enforces this in view of (6.13).

26 Convince yourself that qprimary = ρ unless e0 has no component in the invariant subspace associated with the dominanteigenvalue.


6.3 Numerical example 51

We are now ready to implement Chebyshev acceleration. With x∗ = limk→∞ xk we obtain from the errorequation ek = pk(M)e0 :

yk+1 = x∗ + ek+1 = x∗ + pk+1(M)e0 = x∗ + 2µk

ρ µk+1

Mpk(M)e0 −µk−1

µk+1

pk−1(M)e0

= x∗ + 2µk

ρ µk+1

Mek −µk−1

µk+1

ek−1 = x∗ + 2µk

ρ µk+1

M(yk − x∗)−µk−1

µk+1

(yk−1 − x∗)

= 2µk

ρ µk+1

Myk −µk−1

µk+1

yk−1 +1

µk+1

(µk+1 −

2

ρµkM + µk−1

)x∗

We now exploit the fact that x∗ is a fixed point of the basic iteration, i.e., x∗ =Mx∗+N b . This togetherwith (6.14) allows us to remove the appearance of x∗ and to obtain a direct three-term recurrence for theyk , without explicit use of the primary iterates xk :

yk+1 = 2µk

ρ µk+1

Myk −µk−1

µk+1

yk−1 + 2µk

ρ µk+1

Nb, with y0 = x0, y1 = x1 =Mx0 +Nb

We collect these findings in Alg. 6.1.

Algorithm 6.1 Chebyshev acceleration

% input: primary iteration xk+1 =Mxk +Nb ;% assumption: σ(M) ⊂ [−ρ, ρ] ⊂ (−1, 1)1: Choose x0 ∈ Rn

2: y0 = x0 , y1 = x1 =Mx0 +Nb3: µ0 = 1 , µ1 = 1/ρ ;4: for k = 1, 2, . . . do5: µk+1 =

2ρµk − µk−1 % use recursion for µk instead of definition µk = Tk(1/ρ)

yk+1 = 2µk

ρ µk+1

Myk −µk−1

µk+1

yk−1 + 2µk

ρ µk+1

Nb

6: end for

6.3 Numerical example

We illustrate the Chebyshev acceleration again for the model problem of Example 2.2. Since we requirethe iteration matrix to have real spectrum, we would like it to be selfadjoint. For the model problemof Example 2.2, the iteration matrix MJac = I − D−1A of the Jacobi method is indeed symmetric. 27

For Chebyshev acceleration of the Jacobi method we assume that the parameter ρ has been chosen asρ = ρ(MJac) .

We also employ Chebyshev acceleration for the Gauss-Seidel method. Since σ(MGS) is not necessarilyreal, we consider its symmetric variant, i.e., SSOR with ω = 1 . From Corollary 5.32 we know thatρ(MGS) = ρ(MSOR(1)) = ρ(MJac)2 . For our calculations, we employ ρ = ρ(MJac)4 since we heuristicallyexpect ρ(MSSOR(1)) ≤ ρ(MGS)2 (note: SSOR(1) is effectively two Gauss-Seidel steps). Fig. 6.14 showsthe performance of various iterative methods including the Chebyshev accelerated versions of the Jacobimethod and of SSOR(1). We observe that Chebyshev acceleration does indeed significantly improve theconvergence.

27 For an arbitrary SPD matrix A , MJac is A - selfadjoint; cf. the proof of Theorem 5.15.



5 10 15 20 25 30 35 40 45 5010

−20

10−15

10−10

10−5

100

iteration number

l2 nor

m o

f res

idua

l


JacobiGauss−SeidelSOR(ω

opt)

Jacobi with ChebCG

20 40 60 80 100 120 14010

−1

100

101

102

103

iteration number

l2 nor

m o

f res

idua

l


JacobiGauss−SeidelSOR(ω

opt)

Jacobi with ChebCG

5 10 15 20 25 30 35 40 45 5010

−10

10−8

10−6

10−4

10−2

100

102

iteration number

l2 nor

m o

f res

idua

l


Gauss−SeidelSSOR(1)SOR(ω

opt)

SSOR(1) with Cheb

20 40 60 80 100 120 14010

−3

10−2

10−1

100

101

102

iteration number

l2 nor

m o

f res

idua

l


Gauss−SeidelSSOR(1)SOR(ω

opt)

SSOR(1) with Cheb

Figure 6.14: Chebyshev acceleration for Poisson problem (Example 2.2) of Jacobi method ( ρ = ρ(MJac)) and symmetric Gauss-Seidel ( = SSOR(1)) with ρ = ρ(MJac)4 .

A brief discussion is in order. We have already seen that ρ(MJac) = 1 − ch2 + O(h3) . Hence, from ourdiscussion in (6.12) we infer that lim supk→∞(ρ(pk(M

Jac)))1/k = 1 − c′h + O(h3/2) for some suitable c′ .For the Chebyshev acceleration based on the SSOR(1) method, we have chosen ρ = ρ(MJac)4 ; again weget ρ = 1 − c′′h2 + O(h3) and thus conclude that the Chebyshev accelerated SSOR(1) has a contractionrate of 1− c′′′h+O(h3/2) .

Note that, in practice, good estimates for the spectrum of the iteration matrix are required. To obtain abound on the largest eigenvalue of the iteration matrix, one possiblity is to perform a few steps of a simplevector iteration; i.e., the parameter ρ in Fig. 6.15 is estimated with 10 steps of a simple vector iterationwith starting vector (1, 1 . . . 1)T .


6.3 Numerical example 53

0 10 20 30 40 5010

−12

10−10

10−8

10−6

10−4

10−2

100

102

iteration number

l2 nor

m o

f res

idua

l


SSOR(1)SSOR(1) with Cheb ρ = ρ

Jac2

SSOR(1) with Cheb ρ = ρJac4

SSOR(1) with Cheb ρ vector iterationSOR(ω

opt)

SSOR(ωopt

) with Cheb ρ vector iteration

0 50 100 150 20010

−15

10−10

10−5

100

105

iteration number

l2 nor

m o

f res

idua

l


SSOR(1)SSOR(1) with Cheb ρ = ρ

Jac2

SSOR(1) with Cheb ρ = ρJac4

SSOR(1) with Cheb ρ vector iterationSOR(ω

opt)

SSOR(ωopt

) with Cheb ρ vector iteration

Figure 6.15: Chebyshev acceleration for Poisson problem (Example 2.2) of symmetric Gauss-Seidel ( =SSOR(1)) and SSOR (ωopt) ( ωopt = optimal choice for SOR)

To illustrate the influence of the quality of available bounds for the spectrum of the iteration matrix, inFig. 6.15 we show the performance of SSOR(1) with ρ = ρ(MJac)2 instead of ρ = ρ(MJac)4 . We observea significant deterioration of the performance, in spite of the fact that for h = 10−2 (i.e., N = 10, 000 )the values ρ(MJac)2 and ρ(MJac)4 are rather close, suggesting that the Chebyshev acceleration is quitesensitive. We also note that in the case N = 10, 000 that the estimate obtained with merely 10 vectoriterations is rather poor.

Figure 6.14 also contains the results for the Conjugate Gradient (CG) method introduced in the followingchapter. The convergence behavior of CG observed for this example is more irregular, and typically ofsuperlinear nature (with an acceleration effect in the later iteration steps). Figure 6.15 also contains theresults for Chebyshev acceleration applied to SSOR (ωopt) , with ωopt = optimal damping parameter forSOR according to Thm. 5.33. The combination of these techniques results in a very good performance,especially for a larger problem size.

In the numerical examples, we have used Alg. 6.1 to accelerate the SSOR iteration. Alg. 6.1 assumesσ(M) ⊂ [−ρ, ρ] . However, one could easily improve the performance of the accelerated SSOR methodsince the spectrum in fact satisfies σ(MSSOR(1)) ⊂ [0, 1) as the following exercise shows.

Exercise 6.4 Let A be SPD. Show: the spectrum σ(MSSOR(1)) of the symmetric Gauss-Seidel method satisfies

σ(MSSOR(1)) ⊂ [0, 1) . (Hint: Show A ≤ N−1 and appeal to Exercise 5.17.)

Exercise 6.5 Formulate the Chebyshev acceration algorithm for the general case that the iteration matrix M

satisfies σ(M) ⊂ [α, β] . Let MSSOR be the iteration matrix for the symmetric Gauss-Seidel iteration applied to

the matrix A for the 2D Poisson problem. Compare the convergence behavior of the SSOR(1) method with the

accelerated version. For the latter, obtain an estimate ρ for ρ(MSSOR) by a few steps of vector iteration. Note

that Exercise 6.4 implies σ(MSSOR) ⊂ [0, 1) .


54 7 GRADIENT METHODS

7 Gradient Methods

Motivated by the rather slow convergence of classical iterative methods, and in view of the sensitivity ofacceleration with respect to estimated parameters, a variety of alternative methods have been proposed.We first consider methods applicable to SPD matrices A , which often arise as a result of the discretizationof elliptic operators, e.g., the matrices of Examples 2.1, 2.2. Later we shall relax this condition and considernonsymmetric equations. Recall that an SPD matrix A > 0 satisfies

xTAx = (x,Ax) = (Ax, x) = (x, x)A = ∥x∥2A > 0 ∀ x = 0

In particular, all eigenvalues of A are positive.

The aim of gradient methods is to minimize the quadratic functional ϕ : Rn 7→ R ,

ϕ(x) = 12(Ax, x)− (b, x) (7.1)

for some b ∈ Rn . We compute the gradient of ϕ as the residual 28

∇ϕ(x) = Ax− b (7.2)

Moreover, the Hessian matrix of ϕ is given by the Jacobian of ∇ϕ ,

H(x) = J(∇ϕ(x)) ≡ A > 0

Thus, the functional ϕ has a unique minimum at x∗ , the stationary point of ϕ , satisfying ∇ϕ(x∗) =Ax∗ − b = 0 . We conclude that

For A > 0 , solving Ax = b is equivalent to finding the minimum of ϕ(x) from (7.1).

Exercise 7.1 Let x∗ be the solution of Ax = b ,A > 0 . Show that

ϕ(x)− ϕ(x∗) =12(x− x∗, x− x∗) =

12 ∥x− x∗∥2A (7.3)

and conclude again that ϕ has indeed a unique minimum.

Exercise 7.1 also shows that

Minimization of ϕ(x) over any subdomain D ⊂ Rn

is equivalent to minimization of the error e = x− x∗ in the energy norm.

Remark 7.2 The equivalence of solving Ax = b with A > 0 and minimization of ϕ(x) from (7.1) isformally analogous to the variational formulation of an elliptic PDE. The simplest 2D example is thePoisson equation (2.2) with homogeneous Dirichlet boundary conditions, with the corresponding (energy)functional

ϕ(u) = 12a(u, u)− (f, u)L2(Ω), a(u, v) =

∫Ω

∇u · ∇v dx (7.4)

Within the Sobolev space H10 (Ω) , the unique minimum of ϕ(u) from (7.4) is attained for u∗ = weak

solution of the boundary value problem (2.2). In this context, ∇u may be considered as an analog of

the discrete object A12x , and ∥u∥H1 = a(u, u)

12 = ∥∇u∥L2(Ω) is the corresponding energy norm. The

integration-by-parts identity∫Ω(−∆u)v dx =

∫Ω∇u · ∇v dx = a(u, v) (for u, v ∈ H1

0 (Ω) ) corresponds to

the identity (Ax, y) = (A12x,A

12y) = (x, y)A .

28 For general A ∈ Rn×n we have ∇ϕ(x) = (ReA)x− b , where ReA = 12 (A+AT ) is the symmetric, or ‘real part’ of A .


55

Simple iterative schemes for minimizing ϕ from (7.1) are ‘descent methods’ and proceed as follows:

Starting from an initial vector x0 , the iteration is given by

xk+1 = xk + αk dk (7.5)

where the search direction dk ∈ Rn and the step lengthαk ∈ R are to be chosen. Typically, once a searchdirection dk = 0 is chosen, the step length αk is taken as the minimizer of the one-dimensional minimizationproblem (‘line search’):

Find the minimizer αk ∈ R of α 7→ ϕ(xk + α dk) . (7.6)

This minimization problem is easily solved since it is quadratic and convex in α : Define ψ(α) = ϕ(xk +α dk) . Then, the chain rule gives

ψ′(α) = ∇ϕ(xk + α dk)T dk = (A(xk + α dk)− b), dk) = (−rk + αAdk, dk) = α(Adk, dk)− (rk, dk)

where the residual rk is, as before, defined as

rk = −∇ϕ(xk) = b− Axk (7.7)

Moreover, ψ(α) is, of course, convex:ψ′′(α) ≡ (Adk, dk) > 0 . The condition on a minimizer αk of ψ isψ′(αk) = 0 . Thus,

αk =(dk, rk)

∥dk∥2A(7.8)

Exercise 7.3 Show that the choice of αk in (7.8) leads to an approximation xk+1 = xk + αk dk such that

ϕ(xk+1)− ϕ(xk) = −1

2

|(dk, rk)|2

∥dk∥2A(7.9)

Thus, if dk is chosen such that (dk, rk) = 0 , then ϕ(xk+1) < ϕ(xk) .

Remark 7.4 Some of the basic iterative methods are related to descent methods, or descent methods ‘indisguise’:

For A ∈ Rn×n , consider the case where the first n search directions d0, . . . , dn−1 are chosen as the unit

vectors, dk = (0, . . . ,

k↓1, . . . , 0)T . Then we obtain from (7.8):

αk =(dk, rk)

∥dk∥2A=

(rk)kAk,k

hence

xk+1 = xk +(rk)kAk,k

dk

which exactly corresponds to the k - th update in the inner loop of a Gauss-Seidel step (5.9). Note thatn of these updates results in a single Gauss-Seidel step, and further steps are obtained by repeating thisprocedure with cyclic choice of search directions (= unit vectors).

In the simplest version of the Richardson iteration (5.6) we simply take dk = rk and αk = 1 , but thisdoes not minimize ϕ in the search direction rk . The locally optimal choice (7.8) corresponds to the of the

relaxation parameter αk =∥rk∥22∥rk∥2A

and leads to the steepest descent method discussed in the sequel.



−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−0.5

0

0.5

1

1.5

x 2

x1

Figure 7.16: SD convergence path for a 2×2 matrix A with κ2 = 3

7.1 The Method of Steepest Descent (SD) for SPD systems

We need to specify the search direction dk in the iteration (7.5). As shown in Exercise 7.1, most choicesof dk will lead to ϕ(xk+1) − ϕ(xk) < 0 . The steepest descent method is a ‘greedy’ algorithm in that itchooses dk as the local direction of steepest descent, which is given by

dk = −∇ϕ(xk) = rk

This choice of search direction, together with the step length αk given by (7.8), leads to the SteepestDescent Algorithm formulated in Alg. 7.1. 29

Algorithm 7.1 Steepest Descent

1: Choose x0 ∈ Rn

2: for k = 0, 1, . . . do3: rk = b− Axk4: αk =

(rk,rk)(rk,Ark)

5: xk+1 = xk + αkrk6: end for

Orthogonal search directions.

A consequence of our choice for the step length αk in (7.8) is that

In SD, consecutive search directions are orthogonal to each other.

To see this, we observe

dk+1 = b− Axk+1 = b− A(xk + αk dk) = rk − αk Adk

Inserting the definition of αk given by (7.8) gives, with dk = rk ,

(dk+1, dk) = (rk, dk)− αk(Adk, dk) = (rk, rk)−(rk, rk)

(rk, A rk)(rk, A rk) = 0 (7.10)

This characteristic can be seen in Fig. 7.16.

29 Basically, the idea of SD can be applied to any minimization problem.


7.1 The Method of Steepest Descent (SD) for SPD systems 57

Convergence of the SD method.

In order to quantify the speed of convergence of the steepest descent method, we use the Kantorovichinequality as technical tool. Let A be any real SPD matrix, and λmax and λmin its largest and smallesteigenvalues. Then, for all x ∈ Rn ,

(Ax, x)(A−1x, x)

(x, x)2≤ (λmin + λmax)

2

4λmin λmax

(7.11)

For a proof see [16, Lemma 5.1]. 30

Now we study the magnitude of the error vectors ek = x∗ − xk in the energy norm ∥ · ∥A . Note thatAek = −rk , where rk = b− Axk is the k -th residual. From (7.3) and (7.9) we obtain with dk = rk :

12∥ek+1∥2A = ϕ(xk+1)− ϕ(x∗) = (ϕ(xk+1)− ϕ(xk)) + (ϕ(xk)− ϕ(x∗)) = −1

2

|(rk, rk)|2

∥rk∥2A+ 1

2∥ek∥2A (7.12)

Now we use the Kantorovich inequality (7.11) and identity rk = −A−1ek to estimate

|(rk, rk)|2

∥rk∥2A=|(rk, rk)|2

(Ark, rk)≥ 4λmin λmax

(λmin + λmax)2 (A

−1 rk, rk) =4λmin λmax

(λmin + λmax)2 (ek, A ek) =

4λmin λmax

(λmin + λmax)2 ∥ek∥

2A

Together with (7.12) this yields

∥ek+1∥2A ≤ ∥ek∥2A

(1− 4λmin λmax

(λmin + λmax)2

)= ∥ek∥2A

(λmax − λmin)2

(λmax + λmin)2

=(λmax − λmin

λmax + λmin

)2∥ek∥2A =

(κ2(A)− 1

κ2(A) + 1

)2∥ek∥2A (7.13)

with the condition number κ2(A) = λmax/λmin . From this reasoning we obtain

Theorem 7.5 For the SD iteration applied to an SPD system Ax = b , the error in the energy norm isbounded by

∥ek∥A ≤(κ2(A)− 1

κ2(A) + 1

)k∥e0∥A (7.14)

I.e., the asymptotic convergence rate is bounded by κ2(A)−1κ2(A)+1

.

Hence ek → 0 as k →∞ . Evidently, the speed of convergence depends on the spectrum A . In particular,when the condition number κ2(A) is large, the contours of the functional ϕ , which are elliptic in shape,are long drawn-out, and the poor convergence rate suggested by (7.13) is graphically explained by ‘zig-zag’-paths similar to the one shown in Fig. 7.17. This illustrates a worst case,31 which occurs for an initialerror close to the eigenvector associated with λmax .

30 The Kantorovich inequality is an example for a ‘strengthened Cauchy-Schwarz (CS) inequality’. Using CS together with∥A∥2 = λmax , ∥A−1∥2 = 1/λmin we would obtain the elementary, larger bound λmax/λmin = κ2(A) on the right-hand side

of (7.11). In (7.14), this would result in the larger factor(κ2(A)−1

κ2(A)

)k/2. For κ = κ2(A) → ∞ the Kantorovich inequality

gains a factor ≈ 1/4 , and with ε = 1/κ this gives the following bounds for the asymptotic convergence rates: ≈ 1 − 2 ε(Kantorovich) vs. 1− ε/2 (CS).

31 The best case: For e0 = any eigenvector of A , the SD iteration always would find the exact solution x∗ in one step,independent of the problem dimension (simple proof!). This is, of course, in no way a practical situation. Moreover, asillustrated in Fig. 7.17, a small deviation from the eigenvector associated with λmax will lead to a very poor convergencebehavior for the case of large κ2(A) .



−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x 2

x1

Figure 7.17: SD convergence path for a 2×2 matrix A with κ2 ≈ 25

0 1 2−0.5

0

0.5

1

1.5

2

2.5

3Steepest Descent

x1

x 32

0 1 2−0.5

0

0.5

1

1.5

2

2.5

3Jacobi

x1

x 32

0 1 2−0.5

0

0.5

1

1.5

2

2.5

3Gauss−Seidel

x1

x 32

5 10 15 20 25 30 35 40 45 5010

−1

100

101

iteration number

l2 nor

m o

f res

idua

l


JacobiGauss−Seidelsteepest decent

Figure 7.18: Convergence paths for the Jacobi, Gauss-Seidel and SD methods for the Poisson problemfrom Example 2.2. Left: case N = 49 . Right: case N = 100 .

Example 7.6 The performance of the SD method can now be compared with that of the Jacobi andGauss-Seidel methods. The convergence paths for these three methods are shown in Fig. 7.18. The systemconsidered is again that of Example 2.2, i.e., the matrix A arises from discretizing the 2D Poison equationwith the 5 point finite difference stencil over an 8×8 mesh or an 11×11 mesh. The right-hand side bis taken as b = (1, 1, . . . , 1)T and the starting vector is x0 = b . Note that we are only plotting a 2Dprojection of the solution vector in Fig. 7.18, and therefore the steepest descent orthogonality property isnot graphically observed. Note that the SD iteration slows down with increasing k .

The SD method does not prove a great improvement over the classical iterative methods. Nevertheless,it comes with a number of new concepts including formulating the problem as a minimization procedureand considering the relationship between consecutive search directions. These concepts will be extendedin the following to generate more successful iterative procedures.


7.2 Nonsymmetric steepest descent algorithms 59

7.2 Nonsymmetric steepest descent algorithms

In the steepest descent algorithm we have required A to be SPD in order for the functional ϕ to have aunique minimum at the solution of Ax = b . Variations on the steepest descent algorithm for nonsymmetricsystems have also been developed, see [16]. The most general, but by far not computationally cheapest ormost efficient requires only that A be nonsingular. Then, since ATA is SPD, Alg. 7.1 can be applied tothe normal equations

ATAx = AT b

This procedure is called the residual norm steepest descent method, and the functional being minimized inthis case is

ψ(x) = 12(Ax,Ax)− (x,AT b)

This method minimizes the ℓ2 - norm of the residual, ∥Ax− b∥22 . However, in view of the convergenceresult (7.13), it is now the condition number of ATA , which is typically much larger than that of A , thatcontrols the convergence rate of the iteration.32

7.3 Gradient methods as projection methods

One of the main characteristics of the SD Method is that consecutive search directions (i.e., the residuals)are orthogonal, (7.10), which implies that the ℓ2 - projection of the new residual onto the previous one iszero. Another way of putting it is: the approximation xk+1 is defined as the solution of

Find xk+1 ∈ xk + spanrk such that rk+1 ⊥ rk

This is a local condition concerning only consecutive residuals; the ‘search history’, i.e., the informationabout the previous search directions r0, . . . , rk−1 in not exploited. One may hope that including thisinformation in the method leads to faster convergence.

Krylov subspace methods are based on this idea: the approximation xk+1 is constructed such that theresidual rk+1 is ‘orthogonal’ (in some appropriate sense to be specified below) to all previous residuals,search directions or a related set of vectors.

Remark 7.7 Brief review on orthogonal projectors:

Let Rn = K ⊕K⊥ , with x ⊥ y for all x ∈ K and y ∈ K⊥ . This is an orthogonal subspace decompositionof Rn . Let

K = spanu1, . . . , um, K⊥ = spanv1, . . . , vn−m

with (ui, uj) = δi,j , (vi, vj) = δi,j , and (ui, vj) = 0 . The union of the ui and vj is an orthogonal basis ofthe full space Rn . For each x ∈ Rn we consider the corresponding Fourier expansion

x =m∑i=1

(ui, x)ui +n−m∑j=1

(vj, x) vj =: P x+Qx (7.15)

This defines a pair (P,Q) of orthogonal projectors. P projects onto K along K⊥ , and vice versa.

32 (7.13) represents only an upper bound for the error. Nevertheless, the bound describes the overall convergence behaviorquite realistically.



From (7.15) and with

U =

| | || | |

u1 u2... um

| | || | |

∈ Rn×m, V =

| | || | |

v1 v2... vn−m

| | || | |

∈ Rn×(n−m),

we obtain the matrix representation for the projectors P,Q ∈ Rn×n :

m∑i=1

ui(uTi x) =

m∑i=1

(uiuTi )x = U UT x = P x ∈ K

and analogously for the orthogonal complement,

n−m∑j=1

vj(vTj x) =

n−m∑j=1

(vjvTj )x = V V T x = Qx ∈ K⊥

From the orthonormality relations UT U = Im×m ,V T V = I(n−m)×(n−m) , and UT V = 0m×(m−n) ,V

T U =0(n−m)×m we obtain the characterizing identities of a pair of orthogonal projectors:

P P = P = P T , QQ = Q = QT , P Q = QP = 0n×n

I.e., an orthogonal projector is idempotent (projector property) and symmetric. We also note the Pythagoreanidentity

∥x∥22 ≡ ∥P x∥22 + ∥Qx∥

22

Exercise 7.8 [ See Exercise 5.14 ] 33

Consider a decomposition Rn = K ⊕ K⊥A analogously as above, with (·, ·) throughout replaced by (·, ·)A , andA - conjugate bases u1, . . . , um , v1, . . . , vn−m , i.e., (ui, uj)A = δi,j , (vi, vj)A = δi,j , and (ui, vj)A = 0 . I.e.,U and V are A - conjugate matrices, satisfying UT AU = Im×m , V T AV = I(n−m)×(n−m) , and UT AV =

0m×(m−n) ,VT AU = 0(n−m)×m .

Show that the corresponding pair (P,Q) of ‘ A - conjugate’ projectors onto K and K⊥A is given by

P = U UT A, Q = V V T A

and P and Q satisfy

P P = P = PA , QQ = Q = QA , P Q = QP = 0n×n

where MA is the adjoint of a matrix M ∈ Rn×n w.r.t. (·, ·)A , i.e.,

MA = A−1MT A

Note the Pythagorean identity

∥x∥2A ≡ ∥P x∥2A + ∥Qx∥2A

33 For x ∈ Rn one may also denote xA = xT A = (Ax)T

, with (x, x)A = xA x , but this notation is not standard and wedo not use it in the following.


61

8 The Conjugate Gradient (CG) Method for SPD Systems

8.1 Motivation

The Conjugate Gradient method is the example par excellence of a Krylov subspace method. The method-ology behind these methods can, e.g., be motivated by the Cayley-Hamilton Theorem, which allowed usin Section 1.4 to construct the inverse of the matrix A as a polynomial in A . Let us write the solution x∗of Ax = b in the form

x∗ = x0 + A−1 r0

with r0 = b− Ax0 . In principle, we can construct A−1 r0 using the Cayley-Hamilton Theorem as

A−1 r0 =(dn−1A

n−1 + dn−2An−2 + · · ·+ d1I

)r0 = qn(A)r0 (8.1)

where the coefficients are defined via the characteristic equation of A . We note that this formula implies

A−1 r0 ∈ Kn = spanr0, A r0, . . . , An−1r0 (8.2)

and therefore the solution x∗ satisfies

x∗ ∈ x0 +Kn.

More generally, for m ≥ 1 we define the m - th Krylov space of A w.r.t. an initial residual r0 as

Km = Km(A, r0) = spanr0, A r0, A2r0, . . . , Am−1r0, with dim (Km) ≤ m. (8.3)

The hope behind Krylov subspace methods is that it is possible to find, for somem≪ n , an approximation

xm ∈ x0 +Km

which is close to the exact solution x∗ .

We start our considerations with the most prominent and historically earliest example, the ConjugateGradient (CG) method for SPD systems,34 developed by Hestenes, Stiefel, and Lanczos from 1950 on.

8.2 Introduction to the CG method

Let A ∈ Rn×n be SPD. The CG method may be motivated and described in different ways (we will comeback to this below). The basic idea is to try to proceed in a similar way as in the Steepest Descent method,but using search directions which are – in contrast to (7.10) –A - orthogonal 35 to each other, i.e.,

(dk+1, dk)A = (dk+1, A dk) = 0, k = 0, 1, . . . (8.4)

with d0 := r0 . Let us first motivate this idea.

34 The CG method was elected as on of the ‘Top 10 Algorithms of the 20th Century’ by a SIAM committee in 2000.35 Instead of ‘ A - orthogonal’ we will also use the term ‘ A - conjugate’ or simply ‘conjugate’. ‘Orthogonal’ means ‘ ℓ2

-orthogonal’.


62 8 THE CONJUGATE GRADIENT (CG) METHOD FOR SPD SYSTEMS

Figure 8.19: The CG method for n = 2 .

The special case n = 2 .

For n = 2 , the contour lines C ϕ =const. of ϕ(x) =12(Ax, x)− (b, x) are ellipses 36 centered at x∗ . Suppose

x1 ∈ C = C ϕ =const. , with r1 = b − Ax1 . We do not know the error e1 = x1 − x∗ , but we know thedirection tangential to C : For a local parametrization x = x(s) of C (with x(0) = x1 )

0 ≡ d

dsϕ(x(s)) = ∇ϕ(x(s)) · x′(s)

hence the tangential vector t1 := x′(0) to C in x1 satisfies

t1 ⊥ Ax1 − b = Ae1, i.e., t1 ⊥ r1, t1 ⊥A e1

Thus, t1 is orthogonal to r1 and A - conjugate to the error e1 . Now, evaluating r1 we know the directiont1 , and a single line search along the direction A - conjugate to t0 will yield the exact solution x∗ for n = 2 .We can realize this procedure in a more explicit way including a preparatory step x0 → x1 (cf. Fig. 8.19):

(i) Choose d0 := r0 and perform a line search along d0 (first step as in the SD method),

x1 = x0 + α0 d0, α0 =(d0, r0)

∥d0∥2Aand evaluate the new residual r1 = b − Ax1 = r0 − α0Ad0 . We already know that r1 ⊥ r0 = d0 .This means that r1 || t0 , the tangential direction at x0 More importantly, we also have t1 || r0 .With this preparatory step, we now can explicitly proceed as indicated above:

(ii) Instead of choosing r1 as the new search direction as in SD, we apply one step of Gram-Schmidtorthogonalization to construct a search vector d1 ⊥A d0 , i.e.,

d1 = r1 + β0 d0, with β0 = −(d0, r1)A∥d0∥2A

36 Note that, due to (7.3), ∥e0∥A = const. along each contour.


8.2 Introduction to the CG method 63

This is exactly what we need, since d1 is indeed A - conjugate to the tangent t1 at x1 :

d1 ⊥A t1 due to d1 ⊥A d0 = r0 || t1

The exact solution x∗ is now given by

x∗ = x2 = x1 + α1 d1, α1 =(d1, r1)

∥d1∥2A

Expressed in geometrical terms: d1 ⊥A d0 means that the directions of d0, d1 correspond to conjugatediameter directions of the ellipses C = C ϕ =const. . In particular, d0 || t1 has tangential direction at x1 , andthe conjugate direction d1 points from x1 to the center x∗ .

Here, the solution x∗ is reconstructed in the form

x∗ = x0 + α0 d0 + α1 d1

where α0d0 ∈ K1(A, r0) ,α0d0 + α1d1 ∈ K2(A, r0) . The ingenious idea behind the CG method is thatthis procedure can be generalized in a very efficient way to arbitrary dimension n by means of a simplerecursion. This results in a direct solution method for SPD systems which terminates latest at xn . Butwe will also see that the xm, m = 0, 1, . . . show a systematic convergence behavior vastly superior to theSD iterates.

Expansion with respect to conjugate directions.

For n = 2 we have realized requirement (8.4) in a constructive way, ending up with a direct solutionprocedure x0 → x1 → x2 = x∗ . Before we discuss the generalization of this procedure leading to the CGmethod, assume that we already know a pairwise conjugate basis

d0, d1, . . . , dn−1 in Rn, with dj = 0, dj ⊥A dk for j = k

Then the solution x∗ of Ax = b can be written in form of its Fourier expansion,

x∗ =n−1∑k=0

(dk, x∗)A(dk, dk)A

dk =n−1∑k=0

(dk, b)

(dk, dk)Adk

Note that the use of the energy product enables a representation in terms of b and the dk , withoutexplicit reference to x∗ , which would not be possible for a representation w..r.t. a basis orthogonal in ℓ2 .Furthermore, from elementary Linear Algebra we know that, for each m ≤ n , the truncated expansion

xm =m−1∑k=0

(dk, b)

(dk, dk)Adk ∈ Km = spand0, . . . , dm−1 (8.5)

is the unique minimizer in Km for the error em = xm − x∗ in the energy norm, i.e.,

xm = argminx∈Km

∥x− x∗∥A

In fact, xm is the A - orthogonal projection of x∗ ontoKm ,xm = Pmx∗ with the corresponding projector

Pm , and the error em = xm − x∗ satisfies

em = (Pm − I )x∗ ⊥A Km = (I − Pm) e0 ⊥A Km



I.e., em is conjugate to Km ; e0 = 0− x∗ is the error of 0 seen as an ‘initial approximation’, with residualr0 = b . The magnitude of em will depend on the approximation quality of Km w.r.t. x∗ . For ∥dk∥A ≡ 1 ,

the matrix representation of the projector Pm is given by Pm = DmDTm A with Dm =

d0 ∣∣ . . . ∣∣ dm−1

,

see Exercise 7.8. Thus, xm = DmDTm b , which is nothing but (8.5).

Producing iterates xm in this way may be called a ‘method of conjugate directions’, a simple and straight-forward projection method. However, in view of practical realization we have to observe:

• We may not expect in general that an A - orthogonal basis is a priori available.

• Such a basis may, in principle, be generated by a Gram-Schmidt process starting from an arbitrarybasis. 37 However, for larger values ofm , storage requirements, in particular, will become restrictive,because all dk must be stored in memory to perform the orthogonalization.

This raises two questions:

• Can we generate an A - orthogonal basis on the fly in course of an iteration x0 → x1 → . . . , e.g. fromthe successive residuals ?

• If yes, can we limit the complexity (in terms storage and flops) of the resulting iterative process ?

For the CG method described below, both goals are achieved in a very satisfactory way for the case ofSPD systems. In an iterative method like CG, starting from some initial x0 , with r0 = b− Ax0 , we willrather aim at projecting the error x0 − x∗ onto Km , which results in a slightly modified procedure.

8.3 Derivation of the CG method

In its essence, the CG method may be viewed as a clever realization of the idea to expand w.r.t. conjugatedirections. Here, starting from an initial guess x0 , the xm are constructed as elements of the affine spacesx0 +Km ,Km = Km(A, r0) (cf. (8.2)) of increasing dimension. We will see that the Km are spanned by thesuccessive residuals (gradients) rk , as well as by successive ‘conjugated gradients’ dk (the search directions)constructed from the rk .

The start is the same as for n = 2 , from x0 with 0 = r0 = b− Ax0 ∈ K1 .

• First step x0 → x1 : Choose d0 = r0 ∈ K1 and perform line search,

x1 = x0 + α0 d0 ∈ x0 +K1, α0 =(d0, r0)

∥d0∥2A=∥r0∥22∥d0∥2A

= 0 (8.6)

and compute the new residualr1 = r0 − α0Ad0 ∈ K2 (8.7)

Stop if r1 = 0 . Otherwise, r0, r1 is an orthogonal basis of K2 .

As a preparation for the next step, we compute the new search direction by the Gram-Schmidtorthogonalization step

d1 = r1 + β0 d0 ∈ K2, β0 = −(d0, r1)A∥d0∥2A

=∥r1∥22∥r0∥22

= 0 (8.8)

37 One might think about the question what it would means to start from the Cartesian basis.


8.3 Derivation of the CG method 65

where the latter identity for β0 follows from (d0, r1)A = 1α0

(α0Ad0, r1) = 1α0

(r0 − r1, r1) togetherwith r1 ⊥ r0 . By construction, d0, d1 is a conjugate basis of K2 .

• We now proceed by induction, which leads to a complete description of the algorithm and its essentialproperties. In a nutshell: Each iteration step is analogous to step (ii) for the case n = 2 .

For m ≥ 2 we inductively assume that we have recursively computed xk, rk and dk for k = 1 . . .m−1by a line search in the same way as for the first step, and that the following identities hold true:

xk = xk−1 + αk−1 dk−1 ∈ x0 +Kk, αk−1 =(dk−1, rk−1)

∥dk−1∥2A=∥rk−1∥22∥dk−1∥2A

= 0 (8.9)

with the residuals0 = rk = rk−1 − αk−1Adk−1 ∈ Kk+1 (8.10)

and the new search directions

0 = dk = rk + βk−1 dk−1 ∈ Kk+1, βk−1 = −(dk−1, rk)A∥dk−1∥2A

=∥rk∥22∥rk−1∥22

= 0 (8.11)

We also inductively assume that, for k = 1 . . .m− 1 ,

r0, r1, . . . , rk is an orthogonal basis of Kk+1 (8.12)

d0, d1, . . . , dk is a conjugate basis of Kk+1 (8.13)

which automatically implies that all the Kℓ are of maximal dimension ℓ .

In particular, note thatrm−1 ⊥ Km−1, dm−1 ⊥A Km−1 (8.14)

• Now we take a step m− 1 → m and verify that all the properties assumed inductively remain valid.

The next iterate is defined via the current search direction dm−1 ,

xm = xm−1 + αm−1 dm−1 ∈ x0 +Km, αm−1 =(dm−1, rm−1)

∥dm−1∥2A=∥rm−1∥22∥dm−1∥2A

= 0 (8.15)

The first expression for αm−1 comes from the line search (minimization of ϕ along xm−1 + αdm−1 .The second identity for αm−1 follows inductively from the definition of rm−1 and dm−1 together withrm−1 ⊥ dm−2 (orthogonality of the ‘new’ residual w.r.t. the ‘old’ search direction in line search):

(dm−1, rm−1) = (rm−1 + βm−2 dm−2, rm−1) = (rm−1, rm−1)

The new residual evaluates to

rm = rm−1 − αm−1Adm−1 ∈ Km+1 (8.16)

If rm = 0 , the iteration stops with xm = x∗ . Otherwise, we again have 0 = rm ⊥ dm−1 , but evenmore:

Proposition: rm ⊥ Km .

Proof: Since, by construction, rm ⊥ dm−1 and Km = Km−1 ⊕ spandm−1 it remains to show thatrm ⊥ Km−1 . The vector rm is a linear combination of rm−1 and Adm−1 . From (8.14) we have

rm−1 ⊥ Km−1, A dm−1 ⊥ Km−1



Algorithm 8.1 CG algorithm

% input: A SPD, b , x0

1: Compute r0 = b− Ax0 , d0 = r02: for k = 0, 1, . . . , until convergence do3: αk = (rk, rk)/(Adk, dk)4: Compute xk+1 = xk + αk dk5: Compute rk+1 = rk − αk Adk6: if rk+1 = 0 then Stop7: βk = (rk+1, rk+1)/(rk, rk)8: Compute dk+1 = rk+1 + βk dk9: end for

which immediately yields rm ⊥ Km−1 , as proposed.

Now we perform the same Gram-Schmidt orthogonalization step as before to construct a searchdirection dm which is conjugate to dm−1 :

dm = rm + βm−1 dm−1 ∈ Km+1, βm−1 = −(dm−1, rm)A∥dm−1∥2A

=∥rm∥22∥rm−1∥22

= 0 (8.17)

The latter identity for βm−1 follows from (dm−1, rm)A = 1αm−1

(αm−1Adm−1, rm) = 1αm−1

(rm−1 −rm, rm) together with rm ⊥ rm−1 .

Proposition: dm ⊥A Km .

Proof: Since, by construction, dm ⊥A dm−1 and Km = Km−1 ⊕ spandm−1 it remains to showthat dm ⊥A Km−1 . The vector dm is a linear combination of rm and dm−1 . From (8.14) we havedm−1 ⊥A Km−1 . Concerning rm , we exploit the symmetry of A and identity (8.10) to compute, foran arbitrary basis vector dk−1 of Km−1 , i.e., for k = 1 . . .m− 1 :

(rm, dk−1)A = (Arm, dk−1) =1

αk−1(rm, αk−1Adk−1) =

1αk−1

(rm, rk−1 − rk︸︷︷︸∈Km

) = 0, k = 1 . . .m− 1

Together with rm ⊥ Km , which has been proved before, this yields dm ⊥A Km−1 , as proposed.

This completes the induction. In particular,

r0, r1, . . . , rm is an orthogonal basis of Km+1 (8.18)

d0, d1, . . . , dm is a conjugate basis of Km+1 (8.19)

Remark 8.1

• In the last step of the proof, the symmetry of A plays an essential role (apart from the fact that theSPD property of A ensures that the conjugation process is well-defined and does nor break downunless a zero residual is encountered).

In a ‘magic’ way, the single orthogonalization step within each iteration automatically leads tocomplete orthogonal and conjugate bases of the Krylov spaces according to (8.18),(8.19). This is insharp contrast to a general Gram-Schmidt procedure, where all orthogonality relations have to beexplicitly enforced (a global process).

For n = 3 this may explained by means of a figure, a 3D generalization of Fig. 8.19. Later we shallrather give a general, more transparent explanation of this outstanding feature of the CG iteration.


8.4 CG as a projection method and its relation to polynomial approximation 67

• ‘Conjugate Gradients’ is a misnomer (a translation error?): The gradients (residuals rm ) are notconjugate but orthogonal; they are conjugated to become A - conjugate (in form of the dm ).

• The CG algorithm is an amazingly simple recurrence formulated in Alg. 8.1. There exist alternativeformulations of this iteration; e.g., the recurrence for the xm can be extracted from the recurrencefor the rm without explicit reference to the dm ; cf. e.g. [16].

• By construction, xm ∈ x0 + Km = x0 + spanr0, A r0, A2r0, . . . , Am−1r0 ; the successive powers

emanate from the computation of the residual (8.16) in each step, which enters the definition of dmand xm+1 .

In the absence of roundoff error, there are two possibilities:

– If a residual rm is (‘luckily’) encountered in course of the iteration, the exact solution xm =x∗ ∈ x0 +Km has been found.

– Otherwise, the Km form an increasing sequence of subspaces of increasing dimension m ,

xm ∈ Km = spanr0, A r0, A2r0, . . . , Am−1r0 = spanr0, . . . , rm = spand0, . . . , dm (8.20)

This shows that, in principle, CG is a direct method: The exact solution x∗ is found after n stepsat the latest. However, we will see that we can give a estimate for the error em = xm − x∗ after msteps, similarly for the SD method (but with a significantly better quality); see Section 8.5 below.

• The computational complexity of CG is usually dominated by the residual evaluation in each step,in form of the computation of Adk . In many practical situations the cost for evaluating the residualis O(n) and comparable to the other vector operations involved.

The storage requirements are moderate, namely also O(n) ; only a few vectors have to be kept inmemory at the same time.

• As we have seen, the orthogonality/conjugacy relations (8.18),(8.19) are inherent to the CG iteration,but only if exact arithmetic is assumed. In practice, these relations will be contaminated by roundofferror, which affects the convergence behavior.

8.4 CG as a projection method and its relation to polynomial approximation

Let us state the essential approximation properties of the CG iterates.

• In Section 7, our starting point was to interpret the solution x∗ as the minimizer of the quadraticform

ϕ(x) = 12(Ax, x)− (b, x) = ϕ(x∗) +

12∥x− x∗∥2A

see (7.1),(7.3). The CG method realizes an iterative line search starting from x0 and locally mini-mizing ϕ(x) , in an iterative fashion, i.e., minimizing ∥e0∥A along successive (conjugate) directionsdk emanating from xk−1 , with spand0, . . . , dm−1 = Km , resulting in (cf. (8.15)):

xm = x0 +m−1∑k=0

αk dk = x0 +m−1∑k=0

(dk, rk)

(dk, dk)Adk ∈ x0 +Km (8.21)



• Now we compare (8.21) with the ‘global’ minimizer we are ultimately interested in, i.e., xm ∈ x0+Km

with the propertyxm = argmin

x∈x0+Km

∥x− x∗∥A

With xm = x0 + ym︸︷︷︸∈Km

, xm − x∗ = e0 + ym we can expand ym ∈ Km w.r.t. the conjugate directions

dk , analogously as in Section 8.2, but with −e0 now playing the role of x∗ . This gives

xm = x0 + yk = x0 + argminy ∈Km

∥e0 + y∥A = x0 +m−1∑k=0

(dk,−e0)A(dk, dk)A

dk = x0 +m−1∑k=0

(dk, r0)

(dk, dk)Adk (8.22)

Note that ym = −Pm e0 , the A - conjugate projection of −e0 onto Km . Now, from the conjugacy ofthe dk and the recursion for the rk we have

(dk, rk) = (dk, rk−1) + (dk, rk − rk−1) = (dk, rk−1) + αk−1 (dk, A dk−1)︸︷︷︸=0

= (dk, rk−2) + (dk, rk−1 − rk−2) = (dk, rk−2) + αk−2 (dk, A dk−2)︸︷︷︸=0

= . . . = (dk, r0)

Comparing (8.21) with (8.22) we now see that xm is indeed identical with xm , i.e.,

xm = argminx∈x0+Km

∥x− x∗∥A (8.23)

For the error em = xm − x∗ this implies

em = argmine∈ e0+Km

∥e∥A (8.24)

which also shows monotonic convergence in the energy norm.

• Thus we have identified CG as a projection method – which was essentially our goal formulated inSection 8.2 – with the A - conjugate projector Pm onto Km ,

xm = x0 − Pm(x0 − x∗) = x0 − Pm e0 ∈ x0 +Km, em = (I − Pm) e0 =: Qme0.

Note that with dk =dk

∥dk∥Awe have

Pm = DmDTmA with Dm =

d0 ∣∣ . . . ∣∣ dm−1

(8.25)

see Section 7.3. 38

• By construction, we have em = e0 − Pm e0 ∈ e0 + Km . Thus we can write xm and em in terms of amatrix polynomial,

xm = x0 + pm−1(A) r0 ∈ x0 +Km, (8.26)

em = e0 + pm−1(A) r0 = (I − pm−1(A)A) e0 =: qm(A)e0 (8.27)

with pm−1 ∈ Pm−1 , hence qm ∈ Pm with qm(0) = 1 . The optimality property (8.24) can now bere-interpretated as a statement on the optimality of the polynomial qm(A) = (I − pm−1(A)A) .

38 In an analogous way, the ( ℓ2 -) orthogonal projector onto Km can be expressed in terms of the residuals rk .


8.4 CG as a projection method and its relation to polynomial approximation 69

Figure 8.20: CG as a projection method

We collect these findings in the following theorem.

Theorem 8.2 The error em = xm − x∗ of the CG iteration is the A - conjugate projection of the initialerror e0 along Km = spanr0, A r0, . . . , Am−1r0 ,

em = Qm e0 = qm(A)e0 ⊥A Km (8.28)

and satisfies∥em∥A = min

e∈ e0+Km

∥e∥A = minq ∈Pm, q(0)= 1

∥q(A)e0∥A (8.29)

Consequently, the error is related to the distribution of the spectrum σ(A) via

∥em∥A∥e0∥A

≤ maxλ∈σ(A)

|qm(λ)| = minq ∈Pm, q(0)= 1

maxλ∈σ(A)

|q(λ)| (8.30)

Note that, in course of the iteration, the matrix polynomial qm(A) realizing the projector Qm is neverexplicitly constructed in form of a matrix. 39 It is ‘inherent’ to the process; the CG iteration realizes theappropriate image of x0 . Explicit computation of the projection matrices is useless and would also benumerically expensive and unstable in general. The projection property of the CG iteration may also bestated in the following way, cf. Fig. 8.20.

Corollary 8.3 The m -th CG iterate xm is uniquely characterized as the solution of the projection problem

Find xm ∈ x0 +Km such that em = xm − x∗ ∈ e0 +Km satisfies (em, v)A = 0 ∀ v ∈ Km (8.31)

This is equivalent to

Find xm ∈ x0 +Km such that (b− Axm, v) = (rm, v) = 0 ∀ v ∈ Km (8.32)

39 Note that the projectors do not depend on x0 .



In Section 9 we will derive the CG method in an alternative fashion, by requiring the optimality propertyformulated in Thm. 8.2 or Corollary 8.3, respectively, and algorithmically realizing the correspondingprojections, where, in a first step, orthogonal bases of the Krylov spaces are constructed. This procedurewill also lead to a better insight to the magical simplicity of the CG procedure, and it will also lead us tomore general projection methods to be studied in Section 10.

8.5 Convergence properties of the CG method

Now we are in a position to analyze the convergence of the CG-method and to derive error estimates. Tothis end we may now choose any polynomial q ∈ Pm with q(0) = 1 to estimate the right hand side of(8.30) from above. A reasonable universal bound is obtained in a similar way as in Section 6: Assumeσ(A) ⊂ [α, β] and seek a polynomial p(t) which attains

minq ∈Pm, q(0)=1

maxt∈ [α,β]

|q(t)|

From Corollary 6.3 (with γ = 0 ) we know that this minimal q is given by a transformed Chebyshevpolynomial, and this results in the error bound

∥em∥A ≤ minq ∈Pm, q(0)= 1

maxi=1...n

|q(λi)| ∥e0∥A ≤ minq ∈Pm, q(0)= 1

maxλ∈ [α,β]

|q(λ)| ∥e0∥A

≤ 2cm

1 + c2m∥e0∥A, c =

√κ− 1√κ+ 1

, κ =β

α

As a consequence, taking [α, β] = [λmin, λmax] , and with the condition number κ2(A) = λmax/λmin , weobtain

Theorem 8.4 For the CG iteration applied to an SPD system Ax = b , the error in the energy norm isbounded by

∥em∥A ≤ 2

(√κ2(A)− 1√κ2(A) + 1

)m

∥e0∥A (8.33)

This bound is similar to that obtained for the SD method, see Thm. 7.5, except that now the conditionnumber of A is replaced by its square root! For large κ2(A) we have√

κ2(A)− 1√κ2(A) + 1

∼ 1− 2√κ2(A)

and convergence to a specified tolerance may be expected after O(√κ2(A) ) steps.

There is another major difference between SD and CG: While (with the exception of trivial startingvectors) the convergence behavior of the steepest descent method is accurately described by the conditionnumber of the matrix (i.e., the ratio of the extremal eigenvalues), it is the whole spectrum of A whichdetermines the convergence behavior of the CG-algorithm. In particular, the bound (8.33) is often toopessimistic.


8.5 Convergence properties of the CG method 71

Theorem 8.5 If A has only m < n distinct eigenvalues then, for any x0 , the CG iteration converges inat most m steps.

Proof: Under the assumption of the theorem we can decompose the initial error e0 in terms of theeigenbasis of A in the form 40

e0 =m∑i=1

εi vi ∈ spanv1, . . . , vm (8.34)

where the vi are certain eigenvectors of A (normalized in ℓ2 ), with corresponding eigenvalues λi > 0 , i =1 . . .m . Thus, the initial residual satisfies

r0 = −Ae0 = −m∑i=1

λiεi vi ∈ spanv1, . . . , vm

and for all iterated residuals we have

Akr0 = −m∑i=1

λk+1i εi vi ∈ spanv1, . . . , vm, k = 0, 1, . . .

This showsKk = Kk(A, r0) ∈ spanv1, . . . , vm, for all k ≥ 0

In particular, the dimension of the Krylov spaces

Kk = spanr0, Ar0, . . . , Ak−1r0 = spanr0, . . . , rk−1

does not grow beyond m . This shows that (at latest)

rm = 0 ⇒ em = 0

because otherwise we would have rm ⊥ Km and dim (Km+1) = dim (spanr0, . . . , rm−1, rm) = m + 1 , acontradiction.

We may also express this property in terms of the optimal polynomial qm(A) inherent to CG. Considerthe characteristic polynomial

χ(z) = (z − λ1)ν1 · · · (z − λm)νm

where νi is the multiplicity of λi . The minimal polynomialµ(z) is defined as the monic polynomial ofminimal degree which vanishes at the spectrum 41 of A . Here, µ(z) is given by the polynomial of degree m ,

µ(z) = (z − λ1) · · · (z − λm), µ(λi) = 0, i = 1 . . .m

This shows that under the assumption of Theorem 8.5, the right hand side of (8.30) attains its minimalvalue 0 for the rescaled version of µ(z) ,

q(λ) =(1− λ

λ1

)· · ·(1− λ

λm

), q(0) = 1

due to (8.34) and since (I − Aλi)vi = 0 , and thus we obtain em = 0 .

40 This is true because any linear combination of eigenvectors associated with an eigenvalue λi is again such an eigenvector(eigenspace associated with λi ).

41 Analogously to χ(A) = 0 (Cayley-Hamilton Theorem) we also have µ(A) = 0 .



Remark 8.6

• The argument in the proof of Theorem 8.5 also shows that the assertion remains true for arbitraryA > 0 if e0 is a linear combination of m < n eigenvectors of A , which is of course not a very practicalassumption.

• From the above considerations we conclude that the CG iteration will rapidly converge of either Ais well-conditioned (untypical!) or, by a heuristic argument, if the eigenvalues of A are concentratedin a few ‘clusters’ (a very special situation).

• We also conclude that one way of improving the convergence behavior of the CG method is toreduce the condition number of the linear system by preconditioning (discussed in Section 12). Thisis by far the most popular method of improving the convergence behavior of the CG method. Apreconditioner that ‘bunches eigenvalues’ can also improve the performance; however, this requiresa much more detailed knowledge of the structure of the matrix under consideration.

8.6 CG in Matlab: The function pcg

The following is a printed copy of the help page for the CG method implemented in Matlab. Note that,alternatively to the matrix A , a function AFUN is sufficient which realizes the operation Ax . The functionpcg also supports preconditioning in several variants, but the concrete preconditioner has to be providedby the user. As for A , the preconditioner my be specified in form of an evaluation function MFUN.

PCG Preconditioned Conjugate Gradients Method.

X = PCG(A,B) attempts to solve the system of linear equations A*X=B for

X. The N-by-N coefficient matrix A must be symmetric and positive

definite and the right hand side column vector B must have length N.

X = PCG(AFUN,B) accepts a function handle AFUN instead of the matrix A.

AFUN(X) accepts a vector input X and returns the matrix-vector product

A*X. In all of the following syntaxes, you can replace A by AFUN.

X = PCG(A,B,TOL) specifies the tolerance of the method. If TOL is []

then PCG uses the default, 1e-6.

X = PCG(A,B,TOL,MAXIT) specifies the maximum number of iterations. If

MAXIT is [] then PCG uses the default, min(N,20).

X = PCG(A,B,TOL,MAXIT,M) and X = PCG(A,B,TOL,MAXIT,M1,M2) use symmetric

positive definite preconditioner M or M=M1*M2 and effectively solve the

system inv(M)*A*X = inv(M)*B for X. If M is [] then a preconditioner

is not applied. M may be a function handle MFUN returning M\X.

X = PCG(A,B,TOL,MAXIT,M1,M2,X0) specifies the initial guess. If X0 is

[] then PCG uses the default, an all zero vector.

[X,FLAG] = PCG(A,B,...) also returns a convergence FLAG:

0 PCG converged to the desired tolerance TOL within MAXIT iterations

1 PCG iterated MAXIT times but did not converge.

2 preconditioner M was ill-conditioned.

3 PCG stagnated (two consecutive iterates were the same).

4 one of the scalar quantities calculated during PCG became too

small or too large to continue computing.


8.6 CG in Matlab: The function pcg 73

[X,FLAG,RELRES] = PCG(A,B,...) also returns the relative residual

NORM(B-A*X)/NORM(B). If FLAG is 0, then RELRES <= TOL.

[X,FLAG,RELRES,ITER] = PCG(A,B,...) also returns the iteration number

at which X was computed: 0 <= ITER <= MAXIT.

[X,FLAG,RELRES,ITER,RESVEC] = PCG(A,B,...) also returns a vector of the

estimated residual norms at each iteration including NORM(B-A*X0).

Example:

n1 = 21; A = gallery(’moler’,n1); b1 = A*ones(n1,1);

tol = 1e-6; maxit = 15; M = diag([10:-1:1 1 1:10]);

[x1,flag1,rr1,iter1,rv1] = pcg(A,b1,tol,maxit,M);

Or use this parameterized matrix-vector product function:

afun = @(x,n)gallery(’moler’,n)*x;

n2 = 21; b2 = afun(ones(n2,1),n2);

[x2,flag2,rr2,iter2,rv2] = pcg(@(x)afun(x,n2),b2,tol,maxit,M);

Class support for inputs A,B,M1,M2,X0 and the output of AFUN:

float: double

See also bicg, bicgstab, bicgstabl, cgs, gmres, lsqr, minres, qmr,

symmlq, tfqmr, ichol, function_handle.

Reference page in Help browser

doc pcg

Example 8.7 We apply the CG and SD algorithms to the diagonal SPD matrix A ∈ R100×100 witheigenvalues λ1 = 5 , λ2 = 4 , λ3 = 1.5 , λ4 = 1.4 , λ5 = 1.3 , λ6 = λ7 = · · ·λ100 = 1 . For exact solutionx = (1, 1 . . . 1)T , starting vector x0 = (0, 0, . . . , 0)T the convergence history (residual versus iterationcount) is shown in Fig. 8.21. We see that the convergence history of the SD method follows the theoreticalupper bound of C((κ−1)/(κ+1))m . The CG method, on the other hand, finds, up to round-off difficulties,the exact solution in step 6 .

5 10 15 20 25 30 35 4010

−20

10−15

10−10

10−5

100

iteration number

resi

dual

siz

e

CG vs. SD, κ(A) = 5

CGSDO(((κ−1)/(κ+1))m)

Figure 8.21: comparison of CG with SD – see Example 8.7


749 GENERAL APPROACH BASED ON ORTHOGONALIZATION OF KM .

THE ARNOLDI/LANCZOS PROCEDURES

8.7 CGN: CG applied to the Normal Equations

As argued in Section 7.2 for the SD method, CG may in principle applied to arbitrary systems via solutionof the normal equations

ATAx = AT b

where AT A takes the role of A before. Again, this will be not a successful approach in general because√κ2(ATA) = κ2(A) . Moreover, the additional evaluations x 7→ AT x make each iteration step more

expensive – or even unfeasible, if an evaluation procedure for AT is not directly available.

9 General Approach Based on Orthogonalization of Km .

The Arnoldi/Lanczos Procedures

In this section we study the construction of orthogonal bases in Krylov spaces. The basic procedure is a avariation of the well-known Gram-Schmidt algorithm, leading to a ‘projected version’ of a matrix A ∈ Rn×n

in form of a Hessenberg matrix. This construction is the basis for general Krylov subspace methods to bestudied in the sequel.

9.1 The Arnoldi procedure for A ∈ Rn×n

For arbitrary A ∈ Rn×n , a given vector r0 ∈ Rn defines a sequence of Krylov subspaces 42

Km = Km(A, r0) = spanr0, A r0, . . . , Am−1r0

We call

Km =

| | || | |

r0 Ar0... Am−1r0

| | || | |

∈ Rn×m (9.1)

the corresponding Krylov matrix.

Constructing an orthonormal basis v1, . . . , vm of Km is of interest on its own right; it is a basic techniquefor what follows, and for general algorithms based on Krylov sequences. In principle, we know how toconstruct v1, . . . , vm : Apply the Gram-Schmidt algorithm (or another equivalent orthonormalizationprocedure) to the Krylov vectors r0, A r0, . . . (the columns of Km ), which are not given a priori but arecomputed on the fly in course of the process.

The so-called Arnoldi iteration is a slight but clever modification of this procedure. The resulting or-thonormal vectors vj are called Arnoldi vectors:

Choose v1 = r0/∥r0∥2 . Then, for j = 1 . . .m , first multiply the current Arnoldi vector vj by A ,

and orthonormalize Avj against all previous Arnoldi vectors.

42 Assume for the moment that Km is of maximal dimension m ≤ n .


9.1 The Arnoldi procedure for A ∈ Rn×n 75

Algorithm 9.1 Arnoldi iteration (classical Gram-Schmidt variant)

1: v1 = r0/∥r0∥22: for j = 1, 2, . . . ,m do3: for i = 1, 2, . . . , j do4: Compute hij = (Avj, vi)5: end for6: Compute wj = Avj −

∑ji=1 hij vi

7: hj+1,j = ∥wj∥28: if hj+1,j = 0 then Stop9: vj+1 = wj/hj+1,j

10: end for

This leads to the following iteration (with the intermediate unnormalized vectors wj ): v1 = r0/∥r0∥2 , and

wj = Avj − (Avj, v1) v1 − (Avj, v2) v2 − . . . − (Avj, vj) vj ; vj+1 =wj

∥wj∥2, j = 1, 2, . . . (9.2)

This is also formulated in Alg. 9.1, where we define

hij = (Avj, vi), i ≤ j, and hj+1,j = ∥wj∥2 (9.3)

As long as wj = 0 we have due to (9.2) (for which, by construction, wj ⊥ spanv1, . . . , vj ) :

h2j+1,j = (wj, wj) = (Avj + lin. comb. of v1, . . . , vj, wj) = (Avj, ∥wj∥2vj+1) = hj+1,j (Avj, vj+1)

hence∥wj∥2 = hj+1,j = (Avj, vj+1) (9.4)

A breakdown occurs if a wj = 0 is encountered. Provided the iteration does not break down, i.e., as longas wj = 0 , it generates the Arnoldi matrix

Vm =

| | || | |

v1 v2... vm

| | || | |

∈ Rn×m (9.5)

and we also define the rectangular upper Hessenberg matrix Hm ∈ R(m+1)×m as

Hm =

h11 h12 h13 . . . h1m

h21 h22 h23 . . . h2m

h32 h33 . . . h3m

. . . . . ....

hm,m−1 hmm

hm+1,m

=

(Av1, v1) (Av2, v1) (Av3, v1) . . . (Avm, v1)

(Av1, v2) (Av2, v2) (Av3, v2) . . . (Avm, v2)

(Av2, v3) (Av3, v3) . . . (Avm, v3)

. . . . . ....

(Avm−1, vm) (Avm, vm)

(Avm, vm+1)

If breakdown occurs in the m - th step, wm = 0 is still well-defined but not vm+1 , and the algorithm stops.In this case, the last row of Hm is zero, hm+1,m = 0 .

Lemma 9.1 Assuming that Alg. 9.1 does not terminate prematurely, the vectors vj, j = 1 . . .m , form anorthonormal basis of the Krylov space Km , i.e., V T

m Vm = Im×m .

Furthermore, Pm = VmVTm ∈ Rn×n is the orthogonal projector onto Km .




Proof: An exercise: A simple induction argument shows that the vectors vj, j = 1 . . .m are indeedorthonormal. A second simple induction argument reveals vj ∈ Kj for j = 1 . . .m .

Furthermore, Pm = VmVTm is a surjection onto Km and satisfies PmPm = P T

m = Pm , which is exactly theproposed projection property.

Since by Lemma 9.1, the Arnoldi vectors vj are orthonormal and since Avj ∈ spanv1, . . . , vj+1 (see(9.2)), each Avj can also be expressed in terms of its Fourier expansion (!) with j+1 terms. E.g., forj = 1 we have

Av1 = (Av1, v1)︸︷︷︸h11

v1 + w1(!)= (Av1, v1)︸︷︷︸

h11

v1 + (Av1, v2)︸︷︷︸h21

v2

thus, w1 satisfies the identity w1 = h21 v2 . For general j , by definition of H = (hij) = (Avj, vi) , we havethe identities

Avj =

j∑i=1

hij vi + wj(!)=

j+1∑i=1

hij vi, j = 1 . . .m−1 (9.6)

which also again shows wj = (Avj, vj+1) vj+1 = hj+1,j vj+1 , cf. (9.4). A special case occurs for j = m if, inthe last step, wm = 0 and vm+1 is not defined. But in any case, we have

Avm =m∑i=1

him vi + wm (9.7)

from the last Arnoldi step in (9.2), with wm = hm+1,m vm+1 if vm+1 is well-defined.

In matrix notation, (9.6),(9.7) is equivalent to

AVm =

| | || | |

Av1 Av2... Avm

| | || | |

= Vm+1Hm = VmHm + wm e

Tm (9.8)

where the square upper Hessenberg matrix Hm ∈ Rm×m is obtained from Hm by removing its last row,and em denotes the m - th unit vector.

We also conclude:

Theorem 9.2 The Arnoldi procedure generates a reduced QR factorization of the Krylov matrix Km

(see (9.1)) in the formKm = VmRm (9.9)

with Vm from (9.5) satisfying V Tm Vm = Im×m , and with a triangular matrix Rm ∈ Rm×m .

Furthermore, with the m×m - upper Hessenberg matrix

Hm =

h11 h12 h13 . . . h1m

h21 h22 h23 . . . h2m

h32 h33 h3m. . . . . .

...hm,m−1 hmm

=

(Av1, v1) (Av2, v1) (Av3, v1) . . . (Avm, v1)

(Av1, v2) (Av2, v2) (Av3, v2) . . . (Avm, v2)

(Av2, v3) (Av3, v3) (Avm, v3). . . . . .

...(Avm−1, vm) (Avm, vm)

we have

V Tm AVm = Hm (9.10)


9.2 The MGS (Modified Gram-Schmidt) variant 77

Proof: The matrix Rm in (9.9) is implicitly defined by the orthogonalization process (9.2): It is easy toverify that the first columns of Km are linear combinations of the first j columns of Vm , which is equivalentto (9.9).

Furthermore, left multiplication of identity (9.8) by V Tm yields

V Tm AVm = V T

m VmHm + V Tm wm e

Tm = Hm,

due to the orthogonality relations V Tm Vm = Im×m and V T

m wm = 0 .

Remark 9.3 Now we see what the Arnoldi process accomplishes: Apart from computing an orthogonalbasis of Km , it maps A to upper Hessenberg form Hm by an orthogonal transformation, see (9.10).

For m = n , the outcome would be identical with the well-known orthogonal similarity transformation toupper Hessenberg form (cf., e.g., [2]).

For the case m < n relevant here, this corresponds to a reduced version of such a decomposition, wherethe resulting ‘small’ matrix Hm = V T

m AVm ∈ Rm×m is a projected version ofA acting inKm : Considerx ∈ Km and express it in the basis Vm , i.e., x = Vm u with corresponding coefficient vector u ∈ Rm .Applying A to x and projecting the result Ax back to Km yields

VmVTm Ax = VmV

Tm AVm u = VmHm u (9.11)

i.e., the coefficient vector of the result expressed in the basis Vm is given by Hm u . We can also write

VmHmVTm = VmV

Tm AVmV

Tm = PmAPm =: Am ∈ Rn×n

with the orthogonal projector Pm = VmVTm onto Km .

The orthogonal basis delivered by the Arnoldi procedure will be used in Section 10 for the construction ofapproximate solutions xm for general linear systems Ax = b .

Exercise 9.4 Show by means of an induction argument with respect to powers of A :

p(A) r0 = Vm p(Hm)V Tm r0 = ∥r0∥2 Vm p(Hm) e1

for all p ∈ Pm−1 , where e1 is the first unit vector in Rm .

9.2 The MGS (Modified Gram-Schmidt) variant

In practice, the classical Gram-Schmidt algorithm is usually implemented in an alternative, numericallymore robust way; this is known as the Modified Gram-Schmidt algorithm (MGS). Let us describe thisvariant in the context of the Arnoldi procedure.

Consider the j - step of the Arnoldi iteration (9.2),

wj = Avj − (Avj, v1) v1 − . . .− (Avj, vj) vj, vj+1 = wj/∥wj∥2 (9.12)

where the v1, . . . , vj are already orthonormal. We have

(Avj, vi) vi = (vTi Avj) vi = vi (vTi Avj) = (vi v

Ti )Avj = PiAvj (9.13)




Algorithm 9.2 Arnoldi iteration (Modified Gram-Schmidt variant)

1: v1 = r0/∥r0∥22: for j = 1, 2, . . . ,m do3: Initialize wj = Avj4: for i = 1, 2, . . . , j do5: Compute hij = (wj, vi)6: Update wj = wj − hij vi7: end for8: hj+1,j = ∥wj∥29: if hj+1,j = 0 then Stop10: vj+1 = wj/hj+1,j

11: end for

with the rank-1-projectors Pi = vi vTi onto spanvi . Thus, (9.12) is equivalent to

wj = (I − P1 − . . .− Pj)Avj = (I − Pj)Avj =: QjAvj (9.14)

with the pair of orthogonal projectors

Pj = VjVTj = v1v

T1 +. . .+vjv

Tj projector onto spanv1, . . . , vj = Kj, Qj = I −Pj : projector alongKj

The idea behind MGS is to organize the successive projections in a different fashion. Let

Qi = I − Pi = I − vi vTi , i = 1 . . . j

denote the rank - (n−1) - projectors onto the orthogonal complements spanvi⊥ along spanvi .

Due to the orthonormality of the vi it is easy to verify by that

Qj Qj−1 · · · Q1 = (I − vj vTj ) (I − vj−1 v

Tj−1) · · · (I − v1 v

T1 )

= I − vj vTj − . . .− v1 vT1 = Qj

With this representation for Qj , (9.2) can be rewritten in form of a recursion in terms of the Qj =(I − vj v

Tj ) :

v1 = r0/∥r0∥2w1 = (I − v1 v

T1 )Av1, v2 = w1/∥w1∥2

w2 = (I − v2 vT2 )(I − v1 v

T1 )Av2, v3 = w2/∥w2∥2

... (9.15)

wm = (I − vm vTm ) · · · (I − v1 v

T1 )Avm, vm+1 = wm/∥wm∥2

Together with identity (I − vivTi ) x = x−(x, vi) vi this leads to Alg. 9.2, which is mathematically equivalentto Alg. 9.1 but usually less sensitive to cancellation effects.

Remark 9.5 The Arnoldi iteration can also be realized via successive Housholder reflections, as in theclassical, full orthogonal reduction of a square matrix to upper Hessenberg form. For details cf. [16].


9.3 The Lanczos procedure for symmetric A ∈ Rn×n 79

Algorithm 9.3 Lanczos iteration

1: β1 = 0 , v0 = 02: v1 = r0/∥r0∥23: for j = 1, 2, . . . ,m do4: Initialize wj = Avj − βjvj−1

5: Compute αj = (wj, vj)6: Update wj = wj − αjvj7: βj+1 = ∥wj∥28: if βj+1 = 0 then Stop9: vj+1 = wj/βj+1

10: end for

9.3 The Lanczos procedure for symmetric A ∈ Rn×n

Assume A = AT ∈ Rn×n is symmetric, and apply the Arnoldi procedure for given r0 . By symmetry, weimmediately obtain from (9.10):

Hm = V Tm AVm = HT

m is also symmetric

Since Hm ∈ Rm×m is upper Hessenberg (see Thm. 9.2), it must be tridiagonal. In this case we write

Hm =: Tm =

α1 β2

β2 α2 β3. . . . . . . . .

βm−1 αm−1 βm

βm αm

,

αj = hjj = (Avj, vj),

βj = hj,j−1 = (Avj−1, vj) = (Avj, vj−1)(9.16)

With this denotation, Alg. 9.2 specializes to the simple recursion formulated in Alg. 9.3. 43

Exercise 9.6 Use the tridiagonal structure of the matrices Tm to conclude that vj+1 ∈ spanvj−1, vj , Avj ;specifically, verify the three-term recurrence

Avj = βj+1 vj+1 + αj vj + βj vj−1

Note: βj+1 can be obtained by normalizing wj = Avj − αjvj − βjvj−1 .

Remark: Due to this simple recursion, algorithms based on the Lanczos process can be organized in a way such

that only 3 vectors need to be kept in memory at at the same time.

9.4 Arnoldi / Lanczos and polynomial approximation

Similarly as for the CG method (cf. Thm. 8.2), there is in intimate connection between the Arnoldi/Lanczosprocedure and a polynomial approximation problem:

Arnoldi/Lanczos approximation problem:

Find a monic polynomial pm ∈ Pm such that ∥pm(A) r0∥2 becomes minimal. (9.17)

The solution of this problem is characterized by the following theorem:

43 In practice, the Lanczos iteration is quite sensitive to numerical loss of orthogonality due to round-off. Much researchhas been invested to fix this problem with reasonable effort by means of some form or re-orthogonalization; cf. [16].




Theorem 9.7 As long as the Arnoldi/Lanczos iteration does not break down (i.e., the Krylov matrix Km

has full rank), (9.17) has a unique solution pm , which is precisely given by the characteristic polynomialpm(z) = χm(z) of Hm [Tm] .

Proof: Since Km = Km(r0) = spanr0, A r0, . . . , Am−1r0 and the columns of Vm are a basis of Km , forany monic polynomial p ∈ Pm the vector p(A) r0 can be written as

p(A) r0 = Am r0 − Vm u ∈ Am r0 ⊕Km

with some coefficient vector u ∈ Rm . Hence, (9.17) is equivalent to a linear least squares problem:

Find u ∈ Rm , i.e., Vm u ∈ Km such that ∥Vm u− Am r0∥2 becomes minimal. (9.18)

Under the assumption of the theorem, i.e. if Vm has full rank m , the this least squares problem has aunique solution u = um , with pm(A) = Am r0 − Vm um , characterized by the orthogonality relation 44

V Tm (Am r0 − Vm um︸︷︷︸

pm(A) r0

) = 0 ⇔ pm(A) r0⊥Km (9.19)

Now we consider the Arnoldi/Lanczos factorizationHm = V Tm AVm (see (9.10)). Due to r0, A r0, . . . , A

m−1r0 ∈Km and since VmV

Tm projects onto Km , we have

V Tm Ar0 = V T

m AVmVTm r0 = HmV

Tm r0

V Tm A2 r0 = V T

m AVmVTm Ar0 = V T

m AVmVTm AVmV

Tm r0 = H2

mVTm r0

. . .

V Tm Am r0 = . . . = Hm

m V Tm r0

hence 45

V Tm p(A) r0 = p(Hm)V

Tm r0 for all p ∈ Pm (9.20)

Now we consider the characteristic polynomial χm of Hm . χm ∈ Pm is monic and satisfies χm(Hm) = 0(Cayley-Hamilton). Together with (9.20) we conclude

0 = χm(Hm)VTm r0 = V T

m χm(A) r0

Thus, pm = χm satisfies (9.19), and we already know that this solution is unique.

Remark 9.8 In in the discussion of the GMRES method below (see Sec. 10.4, we will also identify an‘optimal polynomial’ acting on Km , associated with an approximate solution xm with the property thatthe resulting residual rm becomes minimal.

Theorem 9.7 also provides a motivation for approximating σ(A) by σ(Hm) , e.g. by starting the Arnoldiiteration from some random r0 . The resulting eigenvalues of the Hm are also called Ritz values, and themethod may be considered as a generalization of the ordinary power iteration. For details see e.g. [9],[19].Convergence results are e.g., available for the symmetric case. Here, the distribution of the true spectrumand the quality of the starting vector r0 compared to the dominant eigenvector play a role.

Exercise 9.9 Consider the projected matrix 46

Am := VmHm V Tm = PmAPm ∈ Rn×n, Pm = VmV T

m

Show that each eigenvalue of Hm is also an eigenvalue of Am , and all other eigenvalues of Am are zero.

44 (9.19) is the system of Gaussian normal equations for the least squares problem (9.18).45 See Exercise 9.4 for closely related identity.46 See also Remark 9.3.


9.5 The direct Lanczos method for symmetric systems (D-Lanczos) 81

9.5 The direct Lanczos method for symmetric systems (D-Lanczos)

As in Section 9.3 we now assume that A is symmetric (at this point, not necessarily SPD). The case of ageneral matrix A will be considered in Section 10.

Since the columns of the Lanczos matrix Vm represent an orthonormal basis of Km , we can use this as thebasis for a projection method, in the spirit of the Galerkin orthogonality requirement 47

Find xm ∈ x0 +Km such that (b− Axm, v) = (rm, v) = 0 ∀ v ∈ Km (9.21)

which is identical with the characterization (8.32) obtained for the CG iterates in Corollary 8.3.

We now realize a procedure for computing xm . We make an ansatz for xm in terms of the basis Vm =v1 ∣∣ . . . ∣∣ vm delivered by the Lanczos procedure,

xm = x0 + Vm um (9.22)

where the coefficient vector um ∈ Rm is to be determined. With

Axm = Ax0 + AVm um, rm = b− Axm = r0 − AVm um

enforcing the orthogonality condition (9.21) leads to the requirement

r0 − AVm um ⊥ Km, i.e, (r0 − AVm um, Vm u) = 0 for all u ∈ Rm

which is equivalent to the system of normal equations

V Tm (r0 − AVm um) = 0 ⇔ V T

m AVm︸︷︷︸= Tm

um = V Tm r0 = β e1, β = ∥r0∥2 (9.23)

with e1 = (1, 0, . . . , 0)T ∈ Rm , because v1 = r0∥r0∥2

(cf. (9.2)). Thus, um is determined by the tridiagonalsystem

Tm um = β e1 (9.24)

which is a projected version of the original system Ax = b in the sense of (9.21) (cf. also (9.11)).

If the Lanczos iteration does not break down, the system (9.24) has a unique solution which can becomputed via LU -decomposition of Tm . Furthermore, combination of the Lanczos procedure with thiselimination process leads us to an iterative scheme for the solution of (9.24). First, we study the recursivestructure of the LU -decomposition of Tm in the following exercise.

Exercise 9.10 Assume that for m = 1 . . .M , the tridiagonal Lanczos matrices

Tm =

α1 β2

β2 α2 β3. . .

. . .. . .

βm−1 αm−1 βm

βm αm

∈ Rm×m

from (9.16) admit LU -decompositions Tm = Lm Um . Show for m = 2 . . .M :

47 In the SPD case, (9.21) is equivalent to minimizing ∥em∥A over all possible error vectors em ∈ e0+Km , see Thm. 8.2. Thisis not equivalent to the (also reasonable) minimal residual requirement of minimizing ∥rm∥2 over all possible rm ∈ r0+AKm ,which will be considered later on.




(i) The LU -decomposition of Tm has the bidiagonal form given in (9.28) below.

(ii) Verify the following recursive formulas for the values λm , ωm , ηm in (9.28):

ωm = βm, λm =βmηm−1

, ηm = αm − λm ωm (9.25)

Conclude that the matrices Lm and Um are recursively obtained from Lm−1 , Um−1 by adding one row andcolumn, i.e.,

Lm =

Lm−1 0

0 λm 1

, Um =

Um−10ωm

0 ηm

(9.26)

(iii) Show: If the factors L,U ∈ Rm×m of the LU -decomposition of a matrix T ∈ Rm×m have the form

L =

L′ 0

ℓT 1

, U =

U ′ u

0 η

with ℓ, u ∈ Rm−1 , then L−1 and U−1 can be written as

L−1 =

L′−1 0

−ℓT L′−1 1

, U−1 =

U ′−1 − 1η U

′−1u

0 1η

(9.27)

This means that after adding one row and column to the tridiagonal matrix T to obtain the new tridiagonalmatrix T ′ , the inverses L′−1 , U ′−1 can be also obtained from L−1 and U−1 by simply adding one row andcolumn.

We write the tridiagonal matrix Tm in terms of its LmUm - factorization as

Tm = LmUm =

1

λ2 1

. . . . . .

λm−1 1

λm 1

η1 ω2

η2 ω3

. . . . . .

ηm−1 ωm

ηm

(9.28)

Combining this with (9.22), (9.24) leads to the representation of the iterates xm in the form

xm = x0 + VmU−1m L−1

m (β e1)

From the Lanczos iteration (Alg. 9.3) we have a short recurrence for the columns vj ofVm , which constitutean orthogonal basis of Km . The matrices U−1

m and L−1m can be computed according to Exercise 9.10. Hence,

we expect to be able to find a short recurrence for the vectors xm in form of an iteration of Krylov type.To this end we introduce the substitutions

Dm = Vm U−1m , zm = L−1

m β e1

and obtainxm = x0 +Dm zm (9.29)

To derive a recurrence for the xm , we first consider the vectors zm : From Exercise 9.10 we obtain

zm = L−1m β e1 =

L−1m−1 0

ℓTm 1

β e1 = zm−1

ζm

(9.30)


9.5 The direct Lanczos method for symmetric systems (D-Lanczos) 83

where ζm ∈ R is given by ζm = β ℓTm e1 . Next, the explicit form of Um and its inverse allows us to inferfrom Exercise 9.10:

Dm = VmU−1m =

Vm−1 vm

U−1m−1 um

0 1ηm

=Dm−1 dm

, dm = Vm−1 um +1

ηmvm

i.e., the matrix Dm is obtained from Dm−1 by simply adding one column. Inserting these findings aboutDm and zm into (9.29), we conclude

xm = x0 +Dm zm = x0 +Dm−1 dm

zm−1

ζm

= x0 +Dm−1 zm−1 + ζm dm

= xm−1 + ζm dm (9.31)

This (9.31) is a simple update formula for the approximations xm . We now analyze in more detail howthe search directions dm and the ζm can be computed efficiently.

dm is the m - th column of Dm = VmU−1m . It can be determined by considering the m - th column of the

product Dm Um = Vm , from which we note

Vj,m =m∑i=1

Dj,i Ui,m = Dj,m−1 ωm +Dj,m ηm (9.32)

From (9.32) we can construct a three term recurrence involving two consecutive search directions and thebasis vectors vi :

vm = ωm dm−1 + ηm dm (9.33)

This formula is also correct for the special case m = 1 provided we set d0 = 0 (verify!). Similarly, zm isfound by forward substitution on the system

Lm zm = β e1

which yields

zm =

zm−1

ζm

, with ζm = −λm ζm−1 (9.34)

i.e., the new element at the end of the vector zm is just the last element from zm−1 multiplied by −λm .As a results of these observations, xm can be updated in each step by

xm = xm−1 + ζm dm (9.35)

The so-called D-Lanczos algorithm (see Alg. 9.4) is a concrete realizations of these ideas, namely:

(i) the recurrence for the Lanczos vectors vm together with the definition of the scalars αm and βm ;

(ii) the formulas for the entries λm, ωm, ηm of the LU -decomposition of Tm (9.25);

(iii) the recurrence for the search directions dm (9.33);

(iv) the formula (9.34) for the value ζm ; and finally,

(v) the update formula (9.35) for the iterates xm .




Algorithm 9.4 D-Lanczos

1: Compute r0 = b− Ax0 ; ζ1 = β = ∥r0∥2 ; v1 = r0/β2: λ1 = β1 = 0 , d0 = 0 ;3: for m = 1 , 2, . . . , do4: Compute w = Avm − βmvm−1 and αm = (w, vm)5: if m > 1 then compute λm = βm

ηm−1and ζm = −λmζm−1

6: ηm = αm − λmβm7: dm = (vm − βmdm−1)/ηm8: xm = xm−1 + ζm dm9: if xm has converged, then Stop10: w = w − αmvm ; βm+1 = ∥w∥2 ; vm+1 = w/βm+1

11: end for

Exercise 9.11 How many vectors of length n do you need to keep in memory at any given time? Using thecharacterization (9.21), show that ζm could alternatively be computed as

ζm =(rm, dm)

(dm, A dm)

Remark 9.12 The D-Lanczos algorithm relies on the symmetry of the matrix A , but it is not assumedthat A be SPD. However, the above derivation assumes the existence of an LU -decomposition of thematrices Tm . If A is SPD, then we shall see that this assumption is valid, and the D-Lanczos producesiterates identical to the CG iterates, as shown in the next section. If A is merely symmetric, then it ispossible that the D-Lanczos algorithm breaks down.

9.6 From Lanczos to CG

The D-Lanczos algorithm relies on linking recurrences for the orthogonal basis vectors vm and the searchdirections dm to each other by means of the entries of the LU -decomposition of the matrices Tm . Wenow show that this can be written in a simpler way, where the factorization is not explicitly required. ForSPD systems we will see that this is always possible, i.e., the iteration does not break down, and we showthat the result is exactly the CG algorithm from Section 8.

Lemma 9.13 Let A ∈ Rn×n be symmetric. Let xm,m = 0, 1 . . . be the sequence of approximations obtainedby the D-Lanczos Algorithm 9.4 – we assume that the algorithm does not break down. Let rm = b−Axmbe the sequence of residuals. Then:

(i) rm = σm vm+1 for some σm ∈ R ;

(ii) the residuals rm are pairwise orthogonal, i.e., (ri, rj) = 0 for i = j ;

(iii) the search directions dm are pairwise conjugate ( A -orthogonal), i.e., (Adi, dj) = 0 for i = j .

Proof:

ad (i): Use em = (0, 0, . . . , 0, 1)T ∈ Rm , write

Tm =

Tm

tm+1,m eTm

∈ R(m+1)×m

and compute the residual,

rm = b− Axm = b− A (x0 + Vm um) = r0 − AVm um = β v1 − Vm+1Tm um

= Vm ( β e1 − Tm um︸︷︷︸=0

)− vm+1 tm+1,m eTm um = σmvm+1 (9.36)

for some scalar σm . Here we have used identity (9.8) for Hm = Tm .


9.6 From Lanczos to CG 85

ad (ii): Since the residuals rm are multiples of the vm+1 , which are pairwise orthogonal, the residuals areorthogonal to each other as well.

ad (iii): Consider the matrix product

DTm ADm = U−T

m V Tm AVm U

−1m = U−T

m Tm U−1m = U−T

m Lm

I.e., the lower triangular matrix U−Tm Lm equals symmetric matrix DT

m ADm , which therefore must be adiagonal matrix. This means that the vectors di form a conjugate set, i.e., (Adi, dj) = 0 for i = j .

We now assume that A is SPD. In this case,

Tm = V Tm AVm is clearly also SPD,

therefore the LU -decomposition Tm = LmUm is well-defined. 48 In particular, the ηm computed by theD-Lanczos algorithm do not vanish (cf. (9.25),(9.28)).

Lemma 9.13 describes the same orthogonality relationships which we established for the CG method inSection 8.3. This is not surprising, since the Galerkin orthogonality requirement (9.21) which led us toD-Lanczos is identical with (8.32) (cf. Corollary 8.3), which characterizes the CG iterates. Therefore, theresulting iterates xm must be identical. In particular, the D-Lanczos algorithm does not break down inthe SPD case.

This can also be verified in an explicit way. To this end we rewrite the D-Lanczos recurrence by directlyimposing the orthogonality conditions from Lemma 9.13: Recalling the update formula (9.35), we see thatxj+1 is of the form

xj+1 = xj + αj dj (9.37)

where dj is the new search direction. 49 Therefore, the residual vectors must satisfy the recurrence

rj+1 = rj − αj Adj (9.38)

Assume rj = 0 , i.e., the exact solution has not yet been encountered. Condition rj+1 ⊥ rj therefore leadsto (rj+1, rj) = (rj − αj Adj, rj) = 0 , hence 50

αj =(rj, rj)

(Adj, rj)(9.39)

Note, in particular, that a scaling of dj (i.e., multiplication of dj by a factor σ ∈ R ) changes αj but notxj+1 .

Furthermore, relations (9.36) and (9.33) tell us that the new search direction dj+1 is a linear combinationof the old search direction dj and vj+1 . Thus, assuming rj+1 = 0 we have dj+1 ∈ spanrj+1, dj . Hence,we make the ansatz

dj+1 = rj+1 + βj dj (9.40)

The conjugacy condition (dj+1, dj)A = 0 and (9.38),(9.39) yield

βj = −(rj+1, A dj)

(Adj, dj)=

(rj+1, rj+1)

(rj, rj)(9.41)

These recursive formulas are the heart of the CG-algorithm, as already known from Section 8:

48 Consider, e.g. the Cholesky decomposition of Tm which is well-defined for any SPD matrix (no pivoting!) and rescale –a simple exercise.

49 Here we are changing our notation: Now, the vectors dj correspond to the vectors dj+1 in D-Lanczos method. Thestarting index is j = 0 , as in our original formulation of the CG method.

50 Note that rj = 0 implies (Adj , rj) = 0 , i.e., αj is well-defined. Furthermore, it implies αj = 0 .


86 10 GENERAL KRYLOV SUBSPACE METHODS, IN PARTICULAR GMRES

1. Compute xj+1 = xj + αj dj , where αj is given by (9.39).

2. Compute the new search direction dj+1 as dj+1 = rj+1 + βj dj , where βj is given by (9.41).

In practice the value αj is computed in slightly different way, using the conjugacy of the dj , we obtain

(Adj, rj) = (dj, dj − βj−1 dj−1)A = (dj, dj)A = (Adj, dj)

such that that αj =(rj ,rj)

(Adj ,dj). See Alg. 8.1 !

From the derivation, it is clear that the CG algorithm does not stop unless rj = 0 (the ‘lucky breakdown’).

Remark 9.14 The storage requirement for the CG algorithm 8.1 is: 4 vectors ( x, d, A d, r ) plus A (ora function representing evaluation of Ax ). The storage requirement for the (mathematically equivalent)D-Lanczos algorithm 9.4 is: 5 vectors ( vm, vm−1, w, d, x ), plus A .

10 General Krylov Subspace Methods, in particular GMRES

We now assume that A ∈ Rn×n is an arbitrary square matrix. In general, Krylov subspace approxima-tions are defined via a projection property, which has the general form of Petrov-Galerkin orthogonalityrequirement:

Find xm ∈ x0 +Km such that (b− Axm, v) = 0 ∀ v ∈ Lm (10.1)

Here, the ansatz space (or solution space) is Km = Km(A, r0) . Different Krylov methods differ in thechoice of the test space Lm . We will consider three choices:

(i) ‘Full orthogonalization methods’ (FOM; e.g., D-Lanczos) and conjugate gradient methods (CG):

We choose Lm = Km (cf. (9.21)):

Find xm ∈ x0 +Km such that (b− Axm, v) = 0 ∀ v ∈ Lm = Km (10.2)

With rm = b− Axm = −Aem this is equivalent to the requirement

rm ⊥ Km ⇔ Aem ⊥ Km (10.3)

in particular

em ⊥ AKm for A symmetric, em ⊥A Km for A SPD (10.4)

If A is SPD, we already know from Section 8.4 that this has always a unique solution xm satisfying

∥em∥A = ∥xm − x∗∥A = minx∈x0+Km

∥x− x∗∥A = mine∈ e0+Km

∥e∥A (10.5)

i.e., the energy norm of the error is minimized over e0+Km . In other words: Condition (10.4) is thecharacterization of the solution xm of (10.5) in the sense of ‘Galerkin orthogonality’.


87

r0

AKm

A δ δ

Km

d0

0 0A

Figure 10.22: The orthogonality conditions (10.6) and (10.2)

(ii) ‘[Generalized] minimal residual methods’ (MINRES, GMRES):

We choose Lm = AKm , i.e.,

Find xm ∈ x0 +Km such that (b− Axm, v) = 0 ∀ v ∈ Lm = AKm (10.6)

or equivalentlyrm ⊥ Lm = AKm ⇔ Aem ⊥ AKm (10.7)

As shown below, this is equivalent to

∥rm∥2 = ∥b− Axm∥2 = minx∈x0+Km

∥b− Ax∥2 = minr∈ r0+AKm

∥r∥2 (10.8)

which motivates the term ‘minimal residual’ method.

(iii) ‘Biorthogonal methods’:

Here, Lm = Km(AT , r0) for some r0 . This choice will lead to the Biconjugate Gradient method

(BiCG) and its variants.

The orthogonality conditions (10.2), (10.6) are graphically illustrated in Fig. 10.22.

Lemma 10.1 Conditions (10.6)/(10.7) and (10.8) are equivalent.

Proof: This is related to the characterization of the solution of a linear least squares problem via itssystem of normal equations.

For given x0, r0 , consider an arbitrary x ∈ x0+Km , with residual vector r ∈ r0+AKm = r0+Lm . Writer = r0 + l with l ∈ Lm . Furthermore, we decompose the initial residual according to

r0 = Pm r0 +Qm r0 ∈ Lm ⊕ L⊥m

where Pm is the orthogonal projector onto Lm , and Qm = I − Pm . Thus, r has the unique decomposition

r = (l + Pm r0) +Qm r0 ∈ Lm ⊕ L⊥m,



with∥r∥22 = ∥ l + Pm r0∥22 + ∥Qm r0∥22

Since Qm r0 is a priori fixed, this attains its minimal value for l = −Pm r0 , i.e., the minimum is attainedat

rm = r0 − Pm r0 = Qmr0 ∈ L⊥m ⊥ Lm

which is exactly the Petrov-Galerkin orthogonality requirement (10.7).

This argument can also be reversed, i.e., if r satisfies (10.7), then r = Qm r and then ∥r∥22 = ∥Qm r∥22 =∥Qm r0∥22 , which is the minimal attainable value.

For A invertible, the corresponding solution value xm with b − Axm = rm is uniquely defined:xm =A−1(b − rm) . In the following we study the computation of xm by means of a Krylov subspace method.Similarly as the D-Lanczos iteration was derived from the Lanczos procedure in Section 9.5, our startingpoint is the general Arnoldi procedure for general matrices A , see Section 9.1, 9.2.

Exercise 10.2

(i) Let A ∈ Rn×n be an invertible matrix. Use (8.2) to conclude that the approximation xn obtained in the n-th step according to criterion (10.8) is the exact solution x∗ of Ax = b .

(ii) Assume additionally that for some m ≤ n there holds Km = Kn . Show that then already xm = xm+1 =· · · = x∗ .

Hint: Show that Kk = Km = Kn also for all k > m .

10.1 Computing the xm

As for all Krylov subspace methods, the approximations xm, m = 0, 1, . . . are computed in turn until anapproximation xm is found which is sufficiently accurate. It is, of course, essential that the xm can becomputed efficiently from the orthogonality conditions (10.6) or (10.2). The general procedure is:

• Construct the n×m Arnoldi matrix Vm =v1 ∣∣ . . . ∣∣ vm . By construction, its columns vj form an

orthogonal basis of Km .

• Construct the n×m matrix Wm =w1

∣∣ . . . ∣∣wm

such that its columns wj form a basis of Lm .

For Lm = AKm this is given by Wm = AVm , which is generated in course of the Arnoldi procedure.

• Write the approximate solution asxm = x0 + Vm um

where um ∈ Rm is a vector of weights to be determined.

• Enforcing either of the orthogonality conditions (10.6),(10.2) leads to the system of normal equations

W Tm (Axm − b) = 0 ⇔ W T

m AVm um =W T r0 (10.9)

from which the approximate solution xm follows in the form

xm = x0 + Vm um = x0 + Vm (W Tm AVm)

−1W Tm r0 (10.10)

Note that the matrix W Tm AVm is only of size m×m ; therefore its inversion is affordable for m≪ n .


10.2 The GMRES (Generalized Minimal Residual) method 89

Remark 10.3 One may also think of proceeding in the following way, motivated by Exercise 9.4: For allp ∈ Pm−1 he have

Vm p(Hm)VTm r0 = p(A) r0

Formally replacing p(A) by A−1 would suggest the approximate identity

VmH−1m V T

m r0 ≈ A−1 r0

which givesxm := x0 + VmH

−1m V T

m r0 ≈ x0 + A−1 r0 = x∗

A look at (10.10) shows that this gives a FOM method, Km = Lm . (Recall that Hm = V Tm AVm .)

10.2 The GMRES (Generalized Minimal Residual) method

GMRES is the most popular Krylov subspace method applicable to any nonsingular matrix A , basedon the Petrov-Galerkin orthogonality (minimal residual) requirement (10.6), Lm = AKm . For given m ,GMRES computes an orthogonal basis v1, . . . , vm of Km = Km(A, r0) and solves a linear system ofdimension m equivalent to (10.10) in an efficient way.

The first step of the GMRES algorithm is to generate the set of basis vectors v1 . . . , vm by means of theArnoldi procedure, in the formulation based on Modified Gram-Schmidt, see Section 9.2 (Alg. 9.2).

Exercise 10.4 Assume r0 = 0 and let Hm ∈ R(m+1)×m be defined by the Arnoldi algorithm,

Hm =

h11 h12 h13 . . . h1m

h21 h22 h23 . . . h2m

h32 h33 . . . h3m

. . .. . .

...

hm,m−1 hmm

hm+1,m

=

(Av1, v1) (Av2, v1) (Av3, v1) . . . (Avm, v1)

(Av1, v2) (Av2, v2) (Av3, v2) . . . (Avm, v2)

(Av2, v3) (Av3, v3) . . . (Avm, v3)

. . .. . .

...

(Avm−1, vm) (Avm, vm)

(Avm, vm+1)

Consider the subdigonal entries of Hm , and assume hj+1,j = 0 for j = 1 . . .m−1 . Show:

(i) Km = spanv1, . . . , vm and dimKm = m .

(ii) Hm has full column rank: rank Hm = m (cf. e.g. (9.8)).

(iii) If hm+1,m = 0 , then Amr0 ∈ Km (!). Conclude that Km = Km+1 = · · · = Kn .

Exercise 10.4 shows that as long as Alg. 9.2 does not break down, i.e., if hj+1,j = 0 for j = 1 . . .m − 1 ,then the Arnoldi vectors v1, . . . , vm form an orthogonal basis of Km (this fact was already formulated inLemma 9.1 above). Note that it is essential for Vm ∈ Rn×m to have full rank in order for the expression(W T

m AVm)−1 in (10.10) to be meaningful. The case of a breakdown, i.e., hm+1,m = 0 with hj+1,j = 0 for

j = 1 . . .m− 1 , is called a lucky breakdown 51 because, as we will see below, in this case we already hit theexact solution x∗ .

Let us assume the no-breakdown condition

hj+1,j = 0 for j = 1 . . .m−1 (10.11)

51 GMRES has no ‘serious breakdown’ condition; howeer, a ‘computational breakdown’ may occur if the matrix Hm to bestored in memory becomes too large. See Remark 10.7 below.



With β = ∥r0∥2 we have β v1 = r0 . Additionally, we observe

β Vm+1 e1 = β v1 = r0 (10.12)

with e1 = (1, 0, 0, . . . , 0)T ∈ Rm+1 .

The optimality criterion (10.8) for GMRES is the key to compute xm in the form xm = x0+Vm um . Using(10.12) and AVm = Vm+1Hm (see (9.8)) we can write the residual of a vector x = x0m+ Vm u in the form

b− Ax = b− A (x0 + Vm u) = r0 − AVm u = β v1 − Vm+1Hm u = Vm+1 (β e1 − Hm u)

Moreover, since the columns of Vm+1 are orthonormal, we have ∥b− Ax∥2 = ∥Vm+1 (β e1 − Hm u)∥2 =∥β e1 − Hm u∥2 . Thus, the optimality criterion (10.8) is equivalent to a least square problem for u : Findu = um ∈ Rm for which the minimum

minx∈x0+Km

∥b− Ax∥2 = minu∈Rm

∥β e1 − Hm u∥2 (10.13)

is obtained. Since assumption (10.11) guarantees that Hm has full rank, the problem (10.13) can beuniquely solved for u . One way to realize this is to set up and solve the normal equations

HTm Hm um = HT

m β e1 (10.14)

for the minimization problem (10.13). For this latter system, for example, Cholesky decomposition ofHT

m Hm could be employed (this would lead to a cost O(m3) ). However, in view of the fact that Hm hasupper Hessenberg form, it is customary to employ a QR factorization of Hm instead, as discussed below.

Exercise 10.5 Brief review on the solution of linear least squares problems (cf. e.g. [2]):

Let k ≤ n and B ∈ Rn×k have full (column) rank. Let b ∈ Rn . Show:

(i) The minimization problem

Find uopt ∈ Rk such that ∥b−B uopt∥2 = minu∈Rk

∥b−B u∥2 (10.15)

has a unique solution uopt , given by the solution of the normal equations

BT B u = BT b

(ii) Let B be decomposed in the form B = QR , where Q ∈ Rn×n is orthogonal and R ∈ Rn×k has ‘generalizedupper triangular form’,

R =

R

0

where R ∈ Rk×k is upper triangular.

Show:

a) The assumption that B has full (column) rank implies that R is invertible.

b) Set b = QT b ∈ Rn and decompose it as b = (bT1 , bT2 )T , where b1 ∈ Rk and b2 ∈ Rn−k . Show: Thesolution uopt ∈ Rk of the minimization problem (10.15) is given by

uopt = R−1 b1

Hint: Since Q is orthogonal, ∥b−B u∥2 = ∥QT b−QT B u∥2 = ∥QT b−Ru∥2 .c) Show that, actually, a reduced QR factorization is sufficient, i.e., only the first k columns of Q are

required. These columns form an orthonormal basis of the column space of A . (In Matlab, thereduced (or ‘economy size’) QR factorization is generated by qr(A,0).)

d) Show that the minimizer uopt satisfies ∥b−B uopt∥2 = ∥b2∥2 . Thus, the minimal residual norm can becomputed independent of uopt .


10.2 The GMRES (Generalized Minimal Residual) method 91

The least squares problem (10.13) has the standard form considered in Exercise 10.5, i.e., um is the solutionof the minimization problem

Find u ∈ Rm such that ∥β e1 − Hm u∥2 is minimal. (10.16)

In view of our assumption (10.11) we know that Hm has full rank. The QR factorization of Hm is easyto realize because the upper Hessenberg matrix Hm ∈ R(m+1)×m is already ‘close to upper diagonal’: Onlythe m non-zero elements of Hm below the diagonal need to annihilated, e.g., using Householder reflectionsor Givens rotations (cf. e.g. [2]).

This QR factorization gives Hm = Qm+1Rm , whereQm+1 ∈ R(m+1)×(m+1) is orthogonal and Rm ∈ R(m+1)×m

has the form

Rm =

Rm

0

where Rm ∈ Rm×m is upper triangular. As discussed in Exercise 10.5, Rm is invertible. Hence, theapproximate solution um ∈ Rm can be obtained via back substitution from

Rm u = g

where g consists of the first m components of β QTm+1 e1 = (gT , g′)T with g ∈ Rm and g′ ∈ R . The

pseudo-code for this basic form of the GMRES algorithm is given as Alg. 10.1. It consists of the Arnoldiiteration up to dimension m , followed by the solution of the resulting least squares problem (10.16).

Algorithm 10.1 GMRES (basic form)

1: Compute r0 = b− Ax0 , β = ∥r0∥2 , and v1 = r0/β2: Define the (m+1)×m -matrix Hm and initialize elements hij to zero3: for j = 1, 2, . . . ,m do4: Compute wj = Avj5: for i = 1, . . . , j do6: hij = (wj, vi)7: wj = wj − hij vi8: end for9: hj+1,j = ∥wj∥2 . if hj+1,j = 0 set m = j and goto 1210: vj+1 = wj/hj+1,j

11: end for12: Compute um as the minimizer of ∥β e1 − Hm u∥2 and set xm = x0 + Vm um

Exercise 10.6 Use the fact that computing the QR factorization of the upper Hessenberg matrix Hm requires

O(m2) flops. Assume that the matrix A ∈ Rn×n is sparse and that the cost of a matrix-vector multiplication

x 7→ Ax is O(n) . Show that the cost of GMRES is O(m2 n) flops. What is its memory requirement?

Remark 10.7 Some comments on Alg. 10.1:

• Our derivation of Alg. 10.1 assumed the no-breakdown condition (10.11). GMRES terminates uponfinding the first j = m with hm+1,m = 0 , i.e., hj+1,j = 0 for j = 1 . . .m−1 and hm+1,m = 0 . Thissituation is called a ‘lucky breakdown’, since from Exercise 10.2 we know that then Km = Kn ;furthermore, from Exercise 10.2 we get xm = x∗ , i.e., the GMRES happens to hit exactly the desiredsolution x∗ . Concerning the computation of xm , we note that Exercise 10.4 shows dimKm = m andthat Hm has full rank, such that the computation of xm in line 12 of Alg. 10.1 is possible.



• In practice, the GMRES algorithm is implemented in a different way: The parameter m is not fixeda priori. Instead, a maximum number mmax is given, typically dictated by computational resources,in particular, the memory capacity available.

• The vectors v1, v2, . . . are computed successively together with the matrices Hm . Note that, forcomputing the optimal um and extracting the approximate solution xm = x0 + Vm um at the end,the full matrix Vm ∈ Rn×m – which becomes large for increasing m – has to be kept in memory! 52

Once the vectors v1, . . . , vm−1 and the matrix Hm−1 have already been computed, it is sufficient tocompute the next Arnoldi vector vm , and the matrix Hm is obtained from Hm−1 by adding onecolumn and the entry hm+1,m . Also, the required QR factorizations can be efficiently updated fromstep to step.

• An appropriate termination condition (typically, the size of the residual ∥rm∥2 = ∥b− Axm∥2 ) isemployed to stop the iteration. Note that the algorithm does not automatically provide xm explicitlyonce only um has been computed. However, the residual norm

∥rm∥2 = ∥b− Axm∥2 = ∥β e1 − Hm um∥2

is easily evaluated on the basis of Exercise 10.5, (ii c); see also [16].

• If the maximum number of iterations has been reached without triggering the termination condition,then a restart is performed, i.e., GMRES is started afresh with the last approximation xmmax as theinitial guess. This is called restarted GMRES (mmax) in the literature. After a restart, the old dataare ‘forgotten’, and storage of the vj etc. begins to accumulate ‘from scratch’.

• A more detailed description of the various practical implementation issues can be found in [16].

Remark 10.8 Faute de mieux, the residual ∥b− Axm∥2 is typically used as a stopping criterion. It shouldbe noted that for matrices A with large κ2(A) , the error may be large in spite of the residual being small:

∥xm − x∗∥2∥x∗∥2

≤ κ2(A)∥b− Axm∥2∥b∥2

Exercise 10.9 The Krylov method analogous to GMRES for the special case of a symmetric, possibly indefinite

matrix A is called the Minimal Residual Method (MINRES). Similarly like the D-Lanczos method (which is of

FOM type), MINRES involves tridiagonal matrices Hm = Tm , which leads to compact update formulas similar

to D-Lanczos. Explore this version referring to Section 6.6 in [16] (‘Symmetric Lanczos Algorithm’).

10.3 GMRES in Matlab: The function gmres

Example 10.10 Matlab has a robust version of restarted GMRES that can be used for experimentation.Applying this version of GMRES to the SPD matrix A ∈ R1806×1806 bcsstk14.mtx from the MatrixMarketcollection, with exact solution x = (1, 1, · · · , 1)T results in the convergence history plotted in Fig. 10.23.We note that the residual decays as the number if iterations increases. When the number of iterationsreaches the problem size, the exact solution should be found. As in this example, this does not happen inpractice due to round-off problems, but the residual is quite small.

52 For CG and D-Lanczos the situation is different because the vi can be reconstructed from a three-term recurrence, seeExercise 9.6.


10.3 GMRES in Matlab: The function gmres 93

100

101

102

103

104

10−5

100

105

1010

1015

iteration

resi

dual

GMRES applied to bcsstk14.mtx; x = (1,1,....,1)T

Figure 10.23: Convergence history of GMRES ( A is SPD).

It should be noted that, generally speaking, GMRES is employed in connection with a suitable precondi-tioner. From this we expect a significant improvement of the convergence behavior. Preconditioning willbe discussed in Section 12.

The following is a printed copy of the help page for the GMRES method implemented in Matlab. Asfor CG, alternatively to the matrix A , a function AFUN is sufficient which realizes the operation Ax . Thefunction gmres also supports a restarting strategy and preconditioning in several variants, but the concretepreconditioner has to be provided by the user. As for A , the preconditioner my be specified in form of afunction handle.

GMRES Generalized Minimum Residual Method.

X = GMRES(A,B) attempts to solve the system of linear equations A*X = B

for X. The N-by-N coefficient matrix A must be square and the right

hand side column vector B must have length N. This uses the unrestarted

method with MIN(N,10) total iterations.

X = GMRES(AFUN,B) accepts a function handle AFUN instead of the matrix

A. AFUN(X) accepts a vector input X and returns the matrix-vector

product A*X. In all of the following syntaxes, you can replace A by

AFUN.

X = GMRES(A,B,RESTART) restarts the method every RESTART iterations.

If RESTART is N or [] then GMRES uses the unrestarted method as above.

X = GMRES(A,B,RESTART,TOL) specifies the tolerance of the method. If

TOL is [] then GMRES uses the default, 1e-6.

X = GMRES(A,B,RESTART,TOL,MAXIT) specifies the maximum number of outer

iterations. Note: the total number of iterations is RESTART*MAXIT. If

MAXIT is [] then GMRES uses the default, MIN(N/RESTART,10). If RESTART

is N or [] then the total number of iterations is MAXIT.



X = GMRES(A,B,RESTART,TOL,MAXIT,M) and

X = GMRES(A,B,RESTART,TOL,MAXIT,M1,M2) use preconditioner M or M=M1*M2

and effectively solve the system inv(M)*A*X = inv(M)*B for X. If M is

[] then a preconditioner is not applied. M may be a function handle

returning M\X.

X = GMRES(A,B,RESTART,TOL,MAXIT,M1,M2,X0) specifies the first initial

guess. If X0 is [] then GMRES uses the default, an all zero vector.

[X,FLAG] = GMRES(A,B,...) also returns a convergence FLAG:

0 GMRES converged to the desired tolerance TOL within MAXIT iterations.

1 GMRES iterated MAXIT times but did not converge.

2 preconditioner M was ill-conditioned.

3 GMRES stagnated (two consecutive iterates were the same).

[X,FLAG,RELRES] = GMRES(A,B,...) also returns the relative residual

NORM(B-A*X)/NORM(B). If FLAG is 0, then RELRES <= TOL. Note with

preconditioners M1,M2, the residual is NORM(M2\(M1\(B-A*X))).

[X,FLAG,RELRES,ITER] = GMRES(A,B,...) also returns both the outer and

inner iteration numbers at which X was computed: 0 <= ITER(1) <= MAXIT

and 0 <= ITER(2) <= RESTART.

[X,FLAG,RELRES,ITER,RESVEC] = GMRES(A,B,...) also returns a vector of

the residual norms at each inner iteration, including NORM(B-A*X0).

Note with preconditioners M1,M2, the residual is NORM(M2\(M1\(B-A*X))).

Example:

n = 21; A = gallery(’wilk’,n); b = sum(A,2);

tol = 1e-12; maxit = 15; M = diag([10:-1:1 1 1:10]);

x = gmres(A,b,10,tol,maxit,M);

Or, use this matrix-vector product function

%-----------------------------------------------------------------%

function y = afun(x,n)

y = [0; x(1:n-1)] + [((n-1)/2:-1:0)’; (1:(n-1)/2)’].*x+[x(2:n); 0];

%-----------------------------------------------------------------%

and this preconditioner backsolve function

%------------------------------------------%

function y = mfun(r,n)

y = r ./ [((n-1)/2:-1:1)’; 1; (1:(n-1)/2)’];

%------------------------------------------%

as inputs to GMRES:

x1 = gmres(@(x)afun(x,n),b,10,tol,maxit,@(x)mfun(x,n));

Class support for inputs A,B,M1,M2,X0 and the output of AFUN:

float: double

See also bicg, bicgstab, bicgstabl, cgs, lsqr, minres, pcg, qmr, symmlq,

tfqmr, ilu, function_handle.

Reference page in Help browser

doc gmres


10.4 Convergence properties of the GMRES method 95

10.4 Convergence properties of the GMRES method

A convergence analysis for GMRES can be done along similar lines as for the CG method in Section 8.5.

Lemma 10.11 Let xm be the approximate solution obtained in the m -th step of the GMRES algorithm,with residual rm = b− Axm . Then, rm is of the form

rm = r0 − Apm−1(A) r0 = qm(A) r0, qm(z) = 1− z pm−1(z)

with a polynomial pm−1 ∈ Pm−1 , and

∥rm∥2 = ∥(I − Apm−1(A)) r0∥2 = minp∈Pm−1

∥(I − Ap(A)) r0∥2

or equivalently, qm ∈ Pm with q(0) = 1 and

∥rm∥2 = ∥qm(A) r0∥2 = minq ∈Pm, q(0)=1

∥q(A)r0∥2

Proof: By construction, xm minimizes the 2-norm of the residual in the affine space x0 +Km . Thus, theassertion follows from the fact that

Km = x0 + p(A) r0 : p ∈ Pm−1 ⇒ rm = b− Axm ∈ r0 − Apm−1(A) r0 : p ∈ Pm−1

Lemma 10.11 shows that GMRES iteration can be identified with a polynomial approximation problemsimilar to the ‘Arnoldi/Lanczos approximation problem’ (9.17), namely:

GMRES approximation problem:

Find polynomial qm ∈ Pm with qm(0) = 1 such that ∥qm(A) r0∥2 becomes minimal. (10.17)

Theorem 10.12 Assume that A is diagonalizable, A = X ΛX−1 , where Λ = Diag(λ1, λ2, . . . , λn) is thediagonal matrix of eigenvalues. Let

ϵm = minq ∈Pm, q(0)=1

maxi=1...n

|q(λi)| (10.18)

Then the norm of the m - th residual is bounded by

∥rm∥2 ≤ κ2(X) ϵm ∥r0∥2 (10.19)

with κ2(X) = ∥X∥2 ∥X−1∥2 .

Proof: Consider an arbitrary polynomial q ∈ Pm with q(0) = 1 , and x ∈ Km such that b−Ax = q(A) r0 .Then,

∥b− Ax∥2 = ∥q(A) r0∥2 = ∥X q(Λ)X−1∥2 ≤ κ2(X) ∥q(Λ)∥2 ∥r0∥2with

∥q(Λ)∥2 = maxi=1...n

|q(λi)|

Since xm minimizes the residual norm over x0 +Km , then for any such polynomial q we have

∥rm∥2 = minx∈x0+Km

∥b− Axm∥2 ≤ κ2(X) ϵm ∥r0∥2

with ϵm from (10.18), as asserted.



The quantity ϵm is related to the question whether such a polynomial qm exists which is ‘small on thespectrum’ of A , in the sense of (10.18). If appropriate a priori information about the spectrum is available,∥εm∥2 can be estimated in a similar way as in the convergence proof for the CG method (see Sec. 8.5).

Example 10.13 Consider an invertible matrix A with real spectrum contained in an interval [α, β] ⊂ R+ .From Corollary 6.3 (which is based on the Chebyshev min-max Theorem 6.2), with γ = 0 , we conclude

ϵm ≤ minq ∈Pm, q(0)= 1

maxi=1...n

|q(λi)| ≤ minq ∈Pm, q(0)= 1

maxλ∈ [α,β]

|q(λ)|

≤ 2cm

1 + c2m, c =

√κ− 1√κ+ 1

, κ =β

α

However, if A far from normal, κ = β/α is not related to the quantity κ2(A) .

Moreover, the condition number κ2(X) of the matrix of eigenvectors appears in the estimate for ∥rm∥2 .For normal A we have κ2(X) = 1 . In general, however, κ2(X) is typically not known and can be verylarge, even if the condition number κ2(A) is of moderate size, e.g. if A is ‘close to a non-diagonalizablematrix’. The simple convergence result from Thm. 10.12 is therefore of limited practical value.

Remark 10.14 For GMRES, the required effect of a preconditioner to accelerate convergence is less clearcut than, e.g., for the CG method. In general, ‘bunching of eigenvalues’ is still a good strategy, as ϵm can bebounded by a complex analog of the Chebyshev min-max Theorem; see [16]. However, any preconditionerapplied to the system may also have an adverse effect on the condition number κ2(X) .

Exercise 10.15 Let A be SPD. Show: the restarted version GMRES (m) converges for any m ≥ 1 .

Hint: For m = 1 consider the relation to steepest descent methods.

Exercise 10.16 Let A ∈ Rn×n be of the form

A =

1 c1

0 1 c2. . .

. . .

1 cn−2

1 cn−1

1

with ci = 0 .

Show: The eigenvalue λ = 1 has geometric multiplicity 1 . n is the smallest integer m for which (A− I )m = 0 ,

and the characteristic polynomial χ(z) of A is identical with its minimal polynomial, χ(z) = µ(z) = (z − 1)n .

Try to explain how the GMRES iteration behaves applied to the enfant terribleA . (You may also test.)


97

11 Methods Based on Biorthogonalization. BiCG

The disadvantage of GMRES is its high memory requirement. For symmetric matrices A , this can be becircumvented by using CG, D-Lanczos or MINRES. The main reason for these savings in the symmetriccase is that – in the notation of Section 10 – the matrix W T

m AVm becomes tridiagonal. In general, thisproperty is also retained in biorthogonalization methods which we now study.

Remark 11.1 Consider the Lanczos decomposition from Section 9.3 for symmetric A ∈ Rn×n ,

V Tm AVm = Tm ∈ Rm×m tridiagonal, symmetric (11.1)

where the columns vi ∈ Rn of Vm are orthonormal. As discussed before, this can be viewed as ‘reducedversion’ of the full orthonormal triangularization

V T AV = T ∈ Rn×n tridiagonal, symmetric (11.2)

where the vi form an orthonormal basis of the full space Rn . The decompositions (11.1) and (11.2) canbe obtained in rational arithmetic in a finite number of steps, in contrast to the eigendecomposition

XT AX = Λ ∈ Rn×n diagonal (11.3)

where the columns xi of X form an orthonormal basis of eigenvectors of A .

The idea of biorthogonality is not so far off. Consider an arbitrary, nonsymmetric diagonalizable matrixA ∈ Rn×n , with eigendecomposition

X−1AX = Λ ∈ Rn×n diagonal (11.4)

Here, the columns xj of X form a (generally non-orthonormal) basis of eigenvectors of A ; they are alsocalled right eigenvectors of A . Transposing (11.4) gives

XT AT X−T = Λ (11.5)

This shows that the columns yi of X−T , i.e., the rows of X−1 are the (right) eigenvectors of AT . They

are also called left eigenvectors of A . Now, identity I = X−1X shows

(yi, xj) = δi,j (11.6)

i.e., the pair of eigenbases (x1, . . . , xn) and (y1, . . . , yn) is biorthogonal. With Y = X−T we can write (11.4)in the form

Y T AX = Λ ∈ Rn×n diagonal (11.7)

In view of (11.2), we may again ask for a rational, constructive modification of such a decomposition inthe form

W T AV = T ∈ Rn×n (unsymmetric) tridiagonal (11.8)

with a biorthogonal matrix pair (V,W ) , i.e., W T V = I , or, more generally, a reduced version analogousto (11.1),

W Tm AVm = Tm ∈ Rm×m (unsymmetric) tridiagonal (11.9)

with Vm, Wm ∈ Rn×m satisfying W Tm Vm = Im×m . This exactly what the Lanczos biorthogonalization

procedure aims to realize.


98 11 METHODS BASED ON BIORTHOGONALIZATION. BICG

11.1 Lanczos biorthogonalization

Let v1, w1 ∈ Rn be a pair vectors, and consider the Krylov spaces Km(A, v1) and Km(AT , w1) . The

classical algorithm to compute two sets of vectors v1, . . . , vm and w1, . . . , wm which span these two spacesand which are biorthonormal, i.e., (vi, wj) = δi,j , is Alg. 11.1 specified below.

This algorithm, the Lanczos biorthogonalization procedure, is a generalization of the Arnoldi/Lanczosprocedure discussed in Section 9. We wish to find a pair of biorthogonal matrices Vm,Wm ∈ Rn×m ,

Vm =

| | || | |

v1 v2... vm

| | || | |

, Wm =

| | || | |

w1 w2... wm

| | || | |

, W T

m Vm = Im×m (11.10)

The idea due to Lanczos is to use two simultaneous Lanczos processes for A and AT . We make thefollowing ansatz motivated by the Arnoldi/Lanczos procedure:

AVm = Vm+1 Tm, (11.11)

AT Wm = Wm+1 Sm (11.12)

with a pair of tridiagonal matrices closely related,

Tm =

α1 β2

δ2 α2 β3. . . . . . . . .

δm−1 αm−1 βm

δm αm

δm+1

∈ R(m+1)×m, Sm =

α1 δ2

β2 α2 δ3. . . . . . . . .

βm−1 αm−1 δm

βm αm

βm+1

∈ R(m+1)×m

(11.13)The desired relations (11.11) and (11.12) are equivalent to a pair of three-term recurrences,

Avj = βj vj−1 + αj vj + δj+1 vj+1, (11.14)

AT wj = δj wj−1 + αj wj + βj+1wj+1, (11.15)

for j = 1 . . .m , with β1 = δ1 = 0 , from which it will be possible compute v2, v3, . . . and w2, w3, . . . as longas the β ’s and δ ’s do not vanish.

So far we have not used the biorthogonality requirement for the vectors vi and wj . If this is assured,together with (11.11) it will lead us to a decomposition of the desired form (11.9),

W Tm AVm = W T

m Vm+1Tm = Tm (11.16)

where the square tridiagonal matrix Tm ∈ Rm×m is defined from Tm (or, equivalently, from STm ) by

removing its last row,

Tm =

α1 β2

δ2 α2 β3. . . . . . . . .

δm−1 αm−1 βm

δm αm

∈ Rm×m (11.17)


11.1 Lanczos biorthogonalization 99

Algorithm 11.1 Lanczos biorthogonalization procedure

1: Choose two vectors v1 , w1 with (v1, w1) = 1 .2: Set β1 = δ1 = 0 , w0 = v0 = 03: for j = 1 , 2, . . . , do4: αj = (Avj, wj)5: vj+1 = Avj − αj vj − βj vj−1

6: wj+1 = AT wj − αj vj − δj wj−1

7: δj+1 =√|(vj+1, wj+1)| . If δj+1 = 0 stop.

8: βj+1 = (vj+1, wj+1)/δj+1 .9: wj+1 = wj+1/βj+1

10: vj+1 = vj+1/δj+1

11: end for

With this notation, (11.11) and (11.12) can be written as

AVm = VmTm + δm+1 vm+1 eTm (11.18)

AT Wm = WmTTm + βm+1wm+1 e

Tm (11.19)

with em = (0, 0, . . . , 1)T .

The algorithmic realization consists in implementing the pair of three-term recurrences (11.11),(11.12) andenforcing biorthogonality. This fixes the parameters αj, βj and δj in the following, recursive way:

Assume that, for j ≥ 1 , v1 . . . , vj and w1 . . . , wj already satisfy biorthogonality. We wish to constructvj+1 and wj+1 according to (11.11),(11.12) in a way such that the extended sequences v1 . . . , vj+1 andw1 . . . , wj+1 are also biorthogonal. Let

vj+1 = Avj − αj vj − βj vj−1,

wj+1 = AT wj − αj wj − δj wj−1

Consider the inner product (vj+1, wj) and enforce biorthogonality:

0!= (vj+1, wj) = (Avj, wj)− αj (vj, wj)︸︷︷︸

= 1

− βj (vj−1, wj)︸︷︷︸= 0

This shows that αj has to be chosen asαj = (Avj, wj)

Exactly the same choice for αj also leads to (wj+1, vj) = 0 , as is easy to check. Furthermore, we rescale vj+1

and wj+1 such that the resulting vectors vj+1 = vj+1/βj+1 and wj+1 = wj+1/δj+1 satisfy (vj+1, wj+1) = 1 .This fixes βj and δj , but the choice is not unique – the only requirement is δj+1 βj+1 = (vj+1, wj+1) .

Exercise 11.2 Show that the above construction automatically leads to a full biorthogonal sequence. In par-ticular, we also have

0 = (vj+1, w1) = . . . = (vj+1, wj−1), 0 = (wj+1, v1) = . . . = (wj+1, vj−1)

Hint: Proof by induction, exploiting the structure of the three-term recurrences for the vj and wj .

The final outcome is Alg. 11.1. Here, βj+1 and δj+1 is chosen such that ∥vj+1∥2 = ∥wj+1∥2 = 1 , resultingin a pair of biorthonormal bases v1 . . . vm and w1 . . . wm of Km(A, v1) and Km(A

T , w1) , respectively.



The main advantage over the Arnoldi procedure consists in the fact that only short recurrences are involved,which results in significant savings of computing costs and storage. The fact that AT , or a correspondingevaluation procedure, is required may be a significant disadvantage in applications where this is not readilyavailable or expensive to evaluate.

Note that a serious breakdown occurs if δj+1 as defined in line 7 of Alg. 11.1 vanishes, in particular ifvj+1 = 0 and wj+1 = 0 but (vj+1, wj+1) vanishes or becomes very small. Such a breakdown is hardlypredictable; in the literature, strategies have been developed where the freedom in the choice of the β ’sand δ ’s is exploited to avoid breakdowns, e.g., by Look-Ahead - strategies. Still, this is a difficult topic;see [16] for some details.

Another way to derive the three-term recurrence for the vj and wj is subject of the following exercise.

Exercise 11.3 Let v1, . . . , vm and w1, . . . , wm be biorthonormal bases of Km(A, v1) and Km(AT , w1) , re-spectively. Show: Seeking vm+1 ∈ Km+1(A, v1) and wm+1 ∈ Km+1(A

T , w1) in the form

vm+1 = Avm +

m∑i=1

κi vi, wm+1 = AT wm +

m∑i=1

λiwi

and requiring biorthogonality

vm+1 = Avm + κm vm + κm−1 vm−1, wm+1 = AT wm + λmwm + λm−1wm−1

Summary of the properties of Lanczos biorthogonalization:

Theorem 11.4 Assume that Alg. 11.1 does not break down before step m . Then the vectors vj and wj ,j = 1 . . .m , are biorthogonal. Moreover, v1, . . . , vm is a basis of Km(A, v1) and w1, . . . , wm is a basisof Km(A

T , w1) . The following identities hold true:

AVm = VmTm + δm+1 vm+1 eTm (11.20)

AT Wm = WmTTm + βm+1wm+1 e

Tm (11.21)

W Tm AVm = Tm (11.22)

with Tm from (11.17).

Note that, in general, the bases v1, . . . , vm and v1, . . . , vm are not orthogonal.

Relations (11.20)–(11.22) can be interpreted as follows. The matrix Tm is a projected version of A corre-sponding to an oblique (nonorthogonal) projection onto Km(A, v1) . More precisely: Consider the matrix

VmTmWTm = VmW

Tm AVmW

Tm ∈ Rn×n

By construction, VmWTm is an oblique projector onto spanv1, . . . , vm = Km(A, v1) because due to

biorthogonality, VmWTm VmW

Tm = VmW

Tm . An analogous interpretation can be given for T T

m , whichis a projected version of AT (projection onto Km(A

T , w1) ).

In the following section we will consider a Krylov subspace method, BiCG, which will be derived in a similarway from Alg. 11.1 like D-Lanczos and GMRES were derived from the Lanczos ans Arnoldi process.


11.2 BiCG 101

In Alg. 11.1, both mappings x 7→ Ax and x 7→ AT x are involved, and similar operations are performedwith them. This can be pursued further to see that, if two linear systems involving A and AT have tobe solved, methods based on Lanczos biorthogonalization appear to be natural. Otherwise, it may befavorable to avoid explicit evaluation of AT ; we will briefly discuss such techniques in Section 11.3.

11.2 BiCG

According to Section 10, case (iii), the approximation xm by a biorthogonal method is defined via thePetrov-Galerkin condition

Find xm ∈ x0 +Km(A, r0) such that (b− Axm, w) = 0 ∀ w ∈ Lm = Km(AT , w1) (11.23)

We have:

Lemma 11.5 Assume that Alg. 11.1 does not break down before step m . Assume that Tm is invertible.Then, (11.23) has a unique solution of the form

xm = x0 + Vmum (11.24)

where um is the solution ofTm um = β e1, β = ∥r0∥2 (11.25)

Proof: Exercise – essentially along the same lines as for D-Lanczos and GMRES.

Remark 11.6 The so-called QMR (Quasi Minimal Residual) method is also based on Lanczos biorthog-onalization. Here, in contrast to (11.25), um is determined by minimizing ∥β e1 − Tmum∥2 with e1 =(1, 0, . . . , 0)T ∈ Rm+1 and Tm ∈ R(m+1)×m from (11.10).

Remark 11.7 Consider the ‘dual’ linear system AT x = b∗ , where b∗ is a given vector. If x∗0 is an initialguess and w1 = r∗0 = b∗ − AT x∗0 , then (under the assumptions of Lemma 11.5) the solution u∗m of

T Tm u∗m = β∗e1, β∗ = ∥r∗0∥2

leads to an approximation x∗m = x∗0 +Wm u∗m which solves the Petrov-Galerkin problem (a dual version of

(11.23))

Find xm ∈ x0 +Km(AT , r∗0) such that (b∗ − AT x∗m, v) = 0 ∀ v ∈ Km(A, v1) (11.26)

This follows from the fact that, in the Lanczos biorthogonalization procedure, the matrices A and AT

occur in a way pari passu. In other words: simply change the role of A and AT . Hence, by considering theKrylov spaces Km(A, r0) and Km(A

T , r∗0) , it is possible to solve Ax = b and AT x∗ = b∗ simultaneously.

The procedure towards an algorithm now parallels that of our derivation of the D-Lanczos method (andthe CG algorithm); we will be slightly more brief here. In order to simplify the notation, we assume thatwe wish to solve simultaneously the systems Ax = b and AT x∗ = b∗ . We also assume that initial guessesx0 and x∗0 are prescribed. These determine the initial residuals r0 , r

∗0 and thus the vectors v1 and w1 ,

which in turn determine the Krylov spaces Km(A, v1) and Km(AT , w1) .



We assume that the matrix Tm delivered by Lanczos biorthogonalization has an LU - decomposition Tm =LmUm . Then the approximate solutions xm , x∗m are given by

xm = x0 + Pm zm, x∗m = x∗0 + P ∗m z

∗m

where Pm =p0 ∣∣ . . . ∣∣ pm−1

= VmU−1m and P ∗

m =p∗0 ∣∣ . . . ∣∣ p∗m−1

= WmL−Tm , and

zm = L−1m (β e1), z∗m = U−T

m (β∗e1)

Proceeding in exactly the same way as in our derivation of the D-Lanczos method, we conclude that thematrices Pm , P ∗

m are obtained from Pm−1 , P∗m−1 by simply appending one column. Also, zm and z∗m are

obtained from zm−1 , z∗m−1 by adding one entry. As for the D-Lanczos method we therefore obtain

xm = xm−1 + ζm pm−1, x∗m = x∗m−1 + ζ∗m p∗m−1

for suitable ζm , ζ∗m . Furthermore, from Vm = Pm Um and Wm = P ∗m L

Tm we have

pm−1 ∈ spanpm−2, vm, p∗m−1 ∈ spanp∗m−2, wm

Now, instead of explicitly computing the update formulas for ζm , ζ∗m , pm , p∗m , we exploit (bi)orthogonalityconditions in the same way as in the derivation of the CG algorithm from D-Lanczos:

(i) The search directions 53 pi , p∗i form a A -biorthogonal system: This follows from(

P ∗m

)TAPm = L−1

m W Tm AVm U

−1m = L−1

m Tm U−1m = I

(ii) The residuals rm , r∗m are multiplies of vm+1 and wm+1 , respectively. This follows as in the CG-case:From the definition of xm and Theorem 11.4 we get

rm = b− Axm = b− A (x0 + Vm um) = r0 − AVm um = β v1 − (VmTm + δm+1 vm+1 eTm )um

= Vm (β e1 − Tm um) + δm+1 vm+1 (eTm um) = δm+1 vm+1 (e

Tm um)

An analogous reasoning shows that r∗m is a multiple of wm+1 . In particular, the residuals form abiorthogonal system.

We collect these findings:

xm+1 = xm + αm pm, x∗m+1 = x∗m + α∗m p

∗m

this implies rm+1 = rm − αmApm, r∗m+1 = r∗m − α∗mA

T p∗m

for the search directions: pm+1 = rm+1 + βm pm, p∗m+1 = r∗m+1 + β∗m p

∗m

Using the above orthogonality conditions, we obtain 54

αm = α∗m =

(rm, r∗m)

(Apm, p∗m), βm = β∗

m =(rm+1, r

∗m+1)

(rm, r∗m)(11.27)

This leads to Alg. 11.2. (Here we ignore the update for the x∗m , which can be omitted if we are onlyinterested in solving Ax = b .)

Remark 11.8 If a dual problem AT x∗ = b∗ is also to be solved, then r∗0 depends on the initial guess x∗0and the update for the x∗m needs also to be explicitly carried out.

53 These correspond to the di in our notation for CG.54 0 = (rm+1, r

∗m) implies (rm, r∗m) = αm(Apm, r∗m) = αm(Apm, p∗m− β∗

m−1p∗m−1) = αm(Apm, p∗m) ; an analogous compu-

tation shows α∗m = αm . From 0 = (p∗m, A pm+1) we obtain 0 = (p∗m, A pm+1) = (p∗m, A rm+1 + βm Apm) = (p∗m, A rm+1) +

βm(p∗m, A pm) = (AT p∗m, rm+1) + βm(p∗m, A pm) = 1α∗

m(r∗m − r∗m+1, rm+1) + βm(p∗m, A pm) = − (Apm,p∗

m)(rm,r∗m) (rm+1, r

∗m+1) +

βm(p∗m, A pm) . This implies the formula for βm . An analogous computation shows β∗m = βm .


11.3 A brief look at CGS and BiCGStab 103

Algorithm 11.2 BiCG

1: Compute p0 = r0 = b− Ax02: Choose r∗0 such that (r0, r

∗0) = 0

3: for m = 0 , 1 . . . do4: αm = (rm,r∗m)

(Apm,p∗m)

5: xm+1 = xm + αm pm6: rm+1 = rm − αmApm7: r∗m+1 = r∗m − αmA

T p∗m8: βm = (rm+1, r

∗m+1)/(rm, r

∗m)

9: pm+1 = rm+1 + βm pm10: p∗m+1 = r∗m+1 + βm p

∗m

11: end for

11.3 A brief look at CGS and BiCGStab

In many applications the matrix A is not explicitly available, but merely a routine that provides thematrix-vector multiplication x 7→ Ax . In such a situation, the operation x 7→ AT x is not necessarilyavailable, and therefore the BiCG-algorithm not directly applicable. One of the reasons for developing theConjugate Gradient Squared (CGS) and the BiConjugate Gradient Stabilized (BiCGStab) methods was tocircumvent this difficulty. We will only try to illustrate the main ideas of the CGS method – we refer to[16] for the details of the algorithms. Working out all the details is a job involving manipulating Krylovrecurrences.

Our starting point is the observation that the BiCGmethod produces residuals rm , r∗m and search directionspm , p∗m that of the form

rm = φm(A) r0, r∗m = φm(AT ) r∗0, pm = πm(A) r0, p∗m = πm(A

T ) r∗0

where φm and πm are polynomials of degree m . With this notation, the αm, βm from (11.27) satisfy

αm =(φm(A) r0, φm(A

T ) r∗0)

(Aπm(A) r0, πm(AT ) r∗0)=

(φ2m(A) r0, r

∗0)

(Aπ2m(A) r0, r

∗0)

βm =(φm+1(A) r0, φm+1(A

T ) r∗0)

(φm(A) r0, φm(AT ) r∗0)=

(φ2m+1(A) r0, r

∗0)

(φ2m(A) r0, r

∗0)

We see: If we can derive a recursion for the vectors φ2m(A) r0 and π2

m(A) r0 , then we can compute theparameters αm and βm in a ‘transpose-free’ way, i.e., without explicit reference to AT . In the CGSapproach one attempts to find approximations xm with residuals rm of the form

rm = φ2m(A) r0 (11.28)

We now show that recurrences for the iterates xm and the residuals rm can indeed be realized. This isachived by appropriate manipulation of some formulas: From the recurrences for the residuals rm and thesearch directions pm in BiCG, we obtain

φm+1(t) = φm(t)− αm t πm(t), (11.29)

πm+1(t) = φm+1(t) + βm πm(t) (11.30)

Squaring yields

φ2m+1(t) = φ2

m(t)− 2αm t πm(t)φm(t) + α2m t

2 π2m(t), (11.31)

π2m+1(t) = φ2

m+1(t) + 2 βm φm+1(t)πm(t) + β2m+1 π

2m(t) (11.32)



From this we can obtain recurrences for the polynomials φ2m and π2

m by introducing the auxiliary function

ψm(t) = φm+1(t)πm(t)

since the other cross term φm(t) and πm(t) can be computed from the functions φ2m(t) , π

2m(t) and ψm(t)

asφm(t)πm(t) = φm(t)

(φm(t) + βm−1 πm−1(t)

)= φ2

m(t) + βm−1 φm(t)πm−1(t) (11.33)

Combining (11.29) with (11.31) and (11.33) yields a recurrence for the polynomials ψm :

ψm(t) = φm+1(t)πm(t) =(φm(t)− αm t πm(t)

)πm(t) = −αm t π

2m(t) + φm(tπm(t)

= −αm t π2m(t) + φ2

m(t) + βm−1 φm(t)πm−1(t)

This also allows us to obtain a recurrence for the residuals rm from (11.28). In order to motivate therecurrence for the approximations xm , we make the ansatz xm+1 = xm + αm dm for some vector dm . Thisvector (search direction) has to satisfy

rm+1 = b− A xm+1 = b− A (xm + αm dm) = rm − αmAdm

Since we require rm = φ2m(A)r0 , we obtain in view of (11.31)

−αmAdm = rm+1 − rm = φ2m+1(A) r0 − φ2

m(A) r0

= A(−2αm πm(A)φm(A) + α2

mAπ2m(A)

)r0

From this, we readily infer that the update dm is

dm =(2πm(A)φm(A)− αmAπ

2m(A)

)r0 (11.34)

Collecting these findings, for the vectors

rm = φ2m(A) r0, pm = π2

m(A) r0, qm = φm+1(A)πm(A) r0

we obtain the following recursions (with x0 = x0 , r0 = b− A x0 , p0 = r0 , β0 = 0 ):

αm = (rm, r∗0)/(A pj, r

∗0),

dm = 2 rm + 2 βm−1 qm−1 − αmA pm,

qm = rm + βm−1 qm−1 − αm−1A pm,

xm+1 = xm + αm dm,

rm+1 = rm − αmAdm,

βm = (rm+1, r∗0)/(rm, r

∗0),

pm+1 = rm+1 + βm (2 qm + βj pm)

Remark 11.9 We refer to [16] for a slightly different formulation of the CGS algorithm. From the abovederivation of CGS it is not clear that it is indeed a Krylov subspace method. In the survey article [6] fora more detailed discussion of the fact that the CGS can be viewed as a Krylov subspace method is given.However, here the inner product with respect to which orthogonality is required is problem dependent.

It has been observed in numerical examples that the convergence of CGS is often ‘erratic’ and the residualscan become very large. In order to remedy this, the BiCGStab variant was introduced, which usually leadsto a ‘smoother’ convergence behavior. We refer to [16] for a more detailed description of the algorithm.

Many other variants have been developed, trying to combine desired virtues as non-erratic convergencebehavior, look-ahead strategies to avoid breakdowns, and transpose-free formulation. This is still anactive research area; ‘the universal method’ which solves all problems with best efficiency does not exist.Particular classes of problems are often better understood (the SPD case is a typical example).


11.3 A brief look at CGS and BiCGStab 105

100

101

102

10−20

10−10

100

1010

iteration number

rela

tive

resi

dual

Example: "enfant terrible" (n = 10)

GMRESBiCG

Figure 11.24: Convergence history for Example 11.10.

This leads us to the next topic, preconditioning. Here one tries to combine a standard technique likeCG or GMRES with a preprocessing step, aiming at solving an equivalent but ‘less harmful’ problem atsome extra cost per step. Most successful preconditioning technique are taylored to particular problemstructures.

Example 11.10 In some sense, the matrix A from Exercise 10.15 may be call an enfant terrible fromthe point of view of Krylov subspace methods. Consider the case n = 10 with upper diagonal 1 , i.e., aJordan block of dimension 10:

A =

1 1

1 1

1 1

1 1

1 1

1 1

1 1

1 1

1 1

1

While direct elimination ist trivial, all unpreconditioned methods available in Matlab are not able to finda reasonable solution before n steps. Figure 11.24 shows the convergence history for GMRES and BiCG;the only difference is that BiCG finds, up to round-off, the exact solution ‘already’ after n− 1 steps. ForCGS and BiCGSTAB, a very similar behavior is observed.

Trying to give heuristic interpretation of this effect, one may say that all these methods try to minimizethe norm of some polynomial pm(A) over the spectrum of A . For a highly nonnormal matrix like inExercise 11.10, the location of the spectrum by no means contains complete information on the behaviorof A . For all these methods, significant nonnormality is a critical issue which can, in general, only copedwith by appropriate preconditioning techniques.


106 12 PRECONDITIONING

12 Preconditioning

12.1 General remarks; preconditioned GMRES

Preconditioning transforms the original linear system Ax = b into an equivalent one which is (hopefully)easier to solve by an iterative technique. A good preconditionerM is an approximation for A which can beefficiently inverted, chosen in a way that using M−1A or AM−1 instead of A leads to a better convergencebehavior.

In fact, GMRES or CG are rarely used directly. In practice, they are always applied in collaboration withsome preconditioner. Historically, the full acceptance of the CG method startet around 1970 (20 yearsafter its invention), when the first practicable preconditioning techniques were designed.

Note that, whereas GMRES or CG are ‘general purpose’ techniques (GMRES for general matrices, CGfor SPD matrices), the preconditioner typically incorporates information about the specific problem underconsideration. Thus, the general purpose solver serves as a ‘template’ which is adapted to a particularclass of problems via preconditioning.

There are three general types of preconditioning.

– Left preconditioning by a matrix ML :

M−1L Ax =M−1

L b (12.1)

Since ML ≈ A−1 , the preconditioned residual r = M−1L (b − Ax) = M−1

L r can be interpreted as anapproximation for the ‘exact correction’ A−1(b − Ax) = x∗ − x = −e , the (negative) error of thecurrent iterate x .

– Right preconditioning by a matrix MR :

AM−1R u = b, x =M−1

R u (12.2)

This involves a substitution u for the original variable x .

– Split (two-sided) preconditioning :

M−1L AM−1

R u =M−1L b, x =M−1

R u (12.3)

Split preconditioning encompasses both the left and the right methods by settingMR = I orML = I ,respectively.

Naturally, an important feature of the preconditioning matrices will be that they are easily inverted inthe sense that Mx = u can be solved relatively cheaply.55 In the algorithms below, an evaluation M−1ualways refers to the solution of Mv = u .

The split preconditioned GMRES method is formulated in Alg. 12.1. It can be understood as simplyapplying the classical GMRES Alg. 10.1 to

M−1L AM−1

R u∗ =M−1L b (12.4)

with a starting vector u0 . The relations x∗ = M−1R u∗ , x0 = M−1

R u0 , and setting xm = M−1R um =

M−1R (u0 + Vm um) , allows one to write the GMRES iteration such that the auxiliary variables ui do not

appear explicitly but only the original iterates xi , which are the ones of interest.

55 Recall ‘rules’: Iterative methods only use the matrix-vector product x 7→ Ax ; a preconditioner M is typically onlygiven as the ‘action’ x 7→M−1x ; in fact, the matrix M is often not explicitly available.


12.2 PCG: Preconditioned CG 107

Algorithm 12.1 Preconditioned GMRES

1: Compute r0 =M−1L (b− Ax0) , β = ∥r0∥2 , and v1 = r0/β

2: Define the (m+ 1)×m matrix Hm and set elements hij to zero3: for j = 1, 2, . . . ,m do4: Compute wj =M−1

L AM−1R vj

5: for i = 1, . . . , j do6: hij = (wj, vi)7: wj = wj − hij vi8: end for9: hj+1,j = ∥wj∥210: if hj+1,j = 0 set m = j and goto 13 % lucky breakdown11: vj+1 = wj/hj+1,j

12: end for13: Compute um as the minimizer of ∥β e1 − Hmy∥2 and xm = x0 +M−1

R Vm um

Remark 12.1 The left preconditioned GMREs minimizes the residual norm ∥M−1L (b− Axm)∥2 over a

suitable Krylov subspace. Right preconditioning on the other hand, minimizes the original residual∥b− Axm∥2 .

A natural question is whether left, right, or even split preconditioning is to be preferred. In many cases,the convergence behavior is not significantly different. This is not completely unexpected in view of thefact that the spectra of AM−1 (corresponding to right preconditioning) and M−1A (corresponding to leftpreconditioning) coincide.

Let M be a preconditioner for A . A standard version of a split preconditioning is obtained by meansof the LU -decomposition M = LU , choosing ML = L and MR = U . Many preconditioners used inapplication problems are constructed in this form; often, M is not directly computed but L and U arechosen appropriately, with LU ≈ A .56

12.2 PCG: Preconditioned CG

We will discuss left preconditioning of the CG method, i.e., we consider solving

M−1Ax =M−1b (12.5)

where, A is SPD. It will be natural and convenient to choose the preconditioner M to be SPD as well.

Note that the CG algorithm in its original form (Alg. 8.1) is not directly applicable to the new system(12.5), because the matrix M−1A is not SPD in general. However, it is SPD with respect to a new innerproduct on Rn :

Exercise 12.2 Let A,M ∈ Rn×n be SPD. Define the M -inner product (·, ·)M by (x, y)M = (Mx, y) = xT M y .Show:

(i) M−1A is symmetric with respect to the inner product (·, ·)M , i.e., (M−1Ax, y)M = (x,M−1Ay)M for allx, y ∈ Rn .

(ii) M−1A is positive definite with respect to (·, ·)M , i.e., (M−1Ax, x)M > 0 for all 0 = x ∈ Rn .

56 Note that this is a ‘consistent’ approach, since for A = LU we have L−1AU−1 = I .



Hence we may apply the CG-algorithm 8.1 to (12.5), replacing the standard inner product (·, ·) with thenew inner product (·, ·)M . This leads to:

1. r0 =M−1(b− Ax0) , d0 = r02. for k = 0, 1, . . . until convergence do

3. αk = (rk, rk)M/(M−1A dk, dk)M

4. xk+1 = xk + αk dk5. rk+1 = rk − αkM

−1A dk6. βk = (rk+1, rk+1)M/(rk, rk)M7. dk+1 = rk+1 + βk dk8. end for

The difficulty with this algorithm is that the evaluation of (·, ·)M ostensibly requires a matrix-vectormultiplication x 7→Mx . However, the ‘rule’ of our preconditioning strategies is that only the matrix-vectormultiplications x 7→ Ax and x 7→M−1x are available. However, it is possible to rewrite this algorithm toavoid matrix-vector multiplications x 7→Mx , by re-introducing the original residual rk =M rk . We notethat r0 = b−Ax0 is explicitly available. Updating rk along with the vectors xk , rk , and dk then leads tothe preconditioned CG-algorithm formulated in Alg. 12.2.

Algorithm 12.2 Left-preconditioned Conjugate Gradient method

1: Compute r0 = (b− Ax0) , r0 =M−1r0 , d0 = r02: for k = 0, 1, . . . until convergence do3: αk = (rk, rk)/(Adk, dk)4: xk+1 = xk + αk dk5: rk+1 = rk − αk Adk6: rk+1 =M−1rk+1

7: βk = (rk+1, rk+1)/(rk, rk)8: dk+1 = rk+1 + βk dk9: end for

Exercise 12.3 Consider the right-preconditioned system AM−1u = b , where u = Mx . Show that the matrix

AM−1 is symmetric positive definite with respect to the inner product (·, ·)M−1 defined by (x, y)M−1 = (M−1x, y) .

Formulate the (right-)preconditioned CG algorithm. Formulate it in such a way that the auxiliary variable u and

the iterates uj = Mxj do not explicitly appear in the algorithm, but only the original iteratives xj = M−1uj .

Exercise 12.4 Assume that the Cholesky factorization M = LLT of an SPD preconditioner M is available.Consider the split preconditioned system with ML = L and MR = LT i.e., applying the CG algorithm toL−1AL−T u = L−1b .

Show: The iterates xm = L−T um coincide with those of the left-preconditioned CG method, i.e., Alg. 12.2with preconditioner M . Furthermore, show that the iterates also coincide with the iterates obtained from rightpreconditioned CG as developed in Exercise 12.3.

Note that the same argument holds for an SPD preconditioner M implicitly specified by any regular matrix C

with M = C CT .


12.3 Preconditioning in Matlab 109

100

101

102

10−20

10−15

10−10

10−5

100

iteration count

rela

tive

resi

dual

Example: "gallery(wilk)" (n = 21)

GMRESGMRES with PC

Figure 12.25: Convergence history for Example 12.5.

12.3 Preconditioning in Matlab

The Matlab implementations, in particular pcg and gmres, support left preconditioning only, see Sec-tion 8.6 and 10.3. In both cases, the preconditioner M = ML may be specified directly as a matrix orvia two matrices M1 and M2 such that ML = M1M2 (e.g., in form of an LU -decomposition for easyinversion). In both cases, applying the preconditioner means solving systems of the form My = r , seeAlg. 10.1 and 12.2. Similarly as for A itself (function handle AFUN), the backsolving function r 7→ M−1rmay be specified in form of a function handle57 MFUN.

Example 12.5 This is just for trying out the way of usage: We run the example from doc gmres withand without preconditioning; see Fig. 12.25. In this example, A is symmetric tridiagonal and ‘not far fromdiagonally dominant’; M = diag(A) is therefore ‘reasonable’ preconditioner.

n = 21;

A = gallery(’wilk’,n);

b = sum(A,2);

tol = 1e-12;

maxit = n;

[x1,flag1,relres1,iter1,resvec1] = gmres(A,b,n,tol,maxit);

M = diag([10:-1:1 1 1:10]);

[x2,flag2,relres2,iter2,resvec2] = gmres(A,b,n,tol,maxit,M);

%

% Or, use this matrix-vector product function

%

% function y = afun(x,n)

% y = [0; x(1:n-1)] + [((n-1)/2:-1:0)’; (1:(n-1)/2)’].*x+[x(2:n); 0];

%

% and this preconditioner backsolve function

%

57While AFUN corresponds to x 7→ Ax , MFUN corresponds to x 7→M−1 x .



% function y = mfun(r,n)

% y = r ./ [((n-1)/2:-1:1)’; 1; (1:(n-1)/2)’];

%

% as inputs to GMRES:

%

[x2,flag2,relres2,iter2,resvec2] = gmres(@(x)afun(x,n),b,n,tol,maxit,@(x)mfun(x,n));

%

loglog([1:length(resvec1)+1],[resvec1./norm(b,2);norm(b-A*x1,2)/norm(b,2)],’--’,...

[1:length(resvec2)+1],[resvec2./norm(b,2);norm(b-A*x2,2)/norm(b,2)],’-o’,...

’LineWidth’,2);

set(gca,’FontSize’,16)

legend(’GMRES’,’GMRES with PC’,’Location’,’SouthEast’)

xlabel(’iteration count’)

ylabel(’relative residual’)

title(’Example: "gallery(wilk)" (n = 21)’)

set(gca,’XTick’,[1 10 100 1000 10^4 10^5])

12.4 Convergence behavior of PCG

We consider the PCG algorithm with left preconditioning described in Section 12.2. The convergenceanalysis follows along similar lines as in Section 8.5.

Essentially, the PCG algorithm coincides with the standard CG algorithm; only the ‘standard’ innerproduct (·, ·) = (,·)2 is replaced by the M -inner product (·, ·)M . Recalling that the CG-algorithm is just aKrylov subspace method for solving M−1Ax =M−1b , we conclude that that the iterates xm are uniquelycharacterized by Galerkin orthogonality in the form

(M−1(b− Axm), v)M = 0 ∀ v ∈ Km (12.6)

where the Krylov space Km = Km(M−1A, r0) is given by

Km = spanr0, . . . , (M−1A)m−1r0, r0 =M−1r0 =M−1(b− Ax0)Evidently, the Galerkin orthogonality condition (12.6) is equivalent to the original Galerkin conditionwithout preconditioning,

(b− Axm, v) = 0 ∀ v ∈ Km (12.7)

Thus, we again have∥xm − x∗∥A = inf

x∈x0+Km

∥x− x∗∥A

Exactly as in the case of the standard CG method we obtain, using Km = spanr0, . . . , (M−1A)m−1r0that any x ∈ x0 + Km can be written in the form

x = x0 + pm−1(M−1A) r0

for some pm−1 ∈ Pm−1 . Thus, with e0 = x0 − x∗ ,x− x∗ = e0 + x− x0 = e0 + pm−1(M

−1A) r0 = e0 + pm−1(M−1A)M−1r0

= e0 − pm−1(M−1A)M−1Ae0 = (I − pm−1(M

−1A)M−1A) e0

Thus, the above optimality condition gives

∥em∥A = ∥xm − x∗∥A = minq ∈Pm, q(0)=1

∥ q(M−1A) e0 ∥A (12.8)

where the minimum is attained by some optimal polynomial q = qm .


12.4 Convergence behavior of PCG 111

Due to the fact that A and M are SPD, the ‘preconditioned spectrum’

σ(M−1A) = σ(M− 12AM− 1

2 )

is positive. Now we express (12.8) more explicitly in terms of the spectrum σ(M−1A) . To this end wewrite

q(M−1A) =M− 12 q(M− 1

2AM− 12 )M

12

Let vi, i = 1 . . . n , be an orthonomal eigenbasis of the SPD matrix M− 12AM− 1

2 with corresponding

eigenvalues λi . We expand expand M12 e0 in this basis, M

12 e0 =

∑ni=1 κi vi . This gives

∥q(M−1A) e0∥2

A = (AM− 12 q(M− 1

2AM− 12 )M

12 e0,M

− 12 q(M− 1

2AM− 12 )M

12 e0)

= (M− 12AM− 1

2 q(M− 12AM− 1

2 )M12 e0, q(M

− 12AM− 1

2 )M12 e0)

=( n∑

i=1

λi q(λi)κi, vi,n∑

i=1

q(λi)κi vi)=

n∑i=1

λi |q(λi)|2 |κi|2

Together with

∥e0∥2A = (Ae0, e0) = (M− 12AM− 1

2M12 e0,M

12 e0) =

( n∑i=1

λi κi vi,n∑

i=1

κi vi

)=

n∑i=1

λi |κi|2

we arrive at∥em∥A = min

q ∈Pm, q(0)=1∥ q(M−1A) e0 ∥A ≤ min

q ∈Pm, q(0)=1maxi=1...n

|q(λi)| ∥e0∥A (12.9)

where the λi > 0 are the eigenvalues of M− 12AM− 1

2 and also of M−1A .

The outcome is the same as in Thm. 8.2 for the CG method, with A replaced by M−1A . Thus, identity(12.9) shows:

The convergence of PCG depends on the spectrum of the preconditioned matrix M−1A .

As in the case of the CG method, an overall (worst case) bound involves the ratio of the largest and thesmallest eigenvalue of M−1A . We formulate this as an exercise.

Exercise 12.6

(i) Show: The speed of convergence of the PCG iteration can be estimated by

∥em∥A ≤ 2(√κ− 1√

κ+ 1

)m∥e0∥A, κ = κσ(M

−1A) =λmax

λmin

where λmax and λmin are the largest and smallest eigenvalues of M−1A , and κσ(M−1A) = λmax/λmin is the

so-called spectral condition number of A (in general, this is not identical with κ2(M−1A) ).

(ii) Show that a characterization of λmin and λmax which may be easier to check is the following: λmin is thelargest and λmax is the smallest number such that

λmin (Mx, x) ≤ (Ax, x) ∀ x ∈ Rn

(Ax, x) ≤ λmax (Mx, x) ∀ x ∈ Rn

(iii) Show: If a, b > 0 can be found such that

a (Mx, x) ≤ (Ax, x) ∀ x ∈ Rn

(Ax, x) ≤ b (Mx, x) ∀ x ∈ Rn

then a ≤ λmin ,λmax ≥ b , and thus, κσ(M−1A) ≤ b/a .



12.5 Preconditioning techniques in general

As mentioned before, GMRES and CG are fairly general solution techniques. In actual practice, thepreconditioner is chosen in dependence on properties of the matrix A and is the key component of theiterative solution method. A few general comments concerning the choice/design of a preconditioner are:

• M−1A should be ‘close’ to the identity matrix or at least have a spectrum that is clustered.

• The operation x 7→ M−1x should be cheap/easy to perform. Note that this condition depends alsoon the computer architecture employed. As we will see below, a Gauss-Seidel preconditioner maybe better than a Jacobi preconditioner (in terms of iteration count), but a Jacobi preconditionerrequires less communication and is often more natural and efficient to realize on a parallel computer.

A good preconditionerM typically depends on the problem under consideration. In this section, however,we will discuss a few preconditioning techniques that could be applied in many circumstances. We pointout that these techniques are fairly general and may not always be very effective. However, since they areeasy to use they are often worth an initial investigation. A more detailed discussion of these techniquescan be in found in [16].

Classical iterative schemes as preconditioners.

Every linear stationary iteration induces a preconditioner. To see this, we recall the second normal formG = I − NA where G is the iteration matrix of the scheme. 58 Convergent linear iterations satisfyρ(G) < 1 , i.e., G is ‘small’; in other words: NA is ‘close to’ I because N plays the role of an approximateinverse. Hence, the matrix W = N−1 (from the third normal form) may be used as a preconditioner – itapproximates A in some sense.

For the classical methods (Jacobi, Gauss-Seidel, SOR, SSOR) such a preconditionerW = N−1 can be readoff directly from the identity G = I − NA ; the formulas given in Section 5.2, p. 30. For example, withA = L+D + U ,

Jacobi preconditioner:MJ = D

Gauss-Seidel preconditioner:MGS = D + L

Symm. Gauss-Seidel preconditioner:MSGS = (D + L)D−1(D + U)

Jacobi preconditioning may be interpretet as a ‘diagonal rescaling’ of the original matrix A as D−1A , re-sulting in a matrix with unit diagonal. Note that, if A already has unit diagonal, the Jacobi preconditionerhas no effect, because, as easy to see, κσ(D

−1A) = κ2(A) in this case.

We stress that the action of these preconditioners can be easily realized since the matrices (D + L) and(D + U) have triangular structure. We also note that MSGS is symmetric if the underlying matrix A is.This allows us to employ it as a preconditioner for the CG algorithm.

Remark 12.7 It is important to notice that for any linear stationary iteration scheme with iterationmatrix G , the convergence criterion ρ(G) < 1 is rather ‘strict’. For a preconditioned Richardson iteration,the preconditioner M = W = N−1 has to satisfy ρ(I − M−1A) < 1 . Now assume ∥ · ∥ is a matrix normfor which also ∥I − M−1A∥ < 1 holds (such a norm always exists). Then, by the triangle inequality,59

∥M−1A∥ ≤ 1 + ∥I − M−1A∥ ≤ 2

58 Notation G for iteration matrix : different from Section 5 ( G = M ); in the present context, M always denotes apreconditioner.

59 Note that ρ(·) itself is nor a matrix norm; in particular, the triangle inequality does no hold in general.


12.5 Preconditioning techniques in general 113

from which one may argue that M is really a very good preconditioner. If a preconditioner is very good, itis almost perfect. On the other hand, if it is only good, it is better than nothing. In particular, somethinglike

∥M−1A∥ ≤ C, ∥A−1M∥ ≤ C, C a moderate-sized constant

is sufficient to ensure a very satisfactory convergence property of a Krylov subspace method. Cf. alsothe related Exercise 12.6 for the case of PCG – here, spectra are involved but not norms – better thannothing.

This reasoning shows that the choice of an appropriate preconditioner for GMRES or CG is in somesense easier than searching for one that leads to a convergent linear iteration scheme. This is why thecombination of Krylov methods with preconditioning has become so popular.

In practice, the linear system Ax = b to be solved is a member of a family of systems with increasingdimension n (with n → ∞ for some discretization parameter h → 0 ). In this case, a ‘good’ (indeed, avery good) preconditioner may characterized by the property

∥M−1A∥ ≤ C, ∥A−1M∥ ≤ C, C a moderate-sized constant independent ofn .

of course at moderate computational cost for the preconditioning step.

Incomplete factorizations: ILU and its variants.

A very popular class of preconditioners are incomplete factorizations. Such a type of preconditioning isalso one of the historically earliest examples (around 1970) which made the PCG method become widelyaccepted. In Section 3 on direct methods we have seen that computing the LU -decomposition of a sparsematrix A may result in considerable fill-in. In view of the fact that a preconditioner just needs to be anapproximation to A−1 , one may consider computing an approximate factorization L U ≈ A , where L andU should also be sparse. ChoosingM = L U leads to an efficient evaluation of x 7→M−1x , since y = L−1xand M−1x = U−1y can easily be realized by forward and backward substitution.

Matlab has built-in functions to compute these incomplete factors as well as few other variants, e.g.,MILU (see the book [16] for further details). See: help ilu, help ichol.

ILU(0).

Let NZ(A) denote the set of index pairs (i, j) with Ai,j = 0 , the ‘sparsity pattern’ of A . In the ILU(0)technique, the factors L and U are required to satisfy:

(i) L and U have the same sparsity pattern as A ,

i.e., NZ(L+ U) = NZ(A) (12.10a)

(ii) the non-zero entries of L and U are such that

(LU)ij = Ai,j for all index pairs (i, j) ∈ NZ(A) (12.10b)

Algorithmically, this can be realized by modifying the standard LU -decomposition algorithm in such away that only the non-zero entries of the factors L and U are computed; the remaining fill-in is simplyignored. This is executed by Alg. 12.3. The algorithm overwrites the matrix A with the factors L and U(as usual, the diagonal contains the diagonal of U since the diagonal of L has fixed entries 1 ).



Algorithm 12.3 ILU(0) – KIJ-variant

% overwrites A with (approx.) LU -decomposition; entries Lii = 1 not stored

1: for k = 1, 2, . . . , n− 1 do2: for i = k+1 . . . n and (i, k) ∈ NZ(A) do3: aik = aik/akk4: for j = k+1 . . . n and (i, j) ∈ NZ(A) do5: aij = aij − aik akj6: end for7: end for8: end for

Comments concerning existence, uniqueness, and implementation of ILU(0):

• There is no guarantee that an incomplete factorization can be computed. For certain classes ofmatrices such as M -matrices it is known that an incomplete LU -decomposition exists, see [16,Theorem 10.1]. A matrix A is called an M -matrix if

• Ai,i > 0 ∀ i

• Ai,j ≤ 0 ∀ i, j with i = j

• A is invertible, with (A−1)i,j ≥ 0 ∀ i, j

In particular, the matrices A from the Poisson example 2.2 are M -matrices.

• The conditions (12.10) may not determine L and U uniquely – it is the concrete algorithmic real-ization which determines them.

• By construction, we haveA = L U −R

where the ‘residual matrix’ R consists of the elements which are dropped in course of the incompleteelimination process.

• In practice, instead of Alg. 12.3 a different variant of Gaussian elimination is employed. For sparsematrices A which are stored row by row (e.g., in the CSR-format) it is better to rearrange the threeloops (over i, j and k ) as in Alg. 12.3 so as to operate on whole rows of A . This leads to the so-calledIKJ-variant of (incomplete) LU -decomposition.

Remark 12.8 ILU(0) employs the same sparsity pattern as the matrix A . One could employ othersparsity patterns for the factors L and U . One way to choose them is done in the ILU(p) methods (see[16]), where p ∈ N0 is a measure for the amount of fill-in we wish to accept.

Thresholding: ILUT.

A drawback of the ILU(0) strategy is that it ignores the actual size of the entries of the exact factors L andU . Thresholding tries to incorporate this into the method. However, in this approach the sparsity patternof the factors L and U is determined on the fly during the factorization process, since the magnitude ofthe entries of L and U is not known from the beginning.

An example of a thresholding strategy is ILUT(p, τ) in Alg. 12.4, based on an IKJ-varainat of ILU. Possiblechoices for the dropping rules in Alg. 12.4 are:



1. Line 5:wk is dropped (i.e., set to zero) if |wk| ≤ τ ∥Ak,∗∥2 , where A is the original matrix.

2. Line 8: drop all entries of w with small value (e.g., |wk| ≤ τ ∥Ak,∗∥2 ). Then, keep only the p largestentries of w([1 : i−1]) for the L -part; in addition to the entry wi (which represents Ui,i and is alwayskept) keep the p largest entries of w([i+ 1 : n]) for the U part.

This dropping strategy involves the parameters τ and p . The purpose of τ is to (hopefully) keep only thelarge entries of L and U ; dropping small entries allows one to save computational time. The purpose ofthe parameter p is to keep the memory requirement under control.

Remark 12.9 Alg. 12.4 is based on the IKJ-variant of LU -decomposition, i.e., is suitable for row con-tiguous data formats for A . For simplicity of exposition, Alg. 12.4 is formulated as if A were a full matrix.The sparsity of A is, of course, exploited in practice. The functions for incomplete factorization inMatlabcan only be applied to matrices in sparse data format.

Algorithm 12.4 ILUT (p, τ) – IKJ-variant

% overwrites A with (approx.) LU -decomposition; entries Lii = 1 not stored

% algorithms is formulated ignoring sparse storage formats!

1: for i = 1, 2, . . . do2: w = ai,∗ % grab a whole row3: for k = 1 . . . i− 1 and wk = 0 do4: wk = wk/akk5: apply a dropping rule to wk

6: If wk = 0 then w(k+1 : n) = w(k+1 : n)− wk ∗ a(k, [k+1, n])7: end for8: apply a dropping rule to w9: a(i, [1 : i− 1]) = w([1 : i− 1]) % i.e., L(i, [1 : i− 1]) = w([1 : i− 1])10: a(i, [i : n]) = w([i : n]) % i.e., U(i, [i : n]) = w([i : n])11: end for

Incomplete Cholesky (ICC).

For SPD matrices A , it is common to use the Cholesky factorization LLT = A instead of the LU -decomposition. This sparked the development of incomplete Cholesky factorization techniques analogousto ILU and ILUT.

For classical Cholesky applied to an SPD matrix A , all occurring square roots are well-defined positivenumbers, such that is always guaranteed that the diagonal of L is positive. If we use a version of incompleteCholesky factorization for preconditioning in PCG, we have to ensure that M = L LT is positive definite.For this purpose it may be necessary to ‘strengthen’ the diagonal entries, see Exercise 12.10. (This is notsupported by the Matlab function ichol.)

Exercise 12.10 Formulate the Incomplete Cholesky Algorithm ICC(0) analogous to ILU(0), via adaptation of

the classical Cholesky algorithm (see [2]). Modify it in sich a way that, in case of failure (square root of a negative

number!), the incomplete Cholesky factorization of a modified matrix A is computed instead, with A = A+D ,D

a positive diagonal matrix. (Implementation in Matlab.)



Approximate inverses.

For notational convenience and variety of presentation, we here consider the generation of right precondi-tioners.

Starting from the observation that the (right) preconditioner M should satisfy AM−1 ≈ I , approximateinverse preconditioners are directly aiming at a matrix M−1 =: X (hopefully invertible) such that

∥AX − I∥

is small in some norm. The matrix-vector multiplication x 7→ Xx is then taken as the action of thepreconditioner. For this to be viable, the matrix X needs to be sparse. One possible way of generating Xis to prescribe its sparsity pattern and then try to construct X as the minimizer of

min∥AX − I∥F : X ∈ Rn×n with prescribed sparsity pattern (12.11)

where the Frobenius matrix norm is given by ∥X∥2F =∑n

i,j=1X2i,j . (An alternative to prescribing the

sparsity pattern of X would again be to employ some dropping strategies, e.g., to keep only the p largestentries of each column of X .)

The choice of the Frobenius norm is, of course, somewhat arbitrary, but is convenient since minimizingthe objective functional

φ(X) := ∥AX − I∥2Fis equivalent to minimizing the ∥ · ∥2 - norm of the columns of AX − I , seen as one ‘long vector’. Thecorresponding inner product is

(X, Y )F =n∑

i,j=1

Xi,j Yi,j = trace(Y TX), ∥X∥22 = (X,X)F

Thus, (12.11) is an unconstrained minimization problem, namely simply a standard least squares problemassociated with the over-determined system AX = I of n2 equations in m unknowns, where m is thenumber of prescribed non-zero entries for the solution X .

The only formal difference to a standard least squares problem consists in the fact that a matrix X islooked for. Instead of rewriting this in terms of ‘long vectors’ and proceeding in a standard way, it is morenatural here to stick to the matrix formulation. At the desired solution X , the gradient of φ must vanish.To represent the gradient in form of a matrix D = D(X) ∈ Rn×n (analogous to A and X ), we considerthe local linearization of φ at some X in the form

φ(X+H) = φ(X) + (D(X), H)F +O(∥H∥2)

Then, D(X) is the matrix representation for the gradient (Frechet derivative) of φ at X . The question isnow to determine D(X) .

Exercise 12.11 Expand φ(X+H) to conclude

φ(X+H) = φ(X)− 2 (AT R,H)F + ∥AH∥2F with the residual matrix R = I − AX (12.12)

I.e., D(X) = −2AT R = −2AT (I − AX) . Note that this expansion is exact because φ(X) is a quadratic

functional.



(12.12) shows that the matrix representation for the gradient D = D(X) is given by

D(X) = −2ATR = 2AT (AX − I)

and the systemD(X) = 0 ⇔ AT (AX − I) = 0

is nothing but the system of normal equations for the minimization problem. For the prescribed sparsitypattern of X with m entries, its dimension is m×m .

In practice (for large matrices A ), direct solution of this system is often out of scope. As an alternative, aniterative technique in the like steepest descent (SD) may be used. The formulation of the correspondingalgorithm is analogous to the SD algorithm from Section 7.1, but with respect to the inner product(X, Y )F = trace(Y TX) , with the negative gradient matrix D as search direction, see Alg. 12.5.

However, in this approach the the iterates X will tend to become denser at each step. Therefore itis essential to apply a numerical dropping strategy for the undesired elements. But then descent is nolonger guaranteed, i.e., we do not necessarily have φ(Xnew) < φ(Xold) . An alternative would be to applynumerical dropping to the search direction S before updating X . However, this does not directly controlthe amount of fill-in in the iterates X . See [16] fur further remarks.

Algorithm 12.5 Global Steepest Descent algorithm

1: Choose an initial guess X with given sparsity pattern2: Until convergence do3: R = I − AX , S = ATR4: α = ∥S∥2F/∥AS∥2F5: X = X + αS6: Apply numerical dropping to X7: end do

Alternatively, it has been proposed to use the residual matrix R = I − AX as the search direction, seeAlg. 12.6. Note that R = 0 gives a direction of descent since (D(X), R)F = −2(ATR,R)F = 0 (forinvertible A ).

Algorithm 12.6 Global Minimal Residual Descent algorithm

1: Choose an initial guess X with given sparsity pattern2: Until convergence do3: R = I − AX4: α = trace(RTAR)/∥AR∥2F5: X = X + αR6: Apply numerical dropping to X7: end do

Exercise 12.12 Show that (apart from the dropping step) Alg. 12.5 is indeed the steepest descent algorithm

for the problem under consideration. (Cf. the derivation of the SD algorithm in Section 7.1, but note the different

inner product.) Furthermore, show that formulation of Alg. 12.6 is correct.

In both algorithms, the residual matrix R has to be explicitly stored. The occurring scalar quantities,including ∥AS∥2F and trace(RTAR) can be computed from the successive columns of AS , etc., which canbe successively computed, used, and discarded again. Thus, the matrix products AS and RTAR neednot to be explicitly stored.



Column-oriented technique.

The objective functional φ(X) = ∥AX − I∥2F is nothing but

φ(X) =n∑

j=1

∥Axj − ej∥22

in column-wise notation. Since the j -th column xj of X occurs only in the j -th term, minimization of φevidently decouples into n individual minimization problems

φj(X) := ∥Axj − ej∥22 7→ min!, j = 1 . . . n (12.13)

An attractive feature of this formulation is that the minimization can be done for all columns in parallel.60

Each minimization can be performed by taking a sparse initial guess and solving approximately the nparallel linear subproblems (12.13) with a few steps of a nonsymmetric descent-type method. A basicversion based on steepest descent in residual direction(s) is formulated in Alg. 12.7.

Algorithm 12.7 Approximate inverse via MR Iteration

1: X = X0 % initial guess for X , e.g., X = I2: for each column j = 1 . . . n do3: set xj = X0 ej4: for k = 1 . . . until some stopping criterion is met do5: rj = ej − Axj6: αj =

(rj ,Arj)

(Arj ,Arj)

7: xj = xj + αj rj8: apply a dropping strategy for the entries of xj9: end for10: end for

In [16], this class of preconditioning techniques is studied in more detail. Among others, this includes also‘factored approximate inverses’, a more systematic approach if ILU type.

Polynomial preconditioners.

Another class of preconditioners – in the spirit of finding an approximate inverse – are polynomialpreconditioners. From the Cayley-Hamilton Theorem we know that an invertible matrix A ∈ Rn×n canbe represented as A−1 = p(A) for some p ∈ Pn−1 . One may hope to find good approximations usingpolynomials of much smaller degree. The following choices are just two examples. In both cases, computinga (left-) preconditioned residual amounts to evaluating a term of the form p(A)x with some low degreepolynomial p .

• Neumann polynomials: If A is SPD, then the system Ax = b is equivalent to ωAx = ω b ( ω = 0 ).If we choose the damping parameter 0 < ω < ∥A∥2 , then ∥I − ωA∥2 < 1 , i.e., the Neumann series

(ωA)−1 =∞∑i=0

(I − ωA)i

is convergent. Hence, truncating this series, i.e., taking M−1 =m∑i=0

(I − ωA)i for rather small m may

lead to a good preconditioner for the linear system ωAx = ω b .

60 A popular realization of these ideas is the so-called SPAI algorithm, seehttp://www.computational.unibas.ch/software/spai.


12.6 Numerical examples 119

• Chebyshev polynomials: We seek a preconditioner M−1 ≈ A−1 in the form M−1 ≈ pm(A) for somepm ∈ Pm . Aiming at minimizing ∥I − pm(A)A∥ in some norm we are led – as in our discussion ofChebyshev acceleration in Section 6 – to minimizing the spectral radius ρ(I − pm(A)A) , i.e., tofind pm ∈ Pm such that max|1− λ pm(λ)| : λ ∈ σ(A) is minimized.

In practice, the spectrum σ(A) is not known but possibly an inclusion set E ⊂ C with σ(A) ⊂ E .In this case, we would seek p ∈ Pm such that max

x∈E|1− x p(x)| is minimized. If, for example, E is an

interval on the real line, then the minimizer pm ∈ Pm is given by the scaled Chebyshev polynomialdescribed in Corollary 6.3. The preconditioner is then taken as x 7→ M−1x = pm(A)x .

Block preconditioners.

For most preconditioners also block variants exist, e.g., for a matrix A given in block form

A =

A1,1 A1,2 · · · A1,k

A2,1 A2,2 · · · A2,k

......

. . ....

Ak,1 Ak,2 · · · Ak,k

one may choose the preconditioner

M−1J =

A−1

1,1

A−12,2

. . .

A−1k,k

which is the block Jacobi preconditioner. Here, of course, one assumes that the diagonal blocks Ai,i arecheaply invertible.

12.6 Numerical examples

Poisson problem.

The results obtained with the preconditioning methods described when applied to CG and the SPD matrix‘poisson’ from the Matlab gallery (cf. Example 2.2) are displayed in Table 12.1.

dimension N×N N = 64 N = 256 N = 1,024 N = 4,096 N = 16,384

original 10 28 (0.06) 59 (0.38) 119 (2.31) 239 (17.36)

ICC(0) 11 19 (0.11) 30 (0.50) 55 (3.74) 100 (39.16)

SGS 11 19 (0.11) 34 (0.55) 60 (2.74) 118 (22.19)

diagonal 10 28 59 119 239

Table 12.1: Iteration counts and times in obtained using Matlab’s implementation of PCG and ICC ona 450 MHz P2 applied to the ‘Poisson’ matrix, with a convergence tolerance of 10−8 .



0 1 2 3 4 5 6 7

Original system, condition number = 13.3333

Incomplete Cholesky preconditioning, condition number of system = 2.569

Symmetric Gauss−Seidel preconditioning, condition number of system = 3.3157

Diagonal preconditioning, condition number of system = 13.3333

Eigenvalues of preconditioned systems, order 16x16

0 1 2 3 4 5 6 7 8

Original system, condition number = 46.2952

Incomplete Cholesky preconditioning, condition number of system = 7.5299

Symmetric Gauss−Seidel preconditioning, condition number of system = 10.7953

Diagonal preconditioning, condition number of system = 46.2952

Eigenvalues of preconditioned systems, order 64x64

Figure 12.26: Eigenvalues of preconditioned versions of the matrix ‘Poisson’ from the Matlab gallery.

Note that, with the exception of diagonal preconditioning, a decrease in the number of iterations requiredto achieve convergence is obtained; however, the computational time is not reduced. This is probably dueto the relatively cheap cost of a CG iteration for this pentadiagonal matrix compared to the setup of thepreconditioning matrix and its inversion. The reduction in the number of iterations required correspondsto a reduction in the condition number of the preconditioned system.

For low values of n the eigenvalues of these systems are shown in Fig. 12.26. We note that although diagonalpreconditioning shifts the eigenvalues, it does not reduce the condition number of the new system. Thisis due to the fact that the diagonal of A is constant, D = 4 I , which factors out of the evaluation of thecondition number. For such a constant D , diagonal preconditioning may have no effect.

In general, preconditioning by the diagonal generates some benefit and due to the simplicity of this methodit is often worth an initial investigation – it is simply a rescaling of the problem which might be usefulfor the case where the matrix elements strongly varying in size.

A sterner test for preconditioned CG can be constructed from one of the test matrices freely availablefrom the large collection at the Matrix Market 61. The matrix selected is NOS6, a matrix resulting fromdiscretizing Poisson’s equation on an L-shaped domain. The matrix is SPD, dimension 675×675 and with

61 http://math.nist.gov/MatrixMarket: A visual repository of test data for use in comparative studies of algorithmsfor numerical linear algebra, featuring nearly 500 sparse matrices from a variety of applications, as well as matrix generationtools and services.


12.6 Numerical examples 121

an estimated condition number of 8×106 . The result obtained are displayed in Table 12.2. For this problemwe observe that the diagonal preconditioning did a relatively good job, mostly due to its computationalcheapness.

iteration count time (seconds)

original 1408 2.917

ICC(0) 87 0.599

SGS 42 0.277

Diagonal 103 0.388

Table 12.2: Iteration counts and times obtained using Matlab’s implementation of PCG and ICC on anold SUN workstation applied to the NOS6 matrix, with a convergence tolerance of 1×10−8 .

An unsymmetric problem.

Computational results obtained with the preconditioned GMRES algorithm are displayed in Table 12.3and Fig. 12.28. The matrix considered is PORES3 from the Matrix Market. PORES3 an unsymmetricmatrix resulting from modeling oil reservoirs, with dimension 532×532 and an estimated condition numberof 6.6×105 .

From Table 12.3 we see that GMRES is quite responsive to preconditioning: ILU(0) and ILUT( 10−4 )perform very well. Fig. 12.28 shows the convergence behavior (residual vs. iteration number) for precondi-tioned GMRES and restarted GMRES with different restart values. Note a difficulty of restarted GMRESwhich quite often arises: If the restart value is too small, then restarted GMRES does not converge.

iteration count time (seconds)

original 93 + (84× 100) 123.593

ILU( τ ) τ = 1× 10−4 10 0.125

ILU(0) 29 0.359

SGS 31 + (11× 100) 20.179

Table 12.3: Iteration counts and times obtained using Matlab’s implementation of restarted GMRES(restart value 100) and ILU on an old SUN workstation applied to the PORES3 matrix, with a convergencetolerance of 1×10−8 .

Exercise 12.13 Use the MatrixMarket collection together with the Matlab built-in functions to reproduce all

these results.

Although the preconditioning techniques presented here can be quite effective in accelerating the conver-gence rate of Krylov subspace methods, we typically find that, when applied to discretizations of PDEs,the number of iterations required to achieve convergence remains linked to size of the problem underconsideration.

Ideally, however, one would like to have preconditioners that perform well irrespective of the problem size.This goal, the Holy Grail of preconditioning, can be achieved for certain problem classes. For example,multigrid and the so-called BPX preconditioner achieve this goal for discretizations stemming from ellipticpartial differential equations.



0 200 400

0

100

200

300

400

500

nz = 17980 200 400

0

100

200

300

400

500

nz = 2208

0 200 400

0

100

200

300

400

500

nz = 24570 200 400

0

100

200

300

400

500

nz = 3090

Figure 12.27: Incomplete LU factorizations of the matrix PORES3, preserving the original sparsity (top)and using a dropping tolerance of 5× 10−4 (bottom).

100

101

102

103

104

105

10−8

10−6

10−4

10−2

100

102

104

106

iteration number

resi

dual

GMRES: PORES 3 (N = 532)

restart = 10restart = 50no restartILU(0)IULT(0.01)

Figure 12.28: Performance of restarted preconditioned GMRES for PORES3


123

13 Multigrid Methods (MG)

The classical Krylov techniques such as CG and GMRES coupled with the simple preconditioners discussedabove typically deteriorate as the problem size increases, i.e., the number of iterations required to reacha given accuracy increases as the problem size increases. We now show how multigrid techniques canovercome these difficulties. Multigrid can itself be viewed as a convergent ‘stand alone’ iterative scheme.But, in particular, it can be very successfully employed as a preconditioner for CG or GMRES.

13.1 Motivation: 1D elliptic model problem.

In order to illustrate the main MG idea, we consider solving the linear system arising from the 1D PoissonExample 2.1. It will be convenient to scale the tridiagonal matrix of Example 2.1 in a different way. Weconsider the sample problem

ANu = bN (13.1)

where the tridiagonal matrix AN ∈ R(N−1)×(N−1) and the right-hand side bN ∈ RN−1 are given by

AN =1

hN

2 −1−1 2 −1

. . . . . . . . .

−1 2 −1−1 2

, bN = hN

1

1...

1

1

Here the parameter hN is simply the mesh size given by

hN =1

N(13.2)

This corresponds to solving the boundary problem −u′′ = 1 on Ω = (0, 1) , with Dirichlet boundaryconditions u(0) = u(1) = 0 . The exact solution is u(x) = 1

2x(1− x) .

The solution of the linear system (13.1) represents approximations to the nodal values of the functionu(x) . 62 It will be useful to identify these solutions vectors with piecewise linear functions. More precisely,we identify a vector u ∈ RN−1 with a piecewise linear function u ∈ C[0, 1] via the relations

u(xi) = ui, i = 1 . . . N−1, xi = i hN

Additionally, we assume u(0) = u(1) = 0 . In the sequel, we will frequently refer to this isomorphism,

RN−1 ∋ u ! u ∈ C[0, 1] (with zero boundary values). (13.3)

Note that u =∑N−1

i=1 ui ei , where ei denotes the piecewise linear i - th hat function satisfying ei(xj) = δij .These hat functions form a basis in the space of piecewise linear functions, and u play the role of thecoefficient vector of the representation of a function u in this basis.

Remark 13.1 The identification between vectors u ∈ RN−1 and piecewise linear functions is quite naturalif one observes that the matrix AN is exactly the stiffness matrix obtained by a FEM discretization (withpiecewise linear ansatz functions) of −u′′ = 1 on (0, 1) . In the present case of a uniform mesh, the finitedifference matrix of Example 2.1 coincides, up to a scaling with hN , with the FEM matrix.

62 In our simple example where u(x) = 12 x(1 − x) , these nodal values are exactly reproduced by the discrete solution

because the local discretization errors vanish.


124 13 MULTIGRID METHODS (MG)

100 200 300 400 500 600 700 800 900 100010

−15

10−10

10−5

100

iteration number

erro

r in

max

imum

nor

m

N=512N=64N=16N=4

0 5 10 15 2010

−2

10−1

100

101

102

iteration number

erro

r

w=1w=0.5w=1.5

Figure 13.29: Left: Performance of the undamped Jacobi method for solving (13.1) for different problemsizes N . Right: Varying, for fixed N , the damping parameter ω does not improve convergence.

Example 13.2 Let us denote by um the m - th iterate of the Jacobi method applied to (13.1), with u∗= exact solution of (13.1). Fig. 13.29 shows the convergence history ( ∥um − u∗∥∞ versus the iterationnumber m , for u0 = 0 ), for different problem sizes N . Note in particular the well-known degradation ofthe convergence behavior as the problem size increases.

A more precise analysis of the convergence behavior of the Jacobi method applied to (13.1) can the basedon the fact that the eigenvalues and the eigenvectors of the iteration matrix are explicitly known (cf. Sec. 2):The vectors wk ∈ RN−1 , k = 1 . . . N−1 , with

(wk)i = sin(ikπhN) = sin(kπxi) i = 1 . . . N−1 (13.4)

are the eigenvectors of AN , with corresponding eigenvalues

λk =4

hNsin2(k π

2hN) (13.5)

The eigenvectors wk can be identified with the piecewise linear functions wk plotted in Fig. 13.30 for differ-ent values of k . Whereas the ‘low frequencies’ wk for small values of k correspond to slowly (‘smoothly’)varying functions, the ‘high frequencies’ corresponding to large values of k are rapidly oscillating.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

wk for k=1;N=64

0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

wk for k=5;N=64

0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

wk for k=63;N=64

Figure 13.30: Illustration of w1 (low frequency), w5 , and wN−1 (high frequency); N = 64 .


13.2 A more detailed analysis of the Jacobi method. Error smoothing 125

0 10 20 30 40 50 60 7010

0

101

102

103

104

error reduction of undamped Jacobi method

num

ber

of it

erat

ions

to r

educ

e er

ror

to 1

%

wave number k0 10 20 30 40 50 60 70

100

101

102

103

104

error reduction of damped Jacobi method; w=2/3

num

ber

of it

erat

ions

to r

educ

e er

ror

to 1

%wave number k

0 20 40 60 80 1000

2

4

6

8

10

12

14

16

18

20error reduction for optimally damped Jacobi

erro

r

number of iterations

Figure 13.31: Left: Contraction property of undamped ( ω = 1 ) Jacobi method in dependence on thewave number k . Center: Contraction property of damped ( ω = 2/3 ) Jacobi method in dependenceon the wave number k . Right: Convergence behavior of damped ( ω = 2/3 ) Jacobi for initial errore0 =

∑N−1k=1 wk .

13.2 A more detailed analysis of the Jacobi method. Error smoothing

We study the performance of the damped Jacobi iteration, with damping factor ω ∈ (0, 2) , for solving(13.1):

um+1 = um + ωD−1N (bN − ANum) (13.6)

where AN is given by (13.1) and DN is the diagonal of AN .

Let Gω = GJacω = I − ωD−1

N AN denote the iteration matrix. For all k , wk from (13.4) is an eigenvector ofthe symmetric matrix Gω , with eigenvalue

γk(ω) = 1− ω hN2λk = 1− 2ω sin2(k π

2hN) (13.7)

Example 13.3 We consider the case N = 64 and analyze the case where the initial error e0 ∈ RN−1 isone of the eigenmodes, e0 = wk . The error in step m then satisfies

∥em∥2 = ∥(Gω)me0∥2 = ∥γk(ω)

mwk∥2 = |γk(ω)|m = |1− 2ω sin2(k π

2hN)|

m

As k varies between k = 1 and k = N−1 , the contraction factor |γk(ω)| varies between 1 − O(h2N) andnumbers that are very close to 0 ! For the case ω = 1 , the left plot in Fig. 13.31 illustrates the value|γk(1)| by plotting m ∈ N over k such that |γk(1)|m ≈ 0.01 .

The center plot in Fig. 13.31 shows the behavior for the case ω = ωopt = 2/3 , which gives ‘optimaldamping of high frequencies’ in the sense of Exercise 13.4. Components corresponding to high frequenciesare reduced quickly, whereas the low frequency components are not significantly reduced.

The right plot in Fig. 13.31 shows the convergence history for ω = ωopt = 2/3 for a case where all modesare equally present in e0 . The error reduction rate is quite good at the beginning, due to the fact thatthe high frequency error modes are quickly damped out. Later on, as the error becomes smoother, theiteration begins to stagnate.

Example 13.3 shows that the error components of some of the ‘frequencies’ (or ‘modes’) are reduced veryquickly in a few Jacobi steps; the error components of other modes are hardly reduced. The conclusionis that we should design iterative schemes that are based on two principles: Use the Jacobi method toreduce the error of some components; another procedure will have to be designed to effectively reduce theerror in the remaining modes. The simplest (but not practicable) algorithm which realizes this idea is thetwo-grid method which we describe below.



0 0.2 0.4 0.6 0.8 1−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

x

erro

r

convergence behavior of damped (w = 2/3) Jacobi; random x0

e0

e3

e6

e9

Figure 13.32: Error reduction of optimally damped Jacobi method with random initial vector x0 ; N = 32 .

Having accepted the fact that we cannot reduce all error components uniformly well, we will settle forefficiently reducing some of them. For reasons that will become clear later, we wish to reduce the highfrequency components. The optimal value of the damping parameter ω then turns out to be ωopt = 2/3 :

Exercise 13.4 Show: the minimizer ωopt of the function

ρ(Gω) = maxN2≤ k≤N−1

|γk(ω)| = maxN2≤k≤N−1

|1− 2ω sin2(k π2hN )|

satisfies limN→∞ (hN→0)

ωopt = 2/3 . Also show that ρ(Gωopt) ≤ 13 for all hN > 0 .

Thus we have identified the optimal damping parameter as ωopt = 2/3 . This choice indeed leads to a quickreduction of the high frequency components corresponding to the the upper half of the discrete spectrum.

Example 13.5 We again consider the model problem (13.1) and use the damped Jacobi method withdamping parameter ωopt = 2/3 . For N = 32 and a random starting vector x0 we plot the errors e0 , e3 ,e6 , and e9 in Fig. 13.32. We observe that the initial error e0 , being randomly chosen, has high and lowfrequency components. The damped Jacobi method damps out the high frequency components quickly:the strong spikes of e0 are no longer present in e3 and the error e3 (and even more so e6 ) is rather slowlyvarying. Iterating further does not result in a significant error reduction because the contraction rate ofthe damped Jacobi method is close to 1 for the low frequency components.

Another reasonable choice is ω = 1/2 . In this case the ‘medium frequencies’ are damped more slowly thanfor ω = 2/3 ; the overall damping factor for the upper half of the spectrum is only 1

2instead of 1

3. On the

other hand, the highest frequencies are those for which the damping effect is most significant (also notethat all eigenvalues of G1/2 are positive). This is illustrated in Fig. 13.33 (left = low frequencies; right =high frequencies).

Curve 1omega=1omega=1/2omega=2/3

–1

–0.5

0

0.5

1

0.2 0.4 0.6 0.8 1x

Figure 13.33: Eigenvalue portrait for Gω , with three different values of ω .


13.3 The two-grid scheme (TG) in abstract formulation 127

13.3 The two-grid scheme (TG) in abstract formulation

We have seen in Example 13.5 that the Jacobi method with damping parameter ωopt = 2/3 is quite effectiveat reducing high frequency components of the error. From Fig. 13.32 we see that, after a few steps, theerror em resp. em is ‘smooth’ in the sense that, while possibly being large, it is not strongly varying. Thus,we may hope that we could get a good approximation of em on a coarser mesh, e.g. with mesh size 2hN .This is the idea of the two-grid method:

1. Smoothing step: Perform m steps of the damped Jacobi method for ANu = bN to yield um .

2. Coarse grid correction: Solve a related problem of size n < N (typically, n = N2in the 1D case)

whose solution e approximates the error em = um − uN . Then correct um using e . The solution ewill stem from solving a problem that is posed on a coarser grid with mesh size 2hN .

Let us first describe the coarse grid correction step in a purely formal way. The correction step is nothingbut one step (or maybe more of them) of a linear iterative scheme, where the original matrix AN is approx-imated by its ‘coarse counterpart’ An . In addition, we need a restriction operator Rn,N and prolongation(interpolation) operator PN,n acting between the solution spaces of dimension (essentially) N and n . Withan appropriate choice for Rn,N and PN,n , the coarse grid correction step is realized in the usual way interms of correction by a linear image of the residual of the smoothed-out approximation um ,

um 7→ uTGm = um + PN,nA

−1n Rn,N (bN − ANum)

i.e., PN,nA−1n Rn,N is used to approximate A−1

N . This may be called the ‘strong form’ of the correctionstep and will be appropriate in the setting of FD methods. (In the FEM context we will use a ‘weak’formulation.) Algorithmically, this amounts to

1. Compute the residual rm = bN − ANum

2. Restrict the residual to the ‘coarser space’ : rm = Rn,N rm

3. Solve An δ = rm4. Prolongate the correction δ and add it to um , i.e., compute uTG

m = um + PN,n δ

The coarse grid correction 63 δ is an approximation for the error −em = uN − um . Note that the ap-proximation PN,nA

−1n Rn,N for A−1

N has reduced rank, and that the corresponding amplification matrix

G = IN − PN,nA−1n Rn,NAN

cannot be a contraction:Ge = e for e ∈ ker(Rn,N) , which typically consists of ‘unsmooth’ objects. Thisshows that coarse-grid correction only makes sense in cooperation with a preceding smoothing procedure.Additional smoothing steps following the coarse-grid correction are a further option.

Galerkin approximations on subspaces.

In principle, MG methods can be applied in the context of any discretization approach, e.g., finite difference(FD) methods. In view of application to systems arising from a FEM discretization, a natural, ‘weak’formulation of coarse-grid correction is appropriate, involving Galerkin orthogonality. Let us describe thisin a general, abstract form, considering a linear system

ANu = bN , AN ∈ R(N−1)×(N−1)

63 Here and in the sequel, uN denotes the exact solution of ANu = bN . and the letter δ is designated to denote a quantitityapproximating the negative error e = u− uN of a given approximation u . Thus, u+ δ is the new, corrected approximation.



Due to finite dimension this is equivalent to the following ‘weak’ formulation:

Find u ∈ RN−1 such that (ANu,w) = (bN , w) ∀ w ∈ RN−1

Let Vn ⊂ RN−1 be a subspace of RN−1 of dimension n−1 < N−1 . The corresponding ‘Galerkin approxi-mation’ in the subspace is defined

Find v ∈ Vn such that (ANv, w) = (bN , w) ∀ w ∈ Vn (13.8)

In order to formulate (13.8) as a linear system of equations, let v1, . . . , vn−1 be a basis of Vn . Let

PN,n =v1 ∣∣ . . . ∣∣ vn−1

∈ R(N−1)×(n−1) . Seeking v in the form v = PN,n y with coefficient vector y ∈ Rn

allows us to rewrite (13.8) as

Find y ∈ Rn−1 such that (ANPN,n y, PN,n z) = (bN , PN,n z) ∀ z ∈ Rn−1 (13.9)

which can be rewritten as

Compute the solution y ∈ Rn−1 of P TN,nANPN,n y = P T

N,n bN (13.10)

The matrix PN,n ∈ R(N−1)×(n−1) is called the prolongation matrix, whereas its transpose Rn,N = P TN,n ∈

R(n−1)×(N−1) plays the role of a restriction matrix.

Remark 13.6 This Galerkin approximation is closely related to the [Petrov-]Galerkin approach under-lying Krylov methods. Indeed, by rearranging terms in (13.8), we see that it is equivalent to the Galerkinorthogonality requirement

Find v ∈ Vn such that (bN − ANv, w) = 0 ∀ w ∈ VnRecognizing this as the defining condition in (10.2), we can conclude as in the proof of (10.5) that, forSPD matrices A , v satisfies the best approximation property

∥v − u∗∥A = minw∈Vn

∥w − u∗∥A (13.11)

in the energy norm, where u∗ denotes the exact solution of ANu = bN .

The resulting ‘Galerkin form’ form of one step of the abstract two-grid method amounts to the follow-ing procedure. After performing m steps of the damped Jacobi iteration for ANu = bN yielding um ,the (Galerkin) coarse grid correction is computed in the following way, with the Galerkin coarse gridapproximation of AN ,

An = P TN,nANPN,n ∈ R(n−1)×(n−1) (13.12)

defined via an appropriately chosen Galerkin pairPN,n (prolongation) and Rn,N = P TN,n (restriction):

1. Compute the residual rm = bN − ANum

2. Restrict the residual to the subspace Vn : rm = P TN,n rm

3. Find the Galerkin approximation by solving An δ = rm4. Prolongate the correction δ and add it to um , i.e., compute uTG

m = um + PN,n δ

Exercise 13.7 Check that the two-grid method is a linear iteration; in particular: If SN = Gω = I− ωD−1N AN is

the iteration matrix of the damped Jacobi method (or some other linear smoothing procedure), then the iterationmatrix GTG of the two-grid with m smoothing steps is given by

GTG = (I − PN,nA−1n P T

N,nAN )SmN (13.13)

with An from (13.12). Conclude that (in any norm)

∥GTG∥ ≤ ∥A−1N − PN,nA

−1n P T

N,n∥ ∥ANSmN ∥ (13.14)


13.4 The two-grid method for the 1D model problem 129

13.4 The two-grid method for the 1D model problem

We return to the problem of designing the coarse grid correction for the 1D Poisson model problem.Our first goal is to choose an appropriate subspace Vn together with the appropriate prolongation andrestriction operators.

It will be useful to think in terms of functions u and identify them with coefficient vectors u in a canonicalway, in the spirit of (13.3). To this end we define the space of piecewise linear functions UN on the given(equidistant) mesh by

UN = u ∈ C[0, 1] : u|Ii ∈ P1 for i = 0 . . . N−1

with the subintervals

Ii = (xi, xi+1), i = 0 . . . N−1, xi = i hN , hN =1

N

Note that the isomorphism (13.3), RN−1 ∋ u ! u ∈ UN maps u ∈ RN−1 to an element of u ∈ UN .

We choose n = N2and consider 64 the spaces UN and UN

2. Clearly, UN

2⊂ UN with the natural injection

which we denote by IN,N2: UN

2→ UN . This injection plays the role of our prolongation operator PN,N

2.

It can be also be expressed in matrix notation:

Exercise 13.8 [matrix analog of IN,N2] Define IN,N

2∈ R(N−1)×(N

2−1) by

IN,N2=

12

1 012

12

0 112

12

0 1 012

12

. . .. . .

112

Let uN

2∈ UN

2be associated with the vector uN

2∈ R

N2−1 . Then uN

2is ‘canonically embedded’ into UN since

UN2⊂ UN .

Show: The vector wN ∈ RN−1 corresponding to uN2(viewed as an element of UN ) is then given by wN = IN,N

2uN

2.

Hint: This is related to piecewise linear interpolation: For any uN2∈ UN

2and a mesh point of the fine mesh

xi =12(a+ b) we have u(xi) =

12(u(a) + u(b)) . Visualize this by means of a figure.

Furthermore, give an interpretation of the corresponding (Galerkin) restriction operator RN2,N : UN → UN

2

represented by the matrix ITN,N

2

∈ R(N2−1)×(N−1) . Note that this restriction is not the trivial, pointwise one.

64 For simplicity, we assume that N is even – in fact, later we will assume that N = 2L for some L ∈ N .



In matrix notation, the natural injection IN,N2allows us identify a subspace of RN−1 suitable for the coarse

grid correction, namely the space Vn = IN,N2RN

2−1 .

We are now ready to turn to the coarse-grid Galerkin approximation of the error em = um − uN for agiven um (obtained after m smoothing steps), and formulate it in terms of the algebraic objects (matricesand coefficient vectors) involved in the numerical computation. The error satisfies AN(−em) = rm =

bN − ANum . The Galerkin correction technique described above, with the subspace Vn = IN,N2RN

2−1 of

dimension N2−1 , gives the coarse grid correction δN

2as the solution of

AGalerkinN2

δN2= IT

N,N2rm (13.15)

with the Galerkin coarse grid approximation matrix of AN defined by

AGalerkinN2

= ITN,N

2ANIN,N

2(13.16)

which is a projected version of AN .

The outcome is exactly the procedure indicated before, with the canonical choice for the Galerkin pair,IN,N

2(prolongation) and IN

2,N = IT

N,N2

(restriction) and the corresponding matrix AGalerkinN2

.

The outcome is the two-grid algorithm:

Algorithm 13.1 Two-grid method

1: choose initial guess uTG0

2: for j = 1, 2, . . . until convergence do3: Smoothing: do m steps of the damped Jacobi method to obtain um4: Compute and restrict the residual: IT

N,N2

rm = ITN,N

2

(bN − ANum)

5: Coarse grid correction: solve (13.15) for δN2

6: Prolongate and apply the correction:uTGj = um + IN,N

2δN

2

7: end for

Example 13.9 For the 1D model problem, the two-grid method has very good convergence properties asis visible in Fig. 13.34. We note in particular that its performance (measured in terms of the amount oferror reduction per iteration) is independent of the problem size.

Interpretation in terms of continuous functions.

In computational practice, the coarse grid correction is of course an algebraic process, as in Alg. 13.1.However, for a theoretical understanding in the context of FD and, in particular, FEM discretizations, itshould rather be reformulated and viewed in terms of function spaces.

In the sequel we shall concentrate on the FEM context, and once more we recall our ‘FEM isomorphism’u ∈ RN−1 ! u ∈ UN . We introduce the bilinear form a(·, ·) and the linear form f(·) by

a(u, v) =

∫Ω=(0,1)

u′ v′ dx, f(v) = hN(bN , v) (13.17)

It is easy to verify that(ANu,w) ≡ a(u, w)


13.4 The two-grid method for the 1D model problem 131

0 5 10 1510

−20

10−15

10−10

10−5

100


Max

imum

of r

esid

ual

Error of Two−grid method vs. number of iterations

h=1/8h=1/16h=1/32h=1/64

0 5 10 1510

−20

10−15

10−10

10−5

100


Max

imum

err

or in

nod

es

Error of Two−grid method vs. number of iterations

h=1/8h=1/16h=1/32h=1/64

Figure 13.34: Performance of the two-grid method for the 1D Poisson model problem.

and therefore the exact solution uN of ANu = bN corresponds to the solution uN of the weak formulation

a(uN , w) = f(w) ∀ w ∈ UN (13.18)

Consider um ∈ RN−1 obtained after m smoothing steps and its associated function um ∈ UN , with errorem = um − u∗ . The two-grid method seeks an approximation uTG to u∗ of the form

uTG = um + δN2

with δN2∈ UN

2⊂ UN . It is natural to aim for δN

2∈ UN

2to be the best approximation to the error −em ,

i.e., we seek δN2such that the error in the energy norm 65 is minimized:

Find δN2∈ UN

2such that ∥uTG − u∗∥A = ∥(um + δN

2)− u∗∥A ≤ ∥(um + w)− u∗∥A ∀ w ∈ UN

2

The minimizer δN2is the solution of the Galerkin (orthogonality) system

a((um + δN2)− uN , w) = 0 ∀ w ∈ UN

2

Rearranging terms results in the equivalent formulation

Find δN2∈ UN

2such that a(δN

2, w) = a(−em, w) ∀ w ∈ UN

2(13.19)

We note:

1. At first sight, problem (13.19) cannot be solved because the unknown exact solution uN appears onthe right-hand side. However, from (13.18) we see that (13.19) can be rewritten as

Find δN2∈ UN

2such that a(δN

2, w) = f(w)− a(um, w) ∀ w ∈ UN

2(13.20)

Here the right-hand side is the residual of um in the weak sense.

2. The solution δN2of (13.19) is the projection, w.r.t. the a(·, ·) inner product, of −em onto the coarse

space UN2.

65 For u ∈ RN−1 and corresponding u ∈ UN we denote ∥u∥A =√a(u, u) =

√(ANu, u) = ∥u∥A .



The functions δN2, w ∈ UN

2correspond to vectors δN

2, w ∈ RN

2−1 (via the FEM isomorphism). The

canonical embedding UN2⊂ UN is represented by the matrix IN,N

2∈ R(N−1)×(N

2−1) (cf. Exercise 13.8).

Hence, in matrix notation, (13.19) reads

Find δN2∈ R

N2−1 such that (ANIN,N

2δN

2, IN,N

2w) = (rm, IN,N

2w) ∀ w ∈ R

N2−1

with the residual rm = bN − ANum . Rearranging terms, we see that this is nothing but (13.15). Thederivation also suggests that the matrix IT

N,N2

ANIN,N2is actually identical to AN

2:

Exercise 13.10 Show: For the 1D Poisson problem, ITN,N

2

ANIN,N2= AN

2.

Remark 13.11 The matrices

PN,N2= IN,N

2and RN

2,N = IT

N,N2

play the role of Galerkin prolongation and restriction matrices, respectively. In particular, for an SPDproblem the choice RN

2,N = P T

N,N2

is natural because the coarse grid Galerkin approximation matrix (13.16),

AGalerkinN2

= RN2,NANPN,N

2= P T

N,N2ANPN,N

2= IT

N,N2ANIN,N

2(13.21)

is also SPD.

However, in general, ITN,N

2

ANIN,N2= AN

2is not necessarily true, and general multigrid techniques work

with AN2as directly given by the discretization on the coarse level. Different choices for PN,N

2and RN

2,N

are possible, and this choice is one of several parameters influencing the convergence behavior. (The choiceof the smoothing procedure is, of course, also essential, and it is not straightforward in general.)

13.5 Analysis of the two-grid method for elliptic problems

The following convergence analysis of the two-grid method is based on Fourier analysis of the smootherand on ‘elliptic regularity’ to be explained below. The analysis can be understood as separately boundingthe factors ∥A−1

N −PN,nA−1n P T

N,n∥ and ∥ANSm∥ in (13.14). We give details for the 1D Poisson problem; the

style of analysis is such that it can be seen how the theory should extend to more general elliptic boundaryvalue problems.

Instead of the damped Jacobi method, we consider the damped Richardson method as the smoother. (Forthe 1D Poisson model this is the same for the case a of a uniform mesh.) 66

S = I − B−1A, B = ωI (13.22)

We assume

∥A∥2 = λmax(A) ≤ ω ≤ C ′λmax(A) (13.23)

where C ′ ≥ 1 is a parameter independent of the stepsize h , and λmax(A) of course depends on h . We referto Lemma 13.16 below for (general) bounds on λmax(A) .

66 Here we drop the subscript N for convenience of notation.


13.5 Analysis of the two-grid method for elliptic problems 133

Remark 13.12 Assumption ω ≥ λmax(A) means that the damped Richardson preconditioner B = ωIsatisfies A ≤ B . Note that, in the Richardson formulation, the damping parameter will have to be chosendepending on h .

For the Poisson 1D example, we have λmax(A) =4hsin2((N−1)π

2h) ≈ 4

h(cf. (13.5)). Since the diagonal is

constant, D = 2hI , a damped Jacobi step can also be interpreted as a damped Richardson step:

SωJac= I − ωJacD

−1A = I − ωJach2A = I − B−1A, with B =

2

hωJac

I =: ωI

Thus, ω ≈ λmax(A) is equivalent to ωJac =12, a reasonable choice for Jacobi damping.

We now formalize the smoothing property. To this end it is useful to define a family of ‘generalized energynorms’ which are well-defined for any SPD matrix A and arbitrary s ∈ R :

∥|u|∥2s = (Asu, u) (13.24)

For s = 0, 1, 2 we have∥|u|∥0 = ∥u∥2, ∥|u|∥1 = ∥u∥A, ∥|u|∥2 = ∥Au∥2 (13.25)

One way of understanding the norms ∥| · |∥s is to recognize that the larger s ≥ 0 is taken, the more ‘weight’is put on the components corresponding to large eigenvalues of A : Let (wk) be an orthonormal eigenbasisof A with corresponding eigenvalues λk . Then, upon writing u =

∑k ξk wk , we have

∥|u|∥2s =∑k

|λk|2s |ξk|2

Furthermore, for elliptic problems like the Poisson equation, larger eigenvalues typically correspond tomore oscillatory, unsmooth components wk . This will allow us to quantify the smoothing property whichwe have observed in Example 13.5 for the damped Jacobi method. Lemma 13.14 below makes this moreprecise.

The discrete norms ∥| · |∥s can also be understood via their relation to Sobolev spaces. Assume that A isthe FEM stiffness matrix for the Poisson problem on the uniform mesh with mesh size h = hN . We have

Lemma 13.13 (connection with Sobolev spaces) Let Ω = (0, 1) . Identify a function u ∈ UN withits coefficient vector u ∈ RN−1 . Then:

|u|2H1(Ω) = (Au, u) = ∥|u|∥21, ∥u∥2L2(Ω) ∼ h∥u∥22 = h∥|u|∥20

where |u|H1(Ω) is the Sobolev (semi-) norm defined by the energy product

a(u, v) = (u, v)H1(Ω) =

∫Ω=(0,1)

u′ v′ dx

(cf. (13.17)).

Proof: Exercise. The key is to observe that, for piecewise linear u, v ,

(Au, v) = a(u, v) for all u, v ∈ UN

This identity can e.g. be obtained by applying partial summation to (Au, v) . This is a ‘discrete counterpart’of the partial integration identity

∫Ω(−∆u)v = a(u, v) which leads from the strong to the weak form of

the 1D Poisson equation.



Note that in the space H10 (Ω) with zero boundary values, | · |H1(Ω) is indeed a norm. In general, it is only

a seminorm on H1(Ω) and the H1 -norm is e.g. defined as

∥w∥2H1(Ω) = ∥w∥2L2(Ω) + |w|

2H1(Ω)

If Dirichlet boundary value are exactly satisfied by a discrete approximation u , the error e is an elementof H1

0 (Ω) . For the error analysis of the Poisson model we can therefore work with the norm ∥ · ∥H1(Ω) .Due to Lemma 13.13 it is identical with the energy norm ∥| · |∥1 . 67

Smoothing property.

We now show that the smoother S considered above has the smoothing property, i.e., we can control higherorder norms of the iterates um .

Lemma 13.14 (Smoothing property) Let S be defined by the damped Richardson iteration (13.22)with damping parameter ω ≥ λmax(A) . Consider the iteration em+1 = S em . Then,

∥|em|∥2 ≤ C m−1 ∥|e0|∥0 C =ω

e(13.26)

Proof: A is assumed to be SPD. Let (wk)N−1k=1 be an orthonormal eigenbasis of A . The corresponding

eigenvalues are λk, k = 1 . . . N−1 . Expand e0 =∑N−1

k=1 ξk wk . The simple structure of the smoother S

allows us to write em =∑N−1

k=1

(1− λk

ω

)mξk wk . By assumption on ω we have 0 ≤ 1− λk

ω≤ 1 . We have 68

∥|em|∥22 =N−1∑k=1

λ2k

((1− λk

ω

)mξk

)2= ω2

N−1∑k=1

(λk

ω

)2 ((1− λk

ω

)m)2ξ2k

≤ ω2 max0≤ θ≤ 1

(θ2(1− θ)2m

)N−1∑k=1

ξ2k = ω2 max0≤ θ≤ 1

(θ2(1− θ)2m

)∥|e0|∥20

Elementary considerations reveal that the function θ 7→ θ(1−θ)m attains its maximum (over [0, 1] ) atθ = 1/(1+m) . Hence,

max0≤ θ≤ 1

θ(1−θ)m ≤ 1

1+m

(1− 1

1+m

)m≤ 1

m

(1− 1

m

)m≤ 1

me

Thus,

max0≤θ≤1

θ2(1− θ)2m ≤ 1

m2e2

which concludes the proof.

Remark 13.15 The proof works in a similar way for other pairs of norms. E.g., instead of (13.26) wealso have

∥|em|∥2 ≤ C m− 12 ∥|e0|∥1 (13.27)

with some constant C depending on ω .

67 Lemma 13.13 is that, up to scaling by a certain power of h , the discrete norms ∥| · |∥s correspond to classical Sobolevnorms. Indeed, this is even true for s ∈ [0, 1] . For s > 1 this is no longer correct; however, for s > 1 these discrete norms sharemany properties of the ‘corresponding’ Sobolev norms. For more general elliptic equations there is also a close relationship toSobolev norms; however, it is not so ‘direct’; in particular, the elliptic form a(w,w) is not identical with |w|2H1(Ω) in general.

68 In the proof, identify λk

ω with the continuous variable θ ∈ [0, 1] .



For the Poisson problem, the h - dependence of the smoothing Richardson iteration is hidden in the pa-rameter ω . We have:

Lemma 13.16 Let Ω = (0, 1) . Then there exists C > 0 such that

C−1h−1 ≤ λmax(A) ≤ C h−1, C h ≤ λmin(A) ≤ C−1h

Proof: The eigenvalues of h−1A have been explicitly computed for the model problem in Example 2.1,and we know that λmin(A) = O(h) and λmax(A) = O(h−1) .

A proof which is independent of the explicit knowledge of the spectrum would follow along the lines of thefollowing typical ‘FEM-argument’: From the Rayleigh quotient

(Au, u)

(u, u)=∥|u|∥21∥|u|∥20

∼|u|2H1(Ω)

h−1 ∥u∥2L2(Ω)

we conclude, together with the inverse estimate 69 |u|H1(Ω) ≤ Ch−1∥u∥L2(Ω) , which is valid for all u ∈ UN ,and the Poincare inequality 70 ∥w∥L2(Ω) ≤ C |w|H1(Ω) , which is valid for all w ∈ H1

0 (Ω) ,

λmax(A) = supu∈RN−1

(Au, u)

(u, u)≤ sup

u∈ UN

|u|2H1(Ω)

h−1 ∥u∥2L2(Ω)

≤ C h−1

λmin(A) = infu∈RN−1

(Au, u)

(u, u)≥ inf

u∈ UN

|u|2H1(Ω)

h−1 ∥u∥2L2(Ω)

≥ C h

This gives upper and lower bounds for λmax(A) and λmin(A) , respectively.

Lemma 13.14 shows∥ASm∥2 ≤ C m−1 (13.28)

Due to ω ≥ λmax(A) the right hand side is large ( O(h−1) ), but it contains the ‘damping factor’ 71 m−1 .

Approximation property.

We consider the ‘coarse grid Galerkin approximation’ uN2∈ UN

2of some uN ∈ UN , defined by the Galerkin

orthogonality condition (see Sec. 13.4)

a(uN2, w) = a(uN , w) ∀ w ∈ UN

2(13.29)

A good performance of the coarse grid correction requires uN2≈ uN for sufficiently smooth uN . The

following lemmata provide the basis for an estimate of the quality of this approximation. Lemma 13.17 isa version of ‘Aubin-Nitsche Lemma’, which can be formulated in a more general setting (e.g., for smoothfunctions from H2(Ω) where Ω ⊂ Rd is sufficiently regular, like convex or with smooth boundary). Forour purpose we formulate a special, finite-dimensional 1D-version of this lemma which is sufficient for ourpurpose. It provides a strong inverse estimate for the L2 -error of a Galerkin approximation in terms ofits H1 -norm, with an additional factor h .

69 For the 1D Poisson problem this corresponds to (Au, u) ≤ h−1(u, u) .70 The proof of the Poincare inequality can, e.g., be found in any textbook on FEM, e.g., [3].71 Note that a ‘naive’ estimate would, typically, only yield ∥ASm∥2 ≤ ∥A∥2 ∥Sm∥2 ≤ C h−1 without the damping term.



Lemma 13.17 (Aubin-Nitsche Trick) Let Ω = (0, 1) . Consider uN ∈ UN and its coarse-gridGalerkin approximation uN

2∈ UN

2. Then,

∥uN2− uN∥L2(Ω) ≤ C h |uN

2− uN |H1(Ω) (13.30)

Proof: Let e = uN2− uN denote the error of the Galerkin approximation. The idea of the proof is to

define an auxiliary function zN ∈ UN as the solution of the variational problem with right hand side e ,

Find zN ∈ UN such that a(zN , w) = (e, w)L2(Ω) ∀ w ∈ UN

Furthermore, consider an arbitrary zN2∈ UN

2. Due to the Galerkin orthogonality (13.29) satisfied by uN

2

we have a(zN2, e) = 0 . Together with the definition of zN we have (with the choice w = e )

∥e∥2L2(Ω) = (e, e)L2(Ω) = a(zN , e) = a(zN − zN2, e) ≤ |zN

2− zN |H1(Ω) |e|H1(Ω) (13.31)

To bound the right hand side of (13.31) we wish to choose zN2such that the error zN

2−zN is sufficiently small.

To this end it is sufficient to consider the piecewise linear interpolant zN2∈ UN

2of zN ; its interpolation

error can be estimated by

|zN2− zN |H1(Ω) ≤ C h ∥e∥L2(Ω) (13.32)

under quite general circumstances. 72 (For the 1D model problem, see Exercise 13.18 for an elementaryargument.) Together with (13.31) this leads us to

∥e∥2L2(Ω) ≤ C h |e|H1(Ω) ∥e∥L2(Ω)

which concludes the proof.

Exercise 13.18 Show that (13.32) is satisfied for the 1D Poisson problem.

Hint: In this case the proof can be realized in a purely algebraic way, by considering the vectors zN , zN2

! zN , zN2.

Piecewise linear interpolation of zN corresponds to identity zN2

= ITN,N

2

zN . The desired estimate can now

be concluded by comparing the ‘interpolation error’ (IN,N2ITN,N

2

− I) zN with e = Az . (In fact, the matrix

(IN,N2ITN,N

2

− I) describing the interpolation error ‘projects’ onto non-smooth components.

To ‘understand’ this estimate observe that, by definition of zN , e is the (weak) second derivative z′′N .

The Aubin-Nitsche Lemma enables us to give an explicit estimatefor the Galerkin error.

Lemma 13.19 Let Ω = (0, 1) . Then there exists C > 0 such that

|uN2− uN |H1(Ω) ≤ C h1/2 ∥|uN |∥2 (13.33)

72 Afar from technical details, (13.32) appears to be natural: Essentially, this means that the first derivative of the errorin linear interpolation can be bounded by h times something related to the second derivative of the interpoland. Note thatzN ∈ H2(Ω) since e is continous.



Proof: We make use of the Galerkin orthogonality (13.29) for w = uN2, i.e., a(uN

2, uN

2) = a(uN , uN

2) , to

conclude

|uN2− uN |2H1(Ω) = a(uN

2− uN , uN

2− uN) = a(uN

2− uN , uN

2)︸︷︷︸

=0

− a(uN2− uN , uN)

= −(IN,N2uN

2− uN , A uN) ≤ ∥|IN,N

2uN

2− uN |∥0 ∥|uN |∥2

Using Lemma 13.13 we recognize ∥|IN,N2uN

2− u|∥0 ∼ h−1/2 ∥uN

2− uN∥L2(Ω) . Together with (13.30) we

obtain|uN

2− uN |2H1(Ω) ≤ C hh−1/2 |uN

2− uN |H1(Ω)∥|uN |∥2

which shows (13.33).

Convergence of the two-grid method.

Making use of the above estimates, there are different ways to prove to prove convergence of the two-gridmethod. Theorem 13.20 states a convergence result in the energy norm, and the proof is formulated froma ‘function viewpoint’. The proof of Thm. 13.21 below is more ‘algebraic’.

The formulations apply to the 1D case, Ω = (0, 1) ; we again drop the index N in our notation.

Theorem 13.20 [Convergence of the two-grid method, (i)] Let u0 be an initial guess, and let the dampingparameter ω satisfy λmax(A) ≤ ω ≤ C ′ λmax(A) for some C ′ . Let uTG denote the approximation obtainedby the two-grid method with m smoothing steps.

There exists a constant C > 0 independent of h and m such that

∥uTG − u∗∥A ≤ C m− 12 ∥u0 − u∗∥A

In particular, the two-grid method converges if the number of smoothing steps is sufficiently large.

Proof: Due to the smoothing property (13.27), the error em = um − u∗ after m smoothing steps satisfies

∥|em|∥2 ≤ C m− 12 ∥|e0|∥1 (13.34)

Now we again switch to the ‘function viewpoint’: Via the Galerkin formulation (13.19), the coarse gridcorrection δN

2∈ UN

2is an approximation to −em , defined by

a(δN2, w) = a(−em, w) ∀ w ∈ UN

2

The error of the corrected approximation uTG = um + δN2is

eTG = uTG − u∗ = δN2− (−em)

This is precisely the situation described by Lemma 13.19, with −em, δN2playing the role of u, uN

2. Thus,

the approximation property (13.33) from Lemma 13.19 gives

|eTG|H1(Ω) = |δN2+ em|H1(Ω) ≤ C h1/2 ∥|em|∥2



Together with (13.34) we obtain

|eTG|H1(Ω) ≤ C h1/2m− 12 ω

12 ∥|e0|∥1

Now, using assumption λmax(A) ≤ ω ≤ C ′λmax(A) and Lemma 13.16 gives ω ≤ C ′ λmax(A) ≤ C ′Ch−1 ,and thus,

|eTG|H1(Ω) ≤ C h12 h−

12 m− 1

2∥|e0|∥1 = C m− 12 ∥|e0|∥1

Since 73 ∥e∥H1(Ω) ∼ ∥|e|∥1 = ∥e∥A , the proof is complete.

Theorem 13.20 tells us that the iteration matrix GTG of the two-grid method satisfies ∥GTG∥A ≤ Cm− 12

with some constant C independent of h . An alternative way of proving the convergence of the two-gridmethod is more algebraic in nature. In this way we may, e.g., prove that the iteration matrix satisfies∥GTG∥2 ≤ Cm−1 in the spectral norm:

Theorem 13.21 [Convergence of the two-grid method, (ii)] Under the assumptions of Theorem 13.20,we also have

∥uTG − u∥2 ≤ C m− 12 ∥u0 − u∥2

with some constant C > 0 .

Proof: We use the splitting of GTG already mentioned to estimate

∥GTG∥2 = ∥(I − IN,N2A−1

N2

ITN,N

2A)Sm∥2 ≤ ∥A−1 − IN,N

2A−1

N2

ITN,N

2∥2 ∥ASm∥2

and estimate the two factors:

(i) The smoothing property∥ASm∥2 ≤ C m−1 (13.35)

follows directly from Lemma 13.14, see (13.28).

(ii) Now we prove the approximation property in the form of an estimate for the norm of the differencebetween A−1 and its two-grid Galerkin approximation,

∥A−1 − IN,N2A−1

N2

ITN,N

2∥2 ≤ C h – (?) (13.36)

To this end we note that, for any vector z the coarse-grid Galerkin approximation to the solution uof Au = z is given by uN

2= IN,N

2A−1

N2

ITN,N

2

z . In this context, estimate (13.33) from Lemma 13.19

applies and can be restated in the form

∥u− uN2∥A ≤ C h1/2 ∥Au∥2

Furthermore, estimate (13.30) from the Aubin-Nitsche 74 Lemma 13.17 is equivalent to

∥u− uN2∥2 ≤ C h1/2 ∥u− uN

2∥A

73 Recall that for the 1D Poisson model these norms are even identical, see Lemma 13.13.74 Note that ∥v∥L2(Ω) ∼ h1/2 ∥v∥2 (Lemma 13.13).

Observe that the Aubin-Nitsche estimate is nontrivial; for general vectors v we only have ∥v∥2 ≤ C h−1/2 ∥v∥A .



This shows∥A−1z − IN,N

2A−1

N2

ITN,N

2z∥2 ≤ C h ∥z∥2

Since z was arbitrarily chosen, this gives precisely the desired approximation property (13.36).

Together with the assumption on ω , estimates (13.35) and (13.36) yield the assertion of the theorem(analogously as in the proof of Theorem 13.20).

Remark 13.22 These results for the two-grid methods can be generalized, e.g., to general FEM dis-cretizations for elliptic problems, with an appropriate choice of algorithmic components. It should alsobe mentioned that the undamped Gauss-Seidel method has also good smoothing properties which are,however, more difficult to analyze.

Another look at TG for SPD systems. TG as a preconditioner for CG.

In the following exercises we study a two-grid (TG) scheme with ‘symmetric smoothing’, i.e. the smoothingprocedure applied after the coarse grid correction is an adjoint of the initial smoother. In particular, inExercise 13.24 we consider TG as a preconditioner for CG. It should be noted that the results of theseexercises can be extended to general multigrid schemes og Galerkin type (see Sec. 13.6 below).

Exercise 13.23 Let A = AN ∈ RN×N be SPD. The iteration matrix (amplification matrix) of a TG schemewith symmetric pre- and post-smoothing has the form

GTG = (I −HT A)(I − CA)(I −HA) =: (I − TA) (13.37)

with a smoother S = I −HA and its adjoint I −HT A , and C = PA−1n P T , where P ∈ RN×n is a prolongation

matrix und 0 < An = P T AP ∈ Rn×n is the Galerkin approximation of A on a coarser level, i.e., in a subspaceVn of dimension n < N . (This may result from our context of ‘geometric multigrid’ (GMG) as introduced above,where we identify vectors with functions. However, it can also be seen in the context of ‘algebraic multigrid’(AMG), which directly works with the matrix-vector formulation.) We assume that P ∈ RN×n has full rank n ,and Vn = image (P ) . Then, PP T ∈ RN×N with image(PP T ) = Vn .

Show:

(i) The A - adjoint of (I −HA) is (I −HA)A = (I −HT A) .

(ii) I − CA is A - selfadjoint, and the same is true for GTG = I − TA .

(iii) I−CA is non-expansive (in A - norm), i.e., ρ(I−CA) = ∥I−CA∥A = 1 , hence ∥GTG∥A = ∥I−TA∥A < 1 ,provided ∥I −HA∥A < 1 .

( ∥GTG∥A ‘significantly < 1 ’ requires an appropriate smoother 75 H .)

Hint: The subpace- (coarse grid-) correction is a Galerkin approximation. With the error e = u−u∗ for givenu we have r = b−Au = −Ae . The Galerkin subspace (two-grid-) correction is given by δ = −PA−1

n P T Ae ,and this is nothing but the A - best approximation in Vn for the ‘exact correction’ −e . Check once morethis fact, i.e., check the Galerkin orthogonality relation (δ + e) ⊥A Vn by evaluating the inner product(δ + e, PP T y)A for arbitrary vectors y ∈ RN . Conclude that I −GA is non-expansive (Pythagoras).

75 Besides the damped Jacobi method, SOR(1), i.e., undamped Gauss-Seidel, is a also good smoother for elliptic problems.Therefore you may, e.g., identify H with forward Gauss-Seidel and HT with backward Seidel; for damped Jacobi, H = HT

is a multiple of D = diag(A) .



• (iv) Show that the Galerkin approximation operator

CA = PA−1n P T A

is the A - orthogonal projector onto Vn . (This property is equivalent to Galerkin ( A -) orthogonality.)

Remark: The matrix PP T is not a projector onto Vn . In the simple 1D-case, for instance, P = IN,N2

is associated with linear interpolation, and P T = IN2 ,N

is local weighting operator. Thus, PP T v = v in

general.

Exercise 13.24 Let A be SPD and assume that the TG amplification matrix GTG = I − TA from (13.37) iscontractive, i.e., d.h. ρ(gTG) = ρ(I − TA) < 1 (see Exercise 13.23).

Show: The preconditioner T defined in this way is SPD, as required for CG preconditioning.

Hint;: From Exercise 13.23, (ii) we see that T is symmetric, and ρ(I − TA) = ∥I − TA∥A < 1 (explain). The

desired property T > 0 then follows by means of a spectral argument.

Remark 13.25 Algorithmically, TG as preconditioner for CG is realized as follows:

For a given iterate u , with error e = u − u∗ and residual r = b − Au = −Ae , we wish to approximatethe ‘exact correction’ −e , which is the solution of the system Aε = r . To this end we approximate thissolution by a TG step applied to Aε = r starting with ε0 = 0 , i.e. (in the notation from (13.37))

ε = ε0 + T (r − Aε0) = T r ≈ −e

This means that T exactly plays the role of the preconditioner M−1 , as expected. If you use Mat-lab / pcg, specify a function MFUN(r) which performs such a TG step. The result is the preconditionedresidual r = ε = T r ≈ −e .

13.6 Multigrid (MG)

Multigrid in 1D.

The two-grid method is mainly of theoretical interest, since implementation of Alg. 13.1 requires the exactsolution of a problem of size N

2in each step of the iteration. As we have discovered in Exercise 13.10, for

the 1D Poisson model the Galerkin coarse grid approximation ITN,N

2

ANIN,N2is actually identical to76 AN

2.

The coarse grid problem to be solved is of the same type as the original problem. This suggests to proceedin a recursive fashion: Instead of solving exactly on the coarse grid, we treat it just like the fine gridproblem by performing some smoothing steps and then move to an even coarser grid. In this way weproceed further until the coarse grid problem is sufficiently small to be solved by a direct method.

Assuming that the initial problem size satisfies

N = 2L

for some L ∈ N , this idea leads to Alg. 13.2, which realizes one ‘cycle’ (i.e., one iteration step) of abasic MG algorithm. It is called with some initial guess u0 the number of ‘pre-smoothing’ steps mpre andoptional ‘post-smoothing’ steps mpost are up to the user. (Note that 0 is the natural initial choice for thecorrection δ at all coarser levels.)


13.6 Multigrid (MG) 141

Algorithm 13.2 Basic MG cycle (1D)

% calling sequence: u = MG(u0, b, N) ; input: initial guess u0 ; output: approximation u

1: if N is sufficiently small, compute u = A−1N b .

2: else3: do mpre steps of smoothing (e.g., damped Jacobi) with initial guess u0 to obtain um4: rm = b− AN um5: δ = MG(0, RN

2,Nrm,

N2) % solve for correction recursively with initial guess 0

6: u = um + PN,N2δ

7: do mpost steps of smoothing (damped Jacobi) with initial guess u to obtain u8: end if9: return u

0 5 10 1510

−20

10−15

10−10

10−5

100


Max

imum

nod

al e

rror

Multigrid: Error vs. number of iterations

h=1/64h=1/32h=1/16h=1/8

0 5 10 1510

−15

10−10

10−5

100


Max

imum

of r

esid

ual

Multigrid: Error vs. number of iterations

h=1/64h=1/32h=1/16h=1/8

Figure 13.35: Convergence behavior of basic MG: left: ∥um − u∥∞ ; right: ∥rm∥∞ .

As discussed above, in the 1D case the natural ‘Galerkin choice’ for the restriction and prolongationoperators is RN

2,N = IT

N,N2

,PN,N2= IN,N

2.

Example 13.26 We illustrate the convergence behavior of the MG algorithm in Fig. 13.35. The plotshows the performance of the iteration ui+1 = MG(ui, b, N) for different values of h = 1/N . Here,mpre = 3 and mpost = 0 were used.

The convergence behavior of the MG method in Example 13.26 is quite satisfactory. The following exerciseshows that the complexity of the algorithm is also optimal, i.e., one cycle of the MG algorithm (i.e., onecall of MG ) has complexity O(N) .

Exercise 13.27 Denote by CMG(N) the cost of one the MG cycle (Alg. 13.2) called with problem size N = 2L .

Show: CMG(N) ≤ CMG(N2 )+c (1+mpre+mpost)N for some constant c > 0 . Conclude that the cost of a complete

cycle is O(N) .

76 This is the main reason for our choice of scaling in the beginning of Section 13.



finest mesh

coarsest mesh

fAAAAUfAAAAUfAAAAUfAAAAUf

f

f

f


f

f

f


f

f

f

f

Figure 13.36: Three V-cycles

Multigrid in more generality.

Alg. 13.2 is formulated for the 1D model problem. A more general view, which is also applicable toproblems in higher dimensions, is the following. We stick to a FEM-like terminology oriented to ellipticproblems. Suppose a sequence of meshes Tl, l = 0, 1 . . . with corresponding approximation spaces (in thestandard case, spaces of piecewise linear functions). For simplicity we assume:

(i) The mesh size hl of mesh Tl is hl ∼ 2−l .

(ii) The spaces are nested : Ul ⊂ Ul+1 for l = 0, 1 . . . We write Nl = dimVl .

The spaces Ul are spanned by bases (e.g., the piecewise linear ‘hat functions’); the natural embedding Ul ⊂Ul+1 then corresponds to a prolongation operator (matrix) Pl+1,l ∈ RNl+1×Nl . Its transpose Rl,l+1 = P T

l+1,l

is the restriction operator. By Al we denote the stiffness matrix arising from the underlying bilinear forma(·, ·) and the choice of basis for the space Ul . In simple standard situations, Pl+1,l can be chosen in sucha way that the Galerkin identity

Al−1 = P Tl,l−1Al Pl,l−1

remains valid. We have seen that this facilitates the convergence analysis, but the formulation of the MGalgorithm does not depend on this property; it also not valid or desirable in all aplications. Also a choiceof the restriction operator Rl−1,l = P T

l,l−1 may be reasonable.

The basic MG algorithm 13.2 is reformulated as Alg. 13.3. Note that the assumption hl ∼ 2−l guaranteesthat Nl ∼ 2 l d , where d ∈ N is the spatial dimension.

Alg. 13.3 is called the V-cycle; see Fig. 13.36. Instead of solving exactly on a coarser level, which corre-sponds to the two-grid algorithm, a single approximate solution step is performed by a recursive call ateach level l .

Error amplification operator of the V-cycle.

For the convergence analysis, a MG cycle is interpreted in a recursive way as a perturbed TG cycle. Forthe TG cycle on level l , let us denote the iteration matrix (error amplification operator) by

GTGl = S

(post)l (Il − CTG

l Al)S(pre)l , CTG

l = Pp,l−1A−1l−1Rl−1,l (13.38)



Algorithm 13.3 Basic MG (‘V-Cycle’)

% calling sequence: u = MG(u0, b, l) ; input: initial guess u0 ; output: approximation x

1: if level l is sufficiently small, compute u = A−1l b .

2: else3: do mpre steps of smoothing (damped Jacobi) with initial guess u0 to obtain um4: rm = b− Al um5: δ = MG(0, Rl−1,lrm, l−1) % solve for correction recursively with initial guess 06: u = um + Pl,l−1 δ7: do mpost steps of smoothing (damped Jacobi) with initial guess u to obtain u8: end if9: return u

In the level- l MG version of the V-cycle, A−1l−1 is replaced by its level- (l−1) MG approximation, which

we denote by N(l−1)l−1 . For the resulting MG analog of (13.38) we write

G(l)l = S

(post)l (Il − C(l−1)

l Al)S(pre)l , C

(l)l = Pl,l−1N

(l−1)l−1 Rl−1,l (13.39)

Here, the (approximate inverse) operator N(l−1)l−1 is related to the corresponding level- (l−1) MG amplifi-

cation operator by

G(l−1)l−1 = Il−1 −N (l−1)

l−1 Al−1, i.e., N(l−1)l−1 = A−1

l−1 −G(l−1)l−1 A−1

l−1

Thus, (13.39) can be written as

G(l)l = S

(post)l (Il − CTG

l Al)S(pre)l︸︷︷︸

= GTGl

+S(post)l Pl,l−1G

(l−1)l−1 A−1

l−1Rl−1,lAl S(pre)l (13.40)

Assuming ∥S(post)l ∥ < 1 , ∥S(pre)

l ∥ < 1 , ∥P l,l−1l ∥ < C , and

∥A−1l−1Rl−1,lAl S

(pre)l ∥ ≤ C, C a uniform constant (typically ≥ 1 ), (13.41)

this gives the recursive estimate∥G(l)

l ∥ ≤ ∥GTGl ∥+ C ∥G(l−1)

l−1 ∥ (13.42)

with some constant C . With the abbreviation

κTGl = ∥GTG

l ∥, κ(l)l = ∥G(l)

l ∥

this gives the recursionκ(2)2 = κTG

2 , κ(l)l ≤ κTG

l + C κ(l−1)l−1 , l = 3, 4, . . . (13.43)

and we see that, even for ∥κTGl ∥ < 1 on all levels l (i.e., uniform contractivity of the TG cycle), it is not

possible to derive uniform contraction bounds for the V-cycle. Actually, existing convergence proofs forthe V-cycle rely on a more or less explicit representation for G

(l)l , which is rather cumbersome.

The W-cycle.

Viewing MG as a (linear) iteration scheme, it is natural to attempt to improve the approximation byrepeated recursive calls. This leads to the so-called µ -cycle formulated in Alg. 13.4. The case µ = 1



fAAAAUfAAAAUfAAAAUf

fAAAAUf

f

fAAAAUfAAAAUf

fAAAAUf

f

f

f

Figure 13.37: W-cycle

corresponds to the V-cycle (Alg. 13.3); the case µ = 2 leads to the so-called W-cycle, visualized inFig. 13.37.

We can now modify the above recursive representation for the case of the µ -cycle. Again we start withthe normal TG cycle (13.38): 77

GTGl = S

(post)l (Il − CTG

l Al)S(pre)l , CTG

l = Pp,l−1A−1l−1Rl−1,l (13.44)

In the level- l MG version of the µ -cycle, A−1l−1 is replaced by its level- (l−1) µ -cycle approximation

N(l−1)l−1 . For the resulting µ -cycle MG analog of (13.44) we write

G(l)l = S

(post)l (Il − C(l−1)

l Al)S(pre)l , C

(l)l = Pl,l−1N

(l−1)l−1 Rl−1,l (13.45)

Now, the (approximate inverse) operator N(l−1)l−1 is related to µ -fold application of the corresponding level-

(l−1) MG amplification operator:(G

(l−1)l−1

)µ= Il−1 −N (l−1)

l−1 Al−1, i.e., N(l−1)l−1 = A−1

l−1 −(G

(l−1)l−1

)µA−1

l−1

Thus, (13.39) can be written as

G(l)l = S

(post)l (Il − CTG

l Al)S(pre)l︸︷︷︸

= GTGl

+S(post)l Pl,l−1

(G

(l−1)l−1

)µA−1

l−1Rl−1,lAl S(pre)l (13.46)

In a similar way as for the V-cycle, this gives the recursive estimate

κ(2)2 = κTG

2 , κ(l)l ≤ κTG

l + C(κ(l−1)l−1

)µ, l = 3, 4, . . . (13.47)

for the contraction rate of the µ -cycle.

Exercise 13.28 Assume µ = 2 and κTGl ≤ ρ = 1

4C on all levels l , where C ≥ 1 is the constant appearing

in (13.47). (Thus, ρ ≤ 14 is necessarily assumed.) Show: The W -cycle contraction rate κ

(l)l can be uniformly

bounded by

κ(l)l ≤

1−√1− 4C ρ

2C≤ 2 ρ <

1

2

on all levels l = 2, 3, 4, . . . .

77 This is slightly simplified: According to Alg. 13.4, we would start with µ TG coarse grid corrections; The differencebetween these versions is not essential.



Hint: The sequence (κ(l)l ) is strictly increasing and majorized by the sequence defined by

ξ2 = ρ, ξl = ρ+ C ξ2l−1, l = 3, 4, . . .

Consider the latter as a fixed point iteration.

Algorithm 13.4 Multigrid ( µ -Cycle)

% Calling sequence: u = MG(u0, b, l, µ) ; input: initial guess u0 ; output: approximation u

% µ = 1 → V-cycle; µ = 2 → W-cycle

1: if level l is sufficiently small, compute u = A−1l b .

2: else3: do mpre steps of smoothing (damped Jacobi) with initial guess u0 to obtain um4: rm = b− Al um5: δ(0) = 06: for ν = 1 to µ do δ(ν) = MG(δ(ν−1), Rl−1,lrm, l−1, µ)7: u = um + Pl,l−1δ

(µ)

8: do mpost steps of smoothing (damped Jacobi) with initial guess u to obtain u9: end if10: return u

On the basis the recursion (13.47), Exercise 13.28 provides a rather general convergence argument for theW-cycle. Still, we have assumed that (13.41) holds, i.e.,

∥A−1l−1Rl−1,lAl Sl∥ ≤ C

which looks quite natural but needs to be argued. Direct evaluation for the 1D and 2D Poisson examplesshows that, with respect to the energy norms involved, C is indeed a moderate-sized a uniform constant≥ 1 . In the Galerkin context one may think of rewriting this as

A−1l−1Rl−1,lAl Sl = A−1

l−1Rl−1,lAl Pl,l−1︸︷︷︸= Il−1

Rl−1,l Sl + A−1l−1Rl−1,lAl (Il − Pl,l−1Rl−1,l)Sl

and try to estimate (in norm) the second term on the right hand side, which is still not straightforward. 78

Example 13.29 The two-grid method and the MG methods are linear iterations. For the 1D modelproblem, Table 13.4 shows estimates for the corresponding contraction rates in the energy norm, obtainedby numerical experiment: We observe the error reduction from step 14 to step 15 of MG for a randominitial vector u0 , with mpre = m presmoothing steps and mpost = 0 postsmoothing steps. As to beexpected, the two-grid method has the best contraction rate. The W-cycle ( µ=2 ) is very close to thetwo-grid method. We note in particular that the contraction rate is rather small even for µ=1 (V-cycle).

78 In the literature, different versions of convergence arguments are given for the W-cycle (and also the V-cycle), mainlyin the FEM context.



V-cycle

N 23 24 25 26 27 28 29 210

m = 1 0.333 0.330 0.327 0.323 0.320 0.320 0.313 0.312

m = 2 0.156 0.175 0.191 0.198 0.203 0.205 0.207 0.207

m = 3 0.089 0.105 0.118 0.127 0.131 0.134 0.136 0.138

W-cycle

N 23 24 25 26 27 28 29 210

m = 1 0.333 0.329 0.312 0.320 0.330 0.329 0.328 0.327

m = 2 0.116 0.116 0.114 0.116 0.116 0.115 0.115 0.114

m = 3 0.073 0.077 0.077 0.078 0.076 0.077 0.077 0.077

Two-grid method

N 23 24 25 26 27 28 29 210

m = 1 0.333 0.314 0.326 0.321 0.329 0.326 0.327 0.325

m = 2 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111

m = 3 0.074 0.076 0.076 0.078 0.078 0.078 0.078 0.077

Table 13.4: Estimated contraction rates for V-cycle ( µ = 1 ), W-cycle ( µ = 2 ), and two-grid method.

13.7 Nested Iteration and Full Multigrid (FMG)

One of the basic questions of iterative solution techniques is finding a good starting vector. Nested iterationis a general technique for this. The basic idea is the following: A good starting vector u0 for the fine gridproblem ANu = bN , might be the appropriately prolongated solution of a coarse grid problem. Since thecoarse grid problem cannot be solved exactly either, we solve it iteratively and need a good starting vectorfor that iteration as well. Effectively, we start at the coarsest level, where an exact solution is available;then, we prolongate this solution to a finer level where an approximate solution technique such as MG canbe used; the approximate solution obtained in this way is transferred to the next fine grid as a startingpoint for a MG iteration, etc., until we reach the finest level with a good initial guess for the final MGcycle.

eeAAAUe

eeAAAUeAAAUe

ee

eAAAUeAAAUeAAAUe

ee

ee

Figure 13.38: Full Multigrid (FMG)


13.7 Nested Iteration and Full Multigrid (FMG) 147

N 24 25 26 27 28 29 210 211 212 213 214 215 216 217 218 219

tFMG

tMG3.3 3.2 3.9 4.6 5 4.6 4.8 4.3 4 3.4 2.9 2.6 2.5 2.5 2.5 2.5

Table 13.5: Ratio of CPU-time per iteration of FMG vs. MG ( mpre = 3 ).

In Fig. 13.38, a single pass of this procedure is visualized (think of starting from the left, with exact solutionat the coarsest level, and ‘going up’). But such a single pass not necessarily yields a sufficiently accurateresult, and the process is again iterated. To this end we use a recursive approach: For an intermediateapproximation on level l , we compute the new approximation by coarse grid correction using FMG onlevel l−1 , see Alg. 13.5, with 0 as the natural initial guess for the correction. In general, one pass of theFMG procedure uses µ′ MG cycles of the type µ -cycle (Alg. 13.4) at each level l .

Algorithm 13.5 Full Multigrid

% calling sequence: u = FMG(u0, b, l) ; input: initial guess u0 ; output: approximation u

% µ , µ′ ≥ 1 given

1: if l is sufficiently small, compute u = A−1l b .

2: else3: rl = b− Al u04: δ = FMG(0, Rl−1,l rl, l−1)5: u(0) = u0 + Pl,l−1 δ6: for ν = 1 to µ′ do u(ν) = MG(u(ν−1), b, l, µ)7: u = u(ν)

8: end if9: return u

Unless some initial approximation u0 is available, the process is initiated by calling FMG(0, b, l) , i.e.,u0 = 0 with initial residual b . Thus, FMG(0, Rl−1,lb, l−1) is called in order to obtain a good initialsolution. Due to the recursion, this means that we ‘go up’ from the bottom level with FMG, interpolationand µ′ additional MG steps to obtain a first approximation u1 on level l , with residual rl = b− Au1 . Inthe second iteration FMG(u1, b, l) calls FMG(0, Rl−1,lrl, l−1) , and again we go up from the bottom levelwith FMG, interpolation and µ′ additional MG steps to obtain a next approximation u2 on level l , etc.

Example 13.30 We illustrate the performance of the FMG algorithm for the 1D model problem inFig. 13.39, i.e., we plot the errors ∥um − u∥∞ and the energy norm error ∥um − u∥AL

of the iterationxm+1 = FMG(um, f, L) for different values of L (with N = 2L ). We note the considerable performanceimprovement due to the good initial guesses. Here, mpre = 3 and mpost = 0 and µ = µ′ = 1 .

The FMG method is of course more expensive. However, the cost of one cyle of FMG is still O(N) as forthe standard MG cycle (Exercise 13.31). This is illustrated in Fig. 13.39.

Exercise 13.31 For problems of size N = 2L denote by CFMG(N) the cost (e.g., number of floating point

operations) of one call of FMG(u, f,N) . Show for the case d = 1 and µ = µ′ = 1 that for the model problem

with N = 2L the complexity of FMG is CFMG(N) ≤ CN . In fact, the ratio of CFMG(N)/CMG(N) should

asymptotically be 2 is as is illustrated in Table 13.5.



0 2 4 6 8 1010

−15

10−10

10−5

100


ener

gy n

orm

err

or

MG vs. Full MG: energy norm error vs. number of iterations

MG: N=220

FMG: N=220

MG:N=215

FMG:N=215

MG:N=210

FMG:N=210

0 2 4 6 8 1010

−20

10−15

10−10

10−5

100


max

imum

nod

al e

rror

MG vs. Full MG: maximum nodal error vs. CPU time

MG:N=220

FMG:N=220

MG:N=215

FMG:N=215

MG:N=210

FMG:N=210

Figure 13.39: Convergence behavior of FMG compared with MG.

Optimal convergence properties of FMG.

The numerical evidence of Example 13.30 shows that the advantage of FMG becomes more pronouncedas the problem size N increases. The following Theorem 13.32 is one way of formalizing this observation.We consider again a sequence of meshes meshes Tl with meshes sizes hl ∼ 2−l . We assume that the exact(Galerkin) approximationul ∈ Ul on level l to the exact solution u∗ of the original (PDE) problem satisfies(typically, in the energy norm ∥ · ∥ = ∥ · ∥A ),

∥ul − u∗∥ ≤ K hpl (13.48)

for some p > 0 and all l .

Theorem 13.32 Let c = supl hl−1/hl , and let κ = κ(µ) be the contraction rate of the multigrid µ -cycle.Let p be as in (13.48), and let µ′ be the number of µ -cycles used in the FMG algorithm 13.5. Assumec pκµ

′< 1 . Then there exists a constant C ′ > 0 such that one cycle of FMG results in an approximation

ul ∈ Ul which satisfies∥ul − u∗∥ ≤ C ′Khpl

Proof: We proceed by induction on l . For the approximations ul (obtained by FMG) to the exactsolutions ul on level l , we denote the ‘algebraic’ FMG error on level l by el = ul − ul . Clearly, e0 = 0 .FMG on level l consists of µ′ steps of classical MG (with contraction rate κ = κ(µ) ), with an initial errorul−1 − ul . Hence,

∥el∥ ≤ κµ′∥ul−1 − ul∥ ≤ κµ

′(∥ul−1 − ul−1∥+ ∥ul−1 − ul∥)

≤ κµ′(∥ul−1 − ul−1∥+ ∥ul−1 − u∗∥+ ∥ul − u∗∥

)≤ κµ

′(∥el−1∥+K hpl−1 +K hpl)≤ κµ

′(∥el−1∥+ (1 + c p)K hpl)

Iterating this inequality, we obtain with e0 = 0 :

∥el∥ ≤ K (1 + c p)κµ′(hpl + κµ

′hpl−1 + κ2µ

′hpl−2 + · · ·+ κlµ

′hp0)

The definition of c implies hl−i ≤ cihl . Hence, together with assumption c pκµ′< 1 we obtain

∥el∥ ≤ K (1 + c)κµ′hpl(1 + κµ

′c p + κ2µ

′c2p + · · ·+ κlµ

′clp)≤ K(1 + c)

1− κµ′ c pκµ

′hpl


13.7 Nested Iteration and Full Multigrid (FMG) 149

0 2 4 6 8 1010

−12

10−10

10−8

10−6

10−4

10−2

100


ener

gy n

orm

err

or


MG: N=220

FMG: N=220

MG:N=215

FMG:N=215

MG:N=210

FMG:N=210

0 2 4 6 8 1010

−15

10−10

10−5

100


ener

gy n

orm

err

or


MG: N=220

FMG: N=220

MG:N=215

FMG:N=215

MG:N=210

FMG:N=210

Figure 13.40: Convergence behavior of MG and FMG for µ = 1 , µ′ = 1 , mpost = 0 . Left: mpre = 1 .Right: mpre = 3 . For comparison: the discretization errors (in energy norm) for l = 10 , l = 15 , andl = 20 are: e10 = 2.8210−4 , e15 = 8.8110−4 , e20 = 2.7510−7 .

Thus,

∥ul − u∗∥ ≤ ∥el∥+ ∥ul − u∗∥ ≤ C ′Khpl

with the appropriate constant C ′ .

Theorem 13.32 implies the following observations:

• Consider FMG for linear systems arising after discretization of elliptic boundary value problems.The exact solutions ul ∈ Ul are approximations to the unknown solution u∗ . Therefore, it sufficesto obtain approximations ul ≈ ul up to the discretization error ul − u∗ ∼ hpl . Theorem 13.32 showsthat this can be achieved with one cycle of FMG. The cost for one cycle of FMG is O(NL) byExercise 13.31.

• Standard MG starts with the initial guess 0 . Hence, with a level-independent contraction rateρ < 1 , the error after m steps of MG is O(ρm) . Thus, to reach the level of the discretization erroruL − u∗ ∼ hpL , one needs O(log hpL) = O(L) steps. In terms of the problem size NL ∼ hdL , we needO(logNL) steps; the total cost is therefore O(NL logNL) .

• The condition c pκµ′< 1 could, in principle, be enforced by increasing µ′ (note: µ′ enters only linearly

in the complexity estimates for FMG), or by increasing the number is smoothing steps (recall theanalysis of the two-grid method in Theorem 13.20).

Example 13.33 In Fig. 13.40 we demonstrate the behavior of the MG method compared with the FMGmethod for the 1D model problem −u′′ = 1 . The left figure shows the case mpre = 1 , whereas the rightfigure shows the case mpre = 3 . We note that FMG with mpre = 1 attains an approximation at the levelof the discretization error after a single pass.



13.8 Nonlinear problems

Like other stationary iterative schemes, multigrid is not a priori a ‘linear technique’. It can be adapted, tononlinear problems, with some reorganization concerning the correction steps, where nonlinearity is takeninto account.

Full approximation scheme (FAS).

The basic idea is rather general. Consider a system of equations in the form

AN(u) = bN (13.49)

with a nonlinear mapping AN : RN → RN . We assume that the problem is well-posed (with, at least, alocally unique solution) and consider a (simpler) approximation AN(u) ≈ AN(u) . The mapping AN is an(in general, also nonlinear) preconditioner for AN .

Assume that a first guess u0 for the solution u∗ of (13.49) is available. For AN(u) ≈ AN(u) we may expect

AN(u∗)− AN(u0) ≈ AN(u∗)− AN(u0) = bN − AN(u0) = r0 (13.50)

Here, the right hand side is the residual r0 of u0 w.r.t. (13.49). This suggests to compute a correctedapproximation u1 as the solution of

AN(u1) = AN(u0) + r0 = bN + (AN − AN)u0 (13.51)

If AN is linear, this can be written as the usual ‘correction scheme’ as in stationary iteration methods,where the correction δ is a linear image of the residual,

AN δ = r0, u1 = u0 + δ (13.52)

In general, (13.51) is solved directly for the new approximation u1 . This is called a ‘full approximationscheme’ (FAS).

In the context of multigrid, AN is defined via a coarse grid approximation An of AN , of dimension n < N ,and we have to be careful concerning the intergrid transfer. Assume that u0 is an appropriately smoothedapproximation on the finer grid. With our usual notation for the restriction and prolongation operators,the nonlinear FAS-type two-grid method is:

1. Restrict u0 to coarser space: u0 = Rn,N u0

2. Compute the residual and restrict to coarser space : r0 = Rn,N r0 = Rn,N(bN − AN(u0))

3. Solve An(v) = An(u0) + r0

4. Compute the coarse grid correction: δ = v − u05. Prolongate the correction δ and add it to u0 , i.e., compute u1 = u0 + PN,n δ

Note that the coarse grid solution v is not directly prolongated, but the coarse-grid correction δ obtainedby the FAS step is prolongated and added to u0 . In this way, the two-grid FAS scheme becomes equivalentto our two-grid ‘correction scheme’ (CS) formulation for the linear case.

The generalization to FAS-type multigrid is straightforward (V-cycle, µ -cycle, FMG). Note that themultigrid version requires the solution of a nonlinear problem only on the coarsest grid, e.g., by a procedureof Newton type. More implementation details and examples for applying FAS-type multigrid to nonlinearboundary value problems can be found in [4].


13.8 Nonlinear problems 151

Exercise 13.34 Formulate the nonlinear FAS-type two-grid scheme in detail for a standard finite differencediscretization of

−u′′ + ϕ(u)u = f(x)

with Dirichlet boundary conditions. (This problem is well-posed for ϕ(u) ≥ 0 .)

The smoothing procedure can usually be chosen in a similar way as for related linear(ized) problem.

Example 13.35 For the problem from Exercise 13.34, for instance, discretization leads to an algebraicsystem of the form

(AN + ΦN(u))u = bN

with a nonlinear diagonal mapping ΦN(u)u = (ϕ(u1)u1, . . . , ϕ(un)uN) . The corresponding damped Jacobismoother is based on inverting the corresponding ‘diagonal preconditioner’

SN(u) = (DN + ΦN(u))u, DN = diag(AN)

A standard Jacobi step can be defined in an FAS-type manner:

u0 7→ u1 = solution of SN(u1) = Sn(u0) + r0, r0 = bN − (AN + ΦN(u0))u0

In practice, this is realized by a (scalar) Newton procedure, with the Jacobian

DSN(u) = DN +DΦN(u)

A single Newton step starting from u0 takes the form

u(1)1 = u0 − (DSN(u0))

−1r0

or equivalently, u(1)1 = u0 + δ , where the correction δ is the solution of

DSN(u0) δ = r0

For damped Jacobi we set u(1)1 = u0 + ω δ , with an appropriate damping factor ω .

Naturally, the detailed choice for the algorithmic components is not obvious in the nonlinear case. Somedetailed case studies are presented in [4].

Nonlinear Galerkin and FAS type two-grid scheme.

The weakly nonlinear Poisson problem from Exercise 13.34 is an example of a nonlinear elliptic problem forwhich the linear theory can be extended in a rather straightforward manner. An introduction to the theoryof nonlinear elliptic problems can be found in [20]. Here, we restrict ourselves to the abstract specificationof the weak formulation, together with the Galerkin/FEM approximation and the corresponding two-gridprocedure.

The weak formulation of a nonlinear elliptic problem takes the form

a(u, v) = f(v) ∀ v ∈ V (13.53)



with a bilinear form a(u, v) which depends in a nonlinear way on u . (In the simplest case of a secondorder problem posed on a domain Ω with homogeneous Dirichlet boundary conditions, V = H1

0 (Ω) .)

As in Section 13.4 (linear case), a Galerkin/FEM approximation defined on a finite-dimensional subspaceUN ⊂ V via the (nonlinear) Galerkin conditions

a(u, v) = f(v) ∀ v ∈ UN (13.54)

With the usual FEM isomorphism u ∈ RN−1 ! u ∈ UN , this corresponds to a nonlinear algebraic system

AN(u) = bN

Consider an approximation u0 ∈ UN for the solution uN of (13.54) obtained after smoothing,79 with errore0 = u0 − uN . Error and residual are related via 80

a(u0 − e0︸︷︷︸= uN

, v)− a(u0, v) = f(v)− a(u0, v) ∀ v ∈ UN (13.55)

In the nonlinear two-grid method we seek an approximation uTG to uN of the form

uTG = u0 + δN2

with δN2∈ UN

2⊂ UN . As in the linear case, the coarse-grid correction δN

2is an approximation to the

unknown ‘exact correction’ −e0 ; from (13.55) we see that, at least formally, it is defined in a natural wayby the Galerkin condition

a(u0 + δN2, w)− a(u0, w) = f(w)− a(u0, w) =: r0(w) ∀ w ∈ UN

2(13.56)

which simplifies to (13.20) from Section 13.4 in the linear case. However, the coarse grid correction step isrealized within the smaller subspace UN

2. Therefore, in the left-hand side of (13.56) we replace u0 by an

interpolant ˜u0 ∈ UN2(standard case: piecewise linear interpolation). This leads to

Find δN2∈ UN

2such that a(˜u0 + δN

2︸︷︷︸=: vN

2

, w)− a(˜u0, w) = r0(w) ∀ w ∈ UN2

(13.57)

This is realized by

Compute vN2∈ UN

2such that a(vN

2, w) = a(˜u0, w) + r0(w) ∀ w ∈ UN

2(13.58)

followed byδN

2= vN

2− ˜u0

and the new approximation u1 is obtained in the form

u1 = u0 + δN2

We see that this process is nothing but a ‘weak formulation’ of the FAS coarse grid correction scheme fromSection 13.8. In matrix/vector formulation it is the same, with an appropriate definition of the ‘coarsegrid Galerkin approximation’ AN

2= AN

2(u) .

Note that in the linear case our formulation of the FAS exactly reduces to the linear CS type scheme.

79 Remark concerning notation: In the linear case we have used the notation um instead of u0 (Section 13.4). All otherdenotations, in particular for the natural restriction and prolongation operators, are the same as in Section 13.4.

80 The right hand side of (13.55) is called the weak residual of u0 ; more precisely: the weak residual of u0 is the u0

-dependent functional r0 : v 7→ f(v)− a(u0, v) .


153

14 Substructuring Methods

Multigrid techniques are relatively easy to implement on simple geometries and regular grids. If morecomplicated geometries are involved, it is often useful to use substructuring techniques, e.g., by partition-ing the underlying domains into several subdomains and use some divide-and-conquer technique in thepreconditioning process. A typical example is an L-shaped domain in R2 partitioned into three rectangles.This approach is called domain decomposition. Domain decomposition may also be motivated by storagelimitations (the solver need to be based on smaller problems). Some of these techniques are also naturalcandidates for parallelization. Domain decomposition is also a natural approach if the underlying prob-lem is of a different nature in different parts of the domain, or if fast direct solvers can be used for thesubproblems.

Like multigrid methods, substructuring techniques are frequently used for preconditioning CG or GMRES.

14.1 Subspace corrections

In this section we introduce substructuring techniques using a general, abstract formulation. We will alsosee that many of the methods discussed before fit naturally into this framework. We consider an abstractvariational problem, in weak formulation, on a finite-dimensional space V ,

Find u ∈ V such that a(u, v) = f(v) ∀ v ∈ V (14.1)

with a bounded SPD 81 bilinear form a(·, ·) and associated energy norm ∥u∥A =√a(u, u) .

Assume that an approximation u0 for the exact solution u∗ of (14.1) is given, with weak residual r0 = r(u0) ,

r0(v) = f(v)− a(u0, v), v ∈ V (14.2)

Exact solution of the the correction equation

Find δ ∈ V such that a(δ, v) = r0(v) ∀ v ∈ V (14.3)

would result in the negative error, δ = −e0 , i.e., the exact correction such that u0 + δ = u∗ . Forthe construction of a preconditioner, we approximate (14.3) by means of subspace correction techniques.(Preconditioning by MG may be considered as a special case of such a technique.)

Let V1 ⊂ V be a linear subspace of V . Analogously as for the two-grid method, the associated ‘subspacecorrection’, i.e. the solution δ1 ∈ V1 of

a(δ1, v1) = r0(v1) ∀ v1 ∈ V1 (14.4)

is the projection with respect to the a(·, ·) inner product of the negative error −e0 = u∗ − u0 onto thesubspace V1 , and let

u1 = u0 + δ1 (14.5)

be the improved approximation, which is locally optimal in the following sense: It satisfies the fundamentalGalerkin orthogonality relation

δ1 + e0 ⊥A V1

which can be written in terms of the new residual r1 = r(u1) ,

r1(v1) = f(v1)− a(u1, v1) = 0 ∀ v1 ∈ V1 (14.6)

81Here, SPD means that a(·, ·) is symmetric and coercive, or ‘elliptic’, i.e., a(u, u) ≥ γ (u, u) > 0 uniformly for all u ∈ V .


154 14 SUBSTRUCTURING METHODS

This follows directly from (14.2)–(14.5), and it is equivalent to the best approximation property

∥u1 − u∗∥A = minv ∈V1

∥v − u∗∥A (14.7)

More generally, we consider a family of subspaces V0, . . . , VN which span V ,

N∑i=0

Vi = V (14.8)

i.e., each v ∈ V can be written in the form v =∑N

i=0 vi with vi ∈ Vi . The sum in (14.8) is not necessarilyassumed to be a direct sum (

⊕): Some of the Vi may be ‘overlapping’ (i.e., the intersections Vi ∩Vj may

have a positive dimension). In this case the representation v =∑N

i=0 vi is not unique.

Moreover, the subspace corrections analogous to (14.4) may be ‘inexact’ (similarly as in the case wherean ‘exact’ coarse grid correction is replaced by a MG cycle). Therefore we assume that the subspacecorrections are of a more general form, obtained by the solutions δi ∈ Vi of

ai(δi, vi) = r0(vi) ∀ vi ∈ Vi (14.9)

where ai(·, ·) is an SPD bilinear form on Vi which is not necessarily identical with the action of the givenform a(·, ·) restricted to Vi . (Think, for instance, of a multigrid cycle playing the role of an approximatelocal solver.)

Different versions of substructuring methods are characterized in the way how the corresponding subspacecorrections are combined. We will see that, typically, the outcome is a preconditioner for (14.1) whichmay also be written in a ‘weak form’, analogous to (14.1),

Find δ ∈ V such that b(δ, v) = r0(v) ∀ v ∈ V (14.10)

The bilinear form b(·, ·) is an approximation for a(·, ·) representing the preconditioner. It is usuallynot written down explicitly; rather, it is implicitly defined by the particular type of subspace correctionalgorithm used.

For studying the properties of b(·, ·) it is not essential what particular functional appears on the right handside in (14.10). Therefore, for the action of the preconditioner we now again write

Find u ∈ V such that b(u, v) = f(v) ∀ v ∈ V (14.11)

in analogy to in (14.1), and we consider of ‘subspace solutions’ instead of ‘subspace corrections’.

14.2 Additive Schwarz methods (ASM)

The simplest version of a ‘global’ technique is to compute independent subspace solutions ui ∈ Vi and addthem up:

u =N∑i=0

ui (14.12)

where the ui are the solutions of

ai(ui, vi) = f(vi) ∀ vi ∈ Vi, i = 0 . . . N (14.13)

For historical reasons, this is called an additive Schwarz method (ASM). Naturally, the ui can be computedin parallel because there is no data dependency between the subproblems. Note that even for the case ofexact local solvers ai = a|Vi

, and despite∑N

i=0 Vi = V , u from (14.12) is not the exact solution of (14.1).

For a study of the properties of the ASM preconditioner it is favorable to refer to an explicit basisrepresentation.


14.2 Additive Schwarz methods (ASM) 155

Representation of the ASM preconditioner.

We identify the n -dimensional space V with 82 Rn , and choose a basis x1, . . . , xn for V and arbitrarybases xi,1, . . . , xi,ni

for each subspace Vi , where n = dim(V ) and ni = dim(Vi) . Let X =(x1 | . . . |xn

)∈

Rn×n and Xi =(xi,1 | . . . |xi,ni

)∈ Rn×ni be the columnwise matrix representations of these bases. For the

representations of vectors v, vi with respect to these bases we use the denotation 83 (with v ∈ V, v′ ∈ Rn ,and vi ∈ Vi, v′′i ∈ Rni ),

v ≡ X v′, vi ≡ Xi v′′i

With respect to the global basis X , the stiffness matrix A ∈ Rn×n , i.e., the SPD matrix representation ofa(·, ·) is given by

A =(aj,k)= a(xj, xk)

Let the Ai ∈ Rni×ni analogously defined as the local stiffness matrices of the bilinear forms ai(·, ·) withrespect to the bases of the Vi , i.e.,

Ai =((ai)j,k

)= ai(xi,j, xi,k)

By assumption on the ai(·, ·) , the matrices Ai are SPD.

The linear functional f(v) is represented in a similar way: For v = X v′ we have f(v) = f ′T v′ (Rieszrepresentation with some f ′ ∈ Rn ), and for vi = Xi v

′′i we have a representation f(vi) = f ′′

iT v′′i with

f ′′i ∈ Rni . In this notation, the original and subspace problems take the form

Au′ = f ′, Ai u′′i = f ′′

i

Let Ei ∈ Rn×ni denote the matrix representation of the embedding Vi ⊂ V , i.e., Ei is defined byXEi = Xi ,i.e.,

Ei = X−1Xi ∈ Rn×ni (14.14)

In other words, Ei transforms the i -th coordinate representation into the global coordinate representationwith respect to the basis X . In particular, for each fixed i and vi ∈ Vi we have

vi ≡ X v′i ≡ Xi v′′i , ⇒ v′i = Ei v

′′i (14.15)

and for arbitrary vi ∈ Vi :f(vi) ≡ f ′T v′i ≡ f ′′

iTv′′i ⇒ f ′′

i = ETi f

′ (14.16)

We see that the Ei and ETi play the role of prolongation and restriction operators, respectively.

Now, each individual subspace solution ui from (14.13) is represented as ui = Xi u′′i , where u

′′i is the

solution of

Ai u′′i = f ′′

i , i = 0 . . . N (14.17)

Thus, 84

u′i = EiA−1i ET

i f′i =: Bi f

′i , i = 0 . . . N (14.18)

In this way we obtain

82 As usual, this corresponds to a natural isomorphism, e.g., between functions and coordinate vectors in the FEM context.83 For each i , the index ′′ is used as a ‘generic’ symbol for the individual coordinate vectors w.r.t. to the individual bases

in the subspaces Vi .84 The matrices Bi as well as B below refer to the global basis X .



Lemma 14.1 For any choice of bases X and Xi , the matrix representation of the preconditioner definedby (14.12),(14.13) is given by the SPD matrix

B :=N∑i=0

Bi =N∑i=0

EiA−1i ET

i ∈ Rn×n (14.19)

With B := B−1 we have u = X u′ , where u′ is the solution of

B u′ = f ′ (14.20)

Proof: For arbitrary f ′ ∈ Rn we have( N∑i=0

EiA−1i ET

i f′, f ′)=

N∑i=0

(EiA

−1i ET

i f′, f ′) = N∑

i=0

(A−1

i ETi f

′, ETi f

′)with ET

i f′ ∈ Rni . Each term in the sum is positive because the Ai are SPD. This proves that the matrix

on the right hand side of (14.19) well-defined and SPD. Therefore B is well-defined and SPD.

Furthermore, the action of the ASM preconditioner to a right-hand side f is defined by the summing upthe individual subspace solutions ui , cf. (14.12). With (14.18) we obtain

u′ =N∑i=0

Bi f′ = B f ′ = B−1f ′

This completes the proof.

By (14.19), the SPD matrix B := B−1 is implicitly defined, and the solution of (14.20) represents theaction of the ASM preconditioner w.r.t. the basis X . Clearly, the SPD matrix B represents an SPDbilinear form b(·, ·) in V via the identity

b(u, v) ≡ (B u′, v′) (u = Xu′, v = Xv′)

Therefore the ASM precondioner is the solution of an approximate problem of the form

Find u ∈ V such that b(u, v) = f(v) ∀ v ∈ V (14.21)

This may also be written in the form

Find u ∈ V such that a(P−1ASM u, v) = f(v) ∀ v ∈ V (14.22)

where PASM is defined as the ‘projection-like’ operator PASM : V → V with matrix representation

PASM ! P ′ASM = B A =

N∑i=0

BiA (14.23)

The operator PASM can be equivalently be defined by the property

b(PASM u, v) = a(u, v) for all u, v ∈ V (14.24)

We also define the individual projection-like operators Pi : V → Vi with matrix representation

Pi ! P ′i = BiA = EiA

−1i ET

i A, i = 0 . . . N (14.25)



Consider arbitrary u = X u′ ∈ V and vi = X v′i and observe P ′i u

′ = Ei (A−1i ET

i A)u′ , i.e. (A−1

i ETi A)u

′ isthe representation of P ′

i u′ in the local coordinate system in Vi . This gives

ai(Pi u, vi) = (AiA−1i ET

i Au′, v′′i ) = (Au′, Ei v

′′i ) = (Au′, v′i) = a(u, vi) (14.26)

which yields the fundamental identity

ai(Pi u, vi) = a(u, vi) for all u ∈ V, vi ∈ Vi (14.27)

Lemma 14.2 The operators Pi, i = 0 . . . N and PASM are self-adjoint with respect to a(·, ·) .

Proof: Let u, v ∈ V . Using the symmetry of a(·, ·) , ai(·, ·) , and the fact that Pi u, Pi v ∈ Vi we obtain,making use of (14.27),

a(Pi u, v) = a(v, Pi u) = ai(Pi v, Pi u) = ai(Pi u, Pi v) = a(u, Pi v)

and for PASM the result follows by summation.

Remark 14.3 In the special case ai(·, ·) ≡ a(·, ·) we have Ai = ETi AEi , which corresponds to the

Galerkin approximation to A associated with the subspace Vi , and P ′i = EiA

−1i ET

i A is even an ( A-orthogonal) projector:

P ′iP

′i = EiA

−1i ET

i AEiA−1i ET

i A = P ′i

In the general case ai(·, ·) = a(·, ·) , we speak of ‘projection like’ operators.

Lemma 14.1 shows that B is SPD. Our goal is to estimate the constants γ,Γ in the estimate

γ B ≤ A ≤ ΓB

If these constants are available then, as in Exercise 12.6, we infer for the matrix P ′ASM = B−1A :

κ = κσ(P′ASM) =

λmax(P′ASM)

λmin(P ′ASM)

≤ Γ

γ(14.28)

In view of Exercise 12.6, the quantity κ allows us to assess the convergence behavior of the correspondingpreconditioned CG method (PCG). The goal is the design of preconditioners B that are cheap (i.e.,r 7→ B−1 r is simple to evaluate), while the ratio Γ/γ is as small as possible.

Example 14.4 Let A ∈ RN×N be SPD. Denote by ei , i = 1 . . . N , the Euclidean unit vectors in RN .Let V = RN ,Vi = spanei , and let the bilinear forms ai(·, ·) be obtained by restricting a(·, ·) to the(one-dimensional) spaces Vi , i.e., ai(u, v) = a(u, v) for u, v ∈ Vi . The corresponding subspace solutionui ∈ Vi = spanei is the solution of (14.13),

ai(ui, vi) = f(vi) ∀ v ∈ Vi

Each single correction affects the i -th solution component only, and all the corrections are added up. Inmatrix terminology (cf. Lemma 14.1) we have, with respect to the Euclidean basis,

Ei =(0, . . . , 0, 1, 0, . . . , 0

)T, Ai =

(a(ei, ei)

)=(ai,i)



From (14.19) we obtain

B−1 = diag(a−11,1, . . . , a

−1n,n

)The corresponding ASM preconditioner is precisely the Jacobi preconditioner, i.e., the preconditioningmatrix B is the diagonal diag(A) of the stiffness matrix A .

In FEM terminology, for the Poisson equation with Dirichlet boundary conditions, discretized over meshpoints xi , this means that each single correction ui ∈ Vi is a multiple of the i -th basis function (hatfunction) over the local patch Ωi around xi , and ui is the solution of a local discrete Poisson problemwith homogeneous Dirichlet boundary condition on ∂Ωi . All these local solutions added up give rise tothe value of the Jacobi preconditioner. In terms of degrees of freedom, this is a ‘non-overlapping’ method;however, from a geometric point of view, the local subdomains (patches) Ωi are of course overlapping.

In this interpretation, Jacobi is a simple example of a ‘domain decomposition technique’. More advancedtechniques for domain decomposition are considered in Section 14.4.

This example shows that ASM is a generalization of the classical Jacobi preconditioner in the sense:‘Compute the actual residual, choose a family of subspaces, compute the Galerkin correction in eachindividual subspace, and sum up the corrections’. In general we also allow that the subproblems are onlyapproximately solved and that the subspaces may be overlapping.

Exercise 14.5 Provide an interpretation of Jacobi line relaxation (block relaxation, where the blocks of variables

are associated with grid lines of a regular mesh) in the spirit of Example 14.4.

Abstract theory for ASM.

In the following we characterize the ASM preconditioner B by three parameters: C0 , ρ(E) , and ω , whichenter via assumptions on the subspaces Vi and the bilinear forms ai(·, ·) (the approximate local problems).

Assumption 14.6 (stable decomposition)

There exists a constant C0 > 0 such that every u ∈ V admits a decomposition u =∑N

i=0 ui with ui ∈ Visuch that

N∑i=0

ai(ui, ui) ≤ C20 a(u, u) (14.29)

Assumption 14.7 (strengthened Cauchy-Schwarz inequality)

For i, j = 1 . . . N , let Ei,j = Ej,i ∈ [0, 1] be defined by the inequalities

|a(ui, uj)|2 ≤ Ei,j a(ui, ui) a(uj, uj) ∀ ui ∈ Vi, uj ∈ Vj (14.30)

By ρ(E) we denote the spectral radius of the symmetric matrix E = (Ei,j) ∈ RN×N . The particularassumption is that we have a nontrivial bound for ρ(E) to our disposal.

Note that due to Ei,j ≤ 1 (Cauchy-Schwarz inequality), the trivial bound ρ(E) = ∥E∥2 ≤√∥E∥1 ∥E∥∞ ≤ N

always holds; for particular Schwarz methods one usually aims at bounds for ρ(E) which are independentof N .



Assumption 14.8 (local stability)

There exists ω > 0 such that for all i = 1 . . . N :

a(ui, ui) ≤ ω ai(ui, ui) ∀ ui ∈ Vi (14.31)

Remark 14.9 The space V0 is not included in the definition of E ; as we will see below, this space isallowed to play a special role. Ei,j = 0 implies that the spaces Vi and Vj are orthogonal (in the a(·, ·) - innerproduct). We will see below that small ρ(E) is desirable. We will also see below that a small C0 is desirable.

The parameter ω represents a one-sided measure of the approximation properties of the approximatesolvers ai . If the local solver is of (exact) Galerkin type, i.e, ai(u, v) ≡ a(u, v) for u, v ∈ Vi , then ω = 1 .However, this does not necessarily imply that Assumptions 14.6 and 14.7 are satisfied.

Lemma 14.10 (P. L. Lions)

Let PASM be defined by (14.23) resp. (14.24). Then, under Assumption 14.6,

(i) PASM : V → V is a bijection, and

a(u, u) ≤ C20 a(PASM u, u) ∀ u ∈ V (14.32)

(ii) Characterization of b(u, u) :

b(u, u) = a(P−1ASM u, u) = min

N∑i=0

ai(ui, ui) : u =N∑i=0

ui, ui ∈ Vi

(14.33)

Proof: We make use of the fundamental identity (14.27) and Cauchy-Schwarz inequalites.

Proof of (i): Let u ∈ V and u =∑

i ui be a decomposition of the type guaranteed by Assumption 14.6.Then:

a(u, u) = a(u,∑

i ui) =∑

i a(u, ui) =∑

i ai(Pi u, ui) ≤∑

i

√ai(Pi u, Pi u) ai(ui, ui)

=∑

i

√a(u, Pi u) ai(ui, ui) ≤

√∑i a(u, Pi u)

√∑i ai(ui, ui)

=√a(u, PASM u)

√∑i ai(ui, ui) ≤

√a(u, PASM u)C0

√a(u, u)

This implies the estimate (14.32). In particular, it follows that PASM is injective, because with (14.32),PASM u = 0 implies a(u, u) = 0 , hence u = 0 . Due to finite dimension, we conclude that PASM is bijective.

Proof of (ii): We first show that the minimum on the right-hand side of (14.33) cannot be smaller thana(P−1

ASM u, u) . To this end, we consider an arbitrary decomposition u =∑

i ui with ui ∈ Vi and estimate

a(P−1ASM u, u) =

∑i a(P

−1ASM u, ui) =

∑i ai(PiP

−1ASM u, ui)

≤√∑

i ai(PiP−1ASM u, PiP

−1ASM u)

√∑i ai(ui, ui)

=√∑

i a(P−1ASM u, PiP

−1ASM u)

√∑i ai(ui, ui) =

√a(P−1

ASM u, u)√∑

i ai(ui, ui)

In order to see that a(P−1ASM u, u) is indeed the minimum of the right-hand side of (14.33), we define

ui = PiP−1ASM u . Obviously, ui ∈ Vi and

∑i ui = u . Furthermore,∑

i ai(ui, ui) =∑

i ai(PiP−1ASM u, PiP

−1ASM u) =

∑i a(P

−1ASM u, PiP

−1ASM u)

= a(P−1ASM u,

∑i PiP

−1ASM u) = a(P−1

ASM u, u)

This concludes the proof.



The matrix P ′ASM = B−1A from (14.23) is the matrix representation of the operator PASM . Since PASM

is self-adjoint in the A -inner product (see Lemma 14.2), we can estimate the smallest and the largesteigenvalue of B−1A by:

λmin(B−1A) = inf

0 =u ∈V

a(PASM u, u)

a(u, u), λmax(B

−1A) = sup0 =u ∈V

a(PASM u, u)

a(u, u)(14.34)

Lemma 14.10, (i) in conjunction with Assumption 14.6 readily yields

λmin(B−1A) ≥ 1

C20

An upper bound for λmax(B−1A) is obtained with the help of the following lemma.

Lemma 14.11 Under Assumptions 14.7 and 14.8 we have

∥Pi∥A ≤ ω, i = 0 . . . N (14.35)

a(PASM u, u) ≤ ω (1 + ρ(E)) a(u, u) for all u ∈ V (14.36)

Proof: Again we make use of identity (14.27). We start with the proof of (14.35): From Assumption 14.8,(14.31) we infer for all u ∈ V :

∥Pi u∥2A = a(Pi u, Pi u) ≤ ω ai(Pi u, Pi u) = ω a(u, Pi u) ≤ ω ∥u∥A ∥Pi u∥A

which implies (14.35).

For the proof of (14.36), we observe that the space V0 is assumed to play a special role. We define

P =N∑i=1

Pi = PASM − P0

Assumption 14.7 then allows us to bound 85

a(P u, P u) =N∑

i,j=1

a(Pi u, Pju) ≤N∑

i,j=1

Ei,j√a(Pi u, Pi u)

√a(Pju, Pju)

≤ ωN∑

i,j=1

Ei,j√ai(Pi u, Pi u)

√ai(Pju, Pju) ≤ ω ρ(E)

N∑i=1

ai(Pi u, Pi u)

= ω ρ(E)N∑i=1

a(u, Pi u) = ω ρ(E)a(u, P u) ≤ ω ρ(E) ∥u∥A ∥P u∥A

from which we extract∥P u∥A ≤ ω ρ(E) ∥u∥A (14.37)

From (14.35) we have ∥P0 u∥A ≤ ω ∥u∥A . Combining this with (14.37) gives

a(PASM u, u) = a(P u, u) + a(P0 u, u) ≤ ∥P u∥A ∥u∥A + ∥P0 u∥A ∥u∥A ≤ ω (1 + ρ(E)) ∥u∥2A

which is the desired estimate. 85 Note that E is symmetric, such that ∥E∥2 = ρ(E) .


14.3 Multiplicative Schwarz methods (MSM) 161

Theorem 14.12 Let C0 , ω , ρ(E) be defined by Assumptions 14.6–14.8. Then,

λmin(P′ASM) ≥

1

C20

and λmax(P′ASM) ≤ ω (1 + ρ(E))

Proof: Follows from Lemmas 14.10 and 14.11 in conjunction with (14.34).

Theorem 14.12 permits us to bound the spectral condition number of B−1A! PASM (cf. (14.28)) in termsof the parameters C0, ω and ρ(E) ,

κ = κσ(P′ASM) ≤ C2

0 ω (1 + ρ(E))

The error amplification operator of the preconditioner (in the sense of one step of a stationary iterationscheme) is given by

GASM := I − PASM ! I − P ′ASM = I − B−1A

Note that the above results to not imply ∥GASM∥A < 1 , i.e., ASM as stationary iteration is not guaranteedto be convergent under the above assumptions.

14.3 Multiplicative Schwarz methods (MSM)

In multiplicative versions of Schwarz methods, subspace solutions (subspace corrections) are immediatelyapplied, and the new residual is evaluated before proceeding (as in the Gauss-Seidel method compared toJacobi). With the same denotation as for ASM methods, this gives the following preconditioning step,starting from some initial (weak) residual r0(·) = r(u0)(·) = f(·)− a(u0, ·) :

Find δ0 ∈ V0 such that a0(δ0, v0) = r0(v0) ∀ v0 ∈ V0Update approximation: u0,0 = u0 + δ0

Update residual: r0,0(·) = f(·)− a(u0,0, ·) = r0(·)− a(δ0, ·)

and continuing this over the subspaces V1, V2, . . . .

In coordinate representation, this leads us to the following iteration for i = 0, 1, . . . , starting with u′0,−1 =u′0 , r

′0,−1 = r′0 :

Solve Ai δ′′i = ET

i r′0,i−1

Let δ′i = Ei δ′′i , update u′0,i = u′0,i−1 + δ′i

Compute the new residual r′0,i = r′0 − Aδ′i

After N+1 steps, we see that the new approximation u′1 := u′0,N is given by

u′1 = u′0 +N∑i=0

δ′i

with

δ′i = EiA−1i ET

i r′0,i−1, i = 0, 1, . . .

r′0,i = r′0,i−1 − Aδ′i, i = 0, 1, . . .



i.e.,

δ′0 = E0A−10 ET

0 r′0

δ′1 = E1A−11 ET

1 r′0,1

= E1A−11 ET

1 (r′0 − Aδ′0)

= E1A−11 ET

1 (I − E0A−10 ET

0 A) r′0

. . .

With P ′i = BiA = EiA

−1i ET

i A (the matrix representation of the projection-like operator (14.25)) thisleads us to the inductive representation

δ′i = EiA−1i ET

i (I − P ′i ) · · · (I − P ′

0) r′0

For the successive residuals we have, with Q′i := AEiA

−1i ET

i :

r′0,0 = (I − Q′0) r

′0

r′0,1 = (I − Q′1) r

′0,0 = (I − Q′

1)(I − Q′0) r

′0

. . .

This shows, after a complete sweep over all subspaces Vi ,

r′1 = r′0,N = (I − Q′N) · · · (I − Q′

0) r′0

For the error e′1 = −Ar′1 we obtain the multiplicative representation 86

e′1 = (I − P ′N) · · · (I − P ′

0) e′0 =: P ′

MSM e′0

which motivates the name ‘multiplicative method’. The operator GMSM : V → V ,

GMSM = (I − PN) · · · (I − P0), with Pi ! P ′i = BiA = EiA

−1i ET

i A (14.38)

represents error amplification operator of one sweep of MSM.

The abstract theory from the preceding section can be extended to the MSM case. In particular, in [18]it is shown that repeated application of MSM sweeps yields a convergent stationary iteration scheme,provided ω ∈ (0, 2) :

Theorem 14.13 Let C0 , ω , ρ(E) be defined in Assumptions 14.6–14.8, and suppose ω ∈ (0, 2) . Letω = max1, ω . Then, the MSM error amplification operator (14.38) satisfies

∥GMSM∥2A ≤ 1− 2− ωC2

0 (1 + 2 ω2ρ(E))< 1

Proof: Given in [18]. See also [18] for some historical remarks. Theorem 14.13 may be considered asan abstract generalization of the original proof by Schwarz on the convergence of the classical Schwarzalternating (domain decomposition) method for the Poisson equation.

86 Note that A−1(I − Q′N ) · · · (I − Q′

0)A = (I − P ′N ) · · · (I − P ′

0) – an exercise.


14.4 Introduction to domain decomposition techniques 163

In a sense, Thm. 14.13 is a stronger result as obtained above for the ASM case. It also implies a boundfor the condition number of PMSM := I − GMSM (the analog of PASM for the ASM case): Let q < 1 denotethe bound for ∥GMSM∥A from Theorem 14.13. Then,

∥PMSM∥A ≤ 1 + q < 2

and PMSM is invertible, with

∥P−1MSM∥A = ∥(I − GMSM)

−1∥A ≤1

1− qhence

κA(PMSM) ≤2

1− qNote, however, since PMSM is not symmetric, it cannot be used in a straightforward way as for precon-ditioning the CG method. A symmetric version can be obtained in the same way as for the symmetricGauss-Seidel method by repeating the multiplicative correction procedure, starting with VN , down to V0 .

Remark 14.14 Line relaxation techniques (block-Jacobi, block-Gauss-Seidel and similar techiques) natu-rally fit into the ASM/MSM framework, with appropriate subspaces Vi representing the respective blocking,i.e., the agglomeration of variables.

Multigrid methods can also be formulated in the context of subspace correction methods, based on hier-archical decompositions of the given space V .

14.4 Introduction to domain decomposition techniques

The most important practical realization of the abstract idea of additive or multiplicative Schwarz methodsis domain decomposition. The standard text in this topic is the book [18]. The abstract theory fromSections 14.2 and 14.3 provides a framework which has proven quite helpful in the design and analysis ofa number of old and new domain decomposition techniques.

Often, the approximate local solvers are realized in terms of MG cycles, or sometimes special direct solvers(e.g., based on FFT techniques). In this section we discuss two typical domain decomposition techniquesfor elliptic problems. In particular, we consider the 2D Poisson equation on a domain Ω ⊂ R2 as aprominent example.

An overlapping additive two-level method.

Let Th be a quasi-uniform mesh (consisting of triangles), with mesh size h , of Ω ⊂ R2 . We consider theDirichlet problem

−∆u = f on Ω , u = 0 on ∂Ω (14.39)

discretized by FEM with piecewise linear functions from the FEM space Vh ⊂ H10 (Ω) , giving rise to a

discrete system Ah u′h = f ′

h .

To generate a preconditioner for the resulting linear system we employ the ASM framework. To fix ideasand to keep the exposition simple, we assume that a second, coarse triangulation TH with mesh size H ischosen. The FEM space on this mesh is VH ⊂ H1

0 (Ω) . Furthermore we assume VH ⊂ Vh , i.e., the meshlines of TH are also mesh lines of Th .



For simplicity of exposition we assume that the local solvers are exact, i.e., the FEM systems on thesubdomains are solved exactly, such that Assumption 14.8 is trivially satisfied with ω = 1 .

The space VH is associated with the special subspace V0 from our abstract ASM setting. Its purpose is toprovide an initial, coarse approximation which enables us to start the ASM procedure.

We assume that N subdomains Ωi, i = 1 . . . N , of Ω are given which consist of unions of elements(triangles). We set

Vi = Vh ∩H10 (Ωi), i = 1 . . . N

We assume that the subdomains Ωi satisfy:

• Ω =∪N

i=1Ωi

• The Ωi are assumed to be overlapping; in particular, for the ASM convergence theory the ‘amountof overlap’ is an essential parameter.

• Not more thanM subdomains are simultaneously overlapping, i.e., supi=1...N

cardj : Ωi∩Ωj = ∅ ≤M .

• The fine mesh Th is ‘compatible’ with the decomposition of Ω , giving rise to N local meshes Th,i onthe subdomains Ωi , such that Vh =

∑Ni=1 Vi . In particular, the boundaries ∂Ωi consist of mesh lines

from Th .

One step of our ASM preconditioner amounts to the following procedure:

1. Solve the FEM equations AH u′H = f ′

H on the coarse mesh TH , giving uH ∈ VH .

2. Embedding uH into the fine mesh gives rise to u0 ∈ V0 , where V0 ∈ Vh is the space of all interpolantsfrom VH to Vh (described by the range of the corresponding interpolation operator).

3. For i = 1 . . . N :

Solve the subproblems A′′h,iu

′′h,i = f ′′

h,i , where the local stiffness matrices Ah,i refer to the local meshesover the subdomains, with the appropriately restricted versions f ′′

h,i of fh and with homogeneousDirichlet boundary conditions on the boundaries ∂Ωi .

4. Extend (prolongate) the uh,i to functions ui defined on the overall grid Th (zero outside Ωi ), and set

u = u0 +∑N

i=1 ui .

A formally complete description involves the precise definition of prolongation operators Ei and theirtransposes ET

i , such that the method exactly fits into our abstract ASM framework.

One may also think of omitting the coarse space V0 . However, convergence theory will tell us that thishas an unfavorable effect on the condition number of the preconditioner for h → 0 , similarly as for theJacobi case. One should also not forget that, if the method is used as a preconditioner, e.g., for CG,then the right hand side of the given problem f is actually some intermediate residual. (However, for thedescription of the action of a preconditioner this is not essential.)

Usually, the cost for performing a preconditioning step is significantly smaller than solving the full systemexactly, at least if the coarser problem is also treated by some substructuring technique. The trade-offbetween the cost for the preconditioning step and its acceleration effect on the underlying iteration (PCG)is, however, difficult to evaluate in general.



Apart from the acceleration effect, the most favorable features of such a preconditioner is given by thefact that complex geometries can be reduced to simpler subgeometries, and the fact that individual sub-domain corrections can be computed in parallel. This is the major advantage of additive compared tomultiplicative approaches, and this is more important for practice than the observation that multiplicativepreconditioners are usually more accurate. However, due to the overlapping domains this requires somelocal communication, e.g., if each domain is mapped to an individual processor. This motivated the searchfor non-overlapping techniques; see Section 14.4 for a typical example.

A bit of theory.

The assumption on the maximal amount of simultaneous overlap of the subdomains implies that thespectral radius ρ(E) appearing in Theorem 14.12 is bounded by M , the maximal number of simultaneousoverlaps. this can be shown with the help of the following lemma, which can be regarded as a sharpenedversion of the standard inequality ∥E∥2 ≤ ∥E∥F .

Lemma 14.15 Let E ∈ RN×N be a symmetric matrix with M ≤ N non-zero entries per column and row.Then,

∥E∥2 ≤√M max

j=1...N∥E j∥2

where Ej denotes the j -th column of E .

In particular, if |Ei,j| ≤ 1 for all i, j ∈ 1, . . . , N , then ∥E∥2 ≤M .

Proof: For each i , let J(i) ⊂ 1, . . . , N denote the set of indices j with Ei,j = 0 . Then, for any x ∈ RN

we have

∥Ex∥22 =∑i

∣∣∣ ∑j ∈ J(i)

Ei,j xj∣∣∣2 ≤ ∑

i

∑j ∈ J(i)

x2j∑

j ∈ J(i)

|Ei,j|2

≤(∑

i,jEi,j =0

x2j

)max

j∥Ej∥22 ≤M ∥x∥22 max

j∥Ej∥22

by assumption on E . This implies the first assertion, and the second assertion easily follows.

Now we consider the matrix

E =(Ei,j), with Ei,j = sup

0 =ui ∈Vi0 =uj ∈Vj

|a(ui, uj)|2

a(ui, ui) a(uj, uj)≤ 1

involved in Assumption 14.7. For Ωi∩Ωj = ∅ we have Ei,j = 0 because (using the notation of Section 14.2)

a(ui, uj) = (ETi Aiu

′′i , E

Tj Aju

′′j ) = (Aiu

′′i , E

iETj Aju

′′j ) with EiE

Tj = 0

Therefore, due to our assumption on the maximal simultaneous overlap, the matrix E satisfies the assump-tions of Lemma 14.15. This shows ρ(E) ≤M .

In order to apply Theorem 14.12 it remains to verify that the decomposition ist stable in the sense ofAssumption 14.6. This is the topic of the following theorem.



Theorem 14.16 [ASM with ‘generous’ overlap] Given the above hypotheses, there exists C > 0 such thatany u ∈ Vh can be decomposed as u =

∑Ni=0 ui with ui ∈ Vi and

N∑i=0

∥ui∥2H1(Ω) ≤ C(1 +

H2

δ2

)∥u∥2H1(Ω)

where δ > 0 is a parameter characterizing the ‘amount of overlap’. ( δ can be defined in terms of thecharacteristic functions of the subdomains Ωi .)

Proof: See [18].

The boundedness and ellipticity of a(·, ·) implies a(ui, ui) ≤ C ∥ui∥2H1(Ω) and ∥u∥2H1(Ω) ≤ C a(u, u) . This

shows that Assumption 14.6 is satisfied with C20 = C

(1 + H2

δ2

).

Remark 14.17 At first sight, the two-level approach might appear unnatural, namely to add all the localsolutions to a coarse approximation u0 . In fact, in a ‘classical’ iterative Schwarz procedure this step wouldnot be included. However, this is nothing but an additional, global overlap, and the role of the coarsespace V0 is to ensure that information obtained by the subdomain solves is communicated to the othersubdomains. In particular, without this step Theorem 14.12 would not be valid.

Iterative application of the two-level ASM procedure means that we compute the residual r′h = f ′h −Ahu

′

and restart the process for the ‘correction problem’ Ahδ′h = r′h (with zero boundary conditions). In general,

we may not expect that this leads to a convergent stationary scheme; cf. Theorem 14.12 which providesa bound for the condition number of PASM but does not guarantee ∥GASM∥A < 1 . Here, again it has tobe stressed that the practical role of the ASM preconditioner is to accelerate an outer Krylov subspaceiteration. Here f ′

h plays the role of the current residual encountered in the outer iteration, and the ASMsolver maps is to the preconditioned residual B−1f ′

h .

Suppose the overall measure of overlap satisfies δ = O(H) . Then Theorem 14.12 asserts that the conditionnumber is independent of h and H . Since H is related to the number of subdomains, where in 2D weexpect for a reasonable choice subdomains that the number of subdomains N satisfies N ∼ H−2 , we seethat the condition number is independent of the number of subdomains.

Theorem 14.12 shows that the condition number of the ASM preconditioner grows at most like O(1 +H2/δ2) . Thus, the condition number is bounded uniformly in h . Furthermore, if the overlap is ‘generous’,i.e., δ ≥ cH , then the condition number is bounded uniformly in both h and H . A more careful analysisallows one reduce the upper bound O(1 + H2/δ2) to O(1 + H/δ) . This is of interest if one wishes toconsider the case of ‘large’ H but ‘small’ δ . To this end, however, additional conditions on the structureof the subdomains, are required; see [15],[18].

Exercise 14.18 Devise the multiplicative (MSM) analog of the two-level ASM procedure introduced above.

Hint: Compute the residual after each single step before proceeding.



An non-overlapping additive two-level method.

We again consider the problem (14.39). We introduce a non-overlapping decomposition method involvingappropriate subdomain solvers. We will again assume ai ≡ a (‘exact solvers’). To fix ideas for the definitionof the subspaces, we assume:

• The triangulation Th is a refinement of a coarse triangulation TH (consisting of triangles).

• The subdomains 87 Ωi are taken exactly as the triangles of the coarse triangulation TH .

• The method works on the union of the ‘patches’ of neighboring triangles; we will consider a versionwhere these patches are chosen as Ωi,j = Ωi ∪ Ωj , all unions of pairs of neighboring triangles Ωi, Ωj

sharing a common edge Γi,j .

Before we proceed with the formal definition of the subspaces that comprise the splitting of Vh for thismethod, we recall that the unknowns (degrees of freedom) in the original discrete problem correspondin a one-to-one fashion to the nodes of the fine triangulation. We split the set of nodes N of the finetriangulation Th into:

• N0 : the nodes of the coarse triangulation TH ,

• Ni : the nodes in the interior of the subdomains Ωi (= triangles of the coarse triangulation),

• Ni,j : the nodes on the edges Γi,j = Ωi ∩ Ωj of the coarse triangulation.

Again, Vh is the FEM space associated with Th . We construct a non-overlapping splitting of Vh in theform

Vh = V0 +∑i

Vi +∑i,j

Vi,j (14.40)

Of course, the patches Ωi,j are overlapping to some extent; however, the term ‘non-overlapping’ refersto the fact that we design (14.40) as a direct sum. To this end we associate the subspaces N0 , Ni andNi,j , i.e., V0 is associated with N0 , each Vi is associated with the nodes of Ni (i.e., with the interior ofthe subdomain Ωi ), and each space Vi,j corresponds to the set Ni,j (i.e., to the edge Γi,j ). The factthat (14.40) is a direct sum will follow from the following fact: For each of these sets V0 , Vi , Vi,j we willconstruct a basis which has a ‘Kronecker δ -property’ for the nodes of the corresponding nodal set N0 ,Ni , or Ni,j .

In detail, the spaces that make up the splitting of Vh are defined as follows:

• V0 is the space of piecewise linears on TH :V0 = VH .

• For each subdomain Ωi we set Vi = Vh ∩H10 (Ωi) .

• For each edge Γi,j of the coarse triangulation ( Γi,j denoting the edge shared by the subdomains Ωi

and Ωj ), we define the ‘edge space’ Vi,j as the set of functions of Vh which are (i) supported by theedge patch Ωi,j = Ωi ∪ Ωj ∪ Γi,j , and which (ii) are discrete harmonic, i.e.,

u ∈ Vi,j ⇔

supp u ⊂ Ωi,j and

a(u, v) = 0 ∀ v ∈ Vi ⊕ Vj(14.41)

see Fig. 14.41. Note that dim(Vi,j) = |Nij| . In this way we obtain a splitting

Vh = V0 ⊕N∑i=1

Vi ⊕∑i,j

Γi,j = ∅

Vi,j (14.42)

87 The splitting of Vh will be based on the subdomains Ωi as well as so-called edge patches Ωi,j to be defined below.



Figure 14.41: An edge patch ΩΓ = Ω1 ∪ Ω2 ∪ Γ , with nodes and mesh lines.

– a dimension argument shows that this is, in fact, a direct sum.

In the notation of Section 14.2, the non-overlapping ASM preconditioner is now defined as (cf. (14.19)):

B−1 =N∑i=0

EiA−1i ET

i +∑i,j

Ei,j A−1i,j E

Ti,j (14.43)

where Ei and Ei,j are the matrix representations of the embeddings Vi ⊂ Vh and Vi,j ⊂ Vh ; the matricesAi and Ai,j are the stiffness matrices for the subspaces Vi and Vi,j .

The stiffness matrices Ai and the matrices Ei corresponding to the problems based on the spaces Vi and thecoarse space V0 are defined in a straightforward way: As for the overlapping method from Section 14.4,A0 corresponds to the stiffness matrix for the approximation of the given problem on the coarse meshTH , and the subproblems to be solved on the Vi, i = 1 . . . N , are local discrete Dirichlet problems on thesubdomains Ωi , with homogeneous boundary conditions on ∂Ωi .

The appropriate interpretation of the stiffness matrices Ai,j for the edge spaces Vi,j is more complicated,however. These are defined over the edge patches Ωi,j = Ωi ∪ Ωj ∪ Γi,j with internal interface Γi,j . Thecorrect interpretation of these subproblems is essential for understanding the nature of the non-overlappingASM, which is defined by (14.43) in a purely formal way. For the case where the stiffness matrix Ah isnot formed explicitly but only the matrix-vector multiplication u′h 7→ Ahu

′h is available, this will also show

how the data for the local ‘edge problems’ have to be assembled and how these subproblems are to besolved.

The appropriate interpretation of the local edge problems is the topic of the following considerations,where we are ignoring discretization issues for a moment. To fix ideas, we consider a single edge Γ that isshared by two subdomains Ω1 and Ω2 , i.e., Γ is an interior interface between Ω1 and Ω2 , see Fig. 14.42.We denote the corresponding edge patch by Ω = ΩΓ = Ω1 ∪ Ω2 ∪ Γ .

• We interpret the edge subproblems in the following way: Consider the local solutions u1, u2 of−∆u = f on Ω1,Ω2 , with homogeneous Dirichlet boundary conditions on ∂Ω1, ∂Ω2 (including Γ !).These solutions are independent of each other. Evaluate the jump 88 [∂nuI ] := ∂n1u1 + ∂n2u2 acrossthe interface Γ , and consider the solution w of the ‘interface problem’

−∆w = 0 on ΩΓ , w = 0 on ∂ΩΓ , [∂nw] = gΓ := −[∂nuI ] on Γ (14.44)

88 [∂nuI ] = ∂n1u1 + ∂n2u2 denotes the jump of the denotes normal derivative over the interface Γ , where the ∂niui areoriented outward w.r.t. Ωi .



Figure 14.42: A combined domain Ω1 ∪ Ω2 ∪ Γ with interface Γ .

(observe the minus sign with the jump term!). We call w the harmonic extension of its trace w|Γ toΩΓ . Then, the linear combination u = u1 + u2 + w is the solution of 89

−∆u = f on ΩΓ , u = 0 on ∂ΩΓ

with [∂nu] = 0 on Γ . In this way, we have solved the Dirichlet problem on ΩΓ by means of solvingtwo independent problems in Ω1, Ω2 and the additional interface problem (14.44).

On the discrete level, the solution of the subproblems on Ω1 and Ω2 is straightforward. We nowconsider the nature of the auxiliary interface problem (14.44), in order to understand how it can besolved on the discrete level.

• Problem (14.44) is of the type

−∆w = 0 on ΩΓ , w = 0 on ∂ΩΓ , [∂nw] = gΓ on Γ (14.45)

with a prescribed jump gΓ over the interface. (In (14.44) we have gΓ = −[∂nuI ] .)To obtain the weak form of (14.45) we use partial integration on both subdomains Ωi , with testfunctions v ∈ H1

0 (ΩΓ) : ∫Ω1

∆w v =

∫∂Ω1

∂n1w v −∫Ω1

∇w∇v

=

∫∂Ω1\Γ

∂nw v +

∫Γ

∂n1w v −∫Ω1

∇w∇v

= 0 +

∫Γ

∂n1w v −∫Ω1

∇w∇v

and analogously, ∫Ω2

∆w v = 0 +

∫Γ

∂n2w v −∫Ω2

∇w∇v

Adding up these identities yields∫ΩΓ

∆w v =

∫Γ

[∂nw] v −∫ΩΓ

∇w∇v

This leads us to the weak form of (14.45):

Find w ∈ H10 (ΩΓ) such that

∫ΩΓ

∇w∇v =

∫Γ

gΓ v ∀ v ∈ H10 (ΩΓ) (14.46)

89 Here u1 is extended by 0 to Ω2 , and vice versa.



• Consider the discretized version of (14.46): 90 AΓ,Γ AΓ,I

AI,Γ AI,I

w′Γ

w′I

=

g′Γ0 ′

(14.47)

with the appropriately blocked local stiffness matrix over ΩΓ , and the appropriate coordinate vectorg′Γ . The indices Γ and I refer to the interface and interior components, respectively, and we havecombined the interior nodes from N1 and N2 into the set NI of all interior nodes. The block structureof the sparse stiffness matrix in (14.47) has to be interpreted accordingly. 91

Elimination of the variable w′I gives rise to a linear system for the edge component w′

Γ :

S w′Γ = g′Γ (14.48)

where S is the Schur complement of AΓ,Γ ,

S = AΓ,Γ − AΓ,I A−1I,I AI,Γ (14.49)

As soon as w′Γ is determined, the component w′

I is obtained as

w′I = −A−1

I,I AI,Γw′Γ (14.50)

– The action of S of has the nature of a (discrete) Dirichlet-to-Neumann map: It transfers Dirichlet(i.e., pointwise) data w′

Γ on Γ to Neumann data g′Γ on Γ .

– The action of S−1 has the nature of a (discrete) Neumann-to-Dirichlet map: The Dirichlet dataw′

Γ on the interface Γ are computed from the solution of (14.48) with a Neumann interface conditionon Γ .

– Furthermore, w′I is the discrete harmonic extension of w′

Γ to the interior of ΩΓ , as explained below.

In order to understand the computational realization of (14.48)–(14.50), the notion of discrete harmonicextension and the action of S and S−1 has to be understood more precisely. This is the topic of thefollowing considerations.

• Discrete harmonic extension: Let data w′Γ on Γ be given. We denote −A−1

I,I AI,Γ by E andconsider

w′I = E w′

γ = −A−1I,I AI,Γw

′Γ

w′I is the discrete harmonic extension of w′

Γ to the interior nodes in ΩΓ . To see what this means,insert w′

Γ and w′I into (14.47) to obtain AΓ,Γ AΓ,I

AI,Γ AI,I

w′Γ

w′I

=

S w′Γ

0 ′

(14.51)

This means that

w′ =

w′Γ

w′I

=

w′Γ

E w′Γ

(14.52)

can be interpreted as the solution of a problem of the form (14.47), the discrete version of (14.46),with Neumann data Sw′

Γ on Γ . Thus, w′ is the coordinate representation of an element w from theedge space VΓ := V1,2 (see (14.41) !). The interior component w′

I is uniquely determined by w′Γ . We

call E = −A−1I,I AI,Γ the (discrete) harmonic extension operator.

90 Think of a straightforward Galerkin/FEM approximation on ΩΓ with piecewise linear elements.91 Recall that, for any pair of nodes x and y in ΩΓ , the corresponding entry Ax,y in the stiffness matrix is given by∫

ΩΓ∇ϕx∇ϕy , where ϕx and ϕy are the nodal basis functions (hat functions) associated with these nodes.



• Action ofS : For any given (discrete) Dirichlet data w′Γ on the interface Γ , consider its discrete

harmonic extension (14.52). Relation (14.51) leads to the following interpretation:

g′Γ = S w′Γ = a discretized version of the jump [∂nw] across the interface Γ (14.53)

i.e., a discrete Dirichlet-to-Neumann map.

• Action ofS−1 : For any given (discrete) Neumann g′Γ on the interface Γ , relation (14.51) leads tothe following interpretation:

w′Γ = S−1 g′Γ = solution of (14.47) evaluated at the interface Γ (14.54)

i.e., a discrete Neumann-to-Dirichlet map. The inner component w′I of the solution w

′ is the discreteharmonic extension of w′

Γ .

Note that for the solution of the original, continuous problem (14.45), a similar interpretion in terms of aharmonic extension can be given. The corresponding Neumann-to-Dirichlet map is called the Poincare-Steklov operator. For the precise theoretical foundation of this clever construction in the PDE context,see [15]; it involves a multi-domain formulation using local Poincare-Steklov operators and harmonicextensions.

Exercise 14.19 Show that, with appropriate subblocking,

AΓ,Γ AΓ,I

AI,Γ AI,I

=

AΓ,Γ AΓ,1 AΓ,2

A1,Γ A1,1 0

A2,Γ 0 A2,2

(14.55)

the Schur complement (14.49) can be expressed as

S = AΓ,Γ −AΓ,1A−11,1A1,Γ −AΓ,2A

−12,2A2,Γ (14.56)

Furthermore, note that AΓ,Γ can also be assembled from two contributions, AΓ,Γ = A(1)Γ,Γ +A

(2)Γ,Γ .

In the case of a single domain Ω = ΩΓ with two subdomains Ω1 and Ω2 , we see that, from the computationalpoint of view, the solution of the original problem

−∆u = f on ΩΓ , u = 0 on ∂ΩΓ (14.57)

amounts, after discretization, to the following steps:

• Assemble the stiffness matrices in (14.55).

• For i = 1, 2 , compute the inverse matrices A−1i,i , and determine the discrete approximations u′i of the

subdomain problems−∆ui = f on ΩI , u = 0 on ∂ΩI

• Extend u′1 by 0 to Ω2 and vice versa.

• Assemble the Schur complement (see (14.56))

S = AΓ,Γ − AΓ,1A−11,1A1,Γ − AΓ,2A

−12,2A2,Γ

• Solve the Schur complement system, i.e., the discrete interface problem (14.48),

S w′Γ = g′Γ



• Extend w′Γ to w′ via discrete harmonic extension (14.50),

w′I = E w′

Γ = −A−1I,I AI,Γw

′Γ

• Determine the discrete solution u′ on ΩΓ as

u′ = u′1 + u′2 + w′

By construction, this is the exact discrete solution.

The preconditioner:

Finally, returning to the case of a general decomposition into an arbitrary number of subdomains Ωi , wecan now describe our non-overlapping ASM preconditioner, which is formally defined by (14.43), in precisedetail:

• Solve the given problem on the coarse triangulation TH , giving rise to u0 .

• Apply the local solution procedure, over all subdomains Ωi and Ωj and all edge patches Ωi,j . Thisgives local contributions

– ui defined over Ωi and extended by 0 outside Ωi , and

– ui,j defined over all edge patches Ωi,j , which are also extended by 0 outside Ωi,j .

• Adding up defines the action of the preconditioner (cf. 14.42)):

u = u0 +N∑i=1

ui +∑i,j

Γi,j = ∅

ui,j

In particular, to determine the ui,j , subproblem (14.57) is solved, in the way as described above, for all edgepatches ΩΓ = Ωi,j . All interior local subproblems are of the original type −∆u = f , with homogeneousboundary conditions as in the overlapping case. However, the edge subproblems are set up in a moresubtle way described above, using discrete harmonic extensions. This patching procedure realizes a moresubtle way of communication between local subdomains, compared with overlapping.

In fact, our non-overlapping method has an ‘overlapping taste’, because the edge patches Ωi,j have someoverlap. However, the method is non-overlapping in an ‘algebraic’ sense; it is more easily parallelized thanthe overlapping method from Section 14.4. In particular, with subblocking as in Exercise 14.55, assemblingthe Schur complement matrices for the edge systems can be readily parallelized over the subdomains.

Theory shows that the combined strategy involving a globally defined, coarse ‘skeleton’ approximation u0is important for the successful performance of the preconditioner; see [15],[18].

Remark 14.20 Concerning the concrete implementation of the preconditioner, realizing it by explicitlycomputing all the (rather small) local inverses Ai,i and using this for setting up the Schur complementmatrices (14.56) is a common technique. This can be interpreted in the sense that, in a first step of theprocess, all interior nodes are eliminated, which is called static condensation. However, from (14.56) wealso see that, actually, not the ‘complete’ inverses A−1

i,i are required but only A−1i,i Ai,Γ , which has some

potential for further optimization of the computational effort.

On the other hand, if iterative approximate local solvers are to be applied for the Schur complementsystems, explicit inversion of the Ai,i can be avoided: Each evaluation of a matrix-vector product u 7→ S uinvolves the invocation of two subdomain solvers, which can be performed in parallel.



Remark 14.21 Various domain decomposition methods for constructing efficient preconditioners, in par-ticular for complex geometries are the topic of many recent research activities, see [15], [18]. Anotherprominent class are the so-called FETI techniques (‘Finite Element Tearing and Interconnecting’). Oneof these techniques is based on solving Neumann problems on the individual subdomains Ωi , where theNeumann data are taken from a current approximation. Then, Dirichlet problems are solved on the Ωi ,where the Dirichlet data at the interfaces are derived from the jumps in the solutions of the neighboringNeumann problems at the interfaces (here, Neumann-to-Dirichlet maps are involved). The traces at theinterfaces of the solutions of these Dirichlet problems are then employed to obtain improved Neumanndata; in this way, the procedure can be iterated.

Recall that all these techniques are usually applied as preconditioners for Krylov subspace methods appliedto the original system. Another common way of expressing this cooperation is that ‘the preconditioneris accelerated by the Krylov subspace method’. Note that in many cases, multigrid techniques or otherapproximate local solvers are applied to solve the subproblems in involved in a preconditioning step.


174 REFERENCES

References

[1] P.R. Amestoy, T.A. Davis, and I.S. Duff. An approximate minimum degree ordering algorithm. SIAMJ. Matrix Anal., 17(4):886–905, 1996.

[2] W. Auzinger, D. Praetorius. Numerische Mathematik. Lecture Notes, TU Wien, 2007.

[3] S.C. Brenner and L.R. Scott. The Mathematical Theory of Finite Element Methods. Springer, 2002.

[4] W.K. Briggs, V.E. Henson, and S.F. McCormick. A Multigrid Tutorial. SIAM, 2000.

[5] I. Duff, A.M. Erisman, and J.K. Reid. Direct Methods for Sparse Matrices. Oxford Clarendon Press,1992.

[6] M. Eiermann and O. Ernst. Geometric aspects of the theory of Krylov subspace methods. In A. Iserles,editor, Acta Numerica 2001, pages 251–312. Cambridge University Press, 2001.

[7] A. George. Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal., 10:345–363,1973.

[8] A. George and J.W.H. Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, 1981.

[9] G.H. Golub and C.F. Van Loan Matrix Computations. 3. ed., John Hopkins University Press, 1996.

[10] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied Math-ematical Sciences. Springer, 1994.

[11] A. Iserles. A First Course in the Numerical Analysis of Differential Equations, Cambridge UniversityPress, 1995.

[12] C. Kanzow. Numerik linearer Gleichungssysteme. Direkte und iterative Verfahren, Springer-Lehrbuch,2005.

[13] C.D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.

[14] G. Meurant. Gaussian elimination for the solution of linear systems of equations. In Handbook ofNumerical Analysis, Vol. VII, pages 3–172. North-Holland, 2000.

[15] A. Quarteroni and A. Valli. Domain Decomposition Methods for Partial Differential equations. OxfordScience Publications, 1999.

[16] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, 1. ed., 1996.

[17] W.F. Tinney and J.W. Walker. Direct solutions of sparse network equations by optimally orderedtriangular factorization. Proc. of the IEEE, 55:1801–1809, 1967.

[18] A. Toselli and O. Widlund. Domain Decomposition Methods - Algorithms and Theory. Springer, 2005.

[19] L.N. Trefethen, D. Bau. Numerical Linear Algebra. SIAM, 1997.

[20] W. Zulehner. Numerische Mathematik. Eine Einfuhrung anhand von Differentialgleichungsproblemen.Band 1: Stationare Probleme. Birkhauser, 2008.

Date post:	25-Apr-2018
Category:	Documents
Upload:	trandiep
View:	233 times
Download:	5 times

Iterative Solution of Large Linear Systems - TU...

Documents