Iterative Algorithms in Inverse Problems

8/22/2019 Iterative Algorithms in Inverse Problems

http://slidepdf.com/reader/full/iterative-algorithms-in-inverse-problems 1/311

Iterative Algorithms in Inverse

Problems

Charles L. Byrne

January 10, 2006



2



Contents

I Preliminaries xi

1 Preface 1

2 Introduction 52.1 Algorithms and Operators . . . . . . . . . . . . . . . . . . . 5

2.1.1 Dynamical Systems . . . . . . . . . . . . . . . . . . . 52.1.2 Steepest Descent Minimization . . . . . . . . . . . . 62.1.3 The Newton-Raphson Algorithm . . . . . . . . . . . 72.1.4 Newton-Raphson and Chaos . . . . . . . . . . . . . . 82.1.5 Selecting the Operator . . . . . . . . . . . . . . . . . 9

2.2 Operators on Euclidean Space . . . . . . . . . . . . . . . . . 92.2.1 Non-Expansive Operators . . . . . . . . . . . . . . . 102.2.2 Strict Contractions . . . . . . . . . . . . . . . . . . . 102.2.3 Averaged Operators . . . . . . . . . . . . . . . . . . 102.2.4 Affine Linear and Linear Operators . . . . . . . . . . 11

2.3 Projection Operators . . . . . . . . . . . . . . . . . . . . . . 112.4 Paracontractive Operators . . . . . . . . . . . . . . . . . . . 12

2.4.1 Linear and Affine Paracontractions . . . . . . . . . . 132.5 Operators Related to a Gradient . . . . . . . . . . . . . . . 132.5.1 Constrained Minimization . . . . . . . . . . . . . . . 14

2.6 Systems of Linear Equations . . . . . . . . . . . . . . . . . . 142.6.1 Exact Solutions . . . . . . . . . . . . . . . . . . . . . 142.6.2 Optimization and Approximate Solutions . . . . . . 162.6.3 Splitting Methods . . . . . . . . . . . . . . . . . . . 16

2.7 Positive Solutions of Linear Equations . . . . . . . . . . . . 172.7.1 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . 172.7.2 The EMML and SMART algorithms . . . . . . . . . 172.7.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . 182.7.4 Entropic Projections onto Hyperplanes . . . . . . . . 18

2.8 Sensitivity to Noise . . . . . . . . . . . . . . . . . . . . . . . 18

2.8.1 Norm Constraints . . . . . . . . . . . . . . . . . . . 192.9 Constrained Optimization . . . . . . . . . . . . . . . . . . . 19

i



ii CONTENTS

2.9.1 Convex Feasibility and Split Feasibility . . . . . . . 202.9.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . 20

2.10 Bregman Projections and the SGP . . . . . . . . . . . . . . 212.11 The Multiple-Distance SGP (MSGP) . . . . . . . . . . . . . 222.12 Linear Programming . . . . . . . . . . . . . . . . . . . . . . 232.13 A pplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

II Fixed-Point Iterative Algorithms 25

3 Convergence Theorems 273.1 Fixed Points of Iterative Algorithms . . . . . . . . . . . . . 273.2 Convergence Theorems for Iterative Algorithms . . . . . . . 28

3.2.1 Strict Contractions . . . . . . . . . . . . . . . . . . . 283.3 Paracontractive Operators . . . . . . . . . . . . . . . . . . . 303.4 Averaged Non-expansive Operators . . . . . . . . . . . . . . 31

3.5 Projection onto Convex Sets . . . . . . . . . . . . . . . . . . 313.6 Generalized Projections . . . . . . . . . . . . . . . . . . . . 32

4 Averaged Non-expansive Operators 354.1 Convex Feasibility . . . . . . . . . . . . . . . . . . . . . . . 354.2 Constrained Optimizaton . . . . . . . . . . . . . . . . . . . 364.3 Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 The Landweber Algorithm . . . . . . . . . . . . . . . 364.3.2 Splitting Algorithms . . . . . . . . . . . . . . . . . . 37

4.4 Averaged Non-expansive Operators . . . . . . . . . . . . . . 374.4.1 Properties of Averaged Operators . . . . . . . . . . . 374.4.2 Averaged Linear Operators . . . . . . . . . . . . . . 40

4.5 The KM Theorem . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 The De Pierro-Iusem Approach . . . . . . . . . . . . . . . . 43

5 Paracontractive Operators 455.1 Paracontractions and Convex Feasibility . . . . . . . . . . . 455.2 The EKN Theorem . . . . . . . . . . . . . . . . . . . . . . . 475.3 Linear and Affine Paracontractions . . . . . . . . . . . . . . 48

5.3.1 Back-propagation-of-error Methods . . . . . . . . . . 485.3.2 Defining the Norm . . . . . . . . . . . . . . . . . . . 485.3.3 Proof of Convergence . . . . . . . . . . . . . . . . . 49

6 Bregman-Paracontractive Operators 536.1 Bregman Paracontractions . . . . . . . . . . . . . . . . . . . 53

6.1.1 Entropic Projections . . . . . . . . . . . . . . . . . . 54

6.1.2 Weighted Entropic Projections . . . . . . . . . . . . 556.2 Extending the EKN Theorem . . . . . . . . . . . . . . . . . 56



CONTENTS iii

6.3 Multiple Bregman Distances . . . . . . . . . . . . . . . . . . 57

6.3.1 Assumptions and Notation . . . . . . . . . . . . . . 57

6.3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . 57

6.3.3 A Preliminary Result . . . . . . . . . . . . . . . . . 576.3.4 Convergence of the Algorithm . . . . . . . . . . . . . 58

III Systems of Linear Equations 59

7 The Algebraic Reconstruction Technique 61

7.1 The ART . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2 Calculating the ART . . . . . . . . . . . . . . . . . . . . . . 62

7.3 When Ax = b Has Solutions . . . . . . . . . . . . . . . . . . 62

7.4 When Ax = b Has No Solutions . . . . . . . . . . . . . . . . 63

7.4.1 Subsequential Convergence of ART . . . . . . . . . . 63

7.4.2 The Geometric Least-Squares Solution . . . . . . . . 647.4.3 Nonnegatively Constrained ART . . . . . . . . . . . 65

7.5 Avoiding the Limit Cycle . . . . . . . . . . . . . . . . . . . 66

7.5.1 Double ART (DART) . . . . . . . . . . . . . . . . . 66

7.5.2 Strongly Underrelaxed ART . . . . . . . . . . . . . . 66

7.6 Approximate Solutions and the Nonnegativity Constraint . 66

8 Simultaneous ART 69

8.1 Cimmino’s Algorithm . . . . . . . . . . . . . . . . . . . . . 69

8.2 The Landweber Algorithms . . . . . . . . . . . . . . . . . . 70

8.2.1 Finding the Optimum γ . . . . . . . . . . . . . . . . 70

8.2.2 The Projected Landweber Algorithm . . . . . . . . . 72

8.3 An Upper Bound for the Maximum Eigenvalue of A†A . . . 72

8.3.1 The Normalized Case . . . . . . . . . . . . . . . . . 72

8.3.2 The General Case . . . . . . . . . . . . . . . . . . . 73

8.3.3 Upper Bounds for -Sparse Matrices . . . . . . . . . 74

9 Jacobi and Gauss-Seidel Methods 75

9.1 The Jacobi and Gauss-Seidel Methods: An Example . . . . 75

9.2 Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . 76

9.3 Some Examples of Splitting Methods . . . . . . . . . . . . . 77

9.4 Jacobi’s Algorithm and JOR . . . . . . . . . . . . . . . . . 78

9.4.1 The JOR in the Nonnegative-definite Case . . . . . 79

9.5 The Gauss-Seidel Algorithm and SOR . . . . . . . . . . . . 809.5.1 The SOR in the Nonnegative-definite Case . . . . . 80



iv CONTENTS

IV Positivity in Linear Systems 83

10 The Multiplicative ART (MART) 8510.1 A Special Case of ART and MART . . . . . . . . . . . . . . 85

10.2 MART in the General Case . . . . . . . . . . . . . . . . . . 8610.3 ART and MART as Sequential Projection Methods . . . . . 87

10.3.1 Cross-Entropy or the Kullback-Leibler Distance . . . 8710.3.2 Weighted KL Projections . . . . . . . . . . . . . . . 88

10.4 Proof of Convergence for MART . . . . . . . . . . . . . . . 8910.5 Comments on the Rate of Convergence of MART . . . . . 91

11 The Simultaneous MART (SMART) 9311.1 The SMART Iteration . . . . . . . . . . . . . . . . . . . . . 9311.2 The SMART as a Generalized Projection Method . . . . . . 9411.3 Proof of Convergence of the SMART . . . . . . . . . . . . . 9511.4 Remarks on the Rate of Convergence of the SMART . . . . 9611.5 Block-Iterative SMART . . . . . . . . . . . . . . . . . . . . 97

11.5.1 The Rescaled Block-Iterative SMART . . . . . . . . 97

12 Expectation Maximization Maximum Likelihood (EMML) 9912.1 The EMML Iteration . . . . . . . . . . . . . . . . . . . . . . 9912.2 Proof of Convergence of the EMML Algorithm . . . . . . . 100

12.2.1 Some Pythagorean Identities Involving the KL Dis-tance . . . . . . . . . . . . . . . . . . . . . . . . . . 101

12.3 Block-Iterative EMML Iteration . . . . . . . . . . . . . . . 10212.3.1 A Row-Action Variant of EMML . . . . . . . . . . . 103

13 Rescaled Block-Iterative (RBI) Methods 10513.1 Block-Iterative Methods . . . . . . . . . . . . . . . . . . . . 10513.2 The SMART and the EMML method . . . . . . . . . . . . 106

13.3 Ordered-Subset Versions . . . . . . . . . . . . . . . . . . . . 10813.4 The RBI-SMART . . . . . . . . . . . . . . . . . . . . . . . . 10913.5 T he RBI-EMML . . . . . . . . . . . . . . . . . . . . . . . . 11313.6 RBI-SMART and Entropy Maximization . . . . . . . . . . . 116

V Stability 119

14 Sensitivity to Noise 12114.1 Where Does Sensitivity Come From? . . . . . . . . . . . . . 121

14.1.1 The Singular-Value Decomposition of A . . . . . . . 12214.1.2 The Inverse of Q = A†A . . . . . . . . . . . . . . . . 12214.1.3 Reducing the Sensitivity to Noise . . . . . . . . . . . 123

14.2 Iterative Regularization in ART . . . . . . . . . . . . . . . . 12514.3 A Bayesian View of Reconstruction . . . . . . . . . . . . . . 125



CONTENTS v

14.4 The Gamma Prior Distribution for x . . . . . . . . . . . . . 12714.5 The One-Step-Late Alternative . . . . . . . . . . . . . . . . 12814.6 Regularizing the SMART . . . . . . . . . . . . . . . . . . . 12814.7 De Pierro’s Surrogate-Function Method . . . . . . . . . . . 12914.8 Block-Iterative Regularization . . . . . . . . . . . . . . . . . 131

15 Feedback in Block-Iterative Reconstruction 13315.1 Feedback in ART . . . . . . . . . . . . . . . . . . . . . . . . 13415.2 Feedback in RBI methods . . . . . . . . . . . . . . . . . . . 134

15.2.1 The RBI-SMART . . . . . . . . . . . . . . . . . . . 13515.2.2 The RBI-EMML . . . . . . . . . . . . . . . . . . . . 138

VI Optimization 141

16 Iterative Optimization 14316.1 Functions of a Single Real Variable . . . . . . . . . . . . . . 14316.2 Functions of Several Real Variables . . . . . . . . . . . . . . 144

16.2.1 Cauchy’s Inequality for the Dot Product . . . . . . . 14416.2.2 Directional Derivatives . . . . . . . . . . . . . . . . . 14416.2.3 Constrained Minimization . . . . . . . . . . . . . . . 14516.2.4 An Example . . . . . . . . . . . . . . . . . . . . . . 145

16.3 Gradient Descent Optimization . . . . . . . . . . . . . . . . 14716.4 The Newton-Raphson Approach . . . . . . . . . . . . . . . 147

16.4.1 Functions of a Single Variable . . . . . . . . . . . . . 14816.4.2 Functions of Several Variables . . . . . . . . . . . . . 148

16.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . 148

17 Conjugate-Direction Methods in Optimization 149

17.1 Backpropagation-of-Error Methods . . . . . . . . . . . . . . 14917.2 Iterative Minimization . . . . . . . . . . . . . . . . . . . . . 15017.3 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . 15017.4 Conjugate Directions . . . . . . . . . . . . . . . . . . . . . . 15317.5 The Conjugate Gradient Method . . . . . . . . . . . . . . . 154

18 Convex Sets and Convex Functions 15718.1 Optimizing Functions of a Single Real Variable . . . . . . . 157

18.1.1 The Convex Case . . . . . . . . . . . . . . . . . . . . 15818.2 Optimizing Functions of Several Real Variables . . . . . . . 160

18.2.1 The Convex Case . . . . . . . . . . . . . . . . . . . . 16118.3 Convex Feasibility . . . . . . . . . . . . . . . . . . . . . . . 165

18.3.1 The SOP for Hyperplanes . . . . . . . . . . . . . . . 166

18.3.2 The SOP for Half-Spaces . . . . . . . . . . . . . . . 16718.4 Optimization over a Convex Set . . . . . . . . . . . . . . . . 167



vi CONTENTS

18.4.1 Linear Optimization over a Convex Set . . . . . . . 167

18.5 Geometry of Convex Sets . . . . . . . . . . . . . . . . . . . 168

18.6 Projecting onto the Intersection of Convex Sets . . . . . . . 168

18.6.1 A Motivating Lemma . . . . . . . . . . . . . . . . . 168

18.6.2 Dykstra’s Algorithm . . . . . . . . . . . . . . . . . . 169

19 Generalized Pro jections onto Convex Sets 171

19.1 Bregman Functions and Bregman Distances . . . . . . . . . 171

19.2 The Successive Generalized Projections Algorithm . . . . . 172

19.3 Bregman’s Primal-Dual Algorithm . . . . . . . . . . . . . . 173

19.4 Dykstra’s Algorithm for Bregman Projections . . . . . . . . 174

19.4.1 A Helpful Lemma . . . . . . . . . . . . . . . . . . . 174

20 An Interior-Point Optimization Method 177

20.1 The Multiprojection Successive Generalized Projection Method177

20.2 An Interior-Point Algorithm (IPA) . . . . . . . . . . . . . . 178

20.3 The MSGP Algorithm . . . . . . . . . . . . . . . . . . . . . 17820.3.1 Assumptions and Notation . . . . . . . . . . . . . . 178

20.3.2 The MSGP Algorithm . . . . . . . . . . . . . . . . . 179

20.3.3 A Preliminary Result . . . . . . . . . . . . . . . . . 179

20.3.4 The MSGP Convergence Theorem . . . . . . . . . . 179

20.4 An Interior-Point Algorithm for Iterative Optimization . . . 181

20.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . 181

20.4.2 The IPA . . . . . . . . . . . . . . . . . . . . . . . . . 182

20.4.3 Motivating the IPA . . . . . . . . . . . . . . . . . . . 182

20.4.4 Preliminary results for the IPA . . . . . . . . . . . . 182

21 Linear Programming 185

21.1 Primal and Dual Problems . . . . . . . . . . . . . . . . . . . 18521.1.1 Canonical and Standard Forms . . . . . . . . . . . . 185

21.1.2 Weak Duality . . . . . . . . . . . . . . . . . . . . . . 186

21.1.3 Strong Duality . . . . . . . . . . . . . . . . . . . . . 186

21.2 The Simplex Method . . . . . . . . . . . . . . . . . . . . . . 189

22 Systems of Linear Inequalities 191

22.1 Projection onto Convex Sets . . . . . . . . . . . . . . . . . . 191

22.2 Solving Ax = b . . . . . . . . . . . . . . . . . . . . . . . . . 193

22.2.1 When the System Ax = b is Consistent . . . . . . . 194

22.2.2 When the System Ax = b is Inconsistent . . . . . . . 194

22.3 The Agmon-Motzkin-Schoenberg algorithm . . . . . . . . . 196

22.3.1 When Ax ≥ b is Consistent . . . . . . . . . . . . . . 19822.3.2 When Ax ≥ b is Inconsistent . . . . . . . . . . . . . 198



CONTENTS vii

23 The Split Feasibility Problem 20123.1 The CQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20123.2 Particular Cases of the CQ Algorithm . . . . . . . . . . . . 203

23.2.1 The Landweber algorithm . . . . . . . . . . . . . . . 20323.2.2 The Projected Landweber Algorithm . . . . . . . . . 20323.2.3 Convergence of the Landweber Algorithms . . . . . 20323.2.4 The Simultaneous ART (SART) . . . . . . . . . . . 20323.2.5 More on the CQ Algorithm . . . . . . . . . . . . . . 204

24 Constrained Iteration Methods 20524.1 Modifying the KL distance . . . . . . . . . . . . . . . . . . 20524.2 The ABMART Algorithm . . . . . . . . . . . . . . . . . . . 20624.3 The ABEMML Algorithm . . . . . . . . . . . . . . . . . . . 207

25 Fourier Transform Estimation 20925.1 The Limited-Fourier-Data Problem . . . . . . . . . . . . . . 20925.2 Minimum-Norm Estimation . . . . . . . . . . . . . . . . . . 210

25.2.1 The Minimum-Norm Solution of Ax = b . . . . . . . 21025.2.2 Minimum-Weighted-Norm Solution of Ax = b . . . . 211

25.3 Fourier-Transform Data . . . . . . . . . . . . . . . . . . . . 21225.3.1 The Minimum-Norm Estimate . . . . . . . . . . . . 21225.3.2 Minimum-Weighted-Norm Estimates . . . . . . . . . 21325.3.3 Implementing the PDFT . . . . . . . . . . . . . . . . 214

25.4 The Discrete PDFT (DPDFT) . . . . . . . . . . . . . . . . 21525.4.1 Calculating the DPDFT . . . . . . . . . . . . . . . . 21525.4.2 Regularization . . . . . . . . . . . . . . . . . . . . . 216

VII Appendices 217

26 Basic Concepts 21926.1 The Geometry of Euclidean Space . . . . . . . . . . . . . . 219

26.1.1 Inner Products . . . . . . . . . . . . . . . . . . . . . 21926.1.2 Cauchy’s Inequality . . . . . . . . . . . . . . . . . . 22026.1.3 Hyperplanes in Euclidean Space . . . . . . . . . . . 22126.1.4 Convex Sets in Euclidean Space . . . . . . . . . . . . 221

26.2 Analysis in Euclidean Space . . . . . . . . . . . . . . . . . . 22226.3 Basic Linear Algebra . . . . . . . . . . . . . . . . . . . . . . 223

26.3.1 Systems of Linear Equations . . . . . . . . . . . . . 22326.3.2 The Fundamental Subspaces . . . . . . . . . . . . . 224

26.4 Linear and Nonlinear Operators . . . . . . . . . . . . . . . . 22526.4.1 Linear and Affine Linear Operators . . . . . . . . . . 226

26.4.2 Orthogonal Projection onto Convex Sets . . . . . . . 22726.4.3 Gradient Operators . . . . . . . . . . . . . . . . . . 228



viii CONTENTS

27 Metric Spaces and Norms 22927.1 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 22927.2 N orms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

27.2.1 The 1-norm . . . . . . . . . . . . . . . . . . . . . . . 23027.2.2 The ∞-norm . . . . . . . . . . . . . . . . . . . . . . 23127.2.3 The 2-norm . . . . . . . . . . . . . . . . . . . . . . . 23127.2.4 Weighted 2-norms . . . . . . . . . . . . . . . . . . . 231

27.3 Eigenvalues and Matrix Norms . . . . . . . . . . . . . . . . 23127.3.1 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . 232

27.4 The Euclidean Norm of a Square Matrix . . . . . . . . . . . 23427.4.1 Diagonalizable Matrices . . . . . . . . . . . . . . . . 23627.4.2 Gerschgorin’s Theorem . . . . . . . . . . . . . . . . . 23627.4.3 Strictly Diagonally Dominant Matrices . . . . . . . . 236

28 Bregman-Legendre Functions 23728.1 Essential smoothness and essential strict convexity . . . . . 237

28.2 Bregman Projections onto Closed Convex Sets . . . . . . . 23828.3 Bregman-Legendre Functions . . . . . . . . . . . . . . . . . 23928.4 Useful Results about Bregman-Legendre Functions . . . . . 239

29 Detection and Classification 24129.1 E stimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

29.1.1 The simplest case: a constant in noise . . . . . . . . 24229.1.2 A known signal vector in noise . . . . . . . . . . . . 24229.1.3 Multiple signals in noise . . . . . . . . . . . . . . . . 243

29.2 D etection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24429.2.1 Parametrized signal . . . . . . . . . . . . . . . . . . 244

29.3 Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . 24629.3.1 Channelized Observers . . . . . . . . . . . . . . . . . 246

29.3.2 An Example of Discrimination . . . . . . . . . . . . 24729.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 247

29.4.1 The Training Stage . . . . . . . . . . . . . . . . . . . 24729.4.2 Our Example Again . . . . . . . . . . . . . . . . . . 248

29.5 More realistic models . . . . . . . . . . . . . . . . . . . . . . 24829.5.1 The Fisher linear discriminant . . . . . . . . . . . . 249

30 Tomography 25130.1 X-ray Transmission Tomography . . . . . . . . . . . . . . . 251

30.1.1 The Exponential-Decay Model . . . . . . . . . . . . 25230.1.2 Reconstruction from Line Integrals . . . . . . . . . . 25330.1.3 The Algebraic Approach . . . . . . . . . . . . . . . . 254

30.2 Emission Tomography . . . . . . . . . . . . . . . . . . . . . 255

30.2.1 Maximum-Likelihood Parameter Estimation . . . . . 25530.3 Image Reconstruction in Tomography . . . . . . . . . . . . 256



CONTENTS ix

31 Magnetic-Resonance Imaging 257

31.1 An Overview of MRI . . . . . . . . . . . . . . . . . . . . . . 25731.2 The External Magnetic Field . . . . . . . . . . . . . . . . . 25831.3 The Received Signal . . . . . . . . . . . . . . . . . . . . . . 258

31.3.1 An Example of G(t) . . . . . . . . . . . . . . . . . . 25931.3.2 Another Example of G(t) . . . . . . . . . . . . . . . 259

32 Hyperspectral Imaging 261

33 Farfield Propagation 265

33.1 The Solar-Emission Problem . . . . . . . . . . . . . . . . . 26633.2 The One-Dimensional Case . . . . . . . . . . . . . . . . . . 266

33.2.1 The Plane-Wave Model . . . . . . . . . . . . . . . . 26733.3 Fourier-Transform Pairs . . . . . . . . . . . . . . . . . . . . 268

33.3.1 The Fourier Transform . . . . . . . . . . . . . . . . . 26833.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . 268

33.3.3 Reconstructing from Fourier-Transform Data . . . . 26933.3.4 An Example . . . . . . . . . . . . . . . . . . . . . . 269

33.4 T he Dirac Delta . . . . . . . . . . . . . . . . . . . . . . . . 27033.5 Practical Limitations . . . . . . . . . . . . . . . . . . . . . . 270

33.5.1 Convolution Filtering . . . . . . . . . . . . . . . . . 27133.5.2 Low-Pass Filtering . . . . . . . . . . . . . . . . . . . 272

33.6 Point Sources as Dirac Deltas . . . . . . . . . . . . . . . . . 27333.7 The Limited-Aperture Problem . . . . . . . . . . . . . . . . 273

33.7.1 Resolution . . . . . . . . . . . . . . . . . . . . . . . . 27433.7.2 The Solar-Emission Problem Revisited . . . . . . . . 275

33.8 D iscrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . 27633.8.1 Reconstruction from Samples . . . . . . . . . . . . . 277

33.9 The Finite-Data Problem . . . . . . . . . . . . . . . . . . . 27733.10Functions of Several Variables . . . . . . . . . . . . . . . . . 27833.10.1Two-Dimensional Farfield Object . . . . . . . . . . . 27833.10.2Two-Dimensional Fourier Transforms . . . . . . . . 27933.10.3Two-Dimensional Fourier Inversion . . . . . . . . . . 28033.10.4Limited Apertures in Two Dimensions . . . . . . . . 281

33.11Broadband Signals . . . . . . . . . . . . . . . . . . . . . . . 28133.12The Laplace Transform and the Ozone Layer . . . . . . . . 282

33.12.1The Laplace Transform . . . . . . . . . . . . . . . . 28233.12.2Scattering of Ultraviolet Radiation . . . . . . . . . . 28233.12.3Measuring the Scattered Intensity . . . . . . . . . . 28333.12.4 The Laplace Transform Data . . . . . . . . . . . . . 283

33.13Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

Bibliography 284



x CONTENTS

Index 295



Part I

Preliminaries

xi





Chapter 1

Preface

VALENTINE: What she’s doing is, every time she works out a value for y,

she’s using that as her next value for x. And so on. Like a feedback. She’sfeeding the solution into the equation, and then solving it again. Iteration,you see. ... This thing works for any phenomenon which eats its ownnumbers.

HANNAH: What I don’t understand is... why nobody did this feedbackthing before- it’s not like relativity, you don’t have to be Einstein.

VALENTINE: You couldn’t see to look before. The electronic calculatorwas what the telescope was for Galileo.

HANNAH: Calculator?

VALENTINE: There wasn’t enough time before. There weren’t enoughpencils . ... Now she’d only have to press a button, the same button, overand over. Iteration. ... And so boring!

HANNAH: Do you mean that was the only problem? Enough time? Andpaper? And the boredom?

VALENTINE: Well, the other thing is, you’d have to be insane.

(Act 1, Scene 4: Arcadia , by Tom Stoppard)

1



2 CHAPTER 1. PREFACE

The well known formula for solving a quadratic equation produces theanswer in a finite number of calculations; it is a non-iterative method, if weare willing to accept a square-root symbol in our answer. Similarly, Gausselimination gives the solution to a system of linear equations, if there is one,

in a finite number of steps; it, too, is a non-iterative method. A typicaliterative algorithm (the name comes from the Latin word iterum , meaning“again”), involves a relatively simple calculation, performed repeatedly.An iterative method produces a sequence of approximate answers that,in the best case, converges to the solution. The characters in Stoppard’splay are discussing the apparent anticipation, by a (fictional) teenage girlin 1809, of the essential role of iterative algorithms in chaos theory andfractal geometry. A good example of an iterative algorithm is the bi-section method for finding a root of a real-valued continuous function f (x) of thereal variable x: begin with an interval [a, b] such that f (a)f (b) < 0 andthen replace one of the endpoints with the average a+b

2 , maintaining thenegative product. The length of each interval so constructed is half the

length of the previous interval and each interval contains a root. In thelimit, the two sequences defined by the left endpoints and right endpointsconverge to the same root.

Iterative algorithms are used to solve problems for which there is nonon-iterative solution method, as well as problems for which non-iterativemethods are impractical, such as using Gauss elimination to solve a systemof thousands of linear equations in thousands of unknowns. We may wantto find a root of f (x) = x2 − 2 in order to approximate

√ 2, or to solve an

algebraic equation, such as x = tan x, by writing the equation as f (x) =x−tan x = 0. On the other hand, we may want a root of f (x) because f (x)is the derivative of another function, say F (x), that we wish to optimize.If our goal is to minimize F (x), we may choose, instead, to generate aniterative sequence {xk}, k = 0, 1,..., that converges to a minimizer of F (x).

Iterative algorithms are often formulated as fixed-point methods: theequation f (x) = 0 is equivalent to x = f (x) + x = g(x), so we may try tofind a fixed point of g(x), that is, an x for which g(x) = x.

The idea of using iterative procedures for solving problems is an ancientone. Archimedes’ use of the areas of inscribed and circumscribed regularpolygons to estimate the area of a circle is a famous instance of an iterativeprocedure, as is his method of exhaustion for finding the area of a sectionof a parabola.

It is not our aim here to describe all the various problems that can besolved by iterative methods. We shall focus on iterative methods currentlybeing used in inverse problems, with special attention to remote-sensingapplications, such as image reconstruction from tomographic data in medi-cal diagnostics and acoustic array signal processing. Such methods include

those for solving large systems of linear equations, with and without con-straints, optimization techniques, such as likelihood and entropy maximiza-



3

tion, data-extrapolation procedures, and algorithms for convex feasibilityproblems.

Throughout these discussions we shall be concerned with the speed of the algorithms, as well as their sensitivity to noise or errors in the data;

methods for accelerating and regularizing the algorithms will be treated indetail.

The iterative algorithms we discuss take the form xk+1 = T xk, whereT is some (usually nonlinear) continuous operator on the space of J -dimensional complex or real vectors. If the sequence {xk} converges tox∗, then T x∗ = x∗, that is, x∗ is a fixed point of T . To be sure that thesequence {xk} converges, we need to know that T has fixed points, but weneed more than that.

We shall focus on two broad classes of operators, those that are aver-aged, non-expansive with respect to the Euclidean vector norm, and thosethat are paracontractive with respect to some vector norm. Convergencefor the first class of operators is a consequence of the Krasnoselskii/Mann

(KM) Theorem, and the Elsner/Koltracht/Neumann (EKN) Theorem es-tablishes convergence for the second class. The definitions of these classesare derived from basic properties of orthogonal projection operators, whichare members of both classes.

In many remote-sensing applications, the (discretized) object sought isnaturally represented as a vector with nonnegative entries. For such prob-lems, we can incorporate nonnegativity in the algorithms through the useof projections with respect to entropy-based distances. These algorithmsare often developed by analogy with those methods using orthogonal pro-

jections. As we shall see, this analogy can often be further exploited toderive convergence theorems.

The cross-entropy distance is just one example of a Bregman distance.The notion of an operator being paracontractive, with respect to a norm,

can be extended to being paracontractive, with respect to a Bregman dis-tance. Bregman projections onto convex sets are paracontractive in thisgeneralized sense, as are many of the operators of interest. The EKN The-orem and many of its corollaries can be extended to operators that areparacontractive, with respect to Bregman distances.

We begin with an overview of the algorithms and their applications.



4 CHAPTER 1. PREFACE



Chapter 2

Introduction

All iterative algorithms generate a sequence {xk} of vectors. The sequencemay converge for any starting vector x0, or may converge only if the x0

is sufficiently close to the solution. The limit, when it exists, may dependon x0, and may, or may not, solve the original problem. Convergence tothe limit may be slow and the algorithm may need to be accelerated. Thealgorithm may involve measured data. The limit may be sensitive to noisein the data and the algorithm may need to be regularized to lessen thissensitivity. The algorithm may be quite general, applying to all problems ina broad class, or it may be tailored to the problem at hand. Each step of thealgorithm may be costly, but only a few steps generally needed to producea suitable approximate answer, or, each step may be easily performed,but many such steps needed. Although convergence of an algorithm isimportant, theoretically, in practice only a few iterative steps may be used.

2.1 Algorithms and OperatorsFor most of the iterative algorithms we shall consider, the iterative step is

xk+1 = T xk,

for some operator T . The behavior of the algorithm will then depend onthe properties of the operator T . If T is a continuous operator (and itusually is), and the sequence {xk} converges to x, then T x = x, that is, xis a fixed point of the operator T .

2.1.1 Dynamical Systems

The simplest differential equation describing population dynamics is

p(t) = ap(t),

5



6 CHAPTER 2. INTRODUCTION

with exponential solutions. More realistic models impose limits to growth,and may take the form

p(t) = a(L

− p(t)) p(t),

where L is an asymptotic limit for p(t). Discrete versions of the limited-population problem then have the form

xk+1 − xk = a(L − xk)xk,

which, for zk = xk/L, can be written as

zk+1 = r(1 − zk)zk;

we shall assume that r > 0. With T z = r(1 − z)z = f (z) and zk+1 = T zk,we are interested in the behavior of the sequence, as a function of r.

The operator T has a fixed point at z∗ = 0 always, and another one atz∗

= 1−

1

r

, if r > 1.

Exercise 2.1 A fixed point z∗ of f (z) is said to be stable if |f (z∗)| < 1.Show that z∗ = 0 is stable if r < 1, while z∗ = 1 − 1

ris stable if 1 < r < 3.

In fact, if r = 2, z∗ = 1 − 1r = 1

2 is superstable and convergence is quiterapid, since f (12) = 0. What happens for r > 3 is most interesting and isa primary topic in any book on chaos and dynamical systems [108].

2.1.2 Steepest Descent Minimization

Suppose that we want to minimize a real-valued function f : RJ → R. Ateach x the direction of greatest decrease of f is the negative of the gradient,−∇f (x). The steepest descent method has the iterative step

xk+1 = xk − αk∇f (xk),

where, ideally, the step-length parameter αk would be chosen so as to min-imize f (x) in the chosen direction, that is, the choice of α = αk wouldminimize

f (xk − α∇f (xk)).

In practice, it is difficult, if not impossible, to determine the optimal valueof αk at each step. Therefore, a line search is usually performed to finda suitable αk, meaning that values of f (xk − α∇f (xk)) are calculated, forsome finite number of α values, to determine a suitable choice for αk.

For practical reasons, we are interested in iterative algorithms that avoidline searches. Some of the minimization algorithms we shall study take the

formxk+1 = xk − α∇f (xk),



2.1. ALGORITHMS AND OPERATORS 7

where the α is a constant, selected at the beginning of the iteration. Suchiterative algorithms have the form xk+1 = T xk, for T the operator definedby

T x = x

−α

∇f (x).

When properly chosen, the α will not be the optimal step-length parameterfor every step of the iteration, but will be sufficient to guarantee conver-gence. In addition, the resulting iterative sequence is often monotonically deceasing , which means that

f (xk+1) < f (xk),

for each k.We shall discuss other iterative monotone methods, such as the EMML

and SMART algorithms, that can be viewed as generalized steepest descentmethods taking the form

xk+1j = xkj − αk,j∇f (xk)j .

In these cases, the step-length parameter αk is replaced by ones that alsovary with the entry index j. While this may seem even more complicatedto implement, for the algorithms mentioned, these αk,j are automaticallycalculated as part of the algorithm, with no line searches involved.

2.1.3 The Newton-Raphson Algorithm

The well known Newton-Raphson (NR) iterative algorithm for finding aroot of a function g : R → R has the iterative step

xk+1 = xk − g(xk)/g(xk).

The operator T is now the ordinary function

T x = x

−g(x)/g(x).

If g is a vector-valued function, g : RJ → RJ , then g(x) has the form

g(x) = (g1(x),...,gJ (x))T ,

where gj : RJ → R are the component functions of g(x). The NR iterativestep is

xk+1 = xk − [J (g)(xk)]−1g(xk),

where J (g)(x) is the Jacobian matrix of first partial derivatives of thecomponent functions of g; that is, its entries are ∂gm

∂xj(x). The operator T

is nowT x = x − [J (g)(x)]−1g(x).

Convergence of the NR algorithm is not guaranteed and depends on thestarting point being sufficiently close to the answer. When it does converge,

however, it does so fairly rapidly. In both the scalar and vector cases, thelimit is a fixed point of T , and therefore a root of g(x).




2.1.4 Newton-Raphson and Chaos

It is interesting to consider how the behavior of the NR iteration dependson the starting point.

Some Apparently Simple Cases

The complex-valued function f (z) = z2 − 1 of the complex variable z hastwo roots, z = 1 and z = −1.

Exercise 2.2 Show that the NR method for finding a root now has the iterative step

zk+1 = T zk =zk2

+1

2zk.

If z0 is selected closer to z = 1 than to z = −1 then the iterative sequenceconverges to z = 1; similarly, if z0 is closer to z = −1, the limit is z = −1.If z0 is on the vertical axis of points with real part equal to zero, then thesequence does not converge, and is not even defined for z = 0. This axisseparates the two basins of attraction of the algorithm.

Now consider the function f (z) = z3 − 1, which has the three rootsz = 1, z = ω = e2πi/3, and z = ω2 = e4πi/3.

Exercise 2.3 Show that the NR method for finding a root now has the iterative step

zk+1 = T zk =2zk

3+

1

3zk.

Where are the basins of attraction now? Is the complex plane divided upas three people would divide a pizza, into three wedge-shaped slices, eachcontaining one of the roots? Far from it. In fact, it can be shown that,

if the sequence starting at z0 = a converges to z = 1 and the sequencestarting at z0 = b converges to ω, then there is a starting point z0 = c,closer to a than b is, whose sequence converges to ω2. For more details andbeautiful colored pictures illustrating this remarkable behavior, see [108].

The Sir Pinski Game

In [108] Schroeder discusses several iterative sequences that lead to fractalor chaotic behavior. The Sir Pinski Game has the following rules. Let P 0be a point chosen arbitrarily within the interior of the equilateral trianglewith vertices (1, 0, 0), (0, 1, 0) and (0, 0, 1). Let V be the vertex closest toP 0 and P 1 chosen so that P 0 is the midpoint of the line segment V P 1.Repeat the process, with P 1 in place of P 0. The game is lost when P n falls

outside the original triangle. The objective of the game is to select P 0 thatwill allow the player to win the game. Where are these winning points?



2.2. OPERATORS ON EUCLIDEAN SPACE 9

The inverse Sir Pinski Game is similar. Select any point P 0 in theplane of the equilateral triangle, let V be the most distance vertex, and P 1the midpoint of the line segment P 0V . Replace P 0 with P 1 and repeat theprocedure. The resulting sequence of points is convergent. Which points

are limit points of sequences obtained in this way?

The Chaos Game

Schroeder also mentions Barnsley’s Chaos Game . Select P 0 inside the equi-lateral triangle. Roll a fair die and let V = (1, 0, 0) if 1 or 2 is rolled,V = (0, 1, 0) if 3 or 4 is rolled, and V = (0, 0, 1) if 5 or 6 is rolled. LetP 1 again be the midpoint of V P 0. Replace P 0 with P 1 and repeat theprocedure. Which points are limits of such sequences of points?

2.1.5 Selecting the Operator

Although any iterative algorithm involves a transformation of the current

vector xk into the next one, xk+1, it may be difficult, if not impossible, andperhaps useless, to represent that transformation simply as xk+1 = T xk.The transformation that occurs in the bisection method for root-finding isnot naturally represented using an operator T . Nevertheless, many algo-rithms do take the form xk+1 = T xk, as we shall see, and investigating theproperties of such operators is an important part of the study of iterativealgorithms.

2.2 Operators on Euclidean Space

Much of our attention will be devoted to operators on the finite-dimensionalEuclidean spaces RJ , the space of real J -vectors, and C J , the space of

complex J -vectors; we call the space X when what we are saying appliesto either of these spaces. Unless otherwise stated, the notation ||x|| willdenote the Euclidean norm or Euclidean length of the vector x, given by

||x|| =

J j=1

|xj |2.

The Euclidean distance between vectors x and y is ||x−y||. There are othernorms that are helpful when developing the theory of iterative algorithms,as we shall see when we come to the Jacobi iterative method, but theEuclidean norm is the most convenient for our purposes.

An operator T on X can be written in terms of its scalar-valued com-ponent functions T j(x) as

T x = (T 1(x),...,T J (x))T .




We say that T is continuous if each of the functions T j is continuous, as ascalar-valued function on X . Continuity of T , by itself, will not guaranteethe convergence of the iterative scheme xk+1 = T xk, even when Fix(T ),the set of fixed points of T , is non-empty.

An operator T on X is Lipschitz continuous if there is a positive constantλ such that

||T x − T y|| ≤ λ||x − y||,for all x and y in X .

2.2.1 Non-Expansive Operators

We shall focus on operators T that are non-expansive (ne), which meansthat, for all vectors x and y in X ,

||T x − T y|| ≤ ||x − y||.

Clearly, any ne operator is Lipschitz continuous, for λ = 1. As notedpreviously, ||x|| denotes the Euclidean norm of the vector x. Even beingne is not enough for convergence, as the example T = −I , I the identityoperator, shows.

2.2.2 Strict Contractions

To guarantee convergence of {xk} to a fixed point of T , it is sufficient toassume that T is a strict contraction (sc), which means that there is r inthe interval (0, 1) such that, for all x and y in X ,

||T x − T y|| ≤ r||x − y||.

As we shall see later, if T is sc, then T has a unique fixed point, say x, andthe sequence {xk} converges to x, for every starting vector x0. But beinga strict contraction is too strong for our purposes.

2.2.3 Averaged Operators

Many of the operators we need to study have multiple fixed points. Forexample, the orthogonal projection onto a hyperplane in X has the entirehyperplane for its fixed-point set. We need a class of operators betweenthe ne operators and the sc ones. The Krasnoselskii-Mann (KM) Theoremshows us how to select this class:

Theorem 2.1 If T = (1

−α)I + αN , for some α in the interval (0, 1) and

ne operator N , then {xk} converges to a fixed point of T , whenever Fix( T )is non-empty.



2.3. PROJECTION OPERATORS 11

This theorem suggests that the appropriate class is that of the averaged (av) operators, that is, those T as described in the KM Theorem. The classof averaged operators is quite broad, and includes many of the operatorswe need to study, in the Euclidean case. Products of av operators are av

operators, which is quite helpful in designing algorithms for constrainedoptimization.

For any operator T on X , we have the following identity relating T toits complement operator, G = I − T :

||x − y||2 − ||T x − T y||2 = 2Gx − Gy,x − y − ||Gx − Gy||2. (2.1)

This identity, which allows us to transform properties of T into properties of G that might be easier to work with, is helpful in our study of av operators.One reason for our focus on the Euclidean norm is to be able to make useof this identity.

2.2.4 Affine Linear and Linear Operators

An operator B is linear if, for all scalars α and β and vectors x and y,

B(αx + βy) = αBx + βBy.

An operator T is affine linear , or just affine , if there is a linear operator Band a vector d, such that, for all vectors x,

T x = Bx + d.

We can see that an affine linear operator T will be ne, sc, or av preciselywhen its linear component, B, is ne, sc, or av, respectively.

A linear operator B, which we shall view as multiplication by the matrixB, is said to be Hermitian if B = B†; this means that B†, the conjugatetranspose of B, is equal to B. The eigenvalues of such linear operators arereal and we have the following: B is ne if and only if all its eigenvalueslie in the interval [−1, 1]; B is av if and only if all its eigenvalues lie inthe interval (−1, 1]; and B is sc if and only if all its eigenvalues lie in theinterval (−1, 1).

When B is not Hermitian, we cannot determine if B is av from itseigenvalues, which need not be real. An alternative approach is to ask if B,and therefore T , is a paracontraction for some vector norm, as discussedbelow.

2.3 Projection Operators

Several of the problems of interest to us here involve finding a vector thatsatisfies certain constraints, such as optimizing a function, or lying within




certain convex sets. If C is a closed, non-empty convex set in X , and x isany vector, then there is a unique point P C x in C closest to x. This pointis called the orthogonal projection of x onto C . The orthogonal projectionoperator T = P C is an av operator; in fact, it is a firmly non-expansive (fne)

operator, which is a class of operators within the class of av operators. If C is a hyperplane, then we can get an explicit description of P C x in termsof x; for general convex sets C , however, we will not be able to express P C xexplicitly.

As we shall see in a subsequent chapter, the operators P C have proper-ties that are stronger than being averaged; P C is firmly non-expansive :

P C x − P C y, x − y ≥ ||P C x − P C y||2,

where the norm is the usual Euclidean norm. It follows from Cauchy’sInequality that

||P C x − P C y|| ≤ ||x − y||,with equality if and only if

P C x − P C y = α(x − y),

for some scalar α with |α| = 1. But, because

0 ≤ P C x − P C y, x − y = α||x − y||2,

it follows that α = 1, and so

P C x − x = P C y − y.

This leads to the definition of paracontractive operators.

2.4 Paracontractive OperatorsA (possibly nonlinear) operator T is said to be a paracontraction (pc), ora paracontractive operator , with respect to a vector norm | | · | |, if, for everyfixed point y of T , and for every x,

||T x − y|| < ||x − y||,or T x = x [64]. If T has no fixed points, then T is trivially pc. Being pcdoes not imply being ne. An operator T is said to be strictly non-expansive (sne), with respect to some vector norm | | · | |, if

||T x − T y|| < ||x − y||,

or T x − x = T y − y [85]. Every T that is sne is pc. We have the followingconvergence result from [64]:



2.5. OPERATORS RELATED TO A GRADIENT 13

Theorem 2.2 If T is pc with respect to some vector norm, and T has fixed points, then the iterative sequence {xk+1 = T xk} converges to a fixed point of T , for every starting vector x0.

Unlike av operators, the product of two or more pc operators may not bepc; the product is pc if the operators share at least one fixed point.

2.4.1 Linear and Affine Paracontractions

Say that the linear operator B is diagonalizable if B has a basis of eigen-vectors. In that case let the columns of V be such an eigenvector basis.Then we have V −1BV = L, where L is the diagonal matrix having theeigenvalues of B along its diagonal.

Exercise 2.4 Show that B is diagonalizable if all its eigenvalues are dis-tinct.

We see from the exercise that almost all B are diagonalizable. Indeed, all

Hermitian B are diagonalizable.Suppose that T is an affine linear operator whose linear part B is diag-

onalizable, and |λ| < 1 for all eigenvalues λ of B that are not equal to one.Let {u1,...,uJ } be linearly independent eigenvectors of B. For each x, wehave

x =J j=1

ajuj ,

for some coefficients aj . Define

||x|| =J j=1

|aj |,

Then, T is pc with respect to this norm.

Exercise 2.5 Show that, if B is a linear av operator, then |λ| < 1 for all eigenvalues λ of B that are not equal to one.

We see from the exercise that, for the case of affine operators T whoselinear part is not Hermitian, instead of asking if T is av, we can ask if T is pc; since B will almost certainly be diagonalizable, we can answer thisquestion by examining the eigenvalues of B.

2.5 Operators Related to a Gradient

The gradient descent method for minimizing a function g : RJ

→R has

the iterative stepxk+1 = xk − γ k∇g(xk),




where the step-length parameter γ k is adjusted at each step. If we holdγ k = γ fixed, then we have xk+1 = T xk, for

T x = x

−γ

∇g(x).

We shall seek conditions on g and γ under which the operator T is av, whichwill then lead to iterative algorithms for minimizing g, with convergence aconsequence of the KM Theorem.

2.5.1 Constrained Minimization

If our goal is to minimize g(x) over only those x that are in the closed,convex set C , then we may consider a projected gradient descent method,having the iterative step

xk+1 = P C (xk − γ ∇g(xk)).

When the operator T x = x

−γ

∇g(x) is av, so is P C T , so the KM Theorem

will apply once again.

2.6 Systems of Linear Equations

In remote-sensing problems, including magnetic-resonance imaging, trans-mission and emission tomography, acoustic and radar array processing, andelsewhere, the data we have measured is related to the object we wish torecover by linear transformation, often involving the Fourier transform. Inthe vector case, in which the object of interest is discretized, the vectorb of measured data is related to the vector x we seek by linear equationsthat we write as Ax = b. The matrix A need not be square, there can beinfinitely many solutions, or no solutions at all. We may want to calculate a

minimum-norm solution, in the under-determined case, or a least-squaressolution, in the over-determined case. The vector x may be the vector-ization of a two-dimensional image, in which case I , the number of rows,and J , the number of columns of A, can be in the thousands, precludingthe use of non-iterative solution techniques. We may have additional priorknowledge about x, such as its entries are non-negative, which we wantto impose as constraints. There is usually noise in measured data, so wemay not want an exact solution of Ax = b, even if such solutions exist, butprefer a regularized approximate solution. What we need then are iterativealgorithms to solve these problems involving linear constraints.

2.6.1 Exact Solutions

When J ≥ I , the system Ax = b has exact solutions, and we want tocalculate one of these, we can choose among many iterative algorithms.



2.6. SYSTEMS OF LINEAR EQUATIONS 15

The algebraic reconstruction technique (ART) associates the ith equationin the system with the hyperplane

H i =

{x

|(Ax)i = bi

}.

With P i the orthogonal projection onto H i, and i = k(mod I )+1, the ARTiterative step is

xk+1 = P ixk.

The operators P i are av, so the product

T = P I P I −1 · · · P 2P 1

is also av and convergence of the ART follows from the KM Theorem. TheART is also an optimization method, in the sense that it minimizes ||x−x0||over all x with Ax = b.

We can also use the operators P i in a simultaneous manner, taking the

iterative step to be

xk+1 =1

I

I i=1

P ixk.

This algorithm is the Cimmino algorithm [46]. Once again, convergencefollows from the KM Theorem, since the operator

T =1

I

I i=1

P i

is av. Cimmino’s algorithm also minimizes ||x−x0|| over all x with Ax = b,but tends to converge more slowly than ART. One advantage Cimmino’s

algorithm has over the ART is that, in the inconsistent case, in whichAx = b has no solutions, Cimmino’s algorithm converges to a least-squaressolution of Ax = b, while the ART produces a limit cycle of multiplevectors.

Note that Ax = b has solutions precisely when the square systemAA†z = b has a solution; when A has full rank (which is most of thetime) the matrix AA† will be invertible and the latter system will havea unique solution z = (AA†)−1b. Then x = A†z is the minimum-norm solution of the system Ax = b.

If we require a solution of Ax = b that lies in the closed, convex set C ,we can modify both the ART and Cimmino’s algorithm to achieve this end;all we need to do is to replace xk+1 with P C xk

+1, the orthogonal projectionof xk+1 onto C . We call these modified algorithms the projected ART and

projected Cimmino algorithm , respectively. Convergence is again the resultof the KM Theorem.




2.6.2 Optimization and Approximate Solutions

When I > J and the system Ax = b has no exact solutions, we can calculatethe least-squares solution closest to x0 using Cimmino’s algorithm. When

all the rows of A are normalized to have Euclidean length one, the iterativestep of Cimmino’s algorithm can be written as

xk+1 = xk +1

I A†(b − Axk).

This is a special case of the Landweber algorithm , which has the iterativestep

xk+1 = xk + γA†(b − Axk).

Landweber’s algorithm converges to the least-squares solution closest tox0, if the parameter γ is in the interval (0, 2/L), where L is the largesteigenvalue of the matrix A†A. Landweber’s algorithm can be written asxk+1 = T xk, for the operator T defined by

T x = (I − γA†A)x + γA†b.

This operator is affine linear and is an av operator, since its linear part, thematrix B = I −γA†A, is av for any γ in (0, 2/L). Convergence then followsfrom the KM Theorem. When the rows of A have Euclidean length one,the trace of AA† is I , the number of rows in A, so L ≤ I . Therefore, thechoice of γ = 1

I used in Cimmino’s algorithm is permissible, but usuallymuch smaller than the optimal choice.

To minimize ||Ax − b|| over x in the closed, convex set C we can usethe projected Landweber algorithm , with the iterative step

xk+1 = P C (xk + γA†(b − Axk)).

Since P C is an av operator, the operator

T x = P C (x + γA†(b − Ax))

is av for all γ in (0, 2/L). Convergence again follows from the KM algo-rithm. Note that when Ax = b has solutions in C , the projected Landweberalgorithm converges to such a solution.

2.6.3 Splitting Methods

As we noted previously, the system Ax = b has solutions if and only if the square system AA†z = b has solutions. The splitting methods applyto square systems Sz = h. The idea is to decompose S into S = M − K ,where M is easily inverted. Then

Sz = Mz − Kz = h.



2.7. POSITIVE SOLUTIONS OF LINEAR EQUATIONS 17

The operator T given by

T z = M −1Kz + M −1h

is affine linear and is av whenever the matrix M −1K is av. When M −1K isnot Hermitian, if M −1K is a paracontraction, with respect to some norm,we can use Theorem 2.2.

Particular choices of M and K lead to Jacobi’s method, the Gauss-Seidel method, and the more general Jacobi and Gauss-Seidel overrelax-ation methods (JOR and SOR). For the case of S non-negative-definite,the JOR algorithm is equivalent to the Landweber algorithm and the SORis closely related to the relaxed ART method. Convergence of both JORand SOR in this case follows from the KM Theorem.

2.7 Positive Solutions of Linear Equations

Suppose now that the entries of the matrix A are non-negative, those of b are positive, and we seek a solution x with non-negative entries. Wecan, of course, use projected algorithms discussed in the previous section.Alternatively, we can use algorithms designed specifically for non-negativeproblems and based on cross-entropy, rather than on the Euclidean distancebetween vectors.

2.7.1 Cross-Entropy

For a > 0 and b > 0, let the cross-entropy or Kullback-Leibler distancefrom a to b be

KL(a, b) = a loga

b+ b − a,

KL(a, 0) = +∞, and KL(0, b) = b. Extend to nonnegative vectors coordinate-wise, so that

KL(x, z) =

J j=1

KL(xj , zj).

Unlike the Euclidean distance, the KL distance is not symmetric; KL(Ax,b)and KL(b,Ax) are distinct, and we can obtain different approximate so-lutions of Ax = b by minimizing these two distances with respect to non-negative x.

2.7.2 The EMML and SMART algorithms

The expectation maximization maximum likelihood (EMML) algorithm min-

imizes KL(b,Ax), while the simultaneous multiplicative ART (SMART)minimizes KL(Ax,b). These methods were developed for application to




tomographic image reconstruction, although they have much more generaluses. Whenever there are nonnegative solutions of Ax = b, SMART con-verges to the nonnegative solution that minimizes KL(x, x0); the EMMLalso converges to a non-negative solution, but not characterization of that

solution is known.

2.7.3 Acceleration

Both the EMML and SMART algorithms are simultaneous, like Cimmino’salgorithm, using all the equations in each step of the iterative. LikeCimmino’s algorithm, they are slow to converge. In the consistent case,the ART converges much faster than Cimmino’s algorithm, and analo-gous successive- and block-projection methods for accelerating the EMMLand SMART methods have been developed; including the multiplicative ART (MART), the rescaled block-iterative SMART (RBI-SMART) and therescaled block-iterative EMML (RBI-EMML). These methods can be viewedas involving projections onto hyperplanes, but the projections are entropic,not orthogonal, projections.

2.7.4 Entropic Projections onto Hyperplanes

Let H i be the hyperplane

H i = {x|(Ax)i = bi}.

For any non-negative z, denote by x = P ei z the non-negative vector inH i that minimizes the entropic distance KL(x, z). Generally, we cannotexpress P ei z in closed form. On the other hand, if we ask for the non-negative vector x = Qe

i z in H i for which the weighted entropic distance

J j=1

AijKL(xj , zj)

is minimized, we find that x = Qei z can be written explicitly:

xj = zjbi

(Az)i.

We can use these weighted entropic projection operators Qei to derive the

MART, the SMART, the EMML, the RBI-SMART, and the RBI-EMMLmethods.

2.8 Sensitivity to Noise

In many applications of these iterative methods, the vector b consists of measurements, and therefore, is noisy. Even though exact solutions of



2.9. CONSTRAINED OPTIMIZATION 19

Ax = b may exist, they may not be useful, because they are the resultof over-fitting the answer to noisy data. It is important to know wheresensitivity to noise can come from, and how modify the algorithms to lessenthe sensitivity. Ill-conditioning in the matrix A can lead to sensitivity to

noise and regularization can help to make the solution less sensitive to noiseand other errors.

2.8.1 Norm Constraints

For example, in the inconsistent case, when we seek a least-squares solutionof Ax = b, we minimize ||Ax − b||. To avoid over-fitting to noisy data wecan minimize

||Ax − b||2 + 2||x||2,

for some small . In the consistent case, instead of calculating the exactsolution that minimizes ||x − x0||, we can calculate the minimizer of

||Ax − b||2 + 2||x − x0||2.

These approaches to regularization involve the additional of a penalty termto the function being minimized. Such regularization can often be obtainedthrough a Bayesian maximum a posteriori probability (MAP) approach.

Noise in the data can manifest itself in a variety of ways. For example,consider what can happen when we impose positivity on the calculatedleast-squares solution, that is, when we minimize ||Ax − b|| over all non-negative vectors x. We have the following result:

Theorem 2.3 Suppose that A and every matrix Q obtained from A by deleting columns has full rank. Suppose there is no nonnegative solution of the system of equations Ax = b. Then there is a subset S of the set

{i = 1, 2,...,I } with cardinality at most I −1 such that, if x is any minimizer of ||Ax − b|| subject to x ≥ 0, then xj = 0 for j not in S . Therefore, x is unique.

This theorem tells us that when J > I , but Ax = b has no non-negativesolutions, the non-negatively constrained least-squares solution can have atmost I −1 non-zero entries, regardless of how large J is. This phenomenonalso occurs with several other approximate methods, such as those thatminimize the cross-entropy distance.

2.9 Constrained Optimization

In image reconstruction, we often have prior constraints that we wish toimpose on the vectorized image x, as well as measured data, with which a




suitable x should be in reasonable agreement. Taken together, these con-straints are usually insufficient to specify a unique solution; we obtain ourdesired solution by optimizing some cost function over all the x satisfyingour constraints. This is constrained optimization.

2.9.1 Convex Feasibility and Split Feasibility

The constraints we wish to impose on x can often be formulated as requiringthat x be a member of closed, convex sets C i, i = 1,...,I . In some cases,there are sufficiently many C i so that any member of C , their intersection,will be a satisfactory answer to our problem. Finding a member of C isthe convex feasibility problem (CFP). When the intersection C is empty,we can minimize a proximity function, such as

F (x) =I

i=1||P C ix − x||2.

When the intersection C is quite large, we may want to minimize a costfunction f (x) over the members of C . For example, we may want themember of C that is closest to x0; that is, we want to minimize ||x − x0||over C .

Let A be an I by J matrix. The split feasibility problem (SFP) is tofind a member of a closed, convex set C in RJ for which Ax is a memberof a second closed, convex set Q in RI . When there is no such x, we canminimize the proximity function

G(x) = ||P QAx − Ax||,

over all x in C , whenever such minimizers exist.

2.9.2 Algorithms

The CFP can be solved using the successive orthogonal projections (SOP)method. The iterative step of the SOP is

xk+1 = P I P I −1 · · · P 2P 1xk,

where P i = P C i is the orthogonal projection onto C i. The operator

T = P I P I −1 · · · P 2P 1

is averaged and convergence of the SOP follows from the KM Theorem.The SOP is useful when the sets C i are easily described and the P i are

easily calculated, but P C is not. The SOP converges to the member of C closest to x0 when the C i are hyperplanes, but not in general.



2.10. BREGMAN PROJECTIONS AND THE SGP 21

When C = ∩I i=1C i is empty and we seek to minimize the proximity

function F (x), the relevant iteration is

xk+1 =1

I

I i=1

P ixk.

The operator

T =1

I

I i=1

P i

is averaged, so this iteration converges, by the KM Theorem, wheneverF (x) has a minimizer.

The CQ algorithm for the SFP has the iterative step

xk+1 = P C (xk − γAT (I − P Q)Axk).

The operatorT = P C (I − γAT (I − P Q)A)

is averaged whenever γ is in the interval (0, 2/L), where L is the largesteigenvalue of AT A, and so the CQ algorithm converges to a fixed point of T , whenever such fixed points exist. When the SFP has a solution, the CQalgorithm converges to one; when it does not, the CQ algorithm convergesto a minimizer, over C , of the proximity function ||P QAx − Ax||, wheneversuch minimizers exist.

When the intersection C = ∩I i=1C i is large and we want to calculate the

orthogonal projection of x0 onto C using the operators P C i , we cannot usethe SOP; instead we use Dykstra’s algorithm. Dykstra’s algorithm employsthe projections P C i , but not directly on xk, but on translations of xk. It is

motivated by the following lemma:

Lemma 2.1 If x = c +I

i=1 pi, where, for each i, c = P C i(c + pi), then c = P C x.

Bregman discovered an iterative algorithm for minimizing a more generalconvex function f (x) over x with Ax = b and also x with Ax ≥ b [15]. Thesealgorithms are based on his extension of the SOP to include projectionswith respect to generalized distances, such as entropic distances.

2.10 Bregman Projections and the SGP

If f : RJ

→R is convex and differentiable, then, for all x and y, we have

Df (x, y) = f (x) − f (y) − ∇f (y), x − y ≥ 0.




If x minimizes f (x) over x with Ax = b, then

∇f (x) + A†c = 0,

for some vector c. Bregman’s idea is to use Df (x, y) to define generalizedprojections, and then to mimic the SOP to solve for x. Simply requiringthat f (x) be convex and differentiable is not sufficient for a complete theoryand additional requirements are necessary; see the appendix on Bregman-Legendre functions and Bregman projections.

For each i, let P f i z be the point in the hyperplane

H i = {x|(Ax)i = bi}that minimizes Df (x, z). Then P f i z is the Bregman projection of z onto H iand

∇f (P f i z) − ∇f (z) = λiai,

for some λi, where ai is the ith column of A†. Bregman’s successive gen-eralized projection (SGP) method has the iterative step

xk+1 = ∇f −1(∇f (xk) + λkai),

for some scalar λk and i = k(mod I ) + 1. The sequence {xk} will convergeto x with Ax = b, provided solutions exist, and when x0 is chosen so thatx0 = A†d, for some d, the sequence will converge to the solution thatminimizes f (x). Bregman also uses Bregman distances to obtain a primal-dual algorithm for minimizing f (x) over all x with Ax ≥ b. Dykstra’salgorithm can be extended to include Bregman projections; this extendedalgorithm is then equivalent to the generalization of Bregman’s primal-dualalgorithm to minimize f (x) over the intersection of closed , convex sets.

2.11 The Multiple-Distance SGP (MSGP)

As we noted earlier, both the EMML and SMART algorithms can be viewedin terms of weighted entropic projections onto hyperplanes. Unlike theSGP, the weighted entropic distances used vary with the hyperplane, sug-gesting that it may be possible to extend the SGP algorithm to includeBregman projections in which the function f is replaced by f i that de-pends on the set C i. It is known, however, that merely replacing the singleBregman function f with f i that varies with the i is not enough to guaran-tee convergence. The multiple-distance SGP (MSGP) algorithm achievesconvergence by using a dominating Bregman distance Dh(x, y) with

Dh(x, y) ≥ Df i(x, y),

for each i, and a generalized notion of relaxation. The MSGP leads to

an interior-point method, the IPA, for minimizing certain convex functionsover convex sets.



2.12. LINEAR PROGRAMMING 23

2.12 Linear Programming

Bregman’s primal-dual algorithm suggests a method for approximating thesolution of the basic problem in linear programming, to minimize a linear

function cT x, over all x with Ax ≥ b. Other solution methods exist forthis problem, as well. Associated with the basic primary problem is adual problem. Both the primary and dual problems can be stated in theircanonical forms or their standard forms . The primary and dual problemsare connected by the Weak Duality and Strong Duality theorems. Thesimplex method is the best known solution procedure.

2.13 Applications

Iterative algorithms are necessary in many areas of applications. Transmis-sion and emission tomography involve the solving of large-scale systems of linear equations, or optimizing convex functions of thousands of variables.

Magnetic-resonance imaging produces data that is related to the objectof interest by means of the Fourier transform or the Radon transform.Hyperspectral imaging leads to several problems involving limited Fourier-transform data. Iterative data-extrapolation algorithms can be used toincorporate prior knowledge about the object being reconstructed, as wellas to improve resolution. Entropy-based iterative methods are used to solvethe mixture problems common to remote-sensing, as illustrated by sonarand radar array processing, as well as hyperspectral imaging.






Part II

Fixed-Point IterativeAlgorithms

25





Chapter 3

Convergence Theorems

In this chapter we consider three fundamental convergence theorems thatwill play important roles in much of what follows.

3.1 Fixed Points of Iterative Algorithms

The iterative methods we shall consider can be formulated as

xk+1 = T xk, (3.1)

for k = 0, 1,..., where T is a linear or nonlinear continuous operator on (allor some of) the space X of real or complex J -dimensional vectors and x0 isan arbitrary starting vector. For any such operator T on X the fixed point set of T is

Fix(T ) =

{z

|T z = z

}.

Exercise 3.1 Show that, if the iterative sequence defined by Equation (3.1)converges, then the limit is a member of Fix( T ).

A wide variety of problems can be solved by finding a fixed point of aparticular operator and algorithms for finding such points play a prominentrole in a number of applications. The paper [117] is an excellent sourceof background on these topics, particularly as they apply to signal andimage processing. The more recent article by Bauschke and Borwein [7] isalso quite helpful. The book by Borwein and Lewis [13] is an importantreference.

In the algorithms of interest here the operator T is selected so that theset Fix(T ) contains those vectors z that possess the properties we desire in

a solution to the original signal processing or image reconstruction problem;finding a fixed point of the iteration leads to a solution of our problem.

27



28 CHAPTER 3. CONVERGENCE THEOREMS

3.2 Convergence Theorems for Iterative Al-gorithms

In general, a sequence generated by Equation (3.1) need not converge, evenwhen T has fixed points. The Newton-Raphson iteration, for example, mayconverge only when the starting vector x0 is sufficiently close to a solution.We shall be concerned mainly with classes of operators T for which conver-gence holds for all starting vectors. The class of strict contractions providesa good example.

3.2.1 Strict Contractions

An operator T on X is Lipschitz continuous if there is a positive constantλ such that

||T x − T y|| ≤ λ||x − y||,for all x and y in

X .

An operator T on X is a strict contraction (sc), with respect to a vectornorm | | · | |, if there is r ∈ (0, 1) such that

||T x − T y|| ≤ r||x − y||,

for all vectors x and y.

Exercise 3.2 Show that a strict contraction can have at most one fixed point.

For strict contractions, we have the Banach-Picard theorem [61]:

Theorem 3.1 Let T be sc. Then, there is a unique fixed point and, for any starting vector x0, the sequence

{xk

}generated by Equation (3.1) converges

to the fixed point.

The key step in the proof is to show that {xk} is a Cauchy sequence,therefore, it has a limit.

Exercise 3.3 Show that the sequence {xk} is a Cauchy sequence. Hint:consider

||xk − xk+n|| ≤ ||xk − xk+1|| + ... + ||xk+n−1 − xk+n||,and use

||xk+m − xk+m+1|| ≤ rm||xk − xk+1||.Exercise 3.4 Since

{xk

}is a Cauchy sequence, it has a limit, say x. Let

ek = x − xk. Show that {ek} → 0, as k → +∞, so that {xk} → x. Finally,show that T x = x.



3.2. CONVERGENCE THEOREMS FOR ITERATIVE ALGORITHMS 29

Exercise 3.5 Suppose that we want to solve the equation

x =1

2e−x.

Let T x = 12

e−x for x in R. Show that T is a strict contraction, when re-stricted to non-negative values of x, so that, provided we begin with x0 > 0,the sequence {xk = T xk−1} converges to the unique solution of the equa-tion. Hint: use the mean value theorem from calculus.

Exercise 3.6 Let T be an affine operator, that is, T has the form T x =Bx + d, where B is a linear operator, and d is a fixed vector. Show that T is a strict contraction if and only if ||B||, the induced matrix norm of B,is less than one.

The spectral radius of B, written ρ(B), is the maximum of |λ|, over alleigenvalues λ of B. Since ρ(B) ≤ ||B|| for every norm, B is sc implies that

ρ(B) < 1. When B is Hermitian, the matrix norm of B induced by theEuclidean vector norm is ||B||2 = ρ(B), so if ρ(B) < 1, then B is sc withrespect to the Euclidean norm.

When B is not Hermitian, it is not as easy to determine if T is sc withrespect to a given norm. Instead, we often tailor the norm to the operatorT .

To illustrate, suppose that B is a diagonalizable matrix, that is, thereis a basis for X consisting of eigenvectors of B. Let {u1,...,uJ } be such abasis, and let Buj = λjuj , for each j = 1,...,J . For each x in X , there arecoefficients aj so that

x =J j=1

ajuj .

Then let

||x|| =J j=1

|aj |. (3.2)

Exercise 3.7 Show that | | · | | defines a norm on X

Exercise 3.8 Suppose that ρ(B) < 1. Show that the affine operator T is sc, with respect to the norm defined by Equation (3.2).

Actually, this result holds for any square matrix B. According to Lemma27.1, for any square matrix B and any > 0, there is a vector norm forwhich the induced matrix norm satisfies

||B

|| ≤ρ(B) + .

In many of the applications of interest to us, there will be multiplefixed points of T . Therefore, T will not be sc for any vector norm, and the




Banach-Picard fixed-point theorem will not apply. We need to considerother classes of operators.

The first class we consider are the paracontractive (pc) operators. Thisclass is particularly important for the study of affine operators, since T

being pc can be related to the behavior of the eigenvalues of B.For the non-affine case, we shall begin with operators that are non-

expansive (ne) with respect to the Euclidean norm, and then focus on animportant sub-class, the averaged operators.

3.3 Paracontractive Operators

An operator T on X is a paracontraction (pc), with respect to a vectornorm | | · | |, if, for every fixed point y of T , and every x, we have

||T x − y|| < ||x − y||,unless T x = x. If T has no fixed points, then T is trivially pc. An operator

T is strictly non-expansive (sne) if

||T x − T y|| < ||x − y||,unless T x − T y = x − y. Clearly, if T is sc, then T is sne.

Exercise 3.9 Show that if T is sne, then T is pc.

Exercise 3.10 Let H (a, γ ) = {x|x, a = γ }. Show that P , the orthogonal projection onto H (a, γ ), is given by

P x = x +γ − x, a

a, a a.

Then show that P is pc, but not sc, with respect to the Euclidean norm.

To illustrate, suppose, once again, that B is a diagonalizable matrix,that is, there is a basis for X consisting of eigenvectors of B. Let {u1,...,uJ }be such a basis, and let Buj = λjuj , for each j = 1,...,J .

Exercise 3.11 Suppose that |λj | < 1, for all eigenvalues λj that are not equal to one. Show that the affine operator T is pc, with respect to the norm defined by Equation (3.2).

Our interest in paracontractions is because of the Elsner/Koltracht/Neumann(EKN) Theorem [64]:

Theorem 3.2 Let T be pc with respect to some vector norm. If T has fixed points, then the iteration in Equation (3.1) converges to a fixed point of T ,

for all starting vectors x0.

The product of two or more sne operators is again sne. The product of two

or more pc operators will be pc if the operators share at least one fixedpoint, but not generally.



3.4. AVERAGED NON-EXPANSIVE OPERATORS 31

3.4 Averaged Non-expansive Operators

An operator T on X is non-expansive (ne), with respect to some vectornorm, if, for every x and y, we have

||T x − T y|| ≤ ||x − y||.The identity map Ix = x for all x is clearly ne; more generally, for anyfixed vector w in X , the maps N x = x + w and N x = −x + w are ne. If T is pc, then T is ne. Being ne is not enough to guarantee convergence of the iterative sequence {xk}, as the example T = −I illustrates.

An operator T is averaged (av) if there is α ∈ (0, 1) and a non-expansiveoperator N , such that

T = (1 − α)I + αN,

where I is the identity operator. We may also say that T is α-av.

Exercise 3.12 Show that an av operator is ne.

Although we have defined the av operators for any vector norm, they aremost useful in the context of the Euclidean norm. The main reason forthis is the following identity, relating an operator T to its complementG = I − T , which holds for the Euclidean norm:


Our interest in averaged operators comes from the Krasnoselskii/MannTheorem [94]:

Theorem 3.3 Let T be averaged, with respect to the Euclidean norm. If T has fixed points, then the iterative sequence generated by Equation (3.1)converges to a fixed point of T , for every starting vector, x0.

To make use of the KM Theorem, we shall assume, from now on, that allav operators are averaged with respect to the Euclidean norm.

The product of two or more av operators is again av, which makes theclass of av operators important for the development of convergent iterativealgorithms.

3.5 Projection onto Convex Sets

Let C be a nonempty, closed convex subset of X . For every x in X , thereis a unique point in C closest to x, in the Euclidean distance; this pointis denoted P C x and the operator P C is the orthogonal projection onto C .For most sets C we will not be able to describe P C x explicitly. We can,however, characterize P C x as the unique member of C for which

P C x − x, c − P C x ≥ 0, (3.4)for all c in C ; see Proposition 26.1.




Exercise 3.13 Show that the orthogonal projection operator T = P C is nonexpansive, with respect to the Euclidean norm. Hint: use Inequality (3.4) to get

P C y

−P C x, P C x

−x

≥0,

and P C x − P C y, P C y − y ≥ 0.

Add the two inequalities and use the Cauchy inequality.

In fact, this exercise shows that

P C x − P C y, x − y ≥ ||P C x − P C y||2,

which says that the operator T = P C is not simply ne, but is firmly non-expansive (fne). As we shall see later, being fne implies being av, so the P C operators are av. If C i, i = 1,...,I are convex sets, and P i the orthogonalprojection onto C i, then the operator

T = P I P I −1 · · · P 2P 1

is again av. When the intersection of the C i is non-empty, the sequence{xk} will converge to a member of that intersection.

It follows from Cauchy’s Inequality that

||P C x − P C y|| ≤ ||x − y||,

with equality if and only if

P C x − P C y = α(x − y),

for some scalar α with

|α

|= 1. But, because

0 ≤ P C x − P C y, x − y = α||x − y||2,

it follows that α = 1, and so

P C x − x = P C y − y.

This shows that the P C operators are pc.

3.6 Generalized Projections

So far, we have been discussing algorithms that apply to any vectors in

X . In a number of applications, the vectors of interest will naturally have

non-negative entries. For such problems, it is reasonable to consider dis-tances that apply only to non-negative vectors, such as the cross-entropy, or



3.6. GENERALIZED PROJECTIONS 33

Kullback-Leibler, distance. Associated with such distances are generalizedprojections. Algorithms that are based on orthogonal projection operatorscan then be extended to employ these generalized projections. Of course,new proofs of convergence will be needed, but even there, aspects of earlier

proofs are often helpful.The orthogonal projection operators lead us to both the averaged opera-

tors and the paracontractive operators, as well as to generalized projectionsand Bregman paracontractions, and the algorithms built from them.






Chapter 4

Averaged Non-expansiveOperators

Many well known algorithms in optimization, signal processing, and im-age reconstruction are iterative in nature. The Jacobi, Gauss-Seidel, andsuccessive overrelaxation (SOR) procedures for solving large systems of linear equations, projection onto convex sets (POCS) methods and iter-ative optimization procedures, such as entropy and likelihood maximiza-tion, are the primary examples. The editorial [90] provides a brief intro-duction to many of the recent efforts in medical imaging. It is a pleas-ant fact that convergence of many of these algorithms is a consequenceof the Krasnoselskii/Mann (KM) Theorem for averaged operators or theElsner/Koltracht/Neumann (EKN) Theorem for paracontractions. In thischapter we take a closer look at averaged non-expansive operators and theKrasnoselskii/Mann Theorem. In the following chapter, we turn to para-

contractive non-expansive operators and the results of Elsner, Koltrachtand Neumann.

4.1 Convex Feasibility

To illustrate, suppose that C is a closed convex set in X , such as the non-negative vectors in RJ . The orthogonal projection operator P C associateswith every x in X the point P C x in C that is nearest to x, in the Eu-clidean distance. If C 1 and C 2 are two such sets the fixed points of theoperator T = P C 2P C 1 are the vectors in the intersection C = C 1 ∩ C 2.Finding points in the intersection of convex sets is called the convex feasi-bility problem (CFP). If C is nonempty; then the sequence

{xk

}generated

by Equation (3.1) converges to a member of C . This is a consequence of the KM Theorem, since the operator T is av.

35



36 CHAPTER 4. AVERAGED NON-EXPANSIVE OPERATORS

4.2 Constrained Optimizaton

Some applications involve constrained optimization, in which we seek avector x in a given convex set C that minimizes a certain function f . For

suitable γ > 0 the operator T = P C (I − γ ∇f ) will be av and the sequence{xk} will converge to a solution.

4.3 Solving Linear Systems

An important class of operators are the affine linear ones, having the form

T x = Bx + h,

where B is linear, so that Bx is the multiplication of the vector x by thematrix B, and h is a fixed vector. Affine linear operators occur in iterativemethods for solving linear systems of equations.

4.3.1 The Landweber Algorithm

The Landweber algorithm for solving the system Ax = b is

xk+1 = xk + γA†(b − Axk),

where γ is a selected parameter. We can write the Landweber iteration as

xk+1 = T xk,

forT x = (I − γA†A)x + A†b = Bx + h.

The Landweber algorithm actually solves the square linear system A†A =

A†b for a least-squares solution of Ax = b. When there is a unique solutionor unique least-squares solution of Ax = b, say x, then the error at the k-thstep is ek = x − xk and we see that

Bek = ek+1.

We want ek → 0, and so we want ||B|| < 1; this means that both T andB are strict contractions. Since B is Hermitian, B will be sc if and only||B|| < 1, where ||B|| = ρ(B) is the matrix induced by the Euclidean vectornorm.

On the other hand, when there are multiple solutions of Ax = b, thesolution found by the Landweber algorithm will be the one closest to thestarting vector. In this case, we cannot define ek and we do not want

||B

||< 1; that is, we do not need that B be a strict contraction, but

something weaker. As we shall see, since B is Hermitian, B will be avwhenever γ lies in the interval (0, 2/ρ(B)).




4.3.2 Splitting Algorithms

Affine linear operators also occur in splitting algorithms for solving a squaresystem of linear equations, Sx = b. We write S = M − K , with M

invertible. Then, the iteration is

xk+1 = M −1Kxk + M −1b,

which can be written asxk+1 = T xk,

for the affine linear operator

T x = M −1Kx + M −1b = Bx + h.

When S is invertible, there is a unique solution of Sx = b, say x, and wecan define the error ek = x − xk. Then ek+1 = Bek, and again we want||B|| < 1, that is, B is a strict contraction. However, if S is not invertible

and there are multiple solutions, then we do not want B to be sc. Since Bis usually not Hermitian, deciding if B is av may be difficult. Therefore,we may instead ask if there is a vector norm with respect to which B is pc.

We begin, in the next section, a detailed discussion of averaged oper-ators, followed by an examination of the proof of the Krasnoselskii/Manntheorem.

4.4 Averaged Non-expansive Operators

As we have seen, the fact that a ne operator N has fixed points is not suf-ficient to guarantee convergence of the orbit sequence {N kx}; additionalconditions are needed. Requiring the operator to be a strict contraction isquite restrictive; most of the operators we are interested in here have mul-tiple fixed points, so are not sc. For example, if T = P C , then C = Fix(T ).Motivated by the KM Theorem, we concentrate on averaged operators, bywhich we shall always mean with respect to the Euclidean norm.

4.4.1 Properties of Averaged Operators

As we shall see now, in seeking fixed points for an operator T it is helpfulto consider properties of its complement, G = I − T . An operator G on X is called ν -inverse strongly monotone (ν -ism) [69] (also called co-coercive in [50]) if there is ν > 0 such that

Gx − Gy,x − y ≥ ν ||Gx − Gy||2.

Exercise 4.1 Show that N is ne if and only if its complement G = I − N is 1

2-ism. If G is ν -ism and γ > 0 then the operator γG is ν γ -ism.




Lemma 4.1 An operator A is av if and only if its complement G = I − Ais ν -ism for some ν > 1

2.

Proof: We assume first that A is av. Then there is α ∈ (0, 1) and ne

operator N such that A = (1 − α)I + αN , and so G = I − A = α(I − N ).Since N is ne, I − N is 1

2 -ism and G = α(I − N ) is 12α -ism. Conversely,

assume that G is ν -ism for some ν > 12 . Let α = 1

2ν and write A =(1 − α)I + αN for N = I − 1

αG. Since I − N = 1

αG, I − N is αν -ism.

Consequently I − N is 12 -ism and N is ne. Therefore, A is av.

Exercise 4.2 Show that, if the operator A is α-av and 1 > β > α, then Ais β -av.

Exercise 4.3 Note that we can establish that a given operator is av by showing that there is an α in the interval (0, 1) such that the operator

1

α(A − (1 − α)I )

is ne. Use this approach to show that if T is sc, then T is av.

Lemma 4.2 Let T = (1 − α)A + αN for some α ∈ (0, 1). If A is averaged and N is non-expansive then T is averaged.

Proof: Let A = (1 − β )I + βM for some β ∈ (0, 1) and ne operator M .Let 1 − γ = (1 − α)(1 − β ). Then we have

T = (1 − γ )I + γ [(1 − α)βγ −1M + αγ −1N ].

Since the operator K = (1 − α)βγ −1M + αγ −1N is easily shown to be neand the convex combination of two ne operators is again ne, T is averaged.

Corollary 4.1 If A and B are av and α is in the interval [0, 1], then the operator T = (1 − α)A + αB formed by taking the convex combination of A and B is av.

An operator F on X is called firmly non-expansive (fne) [117], [7] if itis 1-ism.

Lemma 4.3 An operator F is fne if and only if its complement I − F is fne. If F is fne then F is av.

Proof: For any operator F with G = I − F we have

F x − F y , x − y− | |F x − F y||2 = Gx − Gy,x − y− | |Gx − Gy||2.

The left side is nonnegative if and only if the right side is. Finally, if F is

fne then I − F is fne, so I − F is ν -ism for ν = 1. Therefore F is av byLemma 4.1.




Corollary 4.2 Let T = (1 − α)F + αN for some α ∈ (0, 1). If F is fne and N is ne then T is averaged.

Since the orthogonal projection of x onto C is characterized by the

inequalities c − P C x, P C x − x ≥ 0

for all c ∈ C , we have

P C y − P C x, P C x − x ≥ 0

andP C x − P C y, P C y − y ≥ 0.

Adding, we find that

P C x − P C y, x − y ≥ ||P C x − P C y||2;

the operator P C is fne, and therefore also av.The orthogonal projection operators P H onto hyperplanes H = H (a, γ )

are sometimes used with relaxation , which means that P H is replaced bythe operator

T = (1 − ω)I + ωP H ,

for some ω in the interval (0, 2). Clearly, if ω is in the interval (0, 1), then T is av, by definition, since P H is ne. We want to show that, even for ω in theinterval [1, 2), T is av. To do this, we consider the operator RH = 2P H −I ,which is reflection through H ; that is,

P H x =1

2(x + RH x),

for each x.

Exercise 4.4 Show that RH is an isometry; that is,

||RH x − RH y|| = ||x − y||, for all x and y, so that RH is ne.

Exercise 4.5 Show that, for ω = 1 + γ in the interval [1, 2), we have

(1 − ω)I + ωP H = αI + (1 − α)RH ,

for α = 1−γ 2

; therefore, T = (1 − ω)I + ωP H is av.

The product of finitely many ne operators is again ne, while the productof finitely many fne operators, even orthogonal projections, need not be fne.It is a helpful fact that the product of finitely many av operators is againav.

If A = (1

−α)I + αN is averaged and B is averaged then T = AB has

the form T = (1 − α)B + αN B. Since B is av and N B is ne, it followsfrom Lemma 4.1 that T is averaged. Summarizing, we have




Proposition 4.1 If A and B are averaged, then T = AB is averaged.

It is possible for Fix(AB) to be nonempty while Fix(A)∩Fix(B) isempty; however, if the latter is nonempty, it must coincide with Fix(AB)

[7]:

Proposition 4.2 Let A and B be averaged operators and suppose that Fix( A)∩Fix( B) is nonempty. Then Fix( A)∩Fix( B) =Fix( AB)=Fix( BA).

Proof: Let I − A be ν A-ism and I − B be ν B-ism, where both ν A and ν Bare taken greater than 1

2 . Let z be in Fix(A)∩Fix(B) and x in Fix(BA).Then

||z − x||2 ≥ ||z − Ax||2 + (2ν A − 1)||Ax − x||2≥ ||z − BAx||2 + (2ν B − 1)||BAx − Ax||2 + (2ν A − 1)||Ax − x||2

= ||z − x||2 + (2ν B − 1)||BAx − Ax||2 + (2ν A − 1)||Ax − x||2.

Therefore ||Ax − x|| = 0 and ||BAx − Ax|| = ||Bx − x|| = 0.

4.4.2 Averaged Linear Operators

Affine linear operators have the form T x = Bx + d, where B is a matrix.The operator T is av if and only if B is av. It is useful, then, to considerconditions under which B is av.

When B is averaged, there is a positive α in (0, 1) and N a ne operator,with

B = (1 − α)I + αN.

Therefore

N =1

αB + (1 − 1

α)I (4.1)

is non-expansive. Clearly, N is a linear operator; that is, N is multiplicationby a matrix, which we also denote N . When is such an operator N ne?

Exercise 4.6 Show that a linear operator N is ne, in the Euclidean dis-tance, if and only if ||N || =

ρ(N †N ), the matrix norm induced by the

Euclidean vector norm, does not exceed one.

We know that B is av if and only if its complement, I − B, is ν -ism forsome ν > 1

2 . Therefore,

(I − B)x, x ≥ ν ||(I − B)x||2,

for all x. This implies that x†(I − B)x ≥ 0, for all x. Since this quadraticform can be written as

x†(I − B)x = x†(I − Q)x,




for Q = 12(B + B†), it follows that I − Q must be non-negative definite.

Moreover, if B is av, then B is ne, so that ||B|| ≤ 1. Since ||B|| = ||B†||,and ||Q|| ≤ 1

2(||B|| + ||B†||), it follows that Q must be non-expansive, also.Consequently, if the linear operator B is av, then the eigenvalues of Q must

lie in the interval [−1, 1].In later chapters we shall be particularly interested in linear operators

B that are Hermitian, in which case N will also be Hermitian. Therefore,we shall assume, for the remainder of this subsection, that B is Hermitian,so that all of its eigenvalues are real. It follows from our discussion relatingmatrix norms to spectral radii that a Hermitian N is ne if and only if ρ(N ) ≤ 1. We now derive conditions on the eigenvalues of B that areequivalent to B being an av linear operator.

For any (necessarily real) eigenvalue λ of B, the corresponding eigen-value of N is

ν =1

αλ + (1 − 1

α).

Exercise 4.7 Show that |ν | ≤ 1 if and only if 1 − 2α ≤ λ ≤ 1.

From the exercise, we see that the Hermitian linear operator B is av if and only if there is α in (0, 1) such that

−1 < 1 − 2α ≤ λ ≤ 1,

for all eigenvalues λ of B. This is equivalent to saying that

−1 < λ ≤ 1,

for all eigenvalues λ of B. The choice

α0 = 1 − λmin2

is the smallest α for which

N =1

αB + (1 − 1

α)I

will be non-expansive; here λmin denotes the smallest eigenvalue of B. So,α0 is the smallest α for which B is α-av.

The linear operator B will be fne if and only if it is 12 -av. Therefore, B

will be fne if and only if 0 ≤ λ ≤ 1, for all eigenvalues λ of B. Since B isHermitian, we can say that B is fne if and only if B and I − B are non-negative definite. We summarize the situation for Hermitian B as follows.Let λ be any eigenvalue of B. Then

B is non-expansive if and only if −1 ≤ λ ≤ 1, for all λ;




B is averaged if and only if −1 < λ ≤ 1, for all λ;

B is a strict contraction if and only if −1 < λ < 1, for all λ;

B is firmly non-expansive if and only if 0 ≤ λ ≤ 1, for all λ.Next, we examine the proof of Theorem 4.1, in order to better under-

stand the advantages of sequential methods over simultaneous ones.

4.5 The KM Theorem

The Krasnoselskii/Mann theorem is the following:

Theorem 4.1 Let T be an av operator on X and let Fix( T ) be nonempty.Then the orbit sequence {T kx} converges to a member of Fix( T ), for any x.

As we shall see, many of the iterative methods used in signal and imageprocessing are special cases of the KM approach.

Proof of the theorem: Let z be a fixed point of non-expansive operatorN and let α ∈ (0, 1). Let T = (1 − α)I + αN , so the iterative step becomes

xk+1 = T xk = (1 − α)xk + αN xk. (4.2)

The identity in Equation (26.3) is the key to proving Theorem 4.1.Using T z = z and (I − T )z = 0 and setting G = I − T we have

||z − xk||2 − ||T z − xk+1||2 = 2Gz − Gxk, z − xk −||Gz − Gxk||2.

Since, by Lemma 4.1, G is 12α -ism, we have

||z − xk||2 − ||z − xk+1||2 ≥ (1

α− 1)||xk − xk+1||2. (4.3)

Consequently the sequence {xk} is bounded, the sequence {||z − xk||} isdecreasing and the sequence {||xk − xk+1||} converges to zero. Let x∗ be acluster point of {xk}. Then we have T x∗ = x∗, so we may use x∗ in place of the arbitrary fixed point z. It follows then that the sequence {||x∗ − xk||}is decreasing; since a subsequence converges to zero, the entire sequenceconverges to zero. The proof is complete.

As we outlined in the Introduction, a wide variety of operators T canbe shown to be av. The convergence of the iterative fixed-point algorithms

associated with these operators then follows as a consequence of this theo-rem.



4.6. THE DE PIERRO-IUSEM APPROACH 43

4.6 The De Pierro-Iusem Approach

As we have seen, the class of non-expansive operators is too broad, and theclass of strict contractions too narrow for our purposes. The KM Theorem

encourages us to focus on the intermediate class of averaged operators.While this is certainly a fruitful approach, it is not the only possible one.In [58] De Pierro and Iusem take a somewhat different approach, basingtheir class of operators on properties of orthogonal projections onto convexsets.

Exercise 4.8 Use the Cauchy-Schwarz Inequality and the fact that T =P C is firmly non-expansive to show that

||T x − T y|| = ||x − y|| (4.4)

implies that

T x−

T y = x−

y, (4.5)

and

T x − x, x − y = 0. (4.6)

De Pierro and Iusem consider operators Q : RJ → RJ that are non-expansive and for which the property in Equation (4.4) implies both Equa-tions (4.5) and (4.6). They then show that this class is closed to finiteproducts and convex combinations.






Chapter 5

Paracontractive Operators

An affine linear operator T x = Bx + d is an averaged non-expansive op-erator if and only if its linear part, B, is also averaged. A Hermitian Bis av if and only if −1 < λ ≤ 1, for each eigenvalue λ of B. When B isnot Hermitian, deciding if B is av is harder. In such cases, we can ask if there is some vector norm, with respect to which B is paracontractive (pc).As we shall see, if B is diagonalizable, then B is pc if |λ| < 1, for everyeigenvalue λ of B that is not equal to one. Then we can use the results of Elsner, Koltracht and Neumann to establish convergence of the iterativealgorithm given by Equation (3.1).

5.1 Paracontractions and Convex Feasibility

An operator T on X is paracontractive (pc), with respect to some vector

norm | | · | |, if, for every fixed point y of T and for every x, we have

||T x − y|| < ||x − y||,

unless T x = x. Note that T can be pc without being continuous, hencewithout being ne. We shall restrict our attention here to those pc operatorsthat are continuous.

Let C i, i = 1,...,I, be non-empty, closed convex sets in X , with non-empty intersection C . The orthogonal projection P i = P C i onto C i ispc, with respect to the Euclidean norm, for each i. The product T =P I P I −1 ·· ·P 1 is also pc, since C is non-empty. The SOP algorithm convergesto a member of C , for any starting vector x0, as a consequence of theEKN Theorem. For the SOP to be a practical procedure, we need to

be able to calculate easily the orthogonal projection onto each C i. Thecyclic subgradient projection method (CSP) (see [43]) provides a practical

45



46 CHAPTER 5. PARACONTRACTIVE OPERATORS

alternative to the SOP, for sets C i of the form

C i = {x|gi(x) ≤ bi},

where gi is a convex, differentiable function on X . For each i, let

T ix = x − ωαi(x)∇gi(x),

for

αi(x) = max(gi(x) − bi, 0)/||∇gi(x)||2.

From [64] we have

Theorem 5.1 For 0 < ω < 2, the operators T i are pc, with respect to the Euclidean norm.

Proof: A vector y is a fixed point of T i if and only if gi(y)≤

0, so if andonly if y ∈ C i. Let x be a vector outside of C i, and let α = αi(x). Sincegi has no relative minimum outside of C i, T ix is well defined. We want toshow that ||T ix − y|| < ||x − y||. This is equivalent to showing that

ω2α2||∇gi(x)||2 ≤ 2ωα∇gi(x), x − y,

which, in turn, is equivalent to showing that

ω(gi(x) − bi) ≤ ∇gi(x), x − y. (5.1)

Since gi(y) ≤ bi and gi is convex, we have

(gi(x) − β ) ≤ (gi(x) − gi(y)) ≤ ∇gi(x), x − y.

Inequality (5.1) follows immediately.

The CSP algorithm has the iterative step

xk+1 = T i(k)xk,

where i(k) = k(mod I )+1. Since each of the operators T i is pc, the sequenceconverges to a member of C , whenever C is non-empty, as a consequenceof the EKN Theorem.

Let A be an I by J real matrix, and for each i let gi(x) = (Ax)i. Thenthe gradient of gi is ∇gi(x) = ai, the ith column of AT . The set C i isthe half-space C =

{x

|(Ax)i

≤bi

}, and the operator T i is the orthogonal

projection onto C i. The CSP algorithm in this case becomes the AMSalgorithm for finding x with Ax ≤ b.



5.2. THE EKN THEOREM 47

5.2 The EKN Theorem

We have the Elsner/Koltracht/Neumann Theorem and its corollaries from[64]:

Theorem 5.2 Suppose that there is a vector norm on X , with respect towhich T i is a pc operator, for all i = 1,...,I , and that F = ∩I

i=1Fix(T i) is not empty. Let

T = T I T I −1 · · · T 2T 1.

For k = 0, 1,..., let i(k) = k(mod I ) + 1, and xk+1 = T i(k)xk. The sequence

{xk} converges to a member of F , for every starting vector x0.

Proof: Let y ∈ F . Then, for k = 0, 1,...,

||xk+1 − y|| = ||T i(k)xk − y|| ≤ ||xk − y||,so that the sequence {||xk − y||} is decreasing; let d ≥ 0 be its limit. Sincethe sequence

{xk

}is bounded, we select an arbitrary cluster point, x∗.

Then d = ||x∗ − y||, from which we can conclude that

||T ix∗ − y|| = ||x∗ − y||,

and T ix∗ = x∗, for i = 1,...,I ; therefore, x∗ ∈ F . Replacing y, an arbitrarymember of F , with x∗, we have that ||xk − x∗|| is decreasing. But, asubsequence converges to zero, so the whole sequence must converge tozero. This completes the proof.

Corollary 5.1 If T is pc with respect to some vector norm, and T has fixed points, then the iterative sequence {xk} generated by Equation (3.1)converges to a fixed point of T , for every starting vector x0.

Corollary 5.2 If F = ∩I i=1Fix(T i) is not empty, then F = Fix (T ).

Proof: The sequence xk+1 = T i(k)xk converges to a member of Fix (T ),for every x0. Select x0 in F .

Corollary 5.3 The product T of two or more pc operators T i, i = 1,...,I is again a pc operator, if F = ∩I

i=1Fix(T i) is not empty.

Proof: Suppose that y ∈ F = Fix (T ) and

||T x − y|| = ||x − y||.Then, since

||T I (T I −1 · · · T 1)x − y|| ≤ ||T I −1 · · · T 1x − y|| ≤ ... ≤ ||T 1x − y|| ≤ ||x − y||,it follows that

||T ix − y|| = ||x − y||,and T ix = x, for each i. Therefore, T x = x.




5.3 Linear and Affine Paracontractions

Say that the linear operator B is diagonalizable if B has a basis of eigen-vectors. In that case let the columns of V be such an eigenvector basis.

Then we have V −1BV = L, where L is the diagonal matrix having theeigenvalues of B along its diagonal.

5.3.1 Back-propagation-of-error Methods

Suppose that A is I by J , with J > I and that Ax = b has infinitely manysolutions. A backpropagation-of-error approach leads to an algorithm withthe iterative step

xk+1 = xk + γC †(b − Axk),

where C is some I by J matrix. The algorithm can then be written in theform of Equation (3.1), for T the affine operator given by

T x = (I

−γC †A)x + γC †b.

Since Ax = b has multiple solutions, A has a non-trivial null space, so thatsome of the eigenvalues of B = (I − γC †A) are equal to one. As we shallsee, if γ is chosen so that |λ| < 1, for all the remaining eigenvalues of B,and B is diagonalizable, then T will be pc, with respect to some vectornorm, and the iterative sequence {xk} will converge to a solution. For sucha γ to exist, it is necessary that, for all eigenvalues λ = a+bi, the real partsa have the same sign, which we may, without loss of generality, assume tobe positive. Then we need to select γ in the intersection of the intervals(0, 2a/(a2 + b2)), taken over every eigenvalue λ.

5.3.2 Defining the Norm

Suppose that T x = Bx + d is an affine linear operator whose linear part Bis diagonalizable, and |λ| < 1 for all eigenvalues λ of B that are not equalto one. Let {u1,...,uJ } be linearly independent eigenvectors of B. For eachx, we have

x =

J j=1

ajuj ,

for some coefficients aj . Define

||x|| =

J j=1

|aj |,

We know from a previous exercise that T is pc with respect to this norm.

It follows from Theorem 5.2 that the iterative sequence {xk} will convergeto a fixed point of T , whenever T has fixed points.



5.3. LINEAR AND AFFINE PARACONTRACTIONS 49

5.3.3 Proof of Convergence

It is not difficult to prove convergence directly, as we now show.

Proof of convergence: Let the eigenvalues of B be λj, for j = 1,...,J ,with associated linearly independent eigenvectors uj . Define a norm onvectors x by

||x|| =

J j=1

|aj |,

for

x =J j=1

ajuj .

Assume that λj = 1, for j = K + 1,...,J , and that |λj | < 1, for j = 1,...,K .Let

d =

J j=1

djuj

.

Let x be an arbitrary fixed point of T , with

x =

J j=1

ajuj .

From T x = x we have

J j=1

ajuj =J j=1

(λj aj + dj)uj .

Then withxk =

J j=1

ajkuj ,

and

xk+1 = Bxk + h =

J j=1

(λjajk + dj)uj ,

we have

xk − x =J

j=1

(ajk − aj)uj ,

and

xk+1 − x =K j=1

λj(ajk − aj)uj +J

j=K +1

(ajk − aj)uj .




Therefore,

||xk − x|| =K

j=1|ajk − a| +

J

j=K +1

|ajk − aj |,

while

||xk+1 − x|| =K j=1

|λj ||ajk − a| +J

j=K +1

|ajk − aj |.

Consequently,

||xk − x| |− | |xk+1 − x|| =K j=1

(1 − |λj |)|ajk − aj |.

It follows that the sequence {||xk−x||} is decreasing, and that the sequences{|ajk − aj |} converge to zero, for each j = 1,...,K .

Since the sequence {xk} is then bounded, select a cluster point, x∗, with

x∗ =J j=1

a∗juj .

Then we must have{|ajk − a∗j |} → 0,

for j = 1,...,K . It follows that aj = a∗j , for j = 1,...,K . Therefore,

x − x∗ =J

j=K +1

cjuj ,

for cj = aj

−a∗j . We can conclude, therefore, that

x − Bx = x∗ − Bx∗,

so that x∗ is another solution of the system (I − B)x = d. Therefore,the sequence {||xk − x∗||} is decreasing; but a subsequence converges tozero, so the entire sequence must converge to zero. We conclude that {xk}converges to the solution x∗.

It is worth noting that the condition that B be diagonalizable cannotbe omitted. Consider the non-diagonalizable matrix

B =

1 10 1

,

and the affine operator

T x = Bx + (1, 0)T .



5.3. LINEAR AND AFFINE PARACONTRACTIONS 51

The fixed points of T are the solutions of (I − B)x = (1, 0)T , which arethe vectors of the form x = (a, −1)T . With starting vector x0 = (1, 0)T ,we find that xk = (k − 1)x0, so that the sequence {xk} does not convergeto a fixed point of T . There is no vector norm for which T is pc.

If T is an affine linear operator with diagonalizable linear part, then T is pc whenever T is av, as the following exercise will establish.

Exercise 5.1 Show that, if B is a linear av operator, then |λ| < 1 for all eigenvalues λ of B that are not equal to one.

We see from the exercise that, for the case of affine operators T whoselinear part is not Hermitian, instead of asking if T is av, we can ask if T is pc; since B will almost certainly be diagonalizable, we can answer thisquestion by examining the eigenvalues of B.






Chapter 6

Bregman-ParacontractiveOperators

In the previous chapter, we considered operators that are paracontractive,with respect to some norm. In this chapter, we extend that discussion tooperators that are paracontractive, with respect to some Bregman distance.Our objective here is to examine the extent to which the EKN Theoremand its consequences can be extended to the broader class of Bregmanparacontractions. Typically, these operators are not defined on all of X ,but on a restricted subset, such as the non-negative vectors, in the case of entropy. For details concerning Bregman distances and related notions, seethe appendix.

6.1 Bregman Paracontractions

Let f be a closed proper convex function that is differentiable on thenonempty set intD. The corresponding Bregman distance Df (x, z) is de-fined for x ∈ RJ and z ∈ intD by

Df (x, z) = f (x) − f (z) − ∇f (z), x − z,

where D = {x |f (x) < +∞} is the essential domain of f . When the domainof f is not all of RJ , we define f (x) = +∞, for x outside its domain. Notethat Df (x, z) ≥ 0 always and that Df (x, z) = +∞ is possible. If f isessentially strictly convex then Df (x, z) = 0 implies that x = z.

Let C be a nonempty closed convex set with C ∩ intD = ∅. Pick z ∈intD. The Bregman projection of z onto C , with respect to f , is

P f C (z) = argminx∈C ∩DDf (x, z).

53



54 CHAPTER 6. BREGMAN-PARACONTRACTIVE OPERATORS

If f is essentially strictly convex, then P f C (z) exists. If f is strictly convex

on D then P f C (z) is unique. We assume that f is Legendre, so that P f C (z)is uniquely defined and is in intD; this last condition is sometimes calledzone consistency .

We shall make much use of the Bregman Inequality (28.1):

Df (c, z) ≥ Df (c, P f C z) + Df (P f C z, z). (6.1)

A continuous operator T : intD → intD is called a Bregman paracon-traction (bpc) if, for every fixed point z of T , and for every x, we have

Df (z , T x) < Df (z, x),

unless T x = x. In order for the Bregman distances Df (z, x) and Df (z , T x)to be defined, it is necessary that ∇f (x) and ∇f (T x) be defined, and sowe need to restrict the domain and range of T in the manner above. Thiscan sometimes pose a problem, when the iterative sequence

{xk+1 = T xk

}converges to a point on the boundary of the domain of f . This happens,for example, in the EMML and SMART methods, in which each xk isa positive vector, but the limit can have entries that are zero. One wayaround this problem is to extend the notion of a fixed point: say that z is anasymptotic fixed point of T if (z, z) is in the closure of the graph of T , thatis, (z, z) is the limit of points of the form (x,Tx). Theorems for iterativemethods involving Bregman paracontractions can then be formulated toinvolve convergence to an asymptotic fixed point [26]. In our discussionhere, however, we shall not consider this more general situation.

6.1.1 Entropic Projections

As an example of a Bregman distance and Bregman paracontractions, con-sider the function g(t) = t log(t) − t, with g(0) = 0, and the associatedBregman-Legendre function

f (x) =J j=1

g(xj),

defined for vectors x in the non-negative cone RJ +. The corresponding

Bregman distance is the Kullback-Leibler, or cross-entropy, distance

Df (x, z) = f (x) − f (z) − ∇f (z), x − z = KL(x, z).

For any non-empty, closed, convex set C , the entropic projection operator

P eC is defined by P eC z is the member x of C ∩ RJ + for which KL(x, z) isminimized.



6.1. BREGMAN PARACONTRACTIONS 55

Theorem 6.1 The operator T = P eC is bpc, with respect to the cross-entropy distance.

Proof: The fixed points of T = P eC are the vectors c in C ∩ RJ +. From the

Bregman Inequality (6.1) we have

Df (c, x) − Df (c, P eC x) ≥ Df (P eC x, x) ≥ 0,

with equality if and only if Df (P eC x, x) = 0, in which case T x = x.

6.1.2 Weighted Entropic Projections

Generally, we cannot exhibit the entropic projection onto a closed, convexset C in closed form. When we consider the EMML and SMART algo-rithms, we shall focus on non-negative systems Ax = b, in which the entries

of A are non-negative, those of b are positive, and we seek a non-negativesolution. For each i = 1,...,I , let

H i = {x ≥ 0|(Ax)i = bi}.

We cannot write the entropic projection of z onto H i in closed form, but,for each positive vector z, the member of H i that minimizes the weightedcross-entropy,

J j=1

AijKL(xj , zj) (6.2)

is

xj = (Qei z)j = zj bi

(Az)i.

Exercise 6.1 Show that the operator Qei is bpc, with respect to the Breg-

man distance in Equation (6.2). Hint: show that, for each x in H i,

J j=1

AijKL(xj , zj) −J j=1

AijKL(xj , (Qeiz)j) = KL(bi, (Az)i).

WithI

i=1 Aij = 1, for each j, the iterative step of the EMML algorithmcan be written as xk+1 = T xk, for

(T x)j =I i=1

Aij(Qeix)j ,




and that of the SMART is xk+1 = T xk, for

(T x)j =I

i=1[(Qeix)j ]Aij .

It follows from the theory of these two algorithms that, in both cases, T isbpc, with respect to the cross-entropy distance.

6.2 Extending the EKN Theorem

We have the following generalization of Corollary 5.3:

Theorem 6.2 For i = 1,...,I , let T i be bpc, for the Bregman distance Df .Let F = ∩I

i=1Fix(T i) be non-empty. Then the sequence {xk+1 = T xk}converges to a member of F .

Proof: Let z be in F . Since Df (z, T ix) ≤ Df (z, x), for each i, it follows

that Df (z, x) − Df (z , T x) ≥ 0.

If equality holds, then

Df (z, (T I T I −1 · · · T 1)x) = Df (z, (T I −1 · · · T 1)x)

... = Df (z, T 1x) = Df (z, x),

from which we can conclude that T ix = x, for each i. Therefore, T x = x,and T is bpc.

Now we present a generalization of the EKN Theorem.

Theorem 6.3 For i = 1,...,I , let T i be bpc, for the Bregman distance Df . Let F = ∩I

i=1Fix(T i) be non-empty. Let i(k) = k(mod I ) + 1 and xk+1 = T i(k)x

k. Then the sequence {xk} converges to a member of F .

Proof: Let z be a member of F . We know that

Df (z, xk) − Df (z, xk+1) ≥ 0,

so that the sequence {Df (z, xk} is decreasing, with limit d ≥ 0. Then thesequence {xk} is bounded; select a cluster point, x∗. Then T 1x∗ is also acluster point, so we have

Df (z, x) − Df (z, T 1x) = 0,

from which we conclude that T 1x = x. Similarly, T 2T 1x∗ = T 2x∗ is acluster point, and T 2x∗ = x∗. Continuing in this manner, we show that x∗

is in F . Then {Df (x∗, xk)} → 0, so that {xk} → x∗.

Corollary 6.1 If F is not empty, then F = Fix(T ).Exercise 6.2 Prove this corollary.



6.3. MULTIPLE BREGMAN DISTANCES 57

6.3 Multiple Bregman Distances

We saw earlier that both the EMML and the SMART algorithms involveBregman projections with respect to distances that vary with the sets

C i = H i. This suggests that Theorem 6.3 could be extended to includecontinuous operators T i that are bpc, with respect to Bregman distancesDf i that vary with i. However, there is a counter-example in [31] thatshows that the sequence {xk+1 = T i(k)xk} need not converge to a fixedpoint of T . The problem is that we need some Bregman distance Dh thatis independent of i, with {Dh(z, xk} decreasing. The result we present nowis closely related to the MSGP algorithm.

6.3.1 Assumptions and Notation

We make the following assumptions throughout this section. The function his super-coercive and Bregman-Legendre with essential domain D = dom h.For i = 1, 2,...,I the function f i is also Bregman-Legendre, with D

⊆dom f i, so that int D ⊆ int dom f i. For all x ∈ dom h and z ∈ int dom h wehave Dh(x, z) ≥ Df i(x, z), for each i.

6.3.2 The Algorithm

The multi-distance extension of Theorem 6.3 concerns the algorithm withthe following iterative step:

xk+1 = ∇h−1∇h(xk) − ∇f i(k)(xk) + ∇f i(k)(T i(k)(xk))

. (6.3)

6.3.3 A Preliminary Result

For each k = 0, 1,... define the function Gk(·) : dom h

→[0, +

∞) by

Gk(x) = Dh(x, xk) − Df i(k)(x, xk) + Df i(k)(x, T i(k)(xk)). (6.4)

The next proposition provides a useful identity, which can be viewed as ananalogue of Pythagoras’ theorem. The proof is not difficult and we omitit.

Proposition 6.1 For each x ∈ dom h, each k = 0, 1,..., and xk+1 given by Equation (6.3) we have

Gk(x) = Gk(xk+1) + Dh(x, xk+1). (6.5)

Consequently, xk+1 is the unique minimizer of the function Gk(·).

This identity (6.5) is the key ingredient in the convergence proof for thealgorithm.




6.3.4 Convergence of the Algorithm

We shall prove the following convergence theorem:

Theorem 6.4 Let F be non-empty. Let x0

∈int dom h be arbitrary. Any

sequence xk obtained from the iterative scheme given by Equation (6.3)converges to x∞ ∈ F ∩ dom h.

Proof: Let z be in F . Then it can be shown that

Dh(z, xk) − Dh(z, xk+1) = Gk(xk+1) + Df i(z, xk) − Df i(z, T i(k)xk).

Therefore, the sequence {Dh(z, xk)} is decreasing, and the non-negativesequences {Gk(xk+1)} and {Df i(z, xk) − Df i(z, T i(k)x

k)} converge to zero.The sequence {xmI } is then bounded and we can select a subsequence{xmnI } with limit point x∗,0. Since the sequence {xmnI +1} is bounded, ithas a subsequence with limit x∗,1. But, since

Df 1(z, xmnI ) − Df 1(z, xmnI +1) → 0,

we conclude that T 1x∗,0 = x∗,0. Continuing in this way, we eventuallyestablish that T ix∗,0 = x∗,0, for each i. So, x∗,0 is in F . Using x∗,0 in placeof z, we find that {Dh(x∗,0, xk)} is decreasing; but a subsequence convergesto zero, so the entire sequence converges to zero, and {xk} → x∗,0.



Part III

Systems of LinearEquations

59





Chapter 7

The AlgebraicReconstruction Technique

The algebraic reconstruction technique (ART) [70] is a sequential iterativealgorithm for solving an arbitrary system Ax = b of I real or complex linearequations in J unknowns. For notational simplicity, we shall assume, fromnow on in this chapter, that the equations have been normalized so thatthe rows of A have Euclidean length one.

7.1 The ART

For each index value i let H i be the hyperplane of J -dimensional vectorsgiven by

H i = {x|(Ax)i = bi}, (7.1)

and P i the orthogonal projection operator onto H i. Let x0 be arbitraryand, for each nonnegative integer k, let i(k) = k(mod I ) + 1. The iterativestep of the ART is

xk+1 = P i(k)xk.

Because the ART uses only a single equation at each step, it has been calleda row-action method [37].

We also consider the full-cycle ART, with iterative step zk+1 = T zk,for

T = P I P I −1 · · · P 2P 1.

As we saw previously, the operators P i are averaged (av), so that the op-

erator T is av. According to the KM theorem, the sequence {T kx} willconverge to a fixed point of T , for any x, whenever such fixed points exist.

61



62CHAPTER 7. THE ALGEBRAIC RECONSTRUCTION TECHNIQUE

When the system Ax = b has solutions, the fixed points of T are solutions.When there are no solutions of Ax = b, the operator T will still have fixedpoints, but they will no longer be exact solutions.

The ART can also include relaxation. For ω in the interval (0, 2), let

Qi = (1 − ω)I + ωP i.

As we have seen, the operators Qi are also av, as is their product.

7.2 Calculating the ART

Given any vector z the vector in H i closest to z, in the sense of the Euclideandistance, has the entries

xj = zj + Aij(bi − (Az)i)/J

m=1

|Aim|2 = zj + Aij(bi − (Az)i). (7.2)

The ART is the following: begin with an arbitrary vector x0; for eachnonnegative integer k, having found xk, let xk+1 be the vector in H i closestto xk. We can use Equation (7.2) to write

xk+1j = xkj + Aij(bi − (Axk)i). (7.3)

When the system Ax = b has exact solutions the ART converges to thesolution closest to x0. How fast the algorithm converges will depend onthe ordering of the equations and on whether or not we use relaxation. Inselecting the equation ordering, the important thing is to avoid particularlybad orderings, in which the hyperplanes H i and H i+1 are nearly parallel.Relaxed ART has the iterative step

xk+1j = xkj + γAij(bi − (Axk)i), (7.4)

where γ ∈ (0, 2).

7.3 When Ax = b Has Solutions

When the system Ax = b is consistent, that is, has solutions, the conver-gence of the full-cycle ART sequence

zk+1 = P I P I −1 · · · P 2P 1zk

to a solution is a consequence of the KM theorem. In fact, as we shall

show now, the ART sequence {xk+1 = P i(k)xk} also converges, and to thesolution closest to the initial vector x0.



7.4. WHEN AX = B HAS NO SOLUTIONS 63

Exercise 7.1 Let x0 and y0 be arbitrary and {xk} and {yk} be the se-quences generated by applying the ART algorithm, beginning with x0 and y0, respectively; that is, yk+1 = P i(k)yk. Show that

||x0 − y0||2 − ||xI − yI ||2 =I

i=1

((Axi−1)i − (Ayi−1)i)2. (7.5)

We give a proof of the following result.

Theorem 7.1 Let Ax = b and let x0 be arbitrary. Let {xk} be generated by Equation (7.3). Then the sequence {||x − xk||} is decreasing and {xk}converges to the solution of Ax = b closest to x0.

Proof: Let Ax = b. Let vri = (AxrI +i−1)i and vr = (vr1,...,vrI )T , for

r = 0, 1,.... It follows from Equation (7.5) that the sequence {||x−xrI ||} isdecreasing and the sequence {vr − b} → 0. So {xrI } is bounded; let x∗,0 bea cluster point. Then, for i = 1, 2,...,I , let x∗,i be the successor of x∗,i−1

using the ART algorithm. It follows that (Ax∗,i−1)i = bi for each i, fromwhich we conclude that x∗,0 = x∗,i for all i and that Ax∗,0 = b. Using x∗,0 inplace of the arbitrary solution x, we have that the sequence {||x∗,0−xk||} isdecreasing. But a subsequence converges to zero, so {xk} converges to x∗,0.By Equation (7.5), the difference ||x − xk||2 − ||x − xk+1||2 is independentof which solution x we pick; consequently, so is ||x − x0||2 − ||x − x∗,0||2.It follows that x∗,0 is the solution closest to x0. This completes the proof.

7.4 When Ax = b Has No Solutions

When there are no exact solutions, the ART does not converge to a singlevector, but, for each fixed i, the subsequence

{xnI +i, n = 0, 1,...

}converges

to a vector zi and the collection {zi |i = 1,...,I } is called the limit cycle [112, 58, 33]. For simplicity, we assume that I > J , and that the matrixA has full rank, which implies that Ax = 0 if and only if x = 0. Becausethe operator T = P I P i−1 · · · P 2P 1 is av, this subsequential convergence toa limit cycle will follow from the KM theorem, once we have establishedthat T has fixed points.

7.4.1 Subsequential Convergence of ART

We know from Exercise (26.17) that the operator T is affine linear and hasthe form

T x = Bx + d,

where B is the matrixB = (I − aI (aI )†) · · · (I − a1(a1)†),




and d a vector.The matrix I − B is invertible, since if (I − B)x = 0, then Bx = x. It

follows that x is in H i0 for each i, which means that ai, x = 0 for each i.Therefore Ax = 0, and so x = 0.

Exercise 7.2 Show that the operator T is strictly nonexpansive, meaning that

||x − y|| ≥ ||T x − T y||,with equality if and only if x = T x and y = T y. Hint: Write T x − T y =Bx − By = B(x − y) Since B is the product of orthogonal projections, Bis av. Therefore, there is α > 0 with

||x − y||2 − ||Bx − By||2 ≥ (1

α− 1)||(I − B)x − (I − B)y||2.

The function ||x−T x|| has minimizers, since ||x−T x||2 = ||x−Bx−d||2is quadratic in x. For any such minimizer z we will have

||z − T z|| = ||T z − T 2z||.

Since T is strictly ne, it follows that z = T z.

Exercise 7.3 Let AA† = L + D + L†, for diagonal matrix D and lower triangular matrix L. Show that, for the operator T above, T x can be written as

T x = (I − A†(L + D)−1)x + A†(L + D)−1b.

As we shall see, this formulation of the operator T provides a connectionbetween the full-cycle ART for Ax = b and the Gauss-Seidel method, asapplied to the system AA†z = b [55].

The ART limit cycle will vary with the ordering of the equations, andcontains more than one vector unless an exact solution exists. There areseveral open questions about the limit cycle.

Open Question: For a fixed ordering, does the limit cycle depend on theinitial vector x0? If so, how?

7.4.2 The Geometric Least-Squares Solution

When the system Ax = b has no solutions, it is reasonable to seek an ap-proximate solution, such as the least squares solution, xLS = (A†A)−1A†b,which minimizes ||Ax − b||. It is important to note that the system Ax = bhas solutions if and only if the related system W Ax = W b has solutions,

where W denotes an invertible matrix; when solutions of Ax = b exist, theyare identical to those of W Ax = W b. But, when Ax = b does not have



7.4. WHEN AX = B HAS NO SOLUTIONS 65

solutions, the least-squares solutions of Ax = b, which need not be unique,but usually are, and the least-squares solutions of W Ax = W b need notbe identical. In the typical case in which A†A is invertible, the uniqueleast-squares solution of Ax = b is

(A†A)−1A†b,

while the unique least-squares solution of W Ax = W b is

(A†W †W A)−1A†W †b,

and these need not be the same. A simple example is the following. Con-sider the system

x = 1; x = 2,

which has the unique least-squares solution x = 1.5, and the system

2x = 2; x = 2,

which has the least-squares solution x = 1.2. The so-called geometric least-squares solution of Ax = b is the least-squares solution of W Ax = b, forW the diagonal matrix whose entries are the reciprocals of the Euclideanlengths of the rows of A. In our example above, the geometric least-squaressolution for the first system is found by using W 11 = 1 = W 22, so is againx = 1.5, while the geometric least-squares solution of the second system isfound by using W 11 = 0.5 and W 22 = 1, so that the geometric least-squaressolution is x = 1.5, not x = 1.2.

Open Question: If there is a unique geometric least-squares solution,where is it, in relation to the vectors of the limit cycle? Can it be calculatedeasily, from the vectors of the limit cycle?

There is a partial answer to the second question. In [23] (see also [33]) it

was shown that if the system Ax = b has no exact solution, and if I = J +1,then the vectors of the limit cycle lie on a sphere in J -dimensional spacehaving the least-squares solution at its center. This is not generally true,however.

Open Question: In both the consistent and inconsistent cases, the se-quence {xk} of ART iterates is bounded [112, 58, 23, 33]. The proof is easyin the consistent case. Is there an easy proof for the inconsistent case?

7.4.3 Nonnegatively Constrained ART

If we are seeking a nonnegative solution for the real system Ax = b, wecan modify the ART by replacing the xk+1 given by Equation (7.3) with

(xk+1)+. This version of ART will converge to a nonnegative solution,whenever one exists, but will produce a limit cycle otherwise.




7.5 Avoiding the Limit Cycle

Generally, the greater the minimum value of ||Ax−b||2 the more the vectorsof the LC are distinct from one another. There are several ways to avoid

the LC in ART and to obtain a least-squares solution. One way is thedouble ART (DART) [27]:

7.5.1 Double ART (DART)

We know that any b can be written as b = Ax + w, where AT w = 0 and xis a minimizer of ||Ax − b||2. The vector w is the orthogonal projection of b onto the null space of the matrix transformation AT . Therefore, in Step1 of DART we apply the ART algorithm to the consistent system of linearequations AT w = 0, beginning with w0 = b. The limit is w∞ = w, themember of the null space of AT closest to b. In Step 2, apply ART to theconsistent system of linear equations Ax = b−w∞ = Ax. The limit is then

the minimizer of ||Ax−b|| closest to x0

. Notice that we could also obtain theleast-squares solution by applying ART to the system AT y = AT b, startingwith y0 = 0, to obtain the minimum-norm solution, which is y = Ax, andthen applying ART to the system Ax = y.

7.5.2 Strongly Underrelaxed ART

Another method for avoiding the LC is strong underrelaxation [38]. Lett > 0. Replace the iterative step in ART with

xk+1j = xkj + tAij(bi − (Axk)i). (7.6)

In [38] it is shown that, as t→

0, the vectors of the LC approach the geo-metric least squares solution closest to x0; a short proof is in [23]. Bertsekas[12] uses strong underrelaxation to obtain convergence of more general in-cremental methods.

7.6 Approximate Solutions and the Nonneg-ativity Constraint

For the real system Ax = b, consider the nonnegatively constrained least-squares problem of minimizing the function ||Ax − b||, subject to the con-straints xj ≥ 0 for all j; this is a nonnegatively constrained least-squaresapproximate solution. As noted previously, we can solve this problem using

a slight modification of the ART. Although there may be multiple solutionsx, we know, at least, that Ax is the same for all solutions.



7.6. APPROXIMATE SOLUTIONS AND THE NONNEGATIVITY CONSTRAINT 67

According to the Karush-Kuhn-Tucker theorem [103], the vector Axmust satisfy the condition

I i=1

Aij((Ax)i − bi) = 0 (7.7)

for all j for which xj > 0 for some solution x. Let S be the set of all indices j for which there exists a solution x with xj > 0. Then Equation (7.7)must hold for all j in S . Let Q be the matrix obtained from A by deletingthose columns whose index j is not in S . Then QT (Ax − b) = 0. If Q hasfull rank and the cardinality of S is greater than or equal to I , then QT isone-to-one and Ax = b. We have proven the following result.

Theorem 7.2 Suppose that A has the full-rank property, that is, A and every matrix Q obtained from A by deleting columns has full rank. Suppose there is no nonnegative solution of the system of equations Ax = b. Then

there is a subset S of the set {i = 1, 2,...,I } with cardinality at most I − 1such that, if x is any minimizer of ||Ax − b|| subject to x ≥ 0, then xj = 0 for j not in S . Therefore, x is unique.

When x is a vectorized two-dimensional image and J > I , the presenceof at most I − 1 positive pixels makes the resulting image resemble stars inthe sky; for that reason this theorem and the related result for the EMMLalgorithm ([19]) are sometimes called night sky theorems. The zero-valuedpixels typically appear scattered throughout the image. This behavioroccurs with all the algorithms discussed so far that impose nonnegativity,whenever the real system Ax = b has no nonnegative solutions.

This result leads to the following open question:

Open Question: How does the set S defined above vary with the choiceof algorithm, with the choice of x0 for a given algorithm, and for the choiceof subsets in the block-iterative algorithms?






Chapter 8

Simultaneous ART

The ART is a sequential algorithm, using only a single equation from thesystem Ax = b at each step of the iteration. In this chapter we consider

iterative procedures for solving Ax = b in which all of the equations areused at each step. Such methods are called simultaneous algorithms. Asbefore, we shall assume that the equations have been normalized so thatthe rows of A have Euclidean length one.

8.1 Cimmino’s Algorithm

The ART seeks a solution of Ax = b by projecting the current vectorxk orthogonally onto the next hyperplane H (ai(k), bi(k)) to get xk+1. In

Cimmino’s algorithm, we project the current vector xk onto each of thehyperplanes and then average the result to get xk+1. The algorithm beginswith an arbitrary x0; the iterative step is then

xk+1 =1

I

I i=1

P ixk, (8.1)

where P i is the orthogonal projection onto H (ai, bi).

Exercise 8.1 Show that the iterative step can then be written as

xk+1 = xk +1

I A†(b − Axk). (8.2)

As we saw in our discussion of the ART, when the system Ax = b hasno solutions, the ART does not converge to a single vector, but to a limitcycle. One advantage of many simultaneous algorithms, such as Cimmino’s,

is that they do converge to the least squares solution in the inconsistentcase.

69



70 CHAPTER 8. SIMULTANEOUS ART

Cimmino’s algorithm has the form xk+1 = T xk, for the operator T given by

T x = (I − 1

I A†A)x +

1

I A†b.

Experience with Cimmino’s algorithm shows that it is slow to converge.In the next section we consider how we might accelerate the algorithm.

8.2 The Landweber Algorithms

The Landweber algorithm [87, 11], with the iterative step

xk+1 = xk + γA†(b − Axk), (8.3)

converges to the least squares solution closest to the starting vector x0,provided that 0 < γ < 2/λmax, where λmax is the largest eigenvalue of the nonnegative-definite matrix A†A. Loosely speaking, the larger γ is,

the faster the convergence. However, precisely because A is large, calcu-lating the matrix A†A, not to mention finding its largest eigenvalue, canbe prohibitively expensive. The matrix A is said to be sparse if most of itsentries are zero. In [30] upper bounds for λmax were obtained in terms of the degree of sparseness of the matrix A; we discuss these bounds in thefinal section of this chapter.

8.2.1 Finding the Optimum γ

The operator

T x = x + γA†(b − Ax) = (I − γA†A)x + γA†b

is affine linear and is av if and only if its linear part, the hermitian matrix

B = I − γA†A,

is av. To guarantee this we need 0 ≤ γ < 2/λmax. Should we always try totake γ near its upper bound, or is there an optimum value of γ ? To answerthis question we consider the eigenvalues of B for various values of γ .

Exercise 8.2 Show that, if γ < 0, then none of the eigenvalues of B is less than one.

Exercise 8.3 Show that, for

0 ≤ γ ≤ 2

λmax + λmin,

we have ρ(B) = 1 − γλmin;



8.2. THE LANDWEBER ALGORITHMS 71

the smallest value of ρ(B) occurs when

γ =2

λmax + λmin,

and equals λmax − λmin

λmax + λmin.

Similarly, show that, for

γ ≥ 2

λmax + λmin,

we have ρ(B) = γλmax − 1;

the smallest value of ρ(B) occurs when

γ =

2

λmax + λmin ,

and equals λmax − λmin

λmax + λmin.

We see from this exercise that, if 0 ≤ γ < 2/λmax, and λmin > 0, then||B|| = ρ(B) < 1, so that B is sc. We minimize ||B|| by taking

γ =2

λmax + λmin,

in which case we have

||B|| =

λmax

−λmin

λmax + λmin =

c

−1

c + 1 ,

for c = λmax/λmin, the condition number of the positive-definite matrixA†A. The closer c is to one, the smaller the norm ||B||, and the faster theconvergence.

On the other hand, if λmin = 0, then ρ(B) = 1 for all γ in the interval(0, 2/λmax). The matrix B is still av, but it is no longer sc. For example,consider the orthogonal projection P 0 onto the hyperplane H 0 = H (a, 0),where ||a|| = 1. This operator can be written

P 0 = I − aa†.

The largest eigenvalue of aa† is λmax = 1; the remaining ones are zero.The relaxed projection operator

B = I − γaa†




has ρ(B) = 1 − γ > 1, if γ < 0, and for γ ≥ 0, we have ρ(B) = 1. Theoperator B is av, in fact, it is fne, but it is not sc.

It is worth noting that the definition of the condition number givenabove applies only to positive-definite matrices. For general square, invert-

ible matrices S , the condition number depends on the particular inducedmatrix norm and is defined as

c = ||S ||||S −1||.To motivate this definition of the condition number, suppose that x = S −1his the solution of Sx = h, and that h is perturbed to h + δ h. Then let δ xbe such that x + δ x = S −1(h + δ h). The relative change in the solution,||δ x||/||x||, is related to the relative change in h, ||δ h||/||h||, by

||δ x||||x|| ≤ ||S ||||S −1|| ||δ h||

||h|| .

8.2.2 The Projected Landweber AlgorithmWhen we require a nonnegative approximate solution x for the real systemAx = b we can use a modified version of the Landweber algorithm, calledthe projected Landweber algorithm [11], having the iterative step

xk+1 = (xk + γA†(b − Axk))+, (8.4)

where, for any real vector a, we denote by (a)+ the nonnegative vectorwhose entries are those of a, for those that are nonnegative, and are zerootherwise. The projected Landweber algorithm converges to a vector thatminimizes ||Ax − b|| over all nonnegative vectors x, for the same values of γ .

Both the Landweber and projected Landweber algorithms are special

cases of the CQ algorithm [30], which, in turn, is a special case of themore general iterative fixed point algorithm, the Krasnoselskii/Mann (KM)method.

8.3 An Upper Bound for the Maximum Eigen-value of A†A

The upper bounds for λmax we present here apply to any matrix A, butwill be particularly helpful when A is sparse.

8.3.1 The Normalized Case

We assume now that the matrix A has been normalized so that each of its rows has Euclidean length one. Denote by sj the number of nonzero



8.3. AN UPPER BOUND FOR THE MAXIMUM EIGENVALUE OF A†A73

entries in the jth column of A, and let s be the maximum of the sj . Ourfirst result is the following [30]:

Theorem 8.1 For normalized A, λmax, the largest eigenvalue of the ma-

trix A†A, does not exceed s.

Proof: For notational simplicity, we consider only the case of real matricesand vectors. Let AT Av = cv for some nonzero vector v. We show thatc ≤ s. We have AAT Av = cAv and so wT AAT w = vT AT AAT Av =cvT AT Av = cwT w, for w = Av. Then, with eij = 1 if Aij = 0 and eij = 0otherwise, we have

(I

i=1

Aijwi)2 = (

I i=1

Aijeijwi)2

≤(

I

i=1

A2ijw2

i )(I

i=1

e2ij) =

(I i=1

A2ijw2

i )sj ≤ (I i=1

A2ijw2

i )s.

Therefore,

wT AAT w =J j=1

(I

i=1

Aijwi)2 ≤

J j=1

(I

i=1

A2ijw2

i )s,

and

wT AAT w = cI

i=1

w2i = c

I

i=1

w2i (

J

j=1

A2ij)

= cI

i=1

J j=1

w2i A2

ij.

The result follows immediately.

When A is normalized the trace of AAT , that is, the sum of its diagonalentries, is M . Since the trace is also the sum of the eigenvalues of bothAAT and AT A, we have λmax ≤ M . When A is sparse, s is much smallerthan M , so provides a much tighter upper bound for λmax.

8.3.2 The General Case

A similar upper bound for λmax is given for the case in which A is notnormalized.




Theorem 8.2 For each i = 1,...,I let ν i =J

j=1 |Aij |2 > 0. For each

j = 1,...,J , let σj =I

i=1 eijν i, where eij = 1 if Aij = 0 and eij = 0otherwise. Let σ denote the maximum of the σj. Then the eigenvalues of

the matrix A†A do not exceed σ.The proof of Theorem 8.2 is similar to that of Theorem 8.1; the details arein [30].

8.3.3 Upper Bounds for -Sparse Matrices

If A is not sparse, but most of its entries have magnitude not exceeding > 0 we say that A is -sparse. We can extend the results for the sparsecase to the -sparse case.

Given a matrix A, define the entries of the matrix B to be Bij = Aij if |Aij | > , and Bij = 0, otherwise. Let C = A − B; then |C ij| ≤ , for alli and j. If A is -sparse, then B is sparse. The 2-norm of the matrix A,written

||A

||, is defined to be the square root of the largest eigenvalue of

the matrix A†A, that is, ||A|| = √ λmax. From Theorem 8.2 we know that||B|| ≤ σ. The trace of the matrix C †C does not exceed IJ2. Therefore

λmax = ||A|| = ||B + C || ≤ ||B|| + ||C || ≤ √ σ +

√ IJ, (8.5)

so that

λmax ≤ σ + 2√

σIJ + IJ2. (8.6)

Simulation studies have shown that these upper bounds become tighteras the size of the matrix A increases. In hundreds of runs, with I and J in the hundreds, we found that the relative error of the upper bound wasaround one percent [35].



Chapter 9

Jacobi and Gauss-SeidelMethods

Linear systems Ax = b need not be square but can be associated withtwo square systems, A†Ax = A†b, the so-called normal equations , andAA†z = b, sometimes called the Bj¨ orck-Elfving equations [55]. In this chap-ter we consider two well known iterative algorithms for solving square sys-tems of linear equations, the Jacobi method and the Gauss-Seidel method.Both these algorithms are easy to describe and to motivate. They bothrequire not only that the system be square, that is, have the same num-ber of unknowns as equations, but satisfy additional constraints needed forconvergence.

Both the Jacobi and the Gauss-Seidel algorithms can be modified toapply to any square system of linear equations, Sz = h. The resultingalgorithms, the Jacobi overrelaxation (JOR) and successive overrelaxation

(SOR) methods, involve the choice of a parameter. The JOR and SOR willconverge for more general classes of matrices, provided that the parameteris appropriately chosen.

When we say that an iterative method is convergent, or converges, undercertain conditions, we mean that it converges for any consistent system of the appropriate type, and for any starting vector; any iterative method willconverge if we begin at the right answer.

9.1 The Jacobi and Gauss-Seidel Methods:An Example

Suppose we wish to solve the 3 by 3 system

S 11z1 + S 12z2 + S 13z3 = h1

75



76 CHAPTER 9. JACOBI AND GAUSS-SEIDEL METHODS

S 21z1 + S 22z2 + S 23z3 = h2

S 31z1 + S 32z2 + S 33z3 = h3,

which we can rewrite as

z1 = S −111 [h1 − S 12z2 − S 13z3]

z2 = S −122 [h2 − S 21z1 − S 23z3]

z3 = S −133 [h3 − S 31z1 − S 32z2],

assuming that the diagonal terms S mm are not zero. Let z0 = (z01 , z02 , z03)T

be an initial guess for the solution. We then insert the entries of z0 on theright sides and use the left sides to define the entries of the next guess z1.This is one full cycle of Jacobi’s method .

The Gauss-Seidel method is similar. Let z0 = (z01 , z02 , z03)T be an initialguess for the solution. We then insert z02 and z03 on the right side of the

first equation, obtaining a new value z11 on the left side. We then insertz03 and z11 on the right side of the second equation, obtaining a new valuez12 on the left. Finally, we insert z11 and z12 into the right side of the thirdequation, obtaining a new z13 on the left side. This is one full cycle of theGauss-Seidel (GS) method.

9.2 Splitting Methods

The Jacobi and the Gauss-Seidel methods are particular cases of a moregeneral approach, known as splitting methods. Splitting methods applyto square systems of linear equations. Let S be an arbitrary N by N square matrix, written as S = M

−K . Then the linear system of equations

Sz = h is equivalent to M z = Kz + h. If M is invertible, then we can alsowrite z = M −1Kz + M −1h. This last equation suggests a class of iterativemethods for solving Sz = h known as splitting methods . The idea is toselect a matrix M that can be easily inverted, and then let

zk+1 = M −1Kzk + M −1h. (9.1)

From K = M − S , we can write Equation (14.8) as

zk+1 = zk + M −1(h − Szk). (9.2)

Suppose that S is invertible and z is the unique solution of Sz = h. Theerror we make at the k-th step is ek = z

−zk.

Exercise 9.1 Show that ek+1 = M −1Kek



9.3. SOME EXAMPLES OF SPLITTING METHODS 77

We want the error to decrease with each step, which means that we shouldseek M and K so that ||M −1K || < 1. If S is not invertible and there aremultiple solutions of Sz = h, then we do not want M −1K to be a strictcontraction, but only av or pc. The operator T defined by

T z = M −1Kz + M −1h = Bz + d

is an affine linear operator and will be a sc or av operator whenever B =M −1K is.

It follows from our previous discussion concerning linear av operatorsthat, if B = B† is Hermitian, then B is av if and only if

−1 < λ ≤ 1,

for all (necessarily real) eigenvalues λ of B.In general, though, the matrix B = M −1K will not be Hermitian,

and deciding if such a non-Hermitian matrix is av is not a simple matter.Instead, we can use Theorem ??. According to that theorem, If B has a

basis of eigenvectors, and |λ| < 1 for all eigenvalues λ of B that are notequal to one, then {zk} will converge to a solution of Sz = h, wheneversolutions exist.

In what follows we shall write an arbitrary square matrix S as

S = L + D + U,

where L is the strictly lower triangular part of S , D the diagonal part, andU the strictly upper triangular part. When S is Hermitian, we have

S = L + D + L†.

We list now several examples of iterative algorithms obtained by the split-ting method. In the remainder of the chapter we discuss these methods in

more detail.

9.3 Some Examples of Splitting Methods

As we shall now see, the Jacobi and Gauss-Seidel methods, as well as theiroverrelaxed versions, JOR and SOR, are splitting methods.

Jacobi’s Method: Jacobi’s method uses M = D and K = −L−U , underthe assumption that D is invertible. The matrix B is

B = M −1K = −D−1(L + U ). (9.3)

The Gauss-Seidel Method: The Gauss-Seidel (GS) method uses thesplitting M = D + L, so that the matrix B is

B = I − (D + L)−1S. (9.4)




The Jacobi Overrelaxation Method (JOR): The JOR uses the split-ting

M =1

ωD

and

K = M − S = (1

ω− 1)D − L − U.

The matrix B is

B = M −1K = (I − ωD−1S ). (9.5)

The Successive Overrelaxation Method (SOR): The SOR uses thesplitting M = ( 1ωD + L), so that

B = M −1K = (D + ωL)−1[(1 − ω)D − ωU ]

orB = I

−ω(D + ωL)−1S,

or

= (I + ωD−1L)−1[(1 − ω)I − ωD−1U ]. (9.6)

9.4 Jacobi’s Algorithm and JOR

Most textbooks on numerical analysis describe Jacobi’s method as an itera-tive procedure for solving Sz = h, where S is a square matrix. The matrixB in Equation (9.3) is not generally av and the Jacobi iterative scheme willnot converge, in general. Additional conditions need to be imposed on S in order to guarantee convergence. One such condition is that S be strictlydiagonally dominant. In that case, all the eigenvalues of B = M −1K canbe shown to lie inside the unit circle of the complex plane, so that ρ(B) < 1.It follows from Lemma 27.1 that B is sc with respect to some vector norm,and the Jacobi iteration converges. If, in addition, S is Hermitian, theeigenvalues of B are in the interval (−1, 1), and so B is sc with respect tothe Euclidean norm.

Alternatively, one has the Jacobi overrelaxation (JOR) method, whichis essentially a special case of the Landweber algorithm and involves anarbitrary parameter.

For S an N by N matrix, Jacobi’s method can be written as

znewm = S −1mm[hm −j=m

S mjzoldj ],

for m = 1,...,N . With D the invertible diagonal matrix with entriesDmm = S mm we can write one cycle of Jacobi’s method as



9.4. JACOBI’S ALGORITHM AND JOR 79

znew = zold + D−1(h − Szold).

The Jacobi overrelaxation (JOR) method has the following full-cycle iter-

ative step:

znew = zold + ωD−1(h − Szold);

choosing ω = 1 we get the Jacobi method. Convergence of the JOR itera-tion will depend, of course, on properties of S and on the choice of ω. WhenS is Hermitian, nonnegative-definite, for example, S = A†A or S = AA†,we can say more.

9.4.1 The JOR in the Nonnegative-definite Case

When S is nonnegative-definite and the system Sz = h is consistent theJOR converges to a solution for any ω ∈ (0, 2/ρ(D−1/2SD−1/2)), where

ρ(Q) denotes the largest eigenvalue of the nonnegative-definite matrix Q.For nonnegative-definite S , the convergence of the JOR method is impliedby the KM theorem, since the JOR is equivalent to Landweber’s algorithmin these cases.

Exercise 9.2 Show that Ax = b has solutions if and only if the associated Bj¨ orck-Elfving equations AA†z = b has solutions.

The JOR method, as applied to Sz = AA†z = b, is equivalent to theLandweber iterative method for Ax = b.

Exercise 9.3 Show that, if {zk} is the sequence obtained from the JOR,then the sequence {A†zk} is the sequence obtained by applying the Landwe-ber algorithm to the system D−1/2Ax = D−1/2b, where D is the diagonal part of the matrix S = AA†.

If we select ω = 1/I we obtain the Cimmino method. Since the trace of the matrix D−1/2SD−1/2 equals I we know that ω = 1/I is not greaterthan the largest eigenvalue of the matrix D−1/2SD−1/2 and so this choiceof ω is acceptable and the Cimmino algorithm converges whenever thereare solutions of Ax = b. In fact, it can be shown that Cimmino’s methodconverges to a least squares approximate solution generally.

Similarly, the JOR method applied to the system A†Ax = A†b is equiv-alent to the Landweber algorithm, applied to the system Ax = b.

Exercise 9.4 Show that, if {zk} is the sequence obtained from the JOR,then the sequence

{D1/2zk

}is the sequence obtained by applying the Landwe-

ber algorithm to the system AD−1/2x = b, where D is the diagonal part of the matrix S = A†A.




9.5 The Gauss-Seidel Algorithm and SOR

In general, the full-cycle iterative step of the Gauss-Seidel method is thefollowing:

znew = zold + (D + L)−1(h − Szold),

where S = D + L + U is the decomposition of the square matrix S intoits diagonal, lower triangular and upper triangular diagonal parts. The GSmethod does not converge without restrictions on the matrix S . As withthe Jacobi method, strict diagonal dominance is a sufficient condition.

The successive overrelaxation (SOR) method has the following full-cycleiterative step:

znew = zold + (ω−1D + L)−1(h − Szold);

the choice of ω = 1 gives the GS method. Convergence of the SOR iterationwill depend, of course, on properties of S and on the choice of ω. When S

is Hermitian, nonnegative-definite, as, for example, when we take S = A†Aor S = AA†, we can say more.

Exercise 9.5 Use the form

B = (D + ωL)−1[(1 − ω)D − ωU ]

to show that |det(B)| = |1 − ω|N .

Conclude from this and the fact that the determinant of B is the product of its eigenvalues that ρ(B) > 1 if ω < 0 or ω > 2.

9.5.1 The SOR in the Nonnegative-definite Case

When S is nonnegative-definite and the system Sz = h is consistent theSOR converges to a solution for any ω ∈ (0, 2). This follows from theconvergence of the ART algorithm, since, for such S , the SOR is equivalentto the ART.

Now we consider the SOR method applied to the Bjorck-Elfving equa-tions. Rather than count a full cycle as one iteration, we now count as asingle step the calculation of a single new entry. Therefore, for k = 0, 1,...the k + 1-st step replaces the value zki only, where i = k(mod I ) + 1. Wehave

zk+1i = ωD−1ii (bi −

i−1n=1

S inzkn −I

n=i+1

S inzkn)

and zk+1n

= zkn

for n= i. Now we calculate xk+1 = A

†zk+1:

xk+1j = xkj + ωD−1ii Aij(bi − (Axk)i).



9.5. THE GAUSS-SEIDEL ALGORITHM AND SOR 81

This is one step of the relaxed algebraic reconstruction technique (ART)applied to the original system of equations Ax = b. The relaxed ARTconverges to a solution, when solutions exist, for any ω ∈ (0, 2).

When Ax = b is consistent, so is AA†z = b. We consider now the

case in which S = AA† is invertible. Since the relaxed ART sequence{xk = A†zk} converges to a solution x∞, for any ω ∈ (0, 2), the sequence{AA†zk} converges to b. Since S = AA† is invertible, the SOR sequence{zk} then converges to S −1b.






Part IV

Positivity in LinearSystems

83





Chapter 10

The Multiplicative ART(MART)

The multiplicative ART (MART) [70] is an iterative algorithm closely re-lated to the ART. It applies to systems of linear equations Ax = b, for whichthe bi are positive and the Aij are nonnegative; the solution x we seek willhave nonnegative entries. It is not so easy to see the relation between ARTand MART if we look at the most general formulation of MART. For thatreason, we begin with a simpler case, in which the relation is most clearlyvisible.

10.1 A Special Case of ART and MART

We begin by considering the application of ART to the transmission to-

mography problem. For i = 1,...,I , let Li be the set of pixel indices j forwhich the j-th pixel intersects the i-th line segment, and let |Li| be thecardinality of the set Li. Let Aij = 1 for j in Li, and Aij = 0 otherwise.With i = k(mod I ) + 1, the iterative step of the ART algorithm is

xk+1j = xkj +1

|Li| (bi − (Axk)i),

for j in Li, andxk+1j = xkj ,

if j is not in Li. In each step of ART, we take the error, bi − (Axk)i,associated with the current xk and the i-th equation, and distribute itequally over each of the pixels that intersects Li.

Suppose, now, that each bi is positive, and we know in advance that thedesired image we wish to reconstruct must be nonnegative. We can begin

85



86 CHAPTER 10. THE MULTIPLICATIVE ART (MART)

with x0 > 0, but as we compute the ART steps, we may lose nonnegativity.One way to avoid this loss is to correct the current xk multiplicatively,rather than additively, as in ART. This leads to the multiplicative ART(MART).

The MART, in this case, has the iterative step

xk+1j = xkj

bi(Axk)i

,

for those j in Li, andxk+1j = xkj ,

otherwise. Therefore, we can write the iterative step as

xk+1j = xkj

bi(Axk)i

Aij

.

10.2 MART in the General CaseTaking the entries of the matrix A to be either one or zero, depending onwhether or not the j-th pixel is in the set Li, is too crude. The line Li

may just clip a corner of one pixel, but pass through the center of another.Surely, it makes more sense to let Aij be the length of the intersection of line Li with the j-th pixel, or, perhaps, this length divided by the length of the diagonal of the pixel. It may also be more realistic to consider a strip,instead of a line. Other modifications to Aij may made made, in order tobetter describe the physics of the situation. Finally, all we can be sure of is that Aij will be nonnegative, for each i and j. In such cases, what is theproper form for the MART?

The MART, which can be applied only to nonnegative systems, is a

sequential, or row-action, method that uses one equation only at each stepof the iteration. The MART begins with a positive vector x0. Having foundxk for nonnegative integer k, we let i = k(mod I ) + 1 and define xk+1 by

xk+1j = xkj

bi(Axk)i

m−1i

Aij, (10.1)

where mi = max {Aij | j = 1, 2,...,J }. Some treatments of MART leaveout the mi, but require only that the entries of A have been rescaled sothat Aij ≤ 1 for all i and j. The mi is important, however, in acceleratingthe convergence of MART.

The MART can be accelerated by relaxation, as well. The relaxedMART has the iterative step

xk+1j = xkj

bi

(Axk)i

γ im−

1i Aij, (10.2)



10.3. ART AND MART AS SEQUENTIAL PROJECTION METHODS 87

where γ i is in the interval (0, 1). As with ART, finding the best relaxationparameters is a bit of an art.

In the consistent case, by which we mean that Ax = b has nonnegativesolutions, we have the following convergence theorem for MART.

Theorem 10.1 In the consistent case, the MART converges to the unique nonnegative solution of b = Ax for which the distance

J j=1 KL(xj , x0j) is

minimized.

If the starting vector x0 is the vector whose entries are all one, then theMART converges to the solution that maximizes the Shannon entropy,

SE (x) =J j=1

xj log xj − xj .

As with ART, the speed of convergence is greatly affected by the order-ing of the equations, converging most slowly when consecutive equationscorrespond to nearly parallel hyperplanes.

Open Question: When there are no nonnegative solutions, MART doesnot converge to a single vector, but, like ART, is always observed to producea limit cycle of vectors. Unlike ART, there is no proof of the existence of a limit cycle for MART.

10.3 ART and MART as Sequential Projec-tion Methods

We know from our discussion of the ART that the iterative ART step canbe viewed as the orthogonal projection of the current vector, xk, onto H i,the hyperplane associated with the i-th equation. Can we view MART in a

similar way? Yes, but we need to consider a different measure of closenessbetween nonnegative vectors.

10.3.1 Cross-Entropy or the Kullback-Leibler Distance

For positive numbers u and v, the Kullback-Leibler distance [86] from u tov is

KL(u, v) = u logu

v+ v − u. (10.3)

We also define KL(0, 0) = 0, KL(0, v) = v and KL(u, 0) = +∞. The KLdistance is extended to nonnegative vectors component-wise, so that fornonnegative vectors x and z we have

KL(x, z) =

J j=1

KL(xj , zj). (10.4)




Exercise 10.1 One of the most useful facts about the KL distance is that, for all nonnegative vectors x and z, with z+ =

J j=1 zj > 0, we have

KL(x, z) = KL(x+, z+) + KL(x,

x+

z+ z). (10.5)

Prove this.

Given the vector xk, we find the vector z in H i for which the KL distancef (z) = KL(xk, z) is minimized; this z will be the KL projection of xk ontoH i. Using a Lagrange multiplier, we find that

0 =∂f

∂zj(z) − λiAij ,

for some constant λi, so that

0 = −xk

jzj + 1 − λiAij,

for each j. Multiplying by zj , we get

zj − xj = zjAijλi. (10.6)

For the special case in which the entries of Aij are zero or one, we cansolve Equation (10.6) for zj . We have

zj − xkj = zjAijλi,

for each j ∈ Li, and zj = xkj , otherwise. Multiplying both sides by Aij andsumming over j, we get

bi − (Axk)i = |Li|(λi − 1).

Consequently,

zj = xkj +1

|Li| (bi − (Axk)i),

which is clearly xk+1j . So, at least in the special case we have been dis-cussing, MART consists of projecting, in the KL sense, onto each of thehyperplanes in succession.

10.3.2 Weighted KL Projections

For the more general case in which the entries Aij are arbitrary nonnegativenumbers, we cannot directly solve for zj in Equation (10.6). There is an



10.4. PROOF OF CONVERGENCE FOR MART 89

alternative, though. Instead of minimizing KL(x, z), subject to (Az)i = bi,we minimize the weighted KL distance

J j=1

AijKL(xj , zj),

subject to the same constraint on z. We shall denote the optimal z by Qix.Again using a Lagrange multiplier approach, we find that

0 = −Aij(xjzj

+ 1) − Aijλi,

for some constant λi. Multiplying by zj , we have

Aijzj − Aijxj = Aijzjλi. (10.7)

Summing over the index j, we get

bi−

(Ax)i = biλi,

from which it follows that

1 − λi = (Ax)i/bi.

Substituting for λi in equation (10.7), we obtain

zj = (Qix)j = xjbi

(Ax)i, (10.8)

for all j for which Aij = 0.Note that the MART step does not define xk+1 to be this weighted KL

projection of xk onto the hyperplane H i; that is,

xk+1j

= (Qix

k)j ,

except for those j for whichAijmi

= 1. What is true is that the MART stepinvolves relaxation. Writing

xk+1j = (xkj )1−m−1i

Aij

xkjbi

(Axk)i

m−1i

Aij,

we see that xk+1j is a weighted geometric mean of xkj and (Qixk)j .

10.4 Proof of Convergence for MART

We assume throughout this proof that x is a nonnegative solution of Ax = b.For i = 1, 2,...,I , let

Gi(x, z) = KL(x, z) + m−1i KL((Ax)i, bi) − m−1

i KL((Ax)i, (Az)i).




Exercise 10.2 Use Equation (10.5) to prove that Gi(x, z) ≥ 0 for all xand z.

Exercise 10.3 Show that Gi(x, z), viewed as a function of z, is minimized

by z = x, by showing that

Gi(x, z) = Gi(x, x) + KL(x, z) − m−1i KL((Ax)i, (Az)i). (10.9)

Exercise 10.4 Show that Gi(x, z), viewed as a function of x, is minimized by x = z, where

zj = zj

bi(Az)i

m−1i

Aij,

by showing that

Gi(x, z) = Gi(z, z) + KL(x, z). (10.10)

We note that xk+1 = (xk).Now we calculate Gi(x, xk) in two ways, using, first, the definition, and,

second, Equation (10.10). From the definition, we haveGi(x, xk) = KL(x, xk) − m−1

i KL(bi, (Axk)i).

From Equation (10.10), we have

Gi(x, xk) = Gi(xk+1, xk) + KL(x, xk+1).

Therefore,

KL(x, xk) − KL(x, xk+1) = Gi(xk+1, xk) + m−1i KL(bi, (Axk)i). (10.11)

From Equation (10.11) we can conclude several things:

1) the sequence {KL(x, xk)} is decreasing;

2) the sequence {xk

} is bounded, and therefore has a cluster point, x∗; and3) the sequences {Gi(xk+1, xk)} and {m−1i KL(bi, (Axk)i)} converge de-

creasingly to zero, and so bi = (Ax∗)i for all i.Since b = Ax∗, we can use x∗ in place of the arbitrary solution x to

conclude that the sequence {KL(x∗, xk)} is decreasing. But, a subsequenceconverges to zero, so the entire sequence must converge to zero, and there-fore {xk} converges to x∗. Finally, since the right side of Equation (10.11) isindependent of which solution x we have used, so is the left side. Summingover k on the left side, we find that

KL(x, x0) − KL(x, x∗)

is independent of which x we use. We can conclude then that minimizingKL(x, x0) over all solutions x has the same answer as minimizing KL(x, x∗)

over all such x; but the solution to the latter problem is obviously x = x∗.This concludes the proof.



10.5. COMMENTS ON THE RATE OF CONVERGENCE OF MART 91

10.5 Comments on the Rate of Convergenceof MART

We can see from Equation (10.11),

KL(x, xk) − KL(x, xk+1) = Gi(xk+1, xk) + m−1i KL(bi, (Axk)i),

that the decrease in distance to a solution that occurs with each step of MART depends on m−1

i and on KL(bi, (Axk)i); the latter measures theextent to which the current vector xk solves the current equation. We seethen that it is reasonable to select mi as we have done, namely, as thesmallest positive number ci for which Aij/ci ≤ 1 for all j. We also see thatit is helpful if the equations are ordered in such a way that KL(bi, (Axk)i)is fairly large, for each k. It is not usually necessary to determine anoptimal ordering of the equations; the important thing is to avoid orderingthe equations so that successive hyperplanes have nearly parallel normalvectors.






Chapter 11

The Simultaneous MART(SMART)

There is a simultaneous version of MART, called the SMART [42, 54, 107].As with MART, the SMART applies only to nonnegative systems. UnlikeMART, SMART uses all equations in each step of the iteration.

11.1 The SMART Iteration

It begins with a positive vector x0; having calculated xk, we calculate xk+1

using

log xk+1j = log xkj + s−1j

I

i=1Aij log

bi(Axk)i

, (11.1)

where sj =I

i=1 Aij > 0.The following theorem describes what we know concerning the SMART.

Theorem 11.1 In the consistent case the SMART converges to the unique nonnegative solution of b = Ax for which the distance

J j=1 sjKL(xj , x0j)

is minimized. In the inconsistent case it converges to the unique nonnega-tive minimizer of the distance KL(Ax,b) for which

J j=1 sjKL(xj , x0j) is

minimized; if A and every matrix derived from A by deleting columns has full rank then there is a unique nonnegative minimizer of KL(Ax,b) and at most I − 1 of its entries are nonzero.

When there are nonnegative solutions of Ax = b, both MART and

SMART converge to the nonnegative solution minimizing the Kullback-Leibler distance KL(x, x0); if x0 is the vector whose entries are all one,

93



94 CHAPTER 11. THE SIMULTANEOUS MART (SMART)

then the solution minimizes the Shannon entropy, SE (x), given by

SE (x) =

J

j=1

xj log xj − xj . (11.2)

One advantage that SMART has over MART is that, if the nonnegativesystem Ax = b has no nonnegative solutions, the SMART converges to thenonnegative minimizer of the function KL(Ax,b) for which KL(x, x0) isminimized. One disadvantage of SMART, compared to MART, is that itis slow.

11.2 The SMART as a Generalized Projec-tion Method

As we saw previously, the MART algorithm can be viewed as a sequen-

tial, relaxed generalized projection method that involves the weighted KLprojections Qi. In this section we show that the SMART iteration can beviewed in this way also.

Recall that, for any nonnegative vector x, the nonnegative vector z =Qix given by

zj = (Qix)j = xjbi

(Ax)i

minimizes the weighted KL distance

J j=1

AijKL(xj , zj),

over all nonnegative z with (Az)i = bi. Given xk, we take as xk+1 the

vector whose entries xk+1j are weighted geometric means of the (Qixk)j ;that is,

log xk+1j =

I i=1

s−1j Aij log(Qixk)j ,

with sj =I

i=1 Aij > 0. We then have

xk+1j = xkj exp(I i=1

s−1j Aij logbi

(Axk)i),

or

xk+1j = xkj

I

i=1

(bi

(Axk

)i

)s−1j

Aij .

This is the SMART iterative step.



11.3. PROOF OF CONVERGENCE OF THE SMART 95

11.3 Proof of Convergence of the SMART

For the consistent case, in which there are nonnegative solutions of A = b,the proof of convergence of SMART is almost the same as that for MART

given previously. To simplify the notation, we shall assume that we havenormalize the problem so that the sums of the entries in each column of A is one. That means we replace each Aij with s−1j Aij and each xj withsjxj . Instead of Gi(x, z), use

G(x, z) = KL(x, z) − KL(Ax,Az) + KL(Ax,b).

It follows from our assumption about normalization and Equation (10.5)that

KL(x, z) − KL(Ax, Az) ≥ 0,

so G(x, z) ≥ 0 for all nonnegative x and z. Notice that

G(x, x) = KL(Ax,b), (11.3)

so thatG(x, z) = G(x, x) + KL(x, z) − KL(Ax, Az),

and G(x, z) is minimized, as a function of z, by the choice z = x. Mini-mizing G(x, z) with respect to x, for fixed z, as we did for MART, we findthat

G(x, z) = G(z, z) + KL(x, z), (11.4)

for z given by

zj = zj

I i=1

(bi

(Az)i)Aij .

Notice that the SMART iteration, in the normalized case, is

xk+1 = (xk).

We complete the convergence proof through several exercises. In complet-ing these exercises, it will be helpful to study the related results used inthe convergence proof of MART.

Exercise 11.1 Show that the sequence {KL(Axk, b)} is decreasing and the sequence {KL(xk, xk+1)} converges to zero. Hint: use Equations (11.3)and (11.4).

Exercise 11.2 Show that the sequence {xk} is bounded, by showing that

J j=1

xkj ≤I i=1

bi.




Exercise 11.3 From the previous exercise, we know that the sequence {xk}has cluster points; let x∗ be one of them. Show that (x∗) = x∗. Hint: use the fact that {KL(xk, xk+1)} converges to zero.

Exercise 11.4 Let x = x ≥ 0 minimize KL(Ax,b), over all nonnegative vectors x. Show that (x) = x.

Exercise 11.5 Show that, for the SMART sequence {xk} with cluster point x∗ and x as defined previously, we have

KL(x, xk) − KL(x, xk+1) = KL(Axk+1, b) − KL(Ax, b)+

KL(Ax,Axk) + KL(xk+1, xk) − KL(Axk+1, Axk), (11.5)

and so KL(Ax,Ax∗) = 0, the sequence {KL(x, xk)} is decreasing and KL(x, x∗) < +∞.

Exercise 11.6 Show that, for any cluster point x∗ of the sequence {xk},we have

KL(Ax, b) = KL(Ax∗, b),

so that x∗ is a nonnegative minimizer of KL(Ax,b). Consequently, the sequence {KL(x∗, xk)} converges to zero, the sequence {xk} converges tox∗, and

KL(x, x0) ≥ KL(x∗, x0).

11.4 Remarks on the Rate of Convergence of the SMART

In the consistent case, the progress we make toward a solution, using theSMART, is described by Equation (11.5), which now says

KL(x, xk) − KL(x, xk+1)

= KL(Axk+1, b) + KL(b,Axk) + KL(xk+1, xk) − KL(Axk+1, Axk).

It follows that

KL(x, xk) − KL(x, xk+1) ≥ KL(b,Axk).

While this is not an equality, it suggests that the improvement we makewith each step is on the order of KL(Ax,Axk). In the MART case, theimprovement we make with each step is

KL(x, xk) − KL(x, xk+1) ≥ m−1i KL(bi, (Axk)i).



11.5. BLOCK-ITERATIVE SMART 97

Since we are assuming that the columns of A sum to one, the individualentries will be on the order of 1I , if all the entries are roughly the same size,so that mi is then on the order of 1I . This indicates that the MART makesabout as much progress toward a solution in one step (which means using

a single equation), as SMART makes using one step (which means using allthe equations). Said another way, the progress made in one pass throughall the data using MART is about I times better than in one iteration of SMART, and yet involves about the same amount of calculation. Of course,this is a rough estimate, but it does correspond to what we typically observein practice. If, however, the matrix A is sparse and has, say, only about√

I non-zero entries per column, then each entry is roughly 1√ I

, and m−1i

is on the order of √

I . In such cases, the progress made in one pass throughall the data using MART is about

√ I times better than in one iteration of

SMART, and yet involves about the same amount of calculation.

11.5 Block-Iterative SMARTAs we just argued, there is good empirical, as well as theoretical, justifica-tion for the claim that MART converges, in the consistent case, significantlyfaster than SMART. On the other hand, the SMART can be implementedin parallel, which will accelerate the computation time. Because the MARTuses only a single equation at each step, it does not take advantage of thecomputer architecture. A compromise between being purely sequential andbeing purely simultaneous might provide the best solution. Such a methodis a block-iterative method .

Block-iterative methods involve a partition of the index set {i = 1,...,I }into nonempty subsets Bn, n = 1, 2,...,N . For k = 0, 1, 2,..., and n(k) =k(mod N ) + 1, only the equations corresponding to i in the set Bn are usedto calculate xk+1 from xk. The ART and MART are extreme examples of block-iterative algorithms, in which N = I and Bn = Bi = {i}, for each i.

The SMART algorithm involves a summation over i = 1,...,I at eachstep. Block-iterative SMART algorithms replace this sum with a sum onlyover those i in the current block.

11.5.1 The Rescaled Block-Iterative SMART

Both the MART and SMART involve weighted geometric means of the gen-eralized projections Qi; MART involves relaxation, as well, while SMARTdoes not. The block-iterative SMART algorithms can also be writtenin terms of such relaxed weighted geometric means. The rescaled block-

iterative SMART (RBI-SMART) also uses a particular choice of a param-eter designed to accelerate the convergence in the consistent case.




The vector xk+1 determined by the RBI-SMART is the following:

xk+1j = (xkj )1−m−1n s−1

jsnj

i∈Bn[xkj

bi(Axk)i

]m−1n s−1

jAij ,

wheresnj =

i∈Bn

Aij ,

andmn = max{snjs−1j | j = 1,...,J }.

Consequently, xk+1j is a weighted geometric mean of xkj and the (Qixk)jfor i in the block Bn.

The RBI-SMART converges, in the consistent case, to the same solutionas MART and SMART, for all choices of blocks. The proof is similar tothat for MART and SMART and we leave it as an exercise for the reader.There are variants of the RBI-SMART that involve other parameters [32].

As with ART and MART, the RBI-SMART does not converge to asingle vector in the inconsistent case. What is always observed is thatRBI-SMART exhibits subsequential convergence to a limit cycle. There isno proof of this, however.



Chapter 12

ExpectationMaximization MaximumLikelihood (EMML)

For nonnegative systems Ax = b in which the column sums of A and theentries of b are positive, the expectation maximization maximum likelihood(EMML) method produces a nonnegative solution of Ax = b, whenever oneexists [19, 20, 32, 51, 95, 109, 88, 114, 89] . If not, the EMML converges toa nonnegative approximate solution that minimizes the function KL(b,Ax)[19, 21, 32, 51, 114].

12.1 The EMML IterationAs we saw previously, the iterative step in the SMART involves a weightedgeometric mean of the weighted KL projections Qix

k: for the SMART wehave

log xk+1j = s−1j

I i=1

Aij log(Qixk)j .

It would be nice if we could avoid the exponentiation required in theSMART iterative step. This suggests the algorithm in which the entriesxk+1j are weighted arithmetic means of the (Qixk)j ; that is, the iterativestep should be

xk+1j = s−1j

I i=1

Aij(Qixk)j ,

99



100CHAPTER 12. EXPECTATION MAXIMIZATION MAXIMUM LIKELIHOOD (EMML)

which can be written as

xk+1j = xkj s−1j

I

i=1

Aijbi

(Axk)i. (12.1)

This is the iterative step of the EMML algorithm.The EMML algorithm was not originally derived from the SMART algo-

rithm, but from a general method for likelihood maximization in statistics,the expectation maximization (EM) approach [56]. The EMML algorithmwe study here is the EM method, as it applies to the case in which thedata bi are instances of independent Poisson random variables with meanvalues (Ax)i; here the entries of x are the parameters to be estimated.

For the EMML algorithm the main results are the following.

Theorem 12.1 In the consistent case the EMML algorithm converges tononnegative solution of Ax = b. In the inconsistent case it converges to a nonnegative minimizer of the distance KL(b,Ax); if A and every matrix derived from A by deleting columns has full rank then there is a unique nonnegative minimizer of KL(b,Ax) and at most I − 1 of its entries are nonzero.

An open question about the EMML algorithm is the following:

Open Question: How does the EMML limit depend on the starting vectorx0? In particular, when there are nonnegative exact solutions of Ax = b,which one does the EMML produce and how does it depend on x0?

12.2 Proof of Convergence of the EMML Al-

gorithmLet A be an I by J matrix with entries Aij ≥ 0, such that, for each

j = 1,...,J , we have sj =I

i=1 Aij > 0. Let b = (b1,...,bI )T with bi > 0

for each i. We shall assume throughout this section that sj = 1 for each j. If this is not the case initially, we replace xj with xjsj and Aij withAij/sj ; the quantities (Ax)i are unchanged.

For each nonnegative vector x for which (Ax)i =J

j=1 Aijxj > 0, letr(x) = {r(x)ij} and q (x) = {q (x)ij} be the I by J arrays with entries

r(x)ij = xjAijbi

(Ax)i

andq (x)ij = xjAij .



12.2. PROOF OF CONVERGENCE OF THE EMML ALGORITHM 101

The KL distance

KL(r(x), q (z)) =I

i=1

J

j=1

KL(r(x)ij , q (z)ij)

will play an important role in the proof that follows. Note that if there isnonnegative x with r(x) = q (x) then b = Ax.

12.2.1 Some Pythagorean Identities Involving the KLDistance

The EMML iterative algorithm is derived using the principle of alternating minimization , according to which the distance KL(r(x), q (z)) is minimized,first with respect to the variable x and then with respect to the variablez. Although the KL distance is not Euclidean, and, in particular, not evensymmetric, there are analogues of Pythagoras’ theorem that play important

roles in the convergence proofs.

Exercise 12.1 Establish the following Pythagorean identities:

KL(r(x), q (z)) = KL(r(z), q (z)) + KL(r(x), r(z)); (12.2)

KL(r(x), q (z)) = KL(r(x), q (x)) + KL(x, z), (12.3)

for

xj = xj

I i=1

Aijbi

(Ax)i. (12.4)

Note that it follows from normalization and Equation (10.5) that KL(x, z)−KL(Ax, Az) ≥ 0.

Exercise 12.2 Show that, for {xk} given by Equation (12.1), {KL(b,Axk)}is decreasing and {KL(xk+1, xk)} → 0. Hint: Use KL(r(x), q (x)) =KL(b,Ax), and the Pythagorean identities.

Exercise 12.3 Show that the EMML sequence {xk} is bounded by showing

J j=1

xkj =I i=1

bi.

Exercise 12.4 Show that (x∗) = x∗ for any cluster point x∗ of the EMMLsequence {xk}. Hint: Use the fact that {KL(xk+1, xk)} → 0.




Exercise 12.5 Let x minimize KL(b,Ax) over all x ≥ 0. Then, (x) = x.Hint: Apply Pythagorean identities to KL(r(x), q (x)).

Note that, because of convexity properties of the KL distance, even if

the minimizer x is not unique, the vector Ax is unique.

Exercise 12.6 Show that, for the EMML sequence {xk} with cluster point x∗ and x as defined previously, we have the double inequality

KL(x, xk) ≥ KL(r(x), r(xk)) ≥ KL(x, xk+1), (12.5)

from which we conclude that the sequence {KL(x, xk)} is decreasing and KL(x, x∗) < +∞. Hints: For the first inequality calculate KL(r(x), q (xk))

in two ways. For the second one, use (x)j =I

i=1 r(x)ij and Exercise 10.1.

Exercise 12.7 For x∗ a cluster point of the EMML sequence {xk} we have KL(b,Ax∗) = KL(b, P x). Therefore, x∗ is a nonnegative minimizer

of KL(b,Ax). Consequently, the sequence {KL(x∗, xk)} converges to zero,and so {xk} → x∗. Hint: Use the double inequality of Equation (12.5) and KL(r(x), q (x∗)).

Both the EMML and the SMART algorithms are slow to converge. Forthat reason attention has shifted, in recent years, to block-iterative versionsof these algorithms.

12.3 Block-Iterative EMML Iteration

Block-iterative versions of ART and SMART have been known for decades.In contrast, the first block-iterative variant of the EMML algorithm, the

ordered-subset EM (OSEM) [81], was discovered in 1994. The main idea inthe OSEM is simply to replace all the sums over all the indices i with sumsonly over those i in the current block. This is not quite right; it ignores therelaxation that we have seen in the MART and RBI-SMART. The OSEMwas shown to converge, in the consistent case, only when the matrix Asatisfies a quite restrictive condition, subset balance . This means that thesums

snj =i∈Bn

Aij

depend only on n, and not on j.The rescaled block-iterative EMML (RBI-EMML) corrects this omission.

It has the iterative step

xk+1j = (1 − m−1n s−1j snj)xkj + m−1

n s−1j xkji∈Bn

Aij bi(Axk)i

. (12.6)



12.3. BLOCK-ITERATIVE EMML ITERATION 103

The RBI-EMML converges, in the consistent case, for any choice of blocks.

Open Question: When there are multiple nonnegative solutions of Ax =b, the RBI-EMML solution will depend on the starting vector, x0, butprecisely how is unknown. Simulations seem to show that the solution mayalso vary with the choice of blocks, as well as with their ordering. How?

12.3.1 A Row-Action Variant of EMML

The MART is the row-action, or sequential, variant of RBI-SMART. Thereis also a row-action variant of EMML, obtained by selecting N = I andtaking Bn = Bi = {i} as the blocks. This row-action variant has beencalled the EM-MART [32]. The EM-MART has the iterative step

xk+1j = (1 − m−1i s−1j Aij)xkj + m−1

i s−1j xkjAijbi

(Axk)i,

for mi = max{Aijs−1j }. Note that another version of EM-MART has theiterative step

xk+1j = (1 − m−1i Aij)xkj + m−1

i xkjAijbi

(Axk)i,

for mi = max{Aij}. The second convergent version looks more like MART,while the first follows directly from the RBI-EMML formula.






Chapter 13

Rescaled Block-Iterative(RBI) Methods

Image reconstruction problems in tomography are often formulated as sta-tistical likelihood maximization problems in which the pixel values of thedesired image play the role of parameters. Iterative algorithms based oncross-entropy minimization, such as the expectation maximization maxi-mum likelihood (EMML) method and the simultaneous multiplicative alge-braic reconstruction technique (SMART) can be used to solve such prob-lems. Because the EMML and SMART are slow to converge for largeamounts of data typical in imaging problems acceleration of the algorithmsusing blocks of data or ordered subsets has become popular. There area number of different ways to formulate these block-iterative versions of EMML and SMART, involving the choice of certain normalization andregularization parameters. These methods are not faster merely because

they are block-iterative; the correct choice of the parameters is crucial. Thepurpose of this chapter is to discuss these different formulations in detailsufficient to reveal the precise roles played by the parameters and to guidethe user in choosing them.

13.1 Block-Iterative Methods

Methods based on cross-entropy, such as the multiplicative ART (MART),its simultaneous version, SMART, the expectation maximization maximumlikelihood method (EMML) and all block-iterative versions of these algo-rithms apply to nonnegative systems that we denote by Ax = b, where bis a vector of positive entries, A is a matrix with entries Aij

≥0 such that

for each j the sum sj =

I i=1 Aij is positive and we seek a solution x withnonnegative entries. If no nonnegative x satisfies b = Ax we say the system

105



106CHAPTER 13. RESCALED BLOCK-ITERATIVE (RBI) METHODS

is inconsistent .

Simultaneous iterative algorithms employ all of the equations at eachstep of the iteration; block-iterative methods do not. For the latter methodswe assume that the index set

{i = 1,...,I

}is the (not necessarily disjoint)

union of the N sets or blocks Bn, n = 1,...,N . We shall require thatsnj =

i∈Bn Aij > 0 for each n and each j. Block-iterative methods like

ART and MART for which each block consists of precisely one element arecalled row-action or sequential methods.

We begin our discussion with the SMART and the EMML method.

13.2 The SMART and the EMML method

Both the SMART and the EMML method provide a solution of b = Axwhen such exist and (distinct) approximate solutions in the inconsistentcase. Both begin with an arbitrary positive vector x0. Having found xk

the iterative step for the SMART is

SMART:

xk+1j = xkj exp

s−1j

I i=1

Aij logbi

(Axk)i

(13.1)

while that for the EMML method is

EMML:

xk+1j = xkj s−1j

I i=1

Aijbi

(Axk)i. (13.2)

The main results concerning the SMART is given by the following theorem.

Theorem 13.1 In the consistent case the SMART converges to the unique nonnegative solution of b = Ax for which the distance

J j=1 sjKL(xj , x0j)

is minimized. In the inconsistent case it converges to the unique nonnega-tive minimizer of the distance KL(Ax,y) for which

J j=1 sjKL(xj , x0j) is

minimized; if A and every matrix derived from A by deleting columns has full rank then there is a unique nonnegative minimizer of KL(Ax,y) and at most I

−1 of its entries are nonzero.

For the EMML method the main results are the following.



13.2. THE SMART AND THE EMML METHOD 107

Theorem 13.2 In the consistent case the EMML algorithm converges tononnegative solution of b = Ax. In the inconsistent case it converges to a nonnegative minimizer of the distance KL(y,Ax); if A and every matrix derived from A by deleting columns has full rank then there is a unique

nonnegative minimizer of KL(y,Ax) and at most I − 1 of its entries are nonzero.

In the consistent case there may be multiple nonnegative solutions and theone obtained by the EMML algorithm will depend on the starting vectorx0; how it depends on x0 is an open question.

These theorems are special cases of more general results on block-iterative methods that we shall prove later in this chapter.

Both the EMML and SMART are related to likelihood maximization.Minimizing the function KL(y,Ax) is equivalent to maximizing the like-lihood when the bi are taken to be measurements of independent Poissonrandom variables having means (Ax)i. The entries of x are the parametersto be determined. This situation arises in emission tomography. So the

EMML is a likelihood maximizer, as its name suggests.The connection between SMART and likelihood maximization is a bit

more convoluted. Suppose that sj = 1 for each j. The solution of b = Axfor which KL(x, x0) is minimized necessarily has the form

xj = x0j exp I i=1

Aijλi

(13.3)

for some vector λ with entries λi. This log linear form also arises in trans-mission tomography, where it is natural to assume that sj = 1 for each jand λi ≤ 0 for each i. We have the following lemma that helps to connectthe SMART algorithm with the transmission tomography problem:

Lemma 13.1 Minimizing KL(d, x) over x as in Equation (13.3) is equiv-

alent to minimizing KL(x, x0

), subject to Ax = P d.The solution to the latter problem can be obtained using the SMART.

With x+ =J

j=1 xj the vector A with entries pj = xj/x+ is a probabil-

ity vector. Let d = (d1,...,dJ )T be a vector whose entries are nonnegative

integers, with K =J

j=1 dj . Suppose that, for each j, pj is the probabilityof index j and dj is the number of times index j was chosen in K trials.The likelihood function of the parameters λi is

L(λ) =J j=1

pdjj (13.4)

so that the log-likelihood function is

LL(λ) =J j=1

dj log pj . (13.5)




Since A is a probability vector, maximizing L(λ) is equivalent to minimizingKL(d, p) with respect to λ, which, according to the lemma above, canbe solved using SMART. In fact, since all of the block-iterative versionsof SMART have the same limit whenever they have the same starting

vector, any of these methods can be used to solve this maximum likelihoodproblem. In the case of transmission tomography the λi must be non-positive, so if SMART is to be used, some modification is needed to obtainsuch a solution.

Those who have used the SMART or the EMML on sizable problemshave certainly noticed that they are both slow to converge. An importantissue, therefore, is how to accelerate convergence. One popular method isthrough the use of block-iterative (or ordered subset ) methods.

13.3 Ordered-Subset Versions

To illustrate block-iterative methods and to motivate our subsequent dis-

cussion we consider now the ordered subset EM algorithm (OSEM), which isa popular technique in some areas of medical imaging, as well as an anal-ogous version of SMART, which we shall call here the OSSMART. TheOSEM is now used quite frequently in tomographic image reconstruction,where it is acknowledged to produce usable images significantly faster thenEMML. From a theoretical perspective both OSEM and OSSMART areincorrect. How to correct them is the subject of much that follows here.

The idea behind the OSEM (OSSMART) is simple: the iteration looksvery much like the EMML (SMART), but at each step of the iterationthe summations are taken only over the current block. The blocks areprocessed cyclically.

The OSEM iteration is the following: for k = 0, 1,... and n = k(mod N )+1, having found xk let

OSEM:

xk+1j = xkj s−1nji∈Bn

Aijbi

(Axk)i. (13.6)

The OSSMART has the following iterative step:

OSSMART

xk+1j = xkj exp

s−1nji∈Bn

Aij logbi

(Axk)i

. (13.7)

In general we do not expect block-iterative algorithms to converge in theinconsistent case, but to exhibit subsequential convergence to a limit cycle ,



13.4. THE RBI-SMART 109

as we shall discuss later. We do, however, want them to converge to asolution in the consistent case; the OSEM and OSSMART fail to do thisexcept when the matrix A and the set of blocks {Bn, n = 1,...,N } satisfythe condition known as subset balance , which means that the sums snjdepend only on j and not on n. While this may be approximately valid insome special cases, it is overly restrictive, eliminating, for example, almostevery set of blocks whose cardinalities are not all the same. When theOSEM does well in practice in medical imaging it is probably because theN is not large and only a few iterations are carried out.

The experience with the OSEM was encouraging, however, and stronglysuggested that an equally fast, but mathematically correct, block-iterativeversion of EMML was to be had; this is the rescaled block-iterative EMML(RBI-EMML). Both RBI-EMML and an analogous corrected version of OSSMART, the RBI-SMART, provide fast convergence to a solution inthe consistent case, for any choice of blocks.

13.4 The RBI-SMARTWe turn next to the block-iterative versions of the SMART, which we shalldenote BI-SMART. These methods were known prior to the discovery of RBI-EMML and played an important role in that discovery; the importanceof rescaling for acceleration was apparently not appreciated, however. TheSMART was discovered in 1972, independently, by Darroch and Ratcliff,working in statistics, [54] and by Schmidlin [107] in medical imaging. Block-iterative versions of SMART are also treated in [54], but they also insiston subset balance. The inconsistent case was not considered.

We start by considering a formulation of BI-SMART that is generalenough to include all of the variants we wish to discuss. As we shall see,this formulation is too general and will need to be restricted in certain waysto obtain convergence. Let the iterative step be

xk+1j = xkj exp

β nji∈Bn

αniAij log bi

(Axk)i

, (13.8)

for j = 1, 2,...,J , n = k(mod N ) + 1 and β nj and αni positive. As we shallsee, our convergence proof will require that β nj be separable, that is,

bnj = γ jδ n

for each j and n and that

γ jδ nσnj ≤ 1, (13.9)

for σnj =

i∈Bn αniAij . With these conditions satisfied we have the fol-lowing result.




Theorem 13.3 Let x be a nonnegative solution of b = Ax. For any posi-tive vector x0 and any collection of blocks {Bn, n = 1,...,N } the sequence {xk} given by equation (13.8) converges to the unique solution of b = Ax

for which the weighted cross-entropy J

j=1

γ −1j

KL(xj , x0

j

) is minimized.

The inequality in the following lemma is the basis for the convergence proof.

Lemma 13.2 Let b = Ax for some nonnegative x. Then for {xk} as in Equation (13.8) we have

J j=1

γ −1j KL(xj , xkj ) −J j=1

γ −1j KL(xj , xk+1j ) ≥

δ ni∈Bn

αniKL(bi, (Axk)i). (13.10)

Proof: First note that

xk+1j = xkj exp

γ jδ ni∈Bn

αniAij log bi

(Axk)i

, (13.11)

and

exp

γ jδ ni∈Bn

αniAij log bi

(Axk)i

can be written as

exp

(1 − γ jδ nσnj) log 1 + γ jδ n

i∈BnαniAij log

bi

(Axk)i,

which, by the convexity of the exponential function, is not greater than

(1 − γ jδ nσnj) + γ jδ ni∈Bn

αniAijbi

(Axk)i.

It follows that

J j=1

γ −1j (xkj − xk+1j ) ≥ δ ni∈Bn

αni((Axk)i − bi).

We also have

log(xk+1j /xkj ) = γ jδ ni∈Bn

αniAij log bi(Axk)i

.



13.4. THE RBI-SMART 111

ThereforeJ j=1


γ −1j KL(xj , xk+1j )

=J j=1

γ −1j (xj log(xk+1j /xkj ) + xkj − xk+1j )

=J j=1

xjδ ni∈Bn

αniAij logbi

(Axk)i+

J j=1

γ −1j (xkj − xk+1j )

= δ ni∈Bn

αni(J j=1

xjAij)logbi

(Axk)i+

J j=1

γ −1j (xkj − xk+1j )

≥ δ n

i∈Bn

αni(bi logbi

(Axk)i+ (Axk)i − bi) = δ n

i∈Bn

αniKL(bi, (Axk)i).

This completes the proof of the lemma.From the inequality (13.10) we conclude that the sequence

{J j=1

γ −1j KL(xj , xkj )}

is decreasing, that {xk} is therefore bounded and the sequence

{i∈Bn

αniKL(bi, (Axk)i)}

is converging to zero. Let x∗ be any cluster point of the sequence {xk}.

Then it is not difficult to show that b = Ax∗. Replacing x with x∗ wehave that the sequence {J j=1 γ −1j KL(x∗j , xkj )} is decreasing; since a sub-

sequence converges to zero, so does the whole sequence. Therefore x∗ isthe limit of the sequence {xk}. This proves that the algorithm producesa solution of b = Ax. To conclude further that the solution is the onefor which the quantity

J j=1 γ −1j KL(xj , x0j) is minimized requires further

work to replace the inequality (13.10) with an equation in which the rightside is independent of the particular solution x chosen; see the final sectionof this chapter for the details.

We see from the theorem that how we select the γ j is determined by

how we wish to weight the terms in the sumJ

j=1 γ −1j KL(xj , x0j). In

some cases we want to minimize the cross-entropy KL(x, x0) subject tob = Ax; in this case we would select γ j = 1. In other cases we may

have some prior knowledge as to the relative sizes of the xj and wish toemphasize the smaller values more; then we may choose γ j proportional to




our prior estimate of the size of xj . Having selected the γ j , we see fromthe inequality (13.10) that convergence will be accelerated if we select δ nas large as permitted by the condition γ jδ nσnj ≤ 1. This suggests that wetake

δ n = 1/ min{σnjγ j , j = 1,...,J }. (13.12)

The rescaled BI-SMART (RBI-SMART) as presented in [21, 23, 24] usesthis choice, but with αni = 1 for each n and i. Let’s look now at someof the other choices for these parameters that have been considered in theliterature.

First, we notice that the OSSMART does not generally satisfy the re-quirements, since in (13.7) the choices are αni = 1 and β nj = s−1nj ; the onlytimes this is acceptable is if the snj are separable; that is, snj = rjtn forsome rj and tn. This is slightly more general than the condition of subsetbalance and is sufficient for convergence of OSSMART.

In [42] Censor and Segman make the choices β nj = 1 and αni > 0 such

that σnj ≤ 1 for all n and j. In those cases in which σnj is much less than1 for each n and j their iterative scheme is probably excessively relaxed; itis hard to see how one might improve the rate of convergence by alteringonly the weights αni, however. Limiting the choice to γ jδ n = 1 reduces ourability to accelerate this algorithm.

The original SMART in equation (13.1) uses N = 1, γ j = s−1j andαni = αi = 1. Clearly the inequality (13.9) is satisfied; in fact it becomesan equality now.

For the row-action version of SMART, the multiplicative ART (MART),due to Gordon, Bender and Herman [70], we take N = I and Bn = Bi = {i}for i = 1,...,I . The MART begins with a strictly positive vector x0 andhas the iterative step

The MART:

xk+1j = xkj

bi(Axk)i

m−1i

Aij, (13.13)

for j = 1, 2,...,J , i = k(mod I ) + 1 and mi > 0 chosen so that m−1i Aij ≤ 1

for all j. The smaller mi is the faster the convergence, so a good choiceis mi = max{Aij |, j = 1,...,J }. Although this particular choice for mi isnot explicitly mentioned in the various discussions of MART I have seen,it was used in implementations of MART from the beginning [78].

Darroch and Ratcliff included a discussion of a block-iterative version of SMART in their 1972 paper [54]. Close inspection of their version revealsthat they require that snj = i

∈Bn

Aij = 1 for all j. Since this is unlikely

to be the case initially, we might try to rescale the equations or unknownsto obtain this condition. However, unless snj =

i∈Bn Aij depends only



13.5. THE RBI-EMML 113

on j and not on n, which is the subset balance property used in [81], wecannot redefine the unknowns in a way that is independent of n.

The MART fails to converge in the inconsistent case. What is alwaysobserved, but for which no proof exists, is that, for each fixed i = 1, 2,...,I ,

as m → +∞, the MART subsequences {xmI +i} converge to separate limitvectors, say x∞,i. This limit cycle LC = {x∞,i|i = 1,...,I } reduces to asingle vector whenever there is a nonnegative solution of b = Ax. Thegreater the minimum value of KL(Ax,y) the more distinct from one an-other the vectors of the limit cycle are. An analogous result is observed forBI-SMART.

13.5 The RBI-EMML

As we did with SMART, we consider now a formulation of BI-EMML thatis general enough to include all of the variants we wish to discuss. Onceagain, the formulation is too general and will need to be restricted in certain

ways to obtain convergence. Let the iterative step be

xk+1j = xkj (1 − β njσnj) + xkjβ nji∈Bn

αniAijbi

(Axk)i, (13.14)

for j = 1, 2,...,J , n = k(mod N )+1 and β nj and αni positive. As in the caseof BI-SMART, our convergence proof will require that β nj be separable,that is,

bnj = γ jδ n

for each j and n and that the inequality (13.9) hold. With these conditionssatisfied we have the following result.

Theorem 13.4 Let x be a nonnegative solution of b = Ax. For any

positive vector x0 and any collection of blocks {Bn, n = 1,...,N } the se-quence {xk} given by Equation (13.8) converges to a nonnegative solution of b = Ax.

When there are multiple nonnegative solutions of b = Ax the solutionobtained by BI-EMML will depend on the starting point x0, but preciselyhow it depends on x0 is an open question. Also, in contrast to the case of BI-SMART, the solution can depend on the particular choice of the blocks.The inequality in the following lemma is the basis for the convergence proof.

Lemma 13.3 Let b = Ax for some nonnegative x. Then for {xk} as in Equation (13.14) we have

J j=1


γ −1j KL(xj , xk+1j ) ≥




δ ni∈Bn


Proof: From the iterative step

xk+1j = xkj (1 − γ jδ nσnj) + xkj γ jδ ni∈Bn

αniAijbi

(Axk)i

we have

log(xk+1j /xkj ) = log

(1 − γ jδ nσnj) + γ jδ ni∈Bn

αniAijbi

(Axk)i

.

By the concavity of the logarithm we obtain the inequality

log(xk+1j /xkj ) ≥

(1 − γ jδ nσnj) log 1 + γ jδ n

i∈BnαniAij log

bi(Axk)i

,

or

log(xk+1j /xkj ) ≥ γ jδ ni∈Bn

αniAij logbi

(Axk)i.

Therefore

J j=1

γ −1j xj log(xk+1j /xkj ) ≥ δ ni∈Bn

αni(J j=1

xjAij)logbi

(Axk)i.

Note that it is at this step that we used the separability of the β nj. Also

J

j=1

γ −1j

(xk+1

j −xk

j) = δ n

i∈Bn((Axk)i

−bi).

This concludes the proof of the lemma.

From the inequality (13.15) we conclude, as we did in the BI-SMART

case, that the sequence {J j=1 γ −1j KL(xj , xkj )} is decreasing, that {xk}

is therefore bounded and the sequence {i∈BnαniKL(bi, (Axk)i)} is con-

verging to zero. Let x∗ be any cluster point of the sequence {x}. Then it isnot difficult to show that b = Ax∗. Replacing x with x∗ we have that thesequence {J

j=1 γ −1j KL(x∗j , xkj )} is decreasing; since a subsequence con-verges to zero, so does the whole sequence. Therefore x∗ is the limit of the sequence {xk}. This proves that the algorithm produces a nonnegativesolution of b = Ax. We are now unable to replace the inequality (13.15)

with an equation in which the right side is independent of the particularsolution x chosen.



13.5. THE RBI-EMML 115

Having selected the γ j , we see from the inequality (13.15) that con-vergence will be accelerated if we select δ n as large as permitted by thecondition γ jδ nσnj ≤ 1. This suggests that once again we take

δ n = 1/ min{σnjγ j , j = 1,...,J }. (13.16)

The rescaled BI-EMML (RBI-EMML) as presented in [21, 23, 24] uses thischoice, but with αni = 1 for each n and i. Let’s look now at some of the other choices for these parameters that have been considered in theliterature.

First, we notice that the OSEM does not generally satisfy the require-ments, since in (13.6) the choices are αni = 1 and β nj = s−1nj ; the onlytimes this is acceptable is if the snj are separable; that is, snj = rjtn forsome rj and tn. This is slightly more general than the condition of subsetbalance and is sufficient for convergence of OSEM.

The original EMML in equation (13.2) uses N = 1, γ j = s−1j andαni = αi = 1. Clearly the inequality (13.9) is satisfied; in fact it becomesan equality now.

Notice that the calculations required to perform the BI-SMART aresomewhat more complicated than those needed in BI-EMML. Because theMART converges rapidly in most cases there is considerable interest in therow-action version of EMML. It was clear from the outset that using theOSEM in a row-action mode does not work. We see from the formula forBI-EMML that the proper row-action version of EMML, which we call theEM-MART, has the iterative step

EM-MART:

xk+1j = (1 − δ iγ jαiiAij)xkj + δ iγ jαiiAijbi

(Axk)i, (13.17)

with

γ jδ iαiiAij ≤ 1

for all i and j. The optimal choice would seem to be to take δ iαii as largeas possible; that is, to select δ iαii = 1/ max{γ jAij , j = 1,...,J }. With thischoice the EM-MART is called the rescaled EM-MART (REM-MART).

The EM-MART fails to converge in the inconsistent case. What isalways observed, but for which no proof exists, is that, for each fixed i =1, 2,...,I , as m → +∞, the EM-MART subsequences {xmI +i} converge toseparate limit vectors, say x∞,i. This limit cycle LC = {x∞,i|i = 1,...,I }reduces to a single vector whenever there is a nonnegative solution of b =Ax. The greater the minimum value of KL(y,Ax) the more distinct from

one another the vectors of the limit cycle are. An analogous result isobserved for BI-EMML.




We must mention a method that closely resembles the REM-MART, therow-action maximum likelihood algorithm (RAMLA), which was discoveredindependently by Browne and De Pierro [18]. The RAMLA avoids the limitcycle in the inconsistent case by using strong underrelaxation involving a

decreasing sequence of relaxation parameters λk. The RAMLA has thefollowing iterative step:RAMLA:

xk+1j = (1 − λkn

Aij)xkj + λkxkjn

Aij

bi(Axk)i

, (13.18)

where the positive relaxation parameters λk are chosen to converge to zeroand

+∞k=0 λk = +∞.

13.6 RBI-SMART and Entropy Maximization

As we stated earlier, in the consistent case the sequence

{xk

}generated by

the BI-SMART algorithm and given by equation (13.11) converges to theunique solution of b = Ax for which the distance

J j=1 γ −1j KL(xj , x0j) is

minimized. In this section we sketch the proof of this result as a sequenceof lemmas, each of which is easily established.

Lemma 13.4 For any nonnegative vectors a and b with a+ =M

m=1 amand b+ =

M m=1 bm > 0 we have

KL(a, b) = KL(a+, b+) + KL(a+,a+b+

b). (13.19)

For nonnegative vectors x and z let

Gn(x, z) =

J j=1

γ −1j KL(xj , zj)

+δ niıBn

αni[KL((Ax)i, bi) − KL((Ax)i, (P z)i)]. (13.20)

It follows from Lemma 13.19 and the inequality

γ −1j − δ nσnj ≥ 1

that Gn(x, z) ≥ 0 in all cases.

Lemma 13.5 For every x we have

Gn(x, x) = δ ni∈Bn

αniKL((Ax)i, bi) (13.21)



13.6. RBI-SMART AND ENTROPY MAXIMIZATION 117

so that

Gn(x, z) = Gn(x, x) +J j=1

γ −1j KL(xj , zj)

−δ ni∈Bn

αniKL((Ax)i, (P z)i). (13.22)

Therefore the distance Gn(x, z) is minimized, as a function of z, by z = x.Now we minimize Gn(x, z) as a function of x. The following lemma showsthat the answer is

xj = zj = zj exp

γ jδ ni∈Bn

αniAij logbi

(P z)i

. (13.23)

Lemma 13.6 For each x and z we have

Gn(x, z) = Gn(z, z) +

J j=1

γ −1j KL(xj , zj). (13.24)

It is clear that (xk) = xk+1 for all k.Now let b = P u for some nonnegative vector u. We calculate Gn(u, xk)

in two ways: using the definition we have

Gn(u, xk) =J j=1

γ −1j KL(uj , xkj ) − δ ni∈Bn

αniKL(bi, (Axk)i),

while using Lemma 13.24 we find that

Gn(u, x

k

) = Gn(x

k+1

, x

k

) +

J

j=1 γ −

1

j KL(uj , x

k+1

j ).

ThereforeJ j=1

γ −1j KL(uj , xkj ) −J j=1

γ −1j KL(uj , xk+1j )

= Gn(xk+1, xk) + δ ni∈Bn


We conclude several things from this.First, the sequence {J

j=1 γ −1j KL(uj , xkj )} is decreasing, so that the

sequences

{Gn(xk+1, xk)

}and

{δ ni

∈Bn

αniKL(bi, (Axk)i)

}converge to

zero. Therefore the sequence {xk} is bounded and we may select an arbi-trary cluster point x∗. It follows that b = Ax∗. We may therefore replace




the generic solution u with x∗ to find that {J j=1 γ −1j KL(x∗j , xkj )} is a de-

creasing sequence; but since a subsequence converges to zero, the entiresequence must converge to zero. Therefore {xk} converges to the solutionx∗.

Finally, since the right side of equation (13.25) does not depend onthe particular choice of solution we made, neither does the left side. Bytelescoping we conclude that

J j=1

γ −1j KL(uj , x0j) −J j=1

γ −1j KL(uj , x∗j )

is also independent of the choice of u. Consequently, minimizing the func-tion

J j=1 γ −1j KL(uj , x0j) over all solutions u is equivalent to minimizingJ

j=1 γ −1j KL(uj , x∗j ) over all solutions u; but the solution to the latterproblem is obviously u = x∗. This completes the proof.



Part V

Stability

119





Chapter 14

Sensitivity to Noise

When we use an iterative algorithm, we want it to solve our problem.We also want the solution in a reasonable amount of time, and we wantslight errors in the measurements to cause only slight perturbations in thecalculated answer. We have already discussed the use of block-iterativemethods to accelerate convergence. Now we turn to regularization as ameans of reducing sensitivity to noise. Because a number of regularizationmethods can be derived using a Bayesian maximum a posteriori approach,regularization is sometimes treated under the heading of MAP methods(see, for example, [33]).

14.1 Where Does Sensitivity Come From?

We illustrate the sensitivity problem that can arise when the inconsistent

system Ax = b has more equations than unknowns and we calculate theleast-squares solution,

xLS = (A†A)−1A†b,

assuming that the Hermitian, nonnegative-definite matrix Q = (A†A) isinvertible, and therefore positive-definite.

The matrix Q has the eigenvalue/eigenvector decomposition

Q = λ1u1u†1 + · · · + λI uI u†I ,

where the (necessarily positive) eigenvalues of Q are

λ1

≥λ2

≥ · · · ≥λI > 0,

and the vectors ui are the corresponding orthogonal eigenvectors.

121



122 CHAPTER 14. SENSITIVITY TO NOISE

14.1.1 The Singular-Value Decomposition of A

The square roots√

λi are called the singular values of A. The singular-value decomposition (SVD) of A is similar to the eigenvalue/eigenvector

decomposition of Q: we have

A =

λ1u1v†1 + · · · +

λI uI v†I ,

where the vi are particular eigenvectors of AA†. We see from the SVD thatthe quantitites

√ λi determine the relative importance of each term uiv

†i .

The SVD is commonly used for compressing transmitted or stored im-ages. In such cases, the rectangular matrix A is a discretized image. Itis not uncommon for many of the lowest singular values of A to be nearlyzero, and to be essentially insignificant in the reconstruction of A. Onlythose terms in the SVD for which the singular values are significant needto be transmitted or stored. The resulting images may be slightly blurred,but can be restored later, as needed.

When the matrix A is a finite model of a linear imaging system, therewill necessarily be model error in th selection of A. Getting the dominantterms in the SVD nearly correct is much more important (and usually mucheasier) than getting the smaller ones correct. The problems arise when wetry to invert the system, to solve Ax = b for x.

14.1.2 The Inverse of Q = A†A

The inverse of Q can then be written

Q−1 = λ−11 u1u†1 + · · · + λ−1I uI u†I ,

so that, with A†b = c, we have

xLS = λ−11 (u†1c)u1 + · · · + λ−1I (u†I c)uI .

Because the eigenvectors are orthogonal, we can express ||A†b||2 = ||c||2 as

||c||2 = |u†1c|2 + · · · + |u†I c|2,

and ||xLS ||2 as

||xLS ||2 = λ−11 |u†1c|2 + · · · + λ−1I |u†I c|2.

It is not uncommon for the eigenvalues of Q to be quite distinct, with someof them much larger than the others. When this is the case, we see that

||xLS

||can be much larger than

||c

||, because of the presence of the terms

involving the reciprocals of the small eigenvalues. When the measurementsb are essentially noise-free, we may have |u†i c| relatively small, for the indices



14.1. WHERE DOES SENSITIVITY COME FROM? 123

near I , keeping the product λ−1i |u†i c|2 reasonable in size, but when the bbecomes noisy, this may no longer be the case. The result is that thoseterms corresponding to the reciprocals of the smallest eigenvalues dominatethe sum for xLS and the norm of xLS becomes quite large. The least-

squares solution we have computed is essentially all noise and useless.In our discussion of the ART, we saw that when we impose a non-

negativity constraint on the solution, noise in the data can manifest itself in a different way. When A has more columns than rows, but Ax = b hasno non-negative solution, then, at least for those A having the full-rank property , the non-negatively constrained least-squares solution has at mostI − 1 non-zero entries. This happens also with the EMML and SMARTsolutions. As with the ART, regularization can eliminate the problem.

14.1.3 Reducing the Sensitivity to Noise

As we just saw, the presence of small eigenvalues for Q and noise in b cancause

||xLS

||to be much larger than

||A†b

||, with the result that xLS is

useless. In this case, even though xLS minimizes ||Ax − b||, it does so byoverfitting to the noisy b. To reduce the sensitivity to noise and therebyobtain a more useful approximate solution, we can regularize the problem.

It often happens in applications that, even when there is an exact so-lution of Ax = b, noise in the vector b makes such as exact solution unde-sirable; in such cases a regularized solution is usually used instead. Select > 0 and a vector p that is a prior estimate of the desired solution. Define

F (x) = (1 − )Ax − b2 + x − p2. (14.1)

Exercise 14.1 Show that F always has a unique minimizer x, given by

x = ((1 − )A†A + I )−1((1 − )A†b + p);

this is a regularized solution of Ax = b. Here, A is a prior estimate of the desired solution. Note that the inverse above always exists.

Note that, if p = 0, then

x = (A†A + γ 2I )−1A†b, (14.2)

for γ 2 = 1− . The regularized solution has been obtained by modifying

the formula for xLS , replacing the inverse of the matrix Q = A†A withthe inverse of Q + γ 2I . When is near zero, so is γ 2, and the matricesQ and Q + γ 2I are nearly equal. What is different is that the eigenvaluesof Q + γ 2I are λi + γ 2, so that, when the eigenvalues are inverted, the

reciprocal eigenvalues are no larger than 1/γ 2, which prevents the norm of x from being too large, and decreases the sensitivity to noise.




Exercise 14.2 Let be in (0, 1), and let I be the identity matrix whose dimensions are understood from the context. Show that

((1

−)AA† + I )−1A = A((1

−)A†A + I )−1,

and, taking conjugate transposes,

A†((1 − )AA† + I )−1 = ((1 − )A†A + I )−1A†.

Hint: use the identity

A((1 − )A†A + I ) = ((1 − )AA† + I )A.

Exercise 14.3 Show that any vector A in RJ can be written as A = A†q +r, where Ar = 0.

What happens to x as goes to zero? This will depend on which casewe are in:

Case 1: N ≤ M, A†A is invertible; or

Case 2: N > M, AA† is invertible.

Exercise 14.4 Show that, in Case 1, taking limits as → 0 on both sides of the expression for x gives x → (A†A)−1A†b, the least squares solution of Ax = b.

We consider Case 2 now. Write A = A†q + r, with Ar = 0. Then

x = A†((1 − )AA† + I )−1((1 − )b + q ) + ((1 − )A†A + I )−1(r).

Exercise 14.5 (a) Show that

((1 − )A†A + I )−1(r) = r,

for all ∈ (0, 1). (b) Now take the limit of x, as → 0, to get x →A†(AA†)−1b + r. Show that this is the solution of Ax = b closest to A.Hints: For part (a) let

t = ((1 − )A†A + I )−1(r).

Then, multiplying by A gives

At = A((1 − )A†A + I )−1(r).

Now show that At = 0. For part (b) draw a diagram for the case of one equation in two unknowns.



14.2. ITERATIVE REGULARIZATION IN ART 125

14.2 Iterative Regularization in ART

It is often the case that the entries of the vector b in the system Ax = bcome from measurements, so are usually noisy. If the entries of b are noisy

but the system Ax = b remains consistent (which can easily happen in theunderdetermined case, with J > I ), the ART begun at x0 = 0 convergesto the solution having minimum norm, but this norm can be quite large.The resulting solution is probably useless. Instead of solving Ax = b, weregularize by minimizing, for example, the function F (x) given in Equation(14.1). For the case of p = 0, the solution to this problem is the vector xin Equation (14.2). However, we do not want to calculate A†A + γ 2I , inorder to solve

(A†A + γ 2I )x = A†b,

when the matrix A is large. Fortunately, there are ways to find x, usingonly the matrix A and the ART algorithm.

We discuss two methods for using ART to obtain regularized solutions

of Ax = b. The first one is presented in [33], while the second one isdue to Eggermont, Herman, and Lent [63]. For notational convenience, weconsider only real systems.

In our first method we use ART to solve the system of equations givenin matrix form by

[ AT γI ]

uv

= 0.

We begin with u0 = b and v0 = 0.

Exercise 14.6 Show that the lower component of the limit vector is v∞ =−γ x.

The method of Eggermont et al. is similar. In their method we useART to solve the system of equations given in matrix form by

[ A γI ]

xv

= b.

We begin at x0 = 0 and v0 = 0.

Exercise 14.7 Show that the limit vector has for its upper component x∞ = x as before, and that γv∞ = b − Ax.

14.3 A Bayesian View of Reconstruction

The EMML iterative algorithm maximizes the likelihood function for thecase in which the entries of the data vector b = (b1,...,bI )

T are assumed




to be samples of independent Poisson random variables with mean val-ues (Ax)i; here, A is an I by J matrix with nonnegative entries andx = (x1,...,xJ )

T is the vector of nonnegative parameters to be estimated.Equivalently, it minimizes the Kullback-Leibler distance KL(b,Ax). This

situation arises in single photon emission tomography, where the bi are thenumber of photons counted at each detector i, x is the vectorized imageto be reconstructed and its entries xj are (proportional to) the radionu-clide intensity levels at each voxel j. When the signal-to-noise ratio is low,which is almost always the case in medical applications, maximizing like-lihood can lead to unacceptably noisy reconstructions, particularly whenJ is larger than I . One way to remedy this problem is simply to halt theEMML algorithm after a few iterations, to avoid over-fitting the x to thenoisy data. A more mathematically sophisticated remedy is to employ aBayesian approach and seek a maximum a posteriori (MAP) estimate of x.

In the Bayesian approach we view x as an instance of a random vectorhaving a probability density function f (x). Instead of maximizing the like-

lihood given the data, we now maximize the posterior likelihood, given boththe data and the prior distribution for x. This is equivalent to minimizing

F (x) = KL(b,Ax) − log f (x). (14.3)

The EMML algorithm is an example of an optimization method based onalternating minimization of a function H (x, z) > 0 of two vector variables.The alternating minimization works this way: let x and z be vector vari-ables and H (x, z) > 0. If we fix z and minimize H (x, z) with respect to x,we find that the solution is x = z, the vector we fixed; that is,

H (x, z) ≥ H (z, z)

always. If we fix x and minimize H (x, z) with respect to z, we get somethingnew; call it T x. The EMML algorithm has the iterative step xk+1 = T xk.

Obviously, we can’t use an arbitrary function H ; it must be related tothe function KL(b,Ax) that we wish to minimize, and we must be able toobtain each intermediate optimizer in closed form. The clever step is toselect H (x, z) so that H (x, x) = KL(b,Ax), for any x. Now see what wehave so far:

KL(b,Axk) = H (xk, xk) ≥ H (xk, xk+1)

≥ H (xk+1, xk+1) = KL(b,Axk+1).

That tells us that the algorithm makes KL(b,Axk) decrease with eachiteration. The proof doesn’t stop here, but at least it is now plausible thatthe EMML iteration could minimize KL(b,Ax).

The function H (x, z) used in the EMML case is the KL distance

H (x, z) = KL(r(x), q (z)) =I i=1

J j=i

KL(r(x)ij , q (z)ij); (14.4)



14.4. THE GAMMA PRIOR DISTRIBUTION FOR X 127

we define, for each nonnegative vector x for which (Ax)i =J

j=1 Aijxj > 0,the arrays r(x) = {r(x)ij} and q (x) = {q (x)ij} with entries

r(x)ij = xjAij

bi

(Ax)i

andq (x)ij = xjAij .

With x = xk fixed, we minimize with respect to z to obtain the nextEMML iterate xk+1. Having selected the prior pdf f (x), we want an itera-tive algorithm to minimize the function F (x) in Equation (14.3). It wouldbe a great help if we could mimic the alternating minimization formulationand obtain xk+1 by minimizing

KL(r(xk), q (z)) − log f (z) (14.5)

with respect to z. Unfortunately, to be able to express each new xk+1 in

closed form, we need to choose f (x) carefully.

14.4 The Gamma Prior Distribution for x

In [89] Lange et al. suggest viewing the entries xj as samples of indepen-dent gamma-distributed random variables. A gamma-distributed randomvariable x takes positive values and has for its pdf the gamma distribution defined for positive x by

γ (x) =1

Γ(α)(

α

β )αxα−1e−αx/β,

where α and β are positive parameters and Γ denotes the gamma function.

The mean of such a gamma-distributed random variable is then µ = β andthe variance is σ2 = β 2/α.

Exercise 14.8 Show that if the entries zj of z are viewed as independent and gamma-distributed with means µj and variances σ2

j , then minimizing the function in line (14.5) with respect to z is equivalent to minimizing the

function

KL(r(xk), q (z)) +J j=1

δ jKL(γ j , zj), (14.6)

for

δ j = µj

σ2j

, γ j = µ2j − σ2j

µj,




under the assumption that the latter term is positive. Show further that the resulting xk+1 has entries given in closed form by

xk+1j =δ j

δ j + sj γ j +1

δ j + sj xkj

I i=1

Aijbi/(Axk)i, (14.7)

where sj =I

i=1 Aij.

We see from Equation (14.7) that the MAP iteration using the gammapriors generates a sequence of estimates each entry of which is a convexcombination or weighted arithmetic mean of the result of one EMML stepand the prior estimate γ j . Convergence of the resulting iterative sequenceis established in [89]; see also [19].

14.5 The One-Step-Late AlternativeIt may well happen that we do not wish to use the gamma priors modeland prefer some other f (x). Because we will not be able to find a closedform expression for the z minimizing the function in line (14.5), we needsome other way to proceed with the alternating minimization. Green [71]has offered the one-step-late (OSL) alternative.

When we try to minimize the function in line (14.5) by setting thegradient to zero we replace the variable z that occurs in the gradient of theterm − log f (z) with xk, the previously calculated iterate. Then, we cansolve for z in closed form to obtain the new xk+1. Unfortunately, negativeentries can result and convergence is not guaranteed. There is a sizableliterature on the use of MAP methods for this problem. In [28] an interior

point algorithm (IPA) is presented that avoids the OSL issue. In [99] theIPA is used to regularize transmission tomographic images.

14.6 Regularizing the SMART

The SMART algorithm is not derived as a maximum likelihood method, soregularized versions do not take the form of MAP algorithms. Neverthe-less, in the presence of noisy data, the SMART algorithm suffers from thesame problem that afflicts the EMML, overfitting to noisy data resultingin an unacceptably noisy image. As we saw earlier, there is a close con-nection between the EMML and SMART algorithms. This suggests that aregularization method for SMART can be developed along the lines of the

MAP with gamma priors used for EMML. Since the SMART is obtained byminimizing the function KL(q (z), r(xk)) with respect to z to obtain xk+1,



14.7. DE PIERRO’S SURROGATE-FUNCTION METHOD 129

it seems reasonable to attempt to derive a regularized SMART iterativescheme by minimizing

KL(q (z), r(xk

)) +

J j=1

δ jKL(zj , γ j), (14.8)

for selected positive parameters δ j and γ j .

Exercise 14.9 Show that the zj minimizing the function in line (14.8)can be expressed in closed form and that the resulting xk+1 has entries that satisfy

log xk+1j =δ j

δ j + sjlog γ j +

1

δ j + sjxkj

I i=1

Aij log[bi/(Axk)i]. (14.9)

In [19] it was shown that this iterative sequence converges to a minimizer

of the function

KL(Ax,y) +

J j=1

δ jKL(xj , γ j).

It is useful to note that, although it may be possible to rederive this min-imization problem within the framework of Bayesian MAP estimation bycarefully selecting a prior pdf for the vector x, we have not done so. TheMAP approach is a special case of regularization through the use of penaltyfunctions. These penalty functions need not arise through a Bayesian for-mulation of the parameter-estimation problem.

14.7 De Pierro’s Surrogate-Function Method

In [57] De Pierro presents a modified EMML algorithm that includes reg-ularization in the form of a penalty function. His objective is the same asours was in the case of regularized SMART: to embed the penalty termin the alternating minimization framework in such a way as to make itpossible to obtain the next iterate in closed form. Because his surrogate

function method has been used subsequently by others to obtain penalizedlikelihood algorithms [44], we consider his approach in some detail.

Let x and z be vector variables and H (x, z) > 0. Mimicking the be-havior of the function H (x, z) used in Equation (14.4), we require that if we fix z and minimize H (x, z) with respect to x, the solution should bex = z, the vector we fixed; that is, H (x, z) ≥ H (z, z) always. If we fixx and minimize H (x, z) with respect to z, we should get something new;

call it T x. As with the EMML, the algorithm will have the iterative stepxk+1 = T xk.




Summarizing, we see that we need a function H (x, z) with the properties(1) H (x, z) ≥ H (z, z) for all x and z; (2) H (x, x) is the function F (x) wewish to minimize; and (3) minimizing H (x, z) with respect to z for fixed xis easy.

The function to be minimized is

F (x) = KL(b,Ax) + g(x),

where g(x) ≥ 0 is some penalty function. De Pierro uses penalty functionsg(x) of the form

g(x) =

pl=1

f l(sl, x ).

Let us define the matrix S to have for its lth row the vector sT l . Thensl, x = (Sx)l, the lth entry of the vector Sx. Therefore,

g(x) =

p

l=1

f l((Sx)l).

Let λlj > 0 withJ

j=1 λlj = 1, for each l.Assume that the functions f l are convex. Therefore, for each l, we have

f l((Sx)l) = f l(

J j=1

S ljxj) = f l(

J j=1

λlj(S lj/λlj)xj)

≤J j=1

λljf l((S lj/λlj)xj).

Therefore,

g(x) ≤ pl=1

J j=1

λljf l((S lj/λlj)xj).

So we have replaced g(x) with a related function in which the xj occurseparately, rather than just in the combinations (Sx)l. But we aren’t quitedone yet.

We would like to take for De Pierro’s H (x, z) the function used in theEMML algorithm, plus the function

pl=1

J j=1

λljf l((S lj/λlj)zj).

But there is one slight problem: we need H (z, z) = F (z), which we don’thave yet. De Pierro’s clever trick is to replace f

l((S

lj/λ

lj)zj

) with

f l((S lj/λlj)zj − (S lj/λlj)xj + (Sx)l).



14.8. BLOCK-ITERATIVE REGULARIZATION 131

So, De Pierro’s function H (x, z) is the sum of the H (x, z) used in theEMML case and the function

p

l=1

J

j=1

λljf l((S lj/λlj)zj−

(S lj/λlj)xj + (Sx)l).

Now he has the three properties he needs. Once he has computed xk, heminimizes H (xk, z) by taking the gradient and solving the equations forthe correct z = T xk = xk+1. For the choices of f l he discusses, theseintermediate calculations can either be done in closed form (the quadraticcase) or with a simple Newton-Raphson iteration (the logcosh case).

14.8 Block-Iterative Regularization

We saw previously that it is possible to obtain a regularized least-squaressolution x, and thereby avoid the limit cycle, using only the matrix A and

the ART algorithm. This prompts us to ask if it is possible to find regular-ized SMART solutions using block-iterative variants of SMART. Similarly,we wonder if it is possible to do the same for EMML.

Open Question: Can we use the MART to find the minimizer of thefunction

KL(Ax,b) + KL(x, p)?

More generally, can we obtain the minimizer using RBI-SMART?

Open Question: Can we use the RBI-EMML methods to obtain theminimizer of the function

KL(b,Ax) + KL( p, x)?

There have been various attempts to include regularization in block-iterative methods, to reduce noise sensitivity and avoid limit cycles, but allof these approaches have been ad hoc , with little or no theoretical basis.Typically, they simply modify each iterative step by including an additionalterm that appears to be related to the regularizing penalty function. Thecase of the ART is instructive, however. In that case, we obtained thedesired iterative algorithm by using an augmented set of variables, notsimply by modifying each step of the original ART algorithm. How to dothis for the MART and the other block-iterative algorithms is not obvious.

Recall that the RAMLA method in Equation (13.18) is similar to theRBI-EMML algorithm, but employs a sequence of decreasing relaxationparameters, which, if properly chosen, will cause the iterates to convergeto the minimizer of KL(b,Ax), thereby avoiding the limit cycle. In [59]

RAMLA is extended to a regularized version, but with no guarantee of convergence.






Chapter 15

Feedback inBlock-IterativeReconstruction

When the nonnegative system of linear equations Ax = b has no nonnega-tive solutions we say that we are in the inconsistent case . In this case theSMART and EMML algorithms still converge, to a nonnegative minimizerof KL(Ax,b) and KL(b,Ax), respectively. On the other hand, the rescaledblock-iterative versions of these algorithms, RBI-SMART and RBI-EMML,do not converge. Instead they exhibit cyclic subsequential convergence ; foreach fixed n = 1,...,N , with N the number of blocks, the subsequence{xmN +n} converges to their own limits. These limit vectors then consti-tute the limit cycle (LC). The LC for RBI-SMART is not the same as for

RBI-EMML, generally, and the LC varies with the choice of blocks. Ourproblem is to find a way to calculate the SMART and EMML limit vec-tors using the RBI methods. More specifically, how can we calculate theSMART and EMML limit vectors from their associated RBI limit cycles?

As is often the case with the algorithms based on the KL distance, wecan turn to the ART algorithm for guidance. What happens with the ARTalgorithm in the inconsistent case is often closely related to what happenswith RBI-SMART and RBI-EMML, although proofs for the latter methodsare more difficult to obtain. For example, when the system Ax = b has nosolution we can prove that ART exhibits cyclic subsequential convergenceto a limit cycle. The same behavior is seen with the RBI methods, but noone knows how to prove this. When the system Ax = b has no solutionwe usually want to calculate the least squares (LS) approximate solution.

The problem then is to use the ART to find the LS solution. There areseveral ways to do this, as discussed in [23, 33]. We would like to be able

133



134CHAPTER 15. FEEDBACK IN BLOCK-ITERATIVE RECONSTRUCTION

to borrow some of these methods and apply them to the RBI problem. Inthis section we focus on one specific method that works for ART and wetry to make it work for RBI; it is the feedback approach.

15.1 Feedback in ART

Suppose that the system Ax = b has no solution. We apply the ART andget the limit cycle {z1, z2,...,zI }, where I is the number of equations andz0 = zI . We assume that the rows of A have been normalized so that theirlengths are equal to one. Then the ART iterative step gives

zij = zi−1j + Aij(bi − (Azi−1)j)

orzij − zi−1j = Aij(bi − (Azi−1)j).

Summing over the index i and using z0 = zI we obtain zero on the left

side, for each j. Consequently A†b = A†c, where c is the vector with entriesci = (Azi−1)i. It follows that the systems Ax = b and Ax = c have thesame LS solutions and that it may help to use both b and c to find the LSsolution from the limit cycle. The article [23] contains several results alongthese lines. One approach is to apply the ART again to the system Ax = c,obtaining a new LC and a new candidate for the right side of the systemof equations. If we repeat this feedback procedure, each time using theLC to define a new right side vector, does it help us find the LS solution?Yes, as Theorem 4 of [23] shows. Our goal in this section is to explore thepossibility of using the same sort of feedback in the RBI methods. Someresults in this direction are in [23]; we review those now.

15.2 Feedback in RBI methodsOne issue that makes the KL methods more complicated than the ART isthe support of the limit vectors, meaning the set of indices j for which theentries of the vector are positive. In [19] it was shown that when the systemAx = b has no nonnegative solutions and A has the full rank property thereis a subset S of { j = 1,...,J } with cardinality at most I −1, such that everynonnegative minimizer of KL(Ax,b) has zero for its j-th entry whenever jis not in S . It follows that the minimizer is unique. The same result holdsfor the EMML, although it has not been proven that the set S is the sameset as in the SMART case. The same result holds for the vectors of the LCfor both RBI-SMART and RBI-EMML.

A simple, yet helpful, example to refer to as we proceed is the following.

A =

1 .50 .5

, b =

.51

.



15.2. FEEDBACK IN RBI METHODS 135

There is no nonnegative solution to this system of equations and the sup-port set S for SMART, EMML and the RBI methods is S = { j = 2}.

15.2.1 The RBI-SMARTOur analysis of the SMART and EMML methods has shown that the the-ory for SMART is somewhat nicer than that for EMML and the resultingtheorems for SMART are a bit stronger. The same is true for RBI-SMART,compared to RBI-EMML. For that reason we begin with RBI-SMART.

Recall that the iterative step for RBI-SMART is

xk+1j = xkj exp(m−1n s−1j

i∈Bn

Aij log(bi/(Axk)i)),

where n = k(mod N ) + 1, sj =I

i=1 Aij , snj =

i∈BnAij and mn =

max{snj/sj , j = 1,...,J }.For each n let

Gn(x, z) =

J j=1

sjKL(xj , zj) − m−1n

i∈Bn

KL((Ax)i, (Az)i) + m−1n

i∈Bn

KL((Ax)i, bi).

Exercise 15.1 Show that

J j=1

sjKL(xj , zj) − m−1n

i∈Bn

KL((Ax)i, (Az)i) ≥ 0,

so that Gn(x, z) ≥ 0.


Gn(x, z) = Gn(z, z) +

J j=1

sjKL(xj , zj),

where zj = zj exp(m−1

n s−1ji∈Bn

Aij log(bi/(Az)i).

We assume that there are no nonnegative solutions to the nonnegative sys-tem Ax = b. We apply the RBI-SMART and get the limit cycle {z1,...,zN },where N is the number of blocks. We also let z0 = zN and for each i letci = (Azn−1)i where i

∈Bn, the n-th block. Prompted by what we learned

concerning the ART, we ask if the nonnegative minimizers of KL(Ax,b)and KL(Ax,c) are the same. This would be the correct question to ask if




we were using the slower unrescaled block-iterative SMART, in which themn are replaced by one. For the rescaled case it turns out that the properquestion to ask is: Are the nonnegative minimizers of the functions

N n=1

m−1n

i∈Bn

KL((Ax)i, bi)

andN n=1

m−1n

i∈Bn

KL((Ax)i, ci)

the same? The answer is ”Yes, probably.” The difficulty has to do withthe support of these minimizers; specifically: Are the supports of bothminimizers the same as the support of the LC vectors? If so, then we canprove that the two minimizers are identical. This is our motivation for thefeedback approach.

The feedback approach is the following: beginning with b0 = b we apply

the RBI-SMART and obtain the LC, from which we extract the vector c,which we also call c0. We then let b1 = c0 and apply the RBI-SMART tothe system b1 = Ax. From the resulting LC we extract c1 = b2, and so on.In this way we obtain an infinite sequence of data vectors {bk}. We denoteby {zk,1,...,zk,N } the LC we obtain from the system bk = Ax, so that

bk+1i = (Azk,n)i, for i ∈ Bn.

One issue we must confront is how we use the support sets. At the first stepof feedback we apply RBI-SMART to the system b = b0 = Ax, beginningwith a positive vector x0. The resulting limit cycle vectors are supportedon a set S 0 with cardinality less than I . At the next step we apply theRBI-SMART to the system b1 = Ax. Should we begin with a positive

vector (not necessarily the same x

0

as before) or should our starting vectorbe supported on S 0?

Exercise 15.3 Show that the RBI-SMART sequence {xk} is bounded. Hints:For each j let M j = max{bi/Aij , |Aij > 0} and let C j = max{x0j , M j}.

Show that xkj ≤ C j for all k.

Exercise 15.4 Let S be the support of the LC vectors. Show that

N n=1

m−1n

i∈Bn

Aij log(bi/ci) ≤ 0 (15.1)

for all j, with equality for those j ∈ S . Conclude from this that

N n=1

m−1n

i∈Bn

KL((Ax)i, bi) −N n=1

m−1n

i∈Bn

KL((Ax)i, ci) ≥




N n=1

m−1n

i∈Bn

(bi − ci),

with equality if the support of the vector x lies within the set S . Hints: For

j ∈ S consider log(znj /zn−1j ) and sum over the index n, using the fact that

zN = z0. For general j assume there is a j for which the inequality does not hold. Show that there is M and > 0 such that for m ≥ M

log(x(m+1)N j /xmN

j ) ≥ .

Conclude that the sequence {xmN j } is unbounded.


N n=1

Gn(zk,n, zk,n−1) =N n=1

m−1n

i∈Bn

(bki − bk+1i ),

and conclude that the sequence {N n=1 m−1

n (

i∈Bn bki )} is decreasing and

that the sequence {N n=1 Gn(zk,n, zk,n−1)} → 0 as k → ∞. Hints: Calcu-

late Gn(zk,n, zk,n−1) using Exercise (15.2).

Exercise 15.6 Show that for all vectors x ≥ 0 the sequence

{N n=1

m−1n

i∈Bn

KL((Ax)i, bki )}

is decreasing and the sequence

N n=1 m−

1

ni∈Bn(b

k

i − b

k+1

i ) → 0,

as k → ∞. Hints: Calculate

{N n=1

m−1n

i∈Bn

KL((Ax)i, bki )} − {N n=1

m−1n

i∈Bn

KL((Ax)i, bk+1i )}

and use the previous exercise.

Exercise 15.7 Extend the boundedness result obtained earlier to conclude that for each fixed n the sequence {zk,n} is bounded.

Since the sequence

{zk,0

}is bounded there is a subsequence

{zkt,0

}converging to a limit vector z∗,0. Since the sequence {zkt,1} is boundedthere is subsequence converging to some vector z∗,1. Proceeding in this




way we find subsequences {zkm,n} converging to z∗,n for each fixed n. Ourgoal is to show that, with certain restrictions on A, z∗,n = z∗ for eachn. We then show that the sequence {bk} converges to Az∗ and that z∗

minimizesN n=1

m−1n

i∈Bn

KL((Ax)i, bi).

It follows from Exercise (15.5) that

{N n=1

Gn(z∗,n, z∗,n−1)} = 0.

Exercise 15.8 Find suitable restrictions on the matrix A that permit us to conclude from above that z∗,n = z∗,n−1 = z∗ for each n.

Exercise 15.9 Show that the sequence {bk} converges to Az∗. Hints:

Since the sequence

{N n=1 m−1

n i

∈Bn

KL((Az∗)i, bki )

}is decreasing and a

subsequence converges to zero, it follows that the whole sequence converges to zero.

Exercise 15.10 Use Exercise (15.4) to obtain conditions that permit us to conclude that the vector z∗ is a nonnegative minimizer of the function

N n=1

m−1n

i∈Bn

KL((Ax)i, bi).

15.2.2 The RBI-EMML

We turn now to the RBI-EMML method, having the iterative step

xk+1j = (1 − m−1n s−1j snj)xkj + m−1n s−1j xkji∈Bn

Aijbi/(Axk)i,

with n = k(mod N ) + 1. As we warned earlier, developing the theory forfeedback with respect to the RBI-EMML algorithm appears to be moredifficult than in the RBI-SMART case.

Applying the RBI-EMML algorithm to the system of equations Ax = bhaving no nonnegative solution, we obtain the LC {z1,...,zN }. As before,for each i we let ci = (Azn−1)i where i ∈ Bn. There is a subset S of { j = 1,...,J } with cardinality less than I such that for all n we haveznj = 0 if j is not in S .

The first question that we ask is: Are the nonnegative minimizers of the functions

N n=1

m−1n

i∈Bn

KL(bi, (Ax)i)




andN n=1

m−1n

i∈Bn

KL(ci, (Ax)i)

the same?As before, the feedback approach involves setting b0 = b, c0 = c = b1

and for each k defining bk+1 = ck, where ck is extracted from the limitcycle

LC (k) = {zk,1,...,zk,N = zk,0}obtained from the system bk = Ax as cki = (Azk,n−1)i where n is suchthat i ∈ Bn. Again, we must confront the issue of how we use the supportsets. At the first step of feedback we apply RBI-EMML to the systemb = b0 = Ax, beginning with a positive vector x0. The resulting limit cyclevectors are supported on a set S 0 with cardinality less than I . At the nextstep we apply the RBI-EMML to the system b1 = Ax. Should we beginwith a positive vector (not necessarily the same x0 as before) or should our

starting vector be supported on S 0

? One approach could be to assume firstthat J < I and that S = { j = 1,...,J } always and then see what can bediscovered.

Our conjectures, subject to restrictions involving the support sets, areas follows:1: The sequence {bk} converges to a limit vector b∞;2: The system b∞ = Ax has a nonnegative solution, say x∞;3: The LC obtained for each k converge to the singleton x∞;4: The vector x∞ minimizes the function

N n=1

m−1n

i∈Bn

KL(bi, (Ax)i)

over nonnegative x.Some results concerning feedback for RBI-EMML were presented in

[23]. We sketch those results now.

Exercise 15.11 Show that the quantity

N n=1

m−1n

i∈Bn

bki

is the same for k = 0, 1,.... Hints: Show that

J

j=1sj

N

n=1(zk,nj − zk,n−1j ) = 0

and rewrite it in terms of bk and bk+1.




Exercise 15.12 Show that there is a constant B > 0 such that zk,nj ≤ B for all k, n and j.


sj log(zk,n−1j /zk,nj ) ≤ m−1n

i∈Bn

Aij log(bk+1i /bki ).

Hints: Use the convexity of the log function and the fact that the terms 1 − m−1

n snj and m−1n Aij , i ∈ Bn sum to one.

Exercise 15.14 Use the previous exercise to prove that the sequence

{N n=1

m−1n

i∈Bn

KL((Ax)i, bki )}

is decreasing for each nonnegative vector x and the sequence

{N n=1

m−1n

i∈Bn

Aij log(bki )}

is increasing.



Part VI

Optimization

141





Chapter 16

Iterative Optimization

Optimization means finding a maximum or minimum value of a real-valuedfunction of one or several variables. Constrained optimization means thatthe acceptable solutions must satisfy some additional restrictions, such asbeing nonnegative. Even if we know equations that optimal points mustsatisfy, solving these equations is often difficult and usually cannot be donealgebraically. In this chapter we sketch the conditions that must hold inorder for a point to be an optimum point, and then use those conditionsto motivate iterative algorithms for finding the optimum points. We shallconsider only minimization problems, since any maximization problem canbe converted into a minimization problem by changing the sign of thefunction involved.

16.1 Functions of a Single Real Variable

If f (x) is a continuous, real-valued function of a real variable x and wewant to find an x for which the function takes on its minimum value, thenwe need only examine those places where the derivative, f (x), is zero,and those places where f (x) does not exist; of course, without furtherassumptions, there is no guarantee that a minimum exists. Therefore, if f (x) is differentiable at all x, and if its minimum value occurs at x∗, thenf (x∗) = 0. If the problem is a constrained minimization , that is, if theallowable x lie within some interval, say, [a, b], then we must also examinethe end-points, x = a and x = b. If the constrained minimum occurs atx∗ = a and f (a) exists, then f (a) need not be zero; however, we musthave f (a) ≥ 0, since, if f (a) < 0, we could select x = c slightly to theright of x = a with f (c) < f (a). Similarly, if the minimum occurs at

x = b, and f (b) exists, we must have f (b) ≤ 0. We can combine theseend-point conditions by saying that if the minimum occurs at one of the

143



144 CHAPTER 16. ITERATIVE OPTIMIZATION

two end-points, moving away from the minimizing point into the interval[a, b] cannot result in the function growing smaller. For functions of severalvariables similar conditions hold, involving the partial derivatives of thefunction.

16.2 Functions of Several Real Variables

Suppose, from now on, that f (x) = f (x1,...,xN ) is a continuous, real-valued function of the N real variables x1,...,xN and that x = (x1,...,xN )

T

is the column vector of unknowns, lying in the N -dimensional space RN .When the problem is to find a minimum (or a maximum) of f (x), we callf (x) the objective function . As in the case of one variable, without addi-tional assumptions, there is no guarantee that a minimum (or a maximum)exists.

16.2.1 Cauchy’s Inequality for the Dot Product

For any two vectors v and w in RN the dot product is defined to be

v · w =

N n=1

vnwn.

Cauchy’s inequality tells us that |v ·w| ≤ ||v||||w||, with equality if and onlyif w = αv for some real number α. In the multi-variable case we speak of the derivative of a function at a point, in the direction of a given vector;these are the directional derivatives and their definition involves the dotproduct.

16.2.2 Directional Derivatives

If ∂f ∂xn

(z), the partial derivative of f , with respect to the variable xn, at

the point z, is defined for all z, and u = (u1,...,uN )T is a vector of length

one, that is, its norm,

||u|| =

u21 + ... + u2N ,

equals one, then the derivative of f (x), at a point x = z, in the directionof u, is

∂f

∂x1(z)u1 + ... +

∂f

∂xN (z)uN .

Notice that this directional derivative is the dot product of u with thegradient of f (x) at x = z, defined by

∇f (z) = (∂f

∂x1(z),...,

∂f

∂xN (z))T .



16.2. FUNCTIONS OF SEVERAL REAL VARIABLES 145

According to Cauchy’s inequality, the dot product ∇f (z) · u will take onits maximum value when u is a positive multiple of ∇f (z), and therefore,its minimum value when u is a negative multiple of ∇f (z). Consequently,the gradient of f (x) at x = z points in the direction, from x = z, of

the greatest increase in the function f (x). This suggests that, if we aretrying to minimize f (x), and we are currently at x = z, we should considermoving in the direction of −∇f (z); this leads to Cauchy’s iterative methodof steepest descent , which we shall discuss in more detail later.

If the minimum value of f (x) occurs at x = x∗, then either all thedirectional derivatives are zero at x = x∗, in which case ∇f (z) = 0, or atleast one directional derivative does not exist. But, what happens whenthe problem is a constrained minimization?

16.2.3 Constrained Minimization

Unlike the single-variable case, in which constraining the variable simply

meant requiring that it lie within some interval, in the multi-variable caseconstraints can take many forms. For example, we can require that each of the entries xn be nonnegative, or that each xn lie within an interval [an, bn]that depends on n, or that the norm of x, defined by ||x|| =

x21 + ... + x2N ,

which measures the distance from x to the origin, does not exceed somebound. In fact, for any set C in N -dimensional space, we can pose theproblem of minimizing f (x), subject to the restriction that x be a member

of the set C . In place of end-points, we have what are called boundary-points of C , which are those points in C that are not entirely surroundedby other points in C . For example, in the one-dimensional case, the pointsx = a and x = b are the boundary-points of the set C = [a, b]. If C = RN

+

is the subset of N -dimensional space consisting of all the vectors x whoseentries are nonnegative, then the boundary-points of C are all nonnegative

vectors x having at least one zero entry.Suppose that C is arbitrary in RN and the point x = x∗ is the solution

to the problem of minimizing f (x) over all x in the set C . Assume alsothat all the directional derivatives of f (x) exist at each x. If x∗ is not aboundary-point of C , then all the directional derivatives of f (x), at thepoint x = x∗, must be nonnegative, in which case they must all be zero,so that we must have ∇f (z) = 0. On the other hand, speaking somewhatloosely, if x∗ is a boundary-point of C , then it is necessary only that thedirectional derivatives of f (x), at the point x = x∗, in directions that pointback into the set C , be nonnegative.

16.2.4 An Example

To illustrate these concepts, consider the problem of minimizing the func-tion of two variables, f (x1, x2) = x1 + 3x2, subject to the constraint that




x = (x1, x2) lie within the unit ball C = {x = (x1, x2)|x21 + x22 ≤ 1}.With the help of simple diagrams we discover that the minimizing pointx∗ = (x∗1, x∗2) is a boundary-point of C , and that the line x1+3x2 = x∗1+3x∗2is tangent to the unit circle at x∗. The gradient of f (x), at x = z, is

∇f (z) = (1, 3)T , for all z, and is perpendicular to this tangent line. But,since the point x∗ lies on the unit circle, the vector (x∗1, x∗2)T is also per-pendicular to the line tangent to the circle at x∗. Consequently, we knowthat (x∗1, x∗2)T = α(1, 3)T , for some real α. From x21 + x22 = 1, it followsthat |α| =

√ 10. This gives us two choices for x∗: either x∗ = (

√ 10, 3

√ 10),

or x∗ = (−√ 10, −3

√ 10). Evaluating f (x) at both points reveals that f (x)

attains its maximum at the first, and its minimum at the second.Every direction vector u can be written in the form u = β (1, 3)T +

γ (−3, 1)T , for some β and γ . The directional derivative of f (x), at x = x∗,in any direction that points from x = x∗ back into C , must be nonnega-tive. Such directions must have a nonnegative dot product with the vector(−x∗1, −x∗2)T , which tells us that

0 ≤ β (1, 3)T · (−x∗1, −x∗2)T + γ (−3, 1)T · (−x∗1, x∗2)T ,

or

0 ≤ (3γ − β )x∗1 + (−3β − γ )x∗2.

Consequently, the gradient (1, 3)T must have a nonnegative dot productwith every direction vector u that has a nonnegative dot product with(−x∗1, −x∗2)T . For the dot product of (1, 3)T with any u to be nonnegativewe need β ≥ 0. So we conclude that β ≥ 0 for all β and γ for which

0 ≤ (3γ − β )x∗1 + (−3β − γ )x∗2.

Saying this another way, if β < 0 then

(3γ − β )x∗1 + (−3β − γ )x∗2 < 0,

for all γ . Taking the limit, as β → 0 from the left, it follows that

3γx∗1 − γx∗2 ≤ 0,

for all γ . The only way this can happen is if 3x∗1 − x∗2 = 0. Therefore,our optimum point must satisfy the equation x∗2 = 3x∗1, which is what wefound previously.

We have just seen the conditions necessary for x∗ to minimize f (x),subject to constraints, be used to determine the point x∗ algebraically.In more complicated problems we will not be able to solve for x∗ merely

by performing simple algebra. But we may still be able to find x∗ usingiterative optimization methods.



16.3. GRADIENT DESCENT OPTIMIZATION 147

16.3 Gradient Descent Optimization

Suppose that we want to minimize f (x), over all x, without constraints.Begin with an arbitrary initial guess, x = x0. Having proceeded to xk, weshow how to move to xk+1. At the point x = xk, the direction of greatestrate of decrease of f (x) is u = −∇f (xk). Therefore, it makes sense to movefrom xk in the direction of −∇f (xk), and to continue in that direction untilthe function stops decreasing. In other words, we let

xk+1 = xk − αk∇f (xk),

where αk ≥ 0 is the step size , determined by the condition

f (xk − αk∇f (xk)) ≤ f (xk − α∇f (xk)),

for all α

≥0. This iterative procedure is Cauchy’s steepest descent method.

To establish the convergence of this algorithm to a solution requires ad-ditional restrictions on the function f ; we shall not consider these issuesfurther. Our purpose here is merely to illustrate an iterative minimizationphilosophy that we shall recall in various contexts.

If the problem is a constrained minimization, then we must proceedmore carefully. One method, known as interior-point iteration, begins withx0 within the constraint set C and each subsequent step is designed to pro-duce another member of C ; if the algorithm converges, the limit is thenguaranteed to be in C . For example, if C = RN

+ , the nonnegative conein RN , we could modify the steepest descent method so that, first, x0 isa nonnegative vector, and second, the step from xk in C is restricted sothat we stop before xk+1 ceases to be nonnegative. A somewhat differentmodification of the steepest descent method would be to take the full stepfrom xk to xk+1, but then to take as the true xk+1 that vector in C nearestto what would have been xk+1, according to the original steepest descentalgorithm; this new iterative scheme is the projected steepest descent al-gorithm. It is not necessary, of course, that every intermediate vector xk

be in C ; all we want is that the limit be in C . However, in applications,iterative methods must always be stopped before reaching their limit point,so, if we must have a member of C for our (approximate) answer, then wewould need xk in C when we stop the iteration.

16.4 The Newton-Raphson Approach

The Newton-Raphson approach to minimizing a real-valued function f :RJ → R involves finding x∗ such that ∇f (x∗) = 0.




16.4.1 Functions of a Single Variable

We begin with the problem of finding a root of a function g : R → R. If x0

is not a root, compute the line tangent to the graph of g at x = x0 and let

x1

be the point at which this line intersects the horizontal axis; that is,

x1 = x0 − g(x0)/g(x0).

Continuing in this fashion, we have

xk+1 = xk − g(xk)/g(xk).

This is the Newton-Raphson algorithm for finding roots. Convergence,when it occurs, is more rapid than gradient descent, but requires thatx0 be sufficiently close to the solution.

Now suppose that f : R → R is a real-valued function that we wishto minimize by solving f (x) = 0. Letting g(x) = f (x) and applying theNewton-Raphson algorithm to g(x) gives the iterative step

xk+1 = xk − f (xk)/f (xk).

This is the Newton-Raphson optimization algorithm. Now we extend theseresults to functions of several variables.

16.4.2 Functions of Several Variables

The Newton-Raphson algorithm for finding roots of functions g : RJ → RJ

has the iterative step

xk+1 = xk − [J (g)(xk)]−1g(xk),

where J (g)(x) is the Jacobian matrix of first partial derivatives, ∂gm∂xj

(xk),

for g(x) = (g1(x),...,gJ (x))T .

To minimize a function f : RJ → R, we let g(x) = ∇f (x) and find aroot of g. Then the Newton-Raphson iterative step becomes

xk+1 = xk − [∇2f (xk)]−1∇f (xk),

where ∇2f (x) = J (g)(x) is the Hessian matrix of second partial derivativesof f .

16.5 Other Approaches

Choosing the negative of the gradient as the next direction makes goodsense in minimization problems, but it is not the only, or even the best, wayto proceed. For least squares problems the method of conjugate directions

is a popular choice (see [33]). Other modifications of the gradient can alsobe used, as, for example, in the EMML algorithm.



Chapter 17

Conjugate-DirectionMethods in Optimization

17.1 Backpropagation-of-Error Methods

In this chapter we consider iterative algorithms for solving systems of linearequations. If A is an invertible real matrix, B−1 is an approximation of A−1, and we wish to find x for which Ax = b, we could proceed as follows:having obtained an approximate solution xk, view B−1(b − Axk) as anapproximation of the error x − xk = A−1(b − Axk) and take as the nextapproximation

xk+1 = xk + B−1(b − Axk).

For this to be a practical method, we need B to be a good approximationof A and also to be easily inverted. For a more general I by J matrix A,

the system Ax = b need not have a solution, in which case we may wish tocalculate a least-squares solution. The Landweber algorithm, with iterativestep

xk+1 = xk + γA†(b − Axk),

converges to a least-squares solution for any γ in the interval (0, 2/L),where L is the largest eigenvalue of Q = A†A. Algorithms of this typeare sometimes called backpropagation-of-error methods because the error istransformed by A† back into the space of J -dimensional vectors. Through-out this chapter we shall assume that Q is invertible so that the systemAx = b has a unique least-squares solution.

Finding the least-squares solution of a possibly inconsistent system of linear equations Ax = b is equivalent to minimizing the quadratic functionf (x) = 1

2

||Ax

−b

||2 and so can be viewed within the framework of opti-

mization. Iterative-optimization methods can then be used to provide, orat least suggest, algorithms for obtaining the least-squares solution.

149



150CHAPTER 17. CONJUGATE-DIRECTION METHODS IN OPTIMIZATION

17.2 Iterative Minimization

Iterative methods for minimizing a real-valued function f (x) over the vectorvariable x usually take the following form: having calculated xk a new

direction vector dk is selected, an appropriate scalar αk > 0 is determinedand the next member of the iterative sequence is given by

xk+1 = xk + αkdk. (17.1)

Ideally, one would choose the αk to be the value of α for which the functionf (xk + αdk) is minimized. It is assumed that the direction dk is a descent direction ; that is, for small positive α the function f (xk + αdk) is strictlydecreasing. Finding the optimal value of α at each step of the iteration isdifficult, if not impossible, in most cases, and approximate methods, usingline searches, are commonly used.

Exercise 17.1 Differentiate the function f (xk + αdk) with respect to the

variable α to show that

∇f (xk+1) · dk = 0. (17.2)

Since the gradient ∇f (xk+1) is orthogonal to the previous directionvector dk and also because −∇f (x) is the direction of greatest decreaseof f (x), the choice of dk+1 = −∇f (xk+1) as the next direction vector isa reasonable one. With this choice we obtain Cauchy’s steepest descent method [92]:

xk+1 = xk − αk∇f (xk).

The steepest descent method need not converge in general and even whenit does, it can do so slowly, suggesting that there may be better choicesfor the direction vectors. For example, the Newton-Raphson method [100]

employs the following iteration:

xk+1 = xk − ∇2f (xk)−1∇f (xk),

where ∇2f (x) is the Hessian matrix for f (x) at x. To investigate furtherthe issues associated with the selection of the direction vectors, we considerthe more tractable special case of quadratic optimization.

17.3 Quadratic Optimization

Let A be an arbitrary real I by J matrix. The linear system of equationsAx = b need not have any solutions, and we may wish to find the least-squares solution x = x that minimizes

f (x) =1

2||b − Ax||2. (17.3)



17.3. QUADRATIC OPTIMIZATION 151

The vector b can be written

b = Ax + w,

where A†w = 0 and the least squares solution is an exact solution of thelinear system Qx = c, with Q = A†A and c = A†b.

We consider now the iterative scheme described by Equation (17.1) forf (x) as in Equation (17.3). For this f (x) the gradient becomes

∇f (x) = Qx − c.

The optimal αk for the iteration can be obtained in closed form.

Exercise 17.2 Show that the optimal αk is

αk =

rk

·dk

dk · Qdk , (17.4)

where rk = c − Qxk.

Exercise 17.3 Let ||x||2Q = x · Qx denote the square of the Q-norm of x.Show that

||x − xk||2Q − ||x − xk+1||2Q = (rk · dk)2/dk · Qdk ≥ 0

for any direction vectors dk.

If the sequence of direction vectors {dk} is completely general, the iter-ative sequence need not converge. However, if the set of direction vectorsis finite and spans RJ and we employ them cyclically, convergence follows.

Theorem 17.1 Let {d1,...,dJ } be any finite set whose span is all of RJ .Let αk be chosen according to Equation (17.4). Then, for k = 0, 1,...,

j = k(mod J ) + 1, and any x0, the sequence defined by

xk+1 = xk + αkdj

converges to the least squares solution.

Proof: The sequence {||x − xk||2Q} is decreasing and, therefore, the se-quence {(rk · dj)2/dj · Qdj must converge to zero. Therefore, the vectors




xk are bounded, and for each j = 1,...,J , the subsequences {xmJ +j , m =0, 1,...} have cluster points, say x∗,j with

x∗,j+1

= x∗,j

+

(c

−Qx∗,j)

·dj

dj · Qdj dj

.

Since

rmJ +j · dj → 0,

it follows that, for each j = 1,...,J ,

(c − Qx∗,j) · dj = 0.

Therefore,

x∗,1 = ... = x∗,J = x∗

with Qx∗ = c. Consequently, x∗ is the least squares solution and the

sequence {||x∗−xk||Q} is decreasing. But a subsequence converges to zero;therefore, {||x∗ − xk||Q} → 0. This completes the proof.

There is an interesting corollary to this theorem that pertains to a mod-ified version of the ART algorithm. For k = 0, 1,... and m = k(mod M ) + 1and with the rows of A normalized to have length one, the ART iterativestep is

xk+1 = xk + (bi − (Axk)i)ai,

where ai is the ith column of A†. When Ax = b has no solutions, theART algorithm does not converge to the least-squares solution; rather,it exhibits subsequential convergence to a limit cycle. However, using theprevious theorem, we can show that the following modification of the ART,which we shall call the least squares ART (LS-ART), converges to the least-squares solution for every x0:

xk+1 = xk +rk · ai

ai · Qaiai.

In the quadratic case the steepest descent iteration has the form

xk+1 = xk +rk · rk

rk · Qrk(c − Qxk).

We have the following result.

Theorem 17.2 The steepest descent method converges to the least-squares solution.



17.4. CONJUGATE DIRECTIONS 153

Proof: As in the proof of the previous theorem, we have

||x − xk||2Q − ||x − xk+1||2Q = (rk · dk)2/dk · Qdk ≥ 0,

where now the direction vectors are dk = rk. So, the sequence {||x−xk||2Q}is decreasing, and therefore the sequence {(rk ·rk)2/rk ·Qrk} must convergeto zero. The sequence {xk} is bounded; let x∗ be a cluster point. It followsthat c − Qx∗ = 0, so that x∗ is the least-squares solution x. The rest of the proof follows as in the proof of the previous theorem.

17.4 Conjugate Directions

From Equation (17.2) we have

(c − Qxk+1) · dk = 0,

which can be expressed as

(x − xk+1) · Qdk = (x − xk+1)†Qdk = 0.

So, the least-squares solution that we seek lies in a direction from xk+1 thatis Q-orthogonal (or conjugate ) to dk. This suggests that we can do betterthan steepest descent if we take the next direction to be Q-orthogonal tothe previous one, rather than just orthogonal. This leads us to conjugate direction methods .

Exercise 17.4 Say that the set { p0,...,pn−1} is a conjugate set for RJ if

pi

·Qpj

= 0 for i = j. Prove that a conjugate set that does not contain zerois linearly independent. Show that if pn = 0 for n = 0,...,J − 1, then the least-squares vector x can be written as

x = a0 p0 + ... + aJ −1 pJ −1,

with aj = c · pj/pj · Qpj for each j.

Therefore, once we have a conjugate basis, computing the least squaressolution is trivial. Generating a conjugate basis can obviously be done usingthe standard Gram-Schmidt approach. This is not a practical solution inmost applications, however. If we base the construction on the set of powers

Qn p0, we have a much more efficient mechanism for generating a conjugatebasis, namely a three-term recursion formula [92].




Theorem 17.3 Let p0 = 0 be arbitrary. Let p1 be given by

p1 = Qp0 − Qp0 · Qp0

p0

·Qp0

p0

and, for n ≥ 1, let pn+1 be given by

pn+1 = Qpn − Qpn · Qpn

pn · Qpnpn − Qpn · Qpn−1

pn−1 · Qpn−1 pn−1.

Then, the set { p0,...,pJ −1} is a conjugate set for RJ . If pn = 0 for each n, then the set is a conjugate basis for RJ .

Proof: We consider the induction step of the proof. Assume that { p0,...,pn}is a Q-orthogonal set of vectors; we then show that { p0,...,pn+1} is also,provided that n ≤ J − 2. It is clear that

pn+1 · Qpn = pn+1 · Qpn−1 = 0.

For j ≤ n − 2, we have

pn+1 · Qpj = pj · Qpn+1 = pj · Q2 pn − apj · Qpn − bpj · Qpn−1,

for constants a and b. The second and third terms on the right side arethen zero because of the induction hypothesis. The first term is also zerosince

pj · Q2 pn = (Qpj) · Qpn = 0

because Qpj is in the span of { p0,...,pj+1}, and so is Q-orthogonal to pn.

17.5 The Conjugate Gradient Method

The conjugate gradient method (CGM) combines the use of the negativegradient directions from the steepest descent method with the use of aconjugate basis of directions. Since, in the quadratic case, we have

−∇f (xk) = rk = (c − Qxk),

the CGM constructs a conjugate basis of directions from the residuals rk.The iterative step for the CGM is the following:

xn+1 = xn +rn · pn

pn · Qpn pn.

As before, there is an efficient recursion formula that provides the nextdirection: let p0 = r0 = (c − Qx0) and

pn+1 = rn+1 − rn+1 · Qpn

pn · Qpnpn.



17.5. THE CONJUGATE GRADIENT METHOD 155

Since the αn is the optimal choice and

rn+1 = −∇f (xn+1),

we have, according to Equation (17.2),

rn+1 · pn = 0.

Consequently, if pn+1 = 0 then rn+1 = 0 also, which tells us that Qxn+1 =c. An induction proof similar to the one used to prove Theorem 17.3establishes that the set { p0,...,pJ −1} is a conjugate set [92]. In theory theCGM converges to the least squares solution in finitely many steps. Inpractice, the CGM can be employed as a fully iterative method by cyclingback through the previously used directions.

The convergence rate of the CGM depends on the condition number of the matrix Q, which is the ratio of its largest to its smallest eigenvalues.When the condition number is much greater than one convergence can be

accelerated by preconditioning the matrix Q; this means replacing Q withP −1Q, for some approximate inverse P −1 of Q (see [4]).

There are versions of the CGM for the minimization of nonquadraticfunctions. In the quadratic case the next conjugate direction pn+1 is builtfrom the residual rn and pn. Since, in that case, rn = −∇f (xn), thissuggests that in the nonquadratic case we build pn+1 from −∇f (xn) and

pn. This leads to the Fletcher-Reeves method. Other similar algorithms,such as the Polak-Ribiere and the Hestenes-Stiefel methods, perform betteron certain problems [100].






Chapter 18

Convex Sets and ConvexFunctions

In this chapter we consider several algorithms pertaining to convex sets andconvex functions, whose convergence is a consequence of the KM theorem.

18.1 Optimizing Functions of a Single RealVariable

Let f : R → R be a differentiable function. From the Mean-Value Theoremwe know that

f (b) = f (a) + f (c)(b − a),

for some c between a and b. If there is a constant L with |f (x)| ≤ L forall x, that is, the derivative is bounded, then we have

|f (b) − f (a)| ≤ L|b − a|, (18.1)

for all a and b; functions that satisfy Equation (18.1) are said to be L-Lipschitz .

Suppose g : R → R is differentiable and attains its minimum value. Wewant to minimize the function g(x). Solving g(x) = 0 to find the optimalx = x∗ may not be easy, so we may turn to an iterative algorithm forfinding roots of g(x), or one that minimizes g(x) directly. In the lattercase, we may consider a steepest descent algorithm of the form

xk+1 = xk − γg (xk),

for some γ > 0. We denote by T the operator

T x = x − γg (x).

157



158 CHAPTER 18. CONVEX SETS AND CONVEX FUNCTIONS

Then, using g(x∗) = 0, we find that

|x∗ − xk+1| = |T x∗ − T xk|.

We would like to know if there are choices for γ that make T an av operator.For functions g(x) that are convex , the answer is yes.

18.1.1 The Convex Case

The function g(x) is said to be convex if, for each pair of distinct realnumbers a and b and for every α in the interval (0, 1), we have

g((1 − α)a + αb) ≤ (1 − α)g(a) + αg(b).

If g(x) is a differentiable function, then convexity can be expressed in termsof properties of the derivative, g(x).

Theorem 18.1 For the differentiable function g(x), the following are equiv-alent:1) g(x) is convex;2) for all a and b we have

g(b) ≥ g(a) + g(a)(b − a); (18.2)

3) the derivative, g(x), is an increasing function, or, equivalently,

(g(b) − g(a))(b − a) ≥ 0, (18.3)

for all a and b.

Proof of the Theorem: Assume that g(x) is convex. Then, for any a

and b and α in (0, 1), we have

g(a + α(b − a)) = g((1 − α)a + αb) ≤ (1 − α)g(a) + αg(b).

Then,

[g(a + α(b − a)) − g(a)]/[α(b − a)] ≤ [g(b) − g(a)]/[b − a].

The limit on the left, as α → 0, is g(a). It follows that

g(a) ≤ [g(b) − g(a)]/[b − a],

which is Inequality (18.2).Assume now that Inequality (18.2) holds, for all a and b. Therefore, we

also haveg(a) − g(b) ≥ g(b)(a − b),



18.1. OPTIMIZING FUNCTIONS OF A SINGLE REAL VARIABLE 159

or

g(a) − g(b) ≥ −g(b)(b − a). (18.4)

Adding Inequalities (18.3) and (18.4), we obtain

0 ≥ (g(a) − g(b))(b − a),

from which we easily conclude that g(x) is increasing.Finally, assume that g(x) is an increasing function, so that Inequality

(18.3) holds. We show that g(x) is convex. Let a < b and let f (α) bedefined by

f (α) = [(1 − α)g(a) + αg(b)] − g((1 − α)a + αb).

Then f (0) = f (1) = 0, and

f (α) = g(b) − g(a) − g((1 − α)a + αb)(b − a). (18.5)

If f (α) < 0 for some α, then there must be a minimum at α = α with

f (α) = 0. But, if f (α) had a relative minimum, then f (α) would beincreasing nearby. We conclude by showing that the function

g((1 − α)a + αb)(b − a)

is an increasing function of α. To see this, note that, for β > α,

(β − α)[g((1 − β )a + βb) − g((1 − α)a + αb)(b − a)

= [g((1 − β )a + βb) − g((1 − α)a + αb)][((1 − β )a + βb) − ((1 − α)a + αb)],

which is non-negative, according to Inequality (18.3). It follows that f (α)is a decreasing function of α, so cannot have a relative minimum. Thisconcludes the proof.

Theorem 18.2 If g(x) is twice differentiable and g(x) ≥ 0 for all x, then g(x) is convex.

Proof: We have g(x) ≥ 0 for all x, so that

f (α) = −g((1 − α)a + αb)(b − a)2 ≤ 0,

where f (α) is as in the proof of the previous theorem. Therefore f (α)cannot have a relative minimum. This completes the proof.

Suppose that g(x) is convex and the function f (x) = g(x) is L-Lipschitz.If g(x) is twice differentiable, this would be the case if

0 ≤ g(x) ≤ L,

for all x. As we shall see, if γ is in the interval (0, 2L), then T is an av

operator and the iterative sequence converges to a minimizer of g(x). Inthis regard, we have the following result.




Theorem 18.3 Let h(x) be convex and differentiable and h(x) non-expansive,that is,

|h(b) − h(a)| ≤ |b − a|,

for all a and b. Then h(x) is firmly non-expansive, which means that

(h(b) − h(a))(b − a) ≥ (h(b) − h(a))2.

Proof: Since h(x) is convex and differentiable, the derivative, h(x), mustbe increasing. Therefore, if b > a, then |b − a| = b − a and

|h(b) − h(a)| = h(b) − h(a).

If g(x) is convex and f (x) = g(x) is L-Lipschitz, then 1Lg(x) is ne, so

that 1

L

g(x) is fne and g(x) is 1

L

-ism. Then, for γ > 0, γg (x) is 1

γL

-ism,which tells us that the operator

T x = x − γg (x)

is av whenever 0 < γ < 2L . It follows from the KM Theorem that the

iterative sequence xk+1 = T xk = xk − γg (xk) converges to a minimizer of g(x).

In the next section we extend these results to functions of several vari-ables.

18.2 Optimizing Functions of Several Real Vari-

ablesLet f : RJ → R be a real-valued function of J real variables. The functionf (x) is said to be differentiable at the point x0 if the partial derivatives,∂f ∂xj

(x0), exist for j = 1,...,J and

limh→0

1

||h|| [f (x0 + h) − f (x0) − ∇f (x0), h] = 0.

It can be shown that, if f is differentiable at x = x0, then f is continuousthere as well [66].

Let f : RJ → R be a differentiable function. From the Mean-ValueTheorem ([66], p. 41) we know that, for any two points a and b, there is αin (0, 1) such that

f (b) = f (a) + ∇f ((1 − α)a + αb), b − a.



18.2. OPTIMIZING FUNCTIONS OF SEVERAL REAL VARIABLES 161

If there is a constant L with ||∇f (x)|| ≤ L for all x, that is, the gradientis bounded in norm, then we have

|f (b)

−f (a)

| ≤L

||b

−a

||, (18.6)

for all a and b; functions that satisfy Equation (18.6) are said to be L-Lipschitz .

In addition to real-valued functions f : RJ → R, we shall also beinterested in functions F : RJ → RJ , such as F (x) = ∇f (x), whose rangeis RJ , not R. We say that F : RJ → RJ is L-Lipschitz if there is L > 0such that

||F (b) − F (a)|| ≤ L||b − a||,for all a and b.

Suppose g : RJ → R is differentiable and attains its minimum value.We want to minimize the function g(x). Solving ∇g(x) = 0 to find theoptimal x = x∗ may not be easy, so we may turn to an iterative algorithm

for finding roots of ∇g(x), or one that minimizes g(x) directly. In the lattercase, we may again consider a steepest descent algorithm of the form

xk+1 = xk − γ ∇g(xk),

for some γ > 0. We denote by T the operator

T x = x − γ ∇g(x).

Then, using ∇g(x∗) = 0, we find that

||x∗ − xk+1|| = ||T x∗ − T xk||.

We would like to know if there are choices for γ that make T an av operator.

As in the case of functions of a single variable, for functions g(x) that areconvex , the answer is yes.

18.2.1 The Convex Case

The function g(x) : RJ → R is said to be convex if, for each pair of distinctvectors a and b and for every α in the interval (0, 1) we have

g((1 − α)a + αb) ≤ (1 − α)g(a) + αg(b).

If g(x) is a differentiable function, then convexity can be expressed in termsof properties of the derivative, ∇g(x).

Theorem 18.4 For the differentiable function g(x), the following are equiv-

alent:1) g(x) is convex;




2) for all a and b we have

g(b) ≥ g(a) + ∇g(a), b − a ; (18.7)

3) for all a and b we have

∇g(b) − ∇g(a), b − a ≥ 0. (18.8)

Proof: Assume that g(x) is convex. Then, for any a and b and α in (0, 1),we have

g(a + α(b − a)) = g((1 − α)a + αb) ≤ (1 − α)g(a) + αg(b).

Then,

g(a + α(b − a)) − g(a) ≤ g(b) − g(a).

The limit on the left, as α → 0, is

∇g(a), b − a.

It follows that

∇g(a), b − a ≤ g(b) − g(a).

which is Inequality (18.7).Assume now that Inequality (18.7) holds, for all a and b. Therefore, we

also have

g(a) − g(b) ≥ ∇g(b), a − b,

or

g(a) − g(b) ≥ −∇g(b), b − a. (18.9)

Adding Inequalities (18.7) and (18.9), we obtain Inequality (18.8).Finally, assume that Inequality (18.8) holds. We show that g(x) is

convex. Let a < b and let f (α) be defined by

f (α) = [(1 − α)g(a) + αg(b)] − g((1 − α)a + αb).

Then f (0) = f (1) = 0, and

f (α) = g(b) − g(a) − ∇g((1 − α)a + αb), b − a. (18.10)

If f (α) < 0 for some α, then there must be a minimum at α = α withf (α) = 0. But, if f (α) had a relative minimum, then f (α) would beincreasing nearby. We conclude by showing that the function

∇g((1 − α)a + αb), b − a



18.2. OPTIMIZING FUNCTIONS OF SEVERAL REAL VARIABLES 163

is an increasing function of α. To see this, note that, for β > α,

(β − α)[∇g((1 − β )a + βb) − ∇g((1 − α)a + αb), b − a]

= ∇g((1−β )a + βb)−∇g((1−α)a + αb), ((1−β )a + βb)− ((1−α)a + αb),

which is non-negative, according to Inequality (18.3). It follows that f (α)is a decreasing function of α, so cannot have a relative minimum. Thisconcludes the proof.

As in the case of functions of a single variable, we can say more whenthe function g(x) is twice differentiable.

Theorem 18.5 If g(x) is twice differentiable and the second derivative matrix is non-negative definite, that is, ∇2g(x) ≥ 0 for all x, then g(x) is convex.

Proof: Now we have

f (α) = −(b − a)T ∇2g((1 − α)a + αb)(b − a) ≤ 0,

where f (α) is as in the proof of the previous theorem. Therefore f (α)cannot have a relative minimum. This completes the proof.

Suppose that g(x) : RJ → R is convex and the function F (x) = ∇g(x)is L-Lipschitz. As we shall see, if γ is in the interval (0, 2

L), then theoperator T = I − γF defined by

T x = x − γ ∇g(x),

is an av operator and the iterative sequence converges to a minimizer of g(x). In this regard, we have the following analog of Theorem 18.3.

Theorem 18.6 Let h(x) be convex and differentiable and its derivative,∇h(x), non-expansive, that is,

||∇h(b) − ∇h(a)|| ≤ ||b − a||,

for all a and b. Then ∇h(x) is firmly non-expansive, which means that

∇h(b) − ∇h(a), b − a ≥ ||∇h(b) − ∇h(a)||2.

Unlike the proof of Theorem 18.3, the proof of this theorem is not

trivial. In [69] Golshtein and Tretyakov prove the following theorem, fromwhich Theorem 18.6 follows immediately:




Theorem 18.7 Let g : RJ → R be convex and differentiable. The follow-ing are equivalent:

||∇g(x)

− ∇g(y)

|| ≤ ||x

−y

||; (18.11)

g(x) ≥ g(y) + ∇g(y), x − y +1

2||∇g(x) − ∇g(y)||2; (18.12)

and

∇g(x) − ∇g(y), x − y ≥ ||∇g(x) − ∇g(y)||2. (18.13)

The only difficult step in the proof is showing that Inequality (18.11) impliesInequality (18.12). To prove this part, let x(t) = (1−t)y +tx, for 0 ≤ t ≤ 1.Then

g(x(t)) = ∇g(x(t)), x − y,

so that 10

∇g(x(t)) − ∇g(y), x − ydt = g(x) − g(y) − ∇g(y), x − y.

Therefore,

g(x) − g(y) − ∇g(y), x − y ≤ 10

||∇g(x(t)) − ∇g(y)||||x(t) − y||dt

≤ 10

||x(t) − y||2dt =

10

||t(x − y)||2dt =1

2||x − y||2,

according to Inequality (18.11). Therefore,

g(x) ≤ g(y) + ∇g(y), x − y +

1

2 ||x − y||2

.

Now let x = y − ∇g(y), so that

g(y − ∇g(y)) ≤ g(y) + ∇g(y), ∇g(y) +1

2||∇g(y)||2.

Consequently,

g(y − ∇g(y)) ≤ g(y) − 1

2||∇g(y)||2.

Therefore,

inf g(x) ≤ g(y) − 1

2||∇g(y)||2,

or

g(y) ≥ inf g(x) +1

2||∇g(y)||2. (18.14)



18.3. CONVEX FEASIBILITY 165

Now fix y and define the function f (x) by

h(x) = g(x) − g(y) − ∇g(y), x − y.

Then h(x) is convex, differentiable, and non-negative,∇h(x) = ∇g(x) − ∇g(y),

and h(y) = 0, so that h(x) attains its minimum at x = y. ApplyingInequality (18.14) to the function h(x), with z in the role of x and x in therole of y, we find that

inf h(z) = 0 ≤ h(x) − 1

2||∇h(x)||2.

From the definition of h(x), it follows that

0 ≤ g(x) − g(y) − ∇g(y), x − y − 1

2||∇g(x) − ∇g(y)||2.

This completes the proof of the implication.If g(x) is convex and f (x) = ∇g(x) is L-Lipschitz, then 1

L∇g(x) is ne,so that 1

L∇g(x) is fne and ∇g(x) is 1L -ism. Then for γ > 0, γ ∇g(x) is

1γL -ism, which tells us that the operator

T x = x − γ ∇g(x)

is av whenever 0 < γ < 2L . It follows from the KM Theorem that the

iterative sequence xk+1 = T xk = xk − γ ∇g(xk) converges to a minimizerof g(x), whenever minimizers exist.

18.3 Convex Feasibility

The convex feasibility problem (CFP) is to find a point in the non-emptyintersection C of finitely many closed, convex sets C i in RJ . The successive orthogonal projections (SOP) method [72] is the following. Begin with anarbitrary x0. For k = 0, 1,..., and i = k(mod I ) + 1, let

xk+1 = P ixk,

where P ix denotes the orthogonal projection of x onto the set C i. Sinceeach of the operators P i is firmly non-expansive, the product

T = P I P I −1 · · · P 2P 1

is averaged. Since C is not empty, T has fixed points. By the KM Theorem,the sequence

{xk

}converges to a member of C . It is useful to note that

the limit of this sequence will not generally be the point in C closest to x0;it is if the C i are hyperplanes, however.




18.3.1 The SOP for Hyperplanes

For any x, P ix, the orthogonal projection of x onto the closed, convex setC i, is the unique member of C i for which

P ix − x, y − P ix ≥ 0,

for every y in C i.


||y − P ix||2 + ||P ix − x||2 ≤ ||y − x||2,

for all x and for all y in C i.

When the C i are hyperplanes, we can say more.

Exercise 18.2 Show that, if C i is a hyperplane, then

P ix

−x, y

−P ix

= 0,

for all y in C i. Use this result to show that

||y − P ix||2 + ||P ix − x||2 = ||y − x||2,

for every y in the hyperplane C i. Hint: since both P ix and y are in C i, sois P ix + t(y − P ix), for every real t.

Let the C i be hyperplanes with C their non-empty intersection. Let cbe in C .

Exercise 18.3 Show that, for xk+1 = P ixk, where i = k(mod I ) + 1,

||c − xk||2 − ||c − xk+1||2 = ||xk − xk+1||2. (18.15)

It follows from this exercise that the sequence {||c − x

k

||} is decreasingand that the sequence {||xk − xk+1||2} converges to zero. Therefore, thesequence {xk} is bounded, so has a cluster point, x∗, and the cluster pointmust be in C . Therefore, replacing c with x∗, we find that the sequence{||x∗ − xk||2} converges to zero, which means that {xk} converges to x∗.Summing over k on both sides of Equation (18.15), we get

||c − x∗||2 − ||c − x0||2

on the left side, while on the right side we get a quantity that does not de-pend on which c in C we have selected. It follows that minimizing ||c−x0||2over c in C is equivalent to minimizing ||c−x∗||2 over c in C ; the minimizerof the latter problem is clearly c = x∗. So, when the C i are hyperplanes,the SOP algorithm does converge to the member of the intersection that

is closest to x0. Note that the SOP is the ART algorithm, for the case of hyperplanes.



18.4. OPTIMIZATION OVER A CONVEX SET 167

18.3.2 The SOP for Half-Spaces

If the C i are half-spaces, that is, there is some I by J matrix A and vectorb so that

C i = {x|(Ax)i ≥ bi},

then the SOP becomes the Agmon-Motzkin-Schoenberg algorithm. Whenthe intersection is non-empty, the algorithm converges, by the KM Theo-rem, to a member of that intersection. When the intersection is empty, weget subsequential convergence to a limit cycle.

18.4 Optimization over a Convex Set

Suppose now that g : RJ → R is a convex, differentiable function andwe want to find a minimizer of g(x) over a closed, convex set C , if such

minimizers exists. We saw earlier that, if ∇g(x) is L-Lipschitz, and γ isin the interval (0, 2/L), then the operator T x = x − γ ∇g(x) is averaged.Since P C , the orthogonal projection onto C , is also averaged, their product,S = P C T , is averaged. Therefore, by the KM Theorem, the sequence{xk+1 = Sxk} converges to a fixed point of S , whenever such fixed pointsexist.

Exercise 18.4 Show that x is a fixed point of S if and only if x minimizes g(x) over x in C .

18.4.1 Linear Optimization over a Convex Set

Suppose we take g(x) = dT x, for some fixed vector d. Then ∇g(x) = d forall x, and ∇g(x) is L-Lipschitz for every L > 0. Therefore, the operatorT x − x − γd is averaged, for any positive γ . Since P C is also averaged,the product, S = P C T is averaged and the iterative sequence xk+1 = Sxk

converges to a minimizer of g(x) = dT x over C , whenever minimizers exist.

For example, suppose that C is the closed, convex region in the planebounded by the coordinate axes and the line x + y = 1. Let dT = (1, −1).The problem then is to minimize the function g(x, y) = x − y over C .Let γ = 1 and begin with x0 = (1, 1)T . Then x0 − d = (0, 2)T andx1 = P C (0, 2)T = (0, 1)T , which is the solution.

For this algorithm to be practical, P C x must be easy to calculate. Inthose cases in which the set C is more complicated than in the example,

other algorithms, such as the simplex algorithm, will be preferred. We con-sider these ideas further, when we discuss the linear programming problem.




18.5 Geometry of Convex Sets

A point x in a convex set C is said to be an extreme point of C if theset obtained by removing x from C remains convex. Said another way, x

cannot be written asx = (1 − α)y + αz,

for y, z = x and α ∈ (0, 1). For example, the point x = 1 is an extremepoint of the convex set C = [0, 1]. Every point on the boundary of a spherein RJ is an extreme point of the sphere. The set of all extreme points of aconvex set is denoted Ext(C ).

A non-zero vector d is said to be a direction of unboundedness of aconvex set C if, for all x in C and all γ ≥ 0, the vector x + γd is in C .For example, if C is the non-negative orthant in RJ , then any non-negativevector d is a direction of unboundedness.

The fundamental problem in linear programming is to minimize thefunction

f (x) = cT x,

over the feasible set F , that is, the convex set of all x ≥ 0 withAx = b. Inthe next chapter we present an algebraic description of the extreme pointsof the feasible set F , in terms of basic feasible solutions , show that thereare at most finitely many extreme points of F and that every member of F can be written as a convex combination of the extreme points, plus adirection of unboundedness. These results will be used to prove the basictheorems about the primal and dual linear programming problems and todescribe the simplex algorithm.

18.6 Projecting onto the Intersection of Con-vex Sets

As we saw previously, the SOP algorithm need not converge to the point inthe intersection closest to the starting point. To obtain the point closest tox0 in the intersection of the convex sets C i, we can use Dykstra’s algorithm ,a modification of the SOP method [62]. For simplicity, we shall discuss onlythe case of C = A ∩ B, the intersection of two closed, convex sets.

18.6.1 A Motivating Lemma

The following lemma will help to motivate Dykstra’s algorithm.

Lemma 18.1 If x = c + p + q , where c = P A(c + p) and c = P B(c + q ),then c = P C x.



18.6. PROJECTING ONTO THE INTERSECTION OF CONVEX SETS 169

Proof: Let d be arbitrary in C . Then

c − (c + p), d − c ≥ 0,

since d is in A, andc − (c + q ), d − c ≥ 0,

since d is in B. Adding the two inequalities, we get

− p − q, d − c ≥ 0.

But− p − q = c − x,

soc − x, d − c ≥ 0,

for all d in C . Therefore, c = P C x.

18.6.2 Dykstra’s AlgorithmDykstra’s algorithm begins with b0 = x, p0 = q 0 = 0. It involves theconstruction of two sequences, {an} and {bn}, both converging to c = P C x,along with two other sequences, { pn} and {q n} designed so that

an = P A(bn−1 + pn−1),

bn = P B(an + q n−1),

andx = an + pn + q n−1 = bn + pn + q n.

Both {an} and {bn} converge to c = P C x. Usually, but not always, { pn}converges to p and {q n} converges to q , so that

x = c + p + q,with

c = P A(c + p) = P B(c + q ).

Generally, however, { pn + q n} converges to x − c.In [15], Bregman considers the problem of minimizing a convex function

f : RJ → R over the intersection of half-spaces, that is, over the set of points x for which Ax =≥ b. His approach is a primal-dual algorithminvolving the notion of projecting onto a convex set, with respect to ageneralized distance constructed from f . Such generalized projections havecome to be called Bregman projections . In [41], Censor and Reich extendDykstra’s algorithm to Bregman projections, and, in [16], the three showthat the extended Dykstra algorithm of [41] is the natural extension of

Bregman’s primal-dual algorithm to the case of intersecting convex sets.We shall consider these results in more detail in a subsequent chapter.






Chapter 19

Generalized Projectionsonto Convex Sets

The convex feasibility problem (CFP) is to find a member of the nonempty

set C =I i=1 C i, where the C i are closed convex subsets of RJ . In most

applications the sets C i are more easily described than the set C and al-gorithms are sought whereby a member of C is obtained as the limit of aniterative procedure involving (exact or approximate) orthogonal or gener-alized projections onto the individual sets C i.

In his often cited paper [15] Bregman generalizes the SOP algorithmfor the convex feasibility problem to include projections with respect to ageneralized distance, and uses this successive generalized projections (SGP)method to obtain a primal-dual algorithm to minimize a convex functionf : RJ → R over the intersection of half-spaces, that is, over x with Ax ≥ b.The generalized distance is built from the function f , which then mustexhibit additional properties, beyond convexity, to guarantee convergenceof the algorithm

19.1 Bregman Functions and Bregman Dis-tances

The class of functions f that are used to define the generalized distancehave come to be called Bregman functions ; the associated generalized dis-tances are then Bregman distances , which are used to define generalizedprojections onto closed convex sets (see the book by Censor and Zenios[43] for details). In [8] Bauschke and Borwein introduce the related class

of Bregman-Legendre functions and show that these functions provide anappropriate setting in which to study Bregman distances and generalized

171



172CHAPTER 19. GENERALIZED PROJECTIONS ONTO CONVEX SETS

projections associated with such distances. For further details concerningBregman and Bregman-Legendre functions, see the appendix.

Bregman’s successive generalized projection (SGP) method uses projections with respect to Bregman distances to solve the convex feasibility

problem. Let f : RJ → (−∞, +∞] be a closed, proper convex function,with essential domain D = domf = {x|f (x) < +∞} and ∅ = int D. Denoteby Df (·, ·) : D × int D → [0, +∞) the Bregman distance, given by

Df (x, z) = f (x) − f (z) − ∇f (z), x − z (19.1)

and by P f C i the Bregman projection operator associated with the convexfunction f and the convex set C i; that is

P f C iz = arg minx∈C i∩DDf (x, z). (19.2)

The Bregman projection of x onto C is characterized by Bregman’s Inequal-ity :

∇f (P f C x)

− ∇f (x), c

−P f C

≥0, (19.3)

for all c in C .

19.2 The Successive Generalized ProjectionsAlgorithm

Bregman considers the following generalization of the SOP algorithm:

Algorithm 19.1 Bregman’s method of Successive Generalized Pro- jections (SGP): Beginning with x0 ∈ int domf , for k = 0, 1,...,let i =i(k) := k(modI) + 1 and

xk+1 = P f C i(k)(xk). (19.4)

He proves that the sequence {xk} given by (19.4) converges to a memberof C ∩ domf , whenever this set is nonempty and the function f is whatcame to be called a Bregman function ([15]). Bauschke and Borwein [8]prove that Bregman’s SGP method converges to a member of C providedthat one of the following holds: 1) f is Bregman-Legendre; 2) C ∩ intD = ∅and dom f ∗ is open; or 3) dom f and dom f ∗ are both open, with f ∗ thefunction conjugate to f .

In [15] Bregman goes on to use the SGP to find a minimizer of a Breg-man function f (x) over the set of x such that Ax = b. Each hyperplaneassociated with a single equation is a closed, convex set. The SGP findsthe Bregman projection of the starting vector onto the intersection of thehyperplanes. If the starting vector has the form x0 = AT d, for some vector

d, then this Bregman projection also minimizes f (x) over x in the inter-section.



19.3. BREGMAN’S PRIMAL-DUAL ALGORITHM 173

19.3 Bregman’s Primal-Dual Algorithm

The problem is to minimize f : RJ → R over the set of all x for whichAx

≥b. Begin with x0 such that x0 = AT u0, for some u0

≥0. For

k = 0, 1,..., let i = k(mod I ) + 1. Having calculated xk, there are threepossibilities:

a) if (Axk)i < bi, then let xk+1 be the Bregman projection onto the hyper-plane H i = {x|(Ax)i = bi}, so that

∇f (xk+1) = ∇f (xk) + λkai,

where ai is the ith column of AT . With ∇f (xk) = AT uk, for uk ≥ 0,update uk by

uk+1i = uki + λk,

anduk+1m = ukm,

for m = i.

b) if (Axk)i = bi, o r (Axk)i > bi and uki = 0, then xk+1 = xk, anduk+1 = uk.

c) if (Axk)i > bi and uki > 0, then let µk be the smaller of the numbers µkand µk , where

∇f (y) = ∇f (xk) − µkai

puts y in H i, andµk = uki .

Then take xk+1 with

∇f (xk+1) = ∇f (xk) − µkai.

With appropriate assumptions made about the function f , the sequence{xk} so defined converges to a minimizer of f (x) over the set of x withAx ≥ b. For a detailed proof of this result, see [43].

Bregman also suggests that this primal-dual algorithm be used to findapproximate solutions for linear programming problems, where the problemis to minimize a linear function cT x, subject to constraints. His idea is to

replace the function cT x with h(x) = cT x + f (x), and then apply hisprimal-dual method to h(x).




19.4 Dykstra’s Algorithm for Bregman Pro- jections

We are concerned now with finding the Bregman projection of x onto theintersection C of finitely many closed convex sets, C i. The problem can besolved by extending Dykstra’s algorithm to include Bregman projections.

19.4.1 A Helpful Lemma

The following lemma helps to motivate the extension of Dykstra’s algo-rithm.

Lemma 19.1 Suppose that

∇f (c) − ∇f (x) = ∇f (c) − ∇f (c + p) + ∇f (c) − ∇f (c + q ),

with c = P f A(c + p) and c = P f B(c + q ). Then c = P f C x.

Proof: Let d be arbitrary in C . We have

∇f (c) − ∇f (c + p), d − c ≥ 0,

and∇f (c) − ∇f (c + q ), d − c ≥ 0.

Adding, we obtain∇f (c) − ∇f (x), d − c ≥ 0.

This suggests the following algorithm for finding c = P f C x, which turnsout to be the extension of Dykstra’s algorithm to Bregman projections.

Begin with b0 = x, p0 = q 0 = 0. Define

bn−1 + pn−1 = ∇f −1(∇f (bn−1) + rn−1),

an = P f A(bn−1 + pn−1),

rn = ∇f (bn−1) + rn−1 − ∇f (an),

∇f (an + q n−1) = ∇f (an) + sn−1,

bn = P f B(an + q n−1),

andsn = ∇f (an) + sn−1 − ∇f (bn).

In place of ∇f (c + p) − ∇f (c) + ∇f (c + q ) − ∇f (c),



19.4. DYKSTRA’S ALGORITHM FOR BREGMAN PROJECTIONS 175

we have

[∇f (bn−1) + rn−1] −∇f (bn−1) + [∇f (an) + sn−1] −∇f (an) = rn−1 + sn−1,

and also

[∇f (an) + sn−1] − ∇f (an) + [∇f (bn) + rn] − ∇f (bn) = rn + sn−1.

But we also have

rn−1 + sn−1 = ∇f (x) − ∇f (bn−1),

andrn + sn−1 = ∇f (x) − ∇f (an).

Then the sequences {an} and {bn} converge to c. For further details, see[41] and [10].

In [16] the authors show that the extension of Dykstra’s algorithm to

Bregman projections can be viewed as an extension of Bregman’s primal-dual algorithm to the case in which the intersection of half-spaces is re-placed by the intersection of closed convex sets.






Chapter 20

An Interior-PointOptimization Method

Investigations in [22] into several well known iterative algorithms, includ-ing the ‘expectation maximization maximum likelihood’ (EMML) method,the ‘multiplicative algebraic reconstruction technique’ (MART) as well asblock-iterative and simultaneous versions of MART, revealed that the it-erative step of each algorithm involved weighted arithmetic or geometricmeans of Bregman projections onto hyperplanes; interestingly, the projec-tions involved were associated with Bregman distances that differed fromone hyperplane to the next. This representation of the EMML algorithm asa weighted arithmetic mean of Bregman projections provided the key stepin obtaining block-iterative and row-action versions of EMML. Because itis well known that convergence is not guaranteed if one simply extendsBregman’s algorithm to multiple distances by replacing the single distanceDf in (19.4) with multiple distances Df i , the appearance of distinct dis-tances in these algorithms suggested that a somewhat more sophisticatedalgorithm employing multiple Bregman distances might be possible.

20.1 The Multiprojection Successive Gener-alized Projection Method

In [26] such an iterative multiprojection method for solving the CFP,called the multidistance successive generalized projection (MSGP) method,was presented in the context of Bregman functions, and subsequently,in the framework of Bregman-Legendre functions [28]; see the Appendix

on Bregman functions for definitions and details concerning these func-tions. The MSGP extends Bregman’s SGP method by allowing the Breg-

177



178CHAPTER 20. AN INTERIOR-POINT OPTIMIZATION METHOD

man projection onto each set C i to be performed with respect to a Breg-man distance Df i derived from a Bregman-Legendre function f i. TheMSGP method depends on the selection of a super-coercive Bregman-Legendre function h whose Bregman distance Dh satisfies the inequalityDh(x, z) ≥ Df i(x, z) for all x ∈ dom h ⊆

I i=1 dom f i and all z ∈ int dom h,

where dom h = {x|h(x) < +∞}. By using different Bregman distances fordifferent convex sets, we found that we can sometimes calculate the desiredBregman projections in closed form, thereby obtaining computationallytractable iterative algorithms (see [22]).

20.2 An Interior-Point Algorithm (IPA)

Consideration of a special case of the MSGP, involving only a single convexset C 1, leads us to an interior point optimization method. If I = 1 andf := f 1 has a unique minimizer x in intdom h, then the MSGP iterationusing C 1 = {x} is

∇h(xk+1) = ∇h(xk) − ∇f (xk). (20.1)

This suggests an interior point algorithm (IPA) that could be applied morebroadly to minimize a convex function f over the closure of dom h.

First, we present the MSGP method and prove convergence, in thecontext of Bregman-Legendre functions. Then we investigate the IPA sug-gested by the MSGP algorithm.

20.3 The MSGP AlgorithmWe begin by setting out the assumptions we shall make and the notationwe shall use in this section.

20.3.1 Assumptions and Notation

We make the following assumptions throughout this section. Let C =∩I i=1C i be the nonempty intersection of closed convex sets C i. The func-

tion h is super-coercive and Bregman-Legendre with essential domain D =dom h and C ∩ dom h = ∅. For i = 1, 2,...,I the function f i is alsoBregman-Legendre, with D

⊆dom f i, so that int D

⊆int dom f i; also

C i ∩ int domf i = ∅. For all x ∈ dom h and z ∈ intdom h we haveDh(x, z) ≥ Df i(x, z), for each i.



20.3. THE MSGP ALGORITHM 179

20.3.2 The MSGP Algorithm

Algorithm 20.1 The MSGP algorithm: Let x0 ∈ intdom h be arbi-trary. For k = 0, 1,... and i(k) := k(mod I ) + 1 let

xk+1 = ∇h−1∇h(xk) − ∇f i(k)(xk) + ∇f i(k)(P

f i(k)C i(k)

(xk))

. (20.2)

20.3.3 A Preliminary Result

For each k = 0, 1,... define the function Gk(·) : dom h → [0, +∞) by

Gk(x) = Dh(x, xk) − Df i(k)(x, xk) + Df i(k)(x, P f i(k)C i(k)

(xk)). (20.3)

The next proposition provides a useful identity, which can be viewed as ananalogue of Pythagoras’ theorem. The proof is not difficult and we omitit.

Proposition 20.1 For each x ∈ dom h, each k = 0, 1,..., and xk+1 given by (20.2) we have

Gk(x) = Gk(xk+1) + Dh(x, xk+1). (20.4)

Consequently, xk+1 is the unique minimizer of the function Gk(·).

This identity (20.4) is the key ingredient in the convergence proof for theMSGP algorithm.

20.3.4 The MSGP Convergence Theorem

We shall prove the following convergence theorem:

Theorem 20.1 Let x0 ∈ intdom h be arbitrary. Any sequence xk obtained from the iterative scheme given by Algorithm 20.1 converges to x∞ ∈ C ∩dom h. If the sets C i are hyperplanes, then x∞ minimizes the function Dh(x, x0) over all x ∈ C ∩dom h; if, in addition, x0 is the global minimizer of h, then x∞ minimizes h(x) over all x ∈ C ∩ dom h.

Proof: All details concerning Bregman functions are in the Appendix.Let c be a member of C ∩ dom h. From the Pythagorean identity (20.4) itfollows that

Gk(c) = Gk(xk+1) + Dh(c, xk+1). (20.5)

Using the definition of Gk(

·), we write

Gk(c) = Dh(c, xk) − Df i(k)(c, xk) + Df i(k)(c, P f i(k)C i(k)

(xk)). (20.6)




From Bregman’s Inequality (19.3) we have that

Df i(k)(c, xk) − Df i(k)(c, P f i(k)C i(k)

(xk)) ≥ Df i(k)(P f i(k)C i(k)

(xk), xk). (20.7)

Consequently, we know that

Dh(c, xk) − Dh(c, xk+1) ≥ Gk(xk+1) + Df i(k)(P f i(k)C i(k)

(xk), xk) ≥ 0. (20.8)

It follows that {Dh(c, xk)} is decreasing and finite and the sequence {xk}is bounded. Therefore, {Df i(k)(P

f i(k)C i(k)

(xk), xk)} → 0 and {Gk(xk+1)} → 0;

from the definition of Gk(x) it follows that {Df i(k)(xk+1, P f i(k)C i(k)

(xk))} → 0

as well. Using the Bregman inequality we obtain the inequality

Dh(c, xk) ≥ Df i(k)(c, xk) ≥ Df i(k)(c, P f i(k)C i(k)

(xk)), (20.9)

which tells us that the sequence {P f i(k)C i(k)(xk)} is also bounded. Let x∗ be an

arbitrary cluster point of the sequence {xk} and let {xkn} be a subsequenceof the sequence {xk} converging to x∗.

We first show that x∗ ∈ dom h and {Dh(x∗, xk)} → 0. If x∗ is inint dom h then our claim is verified, so suppose that x∗ is in bdry dom h. If c is in dom h but not in int dom h, then, applying B2 of the Appendix onBregman functions, we conclude that x∗ ∈ dom h and {Dh(x∗, xk)} → 0.If, on the other hand, c is in intdom h then by R2 x∗ would have to be inint dom h also. It follows that x∗ ∈ dom h and {Dh(x∗, xk)} → 0. Now weshow that x∗ is in C .

Label x∗ = x∗0. Since there must be at least one index i that occursinfinitely often as i(k), we assume, without loss of generality, that the subse-

quence {x

kn

} has been selected so that i(k) = 1 for all n = 1, 2,.... Passingto subsequences as needed, we assume that, for each m = 0, 1, 2,...,I − 1,the subsequence {xkn+m} converges to a cluster point x∗m, which is indom h, according to the same argument we used in the previous paragraph.For each m the sequence {Df m(c, P f mC m(xkn+m−1))} is bounded, so, again,by passing to subsequences as needed, we assume that the subsequence{P f mC m(xkn+m−1)} converges to c∗m ∈ C m ∩ dom f m.

Since the sequence {Df m(c, P f mC m(xkn+m−1)} is bounded and c ∈ dom f m,it follows, from either B2 or R2, that c∗m ∈ dom f m. We know that

{Df m(P f mC m(xkn+m−1), xkn+m−1)} → 0 (20.10)

and both P f mC m(xkn+m−1) and xkn+m−1 are in int dom f m. Applying R1, B3

or R3, depending on the assumed locations of c∗m and x∗m−1, we concludethat c∗m = x∗m−1.



20.4. AN INTERIOR-POINT ALGORITHM FOR ITERATIVE OPTIMIZATION 181

We also know that

{Df m(xkn+m, P f mC m(xkn+m−1))} → 0, (20.11)

from which it follows, using the same arguments, that x∗m = c∗m. Therefore,we have x∗ = x∗m = c∗m for all m; so x∗ ∈ C .Since x∗ ∈ C ∩ dom h, we may now use x∗ in place of the generic c,

to obtain that the sequence {Dh(x∗, xk)} is decreasing. However, we alsoknow that the sequence {Dh(x∗, xkn)} → 0. So we have {Dh(x∗, xk)} → 0.Applying R5, we conclude that {xk} → x∗.

If the sets C i are hyperplanes, then we get equality in Bregman’s in-equality (19.3)and so

Dh(c, xk) − Dh(c, xk+1) = Gk(xk+1) + Df i(k)(P f i(k)C i(k)

(xk), xk). (20.12)

Since the right side of this equation is independent of which c we havechosen in the set C ∩ dom h, the left side is also independent of this choice.This implies that

Dh(c, x0) − Dh(c, xM ) = Dh(x∗, x0) − Dh(x∗, xM ), (20.13)

for any positive integer M and any c ∈ C ∩ dom h. Therefore

Dh(c, x0) − Dh(x∗, x0) = Dh(c, xM ) − Dh(x∗, xM ). (20.14)

Since {Dh(x∗, xM )} → 0 as M → +∞ and {Dh(c, xM )} → α ≥ 0, we havethat Dh(c, x0) − Dh(x∗, x0) ≥ 0. This completes the proof.

20.4 An Interior-Point Algorithm for Itera-tive Optimization

We consider now an interior point algorithm (IPA) for iterative optimiza-tion. This algorithm was first presented in [27] and applied to transmissiontomography in [99]. The IPA is suggested by a special case of the MSGP,involving functions h and f := f 1.

20.4.1 Assumptions

We assume, for the remainder of this section, that h is a super-coerciveLegendre function with essential domain D = dom h. We also assume thatf is continuous on the set D, takes the value +∞ outside this set and isdifferentiable in int dom D. Thus, f is a closed, proper convex function onRJ . We assume also that x = argminx∈D f (x) exists, but not that it isunique. As in the previous section, we assume that Dh(x, z)

≥Df (x, z) for

all x ∈ dom h and z ∈ int domh. As before, we denote by h∗ the functionconjugate to h.




20.4.2 The IPA

The IPA is an iterative procedure that, under conditions to be describedshortly, minimizes the function f over the closure of the essential domain

of h, provided that such a minimizer exists.

Algorithm 20.2 Let x0 be chosen arbitrarily in int D. For k = 0, 1,... let xk+1 be the unique solution of the equation

∇h(xk+1) = ∇h(xk) − ∇f (xk). (20.15)

Note that equation (20.15) can also be written as

xk+1 = ∇h−1(∇h(xk) − ∇f (xk)) = ∇h∗(∇h(xk) − ∇f (xk)). (20.16)

20.4.3 Motivating the IPA

As already noted, the IPA was originally suggested by consideration of aspecial case of the MSGP. Suppose that x ∈ dom h is the unique globalminimizer of the function f , and that ∇f (x) = 0. Take I = 1 and C =

C 1 = {x}. Then P f C 1(xk) = x always and the iterative MSGP step becomesthat of the IPA. Since we are assuming that x is in dom h, the convergencetheorem for the MSGP tells us that the iterative sequence {xk} convergesto x.

In most cases, the global minimizer of f will not lie within the essentialdomain of the function h and we are interested in the minimum value of f on the set D, where D = dom h; that is, we want x = argminx∈D f (x),whenever such a minimum exists. As we shall see, the IPA can be used toadvantage even when the specific conditions of the MSGP do not hold.

20.4.4 Preliminary results for the IPA

Two aspects of the IPA suggest strongly that it may converge under moregeneral conditions than those required for convergence of the MSGP. Thesequence {xk} defined by (20.15) is entirely within the interior of dom h. Inaddition, as we now show, the sequence {f (xk)} is decreasing. Adding bothsides of the inequalities Dh(xk+1, xk)−Df (xk+1, xk) ≥ 0 and Dh(xk, xk+1)−Df (xk, xk+1) ≥ 0 gives

∇h(xk) − ∇h(xk+1) − ∇f (xk) + ∇f (xk+1), xk − xk+1 ≥ 0. (20.17)

Substituting according to equation (20.15) and using the convexity of thefunction f , we obtain

f (xk) − f (xk+1) ≥ ∇f (xk+1), xk − xk+1 ≥ 0. (20.18)



20.4. AN INTERIOR-POINT ALGORITHM FOR ITERATIVE OPTIMIZATION 183

Therefore, the sequence {f (xk)} is decreasing; since it is bounded below

by f (x), it has a limit, f ≥ f (x). We have the following result (see [27],Prop. 3.1).

Lemma 20.1 f = f (x).

Proof: Suppose, to the contrary, that 0 < δ = f − f (x). Select z ∈ Dwith f (z) ≤ f (x) + δ/2. Then f (xk) − f (z) ≥ δ/2 for all k. WritingH k = Dh(z, xk) − Df (z, xk) for each k, we have

H k − H k+1 = Dh(xk+1, xk) − Df (xk+1, xk) + ∇f (xk+1), xk+1 − z .(20.19)

Since ∇f (xk+1), xk+1−z ≥ f (xk+1)−f (z) ≥ δ/2 > 0 and Dh(xk+1, xk)−Df (xk+1, xk) ≥ 0, it follows that {H k} is a decreasing sequence of positivenumbers, so that the successive differences converge to zero. This is acontradiction; we conclude that f = f (x).

Convergence of the IPA

We prove the following convergence result for the IPA (see also [27]).

Theorem 20.2 If x = argminx∈D f (x) is unique, then the sequence {xk}generated by the IPA according to equation (20.15) converges to x. If xis not unique, but can be chosen in D, then the sequence {Dh(x, xk)} is decreasing. If, in addition, the function Dh(x, ·) has bounded level sets,then the sequence {xk} is bounded and so has cluster points x∗ ∈ D with f (x∗) = f (x). Finally, if h is a Bregman-Legendre function, then x∗ ∈ Dand the sequence {xk} converges to x∗.

Proof: According to Corollary 8.7.1 of [105], if G is a closed, proper convexfunction on RJ and if the level set Lα =

{x|G(x)

≤α}

is nonempty andbounded for at least one value of α, then Lα is bounded for all values of α. If the constrained minimizer x is unique, then, by the continuity of f on D and Rockafellar’s corollary, we can conclude that the sequence {xk}converges to x. If x is not unique, but can be chosen in D, then, withadditional assumptions, convergence can still be established.

Suppose now that x is not necessarily unique, but can be chosen in D.Assuming x ∈ D, we show that the sequence {Dh(x, xk)} is decreasing.Using equation (20.15) we have

Dh(x, xk) − Dh(x, xk+1) = Dh(xk+1, xk) + ∇h(xk+1) −∇h(xk), x − xk+1= Dh(xk+1, xk) − Df (xk+1, xk) + Df (xk+1, xk) + ∇f (xk), xk+1 − x

= Dh(xk+1

, xk

) − Df (xk+1

, xk

) + f (xk+1

) − f (xk

) − ∇f (xk

), x − xk

≥ Dh(xk+1, xk) − Df (xk+1, xk) + f (xk+1) − f (xk) + f (xk) − f (x);




the final inequality follows from the convexity of f . Since Dh(xk+1, xk) −Df (xk+1, xk) ≥ 0 and f (xk+1) − f (x) ≥ 0, it follows that the sequence{Dh(x, xk)} is decreasing.

If h has bounded level sets, then the sequence

{xk

}is bounded and we

can extract a subsequence {xkn} converging to some x∗ in the closure of D.

Finally, assume that h is a Bregman-Legendre function. If x is in Dbut not in int D, then, by B2, x∗ ∈ bdry D implies that x∗ is in D and{Dh(x∗, xkn)} → 0. If x is in int D, then we conclude, from R2, that x∗

is also in int D. Then, by R1, we have {Dh(x∗, xkn)} → 0. We can thenreplace the generic x with x∗, to conclude that {Dh(x∗, xk)} is decreas-ing. But, {Dh(x∗, xkn)} converges to zero; therefore, the entire sequence{Dh(x∗, xk)} converges to zero. Applying R5, we conclude that {xk} con-verges to x∗. This completes the proof.



Chapter 21

Linear Programming

The term linear programming (LP) refers to the problem of optimizing alinear function of several variables over linear equality or inequality con-straints. In this chapter we present the problem and establish the basicfacts. For a much more detailed discussion, consult [100].

21.1 Primal and Dual Problems

Associated with the basic problem in LP, called the primary problem , thereis a second problem, the dual problem . Both of these problems can bewritten in two equivalent ways, the canonical form and the standard form.

21.1.1 Canonical and Standard Forms

Let b and c be fixed vectors and A a fixed matrix. The problem

minimize z = cT x, subject to Ax ≥ b, x ≥ 0 (PC) (21.1)

is the so-called primary problem of LP, in canonical form . The dual problem in canonical form is

maximize w = bT y, subject to AT y ≤ c, y ≥ 0. (DC) (21.2)

The primary problem, in standard form , is

minimize z = cT x, subject to Ax = b, x ≥ 0 (PS) (21.3)

with the dual problem in standard form given by

maximize w = bT y, subject to AT y ≤ c. (DS) (21.4)

185



186 CHAPTER 21. LINEAR PROGRAMMING

Notice that the dual problem in standard form does not require that y benonnegative. Note also that the standard problems make sense only if thesystem Ax = b is underdetermined and has infinitely many solutions. Forthat reason, we shall assume, for the standard problems, that the I by J

matrix A has more columns than rows, so J > I , and has full row rank.If we are given the primary problem in canonical form, we can convert

it to standard form by augmenting the variables, that is, by defining

ui = (Ax)i − bi,

for i = 1,...,I , and rewriting Ax ≥ b as

Ax = b,

for A = [ A −I ] and x = [xT uT ]T .

21.1.2 Weak Duality

Consider the problems (PS) and (DS). Say that x is feasible if x ≥ 0 andAx = b. Let F be the set of feasible x. Say that y is feasible if AT y ≤ c.The Weak Duality Theorem is the following:

Theorem 21.1 Let x and y be feasible vectors. Then

z = cT x ≥ bT y = w.

Corollary 21.1 If z is not bounded below, then there are no feasible y.

Corollary 21.2 If x and y are both feasible, and z = w, then both x and y are optimal for their respective problems.

Exercise 21.1 Prove the theorem and its corollaries.

The nonnegative quantity cT x − bT y is called the duality gap. The comple-mentary slackness condition says that, for optimal x∗ and y∗, we have

xj(cj − (AT y)j) = 0,

for each j, which says that the duality gap is zero. Primal-dual algorithmsfor solving linear programming problems are based on finding sequences{xk} and {yk} that drive the duality gap down to zero [100].

21.1.3 Strong Duality

The Strong Duality Theorem makes a stronger statement.

Theorem 21.2 If one of the problems (PS) or (DS) has an optimal solu-tion, then so does the other and z = w for the optimal vectors.



21.1. PRIMAL AND DUAL PROBLEMS 187

Before we consider the proof of the theorem, we need a few preliminaryresults.

A point x in F is said to be a basic feasible solution if the columns of A corresponding to positive entries of x are linearly independent; denoteby B an invertible matrix obtained by deleting from A columns associatedwith zero entries of x. The entries of an arbitrary x corresponding to thecolumns not deleted are called the basic variables . Then, assuming thatthe columns of B are the first I columns of A, we write xT = (xT B , xT N ),and

A = [ B N ] ,

so that Ax = BxB = b, and xB = B−1b. The following theorems are takenfrom [100].

Theorem 21.3 A point x is in Ext( F ) if and only if x is a basic feasible solution.

Proof: Suppose that x is a basic feasible solution, and we write xT =(xT B, 0T ), A = [ B N ]. If x is not an extreme point of F , then there arey = x and z = x in F , and α in (0, 1), with

x = (1 − α)y + αz.

Then yT = (yT B , yT N ), zT = (zT B, zT N ), and yN ≥ 0, zN ≥ 0. From

0 = xN = (1 − α)yN + (α)zN

it follows that

yN = zN = 0,

and b = ByB = BzB = BxB . But, since B is invertible, we have xB =yB = zB. This is a contradiction, so x must be in Ext(F ).Conversely, suppose that x is in Ext(F ). Since it is in F , we know that

Ax = b and x ≥ 0. By reordering the variables if necessary, we may assumethat xT = (xT B , xT N ), with xB > 0 and xN = 0; we do not know that xB isa vector of length I , however, so when we write A = [ B N ], we do notknow that B is square. If B is invertible, then x is a basic feasible solution.If not, we shall construct y = x and z = x in F , such that

x =1

2y +

1

2z.

If {B1, B2,...,BK } are the columns of B and are linearly dependent,then there are constants p1, p2,...,pK , not all zero, with

p1B1 + ... + pK BK = 0.




With pT = ( p1,...,pK ), we have

B(xB + αp) = B(xB − αp) = BxB = b,

for all α ∈ (0, 1). We then select α so small that both xB + αp > 0 andxB − αp > 0. Let

yT = (xT B + αpT , xT N )

and

zT = (xT B − αpT , xT N ).

This completes the proof.

Exercise 21.2 Show that there are at most finitely many basic feasible solutions, so there are at most finitely many members of Ext( F ).

Theorem 21.4 If F is not empty, then Ext( F ) is not empty. In that case,

let {v1

,...,vK

} be the members of Ext( F ). Every x in F can be written as

x = d + α1v1 + ... + αK vK ,

for some αk ≥ 0, with K

k=1 αk = 1, and some direction of unboundedness,d.

Proof: We consider only the case in which F is bounded, so there is nodirection of unboundedness; the unbounded case is similar. Let x be afeasible point. If x is an extreme point, fine. If not, then x is not a basicfeasible solution. The columns of A that correspond to the positive entriesof x are not linearly independent. Then we can find a vector p such thatAp = 0 and pj = 0 if xj = 0. If || is small, x + p ≥ 0 and (x + p)j = 0 if

xj = 0, then x + p is in F . We can alter in such a way that eventuallyy = x + p has one more zero entry than x has, and so does z = x − p.Both y and z are in F and x is the average of these points. If y and z arenot basic, repeat the argument on y and z, each time reducing the numberof positive entries. Eventually, we will arrive at the case where the numberof non-zero entries is I , and so will have a basic feasible solution.

Proof of the Strong Duality Theorem: Suppose now that x∗ is asolution of the problem (PS) and z∗ = cT x∗. Without loss of generality,we may assume that x∗ is a basic feasible solution, hence an extreme pointof F . Then we can write

xT

∗= ((B−1b)T , 0T ),

cT = (cT B , cT N ),



21.2. THE SIMPLEX METHOD 189

and A = [ B N ]. Every feasible solution has the form

xT = ((B−1b)T , 0T ) + ((B−1N v)T , vT ),

for some v ≥ 0. From cT

x ≥ cT

x∗ we find that

(cT N − cT BB−1N )(v) ≥ 0,

for all v ≥ 0. It follows that

cT N − cT BB−1N = 0.

Nw let y∗ = (B−1)T cB, or yT ∗ = cT BB−1. We show that y∗ is feasible for(DS); that is, we show that

AT y∗ ≤ cT .

SinceyT

∗A = (yT

∗B, yT

∗N ) = (cT B, yT

∗N ) = (cT B, cT BB−1N )

andcT N ≥ cT BB−1N,

we haveyT ∗ A ≤ cT ,

so y∗ is feasible for (DS). Finally, we show that

cT x∗ = yT ∗ b.

We haveyT ∗ b = cT BB−1b = cT x∗.


21.2 The Simplex Method

In this section we sketch the main ideas of the simplex method. For furtherdetails see [100].

Begin with a basic feasible solution of (PS), say

xT = ( bT , 0T ) = ((B−1b)T , 0T ).

Compute the vector yT = cT BB−1. If

cT N = cT N − yT N ≥ 0,

then x is optimal. Otherwise, select a entering varable xj such that

(cN )j < 0.




Compute aj = B−1aj , where aj is the jth column of A. Find an index ssuch that

bs(aj)s

= min1

≤i

≤I { bi

(aj)i: (aj)i > 0}.

If there are no such positive denominators, the problem is unbounded.Then xs is the leaving variable , replacing xj. Redefine B and the basicvariables xB accordingly.



Chapter 22

Systems of LinearInequalities

Designing linear discriminants for pattern classification involves the prob-lem of solving a system of linear inequalities Ax ≥ b. In this chapter wediscuss the iterative Agmon-Motzkin-Schoenberg (AMS) algorithm [1, 98]for solving such problems. We prove convergence of the AMS algorithm,for both the consistent and inconsistent cases, by mimicking the proof forthe ART algorithm. Both algorithms are examples of the method of pro-

jection onto convex sets. The AMS algorithm is a special case of the cyclicsubgradient projection (CSP) method, so that convergence of the AMS,in the consistent case, follows from the convergence theorem for the CSPalgorithm.

22.1 Projection onto Convex SetsIn [118] Youla suggests that problems in image restoration might be viewedgeometrically and the method of projection onto convex sets (POCS) em-ployed to solve such inverse problems. In the survey paper [117] he ex-amines the POCS method as a particular case of iterative algorithms forfinding fixed points of nonexpansive mappings. This point of view is in-creasingly important in applications such as medical imaging and a numberof recent papers have addressed the theoretical and practical issues involved[7], [9], [6], [26], [30], [36], [47], [48], [49].

In this geometric approach the restored image is a solution of the convex feasibility problem (CFP), that is, it lies within the intersection of finitelymany closed nonempty convex sets C i, i = 1,...,I, in RJ (or sometimes, in

infinite dimensional Hilbert space). For any nonempty closed convex setC , the metric projection of x onto C , denoted P C x, is the unique member

191



192 CHAPTER 22. SYSTEMS OF LINEAR INEQUALITIES

of C closest to x. The iterative methods used to solve the CFP employthese metric projections. Algorithms for solving the CFP are discussed inthe papers cited above, as well as in the books by Censor and Zenios [43],Stark and Yang [111] and Borwein and Lewis [13].

The simplest example of the CFP is the solving of a system of linearequations Ax = b. Let A be an I by J real matrix and for i = 1,...,I letBi = {x|(Ax)i = bi}, where bi denotes the i-th entry of the vector b. Nowlet C i = Bi. Any solution of Ax = b lies in the intersection of the C i; if the system is inconsistent then the intersection is empty. The Kaczmarzalgorithm [83] for solving the system of linear equations Ax = b has theiterative step

xk+1j = xkj + Ai(k)j(bi(k) − (Axk)i(k)), (22.1)

for j = 1,...,J , k = 0, 1,... and i(k) = k(mod I ) + 1. This algorithmwas rediscovered by Gordon, Bender and Herman [70], who called it thealgebraic reconstruction technique (ART). This algorithm is an example

of the method of successive orthogonal projections (SOP) [72] whereby wegenerate the sequence {xk} by taking xk+1 to be the point in C i(k) closest to

xk. Kaczmarz’s algorithm can also be viewed as a method for constrainedoptimization: whenever Ax = b has solutions, the limit of the sequencegenerated by equation (22.1) minimizes the function ||x − x0|| over allsolutions of Ax = b.

In the example just discussed the sets C i are hyperplanes in RJ ; supposenow that we take the C i to be half-spaces and consider the problem of finding x such that Ax ≥ b. For each i let H i be the half-space H i ={x|(Ax)i ≥ bi}. Then x will be in the intersection of the sets C i = H i if andonly if Ax ≥ b. Methods for solving this CFP, such as Hildreth’s algorithm,are discussed in [43]. Of particular interest for us here is the behavior of theAgmon-Motzkin-Schoenberg (AMS) algorithm (AMS) algorithm [1] [98] forsolving such systems of inequalities Ax ≥ b. The AMS algorithm has theiterative step

xk+1j = xkj + Ai(k)j(bi(k) − (Axk)i(k))+. (22.2)

The AMS algorithm converges to a solution of Ax ≥ b, if there are solutions.If there are no solutions the AMS algorithm converges cyclically, that is,subsequences associated with the same m converge [58],[9]. We present anelementary proof of this result in this chapter.

Algorithms for solving the CFP fall into two classes: those that employall the sets C i at each step of the iteration (the so-called simultaneous meth-ods ) and those that do not (the row-action algorithms or, more generally,block-iterative methods ).

In the consistent case, in which the intersection of the convex sets C iis nonempty, all reasonable algorithms are expected to converge to a mem-



22.2. SOLVING AX = B 193

ber of that intersection; the limit may or may not be the member of theintersection closest to the starting vector x0.

In the inconsistent case, in which the intersection of the C i is empty,simultaneous methods typically converge to a minimizer of a proximity

function [36], such as

f (x) =I

i=1||x − P C ix||2,

if a minimizer exists.Methods that are not simultaneous cannot converge in the inconsistent

case, since the limit would then be a member of the (empty) intersection.Such methods often exhibit what is called cyclic convergence ; that is, sub-sequences converge to finitely many distinct limits comprising a limit cycle.Once a member of this limit cycle is reached, further application of the al-gorithm results in passing from one member of the limit cycle to the next.Proving the existence of these limit cycles seems to be a difficult problem.

Tanabe [112] showed the existence of a limit cycle for Kaczmarz’s algo-

rithm (see also [55]), in which the convex sets are hyperplanes. The SOPmethod may fail to have a limit cycle for certain choices of the convexsets. For example, if, in R2, we take C 1 to be the lower half-plane andC 2 = {(x, y)|x > 0, y ≥ 1/x}, then the SOP algorithm fails to producea limit cycle. However, Gubin, Polyak and Riak [72] prove weak conver-gence to a limit cycle for the method of SOP in Hilbert space, under theassumption that at least one of the C i is bounded, hence weakly compact.In [9] Bauschke, Borwein and Lewis present a wide variety of results on theexistence of limit cycles. In particular, they prove that if each of the convexsets C i in Hilbert space is a convex polyhedron, that is, the intersection of finitely many half-spaces, then there is a limit cycle and the subsequentialconvergence is in norm. This result includes the case in which each C i is ahalf-space, so implies the existence of a limit cycle for the AMS algorithm.In this paper we give a proof of existence of a limit cycle for the AMSalgorithm using a modification of our proof for the ART.

In the next section we consider the behavior of the ART for solving Ax =b. The proofs given by Tanabe and Dax of the existence of a limit cycle forthis algorithm rely heavily on aspects of the theory of linear algebra, as didthe proof given in an earlier chapter here. Our goal now is to obtain a moredirect proof that can be easily modified to apply to the AMS algorithm.

We assume throughout this chapter that the real I by J matrix A hasfull rank and its rows have Euclidean length one.

22.2 Solving Ax = b

For i = 1, 2,...,I let K i = {x|(Ax)i = 0}, Bi = {x|(Ax)i = bi} and pi be the metric projection of x = 0 onto Bi. Let vri = (AxrI +i−1)i




and vr = (vr1,...,vrI )T , for r = 0, 1,.... We begin with some basic facts

concerning the ART.Fact 1:

||xk

||2

− ||xk+1

||2 = (A(xk)i(k))

2

−(bi(k))

2.

Fact 2:||xrI ||2 − ||x(r+1)I ||2 = ||vr||2 − ||b||2.

Fact 3:||xk − xk+1||2 = ((Axk)i(k) − bi(k))2.

Fact 4: There exists B > 0 such that, for all r = 0, 1,..., if ||vr|| ≤ ||b||then ||xrI || ≥ ||x(r+1)I || − B.

Fact 5: Let x0 and y0 be arbitrary and {xk} and {yk} the sequences gen-erated by applying the ART. Then

||x0

−y0

||2

− ||xI

−yI

||2 = I

i=1

((Axi−1)i−

(Ayi−1)i)2.

22.2.1 When the System Ax = b is Consistent

In this subsection we give a proof of the following result.

Theorem 22.1 Let Ax = b and let x0 be arbitrary. Let {xk} be generated by Equation (22.1). Then the sequence {||x − xk||} is decreasing and {xk}converges to the solution of Ax = b closest to x0.

Proof: Let Ax = b. It follows from Fact 5 that the sequence {||x − xrI ||}is decreasing and the sequence {vr − b} → 0. So {xrI } is bounded; let x∗,0

be a cluster point. Then, for i = 1, 2,...,I let x∗,i be the successor of x∗,i−1

using the ART. It follows that (Ax∗,i−1)i = bi for each i, from which we

conclude that x∗,0

= x∗,i

for all i and that Ax∗,0

= b. Using x∗,0

in place of x, we have that {||x∗,0 − xk||} is decreasing. But a subsequence convergesto zero, so {xk} converges to x∗,0. By Fact 5 the difference ||x − xk||2 −||x − xk+1||2 is independent of which solution x we pick; consequently, sois ||x − x0||2 − ||x − x∗,0||2. It follows that x∗,0 is the solution closest to x0.This completes the proof.

22.2.2 When the System Ax = b is Inconsistent

In the inconsistent case the sequence {xk} will not converge, since anylimit would be a solution. However, for each fixed i ∈ {1, 2,...,I }, thesubsequence {xrI +i} converges [112], [55]; in this subsection we prove thisresult and then, in the next section, we extend the proof to get cyclic

convergence for the AMS algorithm. We start by showing that the sequence{xrI } is bounded. We assume that I > J and A has full rank.



22.2. SOLVING AX = B 195

Proposition 22.1 The sequence {xrI } is bounded.

Proof: Assume that the sequence {xrI } is unbounded. We first show thatwe can select a subsequence {xrtI } with the properties ||xrtI || ≥ t and

||vrt || < ||b||, for t = 1, 2,....Assume that we have selected xrtI , with the properties ||xrtI || ≥ t and

||vrt || < ||b||; we show how to select xrt+1I . Pick integer s > 0 such that

||xsI || ≥ ||xrtI || + B + 1,

where B > 0 is as in Fact 4. With n + rt = s let m ≥ 0 be the smallestinteger for which

||x(rt+n−m−1)I || < ||xsI || ≤ ||x(rt+n−i)I ||.Then ||vrt+n−m−1|| < ||b||. Let xrt+1I = x(rt+n−m−1)I . Then we have

||xrt+1I || ≥ ||x(rt+n−m)I || − B ≥ ||xsI || − B ≥ ||xrtI || + B + 1 − B ≥ t + 1.

This gives us the desired subsequence.For every k = 0, 1,... let zk+1 = xk+1 − pi(k). Then zk+1 ∈ K i(k).

For zk+1 = 0 let uk+1 = zk+1/||zk+1||. Since the subsequence {xrtI }is unbounded, so is {zrtI }, so for sufficiently large t the vectors urtI aredefined and on the unit sphere. Let u∗,0 be a cluster point of {urtI };replacing {xrtI } with a subsequence if necessary, assume that the sequence{urtI } converges to u∗,0. Then let u∗,1 be a subsequence of urtI +1}; again,assume the sequence {urtI +1} converges to u∗,1. Continuing in this manner,we have {urtI +τ } converging to u∗,τ for τ = 0, 1, 2,.... We know that {zrtI }is unbounded and since ||vrt || < ||b||, we have, by Fact 3, that {zrtI +i−1 −zrtI +i} is bounded for each i. Consequently {zrtI +i} is unbounded for eachi.

Now we have

||zrtI +i−1 − zrtI +i|| ≥ ||zrtI +i−1||||urtI +i−1 − urtI +i−1, urtI +iurtI +i||.Since the left side is bounded and ||zrtI +i−1|| has no infinite bounded sub-sequence, we conclude that

||urtI +i−1 − urtI +i−1, urj+I +iurtI +i|| → 0.

It follows that u∗,0 = u∗,i or u∗,0 = −u∗,i for each i = 1, 2,...,I . Thereforeu∗,0 is in K i for each i; but, since the null space of A contains only zero,this is a contradiction. This completes the proof of the proposition.

Now we give a proof of the following result.

Theorem 22.2 Let A be I by J , with I > J and A with full rank. If Ax = b has no solutions, then, for any x0 and each fixed i

∈ {0, 1,...,I

},

the subsequence {xrI +i} converges to a limit x∗,i. Beginning the iteration in Equation (22.1) at x∗,0, we generate the x∗,i in turn, with x∗,I = x∗,0.




Proof: Let x∗,0 be a cluster point of {xrI }. Beginning the ART at x∗,0 weobtain x∗,n, for n = 0, 1, 2,.... It is easily seen that

||x(r−1)I

−xrI

||2

−||xrI

−x(r+1)I

||2 = I

i=1((Ax(r−1)I +i−1)i

−(AxrI +i−1)i)

2.

Therefore the sequence {||x(r−1)I − xrI ||} is decreasing and

{I

i=1((Ax(r−1)I +i−1)i − (AxrI +i−1)i)

2} → 0.

Therefore (Ax∗,i−1)i = (Ax∗,I +i−1)i for each i.For arbitrary x we have

||x−x∗,0||2−||x−x∗,I ||2 =I

i=1((Ax)i−(Ax∗,i−1)i)

2−I

i=1((Ax)i−bi)

2,

so that

||x − x∗,0||2 − ||x − x∗,I ||2 = ||x − x∗,I ||2 − ||x − x∗,2I ||2.

Using x = x∗,I we have

||x∗,I − x∗,0|| = −||x∗,I − x∗,2I ||,

from which we conclude that x∗,0 = x∗,I . From Fact 5 it follows that thesequence {||x∗,0−xrI ||} is decreasing; but a subsequence converges to zero,so the entire sequence converges to zero and {xrI } converges to x∗,0. Thiscompletes the proof.

Now we turn to the problem Ax ≥ b.

22.3 The Agmon-Motzkin-Schoenberg algo-rithm

In this section we are concerned with the behavior of the AMS algorithmfor finding x such that Ax ≥ b, if such x exist. We begin with some basicfacts concerning the AMS algorithm.

Let wri = min{(AxrI +i−1)i, bi} and wr = (wr

1,...,wrI )T , for r = 0, 1,....

The following facts are easily established.Fact 1a:

||xrI +i−1||2 − ||xrI +i||2 = (wri )2 − (bi)

2.

Fact 2a:||xrI ||2 − ||x(r+1)I ||2 = ||wr||2 − ||b||2.

Fact 3a:||xrI +i−1 − xrI +i||2 = (wr

i − bi)2.



22.3. THE AGMON-MOTZKIN-SCHOENBERG ALGORITHM 197

Fact 4a: There exists B > 0 such that, for all r = 0, 1,..., if ||wr|| ≤ ||b||then ||xrI || ≥ ||x(r+1)I || − B.

Fact 5a: Let x0 and y0 be arbitrary and

{xk

}and

{yk

}the sequences

generated by applying the AMS algorithm. Then||x0 − y0||2 − ||xI − yI ||2 =

I

i=1((Axi−1)i − (Ayi−1)i)

2−I

i=1(((Axi−1)i − bi)+ − ((Ayi−1)i − bi)+)2 ≥ 0.

Consider for a moment the elements of the second sum in the inequalityabove. There are four possibilities:1) both (Axi−1)i − bi and (Ayi−1)i − bi are nonnegative, in which case thisterm becomes ((Axi−1)i − (Ayi−1)i)

2 and cancels with the same term inthe previous sum;

2) neither (Axi

−1

)i− bi nor (Ayi

−1

)i− bi is nonnegative, in which case thisterm is zero;3) precisely one of (Axi−1)i − bi and (Ayi−1)i − bi is nonnegative; say it is(Axi−1)i − bi, in which case the term becomes ((Axi−1)i − bi)

2.Since we then have

(Ayi−1)i ≤ bi < (Axi−1)i

it follows that

((Axi−1)i − (Ayi−1)i)2 ≥ ((Axi−1)i − bi)

2.

We conclude that the right side of the equation in Fact 5a is nonnegative,as claimed.

It will be important in subsequent discussions to know under what

conditions the right side of this equation is zero, so we consider that now.We then have

((Axi−1)i − (Ayi−1)i)2 − (((Axi−1)i − bi)+ − ((Ayi−1)i − bi)+)2 = 0

for each m separately, since each of these terms is nonnegative, as we have just seen.

In case 1) above this difference is already zero, as we just saw. In case2) this difference reduces to ((Axi−1)i − (Ayi−1)i)

2, which then is zeroprecisely when (Axi−1)i = (Ayi−1)i. In case 3) the difference becomes

((Axi−1)i − (Ayi−1)i)2 − ((Axi−1)i − bi)

2,

which equals

((Axi−1)i − (Ayi−1)i + (Axi−1)i − bi)(bi − (Ayi−1)i).




Since this is zero, it follows that (Ayi−1)i = bi, which contradicts ourassumptions in this case. We conclude therefore that the difference of sums in Fact 5a is zero if and only if, for all i, either both (Axi−1)i ≥ biand (Ayi−1)i

≥bi or (Axi−1)i = (Ayi−1)i < bi.

22.3.1 When Ax ≥ b is Consistent

We now prove the following result.

Theorem 22.3 Let Ax ≥ b. Let x0 be arbitrary and let {xk} be generated by equation (22.2). Then the sequence {||x − xk||} is decreasing and the sequence {xk} converges to a solution of Ax ≥ b.

Proof: Let Ax ≥ b. When we apply the AMS algorithm beginning at xwe obtain x again at each step. Therefore, by Fact 5a and the discussionthat followed, with y0 = x, we have||xk − x||2 − ||xk+1 − x||2 =

((Axk)i − (Ax)i)2 − (((Axk)i − bi)+ − (Ax)i + bi)

2 ≥ 0. (22.3)

Therefore the sequence {||xk − x||} is decreasing and so {xk} is bounded;let x∗,0 be a cluster point.

The sequence defined by the right side of Equation (22.3) above con-verges to zero. It follows from the discussion following Fact 5a that Ax∗,0 ≥b. Continuing as in the case of Ax = b, we have that the sequence {xk}converges to x∗,0. In general it is not the case that x∗,0 is the solution of Ax ≥ b closest to x0.

Now we turn to the inconsistent case.

22.3.2 When Ax

≥b is Inconsistent

In the inconsistent case the sequence {xk} will not converge, since any limitwould be a solution. However, we do have the following result.

Theorem 22.4 Let A be I by J , with I > J and A with full rank. Let x0 be arbitrary. The sequence {xrI } converges to a limit x∗,0. Beginning the AMS algorithm at x∗,0 we obtain x∗,k, for k = 1, 2,... . For each fixed i ∈ {0, 1, 2,...,I }, the subsequence {xrI +i} converges to x∗,i and x∗,I = x∗,0.

We start by showing that the sequence {xrI } is bounded.

Proposition 22.2 The sequence {xrI } is bounded.

Proof: Assume that the sequence

{xrI

}is unbounded. We first show that

we can select a subsequence {xrtI } with the properties ||xrtI || ≥ t and||wrt || < ||b||, for t = 1, 2,....



22.3. THE AGMON-MOTZKIN-SCHOENBERG ALGORITHM 199

Assume that we have selected xrtI , with the properties ||xrtI || ≥ t and||wrt || < ||b||; we show how to select xrt+1I . Pick integer s > 0 such that

||xsI

|| ≥ ||xrtI

||+ B + 1,

where B > 0 is as in Fact 4a. With n + rt = s let m ≥ 0 be the smallestinteger for which

||x(rt+n−m−1)I || < ||xsI || ≤ ||x(rt+n−m)I ||.

Then ||wrt+n−m−1|| < ||b||. Let xrt+1I = x(rt+n−m−1)I . Then we have

||xrt+1I || ≥ ||x(rt+n−m)I || − B ≥ ||xsI || − B ≥ ||xrtI || + B + 1 − B ≥ t + 1.

This gives us the desired subsequence.For every k = 0, 1,... let zk+1 be the metric projection of xk+1 onto the

hyperplane K i(k). Then zk+1 = xk+1 − pi(k) if (Axk)i ≤ bi and zk+1 =

xk+1

−(Axk

)iAi

if not; here Ai

is the i-th column of AT

. Then zk+1

∈ K i(k).For zk+1 = 0 let uk+1 = zk+1/||zk+1||. Let u∗,0 be a cluster point of {urtI };replacing {xrtI } with a subsequence if necessary, assume that the sequence{urtI } converges to u∗,0. Then let u∗,1 be a subsequence of {urtI +1}; again,assume the sequence {urtI +1} converges to u∗,1. Continuing in this manner,we have {urtI +m} converging to u∗,m for m = 0, 1, 2,.... Since ||wrt || < ||b||,we have, by Fact 3a, that {zrtI +i−1 − zrtI +i} is bounded for each i. Nowwe have

||zrtI +i−1 − zrtI +i|| ≥ ||zrtI +i−1||||urtI +i−1 − urtI +i−1, urt+I +iurtI +i||.

The left side is bounded. We consider the sequence ||zrtI +i−1|| in two cases:1) the sequence is unbounded; 2) the sequence is bounded.

In the first case, it follows, as in the case of Ax = b, that u∗,i−1

= u∗,i

or u∗,i−1 = −u∗,i. In the second case we must have (AxrtI +i−1)i > bi for tsufficiently large, so that, from some point on, we have xrtI +i−1 = xrtI +i,in which case we have u∗,i−1 = u∗,i. So we conclude that u∗,0 is in thenull space of A, which is a contradiction. This concludes the proof of theproposition.

Proof of Theorem 22.4: Let x∗,0 be a cluster point of {xrI }. Beginningthe AMS iteration (22.2) at x∗,0 we obtain x∗,m, for m = 0, 1, 2,.... FromFact 5a it is easily seen that the sequence {||xrI − x(r+1)I ||} is decreasingand that the sequence

{I

i=1((Ax(r−1)I +i−1)i − (AxrI +i−1)i)

2−I

i=1(((Ax(r−1)I +i−1)i − bi)+ − ((AxrI +i−1)i − bi)+)2} → 0.




Again, by the discussion following Fact 5a, we conclude one of two things:either Case (1): (Ax∗,i−1)i = (Ax∗,jI +i−1)i for each j = 1, 2,... or Case(2): (Ax∗,i−1)i > bi and, for each j = 1, 2,..., (Ax∗,jI +i−1)i > bi. LetAi denote the i-th column of AT . As the AMS iteration proceeds fromx∗,0 to x∗,I , from x∗,I to x∗,2I and, in general, from x∗,jI to x∗,(j+1)I wehave either x∗,i−1 − x∗,i = 0 and x∗,jI +i−1 − x∗,jI +i = 0, for each j =1, 2,..., which happens in Case (2), or x∗,i−1 − x∗,i = x∗,jI +i−1 − x∗,jI +i =(bi − (Ax∗,i−1)i)Ai, for j = 1, 2,..., which happens in Case (1). It follows,therefore, that

x∗,0 − x∗,I = x∗,jI − x∗,(j+1)I

for j = 1, 2,... . Since the original sequence {xrI } is bounded, we have

||x∗,0 − x∗,jI || ≤ ||x∗,0|| + ||x∗,jI || ≤ K

for some K and all j = 1, 2,... . But we also have

||x∗,0

− x∗,jI

|| = j||x∗,0

− x∗,I

||.We conclude that ||x∗,0 − x∗,I || = 0 or x∗,0 = x∗,I .

From Fact 5a, using y0 = x∗,0, it follows that the sequence {||x∗,0 −xrI ||} is decreasing; but a subsequence converges to zero, so the entiresequence converges to zero and {xrI } converges to x∗,0. This completesthe proof of Theorem 22.4.



Chapter 23

The Split FeasibilityProblem

The split feasibility problem (SFP) [39] is to find c ∈ C with Ac ∈ Q, if suchpoints exist, where A is a real I by J matrix and C and Q are nonempty,closed convex sets in RJ and RI , respectively. In this chapter we discussthe CQ algorithm for solving the SFP, as well as recent extensions andapplications.

23.1 The CQ Algorithm

In [30] the CQ algorithm for solving the SFP was presented. It has theiterative step

xk+1 = P C (xk

−γAT (I

−P Q)Axk), (23.1)

where I is the identity operator and γ ∈ (0, 2/ρ(AT A)), for ρ(AT A) thespectral radius of the matrix AT A, which is also its largest eigenvalue.

The CQ algorithm converges to a solution of the SFP, for any startingvector x0, whenever the SFP has solutions. When the SFP has no solutions,the CQ algorithm converges to a minimizer of the function

f (x) =1

2||P QAx − Ax||2

over the set C , provided such constrained minimizers exist. Therefore theCQ algorithm is an iterative constrained optimization method. As shownin [31], convergence of the CQ algorithm is a consequence of Theorem 4.1.

The function f (x) is convex and differentiable on RJ and its derivative

is the operator∇f (x) = AT (I − P Q)Ax;

201



202 CHAPTER 23. THE SPLIT FEASIBILITY PROBLEM

see [3].

Lemma 23.1 The derivative operator ∇f is λ-Lipschitz continuous for λ = ρ(AT A), therefore it is ν -ism for ν = 1

λ.

Proof: We have

||∇f (x) − ∇f (y)||2 = ||AT (I − P Q)Ax − AT (I − P Q)Ay||2

≤ λ||(I − P Q)Ax − (I − P Q)Ay||2.

Also

||(I − P Q)Ax − (I − P Q)Ay||2 = ||Ax − Ay||2

+||P QAx − P QAy||2 − 2P QAx − P QAy,Ax − Ayand, since P Q is fne,

P QAx − P QAy,Ax − Ay ≥ ||P QAx − P QAy||2

.

Therefore,

||∇f (x) − ∇f (y)||2 ≤ λ(||Ax − Ay||2 − ||P QAx − P QAy||2)

≤ λ||Ax − Ay||2 ≤ λ2||x − y||2.


If γ ∈ (0, 2/λ) then B = P C (I − γAT (I − P Q)A) is av and, by Theorem4.1, the orbit sequence {Bkx} converges to a fixed point of B, wheneversuch points exist. If z is a fixed point of B, then z = P C (z − γAT (I −P Q)Az). Therefore, for any c in C we have

c − z, z − (z − γAT (I − P Q)Az) ≥ 0.

This tells us that

c − z, AT (I − P Q)Az ≥ 0,

which means that z minimizes f (x) relative to the set C .The CQ algorithm employs the relaxation parameter γ in the interval

(0, 2/L), where L is the largest eigenvalue of the matrix AT A. Choosingthe best relaxation parameter in any algorithm is a nontrivial procedure.Generally speaking, we want to select γ near to 1/L. We saw a simpleestimate for L in our discussion of singular values of sparse matrices: if A is normalized so that each row has length one, then the spectral radiusof AT A does not exceed the maximum number of nonzero elements in any

column of A. A similar upper bound on ρ(AT A) was obtained for non-normalized, -sparse A.



23.2. PARTICULAR CASES OF THE CQ ALGORITHM 203

23.2 Particular Cases of the CQ Algorithm

It is easy to find important examples of the SFP: if C ⊆ RJ and Q = {b}then solving the SFP amounts to solving the linear system of equations

Ax = b; if C is a proper subset of RJ , such as the nonnegative cone, thenwe seek solutions of Ax = b that lie within C , if there are any. Generally,we cannot solve the SFP in closed form and iterative methods are needed.

A number of well known iterative algorithms, such as the Landweber[87] and projected Landweber methods (see [11]), are particular cases of the CQ algorithm.

23.2.1 The Landweber algorithm

With x0 arbitrary and k = 0, 1,... let

xk+1 = xk + γAT (b − Axk). (23.1)

This is the Landweber algorithm.

23.2.2 The Projected Landweber Algorithm

For a general nonempty closed convex C , x0 arbitrary, and k = 0, 1,..., theprojected Landweber method for finding a solution of Ax = b in C has theiterative step

xk+1 = P C (xk + γAT (b − Axk)). (23.2)

23.2.3 Convergence of the Landweber Algorithms

From the convergence theorem for the CQ algorithm it follows that theLandweber algorithm converges to a solution of Ax = b and the projectedLandweber algorithm converges to a solution of Ax = b in C , wheneversuch solutions exist. When there are no solutions of the desired type, theLandweber algorithm converges to a least squares approximate solutionof Ax = b, while the projected Landweber algorithm will converge to aminimizer, over the set C , of the function ||b − Ax||, whenever such aminimizer exists.

23.2.4 The Simultaneous ART (SART)

Another example of the CQ algorithm is the simultaneous algebraic recon-struction technique (SART) [2] for solving Ax = b, for nonnegative matrix

A. Let A be an I by J matrix with nonnegative entries. Let Ai+ > 0 bethe sum of the entries in the ith row of A and A+j > 0 be the sum of the



204 CHAPTER 23. THE SPLIT FEASIBILITY PROBLEM

entries in the jth column of A. Consider the (possibly inconsistent) systemAx = b. The SART algorithm has the following iterative step:

xk+1j = xkj +1

A+j

I

i=1

Aij(bi

−(Axk)i)/Ai+.

We make the following changes of variables:

Bij = Aij/(Ai+)1/2(A+j)1/2,

zj = xj(A+j)1/2,

andci = bi/(Ai+)1/2.

Then the SART iterative step can be written as

zk+1 = zk + BT (c − Bzk).

This is a particular case of the Landweber algorithm, with γ = 1. Theconvergence of SART follows from Theorem 4.1, once we know that the

largest eigenvalue of BT

B is less than two; in fact, we show that it is one[30].If BT B had an eigenvalue greater than one and some of the entries of A

are zero, then, replacing these zero entries with very small positive entries,we could obtain a new A whose associated BT B also had an eigenvaluegreater than one. Therefore, we assume, without loss of generality, that Ahas all positive entries. Since the new BT B also has only positive entries,this matrix is irreducible and the Perron-Frobenius theorem applies. Weshall use this to complete the proof.

Let u = (u1,...,uJ )T with uj = (A+j)1/2 and v = (v1,...,vI )

T , with vi =(Ai+)1/2. Then we have Bu = v and BT v = u; that is, u is an eigenvectorof BT B with associated eigenvalue equal to one, and all the entries of uare positive, by assumption. The Perron-Frobenius theorem applies and

tells us that the eigenvector associated with the largest eigenvalue has allpositive entries. Since the matrix BT B is symmetric its eigenvectors areorthogonal; therefore u itself must be an eigenvector associated with thelargest eigenvalue of BT B. The convergence of SART follows.

23.2.5 More on the CQ Algorithm

One of the obvious drawbacks to the use of the CQ algorithm is that wewould need the projections P C and P Q to be easily calculated. Severalauthors have offered remedies for that problem, using approximations of theconvex sets by the intersection of hyperplanes and orthogonal projectionsonto those hyperplanes [116].

In a recent paper [40] Censor et al discuss the application of the CQ al-

gorithm to the problem of intensity-modulated radiation therapy treatmentplanning.



Chapter 24

Constrained IterationMethods

The ART and its simultaneous and block-iterative versions are designed tosolve general systems of linear equations Ax = b. The SMART, EMMLand RBI methods require that the entries of A be nonnegative, those of bpositive and produce nonnegative x. In this chapter we present variationsof the SMART and EMML that impose the constraints uj ≤ xj ≤ vj ,where the uj and vj are selected lower and upper bounds on the individualentries xj .

24.1 Modifying the KL distance

The SMART, EMML and RBI methods are based on the Kullback-Leibler

distance between nonnegative vectors. To impose more general constraintson the entries of x we derive algorithms based on shifted KL distances, alsocalled Fermi-Dirac generalized entropies .

For a fixed real vector u, the shifted KL distance KL(x − u, z − u) isdefined for vectors x and z having xj ≥ uj and zj ≥ uj . Similarly, theshifted distance KL(v − x, v − z) applies only to those vectors x and z forwhich xj ≤ vj and zj ≤ vj . For uj ≤ vj , the combined distance

KL(x − u, z − u) + KL(v − x, v − z)

is restricted to those xand z whose entries xj and zj lie in the interval[uj , vj ]. Our objective is to mimic the derivation of the SMART, EMMLand RBI methods, replacing KL distances with shifted KL distances, toobtain algorithms that enforce the constraints uj

≤xj

≤vj , for each j.

The algorithms that result are the ABMART and ABEMML block-iterativemethods. These algorithms were originally presented in [25], in which the

205



206 CHAPTER 24. CONSTRAINED ITERATION METHODS

vectors u and v were called a and b, hence the names of the algorithms.Throughout this chapter we shall assume that the entries of the matrix Aare nonnegative. We shall denote by Bn, n = 1,...,N a partition of theindex set

{i = 1,...,I

}into blocks. For k = 0, 1,... let n(k) = k(mod N )+1.

The projected Landweber algorithm can also be used to impose therestrictions uj ≤ xj ≤ vj ; however, the projection step in that algorithmis implemented by clipping, or setting equal to uj or vj values of xj thatwould otherwise fall outside the desired range. The result is that the valuesuj and vj can occur more frequently than may be desired. One advantageof the AB methods is that the values uj and vj represent barriers thatcan only be reached in the limit and are never taken on at any step of theiteration.

24.2 The ABMART Algorithm

We assume that (Au)i

≤bi

≤(Av)i and seek a solution of Ax = b with

uj ≤ xj ≤ vj , for each j. The algorithm begins with an initial vector x0satisfying uj ≤ x0j ≤ vj , for each j. Having calculated xk, we take

xk+1j = αkj vj + (1 − αk

j )uj , (24.1)

with n = n(k),

αkj =

ckjn

(dki )Aij

1 + ckjn

(dki )Aij, (24.2)

ckj =(xkj − uj)

(vj −

xkj

), (24.3)

and

dkj =(bi − (Au)i)((Av)i − (Axk)i)

((Av)i − bi)((Axk)i − (Au)i), (24.4)

wheren

denotes the product over those indices i in Bn(k). Notice that,

at each step of the iteration, xkj is a convex combination of the endpoints

uj and vj , so that xkj lies in the interval [uj, vj ].We have the following theorem concerning the convergence of the AB-

MART algorithm:

Theorem 24.1 If there is a soluton of the system Ax = b that satisfies the

constraints uj ≤ xj ≤ vj for each j, then, for any N and any choice of the blocks Bn, the ABMART sequence converges to that constrained solution



24.3. THE ABEMML ALGORITHM 207

of Ax = b for which the Fermi-Dirac generalized entropic distance from xto x0,

KL(x − u, x0 − u) + KL(v − x, v − x0),

is minimized. If there is no constrained solution of Ax = b, then, for N = 1, the ABMART sequence converges to the minimizer of

KL(Ax − Au,b − Au) + KL(Av − Ax, Av − b)

for which KL(x − u, x0 − u) + KL(v − x, v − x0)

is minimized.

The proof is similar to that for RBI-SMART and is found in [25].

24.3 The ABEMML Algorithm

We make the same assumptions as in the previous section. The iterativestep of the ABEMML algorithm is

xk+1j = αkj vj + (1 − αk

j )uj , (24.5)

where

αkj = γ kj /dkj , (24.6)

γ kj = (xkj − uj)ekj , (24.7)

β kj = (vj−

xkj )f kj , (24.8)

dkj = γ kj + β kj , (24.9)

ekj =

1 −

i∈Bn

Aij

+

i∈Bn

Aij

bi − (Au)i

(Axk)i − (Au)i

, (24.10)

and

f kj =

1 −

i∈Bn

Aij

+

i∈Bn

Aij

(Av)i − bi

(Av)i − (Axk)i

. (24.11)

We have the following theorem concerning the convergence of the ABE-MML algorithm:



208 CHAPTER 24. CONSTRAINED ITERATION METHODS

Theorem 24.2 If there is a soluton of the system Ax = b that satisfies the constraints uj ≤ xj ≤ vj for each j, then, for any N and any choice of the blocks Bn, the ABEMML sequence converges to such a constrained solution of Ax = b. If there is no constrained solution of Ax = b, then, for

N = 1, the ABMART sequence converges to a constrained minimizer of

KL(Ax − Au,b − Au) + KL(Av − Ax,Av − b).

The proof is similar to that for RBI-EMML and is to be found in [25]. Incontrast to the ABMART theorem, this is all we can say about the limitsof the ABEMML sequences.

Open Question: How does the limit of the ABEMML iterative sequencedepend, in the consistent case, on the choice of blocks, and, in general, onthe choice of x0?



Chapter 25

Fourier TransformEstimation

In many remote-sensing problems, the measured data is related to the func-tion to be imaged by Fourier transformation. In the Fourier approach totomography, the data are often viewed as line integrals through the objectof interest. These line integrals can then be converted into values of theFourier transform of the object function. In magnetic-resonance imaging(MRI), adjustments to the external magnetic field cause the measured datato be Fourier-related to the desired proton-density function. In such appli-cations, the imaging problem becomes a problem of estimating a functionfrom finitely many noisy values of its Fourier transform. To overcome theselimitations, one can use iterative and non-iterative methods for incorporat-ing prior knowledge and regularization; data-extrapolation algorithms formone class of such methods.

We focus on the use of iterative algorithms for improving resolutionthrough extrapolation of Fourier-transform data. The reader should con-sult the appendices for brief discussion of some of the applications of thesemethods.

25.1 The Limited-Fourier-Data Problem

For notational convenience, we shall discuss only the one-dimensional case,involving the estimation of the (possibly complex-valued) function f (x) of the real variable x, from finitely many values F (ωn), n = 1,...,N of itsFourier transform. Here we adopt the definitions

F (ω) =

f (x)eixωdx,

209



210 CHAPTER 25. FOURIER TRANSFORM ESTIMATION

and

f (x) =1

2π

F (ω)e−ixωdω.

Because it is the case in the applications of interest to us here, we shallassume that the object function has bounded support, that is, there isA > 0, such that f (x) = 0 for |x| > A.

The values ω = ωn at which we have measured the function F (ω) maybe structured in some way; they may be equi-spaced along a line, or, in thehigher-dimensional case, arranged in a cartesian grid pattern, as in MRI.According to the Central Slice Theorem, the Fourier data in tomographylie along rays through the origin. Nevertheless, in what follows, we shallnot assume any special arrangement of these data points.

Because the data are finite, there are infinitely many functions f (x)consistent with the data. We need some guidelines to follow in selectinga best estimate of the true f (x). First, we must remember that the datavalues are noisy, so we want to avoid overfitting the estimate to noisy

data. This means that we should include regularization in whatever methodwe adopt. Second, the limited data is often insufficient to provide thedesired resolution, so we need to incorporate additional prior knowledgeabout f (x), such as non-negativity, upper and lower bounds on its values,its support, its overall shape, and so on. Third, once we have selectedprior information to include, we should be conservative in choosing anestimate consistent with that information. This may involve the use of constrained minimum-norm solutions. Fourth, we should not expect ourprior information to be perfectly accurate, so our estimate should not beoverly sensitive to slight changes in the prior information. Finally, theestimate we use will be one for which there are good algorithms for itscalculation.

25.2 Minimum-Norm Estimation

To illustrate the notion of minimum-norm estimation, we begin with thefinite-dimensional problem of solving an underdetermined system of linearequations, Ax = b, where A is a rea I by J matrix with J > I and AAT isinvertible.

25.2.1 The Minimum-Norm Solution of Ax = b

Each equation can be written as

bi = (ai)T x = x, ai,

where the vector ai is the ith column of the matrix AT and u, v denotedthe inner, or dot product of the vectors u and v.



25.2. MINIMUM-NORM ESTIMATION 211

Exercise 25.1 Show that every vector x in RJ can be written as

x = AT z + w, (25.1)

with Aw = 0 and ||x||2 = ||AT z||2 + ||w||2.

Consequently, Ax = b if and only if A(AT z) = b and AT z is the solution having the smallest norm. This minimum-norm solution x = AT z can be

found explicitly; it is

x = AT z = AT (AAT )−1b. (25.2)

Hint: multiply both sides of Equation (25.1) by A and solve for z.

It follows from this exercise that the minimum-norm solution x of Ax = bhas the form x = AT z, which means that x is a linear combination of the

ai

:x =

I i=1

ziai.

25.2.2 Minimum-Weighted-Norm Solution of Ax = b

As we shall see later, it is sometimes convenient to introduce a new normfor the vectors. Let Q be a J by J symmetric positive-definite matrix anddefine

||x||2Q = xT Qx.

With Q = C T C , where C is the positive-definite symmetric square-root of Q, we can write

||x||2Q = ||y||2,

for y = Cx. Now suppose that we want to find the solution of Ax = b forwhich ||x||2Q is minimum. We write

Ax = b

as

AC −1y = b,

so that, from Equation (25.2), we find that the solution y with minimumnorm is

y = (AC −1)T (AC −1(AC −1)T )−1b,

ory = (AC −1)T (AQ−1AT )−1b,




so that the xQ with minimum weighted norm is

xQ = C −1y = Q−1AT (AQ−1AT )−1b, (25.3)

Notice that, writingu, vQ = uT Qv,

we find that

bi = Q−1ai, xQQ,

and the minimum-weighted-norm solution of Ax = b is a linear combinationof the columns gi of Q−1AT , that is,

xQ =I

i=1

digi,

where

di = ((AQ−1

AT

)−1

b)i,

for each i = 1,...,I .

25.3 Fourier-Transform Data

Returning now to the case in which we have finitely many values of theFourier transform of f (x), we write

F (ω) =

f (x)eixωdx = eω, f ,

where eω(x) = e−ixω and

g, h =

g(x)h(x)dx.

The norm of a function f (x) is then

||f || =

f, f =

|f (x)|2dx.

25.3.1 The Minimum-Norm Estimate

Arguing as we did in the finite-dimensional case, we conclude that theminimum-norm solution of the data-consistency equations

F (ωn) = eωn, f , n = 1,...,N,



25.3. FOURIER-TRANSFORM DATA 213

has the form

f (x) =N n=1

ane−ixωn .

If the integration assumed to extend over the whole real line, the functionseω(x) are mutually orthogonal and so

an =1

2πF (ωn). (25.4)

In most applications, however, the function f (x) is known to have finitesupport.

Exercise 25.2 Show that, if f (x) = 0 for x outside the interval [a, b], then the coefficients an satisfy the system of linear equations

F (ωn) =N

m=1

Gnmam,

with

Gnm =

ba

eix(ωn−ωm)dx.

For example, suppose that [a, b] = [−π, π] and

ωn = −π +2π

N n,

for n = 1,...,N

Exercise 25.3 Show that, in this example, Gnn = 2π and Gnm = 0, for n

= m. Therefore, for this special case, we again have

an =1

2πF (ωn).

25.3.2 Minimum-Weighted-Norm Estimates

Let p(x) ≥ 0 be a weight function. Let

g, h p =

g(x)h(x) p(x)−1dx,

with the understanding that p(x)−1 = 0 outside of the support of p(x).The associated weighted norm is then

||f || p =

|f (x)|2 p(x)−1dx.




We can then write

F (ωn) = peω, f p =

( p(x)e−ixω)f (x) p(x)−1dx.

It follows that the function consistent with the data and having the mini-mum weighted norm has the form

f p(x) = p(x)N n=1

bne−ixωn. (25.5)

Exercise 25.4 Show that the coefficients bn satisfy the system of linear equations

F (ωn) =N

m=1

bmP nm, (25.6)

with P nm =

p(x)eix(ωn−ωm)dx,

for m, n = 1,...,N .

Whenever we have prior information about the support of f (x), or aboutthe shape of |f (x)|, we can incorporate this information through our choiceof the weight function p(x). In this way, the prior information becomespart of the estimate, through the first factor in Equation (25.5), with thesecond factor providing information gathered from the measurement data.This minimum-weighted-norm estimate of f (x) is called the PDFT, and isdiscussed in more detail in [33].

Once we have f p(x), we can take its Fourier transform, F p(ω), whichis then an estimate of F (ω). Because the coefficients b

nsatisfy Equations

(25.6), we know thatF p(ωn) = F (ωn),

for n = 1,...,N . For other values of ω, the estimate F p(ω) provides anextrapolation of the data. For this reason, methods such as the PDFT aresometimes called data-extrapolation methods . If f (x) is supported on aninterval [a, b], then the function F (ω) is said to be band-limited . If [c, d] isan interval containing [a, b] and p(x) = 1, for x in [c, d], and p(x) = 0 other-wise, then the PDFT estimate is a non-iterative version of the Gerchberg-Papoulis band-limited extrapolation estimate of f (x) (see [33]).

25.3.3 Implementing the PDFT

The PDFT can be extended easily to the estimation of functions of severalvariables. However, there are several difficult steps that can be avoided



25.4. THE DISCRETE PDFT (DPDFT) 215

by iterative implementation. Even in the one-dimensional case, when thevalues ωn are not equispaced, the calculation of the matrix P can be messy.In the case of higher dimensions, both calculating P and solving for thecoefficients can be expensive. In the next section we consider an iterative

implementation that solves both of these problems.

25.4 The Discrete PDFT (DPDFT)

The derivation of the PDFT assumes a function f (x) of one or more con-tinuous real variables, with the data obtained from f (x) by integration.The discrete PDFT (DPDFT) begins with f (x) replaced by a finite vectorf = (f 1,...,f J )

T that is a discretization of f (x); say that f j = f (xj) forsome point xj . The integrals that describe the Fourier transform data canbe replaced by finite sums,

F (ωn) =

J j=1

f jE nj ,

where E nj = eixjωn . We have used a Riemann-sum approximation of theintegrals here, but other choices are also available. The problem then is tosolve this system of equations for the f j .

Since the N is fixed, but the J is under our control, we select J > N ,so that the system becomes under-determined. Now we can use minimum-norm and minimum-weighted-norms solutions of the finite-dimensional prob-lem to obtain an approximate, discretized PDFT solution.

Since the PDFT is a minimum-weighted norm solution in the continous-variable formulation, it is reasonable to let the DPDFT be the correspond-

ing minimum-weighted-norm solution obtained by letting the positive-definitematrix Q be the diagonal matrix having for its jth diagonal entry

Qjj = 1/p(xj),

if p(xj) > 0, and zero, otherwise.

25.4.1 Calculating the DPDFT

The DPDFT is a minimum-weighted-norm solution, which can be calcu-lated using, say, the ART algorithm. We know that, in the underdeter-mined case, the ART provides the the solution closest to the starting vector,in the sense of the Eucliean distance. We therefore reformulate the system,

so that the minimum-weighted norm solution becomes a minimum-normsolution, as we did earlier, and then begin the ART iteration with zero.




25.4.2 Regularization

We noted earlier that one of the principles guiding the estimation of f (x)from Fourier transform data should be that we do not want to overfit the

estimate to noisy data. In the PDFT, this can be avoided by adding a smallpositive quantity to the main diagonal of the matrix P . In the DPDFT,implemented using ART, we regularize the ART algorthm, as we discussedearlier.



Part VII

Appendices

217





Chapter 26

Basic Concepts

In iterative methods, we begin with an initial vector, say x0, and, for

each nonnegative integer k, we calculate the next vector, xk+1

, from thecurrent vector xk. The limit of such a sequence of vectors {xk}, when thelimit exists, is the desired solution to our problem. The fundamental toolswe need to understand iterative algorithms are the geometric concepts of distance between vectors and mutual orthogonality of vectors, the algebraicconcept of transformation or operator on vectors, and the vector-spacenotions of subspaces and convex sets.

26.1 The Geometry of Euclidean Space

We denote by RJ the real Euclidean space consisting of all J -dimensionalcolumn vectors x = (x1,...,xJ )T with real entries xj ; here the superscriptT denotes the transpose of the 1 by J matrix (or, row vector) (x1,...,xJ ).We denote by C J the collection of all J -dimensional column vectors x =(x1,...,xJ )

† with complex entries xj ; here the superscript † denotes theconjugate transpose of the 1 by J matrix (or, row vector) (x1,...,xJ ). Whendiscussing matters that apply to both RJ and C J we denote the underlyingspace simply as X .

26.1.1 Inner Products

For x = (x1,...,xJ )T and y = (y1,...,yJ )T in RJ , the dot product x · y is

defined to be

x · y =J j=1

xjyj .

219



220 CHAPTER 26. BASIC CONCEPTS

Note that we can write

x · y = yT x = xT y,

where juxtaposition indicates matrix multiplication. The 2-norm, or Eu-clidean norm , or Euclidean length , of x is

||x|| =√

x · x =√

xT x.

The Euclidean distance between two vectors x and y in RJ is ||x − y||. Aswe discuss in the appendix on metric spaces, there are other norms on X ;nevertheless, in this chapter ||x|| will denote the 2-norm of x.

For x = (x1,...,xJ )T and y = (y1,...,yJ )

T in C J , the dot product x · yis defined to be

x · y =

J j=1

xjyj.

Note that we can writex · y = y†x.

The norm, or Euclidean length, of x is

||x|| =√

x · x =√

x†x.

As in the real case, the distance between vectors x and y is ||x − y||.Both of the spaces RJ and C J , along with their dot products, are

examples of finite-dimensional Hilbert space. Much of what follows in thesenotes applies to both RJ and C J . In such cases, we shall simply refer tothe underlying space as X and refer to the associated dot product usingthe inner product notation x, y.

26.1.2 Cauchy’s Inequality

Cauchy’s Inequality, also called the Cauchy-Schwarz Inequality, tells usthat

|x, y| ≤ ||x||||y||,with equality if and only if y = αx, for some scalar α.

Proof of Cauchy’s inequality: To prove Cauchy’s inequality for thecomplex vector dot product, we write x · y = |x · y|eiθ. Let t be a realvariable and consider

0

≤ ||e−iθx

−ty

||2 = (e−iθx

−ty)

·(e−iθx

−ty)

= ||x||2 − t[(e−iθx) · y + y · (e−iθx)] + t2||y||2



26.1. THE GEOMETRY OF EUCLIDEAN SPACE 221

= ||x||2 − t[(e−iθx) · y + (e−iθx) · y] + t2||y||2

= ||x||2 − 2Re(te−iθ(x · y)) + t2||y||2

=

||x

||2

−2Re(t

|x

·y

|) + t2

||y

||2 =

||x

||2

−2t

|x

·y

|+ t2

||y

||2.

This is a nonnegative quadratic polynomial in the variable t, so it can-not have two distinct real roots. Therefore, the discriminant 4|x · y|2 −4||y||2||x||2 must be nonpositive; that is, |x · y|2 ≤ ||x||2||y||2. This isCauchy’s inequality.

Exercise 26.1 Use Cauchy’s inequality to show that

||x + y|| ≤ ||x|| + ||y||;this is called the triangle inequality.

We say that the vectors x and y are mutually orthogonal if x, y = 0.

26.1.3 Hyperplanes in Euclidean Space

For a fixed column vector a with Euclidean length one and a fixed scalar γ the hyperplane determined by a and γ is the set H (a, γ ) = {z|a, z = γ }.

Exercise 26.2 Show that the vector a is orthogonal to the hyperplane H =H (a, γ ); that is, if u and v are in H , then a is orthogonal to u − v.

For an arbitrary vector x in X and arbitrary hyperplane H = H (a, γ ),the orthogonal projection of x onto H is the member z = P H x of H that isclosest to x.

Exercise 26.3 Show that, for H = H (a, γ ), z = P H x is the vector

z = P H x = x + (γ − a, x)a. (26.1)

For γ = 0, the hyperplane H = H (a, 0) is also a subspace of X , meaningthat, for every x and y in H and scalars α and β , the linear combinationαx + βy is again in H ; in particular, the zero vector 0 is in H (a, 0).

26.1.4 Convex Sets in Euclidean Space

A subset C of X is said to be convex if, for every pair of members x and yof C , and for every α in the open interval (0, 1), the vector αx + (1 − α)yis also in C .

Exercise 26.4 Show that the unit ball U in

X , consisting of all x with

||x|| ≤ 1, is convex, while the surface of the ball, the set of all x with ||x|| = 1, is not convex.




A convex set C is said to be closed if it contains all the vectors that lieon its boundary. Given any nonempty closed convex set C and an arbitraryvector x in X , there is a unique member of C closest to x, denoted P C x,the orthogonal (or metric) projection of x onto C . For example, if C = U ,

the unit ball, then P C x = x/||x||, for all x such that ||x|| > 1, and P C x = xotherwise. If C is RJ

+, the nonnegative cone of RJ , consisting of all vectorsx with xj ≥ 0, for each j, then P C x = x+, the vector whose entries aremax(xj , 0).

26.2 Analysis in Euclidean Space

We say that an infinite sequence {xk} of vectors in X converges to thevector x if the limit of ||x − xk|| is zero, as k → +∞; then x is called thelimit of the sequence. An infinite sequence {xk} is said to be bounded if there is a positive constant b > 0 such that ||xk|| ≤ b, for all k.

Exercise 26.5 Show that any convergent sequence is bounded. Find a bounded sequence of real numbers that is not convergent.

For any bounded sequence {xk}, there is at least one subsequence, oftendenoted {xkn}, that is convergent; the notation implies that the positiveintegers kn are ordered, so that k1 < k2 < .... The limit of such a subse-quence is then said to be a cluster point of the original sequence.

Exercise 26.6 Show that your bounded, but not convergent, sequence found in the previous exercise, has a cluster point.

Exercise 26.7 Show that, if x is a cluster point of the sequence {xk}, and

if ||x − xk

|| ≥ ||x − xk+1

||, for all k, then x is the limit of the sequence.

A subset C of X is said to be closed if, for every convergent sequence {xk}of vectors in C , the limit point is again in C . For example, in X = R, theset C = (0, 1] is not closed, because it does not contain the point x = 0,which is the limit of the sequence {xk = 1

k}; the set [0, 1] is closed and isthe closure of the set (0, 1], that is, it is the smallest closed set containing(0, 1].

When we investigate iterative algorithms, we will want to know if thesequence {xk} generated by the algorithm converges. As a first step, wewill usually ask if the sequence is bounded? If it is bounded, then it willhave at least one cluster point. We then try to discover if that cluster pointis really the limit of the sequence.

A sequence can be bounded without being convergent. Being a Cauchy sequence in X , on the other hand, is sufficient for convergence. A sequence



26.3. BASIC LINEAR ALGEBRA 223

{xk} in X is said to be a Cauchy sequence if, given any > 0, there is apositive integer k, such that, for every positive integer n, we have

||xk

−xk+n

|| ≤.

The finite-dimensional Euclidean spaces RJ and C J have the importantproperty that every Cauchy sequence has a limit; this is described by sayingthat these spaces are complete .

26.3 Basic Linear Algebra

In this section we discuss systems of linear equations, Gaussian elimination,basic and non-basic variables, the fundamental subspaces of linear algebraand eigenvalues and norms of square matrices.

26.3.1 Systems of Linear Equations

Consider the system of three linear equations in five unknowns given by

x1 +2x2 +2x4 +x5 = 0−x1 −x2 +x3 +x4 = 0x1 +2x2 −3x3 −x4 −2x5 = 0

.

This system can be written in matrix form as Ax = 0, with A the coefficientmatrix

A =

1 2 0 2 1

−1 −1 1 1 01 2 −3 −1 −2

,

and x = (x1, x2, x3, x4, x5)T . Applying Gaussian elimination to this sys-

tem, we obtain a second, simpler, system with the same solutions:x1 −2x4 +x5 = 0

x2 +2x4 = 0x3 +x4 +x5 = 0

.

From this simpler system we see that the variables x4 and x5 can be freelychosen, with the other three variables then determined by this system of equations. The variables x4 and x5 are then independent, the others de-pendent. The variables x1, x2 and x3 are then called basic variables . Toobtain a basis of solutions we can let x4 = 1 and x5 = 0, obtaining thesolution x = (2, −2, −1, 1, 0)T , and then choose x4 = 0 and x5 = 1 to getthe solution x = (−1, 0, −1, 0, 1)T . Every solution to Ax = 0 is then alinear combination of these two solutions. Notice that which variables are

basic and which are non-basic is somewhat arbitrary, in that we could havechosen as the non-basic variables any two whose columns are independent.




Having decided that x4 and x5 are the non-basic variables, we can writethe original matrix A as A = [ B N ], where B is the square invertiblematrix

B = 1 2 0

−1 −1 11 2 −3

,

and N is the matrix

N =

2 1

1 0−1 −2

.

With xB = (x1, x2, x3)T and xN = (x4, x5)T we can write

Ax = BxB + N xN = 0,

so that

xB = −B−1

N xN . (26.2)

26.3.2 The Fundamental Subspaces

We begin with some definitions. Let S be a subspace of finite-dimensionalEuclidean space RJ and Q a J by J Hermitian matrix. We denote by Q(S )the set

Q(S ) = {t|there existss ∈ S with t = Qs}and by Q−1(S ) the set

Q−1(S ) = {u|Qu ∈ S }.

Note that the set Q−1(S ) is defined whether or not Q is invertible.

We denote by S ⊥ the set of vectors u that are orthogonal to everymember of S ; that is,

S ⊥ = {u|u†s = 0, for every s ∈ S }.

Let H be a J by N matrix. Then CS (H ), the column space of H , is thesubspace of RJ consisting of all the linear combinations of the columnsof H . The null space of H †, denoted N S (H †), is the subspace of RJ

containing all the vectors w for which H †w = 0.

Exercise 26.8 Show that CS (H )⊥ = N S (H †). Hint: If v ∈ CS (H )⊥,then v†Hx = 0 for all x, including x = H †v.



26.4. LINEAR AND NONLINEAR OPERATORS 225

Exercise 26.9 Show that CS (H ) ∩ N S (H †) = {0}. Hint: If y = Hx ∈N S (H †) consider ||y||2 = y†y.

Exercise 26.10 Let S be any subspace of RJ . Show that if Q is invertible and Q(S ) = S then Q−1(S ) = S . Hint: If Qt = Qs then t = s.

Exercise 26.11 Let Q be Hermitian. Show that Q(S )⊥ = Q−1(S ⊥) for every subspace S . If Q is also invertible then Q−1(S )⊥ = Q(S ⊥). Find an example of a non-invertible Q for which Q−1(S )⊥ and Q(S ⊥) are different.

We assume, now, that Q is Hermitian and invertible and that the matrix

H †H is invertible. Note that the matrix H †Q−1

H need not be invertibleunder these assumptions. We shall denote by S an arbitrary subspace of RJ .

Exercise 26.12 Show that Q(S ) = S if and only if Q(S ⊥) = S ⊥. Hint:Use Exercise 26.11.

Exercise 26.13 Show that if Q(CS (H )) = CS (H ) then H †Q−1H is in-vertible. Hint: Show that H †Q−1Hx = 0 if and only if x = 0. Recall that Q−1Hx

∈CS (H ), by Exercise 26.11. Then use Exercise 26.9.

26.4 Linear and Nonlinear Operators

In our study of iterative algorithms we shall be concerned with sequencesof vectors {xk|k = 0, 1,...}. The core of an iterative algorithm is the tran-sition from the current vector xk to the next one xk+1. To understand thealgorithm, we must understand the operation (or operator) T by which xk

is transformed into xk+1 = T xk. An operator is any function T defined onX with values again in X .

Exercise 26.14 Prove the following identity relating an arbitrary operator T to its complement G = I

−T :





26.4.1 Linear and Affine Linear Operators

For example, if X = C J and A is a J by J complex matrix, then we candefine an operator T by setting T x = Ax, for each x in C J ; here Ax denotes

the multiplicaton of the matrix A and the column vector x. Such operatorsare linear operators :

T (αx + βy) = αT x + βTy,

for each pair of vectors x and y and each pair of scalars α and β .

Exercise 26.15 Show that, for H = H (a, γ ), H 0 = H (a, 0), and any xand y in X ,

P H (x + y) = P H x + P H y − P H 0,

so that

P H 0(x + y) = P H 0x + P H 0y,

that is, the operator P H 0 is an additive operator. Also, show that

P H 0(αx) = αP H 0x,

so that P H 0 is a linear operator. Show that we can write P H 0 as a matrix multiplication:

P H 0x = (I − aa†)x.

If d is a fixed nonzero vector in C J , the operator defined by T x = Ax+dis not a linear operator; it is called an affine linear operator .

Exercise 26.16 Show that, for any hyperplane H = H (a, γ ) and H 0 =H (a, 0),

P H x = P H 0x + P H 0,

so P H is an affine linear operator.

Exercise 26.17 For i = 1,...,I let H i be the hyperplane H i = H (ai, γ i),H i0 = H (ai, 0), and P i and P i0 the orthogonal projections onto H i and H i0, respectively. Let T be the operator T = P I P I −1 · · · P 2P 1. Show that T is an affine linear operator, that is, T has the form

T x = Bx + d,

for some matrix B and some vector d. Hint: Use the previous exercise and the fact that P i0 is linear to show that

B = (I − aI (aI )†) · · · (I − a1(a1)†).



26.4. LINEAR AND NONLINEAR OPERATORS 227

26.4.2 Orthogonal Projection onto Convex Sets

For an arbitrary nonempty closed convex set C , the orthogonal projectionT = P C is a nonlinear operator, unless, of course, C = H (a, 0) for some

vector a. We may not be able to describe P C x explicitly, but we do knowa useful property of P C x.

Proposition 26.1 For a given x, the vector z is P C x if and only if

c − z, z − x ≥ 0,

for all c in the set C .

Proof: For simplicity, we consider only the real case, X = RJ . Let c bearbitrary in C and α in (0, 1). Then

||x − P C x||2 ≤ ||x − (1 − α)P C x − αc||2 = ||x − P C x + α(P C x − c)||2

= ||x − P C x||2

− 2αx − P C x, c − P C x + α

2

||P C x − c||2

.Therefore,

−2αx − P C x, c − P C x + α2||P C x − c||2 ≥ 0,

so that2x − P C x, c − P C x ≤ α||P C x − c||2.

Taking the limit, as α → 0, we conclude that

c − P C x, P C x − x ≥ 0.

If z is a member of C that also has the property

c − z, z − x ≥ 0,

for all c in C , then we have both

z − P C x, P C x − x ≥ 0,

andz − P C x, x − z ≥ 0.

Adding on both sides of these two inequalities lead to

z − P C x, P C x − z ≥ 0.

But,

z − P C x, P C x − z=

−||z − P C x||2

,so it must be the case that z = P C x. This completes the proof.




26.4.3 Gradient Operators

Another important example of a nonlinear operator is the gradient of areal-valued function of several variables. Let f (x) = f (xi,...,xJ ) be a real

number for each vector x in RJ

. The gradient of f at the point x is thevector whose entries are the partial derivatives of f ; that is,

∇f (x) = (∂f

∂x1(x),...,

∂f

∂xJ (x))T .

The operator T x = ∇f (x) is linear only if the function f (x) is quadratic;that is, f (x) = xT Ax for some square matrix x, in which case the gradientof f is ∇f (x) = 1

2(A + AT )x.



Chapter 27

Metric Spaces and Norms

As we have seen, the inner product on X = RJ or X = C J can be usedto define the Euclidean norm

||x||

of a vector x, which, in turn, provides ametric , or a measure of distance between two vectors, d(x, y) = ||x − y||.The notions of metric and norm are actually more general notions, with nonecessary connection to the inner product.

27.1 Metric Spaces

Let S be a non-empty set. We say that the function d : S × S → [0, +∞)is a metric if the following hold:

d(s, t) ≥ 0, (27.1)

for all s and t in S ;d(s, t) = 0 (27.2)

if and only if s = t;

d(s, t) = d(t, s), (27.3)

for all s and t in S ; and, for all s, t, and u in S ,d(s, t) ≤ d(s, u) + d(u, t) (27.4)

The last inequality is the triangle inequality .A sequence {sk} in the metric space (S , d) is said to have limit s∗ if

lim

k→+∞

d(sk, s∗) = 0.

Any sequence with a limit is said to be convergent .

229



230 CHAPTER 27. METRIC SPACES AND NORMS

Exercise 27.1 Show that a sequence can have at most one limit.

The sequence {sk} is said to be a Cauchy sequence if, for any > 0, thereis positive integer m, such that, for any nonnegative integer n,

d(sm, sm+n) ≤ .

Exercise 27.2 Show that every convergent sequence is a Cauchy sequence.

The metric space (S , d) is said to be complete if every Cauchy sequence isa convergent sequence.

Exercise 27.3 Let S be the set of rational numbers, with d(s, t) = |s − t|.Show that (S , d) is a metric space, but not a complete metric space.

We turn now to metrics that come from norms.

27.2 Norms

Let X denote either RJ or C J . We say that ||x|| defines a norm on X if

||x|| ≥ 0, (27.5)

for all x,

||x|| = 0 (27.6)

if and only if x = 0,

||γx|| = |γ | ||x||, (27.7)

for all x and scalars γ , and

||x + y

|| ≤ ||x||

+||

y||

, (27.8)

for all vectors x and y.

Exercise 27.4 Show that d(x, y) = ||x − y|| defines a metric on X .It can be shown that RJ and C J are complete for any metric arising froma norm.

27.2.1 The 1-norm

The 1-norm on X is defined by

||x||1 =J

j=1

|xj|.

Exercise 27.5 Show that the 1-norm is a norm.



27.3. EIGENVALUES AND MATRIX NORMS 231

27.2.2 The ∞-norm

The ∞-norm on X is defined by

||x||∞

= max{|

xj | |

j = 1,...,J }

.

Exercise 27.6 Show that the ∞-norm is a norm.

27.2.3 The 2-norm

The 2-norm, also called the Euclidean norm, is the most commonly usednorm on X . It is the one that comes from the inner product:

||x||2 =

x, x =√

x†x.

Exercise 27.7 Show that the 2-norm is a norm. Hint: for the triangle inequality, use the Cauchy Inequality.

It is this close relationship between the 2-norm and the inner product that

makes the 2-norm so useful.

27.2.4 Weighted 2-norms

Let Q be a positive-definite Hermitian matrix. Define

||x||Q =

x†Qx,

for all vectors x. If Q is the diagonal matrix with diagonal entries Qjj > 0,then

||x||Q =

J j=1

Qjj |xj |2;

for that reason we speak of ||x||Q as the Q-weighted 2-norm of x.Exercise 27.8 Show that the Q-weighted 2-norm is a norm.

27.3 Eigenvalues and Matrix Norms

Let S be a complex, square matrix. We say that λ is an eigenvalue of S if λis a root of the complex polynomial det (λI − S ). Therefore, each S has asmany (possibly complex) eigenvalues as it has rows or columns, althoughsome of the eigenvalues may be repeated.

An equivalent definition is that λ is an eigenvalue of S if there is anon-zero vector x with Sx = λx, in which case the vector x is called aneigenvector of S . From this definition, we see that the matrix S is invertible

if and only if zero is not one of its eigenvalues. The spectral radius of S ,denoted ρ(S ), is the maximum of |λ|, over all eigenvalues λ of S .




Exercise 27.9 Show that ρ(S 2) = ρ(S )2.

Exercise 27.10 We say that S is Hermitian or self-adjoint if S † = S .Show that, if S is Hermitian, then every eigenvalue of S is real. Hint:

suppose that Sx = λx. Then consider x†Sx.

A Hermitian matrix S is positive-definite if each of its eigenvalues ispositive. If S is an I by I Hermitian matrix with (necessarily real) eigen-values

λ1 ≥ λ2 ≥ · · · ≥ λI ,

and associated (column) eigenvectors {ui |i = 1,...,I } (which we may as-sume are mutually orthogonal), then S can be written as

S = λ1u1u†1 + · · · + λI uI u†I .

This is the eigenvalue/eigenvector decomposition of S . The Hermitian ma-trix S is invertible if and only if all of its eigenvalues are non-zero, in whichcase we can write the inverse of S as

S −1 = λ−11 u1u†1 + · · · + λ−1I uI u†I .

It follows from the eigenvector decomposition of S that S = QQ† forsome Hermitian, positive-definite matrix Q, called the Hermitian square root of S .

27.3.1 Matrix Norms

Any matrix can be turned into a vector by vectorization. Therefore, wecan define a norm for any matrix by simply vectorizing and taking a normof the resulting vector. Such norms for matrices do not take full advantage

of the matrix properties. An induced matrix norm for matrices is a specialtype of matrix norm that comes from a vector norm and that respects thematrix properties. If A is a matrix and ||A|| denotes a matrix norm of A,then we insist that ||Ax|| ≤ ||A||||x||, for all x. All induced matrix normshave this compatibility property.

Let ||x|| be any norm on C J , not necessarily the Euclidean norm, |||b|||any norm on C I , and A a rectangular I by J matrix. The induced matrix norm of A, denoted ||A||, derived from these two vectors norms, is thesmallest positive constant c such that

|||Ax||| ≤ c||x||,

for all x in C J . It can be written as

||A|| = maxx=0

{|||Ax|||/||x||}.



27.3. EIGENVALUES AND MATRIX NORMS 233

We study induced matrix norms in order to measure the distance |||Ax−Az|||, relative to the distance ||x − z||:

|||Ax − Az||| ≤ ||A||||x − z||,for all vectors x and z and ||A|| is the smallest number for which thisstatement can be made. If we choose the two vector norms carefully, thenwe can get an explicit description of ||A||, but, in general, we cannot.

For example, let ||x|| = ||x||1 and |||Ax||| = ||Ax||1 be the 1-norms of the vectors x and Ax, where

||x||1 =J j=1

|xj |.

Exercise 27.11 Show that the 1-norm of A, induced by the 1-norms of vectors in C J and C I , is

||A||1 = max {I

i=1|Aij | , j = 1, 2,...,J }.

The infinity norm of the vector x is

||x||∞ = max {|xj | , j = 1, 2,...,J }.

Exercise 27.12 Show that the infinity norm of the matrix A, induced by the infinity norms of vectors in C J and C I , is

||A||∞ = max {J j=1

|Aij| , i = 1, 2,...,I }.

Exercise 27.13 Let M be an invertible matrix and ||x|| any vector norm.

Define ||x||M = ||Mx||.Show that, for any square matrix S , the matrix norm

||S ||M = maxx=0

{||Sx||M /||x||M }

is ||S ||M = ||MSM −1||.

In [4] this result is used to prove the following lemma:

Lemma 27.1 Let S be any square matrix and let > 0 be given. Then there is an invertible matrix M such that

||S ||M ≤

ρ(S ) + .




Exercise 27.14 Show that, for any square matrix S and any induced ma-trix norm ||S ||, we have ||S || ≥ ρ(S ). Consequently, for any induced matrix norm ||S ||,

||S

|| ≥ |λ

|,

for every eigenvalue λ of S .

So we know thatρ(S ) ≤ ||S ||,

for every induced matrix norm, but, according to Lemma 27.1, we also have

||S ||M ≤ ρ(S ) + .

Exercise 27.15 Show that, if ρ(S ) < 1, then there is a vector norm on X for which the induced matrix norm of S is less than one, so that S is a strict contraction with respect to this vector norm.

27.4 The Euclidean Norm of a Square Matrix

We shall be particularly interested in the Euclidean norm (or 2-norm) of the square matrix A, which is the induced matrix norm derived from theEuclidean vector norms. Unless otherwise stated, we shall understand ||A||to be the Euclidean norm of A.

From the definition of the Euclidean norm of A, we know that

||A|| = max{||Ax||/||x||},

with the maximum over all nonzero vectors x. Since

||Ax

||2 = x†A†Ax,

we have

||A|| =

max {x†A†Ax

x†x}, (27.9)

over all nonzero vectors x.


||A|| =

ρ(A†A);

that is, the term inside the square-root in Equation (27.9) is the largest eigenvalue of the matrix A†A. Hints: let

λ1 ≥ λ2 ≥ ... ≥ λJ ≥ 0



27.4. THE EUCLIDEAN NORM OF A SQUARE MATRIX 235

and let {uj , j = 1,...,J } be mutually orthogonal eigenvectors of A†A with ||uj || = 1. Then, for any x, we have

x =

J j=1

[(uj)†x]uj ,

while

A†Ax =

J j=1

[(uj)†x]A†Auj =

J j=1

λj [(uj)†x]uj .

From the orthogonality it follows that

||x||2 = x†x =J j=1

|(uj)†x|2,

and

||Ax||2 = x†A†Ax =

J j=1

λj |(uj)†x|2. (27.10)

Maximizing ||Ax||2/||x||2 over x = 0 is equivalent to maximizing ||Ax||2,subject to ||x||2 = 1. The right side of Equation (27.10) is then a con-vex combination of the λj, which will have its maximum when only the coefficient of λ1 is non-zero.

Exercise 27.17 Show that, if S is Hermitian, then the Euclidean norm of S is ||S || = ρ(S ). Hint: use Exercise (27.9).

If S is not Hermitian, then the Euclidean norm of S cannot be calculateddirectly from the eigenvalues of S .

Exercise 27.18 Let S be the square, non-Hermitian matrix

S =

i 20 i

,

having eigenvalues λ = i and λ = i. Show that the eigenvalues of the Hermitian matrix

S †S =

1 −2i2i 5

are λ = 3 + 2√

2 and λ = 3 − 2√

2. Therefore, the Euclidean norm of S is

||S || =

3 + 2

√ 2.




27.4.1 Diagonalizable Matrices

A square matrix S is diagonalizable if S has a basis of eigenvectors. In thatcase, with V be the matrix whose columns are these linearly independent

eigenvectors and L the diagonal matrix having the eigenvalues of S alongits main diagonal, we have SV = V L, or V −1SV = L.

Exercise 27.19 Let T = V −1 and define ||x||T = ||T x||, the Euclidean norm of T x. Show that the induced matrix norm of S is ||S ||T = ρ(S ).

We see from this exercise that, for any diagonalizable matrix S , in particu-lar, for any Hermitian matrix, there is a vector norm such that the inducedmatrix norm of S is ρ(S ). In the Hermitian case we know that V −1 = V †

and ||T x|| = ||V †x|| = ||x||, so that the required vector norm is just theEuclidean norm, and ||S ||T is just the 2-norm of S , which we know to beρ(S ).

27.4.2 Gerschgorin’s Theorem

Gerschgorin’s theorem gives us a way to estimate the eigenvalues of anarbitrary square matrix A.

Theorem 27.1 Let A be J by J . For j = 1,...,J , let C j be the circle in the complex plane with center Ajj and radius rj =

m=j |Ajm|. Then

every eigenvalue of A lies within one of the C j.

Proof: Let λ be an eigenvalue of A, with associated eigenvector u. Letuj be the entry of the vector u having the largest absolute value. FromAu = λu, we have

(λ − Ajj)uj =m=j

Ajmum,

so that |λ − Ajj | ≤m=j

|Ajm||um|/|uj | ≤ rj .


27.4.3 Strictly Diagonally Dominant Matrices

A square I by I matrix S is said to be strictly diagonally dominant if, foreach i = 1,...,I ,

|S ii| > ri =m=i

|S im|.

When the matrix S is strictly diagonally dominant, all the eigenvalues of

S lie within the union of the spheres with centers S ii and radii ri. We usethis result in our discussion of the Jacobi splitting method.



Chapter 28

Bregman-LegendreFunctions

In [8] Bauschke and Borwein show convincingly that the Bregman-Legendrefunctions provide the proper context for the discussion of Bregman pro-

jections onto closed convex sets. The summary here follows closely thediscussion given in [8].

28.1 Essential smoothness and essential strictconvexity

A convex function f : RJ → [−∞, +∞] is proper if there is no x withf (x) = −∞ and some x with f (x) < +∞. The essential domain of f

is D = {x|f (x) < +∞}. A proper convex function f is closed if it islower semi-continuous. The subdifferential of f at x is the set ∂f (x) ={x∗|x∗, z − x ≤ f (z) − f (x), forall z}. The domain of ∂f is the set dom∂f = {x|∂f (x) = ∅}. The conjugate function associated with f is thefunction f ∗(x∗) = supz(x∗, z − f (z)).

Following [105] we say that a closed proper convex function f is essen-tially smooth if intD is not empty, f is differentiable on intD and xn ∈intD, with xn → x ∈ bdD, implies that ||∇f (xn)|| → +∞. Here intD andbdD denote the interior and boundary of the set D.

A closed proper convex function f is essentially strictly convex if f isstrictly convex on every convex subset of dom ∂f .

The closed proper convex function f is essentially smooth if and only if the subdifferential ∂f (x) is empty for x

∈bdD and is

{∇f (x)

}for x

∈intD

(so f is differentiable on intD) if and only if the function f ∗ is essentiallystrictly convex.

237



238 CHAPTER 28. BREGMAN-LEGENDRE FUNCTIONS

A closed proper convex function f is said to be a Legendre function if itis both essentially smooth and essentialy strictly convex. So f is Legendreif and only if its conjugate function is Legendre, in which case the gradientoperator

∇f is a topological isomorphism with

∇f ∗ as its inverse. The

gradient operator ∇f maps int dom f onto int dom f ∗. If int dom f ∗ = RJ

then the range of ∇f is RJ and the equation ∇f (x) = y can be solved forevery y ∈ RJ . In order for int dom f ∗ = RJ it is necessary and sufficientthat the Legendre function f be super-coercive , that is,

lim||x||→+∞

f (x)

||x|| = +∞.

If the essential domain of f is bounded, then f is super-coercive and itsgradient operator is a mapping onto the space RJ .

28.2 Bregman Projections onto Closed Con-

vex SetsLet f be a closed proper convex function that is differentiable on thenonempty set intD. The corresponding Bregman distance Df (x, z) is de-fined for x ∈ RJ and z ∈ intD by

Df (x, z) = f (x) − f (z) − ∇f (z), x − z.

Note that Df (x, z) ≥ 0 always and that Df (x, z) = +∞ is possible. If f isessentially strictly convex then Df (x, z) = 0 implies that x = z.

Let K be a nonempty closed convex set with K ∩ intD = ∅. Pick z ∈intD. The Bregman projection of z onto K , with respect to f , is

P f K (z) = argminx∈K ∩DDf (x, z).

If f is essentially strictly convex, then P f K (z) exists. If f is strictly convex

on D then P f K (z) is unique. If f is Legendre, then P f K (z) is uniquely definedand is in intD; this last condition is sometimes called zone consistency .

Example: Let J = 2 and f (x) be the function that is equal to one-half thenorm squared on D, the nonnegative quadrant, +∞ elsewhere. Let K bethe set K = {(x1, x2)|x1 + x2 = 1}. The Bregman projection of (2, 1) ontoK is (1, 0), which is not in intD. The function f is not essentially smooth,although it is essentially strictly convex. Its conjugate is the function f ∗

that is equal to one-half the norm squared on D and equal to zero elsewhere;it is essentially smooth, but not essentially strictly convex.

If f is Legendre, then P f K (z) is the unique member of K ∩intD satisfying

the inequality∇f (P f K (z)) − ∇f (z), P f K (z) − c ≥ 0,



28.3. BREGMAN-LEGENDRE FUNCTIONS 239

for all c ∈ K . From this we obtain the Bregman Inequality :

Df (c, z) ≥ Df (c, P f K (z)) + Df (P f K (z), z), (28.1)

for all c ∈ K .

28.3 Bregman-Legendre Functions

Following Bauschke and Borwein [8], we say that a Legendre function f isa Bregman-Legendre function if the following properties hold:

B1: for x in D and any a > 0 the set {z|Df (x, z) ≤ a} is bounded.B2: if x is in D but not in intD, for each positive integer n, yn is in intDwith yn → y ∈ bdD and if {Df (x, yn)} remains bounded, then Df (y, yn) →0, so that y ∈ D.B3: if xn and yn are in intD, with xn → x and yn → y, where x and y

are in D but not in intD, and if Df (xn

, yn

) → 0 then x = y.

Bauschke and Borwein then prove that Bregman’s SGP method convergesto a member of K provided that one of the following holds: 1) f is Bregman-Legendre; 2) K ∩ intD = ∅ and dom f ∗ is open; or 3) dom f and dom f ∗

are both open.

28.4 Useful Results about Bregman-LegendreFunctions

The following results are proved in somewhat more generality in [8].R1: If yn

∈int dom f and yn

→y

∈int dom f , then Df (y, yn)

→0.

R2: If x and yn ∈ int dom f and yn → y ∈ bd dom f , then Df (x, yn) →+∞.R3: If xn ∈ D, xn → x ∈ D, yn ∈ int D, yn → y ∈ D, {x, y}∩ int D = ∅and Df (xn, yn) → 0, then x = y and y ∈ int D.R4: If x and y are in D, but are not in int D, yn ∈ int D, yn → y andDf (x, yn) → 0, then x = y.As a consequence of these results we have the following.R5: If {Df (x, yn)} → 0, for yn ∈ int D and x ∈ RJ , then {yn} → x.

Proof of R5: Since {Df (x, yn)} is eventually finite, we have x ∈ D. ByProperty B1 above it follows that the sequence {yn} is bounded; withoutloss of generality, we assume that {yn} → y, for some y ∈ D. If x is in intD, then, by result R2 above, we know that y is also in int D. Applying

result R3, with xn = x, for all n, we conclude that x = y. If, on the otherhand, x is in D, but not in int D, then y is in D, by result R2. There are



240 CHAPTER 28. BREGMAN-LEGENDRE FUNCTIONS

two cases to consider: 1) y is in int D; 2) y is not in int D. In case 1) wehave Df (x, yn) → Df (x, y) = 0, from which it follows that x = y. In case2) we apply result R4 to conclude that x = y.



Chapter 29

Detection andClassification

In some applications of remote sensing, our goal is simply to see what is“out there”; in sonar mapping of the sea floor, the data are the acousticsignals as reflected from the bottom, from which the changes in depth canbe inferred. Such problems are estimation problems.

In other applications, such as sonar target detection or medical diag-nostic imaging, we are looking for certain things, evidence of a surfacevessel or submarine, in the sonar case, or a tumor or other abnormality

in the medical case. These are detection problems. In the sonar case, thedata may be used directly in the detection task, or may be processed insome way, perhaps frequency-filtered, prior to being used for detection. Inthe medical case, or in synthetic-aperture radar (SAR), the data is usuallyused to construct an image, which is then used for the detection task. Inestimation, the goal can be to determine how much of something is present;detection is then a special case, in which we want to decide if the amountpresent is zero or not.

The detection problem is also a special case of discrimination , in whichthe goal is to decide which of two possibilities is true; in detection thepossibilities are simply the presence or absence of the sought-for signal.

More generally, in classification or identification , the objective is todecide, on the basis of measured data, which of several possibilities is true.

241



242 CHAPTER 29. DETECTION AND CLASSIFICATION

29.1 Estimation

We consider only estimates that are linear in the data, that is, estimatesof the form

γ = b†x =N n=1

bnxn, (29.1)

where b† denotes the conjugate transpose of the vector b = (b1,...,bN )T .

The vector b that we use will be the best linear unbiased estimator (BLUE)[33] for the particular estimation problem.

29.1.1 The simplest case: a constant in noise

We begin with the simplest case, estimating the value of a constant, givenseveral instances of the constant in additive noise. Our data are xn = γ +q n,for n = 1,...,N , where γ is the constant to be estimated, and the q n are

noises. For convenience, we write

x = γu + q, (29.2)

where x = (x1,...,xN )T , q = (q 1,...,q N )T , u = (1,..., 1)T , the expectedvalue of the random vector q is E (q ) = 0, and the covariance matrix of q is E (qq T ) = Q. The BLUE employs the vector

b =1

u†Q−1uQ−1u. (29.3)

The BLUE estimate of γ is

γ =1

u†Q−1

u

u†Q−1x. (29.4)

If Q = σ2I , for some σ > 0, with I the identity matrix, then the noiseq is said to be white . In this case, the BLUE estimate of γ is simply theaverage of the xn.

29.1.2 A known signal vector in noise

Generalizing somewhat, we consider the case in which the data vector xhas the form

x = γs + q, (29.5)

where s = (s1,...,sN )T is a known signal vector. The BLUE estimator is

b =1

s†Q−1sQ−1s (29.6)



29.1. ESTIMATION 243

and the BLUE estimate of γ is now

γ =1

s†Q−1ss†Q−1x. (29.7)

In numerous applications of signal processing, the signal vectors take theform of sampled sinusoids; that is, s = eθ, with

eθ =1√ N

(e−iθ, e−2iθ,...,e−Niθ )T , (29.8)

where θ is a frequency in the interval [0, 2π). If the noise is white, then theBLUE estimate of γ is

γ =1√ N

N n=1

xneinθ, (29.9)

which is the discrete Fourier transform (DFT) of the data, evaluated atthe frequency θ.

29.1.3 Multiple signals in noise

Suppose now that the data values are

xn =

M m=1

γ msmn + q n, (29.10)

where the signal vectors sm = (sm1 ,...,smN )T are known and we want to

estimate the γ m. We write this in matrix-vector notation as

x = Sc + q, (29.11)

where S is the matrix with entries S nm = smn , and our goal is to findc = (γ 1,...,γ N )

T , the vector of coefficients. The BLUE estimate of thevector c is

c = (S †Q−1S )−1S †Q−1x, (29.12)

assuming that the matrix S †Q−1S is invertible, in which case we must haveM ≤ N .

If the signals sm are mutually orthogonal and have length one, thenS †S = I ; if, in addition, the noise is white, the BLUE estimate of c isc = S †x, so that

cm =N n=1

xnsmn . (29.13)




This case arises when the signals are sm = eθm , for θm = 2πm/M , form = 1,...,M , in which case the BLUE estimate of cm is

cm =1

√ N

N n=1

xne2πimn/M , (29.14)

the DFT of the data, evaluated at the frequency θm. Note that whenthe frequencies θm are not these, the matrix S †S is not I , and the BLUEestimate is not obtained from the DFT of the data.

29.2 Detection

As we noted previously, the detection problem is a special case of esti-mation. Detecting the known signal s in noise is equivalent to decidingif the coefficient γ is zero or not. The procedure is to calculate γ , theBLUE estimate of γ , and say that s has been detected if

|γ |

exceeds a cer-tain threshold. In the case of multiple known signals, we calculate c, theBLUE estimate of the coefficient vector c, and base our decisions on themagnitudes of each entry of c.

29.2.1 Parametrized signal

It is sometimes the case that we know that the signal s we seek to detect isa member of a parametrized family, {sθ|θ ∈ Θ}, of potential signal vectors,but we do not know the value of the parameter θ. For example, we maybe trying to detect a sinusoidal signal, s = eθ, where θ is an unknownfrequency in the interval [0, 2π). In sonar direction-of-arrival estimation,we seek to detect a farfield point source of acoustic energy, but do not know

the direction of the source. The BLUE estimator can be extended to thesecases, as well [33]. For each fixed value of the parameter θ, we estimate γ using the BLUE, obtaining the estimate

γ (θ) =1

s†θQ−1sθs†θQ−1x, (29.15)

which is then a function of θ. If the maximum of the magnitude of thisfunction exceeds a specified threshold, then we may say that there is asignal present corresponding to that value of θ.

Another approach would be to extend the model of multiple signalsto include a continuum of possibilities, replacing the finite sum with anintegral. Then the model of the data becomes

x = θ∈Θ

γ (θ)sθdθ + q. (29.16)



29.2. DETECTION 245

Let S now denote the integral operator

S (γ ) = θ∈Θγ (θ)sθdθ (29.17)

that transforms a function γ of the variable θ into a vector. The adjointoperator, S †, transforms any N -vector v into a function of θ, according to

S †(v)(θ) =N n=1

vn(sθ)n = s†θv . (29.18)

Consequently, S †Q−1S is the function of θ given by

g(θ) = (S †Q−1S )(θ) =

N n=1

N j=1

Q−1nj (sθ)j(sθ)n, (29.19)

so

g(θ) = s†θQ−1sθ. (29.20)

The generalized BLUE estimate of γ (θ) is then

γ (θ) =1

g(θ)

N j=1

aj(sθ)j =1

g(θ)s†θa , (29.21)

where x = Qa or

xn =N

j=1

ajQnj, (29.22)

for j = 1,...,N , and so a = Q−1x. This is the same estimate we obtainedin the previous paragraph. The only difference is that, in the first case, weassume that there is only one signal active, and apply the BLUE for eachfixed θ, looking for the one most likely to be active. In the second case,we choose to view the data as a noisy superposition of a continuum of thesθ, not just one. The resulting estimate of γ (θ) describes how each of theindividual signal vectors sθ contribute to the data vector x. Nevertheless,the calculations we perform are the same.

If the noise is white, we have aj = xj for each j. The function g(θ)becomes

g(θ) =N n=1

|(sθ)n|2, (29.23)




which is simply the square of the length of the vector sθ. If, in addition,the signal vectors all have length one, then the estimate of the functionγ (θ) becomes

γ (θ) =N n=1

xn(sθ)n = s†θx. (29.24)

Finally, if the signals are sinusoids sθ = eθ, then

γ (θ) =1√ N

N n=1

xneinθ, (29.25)

again, the DFT of the data vector.

29.3 Discrimination

The problem now is to decide if the data is x = s1 + q or x = s2 + q ,where s1 and s2 are known vectors. This problem can be converted into adetection problem: Do we have x − s1 = q or x − s1 = s2 − s1 + q ? Thenthe BLUE involves the vector Q−1(s2 − s1) and the discrimination is madebased on the quantity (s2 − s1)†Q−1x. If this quantity is near enough tozero we say that the signal is s1; otherwise, we say that it is s2. The BLUEin this case is sometimes called the Hotelling linear discriminant , and aprocedure that uses this method to perform medical diagnostics is called aHotelling observer .

More generally, suppose we want to decide if a given vector x comesfrom class C 1 or from class C 2. If we can find a vector b such that bT x > a

for every x that comes from C 1, and bT

x < a for every x that comes fromC 2, then the vector b is a linear discriminant for deciding between theclasses C 1 and C 2.

29.3.1 Channelized Observers

The N by N matrix Q can be quite large, particularly when x and q arevectorizations of two-dimensional images. If, in additional, the matrix Qis obtained from K observed instances of the random vector q , then for Qto be invertible, we need K ≥ N . To avoid these and other difficulties, thechannelized Hotelling linear discriminant is often used. The idea here isto replace the data vector x with U x for an appropriately chosen J by N matrix U , with J much smaller than N ; the value J = 3 is used in [68], with

the channels chosen to capture image information within selected frequencybands.



29.4. CLASSIFICATION 247

29.3.2 An Example of Discrimination

Suppose that there are two groups of students, the first group denoted G1,the second G2. The math SAT score for the students in G1 is always above

500, while their verbal scores are always below 500. For the students in G2

the opposite is true; the math scores are below 500, the verbal above. Foreach student we create the two-dimensional vector x = (x1, x2)T of SATscores, with x1 the math score, x2 the verbal score. Let b = (1, −1)T . Thenfor every student in G1 we have bT x > 0, while for those in G2, we havebT x < 0. Therefore, the vector b provides a linear discriminant.

Suppose we have a third group, G3, whose math scores and verbal scoresare both below 500. To discriminate between members of G1 and G3 wecan use the vector b = (1, 0)T and a = 500. To discriminate between thegroups G2 and G3, we can use the vector b = (0, 1)T and a = 500.

Now suppose that we want to decide from which of the three groupsthe vector x comes; this is classification.

29.4 Classification

The classification problem is to determine to which of several classes of vectors a given vector x belongs. For simplicity, we assume all vectorsare real. The simplest approach to solving this problem is to seek lineardiscriminant functions; that is, for each class we want to have a vector bwith the property that bT x > 0 if and only if x is in the class. If the vectorsx are randomly distributed according to one of the parametrized family of probability density functions (pdf) p(x; ω) and the ith class correspondsto the parameter value ωi then we can often determine the discriminantvectors bi from these pdf. In many cases, however, we do not have the pdf

and the b

i

must be estimated through a learning or training step beforethey are used on as yet unclassified data vectors. In the discussion thatfollows we focus on obtaining b for one class, suppressing the index i.

29.4.1 The Training Stage

In the training stage a candidate for b is tested on vectors whose classmembership is known, say {x1,...,xM }. First, we replace each vector xm

that is not in the class with its negative. Then we seek b such that bT xm > 0for all m. With A the matrix whose mth row is (xm)T we can write theproblem as Ab > 0. If the b we obtain has some entries very close to zeroit might not work well enough on actual data; it is often better, then, totake a vector with small positive entries and require Ab

≥. When we

have found b for each class we then have the machinery to perform theclassification task.




There are several problems to be overcome, obviously. The main one isthat there may not be a vector b for each class; the problem Ab ≥ neednot have a solution. In classification this is described by saying that thevectors xm are not linearly separable [60]. The second problem is finding

the b for each class; we need an algorithm to solve Ab ≥ .One approach to designing an algorithm for finding b is the following: for

arbitrary b let f (b) be the number of the xm misclassified by vector b. Thenminimize f (b) with respect to b. Alternatively, we can minimize the func-tion g(b) defined to be the sum of the values −bT xm, taken over all the xm

that are misclassified; the g(b) has the advantage of being continuously val-ued. The batch Perceptron algorithm [60] uses gradient descent methods tominimize g(b). Another approach is to use the Agmon-Motzkin-Schoenberg(AMS) algorithm to solve the system of linear inequalities Ab ≥ [33].

When the training set of vectors is linearly separable, the batch Percep-tron and the AMS algorithms converge to a solution, for each class. Whenthe training vectors are not linearly separable there will be a class for which

the problem Ab ≥ will have no solution. Iterative algorithms in this casecannot converge to a solution. Instead, they may converge to an approxi-mate solution or, as with the AMS algorithm, converge subsequentially toa limit cycle of more than one vector.

29.4.2 Our Example Again

We return to the example given earlier, involving the three groups of stu-dents and their SAT scores. To be consistent with the conventions of thissection, we define x = (x1, x2)T differently now. Let x1 be the math SATscore, minus 500, and x2 be the verbal SAT score, minus 500. The vectorb = (1, 0)T has the property that bT x > 0 for each x coming from G1, butbT x < 0 for each x not coming from G1. Similarly, the vector b = (0, 1)T

has the property that bT x > 0 for all x coming from G2, while bT x < 0 forall x not coming from G2. However, there is no vector b with the propertythat bT x > 0 for x coming from G3, but bT x < 0 for all x not comingfrom G3; the group G3 is not linearly separable from the others. Notice,however, that if we perform our classification sequentially, we can employlinear classifiers. First, we use the vector b = (1, 0)T to decide if the vectorx comes from G1 or not. If it does, fine; if not, then use vector b = (0, 1)T

to decide if it comes from G2 or G3.

29.5 More realistic models

In many important estimation and detection problems, the signal vector s

is not known precisely. In medical diagnostics, we may be trying to detecta lesion, and may know it when we see it, but may not be able to describe it



29.5. MORE REALISTIC MODELS 249

using a single vector s, which now would be a vectorized image. Similarly,in discrimination or classification problems, we may have several examplesof each type we wish to identify, but will be unable to reduce these types tosingle representative vectors. We now have to derive an analog of the BLUE

that is optimal with respect to the examples that have been presented fortraining. The linear procedure we seek will be one that has performed best,with respect to a training set of examples. The Fisher linear discriminant is an example of such a procedure.

29.5.1 The Fisher linear discriminant

Suppose that we have available for training K vectors x1,...,xK in RN ,with vectors x1,...,xJ in the class A, and the remaining K − J vectors inthe class B. Let w be an arbitrary vector of length one, and for each k letyk = wT xk be the projected data. The numbers yk, k = 1,...,J , form theset Y A, the remaining ones the set Y B . Let

µA =1

J

J k=1

xk, (29.26)

µB =1

K − J

K k=J +1

xk, (29.27)

mA =1

J

J k=1

yk = wT µA, (29.28)

and

mB =1

K − J

K k=J +1

yk = wT µB. (29.29)

Let

σ2A =

J k=1

(yk − mA)2, (29.30)

and

σ2B =

K k=J +1

(yk − mB)2. (29.31)




The quantity σ2 = σ2A + σ2

B is the total within-class scatter of the projecteddata. Define the function F (w) to be

F (w) =

(mA

−mB)2

σ2 . (29.32)

The Fisher linear discriminant is the vector w for which F (w) achieves itsmaximum.

Define the scatter matrices S A and S B as follows:

S A =J

k=1

(xk − µA)(xk − µA)T , (29.33)

and

S B =

K

k=J +1

(xk − µB)(xk − µB)T . (29.34)

Then

S within = S A + S B (29.35)

is the within-class scatter matrix and

S between = (µA − µB)(µA − µB)T (29.36)

is the between-class scatter matrix . The function F (w) can then be writtenas

F (w) = wT S betweenw/wT S withinw. (29.37)

The w for which F (w) achieves its maximum value is then

w = S −1within(µA − µB). (29.38)

This vector w is Fisher linear discriminant. When a new data vector x isobtained, we decide to which of the two classes it belongs by calculatingwT x.



Chapter 30

Tomography

In this chapter we present a brief overview of transmission and emissiontomography. These days, the term tomography is used by lay people andpractitioners alike to describe any sort of scan, from ultrasound to magneticresonance. It has apparently lost its association with the idea of slicing, asin the expression three-dimensional tomography . In this paper we focus ontwo important modalities, transmission tomography and emission tomog-raphy. An x-ray CAT scan is an example of the first, a positron-emission(PET) scan is an example of the second.

30.1 X-ray Transmission Tomography

Computer-assisted tomography (CAT) scans have revolutionized medicalpractice. One example of CAT is x-ray transmission tomography. The

goal here is to image the spatial distribution of various matter within thebody, by estimating the distribution of x-ray attenuation. In the continuousformulation, the data are line integrals of the function of interest.

When an x-ray beam travels along a line segment through the body itbecomes progressively weakened by the material it encounters. By com-paring the initial strength of the beam as it enters the body with its finalstrength as it exits the body, we can estimate the integral of the attenuationfunction, along that line segment. The data in transmission tomographyare these line integrals, corresponding to thousands of lines along whichthe beams have been sent. The image reconstruction problem is to cre-ate a discrete approximation of the attenuation function. The inherentlythree-dimensional problem is usually solved one two-dimensional plane, orslice, at a time, hence the name tomography [74].

The beam attenuation at a given point in the body will depend on thematerial present at that point; estimating and imaging the attenuation as a

251



252 CHAPTER 30. TOMOGRAPHY

function of spatial location will give us a picture of the material within thebody. A bone fracture will show up as a place where significant attenuationshould be present, but is not.

30.1.1 The Exponential-Decay Model

As an x-ray beam passes through the body, it encounters various types of matter, such as soft tissue, bone, ligaments, air, each weakening the beamto a greater or lesser extent. If the intensity of the beam upon entry is I inand I out is its lower intensity after passing through the body, then

I out = I ine− Lf

,

where f = f (x, y) ≥ 0 is the attenuation function describing the two-dimensional distribution of matter within the slice of the body being scannedand

L

f is the integral of the function f over the line L along which thex-ray beam has passed. To see why this is the case, imagine the line L

parameterized by the variable s and consider the intensity function I (s)as a function of s. For small ∆s > 0, the drop in intensity from the startto the end of the interval [s, s + ∆s] is approximately proportional to theintensity I (s), to the attenuation f (s) and to ∆s, the length of the interval;that is,

I (s) − I (s + ∆s) ≈ f (s)I (s)∆s.

Dividing by ∆s and letting ∆s approach zero, we get

dI

ds= −f (s)I (s).

Exercise 30.1 Show that the solution to this differential equation is

I (s) = I (0)exp(− u=su=0

f (u)du).

Hint: Use an integrating factor.

From knowledge of I in and I out, we can determine L

f . If we know L

f for every line in the x, y-plane we can reconstruct the attenuation functionf . In the real world we know line integrals only approximately and onlyfor finitely many lines. The goal in x-ray transmission tomography is toestimate the attenuation function f (x, y) in the slice, from finitely manynoisy measurements of the line integrals. We usually have prior informa-tion about the values that f (x, y) can take on. We also expect to findsharp boundaries separating regions where the function f (x, y) varies onlyslightly. Therefore, we need algorithms capable of providing such images.

As we shall see, the line-integral data can be viewed as values of the Fouriertransform of the attenuation function.



30.1. X-RAY TRANSMISSION TOMOGRAPHY 253

30.1.2 Reconstruction from Line Integrals

We turn now to the underlying problem of reconstructing such functionsfrom line-integral data. Our goal is to reconstruct the function f (x, y) from

line-integral data. Let θ be a fixed angle in the interval [0, π), and considerthe rotation of the x, y -coordinate axes to produce the t, s-axis system,where

t = x cos θ + y sin θ,

ands = −x sin θ + y cos θ.

We can then write the function f as a function of the variables t and s.For each fixed value of t, we compute the integral

f (x, y)ds, obtaining

the integral of f (x, y) = f (t cos θ − s sin θ, t sin θ + s cos θ) along the singleline L corresponding to the fixed values of θ and t. We repeat this processfor every value of t and then change the angle θ and repeat again. In thisway we obtain the integrals of f over every line L in the plane. We denote

by rf (θ, t) the integral

rf (θ, t) =

L

f (x, y)ds.

The function rf (θ, t) is called the Radon transform of f .For fixed θ the function rf (θ, t) is a function of the single real variable

t; let Rf (θ, ω) be its Fourier transform. Then,

Rf (θ, ω) =

(

f (x, y)ds)eiωtdt,

which we can write as

Rf (θ, ω) =

f (x, y)eiω(x cos θ+y sin θ)dxdy = F (ω cos θ, ω sin θ),

where F (ω cos θ, ω sin θ) is the two-dimensional Fourier transform of thefunction f (x, y), evaluated at the point (ω cos θ, ω sin θ); this relationshipis called the Central Slice Theorem . For fixed θ, as we change the valueof ω, we obtain the values of the function F along the points of the linemaking the angle θ with the horizontal axis. As θ varies in [0, π), we get allthe values of the function F . Once we have F , we can obtain f using theformula for the two-dimensional inverse Fourier transform. We concludethat we are able to determine f from its line integrals.

The Fourier-transform inversion formula for two-dimensional functionstells us that the function f (x, y) can be obtained as

f (x, y) =1

4π2

F (u, v)e−i(xu+yv)dudv. (30.1)

The filtered backprojection methods commonly used in the clinic are derivedfrom different ways of calculating the double integral in Equation (30.1).




30.1.3 The Algebraic Approach

Although there is some flexibility in the mathematical description of theimage reconstruction problem in transmission tomography, one popular

approach is the algebraic formulation of the problem. In this formulation,the problem is to solve, at least approximately, a large system of linearequations, Ax = b.

The attenuation function is discretized, in the two-dimensional case, byimagining the body to consist of finitely many squares, or pixels , withinwhich the function has a constant, but unknown, value. This value atthe j-th pixel is denoted xj . In the three-dimensional formulation, thebody is viewed as consisting of finitely many cubes, or voxels . The beamis sent through the body along various lines and both initial and finalbeam strength is measured. From that data we can calculate a discreteline integral along each line. For i = 1,...,I we denote by Li the i-th linesegment through the body and by bi its associated line integral. Denote by

Aij the length of the intersection of the j-th pixel with Li; therefore, Aijis nonnegative. Most of the pixels do not intersect line Li, so A is quitesparse. Then the data value bi can be described, at least approximately, as

bi =J

j=1

Aijxj . (30.2)

Both I , the number of lines, and J , the number of pixels or voxels, arequite large, although they certainly need not be equal, and are typicallyunrelated.

The matrix A is large and rectangular. The system Ax = b may or maynot have exact solutions. We are always free to select J , the number of

pixels, as large as we wish, limited only by computation costs. We may alsohave some choice as to the number I of lines, but within the constraintsposed by the scanning machine and the desired duration and dosage of the scan. When the system is underdetermined (J > I ), there may beinfinitely many exact solutions; in such cases we usually impose constraintsand prior knowledge to select an appropriate solution. As we mentionedearlier, noise in the data, as well as error in our model of the physics of the scanning procedure, may make an exact solution undesirable, anyway.When the system is overdetermined (J < I ), we may seek a least-squaresapproximate solution, or some other approximate solution. We may haveprior knowledge about the physics of the materials present in the bodythat can provide us with upper bounds for xj , as well as information aboutbody shape and structure that may tell where xj = 0. Incorporating such

information in the reconstruction algorithms can often lead to improvedimages [99].



30.2. EMISSION TOMOGRAPHY 255

30.2 Emission Tomography

In single-photon emission tomography (SPECT) and positron emission to-mography (PET) the patient is injected with, or inhales, a chemical to

which a radioactive substance has been attached [115]. The chemical is de-signed to become concentrated in the particular region of the body understudy. Once there, the radioactivity results in photons that travel throughthe body and, at least some of the time, are detected by the scanner. Thefunction of interest is the actual concentration of the radioactive material ateach spatial location within the region of interest. Learning what the con-centrations are will tell us about the functioning of the body at the variousspatial locations. Tumors may take up the chemical (and its radioactivepassenger) more avidly than normal tissue, or less avidly, perhaps. Mal-functioning portions of the brain may not receive the normal amount of thechemical and will, therefore, exhibit an abnormal amount of radioactivity.

As in the transmission tomography case, this nonnegative function isdiscretized and represented as the vector x. The quantity bi, the i-th entryof the vector b, is the photon count at the i-th detector; in coincidence-detection PET a detection is actually a nearly simultaneous detection of a photon at two different detectors. The entry Aij of the matrix A is theprobability that a photon emitted at the j-th pixel or voxel will be detectedat the i-th detector.

In the emission tomography case it is common to take a statistical view[89, 88, 106, 109, 114], in which the quantity xj is the expected number of emissions at the j-th pixel during the scanning time, so that the expectedcount at the i-th detector is

E (bi) =J

j=1Aijxj. (30.3)

The system of equations Ax = b is obtained by replacing the expectedcount, E (bi), with the actual count, bi; obviously, an exact solution of thesystem is not needed in this case. As in the transmission case, we seek anapproximate, and nonnegative, solution of Ax = b, where, once again, allthe entries of the system are nonnegative.

30.2.1 Maximum-Likelihood Parameter Estimation

The measured data in tomography are values of random variables. Theprobabilities associated with these random variables are used in formulatingthe image reconstruction problem as one of solving a large system of linearequations. We can also use the stochastic model of the data to formulate

the problem as a statistical parameter-estimation problem, which suggeststhe image be estimated using likelihood maximization. When formulated




that way, the problem becomes a constrained optimization problem. Thedesired image can then be calculated using general-purpose iterative opti-mization algorithms, or iterative algorithms designed specifically to solvethe particular problem.

30.3 Image Reconstruction in Tomography

Because of the wide-spread use of the algebraic approach in tomography,Image reconstruction from tomographic data is an increasingly importantarea of applied numerical linear algebra, particularly for medical diagnosis[70, 74, 84, 101, 102, 114, 115] . In the algebraic approach, the problem isto solve, at least approximately, a large system of linear equations, Ax = b.The vector x is large because it is usually a vectorization of a discreteapproximation of a function of two or three continuous spatial variables.The size of the system necessitates the use of iterative solution methods[90]. Because the entries of x usually represent intensity levels, of beam

attenuation in transmission tomography, and of radionuclide concentrationin emission tomography, we require x to be nonnegative; the physics of thesituation may impose additional constraints on the entries of x. In practice,we often have prior knowledge about the function represented, in discreteform, by the vector x and we may wish to include this knowledge in thereconstruction. In tomography the entries of A and b are also nonnegative.Iterative algorithms tailored to find solutions to these special, constrainedproblems may out-perform general iterative solution methods [99]. To bemedically useful in the clinic, the algorithms need to produce acceptablereconstructions early in the iterative process.

The Fourier approach to tomographic image reconstruction maintains,at least initially, the continuous model for the attenuation function. Thedata are taken to be line integrals through the attenuator, that is, val-ues of its so-called x-ray transform , which, in the two-dimensional case, isthe Radon transform. The Central Slice Theorem then relates the Radon-transform values to values of the Fourier transform of the attenuation func-tion. Image reconstruction then becomes estimation of the (inverse) Fouriertransform. In magnetic-resonance imaging (MRI), we again have the mea-sured data related to the function we wish to image, the proton densityfunction, by a Fourier relation.

In the transmission and emission tomography, the data are photoncounts, so it is natural to adopt a statistical model and to convert theimage reconstruction problem into a statistical parameter-estimation prob-lem. The estimation can be done using maximum likelihood (ML) or max-imum a posteriori (MAP) Bayesian methods, which then require iterative

optimization algorithms.



Chapter 31

Magnetic-ResonanceImaging

Fourier-transform estimation and extrapolation techniques play a majorrole in the rapidly expanding field of magnetic-resonance imaging (MRI).

31.1 An Overview of MRI

Protons have spin , which, for our purposes here, can be viewed as a chargedistribution in the nucleus revolving around an axis. Associated with theresulting current is a magnetic dipole moment collinear with the axis of the spin. Within a single volume element of the body, there will be manyprotons. In elements with an odd number of protons, the nucleus itself will have a net magnetic moment. In much of magnetic-resonance imaging

(MRI), it is the distribution of hydrogen in water molecules that is theobject of interest, although the imaging of phosphorus to study energytransfer in biological processing is also important. There is ongoing workusing tracers containing fluorine, to target specific areas of the body andavoid background resonance.

In the absence of an external magnetic field, the axes of these magneticdipole moments have random orientation, dictated mainly by thermal ef-fects. When a magnetic field is introduced, it induces a small fraction of the dipole moments to begin to align their axes with that of the magneticfield. Only because the number of protons per unit of volume is so largedo we get a significant number of moments aligned in this way.

The axes of the magnetic dipole moments precess around the axis of theexternal magnetic field at the Larmor frequency , which is proportional to

the intensity of the external magnetic field. If the magnetic field intensityvaries spatially, then so does the Larmor frequency. When the body is

257



258 CHAPTER 31. MAGNETIC-RESONANCE IMAGING

probed with an electromagnetic field at a given frequency, a resonancesignal is produced by those protons whose spin axes are precessing at thatfrequency. The strength of the signal is proportional to the proton densitywithin the targeted volume. The received signal is then processed to obtain

information about that proton density.As we shall see, when the external magnetic field is appropriately cho-

sen, a Fourier relationship is established between the information extractedfrom the received signal and the proton density.

31.2 The External Magnetic Field

The external magnetic field generated in the MRI scanner is

H (r, t) = (H 0 + G(t) · r)k + H 1(t)(cos(ω0t)i + sin(ω0t) j), (31.1)

where r = (x,y,z) is the spatial position vector, and ω0 is the Larmor

frequency associated with the static field intensity H 0, that is,

ω0 = γH 0,

with γ the gyromagnetic ratio. The vectors i, j, and k are the unit vectorsalong the coordinate axes. The vector-valued function G(t) produces thegradient field

G(t) · r.

The magnetic field component in the x−y plane is the radio frequency (rf)field.

If G(t) = 0, then the Larmor frequency is ω0 everywhere. If G(t) = θ,for some direction vector θ, then the Larmor frequency is constant on planesnormal to θ. In that case, when the body is probed with an electromagnetic

field of frequencyω = γ (H 0 + s),

there is a resonance signal received from the locations r lying in the planeθ ·r = s. The strength of the received signal is proportional to the integral,over that plane, of the proton density function. Therefore, the measureddata will be values of the three-dimensional Radon transform of the protondensity function, which is related to its three-dimensional Fourier transformby the Central Slice Theorem. Later, we shall consider two more widelyused examples of G(t).

31.3 The Received Signal

We assume now that the function H 1(t) is a short π2 -pulse , that is, it hasconstant value over a short time interval [0, τ ] and has integral π

2γ . The



31.3. THE RECEIVED SIGNAL 259

signal produced by the probed precessing magnetic dipole moments is ap-proximately

S (t) = R3 M 0(r) exp(−iγ (

t0 G(s)ds) · r)exp(−t/T 2)dr, (31.2)

where M 0(r) is the local magnetization, which is proportional to the protondensity function, and T 2 is the transverse or spin-spin relaxation time.

31.3.1 An Example of G(t)

Suppose now that g > 0 and θ is an arbitrary direction vector. Let

G(t) = gθ, for τ ≤ t, (31.3)

and G(t) = 0 otherwise. Then the received signal S (t) is

S (t) = R3

M 0(r) exp(−iγg(t − τ )θ · r)dr

= (2π)3/2M 0(γg(t − τ )θ), (31.4)

for τ ≤ t << T 2, where M 0 denotes the three-dimensional Fourier trans-form of the function M 0(r).

From Equation (31.4) we see that, by selecting different direction vec-tors and by sampling the received signal S (t) at various times, we canobtain values of the Fourier transform of M 0 along lines through the originin the Fourier domain, called k-space . If we had these values for all θ andfor all t we would be able to determine M 0(r) exactly. Instead, we havemuch the same problem as in transmission tomography; only finitely manyθ and only finitely many samples of S (t). Noise is also a problem, becausethe resonance signal is not strong, even though the external magnetic fieldis.

We may wish to avoid having to estimate the function M 0(r) fromfinitely many noisy values of its Fourier transform. We can do this byselecting the gradient field G(t) differently.

31.3.2 Another Example of G(t)

The vector-valued function G(t) can be written as

G(t) = (G1(t), G2(t), G3(t)).

Now we letG2(t) = g2,



260 CHAPTER 31. MAGNETIC-RESONANCE IMAGING

andG3(t) = g3,

for 0 ≤ t ≤ τ , and zero otherwise, and

G1(t) = g1,

for τ ≤ t, and zero otherwise. This means that only H 0k and the rf fieldare present up to time τ , and then the rf field is shut off and the gradientfield is turned on. Then, for t ≥ τ , we have

S (t) = (2π)3/2M 0(γ (t − τ )g1, γ τ g2, γ τ g3).

By selectingtn = n∆t + τ, for n = 1,...,N,

g2k = k∆g,

andg3i = i∆g,

for i, k = −m,...,m we have values of the Fourier transform, M 0, on aCartesian grid in three-dimensional k-space. The local magnetization func-tion, M 0, can then be approximated using the fast Fourier transform.



Chapter 32

Hyperspectral Imaging

Hyperspectral image processing provides an excellent example of the needfor estimating Fourier transform values from limited data. In this chapterwe describe one novel approach, due to Mooney et al. [97]; the presentationhere follows [17].

In this hyperspectral-imaging problem the electromagnetic energy re-flected or emitted by a point, such as light reflected from a location onthe earth’s surface, is passed through a prism to separate the componentsas to their wavelengths. Due to the dispersion of the different frequencycomponents caused by the prism, these components are recorded in theimage plane not at a single spatial location, but at distinct points along aline. Since the received energy comes from a region of points, not a singlepoint, what is received in the image plane is a superposition of differentwavelength components associated with different points within the object.The first task is to reorganize the data so that each location in the image

plane is associated with all the components of a single point of the objectbeing imaged; this is a Fourier-transform estimation problem, which wecan solve using band-limited extrapolation.

The points of the image plane are in one-to-one correspondence withpoints of the object. These spatial locations in the image plane and inthe object are discretized into finite two-dimensional grids. Once we havereorganized the data we have, for each grid point in the image plane, afunction of wavelength, describing the intensity of each component of theenergy from the corresponding grid point on the object. Practical con-siderations limit the fineness of the grid in the image plane; the resultingdiscretization of the object is into pixels. In some applications, such assatellite imaging, a single pixel may cover an area several meters on aside. Achieving subpixel resolution is one goal of hyperspectral imaging;

capturing other subtleties of the scene is another.Within a single pixel of the object, there may well be a variety of ob-

261



262 CHAPTER 32. HYPERSPECTRAL IMAGING

ject types, each reflecting or emitting energy differently. The data we nowhave corresponding to a single pixel are therefore a mixture of the ener-gies associated with each of the subobjects within the pixel. With priorknowledge of the possible types and their reflective or emissive properties,

we can separate the mixture to determine which object types are presentwithin the pixel and to what extent. This mixture problem can be solvedusing the RBI-EMML method.

Hyperspectral imaging gives rise to several of the issues we discuss inthis book. From an abstract perspective the problem is the following: F and f are a Fourier-transform pair, as are G and g; F and G have finitesupport. We measure G and want F ; g determines some, but not all, of thevalues of f . We will have, of course, only finitely many measurements of Gfrom which to estimate values of g. Having estimated finitely many valuesof g, we have the corresponding estimates of f . We apply band-limitedextrapolation of these finitely many values of f to estimate F . In fact,once we have estimated values of F , we may not be finished; each value of

F is a mixture whose individual components may be what we really want.For this unmixing step we use the RBI-EMML algorithm.The region of the object that we wish to image is described by the two-

dimensional spatial coordinate x = (x1, x2). For simplicity, we take thesecoordinates to be continuous, leaving until the end the issue of discretiza-tion. We shall also denote by x the point in the image plane correspondingto the point x on the object; the units of distance between two such pointsin one plane and their corresponding points in the other plane may, of course, be quite different. For each x we let F (x, λ) denote the intensityof the component at wavelength λ of the electromagnetic energy that isreflected from or emitted by location x. We shall assume that F (x, λ) = 0for (x, λ) outside some bounded portion of three-dimensional space.

Consider, for a moment, the case in which the energy sensed by the

imaging system comes from a single point x. If the dispersion axis of theprism is oriented according to the unit vector pθ, for some θ ∈ [0, 2π),then the component at wavelength λ of the energy from x on the objectis recorded not at x in the image plane but at the point x + µ(λ − λ0)pθ.Here, µ > 0 is a constant and λ0 is the wavelength for which the componentfrom point x of the object is recorded at x in the image plane.

Now imagine energy coming to the imaging system for all the pointswithin the imaged region of the object. Let G(x, θ) be the intensity of theenergy received at location x in the image plane when the prism orientationis θ. It follows from the description of the sensing that

G(x, θ) =

+∞−∞

F (x − µ(λ − λ0)pθ, λ)dλ. (32.1)

The limits of integration are not really infinite due to the finiteness of theaperture and the focal plane of the imaging system. Our data will consist



263

of finitely many values of G(x, θ), as x varies over the grid points of theimage plane and θ varies over some finite discretized set of angles.

We begin the image processing by taking the two-dimensional inverseFourier transform of G(x, θ) with respect to the spatial variable x to get

g(y, θ) =1

(2π)2

G(x, θ)exp(−ix · y)dx. (32.2)

Inserting the expression for G in Equation (32.1) into Equation (32.2), weobtain

g(y, θ) = exp(iµλ0pθ · y)

exp(−iµλpθ · y)f (y, λ)dλ, (32.3)

where f (y, λ) is the two-dimensional inverse Fourier transform of F (x, λ)with respect to the spatial variable x. Therefore,

g(y, θ) = exp(iµλ0pθ

·y)

F (y, γ θ), (32.4)

where F (y, γ ) denotes the three-dimensional inverse Fourier transform of F (x, λ) and γ θ = µpθ · y. We see then that each value of g(y, θ) that weestimate from our measurements provides us with a single estimated valueof F .

We use the measured values of G(x, θ) to estimate values of g(y, θ)guided by the discussion in our earlier chapter on discretization. Havingobtained finitely many estimated values of F , we use the support of thefunction F (x, λ) in three-dimensional space to perform a band-limited ex-trapolation estimate of the function F .

Alternatively, for each fixed y for which we have values of g(y, θ) weuse the PDFT or MDFT to solve Equation (32.3), obtaining an estimateof f (y, λ) as a function of the continuous variable λ. Then, for each fixedλ, we again use the PDFT or MDFT to estimate F (x, λ) from the valuesof f (y, λ) previously obtained.

Once we have the estimated function F (x, λ) on a finite grid in three-dimensional space, we can use the RBI-EMML method, as in [96], to solvethe mixture problem and identify the individual object types containedwithin the single pixel denoted x. For each fixed x corresponding to a pixel,denote by b = (b1,...,bI )T the column vector with entries bi = F (x, λi),where λi, i = 1,...,I constitute a discretization of the wavelength spaceof those λ for which F (x, λ) > 0. We assume that this energy intensitydistribution vector b is a superposition of those vectors corresponding to anumber of different object types; that is, we assume that

b =J j=1

ajqj , (32.5)



264 CHAPTER 32. HYPERSPECTRAL IMAGING

for some aj ≥ 0 and intensity distribution vectors qj , j = 1,...,J . Eachcolumn vector qj is a model for what b would be if there had been onlyone object type filling the entire pixel. These qj are assumed to be knowna priori . Our objective is to find the aj .

With Q the I by J matrix whose jth column is qj and a the columnvector with entries aj we write Equation (32.5) as b = Qa. Since theentries of Q are nonnegative, the entries of b are positive, and we seeka nonnegative solution a, we can use any of the entropy-based iterativealgorithms discussed earlier. Because of its simplicity of form and speedof convergence our preference is the RBI-EMML algorithm. The recentmaster’s thesis of E. Meidunas [96] discusses just such an application.



Chapter 33

Farfield Propagation

A basic problem in remote sensing is to determine the nature of a distantobject by measuring signals transmitted by or reflected from that object. If the object of interest is sufficiently remote, that is, is in the farfield , it canbe assumed that the data we obtain by sampling the propagating spatio-temporal field is related to what we want by Fourier transformation . Theproblem is then to estimate a function from finitely many (usually noisy)values of its Fourier transform . Although there are many important math-ematical tools employed to solve signal-processing problems, the Fouriertransform is the most important. Our discussion of farfield propagationwill serve to motivate the Fourier transform, not only as a useful mathe-matical device, but also as an object having actual physical significance.

We shall begin our discussion of farfield propagation by consideringan extended object transmitting or reflecting a single-frequency, or nar-rowband , signal. Later, we shall move to the problem of a distant pointsource whose location we wish to ascertain, as well as to signals involvingmultiple frequencies, the so-called broadband-signal case. The narrowband,extended-object case is a good place to begin, since a point object is simplya limiting case of an extended object, and broadband received signals canalways be filtered to reduce their frequency band.

The application we consider here is a common one of remote-sensing of transmitted or reflected waves propagating from distant sources. Exam-ples include optical imaging of planets and asteroids using reflected sun-

light, radio-astronomy imaging of distant sources of radio waves, active andpassive sonar, and radar imaging.

265



266 CHAPTER 33. FARFIELD PROPAGATION

33.1 The Solar-Emission Problem

In [14] Bracewell discusses the solar-emission problem. In 1942, it wasobserved that radio-wave emissions in the one-meter wavelength range were

arriving from the sun. Were they coming from the entire disk of the sunor were the sources more localized, in sunspots, for example? The problemthen was to view each location on the sun’s surface as a potential source of these radio waves and to determine the intensity of emission correspondingto each location. The sun has an angular diameter of 30 min. of arc, orone-half of a degree, when viewed from earth, but the needed resolutionwas more like 3 min. of arc. As we shall see shortly, such resolution requiresa radio telescope 1000 wavelengths across, which means a diameter of 1kmat a wavelength of 1 meter; in 1942 the largest military radar antennaswere less than 5 meters across. A solution was found, using the method of reconstructing an object from line-integral data, a technique that surfacedagain in tomography. The problem here is inherently two-dimensional, but,for simplicity, we shall begin with the one-dimensional case.

33.2 The One-Dimensional Case

Because our purpose is to motivate the Fourier transform by showing howit arises naturally in a discussion of farfield propagation, we begin with themore tractable narrowband-signal case. We assume that each of the sig-nals being transmitted or reflected is a single-frequency complex sinusoid,having the form Aeiωt, with complex amplitude A that varies as a func-tion of position within the distant object. The Fourier transform entersthe picture when we make the farfield assumption that the distance fromthe object to the sensors is much larger than the distance between sensors.

Equivalently, we assume that we are far enough away from the sources thatthe spherically spreading waves they have generated appear to the sensorsas planewave fronts.

Suppose that D > 0 represents a large distance from our sensors.Imagine each point (x,D, 0) along an axis parallel to the x-axis in three-dimensional space transmitting or reflecting the sinusoidal signal g(x)eiωt,where ω is the common frequency of these signals and the g(x) is thecomplex amplitude associated with each particular x. Our objective is todetermine the values g(x), for each x. In the sun-spot problem such infor-mation will help us decide where the transmitted radio waves are comingfrom. In a radar problem, determining the g(x), the amplitudes of the re-flected radio wave, will tell us something about the nature of the extendedobject, since different materials reflect the waves differently. We calculate

the signal received at the point (s, 0, 0), under the assumption that D > 0is much, much larger than |s|.



33.2. THE ONE-DIMENSIONAL CASE 267

33.2.1 The Plane-Wave Model

Let θ denote the angle between the x-axis and the line from (0, 0, 0) to(x,D, 0). Because D is so much larger than |s|, the angle θ remains the

same, when observed from any other point (s, 0, 0), that is, there is noparallax, and the use of the point (0, 0, 0) is merely a convenience. Again,because D is so large, the spherically spreading field originating at (x,D, 0)is essentially a plane surface as it reachs the sensors. The planes of constantvalue are normal to the direction vector θ = (cos θ, sin θ). Let b(s, t) be thesignal from (x,D, 0) that is received at location (s, 0, 0) at time t. Forreference, let us suppose that

b(0, t) = eiω(t−Dc)g(x).

Because the planewaves travel at a speed c, we have

b(s, t) = u(0, t +s cos θ

c) = eiω(t−

Dc)ei

ωs cos θc g(x).

Of course, the signal received at (s, 0, 0) does not come only from a singlepoint (x,D, 0), but from all the points (x,D, 0), so the combined signalreceived at (s, 0, 0) is

B(s, t) = eiω(t−Dc)

ei

ωs cos θc g(x)dx. (33.1)

Since θ is a one-to-one function of x, we can view g(x) as a function of θ, and write g(θ) in place of g(x). We then introduce the new variablek = ω

c cos θ and write the integral ei

ωs cos θc g(x)dx

as

cω ωc−ω

c

f (k)eiskdk, (33.2)

where f (k) is the function g(θ)/ sin θ, written as a function of the variable k.Since, in most applications, the distant object has a small angular diameterwhen viewed from a great distance, the sun’s is 30 minutes of arc, the angleθ will be restricted to a small interval centered at θ = π

2 . Therefore, sinθis bounded away from zero and f (k) is well defined.

The integral ωc

−ωc

f (k)eiskdk

is the familiar one that defines the Fourier transform of the function f (k).Using the approximations permitted under the farfield assumption, the

received signal B(s, t) is easily shown to provide the Fourier transform of the object function f (k).




33.3 Fourier-Transform Pairs

We consider now the Fourier transform of a function of a single real variable.In the previous section it was reasonable to denote the Fourier transform

of f (k) by F (s), with s denoting location in sensor space and k denotingwave vectors associated with given angles. However, in discussing the moregeneral case, it is better to use more conventional notation. Therefore, weshall consider a function f (x) having Fourier transform F (γ ). The variablex has no relation to the variable of the same name used to describe thespatial extent of the distant object being imaged in our previous example.

33.3.1 The Fourier Transform

Let f (x) be defined for real variable x in (−∞, ∞). The Fourier transform of f (x) is the function of the real variable γ given by

F (γ ) = ∞−∞ f (x)e

iγx

dx. (33.3)

In our example of farfield propagation, the signal received at (s, 0, 0), asgiven by Equation (33.1), can be rewritten as

B(s, t) =c

ωeiω(t−

Dc)F (s), (33.4)

where F (s) is the Fourier transform of f (k). Consequently, we can say thatthe data measured at the sensor locations (s, 0, 0) give us (noisy) values of the Fourier transform of f (k).

33.3.2 Sampling

Because the function f (k) is zero outside the interval [−ωc , ωc ], the function

F (s) is band-limited . The Nyquist spacing in the variable s is therefore

∆s =πc

ω.

The wavelength λ associated with the frequency ω is defined to be

λ =2πc

ω,

so that

∆s =λ

2.

The significance of the Nyquist spacing comes from Shannon’s SamplingTheorem, which says that if we have the values F (m∆s), for all integers



33.3. FOURIER-TRANSFORM PAIRS 269

m, then we have enough information to recover f (k) exactly. In practice,of course, this is never the case.

Notice that B(s, t) is not just F (s), but

B(s, t) = cω

eiω(t−Dc )F (s).

To extract F (s) from B(s, t), we need to remove the factor eiω(t−Dc). When

the frequency ω is large, as in optical remote sensing, for example, deter-mining this value accurately may be impossible. What we then have is thephase problem ; that is, we can measure only |F (s)|, and not the phase of the complex numbers F (s).

33.3.3 Reconstructing from Fourier-Transform Data

As illustrated by the farfield propagation example, our goal is often toreconstruct the function f (x) from measurements of its Fourier transform

F (γ ). But, how?If we have F (γ ) for all real γ , then we can recover the function f (x)using the Fourier Inversion Formula :

f (x) =1

2π

F (γ )e−iγxdγ. (33.5)

The functions f (x) and F (γ ) are called a Fourier-transform pair .

33.3.4 An Example

For example, consider an extended object of finite length, with uniformamplitude function f (x) = 1

2X , for |x| ≤ X , and f (x) = 0, otherwise. TheFourier transform of this f (x) is

F (γ ) =sin(Xγ )

Xγ ,

for all real γ = 0, and F (0) = 1. Note that F (γ ) is nonzero throughout thereal line, except for isolated zeros, but that it goes to zero as we go to theinfinities. This is typical behavior. Notice also that the smaller the X , theslower F (γ ) dies out; the first zeros of F (γ ) are at |γ | = π

X , so the mainlobe widens as X goes to zero.

It may seem paradoxical that when X is larger, its Fourier transformdies off more quickly. The Fourier transform F (γ ) goes to zero faster forlarger X because of destructive interference. Because of differences in theircomplex phases, the magnitude of the sum of the signals received from

various parts of the object is much smaller than we might expect, especiallywhen X is large. For smaller X the signals received at a sensor are much




more in phase with one another, and so the magnitude of the sum remainslarge. A more quantitative statement of this phenomenon is provided bythe uncertainty principle (see [32]).

33.4 The Dirac Delta

Consider what happens in the limit, as X → 0. Then we have an infinitelyhigh point source at x = 0; we denote this by δ (x), the Dirac delta . TheFourier transform approaches the constant function with value 1, for all γ ;the Fourier transform of f (x) = δ (x) is the constant function F (γ ) = 1, forall γ . The Dirac delta δ (x) has the sifting property :

h(x)δ (x)dx = h(0),

for each function h(x) that is continuous at x = 0.

Because the Fourier transform of δ (x) is the function F (γ ) = 1, theFourier inversion formula tells us that

δ (x) =1

2π

∞−∞

e−iωxdω. (33.6)

Obviously, this integral cannot be understood in the usual way. The inte-gral in Equation (33.6) is a symbolic way of saying that

h(x)(1

2π

∞−∞

e−iωxdω)dx =

h(x)δ (x)dx = h(0), (33.7)

for all h(x) that are continuous at x = 0; that is, the integral in Equation(33.6) has the sifting property, so it acts like δ (x). Interchanging the order

of integration in Equation (33.7), we obtain h(x)(

1

2π

∞−∞

e−iωxdω)dx =1

2π

∞−∞

(

h(x)e−iωxdx)dω

=1

2π

∞−∞

H (−ω)dω =1

2π

∞−∞

H (ω)dω = h(0).

We shall return to the Dirac delta when we consider farfield point sources.

33.5 Practical Limitations

In actual remote-sensing problems, antennas cannot be of infinite extent.

In digital signal processing, moreover, there are only finitely many sensors.We never measure the entire Fourier transform of f (x), but, at best, just



33.5. PRACTICAL LIMITATIONS 271

part of it. In fact, the data we are able to measure is almost never exactvalues of the Fourier transform of f (x), but rather, values of some dis-torted or blurred version. To describe such situations, we usually resort toconvolution-filter models.

33.5.1 Convolution Filtering

Imagine that what we measure are not values of F (γ ), but of F (γ )H (γ ),where H (γ ) is a function that describes the limitations and distorting effectsof the measuring process, including any blurring due to the medium throughwhich the signals have passed, such as refraction of light as it passes throughthe atmosphere. If we apply the Fourier Inversion Formula to F (γ )H (γ ),instead of to F (γ ), we get

g(x) =1

2π

F (γ )H (γ )e−iγxdx. (33.8)

The function g(x) that results is g(x) = (f ∗

h)(x), the convolution of thefunctions f (x) and h(x), with the latter given by

h(x) =1

2π

H (γ )e−iγxdx.

Note that, if f (x) = δ (x), then g(x) = h(x); that is, our reconstruction of the object from distorted data is the function h(x) itself. For that reason,the function h(x) is called the point-spread function of the imaging system.

Convolution filtering refers to the process of converting any given func-tion, say f (x), into a different function, say g(x), by convolving f (x) with afixed function h(x). Since this process can be achieved by multiplying F (γ )by H (γ ) and then inverse Fourier transforming, such convolution filters arestudied in terms of the properties of the function H (γ ), known in this con-

text as the system transfer function , or the optical transfer function (OTF);when γ is a frequency, rather than a spatial frequency, H (γ ) is called the frequency-response function of the filter. The magnitude of H (γ ), |H (γ )|,is called the modulation transfer function (MTF). The study of convolu-tion filters is a major part of signal processing. Such filters provide bothreasonable models for the degradation signals undergo, and useful tools forreconstruction.

Let us rewrite Equation (33.8), replacing F (γ ) and H (γ ) with theirdefinitions, as given by Equation (33.3). Then we have

g(x) =

(

f (t)eiγtdt)(

h(s)eiγsds)e−iγxdγ.

Interchanging the order of integration, we get

g(x) =

f (t)h(s)(

eiγ (t+s−x)dγ )dsdt.




Now using Equation (33.6) to replace the inner integral with δ (t + s − x),the next integral becomes

h(s)δ (t + s−

x)ds = h(x−

t).

Finally, we have

g(x) =

f (t)h(x − t)dt; (33.9)

this is the definition of the convolution of the functions f and h.

33.5.2 Low-Pass Filtering

A major problem in image reconstruction is the removal of blurring, whichis often modelled using the notion of convolution filtering. In the one-dimensional case, we describe blurring by saying that we have available

measurements not of F (γ ), but of F (γ )H (γ ), where H (γ ) is the frequency-response function describing the blurring. If we know the nature of theblurring, then we know H (γ ), at least to some degree of precision. We cantry to remove the blurring by taking measurements of F (γ )H (γ ), dividingthese numbers by the value of H (γ ), and then inverse Fourier transform-ing. The problem is that our measurements are always noisy, and typicalfunctions H (γ ) have many zeros and small values, making division by H (γ )dangerous, except where the values of H (γ ) are not too small. These valuesof γ tend to be the smaller ones, centered around zero, so that we end upwith estimates of F (γ ) itself only for the smaller values of γ . The result isa low-pass filtering of the object f (x).

To investigate such low-pass filtering, we suppose that H (γ ) = 1, for|γ | ≤ Γ, and is zero, otherwise. Then the filter is called the ideal Γ-lowpass

filter. In the farfield propagation model, the variable x is spatial, and thevariable γ is spatial frequency, related to how the function f (x) changesspatially, as we move x. Rapid changes in f (x) are associated with values of F (γ ) for large γ . For the case in which the variable x is time, the variable γ becomes frequency, and the effect of the low-pass filter on f (x) is to removeits higher-frequency components.

One effect of low-pass filtering in image processing is to smooth out themore rapidly changing features of an image. This can be useful if thesefeatures are simply unwanted oscillations, but if they are important de-tail, the smoothing presents a problem. Restoring such wanted detail isoften viewed as removing the unwanted effects of the low-pass filtering; inother words, we try to recapture the missing high-spatial-frequency val-ues that have been zeroed out. Such an approach to image restoration is

called frequency-domain extrapolation . How can we hope to recover thesemissing spatial frequencies, when they could have been anything? To have



33.6. POINT SOURCES AS DIRAC DELTAS 273

some chance of estimating these missing values we need to have some priorinformation about the image being reconstructed.

33.6 Point Sources as Dirac Deltas

Television signals reflected from satellites are picked up using antennasin the shape of parabolic dishes. The idea here is to point the dish atthe satellite, so that signals from other sources are discriminated againstand the one from the satellite is reinforced. In applications such as sonarsurveillance, it is often the case that the array of sensors cannot be moved.In such cases electronic steering using phase shifts replaces the physicalturning of the antenna. A common practice in sonar is to place sensors atequal intervals along a straight line; such an arrangement is called a linear array . If our sensor array is linear, along the line making the angle φ = 0with the horizontal axis, then each sensor in the linear array receives thesame signal. If the line of the array corresponds to an angle φ that is notzero, then two sensors a distance ∆ apart along the line receive the signalwith time delays that differ by ∆sinφ

c , that is, with a phase difference of ω∆sinφ

c . Therefore, the data we measure along this linear array contains, inthe phase differences, information about the direction of the farfield pointsource, relative to the line of the array. This forms the basis for sonardirection-of-arrival estimation and detection; for further details see [32].

As we shall see, if we had available an infinite number of sensors, prop-erly spaced along the line of the array, we could determine the direction of the distant point source with perfect accuracy. In the real world, we mustmake due with finitely many imperfect sensors. In addition, it is rarely thecase that the received signal comes from a single point source; there willalways be background noise, other point sources, and so on. Limitations

on the number of sensors, and on where they can be placed, make it harderto separate closely-spaced distant point sources. If we know a priori thatwe are looking at point sources, and not extended objects, the resolution problem can be partly overcome, using nonlinear high-resolution techniques.We shall consider high-resolution methods, such as entropy maximization and likelihood maximization , in subsequent chapters.

33.7 The Limited-Aperture Problem

In the farfield propagation model, our measurements in the farfield give usthe values F (s). Suppose now that we are able to take measurements onlyfor limited values of s, say for

|s

| ≤A; then 2A is the aperture of our antenna

or array of sensors. We describe this, in the general case, by saying thatwe have available measurements of F (γ )H (γ ), where H (γ ) = χΓ(γ ) = 1,




for |γ | ≤ Γ, and zero otherwise. So, in addition to describing blurringand low-pass filtering, the convolution-filter model can also be used tomodel the limited-aperture problem. As in the low-pass case, the limited-aperture problem can be attacked using extrapolation, but with the same

sort of risks described for the low-pass case. A much different approach isto increase the aperture by physically moving the array of sensors, as insynthetic aperture radar (SAR).

Returning to the farfield propagation model, if we have Fourier trans-form data only for |s| ≤ A, then we have F (s) for |s| ≤ A. UsingH (s) = χA(s) to describe the limited aperture of the system, the point-

spread function is h(k) = sin(Ak)πk . The first zeros of the numerator occur

at |k| = πA , so the main lobe of the point-spread function has width 2π

A .For this reason, the resolution of such a limited-aperture imaging systemis said to be on the order of 1

A . Because the distant object, expressed asa function of k in the interval [−ω

c , ωc ], the resolution achieved in imagingthe distant object will depend on the frequency ω, as well. For that reason,

it is common practice to measure the aperture A in units of wavelengthλ, rather than, say, in units of meters; an aperture of A = 5 meters maybe acceptable if the frequency is high, but not if the radiation is in theone-meter-wavelength range.

33.7.1 Resolution

If f (x) = δ (x) and H (γ ) = χΓ(γ ) describes the aperture-limitation of the imaging system, then the point-spread function is h(x) = sin Γx

πx . Themaximum of h(x) still occurs at x = 0, but the main lobe of h(x) extendsfrom −π

Γ to πΓ ; the point source has been spread out. If the point-source

object shifts, so that f (x) = δ (x − a), then the reconstructed image of theobject is h(x − a), so the peak is still in the proper place. If we know

a priori that the object is a single point source, but we do not know itslocation, the spreading of the point poses no problem; we simply look forthe maximum in the reconstructed image. Problems arise when the ob jectcontains several point sources, or when we do not know a priori what weare looking at, or when the object contains no point sources, but is just acontinuous distribution.

Suppose that f (x) = δ (x − a) + δ (x − b); that is, the object consistsof two point sources. Then Fourier inversion of the aperture-limited dataleads to the reconstructed image

g(x) =sinΓ(x − a)

π(x − a)+

sinΓ(x − b)

π(x − b).

If

|b

−a

|is large enough, g(x) will have two distinct maxima, at approxi-

mately x = a and x = b, respectively. However, if |b − a| is too small, thedistinct maxima merge into one, at x = a+b

2 and resolution will be lost.



33.7. THE LIMITED-APERTURE PROBLEM 275

How small is too small will depend on Γ, which, of course, depends on bothA and ω.

Suppose now that f (x) = δ (x − a), but we do not know a priori thatthe object is a single point source. We calculate

g(x) = h(x − a) =sinΓ(x − a)

π(x − a)

and use this function as our reconstructed image of the object, for all x.What we see when we look at g(x) for some x = b = a is g(b), which isthe same thing we see when the point source is at x = b and we look atx = a. Point-spreading is, therefore, more than a cosmetic problem. Whenthe object is a point source at x = a, but we do not know a priori that itis a point source, the spreading of the point causes us to believe that theobject function f (x) is nonzero at values of x other than x = a. When welook at, say, x = b, we see a nonzero value that is caused by the presenceof the point source at x = a.

Suppose now that the object function f (x) contains no point sources,but is simply an ordinary function of x. If the aperture A is very small, thenthe function h(x) is nearly constant over the entire extent of the object.The convolution of f (x) and h(x) is essentially the integral of f (x), so thereconstructed object is g(x) =

f (x)dx, for all x.

Let’s see what this means for the solar-emission problem discussed ear-lier.

33.7.2 The Solar-Emission Problem Revisited

The wavelength of the radiation is λ = 1 meter. Therefore, ωc = 2π, and

k in the interval [−2π, 2π] corresponds to the angle θ in [0, π]. The sunhas an angular diameter of 30 minutes of arc, which is about 10−2 radians.

Therefore, the sun subtends the angles θ in [ π2 −(0.5)·10−2, π2 +(0.5)·10−2],which corresponds roughly to the variable k in the interval [−3 · 10−2, 3 ·10−2]. Resolution of 3 minutes of arc means resolution in the variable k of 3 · 10−3. If the aperture is 2A, then to achieve this resolution, we need

π

A= 3 · 10−3,

orA =

π

3· 103

meters, or about 1000 meters.The radio-wave signals emitted by the sun are focused, using a parabolic

radio-telescope. The telescope is pointed at the center of the sun. Because

the sun is a great distance from the earth and the subtended arc is small (30min.), the signals from each point on the sun’s surface arrive at the parabola




head-on, that is, parallel to the line from the vertex to the focal point, andare reflected to the receiver located at the focal point of the parabola. Theeffect of the parabolic antenna is not to discriminate against signals comingfrom other directions, since there are none, but to effect a summation of

the signals received at points (s, 0, 0), for |s| ≤ A, where 2A is the diameterof the parabola. When the aperture is large, the function H (s) is nearlyone for all s and the signal received at the focal point is essentially

F (s)ds = f (0);

we are now able to distinguish between f (0) and other values f (k). Whenthe aperture is small, H (s) is essentially δ (s) and the signal received at thefocal point is essentially

F (s)δ (s)dγ = F (0) =

f (k)dk;

now all we get is the contribution from all the k, superimposed, and all

resolution is lost.Since the solar emission problem is clearly two-dimensional, and we need

3 min. resolution in both dimensions, it would seem that we would need acircular antenna with a diameter of about one kilometer, or a rectangularantenna roughly one kilometer on a side. We shall return to this problemlater, once when we discuss multi-dimensional Fourier transforms, and thenagain when we consider tomographic reconstruction of images from lineintegrals.

33.8 Discrete Data

A familiar topic in signal processing is the passage from functions of con-

tinuous variables to discrete sequences. This transition is achieved by sam-pling , that is, extracting values of the continuous-variable function at dis-crete points in its domain. Our example of farfield propagation can be usedto explore some of the issues involved in sampling.

Imagine an infinite uniform line array of sensors formed by placingreceivers at the points (n∆, 0, 0), for some ∆ > 0 and all integers n. Thenour data are the values F (n∆). Because we defined k = ω

c cos θ, it is clearthat the function f (k) is zero for k outside the interval [−ω

c , ωc ].

Exercise 33.1 Show that our discrete array of sensors cannot distinguish between the signal arriving from θ and a signal with the same amplitude,coming from an angle α with

ω

ccos α =

ω

ccos θ +

2π

∆m,

where m is an integer.



33.9. THE FINITE-DATA PROBLEM 277

To avoid the ambiguity described in Exercise 33.1, we must select ∆ > 0so that

−ω

c+

2π

∆≥ ω

c,

or∆ ≤ πc

ω=

λ

2.

The sensor spacing ∆s = λ2 is the Nyquist spacing .

In the sunspot example, the object function f (k) is zero for k outsideof an interval much smaller than [−ω

c , ωc ]. Knowing that f (k) = 0 for|k| > K , for some 0 < K < ω

c , we can accept ambiguities that confuseθ with another angle that lies outside the angular diameter of the object.Consequently, we can redefine the Nyquist spacing to be

∆s =π

K .

This tells us that when we are imaging a distant object with a small angular

diameter, the Nyquist spacing is greater than λ2 . If our sensor spacing hasbeen chosen to be λ

2 , then we have oversampled . In the oversampled case,band-limited extrapolation methods can be used to improve resolution (see[33]).

33.8.1 Reconstruction from Samples

From the data gathered at our infinite array we have extracted the Fouriertransform values F (n∆), for all integers n. The obvious question is whetheror not the data is sufficient to reconstruct f (k). We know that, to avoidambiguity, we must have ∆ ≤ πc

ω . The good news is that, provided thiscondition holds, f (k) is uniquely determined by this data and formulas existfor reconstructing f (k) from the data; this is the content of the Shannon Sampling Theorem . Of course, this is only of theoretical interest, since wenever have infinite data. Nevertheless, a considerable amount of traditionalsignal-processing exposition makes use of this infinite-sequence model. Thereal problem, of course, is that our data is always finite.

33.9 The Finite-Data Problem

Suppose that we build a uniform line array of sensors by placing receiversat the points (n∆, 0, 0), for some ∆ > 0 and n = −N,...,N . Then our dataare the values F (n∆), for n = −N,...,N . Suppose, as previously, that theobject of interest, the function f (k), is nonzero only for values of k in theinterval [

−K, K ], for some 0 < K < ω

c . Once again, we must have ∆

≤πcω

to avoid ambiguity; but this is not enough, now. The finite Fourier datais no longer sufficient to determine a unique f (k). The best we can hope




to do is to estimate the true f (k), using both our measured Fourier dataand whatever prior knowledge we may have about the function f (k), suchas where it is nonzero, if it consists of Dirac delta point sources, or if it isnonnegative. The data is also noisy, and that must be accounted for in the

reconstruction process. We shall return later to this important problem of reconstructing a general function f (x) from finitely many noisy values of its Fourier transform.

In certain applications, such as sonar array processing, the sensors arenot necessarily arrayed at equal intervals along a line, or even at the gridpoints of a rectangle, but in an essentially arbitrary pattern in two, or eventhree, dimensions. In such cases, we have values of the Fourier transformof the object function, but at essentially arbitrary values of the variable.How best to reconstruct the object function in such cases is not obvious.

33.10 Functions of Several Variables

Fourier transformation applies, as well, to functions of several variables. Asin the one-dimensional case, we can motivate the multi-dimensional Fouriertransform using the farfield propagation model. As we noted earlier, thesolar emission problem is inherently a two-dimensional problem.

33.10.1 Two-Dimensional Farfield Object

Consider the case of a distant two-dimensional transmitting or reflecting object. Let each point (x,D,z) in the x, z-plane send out the signal g(x, z)eiωt.As in the one-dimensional case, D is so large that the spherically spreadingwave from (x,D,z) is essentially a plane surface when it reaches the planey = 0 of the sensors. Let θ be the unit vector along the line from (0, 0, 0)

to (x,D,z). Then θ is normal to the planes of constant value of the fieldoriginating at (x,D,z). As before, we assume that D is so large that thedirection of (x,D,z) as measured from (0, 0, 0) is the same as would havebeen measured at any other location (u, 0, v) at which we may locate asensor.

Let b(u,v,t) be the signal from (x,D,z) that is received at location(u, 0, v) at time t. For reference, let us suppose that

u(0, 0, t) = eiω(t−Dc)g(x, z).

Because the planewaves travel at a speed c, we have

b(u,v,t) = u(0, t +s · θ

c

) = eiω(t−Dc)ei

ωs·θc g(x),

where s = (u, 0, v).



33.10. FUNCTIONS OF SEVERAL VARIABLES 279

Of course, the signal received at (u, 0, v) does not come only from asingle point (x,D,z), but from all the points (x,D,z), so the combinedsignal received at (u, 0, v) is

B(u,v,t) = eiω(t−Dc)

eiωs·θ

c g(x, z)dxdz. (33.10)

Since there is a one-to-one relationship between the direction vectors θ andthe points (x,D,z), we can view g(x, z) as a function of θ, and write g(θ)in place of g(x, z). We then introduce the new variable k = ω

c θ and writethe integral

eiωs·θc g(x, z)dxdz

as

c

ω f (k)eis·kdk, (33.11)

where the integral is over all three-dimensional vectors having length ωc ,

f (k) is the function obtained from g(θ) and the Jacobian of the transforma-tion of the variables of integration. Since, in most applications, the distantobject has a small angular diameter when viewed from a great distance -the sun’s is 30 minutes of arc - the direction vector θ will be restricted toa small subset of vectors centered at θ = (0, D, 0).

The integral f (k)eis·kdk

is the familiar one that defines the Fourier transform of the function f (k).Using the approximations permitted under the farfield assumption, the re-ceived signal B(u,v,t) provids the Fourier transform of the object functionf (k).

33.10.2 Two-Dimensional Fourier Transforms

Generally, we consider a function f (x, z) of two real variables. Its Fouriertransformation is

F (α, β ) =

f (x, z)ei(xα+zβ)dxdz. (33.12)

For example, suppose that f (x, z) = 1 for√

x2 + z2 ≤ R, and zero,otherwise. Then we have

F (α, β ) =

π

−π

R

0

e−i(αr cos θ+βr sin θ)rdrdθ.




In polar coordinates, with α = ρ cos φ and β = ρ sin φ, we have

F (ρ, φ) =

R

0 π

−πeirρ cos(θ−φ)dθrdr.

The inner integral is well known; π−π

eirρ cos(θ−φ)dθ = 2πJ 0(rρ),

where J 0 denotes the 0th order Bessel function. Using the identity z0

tnJ n−1(t)dt = znJ n(z),

we have

F (ρ, φ) =2πR

ρJ 1(ρR).

Notice that, since f (x, z) is a radial function, that is, dependent only onthe distance from (0, 0, 0) to (x, 0, z), its Fourier transform is also radial.

The first positive zero of J 1(t) is around t = 4, so when we measureF at various locations and find F (ρ, φ) = 0 for a particular (ρ, φ), we canestimate R ≈ 4/ρ. So, even when a distant spherical object, like a star,is too far away to be imaged well, we can sometimes estimate its size byfinding where the intensity of the received signal is zero.

33.10.3 Two-Dimensional Fourier Inversion

Just as in the one-dimensional case, the Fourier transformation that pro-duced F (α, β ) can be inverted to recover the original f (x, y). The FourierInversion Formula in this case is

f (x, y) =1

4π2 F (α, β )e−i(αx+βy)dαdβ. (33.13)

It is important to note that this procedure can be viewed as two one-dimensional Fourier inversions: first, we invert F (α, β ), as a function of,say, β only, to get the function of α and y

g(α, y) =1

2π

F (α, β )e−iβydβ ;

second, we invert g(α, y), as a function of α, to get

f (x, y) =1

2π

g(α, y)e−iαxdα.

If we write the functions f (x, y) and F (α, β ) in polar coordinates, we obtainalternative ways to implement the two-dimensional Fourier inversion. We

shall consider these other ways when we discuss the tomography problemof reconstructing a function f (x, y) from line-integral data.



33.11. BROADBAND SIGNALS 281

33.10.4 Limited Apertures in Two Dimensions

Suppose we have the values of the Fourier transform, F (α, β ), for |α| ≤ A,|β | ≤ B. We describe this limited-data problem using the function H (α, β )

that is one for |α| ≤ A, |β | ≤ B, and zero, otherwise. Then the point-spreadfunction is the inverse Fourier transform of this H (α, β ), given by

h(x, z) =sin Ax

πx

sin Bz

πz.

The resolution in the horizontal (x) direction is on the order of 1A , and 1

Bin the vertical.

Suppose our aperture is circular, with radius A. Then we have Fouriertransform values F (α, β ) for

α2 + β 2 ≤ A. Let H (α, β ) equal one, for

α2 + β 2 ≤ A, and zero, otherwise. Then the point-spread function of thislimited-aperture system is the inverse Fourier transform of H (α, β ), givenby h(x, z) = A

2πrJ 1(rA), with r =

√ x2 + z2. The resolution of this system

is roughly the distance from the origin to the first null of the functionJ 1(rA), which means that rA = 4, roughly.

For the solar emission problem, this says that we would need a circularaperture with radius approximately one kilometer to achieve 3 minutes of arc resolution. But this holds only if the antenna is stationary; a movingantenna is different! The solar emission problem was solved by using arectangular antenna with a large A, but a small B, and exploiting therotation of the earth. The resolution is then good in the horizontal, but badin the vertical, so that the imaging system discriminates well between twodistinct vertical lines, but cannot resolve sources within the same verticalline. Because B is small, what we end up with is essentially the integralof the function f (x, z) along each vertical line. By tilting the antenna, andwaiting for the earth to rotate enough, we can get these integrals along any

set of parallel lines. The problem then is to reconstruct f (x, z) from suchline integrals. This is also the main problem in tomography, as we shallsee.

33.11 Broadband Signals

We have spent considerable time discussing the case of a distant pointsource or an extended object transmitting or reflecting a single-frequencysignal. If the signal consists of many frequencies, the so-called broadbandcase, we can still analyze the received signals at the sensors in terms of timedelays, but we cannot easily convert the delays to phase differences, andthereby make good use of the Fourier transform. One approach is to filter

each received signal, to remove components at all but a single frequency,and then to proceed as previously discussed. In this way we can process




one frequency at a time. The object now is described in terms of a functionof both x and ω, with f (x, ω) the complex amplitude associated with thespatial variable x and the frequency ω. In the case of radar, the functionf (x, ω) tells us how the material at (x, 0, 0) reflects the radio waves at the

various frequencies ω, and thereby gives information about the nature of the material making up the object near the point (x, 0, 0).

There are times, of course, when we do not want to decompose a broad-band signal into single-frequency components. A satellite reflecting a TVsignal is a broadband point source. All we are interested in is receiving thebroadband signal clearly, free of any other interfering sources. The direc-tion of the satellite is known and the antenna is turned to face the satellite.Each location on the parabolic dish reflects the same signal. Because of itsparabolic shape, the signals reflected off the dish and picked up at the focalpoint have exactly the same travel time from the satellite, so they combinecoherently, to give us the desired TV signal.

33.12 The Laplace Transform and the OzoneLayer

In the farfield propagation example just considered, we found the measureddata to be related to the desired object function by a Fourier transforma-tion. The image reconstruction problem then became one of estimatinga function fro finitely many noisy values of its Fourier transform. In thissection we consider an inverse problem involving the Laplace transform.The example is taken from Twomey’s book [113].

33.12.1 The Laplace Transform

The Laplace transform of the function f (x) defined for 0 ≤ x < +∞ is thefunction

F (s) =

+∞0

f (x)e−sxdx.

33.12.2 Scattering of Ultraviolet Radiation

The sun emits ultraviolet (UV) radiation that enters the Earth’s atmo-sphere at an angle θ0 that depends on the sun’s position, and with intensityI (0). Let the x-axis be vertical, with x = 0 at the top of the atmosphereand x increasing as we move down to the Earth’s surface, at x = X . Theintensity at x is given by

I (x) = I (0)e−kx/ cos θ0 .



33.12. THE LAPLACE TRANSFORM AND THE OZONE LAYER 283

Within the ozone layer, the amount of UV radiation scattered in the direc-tion θ is given by

S (θ, θ0)I (0)ekx/ cos θ0∆ p,

where S (θ, θ0) is a known parameter, and ∆ p is the change in the pressureof the ozone within the infinitesmal layer [x, x + ∆x], and so is proportionalto the concentration of ozone within that layer.

33.12.3 Measuring the Scattered Intensity

The radiation scattered at the angle θ then travels to the ground, a distanceof X − x, weakened along the way, and reaches the ground with intensity

S (θ, θ0)I (0)e−kx/ cos θ0e−k(X−x)/ cos θ∆ p.

The total scattered intensity at angle θ is then a superposition of the in-tensities due to scattering at each of the thin layers, and is then

S (θ, θ0)I (0)e−kX/ cos θ0 X

0 e−xβ

dp,

where

β = k[1

cos θ0− 1

cos θ].

This superposition of intensity can then be written as

S (θ, θ0)I (0)e−kX/ cos θ0

X0

e−xβ p(x)dx.

33.12.4 The Laplace Transform Data

Using integration by parts, we get

X0 e−

xβ

p(x)dx = p(X )e−βX

− p(0) + β X0 e−

βx

p(x)dx.

Since p(0) = 0 and p(X ) can be measured, our data is then the Laplacetransform value +∞

0

e−βx p(x)dx;

note that we can replace the upper limit X with +∞ if we extend p(x) aszero beyond x = X .

The variable β depends on the two angles θ and θ0. We can alter θ aswe measure and θ0 changes as the sun moves relative to the earth. In thisway we get values of the Laplace transform of p(x) for various values of β .The problem then is to recover p(x) from these values. Because the Laplacetransform involves a smoothing of the function p(x), recovering p(x) from

its Laplace transform is more ill-conditioned than is the Fourier transforminversion problem.




33.13 Summary

Our goal in this chapter has been to introduce the Fourier transformthrough the use of the example of farfield propagation. For a more detailed

discussion of Fourier transforms and Fourier series, see [33]. As our exampleof farfield propagation shows, the Fourier transform arises naturally in re-mote sensing and measured data is often related by Fourier transformationto what we really want. The theory also connects the Fourier transform tothe important class of convolution filters, which are used to model varioustypes of signal degradation, such as blurring and point-spreading, as wellas the limitations on the aperture of the array of sensors.



Bibliography

[1] Agmon, S. (1954) The relaxation method for linear inequalities, Cana-dian Journal of Mathematics , 6, pp. 382–392.

[2] Anderson, A. and Kak, A. (1984) Simultaneous algebraic reconstruc-tion technique (SART): a superior implementation of the ART algo-

rithm, Ultrasonic Imaging , 6 81–94.

[3] Aubin, J.-P., (1993) Optima and Equilibria: An Introduction to Non-linear Analysis , Springer-Verlag.

[4] Axelsson, O. (1994) Iterative Solution Methods . Cambridge, UK:Cambridge University Press.

[5] Baillon, J., and Haddad, G. (1977) Quelques proprietes des operateursangle-bornes et n-cycliquement monotones, Israel J. of Mathematics ,26 137-150.

[6] Bauschke, H. (2001) Projection algorithms: results and open problems,in Inherently Parallel Algorithms in Feasibility and Optimization and

their Applications , Butnariu, D., Censor, Y. and Reich, S., editors,Elsevier Publ., pp. 11–22.

[7] Bauschke, H., and Borwein, J. (1996) On projection algorithms forsolving convex feasibility problems, SIAM Review , 38 (3), pp. 367–426.

[8] Bauschke, H., and Borwein, J. (1997) “Legendre functions and themethod of random Bregman projections.” Journal of Convex Analysis ,4, pp. 27–67.

[9] Bauschke, H., Borwein, J., and Lewis, A. (1997) The method of cyclicprojections for closed convex sets in Hilbert space, Contemporary

Mathematics: Recent Developments in Optimization Theory and Non-linear Analysis , 204, American Mathematical Society, pp. 1–38.

285



286 BIBLIOGRAPHY

[10] Bauschke, H., and Lewis, A. (1998) “Dykstra’s algorithm with Breg-man projections: a convergence proof.” xxxxxx .

[11] Bertero, M., and Boccacci, P. (1998) Introduction to Inverse Problems

in Imaging Bristol, UK: Institute of Physics Publishing.

[12] Bertsekas, D.P. (1997) A new class of incremental gradient methodsfor least squares problems, SIAM J. Optim., 7, pp. 913-926.

[13] Borwein, J. and Lewis, A. (2000) Convex Analysis and Nonlinear Op-timization. Canadian Mathematical Society Books in Mathematics,New York: Springer-Verlag.

[14] Bracewell, R.C. (1979) Image Reconstruction in Radio Astronomy, in[74], pp. 81–104.

[15] Bregman, L.M. (1967) “The relaxation method of finding the common

point of convex sets and its application to the solution of problems inconvex programming.”USSR Computational Mathematics and Math-ematical Physics 7: pp. 200–217.

[16] Bregman, L., Censor, Y., and Reich, S. (1999) “Dykstra’s algorithm asthe nonlinear extension of Bregman’s optimization method.” Journal of Convex Analysis , 6 (2), pp. 319–333.

[17] Brodzik, A. and Mooney, J. (1999) “Convex projections algorithmfor restoration of limited-angle chromotomographic images.”Journal of the Optical Society of America A 16 (2), pp. 246–257.

[18] Browne, J. and A. DePierro, A. (1996) “A row-action alternative tothe EM algorithm for maximizing likelihoods in emission tomogra-phy.”IEEE Trans. Med. Imag. 15, pp. 687–699.

[19] Byrne, C. (1993) “Iterative image reconstruction algorithms based oncross-entropy minimization.”IEEE Transactions on Image Processing IP-2, pp. 96–103.

[20] Byrne, C. (1995) “Erratum and addendum to ‘Iterative image re-construction algorithms based on cross-entropy minimization’.”IEEE Transactions on Image Processing IP-4, pp. 225–226.

[21] Byrne, C. (1996) “Iterative reconstruction algorithms based on cross-entropy minimization.”in Image Models (and their Speech Model Cousins), S.E. Levinson and L. Shepp, editors, IMA Volumes in

Mathematics and its Applications, Volume 80, pp. 1–11. New York:Springer-Verlag.



BIBLIOGRAPHY 287

[22] Byrne, C. (1996) “Block-iterative methods for image reconstructionfrom projections.”IEEE Transactions on Image Processing IP-5, pp.792–794.

[23] Byrne, C. (1997) “Convergent block-iterative algorithms for imagereconstruction from inconsistent data.”IEEE Transactions on Image Processing IP-6, pp. 1296–1304.

[24] Byrne, C. (1998) “Accelerating the EMML algorithm and related it-erative algorithms by rescaled block-iterative (RBI) methods.”IEEE Transactions on Image Processing IP-7, pp. 100–109.

[25] Byrne, C. (1998) “Iterative deconvolution and deblurring with con-straints”, Inverse Problems , 14, pp. 1455-1467.

[26] Byrne, C. (1999) “Iterative projection onto convex sets using multipleBregman distances.”Inverse Problems 15, pp. 1295–1313.

[27] Byrne, C. (2000) “Block-iterative interior point optimization methodsfor image reconstruction from limited data.”Inverse Problems 16, pp.1405–1419.

[28] Byrne, C. (2001) “Bregman-Legendre multidistance projection algo-rithms for convex feasibility and optimization.”in Inherently Parallel Algorithms in Feasibility and Optimization and their Applications ,Butnariu, D., Censor, Y., and Reich, S., editors, pp. 87–100. Amster-dam: Elsevier Publ.,

[29] Byrne, C. (2001) “Likelihood maximization for list-mode emissiontomographic image reconstruction.”IEEE Transactions on Medical Imaging 20(10), pp. 1084–1092.

[30] Byrne, C. (2002) “Iterative oblique projection onto convex sets andthe split feasibility problem.”Inverse Problems 18, pp. 441–453.

[31] Byrne, C. (2004) “A unified treatment of some iterative algorithms insignal processing and image reconstruction.”Inverse Problems 20, pp.103–120.

[32] Byrne, C. (2005) Choosing parameters in block-iterative or ordered-subset reconstruction algorithms, IEEE Transactions on Image Pro-cessing , 14 (3), pp. 321–327.

[33] Byrne, C. (2005) Signal Processing: A Mathematical Approach , AKPeters, Publ., Wellesley, MA.

[34] Byrne, C. (2005) “Feedback in Iterative Algorithms” unpublished lec-ture notes.



288 BIBLIOGRAPHY

[35] Byrne, C., and Ward, S. (2005) “Estimating the Largest SingularValue of a Sparse Matrix” in preparation.

[36] Byrne, C. and Censor, Y. (2001) Proximity function minimization us-

ing multiple Bregman projections, with applications to split feasibilityand Kullback-Leibler distance minimization, Annals of Operations Re-search, 105, pp. 77–98.

[37] Censor, Y. (1981) “Row-action methods for huge and sparse systemsand their applications.”SIAM Review , 23: 444–464.

[38] Censor, Y., Eggermont, P.P.B., and Gordon, D. (1983) “Strongunderrelaxation in Kaczmarz’s method for inconsistent sys-tems.”Numerische Mathematik 41, pp. 83–92.

[39] Censor, Y. and Elfving, T. (1994) A multiprojection algorithm usingBregman projections in a product space, Numerical Algorithms , 8221–239.

[40] Censor, Y., Elfving, T., Kopf, N., and Bortfeld, T. (2006) “Themultiple-sets split feasibility problem and its application for inverseproblems.” Inverse Problems , to appear.

[41] Censor, Y., and Reich, S. (1998) “The Dykstra algorithm for Bregmanprojections.” Communications in Applied Analysis , 2, pp. 323–339.

[42] Censor, Y. and Segman, J. (1987) “On block-iterative maximiza-tion.”J. of Information and Optimization Sciences 8, pp. 275–291.

[43] Censor, Y. and Zenios, S.A. (1997) Parallel Optimization: Theory,Algorithms and Applications . New York: Oxford University Press.

[44] Chang, J.-H., Anderson, J.M.M., and Votaw, J.R. (2004) “Regular-ized image reconstruction algorithms for positron emission tomogra-phy.”IEEE Transactions on Medical Imaging 23(9), pp. 1165–1175.

[45] Cheney, W., and Goldstein, A. (1959) “Proximity maps for convexsets.” Proc. Am. Math. Soc., 10, pp. 448–450.

[46] Cimmino, G. (1938) “Calcolo approssimato per soluzioni die sistemidi equazioni lineari.”La Ricerca Scientifica XVI, Series II, Anno IX 1,pp. 326–333.

[47] Combettes, P. (1993) The foundations of set theoretic estimation, Pro-ceedings of the IEEE , 81 (2), pp. 182–208.

[48] Combettes, P. (1996) The convex feasibility problem in image recovery,Advances in Imaging and Electron Physics , 95, pp. 155–270.



BIBLIOGRAPHY 289

[49] Combettes, P., and Trussell, J. (1990) Method of successive projec-tions for finding a common point of sets in a metric space, Journal of Optimization Theory and Applications , 67 (3), pp. 487–507.

[50] Combettes, P. (2000) “Fejer monotonicity in convex optimization.”inEncyclopedia of Optimization, C.A. Floudas and P. M. Pardalos, edi-tors, Boston: Kluwer Publ.

[51] Csiszar, I. and Tusnady, G. (1984) “Information geometry and alter-nating minimization procedures.”Statistics and Decisions Supp. 1,pp. 205–237.

[52] Csiszar, I. (1989) “A geometric interpretation of Darroch and Rat-cliff’s generalized iterative scaling.”The Annals of Statistics 17 (3),pp. 1409–1413.

[53] Csiszar, I. (1991) “Why least squares and maximum entropy? An ax-

iomatic approach to inference for linear inverse problems.”The Annals of Statistics 19 (4), pp. 2032–2066.

[54] Darroch, J. and Ratcliff, D. (1972) “Generalized iterative scaling forlog-linear models.”Annals of Mathematical Statistics 43, pp. 1470–1480.

[55] Dax, A. (1990) “The convergence of linear stationary iterative pro-cesses for solving singular unstructured systems of linear equations,”SIAM Review , 32, pp. 611–635.

[56] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) “Maximum like-lihood from incomplete data via the EM algorithm.”Journal of the Royal Statistical Society, Series B 37, pp. 1–38.

[57] De Pierro, A. (1995) “A modified expectation maximization algorithmfor penalized likelihood estimation in emission tomography.”IEEE Transactions on Medical Imaging 14, pp. 132–137.

[58] De Pierro, A. and Iusem, A. (1990) “On the asymptotic behavior of some alternate smoothing series expansion iterative methods.”Linear Algebra and its Applications 130, pp. 3–24.

[59] De Pierro, A., and Yamaguchi, M. (2001) “Fast EM-like methods formaximum ‘a posteriori’ estimates in emission tomography” Transac-tions on Medical Imaging , 20 (4).

[60] Duda, R., Hart, P., and Stork, D. (2001) Pattern Classification , Wiley.

[61] Dugundji, J. (1970) Topology Boston: Allyn and Bacon, Inc.



290 BIBLIOGRAPHY

[62] Dykstra, R. (1983) “An algorithm for restricted least squares regres-sion” J. Amer. Statist. Assoc., 78 (384), pp. 837–842.

[63] Eggermont, P.P.B., Herman, G.T., and Lent, A. (1981) “Iterative algo-

rithms for large partitioned linear systems, with applications to imagereconstruction.”Linear Algebra and its Applications 40, pp. 37–67.

[64] Elsner, L., Koltracht, L., and Neumann, M. (1992) “Convergence of sequential and asynchronous nonlinear paracontractions.” Numerische Mathematik , 62, pp. 305–319.

[65] Fessler, J., Ficaro, E., Clinthorne, N., and Lange, K. (1997) Grouped-coordinate ascent algorithms for penalized-likelihood transmission im-age reconstruction, IEEE Transactions on Medical Imaging , 16 (2),pp. 166–175.

[66] Fleming, W. (1965) Functions of Several Variables , Addison-WesleyPubl., Reading, MA.

[67] Geman, S., and Geman, D. (1984) “Stochastic relaxation, Gibbs dis-tributions and the Bayesian restoration of images.”IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6, pp. 721–741.

[68] Gifford, H., King, M., de Vries, D., and Soares, E. (2000) “Chan-nelized Hotelling and human observer correlation for lesion detectionin hepatic SPECT imaging” Journal of Nuclear Medicine 41(3), pp.514–521.

[69] Golshtein, E., and Tretyakov, N. (1996) Modified Lagrangians and Monotone Maps in Optimization. New York: John Wiley and Sons,Inc.

[70] Gordon, R., Bender, R., and Herman, G.T. (1970) “Algebraic recon-struction techniques (ART) for three-dimensional electron microscopyand x-ray photography.”J. Theoret. Biol. 29, pp. 471–481.

[71] Green, P. (1990) “Bayesian reconstructions from emission tomographydata using a modified EM algorithm.”IEEE Transactions on Medical Imaging 9, pp. 84–93.

[72] Gubin, L.G., Polyak, B.T. and Raik, E.V. (1967) The method of projections for finding the common point of convex sets, USSR Compu-tational Mathematics and Mathematical Physics , 7: 1–24.

[73] Hebert, T. and Leahy, R. (1989) “A generalized EM algorithm for 3-D

Bayesian reconstruction from Poisson data using Gibbs priors.”IEEE Transactions on Medical Imaging 8, pp. 194–202.



BIBLIOGRAPHY 291

[74] Herman, G.T. (ed.) (1979) “Image Reconstruction from Projections” ,Topics in Applied Physics, Vol. 32 , Springer-Verlag, Berlin.

[75] Herman, G.T., and Natterer, F. (eds.) “Mathematical Aspects of Com-

puterized Tomography”, Lecture Notes in Medical Informatics, Vol. 8 ,Springer-Verlag, Berlin.

[76] Herman, G.T., Censor, Y., Gordon, D., and Lewitt, R. (1985) Com-ment (on the paper [114]), Journal of the American Statistical Asso-ciation 80, pp. 22–25.

[77] Herman, G. T. and Meyer, L. (1993) “Algebraic reconstruction tech-niques can be made computationally efficient.”IEEE Transactions onMedical Imaging 12, pp. 600–609.

[78] Herman, G. T. (1999) private communication .

[79] Hildreth, C. (1957) A quadratic programming procedure, Naval Re-

search Logistics Quarterly , 4, pp. 79–85. Erratum, ibid., p. 361.[80] Holte, S., Schmidlin, P., Linden, A., Rosenqvist, G. and Eriksson,

L. (1990) “Iterative image reconstruction for positron emission to-mography: a study of convergence and quantitation problems.”IEEE Transactions on Nuclear Science 37, pp. 629–635.

[81] Hudson, H.M. and Larkin, R.S. (1994) “Accelerated image reconstruc-tion using ordered subsets of projection data.”IEEE Transactions onMedical Imaging 13, pp. 601–609.

[82] Hutton, B., Kyme, A., Lau, Y., Skerrett, D., and Fulton, R. (2002)“A hybrid 3-D reconstruction/registration algorithm for correction of head motion in emission tomography.”IEEE Transactions on Nuclear

Science 49 (1), pp. 188–194.[83] Kaczmarz, S. (1937) “Angenaherte Auflosung von Systemen linearer

Gleichungen.”Bulletin de l’Academie Polonaise des Sciences et Lettres A35, pp. 355–357.

[84] Kak, A., and Slaney, M. (2001) “Principles of Computerized Tomo-graphic Imaging” , SIAM, Philadelphia, PA.

[85] Koltracht, L., and Lancaster, P. (1990) “Constraining strategies forlinear iterative processes.” IMA J. Numer. Anal., 10, pp. 555–567.

[86] Kullback, S. and Leibler, R. (1951) “On information and suffi-ciency.”Annals of Mathematical Statistics 22, pp. 79–86.

[87] Landweber, L. (1951) “An iterative formula for Fredholm integralequations of the first kind.”Amer. J. of Math. 73, pp. 615–624.



292 BIBLIOGRAPHY

[88] Lange, K. and Carson, R. (1984) “EM reconstruction algorithms foremission and transmission tomography.”Journal of Computer Assisted Tomography 8, pp. 306–316.

[89] Lange, K., Bahn, M. and Little, R. (1987) “A theoretical study of some maximum likelihood algorithms for emission and transmissiontomography.”IEEE Trans. Med. Imag. MI-6(2), pp. 106–114.

[90] Leahy, R. and Byrne, C. (2000) “Guest editorial: Recent developmentin iterative image reconstruction for PET and SPECT.”IEEE Trans.Med. Imag. 19, pp. 257–260.

[91] Leahy, R., Hebert, T., and Lee, R. (1989) “Applications of Markovrandom field models in medical imaging.”in Proceedings of the Confer-ence on Information Processing in Medical Imaging Lawrence-BerkeleyLaboratory, Berkeley, CA.

[92] Luenberger, D. (1969) Optimization by Vector Space Methods . New

York: John Wiley and Sons, Inc.[93] Levitan, E. and Herman, G. (1987) “A maximum a posteriori proba-

bility expectation maximization algorithm for image reconstruction inemission tomography.”IEEE Transactions on Medical Imaging 6, pp.185–192.

[94] Mann, W. (1953) “Mean value methods in iteration.”Proc. Amer.Math. Soc. 4, pp. 506–510.

[95] McLachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Extensions . New York: John Wiley and Sons, Inc.

[96] Meidunas, E. (2001) Re-scaled Block Iterative Expectation Maximiza-tion Maximum Likelihood (RBI-EMML) Abundance Estimation and

Sub-pixel Material Identification in Hyperspectral Imagery , MS the-sis, Department of Electrical Engineering, University of MassachusettsLowell.

[97] Mooney, J., Vickers, V., An, M., and Brodzik, A. (1997) “High-throughput hyperspectral infrared camera.”Journal of the Optical So-ciety of America, A 14 (11), pp. 2951–2961.

[98] Motzkin, T., and Schoenberg, I. (1954) The relaxation method forlinear inequalities, Canadian Journal of Mathematics , 6, pp. 393–404.

[99] Narayanan, M., Byrne, C. and King, M. (2001) “An interior point iter-ative maximum-likelihood reconstruction algorithm incorporating up-per and lower bounds with application to SPECT transmission imag-

ing.”IEEE Transactions on Medical Imaging TMI-20 (4), pp. 342–353.



BIBLIOGRAPHY 293

[100] Nash, S. and Sofer, A. (1996) Linear and Nonlinear Programming .New York: McGraw-Hill.

[101] Natterer, F. (1986) Mathematics of Computed Tomography . New

York: John Wiley and Sons, Inc.

[102] Natterer, F., and Wubbeling, F. (2001) Mathematical Methods inImage Reconstruction. Philadelphia, PA: SIAM Publ.

[103] Peressini, A., Sullivan, F., and Uhl, J. (1988) The Mathematics of Nonlinear Programming . Berlin: Springer-Verlag.

[104] Pretorius, P., King, M., Pan, T-S, deVries, D., Glick, S., and Byrne,C. (1998) Reducing the influence of the partial volume effect onSPECT activity quantitation with 3D modelling of spatial resolutionin iterative reconstruction, Phys.Med. Biol. 43, pp. 407–420.

[105] Rockafellar, R. (1970) Convex Analysis . Princeton, NJ: PrincetonUniversity Press.

[106] Rockmore, A., and Macovski, A. (1976) A maximum likelihoodapproach to emission image reconstruction from projections, IEEE Transactions on Nuclear Science , NS-23, pp. 1428–1432.

[107] Schmidlin, P. (1972) “Iterative separation of sections in tomographicscintigrams.”Nucl. Med. 15(1).

[108] Schroeder, M. (1991) Fractals, Chaos, Power Laws , W.H. Freeman,New York.

[109] Shepp, L., and Vardi, Y. (1982) Maximum likelihood reconstruction

for emission tomography, IEEE Transactions on Medical Imaging , MI-1, pp. 113–122.

[110] Soares, E., Byrne, C., Glick, S., Appledorn, R., and King, M. (1993)Implementation and evaluation of an analytic solution to the photonattenuation and nonstationary resolution reconstruction problem inSPECT, IEEE Transactions on Nuclear Science , 40 (4), pp. 1231–1237.

[111] Stark, H. and Yang, Y. (1998) Vector Space Projections: A Numerical Approach to Signal and Image Processing, Neural Nets and Optics ,John Wiley and Sons, New York.

[112] Tanabe, K. (1971) “Projection method for solving a singular system

of linear equations and its applications.”Numer. Math. 17, pp. 203–214.



294 BIBLIOGRAPHY

[113] Twomey, S. (1996) Introduction to the Mathematics of Inversion inRemote Sensing and Indirect Measurement. New York: Dover Publ.

[114] Vardi, Y., Shepp, L.A. and Kaufman, L. (1985) “A statistical model

for positron emission tomography.”Journal of the American Statistical Association 80, pp. 8–20.

[115] Wernick, M. and Aarsvold, J., editors (2004) Emission Tomography:The Fundamentals of PET and SPECT . San Diego: Elsevier AcademicPress.

[116] Yang, Q. (2004) “The relaxed CQ algorithm solving the split feasi-bility problem.” Inverse Problems , 20, pp. 1261–1266.

[117] Youla, D.C. (1987) “Mathematical theory of image restoration by themethod of convex projections.”in Image Recovery: Theory and Appli-cations , pp. 29–78, Stark, H., editor (1987) Orlando FL: AcademicPress.

[118] Youla, D. (1978) Generalized image restoration by the method of alternating projections, IEEE Transactions on Circuits and Systems ,CAS-25 (9), pp. 694–702.



Index

ρ(S ), 231

affine linear, 36affine linear operator, 11affine operator, 11Agmon-Motzkin-Schoenberg algo-

rithm, 192algebraic reconstruction technique,

15, 81alternating minimization, 101AMS algorithm, 46, 192array aperture, 273ART, 15, 61, 192asymptoic fixed point, 54averaged operator, 11, 31

backpropagation-of-error methods,149

band-limited, 214basic feasible solution, 168, 187basic variables, 223

basin of attraction, 8best linear unbiased estimator, 242bi-section method, 2Bjorck-Elfving equations, 75BLUE, 242Bregman function, 171Bregman Inequality, 54, 239Bregman paracontraction, 54Bregman projection, 22, 169Bregman’s Inequality, 172broadband signal, 265

canonical form, 185

Cauchy sequence, 222Cauchy’s Inequality, 220

Cauchy-Schwarz Inequality, 220Central Slice Theorem, 253CFP, 165channelized Hotelling observer, 246Chaos Game, 9Cimmino’s algorithm, 15classification, 241

complementary slackness condition,186complete metric space, 230complete space, 223condition number, 71conjugate gradient method, 154conjugate set, 153convergent sequence, 229convex feasibility problem, 20, 35,

165convex function, 158convex function of several variables,

161

convolution, 271convolution filter, 271CQ algorithm, 21CSP, 45cyclic subgradient projection method,

45

DART, 66data-extrapolation methods, 214detection, 241DFT, 243diagonalizable matrix, 13, 48, 236differentiable function of several

variables, 160Dirac delta, 270

295



296 INDEX

direction of unboundedness, 168discrete Fourier transform, 243discrimination, 241double ART, 66dual problem, 185duality gap, 186Dykstra’s algorithm, 21, 168

EKN Theorem, 30, 47EMML algorithm, 17entropic projection, 54estimation, 241Euclidean distance, 9, 220Euclidean length, 9, 220Euclidean norm, 9, 220expectation maximization maxi-

mum likelihood, 17extreme point, 168

farfield assumption, 266feasible set, 168Fermi-Dirac generalized entropies,

205firmly non-expansive operator, 12Fisher linear discriminant, 249fixed point, 3, 5Fourier Inversion Formula, 280Fourier inversion formula, 269Fourier transform, 265, 268

Fourier-transform pair, 269frequency-domain extrapolation, 272frequency-response function, 271full-cycle ART, 61full-rank property, 67, 123

gamma distribution, 127Gauss-Seidel method, 76Gerschgorin’s theorem, 236gradient descent method, 13

Hermitian operator, 11Hotelling linear discriminant, 246

Hotelling observer, 246hyperspectral imaging, 262

identification, 241induced matrix norm, 232interior-point methods, 147inverse Sir Pinski Game, 9

Jacobi overrelaxation, 78, 79Jacobi’s method, 76, 78JOR, 78

KL distance, 54

Landweber algorithm, 16, 203least squares ART, 152least squares solution, 150linear operator, 11linear programming, 185linear sensor array, 273

Lipschitz continuity, 10, 28Lipschitz function, 157Lipschitz function of several vari-

ables, 161LS-ART, 152

magnetic-resonance imaging, 257MART, 18maximum a posteriori , 126minimum-norm solution, 15modulation transfer function, 271monotone iteration, 7MRI, 257

multiple-distance SGP, 22multiplicative ART, 18

narrowband signal, 265, 266Newton-Raphson algorithm, 148Newton-Raphson iteration, 150non-expansive operator, 10, 31norm, 230normal equations, 75Nyquist spacing, 277

optical transfer function, 271orthogonal projection, 12

paracontraction, 12, 30



Date post:	08-Aug-2018
Category:	Documents
Upload:	zyzhang19754959
View:	219 times
Download:	0 times

Iterative Algorithms in Inverse Problems

Documents