+ All Categories
Home > Documents > Book Pch

Book Pch

Date post: 14-Apr-2018
Category:
Upload: jbusowicz
View: 213 times
Download: 0 times
Share this document with a friend

of 321

Transcript
  • 7/30/2019 Book Pch

    1/321

    Contents

    Preface ix

    Symbols and Acronyms xiii

    1 The Linear Data Fitting Problem 11.1 Parameter estimation, data approximation . . . . . . . . . . 11.2 Formulation of the data tting problem . . . . . . . . . . . 41.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . 91.4 The residuals and their properties . . . . . . . . . . . . . . 131.5 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . 18

    2 Linear Least Squares Problem 252.1 Linear least squares problem formulation . . . . . . . . . . . 252.2 The QR factorization and its role . . . . . . . . . . . . . . . 332.3 Permuted QR factorization . . . . . . . . . . . . . . . . . . 39

    3 Analysis of Least Squares Problems 47

    3.1 The pseudoinverse . . . . . . . . . . . . . . . . . . . . . . . 473.2 The singular value decomposition . . . . . . . . . . . . . . . 503.3 Generalized singular value decomposition . . . . . . . . . . 543.4 Condition number and column scaling . . . . . . . . . . . . 553.5 Perturbation analysis . . . . . . . . . . . . . . . . . . . . . . 58

    4 Direct Methods for Full-Rank Problems 654.1 Normal equations . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.2 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . 684.3 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . 704.4 Modifying least squares problems . . . . . . . . . . . . . . 804.5 Iterative renement . . . . . . . . . . . . . . . . . . . . . . . 854.6 Stability and condition number estimation . . . . . . . . . . 884.7 Comparison of the methods . . . . . . . . . . . . . . . . . . 89

    v

  • 7/30/2019 Book Pch

    2/321

    vi CONTENTS

    5 Rank-Decient LLSQ: Direct Methods : 915.1 Numerical rank . . . . . . . . . . . . . . . . . . . . . . . . . 925.2 Peters-Wilkinson LU factorization . . . . . . . . . . . . . . 93

    5.3 QR factorization with column permutations . . . . . . . . . 945.4 UTV and VSV decompositions . . . . . . . . . . . . . . . . 985.5 Bidiagonalization . . . . . . . . . . . . . . . . . . . . . . . . 995.6 SVD computations . . . . . . . . . . . . . . . . . . . . . . . 101

    6 Methods for Large-Scale Problems 1056.1 Iterative versus direct methods . . . . . . . . . . . . . . . . 1056.2 Classical stationary methods . . . . . . . . . . . . . . . . . 107

    6.3 Non-stationary methods, Krylov methods . . . . . . . . . . 1086.4 Practicalities: preconditioning andstopping criteria . . . . . . . . . . . . . . . . . . . . . . . . 114

    6.5 Block methods . . . . . . . . . . . . . . . . . . . . . . . . . 117

    7 Additional Topics in LLSQ Problems 1217.1 Constrained linear least squares problems . . . . . . . . . . 1217.2 Missing data problems . . . . . . . . . . . . . . . . . . . . . 131

    7.3 Total least squares (TLS) . . . . . . . . . . . . . . . . . . . 1367.4 Convex optimization . . . . . . . . . . . . . . . . . . . . . . 1437.5 Compressed sensing . . . . . . . . . . . . . . . . . . . . . . 144

    8 Nonlinear Least Squares Problems 1478.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1478.2 Unconstrained problems . . . . . . . . . . . . . . . . . . . . 1508.3 Optimality conditions for constrained

    problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1568.4 Separable nonlinear least squares problems . . . . . . . . . 1588.5 Multiobjective optimization . . . . . . . . . . . . . . . . . . 159

    9 Algorithms for Solving Nonlinear LSQP 1639.1 Newtons method . . . . . . . . . . . . . . . . . . . . . . . . 1649.2 The Gauss-Newton method . . . . . . . . . . . . . . . . . . 1669.3 The Levenberg-Marquardt method . . . . . . . . . . . . . . 170

    9.4 Additional considerations and software . . . . . . . . . . . . 1769.5 Iteratively reweighted LSQ algorithms for robust data ttingproblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    9.6 Separable NLLSQ problems:variable projection algorithm . . . . . . . . . . . . . . . . . 181

    9.7 Block methods for large-scale problems . . . . . . . . . . . . 186

  • 7/30/2019 Book Pch

    3/321

    CONTENTS vii

    10 Ill-Conditioned Problems 19110.1 Characterization . . . . . . . . . . . . . . . . . . . . . . . . 19110.2 Regularization methods . . . . . . . . . . . . . . . . . . . . 19210.3 Parameter selection techniques . . . . . . . . . . . . . . . . 19510.4 Extensions of Tikhonov regularization . . . . . . . . . . . . 19810.5 Ill-conditioned NLSQ problems . . . . . . . . . . . . . . . . 201

    11 Linear Least Squares Applications 20311.1 Splines in approximation . . . . . . . . . . . . . . . . . . . . 20311.2 Global temperatures data tting . . . . . . . . . . . . . . . 21211.3 Geological surface modeling . . . . . . . . . . . . . . . . . . 221

    12 Nonlinear Least Squares Applications 23112.1 Neural networks training . . . . . . . . . . . . . . . . . . . 23112.2 Response surfaces, surrogates or proxies . . . . . . . . . . . 23812.3 Optimal design of a supersonic aircraft . . . . . . . . . . . . 24112.4 NMR spectroscopy . . . . . . . . . . . . . . . . . . . . . . . 24812.5 Piezoelectric crystal identication . . . . . . . . . . . . . . 25112.6 Travel time inversion of seismic data . . . . . . . . . . . . . 259

    Appendices 263

    A Sensitivity Analysis 265A.1 Floating-point arithmetic . . . . . . . . . . . . . . . . . . . 265A.2 Stability, conditioning and accuracy . . . . . . . . . . . . . 266

    B Linear Algebra Background 269B.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269B.2 Condition number . . . . . . . . . . . . . . . . . . . . . . . 270B.3 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . 271B.4 Some additional matrix properties . . . . . . . . . . . . . . 272

    C Advanced Calculus Background 273C.1 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . 273C.2 Multivariable calculus . . . . . . . . . . . . . . . . . . . . . 274

    D Statistics 277D.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277D.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 282

    References 283

    Index 305

  • 7/30/2019 Book Pch

    4/321

  • 7/30/2019 Book Pch

    5/321

    Preface

    This book surveys basic modern techniques for the numerical solution of linear and nonlinear least squares problems and introduces the treatmentof large and ill-conditioned problems. The theory is extensively illustratedwith examples from engineering, environmental sciences, geophysics andother application areas.

    In addition to the treatment of the numerical aspects of least squaresproblems, we introduce some important topics from the area of regressionanalysis in statistics, which can help to motivate, understand and evaluatethe computed least squares solutions. The inclusion of these topics is one

    aspect that distinguishes the present book from other books on the subject.The presentation of the material is designed to give an overview, withthe goal of helping the reader decide which method would be appropriatefor a given problem, point toward available algorithms/software and, if nec-essary, help in modifying the available tools to adapt for a given application.The emphasis is therefore on the properties of the different algorithms andfew proofs are presented; the reader is instead referred to the appropriatearticles/books. Unfortunately, several important topics had to be left out,among them, direct methods for sparse problems.

    The content is geared toward scientists and engineers who must analyzeand solve least squares problems in their elds. It can be used as coursematerial for an advanced undergraduate or graduate course in the sciencesand engineering, presupposing a working knowledge of linear algebra andbasic statistics. It is written mostly in a terse style in order to provide aquick introduction to the subject, while treating some of the not so well-known topics in more depth. This in fact presents the reader with anopportunity to verify the understanding of the material by completing or

    providing the proofs without checking the references.The least squares problem is known under different names in differentdisciplines. One of our aims is to help bridge the communication gap be-tween the statistics and the numerical analysis literature on the subject,often due to the use of different terminology, such asl2approximation,regularization, regression analysis, parameter estimation, ltering, process

    ix

  • 7/30/2019 Book Pch

    6/321

    x ACKNOWLEDGEMENTS

    identication, etc.Least squares methods have been with us for many years, since Gauss

    invented and used them in his surveying activities [92]. In 1965, the paperby G. H. Golub [102] on using the QR factorization and later his devel-opment of a stable algorithm for the computation of the SVD started arenewed interest in the subject in the, by then, changed work environmentof computers.

    Thanks also to, among many others, . Bjrck, L. Eldn, C. C. Paige,M. A. Saunders, G. W. Stewart, S. van Huffel and P.-. Wedin, the topicis now available in a robust, algorithmic and well-founded form.

    There are many books partially or completely dedicated to linear andnonlinear least squares. The rst and one of the fundamental references forlinear problems is Lawson and Hansons monograph [162]. Besides summa-rizing the state of the art at the time of its publication, it highlighted thepractical aspects of solving least squares problems. Bates and Watts [11]have an early comprehensive book focused on the nonlinear least squaresproblem with a strong statistical approach. Bjrcks book [22] contains avery careful and comprehensive survey of numerical methods for both lin-ear and nonlinear problems, including the treatment of large, sparse prob-lems. Golub and Van LoansMatrix Computations [116] includes severalchapters on different aspects of least squares solution and on total leastsquares. The total least squares problem, known in statistics as latent rootregression, is discussed in the book by S. van Huffel and J. Vandewalle[255]. Seber and Wild [239] consider exhaustively all aspects of nonlinearleast squares estimation and modeling. Although it is a general treatiseon optimization, Nocedal and Wrights book [183] includes a very clearchapter on nonlinear least squares. Additional material can be found in[23, 70, 140, 248, 258, 269].

    Acknowledgements

    We would like to acknowledge the help of Michael Saunders (iCME,Stanford University), who read carefully the whole manuscript and madea myriad observations and corrections that have greatly improved the nalproduct.

    Per Christian Hansen would like to thank several colleagues from DTUInformatics who assisted with the statistical aspects.

    Godela Scherer gives thanks for all the support at the Department of Mathematics and Statistics, University of Reading, where she was a visitingresearch fellow while working on this book. In particular, she would like tothank Professor Mike J. Baines for numerous inspiring discussions.

    Victor Pereyra acknowledges Weidlinger Associates Inc. and most es-pecially David Vaughan and Howard Levine, for their unagging support

  • 7/30/2019 Book Pch

    7/321

    ACKNOWLEDGEMENTS xi

    and for letting him keep his office and access to computing facilities afterretirement.

    Special thanks are due to the professional handling of the manuscript bythe publishers and more specically to the Executive Editor Vincent Burkeand the Production Editor Andre Barnett.

    Prior to his untimely death on November 2007, Professor Gene Golubhad been an integral part of this project team. Although the book haschanged signicantly since then, it has greatly beneted from his insightand knowledge. He was an inspiring mentor and great friend, and we misshim dearly.

  • 7/30/2019 Book Pch

    8/321

  • 7/30/2019 Book Pch

    9/321

    Symbols and Acronyms

    Symbol RepresentsA m n matrixA, A , AT pseudoinverse, generalized inverse and transpose of Ab right-hand side, lengthmcond( ) condition number of matrix inl2-normCov() covariance matrixdiag( ) diagonal matrixei noise component in datae vector of noise, lengthm

    e i canonical unit vectorM machine precisionE () expected valuef j (t) model basis function(t) pure-data functionM (x , t ) tting model N normal (or Gaussian) distributionnull(A) null space of A p degree of polynomialP , P i , P x probabilityP X projection onto spaceX permutation matrixQ m m orthogonal matrix, partitioned asQ = ( Q1 Q2 )r = r (A) rank of matrixAr , r residual vector, least squares residual vectorr i residual forith data

    range(A) range of matrixAR , R1 m n and n n upper triangular matrixspan{w 1 , . . . , w p} subspace generated by the vectors diagonal SVD matrix

    xiii

  • 7/30/2019 Book Pch

    10/321

    xiv SYMBOLS and ACRONYMS

    Symbol Represents , i standard deviationi singular valuet independent variable in data tting problemt i abscissa in data tting problemU , V mm and n n left and right SVD matricesu i , v i left and right singular vectors, lengthm and n respectivelyW m m diagonal weight matrixwi weight in weighted least squares problemx , x vector of unknowns, least squares solution, lengthn

    x minimum-norm LSQ solutionx B , x TLS basic LSQ solution and total least squares solutionx i coefficient in a linear tting modely vector of data in a data tting problem, lengthmyi data in data tting problem

    2 2-norm, x 2 = ( x21 + + x2n )1/ 2perturbed version of Acronym Name

    CG conjugate gradientCGLS conjugate gradient for LSQFG fast GivensGCV generalized cross validationG-N Gauss-NewtonGS Gram-Schmidt factorizationGSVD generalized singular value decompositionLASVD SVD for large, sparse matricesL-M Levenberg-MarquardtLP linear predictionLSE equality constrained LSQ

  • 7/30/2019 Book Pch

    11/321

    SYMBOLS and ACRONYMS xv

    Acronym NameLSI inequality constrained LSQLSQI quadratically constrained LSQLSQ least squaresLSQR Paige-Saunders algorithmLU LU factorizationMGS modied Gram-SchmidtNLLSQ nonlinear least squaresNMR nuclear magnetic resonanceNN neural networkQR QR decompositionRMS root mean square

    RRQR rank revealing QR decompositionSVD singular value decompositionSNLLSQ separable nonlinear least squaresTLS total least squares problemTSVD truncated singular value decompositionUTV UTV decompositionVARPRO variable projection algorithm, Netlib versionVP variable projection

  • 7/30/2019 Book Pch

    12/321

  • 7/30/2019 Book Pch

    13/321

    Chapter 1

    The Linear Data FittingProblem

    This chapter gives an introduction to the linear data tting problem: howit is dened, its mathematical aspects and how it is analyzed. We also giveimportant statistical background that provides insight into the data tting

    problem. Anyone with more interest in the subject is encouraged to consultthe pedagogical expositions by Bevington [15], Rust [229], Strutz [249] andvan den Bos [258].

    We start with a couple of simple examples that introduce the basicconcepts of data tting. Then we move on to a more formal denition, andwe discuss some statistical aspects. Throughout the rst chapters of thisbook we will return to these data tting problems in order to illustrate theensemble of numerical methods and techniques available to solve them.

    1.1 Parameter estimation, data approximationExample 1. Parameter estimation. In food-quality analysis, the amount and mobility of water in meat has been shown to affect quality attributes like appearance, texture and storage stability. The water contents can be mea-sured by means of nuclear magnetic resonance (NMR) techniques, in which the measured signal reects the amount and properties of different types of water environments in the meat. Here we consider a simplied example involving frozen cod, where the ideal time signal (t) from NMR is a sum of two damped exponentials plus a constant background,

    (t) = x1 e 1 t + x2 e 2 t + x3 , 1 , 2 > 0.

    In this example we assume that we know the parameters 1 and 2 that control the decay of the two exponential components. In practice we do not

    1

  • 7/30/2019 Book Pch

    14/321

    2 LEAST SQUARES DATA FITTING

    Figure 1.1.1: Noisy measurements of the time signal (t) from NMR, for

    the example with frozen cod meat.

    measure this pure signal, but rather a noisy realization of it as shown in Figure 1.1.1.

    The parameters 1 = 27 s 1 and 2 = 8 s 1 characterize two different types of proton environments, responsible for two different water mobilities.The amplitudes x1 and x2 are proportional to the amount of water contained in the two kinds of proton environments. The constant x3 accounts for an

    undesired background (bias) in the measurements. Thus, there are three unknown parameters in this model, namely, x1 , x2 and x3 . The goal of data tting in relation to this problem is to use the measured data to estimate the three unknown parameters and then compute the different kinds of water contents in the meat sample. The actual t is presented in Figure 1.2.1.

    In this example we use the technique of data tting for thepurpose of estimating unknown parameters in a mathemat-ical model from measured data. The model is dictated bythe physical or other laws that describe the data.

    Example 2. Data approximation. We are given measurements of air pollution, in the form of the concentration of NO, over a period of 24 hours,on a busy street in a major city. Since the NO concentration is mainly due to the cars, it has maximum values in the morning and in the afternoon,when the traffic is most intense. The data is shown in Table 1.1 and the plot in Figure 1.2.2.

    For further analysis of the air pollution we need to t a smooth curve to the measurements, so that we can compute the concentration at an arbi-trary time between 0 and 24 hours. For example, we can use a low-degree polynomial to model the data, i.e., we assume that the NO concentration can be approximated by

    f (t) = x1 t p + x2 t p 1 + + x p t + x p+1 ,

  • 7/30/2019 Book Pch

    15/321

    LINEAR DATA FITTING PROBLEM 3

    t i yi t i yi t i yi t i yi t i yi0 110.49 5 29.37 10 294.75 15 245.04 20 216.731 73.72 6 74.74 11 253.78 16 286.74 21 185.782 23.39 7 117.02 12 250.48 17 304.78 22 171.193 17.11 8 298.04 13 239.48 18 288.76 23 171.734 20.31 9 348.13 14 236.52 19 247.11 24 164.05

    Table 1.1: Measurements of NO concentration yi as a function of time t i .The units of yi and t i are g/m 3 and hours, respectively.

    where t is the time, p is the degree of the polynomial and x1 , x2 , . . . , x p+1 are the unknown coefficients in the polynomial. A better model however, since the data repeats every day, would use periodic func-tions:

    f (t) = x1 + x2 sin( t) + x3 cos( t) + x4 sin(2 t) + x5 cos(2 t) + where = 2 / 24 is the period. Again, x1 , x2 , . . . are the unknown coeffi-cients. The goal of data tting in relation to this problem is to estimate the coefficients x1 , x2 , . . . , such that we can evaluate the function f (t) for any argument t . At the same time we want to suppress the inuence of errors present in the data.

    In this example we use the technique of data tting for thepurpose of approximating measured discrete data: we t amodel to given data in order to be able to compute smootheddata for any value of the independent variable in the model.We are free to choose the model, as long as it gives an ade-quate t to the data.

    Both examples illustrate that we are given data with measurement er-rors and that we want to t a model to these data that captures the overallbehavior of it without being too sensitive to the errors. The difference be-tween the two examples is that in the rst case the model arises from aphysical theory, while in the second there is an arbitrary continuous ap-proximation to a set of discrete data.

    Data tting is distinctly different from the problem of interpolation,where we seek a model a functionf (t) that interpolates the given data,i.e., it satisesf (t i ) = yi for all the data points. We are not interestedin interpolation (which is not suited for noisy data) rather, we want toapproximate the noisy data with a parametric model that is either given orthat we can choose, in such a way that the result is not too sensitive to thenoise. In this data tting approach there is redundant data: i.e., more data

  • 7/30/2019 Book Pch

    16/321

    4 LEAST SQUARES DATA FITTING

    than unknown parameters, which also helps to decrease the uncertainty inthe parameters of the model. See Example 15 in the next chapter for a justication of this.

    1.2 Formulation of the data tting problemLet us now give a precise denition of the data tting problem. We assumethat we are givenm data points

    (t1 , y1 ), (t2 , y2), . . . , (tm , ym ),

    which can be described by the relation

    yi = ( t i ) + ei , i = 1 , 2, . . . , m . (1.2.1)

    The function(t), which we call thepure-data function , describes thenoise-free data (it may be unknown, or given by the application), whilee1 , e2 , . . . , e m are the data errors (they are unknown, but we may havesome statistical information about them). The data errors also referredto as noise represent measurement errors as well as random variationsin the physical process that generates the data. Without loss of generalitywe can assume that the abscissast i appear in non-decreasing order, i.e.,t1 t2 tm .In data tting we wish to compute an approximation to(t) typicallyin the interval[t1 , t m ]. The approximation is given by thetting model M (x , t ), where the vectorx = ( x1 , x2 , . . . , x n )T contains n parametersthat characterize the model and are to be determined from the given noisydata. In the linear data tting problem we always have a model of the form

    Linear tting model: M (x , t ) =n

    j =1

    x j f j (t). (1.2.2)

    The functionsf j (t) are called themodel basis functions, and the numbern the order of the t should preferably be smaller than the numberm of data points. A notable modern exception is related to the so-calledcompressed sensing, which we discuss briey in Section 7.5.

    The form of the functionM (x , t ) i.e., the choice of basis functions depends on the precise goal of the data tting. These functions may begiven by the underlying mathematical model that describes the data inwhich caseM (x , t ) is often equal to, or an approximation to, the pure-datafunction(t) or the basis functions may be chosen arbitrarily among allfunctions that give the desired approximation and allow for stable numericalcomputations.

  • 7/30/2019 Book Pch

    17/321

    LINEAR DATA FITTING PROBLEM 5

    The method of least squares (LSQ) is a standard technique for deter-mining the unknown parameters in the tting model. Theleast squares t is dened as follows. We introduce the residualr i associated with the datapoints as

    r i = yi M (x , t i ), i = 1 , 2, . . . , m ,and we note that each residual is a function of the parameter vectorx , i.e.,r i = r i (x ). A least squares t is a choice of the parameter vectorx thatminimizes the sum-of-squares of the residuals:

    LSQ t: minx

    m

    i=1

    r i (x )2 = minx

    m

    i=1

    yi M (x , t i )2 . (1.2.3)

    In the next chapter we shall describe in which circumstances the leastsquares t is unique, and in the following chapters we shall describe anumber of efficient computational methods for obtaining the least squaresparameter vectorx .

    We note in passing that there are other related criteria used in datatting; for example, one could replace the sum-of-squares in (1.2.3) withthe sum-of-absolute-values:

    minx

    m

    i=1|r i (x )| = minx

    m

    i=1

    yi M (x , t i ) . (1.2.4)

    Below we shall use a statistical perspective to describe when these twochoices are appropriate. However, we emphasize that the book focuses onthe least squares t.

    In order to obtain a better understanding of the least squares data ttingproblem we take a closer look at the residuals, which we can write as

    r i = yi M (x , t i ) = yi (t i ) + (t i ) M (x , t i )= ei + (t i ) M (x , t i ) ,

    i = 1 , 2, . . . , m . (1.2.5)

    We see that the ith residual consists of two components: the data errorei comes from the measurements, while the approximation error(t i ) M (x , t i ) is due to the discrepancy between the pure-data function and thecomputed tting model. We emphasize that even if (t) and M (x , t ) havethe same form, there is no guarantee that the estimated parametersx usedin M (x , t ) will be identical to those underlying the pure-data function(t).At any rate, we see from this dichotomy that a good tting modelM (x , t )is one for which the approximation errors are of the same size as the dataerrors.

  • 7/30/2019 Book Pch

    18/321

    6 LEAST SQUARES DATA FITTING

    Underlying the least squares formulation in (1.2.3) are the assumptionsthat the data and the errors are independent and that the errors are whitenoise. The latter means that all data errors are uncorrelated and of thesame size or in more precise statistical terms, that the errorse

    ihave mean

    zero and identical variance:E (ei ) = 0 and E (e2i ) = 2 for i = 1 , 2, . . . , m(where is the standard deviation of the errors).This ideal situation is not always the case in practice! Hence, we also

    need to consider the more general case where the standard deviation de-pends on the indexi, i.e.,

    E (ei ) = 0 , E (e2i ) = 2i , i = 1 , 2, . . . , m ,where

    iis the standard deviation of e

    i. In this case, the maximum likeli-

    hood principle in statistics (see Section 1.3), tells us that we should min-imize the weighted residuals, with weights equal to the reciprocals of thestandard deviations:

    min xmi=1

    r i (x ) i

    2= min x

    mi=1

    y i M (x ,t i ) i

    2. (1.2.6)

    Now consider the expected value of the weighted sum-of-squares:

    E m

    i=1

    r i (x ) i

    2=

    m

    i=1E

    r i (x )2 2i

    =m

    i=1E

    e2i 2i

    +m

    i=1E

    (( t i ) M (x , t i )) 2 2i

    = m +m

    i=1

    E (( t i ) M (x , t i )) 2 2i

    ,

    where we have used thatE (ei ) = 0 and E (e2i ) = 2i . The consequence of this relation is the intuitive result that we can allow the expected value of the approximation errors to be larger for those data(t i , yi ) that have largerstandard deviations (i.e., larger errors). Example 4 illustrates the usefulnessof this approach. See Chapter 3 in [249] for a thorough discussion on howto estimate weights for a given data set.

    We are now ready to state the least squares data tting problem interms of matrix-vector notation. We dene the matrixA

    R m n and the

    vectorsy , r R m as follows,

    A =

    f 1(t1) f 2 (t1) f n (t1)f 1(t2) f 2 (t2) f n (t2)... ... ...f 1(tm ) f 2 (tm ) f n (tm )

    , y =

    y1y2...

    ym

    , r =

    r 1r 2...

    r m

    ,

  • 7/30/2019 Book Pch

    19/321

    LINEAR DATA FITTING PROBLEM 7

    i.e., y is the vector of observations,r is the vector of residuals and thematrix A is constructed such that thej th column is thej th model basisfunction sampled at the abscissast1 , t 2 , . . . , t m . Then it is easy to see thatfor the un-weighted data tting problem we have the relations

    r = y A x and (x ) =m

    i=1

    r i (x )2 = r 22 = y A x 22 .

    Similarly, for the weighted problem we have

    W (x ) =m

    i=1

    r i (x ) i

    2

    = W (y A x ) 22 ,

    with the weighting matrix and weights

    W = diag( w1 , . . . , w m ), wi = 1i , i = 1 , 2, . . . , m .

    In both cases, the computation of the coefficients in the least squares t isidentical to the solution of a linear least squares problem forx . Throughoutthe book we will study these least squares problems in detail and giveefficient computational algorithms to solve them.

    Example 3. We return to the NMR data tting problem from Example 1.For this problem there are 50 measured data points and the model basis functions are

    f 1 (t) = e 1 t , f 2(t) = e 2 t , f 3(t) = 1 ,

    and hence we have m = 50 and n = 3 . In this example the errors in all data points have the same standard deviation = 0 .1, so we can use the un-weighted approach. The solution to the 50 3 least squares problem is

    x1 = 1 .303, x2 = 1 .973, x3 = 0 .305.

    The exact parameters used to generate the data are 1.27, 2.04 and 0.3,respectively. These data were then perturbed with random errors. Figure 1.2.1 shows the data together with the least squares t M (x , t ); note how the residuals are distributed on both sides of the t.

    For the data tting problem in Example 2, we try both the polynomial t and the trigonometric t. In the rst case the basis functions are the monomials f j (t) = tn j , for j = 1 , . . . , n = p + 1 , where p is the degree of the polynomial. In the second case the basis functions are the trigonometric functions:

    f 1 (t) = 1 , f 2(t) = sin( t), f 3 (t) = cos( t),

  • 7/30/2019 Book Pch

    20/321

    8 LEAST SQUARES DATA FITTING

    Figure 1.2.1: The least squares t (solid line) to the measured NMR data(dots) from Figure 1.1.1 in Example 1.

    Figure 1.2.2: Two least squares ts (both of order n = 9) to the mea-sured NO data from Example 2, using a polynomial (left) and trigonometricfunctions (right).

    f 4(t) = sin(2 t), f 5(t) = cos(2 t), . . .

    Figure 1.2.2 shows the two ts using a polynomial of degree p = 8 (giving a t of order n = 9 ) and a trigonometric t with n = 9 . The trigonometric t looks better. We shall later introduce computational tools that let us investigate this aspect in more rigorous ways.

    Example 4. This example illustrates the importance of using weights when computing the t. We use again the NMR data from Example 1, except this time we add larger Gaussian noise to the rst 10 data points, with standard deviation 0.5. Thus we have i = 0 .5 for i = 1 , 2, . . . , 10 (the rst 10 data with larger errors) and i = 0 .1 for i = 11 , 12, . . . , 50 (the remaining data with smaller errors). The corresponding weights wi = 1iare therefore 2, 2, . . . , 2, 10, 10, . . . , 10. We solve the data tting problem with and without weights for 10, 000 instances of the noise. To evaluate the results, we consider how well we estimate the second parameter x2 whose exact value is 2.04. The results are shown in Figure 1.2.3 in the form of

  • 7/30/2019 Book Pch

    21/321

    LINEAR DATA FITTING PROBLEM 9

    Figure 1.2.3: Histograms of the computed values of x 2 for the modiedNMR data in which the rst 10 data points have larger errors. The leftplot shows results from solving the un-weighted LSQ problem, while theright plot shows the results when weights wi = 1i are included in the LSQproblem.

    histograms of the computed values of x2 . Clearly, the weighted t gives more robust results because it is less inuenced by the data with large errors.

    1.3 Maximum likelihood estimationAt rst glance the problems of interpolation and data tting seem to re-semble each other. In both cases, we use approximation theory to selecta good model (through the model basis functions) for the given data thatresults in small approximation errors in the case of data tting, given bythe term (t) M (x , t ) for t [t1 , t m ]. The main difference between thetwo problems comes from the presence of data errors and the way we dealwith these errors.

    In data tting we deliberately avoid interpolating the data and insteadsettle for less degrees of freedom in the model, in order to reduce the modelssensitivity to errors. Also, it is clear that the data noise plays an importantrole in data tting problems, and we should use concepts and tools fromstatistics to deal with it.

    The classical statistical motivation for the least squares t is basedon the maximum likelihood principle. Our presentation follows [15]. Weassume that the data are given by (1.2.1), that the errorsei are unbiasedand uncorrelated, and that each errorei has a Gaussian distribution withstandard deviation i , i.e.,

    yi = ( t i ) + ei , ei N (0, 2i ).Here,N (0, 2i ) denotes the normal (Gaussian) distribution with zero meanand standard deviationi . Gaussian errors arise, e.g., from the measure-ment process or the measuring devices, and they are also good models of composite errors that arise when several sources contribute to the noise.

  • 7/30/2019 Book Pch

    22/321

  • 7/30/2019 Book Pch

    23/321

    LINEAR DATA FITTING PROBLEM 11

    Following once again the maximum likelihood approach we arrive at theproblem of maximizing the function

    P x = K

    m

    i=1e

    | y i M ( x ,t i ) |

    i

    = K e Pmi =1 y i M ( x ,t i ) i with K = mi=1 2 i

    1 . Hence, for these errors we should minimize thesum-of-absolute-values of the weighted residuals. This is the linear 1-normminimization problem that we mentioned in (1.2.4).

    While the principle of maximum likelihood is universally applicable, itcan lead to complicated or intractable computational problems. As an ex-ample, consider the case of Poisson data, whereyi comes from a Poissondistribution, with expected value(t i ) and standard deviation(t i )1/ 2 .Poisson data typically show up in counting measurements, such as the pho-ton counts underlying optical detectors. Then the probability for makingthe observationyi is

    P i =(t i )y i

    yi !e ( t i ) ,

    and hence, we should maximize the probability

    P x =m

    i=1

    M (x , t i )y i

    yi !e M (x ,t i )

    =m

    i=1

    1yi !

    m

    i=1

    M (x , t i )y i e Pmi =1 M (x ,t i ) .Unfortunately, it is computationally demanding to maximize this quantitywith respect tox , and instead one usually makes the assumption that thePoisson errors for each datayi are nearly Gaussian, with standard deviationi = ( t i )1/ 2 y

    1/ 2i (see, e.g., pp. 342343 in [158] for a justication of this

    assumption). Hence, the above weighted least squares approach derived forGaussian noise, with weightswi = y 1/ 2i , will give a good approximationto the maximum likelihood t for Poisson data.

    The Gauss and Laplace errors discussed above are used to model addi-tive errors in the data. We nish this section with a brief look at relativeerrors, which arise when the size of the errorei is perhaps to a goodapproximation proportional to the magnitude of the pure data(t i ). Astraightforward way to model such errors, which ts into the above frame-work, is to assume that the datayi can be described by a normal distribu-tion with mean(t i ) and standard deviation i = |(t i )| . This relative

  • 7/30/2019 Book Pch

    24/321

    12 LEAST SQUARES DATA FITTING

    Gaussian errors model can also be written as

    yi = ( t i ) (1 + ei ), ei N (0, 2). (1.3.2)Then the probability for making the observationyi is

    P i =1

    |(t i )| 2e

    12 y i ( t i )( t i ) 2 .

    Using the maximum likelihood principle again and substituting the mea-sured datayi for the unknown pure data(t i ), we arrive at the followingweighted least squares problem:

    minx

    m

    i=1

    yi

    M (x , ti)

    yi

    2

    = minx W (y A x ) 22 , (1.3.3)with weightswi = y 1i .

    An alternative formulation, which is suited for problems with positivedata (t i ) > 0 and yi > 0, is to assume thatyi can be described by a log-normal distribution, for whichlog yi has a normal distribution, with meanlog ( t i ) and standard deviation :

    yi = ( t i ) ei

    log yi = log ( t i ) + i , i N (0,

    2).

    In this case we again arrive at a sum-of-squares minimization problem,but now involving the difference of the logarithms of the datayi and themodelM (x , t i ). Even whenM (x , t ) is a linear model, this is not a linearproblem inx .

    In the above log-normal model with standard deviation , the probabil-ity P i for making the observationyi is given by

    P i =1

    yi 2 e12

    log y i log ( t i )

    2

    =1

    yi 2 e12

    log y i

    2

    ,

    with yi = yi / (t i ). Now let us assume that is small compared to(t i ),such that yi (t i ) and yi 1. Then, we can writeyi = ( t i ) (1 + ei )yi = 1 + ei , with ei 1 and log yi = ei + O(e2i ). Hence, the exponentialfactor inP i becomes

    e12 log y i 2 = e 12 i + O ( e

    2i )

    2 = e 12 e2i

    2 +O ( e 3i )

    2 = e

    12 ( e i )

    2

    eO ( e 3i )

    2 = e12 ( e i )

    2

    O(1) ,

    while the other factor inP i becomes1

    yi 2 =1

    (t i ) (1 + ei ) 2 =1 + O(ei )

    (t i ) 2 .

  • 7/30/2019 Book Pch

    25/321

    LINEAR DATA FITTING PROBLEM 13

    Hence, as long as (t i ) we have the approximation

    P i1

    (t i )

    2

    e12 ( e i )

    2

    =1

    (t i )

    2

    e12 y i ( t i )( t i ) 2 ,

    which is the probability introduced above for the case of relative Gaussianerrors. Hence, for small noise levels |(t i )|, the two different modelsfor introducing relative errors in the data are practically identical, leadingto the same weighted LSQ problem (1.3.3).

    1.4 The residuals and their propertiesThis section focuses on the residualsr i = yi M (x , t i ) for a given tand how they can be used to analyze the quality of the tM (x , t ) thatwe have computed. Throughout the section we assume that the residualsbehave like a time series, i.e., they have a natural orderingr 1 , r 2 , . . . , r massociated with the orderingt1 < t 2 < < t m of the samples of theindependent variablet.

    As we already saw in Equation (1.2.5), each residualr i consists of twocomponents the data errorei and the approximation error(t i )

    M (x , t i ).

    For a good tting model, the approximation error should be of the samesize as the data errors (or smaller). At the same time, we do not wantthe residuals to be too small, since then the modelM (x , t ) may overt thedata: i.e., not only will it capture the behavior of the pure-data function(t), but it will also adapt to the errors, which is undesirable.

    In order to choose a good tting modelM (x , t ) we must be able toanalyze the residualsr i and determine whether the model captures thepure-data function well enough. We can say that this is achieved when the

    approximation errors are smaller than the data errors, so that the residualsare practically dominated by the data errors. In that case, some of thestatistical properties of the errors will carry over to the residuals. Forexample, if the noise is white (cf. Section 1.1), then we will expect that theresiduals associated with a satisfactory t show properties similar to whitenoise.

    If, on the other hand, the tting model does not capture the mainbehavior of the pure-data function, then we can expect that the residualsare dominated by the approximation errors. When this is the case, theresiduals will not have the characteristics of noise, but instead they willtend to behave as a sampled signal, i.e., the residuals will show stronglocal correlations. We will use the termtrend to characterize a long-termmovement in the residuals when considered as a time series.

    Below we will discuss some statistical tests that can be used to determinewhether the residuals behave like noise or include trends. These and many

  • 7/30/2019 Book Pch

    26/321

    14 LEAST SQUARES DATA FITTING

    other tests are often used in time series analysis and signal processing.Throughout this section we make the following assumptions about the dataerrorsei :

    They are random variables with mean zero and identical variance,i.e.,E (ei ) = 0 and E (e2i ) = 2 for i = 1 , 2, . . . , m . They belong to a normal distribution,ei N (0, 2).

    We will describe three tests with three different properties: Randomness test: check for randomness of the signs of the residuals. Autocorrelation test: check whether the residuals are uncorrelated. White noise test: check for randomness of the residuals.

    The use of the tools introduced here is illustrated below and in Chapter 11on applications.

    Test for random signsPerhaps the simplest analysis of the residuals is based on the statistical

    question: can we consider the signs of the residuals to be random? (Whichwill often be the case whenei is white noise with zero mean.) We cananswer this question by means of arun test from time series analysis; see,e.g., Section 10.4 in [146].

    Given a sequence of two symbols in our case, + and for positiveand negative residualsr i a run is dened as a succession of identicalsymbols surrounded by different symbols. For example, the sequence + ++ + + + + + hasm = 17 elements,n+ = 8 pluses,n = 9 minuses andu = 5 runs: + + + ,

    , ++ ,

    and

    +++ . The distribution of runsu (not the residuals!) can be approximatedby a normal distribution with meanu and standard deviation u given by

    u =2 n+ n

    m+ 1 , 2u =

    (u 1) (u 2)m 1

    . (1.4.1)

    With a 5% signicance level we will accept the sign sequence as random if

    z = |u u | u

    < 1.96 (1.4.2)

    (other values of the threshold, for other signicance levels, can be found inany book on statistics). If the signs of the residuals are not random, thenit is likely that trends are present in the residuals. In the above examplewith 5 runs we havez = 2 .25, and according to (1.4.2) the sequence of signs is not random.

  • 7/30/2019 Book Pch

    27/321

    LINEAR DATA FITTING PROBLEM 15

    Test for correlationAnother question we can ask is whether short sequences of residuals arecorrelated, which is a clear indication of trends. The autocorrelation of theresiduals is a statistical tool for analyzing this. We dene theautocorrela-tion of the residuals, as well as the trend thresholdT , as the quantities

    =m 1

    i=1

    r i r i+1 , T =1

    m 1m

    i=1

    r 2i . (1.4.3)

    Since is the sum of products of neighboring residuals, it is in fact theunit-lag autocorrelation. Autocorrelations with larger lags, or distances inthe index, can also be considered. Then, we say that trends are likely to bepresent in the residuals if the absolute value of the autocorrelation exceedsthe trend threshold, i.e., if | | > T . Similar techniques, based on shortersequences of residuals, are used for placing knots in connection with splinetting; see Chapter 6 in [137].

    We note that in some presentations, the mean of the residuals is sub-tracted before computing and T . In our applications this should not benecessary, as we assume that the errors have zero mean.

    Test for white noiseYet another question we can ask is whether the sequence of residuals be-haves like white noise, which can be answered by means of the normalizedcumulative periodogram. The underlying idea is that white noise has a atspectrum, i.e., all frequency components in the discrete Fourier spectrumhave the same probability; hence, we must determine whether this is thecase. Let the complex numbersr k denote the components of the discreteFourier transform of the residuals, i.e.,

    r k =m

    i=1

    r i e(2 ( i 1)( k 1) /m ) , k = 1 , . . . , m ,

    where denotes the imaginary unit. Our indices are in the range1, . . . , mand thus shifted by1 relative to the range0, . . . , m 1 that is common in sig-nal processing. Note thatr 1 is the sum of the residuals (called the DC com-ponent in signal processing), whiler q+1 withq = m/ 2 is the component of the highest frequency. The squared absolute values

    |r 1

    |2 ,

    |r 2

    |2 , . . . ,

    |r q+1

    |2

    are known as the periodogram (in statistics) or the power spectrum (insignal processing). Then the normalized cumulative periodogram consistsof theq numbers

    ci = |r 2|2 + |r 3|2 + + |r i+1 |2

    |r 2|2 + |r 3|2 + + |r q+1 |2, i = 1 , . . . , q, q = m/ 2 ,

  • 7/30/2019 Book Pch

    28/321

    16 LEAST SQUARES DATA FITTING

    Figure 1.4.1: Two articially created data sets used in Example 5. Thesecond data set is inspired by the data on p. 60 in [137].

    which form an increasing sequence from 0 to 1. Note that the sums excludethe rst term in the periodogram.

    If the residuals are white noise, then the expected values of the normal-ized cumulative periodogram lie on a straight line from(0, 0) to (q, 1). Anyrealization of white noise residuals should produce a normalized cumula-tive periodogram close to a straight line. For example, with the common5% signicance level from statistics, the numbersci should lie within theKolmogorov-Smirnoff limit

    1.35/q of the straight line. If the maximum

    deviationmax i{|ci i/q |}is smaller than this limit, then we recognize theresidual as white noise.Example 5. Residual analysis. We nish this section with an example that illustrates the above analysis techniques. We use the two different data sets shown in Figure 1.4.1; both sets are articially generated (in fact, the second set is the rst set with the t i and yi values interchanged). In both examples we have m = 43 data points, and in the test for white noise we have q = 21 and 1.35/q = 0 .0643. The tting model M (x , t ) is the polynomial of degree p = n 1.

    For tting orders n = 2 , 3, . . . , 9, Figure 1.4.2 shows the residuals and the normalized cumulative periodograms, together with z (1.4.2), the ratios

    /T from (1.4.3) and the maximum distance of the normalized cumulative periodogram to the straight line. A visual inspection of the residuals in the left part of the gure indicates that for small values of n the polynomial model does not capture all the information in the data, as there are obvious trends in the residuals, while for n

    5 the residuals appear to be more

    random. The test for random signs conrms this: for n 5 the numbers z are less than 1.96, indicating that the signs of the residuals could be considered random. The autocorrelation analysis leads to approximately the same conclusion: for n = 6 and 7 the absolute value of the autocorrelation

    is smaller than the threshold T .The normalized cumulative periodograms are shown in the right part

  • 7/30/2019 Book Pch

    29/321

    LINEAR DATA FITTING PROBLEM 17

    Figure 1.4.2: Residual analysis for polynomial ts to articial data set 1.

  • 7/30/2019 Book Pch

    30/321

    18 LEAST SQUARES DATA FITTING

    of Figure 1.4.2. For small values, the curves rise fast toward a at part,showing that the residuals are dominated by low-frequency components. The closest we get to a straight line is for n = 6 , but the maximum distance 0.134to the straight line is still too large to clearly signify that the residuals are white noise. The conclusion from these three tests is nevertheless that n = 6is a good choice of the order of the t.

    Figure 1.4.3 presents the residual analysis for the second data set. Avisual inspection of the residuals clearly shows that the polynomial model is not well suited for this data set the residuals have a slowly varying trend for all values of n . This is conrmed by the normalized cumulative periodograms that show that the residuals are dominated by low-frequency components. The random-sign test and the autocorrelation analysis also

    give a clear indication of trends in the residuals.

    1.5 Robust regressionThe least squares t introduced in this chapter is convenient and useful ina large number of practical applications but it is not always the rightchoice for a data tting problem. In fact, we have already seen in Section1.3 that the least squares t is closely connected to the assumption about

    Gaussian errors in the data. There we also saw that other types of noise,in the framework of maximum likelihood estimation, lead to other criteriafor a best t such as the sum-of-absolute-values of the residuals (the 1-norm) associated with the Laplace distribution for the noise. The moredominating the tails of the probablility density function for the noise, themore important it is to use another criterion than the least squares t.

    Another situation where the least squares t is not appropriate is whenthe data containoutliers , i.e., observations with exceptionally large errorsand residuals. We can say that an outlier is a data point(t i , yi ) whose valueyi is unusual compared to its predicted value (based on all the reliable datapoints). Such outliers may come from different sources:

    The data errors may come from more than one statistical distribution.This could arise, e.g., in an astronomical CCD camera, where we havePoisson noise (or photon noise) from the incoming light, Gaussiannoise from the electronic circuits (amplier and A/D-converter), andoccasional large errors from cosmic radiation (so-called cosmic rayevents).

    The outliers may be due to data recording errors arising, e.g., whenthe measurement device has a malfunction or the person recordingthe data makes a blunder and enters a wrong number.

    A manual inspection can sometimes be used to delete blunders from thedata set, but it may not always be obvious which data are blunders or

  • 7/30/2019 Book Pch

    31/321

    LINEAR DATA FITTING PROBLEM 19

    Figure 1.4.3: Residual analysis for polynomial ts to articial data set 2.

  • 7/30/2019 Book Pch

    32/321

    20 LEAST SQUARES DATA FITTING

    outliers. Therefore we prefer to have a mathematical formulation of thedata tting problem that handles outliers in such a way that all data areused, and yet the outliers do not have a deteriorating inuence on the t.This is the goal of robust regression. Quoting from [90] we say that anestimator or statistical procedure is robust if it provides useful informationeven if some of the assumptions used to justify the estimation method arenot applicable.

    Example 6. Mean and median. Assume we are given n 1 samples z1 , . . . , z n 1 from the same distribution and a single sample zn that is an outlier. Clearly, the arithmetic mean 1n (z1 + z2 + + zn ) is not a good good estimate of the expected value because the outlier constributes with the same weight as all the other data points. On the other hand, the median gives a robust estimate of the expected value since it is insensitive to a few outliers; we recall that if the data are sorted then the median is z(n +1) / 2 if n is odd, and 12 (zn/ 2 + zn/ 2+1 ) if n is even.

    The most common method for robust data tting or robust regression,as statisticians call it is based on the principle of M-estimation introducedby Huber [142], which can be considered as a generalization of maximumlikelihood estimation. Here we consider un-weighted problems only (theextension to weighted problems is straightforward). The underlying ideais to replace the sum of squared residuals in (1.2.3) with the sum of somefunction of the residuals:

    Robust t: minx

    m

    i=1

    r i (x ) = minx

    m

    i=1

    yi M (x , t i ) , (1.5.1)

    where the function denes the contribution of each residual to the functionto be minimized. In particular, we obtain the least squares t when(r ) =12 r

    2 . The function must satisfy the following criteria:

    1. Non-negativity:(r ) 0r .

    2. Zero only when the argument is zero:(r ) = 0 r = 0 .

    3. Symmetry:(r ) = (r ).

    4. Monotonicity:(r ) (r ) for r r .

  • 7/30/2019 Book Pch

    33/321

    LINEAR DATA FITTING PROBLEM 21

    Figure 1.5.1: Four functions (r ) used in the robust data tting problem.All of them increase slower than 12 r

    2 that denes the LSQ problem and thusthey lead to robust data tting problems that are less sensitive to outliers.

    Some well-known examples of the function are (cf. [90, 184]):

    Huber : (r ) =12 r

    2 , |r | |r | 12 2 , |r | >

    (1.5.2)

    Talwar : (r ) =12 r

    2 , |r | 12

    2

    , |r | > (1.5.3)

    Bisquare : (r ) = 2 log coshz

    (1.5.4)

    Logistic : (r ) = 2 |r | log 1 + |

    r |

    (1.5.5)

    Note that all four functions include a problem-dependent positive parameter that is used to control the behavior of the function for large values of r , corresponding to the outliers. Figure 1.5.1 shows these functions for thecase = 1 , and we see that all of them increase slower than the function12 r

    2 , which underlies the LSQ problem. This is precisely why they lead to arobust data tting problem whose solution is less sensitive to outliers thanthe LSQ solution. The parameter should be chosen from our knowledgeof the standard deviation of the noise; if is not known, then it can beestimated from the t as we will discuss in Section 2.2.

    It appears that the choice of function for a given problem relies mainly

    on experience with the specic data for that problem. Still, the Huberfunction has attained widespread use, perhaps due to the natural way itdistinguishes between large and small residuals:

    Small residual satisfying|r i | are treated in the same way as inthe LSQ tting problem; if there are no outliers then we obtain theLSQ solution.

  • 7/30/2019 Book Pch

    34/321

    22 LEAST SQUARES DATA FITTING

    Figure 1.5.2: The pure-data function ( t) (thin line) and the data withGaussian noise (dots); the outlier (t 60 , y60 ) = (3 , 2.5) is outside the plot.The tting model M (x , t ) is a polynomial with n = 9 . Left: the LSQ tand the corresponding residuals; this t is dramatically inuenced by theoutlier. Right: the robust Huber t, using = 0 .025, together with theresiduals; this is a much better t to the given data because it approximatesthe pure-data function well.

    Large residuals satisfying|r i | > are essentially treated as|r i | andthe robust t is therefore not so sensitive to the corresponding datapoints.

    Thus, robust regression is a compromise between excluding the outliersentirely from the analysis and including all the data points and treatingall of them equally in the LSQ regression. The idea of robust regressionis to weight the observations differently based on how well behaved theseobservations are. For an early use in seismic data processing see [54, 235].

    Example 7. Robust data tting with the Huber function. This example illustrates that the Huber function gives a more robust t than the LSQ t. The pure-data funtion (t) is given by

    (t) = sin e t , 0 t 5.We use m = 100 data points with t i = 0 .05i, and we add Gaussian noise with standard deviation = 0 .05. Then we change the 60th data point toan outlier with (t60 , y60 ) = (3 , 2.5); Figure 1.5.2 shows the function (t)and the noisy data. We note that the outlier is located outside the plot.

    As tting model M (x , t ) we use a polynomial with n = 9 and the left part of Figure 1.5.2 shows the least squares t with this model, together

  • 7/30/2019 Book Pch

    35/321

    LINEAR DATA FITTING PROBLEM 23

    with the corresponding residuals. Clearly, this t is dramatically inuenced by the outlier, which is evident from the plot of the t and as well by the behavior of the residuals, which exhihit a strong positive trend in the range 2

    t

    4. This illustrates the inability of the LSQ t to handle outliers in

    a satisfactory way.The right part of Figure 1.5.2 shows the robust Huber t, with parameter

    = 0 .025 (this parameter is chosen to reect the noise level in the data).The resulting t is not inuenced by the outlier, and the residuals do not seem to exhibit any strong trend. This is a good illustration of robust regression.

    In Section 9.5 we describe numerical algorithms for computing the so-

    lutions to robust data tting problems.

  • 7/30/2019 Book Pch

    36/321

  • 7/30/2019 Book Pch

    37/321

    Chapter 2

    Linear Least SquaresProblem

    This chapter covers some of the basic mathematical facts of the linear leastsquares problem, as well as some important additional statistical results fordata tting. We introduce the two formulations of the least squares prob-

    lem: the linear system of normal equations and the optimization problemform.The computation of the LSQ solution via an optimization problem has

    two aspects: simplication of the problem structure and actual minimiza-tion. In this and in the next chapter we present a number of matrix factor-izations, for both full-rank and rank-decient problems, which transformthe original problem to one easier to solve. The QR factorization is empha-sized, for both the analysis and the solution of the LSQ problem, while inthe last section we look into the more expensive complete factorizations.

    Some very interesting historical paper on Gaussian elimination that in-cludes also least squares problems can be found in Grcar [120, 121].

    2.1 Linear least squares problem formulationAs we saw in the previous chapter, underlying the linear (and possiblyweighted) least squares data tting problem is the linear least squares prob-lem

    minx b A x 2 or minx W (b A x ) 2 ,whereA

    R m n is the matrix with samples of the model basis functions,x is the vector of parameters to be determined, the right-hand-sideb isthe vector of observations andW is a diagonal weight matrix (possibly theidentity matrix). Since the weights can always be absorbed intoA and b

    25

  • 7/30/2019 Book Pch

    38/321

    26 LEAST SQUARES DATA FITTING

    U

    I

    T

    range(A)

    b= A x

    r b

    Figure 2.1.1: The geometric interpretation of the linear least squares so-lution x . The plane represents the range of A, and if the vector b hasa component outside this subspace, then we have an inconsistent system.Moreover, b = A x is the orthogonal projection of b on range( A), and r

    is the LSQ residual vector.

    in the mathematical formulation we can, without loss of generality, restrictour discussion to the un-weighted case. Also, from this point on, when

    discussing the generic linear least squares problem, we will use the notationb for the right-hand side (instead of y that we used in Chapter 1), whichis more common in the LSQ literature.

    Although most of the material in this chapter is also applicable to theunderdetermined case(m < n ), for notational simplicity we will alwaysconsider the overdetermined casem n . We denote byr the rank of A, and we consider both full-rank and rank-decient problems. Thus wealways haver n m in this chapter.It is appropriate to remember at this point that anm

    n matrix A is

    always a representation of a linear transformationx A x with A : R n R m , and therefore there are two important subspaces associated with it:The range or column space ,

    range( A) = {z R m | x R n , z = A x }and its orthogonal complement, thenull space of AT :

    null( AT ) = y

    R m

    |AT y = 0 .

    WhenA is square and has full rank, then the LSQ problemmin x A x b 2 reduces to the linear system of equationsA x = b. In all other cases,due to the data errors, it is highly probable that the problem isinconsistent ,i.e., b /range( A), and as a consequence there is no exact solution, i.e., nocoefficientsx j exist that expressb as a linear combination of columns of A.

  • 7/30/2019 Book Pch

    39/321

    LINEAR LEAST SQUARES PROBLEM 27

    Instead, we can nd the coefficientsx j for a vectorb in the range of Aand closest tob. As we have seen, for data tting problems it is naturalto use the Euclidean norm as our measure of closeness, resulting in the leastsquares problem

    Problem LSQ : min x b A x 22 , AR m n , r n m (2.1.1)with the corresponding residual vectorr given by

    r = b A x . (2.1.2)See Figure 2.1.1 for a geometric interpretation. The minimizer, i.e., theleast squares solution which may not be unique, as it will be seen later is denoted byx . We note that the vectorbrange( A) mentioned aboveis given byb= A x .

    The LSQ problem can also be looked at from the following point of view.When our data are contaminated by errors, then the data are not in thespan of the model basis functionsf j (t) underlying the data tting problem(cf. Chapter 1). In that case the data vectorb cannot and should not beprecisely predicted by the model, i.e., the columns of A. Hence, it mustbe perturbed by a minimum amountr , so that it can then be representedby A, in the form of b = A x . This approach will establish a viewpointused in Section 7.3 to introduce the total least squares problem.

    As already mentioned, there are good statistical reasons to use the Eu-clidean norm. The underlying statistical assumption that motivates thisnorm is that the vectorr has random error elements, uncorrelated, withzero mean and a common variance. This is justied by the following theo-rem.

    Theorem 8. (Gauss-Markov) Consider the problem of tting a model M (x , t ) with the n-parameter vector x to a set of data bi = ( t i ) + ei for i = 1 , . . . , m (see Chapter 1 for details).

    In the case of a linear model b = A x , if the errors are uncorrelated with mean zero and constant variance 2 (not necessarily normally distributed)and assuming that the m n matrix A obtained by evaluating the model at the data abscissas {t i}i=1 ,...,m has full rank n , then the best linear unbiased estimator is the least squares estimator x , obtained by solving the problem min x b

    A x 22 .

    For more details see [22] Theorem 1.1.1. Recall also the discussionon maximum likelihood estimation in Chapter 1. Similarly, for nonlinearmodels, if the errorsei for i = 1 , . . . , m have a normal distribution, theunknown parameter vectorx estimated from the data using a least squarescriterion is the maximum likelihood estimator.

  • 7/30/2019 Book Pch

    40/321

    28 LEAST SQUARES DATA FITTING

    There are also clear mathematical and computational advantages asso-ciated with the Euclidean norm: the objective function in (2.1.1) is dif-ferentiable, and the resulting gradient system of equations has convenientproperties. Since the Euclidean norm is preserved under orthogonal trans-formations, this gives rise to a range of stable numerical algorithms for theLSQ problem.Theorem 9. A necessary and sufficient condition for x to be a minimizer of b A x 22 is that it satises

    AT (b A x ) = 0. (2.1.3)Proof. The minimizer of (x ) = b A x 22 must satisfy(x ) = 0, i.e.,(x )/x k = 0 for k = 1 , . . . , n . The kth partial derivative has the form

    (x )x k

    =m

    i=1

    2 bi n

    j =1

    x j a ij (a ik ) = 2m

    i=1

    r i a ik

    = 2 r T A(: , k) = 2 A(: , k)T r ,whereA(: , k) denotes thekth column of A. Hence the gradient can bewritten as

    (x ) =

    2 AT r =

    2 AT (b

    A x )

    and the requirement that(x ) = 0 immediately leads to (2.1.3).

    Denition 10. The two conditions (2.1.2)and (2.1.3)can be written as a symmetric (m + n) (m + n) system in x and r , the so-called augmented system:

    I AAT 0

    rx =

    b0 . (2.1.4)

    This formulation preserves any special structure thatA might have,such as sparsity. Also, it is the formulation used in an iterative renementprocedure for the LSQ solution (discussed in Section 4.5), because of therelevance it gives to the residual.

    Theorem 9 leads to thenormal equations for the solutionxof the leastsquares problem:

    Normal equations:AT A x = AT b. (2.1.5)

    The normal equation matrixAT

    A, which is sometimes called the Gram-mian, is square, symmetric and additionally: If r = n (A has full rank), thenAT A is positive denite and the

    LSQ problem has a unique solution. (Since the Hessian for the leastsquares problem is equal to2AT A, this establishes the uniqueness of x .)

  • 7/30/2019 Book Pch

    41/321

    LINEAR LEAST SQUARES PROBLEM 29

    If r < n (A is rank decient), thenAT A is non-negative denite. Inthis case, the set of solutions forms a linear manifold of dimensionn r that is a translation of the subspace null(A).

    Theorem 9 also states that the residual vector of the LSQ solution liesin null(AT ). Hence, the right-hand-sideb can be decomposed into twoorthogonal components

    b = A x + r ,

    withA x range( A) and r null( AT ), i.e.,A x is the orthogonal projection

    of b onto range(A) (the subspace spanned by the columns of A) and r isorthogonal to range(A).

    Example 11. The normal equations for the NMR problem in Example 1take the form

    2.805 4.024 5.0554.024 8.156 1.5215.055 1.521 50

    x1x2x3

    =13.1425.9851.87

    ,

    giving the least squares solution x = ( 1 .303 , 1.973 , 0.305)T .

    Example 12. Simplied NMR problem. In the NMR problem, let us assume that we know that the constant background is 0.3, corresponding to xing x3 = 0 .3. The resulting 2 2 normal equations for x1 and x2 take the form

    2.805 4.0244.024 8.156

    x1x2

    = 10.7420.35

    and the LSQ solution to this simplied problem is x1 = 1 .287 and x2 =1.991. Figure 2.1.2 illustrates the geometry of the minimization associated with the simplied LSQ problem for the two unknowns x1 and x2 . The left plot shows the residual norm surface as a function of x1 and x2 , and the right plot shows the elliptic contour curves for this surface; the unique minimum the LSQ solution is marked with a dot.

    In the rank-decient case the LSQ solution is not unique, but one canreduce the solution set by imposing additional constraints. For example,the linear least squares problem often arises from a linearization of a non-linear least squares problem, and it may be of interest then to impose theadditional constraint that the solution has minimal 2-norm,x = min x x 2 ,so that the solution stays in the region where the linearization is valid. Be-cause the set of all minimizers is convex there is a unique solution. Anotherreason for imposing minimal length is stability, as we will see in the sectionon regularization.

    For data approximation problems, where we are free to choose themodel basis functionsf j (t), cf. (1.2.2), one should do it in a way that

  • 7/30/2019 Book Pch

    42/321

    30 LEAST SQUARES DATA FITTING

    Figure 2.1.2: Illustration of the LSQ problem for the simplied NMR

    problem. Left: the residual norm as a function of the two unknowns x 1 andx 2 . Right: the corresponding contour lines for the residual norm.

    givesA full rank. A necessary condition is that the (continuous ) functionsf 1(t), . . . , f n (t) are linearly independent, but furthermore, they have to de-ne linearly independent vectors when evaluated on thespecic discrete set of abscissas . More formally:

    A necessary and sufficient condition for the matrixA to have full rankis that the model basis functions be linearly independent over the abscissast1 , . . . , t m :

    n

    j =1

    j f j (t i ) = 0 for i = 1 , . . . , m j = 0 for j = 1 ,...,n.

    Example 13. Consider the linearly independent functions f 1(t) = sin( t),f

    2(t) = sin(2 t) and f

    3(t) = sin(3 t); if we choose the data abscissas t

    i=

    / 4 + i/ 2, i = 1 , . . . , m , the matrix A has rank r = 2 , whereas the same functions generate a full-rank matrix A when evaluated on the abscissas t i = (i/m ), i = 1 . . . , m 1.

    An even stronger requirement is that the model basis functionsf j (t) besuch that the columns of A are orthogonal.

    Example 14. In general, for data tting problems, where the model basis functions f j (t) arise from the underlying model, the properties of the matrix A are dictated by these functions. In the case of polynomial data tting,it is possible to choose the functions f j (t) so that the columns of A are orthogonal, as described by Forsythe [89], which simplies the computation of the LSQ solution. The key is to choose a clever representation of the tting polynomials, different from the standard one with the monomials:

  • 7/30/2019 Book Pch

    43/321

    LINEAR LEAST SQUARES PROBLEM 31

    f j (t) = t j 1 , j = 1 ,...,n, such that the sampled polynomials satisfy m

    i=1

    f j (t i ) f k (t i ) = 0 for j = k. (2.1.6)

    When this is the case we say that the functions are orthogonal over the given abscissas. This is satised by the family of orthogonal polynomials dened by the recursion:

    f 1(t) = 1f 2(t) = t 1

    f j +1 (t) = ( t j ) f j (t) j f j 1(t), j = 2 , . . . n 1,where the constants are given by

    j =1s2j

    m

    i=1

    t i f j (t i )2 , j = 0 , 1, . . . , n 1

    j =s2j

    s2j 1, j = 0 , 1, . . . , n 1

    s2j =

    m

    i=1 f j (t i )2

    , j = 2 , . . . n ,

    i.e., s j is the 2-norm of the j th column of A. These polynomials satisfy (2.1.6), hence, the normal equations matrix AT A is diagonal, and it follows that the LSQ coefficients are given by

    xj =1s2j

    m

    i=1

    yi f j (t i ), j = 1 , . . . , n .

    WhenA has full rank, it follows from the normal equations (2.1.5) thatwe can write the least squares solution as

    x = ( AT A) 1AT b,

    which allows us to analyze the solution and the residual vector in statis-tical terms. Consider the case where the data errorsei are independent,uncorrelated and have identical standard deviations , meaning that thecovariance forb is given by

    Cov( b) = 2 I m ,

    since the errorsei are independent of the exact(t i ). Then a standardresult in statistics says that the covariance matrix for the LSQ solution is

    Cov( x ) = ( AT A) 1AT Cov( b) A (AT A) 1 = 2 (AT A) 1 .

  • 7/30/2019 Book Pch

    44/321

    32 LEAST SQUARES DATA FITTING

    We see that the unknown coefficients in the t the elements of x areuncorrelated if and only if AT A is a diagonal matrix, i.e., when the columnsof A are orthogonal. This is the case when the model basis functions areorthogonal over the abscissast

    1, . . . , t

    m; cf. (2.1.6).

    Example 15. More data gives better accuracy. Intuitively we expect that if we increase the number of data points then we can compute a more accurate LSQ solution, and the present example conrms this. Specically we give an asymptotic analysis of how the solutions variance depends on the number m of data points, in the case of linear data tting. There is noassumption about the distribution of the abscissas t i except that they belong to the interval [a, b] and appear in increasing order. Now let h i = t i t i 1 for i = 2 , . . . , m and let h = ( ba)/m denote the average spacing between the abscissas. Then for j, k = 1 , . . . , n the elements of the normal equation matrix can be approximated as

    (AT A)jk =m

    i=1

    h 1i f j (t i )f k (t i )h i1h

    m

    i=1

    f j (t i )f k (t i )h i

    mb

    a ba f j (t)f k (t) dt,

    and the accuracy of these approximations increases as m increases. Hence,if F denotes the matrix whose elements are the scaled inner products of the model basis functions,

    F jk =1

    ba b

    af j (t)f k (t)dt, i, j = 1 , . . . , n ,

    then for large m the normal equation matrix approximately satises

    AT A m F (AT A) 1

    1m

    F 1 ,

    where the matrix F is independent of m . Hence, the asymptotic result (as m increases) is that no matter the choice of abscissas and basis functions,as long as AT A is invertible we have the approximation for the white-noise case:

    Cov( x ) = 2(AT A) 1 2

    mF 1 .

    We see that the solutions variance is (to a good approximation) inversely proportional to the number m of data points.

    To illustrate the above result we consider again the frozen cod meat example, this time with two sets of abscissas t i uniformly distributed in [0, 0.4] for m = 50 and m = 200 , leading to the two matrices (AT A) 1

  • 7/30/2019 Book Pch

    45/321

    LINEAR LEAST SQUARES PROBLEM 33

    Figure 2.1.3: Histograms of the error norms x exact x 2 for the twotest problems with additive white noise; the errors are clearly reduced by afactor of 2 when we increase m from 50 to 200.

    given by

    1.846 1.300 0.2091.300 1.200 0.2340.209 0.234 0.070

    and 0.535 0.359 0.0570.359 0.315 0.0610.057 0.061 0.018

    ,

    respectively. The average ratio between the elements in the two matrices is 3.71, i.e., fairly close to the factor 4 we expect from the above analysis

    when increasing m by a factor 4.We also solved the two LSQ problems for 1000 realizations of additive white noise, and Figure 2.1.3 shows histograms of the error norms x exact x 2 , where x exact = (1 .27, 2.04, 0.3)T is the vector of exact parameters for the problem. These results conrm that the errors are reduced by a factor of 2 corresponding to the expected reduction of the standard deviation by the same factor.

    2.2 The QR factorization and its roleIn this and the next section we discuss the QR factorization and its role inthe analysis and solution of the LSQ problem. We start with the simplercase of full-rank matrices in this section and then move on to rank-decientmatrices in the next section.

    The rst step in the computation of a solution to the least squaresproblem is the reduction of the problem to an equivalent one with a moreconvenient matrix structure. This can be done through an explicit fac-torization, usually based on orthogonal transformations, where instead of solving the original LSQ problem (2.1.1) one solves an equivalent problemwith a triangular matrix. The basis of this procedure is the QR factoriza-tion, the less expensive decomposition that takes advantage of the isometricproperties of orthogonal transformations (proofs for all the theorems in thissection can be found in [22], [116] and many other references).

  • 7/30/2019 Book Pch

    46/321

    34 LEAST SQUARES DATA FITTING

    Theorem 16. QR factorization. Any real m n matrix A can be fac-tored as

    A = QR with Q R m m , R =

    R10

    R m n , (2.2.1)

    where Q is orthogonal (i.e., QT Q = I m ) and R1R n n is upper triangular.

    If A has full rank, then so has R and therefore all its diagonal elements are nonzero.

    Theorem 17. Economical QR factorization. Let AR m n have full

    column rank r = n . The economical (or thin) QR factorization of A is

    A = Q1R1 with Q 1R m n , R1

    R n n , (2.2.2)

    where Q1 has orthonormal columns (i.e., QT 1 Q1 = I n ) and the upper trian-gular matrix R1 has nonzero diagonal entries. Moreover, Q1 can be chosen such that the diagonal elements of R1 are positive, in which case R1 is the Cholesky factor of AT A.

    Similar theorems hold if the matrixA is complex, with the factorQ nowa unitary matrix.Remark 18. If we partition the m m matrix Q in the full QR factoriza-tion (2.2.1)as

    Q = ( Q1 Q2 ),

    then the sub-matrix Q1 is the one that appears in the economical QR fac-torization (2.2.2). The m (m n) matrix Q2 satises QT 2 Q1 = 0 and Q1QT 1 + Q2QT 2 = I m .

    Geometrically, the QR factorization corresponds to an orthogonalizationof the linearly independent columns of A. The columns of matrixQ1 are anorthonormal basis forrange( A) and those of Q2 are an orthonormal basisfor null( AT ).

    The following theorem expresses the least squares solution of the full-rank problem in terms of the economical QR factorization.

    Theorem 19. Let A

    R m n have full column rank r = n , with the eco-

    nomical QR factorization A = Q1R1 from Theorem 17. Considering that

    b Ax 22 = QT (b Ax ) 22 =QT 1 bQT 2 b

    R10 x

    22

    = QT 1 b R1x 22 + QT 2 b 22 ,

  • 7/30/2019 Book Pch

    47/321

    LINEAR LEAST SQUARES PROBLEM 35

    then, the unique solution of the LSQ problem min x b A x 22 can be com-puted from the simpler, equivalent problem min

    xQT 1 b

    R1x 22 ,

    whose solution is x = R 11 Q

    T 1 b (2.2.3)

    and the corresponding least squares residual is given by

    r = b A x = ( I m Q1QT 1 )b = Q2QT 2 b, (2.2.4)with the matrix Q2 that was introduced in Remark 18.

    Of course, (2.2.3) is short-hand for solvingR x= QT 1 b, and one pointof this reduction is that it is much simpler to solve a triangular system of equations than a full one. Further on we will also see that this approach hasbetter numerical properties, as compared to solving the normal equationsintroduced in the previous section.

    Example 20. In Example 11 we saw the normal equations for the NMRproblem from Example 1; here we take a look at the economical QR factor-

    ization for the same problem:

    A =

    1.00 1.00 10.80 0.94 10.64 0.88 1

    ......

    ...3.2 10 5 4.6 10 2 12.5 10 5 4.4 10 2 12.0

    10 5 4.1

    10 2 1

    ,

    Q1 =

    0.597 0.281 0.1720.479 0.139 0.0710.384 0.029 0.002... ... ...1.89 10 5 0.030 0.2241.52 10 5 0.028 0.2261.22 10 5 0.026 0.229

    ,

    R1 =1.67 2.40 3.02

    0 1.54 5.160 0 3.78

    , Qb =7.814.321.19

    .

    We note that the upper triangular matrix R1 is also the Cholesky factor of the normal equation matrix, i.e., AT A = RT 1 R1 .

  • 7/30/2019 Book Pch

    48/321

    36 LEAST SQUARES DATA FITTING

    The QR factorization allows us to study the residual vector in moredetail. Consider rst the case where we augmentA with an additionalcolumn, corresponding to adding an additional model basis function in thedata tting problem.

    Theorem 21. Let the augmented matrix A = ( A , a n +1 ) have the QR factorization

    A = ( Q1 Q2 )R10 ,

    with Q1 = ( Q1 q ), QT 1 q = 0 and QT 2 q = 0. Then the norms of the least

    squares residual vectors r = ( I m Q1QT 1 )b and r = ( I m Q1QT 1 )b are

    related by r 22 = r 22 + ( q T b)2 .

    Proof. From the relationQ1QT 1 = Q1QT 1 + q q T it follows thatI m

    Q1QT 1 = I m Q1QT 1 + q q T , and hence,

    r 22 = (I m Q1QT 1 )b 22 = (I m Q1QT 1 )b + q q

    T b 22

    = (I m Q1QT 1 )b

    22 + q q

    T b 22 = r 2

    2 + ( qT b)2 ,

    where we used that the two components of r are orthogonal and thatq q T b 2 = |q T b| q 2 = |q T b|.This theorem shows that, when we increase the number of model basis

    functions for the t in such a way that the matrix retains full rank, thenthe least squares residual norm decreases (or stays xed if b is orthogonalto q).

    To obtain more insight into the least squares residual we study theinuence of the approximation and data errors. According to (1.2.1) we

    can write the right-hand side asb = + e ,

    where the two vectors

    = (t1), . . . , ( tm )T and e = ( e1 , . . . , e m )T

    contain the pure data (the sampled pure-data function) and the data errors,respectively. Hence, the least squares residual vector is

    r = A x + e , (2.2.5)where the vectorA x is the approximation error. From (2.2.5) it followsthat the least squares residual vector can be written as

    r = ( I m Q1QT 1 ) + ( I m Q1QT 1 ) e = Q2QT 2 + Q2QT 2 e .

  • 7/30/2019 Book Pch

    49/321

    LINEAR LEAST SQUARES PROBLEM 37

    We see that the residual vector consists of two terms. The rst termQ2QT 2 is an approximation residual, due to the discrepancy between then modelbasis functions (represented by the columns of A) and the pure-data func-tion. The second term is the projected error, i.e., the component of thedata errors that lies in the subspacenull( AT ). We can summarize thestatistical properties of the least squares residual vector as follows.

    Theorem 22. The least squares residual vector r = b A x has the following properties:E (r ) = Q2QT 2 , Cov(r ) = Q2QT 2 Cov( e )Q2QT 2 ,

    E ( r 2

    2) = QT

    2 2

    2+

    E ( QT

    2e 2

    2).

    If e is white noise, i.e., Cov( e ) = 2 I m , then

    Cov( r ) = 2Q2 QT 2 , E ( r 2

    2) = QT 2

    22 + ( m n) 2 .

    Proof. It follows immediately that

    E (Q2QT 2 e ) = Q2QT 2 E (e ) = 0 and E ( T Q2QT 2 e ) = 0 ,

    as well asCov( r ) = Q2QT 2 Cov( + e )Q2Q

    T 2 and Cov( + e ) = Cov( e ).

    Moreover,

    E ( r 22) = E ( Q2QT 2 22) + E ( Q2QT 2 e 22) + E (2 T Q2QT 2 e ).It follows that

    Cov( QT 2 e ) = 2I m n and E ( QT 2 e 22) = trace(Cov( QT 2 e )) = ( m n) 2 .

    From the above theorem we see that if the approximation error A x is somewhat smaller than the data errore then, in the case of white noise,the scaled residual norms (sometimes referred to as the standard error),

    dened bys=

    r 2 m n, (2.2.6)

    provides an estimate for the standard deviation of the errors in the data.Moreover, provided that the approximation error decreases sufficiently fastwhen the tting ordern increases, then we should expect that for large

  • 7/30/2019 Book Pch

    50/321

    38 LEAST SQUARES DATA FITTING

    enoughn the least squares residual norm becomes dominated by the pro- jected error term, i.e.,

    r Q2QT 2 e for n sufficiently large .

    Hence, if we monitor the scaled residual norms= s(n) as a function of n ,then we expect to see thats(n) initially decreases when it is dominatedby the approximation error while at a later stage it levels off, when theprojected data error dominates. The transition between the two stages of the behavior of s(n) indicates a good choice for the tting ordern .Example 23. We return to the air pollution example from Example 2. We compute the polynomial t for n = 1 , 2, . . . , 19 and the trigonometric t for n = 1 , 3, 5, . . . , 19 (only odd values of n are used, because we always need a sin-cos pair). Figure 2.2.1 shows the residual norm r 2 and the scaled residual norm s as functions of n .

    The residual norm decreases monotonically with n , while the scaled residual norm shows the expected behavior mentioned above, i.e., a decaying phase (when the approximation error dominates), followed by a more at or slightly increasing phase when the data errors dominate.

    The standard errorsintroduced in (2.2.6) above, dened as the residualnorm adjusted by the degrees of freedom in the residual, is just one exampleof a quantity from statistics that plays a central role in the analysis of LSQ problems. Another quantity arising from statistics is thecoefficient of determination R2 , which is used in the context of linear regression analysis(statistical modeling) as a measure of how well a linear model ts the data.Given a modelM (x , t ) that predicts the observationsb1 , b2 , . . . , bm and theresidual vectorr = ( b1 M (x , t 1), . . . , bm M (x , t m )) T , the coefficient of determination is dened by

    R2

    = 1 r 22

    mi=1 (bi b)2 , (2.2.7)

    whereb is the mean of the observations. In general, it is an approximationof the unexplained variance, since the second term compares the variance inthe models errors with the total variance of the data. Yet another usefulquantity for analysis is theadjusted coefficient of determination , adj R2 ,dened in the same way as the coefficient of determinationR2 , but adjustedusing the residual degrees of freedom,

    adj R2 = 1 (s

    )2

    mi=1 (bi b)2 / (m 1)

    , (2.2.8)

    making it similar in spirit to the squared standard error(s)2 . In Chapter11 we demonstrate the use of these statistical tools.

  • 7/30/2019 Book Pch

    51/321

    LINEAR LEAST SQUARES PROBLEM 39

    Figure 2.2.1: The residual norm and the scaled residual norm, as functionsof the tting order n , for the polynomial and trigonometric ts to the airpollution data.

    2.3 Permuted QR factorization

    The previous section covered in detail full-rank problems, and we saw thatthe QR factorization was well suited for solving such problems. However,for parameter estimation problems where the model is given there isno guarantee thatA always has full rank, and therefore we must also con-sider the rank-decient case. We give rst an overview of some matrixfactorizations that are useful for detecting and treating rank-decient prob-lems, although they are of course also applicable in the full-rank case. Theminimum-norm solution from Denition 27 below plays a central role inthis discussion.

    WhenA is rank decient we cannot always compute a QR factorization(2.2.1) that has a convenient economical version, where the range of A isspanned by the rst columns of Q. The following example illustrates thata column permutation is needed to achieve such a form.

    Example 24. Consider the factorization

    A = 0 00 1 =c ss c 0 s0 c , for any c2 + s2 = 1 .

    This QR factorization has the required form, i.e., the rst factor is orthog-onal and the second is upper triangular but range( A) is not spanned by the rst column of the orthogonal factor. However, a permutation of the

  • 7/30/2019 Book Pch

    52/321

    40 LEAST SQUARES DATA FITTING

    columns of A gives a QR factorization of the desired form,

    A = 0 01 0 = Q R =0 11 0

    1 00 0 ,

    with a triangular R and such that the range of A is spanned by the rst column of Q.

    In general, we need a permutation of columns that selects the linearlyindependent columns of A and places them rst. The following theoremformalizes this idea.Theorem 25. QR factorization with column permutation. If A is real, m

    n with rank( A) = r < n

    m , then there exists a permutation ,

    not necessarily unique, and an orthogonal matrix Q such that

    A = Q R11 R120 0r

    m r, (2.3.1)

    where R11 is r r upper triangular with positive diagonal elements. The range of A is spanned by the rst r columns of Q.Similar results hold for complex matrices whereQ now is unitary.The rst r columns of the matrixA are guaranteed to be linearly

    independent. For a model with basis functions that are not linearly de-pendent over the abscissas, this provides a method for choosingr linearlyindependent functions. The rank-decient least squares problem can nowbe solved as follows.Theorem 26. Let A be a rank-decient m n matrix with the pivoted QR factorization in Theorem 25. Then the LSQ problem (2.1.1)takes the form

    minx QT A T x QT b 22 =min y

    R11 R120 0

    y 1y 2

    d 1d 2

    2

    2=

    miny R11 y 1 R12 y 2 d 1 22 + d 2 22 ,where we have introduced

    QT b = d 1d 2and y = T x = y 1y 2

    .

    The general solution is

    x = R 111 (d 1 R12 y 2)y 2 , y 2 = arbitrary (2.3.2)

    and any choice of y 2 leads to a least squares solution with residual norm r 2 = d 2 2 .

  • 7/30/2019 Book Pch

    53/321

    LINEAR LEAST SQUARES PROBLEM 41

    Denition 27. Given the LSQ problem with a rank decient matrix Aand the general solution given by (2.3.2), we dene x as the solution of minimal 2-norm that satises

    x= arg min

    xx 2 subject to b A x 2 = min .

    The choicey 2 = 0 in (2.3.2) is an important special case that leads tothe so-calledbasic solution ,

    x B = R 111 QT 1 b

    0 ,

    with at leastnr zero components. This corresponds to using only the rstr columns of A in the solution, while setting the remaining elements tozero. As already mentioned, this is an important choice in data tting aswell as other applications because it implies thatb is represented by thesmallest subset of r columns of A, i.e., it is tted with as few variables aspossible. It is also related to the new eld of compressed sensing [5, 39, 251].

    Example 28. Linear prediction. We consider a digital signal, i.e., a vector s

    R N , and we seek a relation between neighboring elements of the form

    s i =j =1

    x i s i j , i = p + 1 , . . . , N , (2.3.3)

    for some (small) value of . The technique of estimating the ith element from a number of previous elements is called linear prediction (LP), and the LP coefficients x i can be used to characterize various underlying properties of the signal. Throughout this book we will use a test problem where the

    elements of the noise-free signal are given by

    s i = 1 sin(1 t i ) + 2 sin(2 t i ) + + p sin( p t i ), i = 1 , 2, . . . , N .In this particular example, we use N = 32 , p = 2 , 1 = 2 , 2 = 1 and nonoise.

    There are many ways to estimate the LP coefficients in (2.3.


Recommended