LEAST SQUARES METHODS · Web viewPARAMETER ESTIMATION The fundamental concept of parameter...

PARAMETER ESTIMATION

The fundamental concept of parameter estimation is to determine optimal values of parameters for a numerical model that predicts dependent variable outputs of a function, process or phenomenon based on observations of independent variable inputs. For a given data observation, independent variable inputs are grouped into an input vector of length ninp and dependent variable outputs are grouped into an output vector of length nout. Corresponding input and output vectors for a given data observation are called a training pair. In general, training pairs from nobs number of data observations are pooled to estimate npar number of model parameters.

Table of Contents

1.0 Least Squares Methods1.1 Linear Models

1.1.1 Straight Line1.1.2 Single Dependent Variable

1.1.2.1 Polynomial1.1.2.1.1 Quadratic1.1.2.1.2 Cubic1.1.2.1.3 Quintic1.1.2.1.4 Choice Of Polynomial Order

1.1.2.2 Harmonic1.1.2.3 Plane1.1.2.4 Bicubic

1.1.3 Multiple Dependent Variables1.1.4 Weighting1.1.5 Scaling

1.2 Linearized Models1.2.1 Newtonian Cooling1.2.2 Circle1.2.3 Sphere

1.3 Example Data1.3.1 Anscombe's Data Set

2.0 Eigenvector Methods2.1 Line2.2 Plane2.3 Ellipse 2.4 Ellipsoid2.5 2D Quadric2.6 3D Quadric

3.0 Levenberg-Marquardt

4.0 Simplex

page 1 of 21

5.0 Gradient Search

6.0 Neural Network

7.0 Genetic Algorithms

8.0 Simulated Annealing

9.0 Kalman Filter

page 2 of 21

1.0 LEAST SQUARES METHODS

The least squares optimality criterion minimizes the sum of squares of residuals between actual observed outputs and output values of the numerical model that are predicted from input observations

1.1 LINEAR MODELS

Many processes exhibit true linear behavior. Many others operate over such small excursions of input variable values that the output behavior appears linear.

1.1.1 STRAIGHT LINE

The classic least squares problem is to fit a straight line model with a single input x and a single output y using parameters b and m as shown in Equation 1. Unfortunately, individual data observations xi and yi may not fit the model perfectly due to experimental measurement error, process variation or insufficient model complexity as shown in Equation 2. Multiple data observations may be concatenated as shown in Equation 3 and represented in matrix form per Equation 4. Even for optimal estimates of model parameters {}, each data observation will have some residual error ei between the observed output yi and the predicted model output as shown in Equations 5 and 6 .

y = b + m x = [ 1 x ]{bm} for ninp=1, nout=1, npar=2 Eq. 1

yi [ 1 xi ]{bm} Eq. 2

{ y1

y2

⋮ynobs

}≈[1 x1

1 x2

⋮ ⋮1 xnobs

]{bm}Eq. 3

{Y} [X] {} for

{Y }nobs x 1

={ y1

y2

⋮ynobs

}

[X ]nobs x 2

=[1 x1

1 x2

⋮ ⋮1 xnobs

]

{β}2 x 1

={bm}Eq. 4

page 3 of 21

{ e1

e2

⋮enobs

}={ y1

y2

⋮ynobs

}−[1 x1

1 x2

⋮ ⋮1 xnobs

]{bm}Eq. 5

{e} = {Y} - [X] {} for

{e }nobs x 1

={ e1

e2

⋮enobs

}Eq. 6

The scalar sum of squares SSQ of residual errors is shown in Equation 7. To minimize the sum of squares, one may set the partial derivative of the sum of squares with respect to the model

parameters {β } equal to zero as shown in Equation 8. Rearranging these terms provides a linear matrix solution for optimal model parameters per Equation 9a. The coefficient of determination R2

is provided in Equation 9b.

SSQ = {e}T{e} = {Y}T{Y} - 2{}T[X]T{Y} + {}T[X] T[X] {} Eq. 7

SSQ / {} = - 2[X]T{Y} + 2[X] T[X] {} = 0 Eq. 8

{β}2 x 1

=( [ X ]T2 x nobs

[ X ]nobs x 2

)−1 [ X ]T2 x nobs

{Y }nobs x 1 Eq. 9a

R2={β }T [ X ]T {Y }−nobs ( yo )2

{Y }T {Y }−nobs ( y o)2 for yo=

1nobs ∑

i=1

nobs

y iEq. 9b

Expanding this matrix solution as shown in Equation 10, provides valuable insights into the nature of least squares estimates. If the input observations xi and the output observations yi are both mean centered as shown in Equation 11, then the offset term b is zero and the slope m is equal to the covariance of xi and yi divided by the variance of xi as shown in Equation 12. Consequently, mean centered data preclude the need to compute an offset term for many linear models.

{bm}=[ nobs ∑i=1

nobs

x i

∑i=1

nobs

x i ∑i=1

nobs

x i2 ]−1

{ ∑i=1

nobs

yi

∑i=1

nobs

x i y i}Eq. 10

page 4 of 21

{bm}=[nobs 0

0 ∑i=1

nobs

xi2 ]

−1

{ 0

∑i=1

nobs

xi y i} for xo =

1nobs ∑

i=1

nobs

x i= 0 and yo =

1nobs ∑

i=1

nobs

yi= 0 Eq. 11

{bm}={ 0

∑i=1

nobs

x i y i/∑i=1

nobs

x i2}

for xo = 0, yo = 0 Eq. 12

r2 ????

1.1.2 SINGLE DEPENDENT VARIABLE

The methodology developed above may be generalized to polynomials or other linear models requiring npar 2 number of linear parameters as shown in Equation 13. Note that the number of data observations nobs must be greater than or equal to the number of estimated parameters npar to prevent singularity of the symmetric matrix ([X]T[X]).

{β }npar x 1

=( [ X ]Tnpar x nobs

[ X ]nobs x npar

)−1 [ X ]Tnpar x nobs

{Y }nobs x 1 for nout=1 Eq. 13

1.1.2.1 POLYNOMIAL

1.1.2.1.1 QUADRATIC

A quadratic model is presented in Equations 14 and 15.

y = a0 + a1 x + a2 x2 = [ 1 x x2 ]{a0

a1

a2}

for ninp=1, nout=1, npar=3 Eq. 14

{Y }nobs x 1

={ y1

y2

⋮ynobs

}

[X ]nobs x 3

=[1 x1 x12

1 x2 x22

⋮ ⋮ ⋮1 xnobs xnobs

2 ]

{β}3 x 1

={a0

a1

a2}

Eq. 15

1.1.2.1.2 CUBIC

A cubic model is presented in Equations 16 and 17.

page 5 of 21

y = a0 + a1 x + a2 x2 + a3 x3 = [ 1 x x2 x3 ]{a0

a1

a2a3}

for ninp=1, nout=1, npar=4 Eq. 16

{Y }nobs x 1

={ y1

y2

⋮ynobs

}

[ X ]nobs x 4

=[1 x1 x12 x1

3

1 x2 x22 x2

3

⋮ ⋮ ⋮ ⋮1 xnobs xnobs

2 xnobs3 ]

{β }4 x 1

={a0

a1

a2a3}

Eq. 17

1.1.2.1.3 QUINTIC

A quintic model is presented in Equations 18 and 19.

y = [ 1 x x2 x3 x4 ]

{a0a1

a2

a3

a4

} for ninp=1, nout=1, npar=5 Eq. 18

{Y }nobs x 1

={ y1

y2

⋮ynobs

}

[ X ]nobs x 4

=[1 x1 x12 x1

3

1 x2 x22 x2

3

⋮ ⋮ ⋮ ⋮1 xnobs xnobs

2 xnobs3 ]

{β}5 x 1

={a0a1

a2

a3

a4

}Eq. 19

1.1.2.1.4 CHOICE OF POLYNOMIAL ORDER

Engine speed[rpm]

Engine torque[N.m]

800 5001000 5471200 6361400 6791600 719

page 6 of 21

1800 7242000 7122200 6712400 6062600 575

1.1.2.2 HARMONIC

A linear harmonic model is presented in Equations 17 and 18.

y = [ 1 cos sin cos2 sin2 ]

{a0a1

b1

a2

b2

} for ninp=1, nout=1, npar=5 Eq. 17

{Y }nobs x 1

={ y1

y2

⋮ynobs

}

[X ]nobs x 5

=[1 cosθ1 sin θ1 cos 2θ1 sin 2 θ1

1 cosθ2 sin θ2 cos 2θ2 sin 2 θ2

⋮ ⋮ ⋮ ⋮ ⋮1 cosθnobs sin θnobs cos 2θnobs sin 2 θnobs

]

{β}5 x 1

={a0a1

b1

a2

b2

}Eq. 18

1.1.2.3 PLANE

A linear planar model is presented in Equations 19 and 20.

z = a + b x + c y = [ 1 x y ]{abc } for ninp=2, nout=1, npar=3 Eq. 19

{Y }nobs x 1

={ z1

z2

⋮znobs

}

[X ]nobs x 3

=[1 x1 y1

1 x2 y2

⋮ ⋮ ⋮1 xnobs ynobs

]

{β}3 x 1

={abc } Eq. 20

1.1.2.3 BICUBIC

A linear bicubic model is presented in Equations 21 and 22.

page 7 of 21

z=[1 x x2 x3 ] [b1 b5 b9 b13

b2 b6 b10 b14

b3 b7 b11 b15b4 b8 b12 b16

] { 1yy2

y3 } for ninp=2, nout=1, npar=16 Eq. 21

{Y }nobs x 1

={ z1

z2

⋮znobs

} Eq. 22a

Eq. 22b

[ X ]nobs x 16

=[1 x1 x12 x1

3 y1 x1 y1 x12 y1 x1

3 y1 y12 x1 y1

2 x12 y1

2 x13 y1

2 y13 x1 y1

3 x12 y1

3 x13 y1

3

1 x2 x22 x2

3 y2 x2 y2 x22 y2 x2

3 y2 y22 x2 y2

2 x22 y2

2 x23 y2

2 y23 x2 y2

3 x22 y2

3 x23 y2

3

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮1 xn xn

2 xn3 yn xn yn xn

2 yn xn3 yn y n

2 xn yn2 xn

2 yn2 xn

3 yn2 yn

3 xn yn3 xn

2 yn3 xn

3 yn3 ]

{β }16 x1

={b1

b2

b3

b4b5

b6

b7

b8

b9

b10

b11

b12

b13

b14

b15

b16

}Eq. 22c

1.1.3 MULTIPLE DEPENDENT VARIABLES

page 8 of 21

Multiple simultaneous linear outputs for each input data observation may be modeled using this methodology as shown in Equation 23a. Note that the number of data observations nobs must be greater than or equal to the number of parameters npar (not npar times nout) to prevent singularity of the symmetric matrix ([X]T[X]). The coefficient of determination R

2 is provided

in Equation 23b.

{β}npar x nout

=( [ X ]Tnpar x nobs

[ X ]nobs x npar

)−1 [ X ]Tnpar x nobs

[Y ]nobs x nout Eq. 23a

R2

nout x nout=

{β }T [X ]T [Y ]−nobs [Y o ]T [Y o ]

[Y ]T [Y ]−nobs [Y o ]T [Y o ] for

[Y o]1 x nout

= 1nobs ∑

i=1

nobs

[Y ]Eq. 23b

pre-inverse or post-inverse ???

A classic example of multiple linear outputs is to compute affine alignment (translation, rotation, size, shear) between two images based on pixel locations of homologues as shown in Eq. 23c and 23d.

x i y i = pixel location of homologue in target imageui v i = pixel location of matching homologue in image to be aligned with target

[ x y ]= [1 u v ] [b11 b12

b21 b22

b31 b32]

Eq. 23c

{Y }nobs x nout

=[ x1 y1

x2 y2

⋮ ⋮ynobs ynobs

]

[ X ]nobs x 3

=[1 u1 v1

1 u2 v2

⋮ ⋮ ⋮1 unobs vnobs

]

{β }3 x npar

= [b11 b12

b21 b22

b31 b32] npar = 3 Eq. 23d

1.1.4 WEIGHTING

A weighting factor wi may be assigned to each training pair to emphasize relative importance among the data observations. Residuals {e} may be accentuated or attenuated by pre-multiplying with a diagonal weighting matrix [W] to form weighted residuals {} as shown in Equation 23a. The least squares solution for this new set of weighted residuals is shown Equation 23b.

page 9 of 21

{ε }={ ε1

ε2

⋮εnobs

}=[w1 0 0 00 w2 0 00 0 ⋱ 00 0 0 wnobs

]{ e1

e2

⋮enobs

}=[W ] {e }

Eq. 23a

{β}=( [ X ]T [W ]2 [ X ] )−1 ( [ X ]T [W ]2 [Y ] ) Eq. 23b

Two concepts are typically used for weights. The simplest form is to use a value of 1 if that data observation is present in the current data set while a value of 0 is used if that observation is missing as shown in Equation 23c. This technique is often used for scientific devices that employ the same data collection protocols for each data set (same nobs each time), but in which certain observations may be missing or spurious (e.g. photogrammetry with landmarks that are occasionally not visible).

wi2 = 1 for valid data, wi

2 = 0 for missing data Eq. 23c

The second approach is to set the square of each weight equal to the inverse of the expected variance for that observation. The expected variance may be the variance of the respective dependent variable output, the variance of the independent variable input or a pooled value representative of both.

wi2 = 1 / i

2 for i2 = variance of observation i Eq. 23d

1.1.5 SCALING

mean center xi* = xi - x0

scale largest abs(xi*) to ±1

1.2 LINEARIZED MODELS

It is often tempting to linearize models of nonlinear phenomena in an effort to arrive at estimates for parameters. One should exercise caution however in that linear least squares solutions for linearized models minimize the sum of squares of residuals for nonlinear forms of the dependent variables rather than residuals of those dependent variables themselves.

1.2.1 NEWTONIAN COOLING

A classic example is to model the temperature T of an object as a function of time during exponential Newtonian cooling from an initial temperature T0 toward ambient temperature T as shown in Equation 24. Taking the logarithm after a simple algebraic rearrangement provides the linearized model in Equation 25. If the ambient temperature T is known, measuring

page 10 of 21

temperatures Ti of the object at times i allows direct estimation of the Newtonian cooling rate b and indirect estimation of the initial temperature difference above ambient with the linearized matrix model in Equations 26. Specifically, this approach minimizes residuals of the nonlinear function ln(T - T) rather than residuals of the dependent variable temperature T. Note that simultaneous estimation of ambient temperature T∞ requires an iterative algorithm.

T = T + (T0 - T) e-b Eq. 24

ln(T - T) = ln(T0 - T) - b {ln (T 0 - T∞ )

−b }Eq. 25

{Y }nobs x 1

={ln (T1 - T∞ )ln (T2 - T∞ )

⋮ln (Tnobs - T∞ )

}

[X ]nobs x 2

=[1 τ1

1 τ2

⋮ ⋮1 τnobs

]

{β}2 x 1

={ln(T 0 - T∞ )−b }

Eq. 26

1.2.2 LOGARITHMIC SPIRAL

The logarithmic spiral often appears in nature such as the cross –section of a nautilus shell, atmospheric low pressure spirals or the arms of galaxies as shown in Equations 27. Given center location xc and yc allows linearization of x and y data points on the spiral as shown in Equations 28 and provides the linearized matrix model of Equations 29. Again, this approach minimizes residuals of the nonlinear function ln(r) rather than residuals of the independent variables x and y. Simultaneous estimation of center location xc and yc requires iterative techniques. Note that there are no actual dependent variables for closed form geometric models estimated from coordinate data.

r = a eb x = xc + r cos y = yc + r cos Eq. 27

r=√( x−xc )2+ ( y− yc )

2 tan θ=y− yc

x−xCln (r )= ln (a )+b θ=[1 θ ]{ln (a )

b } Eq. 28

{Y }nobs x 1

={ln(r1 )ln(r2 )

⋮ln(rnobs )

}

[X ]nobs x 2

=[1 θ1

1 θ2

⋮ ⋮1 θnobs

]

{β}2 x 1

={ln (a)b }Eq. 29

1.2.3 CIRCLE

page 11 of 21

Another example is to estimate the coordinates a and b for the center of a circle of unknown radius r by measuring the coordinates of points x and y on the circle as shown in the familiar nonlinear model in Equation 30. Expanding and rearranging provides the linearized model shown in Equation 31. Measuring coordinates xi and yi for points on the circle allows direct estimation of the center coordinates a and b and indirect estimation of the radius with the linearized model in Equations 32. This approach minimizes residuals of the nonlinear function (x2+y2) rather than residuals for the independent variables x and y. ( x - a )2 + ( y - b )2 = r2 Eq. 30

( x2 + y2 ) = [ 1 x y ]{r

2−a2−b2

2a2 b }

Eq. 31

{Y }nobs x 1

={ ( x12+ y1

2 )

( x22+ y2

2 )⋮

(xnobs2 + ynobs

2 )}

[X ]nobs x 3

=[1 x1 y1

1 x2 y2


]

{β}3 x 1

={r2−a2−b2

2a2b }

Eq. 32

If radius r is known a priori, the problem may be reformulated to find the best fit for center coordinates a and b as shown in Equations 33. The solution for the term (-a2-b2) may not be completely consistent with the solution for terms (2a) and (2b) if the data does not exactly model the given radius r. Again, note that there are no actual dependent variables for closed form geometric models estimated from coordinate data.

{Y }nobs x 1

={ ( x12+ y1

2−r2)

( x22+ y2

2−r2)⋮

(xnobs2 + ynobs

2 −r2 )}

[X ]nobs x 3

=[1 x1 y1

1 x2 y2


]

{β}3 x 1

={−a2−b2

2a2b }

Eq. 33

1.2.4 SPHERE

Following the circle example above, one can estimate the coordinates a, b and c for the center of a sphere of unknown radius r as shown in Equations 34.

page 12 of 21

{Y }nobs x 1

={ ( x12+ y1

2+z12)

( x22+ y2

2+z22)

⋮(xnobs

2 + ynobs2 +znobs

2 )}

[ X ]nobs x 4

=[1 x1 y1 z1

1 x2 y2 z2

⋮ ⋮ ⋮ ⋮1 xnobs ynobs znobs

]

{β }4 x 1

={r2−a2−b2−c2

2 a2b2c

}Eq.

34

page 13 of 21

1.3 EXAMPLE DATA

1.3.1 ANSCOMBE'S DATA SET

Ref: Anscombe, F.J. (1973) Graphs in statistical analysis. Amer. Statistician 27:17-21 Anscombe's datasetsx1=[10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 ]';y1=[ 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 ]';

x2=x1;y2=[ 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 ]';

x3=x1;y3=[ 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 ]';

x4=[ 8.00 8.00 8.00 8.00 8.00 8.00 8.00 19.00 8.00 8.00 8.00 ]';y4=[ 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89 ]';

page 14 of 21

2.0 EIGENVECTOR METHODS

The linear methods described above minimize the sum of squares of residuals between observed outputs and output values predicted from multiple input observations. However for some numerical models, the concepts of independent inputs and dependent outputs are not well defined. Fitting geometric models to k dimensional point coordinate data is an example. The coordinates for each point are typically grouped into an input vector of length ninp=k and there are no dependent variable outputs. In general, input vectors from nobs number of data observations are pooled to estimate npar number of model parameters.

Linear least squares methods minimize the sum of squares of residuals parallel to the dependent variable. For eigenvector methods, the residuals are nominally orthogonal to model.

Orthogonal distance fitting …

2.1 LINE

The parametric equation for a line in k dimensional space describes the location of any point {x} on the line in terms of a given point {p} on the line plus a directed scalar distance s measured

from {p} in the unit direction {u } along the line as shown in Equation 32. However, individual data observations {xi} for points representing the line may not fit the model perfectly due to experimental measurement error, process variation or insufficient model complexity as shown in Equation 33.

{x }k x 1

= {p }k x 1

+ s {u }k x 1 Eq. 32

{xi} {p} + s {u } Eq. 33

The optimal least-squares estimate for the given point {p} on the line will be the centroid {xo} of the nobs data observations {xi} as shown Equation 34. The symmetric centroidal point distribution matrix [S] may then be formed as shown in Equation 35. Its diagonal eigenvalue matrix [D] and orthogonal eigenvector matrix [V] shown Equation 36 provide least-squares solutions.

{p} = {xo} = 1

nobs ∑i=1

nobs

{xi} Eq. 34

[ S ]k x k =

1nobs ∑

i=1

nobs

({xi}-{xo})T ({xi}-{xo}) Eq. 35

[S] = [V] [D] [V]T for [V] = [{v1}{v2}…{vk}] and [D] = diag( d1 d2 … dk ) Eq. 36

page 15 of 21

Perfect data observations of points on a line would produce a single non-zero eigenvalue of [S] along a direction parallel to the line. Consequently for real-world data observations, the optimal

least-squares estimate for the unit direction {u } will be parallel to the eigenvector that corresponds to the largest eigenvalue as shown in Equation 37. The largest eigenvalue will be equal to variance of the data points about their centroid along the line. The other eigenvalues are principal values of variance perpendicular to the line.

{u } = {vj} / norm{vj} for {vj} corresponding to largest dj for a line Eq. 37

2.2 PLANE

The equation for a plane in k dimensional space describes the location of any point {x} on the plane based on the perpendicular distance from the coordinate origin to the plane as

determined by a given point {p} on the plane and the unit normal {n } to plane as shown in Equation 38. However individual data observations {xi} for points representing the plane may not fit the model perfectly due to experimental measurement error, process variation or insufficient model complexity as shown in Equation 39.

{x}T {n } = {p}T {n } = Eq. 38

{xi}T {n } {p}T {n } = Eq. 39

The optimal least-squares estimate for the given point {p} on the plane will be the centroid {xo} of the nobs data observations {xi} as shown Equation 34 above. The symmetric centroidal point distribution matrix [S], diagonal eigenvalue matrix [D] and orthogonal eigenvector matrix [V] shown in Equations 35 and 36 above again provide least-squares solutions per below.

Perfect data observations of points on a plane would produce a single zero eigenvalue of [S] in a direction normal to the plane. Consequently for real-world data observations, the optimal least-

squares estimate for the unit normal {n } will be parallel to the eigenvector that corresponds to the smallest eigenvalue as shown in Equation 40. The smallest eigenvalue will be equal to the variance of the data points perpendicular to the plane. The other eigenvalues are principal values of variance within the plane.

{n }= {vj} / norm{vj} for {vj} corresponding to smallest dj for a plane Eq. 40

2.3 ELLIPSE

2.4 ELLIPSOID

page 16 of 21

2.5 2D QUADRIC

2.6 3D QUADRIC

page 17 of 21

3.0 LEVENBERG-MARQUARDT

The Levenberg-Marquardt algorithm iteratively adjusts estimates of model parameters {} to minimize residuals between measured dependent variable outputs {y} and predictions from a numerical model f(.) based on independent variable inputs {x} as shown in Equation 44. For a given set of model parameters {}k at iteration k each measured training pair {x}i and {y}i will have residuals {e}i,k as shown in Equation 45. For parameter updates {} shown in Equation 46, the Taylor series expansion for residuals at iteration k+1 may be written as shown in Equation 47 using the Jacobian [J] of the numerical model with respect to model parameters in Equation 48.

{ y }nout x 1

=f ( {x }ninp x 1

, {β }npar x 1) Eq. 44

{e }i , k= {y }i−f ( {x }i , {β }k ) Eq. 45

{}k+1 = {}k + {} Eq. 46

{e }i , k+1nout x 1

= {e }i , knout x 1

− [J ]i , knout x npar

{ Δβ }npar x 1 Eq. 47

[J ]i , k=[∂ f 1

∂ β1

∂ f 1

∂ β2⋯

∂ f 1

∂ βnpar

∂ f 2

∂ β1

∂ f 2

∂ β2⋯

∂ f 2

∂ βnpar

⋮ ⋮ ⋱ ⋮∂ f nout

∂ β1

∂ f nout

∂ β2⋯

∂ f nout

∂ βnpar

] evaluated for {x }i

Eq. 48

If only one training pair is available and the number of dependent variable outputs is equal to the number of parameters (nobs=1 and nout=npar), Equation 47 is deterministic and one can try to drive all nout residuals {e}i,k+1 to zero using Equation 49. This provides the classical Newton-Raphson root finding algorithm shown in Equation 50.

{Δβ }=[J ] i , k−1 {e }i , k Eq. 49

{Δβ }=− [J ]i , k−1 f ( {x }i , {β }k ) for { y }i= {0 } Eq. 50

If only one training pair is available and the number of dependent variable outputs is larger than the number of parameters (nobs=1 and nout>npar), residuals {e}i,k+1 at iteration k+1 can be minimized by the standard linear least squares solution shown in Equation 51.

page 18 of 21

{Δβ }npar x 1

=( [J ]i , kT

npar x nout[J ]i , k

nout x npar)−1 ( [J ]i , k

T

npar x nout{e }i , k

nout x 1) Eq. 51

However, if the number of parameters is greater than the number of dependent variable outputs, Equation 51 is row insufficient and multiple training pairs are required (nobs>1 for nout<npar). Residuals from all training pairs at iteration k shown in Equation 45 may be concatenated as shown in Equation 52 providing an aggregate sum of squares SSQ over all observations. Similarly all residuals predicted at iteration k+1 for update {} in Equation 47 may be concatenated as shown Equation 53. The linear least squares solution for parameter updates that will minimize the predicted aggregate SSQ after at iteration k+1 is then shown in Equations 54 and 55.

SSQk={ {e }1 ,k

{e }2 , k

⋮{e }nobs , k

}T

{ {e }1 , k

{e }2 , k

⋮{e }nobs, k

}=∑i=1

nobs

( {e }i , kT {e }i , k )

Eq. 52

{ {e }1 , k+1

{e }2 , k+1

⋮{e }nobs, k+1

}={ {e }1, k

{e }2, k

⋮{e }nobs , k

}−[ [J ]1, k

[J ]2, k

⋮[ J ]nobs , k

] {Δβ }

Eq. 53

{Δβ }=( [ [J ]1 , k

[J ]2 , k

⋮[J ]nobs, k

]T

[ [J ]1 , k

[J ]2 , k

⋮[J ]nobs, k

] )−1

( [ [J ]1 , k

[J ]2 , k

⋮[J ]nobs , k

]T

{ {e }1, k

{e }2, k

⋮{e }nobs ,k

} )Eq. 54

{Δβ }=(∑i=1

nobs

( [J ]i , kT [ J ]i , k ))

−1

(∑i=1

nobs

( [ J ]i , kT {e }i , k ) )

Eq. 55

Equation 55 provides rapid second order Newtonian convergence but can become unstable if the square Jacobian summation is nearly singular. Levenberg and Marquardt showed that a positive factor added to the diagonal elements of the square Jacobian summation matrix as shown in Equation 56 can provide both rapid and stable convergence. For very small values of , this provides Newtonian convergence similar to Equation 55. For larger values of , this provides small but stable steps along the gradient shown in Equation 57.

page 19 of 21

{Δβ }=(∑i=1

nobs

( [J ]i , kT [ J ]i , k ) + λ [ I ]

npar x npar)−1

(∑i=1

nobs

( [J ]i ,kT {e }i , k ) )

Eq. 56

{Δβ }=1λ (∑i=1

nobs

( [ J ]i , kT {e }i , k ) )

Eq. 57

If parameter updates provide a stable step with smaller aggregate SSQ than prior iterations, factor is reduced in preparation for the next iteration. If parameter updates provide an unstable step with larger aggregate SSQ than prior iterations, those updates are rejected, factor is increased and the process is repeated. Typically is started at a value of 0.1, is reduced by a factor of 10 for stable steps, and is increased by a factor of 10 for unstable steps.

Convergence may be assessed by observing when absolute values of parameter updates are small while the aggregate SSQ approaches the expected standard deviation of residuals. Observing the progression of factor can also help indicate convergence.

The algorithm may be summarized as follows.1) Postulate initial estimates for parameters {}2) Evaluate aggregate SSQ over all training pairs for initial parameter estimates (Equation 52)3) Set factor = 0.1 4) Proceed through all training pairs

a) Evaluate all residuals {e}i,k (Equation 45)b) Evaluate all Jacobians [J]i.k (Equation 48)

c) Accumulate summations ∑i=1

nobs

( [J ]i , kT [J ]i , k ) and ∑

i=1

nobs

( [J ]i , kT {e }i , k )

(Equation 55)5) Add factor to diagonal and compute parameter updates {} (Equation 56)6) Update parameters {}k+1 (Equation 46)7) Evaluate aggregate SSQ over all training pairs for new parameter estimates (Equation 52)8) If aggregate SSQ has been reduced:

a) Reduce factor NEW = OLD / 10b) Proceed with the next iteration at step 4)

9) If aggregate SSQ has increased:a) Discard the new parameter estimates and use immediate prior valuesb) Increasee factor NEW = OLD * 10c) Proceed with the next iteration at step 5)

Because of its robust performance, Levenberg-Marquardt method is often used with finite difference numerical approximations for the Jacobian [J] of the numerical model with respect to model parameters. Note that this Jacobian must be re-evaluated for each training pair for each iteration whether analytically or numerically.

Penalty functions may be added to residuals to impose explicit or implicit inequality constraints on parameters. However as with any gradient technique, convergence may dither across constraint boundaries if the minimum SSQ is at a constraint boundary.

page 20 of 21

Levenberg, K. "A Method for the Solution of Certain Problems in Least Squares." Quart. Appl. Math. 2, 164-168, 1944.

Marquardt, D. "An Algorithm for Least-Squares Estimation of Nonlinear Parameters." SIAM J. Appl. Math. 11, 431-441, 1963.

4.0 SIMPLEX

Nedler, J.A. and Mead, R. "A Simplex Method for Function Minimization." Computer Journal, 7, 308-???, 1965.

5.0 GRADIENT SEARCH

6.0 NEURAL NETWORK

7.0 GENETIC ALGORITHMS

8.0 SIMULATED ANNEALING

9.0 KALMAN FILTER

page 21 of 21

Date post:	26-Apr-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

LEAST SQUARES METHODS · Web viewPARAMETER ESTIMATION The fundamental concept of parameter...

Documents