# LEAST SQUARES METHODS

Document Sample

```					PARAMETER ESTIMATION

The fundamental concept of parameter estimation is to determine optimal values of parameters
for a numerical model that predicts dependent variable outputs of a function, process or
phenomenon based on observations of independent variable inputs. For a given data
observation, independent variable inputs are grouped into an input vector of length ninp and
dependent variable outputs are grouped into an output vector of length nout. Corresponding
input and output vectors for a given data observation are called a training pair. In general,
training pairs from nobs number of data observations are pooled to estimate npar number of
model parameters.

1.0 Least Squares Methods
1.1 Linear Models
1.1.1 Straight Line
1.1.2 Single Dependent Variable
1.1.2.1 Polynomial
1.1.2.1.2 Cubic
1.1.2.1.3 Quintic
1.1.2.1.4 Choice Of Polynomial Order
1.1.2.2 Harmonic
1.1.2.3 Plane
1.1.2.4 Bicubic
1.1.3 Multiple Dependent Variables
1.1.4 Weighting
1.1.5 Scaling
1.2 Linearized Models
1.2.1 Newtonian Cooling
1.2.2 Circle
1.2.3 Sphere
1.3 Example Data
1.3.1 Anscombe's Data Set

2.0 Eigenvector Methods
2.1 Line
2.2 Plane
2.3 Ellipse
2.4 Ellipsoid

3.0 Levenberg-Marquardt

4.0 Simplex
6.0 Neural Network
7.0 Genetic Algorithms
8.0 Simulated Annealing
1.0 LEAST SQUARES METHODS

The least squares optimality criterion minimizes the sum of squares of residuals between actual
observed outputs and output values of the numerical model that are predicted from input
observations

1.1 LINEAR MODELS

Many processes exhibit true linear behavior. Many others operate over such small excursions of
input variable values that the output behavior appears linear.

1.1.1 STRAIGHT LINE

The classic least squares problem is to fit a straight line model with a single input x and a single
output y using parameters b and m as shown in Equation 1. Unfortunately, individual data
observations xi and yi may not fit the model perfectly due to experimental measurement error,
process variation or insufficient model complexity as shown in Equation 2. Multiple data
observations may be concatenated as shown in Equation 3 and represented in matrix form per
Equation 4. Even for optimal estimates of model parameters {}, each data observation will
have some residual error ei between the observed output yi and the predicted model output as
shown in Equations 5 and 6 .

b
y = b + mx = [ 1 x ]           for ninp=1, nout=1, npar=2                                     Eq. 1
m

b
yi  [ 1 xi ]                                                                                 Eq. 2
m

 y1  1 x 1 
 y  1 x  b
 2            2  
                                                                                           Eq. 3
             m 
 y nobs  1 x nobs 
                  

 y1                 1 x 1 
 y                  1 x 
                                        b
{Y}  [X] {}          for {Y}   2             [X]         2 
{}                  Eq. 4
nobs x 1
          nobs x 2           2 x1  m
 y nobs                      
                    1 x nobs 
 e1   y1  1 x 1 
 e   y  1 x  b
 2   2               2  
                                                                                                            Eq. 5
                   m
e nobs  y nobs  1 x nobs 
                         

 e1 
e 
       
{e} = {Y} - [X] {}                       for {e}   2                                                            Eq. 6
nobs x 1
  
e nobs 
       

The scalar sum of squares SSQ of residual errors is shown in Equation 7. To minimize the sum
of squares, one may set the partial derivative of the sum of squares with respect to the model
parameters {} equal to zero as shown in Equation 8. Rearranging these terms provides a linear
matrix solution for optimal model parameters per Equation 9.

SSQ = {e}T{e} = {Y}T{Y} - 2{}T[X]T{Y} + {}T[X] T[X] {}                                                           Eq. 7

SSQ / {} = - 2[X]T{Y} + 2[X] T[X] {} = 0                                                                        Eq. 8

{}  ( [X]T       [X ] ) 1 [X]T {Y}                                                                               Eq. 9
2 x1   2 x nobs   nobs x 2     2 x nobs   nobs x 1

Expanding this matrix solution as shown in Equation 10, provides valuable insights into the
nature of least squares estimates. If the input observations xi and the output observations yi are
both mean centered as shown in Equation 11, then the offset term b is zero and the slope m is
equal to the covariance of xi and yi divided by the variance of xi as shown in Equation 12.
Consequently, mean centered data preclude the need to compute an offset term for many linear
models.

1
              nobs
         nobs 
 b 
nobs          xi             yi 
 i 1     
   nobs                 
i 1
nobs            nobs                                                                        Eq. 10
m   x                   2
 i            xi            x i y i 
 i 1          i 1           i 1
          


1
 b  nobs             0           0 
nobs                             nobs                        nobs

  0                                                                x i = 0 and yo =          y
nobs
 xy 
 x i2          i i 
1                           1
for xo =                                                 =0   Eq. 11
m  
nobs                        nobs          i
i 1                        i 1
             i 1            i 1 
 b  nobs                    
0
                        
 x y
nobs
 
m    i i
 i 1
x 
i 1


2
i
for xo = 0, yo = 0                                    Eq. 12

r2 ????

1.1.2 SINGLE DEPENDENT VARIABLE

The methodology developed above may be generalized to polynomials or other linear models
requiring npar  2 number of linear parameters as shown in Equation 13. Note that the number
of data observations nobs must be greater than or equal to the number of estimated parameters
npar to prevent singularity of the symmetric matrix ([X]T[X]).

{}  ( [ X ]T             [X ] ) 1 [X ]T           {Y}        for nout=1                    Eq. 13
npar x 1   npar x nobs   nobs x npar   npar x nobs   nobs x 1

1.1.2.1 POLYNOMIAL

A quadratic model is presented in Equations 14 and 15.

a 0 
 
2               2
y = a0 + a1 x + a2 x = [ 1 x x ]  a 1                          for ninp=1, nout=1, npar=3   Eq. 14
a 
 2

 y1                    1 x 1               x1 
2

 y                                                       a 0 
                        1 x2                x2             
{Y}   2                [X]                          2
{}   a 1                  Eq. 15
nobs x 1
             nobs x 3                          3x1
a 
 y nobs                                                   2
                       1 x nobs
                   x2 
nobs 

1.1.2.1.2 CUBIC

A cubic model is presented in Equations 16 and 17.
a 0 
 
2      3         2 3 a1 
y = a0 + a1 x + a2 x + a3 x = [ 1 x x x ]                        for ninp=1, nout=1, npar=4   Eq. 16
a 2 
a 3 
 

 y1                 1 x 1        2
x1       x1 
3
a 0 
 y                                                     a 
                                                          
2
1 x2        x        x3 
{Y}   2             [ X]                   2      2
{}   1                        Eq. 17
nobs x 1
          nobs x 4                          4 x1
a 2 
 y nobs                                                a 3 
                    1 x nobs
           x2
nobs    x3 
nobs           

1.1.2.1.3 QUINTIC

A quintic model is presented in Equations 18 and 19.

a 0 
a 
 1
2 3 4      
y = [ 1 x x x x ] a 2            for ninp=1, nout=1, npar=5                                   Eq. 18
a 
 3
a 4 
 

a 0 
 y1                 1 x 1        2
x1       x1 
3
a 
 y                                                      1
                     1 x2        x2       x3             
{Y}   2             [ X]                2         2
{}  a 2                       Eq. 19
nobs x 1
          nobs x 4                          5 x1
a 
 y nobs                                                 3
                    1 x nobs
           x2       x3 
nobs 
nobs
a 4 
 

1.1.2.1.4 CHOICE OF POLYNOMIAL ORDER

Engine speed          Engine torque
[rpm]                 [N.m]
800                   500
1000                  547
1200                  636
1400                  679
1600                  719
1800                  724
2000                  712
2200                  671
2400                  606
2600                  575

1.1.2.2 HARMONIC

A linear harmonic model is presented in Equations 17 and 18.
a 0 
a 
 1
 
y = [ 1 cos sin cos2 sin2 ]  b 1  for ninp=1, nout=1, npar=5                                      Eq. 17
a 
 2
b 2 
 

a 0 
 y1                 1 cos1        sin 1       cos 21       sin 21             a 
 y                  1 cos         sin  2      cos 2 2      sin 2 2             1
                                                                                     
{Y}   2             [ X]           2                                               {}   b1    Eq. 18
nobs x 1
          nobs x 5                                                    5 x1
a 
 y nobs                                                                            2
                    1 cos nobs   sin  nobs   cos 2 nobs   sin 2 nobs 
b 2 
 

1.1.2.3 PLANE

A linear planar model is presented in Equations 19 and 20.

a 
 
z = a + b x + c y = [ 1 x y ] b              for ninp=2, nout=1, npar=3                               Eq. 19
c 
 

 z1 
z                 a 
                   
{Y}   2             {}  b                                                                        Eq. 20
nobs x 1
           3x1
c 
z nobs             
       

1.1.2.3 BICUBIC

A linear bicubic model is presented in Equations 21 and 22.
 b1     b5   b9     b13   1 
b                   b14   y 

z  1 x x2              
x3  2
b 3
b6
b7
b10
b11
 
 
b15  y 2 
for ninp=2, nout=1, npar=16                      Eq. 21
                        
b 4     b8   b12    b16   y 3 
 

 z1 
z 
       
{Y}   2                                                                                                    Eq. 22a
nobs x 1
  
z nobs 
       

Eq. 22b

1 x 1      2
x1      3
x1    y1    x 1 y1      2
x 1 y1     3
x 1 y1    2
y1         2
x 1 y1     2 2
x 1 y1     3 2
x 1 y1    3
y1         3
x 1 y1      2 3
x 1 y1   x 1 y1 
3 3

                                                                                                                          
x2     x3                    x 2 y2   x3 y2    y2    x 2 y2   x 2 y2   x3 y2    y3    x 2 y3    x 2 y3   x 3 y3 
X  1 x 2
 
2      2   y2    x 2 y2      2       2         2        2     2 2     2 2       2        2      2 2      2 2

nobs x 16                                                                                                            
                                                                                                                          
1 x n
         x2
n      x3
n    yn    x n yn    x 2 yn
n      x3 yn
n       y2
n    x n y2
n   x 2 y2
n n    x3 y2
n n     y3
n    x n y3
n    x 2 y3
n n    x 3 y3 
n n

 b1 
b 
 2
 b3 
 
 b4 
 b5 
 
 b6 
b 
 7
   8 
b
 
16 x1
 b9 
b10 
 
b11 
b 
 12 
b13 
b 
 14 
b15 
                                                                                                   Eq. 22c
b16 

1.1.3 MULTIPLE DEPENDENT VARIABLES
Multiple simultaneous linear outputs for each input data observation may be modeled using this
methodology as shown in Equation 23. Note that the number of data observations nobs must be
greater than or equal to the number of parameters npar (not npar times nout) to prevent
singularity of the symmetric matrix ([X]T[X]).

{}  ( [ X ]T                  [X ] ) 1 [X ]T                  Y                         Eq. 23
npar x nout       npar x nobs   nobs x npar        npar x nobs   nobs x nout

A classic example of multiple linear outputs is to compute the magnitude and location of a
vertical force applied on a forceplate transducer comprised of two load cells as shown in Figure
xxx.

full affine strain/graphic transform – displacement, rotation, principal strain, principal directions

1.1.4 WEIGHTING

A weighting factor wi may be assigned to each training pair to emphasize relative importance
among the data observations. Residuals {e} may be accentuated or attenuated by pre-
multiplying with a diagonal weighting matrix [W] to form weighted residuals {} as shown in
Equation 23a. The least squares solution for this new set of weighted residuals is shown
Equation 23b.

 1   w 1                  0         0  0   e1 
  0                                    0  e 2 
   2                                                   
w2          0
                                                          W e                  Eq. 23a
   0                    0             0   
 nobs   0
                           0

0 w nobs  e nobs 
       


{}  X W X
T         2
 X
1          T
W2 [Y]                                  Eq. 23b

Two concepts are typically used for weights. The simplest form is to use a value of 1 if that data
observation is present in the current data set while a value of 0 is used if that observation is
missing as shown in Equation 23c. This technique is often used for scientific devices that
employ the same data collection protocols for each data set (same nobs each time), but in which
certain observations may be missing or spurious (e.g. photogrammetry with landmarks that are
occasionally not visible).

wi2 = 1 for valid data, wi2 = 0 for missing data                                              Eq. 23c

The second approach is to set the square of each weight equal to the inverse of the expected
variance for that observation. The expected variance may be the variance of the respective
dependent variable output, the variance of the independent variable input or a pooled value
representative of both.
wi2 = 1 / i2 for i2 = variance of observation i                                          Eq. 23d

1.1.5 SCALING

mean center
smallest xi maps to -1
largest xi maps to +1
???

1.2 LINEARIZED MODELS

It is often tempting to linearize models of nonlinear phenomena in an effort to arrive at estimates
for parameters. One should exercise caution however in that linear least squares solutions for
linearized models minimize the sum of squares of residuals for nonlinear forms of the dependent
variables rather than residuals of those dependent variables themselves.

1.2.1 NEWTONIAN COOLING

A classic example is to model the temperature T of an object as a function of time  during
exponential Newtonian cooling from an initial temperature T0 toward ambient temperature T as
shown in Equation 24. Taking the logarithm after a simple algebraic rearrangement provides the
linearized model in Equation 25. If the ambient temperature T is known, measuring
temperatures Ti of the object at times i allows direct estimation of the Newtonian cooling rate b
and indirect estimation of the initial temperature difference above ambient with the linearized
matrix model in Equations 26. Specifically, this approach minimizes residuals of the nonlinear
function ln(T - T) rather than residuals of the dependent variable temperature T. Note that
simultaneous estimation of ambient temperature T∞ requires an iterative algorithm.

T = T + (T0 - T) e-b                                                                      Eq. 24

ln(T0 - T )
ln(T - T) = ln(T0 - T) - b                                            Eq. 25
 b         

 ln(T1 - T )              1 τ1 
 ln(T - T )                 1 τ 
                                                ln(T0 - T )
{Y}          2     
     [ X]         2 
{}                                  Eq. 26
nobs x 1
                  nobs x 2           2 x1   b 
ln(Tnobs - T )                      
                            1 τ nobs 

1.2.2 LOGARITHMIC SPIRAL
The logarithmic spiral often appears in nature such as the cross –section of a nautilus shell,
atmospheric low pressure spirals or the arms of galaxies as shown in Equations 27. Given center
location xc and yc allows linearization of x and y data points on the spiral as shown in Equations
28 and provides the linearized matrix model of Equations 29. Again, this approach minimizes
residuals of the nonlinear function ln(r) rather than residuals of the independent variables x and
y. Simultaneous estimation of center location xc and yc requires iterative techniques. Note that
there are no actual dependent variables for closed form geometric models estimated from
coordinate data.

r = a eb   x = xc + r cos       y = yc + r cos                                                        Eq. 27

y  yc                                       ln a 
r    x  x c 2  y  y c 2       tan                   ln r   ln a   b   1           Eq. 28
x  xC                                        b 

 ln( r1 )                 1 1 
 ln( r )                   1  
                                                    ln(a)
{Y}           2
        [ X]         2 
{}                                          Eq. 29
nobs x 1
                 nobs x 2                2 x1   b 
ln( rnobs )                         
                           1  nobs 

1.2.3 CIRCLE

A second example is to estimate the coordinates a and b for the center of a circle of unknown
radius r by measuring the coordinates of points x and y on the circle as shown in the familiar
nonlinear model in Equation 30. Expanding and rearranging provides the linearized model
shown in Equation 31. Measuring coordinates xi and yi for points on the circle allows direct
estimation of the center coordinates a and b and indirect estimation of the radius with the
linearized model in Equations 32. This approach minimizes residuals of the nonlinear function
(x2+y2) rather than residuals for the independent variables x and y.

( x - a )2 + ( y - b )2 = r2                                                                             Eq. 30

r 2  a 2  b 2 
                
( x2 + y2 ) = [ 1 x y ]       2a                                                                       Eq. 31
      2b        
                

 (x1  2
 y1 ) 
2
1 x 1         y1 
                                1 x                           r 2  a 2  b 2 
 (x 2
2
 y2 )                               y2                             
{Y}                   2
    [X]         2                 {}        2a                     Eq. 32
nobs x 1
                      nobs x 3                     3x1
                
( x 2                                                               2b        
 nobs      y nobs )
2
            1 x nobs     y nobs 
If radius r is known a priori, the problem may be reformulated to find the best fit for center
coordinates a and b as shown in Equations 33. Additionally, the solution for the term (-a2-b2)
may not be completely consistent with the solution for terms (2a) and (2b) if the data does not
exactly model the given radius r. Again, note that there are no actual dependent variables for
closed form geometric models estimated from coordinate data.

 ( x 1  y1  r 2 ) 
2     2
1 x 1        y1 
                                   1 x                           a 2  b 2 
 (x 2  y 2  r ) 
2     2    2
y2                         
{Y}                                [X]         2                {}   2a                         Eq. 33
nobs x 1
                         nobs x 3                    3x1
 2b 
( x 2  y 2  r 2 )                                                        
 nobs      nobs                    1 x nobs    y nobs 

1.2.4 SPHERE

Following the circle example above, one can estimate the coordinates a, b and c for the center of
a sphere of unknown radius r as shown in Equations 34.

 ( x 1  y1  z 1 ) 
2    2    2
1 x 1          y1      z1           r 2  a 2  b 2  c 2 
                                    1 x                                                        
 (x 2  y 2  z 2 ) 
2    2    2
y2       z2    {}           2a           
{Y}                                 [X]         2
                       Eq. 34
nobs x 1
                          nobs x 4                         4 x1             2b           
( x 2  y 2  z 2 )                                                                           
 nobs     nobs    nobs              1 x nobs     y nobs   z nobs                  2c          
1.3 EXAMPLE DATA

1.3.1 ANSCOMBE'S DATA SET

Ref: Anscombe, F.J. (1973) Graphs in statistical analysis. Amer. Statistician 27:17-21

Anscombe's datasets
x1=[10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 ]';
y1=[ 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 ]';

x2=x1;
y2=[ 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 ]';

x3=x1;
y3=[ 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 ]';

x4=[ 8.00 8.00 8.00 8.00 8.00 8.00 8.00 19.00 8.00 8.00 8.00 ]';
y4=[ 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89 ]';
2.0 EIGENVECTOR METHODS

The linear methods described above minimize the sum of squares of residuals between observed
outputs and output values predicted from multiple input observations. However for some
numerical models, the concepts of independent inputs and dependent outputs are not well
defined. Fitting geometric models to k dimensional point coordinate data is an example. The
coordinates for each point are typically grouped into an input vector of length ninp=k and there
are no dependent variable outputs. In general, input vectors from nobs number of data
observations are pooled to estimate npar number of model parameters.

Linear least squares methods minimize the sum of squares of residuals parallel to the dependent
variable. For eigenvector methods, the residuals

Orthogonal distance fitting …

2.1 LINE

The parametric equation for a line in k dimensional space describes the location of any point {x}
on the line in terms of a given point {r} on the line plus a directed scalar distance s measured
from {r} multiplied by a unit direction u along the line as shown in Equation 32. However,
ˆ
individual data observations {xi} for points representing the line may not fit the model perfectly
due to experimental measurement error, process variation or insufficient model complexity as
shown in Equation 33.

x  r  s u
ˆ                                                                             Eq. 32
k x1   k x1          k x1

{xi}  {r} + s u
ˆ                                                                            Eq. 33

The optimal least-squares estimate for the given point {r} on the line will be the centroid {xo}of
the nobs data observations {xi} as shown Equation 34. The symmetric centroidal point
distribution matrix [S] may then be formed as shown in Equation 35. Its diagonal eigenvalue
matrix [D] and orthogonal eigenvector matrix [V] shown Equation 36 provide least-squares
solutions.

nobs
{r} = {xo} =           1
nobs   
i 1
{xi}                                                     Eq. 34

nobs
[S] = nobs  ({xi}-{xo})T ({xi}-{xo})
1
Eq. 35
kxk           i 1

[S] = [V] [D] [V]T                  for [V] = [{v1}{v2}…{vk}] and [D] = diag( d1 d2 … dk )   Eq. 36
Perfect data observations of points on a line would produce a single non-zero eigenvalue of [S]
along a direction parallel to the line. Consequently for real-world data observations, the optimal
least-squares estimate for the unit direction u will be parallel to the eigenvector that
ˆ
corresponds to the largest eigenvalue as shown in Equation 37. The largest eigenvalue will be
equal to the variance of the data points about their centroid along the line. The sum of the other
two eigenvalues will be equal the variance of the data points perpendicular to the line.

u = {vj} / norm{vj}
ˆ                              for {vj} corresponding to largest dj for a line             Eq. 37

2.2 PLANE

The equation for a plane in k dimensional space describes the location of any point {x} on the
plane based on the perpendicular distance  from the coordinate origin to the plane as
determined by a given point {p} on the plane and the unit normal n to plane as shown in
ˆ
Equation 38. However individual data observations {xi} for points representing the plane may
not fit the model perfectly due to experimental measurement error, process variation or
insufficient model complexity as shown in Equation 39.

{x}T n = {p}T n = 
ˆ          ˆ                                                                          Eq. 38

{xi}T n  {p}T n = 
ˆ          ˆ                                                                         Eq. 39

The optimal least-squares estimate for the given point {p} on the plane will be the centroid
{xo}of the nobs data observations {xi} as shown Equation 40. The symmetric centroidal point
distribution matrix [S] may then be formed as shown in Equation 41. Its diagonal eigenvalue
matrix [D] and orthogonal eigenvector matrix [V] shown Equation 42 provide least-squares
solutions.

nobs
{p} = {xo} =     1
nobs   
i 1
{xi}                                                          Eq. 40

nobs
[S] = nobs  ({xi}-{xo})T ({xi}-{xo})
1
Eq. 41
3x3      i 1

[S] = [V] [D] [V]T            for [V] = [{v1}{v2}{v3}] and [D] = diag( d1 d2 d3 )           Eq. 42

Perfect data observations of points on a 3D plane would produce a single zero eigenvalue of [S]
in a direction normal to the plane. Consequently for real-world data observations, the optimal
least-squares estimate for the unit normal n will be parallel to the eigenvector that corresponds
ˆ
to the smallest eigenvalue as shown in Equation 43. The smallest eigenvalue will be equal to the
variance of the data points perpendicular to the plane. The other two eigenvalues are principal
values of variance of the data points within the plane.
n = {vj} / norm{vj}
ˆ                      for {vj} corresponding to smallest dj for a 3D plane   Eq. 43

2.3 ELLIPSE

2.4 ELLIPSOID

3.0 LEVENBERG-MARQUARDT

The Levenberg-Marquardt algorithm iteratively adjusts estimates of model parameters {} to
minimize residuals between measured dependent variable outputs {y} and predictions from a
numerical model f(.) based on independent variable inputs {x} as shown in Equation 44. For a
given set of model parameters {}k at iteration k each measured training pair {x}i and {y}i will
have residuals {e}i,k as shown in Equation 45. For parameter updates {} shown in Equation
46, the Taylor series expansion for residuals at iteration k+1 may be written as shown in
Equation 47 using the Jacobian [J] of the numerical model with respect to model parameters in
Equation 48.

                  
y      f  x ,                                                                     Eq. 44
nout x 1      ninp x 1 npar x 1

ei,k  yi  f xi , k                                                              Eq. 45

{}k+1 = {}k + {}                                                                         Eq. 46

ei,k 1  ei,k  J i,k                                                             Eq. 47
nout x 1     nout x 1   nout x npar   npar x 1

 f1           f1            f1 
                         
 2           npar 
 1                                   
 f 2          f 2

f 2 
Ji,k     1            2           npar          evaluated for xi                Eq. 48
                                     
                              
 f nout      f nout

f nout 
 1
                2           npar 


If only one training pair is available and the number of dependent variable outputs is equal to the
number of parameters (nobs=1 and nout=npar), Equation 47 is deterministic and one can try to
drive all nout residuals {e}i,k+1 to zero using Equation 49. This provides the classical Newton-
Raphson root finding algorithm shown in Equation 50.

  Ji,1 ei,k
k                                                                               Eq. 49

   Ji,1 f xi , k 
k                                  for yi  0                             Eq. 50

If only one training pair is available and the number of dependent variable outputs is larger than
the number of parameters (nobs=1 and nout>npar), residuals {e}i,k+1 at iteration k+1 can be
minimized by the standard linear least squares solution shown in Equation 51.
1
                                                         
   J iT,k
 npar x nout     J i,k  
 J iT,k ei ,k 
 npar x nout nout x 1                                 Eq. 51
npar x 1                nout x npar                               

However, if the number of parameters is greater than the number of dependent variable outputs,
Equation 51 is row insufficient and multiple training pairs are required (nobs>1 for nout<npar).
Residuals from all training pairs at iteration k shown in Equation 45 may be concatenated as
shown in Equation 52 providing an aggregate sum of squares SSQ over all observations.
Similarly all residuals predicted at iteration k+1 for update {} in Equation 47 may be
concatenated as shown Equation 53. The linear least squares solution for parameter updates that
will minimize the predicted aggregate SSQ after at iteration k+1 is then shown in Equations 54
and 55.

 e1,k                 e1,k 
T

 e                    e  nobs

SSQk  
2,k 

 



2,k 

  
              
eiT,k ei,k                                    Eq. 52
                                     i 1
enobs,k 
                       enobs,k 
          

 e1,k 1   e1,k   J 1,k 
 e           e   J  
    2 , k 1      2,k 
                       
2,k 
                                                        Eq. 53
                    
enobs,k 1  enobs,k  J nobs,k 
                                    

1
     J 1,k           J 1,k                    J 1,k           e1,k      
T                                                  T
                                                                                        
     J               J                        J               e         
    2, k 
       2, k              2,k                        2,k                              Eq. 54
                                                                            
                                                          
J nobs,k        J nobs,k                   J nobs,k        enobs,k 
                                                                                        
                                                                                      

1

                   
    JiT,k Ji,k    JiT,k ei,k                          
nobs                 nobs
                                                                                           Eq. 55
 i 1               i 1                                        

Equation 55 provides rapid second order Newtonian convergence but can become unstable if the
square Jacobian summation is nearly singular. Levenberg and Marquardt showed that a positive
factor  added to the diagonal elements of the square Jacobian summation matrix as shown in
Equation 56 can provide both rapid and stable convergence. For very small values of , this
provides Newtonian convergence similar to Equation 55. For larger values of , this provides
small but stable steps along the gradient shown in Equation 57.
1

                             
    JiT,k Ji,k   I    JiT,k ei,k          
nobs                                nobs
                                                                                                     Eq. 56
 i 1                npar x npar   i 1                

  1    JiT,k ei,k   
nobs
                                                                                                     Eq. 57
  i 1                 

If parameter updates provide a stable step with smaller aggregate SSQ than prior iterations,
factor  is reduced in preparation for the next iteration. If parameter updates provide an unstable
step with larger aggregate SSQ than prior iterations, those updates are rejected, factor  is
increased and the process is repeated. Typically  is started at a value of 0.1, is reduced by a
factor of 10 for stable steps, and is increased by a factor of 10 for unstable steps.

Convergence may be assessed by observing when absolute values of parameter updates are small
while the aggregate SSQ approaches the expected standard deviation of residuals. Observing the
progression of factor  can also help indicate convergence.

The algorithm may be summarized as follows.
1) Postulate initial estimates for parameters {}
2) Evaluate aggregate SSQ over all training pairs for initial parameter estimates (Equation 52)
3) Set factor  = 0.1
4) Proceed through all training pairs
a) Evaluate all residuals {e}i,k (Equation 45)
b) Evaluate all Jacobians [J]i.k (Equation 48)

  J                  and   J                  
nobs                            nobs
c) Accumulate summations
T
i ,k   J i,k                  T
i ,k   ei,k       (Equation 55)
i 1                            i 1

5) Add factor  to diagonal and compute parameter updates {} (Equation 56)
6) Update parameters {}k+1 (Equation 46)
7) Evaluate aggregate SSQ over all training pairs for new parameter estimates (Equation 52)
8) If aggregate SSQ has been reduced:
a) Reduce factor NEW = OLD / 10
b) Proceed with the next iteration at step 4)
9) If aggregate SSQ has increased:
a) Discard the new parameter estimates and use immediate prior values
b) Increasee factor NEW = OLD * 10
c) Proceed with the next iteration at step 5)

Because of its robust performance, Levenberg-Marquardt method is often used with finite
difference numerical approximations for the Jacobian [J] of the numerical model with respect to
model parameters. Note that this Jacobian must be re-evaluated for each training pair for each
iteration whether analytically or numerically.

Penalty functions may be added to residuals to impose explicit or implicit inequality constraints
on parameters. However as with any gradient technique, convergence may dither across
constraint boundaries if the minimum SSQ is at a constraint boundary.
Levenberg, K. "A Method for the Solution of Certain Problems in Least Squares." Quart. Appl.
Math. 2, 164-168, 1944.
Marquardt, D. "An Algorithm for Least-Squares Estimation of Nonlinear Parameters." SIAM J.
Appl. Math. 11, 431-441, 1963.

4.0 SIMPLEX

Nedler, J.A. and Mead, R. "A Simplex Method for Function Minimization." Computer Journal,
7, 308-???, 1965.

6.0 NEURAL NETWORK

7.0 GENETIC ALGORITHMS

8.0 SIMULATED ANNEALING

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 25 posted: 7/12/2012 language: pages: 20