Parameter estimation for nonlinear models

Document Sample
Parameter estimation for nonlinear models Powered By Docstoc
					    Parameter estimation for
  nonlinear models: Numerical
approaches to solving the inverse
            problem

            Lecture 1
             01/08/2008

           Sven Zenker
             Schedule/Course overview (Part 1)

01/08/2008    Introduction/Basics: Problem setup, maximum likelihood estimation and nonlinear
              programming (NLP), Technical considerations

01/15/2008    Non-linear least squares as NLP, approaches to NLP 1: unconstrained minimization

01/22/2008    Approaches to NLP 2: constraints, duality

01/29/2008    NLP 3: Modern techniques: non-monotonous globalization; Global optimization

02/05/2008    (no class)

02/12/2008    ODE-specific methods 1: Multiple shooting

02/19/2008    ODE-specific methods 2: Collocation

02/26/2008    Iterative estimators, generalized smoothing

03/04/2008    The stochastic viewpoint revisited: philosophy and mathematical formulation, Bayesian
              and frequentist viewpoints, review of probability basics
                   Schedule/Course overview (Part 2)

03/11/2008             (no class, Spring Recess)

03/18/2008 Numerical representation of probability distributions and operations on them, density
                      estimation, bandwidth adaptation as an optimization problem

03/25/2008 Sampling 1: Monte Carlo, Rejection sampling, Random walks: MCMC: Metropolis-
                      Hastings and its variations

04/01/2008 MCMC diagnostics, Sampling 2: Importance sampling (IF) and its relatives, particle
                    filters

04/08/2008 "Population Monte Carlo", mixing MCMC and IF, Hybrid Monte Carlo, Langevin
                      methods

04/15/2008 Utilizing the samples: model selection criteria, sampling the model space: reversible
                        jump MCMC

04/22/2008 Summary & Outlook
                Literature



Nonlinear Programming. Dimitri P.
Bertsekas, Athena Scientific, 1999.

Inverse Problem Theory and Model
Parameter Estimation. Albert
Tarantola, SIAM, 2005.

+ various other sources
      General framework: steps in
 developing mathematical models of the
             physical world

1.Model development and
  parametrization
2.Solution of the forward problem
3.Solution of the inverse problem


Typically, this will be iteratively repeated.
             Interaction between experiment
                 and mathematical model

                     Experimental models of all kinds


Data for identification of relationships,
Validation or Falsification (Scientific Method!)
                                                   Hypotheses, predictions, realization
                                                   of experimentally not realizable
                                                   scenarios


                             Mathematical models
   The model development process yields


   M : X Y, x      y  M ( x)
   x X      n


   y Y     m




M, the (deterministic) nonlinear
model
X: parameter or model space
Y: data or observation space
     Terminology: Forward & inverse
                problem


The model maps parameter vectors
in a model dependent parameter
space X to vectors of observable
quantities y. This “prediction” of
observations is termed the solution of
the forward problem.
     Terminology: Forward & inverse
                problem


Inferring parameter values from
observations is termed the solution of
the inverse problem.
              Stochasticity?

Although we will, for the purpose of
this course, restrict ourselves to
deterministic models, the presence of
measurement error may still enforce
stochastic considerations.
    Sources of uncertainty in the forward and
   inverse problems, a more complete picture
                                                           “Forward”
                                                  Measurement error and model stochasticity            Probability
                               Single                                                               Density function
                               State                                                                on measurement
                                                         (if present) introduce uncertainty
                               Vector                                                                    space
“Interpretation”




                                                                                                                       “Observation”
                                                                   “Prediction”


                   Quantitative                                Mathematical model
                                        System states                                             Measurement
                   representation                                    of system                       results
                   of system            Parameters


                                                                    “Inference”



                                                        Measurement error, model stochasticity,
                        Probability density                                                              Single
                       Function on state and                                                          Measurement
                         Parameter space                and ill-posedness introduce uncertainty          vector


                                                             “Inverse”
 Mathematical models: classification
by formulation of forward problem (1)

•Closed form solution
  •The convenient situation
  •Rarely the case for practically
  relevant scenarios
   Mathematical models: classification
  by formulation of forward problem (2)
Solution available only by numerical
approximation
  •Special case: Model is described by
  system of non-linear differential
  equations
  •Special case: Solution of nonlinear
  algebraic equations
  •Other common scenarios: DDEs,
  PDEs, etc.
   Mathematical models: classification
  by formulation of forward problem (2)
•Solution available only by numerical
approximation
  •Special case: Model is
  described by system of non-
  linear differential equations
  •Other common scenarios: DDEs,
  PDEs, solutions of optimization
  problems, etc.
  Approach for part 1 of this course:
  Nonlinear programming formulation

Turn problem of finding parameter
vector x from observation vector y
into a nonlinear programming
problem, that is, an optimization
problem.
  Nonlinear programming formulation
Advantages:
•Relatively simple formulation
•Excellent (even realtime) performance
possible through exploitation of derivative
information
•Relatively straightforward extensions to
optimal experimental design and control
possible
 Nonlinear programming formulation
Disadvantages:
•Strong assumptions required
  •These are usually met partially at
  best
  •Whether the failure to meet the
  assumptions is practically relevant
  in a given scenario is not always
  evident from the results
     From the inverse problem to a
          nonlinear program:
    The maximum likelihood estimate
                 (MLE)
Basic idea: Given
•a model and a set of parameters specifying it
completely
•the distribution of measurement error
•a set of experimental observations
we can compute the probability of the data being
observed given the model specification and
identify this with the likelihood of the given
parameter set.
       From the inverse problem to a
            nonlinear program:
      The maximum likelihood estimate
                   (MLE)
Deterministic model
             M : X Y, x   y  M ( x)
and measurement error distribution together yield
a parametrized family of conditional probability
distributions on observation space Y (for the
continuous case described by probability density
functions)
                    f x ( y | x)
The likelihood function is the above interpreted as
a function of the 2nd argument, x
                 L( x) f x ( y | x)
      From the inverse problem to a
           nonlinear program:
     The maximum likelihood estimate
                  (MLE)
The likelihood function is the above interpreted as
a function of the 2nd argument, x

                    L( x) f x ( y | x)
for a fixed observation y~ representing our data.
For k independent observations we have
                          k
                 L( x)  f x ( yi | x )
                         i 1
         From the inverse problem to a
              nonlinear program:
        The maximum likelihood estimate
                     (MLE)
 Note that L(x) does NOT define a probability
 density function (PDF) on parameter space (or a
 probability distribution in the discrete case).
 To obtain a bona fide PDF, we need to apply
 Bayes’ theorem (for PDFs):
                             f ( y | x) ( x)
         f ( x | y) 
                        
                        X
                            f ( y | x) ( x)dx
                                     ˆ    ˆ ˆ

which requires knowledge (or assumption) of a
prior probability density π(x) on parameter space.
      From the inverse problem to a
  nonlinear program: The actual program

The maximum likelihood estimate (MLE) is now
found by computing
                  arg max L( x)
                       x

If prior information on the parameter distribution is
available, the maximum a posteriori estimate
(MAP) can be computed as
               arg max L( x) ( x)
                   x

since the integral term in the Bayes’ expression is
independent of x and thus does not play a role in
the optimization.
      From the inverse problem to a
           nonlinear program
The previously described expressions constitute
generic nonlinear programs and can, in principle,
be applied for arbitrary error distributions. We will
discuss numerical methods that can indeed
handle such arbitrary programs (with certain
fundamental limitations to be discussed).
The likelihood function becomes the objective
function of the nonlinear program.
The results are POINT ESTIMATES. Techniques
for estimating full posterior distributions will be the
subject of the 2nd part of this course.
       From the inverse problem to a
   nonlinear program: The actual program

Logarithmic transformation
Note that for likelihood functions arising from multiple
independent transformations, i.e.
                             k
                    L( x)  f x ( yi | x )
                            i 1

the logarithm as a strictly increasing function can be applied
to the likelihood function without changing the location of the
optimum, yielding a computationally convenient sum:
                              k
                  ln L( x)  f x ( yi | x)
                             i 1

which often helps to avoid floating point computation related
precision losses, among other things.
     From the inverse problem to a
          nonlinear program
    Example & special case: normally
           distributed errors
If the measurement errors are normally
distributed, i.e.
                                        ( yi  M i ( x ))2
                  n                 
                            1
           L( x)              e              i2

                 k 1   2 i
the log likelihood takes a familiar form:
                        k
                            ( yi  M i ( x))2
         ln L( x)  
                  i 1   i2
the sum of squared residuals, weighted by the
respective standard deviations.
    From the inverse problem to a
         nonlinear program
   Example & special case: normally
          distributed errors

In the special case of independent observations
with normally distributed errors, finding the MLE
corresponds to the solution of the (weighted)
nonlinear least squares problem.
This will be particularly convenient in the
estimation process for reasons to be discussed.
     From the inverse problem to a
            nonlinear program:
    A (slightly) more specific example

Estimating parameters and initial
conditions for a nonlinear system of
ordinary differential equations, given
measurements of (functions of) system
states at a number of timepoints.
         From the inverse problem to a
              nonlinear program:
            A more specific example
Let the model be specified as an initial
value problem for a system of 1st order
nonlinear ordinary differential equations
   s (t ) f (t , s (t ), p ), s         , p
                                     ns          np


with initial conditions
                     s (0)  s0
for which, under regularity assumptions for
the RHS, a unique solution
    s p , s (0) (t ) exists on some interval.
         From the inverse problem to a
              nonlinear program:
            A more specific example
 For now, let us assume that we are able to
 find the (existing and unique) s(t) by
 numerical integration for arbitrary initial
 conditions s(0) and parameter vectors p.
 Let us further assume that our
 experimental data takes the form of a finite
 set of observation vectors
oi , i  1,..., nt , oi    no
                                 taken at times t1,..., tnt
             From the inverse problem to a
                  nonlinear program:
                A more specific example
     … and that observations can be computed
     from the solutions of our ODE system
     through a (in general nonlinear, parameter
     and time dependent) observation function

      1 ns  n p
T:                     no
                             , s p,s (0) (t )   T (t , p, s p,S (0) (t ))
     From the inverse problem to a
          nonlinear program:
        A more specific example
To formulate the MLE problem, we lump
the unknown parameters and initial
conditions into one “parameter” vector to
be estimated
                s(0)                    ns  n p
          x :       , x 
                p 
and can now view the solution of our ODE
system as a function of time, initial
conditions and parameters
            st ( x) : s p , s (0) (t )
     From the inverse problem to a
          nonlinear program:
        A more specific example
…which allows us to write down the log
likelihood for Gaussian errors as a function
of the sought-after “parameter” vector x:
                  nt
                         [oi  T ( sti ( x), x)]2
       ln L( x)  
                  i 1                i2
The MLE thus corresponds to the solution
of the weighted least squares problem
                          nt
                                [oi  T ( sti ( x), x)]2
    xMLE  arg min 
              x          i 1             i2
       From the inverse problem to a
            nonlinear program:
          A more specific example

Why introduce the observation function T?
Example: fluid flow between compartments:
•ODE formulation in terms of mass/volume conservation at
nodes: state variables are volumes
•Parametrization in terms of pressure-volume relationship
•Observable: intracompartimental pressure
Will become interesting when we discuss implementation
issues, in particular related to computing derivative
information…
    From the inverse problem to a
           nonlinear program:
Objective functions for multi-experiment
                 setups
We have this far assumed that our
observation data is explicable by a single
parameter/initial condition set (single
experiment case). However, nothing
prevents us from considering multi-
experiment setups in which a subset of
parameters/initial conditions is allowed to
vary between subsets of the observations.
      From the inverse problem to a
             nonlinear program:
  Objective functions for multi-experiment
                   setups

 In a “multi-experiment” setting, we have
 multiple observations from a nex
 experiments

oi , i  1,..., nt , oi 
  j              j   j      no                    j         j
                                 taken at times t ,..., t
                                                  1
                                                            nt j


j  1,..., nex
    From the inverse problem to a
           nonlinear program:
Objective functions for multi-experiment
                 setups

We then need to decide which
components of our lumped “parameter”
vector x we will allow to vary between
experiments (example: initial conditions
could be different, parameters the same),
resulting in a combined parameter vector
for the overall problem…
      From the inverse problem to a
             nonlinear program:
  Objective functions for multi-experiment
                   setups

  Overall parameter vector:
                                         xconst 
                                         1 
                             xtotal     xvar 
                                                
                                         nex 
                                         x 
                                         var 

In the Gaussian case, we then need to
solve
                            [oij  T (Sti ( xconst , xvar ), xconst , xvar )]2
                nex nt j                                   j             j

xMLE  arg min
       xtotal   j 1 i 1                            i2
     From the inverse problem to a
            nonlinear program:
 Objective functions for multi-experiment
                  setups

This is also worth mentioning since it
introduces a particular structure into the
objective function that can be exploited
during the optimization process (block
structured Jacobian, will discuss later).
            The role of derivatives


To fully exploit the advantages of the
nonlinear programming formulation of the
inverse problem, derivative information is
required.
                Interlude: Vector calculus
                         Notation
For a totally differentiable function
           f:   n
                       m
                            ,x      y  f ( x), with
            y1   f1 ( x) 
                         
                        
            y   f ( x) 
            m  m 
the total differential Dx f ( x) is a linear map described by the m x n
Jacobi matrix (“Jacobian”)

                                f1 ( x)       f1 ( x) 
                                x              xn 
                                      1
                                                          
                    J x ( x)                            
                                                         
                                f m ( x)      f m ( x) 
                                x              xn 
                                      1                  
           Interlude: Vector calculus
                    Notation


We will use the subscript to denote with respect to
which argument we are differentiating when a
function formally depends on multiple vector valued
arguments (board…).
                          Interlude: Vector calculus
                            Reminder: Chain rule
  For 2 totally differentiable functions
   f:   m
                n
                     ,y   z  f ( y ),
  and
  g : l  m, x         y=g(x)
  the total differential of the composite function
  h : l  n , h( x)  f ( g ( x))
  is the product of the Jacobians
  Dx h( x)  Dy f ( g ( x)) Dx g ( x )
  which, as a product of an n  m-matrix and an m  l -matrix is an n  l -matrix.

Basic stuff, I know, but checking dimensions can
often serve as a plausibility check when things get
messy during implementation…
               Back to our MLE setup


Straightforward application of the chain rule, e.g.,
allows us to compute the Jacobian of our predicted
observables in the previous example from the
Jacobian of the observation function wrt. to the
system states and the Jacobian of the system
states wrt. to model parameters and initial
conditions, as well as the gradient of the likelihood
function wrt. to parameters if we so desire.
           The key issue: Model Jacobian

For all models for which the forward problem can be
solved numerically:
•Finite differences
   •Forward: O(h)
   •Backward: O(h)
   •Central: O(h2)
            The key issue: Model Jacobian

For models available in closed form: can be done
exactly, either symbolically
•Manual
•Computer algebra system (Maple, Mathematica,
Derive, etc., etc.)
or via
•Automatic/Algorithmic differentiation
    Interlude: Algorithmic differentiation
•Basic idea: compute derivatives along with regular values
as program code is run using “augmented arithmetic”
•Accuracy limited only by machine precision
•2 extreme cases: forward and backward “mode”,
performance depends on number of input and output values
•For m x n Jacobian: n forward, m backward sweeps
    Technical implementation:
    •Code transformation
    •Operator overloading
    •Expression templates
•Will discuss practical issues (choice of implementation,
seeding, etc., when we get to them)
          The key issue: Model Jacobian
            Jacobian of ODE solutions


For models specified as ordinary differential
equations and solved using numerical integration:
•Modify integrator (fast):
   •Internal numerical differentiation (Bock)
   •Iterative approximation using directional
   derivatives
•Solve sensitivity equations (error control!)
               The key issue: Model Jacobian
                 Jacobian of ODE solutions
A few remarks on numerical ODE integration
                                (my personal take)


•ALWAYS use adaptive schemes with error control in an optimization settings
•Read the solver manual, understand the error control scheme, meaning of
the tunable parameters (at least superficially)
•If integration is slow using an explicit method, try an implicit method (stiff
integrator)
•If an implicit method fails or shows poor performance, provide exact
Jacobians (analytical, AD)
•If it still fails, closely inspect the point of failure (typical issues: physically
positive state becomes negative due to roundoff error accumulation, system
explodes, etc.)
•Think about RHS discontinuities, performance integrators will NOT let you be
sloppy on that account
                              Jacobian of ODE solutions
                             Forward sensitivity equations
For an ODE system of the form
                   s (t ) f (t , s (t ), p ), s                       , p
                                                                   ns            np


the chain rule of differentiation for each parameter yields a set
of ns differential equations describing the temporal evolution
of the partial derivatives of the solution components of the
ODE wrt. to that parameter
                   s (t )
With si (t ) 
     ˆ                                               To obtain sensitivities to initial
                    pi
                                                     conditions, introduce the initial
d                                            f
   si (t )  Ds f (t , s (t ), p ) si (t ) 
   ˆ                               ˆ             ,   conditions as additional parameters for
dt                                           pi
                                                     which
s          , p         , si (t ) 
                    np
       ns
                           ˆ           ns

                                                                 si (0)  e j
                                                                 ˆ
and
        s ( p )                                     for the sensitivity to the jth initial
si (0)  0
ˆ
          pi                                        condition component.
                Jacobian of ODE solutions
               Forward sensitivity equations
We thus end up with an augmented ODE system with a total
of
                            ns (n p  1)
equations that need to be solved.
While this is expensive, the full range of error control methods
for the solution of ODE systems is available.
In addition, integrators exist (CVODES, e.g.) that exploit the
special structure of this system to solve it more efficiently than
would be possible for a general system (at least for the stiff
case).
Can nevertheless become prohibitively expensive for large
systems.
               Jacobian of ODE solutions
               Adjoint sensitivity analysis

Alternative: adjoint sensitivity analysis.
•Useful when gradient of function of solutions wrt. parameters
is needed for relatively few scalar function(al)s of the ODE
solution.
•Idea: integrate ODE system first, use solution to integrate
backward to find sensitivities.
•One backwards integration per scalar function
•Derivation a little more complicated, implementation requires
solver that supports efficient backward integration => will
discuss when we are ready to use it…
               Jacobian of ODE solutions
                Why can’t we just AD it?

•Differentiability…
•Adaptive iterative codes contain branching points where
differentiability criteria may not be met
•This can cause automatically differentiated codes to be
unreliable
•Since reasonable other methods are available, not the best
option from my point of view
•However, some successful applications of AD to large legacy
PDE codes have been reported
    General remarks on implementation
                 issues
•Tradeoff between development time and performance (i.e.,
MATLAB vs. C implementation)
•Related to the above: tradeoff between robustness and
performance
•High performance code is more error prone (memory
handling, etc.)
•Recommended procedure (my opinion): prototype in high
level interpreted language such as MATLAB, obtain reference
results, then gradually transition to performance
implementation (if even necessary…)
•Test cases are essential!!
•Use plenty of comments, understanding your own reasonably
complex code a few months after writing it may be very hard…
          Inner workings of NLP codes
                  Why bother?
NLP is a mature field of research
We will probably be unable to improve on existing codes with
justifiable time investment
                             BUT
Understanding the underlying principles of typical methods, in
particular with respect to
•Proof of convergence
•Analysis of rate of convergence
•Numerical implementation
          Inner workings of NLP codes
                  Why bother?
…should enable us to better
•Select an appropriate method for a given problem
•Understand and monitor its behavior (what is the meaning of
all those iteratively updated, algorithm specific numbers, and
what do they tell us about the solution we obtain?)
•Diagnose failure if it occurs (typically when assumptions are
not met)
•Perform technical performance tuning (e.g., adaptation to
tuned linear algebra codes, parallelization, etc.)
 Reminder & outlook: fundamental
             issue
       Non-linear optimization/programming:
Efficient techniques make use of local information
   about objective function, i.e. Jacobian, Hessian
                           =>
Can only find local minima, result depends on the
            “basin of attraction” you start in:




                     “Local”                       “Global”
  At least partially amenable to technical solutions (multiple
     shooting for ODEs, global optimization algorithms)
                       Basins of attraction: Algorithm
                               dependence
                                                             Toy model: superposition
                                                             of 2 sine waves, 2 free
                                                             parameters, 20
                                                             “measurements”




Residual error after convergence as a function of initial guess for 3 different algorithms.
                                     Assignment (1)

           MATLAB & ODE warm-up
           exercises
           Consider a sinusoidally forced van der Pol oscillator

y   (1  y 2 ) y  y  a cos(t   )
 : coefficient determining position dependent dampening (system becomes stiff for large  )
a: amplitude of forcing
: frequency of forcing
:phase of forcing
                               Assignment (1)
MATLAB & ODE warm-up
•Implement this oscillator (hint: a non-forced version is part of the MATLAB
examples) and solve it for values of mu=1 and mu=1000, and at least 1 setting of
the forcing parameters where an effect is noticable (use a=0 as test case, can you
find interesting settings?), using ode45 and ode15s, observe & describe the effect
of stiffness (hint: timescale for mu=1 10s of seconds, for mu=1000, 1000s of
seconds)
•Compute sensitivities by deriving forward sensitivity equations and integrating
these (feel free to use a CAS if you wish)
•Download & install sens_analysis_1.1
(http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=148
0&objectType=FILE)
•Use it to evaluate sensitivities for the same parameter settings for which you
integrated the sensitivity equations and compare results
•Compute values and sensitivities wrt. to parameters and initial conditions for the
following vector valued observable for mu=1, a=100, omega =pi, phi=0, t=[0,20]:
                                         y (t ) 2       
                                                        
                                            2           
                          o(t ) :  y (t ) 2  y (t ) 2 
                                                        
                                          y (t )        
                                                        
                                                        
       Assignment (2) OPTIONAL,
     WILL NOT BE GRADED AT THIS
               POINT
•Implement (or provide an existing implementation) of an ODE
system that either is related to your research or interests you
for other reasons and for which you wish to attempt parameter
estimation
•Next weeks interactive component will in part be for you to
briefly (3-5 minutes, 3-5 slides) describe the system, show
solutions for example parameter sets and any data that you
may wish to fit to
•Idea here is that if suitable, we will use these systems for
assignments, performance comparisons, etc., so that the
implementation work you do for the course does not go to
waste
•Do not hesitate if you think the problem difficult, grades will be
assigned based on the quality of the work, not on speed of
success (and there will always be toy models as alternative
options)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:20
posted:11/16/2011
language:English
pages:60