Maximum likelihood (ML) and likelihood ratio (LR) test

W
Document Sample
scope of work template
							       Maximum likelihood (ML) and likelihood ratio
                       (LR) test
•   Conditional distribution and likelihood
•   Maximum likelihood estimator
•   Information in the data and likelihood
•   Observed and Fisher’s information
•   Likelihood ratio test
•   Exercise
      Conditional probability distribution and likelihood
Let us assume that we know that our random sample points came from the the
     population with the distribution with parameter(s) . We do not know . If we
     would know it then we could write probability distribution of single
     observation f(x|). Here f(x|) is the conditional distribution of the observed
     random variable if the parameter would be known. If we observe n independent
     sample points from the same population then the joint conditional probability
     distribution of all observations can be written:     n
                             f ( x1 , x2 , , , xn |  )   f ( xi |  )
                                                i 1
We could write product of the individual probability distribution because
     observations are independent (conditional when parameters are known). f(x|)
     is probability of observation for discrete and density of the distribution for
     continuous cases.
We could interpret f(x1,x2,,,xn|) as probability of observing given sample points if we
     would know parameter . If would vary the parameter we would get different
     values for the probability f. Since f is the probability distribution, parameters
     are fixed and observation varies. For a given observation we define likelihood
     equal to the conditional probability distribution.
                      L( x1 , x2 , , , xn |  )  f ( x1 , x2 , , , xn |  )
    Conditional probability distribution and likelihood: Cont.
When we talk about conditional probability distribution of the observations given
    parameter(s) then we assume that parameters are fixed and observations vary.
    When we talk about likelihood then observations are fixed parameters vary.
    That is the major difference between likelihood and conditional probability
    distribution. Sometimes to emphasize that parameters vary and observations are
    fixed, likelihood is written as:
                                   L( | x1 , x2 , , , , xn )

In this and following lectures we will use one notation for probability and likelihood.
       When we will talk about probability then we will assume that observations vary
       and when we will talk about likelihood we will assume that parameters vary.
Principle of maximum likelihood states that best parameters are those that maximise
       probability of observing current values of observations. Maximum likelihood
       chooses parameters that satisfy:
                                             ˆ
                       L( x1, x2 , , , xn |  )  L( x1, x2 , , , xn |  )
                                     Maximum likelihood
Purpose of maximum likelihood is to maximize the likelihood function and estimate
     parameters. If derivatives of the likelihood function exist then it can be done
     using:
                           dL( x1 , x2 , , , xn |  )
                                                      0
                                    d
Solution of this equation will give possible values for maximum likelihood estimator.
      If the solution is unique then it will be the only estimator. In real application
      there might be many solutions.
Usually instead of likelihood its logarithm is maximized. Since log is strictly
      monotonically increasing function, derivative of the likelihood and derivative
      of the log of likelihood will have exactly same roots. If we use the fact that
      observations are independent then joint probability distributions of all
      observations is equal to product of individual probabilities. We can write log of
      the likelihood (denoted as l):
                                                                            n
             l ( x1 , x2 , , , , xn |  )  ln( L( x1 , x2 , , , , xn |  )   ln( f ( xi |  ))
                                                                           i 1


Usually working with sums is easier than working with products
       Maximum likelihood: Example – success and failure
Let us consider two examples. First example corresponds to discrete probability
     distribution. Let us assume that we carry out trials. Possible outcomes of the
     trials are success or failure. Probability of success is  and probability of failure
     is 1- . We do not know value of . Let us assume we have n trials and k of
     them are successes and n-k of them are failures. Value of random variable
     describing our trials are either 0 (failure) or 1 (success). Let us denote
     observations as y=(y1,y2,,,,yn). Probability of the observation yi at the ith trial is:
                            f ( yi |  )   yi (1   )1 yi
Since individual trials are independent we can write for n trials:
                                                              n
                                L( y1 , y2 , , , yn |  )    yi (1   )1 yi
                                                             i 1
For log of this function we can write:
                                          n
            l ( y1 , y2 , , , yn |  )   ( yi ln( )  (1  yi ) ln(1   ))
                                         i 1
Derivative of the likelihood w.r.t unknown parameter is:
                            n             n                           n

                    dl                   (1  y )                  y
                                yi                    i                       i
                                                                                      k
                        i 1           i 1
                                                          0   
                                                                ˆ    i 1
                                                                                  
                    d                         1                       n           n
Estimator for the parameter is equal to fraction of successes.
         Maximum likelihood: Example – success and failure
.In the example of successes and failures the result was not unexpected and we could
       have guessed it intuitively. More interesting problems arise when parameter 
       itself becomes function of some other parameters and possible observations
       also. Let us say:
                                     (  xi )

It may happen that xi themselves are random variables also. If it is the case and the
     function corresponds to normal distribution then analysis is called Probit
     analysis. Then log likelihood function would look like:n
        l ( y1 , y2 , , , yn , x1 , x2 , , , xn |  ,  )   yi ln  (  xi )  (1  yi ) ln(1   (  xi ))
                                                  i 1


Finding maximum of this function is more complicated. This problem can be
      considered as a non-linear optimization problem. This kind of problems are
      usually solved iteratively. I.e. a solution to the problem is guessed and then it is
      improved iteratively.
        Maximum likelihood: Example – normal distribution
Now let us assume that the sample points came from the population with normal
         distribution with unknown mean and variance. Let us assume that we have n
         observations, y=(y1,y2,,,yn). We want to estimate the population mean and
         variance. Then log likelihood function will have the form:
                                 n
                                      1         ( yi   )2                   n           n
                                                                                             ( yi   )2
l ( y1, y2 , , , yn | ,  )   ln(
                          2
                                           e(              ))  n ln( 2 )  ln( )  
                                                                                   2

                               i 1  2 2         2 2                       2         i 1    2 2
If we get derivative of this function w.r.t mean value and variance then we can write:
                                                         n

                    dl   1 n                             y      i
                           ( yi   )  0   
                    d  2 i 1
                                              ˆ           i 1
                                                            n
                                                                      y
                                                                       ˆ
                                           n
                       dl      n    n
                    d ( )2
                             2 
                              2   2 4
                                          ( y
                                          i 1
                                                 i     )2  0


Fortunately first of these equations can be solved without knowledge about the
     second one. Then if we use result from the first solution in the second solution
     (substitute  by its estimate) then we can solve second equation also. Result of
     this will be sample variance:
                              1 n
                           s   ( yi   ) 2
                            2
                                        ˆ
                              n i 1
       Maximum likelihood: Example – normal distribution
Maximum likelihood estimator in this case gave sample mean and biased sample
    variance. Many statistical techniques are based on maximum likelihood
    estimation of the parameters when observations are distributed normally. All
    parameters of interest are usually inside mean value. In other words  is a
    function of several parameters.
                      g ( x1 , x2 , , , xn , 1 ,  2 , , ,  k )
Then problem is to estimate parameters using maximum likelihood estimator. Usually
      either x-s are fixed values (fixed effects model) or random variables (random
      effects model). Parameters are -s. If this function is linear on parameters then
      we have linear regression.
If variances are known then the Maximum likelihood estimator using observations
      with normal distribution becomes least-squares estimator.
              Information matrix: Observed and Fisher’s
One of the important aspects of the likelihood function is its behavior near to the
     maximum. If the likelihood function is flat then observations have little to say
     about the parameters. It is because changes of the parameters will not cause
     large changes in the probability. That is to say same observation can be
     observed with similar probabilities for various values of the parameters. On the
     other hand if likelihood has pronounced peak near to the maximum then small
     changes in parameters would cause large changes in probability. In this cases
     we say that observation has more information about parameters. It is usually
     expressed as the second derivative (or curvature) of the log-likelihood function.
     Observed information is equal to the second derivative of the minus log-
     likelihood function:           d 2l ( y |  )
                               I o ( )  
                                                           d 2
When there are more than one parameter it is called information matrix.
Usually it is calculated at the maximum of the likelihood. This information is
     different from that defined using entropy.
Example: In case of successes and failures we can write:
                                      n              n

                                      y  (1  y )
                                            i                 i
                        I o ( )    i 1
                                                   i 1
                                      2             (1   ) 2
              Information matrix: Observed and Fisher’s
Expected value of the observed information matrix is called expected information
     matrix or Fisher’s information. Expectation is taken over observations:
                           I ( )  E ( I o ( ))
It is calculated at any value of the parameter. Remarkable fact about Fisher’s
      information matrix is that it is also equal to the expected value of the product of
      the gradients (first derivatives):
                                         d 2l ( y |  )        dl ( y |  ) dl ( y |  )
                          I ( )   E (                )  E(                           )
                                            d 2                   d           d
Note that observed information matrix depends on particular observation whereas
     expected information matrix depends only on probability distribution of
     observations (It is a result of integration. When we integrate over variables we
     loose dependence on particular values of these variables):
When sample size becomes large then maximum likelihood estimator becomes
     approximately normally distributed with variance close to :
                            I 1 ( ) or I o1 ( )
Fisher points out that inversion of observed information matrix gives slightly better
      estimate to variance than that of the expected information matrix.
                    Information matrix: Observed and Fisher’s
More precise relation between expected information and variance is given by Cramer
     and Rao inequality. According to this inequality variance of the maximum
     likelihood estimator never can be less than inversion of information:
                                                var( )  I 1 ( )

Now let us consider an example of successes and failures. If we get expectation value
     for the second derivative of minus log likelihood function we can get:
                                       n         n                    n                      n

                   d 2l ( y |  )         yi  (1  yi )            E ( y )  E (1  y )
                                                                                    i                           i
                                                                                                                        n          n(1   )        n
    I ( )  E (                 )  E ( i 1 2  i 1        )    i 1
                                                                                           i 1
                                                                                                                                             
                      d   2
                                                    (1   )2                 2
                                                                                                 (1   )   2
                                                                                                                           2
                                                                                                                                    (1   ) 2
                                                                                                                                                  (1   )
If we take this at the point of maximum likelihood then we can say that variance of
      the maximum likelihood estimator can be approximated by:
                                                                     (1   )
                                                                     ˆ      ˆ
                                                     var( ) 
                                                          ˆ
                                                                            n

This statement is true for large sample sizes.
                                         Likelihood ratio test
Let us assume that we have a sample of size n (x=(x1,,,,xn)) and we want to estimate a parameter
    vector =( 1,2). Both 1 and 2 can also be vectors. We want to test null-hypothesis against
    alternative one:
                                H 0 : 1  10 against H1 : 1  10
Let us assume that likelihood function is L(x| ). Then likelihood ratio test works as follows: 1)
    Maximise the likelihood function under null-hypothesis (I.e. fix parameter(s) 1 equal to 10 ,
    find the value of likelihood at the maximum, 2)maximise the likelihood under alternative
    hypothesis (I.e. unconditional maximisation), find the value of the likelihood at the
    maximum, then find the ratio:
                              ˆ             ˆ ˆ
                                            ˆ ˆ
            w  L( x | 10 , 2 ) / L( x | 1 , 2 )
            ˆ1 is the value of the paramater after constrained (1  10 ) maximisation
             ˆ ˆ
            ˆ1 ,ˆ2 are the values of the both parametersafter unconstrai maximisation
                                                                            ned
w is the likelihood ratio statistic. Tests carried out using this statistic or some other statistic
    related to this are called likelihood ratio tests. In this case it is clear that:
                                   1 w  0
If the value of w is small then null-hypothesis is rejected. If we g(w) is the the density of
     distribution for w then critical region can be calculated using:
                                    c

                                     g (w)dw  
                                     0
                            References

1.   Berthold, M. and Hand, DJ (2003) “Intelligent data analysis”
2.   Stuart, A., Ord, JK, and Arnold, S. (1991) Kendall’s advanced
     Theory of statistics. Volume 2A. Classical Inference and the Linear
     models. Arnold publisher, London, Sydney, Auckland
                                               Exercise 2
a) Assume that we have n sample points independently drawn from the population
     with the density of distribution
                                    e   k
                    f (x  k | ) 
                                       k!
What is maximum likelihood estimator for . What is the observed and expected
     information.

b) Let us assume that we have a sample of size n of two-dimensional vectors
     ((x1,x2)=((x11,x21), (x12,x22),,,,(x1n,x2n) from the normal distribution:
                                                                1
                                              1                        (( x1  1 ) 2  ( x2   2 ) 2 )
                   L( x | 1 ,  2 ,  )                      2   2
                                                       e
                                             2   2



Find the maximum of the likelihood under the following hypotheses:
               H 0 : 1  10 and alternativ e hypothesis H1 : 1  10
Try to find the likelihood ratio statistic.
Note that variance is unknown.

						
Related docs