Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Linear and generalised linear models by Vikcyb7


									              Linear and generalised linear models
•   Purpose of linear models
•   Solution for linear models
•   Some statistics related with the linear model
•   Analysis of diagnostics
•   Exponential family and generalized linear model
                                  Reason for linear models
Purpose of regression is to reveal statistical relations between input and output variables.
      Statistics cannot reveal functional relationship. It is purpose of other scientific studies.
      Statistics can help validation various functional relationship. Let us assume that we suspect
      that functional relationship is
                                        y  f ( x,  ,  )
where  is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and
      y is output,  is an error associated with the experiment. Then we can set for various values
      of x experiment and get output (or response) for them. If number of experiments is n then
      we will have n output values. Denote them as a vector y=(y1,y2,,,,yn). Purpose of statistics
      is to evaluate parameter vector using input and output values. If function f is a linear
      function of the parameters and errors are additive then we are dealing with linear model.
      For this model we can write
                           yi  1 x1i   2 x2i  ...   p x pi   i
Note that linear model is linearly dependent on parameters but not on input variables. For
                         y   0  1 x1   2 x12  
is a linear model. But
                     y   0  1 x1  12 x2  
is not a linear model.
Basic assumptions for analysis of linear model are:
1)    the model is linear in parameters
2)    the error structure is additive
3)    Random errors have 0 mean and equal variances and errors are uncorrelated.

These assumptions are sufficient to deal with linear models. Uncorrelated with equal
     variance assumption can be removed. Then the treatments becomes a little bit
     more complicated.
Note that for general solution normality assumption is not used. This assumption is
     necessary to design test statistics.
These assumption can be written in a vector form:
                   y  Xβ  ε, E(ε )  0, V (ε )  σ 2I
where y, 0, I,  are vectors and X is a matrix. This matrix is called design matrix,
     input matrix etc. I is nxn identity matrix.
Solution with given model and assumptions is:
                                         y T X 1 ) X T X (  β
If we use the form of the model and write least squares equation (since we want to find solution
      with minimum least-squares error):
                    S  ε Tε  ( y  Xβ)T ( y  Xβ)  min
and get the first and the second derivatives and solve the equation then we can see that this
      solution is correct. S     T      T
                                              2( X Xβ  X y )
                                                2X T X
                                        ββ 2

This solution is unbiased. If we use the formula for the solution and the expression of y then we
      can write:
        E (β)  E ((X T X ) 1 X T y )  E ((X T X ) 1 X T ( Xβ  ε))  E ((X T X ) 1 X T Xβ  ( X T X ) 1 X Tε)
         E (β)  ( X T X ) 1 X T E (ε)  β
So solution is unbiased. Variance of estimation is:
         ˆ         ˆ       ˆ
      V (β)  E (( β  β)( β  β)T )  E (( X T X ) 1 X Tεε T X ( X T X ) 1 )  ( X T X ) 1 X T E (εε T ) X ( X T X ) 1   2 ( X T X ) 1
Here we used the form of the solution and assumption 3)
To calculate variance we need to be able to calculate 2. Since it is variance of the error
     term we can find it using form of the solution. For the estimated error (denoted by
     e) we can write:
                    r  y  Xβ  y  X( X T X) 1 X T y  (I  X( X T X) 1 X T )y
If we use:
                y  Xβ  ε
Immediately gives
                   r  (I  X( X T X) 1 X T )ε  Mε
Since matrix M is idempotent and symmetric i.e. M2=M=MT we can write:
     E (r Tr )  E (ε T Mε)  E (tr (ε T Mε))  E (tr (Mεε T )  tr (ME (εε T ) 
      2tr (M )   2 (tr (I )  tr ( X(X T X) 1 X T )   2 (n  tr (( X T X) 1 X T X)) 
      2 (n  p)
Where n is the number of the observations and p is the number of the fitted parameters.
    Then for unbiased estimator for variance of the residual we can write:
                                 ˆ          ˆ
                     2  ( y  Xβ)T ( y  Xβ) /(n  p )
                                          Singular case
This forms of the solution is true if matrices X and XTX are non-singular. I.e. rank of matrix X
       is equal to the number of parameters. If it is not true then either singular value
       decomposition or eignevalue filtering techniques are used. Fortunately most good
       properties of the linear model remains.
Singular value decomposition: Any nxp matrix can be decomposed in a form:
                                      X  UDV
Where U is nxn and V is pxp orthogonal matrices. I.e.multiplication of transpose of the matrix
     with itself gives unit matrix. D is nxp diagonal matrix of the singular values. If X is
     singular then number of non-zero diagonal elements of D is less than p. Then for XTX we
     can write:
                         X T X  (UDV )T UDV  V TDTU TUDV  V TDTDV
DTD is pxp diagonal matrix. If the matrix is non-singular then we can write:
                              ( X T X ) 1  V T ( DT D) 1 V
Since DTD is diagonal its inverse is the diagonal matrix with diagonals inversed. Main trick
      used in singular value decomposition techniques for equation solution is that when
      diagonals are 0 or close to 0 then instead of their inversion 0 is used. I.e. if E is the
      inverse of the DTD then pseudo inverse is calculated:
                   0       if corresponding diagonal of DTD is 0
            E 
             ii     T
               1/(D D)ii                    otherwise
                                      Test of hypothesis
Sometimes question arises if some of the parameters are significant. To test this type of
      hypothesis it is necessary to understand elements of likelihood ratio test. Let us assume
      that we want to test the following null-hypothesis vs alternative hypothesis:
                            H 0 : β1  0, vs, H1 : β1  0
where 1 is a subvector of the parameter vector. It is equivalent to saying that we want to test of
      one or several parameters are 0 or not. Likelihood ratio test for this case works like that.
      Let us assume we have the likelihood function for the parameters:
                                        L( y | β)
where parameters are partitioned into to subvectors:
                                   β  (β1 , β2 )
Then maximum likelihood estimators are found for two cases. 1st case is when whole parameter
     vector is assumed to be variable. 2nd case is when subvector 1 is fixed to a value defined
     by the null hypothesis. Then values of the likelihood function for this two cases is found
     and their ratio is calculated. Assume that L0 is value of the likelihood under null
     hypothesis (subvector is fixed to the given value) and L1 is under the alternative
     hypothesis. Then ratio of these values is found and statistics related to this ratio is found
     and is used for testing. Ratio is:
                                         lr 
If this ratio is sufficiently small then null-hypothesis is rejected. It is not always possible to find
       distribution of this ratio.
                     Likelihood ratio test for linear model
Let us assume that we have found maximum likelihood values for the variances under null and
       alternative hypothesis and they are:
                                   and 
                                   ˆ     ˆ

furthermore let us assume that n is the number of the observations, p is the number of all
       parameters and r is the number of the parameters we want test. Then it turns out that
       relevant likelihood ratio test statistic for this case is related with F distribution. Relevant
       random variable is:
                                   n  p  2  2
                                          ˆ     ˆ
                                     r      2
This random variable has F distribution with (r,n-p) degrees of freedom. It is true if the
       distribution of the errors is normal. As we know in this case maximum likelihood and
       least-squares coincide.
Note: Distribution becomes F distribution if null-hypothesis is true. If it is not true then
       distribution becomes non-central F distribution
Note: if there are two random variables distributed by 2 distribution with n and m degrees of
       freedom respectively then their ration has F distribution with (n,m) degrees of freedom.
                          Analysis of diagnostics
Residuals and hat matrix: Residuals are differences between observation and fitted
 r  y  y  y  Xβ  y  X(X T X) 1 X T y  (I  X(X T X) 1 X T )y  (I  H) y
H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these
   values are close to one then that fitted value is determined by this observation.
   Sometimes hi’=hi/(1-hi) is used to enhance high leverages.
Q-Q plot can be used to check normality assumption. Q-Q plot is plot of quantiles of
   two distributions (in this case frequency distribution of residuals vs normal
   distribution). If assumption on distribution is correct then this plot should be nearly
Cook’s distance: Observations in turn are removed (it is equivalent to removing
   corresponding row from the design matrix X). Then parameters estimated without
   this observation. Cook’s distance is defined as:
                              ˆ                 ˆ
                             (β i  β) T X T X( β i  β)
                        Di 
                                        p 2
                        β is estimated without i - th observation
                            Analysis of diagnostics: Cont.
Other analysis tools include: / 2
            ri'  ri /(s(1  hi )1 ) standardized (stutendized) residual
             ri*  ri /(si (1  hi )1/ 2 ) externally standardized residual
             Ci  (hi ' )1/ 2 ri*   DFFITS
             Di  (ri ' ) 2 hi ' / p
Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is
  unbiased estimator of 2 after removal of i-th observation
                               Exponential family
Natural exponential family of distributions has a form:
                           L( y |  ,  )  e ( yA( )G ( y )  B ( )) / S ( )
S() is called a scale parameter. When A is identity then it is called. By change of
   variables A can be changed by identity. So it is usual to use
                              L( y |  ,  )  e ( y G ( y )  B ( )) / S ( )
Many distributions including normal, binomial, Poisson, exponential distributions
   belong to this family.
Moment generating function if exists for this distribution is:
                          M (t )  e ( B ( )  B ( tS ( )))/ S ( )
Then the first moment (mean value) and the second central moments can be calculated:
                          dB( )
              E( y)  
                                       d 2 B ( )
                E ( y  E ( y ))  
                2                  2
                                                  S ( )
                                         d 2
                              Generalised linear model
If the distribution of errors is one of the distributions from the exponential family and
       some function of the expected value of the observations is linear function of the
       parameters then generalised linear models are used:
                 g ( E ( y ))  g (  B' (1 ), , , B' ( n ))  Xβ
Function g is called the link function. Here is list of the popular distribution and
      corresponding link functions:
binomial - logit = ln(p/(1-p))
normal - identity
Gamma - inverse
poisson - log
All good statistical packages have implementation of many generalised linear models.
      To use them finding initial values might be necessary.
To fit using generalised linear model the likelihood function is used (if  is constant
      and we ignore constants not depending on parameter of interest):
                        l( y | ) 
                                    C ( )
                                            ( y   B( ))
                                         i 1
                                                i i   i

Then expression for natural exponential family can be used to fit a model. Most
     natural way is using =X
                 Additive model and non-linear models
Let us consider several non-linear models briefly.
1)    Additive model. If model is described as
                           y j   0   si ( xij )
                                          i 1
then model is called an additive model. Where si can be some set of functions.
      Usually they are smooth functions. These type of models are used for
2) If model is a non-linear function of the parameters and the input variables then it is
      called non-linear model. In general form it can be written:
                               yi  f ( x1i , , , xmi , 1 , , ,  p ,  i )

Form of the function depends on the subject studied. This type of models do not have
     closed form and elegant solutions. Non-linear least-squares may not have
     unique solution or may have many local minimum. This type of models are
     usually solved iteratively. I.e. initial values of the parameters are found and
     then using some optimisation techniques they are iteratively updated. Statistical
     properties of non-linear models is not straightforward to derive.
                                     Exercise: linear model
Consider hypothesis testing. We have n observation. Parameter vector has dimension p.
We have partitioning of the parameter vector like (dimension of 1 is r):
                                β  (β1 , β2 )
Corresponding partitioning of the design (input) matrix is

                                        X  ( X1 , X 2 )
Assume that all observations are distributed with equal variance normally and they are
     uncorrelated. Find maximum likelihood estimators for parameters and variance under
     null and alternative hypothesis:
                                     H 0 : β1  0 vs H1 : β1  0
Hint: -loglikelihood function under null hypothesis is (since 1=0):
                         n          1
                    l     ln  2  2 ( y  X 2β2 )T ( y  X 2β2 )
                         2         2

and under the alternative hypothesis:
                         n          1
                    l     ln  2  2 ( y  Xβ)T ( y  Xβ)
                         2         2

Find minimum of these functions. They will be maximum likelihood estimators.

To top