VIEWS: 5 PAGES: 14 POSTED ON: 12/14/2011
Linear and generalised linear models • Purpose of linear models • Solution for linear models • Some statistics related with the linear model • Analysis of diagnostics • Exponential family and generalized linear model Reason for linear models Purpose of regression is to reveal statistical relations between input and output variables. Statistics cannot reveal functional relationship. It is purpose of other scientific studies. Statistics can help validation various functional relationship. Let us assume that we suspect that functional relationship is y f ( x, , ) where is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and y is output, is an error associated with the experiment. Then we can set for various values of x experiment and get output (or response) for them. If number of experiments is n then we will have n output values. Denote them as a vector y=(y1,y2,,,,yn). Purpose of statistics is to evaluate parameter vector using input and output values. If function f is a linear function of the parameters and errors are additive then we are dealing with linear model. For this model we can write yi 1 x1i 2 x2i ... p x pi i Note that linear model is linearly dependent on parameters but not on input variables. For example y 0 1 x1 2 x12 is a linear model. But y 0 1 x1 12 x2 is not a linear model. Assumptions Basic assumptions for analysis of linear model are: 1) the model is linear in parameters 2) the error structure is additive 3) Random errors have 0 mean and equal variances and errors are uncorrelated. These assumptions are sufficient to deal with linear models. Uncorrelated with equal variance assumption can be removed. Then the treatments becomes a little bit more complicated. Note that for general solution normality assumption is not used. This assumption is necessary to design test statistics. These assumption can be written in a vector form: y Xβ ε, E(ε ) 0, V (ε ) σ 2I where y, 0, I, are vectors and X is a matrix. This matrix is called design matrix, input matrix etc. I is nxn identity matrix. Solution Solution with given model and assumptions is: y T X 1 ) X T X ( β ˆ If we use the form of the model and write least squares equation (since we want to find solution with minimum least-squares error): S ε Tε ( y Xβ)T ( y Xβ) min and get the first and the second derivatives and solve the equation then we can see that this solution is correct. S T T 2( X Xβ X y ) β 2S 2X T X ββ 2 This solution is unbiased. If we use the formula for the solution and the expression of y then we can write: ˆ E (β) E ((X T X ) 1 X T y ) E ((X T X ) 1 X T ( Xβ ε)) E ((X T X ) 1 X T Xβ ( X T X ) 1 X Tε) E (β) ( X T X ) 1 X T E (ε) β So solution is unbiased. Variance of estimation is: ˆ ˆ ˆ V (β) E (( β β)( β β)T ) E (( X T X ) 1 X Tεε T X ( X T X ) 1 ) ( X T X ) 1 X T E (εε T ) X ( X T X ) 1 2 ( X T X ) 1 Here we used the form of the solution and assumption 3) Variance To calculate variance we need to be able to calculate 2. Since it is variance of the error term we can find it using form of the solution. For the estimated error (denoted by e) we can write: ˆ r y Xβ y X( X T X) 1 X T y (I X( X T X) 1 X T )y If we use: y Xβ ε Immediately gives r (I X( X T X) 1 X T )ε Mε Since matrix M is idempotent and symmetric i.e. M2=M=MT we can write: E (r Tr ) E (ε T Mε) E (tr (ε T Mε)) E (tr (Mεε T ) tr (ME (εε T ) 2tr (M ) 2 (tr (I ) tr ( X(X T X) 1 X T ) 2 (n tr (( X T X) 1 X T X)) 2 (n p) Where n is the number of the observations and p is the number of the fitted parameters. Then for unbiased estimator for variance of the residual we can write: ˆ ˆ 2 ( y Xβ)T ( y Xβ) /(n p ) ˆ Singular case This forms of the solution is true if matrices X and XTX are non-singular. I.e. rank of matrix X is equal to the number of parameters. If it is not true then either singular value decomposition or eignevalue filtering techniques are used. Fortunately most good properties of the linear model remains. Singular value decomposition: Any nxp matrix can be decomposed in a form: X UDV Where U is nxn and V is pxp orthogonal matrices. I.e.multiplication of transpose of the matrix with itself gives unit matrix. D is nxp diagonal matrix of the singular values. If X is singular then number of non-zero diagonal elements of D is less than p. Then for XTX we can write: X T X (UDV )T UDV V TDTU TUDV V TDTDV DTD is pxp diagonal matrix. If the matrix is non-singular then we can write: ( X T X ) 1 V T ( DT D) 1 V Since DTD is diagonal its inverse is the diagonal matrix with diagonals inversed. Main trick used in singular value decomposition techniques for equation solution is that when diagonals are 0 or close to 0 then instead of their inversion 0 is used. I.e. if E is the inverse of the DTD then pseudo inverse is calculated: 0 if corresponding diagonal of DTD is 0 E * ii T 1/(D D)ii otherwise Test of hypothesis Sometimes question arises if some of the parameters are significant. To test this type of hypothesis it is necessary to understand elements of likelihood ratio test. Let us assume that we want to test the following null-hypothesis vs alternative hypothesis: H 0 : β1 0, vs, H1 : β1 0 where 1 is a subvector of the parameter vector. It is equivalent to saying that we want to test of one or several parameters are 0 or not. Likelihood ratio test for this case works like that. Let us assume we have the likelihood function for the parameters: L( y | β) where parameters are partitioned into to subvectors: β (β1 , β2 ) Then maximum likelihood estimators are found for two cases. 1st case is when whole parameter vector is assumed to be variable. 2nd case is when subvector 1 is fixed to a value defined by the null hypothesis. Then values of the likelihood function for this two cases is found and their ratio is calculated. Assume that L0 is value of the likelihood under null hypothesis (subvector is fixed to the given value) and L1 is under the alternative hypothesis. Then ratio of these values is found and statistics related to this ratio is found and is used for testing. Ratio is: L1 lr L0 If this ratio is sufficiently small then null-hypothesis is rejected. It is not always possible to find distribution of this ratio. Likelihood ratio test for linear model Let us assume that we have found maximum likelihood values for the variances under null and alternative hypothesis and they are: ˆ and ˆ ˆ furthermore let us assume that n is the number of the observations, p is the number of all parameters and r is the number of the parameters we want test. Then it turns out that relevant likelihood ratio test statistic for this case is related with F distribution. Relevant random variable is: ˆ n p 2 2 ˆ ˆ F r 2 ˆ This random variable has F distribution with (r,n-p) degrees of freedom. It is true if the distribution of the errors is normal. As we know in this case maximum likelihood and least-squares coincide. Note: Distribution becomes F distribution if null-hypothesis is true. If it is not true then distribution becomes non-central F distribution Note: if there are two random variables distributed by 2 distribution with n and m degrees of freedom respectively then their ration has F distribution with (n,m) degrees of freedom. Analysis of diagnostics Residuals and hat matrix: Residuals are differences between observation and fitted values: ˆ r y y y Xβ y X(X T X) 1 X T y (I X(X T X) 1 X T )y (I H) y ˆ H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these values are close to one then that fitted value is determined by this observation. Sometimes hi’=hi/(1-hi) is used to enhance high leverages. Q-Q plot can be used to check normality assumption. Q-Q plot is plot of quantiles of two distributions (in this case frequency distribution of residuals vs normal distribution). If assumption on distribution is correct then this plot should be nearly linear. Cook’s distance: Observations in turn are removed (it is equivalent to removing corresponding row from the design matrix X). Then parameters estimated without this observation. Cook’s distance is defined as: ˆ ˆ (β i β) T X T X( β i β) Di p 2 ˆ ˆ β is estimated without i - th observation i Analysis of diagnostics: Cont. Other analysis tools include: / 2 ri' ri /(s(1 hi )1 ) standardized (stutendized) residual ri* ri /(si (1 hi )1/ 2 ) externally standardized residual Ci (hi ' )1/ 2 ri* DFFITS Di (ri ' ) 2 hi ' / p Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is unbiased estimator of 2 after removal of i-th observation Exponential family Natural exponential family of distributions has a form: L( y | , ) e ( yA( )G ( y ) B ( )) / S ( ) S() is called a scale parameter. When A is identity then it is called. By change of variables A can be changed by identity. So it is usual to use L( y | , ) e ( y G ( y ) B ( )) / S ( ) Many distributions including normal, binomial, Poisson, exponential distributions belong to this family. Moment generating function if exists for this distribution is: M (t ) e ( B ( ) B ( tS ( )))/ S ( ) Then the first moment (mean value) and the second central moments can be calculated: dB( ) E( y) d d 2 B ( ) E ( y E ( y )) 2 2 S ( ) d 2 Generalised linear model If the distribution of errors is one of the distributions from the exponential family and some function of the expected value of the observations is linear function of the parameters then generalised linear models are used: g ( E ( y )) g ( B' (1 ), , , B' ( n )) Xβ Function g is called the link function. Here is list of the popular distribution and corresponding link functions: binomial - logit = ln(p/(1-p)) normal - identity Gamma - inverse poisson - log All good statistical packages have implementation of many generalised linear models. To use them finding initial values might be necessary. To fit using generalised linear model the likelihood function is used (if is constant and we ignore constants not depending on parameter of interest): n 1 l( y | ) C ( ) ( y B( )) i 1 i i i Then expression for natural exponential family can be used to fit a model. Most natural way is using =X Additive model and non-linear models Let us consider several non-linear models briefly. 1) Additive model. If model is described as p y j 0 si ( xij ) i 1 then model is called an additive model. Where si can be some set of functions. Usually they are smooth functions. These type of models are used for smoothening. 2) If model is a non-linear function of the parameters and the input variables then it is called non-linear model. In general form it can be written: yi f ( x1i , , , xmi , 1 , , , p , i ) Form of the function depends on the subject studied. This type of models do not have closed form and elegant solutions. Non-linear least-squares may not have unique solution or may have many local minimum. This type of models are usually solved iteratively. I.e. initial values of the parameters are found and then using some optimisation techniques they are iteratively updated. Statistical properties of non-linear models is not straightforward to derive. Exercise: linear model Consider hypothesis testing. We have n observation. Parameter vector has dimension p. We have partitioning of the parameter vector like (dimension of 1 is r): β (β1 , β2 ) Corresponding partitioning of the design (input) matrix is X ( X1 , X 2 ) Assume that all observations are distributed with equal variance normally and they are uncorrelated. Find maximum likelihood estimators for parameters and variance under null and alternative hypothesis: H 0 : β1 0 vs H1 : β1 0 Hint: -loglikelihood function under null hypothesis is (since 1=0): n 1 l ln 2 2 ( y X 2β2 )T ( y X 2β2 ) 2 2 and under the alternative hypothesis: n 1 l ln 2 2 ( y Xβ)T ( y Xβ) 2 2 Find minimum of these functions. They will be maximum likelihood estimators.