Prediction

Document Sample
Prediction Powered By Docstoc
					            Chapter 4
Prediction and Bayesian Inference
• 4.1 Estimators versus predictors
• 4.2 Prediction for one-way ANOVA models
   – Shrinkage estimation, types of predictions
• 4.3 Best linear unbiased predictors (BLUPs)
• 4.4 Mixed model predictors
• 4.5 Bayesian inference
• 4.6 Case study: Forecasting lottery sales
• 4.7 Credibility Theory
• Appendix 4A Linear unbiased predictors
   4.1 Estimators versus predictors
• In the longitudinal data model, yit = zit´ ai + xit´ b + eit , the
  variables {ai} describe subject-specific effects.
• Given the data {yit, zit, xit}, in some problems it is of
  interest to “summarize” subject effects.
   – We have discussed how to estimate fixed, unknown
      parameters .
   – It is also of interest to summarize subject-specific
      effects, such as those described by the random variable
      a i.
• Predictors are “estimators” of random variables.
   – Like estimators, predictors are said to be linear if they
      are formed from a linear combination of the response y.
        Applications of prediction
• In animal and plant breeding, one wishes to predict the
  production of milk for cows based on (1) their lineage
  (random) and (2) herds (fixed)
• In credibility theory, one wishes to predict expected claims
  for a policyholder given exposure to several risk factors
• In sample surveys, one wishes to predict the size of a
  specific age-sex-race cohort within a small geographical
  area (known as “small area estimation”).
• In a survey article, Robinson (1991) also cites (1) ore
  reserve estimation in geological surveys, (2) measuring
  quality of a production plan and (3) ranking baseball
  players abilities.
    4.2. Prediction for one-way ANOVA
                   models
• Consider the traditional one-way random effects ANOVA
  (analysis of variance) model:
                          yit = ma + ai + eit
   – Suppose that we wish to summarize the subject-specific
     conditional mean, ma + ai .
• For contrast, first consider using the fixed effects model
  with ma = 0.
   – Here, we have that y i is the “best” (Gauss-Markov)
     estimate of ai.
   – This estimate is unbiased, that is, E y i = ai.
   – This estimate has minimum variance among all linear
     unbiased estimators (BLUE).
                     Shrinkage estimator
• Using the one-way random effects model.
   – Consider an “estimator” of ma + ai that is a linear
     combination of y i and y , that is, c1 yi  c2 y
   for constants c1 and c2.
• Calculations show that the best values of c1 and c2 that
  minimize E c1 yi  c2 y  ( ma  a i ) 2 are c2 = 1 – c1 and
                                                        n

                                                       
                                            2T   1
           Ti  a
             *   2                        1 i  2             T j2
 c1 =                                        N  N       j =1
         2  Ti* a
                   2              Ti* =
                                             Ti 1  N 1
• For large n, we have the shrinkage estimator, or predictor, of
  ma + ai to be y i , s =  i y i  (1   i ) y , where

                                Ti a
                                    2
                        i =
                             Ti a   e2
                                 2
    Example of shrinkage estimator
              Hypothetical Run Times for Three Machines

•   Machine     Run Times               Average Run Time
•   1         14, 12, 10, 12              y1 = 12
•   2          9, 16, 15, 12              y2 = 13
•   3           8, 10, 7, 7               y3 = 8
   – Notation: yij means the jth run from the ith machine.
   – For example, y21 = 9 and y23 = 15.
• Are there real differences among machines?
                Example - Continued
• To see the “shrinkage” effect, consider
       y1                           y      y2   y3
       8                           11      12   13




            8.525                       11.825 12.650


•   Figure 4.1 Comparison of Subject-Specific Means to
    Shrinkage Estimators.
    More on shrinkage estimators
• Under the random effects model, y i is an unbiased
  predictor of ma+ai in the sense that E y i - (ma + ai) = 0.
   – However, y i is inefficient in the sense that yi , s has a
     smaller mean square error than y i    .
   – Here, y i has been “shrunk” towards the stable estimator y
   – The “estimator” yi , s is said to “borrow strength” from the
     stable estimator y
• Recall                            Ti a2
                         =
                              Ti a   e2
                          i       2

• Note that i1 as either (i) Ti or (ii) a2/ e2 .
                    Best predictors
• From Section 3.1, it is easy to check that the generalized
  least square estimator of ma is
                                                Ti a
                      
                        n                           2
                        i =1
                                i yi   i =
          ma ,GLS   =                        Ti a   e
                                                 2     2
                      
                           n
                            i =1
                                 i

• The linear predictor of ma + ai that has minimum variance
  is yi , BLUP = i y i + (1 - i ) ma,GLS .
   – Here, the acronym BLUP stands for best linear unbiased
      predictor.
             Types of Predictors
• We have now introduced the BLUP of ma + ai . This
  quantity is a linear combination of global parameters and
  subject-specific effects.
• Two other types of predictors are of interest.
   – Residuals. Here, we wish to “predict” eit . The BLUP
     residual turns out to be
                    eit , BLUP = yit  yi , BLUP
   – Forecasts. Here, we wish to predict, for “L” lead time
     units into the future,
                        yi ,Ti  L = ma  a i  e i ,Ti  L
   – Without serial correlation, the predictor is the same as
     the predictor of ma + ai . However, we will see that the
     mean square error turns out to be larger.
4.3 Best linear unbiased predictors
• This section develops best linear unbiased predictors in the
  context of mixed linear models, then specializes the
  consideration to longitudinal data mixed models.
• BLUPs are developed by examining the minimum mean
  square error predictor of a random variable, w.
   – We give a development due to Harville (1976).
   – The argument is originally due to Goldberger (1962),
     who coined the phrase best linear unbiased predictor.
   – The acronym was first used by Henderson (1973).
• BLUPs can also be developed as conditional expectations
  using multivariate normality
• BLUPs can also be developed in a Bayesian context.
            Mixed linear models
• Suppose that we observe an N  1 random vector y with mean
  E y = X b and variance Var y = V.
   – We wish to predict a random variable w, that has mean E w
      = l b and Var w = w2.
   – Denote the covariance between w and y as Cov(w,y) =
      covwy.
• Assuming known regression parameters (b), the best linear (in
  y) predictor of w is
   w* = E w + covwy V-1(y - E y ) = l b + covwy V-1(y - X b ).

   – If w,y are multivariate normal, then w* equals E (w | y ) and
     hence is a minimum mean square predictor of w.
   – The predictor w* is also a minimum mean square predictor
     of w without the assumption of normality. See Appendix
     4A.1.
             BLUP’s as predictors
• To develop the BLUP,
   – define bGLS = ( X V -1 X )-1 X V-1 y to be the generalized
     least squares (GLS) estimator of b.
   – This is the best linear unbiased estimator (BLUE).
   – Replace b by bGLS in the definition of w* to get the BLUP
   wBLUP = l bGLS + covwy  V-1(y - X bGLS )
                        = (l - covwy V-1X) bGLS + covwy V-1 y.
   – See Appendix 4A.2 for a check, establishing wBLUP as
     the best linear unbiased predictor of w.
• From Appendix 4A.3, we also have the form for the
  minimum mean square error:
   Var (wBLUP - w) = (l - covwy V-1X) ( X V -1 X )-1
                     (l - covwy V-1X) - covwy V-1 covwy + w2.
               Example: One-way model
• Recall, yit = ma + ai + eit
                                                           i =1 i yi
                                                            n

   – Thus, yi = 1i (ma + ai) + ei . Thus, bGLS = ma ,GLS =
                                                           i =1 i
                                                               n

   Xi = 1i and             1           a2
                                                  
                   Vi = 2  I i 
                     1
                                               J 
                          e 
                                              2 i
                                    Ti a   e 
                                        2

   – With this, we note that Vi-1 (yi - Xi bGLS)=
                =
                    1
                          (y i - 1i ma ,GLS )   i 1i ( yi  ma ,GLS ) 
                    e2


    – Thus, for predicting w = ma + ai we have l=1 and
      Cov(w, yi) = 1i a2 for the ith subject, 0 otherwise. Thus,
   wBLUP = ma ,GLS  Cov ( w, y i )Vi1 ( y i - X i b GLS )
      = ma ,GLS   a 1
                    2       1
                                 (y i - 1i ma ,GLS )   i 1i ( yi  ma ,GLS ) 
                           e
                       i     2


                = ma ,GLS   i ( yi - ma ,GLS ) = yi , BLUP
    Random effect ANOVA model
• For predicting residuals eit we have l=0 and Cov(w, yi) =
  e2 for the ith subject, tth time period, 0 otherwise.
• Let 1it be a Ti  1 vector with a 1 in the tth position, 0
  otherwise. Thus,
              wBLUP =  e2 1 Vi1 ( y i - X i b GLS )
                              it

                    = yit  yi , BLUP

• is our BLUP residual.
         4.4 Mixed model predictors
• Recall the longitudinal data mixed model
                       yi = Zi ai + Xi b + ei
• As described in Section 3.3, this is a special case of the
  mixed linear model. We use
                V = block diagonal (V1, ..., Vn) ,
   where Vi = Zi D Zi + Ri.
                         X = (X1, ... Xn)
• For BLUP calculations, note that
           covwy = ( Cov(w, y1 ),…, Cov(w, yn) )
Longitudinal data mixed model BLUP

• Recall that the r.v. w has mean E w = l b and Var w = w2.
• The BLUP is

                                    Cov (w, y )V                                        
                                     n
                                                               1
     wBLUP = λ b GLS                               i        i (y i       - X i b GLS )
                                    i =1
• The mean square error is Var (wBLUP - w) =
                                                         1                                        
          n
                                      n                            n
                                                                                                  
   λ 
  
  
           
           i =1
                                1
                                           
                                      X Vi1X i 
                Cov( w, y i )Vi X i 
                                      i =1
                                             i     
                                                   
                                                               λ 
                                                              
                                                              
                                                                       
                                                                       i =1
                                                                            Cov( w, y i )Vi1X i 
                                                                                                  
                                                                                                  
                          n
                        
                         i =1
                                Cov ( w, y i )Vi1Cov ( w, y i )   w
                                                                      2
                 BLUP special cases
• Global parameters and subject-specific effects.
   – Suppose that the interest is in predicting linear
     combinations of global parameters b and subject-
     specific effect ai.
   – Consider linear combinations of the form
                         w = c1 ai + c2 b.
• Residuals. Here, w = eit .
• Forecasts. Suppose that the ith subject is included in the data
  set; predict
          yi ,Ti  L = z,Ti  L α i  x,Ti  L β  e i ,Ti  L
                        i               i

   – for L lead time units in the future.
      Predicting global parameters and
           subject-specific effects

• Consider linear combinations of the form w = c1 ai + c2 b.
• Straightforward calculations show that
   – E w = c2 b so that l = c2,
   – Cov (w, yj ) = c1 D Zi for j = i
   – Cov (w , yj ) = 0 for j  i.

• Thus, wBLUP = c2 bGLS + c1 D Zi Vi-1 (yi - Xi bGLS ).
                   Special case 1
• Take c2 = 0 . Because the means and variance expressions
  are true for all vectors c2, we may write this in vector
  notation to get the BLUP of ai, the vector
               ai,BLUP = D Zi Vi-1 (yi - Xi bGLS ).
• This is unbiased in the sense that E ai,BLUP - ai = 0.
• This estimate has minimum variance among all linear
  unbiased predictors (BLUP).
• In the case of the error components model (zit = 1), this
  reduces to
                   ai , BLUP =  i ( yi - xi b GLS )

• For comparison, recall the fixed effects parameter estimate,
                    ai = y i  x b
                                 i
                Motivating BLUP’s

• We can also motivate BLUP’s using normal theory:
  – Consider the case where ai and e are multivariate normally
    distributed.
  – Then, it can be shown that E (ai | yi) = D Zi Vi-1 (yi -Xi b).
  – To motivate this, consider asking the question: what
    realization of ai could be associated with yi? The
    expectation!
  – The BLUP is the BLUE of E (ai | yi). (That is, replace b by
    bGLS.)
                       Special case 2
• As another example, it is of interest to predict
       w = E ( yi ,Ti 1 |α i ) = z ,Ti 1 α i  x,Ti 1 β
                                    i              i
•
• Choose c1 = z i ,Ti 1 and        c 2 = x i ,Ti 1
• This yields
                    wBLUP = z ,Ti 1a i , BLUP  x,Ti 1b GLS
                              i                    i


• This predictor is of interest in actuarial science, where it is
  known as the credibility estimator.
                     BLUP Residuals
• Here, w = eit . Because E w = 0, it follows that l = 0.
• Straightforward calculations show that
   – Cov (w, yj ) = e2 1it for j = i and
   – Cov (w , yj ) = 0 for j  i.
   – Here, the symbol 1it denotes a Ti  1 vector that has a
     “one” in the tth position and is zero otherwise.
• Thus
             eit,BLUP = e2 1it Vi-1 (yi - Xi bGLS ).
• This can also be expressed as

          eit , BLUP = yit  z,t a i , BLUP  x,t b GLS 
                               i                 i
      Predicting future observations
•   Suppose that the ith subject is included in the data set; predict
              yi ,Ti  L = z,Ti  L α i  x,Ti  L β  e i ,Ti  L
                            i               i
   – for L lead time units in the future.
• We will assume that x i ,Ti  L and z i ,T  L are known.
                                                        i
• It follows that λ = x i ,Ti 1
• Straightforward calculations show that
                     Z i Dz i ,Ti  L  Cov(e i ,Ti  L , ε i ) for j = i
          Cov( w, y j ) = 
                          0                                          for j  i
• Thus, the forecast of yi,Ti+L is
    yi ,Ti  L = x,Ti  L bGLS  z,Ti  L ai, BLUP  Cov(e i ,Ti  L , εi )Ri1ei, BLUP
    ˆ             i                i


• Thus, the forecast is the estimate of the conditional mean
  plus the serial correlation correction factor
              Cov(e i,Ti  L , ε i )R i1e i, BLUP
                Predicting future observations
    • To illustrate, consider the special case where we have
      autoregressive of order 1 (AR(1)), serially correlated errors.
    • Thus, we have
        1               2          T 1 
                                            
                1                  
                                        T 2


R =  2  2              1          T 3 
                                            
                                                    1         0          0      0 
        T 1                                                                              
               T 2    T 3                              1    
                                                                     2
                                       1                                         0      0 
                                                   1       0      1  2        0      0 
                                         R 1 = 2                                           
                                                (1   )  
                                                       2
                                                                                        
                                                           0     0     0       1  2    
                                                                                            
                                                           0                           1 
                                                                 0     0                    

    • After some algebra, the L step forecast is
         ˆ i ,Ti  L = x,Ti  L b GLS  z,Ti  L ai, BLUP   L eiTi ,BLUP
         y              i                 i
             4.5 Bayesian Inference
• With Bayesian statistical models, one views both the model
  parameters and the data as random variables.
   – We assume distributions for each type of random variable.
• Given the parameters β and α, the response model is
                    y = Xβ  Zα  ε

   – Specifically, we assume that the responses y conditional on
     α and β are normally distributed and that
          E (y | α, β ) = Z α + X β and Var (y | α, β) = R.
• Assume that α is distributed normally with mean mα and
  variance D and that β is distributed normally with mean μβ
  and variance β, each independent of the other.
                      Distributions
• The joint distribution of (α, β) is known as the prior
  distribution.
• To summarize, the joint distribution of (α, β, y) is


   α          μα            D      0         DZ      
                                                      
    β   N    μβ       ,    0      Σβ        Σ β X   
   y      
             Zμ α  X μ β                                
                            ZD    XΣ β    V  XΣ β X  
                                                              


• where V = R + Z D Z.
              Posterior Distribution
• The distribution of parameters given the data is known as
  the posterior distribution.
• The posterior distribution of (α, β) given y is normal.
• The conditional moments are
       α
     E   |y = 
                                         
                 μ α  DZ  V  XΣ β X 1 y  Zμ α  X μ β 
                                                              
       β
                                                        
                 μ β  Σ β X V  XΣ β X 1 y  Zμ α  X μ β
                                                             

         α  | y =  D 0    DZ  V  XΣ X
    Var  
         β         0 Σ   Σ X        β      ZD
                                                  1
                                                          XΣ β   
                        β  β     
               Relation with BLUPs
• In longitudinal data applications, one typically has more
  information about the global parameters β than subject-
  specific parameters α.
• Consider first the case β = 0, so that β = mβ with probability
  one.
   – Intuitively, this means that β is precisely known, generally
      from collateral information.
   – Assuming that mα = 0, it is easy to check that the best linear
      unbiased estimator (BLUE) of E ( α | y ) is
                   aBLUP = D Z V-1 ( y – X bGLS)
   – Recall from equation (4.11) that aBLUP is also the best
      linear unbiased predictor in the frequentist (non-Bayesian)
      model framework.
               Relation with BLUPs
• Consider second the case where β-1 = 0.
   – In this case, prior information about the parameter β is
       vague; this is known as using a diffuse prior.
   – Assuming mα = 0, one can show that
                             E ( α | y ) = aBLUP
• It is interesting that in both extreme cases, we arrive at the
  statistic aBLUP as a predictor of α.
   – This analysis assumes D and R are matrices of fixed
       parameters.
   – It is also possible to assume distributions for these
       parameters; typically, independent Wishart distributions are
       used for D-1 and R-1 as these are conjugate priors.
   – The general strategy of substituting point estimates for
       certain parameters in a posterior distribution is called
       empirical Bayes estimation.
    Example – One-way random effects
             ANOVA model
• The posterior means turn out to be
                                           1
                        
       ˆ = E  β |y  =  1    nT                 nT      mb 
       β                                                 y 2 
                                   2            T     b 
                         b  e  T a 
                           2  2                    2     2
                                                  e     a      
        a i = E a i |y  =  ( yi  m b )   b ( y  m b ) 
         ˆ

• where              T a
                        2
                                                nT b
                                                    2
             =                 b =
                   e2  T a
                            2
                                        e2  T a  nT b
                                                 2       2



• Note that b measures the precision of knowledge about b.
  Specifically, we see that b approaches one as b2 , and
  approaches zero as b2 0.

 a i  b = (1   b )(1   )m b   yi    b (1   ) y   yi 
 ˆ     ˆ
          4.6 Wisconsin Lottery Sales
• T=40 weeks of sales from n =50 zip codes
        Table 4.1. Lottery, Economic and Demographic
          Characteristics of 50 Wisconsin ZIP Codes
Lottery Characteristics
ZOLSALES         Online lottery sales to individual consumers
NRETAIL          Number of listed retailers
Economic and Demographic Characteristics
PERPERHH         Persons per household
MEDSCHYR         Median years of schooling
MEDHVL           Median home value in $1000s for owner-occupied homes
PRCRENT          Percent of housing that is renter occupied
PRC55P           Percent of population that is 55 or older
HHMEDAGE         Household median age
MEDINC           Estimated median household income, in $1000s
POPULAT          Population
         Lottery Sales Data Analysis
• Cross-sectional analysis shows that population size heavily
  influences sales, with Kenosha as an outlier
• Multiple time series plots
   – show the effect of jackpots that is common to all postal
      codes
   – show the heterogeneity among postal codes (reaffirmed
      by a pooling test)
   – show the heteroscedasticity that is accommodated
      through a logarithmic transformation
       Lottery Sales Model Selection
• In-sample results show that
   – One-way error components dominates pooled cross-
     sectional models
   – An AR(1) error specification significantly improves the
     fit.
   – The best model is probably the two-way error
     component model, with an AR(1) error specification (not
     yet documented)
• Out-of-sample analysis suggests that
   – logarithmic sales is the preferred choice of response; it
     outperforms sales and percentage change.
          4.7. What is Credibility?
• Hickman’s (1975) Analogy
   – In politics, leaders begin with a reservoir of credibility
     which decreases as executive experience is compiled.
   – Insurance behaves in a reverse fashion!
   – Here, credibility increases as experience increases.
               Credibility Theory
• Credibility is a technique for predicting future expected
  claims for a risk class, given past claims of that and related
  risk classes.
• Importance
   – Credibility is widely used for pricing property and
      casualty, worker’s compensation and health care
      coverages.
   – According to Rodermund (1989), “the concept of
      credibility has been the casualty actuaries’ most
      important and enduring contribution to casualty actuarial
      science.”
                       History
• Mowbray (1914 - PCAS)
  – Asked the question, “how extensive is an
    exposure necessary to give a dependable pure
    premium?”
  – This approach is now known as the “limited
    fluctuation” or “American” credibility
      • Question 1 – do we have enough exposure to give
        full weight to the risk class under consideration?
      • Question 2 – if not, how can we combine information
        from this and related risk classes?
                More History
• Whitney (1918 - PCAS)
  – introduced the idea of using a weighted average
    of average claims of (1) a given risk class and
    (2) all risk classes.
  – The weight is known as the credibility factor.
  – It is of the form
  New Premium =
  Z  Claims Experience + (1 – Z)  Old Premium.
        Example - Balanced Bühlmann

• Consider the model
                 yit = b + ai + eit.

• The credibility factor is                T
                                    =
                                       T  e a
                                            2  2


• The traditional credibility estimator is


           wBLUP = (1   ) y   yi .
                        Example
            Hypothetical Claims for Three Towns

 Town           Claims               Average Claim
  1          14, 12, 10, 12               y1 = 12
  2           9, 16, 15, 12               y2 = 13
  3            8, 10, 7, 7                y3 = 8


• Are there real differences among towns?
• Mowbray - does Town 3 have enough data to
  support its own estimator of pure premiums?
• Whitney - how can I use the information in Towns
  1 and 2 to help determine my rate for Town 3?
            Response toWhitney
• Known as the “shrinkage” effect
     y1                     y      y2 y3
     8                     11      12   13




          8.525                 11.825 12.650


• Comparison of Subject-Specific Means to
  Credibility Estimators.
  Why study credibility theory?
• Long history of applications – “a business necessity”
   – More recently, many theoretical advances with fewer
     innovative applications
• Credibility techniques required in legal statutes and
  standards of practice
   – Standard of Practice 25 by the Actuarial Standards Board
     of the American Academy of Actuaries
   – Wisconsin statutes on credibility insurance and disability
     income
• Advanced techniques are critical for keeping up with
  competition (health insurance – health economists)
• Innovative techniques enhance the “credibility” of the
  profession

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:16
posted:10/13/2011
language:English
pages:42