is an exactly unbiased as an estimator for the model variance of when by eP7nKyhb

VIEWS: 0 PAGES: 46

									The Role of Models in Variance Estimation

       for Establishment Surveys:

   An Introductory Overview Lecture


                Phillip S. Kott
    National Agricultural Statistics Service
                June 21, 2007
            pkott@nass.usda.gov
              703 877-8000 x 102

                        1
                         Outline


Why Do We Need Models?

The Role of Models in Establishment Surveys

The Role of Models in Variance Estimation

Prediction Models in Nonresponse Adjustment

Extensions

Concluding Remark


                             2
                 Why Do We Need Models?


To make statistically rigorous inferences from a convenience
sample or when the sample size is too small for randomization-
based (design-based) inference.


To choose the best among reasonable-sounding randomization-
based estimation strategies.


To simplify variance estimation under an unequal probability-of-
selection scheme.


                                  3
To speed up the asymptotics with middling-sized samples.
This will often improves the accuracy of coverage intervals.


To put statistical structure on “ad-hoc” practices.
This provides us with tools for evaluating their strengths and
weaknesses.


To estimate totals and their variances in the face of nonresponse
or coverage errors.




                                   4
             Model-based Estimators (for totals)


Goal: Estimate a population (U) total, Ty      yk  U yk ,
                                               kU
       with a sample, S, of n (out of N ) elements.


Given: An auxiliary variable, xk  S, about which Tx     U xk is
        known.



The auxiliary variable may come from administrative data or a

census. It could trivially be 1 for all elements in the population.

                                   5
                        A Motivating Example


A population of farms, where

xk is the total land (believed to be) on farm k, and

yk is the planted acres of a particular crop (say corn) on the farm.



The total land value is assumed to be known before sampling and
enumeration. It cannot be changed by survey information even if
found to be in error.


                                   6
If we assume that each yk U satisfies the linear prediction
model:
                           yk  xk   k ,

where E ( k | xk )  0, then the ratio estimator,

                                 S yk     yS
                       r
                      ty    Tx         Tx ,
                                 S xk     xS

is a (model) unbiased estimator (predictor) for Ty in the sense that

                            E (t y  Ty )  0
                                  r




                                    7
when the sampling mechanism is ignorable:


               E ( k | xk )  E ( k | xk ; k  S )  0 .


This ignorability assumption is unnecessary for
simple random samples and cutoff samples (so long as the
population values are unchanged by survey information), but is
needed for convenience samples.




                                    8
                              Model Groups


Some populations naturally separate into G mutually exclusive
groups. If the group x-totals, Txg    U g xk , are known,
then the separate ratio estimator :


                        G        S g yk     G       y gS
                 sr
                ty      Txg                Txg
                       g 1      S g xk    g 1     xgS


is not only unbiased for Ty under the simple linear model,


                                     9
but also the group-ratio model:


                   yk  g xk  k for k U g ,


where (again)   E ( k | xk )  E ( k | xk ; k  S )  0 .

                                r
The simple ratio estimator, t y , is generally not unbiased under this

model.




                                      10
When all the xk =1, the group-ratio model is called the “group-
mean model.” The separate ratio estimator has the familiar
“poststratified” form:


                         G       S g yk      G
                  PS
                 ty      Ng                N g y gS .
                         g 1     ng         g 1



Important distinction: Although the G groups look like H design
strata, the difference is that samples need not be selected
independently across groups.



                                    11
                        Prediction Form


Returning to the simple linear model,   yk  xk   k ,
an estimator for Ty of the form:


                                            ˆ
                   t y   S yk  U  S xk 
                     pred




is model unbiased when    ˆ
                           is an unbiased estimator for .

This estimator uses the model to predict the y-values for the
nonsampled elements.

                                   12
If we assume the  k are uncorrelated and each has variance  2 ,
                                                              k

then the best linear unbiased (BLU) estimator for Ty is

                                       xj yj 
                                   S        2
                                            j 
         BLU
        ty       S yk  U S xk                .
                                         x2 
                                    S  k 2 
                                         j 
This collapses into the ratio estimator if and only if

2  xk .
 k


                                   13
    Model-assisted Randomization-based Estimators


But models fails.

               yk
In particular,    may tend to increase or decrease along with xk.
               xk

                                                              r
Nevertheless, from a randomization-based point of view,     t y under
simple random sampling is almost unbiased. In fact, its relative
randomization mean squared error will tend to zero as the sample
size grows arbitrarily large: it is randomization consistent.


                                   14
Moreover, if the ek = yk  Bxk, where B = Ty / Tx , tend to be

smaller than the     yk  yU , then the ratio estimator will have less
randomization mean squared error than the expansion estimator,

te 
 y
       N
       n    S yk  N y S .

In other words,
yk  xk   k ,
however imperfect, is a better implicit prediction model than
yk     k .

                                      15
                Combined Mean Squared Error


Without a model, it is often impossible to choose among
reasonable randomization-based estimation strategies with
identical sample sizes. As a result, it is often helpful to:


1. Restrict attention (if possible) to almost randomization-
   unbiased or randomization-consistent estimators, and


2. Look at the model expectation of the randomization mean
   squared error (mse) of this strategy.


                                   16
Suppose we want to estimate Ty with the Hajek estimator,
                                   yk
                              S         k          te
                        Tx                    Tx
                 Hajek                                y
                ty                                    e
                                                          ,
                                   xk
                              S         k
                                                     tx

where the element selection probabilities, the             k , can be
anything.


The Hajek estimator is unbiased under the linear model,
yk  xk   k , and is almost randomization unbiased.


                                    17
When the  k are uncorrelated and each has variance  2 ,
                                                      k

one can show that for a fixed (expected) sample size n, the
                                 Hajek
(asymptotic) combined mse of t y         is minimized under Brewer

selection; that is, when
                               k
                       k  n
                              U  j


if all   k  U  j / n
(otherwise k should be selected with certainty).

                                   18
Although the    k need only be known up to a constant for Brewer
selection, even that is unusual.


                                                             
In establishment surveys it is often speculated that   k  xk for
some    between 1/2 and 1 (note that xk must be known for all
elements in U to use this).


Getting the    k right is unnecessary for randomization consistency
or model unbiasedness.




                                   19
The Hajek estimator under stsrs is called the combined ratio
estimator,
                                        H
                                       Nh
                          yk
                        S k       n  S h yk
                                    
             t cr  Tx
               y               Tx hH1 h        ,
                          xk
                        S k          Nh
                                    n  Sh xk
                                   h 1 h



where   k  nh / N h when k is in Sh (the sample in stratum h ).


It is randomization consistent and model unbiased under the
simple linear model but not the group-ratio model.

                                   20
                                                            r
               Estimating the Model Variance of         t   y



Observe that under the simple linear model:
                             S yk
           r
          ty    Ty  U xk         U yk
                             S xk
                            S  xk   k 
                    U xk                    U   xk   k 
                                 S xk
                     NxU
                         S  k  U  k .
                     nxS
                           r
Thus, the model bias of t y is zero.


                                    21
If the  k are uncorrelated and each has variance  2 ,
                                                    k
                               r
then the model variance of t y is


                          2
                N xU 
    
   tr T
            
                2                N xU
E y                 S k  2
                            2
                                       S k  U k ,
                                           2       2
  
  
          y 
              nxS              nxS


which can be estimated in a unbiased fashion if we knew an

unbiased estimator for the    2 .
                               k




                                     22
                                                   
If we assume   2  k for a known set of k (say xk for some  ),
                k
then
                                                       2
                                  xj yj  
                              S           
               1      yi  xi           j  
        k 
        ˆ 2
                  S                            
             n 1      i   i       xj 
                                       2        k
                               S          
                                        j  
                                             


is an unbiased estimator for   2
                                k

Since assumptions about the     2 are very often dubious, a more
                                 k
robust approach to modeling variance estimation is advisable.

                                    23
              Using a Robust Estimator for  k
                                                           2




                           xS 
                                        yk  bxk  ,
                                                   2
For k  S, define
                     2
                    sk         1x 
                           xS  n k 
where   b  S y j S x j .


This is an unbiased estimator for         2 when 2  xk ,
                                           k       k
and nearly (asymptotically) unbiased otherwise because


                                     yk  bxk   k .
                                                2
                     2
                    sk      rk2                     2



                                         24
                          N xU
Since   k
         2
              xk implies        2  U 2 ,
                               S k       k
                          nxS

                                   2
                           N xU           N xU
                robust
               v                S sk 
                                        2
                                                  S sk .
                                                      2

                           nxS            nxS


is an exactly unbiased as an estimator for the model variance of

t y when 2  xk .
  r
          k




                                       25
             robust
Moreover, v          is almost unbiased as long as both n and N/n
                               2
are large, in which case sk can be approximated by

rk2 (or    n 2
              r ),
          n 1 k
                      and
                             robust
                            v      by


                                              2
                                     N xU 
                        approx
                       v                  S rk .
                                                   2

                                     nxS 


N/n large means finite population correction is ignorably small 
often, but not always, the case with establishment surveys.


                                         26
              Estimating the Randomization MSE
                    of the Hajek Estimator

A conventional nearly unbiased randomization mse estimator is


                                   kj  k  j rk rj
                  vconv                              ,
                           k , jS     kj      k  j

where rk                  
            yk  t y t x xk .
                    e   e




This complex Horvitz-Thompson form simplifies under many
commonly-used designs.

                                     27
                                             t cr , it translates into
For the combined ratio estimator under stsrs, y


                                2
                   H    Nh            nh  nh
                                                      rk  rhS  .
                                                                  2
         cr
        vconv                   1     
                  h 1  nh         N h  nh  1 kSh


For the special case of t y under srs, H = 1, and r1S  0.
                          r




vconv also estimates the (large-sample) combined mse of t cr .
 cr
                                                          y



                                          28
We can estimate the combined mse of the Hajek estimator in
general with
                                  2
                                 rk      1  2 1 
      v1Hajek     S 1  k      S     rk2 .
                                 k       k   k 
                                                      


Here 1   k is an element-specific model-based finite population
correction (fpc) term.


Its average values is 1   S 2 / n (a pooled, model-based fpc
                               j
term), not the ad hoc 1 – n/N.

                                 29
                     Systematic Sampling


It is impossible to estimate the randomization mse of a strategy
employing systematic sampling from a purposefully-ordered list.


If one assumes the selection mechanism is ignorable, however,
one can estimate the combined mean squared error of this
strategy.


Usually when the ignorability assumption is violated, estimates of
combined mse will be biased upward.


                                 30
             Randomization-assisted Model-based
                    Variance Estimation

                                                           Hajek
The weighted residual variance estimator for t y                   is
                                                 2
                             kj  k  j  Tx  rk rj
            vwrve                        e           ,
                     k , jS     kj       t x  k  j
        e
where t x                                    
              S xk / k , and rk  yk  t y t x xk .
                                            e   e
                                                       
It is a good estimator of randomization mse as vconv and a better
estimator for model variance when all the            k  1.

                                       31
Even better estimators for the model variance of the combined
ratio estimator and the general Hajek estimator are



            H    N T  2 N T  n
      v2  
       cr
                 h e   h e  h   rk  rhS 2 ,
                        x        x
           h 1  nh t x  nh t x  nh  1 kSh
                                  
and

                                T 1  2 T 1 
                 Hajek
                v2         S  e x
                                            e
                                              x      rk2 .
                                t x  k  t x  k 
                                                   


                                      32
 cr
v2 retains   the large-sample randomization-based properties of

vconv , and is nearly model unbiased when 2  xk .
 cr
                                           k



       is also nearly model unbiased when k  xk , but has a
 Hajek                                     2
v2
small downward bias. A slightly improved version replaces the

rk2 with    n
           n 1
                rk2   or, better yet,

                                         e
                                         tx     
                             sk   e
                              2
                                                 rk2 .
                                   t x  1 xk 
                                           k   

                                         33
                  Adjusting for Nonresponse


Let R be the respondent sample, and suppose the population can

be separated into G groups such that the model


                yk   g xk  k for k U g
holds with
               E ( k | xk )  E ( k | xk ; k  R)  0.

The key assumption in prediction modeling for unit nonresponse is
the ignorability of the response mechanism.


                                   34
It is not necessary for all the elements in a group to have the
same probability of response. That is the assumption of an
alternative and complementary paradigm: quasi-random response
modeling.


                        G        Rg d k yk
Let
                grm
               ty       Txg                 ,
                       g 1      Rg d k xk

where                                               ˆ
        d k  1/ k , and Txg can be either Txg or Txg .
This combined-unbiased estimator generalizes the separate ratio
estimator.
                                     35
  grm
t y can be put in reweighted (or calibration-weighted) form as


                        t y   R wk yk ,
                          grm


where


                                 Txg
                       wk                   dk ,   for   k  Rg .
                               Rg d j x j




                                  36
                                                    grm
We can compute the model variance of t y                  as



               G                                    Rg w j x j       
                 wk              wk  
                                  2
     vgrm                                                              rk2
               g 1 kS                      jR w j x j  wk xk   
                                                   g                  
                    Rg w j y j
with rk    yk                   xk for k  Rg
                    Rg w j x j


There is also a randomization-based component of the combined
mse when the Txg are themselves estimated.



                                            37
Surveys are usually designed to estimate totals for a number of
variables, for example, not only total corn acres but total number
of corn farms.


Consequently, reweighting for unit nonresponse is often done by
setting all the xk = 1, so that the same set of adjusted weights,

{wk}, can be used for every survey variable.




                                  38
                      Item Nonresponse


Suppose there is item nonresponse for some y-variable closely
related to an auxiliary x-variable known for all sampled elements.
A missing y-value is often imputed with something in this form:


                               jRg akj y j
                       yk 
                       ˆ                       xk ,
                               jRg akj x j


where the akj can be anything from dj, through 1/xj, to

1 for a randomly chosen member of Rg and 0 for all others.
                                    39
In each case, we can, for variance-estimation purposes, rewrite
the estimated total in the form of a reweighted estimator
(there is a possibly different reweighted estimator for each survey
variable) and compute


               G                               Rg w j x j       
                 wk         wk  
                             2
     vgrm                                                         rk2 .
               g 1 kS                 jR w j x j  wk xk   
                                              g                  

There are other approaches.

                                       40
                            Extensions


When the frame exhibits coverage errors (duplications or missing
population elements), and true group population totals are known
for some auxiliary variable x, then
                              G           S g d k yk
                      grm
                     ty       Txg
                             g 1         S g d k xk
is a model unbiased estimator for Ty under the group-ratio model,

assuming the coverage mechanism is ignorable.           vgrm serves as a
model (and combined) variance estimator.


                                    41
We can extend the framework to incorporate additional phases of
sampling, with cluster sampling in the original phase(s).


We can also incorporate a more complex set of benchmark
variables than the estimated group totals for a scalar x.


We can incorporate nonlinear prediction models for the yk.
Some care must be taken, however, because Ty        U yk is itself
a linear function of random variables under the prediction model.



                                  42
                         Concluding Remark


When all the sampling fractions (the nh/Nh) are small enough to
ignore,
                                      2
                        H     N h Tx  nh
                                                  rk  rhS  ,
                                                              2
           cr
          v2approx              e 
                        h 1  nh t x  nh  1 kSh



is a good estimator for the randomization mean squared error of

t cr estimator under stsrs and an even better estimator of its model
  y

variance.


                                     43
                      cr
It can be shown that v2approx is asymptotically closer to the
stratified jackknife variance estimator than to the more
conventional


                                    2
                          H     N h  nh
                                                rk  rhS  .
                                                            2
          cr
         vconvapprox             
                          h 1  nh  nh  1 kSh


That is why I argue that the stratified jackknife should be viewed

as a robust (to misspecification of the    2 ) model-based variance
                                            k
estimator rather than a randomization-based variance estimator.

                                    44
                                        References
Brewer, K. R. W. (1963). Ratio estimation and finite populations: some results deductible
      from the assumption of an underlying stochastic process. Australian Journal of
      Statistics 5, 93-105.
Brewer, K. R. W. (2002). Combined survey and sampling inference: Weighing Basu’s
       elephants. London: Arnold.
Deville, J- C. and C- E. Särndal (1992). Calibration estimators in survey sampling.
      Journal of the American Statistical Association 87, 376.
Isaki, C.T. and W. A. Fuller (1982). Survey design under the regression super-population
      model. Journal of the American Statistical Association 77, 89-96.
Kott, P. S. (2004). Randomization-assisted model-based survey sampling. Journal of
      Statistical Planning and Inference 48, 263-277.
Kott, P. S. and Bailey, J. T. (2000). The theory and practice of maximal Brewer selection.
      Proceedings of the Second International Conference on Establishment Surveys,
      invited papers, 269-278.
Kott, P. S. and K. R. W. Brewer (2001). Estimating the model variance of a randomization-
      consistent regression estimator. Proceedings of the Section on Survey Research
      Methods, American Statistical Association, Washington DC.
Särndal, C- E, B. Swensson, and J. Wretman (1989). The weighted residual technique for
      estimating the variance of the general regression estimator of a finite population total.
      Biometrika 76, 527-537.
Valliant, R, Dorfman, A.H., and Royall, R.M. (2000). Finite population sampling and
      inference: a prediction approach. New York: Wiley.

                                                45
The full paper is available at my website:

http://www.nass.usda.gov/research/OD5.htm




                                         46

								
To top