VIEWS: 0 PAGES: 46 POSTED ON: 8/7/2012 Public Domain
The Role of Models in Variance Estimation for Establishment Surveys: An Introductory Overview Lecture Phillip S. Kott National Agricultural Statistics Service June 21, 2007 pkott@nass.usda.gov 703 877-8000 x 102 1 Outline Why Do We Need Models? The Role of Models in Establishment Surveys The Role of Models in Variance Estimation Prediction Models in Nonresponse Adjustment Extensions Concluding Remark 2 Why Do We Need Models? To make statistically rigorous inferences from a convenience sample or when the sample size is too small for randomization- based (design-based) inference. To choose the best among reasonable-sounding randomization- based estimation strategies. To simplify variance estimation under an unequal probability-of- selection scheme. 3 To speed up the asymptotics with middling-sized samples. This will often improves the accuracy of coverage intervals. To put statistical structure on “ad-hoc” practices. This provides us with tools for evaluating their strengths and weaknesses. To estimate totals and their variances in the face of nonresponse or coverage errors. 4 Model-based Estimators (for totals) Goal: Estimate a population (U) total, Ty yk U yk , kU with a sample, S, of n (out of N ) elements. Given: An auxiliary variable, xk S, about which Tx U xk is known. The auxiliary variable may come from administrative data or a census. It could trivially be 1 for all elements in the population. 5 A Motivating Example A population of farms, where xk is the total land (believed to be) on farm k, and yk is the planted acres of a particular crop (say corn) on the farm. The total land value is assumed to be known before sampling and enumeration. It cannot be changed by survey information even if found to be in error. 6 If we assume that each yk U satisfies the linear prediction model: yk xk k , where E ( k | xk ) 0, then the ratio estimator, S yk yS r ty Tx Tx , S xk xS is a (model) unbiased estimator (predictor) for Ty in the sense that E (t y Ty ) 0 r 7 when the sampling mechanism is ignorable: E ( k | xk ) E ( k | xk ; k S ) 0 . This ignorability assumption is unnecessary for simple random samples and cutoff samples (so long as the population values are unchanged by survey information), but is needed for convenience samples. 8 Model Groups Some populations naturally separate into G mutually exclusive groups. If the group x-totals, Txg U g xk , are known, then the separate ratio estimator : G S g yk G y gS sr ty Txg Txg g 1 S g xk g 1 xgS is not only unbiased for Ty under the simple linear model, 9 but also the group-ratio model: yk g xk k for k U g , where (again) E ( k | xk ) E ( k | xk ; k S ) 0 . r The simple ratio estimator, t y , is generally not unbiased under this model. 10 When all the xk =1, the group-ratio model is called the “group- mean model.” The separate ratio estimator has the familiar “poststratified” form: G S g yk G PS ty Ng N g y gS . g 1 ng g 1 Important distinction: Although the G groups look like H design strata, the difference is that samples need not be selected independently across groups. 11 Prediction Form Returning to the simple linear model, yk xk k , an estimator for Ty of the form: ˆ t y S yk U S xk pred is model unbiased when ˆ is an unbiased estimator for . This estimator uses the model to predict the y-values for the nonsampled elements. 12 If we assume the k are uncorrelated and each has variance 2 , k then the best linear unbiased (BLU) estimator for Ty is xj yj S 2 j BLU ty S yk U S xk . x2 S k 2 j This collapses into the ratio estimator if and only if 2 xk . k 13 Model-assisted Randomization-based Estimators But models fails. yk In particular, may tend to increase or decrease along with xk. xk r Nevertheless, from a randomization-based point of view, t y under simple random sampling is almost unbiased. In fact, its relative randomization mean squared error will tend to zero as the sample size grows arbitrarily large: it is randomization consistent. 14 Moreover, if the ek = yk Bxk, where B = Ty / Tx , tend to be smaller than the yk yU , then the ratio estimator will have less randomization mean squared error than the expansion estimator, te y N n S yk N y S . In other words, yk xk k , however imperfect, is a better implicit prediction model than yk k . 15 Combined Mean Squared Error Without a model, it is often impossible to choose among reasonable randomization-based estimation strategies with identical sample sizes. As a result, it is often helpful to: 1. Restrict attention (if possible) to almost randomization- unbiased or randomization-consistent estimators, and 2. Look at the model expectation of the randomization mean squared error (mse) of this strategy. 16 Suppose we want to estimate Ty with the Hajek estimator, yk S k te Tx Tx Hajek y ty e , xk S k tx where the element selection probabilities, the k , can be anything. The Hajek estimator is unbiased under the linear model, yk xk k , and is almost randomization unbiased. 17 When the k are uncorrelated and each has variance 2 , k one can show that for a fixed (expected) sample size n, the Hajek (asymptotic) combined mse of t y is minimized under Brewer selection; that is, when k k n U j if all k U j / n (otherwise k should be selected with certainty). 18 Although the k need only be known up to a constant for Brewer selection, even that is unusual. In establishment surveys it is often speculated that k xk for some between 1/2 and 1 (note that xk must be known for all elements in U to use this). Getting the k right is unnecessary for randomization consistency or model unbiasedness. 19 The Hajek estimator under stsrs is called the combined ratio estimator, H Nh yk S k n S h yk t cr Tx y Tx hH1 h , xk S k Nh n Sh xk h 1 h where k nh / N h when k is in Sh (the sample in stratum h ). It is randomization consistent and model unbiased under the simple linear model but not the group-ratio model. 20 r Estimating the Model Variance of t y Observe that under the simple linear model: S yk r ty Ty U xk U yk S xk S xk k U xk U xk k S xk NxU S k U k . nxS r Thus, the model bias of t y is zero. 21 If the k are uncorrelated and each has variance 2 , k r then the model variance of t y is 2 N xU tr T 2 N xU E y S k 2 2 S k U k , 2 2 y nxS nxS which can be estimated in a unbiased fashion if we knew an unbiased estimator for the 2 . k 22 If we assume 2 k for a known set of k (say xk for some ), k then 2 xj yj S 1 yi xi j k ˆ 2 S n 1 i i xj 2 k S j is an unbiased estimator for 2 k Since assumptions about the 2 are very often dubious, a more k robust approach to modeling variance estimation is advisable. 23 Using a Robust Estimator for k 2 xS yk bxk , 2 For k S, define 2 sk 1x xS n k where b S y j S x j . This is an unbiased estimator for 2 when 2 xk , k k and nearly (asymptotically) unbiased otherwise because yk bxk k . 2 2 sk rk2 2 24 N xU Since k 2 xk implies 2 U 2 , S k k nxS 2 N xU N xU robust v S sk 2 S sk . 2 nxS nxS is an exactly unbiased as an estimator for the model variance of t y when 2 xk . r k 25 robust Moreover, v is almost unbiased as long as both n and N/n 2 are large, in which case sk can be approximated by rk2 (or n 2 r ), n 1 k and robust v by 2 N xU approx v S rk . 2 nxS N/n large means finite population correction is ignorably small often, but not always, the case with establishment surveys. 26 Estimating the Randomization MSE of the Hajek Estimator A conventional nearly unbiased randomization mse estimator is kj k j rk rj vconv , k , jS kj k j where rk yk t y t x xk . e e This complex Horvitz-Thompson form simplifies under many commonly-used designs. 27 t cr , it translates into For the combined ratio estimator under stsrs, y 2 H Nh nh nh rk rhS . 2 cr vconv 1 h 1 nh N h nh 1 kSh For the special case of t y under srs, H = 1, and r1S 0. r vconv also estimates the (large-sample) combined mse of t cr . cr y 28 We can estimate the combined mse of the Hajek estimator in general with 2 rk 1 2 1 v1Hajek S 1 k S rk2 . k k k Here 1 k is an element-specific model-based finite population correction (fpc) term. Its average values is 1 S 2 / n (a pooled, model-based fpc j term), not the ad hoc 1 – n/N. 29 Systematic Sampling It is impossible to estimate the randomization mse of a strategy employing systematic sampling from a purposefully-ordered list. If one assumes the selection mechanism is ignorable, however, one can estimate the combined mean squared error of this strategy. Usually when the ignorability assumption is violated, estimates of combined mse will be biased upward. 30 Randomization-assisted Model-based Variance Estimation Hajek The weighted residual variance estimator for t y is 2 kj k j Tx rk rj vwrve e , k , jS kj t x k j e where t x S xk / k , and rk yk t y t x xk . e e It is a good estimator of randomization mse as vconv and a better estimator for model variance when all the k 1. 31 Even better estimators for the model variance of the combined ratio estimator and the general Hajek estimator are H N T 2 N T n v2 cr h e h e h rk rhS 2 , x x h 1 nh t x nh t x nh 1 kSh and T 1 2 T 1 Hajek v2 S e x e x rk2 . t x k t x k 32 cr v2 retains the large-sample randomization-based properties of vconv , and is nearly model unbiased when 2 xk . cr k is also nearly model unbiased when k xk , but has a Hajek 2 v2 small downward bias. A slightly improved version replaces the rk2 with n n 1 rk2 or, better yet, e tx sk e 2 rk2 . t x 1 xk k 33 Adjusting for Nonresponse Let R be the respondent sample, and suppose the population can be separated into G groups such that the model yk g xk k for k U g holds with E ( k | xk ) E ( k | xk ; k R) 0. The key assumption in prediction modeling for unit nonresponse is the ignorability of the response mechanism. 34 It is not necessary for all the elements in a group to have the same probability of response. That is the assumption of an alternative and complementary paradigm: quasi-random response modeling. G Rg d k yk Let grm ty Txg , g 1 Rg d k xk where ˆ d k 1/ k , and Txg can be either Txg or Txg . This combined-unbiased estimator generalizes the separate ratio estimator. 35 grm t y can be put in reweighted (or calibration-weighted) form as t y R wk yk , grm where Txg wk dk , for k Rg . Rg d j x j 36 grm We can compute the model variance of t y as G Rg w j x j wk wk 2 vgrm rk2 g 1 kS jR w j x j wk xk g Rg w j y j with rk yk xk for k Rg Rg w j x j There is also a randomization-based component of the combined mse when the Txg are themselves estimated. 37 Surveys are usually designed to estimate totals for a number of variables, for example, not only total corn acres but total number of corn farms. Consequently, reweighting for unit nonresponse is often done by setting all the xk = 1, so that the same set of adjusted weights, {wk}, can be used for every survey variable. 38 Item Nonresponse Suppose there is item nonresponse for some y-variable closely related to an auxiliary x-variable known for all sampled elements. A missing y-value is often imputed with something in this form: jRg akj y j yk ˆ xk , jRg akj x j where the akj can be anything from dj, through 1/xj, to 1 for a randomly chosen member of Rg and 0 for all others. 39 In each case, we can, for variance-estimation purposes, rewrite the estimated total in the form of a reweighted estimator (there is a possibly different reweighted estimator for each survey variable) and compute G Rg w j x j wk wk 2 vgrm rk2 . g 1 kS jR w j x j wk xk g There are other approaches. 40 Extensions When the frame exhibits coverage errors (duplications or missing population elements), and true group population totals are known for some auxiliary variable x, then G S g d k yk grm ty Txg g 1 S g d k xk is a model unbiased estimator for Ty under the group-ratio model, assuming the coverage mechanism is ignorable. vgrm serves as a model (and combined) variance estimator. 41 We can extend the framework to incorporate additional phases of sampling, with cluster sampling in the original phase(s). We can also incorporate a more complex set of benchmark variables than the estimated group totals for a scalar x. We can incorporate nonlinear prediction models for the yk. Some care must be taken, however, because Ty U yk is itself a linear function of random variables under the prediction model. 42 Concluding Remark When all the sampling fractions (the nh/Nh) are small enough to ignore, 2 H N h Tx nh rk rhS , 2 cr v2approx e h 1 nh t x nh 1 kSh is a good estimator for the randomization mean squared error of t cr estimator under stsrs and an even better estimator of its model y variance. 43 cr It can be shown that v2approx is asymptotically closer to the stratified jackknife variance estimator than to the more conventional 2 H N h nh rk rhS . 2 cr vconvapprox h 1 nh nh 1 kSh That is why I argue that the stratified jackknife should be viewed as a robust (to misspecification of the 2 ) model-based variance k estimator rather than a randomization-based variance estimator. 44 References Brewer, K. R. W. (1963). Ratio estimation and finite populations: some results deductible from the assumption of an underlying stochastic process. Australian Journal of Statistics 5, 93-105. Brewer, K. R. W. (2002). Combined survey and sampling inference: Weighing Basu’s elephants. London: Arnold. Deville, J- C. and C- E. Särndal (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association 87, 376. Isaki, C.T. and W. A. Fuller (1982). Survey design under the regression super-population model. Journal of the American Statistical Association 77, 89-96. Kott, P. S. (2004). Randomization-assisted model-based survey sampling. Journal of Statistical Planning and Inference 48, 263-277. Kott, P. S. and Bailey, J. T. (2000). The theory and practice of maximal Brewer selection. Proceedings of the Second International Conference on Establishment Surveys, invited papers, 269-278. Kott, P. S. and K. R. W. Brewer (2001). Estimating the model variance of a randomization- consistent regression estimator. Proceedings of the Section on Survey Research Methods, American Statistical Association, Washington DC. Särndal, C- E, B. Swensson, and J. Wretman (1989). The weighted residual technique for estimating the variance of the general regression estimator of a finite population total. Biometrika 76, 527-537. Valliant, R, Dorfman, A.H., and Royall, R.M. (2000). Finite population sampling and inference: a prediction approach. New York: Wiley. 45 The full paper is available at my website: http://www.nass.usda.gov/research/OD5.htm 46