Ratio and Regression Estimation Basic ideas: • The estimated ratio of one variable to other related variable (auxiliary variable) can be used to obtain more precise estimate (i.e., smaller sampling variance) of population mean or total than can be obtained based on one variable, when the population value of the auxiliary variable is known or can be measured easily. • The ratio estimate is slightly biased, as studied before. But the bias in the ratio estimate is negligible for a large n. • The ratio estimation strategy works well when the auxiliary variable is highly correlated with the variable under investigation. • In ratio estimation, one auxiliary variable is used. • In regression estimation, one or more auxiliary variables can be used. Examples of ratio estimation: 1. In 1786, Laplace proposed a method of population estimation (Stigler, History of Statistics, pp. 163-164). He suggested to conduct a census in a few carefully selected communities for determining the ratio of population to birth, and to multiply this ratio to the total number of births in France to estimate the population total. Vital statistics were readily available for most local areas at that time. This method was actually used in 1802, instead of taking a population census for the entire country. (Cf., the first US census was taken in 1790) 2. To estimate the amount of juice that can be produced from a truckload of oranges, a sample of oranges can be squeezed to determine the ratio of juice to weight. Then the estimated ratio is multiplied to the total weight of oranges to estimate the total amount of juice. It will be easier to weigh oranges in the truck than counting them. 3. To estimate the current population of a local area, population can be counted in randomly selected census tracks to determine the ratio of current population to the previous census. Then the current population of the entire area can be estimated by multiplying the estimated ratio to the previous census total. Variance of sample ratio: • Population ratio is R X X and sample ratio is x x Y Y r . Since both the numerator and y y denominator of the ratio are random variables, an expression for its sampling variance cannot be derived easily. • Using a Taylor series approximation, the following expression can be obtained: N 1 X i RYi 2 N n Var(r ) i 1 2 nY N N 1 This is an alternative expression of formula (7.1) on page 193 (ignoring the square root). This can be verified using the data in the illustrative example on that page. • This can be estimated by n 2 xi ryi N n ˆ (r ) 1 i 1 Var 2 , nY n 1 N where Y can be replaced by y . This is an alternative expression of formula (7.3) on page 196 (ignoring the square root). Estimation of the population mean and total: _ • The population mean is estimated by ( r Y ) and the population total by rY . • The sampling variance of the estimate of population mean can be estimated by multiplying _ Y to variance of r (see the above formulas) and 2 for the variance of the estimate of population total by multiplying Y2 to variance of r. When to use the ratio estimation: The ratio estimate will have smaller sampling variance Vy than the simple inflation estimate, whenever 2 xy Vx or V y . The correlation between x and y xy 2Vx should be larger than one-half of the ratio of the coefficient of variation cv of y to cv of x. When cv of two variables is the same, the correlation should be greater than 0.5. Ratio estimation in stratified random sampling: 1. Combined ratio estimate: x st x' Rc Y y st 2. Separate ratio estimate: L L xh x' Rs Yh rh Yh y h 1 h 1 h Note that more information is required in the separate ratio estimate. Illustrative example of ratio estimation in a stratified sample: Consider the following artificial population N=6 in two strata with equal size (x =number with MPH degree; y =total number of employees): Stratum 1 x: 1 2 3 Total = 6 y: 4 5 6 Total = 15 Stratum 2 x: 3 4 5 Total = 12 y: 14 16 18 Total = 48 Totals: X=18 Y=63 For a stratified random sample of 2 from each stratum, there are 9 possible samples and the combined and separate ratio estimates from each sample are as follows (y values are in parentheses): Sample Stratum 1 Stratum 2 Combined Separate 1. 1(4) 2(5) 3(14) 4(16) 16.15 16.20 2. 1(4) 2(5) 3(14) 5(18) 16.90 17.00 3. 1(4) 2(5) 4(16) 5(18) 17.58 17.71 4. 1(4) 3(6) 3(14) 4(16) 17.33 17.20 5. 1(4) 3(6) 3(14) 4(16) 18.00 18.00 6. 1(4) 3(6) 4(16) 5(18) 18.61 18.71 7. 2(5) 3(6) 3(14) 5(18) 18.44 18.02 8. 2(5) 3(6) 3(14) 5(18) 19.05 18.82 9. 2(5) 3(6) 4(16) 5(18) 19.60 19.52 Expected value: 17.96 17.91 Bias -0.038 -0.091 Variance: 1.0526 0.9307 MSE: 1.0542 0.9390 • Note that bias is smaller for the combined estimate, while the variance and MSE are larger for the combined estimate. • For a small sample in each stratum, the combined ratio estimate is recommended (bias consideration). For a large sample in each stratum, the separate ratio estimate is recommended (variance/MSE consideration; the bias is negligible for a large sample) Regression estimation: • The ratio estimate is a special case of a regression estimate. The regression estimate of population mean has the following form of a linear estimate: ˆ x b(Y y ) X lr where b is a regression coefficient of x on y, either estimated from the sample or borrowed from external sources. x • If b is substituted by sample ratio, r then we y ˆ have a ratio estimator, X rY . Note that it is R the case of regression line going through the origin. • If b=1, then we have the difference estimator, ˆ Y x y Xd • If b=0, then we have the estimator relevant to simple random sampling. • Like ratio estimate, regression estimate is generally biased but consistent estimator. ˆ ) E ( x ) YE (b) E (by ) X Cov(by ). E ( X lr • When the covariance is zero (when the joint distribution of x and y is bivariate normal), regression estimate is unbiased. If a sample plot of x against y appears approximately liner, there should be little risk of major bias in regression estimate. • Sampling variance of a regression estimate can be xi x b y i y 2 estimated by Vˆar ( x ) 1 f n n2 lr (the divisor (n-2) is used when b is estimated from the sample; (n-1) is used when b is pre-assigned). For a large sample the following approximation is ˆar x 1 f s x 1 2 2 valid: V lr ˆ xy n (See formula 7.17 on page 215). • When correlation between x and y is zero, variance of regression estimate is the same as variance of simple random sampling. As long as there is a measurable correlation between x and y regression estimate is more precise than the simple expansion type of estimate. Reduction in variance is large when correlation is high: Correlation Factor of Reduction 0.95 1/10 0.5 ¾ 0 none • In general, regression estimate is more precise than ratio estimate (see the summary data on page 217; correct a typo, MSE of 82.97 for x' should be 8297). • The regression estimator can be expanded to include more than one Y variable as follows: ˆ x b (Y y ) b Y y. X lr 1 1 1 2 2 2 • Sampling variance of a difference estimator can be estimated by xi x y i y 2 Vˆar ( x ) 1 f n n 1 d which is similar to variance estimator for a regression estimate shown above.