VIEWS: 22 PAGES: 5 POSTED ON: 9/24/2010 Public Domain
Lecture 16: Statistical Inference with Proportions Nov. 21, 2006 I. Overview II. Proportions as univariate statistic A. Univariate, p for a sample, for the population; ranges from 0 to 1.0 1. Applies for categorical variables (book uses term “qualitative”) that you have measured as dichotomies 2. Proportion is linear transformation of percentage (and vice versa) B. Like other sample statistics, the sample proportion has a sampling distribution whose shape depends on the size of the sample 1. The sample values of proportions (and percentages) unstable when n is small a. transparency 2. When the sample is small, we can test hypotheses using the binomial distribution but we’re not going to do this in class because it is something we rarely if ever do. You can read about it in the text, pp. 187-93; won’t be on test C. Sampling distribution of the proportion when N > 30 1. Sampling distribution of a proportion depends on sample size 2. Mean of sampling distribution of proportion: 3. Standard error of sample proportion: p = / n a. NB: the size of the standard error of the sampling distribution for sample proportion depends on the degree of variability of Y in the population and the size of the sample b. the larger the n, the smaller the standard error of the sampl. distrib. of p D. Estimating a confidence interval that encompasses 1. the proportion is the statistic that polling organizations typically report (e.g., support for marriage between two persons of the same sex) a. as I’ve pointed out more than once, when they report margin of error, they are actually reporting confidence intervals 2. Example a. p = .32, 1-p = .68; n = 350 b. we want to be 99% confident in our estimate; more accurately, 99% of the intervals that we calculate in this way will encompass 1 2 c. p = (1-)/n which we estimate with s p (or -hatp ) s p = p (1-p)/n = .32*.68/350 = .2176/350 = .0006217 = .025 d. Pr [ = p ± Z/2 (sp )] = 99 percent confident e. Pr [ = .32 ± 2.58 (.025)] = .32 ± .064] = 99 percent confident f. Pr [.384 .256] = 99 percent confident g. Conclusion: 99 percent of the time that we construct a confidence interval this way, our estimate of the population proportion will fall between .256 and .384 3. Example of testing a hypothesis about 1. H0 : .30, Ha: < .30 2. under the curve in each tail; corresponding to Z va lues > ± 2.58 3. Sample data: p = .27, so 1-p = .73; n = 350 4. Calculate test statistic: p = (1-)/n which we estimate with sp a. s p = [p (1-p)/n] = [.27*.73/350 ] = [.197/350] = .000563 = .024 b. Z = (estimate of parameter from sample – H0 value of parameter)/std error of estimate c. Z = (.27 - .30)/.024 = -.03/.024 = 1.25 5. Decision: fail to reject the null hypothesis that .30 III. Bivariate associations involving percentages, proportions: Here we use statistical inference involving relationship between two categorical variables, using difference between proportions test A. Recall that when we talked about descriptive statistics about the strength of association of the relation between variables we compared 2 x 2 tables, and assessed the strength of association by the difference in two proportions or percentages, with the statistic D 1. We use statistical inference to assess the likelihood that the observed difference between sample proportions for two groups in sample data is large enough that we can conclude that it is not simply a result of chance, but instead reflects a real association between those variables (group membership and outcome behavior) in the population 2. Although we’re talk about statistical inference with respect to differences in proportions (and next week on differences in means), what we’re really doing is asking 3 whether associations involving a categorical independent variable that we observe in sample data actually exist in the population from which those sample data were drawn. 3. In these cases, the statistical tests assume that the groups are independent. For example, we can examine the association between the gender and the disruptive effects of family demands on one’s job among members of a random sample, but we cannot use the techniques we’re about to discuss if the women and men in the sample are each others’ husbands and wives because these groups are dependent. a. some statistics books refer to tests making this assumption as two-sample tests 4. The steps involved in building a confidence interval around a difference in proportions or testing a hypothesis about a difference in proportions (i.e., are two categorical variables related are the same) are almost identical to those we just did for a single proportion. B. Sampling distribution for the diffe rences in proportions (i.e., association between two dichotomous variables) 1. the sampling distribution of the difference between two sample proportions is normal when the samples are relatively large 2. mean (or expected value) of the sampling distribution of the difference between two sample proportions = p1 - p2 [or E(p1 - p2 )]= 1 - 2 3. The primary difference between statistical inference involving a single proportion and inference involving the difference in proportions is how we calculate the standard error of the sampling distribution for the diffe rence in proportions a. standard e rror of sampling distribution for two sample proportions (p1 – p2 ) is symbolized as p1-p2 b. we don’t know p1-p2 so we have to estimate it (1) the variance of the sampling distribution for two samples equals the sum of the variances of the sampling distributions of the separate estimates (2) since the estimated variance of the sampling distribution for p = [p (1- p)]/n, the estimated variance of the sampling distribution for two samples = p 1 (1- p1 )/n1 + p2 (1 - p2 )/n2 (3) so p1-p2 = √[p1 (1- p1 )/n1 + p2 (1 - p2 )/n2 ] 4 (4) referred to as pooled estimate, and dividing each term by n weights the pooled estimate by the sizes of the two samples C. Estimating confidence intervals for the value of the parameter for the association between two dichotomous variables based on sample data on the differences across groups in proportions 1. CI for 1 – 2 = p1 – p2 Z√[p1 (1- p1 )/n1 + p2 (1 - p2 )/n2 ] 2. in words, CI for difference between proportions in the population = falls between the difference between the sample proportions minus Z standard errors of the proportion and the difference between the sample proportions plus Z standard errors of the proportion 3. Example : p1 = .462 and n1 = 521; p2 = .499 and n2 = 637, and we want to be 95% confident that the interval we construct contains the population difference between 1 and 2 our Z score will be 1.96 a. estimate p1-p2 p1-p2 = √(.499*.501 + .462*.538) = √(.250/637 +.249/521) 637 521 p1-p2 = √(.00039 + .00048) = √.00087 = .0295 b. construct confidence interval: 1 -2 = (.499 - .462) 1.96(.0295) = 95% confidence 1- 2 =.037 1.96*.0295 = .037 .058 = .095 ≥ f- s ≥ -.021 c. Conclusion: 95% of the confidence intervals between -.021 and .095 will encompass the difference in proportions in the population D. Testing a hypothesis about the difference between the proportions for two groups; does the observed sample difference reflect a real difference between the two groups in the population? 1. In other words, can we conclude from sample data that two dichotomous variables are associated in the population? 2. How would we state no association between X and Y in terms of a hypothesis? a. H0 : 1 = 2 which we can restate as 1 - 2 = 0 (i.e., X and Y are statistically independent) Ha : 1 2 which we can restate as 1-2 0 5 b. set = .05; with a two-tailed test, we’ll reject H0 if Z > 1.96 or <-1.96 [draw] c. Under H0 , the sampling distribution of p1 -p2 , has a mean (expected value) of 0 d. collect data and calculate test statistic Proportion favoring affirmative action by faculty or student status (n = 1158) Faculty Students Pro AA .50 .43 Total 521 637 so p1 – p2 = .07 The pooled estimate of standard error of sampling distribution of difference in proportions, pf-ps = √[p1 (1- p1 )/n1 + p2 (1 - p2 )/n2 ] pf-ps = √ (.25/521 + .245/637) = √ (.0005 + .00038) = √ .00088 = .0295 Z = (pf – ps) – ( f – s), but according to H0 , f - s = 0, so pf – ps pf-ps pf-ps pfps Z = .07/.0295 = 2.37 e. Decision: Z falls into region of rejection in right tail, so we reject the H0 that students and faculty members are equally likely to approve of affirmative action with a 5% chance of falsely rejecting a true null hypothesis (1) Indeed, we could get a proportional difference this large by chance alone less than 1% of the time if f = s which leaves us with the alternative H1 that unequal proportions of law students and faculty support AA in the pop. with a 1% chance of incorrectly accepting a false Ha f. if we had samples with fewer than 30 people, we would have used a t-test 3. text discusses statistical inference when your samples are not independent (e.g., before- after studies); if you find yourself in this situation, then you need to consult a text.