# Lecture 16_ Statistical Inference with Proportions by hcj

VIEWS: 22 PAGES: 5

• pg 1
```									Lecture 16: Statistical Inference with Proportions                Nov. 21, 2006
I. Overview
II. Proportions as univariate statistic
A. Univariate, p for a sample,  for the population; ranges from 0 to 1.0
1. Applies for categorical variables (book uses term “qualitative”) that you have
measured as dichotomies
2. Proportion is linear transformation of percentage (and vice versa)
B. Like other sample statistics, the sample proportion has a sampling distribution          whose
shape depends on the size of the sample
1. The sample values of proportions (and percentages) unstable when n is small
a. transparency
2. When the sample is small, we can test hypotheses using the binomial distribution
but we’re not going to do this in class because it is something we        rarely if ever do. You can
read about it in the text, pp. 187-93; won’t be on test
C. Sampling distribution of the proportion when N > 30
1. Sampling distribution of a proportion depends on sample size
2. Mean of sampling distribution of proportion: 
3. Standard error of sample proportion: p = / n
a. NB: the size of the standard error of the sampling distribution for sample
proportion depends on the degree of variability of Y in the population and
the size of the sample
b. the larger the n, the smaller the standard error of the sampl. distrib. of p
D. Estimating a confidence interval that encompasses 
1. the proportion is the statistic that polling organizations typically report (e.g.,   support
for marriage between two persons of the same sex)
a. as I’ve pointed out more than once, when they report margin of error, they
are       actually reporting confidence intervals
2. Example
a. p = .32, 1-p = .68; n = 350
b. we want to be 99% confident in our estimate; more accurately, 99% of the
intervals that we calculate in this way will encompass 

1
2

c. p =   (1-)/n which we estimate with s p (or -hatp )
s p =  p (1-p)/n = .32*.68/350 = .2176/350 = .0006217 = .025
d. Pr [ = p ± Z/2 (sp )] = 99 percent confident
e. Pr [ = .32 ± 2.58 (.025)] = .32 ± .064] = 99 percent confident
f. Pr [.384    .256] = 99 percent confident
g. Conclusion: 99 percent of the time that we construct a confidence interval this
way, our estimate of the population proportion will fall between .256 and .384
3. Example of testing a hypothesis about 
1. H0 :  .30, Ha:  < .30
2. under the curve in each tail; corresponding to Z va lues > ± 2.58
3. Sample data: p = .27, so 1-p = .73; n = 350
4. Calculate test statistic: p =   (1-)/n which we estimate with sp
a. s p =  [p (1-p)/n] = [.27*.73/350 ] = [.197/350] = .000563 = .024
b. Z = (estimate of parameter from sample – H0 value of parameter)/std error
of estimate
c. Z = (.27 - .30)/.024 = -.03/.024 = 1.25
5. Decision: fail to reject the null hypothesis that  .30
III. Bivariate associations involving percentages, proportions: Here we use statistical
inference involving relationship between two categorical variables, using difference between
proportions test
A. Recall that when we talked about descriptive statistics about the strength of association of
the relation between variables we compared 2 x 2 tables, and assessed the strength of association
by the difference in two proportions or percentages, with the statistic D
1. We use statistical inference to assess the likelihood that the observed    difference
between sample proportions for two groups in sample data is large             enough that we can
conclude that it is not simply a result of chance, but instead reflects a real association between
those variables (group membership and          outcome behavior) in the population
2. Although we’re talk about statistical inference with respect to differences in
proportions (and next week on differences in means), what we’re really doing is asking
3

whether associations involving a categorical independent variable that we            observe in
sample data actually exist in the population from which those sample         data were drawn.
3. In these cases, the statistical tests assume that the groups are independent.        For
example, we can examine the association between the gender and the           disruptive effects of
family demands on one’s job among members of a random sample, but we cannot use the
techniques we’re about to discuss if the women and men in the sample are each others’
husbands and wives because these groups are                  dependent.
a. some statistics books refer to tests making this assumption as two-sample
tests
4. The steps involved in building a confidence interval around a difference in
proportions or testing a hypothesis about a difference in proportions (i.e., are two
categorical variables related are the same) are almost identical to those we just did
for a single proportion.
B. Sampling distribution for the diffe rences in proportions (i.e., association between two
dichotomous variables)
1. the sampling distribution of the difference between two sample proportions is normal
when the samples are relatively large
2. mean (or expected value) of the sampling distribution of the difference between two
sample proportions =      p1 - p2   [or E(p1 - p2 )]=  1 -  2
3. The primary difference between statistical inference involving a single proportion and
inference involving the difference in proportions is how we calculate the standard error of
the sampling distribution for the diffe rence in proportions
a. standard e rror of sampling distribution for two sample proportions (p1 – p2 ) is
symbolized as  p1-p2
b. we don’t know  p1-p2 so we have to estimate it
(1) the variance of the sampling distribution for two samples equals the sum of the
variances of the sampling distributions of the separate estimates
(2) since the estimated variance of the sampling distribution for p = [p (1- p)]/n, the
estimated variance of the sampling distribution for two samples = p 1 (1- p1 )/n1 + p2 (1
- p2 )/n2
(3) so  p1-p2 = √[p1 (1- p1 )/n1 + p2 (1 - p2 )/n2 ]
4

(4) referred to as pooled estimate, and dividing each term by n weights the
pooled estimate by the sizes of the two samples
C. Estimating confidence intervals for the value of the parameter for the association
between two dichotomous variables based on sample data on the differences                across
groups in proportions
1. CI for  1 – 2 = p1 – p2  Z√[p1 (1- p1 )/n1 + p2 (1 - p2 )/n2 ]
2. in words, CI for difference between proportions in the population = falls between the
difference between the sample proportions minus Z standard errors of the proportion and
the difference between the sample proportions plus Z standard errors of the proportion
3. Example : p1 = .462 and n1 = 521; p2 = .499 and n2 = 637, and we want to be 95%
confident that the interval we construct contains the population difference between  1 and
2 our Z score will be 1.96
a. estimate p1-p2
p1-p2 = √(.499*.501 + .462*.538) = √(.250/637 +.249/521)
637         521
p1-p2 = √(.00039 + .00048) = √.00087 = .0295
b. construct confidence interval:  1 -2 = (.499 - .462)  1.96(.0295) = 95%
confidence
1- 2 =.037  1.96*.0295 = .037  .058 = .095 ≥ f- s ≥ -.021
c. Conclusion: 95% of the confidence intervals between -.021 and .095 will
encompass the difference in proportions in the population
D. Testing a hypothesis about the difference between the proportions for two groups; does
the observed sample difference reflect a real difference between the two groups in the
population?
1. In other words, can we conclude from sample data that two dichotomous variables are
associated in the population?
2. How would we state no association between X and Y in terms of a hypothesis?
a. H0 :  1 = 2 which we can restate as  1 - 2 = 0 (i.e., X and Y are statistically
independent)
Ha :  1  2 which we can restate as  1-2  0
5

b. set  = .05; with a two-tailed test, we’ll reject H0 if Z > 1.96 or <-1.96 [draw]
c. Under H0 , the sampling distribution of p1 -p2 , has a mean (expected value) of 0
d. collect data and calculate test statistic
Proportion favoring affirmative action by faculty or student status (n = 1158)
Faculty         Students
Pro AA                     .50              .43
Total                      521              637
so p1 – p2 = .07
The pooled estimate of standard error of sampling distribution of difference in
proportions,
 pf-ps = √[p1 (1- p1 )/n1 + p2 (1 - p2 )/n2 ]
 pf-ps = √ (.25/521 + .245/637) = √ (.0005 + .00038) = √ .00088 = .0295
Z = (pf – ps) – ( f –  s), but according to H0 , f -  s = 0, so pf – ps
pf-ps                                     pf-ps            pfps
Z = .07/.0295 = 2.37
e. Decision: Z falls into region of rejection in right tail, so we reject the H0 that students
and faculty members are equally likely to approve of affirmative action with a 5% chance
of falsely rejecting a true null hypothesis
(1) Indeed, we could get a proportional difference this large by chance alone less
than 1% of the time if  f =  s which leaves us with the alternative H1 that unequal
proportions of law students and faculty support AA in the pop. with a 1% chance
of incorrectly accepting a false Ha
f. if we had samples with fewer than 30 people, we would have used a t-test
3. text discusses statistical inference when your samples are not independent (e.g., before-
after studies); if you find yourself in this situation, then you need to consult a text.

```
To top