# 9 Master by cJaxKih

VIEWS: 9 PAGES: 145

• pg 1
```									9 Master

Presentation 8.1
Example
• A survey of 436 workers showed that 192 of
them said that it was seriously unethical to
monitor employee e-mail. When 121 senior-
level bosses were surveyed, 40 said that it was
seriously unethical to monitor employee e-mail.
• Let     and     be the population proportion of
workers and bosses that feel it‟s unethical to
pW      pB
monitor e-mail.
We might want to obtain a CI for        pW.  pB
We would first need an estimate of this
difference. It should seem reasonable that an
estimate be
pW  pB
ˆ    ˆ
 192 / 436  40 /121
 0.1097
The standard error of             ˆ  pBˆ
pW is estimated by

pW (1  pW ) pB (1  pB )
ˆ       ˆ     ˆ      ˆ
s.e.( pW  pB ) 
ˆ    ˆ                    
nW           nB

and just thinking intuitively, this means a CI for

pW is pB

pW  pB  z s.e.( pW  pB )
ˆ    ˆ            ˆ
*
ˆ
• To compute a CI for         B
pWwepneed andˆ B p
ˆ
pW
which are 192/436= 0.4403 and
40/121=0.3305 respectively. This gives a
standard error of

0.4403 (1  0.4403 ) 0.3305 (1  0.3305 )
                      0.0489
436                  121
• Now, if we want to obtain an 80% CI for
pW  pB have
ˆ    ˆ we

0.1097  1.282 (0.0489 )
 (0.047 ,0.172 )
• Suppose we want to test the claim that the a
larger percentage of workers feel that it‟s
unethical to monitor email. That is

H1 : pW  pB
 H1 : pW  pB  0
Again, it should seem intuitive that the
test statistic will be of the form
pW  pB
ˆ     ˆ
pW (1  pW ) pB (1  pB )

nW           nB
but under H0, pW and pB are equal. So, in the
denominator, we can simply replace this with p.
pW  pB
ˆ     ˆ
p (1  p ) p (1  p)

nW         nB

An estimate for p is (192+40)/(436+40)
=0.4165. This gives the test statistic as
2.1656.
Similar to the one sample tests, we can make a
decision by
• comparing the test statistic to the critical value.
If α = 0.05, then the critical value is 1.645. Since
TS > CV, reject H0.
• or we can compare the p-value to α. The p-
value is found as P(Z > 2.1656) =0.015. Since
this value is less than α, we reject H0.
Another example
A major court case on the health effects of drinking contaminated
water took place in the town of Woburn, Massachusetts. A town well
was contaminated with industrial chemicals. During the period when
the well was open, 16 birth defects out of 414 births. When this
particular well was shut off from and water was supplied from other
wells, 3 out of 228 birth defects were reported. The plaintiffs suing the
firm responsible for contaminating the well claim that the rate of birth
defects is higher when the contaminated well was in use. Denote the
contaminated well as „C‟ and the other uncontaminated wells as „U‟ and
p be the proportion of birth defects. What exactly are the plaintiffs
wanting to test?
• Obtain a 98% confidence interval for the
difference in the rate of birth defects for when
the well was on compared to when it was shut
off.
• What is the test statistic?
• What‟s the critical value if we use α=0.01?
• What‟s the conclusion? Should the plaintiffs be
favored here?
Confidence Interval for p

Reasonable Range of Values for
True Population Proportion p
Confidence Interval for p
• The goal is to take a sample and be able
to make intelligent guesses about the true
value of the proportion p in the population.
• A valuable tool is the confidence interval:
the range of values for p in the population
that could reasonably have produced the
sample p-hat we observed.
CI Formula
• A confidence interval for the population p is given
by:

p (1  p )
ˆ      ˆ
pZ
ˆ         *

n
CI Formula
• A 95 percent confidence interval for the
population p is given by:

p (1  p )
ˆ      ˆ
p  1.96
ˆ
n
Example
• Suppose we cure p-hat = .9 of n=1000
heartworm infected dogs. What is the
reasonable range for the cure rate p of our new
. CI for
treatment? Do 95%9(.1) p.
.9  1.96
1000
.9  1.96(.009487)
.9  .0185
(.8815,.9185)
Example
• Reasonable range for p (.88, .92) is same
range argued in previous section on
sampling distributions for p-hat.
• The only reasonable values for p are those
that could produce p-hats only a couple of
standard deviations removed from the
truth.
Reeses Pieces Example
• What is the proportion of orange candies,
p?
• To study this unknown, but very important
value p, we will construct confidence
intervals for p from samples of candies.
• Each bag represents a random sample of
size n from the population of these
candies.
• From each bag your group should: find n,
Reeses Pieces Example
• On whiteboard place your information in tabular
form:

Grou N P-hat CI
p
1
2
3
4
5
6
Reeses Pieces Example
• A histogram of p-hat values should result
in a representation of the sampling
distribution of p-hat.
• The center of this histogram should be p.
What do you think p is?
Reeses Pieces Example
• From the CI‟s, what do you think the true p
is?
• Is an evenly distributed color distribution
p=1/3, a reasonable hypothesis based on
our data? Why or why not?
• Pay attention to the written conclusion I
provide on the board !
Vietnam Veterans Divorce Rate
• N=2101 veterans interviewed found p-hat=777/2101
= .3698 had been divorced at least once.
• What is reasonable range of values for true divorce
proportion p?

.3698  1.96(.01053)
.3698  .02064
(.349,.390)
Vietnam Vets Divorces
• Do you think true divorce proportion is
greater than .5?
• Ans: No. The reasonable range of values
for the true p is (.349, .390). This range is
entirely below p=.5, so we have strong
evidence that the true divorce proportion is
BELOW .5 not above it.
Vietnam Vets Divorces
• Do you think the true divorce proportion
could be .37?
• Ans: Yes, a proportion like .37 is a
reasonable value for the true p according
to our range of reasonable values, so the
truth could reasonably be .37.
Domestic Violence
• For those women who had experienced
some abuse before age 18, the sample
abuse in the past 12 months was p-hat =
236/569 = .4147
• CI for p: (.374, .455).
• Suppose the true proportion currently
abused for those not abuse before age 18
was .11.
• Is there evidence the true population
proportion in our study is greater than .11?
Ask Marilyn – Let‟s Make a Deal
• In 1991 a reader wrote to Marilyn Vos
Savant (highest documented IQ) and
asked whether a player should switch
doors when playing Let‟s Make a Deal.
• There are 3 doors, two with goats and one
with a car. You pick a door. The host,
Monty Hall shows you a door you have not
picked and there is a goat behind it. You
are then asked if you wish to switch doors.
Should you switch?
Let‟s Make a Deal
• Marilyn said yes, you should switch doors.
• There was a storm of angry letters from bad
• “you are the goat”, “take my intro class”, “it is
clearly 50-50 with no advantage to switching”.
• The next week stats professors from elite
universities like Harvard, Stanford, UMM
wrote in and said that Marilyn was correct,
but her reasoning was wrong.
Let‟s Make a Deal
• Let‟s play the game on the computer
simulation, be sure to play the strategy of
switching doors after a goat is shown to
you. Keep track of how many times you
win divided by the number of plays.
Compute p-hat.
• Who is right? Marilyn or the bad
professors?
• Do a 95% CI for p, the proportion of
Level of Confidence
• A CI for p includes a statement of a
confidence level, usually 95%.
• You should know how to compute
confidence intervals for any level of
confidence, but particularly for 80%, 90%,
95%, 98%, 99%.
• The formula is the same for each, but the
Z multiplier changes.
Z Multiplier
• For any confidence level, the Z multiplier is
obtained by drawing a standard normal
curve and then placing symmetric
boundaries around the mean zero.
• For a 95% interval these boundaries
should contain 95% of the observations
within these bounds. That means there is
2.5% of the observations outside these
bounds in each tail to add to the remaining
5%.
Finding Z*
Z-Multiplier
• This means that the upper boundary is at
the 97.5 percentile, and the lower
boundary is at the 2.5 percentile.
• Use your normal table and look up in the
middle for .975 (97.5%), go to the edges to
observe that the z-value corresponding to
this point is 1.96. That is why we have
used 1.96 for the 95% CI multiplier.
Other Z-Multipliers
• You should be able to verify that the
correct multipliers for other confidence
levels are: 1.28, 1.64, 2.33, 2.57.
• Do you know how these were obtained?
What Does 95% Confidence Mean
Anyway?
• A 95% CI means that the method used to
construct the interval will produce intervals
containing the true p in about 95% of the
intervals constructed.
• This means that if the 95% CI method was
used in 100 samples, we should expect
that about 95 of the intervals will contain
the true p, and about 5 intervals should
miss the true p.
Diagram of Confidence
95% of intervals
Contain true p, but
5% miss truth.

p
CI Meaning
• We never know if our CI has contained the
true p or not, but we know the method we
used has the property that it catches the
truth 90% of the time (for a 90% CI), so it
probably has done well in our study, or at
least is not far from the truth.
Butterfly Net
• A confidence interval is like a butterfly net
for catching the true p within its
boundaries.
• Take a swing at the butterfly (p) with your
net (CI), you have a known reliability of
catching the butterfly (p), say 90%, but
you will never know if your net caught the
butterfly or not, just that it is typically a
good method for catching butterflies, and
so it was probably good for you too!
Percent Confidence
• The percent confidence refers to the
reliability of the CI method to produce
intervals that contain the true p.
• Why not do a 100% confidence interval?
Then we would be completely sure that
the interval has contained the true p.
100 % CI for p
• A 100% CI for p is (0, 1), this interval is
sure to contain the true p.
• However this is not very useful. This
%confidence and the usefulness of the
interval to simplify the world.
• We usually choose 90, 95, or 99 percent
confidence levels.
CI Cautions !
• Don‟t suggest that the parameter varies: There is
a 95% chance the true proportion is between .37
and .42. YUCK!! It sounds like the true proportion
is wandering around like an intoxicated (blank) fan.
(Fill in your most hated sports team in the blank).
The true p is fixed, not random.

• Don‟t claim that other samples will agree with
yours: 95% of samples will have proportions
supporting proposal X between .37 and .42.
NOPE!! This range is not about sample proportions
as this statement implies.
CI Cautions ! (Continued)
• Don‟t be certain about the parameter: The cure
rate is between 37 and 42 percent. UGG !! This
makes it seem like the true p could never be
outside this range. We are not sure of this, just
sorta-kinda-sure.
• Don‟t forget: It‟s the parameter (not the statistic):
Never, ever say that we are 95% sure the
sample proportion is between .37 and .42. DUH
! There is NO uncertainty in this, it HAS to be
true.
• Don‟t claim to know too much.
• Do take responsibility (for the uncertainty).
CI Cautions ! (Continued)
• Don‟t claim to know too much: “I‟m 95%
confident that between 37 and 42 percent of
people in the universe are lunkheads.” Well
your population really wasn‟t the whole universe,
just Podunk State U.
• Do take responsibility (for the uncertainty): You
are the one who is uncertain, not the parameter
p. You must accept that only 95% of CI‟s will
contain the true value of p.
Usefulness of CI‟s
• There is a trade-off between reliability
(confidence) and the width of the interval.
• Increasing confidence means the interval
width becomes greater (wider). By
increasing the sample size, n, the interval
becomes narrower.
• How big should the sample size be to get
population p?
CI Behavior
Margin of Error
• The margin of error (m) of a confidence
interval is the plus and minus part of the
confidence interval, m=Z se(p-hat)
• P-hat +/- Z se(p-hat)
• P-hat +/- m
• A confidence interval that has a margin of
error of plus or minus 3 percentage points
means that the margin of error m=.03.
Margin of Error
• From the formula m=Z se (p-hat), you can
see that the margin of error depends on
the confidence level (Z multiplier) and
through the sample size n inside the
expression for se(p-hat).
• A common problem in statistics is to figure
out what sample size will be needed to
obtain the desired accuracy (margin of
error m).
Sample Size Formula
• The sample size n needed to get desired margin of
error m is given by,

2
 Z * *
n     p (1  p )
*

 m 
Sample Size
• The margin of error desired m, is usually
provided in the problem. The value Z* is
determined by the level of confidence that
is desired. If no level is given, just assume
95% confidence.
• The p* value is a bit of a chicken and egg
value of the true p.
Sample Size
• Mmmm, let‟s see, we are trying to do a
study to estimate p, but we need to know p
(p*) to compute the needed sample size.
This seems impossible!
• Quit whining and do the best you can.
Give the best or most current state of
knowledge about p as p*. Usually there is
some information about what p might be.
If you know absolutely nothing, then use
p*=.5.
Why use p*=.5?
• Here is a graph of p*(1-p*) for values of p*:
p*(1-p*)

.25

p*=0       .5                 p*
1
Why use p*=.5
• The graph shows that p*(1-p*) will be
largest when p*=.5. This means the
sample size will be largest when p*=.5.
This means that the sample size will be at
least as big as actually needed.
• This is called being conservative because
you are using more data than would
actually be needed to achieve the margin
of error desired.
Sample Size Example
orgy at my house. I watched n=30 NBA
games from my big blue chair, drank
beverages of God, ate lots of popcorn. I
found that X=18 games were won by the
home team. This means p-hat = 18/30 =
.6.
• What is a 95% CI for true home court win
proportion p?
NBA Games Example
.6(.4)
.6  1.96
30
.6  .1753
(.4246,.7753)
NBA Games Example
• Plausible range of values for true home
court winning proportion was (.42, .78).
This is not very helpful, I knew this even
before the first popcorn kernel popped.
• Why was the procedure not more helpful?
• Problem was the margin of error. It was
huge ! It was about m=.17, .18. The
sample size was too small to make our
inference more precise. We need a bigger
sample size. How big?
NBA Sample Size
• Suppose we wish to obtain a margin of
error of m=.02 in a 95% CI for p. What
sample size is needed?
• n=(1.96/.02)^2 .6(1-.6) = 2304.96
• Round up to n=2305 games. Oh Joy!
What a fiesta !
• Note that our best knowledge was the
small study done at my house, there p-hat
=.6 so it is our best knowledge of the true
p, so p*=.6.
Vietnam Vets Example
• If you go back a few slides you will find
that in the Vietnam Vets divorce rate
example, the margin of error was about
.02. Notice this is a small value for m, and
it was obtained because the sample size
was huge for that problem. Sample size
was over 2000 subjects!
Relationship between m and n
m

n
Graph Computation
•   When p*=.5, m=.05, n=385
•   When m=.03, n=1068
•   When m=.02, n=2401
•   etc
Relationship between m and n
• Notice that as the sample size increases
initially, there is a big drop in the margin of
error. It drops substantially early on.
• However, for larger sample sizes there is
almost no additional reduction in margin of
error for increasing the sample size.
• Most big surveys are below 2000 – 3000
subjects. Do you see why?
Poor, Ignorant Phil !
Right Eye Dominance
• Hold a piece of paper with small hole in
middle out in front of you with both hands.
Focus on an object across the room to be
visible in the hole with both eyes open.
• Now shut one eye, if the object is still
visible, the open eye is the dominant eye.
• Do a 95% CI for the proportion of the
population that is right eye dominant, p.
A Recent Poll (Gallup)
Poll Details
• Certainly, one of the challenges for the winner of
this year's election will be to bring a divided
nation together again.
Survey Methods
• These results are based on telephone interviews
with a randomly selected national sample of
1,013 adults, aged 18 and older, conducted Oct.
14-16. For results based on this sample, one can
say with 95% confidence that the maximum error
attributable to sampling and other random effects
is ±3 percentage points. In addition to sampling
error, question wording and practical difficulties
in conducting surveys can introduce error or bias
into the findings of public opinion polls.
Hypothesis Tests for p

Population Proportion p
Hypothesis Test for p
• You have seen previously the method for
producing a confidence interval or
reasonable range for parameter p.
• Hypothesis tests can also be performed
with one sample proportion to learn about
the population proportion of interest.
Hypothesis Test Formula
H 0 : p  p0
H a : p  p0 ,  p0 ,  p0
p  p0
ˆ
Z
p0 (1  p0 )
n
P  Value  P( Z  Zobs), P( Z  Zobs),
 2 * P( Z | Zobs |)

H 0 : p  .5
H a : p  .5
p  p0
ˆ             .689  .5
Z                                 2.54
p0 (1  p0 )    .5(1  .5)
n             45
P  Value  P ( Z  2.54)  .0055
• Data is unlikely under Ho, data is inconsistent
with Ho.
• We have evidence to doubt Ho.
• We have evidence to support Ha.
• We have evidence the proportion of wins by
switching doors, p, is greater than .5. We have
evidence that Marilyn is right, we should switch
doors.
Reeses Pieces Example
• What is the proportion of orange candies,
p?
• I believe our data were something like p-
hat=.52 for n=60 candies. Do appropriate
hypothesis test.
Right Eye Dominance
• Hold a piece of paper with small hole in
middle out in front of you with both hands.
Focus on an object across the room to be
visible in the hole with both eyes open.
• Now shut one eye, if the object is still
visible, the open eye is the dominant eye.
• Do a hypothesis test that the proportion of
the population that is right eye dominant, p
is not equal to .5.
Spinning Pennies
• We wish to test the hypothesis that the
proportion of spins that will turn heads is
different than .5.
• Some students perform an experiment and
find that 17 heads were obtained from 40
spins. This means p-hat=17/40 = .425.
Spinning Pennies
H 0 : p  .5
H a : p  .5
p  p0
ˆ             .425  .5
Z                                 .95
p0 (1  p0 )    .5(1  .5)
n             40
P  Value  2 * P( Z  .95)  2 * P( Z  .95) 
P  Value  2 * (.171)  .342
Spinning Pennies Conclusion
•   The data is consistent with Ho.
•   There is no evidence to doubt the Ho.
•   There is no evidence to support the Ha.
•   There is no evidence to suggest the
proportion of spins that are heads is
anything other than .5.
Spinning Pennies
• Let‟s do the experiment ourselves.
Inference for Two Population
Proportions
population proportions.
Data Situation
• We now have two populations, and we
wish to compare the proportions of these
populations.
• Population 1 Data: n_1 and p-hat_1.
• Population 2 Data: n_2 and p-hat_2.
Data Situation
Data :
X1
Sample _ 1 : n1 , p1 
ˆ
n1
X2
Sample _ 2 : n2 , p2 
ˆ
n2
Hypothesis Test Formula
H 0 : p1  p2  0
H a : p1  p2  0,  0,  0
p1  p2  0
ˆ ˆ
Z                          , where
1 1
p(1  p)  
ˆ     ˆ 
 n1 n2  
X1  X 2
p
ˆ
n1  n2
P  Value  P( Z  Zobs), P( Z  Zobs),
 2 * P( Z | Zobs |)
Hypothesis Test Formula
• Notice the p-hat with no subscript in the
denominator of the Z statistic. This is
called the pooled proportion.
• Under the Ho we hypothesize that both
populations have the same proportion, so
the natural thing to do is use all the data to
estimate the common proportion. Simply
add all events and divide by the total
sample size.
Red Dye #2 Example
• 2 samples conducted on lab animals. One
group was given a typical animal diet with
44 animals. Four developed tumors.
Thus, p-hat=.091
• In a group given red dye # 2, there were
14 animals developing tumors out of 44.
Thus p-hat=.318.
Red Dye Hypothesis Test
H 0 : pR  pC  0
H a : pR  pC  0
p1  p2  0
ˆ ˆ                     .318  .091
Z                         
1 1                      1  1 
p (1  p )  
ˆ      ˆ             .205(1  .205)  
 n1 n2 
                  44 44 
Z  2.64
X1  X 2    4  14
p
ˆ                      .205
n1  n2   44  44
P  Value  P( Z  2.64)  .0041
Red Dye #2 Conclusion
• The data are unusual if the Ho is true.
The data are inconsistent with the Ho.
• There is evidence to doubt the Ho.
• There is evidence to support the Ha.
• There is evidence that p_r > p_c, and this
means there is evidence the red dye #2
group has a higher proportion of animals
with cancerous tumors than the control
diet. This is evidence that RD#2 is a
carcinogen.
Red Dye #2 Historical Note
• All red color food disappeared for a while.
No Jello, no red M&M‟s, no Hawaiian
Punch, etc, poor young Jon .
• Eventually another red dye was approved
for sale. Jon‟s favorite mass-produced
junk items returned .
Saracco Study (Italy)
• Study of heterosexual couples where one
member of the couple was HIV infected.
• First group used condoms regularly, 171
couples. Of these 3 subsequently became
infected. P-hat = 3/171=.0175
• Second group did not use condoms
regularly. There were 55 such couples,
and 8 subsequently became infected, p-
hat = 8/55 = .14545.
Saracco Hypothesis Test
H 0 : pR  p N  0
H a : pR  p N  0
pR  p N  0
ˆ     ˆ                 .0175  .14545
Z                        
1 1                          1   1
p(1  p)  
ˆ      ˆ            .04867(1  .04867)     
 n1 n2 
                      171 55 
Z  3.84
X1  X 2   38
p
ˆ                     .04867
n1  n2 171  55
P  Value  P( Z  3.84)  .0002
Saracco Conclusion
• The data are unusual under Ho, so data
are inconsistent with Ho.
• There is evidence to doubt the Ho.
• There is evidence to support the Ha.
• There is evidence that p_r<p_n, this
means evidence that HIV infection
proportion is less in group that used
condoms regularly.
Saracco Historical Note
• This was the study that prompted world
health officials to proclaim that regular
condom use was “effective” in preventing
HIV infection.
• This does not mean that using condoms is
risk-free, all it means is that the infection
proportion was statistically less than not
using them.
Confidence Interval Formula

 p1 (1  p1 ) p2 (1  p2 ) 
ˆ       ˆ     ˆ      ˆ
p1  p2  Z
ˆ ˆ           *

                          

      n1          n2       
Confidence Intervals
• The crucial value used to evaluate these
intervals is zero. If all values are above
zero, it implies that proportion p_1 is
greater than p_2.
• If the interval is all negative, there is
evidence p_1<p_2.
• If the interval contains zero, it means no
difference is a plausible/reasonable
statement, and thus no evidence to say
that the proportions differ.
Woburn Mass CI
• In Woburn Massachusetts there were
public wells that provided the city‟s water
supply.
• When the questionable water was being
consumed there were 16 adverse birth
outcomes out of 414 births. P-hat =
16/414=.039.
• When the water was not being consumed,
there were 3 adverse birth outcomes out
Woburn Confidence Interval
 p y (1  p y ) pn (1  pn ) 
ˆ        ˆ      ˆ      ˆ
p y  pn  Z
ˆ     ˆ        *                               
      ny            nn       
                             
 .039(1  .039) .013(1  .013) 
.039  .013  1.96                               
      414            228       
.026  1.96(.012)
.026  .024
(.002,.05)
Woburn Water Conclusion
• The plausible range of value for p_y-p_n is
(.002, .05).
• The entire plausible range is positive.
• This means there is evidence the p_y >
p_n, and that the proportion of adverse
birth events with the water on is greater
than when the water was not used. There
is evidence the water is responsible for an
increase in adverse birth events in
Woburn.
Woburn Water Note
• Entertainment note: Hollywood film, A
Civil Action, starring John Travolta and
Robert DuVall is based on this problem
situation.
• I believe the parents shown in the video
clip are part of the plot of the movie.
Propranolol Study
• Potential usefulness of propranolol for
recent heart attack victims. Population
proportion p_c=proportion death within 2
years. Population proportion p_p=
proportion death within two years.
Propranolol Confidence Interval
 pc (1  pc ) p p (1  p p ) 
ˆ      ˆ     ˆ       ˆ
pc  p p  Z 
ˆ    ˆ      *
               
      nc           np        
                             
.0954  .0704  1.645 *
 .0954(1  .0954) .0704(1  .0704) 
                                  
      1919             1918        
.025  1.645(.00889)
.025  .0147
(.0103,.0397)
Propranolol Conclusion
• Note was 90% CI. The plausible range of
values for p_c – p_p is (.01, .04).
• This range includes only positive values.
• This implies p_c > p_p, and that there is a
higher death proportion under usual care,
and that two year death rates are reduced
when using propranolol.
• Is this a big deal?
• Compute four confidence intervals – one
for each of the four attributes.
• Compute 95% CI‟s for p_male – p_female.
• Write complete conclusion for each
interval.
Large-sample Confidence
Interval for a Population
Proportion
•A confidence interval for a
population characteristic is an
interval of plausible values for the
characteristic. It is constructed so
that, with a chosen degree of
confidence, the value of the
characteristic will be captured
inside the interval.
Confidence Level
•The confidence level associated
with a confidence interval estimate
is the success rate of the method
used to construct the interval.
Recall
For the sampling distribution of p,
p(1  p)
mp = p, p           and for large* n
n
The sampling distribution of p is
approximately normal.
Specifically when n is large*, the statistic
p has a sampling distribution that is
approximately normal with mean p and
standard deviation p(1  p) .
n
* np  10 and np(1-p)  10
Some considerations

Approximately 95% of all large samples will
result in a value of p that is within
p(1  p) of the true population
1.96p  1.96
n
proportion p.
Some considerations
Equivalently, this means that for 95% of
all possible samples, p will be in the
interval
p(1  p)              p(1  p)
p  1.96          to p  1.96
n                    n

Since p is unknown and n is large, we estimate
p(1  p)      p(1  p)
with
n             n

This interval can be used as long as
np  10 and np(1-p)  10
The 95% Confidence Interval
When n is large, a 95% confidence
interval for p is
           p(1  p)           p(1  p) 
 p  1.96          , p  1.96          
              n                  n 

The endpoints of the interval are often
abbreviated by            p(1  p)
p  1.96
n
where - gives the lower endpoint and + the
upper endpoint.
Example
•For a project, a student randomly sampled
182 other students at a large university to
determine if the majority of students were in
favor of a proposal to build a field house.
He found that 75 were in favor of the
proposal.

•Let p = the true proportion of students that
favor the proposal.
Example - continued
75
p      0.4121
182
So np = 182(0.4121) = 75 >10 and
n(1-p)=182(0.5879) = 107 >10 we can use
the formulas given on the previous slide to
find a 95% confidence interval for p.

p(1  p)                 0.4121(0.5879)
p  1.96           0.4121  1.96
n                          182
 0.4121  0.07151

The 95% confidence interval for p is
(0.341, 0.484).
The General Confidence
Interval
The general formula for a confidence
interval for a population proportion p
when
1. p is the sample proportion from a
random sample , and
2. The sample size n is large
(np  10 and np(1-p)  10)
is given by
p(1  p)
p   z critical value 
n
Finding a z Critical Value
•Finding a z critical value for a 98%
confidence interval.

2.33
Looking up the cumulative area or 0.9900 in the
body of the table we find z = 2.33
Some Common Critical
Values
Confidence z critical
level     value
80%        1.28
90%        1.645
95%        1.96
98%        2.33
99%        2.58
99.8%      3.09
99.9%      3.29
Terminology

The standard error of a statistic is the
estimated standard deviation of the statistic.

For sample proportions, the standard deviation is
p(1  p)
n

This means that the standard error of the sample
proportion is
p(1  p)
n
Terminology

The bound on error of estimation, B,
associated with a 95% confidence interval is
(1.96)·(standard error of the statistic).

The bound on error of estimation, B, associated
with a confidence interval is
(z critical value)·(standard error of the statistic).
Sample Size
The sample size required to estimate a
population proportion p to within an amount
B with 95% confidence is

2
 1.96 
n  p(1  p)            
 B 
The value of p may be estimated by prior
information. If no prior information is available,
use p = 0.5 in the formula to obtain a
conservatively large value for n.

Generally one rounds the result up to the nearest integer.
Sample Size Calculation Example
•If a TV executive would like to find a 95%
confidence interval estimate within 0.03 for
the proportion of all households that watch
NYPD Blue regularly. How large a sample
is needed if a prior estimate for p was 0.15.
We have B = 0.03 and the prior estimate of p = 0.15
2                       2
 1.96                  1.96 
n  p(1  p)         (0.15)(0.85)         544.2
 B                     0.03 
A sample of 545 or more would be needed.
Sample Size Calculation
Example revisited
•Suppose a TV executive would like
to find a 95% confidence interval
estimate within 0.03 for the
proportion of all households that
watch NYPD Blue regularly. How
large = 0.03 and should use p = 0.5 in
We have B a sample is needed if we have
no reasonable prior estimate for p.
the formula.
2                    2
 1.96                1.96 
n  p(1  p)         (0.5)(0.5)         1067.1
 B                   0.03 
The required sample size is now 1068.
Notice, a reasonable ball park estimate for p
can lower the needed sample size.
Another Example
•A college professor wants to
estimate the proportion of students
at a large university who favor
building a field house with a 99%
confidence interval accurate to 0.02.
If one of his students performed a
B = 0.02, a prior estimate estimated p to
We havepreliminary study and p = 0.412 and we
be 0.412, how large a sample
should use the z critical value 2.58 (for a 99%
confidence interval) take.
should he
2                         2
 2.58                    2.58 
n  p(1  p)         (0.412)(0.588)         4031.4
 B                       0.02 
The required sample size is 4032.
Large Sample Hypothesis
Test for a Single
Proportion
To test the hypothesis
H0: p = hypothesized proportion,
compute the z statistic
p  hypothesized value
z
hypothesized value(1-hypothesized value)
n
In terms of a standard normal random variable z, the
approximate P-value for this test depends on the
alternate hypothesis and is given for each of the
possible alternate hypotheses on the next 3 slides.
Hypothesis Test
Large Sample Test of Population
Proportion

                                                
               p  hypothesized value           
P-value  P  z                                             
       hypothesized value(1-hypothesized value) 
                                                
                          n                     
Hypothesis Test
Large Sample Test of
Population Proportion

                                                
               p  hypothesized value           
P-value  P  z                                             
       hypothesized value(1-hypothesized value) 
                                                
                          n                     
Hypothesis Test
Large Sample Test of Population
Proportion

                                                
               p  hypothesized value           
P-value  2P  z                                             
       hypothesized value(1-hypothesized value) 
                                                
                          n                     
Hypothesis Test Example
Large-Sample Test for a
Population Proportion
•An insurance company states that
the proportion of its claims that are
settled within 30 days is 0.9. A
consumer group thinks that the
company drags its feet and takes
longer to settle claims. To check
these hypotheses, a simple
random sample of 200 of the
company‟s claims was obtained
and it was found that 160 of the
Example 2
Single Proportion
p = proportion of the company’s claims that are
continued
settled within 30 days
H0: p = 0.9
HA: p  0.9
160
The sample proportion is p       0.8
200
0.8  0.9      0.8  0.9
z                               4.71
(0.9)(1  0.9)    0.9(0.1)
200            200

P-value  P(z  4.71)  0
2
Single Proportion
continued
The probability of getting a result as strongly or
more strongly in favor of the consumer group's
claim (the alternate hypothesis Ha) if the
company’s claim (H0) was true is essentially 0.
Clearly, this gives strong evidence in support of
the alternate hypothesis (against the null
hypothesis).
Example 2
Single Proportion
We would say continued support for
that we have strong
the claim that the proportion of the insurance
company’s claims that are settled within 30 days
is less than 0.9.
Some people would state that we have shown
that the true proportion of the insurance
company’s claims that are settled within 30 days
is statistically significantly less than 0.9.
Hypothesis Test
Example Single
Proportion
•A county judge has agreed that he will give
up his county judgeship and run for a state
judgeship unless there is evidence at the
0.10 level that more then 25% of his party is
in opposition. A SRS of 800 party members
included 217 who opposed him. Please
Hypothesis Test Example
Single Proportion
continued
p = proportion of his party that is in opposition
H0: p = 0.25
HA: p > 0.25
 = 0.10
Note: hypothesized value = 0.25

217
n  800, p         0.27125
800
0.27125  0.25
z                 1.39
0.25(0.75)
800
Hypothesis Test Example
Single Proportion continued
P-value=P(z  1.39)  1  0.9177  0.0823

•At a level of significance of 0.10,
there is sufficient evidence to
support the claim that the true
percentage of the party members
that oppose him is more than 25%.

•Under these circumstances, I
Large-Sample Inferences
Difference of Two Population (Treatment)
Proportions
Some notation:

Population     Sample
Sample Proportion of Proportion of
Size   Successes Successes
Population or
treatment 1      n1         p1           p1
Population or
treatment 2      n2         p2           p2
Properties: Sampling Distribution
of p1- p2
If two random samples are selected
independently of one another, the following
properties hold:
1. m p p  p1  p2
1     2

p1 (1  p1 ) p2 (1  p2 )
2.                              
2             2     2
p1  p 2      p1    p2                       and
n1          n2
p1 (1  p1 ) p2 (1  p2 )
p p                       
1    2
n1          n2
3. If both n1 and n2 are large [n1 p1  10,
n1(1- p1)  10, n2p2  10, n2(1- p2)  10],
then p1 and p2 each have a sampling
distribution that is approximately normal
Large-Sample z Tests for p1
– p2 = 0
The combined estimate of the common
population proportion is

n1p1  n 2 p 2
pc 
n1  n 2
total number of successes in two samples

total sample size
Large-Sample z Tests for p1
– p2 = 0
Null hypothesis: H0: p1 – p2 = 0

Test statistic:
p1  p 2
z
p c (1  p c )   p c (1  p c )

n1              n2
Assumptions:
1. The samples are independently chosen random
samples OR treatments are assigned at random to
individuals or objects (or vice versa).
2. Both sample sizes are large:
n1 p1  10, n1(1- p1)  10, n2p2  10, n2(1- p2)  10
Large-Sample z Tests for p1
– p2 = 0
Alternate hypothesis and finding the P-value:
1. Ha: p1 - p2 > 0
P-value = Area under the z curve to the
right of the calculated z
2. Ha: p1 - p2 < 0
P-value = Area under the z curve to the
left of the calculated z
3. Ha: p1 - p2  0
i. 2•(area to the right of z) if z is positive
ii. 2•(area to the left of z) if z is negative
Example - Student Retention
A group of college students were asked what they
thought the “issue of the day”. Without a pause the
class almost to a person said “student retention”. The
class then went out and obtained a random sample
(questionable) and asked the question, “Do you plan
on returning next year?”
The responses along with the gender of the person
responding are summarized in the following table.
Response
Yes   No Maybe
Male       211    45    19
Gender
Female     141    32    9
Test to see if the proportion of students planning on returning is
the same for both genders at the 0.05 level of significance?
Example - Student Retention
p1 = true proportion of males who plan on returning
p2 = true proportion of females who plan on returning
n1 = number of males surveyed
n2 = number of females surveyed
p1 = x1/n1 = sample proportion of males who plan on
returning
p2 = x2/n2 = sample proportion of females who plan on
returning

Null hypothesis: H0: p1 – p2 = 0
Alternate hypothesis: Ha: p1 – p2  0
Example - Student Retention
Significance level:  = 0.05

Test statistic:
p1  p 2
z
p c (1  p c )   p c (1  p c )

n1              n2

Assumptions: The two samples are independently
chosen random samples. Furthermore, the sample sizes
are large enough since
n1 p1 = 211  10, n1(1- p1) = 64  10
n2p2 = 141  10, n2(1- p2) = 41  10
Example - Student Retention
Calculations:
n1p1  n 2 p 2 211  141 352
pc                               0.7702
n1  n 2      275  182 457
p1  p 2
z
p c (1  p c )       p c (1  p c )

275                  182
0.76727  0.77473

0.77024(1  0.77024)                      0.77024(1  0.77024)

275                                  182
-0.0074525
             -0.19
0.040198
Example - Student Retention
P-value:
The P-value for this test is 2 times the area
under the z curve to the left of the computed
z = -0.19.
P-value = 2(0.4247) = 0.8494

Conclusion:
Since P-value = 0.849 > 0.05 = , the hypothesis H0 is
not rejected at significance level 0.05.
There is no evidence that the return rate is different for
males and females..
Example
A consumer agency spokesman stated that he
thought that the proportion of households having
a washing machine was higher for suburban
households then for urban households. To test to
see if that statement was correct at the 0.05 level
of significance, a reporter randomly selected a
number of households in both suburban and
urban environments and obtained the following
data.
Number     Proportion
having      having
Number     washing     washing
surveyed   machines   machines
Suburban      300        243       0.810
Urban         250        181       0.724
Example
p1 = proportion of suburban households having
washing machines
p2 = proportion of urban households having
washing machines
p1 - p2 is the difference between the proportions
of suburban households and urban
households that have washing machines.
H0: p1 - p2 = 0
Ha: p1 - p2 > 0
Example
Significance level:  = 0.05
Test statistic:
p1  p 2
z
p c (1  p c )   p c (1  p c )

n1              n2

Assumptions: The two samples are independently
chosen random samples. Furthermore, the sample sizes
are large enough since
n1 p1 = 243  10, n1(1- p1) = 57  10
n2p2 = 181  10, n2(1- p2) = 69  10
Example
Calculations:

n1p1  n 2 p 2 243  181 424
pc                               0.7709
n1  n 2      300  250 550

p1  p2
z
pc (1  pc ) pc (1  pc )

n1           n2
0.810  0.742

 1    1 
0.7709(1  0.7709)         
 300 250 
 2.390
Example
P-value:
The P-value for this test is the area under the z
curve to the right of the computed z = 2.39.
The P-value = 1 - 0.9916 = 0.0084
Conclusion:
Since P-value = 0.0084 < 0.05 = , the hypothesis H0 is
rejected at significance level 0.05. There is sufficient
evidence at the 0.05 level of significance that the
proportion of suburban households that have washers is
more that the proportion of urban households that have
washers.
Large-Sample Confidence
Interval for p1 – p2
When
1. The samples are independently selected random
samples OR treatments that were assigned at
random to individuals or objects (or vice versa), and
2. Both sample sizes are large:
n1 p1  10, n1(1- p1)  10, n2p2  10, n2(1- p2)  10
A large-sample confidence interval for p1 – p2 is

p1 (1  p1 ) p 2 (1  p 2 )
(p1  p2 )   z critical value                
n1           n2
Example
A student assignment called for the students to survey
both male and female students (independently and
randomly chosen) to see if the proportions that
approve of the College’s new drug and alcohol policy.
A student went and randomly selected 200 male
students and 100 female students and obtained the
data summarized below.
Number Number that Proportion
surveyed approve   that approve
Female   100       43         0.430
Male     200       61         0.305
Use this data to obtain a 90% confidence interval estimate
for the difference of the proportions of female and male
students that approve of the new policy.
Example
For a 90% confidence interval the z value to use is
1.645. This value is obtained from the bottom row of
the table of t critical values (Table III).
We use p1 to be the female’s sample approval
proportion and p2 as the male’s sample approval
proportion.
0.430(1  0.430) 0.305(1  0.305)
(0.430  0.305)  1.645                 
100              200
(0.125)  0.097      or    (0.028,0.222)
Based on the observed sample, we believe that the
proportion of females that approve of the policy exceeds the
proportion of males that approve of the policy by
somewhere between 0.028 and 0.222.

```
To top