# Stat 101-106

W
Shared by:
Categories
Tags
-
Stats
views:
8
posted:
3/13/2012
language:
pages:
456
Document Sample

Stat 101-106
T. Brown, J. Chang
N. Hengartner, J. Kim, E. Kostello,
J. Lapinski, J. Reuning-Scherer

Fall, 2001
Class #1 9/6/01
1
Who are we and why are we here?
Intro to Statistics for...
• Life Sciences: Stat 101/EEB 210/MCDB 215
(Instructors: J. Kim, JC)
• Political Science: Stat 102/PLSC 452/EP&E 203
(J. Lapinski, JC)
• Sociology: Stat 103/Soc 119
(E. Kostello, JC)
• Psychology: Stat 104/Psych 201 (T. Brown, JC)
• Environmental Sciences: Stat 105/F&ES 205
(J. Reuning-Scherer, JC)
• Data analysis: Stat 106 (N. Hengartner, JC)
2
The intellectual universe
(In 3 easy steps…)

3
The intellectual universe
(In 3 easy steps…)

4
The intellectual universe

5
The intellectual universe
Physics                               Philosophy

Math                            Economics
6
What is Statistics?
• The science of collecting, organizing, and
interpreting numerical facts, which we call
data. (Moore and McCabe)
• The science and art of prediction and
explanation. (Yale College Programs of Study)

Statistics provides a framework and tools for
7
An “easy” question
• In Euclidean plane geometry, what is the sum
of the angles in a triangle?

(Nobody asks: “What is the latest thinking on the sum
of the angles in a triangle in Euclidean geometry?”)
8
Some “hard” questions
• Does smoking cause cancer?
• Do men and women differ in pulse?
• Does listening to Mozart make people smarter?
• Is there a single gene that causes cystic fibrosis?
Where is it in the genome?
• Is global warming really happening? How much
should we worry?
• How long ago did “mitochondrial Eve” live?
• Should Bush really have won the election?
9
Nature of class
• purpose
• prerequisites
• structure
– sections
– schedule

10
What’s going on here
• Two kinds of lectures:
– “General Statistics lectures” introduce statistical
theory, concepts, etc.  Everybody, OML202
– “Subject area lectures”: Introduce subject-
oriented applications and examples; possibly more
techniques and theory  Sections separate
• Not much math used (and none of it
advanced). Lots of reasoning. Some use of
computers.
11
Where to go & when
• Weeks 1, 2, 3: General Stat lecture only
• Weeks 4-9: General Stat lecture Tues,
Subject lecture Thurs
• (Week 11: Fall Break)
• Weeks 10, 12, 13: Subject lecture only

12

• “Section” may be a misleading word. “Subject-area
lectures” may be more suggestive.
• Your section professor (not me) is the ultimate
authority, responsible for giving grades, setting
policies, etc.
• Sections are independent of each other.

13
Course topics
“science of collecting, organizing, and
interpreting data”
(I) Organizing
distributions, graphical displays (histograms,
boxplots,...) numerical summaries (mean, median,
standard deviation,...), Normal distributions
more than one variable: correlation, regression

14
Course topics
“science of collecting, organizing, and
interpreting data”
(II) Collecting and producing data
Sampling, bias, design of experiments,
randomization

15
Course topics
“science of collecting, organizing, and
interpreting data”
(III) Interpreting data -- Statistical inference
Confidence intervals, hypothesis testing, various
techniques for various questions with various
kinds of data

16
Course topics
“science of collecting, organizing, and
interpreting data”
(II.V) Probability
Random variables , rules of probability,
conditional probability, Bayes’ rule, binomial
& Normal distributions, Central Limit
Theorem

17
Using the web

“Register” at http://classes.yale.edu/student/

Register for both your section and for “Stat 100a”

E.g., Stat 102a or whatever

have announcements and links to the class notes and homework.
18
Stat 10x
J. Chang
Tuesday, 9/11/01

a hard day to do Statistics.

19
What is Statistics?
• The science of collecting, organizing, and
interpreting numerical facts, which we call
data. (Moore and McCabe)
• The science and art of prediction and
explanation. (Yale College Programs of Study)

Statistics provides a framework and tools for
20
What is Statistics? (cont.)
Conceptual framework and methods for
• learning from experience (data, experiments,
etc.)
• reasoning under uncertainty
• answering questions and quantifying the
strength or reliability of the conclusions

21
A few (4, actually) words
Prototypical situation: Want to answer a
question about a population of individuals.
We use a sample of individuals from the
population .

Parameter: a number describing the population
Statistic: a number describing the sample.

22
P.S.

Population          Parameter

Sample              Statistic

23
e.g.
In a sample of 1000 US voters before the 2000
election, 493 said they would vote for Bush (and
507 said they would vote for Gore).

Population, parameter, sample, statistic?

24
Data
Typically consists of values of some variables
measured or observed for some individuals.

A variable is a characteristic of an individual, or
something like that. (M&M p. 4 attempts to give
definitions)

E.g. our questionnaire from the first day
• individuals: each of you
• variables: sex, pulse, height, etc.
25
Distributions
The distribution of a variable says what possible
values the variable takes and how frequently it
takes those values.

Many methods to describe and display
distributions

26
Distributions, cont.
E.g. (Minitab demo)
tally ‘section’
histogram ‘weight’
• try with: default, 100 bins, 2 bins

What makes a histogram “good”?

27
Some words that describe distributions

Symmetric,
Bimodal
unimodal
28
Which is skewed to the left? To the right?

29
Skewed to the right        Skewed to the left
Beta (1.6, 2.7) density    Beta (2.7, 1.6) density

30
Numerical descriptions of distributions

• Mean = average
• Median = 50th percentile

Different ways of getting at the idea of a
“center” of a distribution.

31
More detail (and that pesky )
E.g. if data is 76, 57, 88, 93, 72, 94,

76  57  88  93  72  94
Mean                               80
6

For a variable x with n observed values
x1 , x2 ,..., xn the mean of x is n

x1  x2    xn          x     i
x                          i 1
n                   n
32
“Physical” interpretation of mean
Center of mass -- balance point of a distribution

x  93.4

33
Median = 50th percentile
Arrange data in order.
Median M = 50th percentile = “middle observation”

E.g. for data 57, 72, 76, 88, 93
M = 76.
E.g. for data 57, 72, 76, 88, 93, 94
M = (76 + 88) / 2 = 82.
[i.e. if number of observations is even, average
the middle two.]

34
Quartiles
• Define first quartile to be the median of the
observations below the median (i.e. 25th percentile)
• Define third quartile to be the median of the
observations above the median (i.e. 75th percentile)

57, 72, 76, 88, 93, 94
M  82
Q1  72   Q3  93
35
“Robust” or “resistant”
These terms mean “insensitive to a few extreme
observations”
(imagine typo of adding several zeros to a number).

Which is more robust: mean or median ?

Compare      57, 72, 76, 88, 93, 94
to          57, 72, 76, 88, 93, 94000000

36
Which is which?
0. 0.2 0.4 0.6

0    0
0
1        2
0
0   0
0
3   0
0
4

median mean
37
Mean vs. Median          (a few comments)

• Robustness is nice, like apple pie.
• But, e.g., insurance companies and casinos
care more about mean than median

• In a symmetric distribution mean = median.

38
Summary
Today focused on
•Conceptual framework of Statistics
(population parameter, sample, statistic)
•The idea of a distribution
•Started some simple statistics describing
distributions (mean, median,…)
•We saw a bit of Minitab in class

Next time: Standard deviation, densities, Normal
distributions…
39
Stat 10x
J. Chang

Thursday 9/13/00

40
• Minitab intro sessions start today. Look on the web
to see what session you’re assigned.
• “Register” for both “Stat100a” and for your section
through http://classes.yale.edu/student
• Homework will be posted today; linked to the
syllabus.

41
Measures of overall location or “center”

As discussed last time -- mainly mean, median

42
• Range = max - min
• Interquartile range: IQR = Q3 - Q1
• Most common and useful: variance and
standard deviation (SD).
Relationship:
SD  Variance

43
The typical letters
• Sample:
Variance = s2,   SD = s

• Population
Variance = s2,   SD = s

44
Idea of variance
• How far away are the observations, on
average, from the mean?

45
Deviations

xi  x

Difference between ith observation and mean
46
Formula
n
1
s 
2
 ( xi  x )
n  1 i1
2

cary, isn’t it?

47
Calculating Variance
Xi   X     Xi  X   (Xi  X )   2
n5
n 1  4
1    3.8    -2.8      7.84
22.80
2    3.8    -1.8      3.24          s 
2

4
4    3.8     0.2      0.04             5.7
5    3.8     1.2      1.44
s  5.7
7    3.8     3.2     10.24
____     ____                2.39
0      22.80
48
Deviations from the mean

49
Why square?
• Sum of deviations (not squared) is just 0.
Squaring the deviations converts the negative
deviations to positive numbers...
• Summing squares is a natural operation; our
eyes do it all the time with no help from our
brains…          (Just kidding, Professor Brown)

50
How far apart are these 2 points?
•
(4,6)

•        •     (4  1) 2  (6  2) 2
(1,2)    (4,2)
 3  4  9  16
2     2

 25  5

51
Why divide by n-1 ?
• It’s unimportant if n is large.
• It’s not that important in any case.
• What happens when n = 1?
– You shouldn’t be trying to estimate a variance
from a sample of size 1!
• Dividing by n-1 gives an unbiased estimate of

52
More practice
100, 100, 100, 100, 100, 100, 100
Here s = 0.

90, 90, 90, 100, 110, 110, 110

Here s = 10.

53
Robustness of IQR vs. SD
• IQR is robust; SD is not.

54
Some simple rules
Start with a variable X having mean x   and SD s x .

Add 3 to each value, getting a new variable Y.
Yi  X i  3.

What are y and s y ?

55
Some simple rules
Start with a variable X having mean x and SD s x .

Add 3 to each value, getting a new variable Y.
Yi  X i  3.

What are y and s y ?

y  x  3,
s y  sx ...no change in SD

56
Some simple rules

Multiply each value by 3, getting new variable Z.
Zi  3X i .

What are z and s z ?

57
Some simple rules

Multiply each value by 3, getting new variable Z.
Zi  3X i .

What are z and s z ?
z  3x ,
s z  3s x ,
s z2  9 s x
2

58
Nonlinear transformations have a more
complicated effect

Square each value getting new variable W.
Wi  X i .
2

Is w  (x ) ?
2

I.e. (mean of the square) = (square of the mean)?

59
Nonlinear transformations have a more
complicated effect

Square each value getting new variable W.
Wi  X i .
2

Is w  (x ) ?
2

I.e. (mean of the square) = (square of the mean)?

No :    w  (x)   2

60
Density curves
Idealized, smoothed
histogram. Limit of
large population.
(population  )

61
Areas correspond to
proportions of population

62
Why “idealized”?
• No such thing as a precisely normally
distributed population.

63
Example: Uniform density
E.g. Uniform on the interval [0, 10]

flat

Height? 

0                          10
(Area under a density) = 1                   64
y0.2 0.3 0.4   Standard Normal density

x2
1      
y    e     2
2
0. 0.1

4
-   2
-     0     2              4
x                        65
General Normal densities

Interpretation of s in terms of inflection points
66
Notation for Normal distributions

Normal distribution with mean  and SD s is
denoted N (  , s ) .

If the variable X has a N (  , s ) distribution, we
write X ~ N (  , s ) .

67
“68, 95, 99.7 rule”

This picture is for a standard Normal distribution N(0,1)
68
68, 95, 99.7 rule for N(,s)
• 68% of the population is within 1 SD of the
mean (i.e. between s and s)
• 95% of the population is within 2 SD’s of the
mean (i.e. between 2s and 2s)
• 99.7% of the population is within 3 SD’s of
the mean (i.e. between 3s and 3s)

69
Example
Assume verbal SAT scores have approximately
a N(505,110) distribution.

What percentile is a score of 615?

70

615 is 1 SD above the mean…

…blackboard...

71
More precise answers to more general questions
like this using Minitab or Normal Tables
• Minitab
– Do the 615 example
• Tables
– A bit anachronistic now, but useful for desert
islands and exams...

72
Normal tables

See Table A in textbook.   (Show transparency in class)
73
Sir Francis Galton (1822-1911) on the Normal distribution

I know of scarcely anything so apt to impress the
imagination as the wonderful form of cosmic order
expressed by the "Law of Frequency of Error." The law
would have been personified by the Greeks and deified, if
they had known of it. It reigns with serenity and in
complete self-effacement, amidst the wildest confusion.
The huger the mob, and the greater the apparent anarchy,
the more perfect is its sway. It is the supreme law of
Unreason. Whenever a large sample of chaotic elements
are taken in hand and marshaled in the order of their
magnitude, an unsuspected and most beautiful form of
regularity proves to have been latent all along.
74
Stat 10x
J. Chang
Tuesday, 9/18/00

75
Densities
Remember densities describe the distrib of a
variable in a large population.
y0.2 0.3 0.4

Total area = 1
0. 0.1

-
4   -
2   0        2       4       76
x
Areas give fractions of population

-1               2
E.g. What is the fraction of the pop having X between -1 and 2?
77
Areas give fractions of population

-1               2
E.g. What is the fraction of the pop having X between -1 and 2?
78
Areas give fractions of population

Area = 0.82

-1               2
E.g. What is the fraction of the pop having X between -1 and 2?
79
Standard Normal density: The “bell curve”
(Also called “Gaussian”)
y0.2 0.3 0.4
x2
1      
y    e     2
2
0. 0.1

4
-   2
-    0        2            4
x                         80
General Normal densities

Interpretation of s in terms of inflection points
81
Notation for Normal distributions

Normal distribution with mean  and SD s is
denoted N (  , s ) .

If the variable X has a N (  , s ) distribution, we
write X ~ N (  , s ) .

82
“68, 95, 99.7 rule”

This picture is for a standard Normal distribution N(0,1)
83
68,95,99.7 rule for N(,s)
This “rule” is just 3 numbers to memorize:

• 68% of the population is within 1 SD of the
mean (i.e. between s and s)
• 95% of the population is within 2 SD’s of the
mean (i.e. between 2s and 2s)
• 99.7% of the population is within 3 SD’s of
the mean (i.e. between 3s and 3s)

84
Example
Suppose verbal SAT scores have approx N(,s)
distribution with   505 and s  110.
What is the percentile of the score 615?

85
Start by drawing a picture

86
Start by drawing a picture

87
615 is   1   SD above the mean. (615 = 505 + 110)

Want this area:

88
This is same as area above values < 1 in a
standard Normal distrib

89

68 %

16 %            16 %

Answer = 16% + 68% = 84%
90
Doing the problem with Minitab
Use Calc  Probability Distributions  Normal
and fill in 505 and 110 for mean and SD.
Do a cumulative probability, for 615.

Cumulative Distribution Function
Normal with mean = 505.000 and standard
deviation = 110.000

x     P( X <= x)
615.0000        0.8413
91
Using a Normal Table
Useful for desert islands and exams.
Tables typically give cumulative probabilities.
I showed you a table on the overhead…

Textbook explains and gives examples on pp. 75-79.

92
Another problem
Want percentile for 680.
Not a neat use of 68, 95, 99.7 rule.

680
93
To use a Normal table for this problem

Score is  680  505 175
      1.59091
110      110
SD’s above mean.

Now use standard Normal table.

94
Standardizing and z-scores (Just terminology)

Let x be an observation from a distrib with mean
 and SD s.
How many SD’s is x above the mean?
Standardized value:        x
z
s
“z-score”
These are nice for comparing “extremenesses” of
otherwise incomparable quantities                 95
Normal probability plots (or quantile plots)
[M&M pp. 79-83]
Have some data on a variable.
Is it believable that the data came from a Normal
population? If not, in what way does the pop
distrib differ from a Normal?

How can we see this?

96
Idea of Normal probability plots

Change the problem of judging whether a
histogram looks like a “Normally shaped hump”
into judging how well some points fall along a
straight line.

97
E.g.: Gold plating thickness on circuit boards

unit  106 inch
98
99
E.g. Water runoff in Arizona

100
Normal probability plot for Runoff

101
Idea of quantile plots
• Plot the observed values of the variable vs.
“where we would expect them to be if they
came from a (standard) Normal distribution.”
• (Which Normal distrib to use? Doesn’t matter since they are
all linearly related to each other.)
• Quantile plots use percentiles of the Normal
distribution.
• Roughly, plot ith largest observation vs. (i/n)
percentile of N(0,1) distrib.
102
Idea of quantile plots (cont)
• Idea is to plot
– median of data vs. median of N(0,1)
– 10th percentile of data vs 10th percentile of N(0,1)
– etc.
• Need to use a precise definition of sample
percentiles; there are several variations.

103
Something like this...
E.g. given sorted data 10.8 24.2 35.8 36.1 49.5

Plot 35.8 [median of data] vs 0.0 [median of N(0,1)]
Plot 10.8 [20th % of data] vs ?? [20th % of N(0,1)]
Plot 24.2 [40th % of data] vs ?? [40th % of N(0,1)]
.....
Plot 49.5 [100th % of data] vs ?? [100th % of N(0,1)]

oops
104
That needs a little fixing. Several ways to go...

Instead of using fractions 1/5, 2/5, 3/5, 4/5, 5/5,
could use e.g.
1/6, 2/6, 3/6, 4/6, 5/6
or    1/10, 3/10, 5/10, 7/10, 9/10
or    ...

These are different options in Minitab. Makes
little difference if n is large.
105
N(0,1) percentiles
E.g. for n=5 and first option, N(0,1) percentiles look like

•-1           •0           •1

and for n= 20,

•-1          •0            •1

Blackboard…
106
99

95
90
83.3   80
70
66.7   60
50.0   50
40
33.3   30
20
16.7
10
5

1

0   2 4 5      9 10   17   20
Data            107
E.g. Water runoff in Arizona

108
Normal probability plot for Runoff

109
Log(Runoff)

110
111
Today: describing joint distribution
of two variables
• Scatterplots
• Correlation
• Regression

112
• How strong a linear relationship is there
between two variables?
– E.g. when height increases, does weight also tend
to increase?
– E.g. How about weight and pulse?
• If we know the value of one variable for an
individual, how can we best predict the value
of another variable for that individual?

113
Scatterplots
Plot two variables simultaneously.
Put one variable on horizontal axis,
other variable on vertical axis.

114
E.g. weight vs. height

200
weight

150

100

55       65      75
height
115
E.g. pulse vs. weight
120

110

100

90
pulse

80
70

60

50

40

100     150       200
weight
116
Correlation
• Measures the “strength of the linear
relationship” between two variables.

117
Small correlation

2

1

0

-1

-2

-2    -1   0      1       2   3

r = 0.06
118
Highly correlated variables
3

2

1

0

-1

-2

-2   -1   0   1   2   3

r = 0.99
119
Moderate correlation

2

1

0

-1

-2

-3   -2   -1   0   1   2   3

r = 0.55
120
Negative correlation
3                                    3

2                                    2

1
1
0
0
-1
-1
-2
-2
-3
-3   -2   -1    0   1   2   3        -2   -1   0   1   2    3

r = -0.52                            r = -0.96

121
Zero correlation

2

1

0

-1

-2

-2     -1    0     1     2   3

Positive or negative?       r = 0.03
122
Definition of correlation
First standardize variables:
xi  x        yi  y
Instead of xi and yi look at           and
sx            sy

i.e. How many SD’s is each observation above the
mean?
Then do this...

123
…Definition of correlation

1     n
 xi  x  yi  y 
r        s  s 
n  1 i 1  x  y         

That is:
standardize each xi and yi ,
multiply, and
“average”
124
• Correlation is “dimensionless”
• Since can rewrite definition as
1 n  xi  x  yi  y 
 ( x  x )( y  y )    r        s  s 
n  1 i 1  x  y 
r             i       i
        
 (x  x)  ( y  y)
i
2
i
2

we can see that r is between -1 and 1.
E.g. if yi = xi for all i, then r = 1.

125
Rough idea of definition
• Draw picture...

126
A small example worked out in
detail
• Blackboard...

127
Stat 10x
J. Chang
Tuesday, 9/20/01

128
Scatterplots
Plot two variables simultaneously.
Put one variable on horizontal axis,
other variable on vertical axis.

129
E.g. weight vs. height

200
weight

150

100

55       65      75
height
130
E.g. pulse vs. weight
120

110

100

90
pulse

80
70

60

50

40

100     150       200
weight
131
Correlation
• Measures strength and direction of linear
relationship between two variables.
• Between -1 and +1.
• +1 : perfect linear relationship, positive slope
• -1 : perfect linear relationship, negative slope

132
…Definition of correlation

1     n
 xi  x  yi  y 
r        s  s 
n  1 i 1  x  y         

That is:
standardize each xi and yi ,
multiply, and
“average”
133
Rough idea of definition
xi  x  0
xi  x  0                 yi  y  0
yi  y  0           +

Sign of
 xi  x   yi  y 
         sy ?   

xi  x  0  sx              
+                 y y 0
i

xi  x  0, yi  y  0

134
A small example worked out in detail
by hand
• Did this in detail on the blackboard…

135
Correlation and Regression
Sir Francis Galton (1822-1911)
Scientist and explorer
Cousin of Charles Darwin
Studied heredity, intelligence, eugenics …
IQ estimated at 200
Invented quincunx
Idea of branching processes
Statistical study of efficacy of prayer

136
Fathers and sons data

Correlation
r  0.5

What is average height of a son whose father is 72” ?
137
Descriptive statistics on father-son
data
• Fathers: Mean = 68” SD = 3”
• Sons: Mean = 69” SD = 3”
• Average height of son if father is 72” ?
A natural guess:
Father is 4/3 SD’s above mean, so guess
son will be 4/3 SD’s above mean, or 73”

138
How is 73” as a guess?

139
(Just in case you like 72” for some
reason)

140
Best guess depends on correlation
Guess that son will be,
not 4/3 SD’s above mean,
but correlation  4/3 = 2/3 SD’s above mean.

r  0.5

That is, in our example, guess son’s height to be
69 + (2/3)  3 = 71 inches.

141
“Natural” guess vs. “best” guess

73
71

142
Another example
Use LSAT scores to predict 1st-year final exam scores
Historical data:
X = LSAT scores: mean 650, SD 80
Y = final scores: mean 65, SD 10
correlation: r = 0.4
Question: Predict final score for student with x = 750.

•Step 1
750  650 100
Standardize: x is               1.25 SD’s above mean
80      80
143
Example            (cont)
X = LSAT scores: mean 650, SD 80
Y = final scores: mean 65, SD 10            correlation: r = 0.4
Question: Predict final score for student with x = 750.

Natural (bad): Guess Y to be 1.25 SD’s above its mean,
or 65 + 1.25 * 10 = 77.5

Best: Guess Y to be 0.4 * 1.25 = 0.5 SD’s above its
mean, or 65 + 0.5 * 10 = 70.

144
Equation of the regression line
Just a formula for all the best guesses:
y  65         x  650 
 (0.4)         
10           80 

In general:
y y    xx 
 r    
sY      sX 
145
The “regression fallacy”
In training, air force pilots make two practice landings
with instructors and are rated on performance. The
instructors discuss the ratings with the pilots after
each landing. Statistical analysis shows that the
pilots who make poor landings the first time tend to
do better the second, and those who make good
landings the first time tend to do worse on the second
try.
The conclusion: criticism helps the pilots, while praise
tends to make them do worse. As a result, instructors
were ordered to cricticize all landings, good or bad.
146
Regression and least squares
Imagine fitting a line through some data

•
•
•             •
•
• •
•                     residual = (observed y) - (fitted y)
ri  yi  yi
ˆ

147
(“Predicted” or “fitted” y’s) & (“error”
or “residual”)

148
The least squares criterion
Want residuals small: Minimize sum of squared
residuals

•                              •
•                             •
•             •                             •
•                              •       •
• •                           • •
•                             •

149
Flat lines...
•
•
•             •
y=c
•
• •
•
Q: Which c gives the least-squares fit?

A: c  y
…another property of the mean
150
r2

Which is smaller:      ( y  i    yi )
ˆ      2
or   ( y
i    y)   2
?
Hint...

ˆi ) 2   ( yi  y ) 2
Answer:  ( yi  y

r2 measures the improvement:
 ( yi  yi ) 2
ˆ
 1 r2
 ( yi  y ) 2
151
Interpretation
ˆ
 1 r2
 ( yi  y )2
That is,

"SD of yi 's about regression line"
 1 r2
SD of yi 's [about mean y ]

That is,

SD of      yi 's about regression line 

    1  r   SD of yi 's 
2
152
Bivariate normal distributions

0.15
0.1
2
0.05
0
0
-2

0
-2
2

density                sample data

153
Distributions within vertical strips
in a bivariate Normal distribution
Consider y values in a narrow vertical strip at x.
These have
xx
mean  y  rsY          SD  1  r 2 sY
 sX 
•SD within a strip is always  sY
( sY is SD over all individuals)
•If r = 1 then SD in a strip is 0
•if r = 0 then SD in a strip is same as sY
154
Example
mean         SD
LSAT scores     650          80               r = 0.4
final exams     65           10

1. What percentage of students score over 75 on
final exam?

Easy: 75 is (75 - 65)/10 = 1 SD above mean.
Answer is 1  (1)  0.16      (16 %).

Standard Normal table value       155
mean         SD
LSAT scores 650          80     Example (cont.)
final exams 65           10    r = 0.4

2. Among students who get 750 on LSAT, what fraction
get over 75 on final exam?
In strip at x = 750 (standard score = 1.25):
these students have mean = 70
and
SD      1  r sY  1  (0.4)  10  9.165
2            2

We want fraction of N(70, 9.165) distrib to the right of 75.
Standard score for 75 is (75-70)/9.165 = 0.546.
Answer: 1  (0.546)  0.29.      (Compare previous 0.16)
156
A Pythagorean identity
( yi  y ) 2   ( yi  yi ) 2   ( yi  y ) 2
ˆ                       ˆ

Ignoring divisions by n-1, this says:

Variance of fitted values (around mean)
+ Variance of y’s around fitted values
= Variance of y’s.

157
Interpretation of r-squared as the
“fraction of variance explained
by the regression”

r   2
 ( y
ˆ   i    y)2


ˆ
Variance of yi ' s
( y   i    y) 2
Variance of yi ' s

Easily derived from the equation of the
regression line, which we know…
Homework?
158
• Least-squares regression is not robust
(resistant)
• Two kinds of interesting points:
– Outlier : a point with a large residual
– Influential point : if removed, causes a large
change in the regression line

159
A little example
?
x    y
10

1    0
0    1      y
5

-1    0
0   -1         0

10   10              0
x
5   10

160
little example (cont)
10

5
y

0

0         5          10

x

Outlier? No.               Influential? Yes.
161
162
With and without Child 18

163
Next...
• Lurking variables
• The perils of aggregation

164
Stat 10x
J. Chang
Tuesday, 9/25/01

To understand God’s thoughts we must study
statistics, for these are the measure of His purpose.
Florence Nightingale
165
E.g. weight vs. height

200
weight

150

100

55       65      75
height
166
E.g. pulse vs. weight
120

110

100

90
pulse

80
70

60

50

40

100     150       200
weight
167
Correlation
• Measures strength and direction of linear
relationship between two variables.
• Between -1 and +1.
• +1 : perfect linear relationship, positive slope
• -1 : perfect linear relationship, negative slope

168
…Definition of correlation

1     n
 xi  x  yi  y 
r        s  s 
n  1 i 1  x  y         

That is:
standardize each xi and yi ,
multiply, and
“average”
169
Rough idea of definition
x  x  0, y  y  0
i          i

x  x  0, y  y  0


i           i

Sign of
+                     x  x  y  y 
 s  s 
                  ?
i        i


     x      y 

+              x  x  0,
i
y y0
i

x  x  0, y  y  0
i           i

170
Fathers and sons data

Correlation
r  0.5

What is average height of son whose father is 72” ?
171
Descriptive statistics on father-son
data
• Fathers: Mean = 68” SD = 3”
• Sons: Mean = 69” SD = 3”
• Average height of son if father is 72” ?
A natural guess:
Father is 4/3 SD’s above mean, so guess
son will be 4/3 SD’s above mean, or 73”

172
Best guess depends on correlation
Guess that son will be,
not 4/3 SD’s above mean,
but correlation  4/3 = 2/3 SD’s above mean.

That is, in our example, guess son’s height to be
69 + (2/3)  3 = 71 inches.

173
Equation of the regression line
Just a formula for all the best guesses:

y  Y     x  X          
 r
 s               

sY       X               

174
Least squares regression
Imagine fitting a line through some data

•
•
•             •
•
• •
•                     residual = (observed y) - (fitted y)
ri  yi  yi
ˆ

175
The least squares criterion
Want residuals small: Minimize sum of squared
residuals

•                             •
•                             •
•             •                             •
•                             •       •
• •                           • •
•                             •

“fraction of variance explained by the
regression”

r 
2   ˆ
( yi  y ) 2 Variance of yi ' s

ˆ
 ( yi  y ) Variance of yi ' s
2

Easily derived from the equation of the
regression line, which we know…
Homework?

177
• Least-squares regression is not robust
(resistant)
• Two kinds of interesting points:
– Outlier : a point with a large residual
– Influential point : if removed, causes a large
change in the regression line

178
179
With and without Child 18

180
To lie hidden, as in ambush
Lurking variables
A variable that has an important effect but was
overlooked.
Danger: Confounding
[Thinking an effect is due to one variable when it is
better explained by another (lurking) variable.]

1971 study: People who drink a lot have higher
Correlation noticed. Causation?
181
Lurking variables (cont.)
1993: A larger study concluded that after
adjusting for the effects of smoking, no evidence
for increased risk from coffee.

“Spurious correlations”
The correlation is real, but causation isn’t.

182
Lurking variables (cont.)
Lurking var’s can also hide “real” correlations.

(...or even reverse correlations)   183
More on the perils of aggregation:
Categorical data
Hospital A      Hospital B
Died     300                50
Survived 3000              1000

If you needed surgery, which hospital would you
prefer?

184
Hospital A   Hospital B
Died     300             50
Survived 3000           1000

Hospital A Hospital B                     Hospital A Hospital B
Died      5         10                Died     295           40
Survived 1000       800               Survived 2000          200

Maybe…
Hospital A: university medical center, attracts seriously ill
patients from wide area
Hospital B: local, fewer seriously sick patients.             185
Another (real) example:
U.C. Berkeley, 1970’s
Committee searched for discrimination -- higher
percentage of male applicants accepted into grad school
than female.
Looking at individual dept’s, no evidence of admitting
men more than women -- if anything the reverse. ???

Men were applying more to dept.’s with higher
acceptance rates, women applying more to dept’s
that were harder to get into.               186
Next Topic: “Producing Data”
Sampling and Experimental
Design
•   3 Principles of Experimental Design
•   Simple random samples
•   Bias, variance
•   Stratified sampling and blocking

Moore and McCabe Chapter 3.

187
Observation versus experiment
Both attempt to study relationship between an
“explanatory variable” and a “response variable”

• Experiment: deliberately impose “treatments” on
individuals to observe their responses.
• Observational study: observe and measure what
participants do naturally

“experimental units” or “subjects”

188
An example experiment
• Wangensteen (1958): Gastric freezing.
Experiment reported in JAMA: treatment reduced
ulcer pain. 24 patients; all said they felt better.
Technique widely used for several years. OK?
• Several years later: a different, larger study with a
control group. Results:
– 34 % in treatment group improved.
– 38 % in control group improved.

• Salk vaccine trial…
189
Principle 1: Control or
Comparison
• Comparison of different treatments.
• Want different treatment groups to be as similar as
possible -- except for the treatments applied.
• Control effects of environmental or outside variables.
– Outside influences act the same on the different
treatment groups. (E.g. placebo effect)

190
Bias
How to assign experimental units to treatments?
E.g.
– in comparing two medical treatments don’t want to assign
one treatment to sicker patients
– comparing seed varieties: don’t plant one in more fertile
ground

A study is biased if it systematically favors certain
outcomes.

How to avoid bias? Elaborate balancing?
191
Principle 2: Randomization
Assign treatments randomly.

Fair -- doesn’t give an treatment a systematic

But randomization balances out well only in the
“long run.” So…

192
Principle 3: Replication, or
Sample size
Use sample sizes big enough so that we will be able to
distinguish a real effect from random “luck.”

193
It’s hard to be random

0011110101000000110110101
00101100000100111110000011
00100110100110011000011000
11011011111110010010110100
10110110110001011001010001
00000011001111101000100001
11011010110001100111010110
1010000000010101100
194
It’s hard to be random
not very creative

0011110101000000110110101
00101100000100111110000011
00100110100110011000011000
11011011111110010010110100
10110110110001011001010001
00000011001111101000100001
11011010110001100111010110
1010000000010101100
195
It’s hard to be random
not very creative

0011110101000000110110101
00101100000100111110000011
00100110100110011000011000
11011011111110010010110100
10110110110001011001010001
00000011001111101000100001
11011010110001100111010110
1010000000010101100
getting tired…   196
Simple random samples
Def: A simple random sample of size n is a set
of n individuals from a population chosen in
such a way that each set of n individuals has
an equal chance to be the sample actually
selected.

Abbreviate: “simple random sample”  “SRS”

197
How to randomize
Table of random digits

E.g. choose a SRS of size 4 out of 10 individuals,
using first row of table
19223 95034 05756 28713 …

A natural way: label individuals with 0, 1, 2,…, 9.
Take individual 1, then 9, then 2, then 3.
What if we had 25 individuals and wanted a SRS of
size 4?
19, 22, 05, 13 198
Blobs

What is the average area?
E.g. throwing darts leads to size-biased sampling.
199
Buses
Suppose average time between bus arrivals at a
stop is 20 minutes. You arrive at a random time.
What is your average waiting time until the next
bus?
10 minutes?

No -- in general it’s more.

Analogous to blobs…
200
Sampling distributions
Ind   Vote
“Population”:          1     Bush
4 individuals and      2     Bush
4     Gore

Say we want to estimate parameter
p = Prob{vote for Bush}
using a sample of size 2.

Here p = 0.5. Pretend we don’t know this.
201
Sampling distributions (cont.)
List possible SRS’s and the corresponding estimates.
p
12     BB       1
Ind   Vote        13     BG       0.5
1     Bush
2     Bush        14     BG       0.5
3     Gore        23     BG       0.5
4     Gore
24     BG       0.5
34     GG       0

202
Sampling distrib of p-hat from SRS’s
of size 2

0.0                 0.5                 1.0

Or, in terms of probabilities,
4/6

1/6                    1/6

0          0.5             1
203
Bias and variability of an estimator
E.g.: recall true value was p = 0.5. Sampling distrib:

0           0.5         1

Unbiased: Mean of sampling distrib = 0.5 = true value

Variability: SD of sampling distrib  0.3

204
How about with SRS’s of size 1?
p
1     B           1
2     B           1
Ind   Vote
1     Bush
3     G           0
2     Bush   4     G           0
3     Gore
4     Gore             1/2               1/2

0       0.5         1
205
Bias? Variability?

n=2
0           0.5          1

n=1
0           0.5          1

Neither is biased. Case n = 2 has less variability.
206
Bias and Variability
• Bias of an estimator = (mean of sampling distrib)
 (true value of parameter)

Statistic is unbiased if bias = 0.

• Variability of an estimator = (SD of sampling distrib)

Depends on sample size.

207
An example of a simulation
• Bias of estimators of variance -- use Minitab.

208
Stratified sampling
E.g.: estimate avg. salary of engineers at a company.
Suppose 2 types of engineers: “junior” and “senior.”
Suppose company has 200 of each type.
Want to est avg salary with a sample of size 10.

Stratification idea: combine
a SRS of size 5 from junior engineers, and
a SRS of size 5 from senior engineers.

Is this a SRS of size 10?
209
Why stratify vs. take a SRS?
• What’s the advantage of stratifying?
– Bias?
– Variability?

210
Blocking in experimental design
3 types of seeds (treatments): A, B, C.
And some land to try them on:

Divide plot into 30 squares
Use each of A, B, C on 10 squares.
211
Blocking (cont.)
C    B    C     C   A   A    B    B    B C
A    C    B     A   B   C    C    A    C      A
B    A    A     B   C   B    A    C    A      B

Believe field homogeneous 

Partition experimental units into blocks.
Assign treatments randomly within each block.

212
Stat 10x
J. Chang
Tuesday, 9/27/01

A statistician is somebody who is good with figures
but lacks the personality to be an accountant.

213
Bias
How to assign experimental units to treatments?
E.g.
– in comparing two medical treatments don’t want to assign
one treatment to sicker patients
– comparing seed varieties: don’t plant one in more fertile
ground

A study is biased if it systematically favors certain
outcomes.

How to avoid bias? Elaborate balancing?
214
Simple random samples
Def: A simple random sample of size n is a set
of n individuals from a population chosen in
such a way that each set of n individuals has
an equal chance to be the sample actually
selected.

Abbreviate: “simple random sample”  “SRS”

215
How to randomize
Table of random digits

E.g. choose a SRS of size 4 out of 10 individuals,
using first row of table
19223 95034 05756 28713 …

A natural way: label individuals with 0, 1, 2,…, 9.
Take individual 1, then 9, then 2, then 3.
What if we had 25 individuals and wanted a SRS of
size 4?
19, 22, 05, 13 216
Blobs

n  50

What is the average area?
E.g. throwing darts leads to size-biased sampling.
217
Sampling distribution of an estimator
Understand first using a “toy example”

Indiv   Vote
“Population”:            1       Bush
4 individuals and        2       Bush
4       Gore

Say we want to estimate parameter
p = fraction of pop who vote for Bush,
using a sample of size 2.

Here p = 0.5. Pretend we don’t know this.              218
Sampling distributions (cont.)
List possible SRS’s and the corresponding estimates.

p
12      BB       1
Indiv   Vote
1       Bush     13      BG       0.5
2       Bush     14      BG       0.5
3       Gore
4       Gore     23      BG       0.5
24      BG       0.5
34      GG       0
219
ˆ
Sampling distrib of p from SRS's of size 2

0.0                 0.5                 1.0

Or, in terms of probabilities,
4/6

1/6                    1/6

0          0.5             1
220
Bias and variability of an estimator
E.g.: recall true value was p = 0.5. Sampling distrib:

0           0.5         1

Unbiased: Mean of sampling distrib = 0.5 = true value

Variability: SD of sampling distrib  0.3

221
How about with SRS’s of size 1?
p
1     B           1
2     B           1
Ind   Vote
1     Bush
3     G           0
2     Bush   4     G           0
3     Gore
4     Gore             1/2               1/2

0       0.5         1
222
Bias? Variability?

n=2
0           0.5          1

n=1
0           0.5          1

Neither is biased. Case n = 2 has less variability.
223
Bias and Variability
• Bias of an estimator = (mean of sampling distrib)
 (true value of parameter)

An estimator is unbiased if its bias = 0.

• Variability of an estimator = (SD of sampling distrib)

Depends on sample size: shrinks as sample size grows.

224
An example of a simulation
• Bias of estimators of variance -- use Minitab.

225
Stratified sampling
E.g.: estimate avg. salary of engineers at a company.
Suppose 2 types of engineers: “junior” and “senior.”
Suppose company has 200 of each type.
Want to est avg salary with a sample of size 10.

Stratification idea: combine
a SRS of size 5 from junior engineers, and
a SRS of size 5 from senior engineers.

Is this a SRS of size 10?         No
226
Why stratify vs. take a SRS?
• What’s the advantage of stratifying?
– Bias?        no
– Variability? yes

(for a “toy example” see next homework…)

227
Blocking in experimental design
3 types of seeds (treatments): A, B, C.
And some land to try them on:

Divide plot into 30 squares
Use each of A, B, C on 10 squares.
228
Blocking (cont.)
C    B    C     C   A   A    B    B      B C
A    C    B     A   B   C    C    A     C      A
B    A    A     B   C   B    A    C     A      B

A group of exp’l units
Randomized block design:                thought to be similar in
some important way
Partition experimental units into blocks.
Assign treatments randomly within each block.

229
Probability and Statistics
• Probability theory as a major tool in Statistical
inference
– All inferences are expressed in terms of probabilities:
E.g “95% confidence interval”-- 0.95 is the probability of
something
• E.g. poll
– Imagine precisely 50% of a large pop favor Gore.
– We take a random sample of size 1000
– Expect to see about 500 in sample who favor Gore.
• E.g., how likely are we to see more than 600?
230
Probability Models
Given a random phenomenon we are modeling.

S = Sample space = set of all possible outcomes.

E.g.:
Toss a coin: S = {H,T}.

Toss a coin 3 times:
S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

231
Probability Models (cont)
An event is a set of some possible outcomes
i.e., an event is a subset of S.

E.g. A = (get one head in 3 tosses)
= {HTT, THT, TTH}

A probability measure is a function (satisfying certain
conditions) that assigns a probability (a number
between 0 and 1) to each event.

If A is an event, P(A) denotes the probability of A.
232
Interpretations 1: Equally-likely
case
Sometimes, e.g. by symmetry, we believe all possible
outcomes are equally likely. In this case
# outcomes in A
P( A) 
# outcomes in S

E.g. tossing a coin once, with S  {H , T }.
If A  {H }, then
1
P ( A) 
2
233
Interpretations 1: Equally-likely
case
E.g. roll two dice. What is probability of getting
a total of at least 11?

(1,1)   (1,2)   (1,3)   (1,4)   (1,5)   (1,6)
Can think of S
like this…         (2,1)   (2,2)   (2,3)   (2,4)   (2,5)   (2,6)
(3,1)   (3,2)   (3,3)   (3,4)   (3,5)   (3,6)
36 outcomes,       (4,1)   (4,2)   (4,3)   (4,4)   (4,5)   (4,6)
equally likely,
(5,1)   (5,2)   (5,3)   (5,4)   (5,5)   (5,6)
prob 1/36 each.
(6,1)   (6,2)   (6,3)   (6,4)   (6,5)   (6,6)

3
P{total at least 11} =    = .083.
36                         234
Interpretations 2: Long-run
frequency
Imagine repeating the experiment over and over,
“independently,” under the same conditions.

Sometimes A occurs, sometimes it doesn’t.

As repeat more and more,
(Fraction of trials in which A occurs)  P(A)

235
Interpretations 3: Subjective
probability
A “subjective probability” indicates a person’s beliefs
about the likelihood of an event.

E.g. P{Humans extinct within next 1000 years}?

Betting…

236
A useful picture/example

S

A

You’re driving and it’s about to start raining. Think of S as your
windshield. Event A corresponds to statement {the first drop to
hit the windshield hit the set A}.

237
A useful picture/example
A simple probability measure
S
to model this:
A
area of A
P ( A) 
area of S

For convenience assume: (area of S) = 1.
So P(A) = area of A.

Note 0  P ( A)  1 and P(S) = 1.

238
New events from old

A          B

239
New events from old

A                B

What should we call this?
A and B ?
A or B ?
240
So what’s (A and B) ?

241
So what’s (A and B) ?

(raindrop falls in A) and (raindrop falls in B)

242
Complement of A?

243
Complement of A?

244
Axioms of probability
(i.e., properties of probability
measures)
• For each event A,
P(A)  0 and P(A)  1.

• P(S) = 1, where S is the whole sample
space.
( A and B)  ?

• If A and B are disjoint, then
P(A or B) = P(A) + P(B) .
245
Example: Complement rule
P( A )  1  P( A)
c

Why?
( A or A )  S
c

So P( A or Ac )  P( S )  1
But A and Ac are disjoint.
So P( A or Ac )  P( A)  P( Ac )

So P( A)  P( A )  1.
c
246
Definition of P(B | A)

B
A

Idea of P(B|A): Given that A occurs, what is the
probability that B also occurs?
Question: By eyeball, what is P(B|A) ?
Definition of P(B | A)

Given that the raindrop fell in A, we restrict our attention
to the set A. The drop is equally likely to fall anywhere
within A.

248
Definition of P(B | A)

Given A, the event B also occurs when the drop falls in
the dark blue region, i.e., the event (A and B).

249
Definition of P(B | A)

P( A and B)
P( B | A) 
P( A)

Often used in form: P ( A and B )  P ( A) P ( B | A)
250
Independence
E.g. two tosses of a coin

“B is independent of A” means “being told that A occurred
does not affect the likelihood of B occurring.”

I.e. P(B | A) = P(B)
P( A and B)
I.e.,              P( B)
P( A)
I.e., P( A and B)  P( A) P( B)

“A and B are independent”
251
Stat 10x
J. Chang
Tuesday, 10/02/01

252
Sampling distribution of an estimator
Understand first using a “toy example”

Indiv   Vote
“Population”:            1       Bush
4 individuals and        2       Bush
4       Gore

Say we want to estimate parameter
p = fraction of pop who vote for Bush,
using a sample of size 2.

Here p = 0.5. Pretend we don’t know this.              253
Sampling distributions (cont.)
List possible SRS’s and the corresponding estimates.

p
12      BB       1
Indiv   Vote
1       Bush     13      BG       0.5
2       Bush     14      BG       0.5
3       Gore
4       Gore     23      BG       0.5
24      BG       0.5
34      GG       0
254
ˆ
Sampling distrib of p from SRS's of size 2

0.0                 0.5                 1.0

Or, in terms of probabilities,
4/6

1/6                    1/6

0          0.5             1
255
Probability Models
Given a random phenomenon we are modeling.

S = Sample space = set of all possible outcomes.

E.g.:
Toss a coin: S = {H,T}.

Toss a coin 3 times:
S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

256
Probability Models (cont)
An event is a set of some possible outcomes
i.e., an event is a subset of S.

E.g. A = (get one head in 3 tosses)
= {HTT, THT, TTH}

A probability measure is a function (satisfying certain
conditions) that assigns a probability (a number
between 0 and 1) to each event.

If A is an event, P(A) denotes the probability of A.
257
A useful picture/example

S

A

You’re driving and it’s about to start raining. Think of S as your
windshield. Event A corresponds to statement {the first drop to
hit the windshield hit the set A}.

258
A useful picture/example
A simple probability measure
S
to model this:
A
area of A
P ( A) 
area of S

For convenience assume: (area of S) = 1.
So P(A) = area of A.

Note 0  P ( A)  1 and P(S) = 1.

259
New events from old

A            B

A or B                 A and B

Ac  complement of A

260
Axioms of probability
(i.e., properties of probability
measures)
• For each event A,
P(A)  0 and P(A)  1.

• P(S) = 1, where S is the whole sample
space.
( A and B)  ?

• If A and B are disjoint, then
P(A or B) = P(A) + P(B) .
261
Definition of P(B | A)

B
A

Idea of P(B|A): Given that A occurs, what is the
probability that B also occurs?
Question: By eyeball, what is P(B|A) ?
Definition of P(B | A)

Given that the raindrop fell in A, we restrict our attention
to the set A. The drop is equally likely to fall anywhere
within A.
263
Definition of P(B | A)

Given A, the event B also occurs when the drop falls in
the dark blue region, i.e., the event (A and B).

264
Definition of P(B | A)

P( A and B)
P( B | A) 
P( A)

Often used in form: P ( A and B )  P ( A) P ( B | A)
265
Independence
E.g. two tosses of a coin

“B is independent of A” means “being told that A occurred
does not affect the likelihood of B occurring.”

I.e. P(B | A) = P(B)
P( A and B)
I.e.,              P( B)
P( A)
I.e., P( A and B)  P( A) P( B)

“A and B are independent”
266
Example: tree diagrams, Bayes’ rule
A blood test screening for the AIDS virus is given to people
randomly chosen from a population. If a given person has a
positive test result, what is the conditional probability that the
person indeed has the virus?

Suppose that
• 1% of the population has the virus        P(A) = 0.01
• Test has a false positive rate of 1.5%    P(B|Ac) = 0.015
• Test has a false negative rate of 0.3%    P(Bc|A) = 0.003

Let A = {AIDS virus in blood}
B = {Blood test positive}
Want: P(A | B)
267
Draw a tree (sideways)
c        c
Given P(A) = 0.01, P(B|A ) = 0.015, P(B |A) = 0.003.
Want P(A | B).                   B
.997

.003
.01     A
Bc
.015      B
.99
Ac
.985
Bc                    268
Using the tree
P( A and B)
Want P( A | B) 
P( B)
Numerator:
P( A and B)  P( A) P( B | A)
 (.01)(.997)  .00997

Note probabilities multiply along a path in the tree.

269
Using the tree (cont.)

P( B)  P( A & B)  P( Ac & B)
 (.01)(.997)  (.99)(.015)  .02482

P( A & B) .00997
P( A | B)                    .4017
P( B)    .02482
270
Bayes’ rule (what we just did…)
Given P(A), P(B | A), and P(B | Ac).
Want to find a “turned-around” probability like P(A | B).

P( B)  P( A & B)  P( Ac & B)
 P( A) P( B | A)  P( Ac ) P( B | AC )

P( A & B)             P( A) P( B | A)
P( A | B)            
P( B)     P( A) P( B | A)  P( Ac ) P( B | AC )

Know this, but don’t memorize!
Understand, derive, draw a tree...
271
Random variables
Abstract definition:
A random variable is a function defined on S.
Recall S = sample space = {all possible outcomes}

The function assigns a value to each possible outcome.

Typically a number… might be a “category”

272
Simple example of random
Toss a coin 3 times. Let X = number

Outcome s   TTT   TTH   THT   THH   HTT   HTH   HHT   HHH
X(s)      0     1     1     2     1     2     2     3

Each outcome s in S has probability 1/8.

Note, e.g., {X = 1} is the event      So P{X = 0} = 1/8,
{TTH, THT, HTT}. So it                   P{X = 1} = 3/8,
makes sense to talk about the
P{X = 2} = 3/8,
probability P {X = 1}, etc.
P{X = 3} = 1/8.
273
Distribution of a random variable
Discrete random variable: takes on finitely many
possible values.

A distribution of a discrete random variable is a list of
its possible values and the probabilities that it takes on
those values.

E.g. we did the example X = number of heads in 3 tosses...

Continuous random variables: distribution described
by a probability density function.
274
Independent random variables
“X and Y are independent” means that the events
{a < X < b} and {c < Y < d} are independent for all
numbers a, b, c, and d.
i.e. P({a < X < b} & {c < Y < d})
= P{a < X < b} P{c < Y < d}

Idea: knowing information about the value of X tells us
nothing about the value of Y.

275
Binomial distributions
Generic setup: performing n independent trials of an
experiment.

Each trial could be a “success” or a “failure.”

Let p = probability of success on each trial.

Let X = number of successes among the n trials.

Definition: The random variable X is said to have a
Binomial distribution with parameters n and p.

Notation: X ~ B(n, p)
276
Examples
1. We already figured out the B(3, ½) distribution:
0   with probability 1/8
1
    with probability 3/8
X 
2   with probability 3/8
3
    with probability 1/8

2. Let Y be the number of 6’s in 10 rolls of a die.
Distrib of Y ?

Y ~ B (10, 1/6).

277
Examples (cont)
3. Suppose in a pop of size 100 million, 60 million
favor Gore. We take a random sample of size 2500.
Let X = number in sample who favor Gore.
Distribution of X ?

Y ~ B (2500, 0.6).     Exactly?
No…

278
Mean of a random variable
E.g. X = payoff in spinner game
\$9
Distribution of X :                      \$0
\$2
0 with prob 0.5

X  2 with prob 0.3
9 with prob 0.2

Notation :  X   ( X )
Define mean of X
 X  (0)(.5)  (2)(.3)  (9)(.2)
 0  .6  1.8  2.4 dollars
279
Why is the mean defined this way?
Answer: it makes the “law of large numbers” true.

Law of large numbers: As we do many independent
repetitions of the experiment, drawing more and more
numbers from the same distribution, the mean of our
sample will approach the mean of the distribution more
and more closely.

280
Recall “long run frequency” interpretation of
probability
Suppose P(event) = p. Look at

# times event occurs in n independen t trials
Fn 
n

i.e. the fraction of times the event occurs in the first n
trials.

As n increases, this fraction will approach p :
Fn  p
281
Law of large numbers for spinning
game
Imagine playing the game n times, with n large.
Total “sample” winnings:

X 1  X 2    X n  0(# of 0' s)  2(# of 2' s)  9(# of 9' s)

Mean of sample winnings:
 # of 0' s   2 # of 2' s   9 # of 9' s 
X n  0                                         
 n   n   n 

0.5              0.3             0.2
I.e.      X n  0(0.5)  2(0.3)  9(0.2)   X                  282
SD and Variance of a random
variable
Notation:      SD of X                     sX

Variance of X               sX
2

Definition:
s   ( X   X )
2
X
2


283
Calculating Variance
2.4
prob      X      X   X ( X   X )2
0.5      0       -2.4          5.76
0.3       2       -0.4         0.16
0.2       9       6.6         43.56

s X   ( X   X ) 2 
2

= (5.76)(0.5) + (0.16)(0.3) + (43.56)(0.2)
= 11.64        Dollars? No : s X  11.64  3.41 dollars
284
General formulas
 x1 with prob      p1
x
 2                 p2
If X  
                   
 xk
                   pk
k
then X has mean  X   xi pi
i 1
k
and variance s X   ( xi   X ) pi
2                 2

i 1

285
Rules for mean and variance
 ( X  c)   ( X )  c

 (cX )  c ( X )

s ( X  c)  s ( X ),   s 2 ( X  c)  s 2 ( X )

s (cX )  cs ( X ),    s (cX )  c s ( X )
2        2   2

For c  0. Otherwise use | c | .
E.g. s ( X )  s ( X )
286
Stat 10x
J. Chang
Tuesday, 10/09/01

287
Today
• Random variables, mean and variance, rules,
Law of large numbers
• Central Limit theorem
• Sampling distributions, sampling distrib of
sample mean
• Concept of a confidence interval
• Simple examples using Normal distribution
• Binomial distributions and normal
approximations                              288
Mean of a random variable
E.g. X = payoff in spinner game
\$9
Distribution of X :                      \$0
\$2
0 with prob 0.5

X  2 with prob 0.3
9 with prob 0.2

Notation :  X   ( X )
Define mean of X
 X  (0)(.5)  (2)(.3)  (9)(.2)
 0  .6  1.8  2.4 dollars
289
Why is the mean defined this way?
Answer: it makes the “law of large numbers” true.

Law of large numbers: As we do many independent
repetitions of the experiment, drawing more and more
numbers from the same distribution, the mean of our
sample will approach the mean of the distribution more
and more closely.

290
Recall “long run frequency” interpretation of
probability
Suppose P(event) = p. Look at

# times event occurs in n independen t trials
Fn 
n

i.e. the fraction of times the event occurs in the first n
trials.

As n increases, this fraction will approach p :
Fn  p
291
Law of large numbers for spinning
game
Imagine playing the game n times, with n large.
Total “sample” winnings:

X 1  X 2    X n  0(# of 0' s)  2(# of 2' s)  9(# of 9' s)

Mean of sample winnings:
 # of 0' s   2 # of 2' s   9 # of 9' s 
X n  0                                         
 n   n   n 

0.5              0.3             0.2
I.e.      X n  0(0.5)  2(0.3)  9(0.2)   X                  292
SD and Variance of a random
variable
Notation:      SD of X                     sX

Variance of X               sX
2

Definition:
s   ( X   X )
2
X
2


293
Calculating Variance
2.4
prob      X      X   X ( X   X )2
0.5      0       -2.4          5.76
0.3       2       -0.4         0.16
0.2       9       6.6         43.56

s X   ( X   X ) 2 
2

= (5.76)(0.5) + (0.16)(0.3) + (43.56)(0.2)
= 11.64        Dollars? No : s X  11.64  3.41 dollars
294
General formulas
 x1 with prob       p1
x
 2                  p2
If X  
                    
 xk
                    pk
k
then X has mean  X   xi pi
i 1
k
and variance s X   ( xi   X )2 pi
2

i 1

295
Rules for mean and variance
 ( X  c)   ( X )  c

 (cX )  c ( X )

s ( X  c)  s ( X ),   s 2 ( X  c)  s 2 ( X )

s (cX )  cs ( X ),    s (cX )  c s ( X )
2        2   2

For c  0. Otherwise use | c | .
E.g. s ( X )  s ( X )
296
Sums                    0 with prob 0.5
                                \$9
Suppose we play the X  2 with prob 0.3
\$0 \$2
9 with prob 0.2

game 2 times. Total winnings: S  X1  X 2
Distrib of S ?                Probabilities
Possible 0 = 0 + 0            (.5)(.5) = .25
values: 2 = 0 + 2 = 2 + 0    (.5)(.3) + (.3)(.5) = .3
9 =0+9=9+0           (.5)(.2) + (.2)(.5) = .2
4 =2+2               (.3)(.3) = .09
11 = 2 + 9 = 9 + 2   (.3)(.2) + (.2)(.3) = .12
(.2)(.2) = .04          297
18 = 9 + 9
Probability mass functions of S1 and S2
0 with prob .5

S1  2           .3
9
            .2

0 with prob .25
2            .3

4            .09
S2  
9           .2
11         .12
18
            .04                 298
Mean and variance of sum of r.v.’s
Let X1 and X2 be random variables.
Define a new variable S = X1 + X2.
What are the mean and variance of S ?

 (S )   ( X1 )   ( X 2 )

If X1 and X2 are independent, then
s 2 ( S )  s 2 ( X1 )  s 2 ( X 2 )

299
Mean and SD of sum of n (indep)
r.v.’s sample of size n from a
Let X1, X2 ,…, Xn be a random
distrib having mean  and SD s.

Then the sampling distrib of the sum S n  X 1    X n
has mean n and SD n s

(because the variance is ns 2 ).

300
Distrib of total winnings from playing the
spinner game n times
n=1                        n=4

n=2                        n=8

301
Distrib of total winnings from playing the
spinner game n times (cont.)
n = 16

n = 32                    n = 64

302
It’s all very simply described, nearly
n = 64,   2.4 , s  3.41
S 64 has mean (64)(2.4) = 153.6

and SD    ns  64(3.41)  27.28
And a Normal shape

154
127   181                                303
Central Limit Theorem
Repeat an experiment n times independently, getting a
sum S  X 1  X 2    X n

We know  ( S )  n ( X ) and s ( S )  ns ( X ).

CLT: If n is large, the distrib of S is nearly
Normal.

That is, Sn is approximately N (n X , n s X )

304
Example
0 with prob 0.5               \$9

Suppose we play our X  2 with prob 0.3             \$0 \$2
9 with prob 0.2


game 25 times. Total winnings: S  X 1  X 2    X 25 .

Note mean of S is  ( S )  (25)(2.4)  60

Q: How likely to get S  80 ? What is P{S  80} ?

SD of S: s ( S )  25  s ( X )  5  3.41  17.05
305
\$9
Example (cont)                    \$0 \$2
Q: Spin 25 times. P{S  80} ?

CLT says S is approximately N(60, 17.05). So…

P{S  80}  P{N (60, 17.05)  80}
           80  60 
 P  N (0,1)            1  (1.17)  1  0.879  0.121
            17.05 
1.17

306
“Continuity correction”

P{S  80}  P{S  79.5}  P{N (60, 17.05)  79.5}
 N (0,1)  79.5  60   1   (1.14)  1  0.874  0.126
 P                     
            17.05 

307
Starting Chapter 6:
Introduction to Inference
Main concepts:
• Confidence intervals (basic idea today)
• Hypothesis tests (next time)

308
Who wants to be a millionaire guess a
number?
I have a number written down on a slip of paper.
Call it  .
Might be 1 million…
Or 3.14…
Or anything!

I’ll tell you a number, X , within ±5 of  .
Suppose X = 64.1.                is in the interval 64.1 ±5

Q: What can you tell me about  ?              i.e. [59.1, 69.1]

100% confident                                    “Yes!”
309
Who wants to change the rules?
First I spin the needle.
• If needle stops in the yellow
part, then I do as before -- report
an X within ±5 of  .
• If needle stops in the red part,
I lie, reporting an X not within
Yellow part: 99%
±5 of  .
Red part: 1%
Suppose I still report X = 64.1.
You still guess  is in interval [59.1, 69.1].

But now you are “99% confident.”
310
Who wants to drag the Normal distrib into the
discussion?
I have another number  , which I know and you don’t.

Suppose I draw a random X ~ N (  , 0.06),
and report X  0.38.

Can you give a 95% confidence interval for  ?

Reasoning:
• with probability 0.95, X is within 2 SD’s of 
• i.e. with prob 0.95, X is within 0.12 of 
• 95% CI is 0.38 ± 0.12, i.e., [0.26, 0.50]
311
How about a 99% CI in the same
problem?
Again, suppose you want to estimate  , and I draw a
random X ~ N ( , 0.06) and report X = 0.38.
99% confidence interval for  ?

To use same reasoning as before:
• with probability 0.99, X is within ??? SD’s of 
Use Table or Minitab. Get ??? = 2.576

• So with prob 0.99, X is within (2.576)(0.06)  0.15 of 
• So 99% CI is 0.38 ± 0.15, i.e., [0.23, 0.53]
Wider than our 95% CI, [0.26, 0.50]… makes sense.
312
General confidence intervals with Normal
Suppose Y ~ N (  , s ) ,distribs is known and we want
wheres
to estimate  .

A “level C” CI for  is [Y  zCs , Y  zCs ] , where,
for example,
C               zC
.95             1.960 (nearly 2)
.99             2.576
.90             1.645

313
Mean and variance of a sample mean
Let X 1 , X 2 ,, X n be independent random variables
all having the same distribution.
Suppose this “parent distribution”
has mean  and SD s .

Let X n denote the sample mean.
1
i.e. X n   X 1  X 2    X n 
n

Then the sampling distrib of X n has mean 
and SD s / n .
• follows from the rules
• related to law of large numbers
314
Example using a sample mean rather than just
one obs’n
Want to estimate
 = mean pulse rate using a certain medicine.
We sample n = 30 people and find sample mean X  103.9
Find a 95% CI for  .

Assume SD s is known to be 5.1.
Probably unrealistic to assume we
know this. Later see how to fix…
  , 5.1   N (  , 0.93)
Key: At least approximately, X ~ N          
      30 
95% CI is 103.9 ± (1.96)(0.93), i.e., [102.1, 105.7]
315
Binomial distributions
Generic setup: performing n independent trials of an
experiment.

Each trial could be a “success” or a “failure.”

Let p = probability of success on each trial.

Let X = number of successes among the n trials.

DEF: The random variable X is said to have a
Binomial distribution with parameters n and p.

Notation: X ~ B(n, p)
316
Mean and SD of Binomial(n,p)
0 with prob 1      p
Play game I                        n times
1 with prob p                              \$1
Total winnings X ~ B(n, p).                          \$0
Mean and SD for 1 play are
 ( I )  p, s ( I )  p(1  p )        (exercise)
Think: X  I1  I 2    I n , where

1 if k th trial is a success
Ik  
0 if k th trial is a failure
So  ( X )  np, s ( X )  np (1  p )                    317
Normal approximation to
Let   X ~ B(n, p). Binomial
\$1
Think: X  I1  I 2    I n , where
\$0
1 if k th trial is a success
Ik  
0 if k th trial is a failure

By the Central Limit Theorem, for large n the B(n, p)
distrib is approximately Normally distributed with
mean np and SD np(1  p)

318
Example: Margin of error in a poll
Suppose in a large pop, a fraction p = 0.6 of
voters favor Gore. We don’t know this, but take a
random sample of size n = 2500. Let X be the
number in the sample who favor Gore, and let
p  X / n be our estimate of p. What is the
ˆ
“margin of error” of the poll, i.e., the width of a
95% CI for p?

319
Margin of error in a poll
X ~ B(2500, 0.6)
 N ((2500)(0.6), (2500)(0.6)(0.4) )
 N ((2500)(0.6), 50 (0.6)(0.4) )

X                (0.6)(0.4) 
p
ˆ        N  0.6,                 N (0.60, 0.01)
2500                 50      

ˆ
So, e.g., the prob that our estimatep   is off by at

“Margin of error is ±2%.”
320
Stat 10x
J. Chang, 10/16/01

2
1

1

321
Today
• Hypothesis tests
– Basic logic
– 1-sided and 2-sided examples with Normal
distributions
• “ t procedures ” (t tests and CI’s). These
remove the assumption that s is known.

322
Some basic facts we’ve been
using
Suppose X 1 , X 2 ,..., X n is a sample from a distribution
having mean  and SDs .

The sampling distrib of X n has mean         and SD s / n .

CLT: For large n the distrib of X n is approximately
Normal.

If the individual r.v.’s X i come from a Normal distribution,
then the distrib of X n is exactly Normal. That is:
 s 
If X i ~ N (  ,s ), then X n ~ N   ,  .
     n       323
Idea of hypothesis testing: first an
analogous idea from logic
A probabilistic extension of proof by contradiction.

Example: Prove that 2 is irrational.
I.e. prove that 2 cannot be expressed as a quotient m/n.

Technique: Assume the opposite and derive a contradiction.

Suppose we do have m and n satisfying     2  m/n .

We can choose m and n to be not both even.
(Cancel out common factors of 2 from numerator and
denominator until one or both become odd.)
324
Logical analog: Proof by
We are assuming that m and n are numbers satisfying
2  m / n , and m and n are not both even.

We have 2  (m 2 ) /( n 2 ), i.e. 2n 2  m 2 .

So m 2 is even. So m must be even.

Let m  2k . So 2n 2  4k 2 , i.e., n 2  2k 2 .

So n 2 is even. So n must be even.

325
Logical analog: Proof by
We are assuming that m and n are numbers satisfying
2  m / n , and m and n are not both even.

We have 2  (m 2 ) /( n 2 ), i.e. 2n 2  m 2 .

So m 2 is even. So m must be even.

Let m  2k . So 2n 2  4k 2 , i.e., n 2  2k 2 .

So n 2 is even. So n must be even.

326
To prove a statement we
• Assume the opposite
• Do some reasoning, always assuming the
opposite of what we are trying to prove.
• If we reason to a contradiction, we conclude
that our assumption could not possibly be true.

327
Math, logic
Statistics, real life
Want to prove a statement.     Want to use some data to
provide evidence for a
hypothesis.

• Assume the statement         • Assume the hypothesis
is not true.                   is not true.
• Reason to a contradiction.   • Show the observed data is
very unlikely.

i.e. observe something
impossible.                                 328
1. Containing or resembling cheese.
A cheesy example          2. Slang. Shoddy; cheap.

Cheese manufacturer suspects milk supplier is diluting
milk with water. Wants to “prove” this.
Note: adding water increases freezing temp of milk.

Assume freezing temp measurements for pure milk are
known to have a N (.545, .008) distrib. (in ºC)

Data: Take 5 lots of milk.
Say mean of 5 temps is X  .538
Model: X 1 ,, X 5 ~ N (  , .008)
“Null hypothesis”    H 0 :   .545
“Alternative hypoth” H a :   .545                            329
Cheesy example (cont)
Model: X 1 ,, X 5 ~ N (  , .008)   Data: X 5  .538

“Null hypothesis”    H o :   .545
“Alternative hypoth” H a :   .545

We want to assess evidence for Ha.
Assume the opposite; that is, assume Ho
  .545, .008   N ( .545, .0036)
Then X 5 ~ N               
           5
How “extreme” is our observed value of  .538 ?
How likely were we to get a value at least this high?
330
Cheesy example (cont)
Take an observation X from N (.545, .0036) distrib.
What is P{ X  .538} ?
You know how to do this. Standardize the .538, getting
(.538(.545))/.0036 = 1.95, so that
P{ X  .538}  1   (1.95)  .025   “P value”
Conclusion: Since a mean temp. as high as what we
observed would be quite unlikely (P  .025) if the milk
were pure, we have substantial evidence that water has
An interpretation: (1  P value) gives the percentile
of our observation, assuming H 0                    331
Rejection, acceptance, and OJ
Simpson
A common convention is to "reject H 0" if P  .05

What if we get a P larger than our chosen threshold?
E.g. P = .07?

What is the opposite of “Reject H0” ?        “Accept H0” ?

Better to use terminology like “Fail to reject H0” .

Legal analogy: Innocent (H0 ) until proven guilty.
OJ Simpson was not convicted.
Does this suggest he was innocent?
332
“A Critical Appraisal of 98.6F” (a
test)
two-sided 93 healthy people.
JAMA article. Sampled temps of
Sample mean was 98.12F.
How strong is evidence against a population mean of 98.6?
Suppose we know that temps in pop have SD s=0.63F.
Model: X 1 , X 2 ,, X 93 ~ N (  , .63) .
Hypotheses H 0 :   98.6 , H 0 :   98.6 .
What possible values for X are “more extreme” than 98.12?
Let’s say “extreme” means “far from 98.6”.
(Hi or low – this corresponds to doing a “two-sided test.”)
More extreme
than 98.12             98.12     98.6    99.08       333
“A Critical Appraisal of 98.6F”
Assuming H0 is true,
X ~ N 98.6, .63                                 s X  .0653

        93  .
 N (98.6, .0653)

Standardize observed value:
(98.12  98.6) .0653             -7.35   0       7.35
  .48 .0653   7.35

The P-value of the test is the prob of getting a
sample mean more extreme than 98.12, which is
 ( 7.35)  (1   (7.35))  2  1013 .
334
CI’s : Who wants to guess a                           Last time

number?
I have a number written down on a slip of paper.
Call it  .
Might be 1 million.
Or 3.14
Anything...

I’ll tell you a number, X , within ±5 of  .
Say X = 64.1.                     is in the interval 64.1 ±5

Q: What can you tell me about  ?              i.e. [59.1, 69.1]

Q: How confident are you? Final. 100% confident.          335
Last time

Who wants to change the rules?
Use a spinner.
• If needle points to the yellow
wedge, then I do as before --
report an X within ±5 of  .
• If needle points to the red
wedge, I lie, reporting an X not
within ±5 of  .                              Yellow part: 99%
Red part: 1%
Suppose I still report X = 64.1.
You still guess  is in interval [59.1, 69.1].

But now you are “99% confident.”
336
Who wants to drag in the Normal time
Last

distribution?
I have another number  , which I know and you don’t.
Suppose I draw a random X ~ N ( , 0.06),
and report X = 0.38.
Can you give a 95% confidence interval for  ?
Reasoning:
• with probability 0.95, X is within 2 SD’s of 
• i.e. with prob 0.95, X is within 0.12 of 
• i.e. with prob 0.95,  is within 0.12 of X
• 95% CI is 0.38 ± 0.12, i.e., [0.26, 0.50]
337
Example using a sample mean rather
Last time

than just one obs’n
Want to estimate  = mean pulse rate of for people on
a certain medicine.
Assume SD s is known to be 5.1.
???
We sample n = 30 people and find sample mean X  103.9
s
Find a 95% CI for  .
n
  , 5.1   N (  , 0.93)
Key: At least approximately, X ~ N          
      30 

95% CI is 103.9 ± (1.96)(0.93), i.e., [102.1, 105.7]
338
Review of Confidence Intervals
We’ve done Normal case for  ,
with sofknown N ( ,s ) .
Data: X ,, X , a sample size n from
1       n
Sample mean X n .                        unknown, s known

 X  1.96 s , X  1.96 s  .
E.g., 95% CI for  is  n
n
n
           n             
  , s  , which says, e.g., that
Basis for this: X n ~ N        
      n
the prob that X n is w/in 1.96 s n of is 0.95.

339
Confidence intervals for  without
being told s
E.g. case n = 3. Data: X 1 , X 2 , X 3 from N (  ,s ) .
Sample mean X and SD s.                   unknown, s unknown
Say we want a 95% CI for .
          s             s 
Note can't use  X  1.96    , X  1.96    , because we don't know s !
           3             3
          s             s 
How about  X  1.96    , X  1.96    ?        Nope. Bad.
           3             3

95% CI for  :          X  4.30 s , X  4.30 s  .

           3            3


Q: Where does 4.30 come from?

A: The “ t distribution with 2 degrees of freedom ” 340
s                  s
Minitab demo for 95% CI: X 3  1.96     bad, X 3  4.30     good
3                  3
 Enable command language
 Make 3 cols, 2000 rows of N(10,2). These are c1-c3.
 Name c4=mean. Do rmean c1-c3 c4.
 col c5 “L” Use Calc menu: L=mean-1.96*2/sqrt(3)
 col c5 “U” Use Ctrl-E: U=mean+1.96*2/sqrt(3)
 col c7 Let c7 = (L<10) and (U>10)
 sum(c7). Hope about 1900. Let k1=sum(c7)/2000. Print k1.
 col c8 “stdev” rstdev c1-c3 c8 (or could use row stats menu)
 col c9 “Lz” calc menu: Lz=mean-1.96*stdev/sqrt(3)
 col c10 “Uz” Use Ctrl-E: Uz=mean+1.96*stdev/sqrt(3)
 col c11 “zcover” let c11=(Lz < 10) and (Uz > 10)
 sum(c11) Let k2=sum(c11)/2000 Print k2
 c12 “Lt” Lt = mean – 4.30*stdev/sqrt(3)             Script for
 c13 “Ut” Ut = mean + 4.30*stdev/sqrt(3)             the drama
 c14 “tcover” Let c14 = (Lt < 10) and (Ut > 10)      played out
 let k3=sum(c4)/2000 Print k3.                       in class
341
s known                                            s
unknown,
  , s , i.e.,
X ~ N                             Jargon: “degrees of freedom”
X                      X 
~ N (0, 1)               ~ t distrib with n  1 df
s n                      s n

Which is N(0,1)? Which is t ?

N(0,1) density. 95% prob in 1.96

t(2) density. 95% prob in 4.30

342
So what’s a t distribution again?
The distribution of a "standardized" X n , based on a
sample of size n, and "standardized" using s instead of s ,
is called the t distribution with n  1 degrees of freedom.

Xn  
i.e., the distrib of
s/ n

343
t densities
df =  black, N(0,1)
8
4
2
1 red

344
The logic of the t CI
E.g. for n  3:
X 
~ t distrib with 2 df
s 3

This t(2) distrib has .95 probability between 4.30.

 X   between  4.30 and 4.30  0.95 .
I.e. P                               
 s/ 3                         
I.e. PX is within 4.30( s / 3) of    0.95 .

95% CI for  :    X  4.30( s / 3) .
345
t tests: example
H 0 : Y   Z                       125         109         16
347         278         69
H a : Y   Z                       265         275         10
195         191          4
Equivalent : Define X  Y  Z ,
535         416         119
Let  denote mean of X , and test    235         250         15
H0 :   0
H a :   0.                                            X  30.5
s X  52.8
Now it’s a 1-sample test. X i ~ N (  ,s ) .
Doing a test about  , withs not assumed known.            346
t tests: example (cont)                      n6
X                                 X  30.5
"t statistic" t                                     s X  52.8
s n
30.5
Under H 0 ,   0, so we get t          1.42.
52.8 6

For a 2-sided test, we want to        -1.42      0     1.42
add the prob to the right of 1.42
and to the left of -1.42 in the t
distrib with n - 1 = 5 df.
Don’t reject
t table gives P value between .2 and .3.      null hypothesis
Minitab: P = .215                                        347
Stat 10x
J. Chang
Tuesday, 10/23/01

"Statistical thinking will one day be as necessary for
efficient citizenship as the ability to read and write."
-- H.G. Wells

348
Today
• CI for a proportion
• Tests and CI’s for difference between two
means
• Chi-square for goodness of fit to a given
distribution
• Two-way tables and chi-square

349
Review
Simple confidence interval
I have a number  , which I know and you don’t.

Suppose I draw a random X ~ N ( , 0.06),
and report X = 0.38.

Reasoning for a 95% confidence interval for  :
• with probability 0.95, X is within 2 SD’s of 
• i.e. with prob 0.95, X is within 0.12 of 
• 95% CI is 0.38 ± 0.12, i.e., [0.26, 0.50]

350
Review
Math, logic
(Statistical hypothesis testing)
Statistics, real life
Want to prove a statement.      Want to use some data to
provide evidence for a
hypothesis.

• Assume the statement           • Assume the hypothesis
is not true.                     is not true.
• Reason to a contradiction.     • Show the observed data is
very unlikely.

i.e. observe something
impossible.                                      351
Inference for a proportion
Example: Confidence interval in a
poll
Suppose we take a random sample of 900 likely voters.

We ask who they will vote for:        (Imagine a world without
52% say Bush, 48% say Gore.           Nader, Buchanan,…)

Let p denote the unknown fraction of Bush voters.
Our sample gives the point estimate p  .52
ˆ
X
p
ˆ      , where X ~ B(900, p )
900

352
Poll proportion example (cont.)
ˆ
Distrib of p is approx
     p (1  p)          p (1  p ) 
N  p,              N  p,              N ( p, .017)
        900                30      

(.52)(.48)
Estimate this by                 .017
30
An approximate 95% CI is
the observed p  .52 , plus-or-minus 2  SD ( p )  .034 ,
ˆ                                ˆ
i.e. 52%  3.4% .
margin of error
353
Review of Confidence IntervalsReview
We’ve done Normal case for  ,
with sofknown N ( ,s ) .
Data: X ,, X , a sample size n from
1       n
Sample mean X n .                        unknown, s known

 X  1.96 s , X  1.96 s  .
E.g., 95% CI for  is  n
n
n
           n             
  , s  , which says, e.g., that
Basis for this: X n ~ N        
      n
the prob that X n is within 1.96 s n of     is 0.95.

354
Confidence intervals for  without
Review
being told s
E.g. case n = 3. Data: X 1 , X 2 , X 3 from N (  ,s ) .
Sample mean X and SD s.                   unknown, s unknown
Say we want a 95% CI for .
          s             s 
Note can't use  X  1.96    , X  1.96    , because we don't know s !
           3             3
          s             s 
How about  X  1.96    , X  1.96    ?        Nope. Bad.
           3             3

95% CI for  :          X  4.30 s , X  4.30 s  .

           3            3


Q: Where does 4.30 come from?

A: The “ t distribution with 2 degrees of freedom ” 355
s                   s Review
Minitab demo for 95% CI: X 3  1.96     bad, X 3  4.30      good
 Enable command language            3                   3
 Make 3 cols, 2000 rows of N(10,2). These are c1-c3.
 Name c4=mean. Do rmean c1-c3 c4.
 col c5 “L” Use Calc menu: L=mean-1.96*2/sqrt(3)
 col c5 “U” Use Ctrl-E: U=mean+1.96*2/sqrt(3)
 col c7 Let c7 = (L<10) and (U>10)
 sum(c7). Hope about 1900. Let k1=sum(c7)/2000. Print k1.
 col c8 “stdev” rstdev c1-c3 c8 (or could use row stats menu)
 col c9 “Lz” calc menu: Lz=mean-1.96*stdev/sqrt(3)
 col c10 “Uz” Use Ctrl-E: Uz=mean+1.96*stdev/sqrt(3)
 col c11 “zcover” let c11=(Lz < 10) and (Uz > 10)
 sum(c11) Let k2=sum(c11)/2000 Print k2
 c12 “Lt” Lt = mean – 4.30*stdev/sqrt(3)              Script for
 c13 “Ut” Ut = mean + 4.30*stdev/sqrt(3)              the drama
 c14 “tcover” Let c14 = (Lt < 10) and (Ut > 10)       played out
 let k3=sum(c4)/2000 Print k3.                        in class
356
s known                                            s
Review
unknown,
  , s , i.e.,
X ~ N                             Jargon: “degrees of freedom”
X                      X 
~ N (0, 1)               ~ t distrib with n  1 df
s n                      s n

N(0,1) density. 95% prob in 1.96

t(2) density. 95% prob in 4.30

357
Review

So what’s a t distribution again?
The distribution of a "standardized" X n , based on a
sample of size n, and "standardized" using s instead of s ,
is called the t distribution with n  1 degrees of freedom.

Xn  
i.e., t (n  1) is the distrib of
s/ n

358
Review

The logic of the t CI
E.g. for n  3:
X 
~ t distrib with 2 df
s 3

This t(2) distrib has .95 probability between 4.30.

 X   between  4.30 and 4.30  0.95 .
I.e. P                               
 s/ 3                         
I.e. PX is within 4.30( s / 3) of    0.95 .

95% CI for  :    X  4.30( s / 3) .
359
Comparing means of two Normal
distributions
CaseXof paired data ,Y ).
Given paired data: ( , Y ),( X , Y ), ,( X
1 1    2   2        n   n
Want to test H 0 :  x   y
(versus H a :  x   y or whatever)

Consider differences D1 ,, Dn , where Di  X i  Yi .

Now it’s a one-sample problem: D1 ,, Dn is a random
sample from a population, and null hypoth is the population
mean is 0.
360
t tests: an example with paired
6 students, each took 2 reading
speed measurements,                 Y        Z          X
H 0 : Y   Z                    347     278        69
H a : Y   Z                    265     275        10
195     191         4
Equivalent: Define X  Y  Z ,      535     416        119
235     250        15
Let  denote mean of X , and test
H0 :   0
X  30.5
H a :   0.
s X  52.8
Now it’s a 1-sample test. X i ~ N (  ,s ) .
s
Doing a test about  . Let us not assume is known.
361
t tests: example (cont)                         n6
X                                    X  30.5
"t statistic" t                                        s X  52.8
s n
30.5
Under H 0 ,   0, so we get t          1.42.
52.8 6

For a 2-sided test, we want to           -1.42      0     1.42
add the prob to the right of 1.42
and to the left of -1.42 in the t
distrib with n - 1 = 5 df.
Don’t reject
t table gives P value > .2 (see next slide)      null hypothesis
Minitab: P = .215                                           362
Critical values for t distributions
We got t = 1.42, with 5 df

tail probability

critical value

Since t = 1.42 < 1.476,
tail prob > .1, so
2-sided P value > .2.

N (0,1)! 363
Comparing means of two Normal
distributions and Y1,,Yn .
Data: Two indep samples X 1 ,, X m
Model: X 1 ,, X m ~ N (  x ,s x ) and Y1 ,, Yn ~ N (  y ,s y ) .
Our goal is to test the null hypothesis H 0 :  x   y .
Idea: use test statistic X  Y .
Key: Assuming H 0 , need to know distrib of X  Y .
Mean 0. SD? Shape?
Some cases:

s x and s y         s x and s y           s x and s y
known, and          unknown, but          unknown
not nec equal       assumed equal                       364
Simplest case: Population SD’s
known
 recall Key: Assuming H , need to know distrib of X  Y .
0

Easy: We know
sx                     sy
 X ~ N ( x ,    ) and Y ~ N (  y ,    ).
m                      n
 X and Y are independent
          sy 
 0, s x 
2    2
So…            X Y ~ N               
    m     n 
             

365
Two-sample procedures with SD’s
known:here X  9.77, Y  16.27.
E.g. example
X  Y  6.5.

Also suppose we are told that
s x  3 and s y  5 . Then…

32 52
s (X Y )        4.06  2.02
7   9

95% CI for  x   y is
6.5  (1.96)(2.02)  [10.45, 2.55]

366
Two-sample hypothesis test with
SD’s known
…continuing our example…
If H 0 :  x   y is true, then using the given values for
s x and s y , the distrib of X  Y is N (0, 2.02).

We just observed a value  6.5 for X  Y .
6.5
Standardized observed value is       3.22 .
2.02
X Y
P value is 2 (-3.22)  .0013
s   2
s   2
x
       y
m           n
“How extreme” is
difference in means               367
Two-sample procedures with
X Y
unknown
variancesstill like to use
Idea: would
s   2
sy
2    this, but can’t.
x

m      n
So we estimate s x and s y by sample SD’s, s x and s y ,
X Y
and use the test statistic T          2
.
2
sx s y

m n
Distrib for T is not Normal, but approximately a t distrib.
Degrees of freedom? No really clean answer...
• Conservative: Minimum of (m1) and (n1).
• More accurate: A complicated function of m and n. (In textbook…)
• Precise df usually doesn’t matter a whole lot…              368
• Basically, the distinction between t and
Normal distribs, and the precise number of
degrees of freedom, hardly matter unless the
sample sizes involved are very small.

369
t densities
df =  black, N(0,1)
8
4
2
1 red

370
t critical values
(used for 95% CI’s and hypothesis
df    tests)
t       .95
2

371
Two-sample t procedures

372
Two-sample t procedures

373
Two-sample t procedures

374
Two sample procedures with SD’s
unknown but assumed equal
If s x  s y
pp.  X  Y
(Textbook,using T550-554)it’s more accurate
 s , say, instead of
2
sx       s2

y
m       n
s
to use both samples to give a single, “pooled” estimate of .
m                   n
 ( X i  X ) 2   (Y j  Y ) 2
i 1               j 1
sp 
(m  1)  (n  1)

X Y
Use test statistic T              , which, under H 0 , has
sp m  n
1   1

(exactly!) a t distrib with ( m  1)  ( n  1) df.
375
Tables of counts, Goodness of fit,
and Chi-Square
Generalize the Binomial...

Success,   Failure,
prob p     prob (1-p)

Multinomial
…                  distribution
prob p1   prob p2   prob p3        prob pk

Test hypotheses about the urn (or "cell") probabilities p1 , p2 , , pk .
376
Goodness of fit: is the die fair?
Suppose we roll a die 60 times and get these frequencies:

value                 1     2       3        4       5    6
observed freq 8             13 11 5                  14 9         H 0 : pi  1 / 6 for all i

expected freq         10 10 10 10 10 10  Assuming H 0

observed  expected              2
X 
2
expected
8  102 13  102 11  102 5  102 14  102 9  102
                                                              
10         10               10               10       10          10

 5 .6              How “extreme” is this?                                   377
Fair die (cont)
observed  expected 2
X2                              = 5.6
expected

Distrib of X 2 , assuming H 0 , is chi-square, here with 5 df.

P-value:
Table F, p. T-20: P > 0.25
Minitab:          P = 0.347

Don’t reject null hypoth.
378
Fair-die example with different
numbers expected
observed                        2
X 
2
P = .347
expected
= 5.6

That was then. This is now.

value               1    2     3       4   5   6
What if we get   observed freq 80 130 110 50 140 90                        ?
expected freq       100 100 100 100 100 100

Then X 2  56.      And P is miniscule (0.00000000008).
379
How many degrees of freedom?
For chi-square distrib,
number of df is important!
(Unlike for t distrib)

Here df = 5 is number of
“cells” minus 1. “Why?”

H 0 : ( p1, p2 , p3 , p4 , p5 , p6 )  ( 1 , 1 , 1 , 1 , 1 , 1 )
6 6 6 6 6 6

Ha : General p’s (6 positive numbers that sum to 1)

Ha has 5 df, H0 has 0 df.                    Overall test has 5  0 = 5 df.
380
General rule: subtract df in Ha minus df in H0.
t and chi-square critical values.
Used for constructing 95% CI’s
df    (2-sided for t) 
t.975            2
.95

381
Contingency tables, homogeneity,
independence
A two-way classification of subjects by two variables --
gender and handedness:

right      left   ambidextrous
men       934        113         20

women     1070       92           8

Are gender and handedness independent?
I.e. do proportions of right, left, and ambidextrous agree
between men and women?
382
Chi-square for independence in 2-
way tables
A sum of 6 terms in our example

observed  expected 2
Again use X 2  
expected
“expected” counts: see below… Degrees of freedom?

Ha ?? 6 cells, so 6 probabilities, so 5 df

H0 ?? Can choose, e.g., P(man)                      Ha      5 df
[and then P(woman) is determined],            H0      3 df
and can choose P(right) and P(left)
test    2 df
383
[and then P(ambidextrous) is determined].
Expected counts, assuming hull
Data:  hypoth (independence)

Expected counts, assuming H 0 :
(row total) (column total)
expected 
n
(1067)(2004)
E.g. expected count for (men, right) is               955.86
2237
384
2-way table chi-square test using
Minitab

385
(113  97.78) 2
 2.369
97.78

P{ 2 (2 df)  11.806}  .003

386
Stat 10x
J. Chang
Tuesday, 10/30/01

I always find that statistics are hard to swallow and impossible to
digest. The only one I can remember is that if all the people who go
to sleep in church were laid end to end they would be a lot more
comfortable.
--Mrs. Robert A. Taft
387
Today
• Chi-square for goodness of fit to a given
distribution
• Two-way tables and chi-square
• Inference for simple regression

388
t densities
df =  black, N(0,1)
8
4
2
1 red

389
t critical values
(used for 95% CI’s and hypothesis
df    tests)
t       .95
2

390
Tables of counts, Goodness of fit,
and Chi-Square
Generalize the Binomial...

Success,   Failure,
prob p     prob (1-p)

…                  Multinomial
prob p1   prob p2   prob p3        prob pk

Test hypotheses about the urn (or "cell") probabilities p1 , p2 , , pk .
391
Goodness of fit: is the die fair?
Suppose we roll a die 60 times and get these frequencies:

value                 1     2       3        4       5    6
observed freq 8             13 11 5                  14 9         H 0 : pi  1 / 6 for all i

expected freq         10 10 10 10 10 10  Assuming H 0

observed  expected              2
X 
2
expected
8  102 13  102 11  102 5  102 14  102 9  102
                                                              
10         10               10               10       10          10

 5 .6              How “extreme” is this?                                   392
Fair die (cont)
observed  expected 2
X2                              = 5.6
expected

Distrib of X 2 , assuming H 0 , is chi-square, here with 5 df.

P-value:
Table F, p. T-20: P > 0.25
Minitab:          P = 0.347

Don’t reject null hypoth.
393
Fair-die example with different
numbers expected
observed                        2
X 
2
P = .347
expected
= 5.6

That was then. This is now.

value               1    2     3       4   5   6
What if we get   observed freq 80 130 110 50 140 90                        ?
expected freq       100 100 100 100 100 100

Then X 2  56.      And P is miniscule (0.00000000008).
394
How many degrees of freedom?
For chi-square distrib,
number of df is important!
(Unlike for t distrib)

Here df = 5 is number of
“cells” minus 1. “Why?”

H 0 : ( p1, p2 , p3 , p4 , p5 , p6 )  ( 1 , 1 , 1 , 1 , 1 , 1 )
6 6 6 6 6 6

Ha : General p’s (6 positive numbers that sum to 1)

Ha has 5 df, H0 has 0 df.                    Overall test has 5  0 = 5 df.
395
General rule: subtract df in Ha minus df in H0.
t and chi-square critical values.
Used for constructing 95% CI’s
df    (2-sided for t) 
t.975            2
.95

396
Contingency tables, homogeneity,
independence
A two-way classification of subjects by two variables --
gender and handedness:

right      left   ambidextrous
men       934        113         20

women     1070       92           8

Are gender and handedness independent?
I.e. do proportions of right, left, and ambidextrous agree
between men and women?
397
Chi-square for independence in 2-
way tables
A sum of 6 terms in our example

observed  expected 2
Again use X 2  
expected
“expected” counts: see below…

398
Expected counts, assuming hull
Data:  hypoth (independence)

Expected counts, assuming H 0 :
(row total) (column total)
expected 
n
(1067)(2004)
E.g. expected count for (men, right) is               955.86
2237
399
Degress of freedom
Ha ?? 6 cells, so 6 probabilities, so 5 df

H0 ?? Can choose, e.g., P(man)
[and then P(woman) is determined],
and can choose P(right) and P(left)
[and then P(ambidextrous) is determined].

Ha     5 df
H0     3 df
test   2 df
400
2-way table chi-square test using
Minitab

401
(113  97.78) 2
 2.369
97.78

P{ 2 (2 df)  11.806}  .003

402
Next: Inference for regression

• How strong is the evidence that there is a real correlation
between two variables?
• What is a 95% confidence interval for the mean of one
variable, for a given value of another variable?

403
Crying and IQ
The heartwarming story...
Data on 38 infants (4 to 10 days old).
Researchers used a rubber band to snap infants in the foot.
Measured crying intensity
(“number of peaks in the
most active 20 seconds”)
Later measured IQ at age 3 years.

Data from Basic Practice...
404
Crying data
165

155

145

135
r  .455
IQ

125

115

105

95

85
10           20           30
crying

“Is there a real relationship?”
405
Regression model
Y = IQ
Mean of Y is a linear function of X.       X = crying

Y   0  1 X
Actual values = mean + “random error”
Yi   0  1 X i   i

Model :  i ~ N (0,s ) are indep random variables

3 unknown parameters:  0 , 1 , and s .

406
Estimating the parameters
(Start w/the coefficients in the linear
i ~ N (
Yi   0  1 X i   i equation)0, s )

We’ll estimate  0 and 1 by b0 and b1, say.
b0 , b1 : intercept and slope of the usual least-squares
regression line y  b0  b1x .
165

155
On average, gain
145

135
per unit of crying
IQ

125
Y = 91.3 + 1.49 X
115
intensity.
105

95
Very scientific.
85

10   20
crying      30
407
Plot the residuals
Residual is difference between observed Yi and the
prediction of the regression line : ei  Yi  (b0  b1 X i )
Hope to see a formless blob…
50
40
30
residual

20
10
0
-10
-20
-30

10            20            30
crying

Looks pretty formless to me. (Except maybe one point to examine.)
408
Check that the residuals look
decently Normal
Normal Probability Plot for residuals

99

95
90

80
70
Percent

60
50
40
30
20

10
5

1

-40   -30    -20   -10   0      10   20   30   40   50

Data

409
How to estimate s ?
As usual, need to est s to construct CI’s and hypothesis tests.

Yi   0  1 X i   i           i ~ N (0, s )

How about using s = SD of i ’s?
We don’t know the  i ’s !         i  Yi  (  0  1 X i )

Idea: Can estimate i by residual ei  Yi  (b0  b1 X i ) .
Estimate s by SD of residuals…

s  
ei2           (Yi  b0  b1 X i ) 2
n2                   n2
410
There you go again… Why n - 2 ?
s        
Before we divided by n-1.            ei2      (Yi  b0  b1 X i ) 2
Why n-2 here?                      n2             n2

Choose n  2 to get an unbiased estimator of s 2 .

An intuitive way to remember:
Before we said if have n = 1, we want s undefined.
Now: if n = 2, s should be undefined (0/0).

The real idea: there are two estimated parameters in this
expression.
411
Minitab report

412
CI’s and tests for 1
E.g. a 95% CI for 1 will look like
b1  (multiplie r)(SE of b1 )

Around 2, as usual.    Standard error
From a t distrib.      (estimated SD).

s
By algebra... SD(b1 ) 
(Xi  X )   2

s
Estimate by SE(b1 ) 
 ( X i  X )2

413
SE of regression coefficient
s
By algebra... SD(b1 ) 
 ( X i  X )2
s
Estimate by SE(b1 ) 
 ( X i  X )2

This all makes qualitative sense at least. E.g.:
b1 is more variable when s is larger,
less variable when X i ' s are spaced farther apart

414
95% CI for 1 in crying example
b1  (tn  2 )
*
(SE (b1 ))      From t distrib with
38-2=36 df.
 1.49  (2.03) (SE (b1 ))
17.5
SE(b1 )                0.487
(Xi  X ) 2

 1.49  (2.03) (0.487)
 [0.51, 2.48]

415
Hypothesis test in crying example
Test H 0 : 1  0.

b1     1.4929
Calculate                 3.07
SE(b1 ) 0.487

P value: 2 (0.002) = 0.004.

This data gives strong evidence that IQ and crying
are correlated.

416
Confidence and prediction
intervals at a given value x*
E.g., for x*  30 :
Want a CI for mean of Y in the "vertical strip"
at X  30, that is, a CI for  (Y | X  x*).
Prediction interval: Suppose we just saw an infant with a
crying score of 30. Give an interval for future IQ score,
for which we have a given confidence.

Want a CI for the "vertical strip" mean  (Y | X  x*),
and a "prediction interval" for a new value of Y that
we haven't observed yet, for a given value X  x *
417
Formulas, etc.
Intervals will be of the form y  (multiplier)(SE),
ˆ
where "multiplier" comes from a t ( n  2) distribution.

For CI for  (Y | X  x*) use
1 ( x *  x )2
SE  (Y | x*)   s 
n  ( xi  x )2      See pp. 674-677
and pp. 690-691
For prediction interval use
1  ( x *  x )2
SE (Y | x*)    s 1 
n  ( xi  x ) 2               418
In Minitab
Do stat > Regression > Regression, click Options, and fill
in a value or a column for “Prediction intervals for new
observations”

Do stat > Regression > Fitted line plot, click Options, and
check “Display confidence bands” and “Display
prediction bands” for a nice picture.

E.g. for x* = 30, get a CI of [122, 150] for the mean,
and a prediction interval of [98, 174] for a new IQ.

419
Regression Plot
Y = 91.2683 + 1.49290X
R-Sq = 20.7 %

180

130
IQ

Regression
80                                        95% CI
95% PI

10       20                    30

crying

420
From K.A.C. Manderville, The Undoing of Lamia Gurdleneck
"You haven't told me yet," said Lady Nuttal, "what it is your fiance
does for a living."
"He's a statistician," replied Lamia, with an annoying sense of being
on the defensive.
Lady Nuttal was obviously taken aback. It had not occurred to her that
statisticians entered into normal social relationships. The species,
she would have surmised, was perpetuated in some collateral
manner, like mules.
"But Aunt Sara, it's a very interesting profession," said Lamia warmly.
"I don't doubt it," said her aunt, who obviously doubted it very much.
"To express anything important in mere figures is so plainly
impossible that there must be endless scope for well-paid advice
on the how to do it. But don't you think that life with a statistician
would be rather, shall we say, humdrum?"
Lamia was silent. She felt reluctant to discuss the surprising depth of
emotional possibility which she had discovered below Edward's
numerical veneer.                                             421
Stat 10x
J. Chang
Tuesday, 11/6/01

422
Today
• Multiple regression, including some ideas of
model selection

423
Before I forget: are these
interpretations correct?
• 95% CI of [.521, .583] for population
proportion p means that “The probability that
p lies between .521 and .583 is 0.95.”
• Testing a null hypothesis and find P value =
.015 means that “The probability that the null
hypothesis is true is 0.015.”

424
Multiple regression example:
Deciding who should get
Data:       scholarships

Want to use HS GPA and achievement test to predict college GPA
425
Look at scatterplots for all pairs of
variables
3.25
coll_GPA
2.29                                                      Minitab:
Graph
3.2425                                                      Matrix Plot
HS_GPA
2.1675

89.75
ach_test
73.25

9      5        75      25       5         5
2.2    3.2    2. 16   3. 24     73.2      89.7

HS_GPA looks useful in predicting coll_GPA.                         Good
ach_test looks useful in predicting coll_GPA.                       Good
ach_test & HS_GPA not useful in predicting each other! Also good
426
Simple (not multiple) regression of
coll_GPA on HS_GPA

427
Results of simple regression

428
Minitab can store the fits and residuals
in the worksheet

429
Plot residuals

0.5
RES1

0.0

-0.5

2           3    4
HS_GPA

Looks nice
430
Same residuals vs. ach_test (the
other predictor)

0.5
RES1

0.0

-0.5

70          80        90         100
ach_test

431
Multiple regression using both
predictor variables

432
The report

433
Recall the simple regression report for
comparison...

434
Fits and resids from both
regressions
(simple and multiple)

435
Plotting residuals vs. fitted values
0.4

0.3
0.5                                0.2

0.1

RESI2
0.0
0.0
-0.1

-0.2

-0.3
-0.5
-0.4

2.5       3.0   3.5             2     3           4

FITS1                         FITS2

436
How much better did we do with
2 two predictors?
r  .73 (r  .53)       r  .93 (r 2  .87)

4                                  4

coll_GPA
3                                  3

2                                  2

2.5       3.0       3.5          2     3          4
FITS1                          FITS2

The famous “multiple R-sq” (reported by Minitab) is simply
the squared correlation between actual and fitted y’s. 437
Multiple regression model
In our example:
Mean of Y is a linear fcn of X1 and X2.         Y = coll_GPA
X1 = HS_GPA
Y   0  1 X 1   2 X 2                  X2 = ach_test

Actual values = mean + “random error”
Yi   0  1 X i1   2 X i 2   i

Model :  i ~ N (0,s ) are indep random variables

4 unknown parameters:  0 , 1,  2 , and s .

438
Estimates in multiple regression
Yi   0  1 X i1   2 X i 2   i     i ~ N (0,s )

If we estimate  0 , 1 ,  2 by b0 , b1, b2 ,
define residual ei  Yi  (b0  b1 X i1  b2 X i 2 ) .

How to choose “best” b0 , b1, b2 ?
Least squares idea: choose b0 , b1, b2 that give smallest sum
of squared residuals.                     i.e. smallest  ei2 
Estimate of s : s               2
ei
n3
There are formulas for SE(b0 ), SE (b1 ), SE (b2 ) , which
depend, roughly, on s and how spread out the X values are.439
Minitab uses those calculations and
more algebra
to get all this...

440
Model selection example:
Guessing the degree of a
Data:         polynomial
10

0
y

-10

-3   -2   -1   0   1   2   3
x

441
linear fit (polynomial of degree 1)

15

10

5

0

-5

-10

-15

-3   -2       -1           0          1         2   3

442
quadratic fit (polynomial of degree 2)

15

10

5

0

-5

-10

-15

-3   -2         -1          0          1           2   3

443
3rd degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2   -1          0         1     2   3

444
4th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2   -1          0         1     2   3

445
5th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2   -1          0         1     2   3

446
10th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2    -1         0          1     2   3

447
15th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2    -1         0          1     2   3

448
20th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2    -1         0          1     2   3

449
25th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2    -1         0          1     2   3

450
30th degree polynomial fit

15

10

5

0

-5

-10

-15

-3   -2    -1         0          1     2   3

451
Model selection alphabet soup
AIC: "An" Information Criterion (proposed by Akaike)
BIC: "Bayesian Information Criterion" (Schwarz, 1978)

n 1
2
2
      (1  R 2 )
n p

n p    2   p
FPE  (average squared residual)        s 1  
n p        n

"Cross validation"
452
Results of a couple of model
selection criteria

453
The answer is indeed degree = 3
1 3
y  x  x2  5x  4
2
The actual (3rd degree) polynomial

15

10

5

0

-5

-10

-15

-3   -2        -1         0         1          2   3
454
3rd degree polynomial fit                                                                          4th degree polynomial fit

15                                                                                                  15

10                                                                                                  10

5                                                                                                   5

0                                                                                                   0

-5                                                                                                 -5

-10                                                                                                -10

-15                                                                                                -15
-3   -2   -1          0         1                2        3                                        -3   -2   -1          0         1     2     3

The actual (3rd degree) polynomial

15

10

5

0

-5

-10

-15
455
-3       -2            -1         0         1           2    3
And now, as you go forth…
Please remember that the knowledge you have gained in
this class must always be used for good, and never,
not ever, ever, ever, ever,       ever,
for

I’ll be around…
Good luck with the rest of the course,
and with all future random pursuits!
456

Related docs
Other docs by X52Is25h
Fernando Pessoa - DOC