Stat 101-106

W
Shared by: X52Is25h
Categories
Tags
-
Stats
views:
8
posted:
3/13/2012
language:
pages:
456
Document Sample
scope of work template
							        Stat 101-106
       T. Brown, J. Chang
N. Hengartner, J. Kim, E. Kostello,
 J. Lapinski, J. Reuning-Scherer

            Fall, 2001
         Class #1 9/6/01
                                      1
Who are we and why are we here?
             Intro to Statistics for...
• Life Sciences: Stat 101/EEB 210/MCDB 215
  (Instructors: J. Kim, JC)
• Political Science: Stat 102/PLSC 452/EP&E 203
  (J. Lapinski, JC)
• Sociology: Stat 103/Soc 119
  (E. Kostello, JC)
• Psychology: Stat 104/Psych 201 (T. Brown, JC)
• Environmental Sciences: Stat 105/F&ES 205
  (J. Reuning-Scherer, JC)
• Data analysis: Stat 106 (N. Hengartner, JC)
                                                  2
The intellectual universe
     (In 3 easy steps…)




                            3
The intellectual universe
     (In 3 easy steps…)




                            4
The intellectual universe




                            5
          The intellectual universe
Physics                               Philosophy




  Math                            Economics
                                            6
            What is Statistics?
• The science of collecting, organizing, and
  interpreting numerical facts, which we call
  data. (Moore and McCabe)
• The science and art of prediction and
  explanation. (Yale College Programs of Study)

Statistics provides a framework and tools for
  addressing hard questions.
                                                  7
            An “easy” question
• In Euclidean plane geometry, what is the sum
  of the angles in a triangle?




(Nobody asks: “What is the latest thinking on the sum
of the angles in a triangle in Euclidean geometry?”)
                                                        8
         Some “hard” questions
• Does smoking cause cancer?
• Do men and women differ in pulse?
• Does listening to Mozart make people smarter?
• Is there a single gene that causes cystic fibrosis?
  Where is it in the genome?
• Is global warming really happening? How much
  should we worry?
• How long ago did “mitochondrial Eve” live?
• Should Bush really have won the election?
                                                        9
               Nature of class
• purpose
• prerequisites
• structure
  – sections
  – schedule




                                 10
         What’s going on here
• Two kinds of lectures:
  – “General Statistics lectures” introduce statistical
    theory, concepts, etc.  Everybody, OML202
  – “Subject area lectures”: Introduce subject-
    oriented applications and examples; possibly more
    techniques and theory  Sections separate
• Not much math used (and none of it
  advanced). Lots of reasoning. Some use of
  computers.
                                                    11
      Where to go & when
• Weeks 1, 2, 3: General Stat lecture only
• Weeks 4-9: General Stat lecture Tues,
Subject lecture Thurs
• (Week 11: Fall Break)
• Weeks 10, 12, 13: Subject lecture only




                                             12
         More about how this works

• “Section” may be a misleading word. “Subject-area
  lectures” may be more suggestive.
• Your section professor (not me) is the ultimate
  authority, responsible for giving grades, setting
  policies, etc.
• Sections are independent of each other.




                                                 13
                Course topics
“science of collecting, organizing, and
  interpreting data”
(I) Organizing
distributions, graphical displays (histograms,
  boxplots,...) numerical summaries (mean, median,
  standard deviation,...), Normal distributions
more than one variable: correlation, regression



                                                     14
               Course topics
“science of collecting, organizing, and
  interpreting data”
(II) Collecting and producing data
Sampling, bias, design of experiments,
  randomization




                                          15
                Course topics
“science of collecting, organizing, and
  interpreting data”
(III) Interpreting data -- Statistical inference
Confidence intervals, hypothesis testing, various
 techniques for various questions with various
 kinds of data



                                                   16
               Course topics
“science of collecting, organizing, and
  interpreting data”
(II.V) Probability
  Random variables , rules of probability,
  conditional probability, Bayes’ rule, binomial
  & Normal distributions, Central Limit
  Theorem


                                              17
                   Using the web

Course home page http://classes.yale.edu/stat100a/
    “Register” at http://classes.yale.edu/student/


 Register for both your section and for “Stat 100a”

                                    E.g., Stat 102a or whatever


The syllabus, which will be linked to the course home page, will
have announcements and links to the class notes and homework.
                                                             18
       Stat 10x
       J. Chang
    Tuesday, 9/11/01



a hard day to do Statistics.


                               19
            What is Statistics?
• The science of collecting, organizing, and
  interpreting numerical facts, which we call
  data. (Moore and McCabe)
• The science and art of prediction and
  explanation. (Yale College Programs of Study)

Statistics provides a framework and tools for
  addressing hard questions.
                                                  20
       What is Statistics? (cont.)
Conceptual framework and methods for
• learning from experience (data, experiments,
  etc.)
• reasoning under uncertainty
• answering questions and quantifying the
  strength or reliability of the conclusions



                                             21
         A few (4, actually) words
         about statistical inference
Prototypical situation: Want to answer a
  question about a population of individuals.
  We use a sample of individuals from the
  population .

Parameter: a number describing the population
Statistic: a number describing the sample.

                                                22
             P.S.

Population          Parameter




Sample              Statistic



                                23
                      e.g.
In a sample of 1000 US voters before the 2000
election, 493 said they would vote for Bush (and
507 said they would vote for Gore).

Population, parameter, sample, statistic?




                                             24
                         Data
Typically consists of values of some variables
 measured or observed for some individuals.

A variable is a characteristic of an individual, or
  something like that. (M&M p. 4 attempts to give
  definitions)

E.g. our questionnaire from the first day
• individuals: each of you
• variables: sex, pulse, height, etc.
                                                      25
               Distributions
The distribution of a variable says what possible
 values the variable takes and how frequently it
 takes those values.


Many methods to describe and display
 distributions



                                              26
           Distributions, cont.
E.g. (Minitab demo)
tally ‘section’
histogram ‘weight’
• try with: default, 100 bins, 2 bins


What makes a histogram “good”?


                                        27
Some words that describe distributions




     Symmetric,
                           Bimodal
     unimodal
                                         28
Which is skewed to the left? To the right?

                                         29
Skewed to the right        Skewed to the left
 Beta (1.6, 2.7) density    Beta (2.7, 1.6) density

                                                      30
  Numerical descriptions of distributions

• Mean = average
• Median = 50th percentile

Different ways of getting at the idea of a
 “center” of a distribution.




                                             31
   More detail (and that pesky )
E.g. if data is 76, 57, 88, 93, 72, 94,

           76  57  88  93  72  94
    Mean                               80
                        6

For a variable x with n observed values
x1 , x2 ,..., xn the mean of x is n

           x1  x2    xn          x     i
        x                          i 1
                   n                   n
                                                32
 “Physical” interpretation of mean
Center of mass -- balance point of a distribution




          x  93.4



                                               33
        Median = 50th percentile
Arrange data in order.
  Median M = 50th percentile = “middle observation”

E.g. for data 57, 72, 76, 88, 93
       M = 76.
E.g. for data 57, 72, 76, 88, 93, 94
       M = (76 + 88) / 2 = 82.
       [i.e. if number of observations is even, average
       the middle two.]

                                                      34
                    Quartiles
• Define first quartile to be the median of the
  observations below the median (i.e. 25th percentile)
• Define third quartile to be the median of the
  observations above the median (i.e. 75th percentile)


            57, 72, 76, 88, 93, 94
                    M  82
              Q1  72   Q3  93
                                                    35
         “Robust” or “resistant”
These terms mean “insensitive to a few extreme
 observations”
  (imagine typo of adding several zeros to a number).


Which is more robust: mean or median ?

          Compare      57, 72, 76, 88, 93, 94
           to          57, 72, 76, 88, 93, 94000000

                                                      36
                     Which is which?
0. 0.2 0.4 0.6



                 0    0
                      0
                      1        2
                               0
                               0   0
                                   0
                                   3   0
                                       0
                                       4




                 median mean
                                           37
   Mean vs. Median          (a few comments)

• Robustness is nice, like apple pie.
• But, e.g., insurance companies and casinos
  care more about mean than median

• In a symmetric distribution mean = median.




                                               38
                   Summary
Today focused on
•Conceptual framework of Statistics
(population parameter, sample, statistic)
•The idea of a distribution
•Started some simple statistics describing
distributions (mean, median,…)
•We saw a bit of Minitab in class

Next time: Standard deviation, densities, Normal
distributions…
                                             39
  Stat 10x
    J. Chang

Thursday 9/13/00




                   40
           Administrative stuff
• Minitab intro sessions start today. Look on the web
  to see what session you’re assigned.
• Class home page is http://classes.yale.edu/stat100a
• “Register” for both “Stat100a” and for your section
  through http://classes.yale.edu/student
• Homework will be posted today; linked to the
  syllabus.



                                                   41
 Measures of overall location or “center”

As discussed last time -- mainly mean, median




                                            42
Measures of “spread” or “variability”
• Range = max - min
• Interquartile range: IQR = Q3 - Q1
• Most common and useful: variance and
  standard deviation (SD).
Relationship:
                SD  Variance


                                         43
         The typical letters
• Sample:
  Variance = s2,   SD = s


• Population
  Variance = s2,   SD = s


                               44
            Idea of variance
• How far away are the observations, on
  average, from the mean?




                                          45
              Deviations




                 xi  x

Difference between ith observation and mean
                                          46
         Formula
           n
      1
s 
 2
          ( xi  x )
    n  1 i1
                      2




               cary, isn’t it?

                                  47
         Calculating Variance
Xi   X     Xi  X   (Xi  X )   2
                                            n5
                                         n 1  4
1    3.8    -2.8      7.84
                                        22.80
2    3.8    -1.8      3.24          s 
                                    2

                                          4
4    3.8     0.2      0.04             5.7
5    3.8     1.2      1.44
                                    s  5.7
7    3.8     3.2     10.24
            ____     ____                2.39
              0      22.80
                                                    48
Deviations from the mean




                           49
               Why square?
• Sum of deviations (not squared) is just 0.
  Squaring the deviations converts the negative
  deviations to positive numbers...
• Summing squares is a natural operation; our
  eyes do it all the time with no help from our
  brains…          (Just kidding, Professor Brown)




                                                50
How far apart are these 2 points?
            •
          (4,6)



   •        •     (4  1) 2  (6  2) 2
 (1,2)    (4,2)
                      3  4  9  16
                           2     2


                      25  5

                                          51
          Why divide by n-1 ?
• It’s unimportant if n is large.
• It’s not that important in any case.
• What happens when n = 1?
  – You shouldn’t be trying to estimate a variance
    from a sample of size 1!
• Dividing by n-1 gives an unbiased estimate of
  variance.     (More about this later…)



                                                     52
           More practice
100, 100, 100, 100, 100, 100, 100
         Here s = 0.

90, 90, 90, 100, 110, 110, 110

         Here s = 10.



                                    53
      Robustness of IQR vs. SD
• IQR is robust; SD is not.




                                 54
                Some simple rules
Start with a variable X having mean x   and SD s x .

Add 3 to each value, getting a new variable Y.
Yi  X i  3.

What are y and s y ?




                                                       55
                Some simple rules
Start with a variable X having mean x and SD s x .

Add 3 to each value, getting a new variable Y.
Yi  X i  3.

What are y and s y ?

                       y  x  3,
                       s y  sx ...no change in SD

                                                     56
             Some simple rules
How about multiplying?

Multiply each value by 3, getting new variable Z.
Zi  3X i .

What are z and s z ?




                                                    57
             Some simple rules
How about multiplying?

Multiply each value by 3, getting new variable Z.
Zi  3X i .

What are z and s z ?
                              z  3x ,
                              s z  3s x ,
                              s z2  9 s x
                                         2


                                                    58
  Nonlinear transformations have a more
            complicated effect
How about squaring?

Square each value getting new variable W.
Wi  X i .
        2



Is w  (x ) ?
           2

I.e. (mean of the square) = (square of the mean)?



                                                    59
  Nonlinear transformations have a more
            complicated effect
How about squaring?

Square each value getting new variable W.
Wi  X i .
        2



Is w  (x ) ?
           2

I.e. (mean of the square) = (square of the mean)?


                           No :    w  (x)   2


                                                    60
Density curves
             Idealized, smoothed
             histogram. Limit of
             large population.
             (population  )




                            61
   Areas correspond to
proportions of population




                            62
           Why “idealized”?
• No such thing as a precisely normally
  distributed population.




                                          63
      Example: Uniform density
 E.g. Uniform on the interval [0, 10]

                   flat

Height? 



             0                          10
(Area under a density) = 1                   64
       y0.2 0.3 0.4   Standard Normal density

                                                x2
                                       1      
                                    y    e     2
                                       2
0. 0.1



                      4
                      -   2
                          -     0     2              4
                                x                        65
    General Normal densities




Interpretation of s in terms of inflection points
                                             66
Notation for Normal distributions

 Normal distribution with mean  and SD s is
 denoted N (  , s ) .

 If the variable X has a N (  , s ) distribution, we
 write X ~ N (  , s ) .




                                                        67
            “68, 95, 99.7 rule”




This picture is for a standard Normal distribution N(0,1)
                                                     68
     68, 95, 99.7 rule for N(,s)
• 68% of the population is within 1 SD of the
  mean (i.e. between s and s)
• 95% of the population is within 2 SD’s of the
  mean (i.e. between 2s and 2s)
• 99.7% of the population is within 3 SD’s of
  the mean (i.e. between 3s and 3s)



                                             69
                  Example
Assume verbal SAT scores have approximately
 a N(505,110) distribution.

What percentile is a score of 615?




                                          70
                  Answer
About 84th percentile.

615 is 1 SD above the mean…

     …blackboard...




                              71
More precise answers to more general questions
  like this using Minitab or Normal Tables
• Minitab
  – Do the 615 example
  – How about 680?
• Tables
  – A bit anachronistic now, but useful for desert
    islands and exams...




                                                     72
            Normal tables




See Table A in textbook.   (Show transparency in class)
                                                          73
Sir Francis Galton (1822-1911) on the Normal distribution

I know of scarcely anything so apt to impress the
imagination as the wonderful form of cosmic order
expressed by the "Law of Frequency of Error." The law
would have been personified by the Greeks and deified, if
they had known of it. It reigns with serenity and in
complete self-effacement, amidst the wildest confusion.
The huger the mob, and the greater the apparent anarchy,
the more perfect is its sway. It is the supreme law of
Unreason. Whenever a large sample of chaotic elements
are taken in hand and marshaled in the order of their
magnitude, an unsuspected and most beautiful form of
regularity proves to have been latent all along.
                                                    74
  Stat 10x
   J. Chang
Tuesday, 9/18/00




                   75
                          Densities
Remember densities describe the distrib of a
variable in a large population.
       y0.2 0.3 0.4

                                      Total area = 1
0. 0.1




                      -
                      4   -
                          2   0        2       4       76
                              x
   Areas give fractions of population




                 -1               2
E.g. What is the fraction of the pop having X between -1 and 2?
                                                            77
   Areas give fractions of population




                 -1               2
E.g. What is the fraction of the pop having X between -1 and 2?
                                                            78
   Areas give fractions of population

                                      Area = 0.82




                 -1               2
E.g. What is the fraction of the pop having X between -1 and 2?
                                                            79
            Standard Normal density: The “bell curve”
                     (Also called “Gaussian”)
       y0.2 0.3 0.4
                                                x2
                                       1      
                                    y    e     2
                                       2
0. 0.1



                      4
                      -   2
                          -    0        2            4
                               x                         80
    General Normal densities




Interpretation of s in terms of inflection points
                                             81
Notation for Normal distributions

 Normal distribution with mean  and SD s is
 denoted N (  , s ) .

 If the variable X has a N (  , s ) distribution, we
 write X ~ N (  , s ) .




                                                        82
            “68, 95, 99.7 rule”




This picture is for a standard Normal distribution N(0,1)
                                                     83
      68,95,99.7 rule for N(,s)
This “rule” is just 3 numbers to memorize:

• 68% of the population is within 1 SD of the
  mean (i.e. between s and s)
• 95% of the population is within 2 SD’s of the
  mean (i.e. between 2s and 2s)
• 99.7% of the population is within 3 SD’s of
  the mean (i.e. between 3s and 3s)

                                             84
                Example
Suppose verbal SAT scores have approx N(,s)
  distribution with   505 and s  110.
What is the percentile of the score 615?




                                          85
                   Answer
Start by drawing a picture




                             86
                   Answer
Start by drawing a picture




                             87
                     Answer
615 is   1   SD above the mean. (615 = 505 + 110)

Want this area:




                                               88
                   Answer
This is same as area above values < 1 in a
 standard Normal distrib




                                             89
   Answer (cont)


              68 %



 16 %            16 %


Answer = 16% + 68% = 84%
                           90
 Doing the problem with Minitab
Use Calc  Probability Distributions  Normal
 and fill in 505 and 110 for mean and SD.
Do a cumulative probability, for 615.

Cumulative Distribution Function
Normal with mean = 505.000 and standard
  deviation = 110.000

         x     P( X <= x)
  615.0000        0.8413
                                          91
         Using a Normal Table
Useful for desert islands and exams.
Tables typically give cumulative probabilities.
I showed you a table on the overhead…

Textbook explains and gives examples on pp. 75-79.




                                                     92
            Another problem
Want percentile for 680.
Not a neat use of 68, 95, 99.7 rule.




                                       680
                                             93
    To use a Normal table for this problem

Score is  680  505 175
                         1.59091
            110      110
SD’s above mean.

Now use standard Normal table.




                                             94
 Standardizing and z-scores (Just terminology)

Let x be an observation from a distrib with mean
   and SD s.
How many SD’s is x above the mean?
Standardized value:        x
                        z
                              s
        “z-score”
 These are nice for comparing “extremenesses” of
 otherwise incomparable quantities                 95
 Normal probability plots (or quantile plots)
[M&M pp. 79-83]
Have some data on a variable.
Is it believable that the data came from a Normal
  population? If not, in what way does the pop
  distrib differ from a Normal?

How can we see this?


                                              96
 Idea of Normal probability plots

Change the problem of judging whether a
histogram looks like a “Normally shaped hump”
into judging how well some points fall along a
straight line.




                                           97
         E.g.: Gold plating thickness on circuit boards




unit  106 inch
                                                          98
99
E.g. Water runoff in Arizona




                               100
Normal probability plot for Runoff




                                     101
            Idea of quantile plots
• Plot the observed values of the variable vs.
  “where we would expect them to be if they
  came from a (standard) Normal distribution.”
• (Which Normal distrib to use? Doesn’t matter since they are
  all linearly related to each other.)
• Quantile plots use percentiles of the Normal
  distribution.
• Roughly, plot ith largest observation vs. (i/n)
  percentile of N(0,1) distrib.
                                                          102
       Idea of quantile plots (cont)
• Idea is to plot
   – median of data vs. median of N(0,1)
   – 10th percentile of data vs 10th percentile of N(0,1)
   – etc.
• Need to use a precise definition of sample
  percentiles; there are several variations.



                                                     103
               Something like this...
E.g. given sorted data 10.8 24.2 35.8 36.1 49.5

Plot 35.8 [median of data] vs 0.0 [median of N(0,1)]
Plot 10.8 [20th % of data] vs ?? [20th % of N(0,1)]
Plot 24.2 [40th % of data] vs ?? [40th % of N(0,1)]
.....
Plot 49.5 [100th % of data] vs ?? [100th % of N(0,1)]


                                    oops
                                                   104
That needs a little fixing. Several ways to go...

Instead of using fractions 1/5, 2/5, 3/5, 4/5, 5/5,
  could use e.g.
      1/6, 2/6, 3/6, 4/6, 5/6
or    1/10, 3/10, 5/10, 7/10, 9/10
or    ...

These are different options in Minitab. Makes
 little difference if n is large.
                                                 105
            N(0,1) percentiles
E.g. for n=5 and first option, N(0,1) percentiles look like


             •-1           •0           •1




and for n= 20,


             •-1          •0            •1



                                             Blackboard…
                                                       106
       99


       95
       90
83.3   80
       70
66.7   60
50.0   50
       40
33.3   30
       20
16.7
       10
        5


        1


            0   2 4 5      9 10   17   20
                        Data            107
E.g. Water runoff in Arizona




                               108
Normal probability plot for Runoff




                                     109
Log(Runoff)




              110
111
  Today: describing joint distribution
           of two variables
• Scatterplots
• Correlation
• Regression




                                     112
        Address questions like
• How strong a linear relationship is there
  between two variables?
  – E.g. when height increases, does weight also tend
    to increase?
  – E.g. How about weight and pulse?
• If we know the value of one variable for an
  individual, how can we best predict the value
  of another variable for that individual?

                                                  113
                Scatterplots
Plot two variables simultaneously.
Put one variable on horizontal axis,
  other variable on vertical axis.




                                       114
               E.g. weight vs. height


         200
weight




         150




         100


                55       65      75
                        height
                                        115
              E.g. pulse vs. weight
        120

        110

        100

        90
pulse




        80
        70

        60

        50

        40

               100     150       200
                        weight
                                       116
                Correlation
• Measures the “strength of the linear
  relationship” between two variables.




                                         117
          Small correlation

2


1


0


-1


-2

     -2    -1   0      1       2   3



                    r = 0.06
                                       118
Highly correlated variables
 3


 2


 1


 0


 -1


 -2

      -2   -1   0   1   2   3


                                r = 0.99
                                    119
     Moderate correlation

2


1


0


-1


-2

     -3   -2   -1   0   1   2   3


                                    r = 0.55
                                        120
                    Negative correlation
3                                    3

2                                    2

1
                                     1
0
                                     0
-1
                                     -1
-2
                                     -2
-3
     -3   -2   -1    0   1   2   3        -2   -1   0   1   2    3




          r = -0.52                            r = -0.96

                                                                121
          Zero correlation

2


1


0


-1


-2

     -2     -1    0     1     2   3


          Positive or negative?       r = 0.03
                                          122
       Definition of correlation
First standardize variables:
                             xi  x        yi  y
Instead of xi and yi look at           and
                               sx            sy

  i.e. How many SD’s is each observation above the
  mean?
Then do this...


                                                 123
    …Definition of correlation

           1     n
                     xi  x  yi  y 
      r        s  s 
         n  1 i 1  x  y         

That is:
      standardize each xi and yi ,
      multiply, and
      “average”
                                           124
         Notes about correlation
• Correlation is “dimensionless”
• Since can rewrite definition as
                                      1 n  xi  x  yi  y 
          ( x  x )( y  y )    r        s  s 
                                    n  1 i 1  x  y 
  r             i       i
                                                            
         (x  x)  ( y  y)
             i
                     2
                             i
                                 2



  we can see that r is between -1 and 1.
       E.g. if yi = xi for all i, then r = 1.

                                                     125
       Rough idea of definition
• Draw picture...




                                  126
  A small example worked out in
              detail
• Blackboard...




                              127
  Stat 10x
   J. Chang
Tuesday, 9/20/01




                   128
                Scatterplots
Plot two variables simultaneously.
Put one variable on horizontal axis,
  other variable on vertical axis.




                                       129
               E.g. weight vs. height


         200
weight




         150




         100


                55       65      75
                        height
                                        130
              E.g. pulse vs. weight
        120

        110

        100

        90
pulse




        80
        70

        60

        50

        40

               100     150       200
                        weight
                                       131
                 Correlation
• Measures strength and direction of linear
  relationship between two variables.
• Between -1 and +1.
• +1 : perfect linear relationship, positive slope
• -1 : perfect linear relationship, negative slope




                                                132
    …Definition of correlation

           1     n
                     xi  x  yi  y 
      r        s  s 
         n  1 i 1  x  y         

That is:
      standardize each xi and yi ,
      multiply, and
      “average”
                                           133
          Rough idea of definition
                                xi  x  0
     xi  x  0                 yi  y  0
     yi  y  0           +
      
                                             Sign of
                                       xi  x   yi  y 
                                               sy ?   
                                                          
                           xi  x  0  sx              
        +                 y y 0
                            i

xi  x  0, yi  y  0

                                                       134
A small example worked out in detail
             by hand
• Did this in detail on the blackboard…




                                          135
      Correlation and Regression
Sir Francis Galton (1822-1911)
              Scientist and explorer
              Cousin of Charles Darwin
              Studied heredity, intelligence, eugenics …
              IQ estimated at 200
              Invented quincunx
              Idea of branching processes
              Statistical study of efficacy of prayer


                                                 136
         Fathers and sons data


                                          Correlation
                                          r  0.5




What is average height of a son whose father is 72” ?
                                                  137
  Descriptive statistics on father-son
                 data
• Fathers: Mean = 68” SD = 3”
• Sons: Mean = 69” SD = 3”
• Average height of son if father is 72” ?
A natural guess:
  Father is 4/3 SD’s above mean, so guess
  son will be 4/3 SD’s above mean, or 73”


                                             138
How is 73” as a guess?




                         139
(Just in case you like 72” for some
               reason)




                                  140
Best guess depends on correlation
Guess that son will be,
 not 4/3 SD’s above mean,
 but correlation  4/3 = 2/3 SD’s above mean.

             r  0.5

That is, in our example, guess son’s height to be
 69 + (2/3)  3 = 71 inches.


                                              141
“Natural” guess vs. “best” guess


   73
   71




                               142
               Another example
 Use LSAT scores to predict 1st-year final exam scores
 Historical data:
   X = LSAT scores: mean 650, SD 80
   Y = final scores: mean 65, SD 10
   correlation: r = 0.4
 Question: Predict final score for student with x = 750.

•Step 1
                  750  650 100
Standardize: x is               1.25 SD’s above mean
                     80      80
                                                     143
                   Example            (cont)
X = LSAT scores: mean 650, SD 80
Y = final scores: mean 65, SD 10            correlation: r = 0.4
Question: Predict final score for student with x = 750.

Natural (bad): Guess Y to be 1.25 SD’s above its mean,
  or 65 + 1.25 * 10 = 77.5


Best: Guess Y to be 0.4 * 1.25 = 0.5 SD’s above its
mean, or 65 + 0.5 * 10 = 70.


                                                              144
Equation of the regression line
Just a formula for all the best guesses:
              y  65         x  650 
                      (0.4)         
                10           80 

In general:
          y y    xx 
                r    
           sY      sX 
                                           145
        The “regression fallacy”
In training, air force pilots make two practice landings
   with instructors and are rated on performance. The
   instructors discuss the ratings with the pilots after
   each landing. Statistical analysis shows that the
   pilots who make poor landings the first time tend to
   do better the second, and those who make good
   landings the first time tend to do worse on the second
   try.
The conclusion: criticism helps the pilots, while praise
   tends to make them do worse. As a result, instructors
   were ordered to cricticize all landings, good or bad.
                                                     146
     Regression and least squares
Imagine fitting a line through some data


                                   •
                   •
               •             •
     •
                       • •
           •                     residual = (observed y) - (fitted y)
                                            ri  yi  yi
                                                      ˆ


                                                                    147
(“Predicted” or “fitted” y’s) & (“error”
             or “residual”)




                                      148
     The least squares criterion
Want residuals small: Minimize sum of squared
residuals

                            •                              •
              •                             •
          •             •                             •
 •                              •       •
                  • •                           • •
      •                             •


       bad fit                      better fit
                                                          149
                Flat lines...
                                  •
                    •
                •             •
                                       y=c
       •
                        • •
            •
Q: Which c gives the least-squares fit?

A: c  y
                    …another property of the mean
                                             150
                            r2

Which is smaller:      ( y  i    yi )
                                   ˆ      2
                                              or   ( y
                                                      i    y)   2
                                                                     ?
       Hint...


                 ˆi ) 2   ( yi  y ) 2
Answer:  ( yi  y

r2 measures the improvement:
                  ( yi  yi ) 2
                          ˆ
                                  1 r2
                  ( yi  y ) 2
                                                             151
                  Interpretation
Start with      ( yi  yi ) 2
                        ˆ
                                  1 r2
                ( yi  y )2
That is,

   "SD of yi 's about regression line"
                                        1 r2
     SD of yi 's [about mean y ]

That is,

     SD of      yi 's about regression line 

               1  r   SD of yi 's 
                      2
                                                 152
           Bivariate normal distributions


0.15
 0.1
                                       2
0.05
       0
                                   0
           -2

                0
                              -2
                       2



                    density                sample data

                                                         153
 Distributions within vertical strips
 in a bivariate Normal distribution
Consider y values in a narrow vertical strip at x.
These have
                   xx
    mean  y  rsY          SD  1  r 2 sY
                    sX 
•SD within a strip is always  sY
      ( sY is SD over all individuals)
•If r = 1 then SD in a strip is 0
•if r = 0 then SD in a strip is same as sY
                                              154
                 Example
                  mean         SD
  LSAT scores     650          80               r = 0.4
  final exams     65           10

1. What percentage of students score over 75 on
final exam?

Easy: 75 is (75 - 65)/10 = 1 SD above mean.
Answer is 1  (1)  0.16      (16 %).

                  Standard Normal table value       155
            mean         SD
LSAT scores 650          80     Example (cont.)
final exams 65           10    r = 0.4

2. Among students who get 750 on LSAT, what fraction
get over 75 on final exam?
In strip at x = 750 (standard score = 1.25):
these students have mean = 70
and
       SD      1  r sY  1  (0.4)  10  9.165
                   2            2


We want fraction of N(70, 9.165) distrib to the right of 75.
Standard score for 75 is (75-70)/9.165 = 0.546.
Answer: 1  (0.546)  0.29.      (Compare previous 0.16)
                                                     156
      A Pythagorean identity
   ( yi  y ) 2   ( yi  yi ) 2   ( yi  y ) 2
  ˆ                       ˆ


Ignoring divisions by n-1, this says:

Variance of fitted values (around mean)
  + Variance of y’s around fitted values
  = Variance of y’s.

                                                     157
 Interpretation of r-squared as the
  “fraction of variance explained
         by the regression”


r   2
         ( y
             ˆ   i    y)2

                            
                                          ˆ
                              Variance of yi ' s
          ( y   i    y) 2
                              Variance of yi ' s


Easily derived from the equation of the
regression line, which we know…
                             Homework?
                                                   158
        Notes about regression
• Least-squares regression is not robust
  (resistant)
• Two kinds of interesting points:
   – Outlier : a point with a large residual
   – Influential point : if removed, causes a large
     change in the regression line



                                               159
          A little example
                                       ?
x    y
                10

1    0
0    1      y
                5


-1    0
 0   -1         0



10   10              0
                         x
                             5   10




                                      160
         little example (cont)
        10




        5
    y




        0




             0         5          10

                   x



Outlier? No.               Influential? Yes.
                                               161
162
With and without Child 18




                            163
                   Next...
• Lurking variables
• The perils of aggregation
• Simpson’s paradox




                              164
                   Stat 10x
                   J. Chang
                Tuesday, 9/25/01



To understand God’s thoughts we must study
statistics, for these are the measure of His purpose.
                             Florence Nightingale
                                               165
               E.g. weight vs. height


         200
weight




         150




         100


                55       65      75
                        height
                                        166
              E.g. pulse vs. weight
        120

        110

        100

        90
pulse




        80
        70

        60

        50

        40

               100     150       200
                        weight
                                       167
                 Correlation
• Measures strength and direction of linear
  relationship between two variables.
• Between -1 and +1.
• +1 : perfect linear relationship, positive slope
• -1 : perfect linear relationship, negative slope




                                                168
    …Definition of correlation

           1     n
                     xi  x  yi  y 
      r        s  s 
         n  1 i 1  x  y         

That is:
      standardize each xi and yi ,
      multiply, and
      “average”
                                           169
         Rough idea of definition
                           x  x  0, y  y  0
                            i          i




x  x  0, y  y  0

         
     i           i

                                             Sign of
                       +                     x  x  y  y 
                                             s  s 
                                                              ?
                                              i        i

                                                             
                                                 x      y 



         +              x  x  0,
                            i
                                      y y0
                                       i




x  x  0, y  y  0
 i           i




                                                           170
         Fathers and sons data




                                          Correlation
                                          r  0.5


What is average height of son whose father is 72” ?
                                                  171
  Descriptive statistics on father-son
                 data
• Fathers: Mean = 68” SD = 3”
• Sons: Mean = 69” SD = 3”
• Average height of son if father is 72” ?
A natural guess:
  Father is 4/3 SD’s above mean, so guess
  son will be 4/3 SD’s above mean, or 73”


                                             172
Best guess depends on correlation
Guess that son will be,
 not 4/3 SD’s above mean,
 but correlation  4/3 = 2/3 SD’s above mean.



That is, in our example, guess son’s height to be
 69 + (2/3)  3 = 71 inches.


                                              173
Equation of the regression line
Just a formula for all the best guesses:


       y  Y     x  X          
               r
                  s               
                                   
         sY       X               



                                           174
         Least squares regression
Imagine fitting a line through some data


                                   •
                   •
               •             •
     •
                       • •
           •                     residual = (observed y) - (fitted y)
                                            ri  yi  yi
                                                      ˆ


                                                                    175
        The least squares criterion
Want residuals small: Minimize sum of squared
residuals

                          •                             •
            •                             •
        •             •                             •
•                             •       •
                • •                           • •
    •                             •


    bad fit                       better fit                176
“fraction of variance explained by the
              regression”

  r 
   2   ˆ
        ( yi  y ) 2 Variance of yi ' s
                     
                                 ˆ
       ( yi  y ) Variance of yi ' s
                   2




     Easily derived from the equation of the
     regression line, which we know…
                        Homework?


                                               177
        Notes about regression
• Least-squares regression is not robust
  (resistant)
• Two kinds of interesting points:
   – Outlier : a point with a large residual
   – Influential point : if removed, causes a large
     change in the regression line



                                               178
179
With and without Child 18




                            180
                        To lie hidden, as in ambush
             Lurking variables
A variable that has an important effect but was
overlooked.
Danger: Confounding
[Thinking an effect is due to one variable when it is
better explained by another (lurking) variable.]

1971 study: People who drink a lot have higher
incidence of bladder cancer.
      Correlation noticed. Causation?
                                                        181
      Lurking variables (cont.)
1993: A larger study concluded that after
adjusting for the effects of smoking, no evidence
for increased risk from coffee.

“Spurious correlations”
     The correlation is real, but causation isn’t.



                                               182
      Lurking variables (cont.)
Lurking var’s can also hide “real” correlations.




               (...or even reverse correlations)   183
More on the perils of aggregation:
      Simpson’s paradox
Categorical data
           Hospital A      Hospital B
Died     300                50
Survived 3000              1000

If you needed surgery, which hospital would you
prefer?

                                           184
       Simpson’s paradox (cont.)
                       Hospital A   Hospital B
              Died     300             50
              Survived 3000           1000

        Good condition                           Bad condition
       Hospital A Hospital B                     Hospital A Hospital B
Died      5         10                Died     295           40
Survived 1000       800               Survived 2000          200

Maybe…
Hospital A: university medical center, attracts seriously ill
patients from wide area
Hospital B: local, fewer seriously sick patients.             185
      Simpson’s paradox (cont.)
Another (real) example:
U.C. Berkeley, 1970’s
Committee searched for discrimination -- higher
percentage of male applicants accepted into grad school
than female.
Looking at individual dept’s, no evidence of admitting
men more than women -- if anything the reverse. ???

Men were applying more to dept.’s with higher
acceptance rates, women applying more to dept’s
that were harder to get into.               186
     Next Topic: “Producing Data”
      Sampling and Experimental
                Design
•   3 Principles of Experimental Design
•   Simple random samples
•   Bias, variance
•   Stratified sampling and blocking


                    Moore and McCabe Chapter 3.


                                              187
   Observation versus experiment
Both attempt to study relationship between an
  “explanatory variable” and a “response variable”

• Experiment: deliberately impose “treatments” on
  individuals to observe their responses.
• Observational study: observe and measure what
  participants do naturally

                  “experimental units” or “subjects”

                                                       188
         An example experiment
• Wangensteen (1958): Gastric freezing.
  Experiment reported in JAMA: treatment reduced
  ulcer pain. 24 patients; all said they felt better.
  Technique widely used for several years. OK?
• Several years later: a different, larger study with a
  control group. Results:
   – 34 % in treatment group improved.
   – 38 % in control group improved.


• Salk vaccine trial…
                                                      189
         Principle 1: Control or
              Comparison
• Comparison of different treatments.
• Want different treatment groups to be as similar as
  possible -- except for the treatments applied.
• Control effects of environmental or outside variables.
   – Outside influences act the same on the different
     treatment groups. (E.g. placebo effect)




                                                    190
                           Bias
How to assign experimental units to treatments?
E.g.
   – in comparing two medical treatments don’t want to assign
     one treatment to sicker patients
   – comparing seed varieties: don’t plant one in more fertile
     ground

A study is biased if it systematically favors certain
  outcomes.

How to avoid bias? Elaborate balancing?
                                                          191
     Principle 2: Randomization
Assign treatments randomly.

Fair -- doesn’t give an treatment a systematic
  advantage.

But randomization balances out well only in the
 “long run.” So…

                                                 192
      Principle 3: Replication, or
              Sample size
Use sample sizes big enough so that we will be able to
distinguish a real effect from random “luck.”




                                                    193
    It’s hard to be random

0011110101000000110110101
00101100000100111110000011
00100110100110011000011000
11011011111110010010110100
10110110110001011001010001
00000011001111101000100001
11011010110001100111010110
1010000000010101100
                             194
    It’s hard to be random
       not very creative

0011110101000000110110101
00101100000100111110000011
00100110100110011000011000
11011011111110010010110100
10110110110001011001010001
00000011001111101000100001
11011010110001100111010110
1010000000010101100
                             195
    It’s hard to be random
       not very creative

0011110101000000110110101
00101100000100111110000011
00100110100110011000011000
11011011111110010010110100
10110110110001011001010001
00000011001111101000100001
11011010110001100111010110
1010000000010101100
                 getting tired…   196
       Simple random samples
Def: A simple random sample of size n is a set
 of n individuals from a population chosen in
 such a way that each set of n individuals has
 an equal chance to be the sample actually
 selected.



        Abbreviate: “simple random sample”  “SRS”

                                             197
             How to randomize
Table of random digits

E.g. choose a SRS of size 4 out of 10 individuals,
using first row of table
       19223 95034 05756 28713 …

A natural way: label individuals with 0, 1, 2,…, 9.
Take individual 1, then 9, then 2, then 3.
What if we had 25 individuals and wanted a SRS of
size 4?
                                 19, 22, 05, 13 198
                  Blobs




         What is the average area?
E.g. throwing darts leads to size-biased sampling.
                                                199
                    Buses
Suppose average time between bus arrivals at a
stop is 20 minutes. You arrive at a random time.
What is your average waiting time until the next
bus?
10 minutes?

No -- in general it’s more.

Analogous to blobs…
                                             200
       Sampling distributions
                       Ind   Vote
“Population”:          1     Bush
4 individuals and      2     Bush
their votes.           3     Gore
                       4     Gore

Say we want to estimate parameter
     p = Prob{vote for Bush}
using a sample of size 2.

Here p = 0.5. Pretend we don’t know this.
                                            201
       Sampling distributions (cont.)
 List possible SRS’s and the corresponding estimates.
                 SRS    Votes     ˆ
                                  p
                  12     BB       1
Ind   Vote        13     BG       0.5
1     Bush
2     Bush        14     BG       0.5
3     Gore        23     BG       0.5
4     Gore
                  24     BG       0.5
                  34     GG       0

                                                    202
Sampling distrib of p-hat from SRS’s
              of size 2

   0.0                 0.5                 1.0




Or, in terms of probabilities,
                     4/6

          1/6                    1/6

            0          0.5             1
                                                 203
Bias and variability of an estimator
 E.g.: recall true value was p = 0.5. Sampling distrib:




              0           0.5         1

Unbiased: Mean of sampling distrib = 0.5 = true value

 Variability: SD of sampling distrib  0.3

                                                   204
  How about with SRS’s of size 1?
             SRS   Votes       ˆ
                               p
             1     B           1
             2     B           1
Ind   Vote
1     Bush
             3     G           0
2     Bush   4     G           0
3     Gore
4     Gore             1/2               1/2


                           0       0.5         1
                                         205
            Bias? Variability?

n=2
              0           0.5          1



n=1
              0           0.5          1

Neither is biased. Case n = 2 has less variability.
                                                      206
            Bias and Variability
• Bias of an estimator = (mean of sampling distrib)
                           (true value of parameter)

  Statistic is unbiased if bias = 0.

• Variability of an estimator = (SD of sampling distrib)

  Depends on sample size.



                                                    207
     An example of a simulation
• Bias of estimators of variance -- use Minitab.




                                              208
             Stratified sampling
E.g.: estimate avg. salary of engineers at a company.
  Suppose 2 types of engineers: “junior” and “senior.”
  Suppose company has 200 of each type.
  Want to est avg salary with a sample of size 10.

Stratification idea: combine
  a SRS of size 5 from junior engineers, and
  a SRS of size 5 from senior engineers.

Is this a SRS of size 10?
                                                   209
      Why stratify vs. take a SRS?
• What’s the advantage of stratifying?
  – Bias?
  – Variability?




                                         210
  Blocking in experimental design
3 types of seeds (treatments): A, B, C.
And some land to try them on:




Divide plot into 30 squares
Use each of A, B, C on 10 squares.
                                          211
                Blocking (cont.)
  C    B    C     C   A   A    B    B    B C
  A    C    B     A   B   C    C    A    C      A
  B    A    A     B   C   B    A    C    A      B

Suppose: Worry about a fertility gradient 
         Believe field homogeneous 

Partition experimental units into blocks.
Assign treatments randomly within each block.

                                                    212
                 Stat 10x
                  J. Chang
               Tuesday, 9/27/01



A statistician is somebody who is good with figures
but lacks the personality to be an accountant.


                                              213
                           Bias
How to assign experimental units to treatments?
E.g.
   – in comparing two medical treatments don’t want to assign
     one treatment to sicker patients
   – comparing seed varieties: don’t plant one in more fertile
     ground

A study is biased if it systematically favors certain
  outcomes.

How to avoid bias? Elaborate balancing?
                                                           214
       Simple random samples
Def: A simple random sample of size n is a set
 of n individuals from a population chosen in
 such a way that each set of n individuals has
 an equal chance to be the sample actually
 selected.



        Abbreviate: “simple random sample”  “SRS”

                                             215
             How to randomize
Table of random digits

E.g. choose a SRS of size 4 out of 10 individuals,
using first row of table
       19223 95034 05756 28713 …

A natural way: label individuals with 0, 1, 2,…, 9.
Take individual 1, then 9, then 2, then 3.
What if we had 25 individuals and wanted a SRS of
size 4?
                                 19, 22, 05, 13 216
                  Blobs




                                              n  50

         What is the average area?
E.g. throwing darts leads to size-biased sampling.
                                                217
 Sampling distribution of an estimator
              Understand first using a “toy example”

                         Indiv   Vote
“Population”:            1       Bush
4 individuals and        2       Bush
their votes.             3       Gore
                         4       Gore

Say we want to estimate parameter
      p = fraction of pop who vote for Bush,
using a sample of size 2.

Here p = 0.5. Pretend we don’t know this.              218
    Sampling distributions (cont.)
List possible SRS’s and the corresponding estimates.

                SRS     Votes     ˆ
                                  p
                 12      BB       1
Indiv   Vote
1       Bush     13      BG       0.5
2       Bush     14      BG       0.5
3       Gore
4       Gore     23      BG       0.5
                 24      BG       0.5
                 34      GG       0
                                                  219
                      ˆ
  Sampling distrib of p from SRS's of size 2



   0.0                 0.5                 1.0




Or, in terms of probabilities,
                     4/6

          1/6                    1/6

            0          0.5             1
                                                 220
Bias and variability of an estimator
 E.g.: recall true value was p = 0.5. Sampling distrib:




              0           0.5         1

Unbiased: Mean of sampling distrib = 0.5 = true value

 Variability: SD of sampling distrib  0.3

                                                   221
  How about with SRS’s of size 1?
             SRS   Votes       ˆ
                               p
             1     B           1
             2     B           1
Ind   Vote
1     Bush
             3     G           0
2     Bush   4     G           0
3     Gore
4     Gore             1/2               1/2


                           0       0.5         1
                                         222
            Bias? Variability?

n=2
              0           0.5          1



n=1
              0           0.5          1

Neither is biased. Case n = 2 has less variability.
                                                      223
            Bias and Variability
• Bias of an estimator = (mean of sampling distrib)
                           (true value of parameter)

  An estimator is unbiased if its bias = 0.

• Variability of an estimator = (SD of sampling distrib)

  Depends on sample size: shrinks as sample size grows.



                                                        224
     An example of a simulation
• Bias of estimators of variance -- use Minitab.




                                              225
             Stratified sampling
E.g.: estimate avg. salary of engineers at a company.
  Suppose 2 types of engineers: “junior” and “senior.”
  Suppose company has 200 of each type.
  Want to est avg salary with a sample of size 10.

Stratification idea: combine
  a SRS of size 5 from junior engineers, and
  a SRS of size 5 from senior engineers.

Is this a SRS of size 10?         No
                                                   226
      Why stratify vs. take a SRS?
• What’s the advantage of stratifying?
  – Bias?        no
  – Variability? yes



 (for a “toy example” see next homework…)




                                            227
  Blocking in experimental design
3 types of seeds (treatments): A, B, C.
And some land to try them on:




Divide plot into 30 squares
Use each of A, B, C on 10 squares.
                                          228
                Blocking (cont.)
  C    B    C     C   A   A    B    B      B C
  A    C    B     A   B   C    C    A     C      A
  B    A    A     B   C   B    A    C     A      B

Suppose: Worry about a fertility gradient 
                                        A group of exp’l units
Randomized block design:                thought to be similar in
                                        some important way
Partition experimental units into blocks.
Assign treatments randomly within each block.

                                                      229
       Probability and Statistics
• Probability theory as a major tool in Statistical
  inference
  – All inferences are expressed in terms of probabilities:
    E.g “95% confidence interval”-- 0.95 is the probability of
    something
• E.g. poll
  – Imagine precisely 50% of a large pop favor Gore.
  – We take a random sample of size 1000
  – Expect to see about 500 in sample who favor Gore.
     • E.g., how likely are we to see more than 600?
                                                           230
          Probability Models
Given a random phenomenon we are modeling.

S = Sample space = set of all possible outcomes.

E.g.:
Toss a coin: S = {H,T}.


Toss a coin 3 times:
  S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}


                                                   231
      Probability Models (cont)
An event is a set of some possible outcomes
     i.e., an event is a subset of S.

E.g. A = (get one head in 3 tosses)
       = {HTT, THT, TTH}

A probability measure is a function (satisfying certain
conditions) that assigns a probability (a number
between 0 and 1) to each event.

If A is an event, P(A) denotes the probability of A.
                                                       232
 Interpretations 1: Equally-likely
               case
Sometimes, e.g. by symmetry, we believe all possible
outcomes are equally likely. In this case
                    # outcomes in A
            P( A) 
                    # outcomes in S

E.g. tossing a coin once, with S  {H , T }.
If A  {H }, then
                             1
                    P ( A) 
                             2
                                                 233
  Interpretations 1: Equally-likely
                case
E.g. roll two dice. What is probability of getting
a total of at least 11?

                   (1,1)   (1,2)   (1,3)   (1,4)   (1,5)   (1,6)
Can think of S
like this…         (2,1)   (2,2)   (2,3)   (2,4)   (2,5)   (2,6)
                   (3,1)   (3,2)   (3,3)   (3,4)   (3,5)   (3,6)
36 outcomes,       (4,1)   (4,2)   (4,3)   (4,4)   (4,5)   (4,6)
equally likely,
                   (5,1)   (5,2)   (5,3)   (5,4)   (5,5)   (5,6)
prob 1/36 each.
                   (6,1)   (6,2)   (6,3)   (6,4)   (6,5)   (6,6)

                                  3
          P{total at least 11} =    = .083.
                                 36                         234
      Interpretations 2: Long-run
               frequency
Imagine repeating the experiment over and over,
“independently,” under the same conditions.

Sometimes A occurs, sometimes it doesn’t.

As repeat more and more,
      (Fraction of trials in which A occurs)  P(A)



                                                      235
     Interpretations 3: Subjective
              probability
A “subjective probability” indicates a person’s beliefs
about the likelihood of an event.

E.g. P{Humans extinct within next 1000 years}?

Betting…




                                                     236
         A useful picture/example

            S

                        A




You’re driving and it’s about to start raining. Think of S as your
windshield. Event A corresponds to statement {the first drop to
hit the windshield hit the set A}.

                                                             237
       A useful picture/example
A simple probability measure
                                    S
to model this:
                                           A
                  area of A
         P ( A) 
                  area of S

For convenience assume: (area of S) = 1.
So P(A) = area of A.

Note 0  P ( A)  1 and P(S) = 1.


                                               238
New events from old


   A          B




                      239
New events from old


   A                B




 What should we call this?
     A and B ?
     A or B ?
                             240
So what’s (A and B) ?




                        241
      So what’s (A and B) ?




(raindrop falls in A) and (raindrop falls in B)

                                            242
Complement of A?




                   243
Complement of A?




                   244
        Axioms of probability
     (i.e., properties of probability
                measures)
• For each event A,
     P(A)  0 and P(A)  1.

• P(S) = 1, where S is the whole sample
  space.
                        ( A and B)  ?

• If A and B are disjoint, then
      P(A or B) = P(A) + P(B) .
                                          245
  Example: Complement rule
       P( A )  1  P( A)
           c




Why?
        ( A or A )  S
                 c


            So P( A or Ac )  P( S )  1
        But A and Ac are disjoint.
          So P( A or Ac )  P( A)  P( Ac )

        So P( A)  P( A )  1.
                        c
                                              246
        Definition of P(B | A)


                             B
              A


Idea of P(B|A): Given that A occurs, what is the
probability that B also occurs?
Question: By eyeball, what is P(B|A) ?
Answer: 0.5.                                 247
          Definition of P(B | A)




Given that the raindrop fell in A, we restrict our attention
to the set A. The drop is equally likely to fall anywhere
within A.

                                                       248
           Definition of P(B | A)




Given A, the event B also occurs when the drop falls in
the dark blue region, i.e., the event (A and B).


                                                     249
        Definition of P(B | A)




                       P( A and B)
           P( B | A) 
                           P( A)

Often used in form: P ( A and B )  P ( A) P ( B | A)
                                                    250
                 Independence
E.g. two tosses of a coin

“B is independent of A” means “being told that A occurred
does not affect the likelihood of B occurring.”

 I.e. P(B | A) = P(B)
            P( A and B)
      I.e.,              P( B)
                P( A)
            I.e., P( A and B)  P( A) P( B)

                            “A and B are independent”
                                                   251
  Stat 10x
    J. Chang
Tuesday, 10/02/01




                    252
 Sampling distribution of an estimator
              Understand first using a “toy example”

                         Indiv   Vote
“Population”:            1       Bush
4 individuals and        2       Bush
their votes.             3       Gore
                         4       Gore

Say we want to estimate parameter
      p = fraction of pop who vote for Bush,
using a sample of size 2.

Here p = 0.5. Pretend we don’t know this.              253
    Sampling distributions (cont.)
List possible SRS’s and the corresponding estimates.

                SRS     Votes     ˆ
                                  p
                 12      BB       1
Indiv   Vote
1       Bush     13      BG       0.5
2       Bush     14      BG       0.5
3       Gore
4       Gore     23      BG       0.5
                 24      BG       0.5
                 34      GG       0
                                                  254
                      ˆ
  Sampling distrib of p from SRS's of size 2



   0.0                 0.5                 1.0




Or, in terms of probabilities,
                     4/6

          1/6                    1/6

            0          0.5             1
                                                 255
          Probability Models
Given a random phenomenon we are modeling.

S = Sample space = set of all possible outcomes.

E.g.:
Toss a coin: S = {H,T}.


Toss a coin 3 times:
  S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}


                                                   256
      Probability Models (cont)
An event is a set of some possible outcomes
     i.e., an event is a subset of S.

E.g. A = (get one head in 3 tosses)
       = {HTT, THT, TTH}

A probability measure is a function (satisfying certain
conditions) that assigns a probability (a number
between 0 and 1) to each event.

If A is an event, P(A) denotes the probability of A.
                                                       257
         A useful picture/example

            S

                        A




You’re driving and it’s about to start raining. Think of S as your
windshield. Event A corresponds to statement {the first drop to
hit the windshield hit the set A}.

                                                             258
       A useful picture/example
A simple probability measure
                                    S
to model this:
                                        A
                  area of A
         P ( A) 
                  area of S

For convenience assume: (area of S) = 1.
So P(A) = area of A.

Note 0  P ( A)  1 and P(S) = 1.


                                            259
      New events from old

A            B

    A or B                 A and B




                 Ac  complement of A

                                        260
        Axioms of probability
     (i.e., properties of probability
                measures)
• For each event A,
     P(A)  0 and P(A)  1.

• P(S) = 1, where S is the whole sample
  space.
                        ( A and B)  ?

• If A and B are disjoint, then
      P(A or B) = P(A) + P(B) .
                                          261
        Definition of P(B | A)


                             B
              A


Idea of P(B|A): Given that A occurs, what is the
probability that B also occurs?
Question: By eyeball, what is P(B|A) ?
Answer: 0.5.                                 262
          Definition of P(B | A)




Given that the raindrop fell in A, we restrict our attention
to the set A. The drop is equally likely to fall anywhere
within A.
                                                       263
           Definition of P(B | A)




Given A, the event B also occurs when the drop falls in
the dark blue region, i.e., the event (A and B).


                                                     264
        Definition of P(B | A)




                       P( A and B)
           P( B | A) 
                           P( A)

Often used in form: P ( A and B )  P ( A) P ( B | A)
                                                    265
                 Independence
E.g. two tosses of a coin

“B is independent of A” means “being told that A occurred
does not affect the likelihood of B occurring.”

 I.e. P(B | A) = P(B)
            P( A and B)
      I.e.,              P( B)
                P( A)
            I.e., P( A and B)  P( A) P( B)

                            “A and B are independent”
                                                   266
 Example: tree diagrams, Bayes’ rule
A blood test screening for the AIDS virus is given to people
randomly chosen from a population. If a given person has a
positive test result, what is the conditional probability that the
person indeed has the virus?

Suppose that
• 1% of the population has the virus        P(A) = 0.01
• Test has a false positive rate of 1.5%    P(B|Ac) = 0.015
• Test has a false negative rate of 0.3%    P(Bc|A) = 0.003

Let A = {AIDS virus in blood}
    B = {Blood test positive}
                                             Want: P(A | B)
                                                                267
           Draw a tree (sideways)
                    c        c
Given P(A) = 0.01, P(B|A ) = 0.015, P(B |A) = 0.003.
Want P(A | B).                   B
                       .997

                         .003
          .01     A
                                 Bc
                        .015      B
          .99
                  Ac
                        .985
                                 Bc                    268
                Using the tree
                 P( A and B)
Want P( A | B) 
                     P( B)
Numerator:
 P( A and B)  P( A) P( B | A)
              (.01)(.997)  .00997


Note probabilities multiply along a path in the tree.


                                                        269
          Using the tree (cont.)




P( B)  P( A & B)  P( Ac & B)
       (.01)(.997)  (.99)(.015)  .02482

            P( A & B) .00997
P( A | B)                    .4017
              P( B)    .02482
                                             270
    Bayes’ rule (what we just did…)
Given P(A), P(B | A), and P(B | Ac).
Want to find a “turned-around” probability like P(A | B).

P( B)  P( A & B)  P( Ac & B)
       P( A) P( B | A)  P( Ac ) P( B | AC )

            P( A & B)             P( A) P( B | A)
P( A | B)            
              P( B)     P( A) P( B | A)  P( Ac ) P( B | AC )

                      Know this, but don’t memorize!
                      Understand, derive, draw a tree...
                                                         271
                 Random variables
Abstract definition:
A random variable is a function defined on S.
       Recall S = sample space = {all possible outcomes}

The function assigns a value to each possible outcome.

                 Typically a number… might be a “category”




                                                         272
      Simple example of random
                     variable of “heads”.
Toss a coin 3 times. Let X = number

Outcome s   TTT   TTH   THT   THH   HTT   HTH   HHT   HHH
   X(s)      0     1     1     2     1     2     2     3



Each outcome s in S has probability 1/8.

Note, e.g., {X = 1} is the event      So P{X = 0} = 1/8,
{TTH, THT, HTT}. So it                   P{X = 1} = 3/8,
makes sense to talk about the
                                         P{X = 2} = 3/8,
probability P {X = 1}, etc.
                                         P{X = 3} = 1/8.
                                                            273
 Distribution of a random variable
Discrete random variable: takes on finitely many
possible values.

A distribution of a discrete random variable is a list of
its possible values and the probabilities that it takes on
those values.

E.g. we did the example X = number of heads in 3 tosses...


Continuous random variables: distribution described
by a probability density function.
                                                       274
   Independent random variables
“X and Y are independent” means that the events
{a < X < b} and {c < Y < d} are independent for all
numbers a, b, c, and d.
                      i.e. P({a < X < b} & {c < Y < d})
                           = P{a < X < b} P{c < Y < d}

Idea: knowing information about the value of X tells us
nothing about the value of Y.


                                                     275
          Binomial distributions
Generic setup: performing n independent trials of an
experiment.

Each trial could be a “success” or a “failure.”

Let p = probability of success on each trial.

Let X = number of successes among the n trials.

Definition: The random variable X is said to have a
Binomial distribution with parameters n and p.

Notation: X ~ B(n, p)
                                                  276
                    Examples
1. We already figured out the B(3, ½) distribution:
             0   with probability 1/8
             1
                 with probability 3/8
          X 
             2   with probability 3/8
             3
                 with probability 1/8

2. Let Y be the number of 6’s in 10 rolls of a die.
Distrib of Y ?

                           Y ~ B (10, 1/6).

                                                      277
             Examples (cont)
3. Suppose in a pop of size 100 million, 60 million
favor Gore. We take a random sample of size 2500.
Let X = number in sample who favor Gore.
Distribution of X ?

Y ~ B (2500, 0.6).     Exactly?
                       No…




                                                  278
     Mean of a random variable
E.g. X = payoff in spinner game
                                                  $9
Distribution of X :                      $0
                                                   $2
        0 with prob 0.5
        
    X  2 with prob 0.3
        9 with prob 0.2
        
                                       Notation :  X   ( X )
Define mean of X
           X  (0)(.5)  (2)(.3)  (9)(.2)
                 0  .6  1.8  2.4 dollars
                                                        279
 Why is the mean defined this way?
Answer: it makes the “law of large numbers” true.

Law of large numbers: As we do many independent
repetitions of the experiment, drawing more and more
numbers from the same distribution, the mean of our
sample will approach the mean of the distribution more
and more closely.




                                                    280
Recall “long run frequency” interpretation of
                 probability
Suppose P(event) = p. Look at

         # times event occurs in n independen t trials
    Fn 
                              n

i.e. the fraction of times the event occurs in the first n
trials.

As n increases, this fraction will approach p :
                    Fn  p
                                                         281
Law of large numbers for spinning
              game
Imagine playing the game n times, with n large.
Total “sample” winnings:

X 1  X 2    X n  0(# of 0' s)  2(# of 2' s)  9(# of 9' s)

Mean of sample winnings:
               # of 0' s   2 # of 2' s   9 # of 9' s 
       X n  0                                         
               n   n   n 

                   0.5              0.3             0.2
I.e.      X n  0(0.5)  2(0.3)  9(0.2)   X                  282
   SD and Variance of a random
            variable
Notation:      SD of X                     sX

               Variance of X               sX
                                            2



Definition:
              s   ( X   X )
               2
               X
                                   2
                                       


                                                283
       Calculating Variance
                                2.4
   prob      X      X   X ( X   X )2
    0.5      0       -2.4          5.76
    0.3       2       -0.4         0.16
    0.2       9       6.6         43.56

s X   ( X   X ) 2 
  2


    = (5.76)(0.5) + (0.16)(0.3) + (43.56)(0.2)
    = 11.64        Dollars? No : s X  11.64  3.41 dollars
                                                    284
           General formulas
        x1 with prob      p1
       x
        2                 p2
If X  
                          
        xk
                          pk
                                 k
then X has mean  X   xi pi
                                i 1
                     k
and variance s X   ( xi   X ) pi
               2                 2

                    i 1

                                       285
Rules for mean and variance
   ( X  c)   ( X )  c

   (cX )  c ( X )

 s ( X  c)  s ( X ),   s 2 ( X  c)  s 2 ( X )

  s (cX )  cs ( X ),    s (cX )  c s ( X )
                             2        2   2


           For c  0. Otherwise use | c | .
           E.g. s ( X )  s ( X )
                                                    286
  Stat 10x
    J. Chang
Tuesday, 10/09/01




                    287
                    Today
• Random variables, mean and variance, rules,
  Law of large numbers
• Central Limit theorem
• Sampling distributions, sampling distrib of
  sample mean
• Concept of a confidence interval
• Simple examples using Normal distribution
• Binomial distributions and normal
  approximations                              288
     Mean of a random variable
E.g. X = payoff in spinner game
                                                  $9
Distribution of X :                      $0
                                                   $2
        0 with prob 0.5
        
    X  2 with prob 0.3
        9 with prob 0.2
        
                                       Notation :  X   ( X )
Define mean of X
           X  (0)(.5)  (2)(.3)  (9)(.2)
                 0  .6  1.8  2.4 dollars
                                                        289
 Why is the mean defined this way?
Answer: it makes the “law of large numbers” true.

Law of large numbers: As we do many independent
repetitions of the experiment, drawing more and more
numbers from the same distribution, the mean of our
sample will approach the mean of the distribution more
and more closely.




                                                    290
Recall “long run frequency” interpretation of
                 probability
Suppose P(event) = p. Look at

         # times event occurs in n independen t trials
    Fn 
                              n

i.e. the fraction of times the event occurs in the first n
trials.

As n increases, this fraction will approach p :
                    Fn  p
                                                         291
Law of large numbers for spinning
              game
Imagine playing the game n times, with n large.
Total “sample” winnings:

X 1  X 2    X n  0(# of 0' s)  2(# of 2' s)  9(# of 9' s)

Mean of sample winnings:
               # of 0' s   2 # of 2' s   9 # of 9' s 
       X n  0                                         
               n   n   n 

                   0.5              0.3             0.2
I.e.      X n  0(0.5)  2(0.3)  9(0.2)   X                  292
   SD and Variance of a random
            variable
Notation:      SD of X                     sX

               Variance of X               sX
                                            2



Definition:
              s   ( X   X )
               2
               X
                                   2
                                       


                                                293
       Calculating Variance
                     2.4
   prob      X      X   X ( X   X )2
    0.5      0       -2.4          5.76
    0.3       2       -0.4         0.16
    0.2       9       6.6         43.56

s X   ( X   X ) 2 
  2


    = (5.76)(0.5) + (0.16)(0.3) + (43.56)(0.2)
    = 11.64        Dollars? No : s X  11.64  3.41 dollars
                                                    294
           General formulas
        x1 with prob       p1
       x
        2                  p2
If X  
                           
        xk
                           pk
                                  k
then X has mean  X   xi pi
                                 i 1
                      k
and variance s X   ( xi   X )2 pi
               2

                     i 1



                                        295
Rules for mean and variance
   ( X  c)   ( X )  c

   (cX )  c ( X )

 s ( X  c)  s ( X ),   s 2 ( X  c)  s 2 ( X )

  s (cX )  cs ( X ),    s (cX )  c s ( X )
                             2        2   2


           For c  0. Otherwise use | c | .
           E.g. s ( X )  s ( X )
                                                    296
Sums                    0 with prob 0.5
                                                        $9
Suppose we play the X  2 with prob 0.3
                                                   $0 $2
                        9 with prob 0.2
                        
game 2 times. Total winnings: S  X1  X 2
Distrib of S ?                Probabilities
Possible 0 = 0 + 0            (.5)(.5) = .25
 values: 2 = 0 + 2 = 2 + 0    (.5)(.3) + (.3)(.5) = .3
         9 =0+9=9+0           (.5)(.2) + (.2)(.5) = .2
         4 =2+2               (.3)(.3) = .09
         11 = 2 + 9 = 9 + 2   (.3)(.2) + (.2)(.3) = .12
                              (.2)(.2) = .04          297
         18 = 9 + 9
Probability mass functions of S1 and S2
     0 with prob .5
     
S1  2           .3
     9
                 .2



     0 with prob .25
     2            .3
     
     4            .09
S2  
     9           .2
     11         .12
     18
                 .04                 298
Mean and variance of sum of r.v.’s
Let X1 and X2 be random variables.
Define a new variable S = X1 + X2.
What are the mean and variance of S ?

    (S )   ( X1 )   ( X 2 )

   If X1 and X2 are independent, then
             s 2 ( S )  s 2 ( X1 )  s 2 ( X 2 )


                                                    299
Mean and SD of sum of n (indep)
                      r.v.’s sample of size n from a
Let X1, X2 ,…, Xn be a random
distrib having mean  and SD s.

Then the sampling distrib of the sum S n  X 1    X n
has mean n and SD n s

                   (because the variance is ns 2 ).




                                                      300
Distrib of total winnings from playing the
           spinner game n times
         n=1                        n=4




         n=2                        n=8




                                        301
Distrib of total winnings from playing the
       spinner game n times (cont.)
         n = 16




         n = 32                    n = 64




                                            302
It’s all very simply described, nearly
                      n = 64,   2.4 , s  3.41
                      S 64 has mean (64)(2.4) = 153.6

                      and SD    ns  64(3.41)  27.28
                      And a Normal shape




           154
          127   181                                303
         Central Limit Theorem
Repeat an experiment n times independently, getting a
sum S  X 1  X 2    X n

We know  ( S )  n ( X ) and s ( S )  ns ( X ).

CLT: If n is large, the distrib of S is nearly
Normal.

That is, Sn is approximately N (n X , n s X )



                                                     304
                       Example
                          0 with prob 0.5               $9
                          
  Suppose we play our X  2 with prob 0.3             $0 $2
                          9 with prob 0.2
                          

  game 25 times. Total winnings: S  X 1  X 2    X 25 .

  Note mean of S is  ( S )  (25)(2.4)  60

Q: How likely to get S  80 ? What is P{S  80} ?

  SD of S: s ( S )  25  s ( X )  5  3.41  17.05
                                                        305
                                                      $9
                Example (cont)                    $0 $2
Q: Spin 25 times. P{S  80} ?


CLT says S is approximately N(60, 17.05). So…

P{S  80}  P{N (60, 17.05)  80}
               80  60 
 P  N (0,1)            1  (1.17)  1  0.879  0.121
                17.05 
                 1.17


                        True answer using computer: 0.127

                                                     306
             “Continuity correction”

P{S  80}  P{S  79.5}  P{N (60, 17.05)  79.5}
    N (0,1)  79.5  60   1   (1.14)  1  0.874  0.126
 P                     
               17.05 




                                                         307
            Starting Chapter 6:
         Introduction to Inference
Main concepts:
• Confidence intervals (basic idea today)
• Hypothesis tests (next time)




                                            308
Who wants to be a millionaire guess a
             number?
I have a number written down on a slip of paper.
Call it  .
                Might be 1 million…
                        Or 3.14…
                            Or anything!

I’ll tell you a number, X , within ±5 of  .
Suppose X = 64.1.                is in the interval 64.1 ±5

Q: What can you tell me about  ?              i.e. [59.1, 69.1]

Q: How confident are you? (or, “Is that your final answer?”)
   100% confident                                    “Yes!”
                                                      309
     Who wants to change the rules?
First I spin the needle.
 • If needle stops in the yellow
 part, then I do as before -- report
 an X within ±5 of  .
 • If needle stops in the red part,
 I lie, reporting an X not within
                                             Yellow part: 99%
 ±5 of  .
                                             Red part: 1%
Suppose I still report X = 64.1.
You still guess  is in interval [59.1, 69.1].

                       But now you are “99% confident.”
                                                        310
Who wants to drag the Normal distrib into the
                discussion?
I have another number  , which I know and you don’t.

Suppose I draw a random X ~ N (  , 0.06),
and report X  0.38.

Can you give a 95% confidence interval for  ?

Reasoning:
• with probability 0.95, X is within 2 SD’s of 
• i.e. with prob 0.95, X is within 0.12 of 
• 95% CI is 0.38 ± 0.12, i.e., [0.26, 0.50]
                                                   311
     How about a 99% CI in the same
               problem?
Again, suppose you want to estimate  , and I draw a
random X ~ N ( , 0.06) and report X = 0.38.
99% confidence interval for  ?

To use same reasoning as before:
• with probability 0.99, X is within ??? SD’s of 
              Use Table or Minitab. Get ??? = 2.576

• So with prob 0.99, X is within (2.576)(0.06)  0.15 of 
• So 99% CI is 0.38 ± 0.15, i.e., [0.23, 0.53]
              Wider than our 95% CI, [0.26, 0.50]… makes sense.
                                                       312
General confidence intervals with Normal
Suppose Y ~ N (  , s ) ,distribs is known and we want
                         wheres
to estimate  .

A “level C” CI for  is [Y  zCs , Y  zCs ] , where,
for example,
                  C               zC
                  .95             1.960 (nearly 2)
                  .99             2.576
                  .90             1.645



                                                     313
    Mean and variance of a sample mean
Let X 1 , X 2 ,, X n be independent random variables
all having the same distribution.
              Suppose this “parent distribution”
              has mean  and SD s .

Let X n denote the sample mean.
                     1
           i.e. X n   X 1  X 2    X n 
                     n

Then the sampling distrib of X n has mean 
and SD s / n .
                            • follows from the rules
                            • related to law of large numbers
                                                                314
Example using a sample mean rather than just
                 one obs’n
Want to estimate
 = mean pulse rate using a certain medicine.
We sample n = 30 people and find sample mean X  103.9
Find a 95% CI for  .

Assume SD s is known to be 5.1.
                               Probably unrealistic to assume we
                               know this. Later see how to fix…
                                     , 5.1   N (  , 0.93)
Key: At least approximately, X ~ N          
                                         30 
95% CI is 103.9 ± (1.96)(0.93), i.e., [102.1, 105.7]
                                                        315
          Binomial distributions
Generic setup: performing n independent trials of an
experiment.

Each trial could be a “success” or a “failure.”

Let p = probability of success on each trial.

Let X = number of successes among the n trials.

DEF: The random variable X is said to have a
Binomial distribution with parameters n and p.

Notation: X ~ B(n, p)
                                                  316
      Mean and SD of Binomial(n,p)
              0 with prob 1      p
Play game I                        n times
              1 with prob p                              $1
Total winnings X ~ B(n, p).                          $0
Mean and SD for 1 play are
          ( I )  p, s ( I )  p(1  p )        (exercise)
Think: X  I1  I 2    I n , where

                 1 if k th trial is a success
            Ik  
                 0 if k th trial is a failure
So  ( X )  np, s ( X )  np (1  p )                    317
        Normal approximation to
Let   X ~ B(n, p). Binomial
                                                       $1
Think: X  I1  I 2    I n , where
                                                 $0
                 1 if k th trial is a success
            Ik  
                 0 if k th trial is a failure


By the Central Limit Theorem, for large n the B(n, p)
distrib is approximately Normally distributed with
mean np and SD np(1  p)

                                                      318
Example: Margin of error in a poll
Suppose in a large pop, a fraction p = 0.6 of
voters favor Gore. We don’t know this, but take a
random sample of size n = 2500. Let X be the
number in the sample who favor Gore, and let
 p  X / n be our estimate of p. What is the
 ˆ
“margin of error” of the poll, i.e., the width of a
95% CI for p?




                                                      319
      Margin of error in a poll
X ~ B(2500, 0.6)
    N ((2500)(0.6), (2500)(0.6)(0.4) )
    N ((2500)(0.6), 50 (0.6)(0.4) )

    X                (0.6)(0.4) 
p
ˆ        N  0.6,                 N (0.60, 0.01)
   2500                 50      

                                    ˆ
So, e.g., the prob that our estimatep   is off by at
most 0.02 is about 95%.

                       “Margin of error is ±2%.”
                                                   320
          Stat 10x
        J. Chang, 10/16/01


2
          1

    1




                    321
                     Today
• Hypothesis tests
  – Basic logic
  – 1-sided and 2-sided examples with Normal
    distributions
• “ t procedures ” (t tests and CI’s). These
  remove the assumption that s is known.



                                               322
   Some basic facts we’ve been
             using
Suppose X 1 , X 2 ,..., X n is a sample from a distribution
having mean  and SDs .

The sampling distrib of X n has mean         and SD s / n .

CLT: For large n the distrib of X n is approximately
Normal.

If the individual r.v.’s X i come from a Normal distribution,
then the distrib of X n is exactly Normal. That is:
                                        s 
     If X i ~ N (  ,s ), then X n ~ N   ,  .
                                            n       323
   Idea of hypothesis testing: first an
       analogous idea from logic
A probabilistic extension of proof by contradiction.


Example: Prove that 2 is irrational.
I.e. prove that 2 cannot be expressed as a quotient m/n.

Technique: Assume the opposite and derive a contradiction.

Suppose we do have m and n satisfying     2  m/n .

We can choose m and n to be not both even.
  (Cancel out common factors of 2 from numerator and
  denominator until one or both become odd.)
                                                       324
           Logical analog: Proof by
                contradiction
We are assuming that m and n are numbers satisfying
 2  m / n , and m and n are not both even.

We have 2  (m 2 ) /( n 2 ), i.e. 2n 2  m 2 .

So m 2 is even. So m must be even.

Let m  2k . So 2n 2  4k 2 , i.e., n 2  2k 2 .

So n 2 is even. So n must be even.

                                                      325
           Logical analog: Proof by
                contradiction
We are assuming that m and n are numbers satisfying
 2  m / n , and m and n are not both even.

We have 2  (m 2 ) /( n 2 ), i.e. 2n 2  m 2 .

So m 2 is even. So m must be even.

Let m  2k . So 2n 2  4k 2 , i.e., n 2  2k 2 .

So n 2 is even. So n must be even.

                                                      326
Proof by contradiction: The moral
To prove a statement we
• Assume the opposite
• Do some reasoning, always assuming the
  opposite of what we are trying to prove.
• If we reason to a contradiction, we conclude
  that our assumption could not possibly be true.




                                             327
                  Math, logic
               Statistics, real life
Want to prove a statement.     Want to use some data to
                               provide evidence for a
                               hypothesis.


• Assume the statement         • Assume the hypothesis
  is not true.                   is not true.
• Reason to a contradiction.   • Show the observed data is
                                 very unlikely.

    i.e. observe something
         impossible.                                 328
                          1. Containing or resembling cheese.
A cheesy example          2. Slang. Shoddy; cheap.

 Cheese manufacturer suspects milk supplier is diluting
 milk with water. Wants to “prove” this.
 Note: adding water increases freezing temp of milk.

 Assume freezing temp measurements for pure milk are
 known to have a N (.545, .008) distrib. (in ºC)

 Data: Take 5 lots of milk.
 Say mean of 5 temps is X  .538
 Model: X 1 ,, X 5 ~ N (  , .008)
“Null hypothesis”    H 0 :   .545
“Alternative hypoth” H a :   .545                            329
          Cheesy example (cont)
Model: X 1 ,, X 5 ~ N (  , .008)   Data: X 5  .538

“Null hypothesis”    H o :   .545
“Alternative hypoth” H a :   .545

We want to assess evidence for Ha.
Assume the opposite; that is, assume Ho
               .545, .008   N ( .545, .0036)
Then X 5 ~ N               
                        5
How “extreme” is our observed value of  .538 ?
How likely were we to get a value at least this high?
                                                         330
        Cheesy example (cont)
Take an observation X from N (.545, .0036) distrib.
What is P{ X  .538} ?
You know how to do this. Standardize the .538, getting
(.538(.545))/.0036 = 1.95, so that
P{ X  .538}  1   (1.95)  .025   “P value”
Conclusion: Since a mean temp. as high as what we
observed would be quite unlikely (P  .025) if the milk
were pure, we have substantial evidence that water has
been added.
An interpretation: (1  P value) gives the percentile
of our observation, assuming H 0                    331
      Rejection, acceptance, and OJ
                 Simpson
A common convention is to "reject H 0" if P  .05

What if we get a P larger than our chosen threshold?
E.g. P = .07?

What is the opposite of “Reject H0” ?        “Accept H0” ?

Better to use terminology like “Fail to reject H0” .

Legal analogy: Innocent (H0 ) until proven guilty.
 OJ Simpson was not convicted.
 Does this suggest he was innocent?
                                                       332
 “A Critical Appraisal of 98.6F” (a
                               test)
                two-sided 93 healthy people.
JAMA article. Sampled temps of
Sample mean was 98.12F.
How strong is evidence against a population mean of 98.6?
Suppose we know that temps in pop have SD s=0.63F.
Model: X 1 , X 2 ,, X 93 ~ N (  , .63) .
Hypotheses H 0 :   98.6 , H 0 :   98.6 .
What possible values for X are “more extreme” than 98.12?
Let’s say “extreme” means “far from 98.6”.
(Hi or low – this corresponds to doing a “two-sided test.”)
         More extreme
         than 98.12             98.12     98.6    99.08       333
   “A Critical Appraisal of 98.6F”
Assuming H0 is true,
X ~ N 98.6, .63                                 s X  .0653
                  
              93  .
   N (98.6, .0653)

Standardize observed value:
    (98.12  98.6) .0653             -7.35   0       7.35
        .48 .0653   7.35

The P-value of the test is the prob of getting a
sample mean more extreme than 98.12, which is
      ( 7.35)  (1   (7.35))  2  1013 .
                                                       334
    CI’s : Who wants to guess a                           Last time

             number?
I have a number written down on a slip of paper.
Call it  .
                Might be 1 million.
                        Or 3.14
                            Anything...

I’ll tell you a number, X , within ±5 of  .
Say X = 64.1.                     is in the interval 64.1 ±5

Q: What can you tell me about  ?              i.e. [59.1, 69.1]

Q: How confident are you? Final. 100% confident.          335
                                                        Last time

   Who wants to change the rules?
Use a spinner.
• If needle points to the yellow
wedge, then I do as before --
report an X within ±5 of  .
• If needle points to the red
wedge, I lie, reporting an X not
within ±5 of  .                              Yellow part: 99%
                                              Red part: 1%
 Suppose I still report X = 64.1.
 You still guess  is in interval [59.1, 69.1].

                        But now you are “99% confident.”
                                                         336
  Who wants to drag in the Normal time
                               Last

          distribution?
I have another number  , which I know and you don’t.
Suppose I draw a random X ~ N ( , 0.06),
and report X = 0.38.
Can you give a 95% confidence interval for  ?
Reasoning:
• with probability 0.95, X is within 2 SD’s of 
• i.e. with prob 0.95, X is within 0.12 of 
• i.e. with prob 0.95,  is within 0.12 of X
• 95% CI is 0.38 ± 0.12, i.e., [0.26, 0.50]
                                                   337
Example using a sample mean rather
                               Last time

       than just one obs’n
Want to estimate  = mean pulse rate of for people on
                     a certain medicine.
Assume SD s is known to be 5.1.
                             ???
We sample n = 30 people and find sample mean X  103.9
                                                s
Find a 95% CI for  .
                                                 n
                                     , 5.1   N (  , 0.93)
Key: At least approximately, X ~ N          
                                         30 

95% CI is 103.9 ± (1.96)(0.93), i.e., [102.1, 105.7]
                                                   338
  Review of Confidence Intervals
  We’ve done Normal case for  ,
               with sofknown N ( ,s ) .
Data: X ,, X , a sample size n from
        1       n
Sample mean X n .                        unknown, s known

                       X  1.96 s , X  1.96 s  .
E.g., 95% CI for  is  n
                                               n
                                      n
                                 n             
                          , s  , which says, e.g., that
Basis for this: X n ~ N        
                              n
the prob that X n is w/in 1.96 s n of is 0.95.



                                                        339
Confidence intervals for  without
          being told s
E.g. case n = 3. Data: X 1 , X 2 , X 3 from N (  ,s ) .
Sample mean X and SD s.                   unknown, s unknown
Say we want a 95% CI for .
                         s             s 
Note can't use  X  1.96    , X  1.96    , because we don't know s !
                          3             3
                    s             s 
How about  X  1.96    , X  1.96    ?        Nope. Bad.
                     3             3

95% CI for  :          X  4.30 s , X  4.30 s  .
                       
                                  3            3
                                                 

                     Q: Where does 4.30 come from?

           A: The “ t distribution with 2 degrees of freedom ” 340
                                     s                  s
Minitab demo for 95% CI: X 3  1.96     bad, X 3  4.30     good
                                      3                  3
 Enable command language
 Make 3 cols, 2000 rows of N(10,2). These are c1-c3.
 Name c4=mean. Do rmean c1-c3 c4.
 col c5 “L” Use Calc menu: L=mean-1.96*2/sqrt(3)
 col c5 “U” Use Ctrl-E: U=mean+1.96*2/sqrt(3)
 col c7 Let c7 = (L<10) and (U>10)
 sum(c7). Hope about 1900. Let k1=sum(c7)/2000. Print k1.
 col c8 “stdev” rstdev c1-c3 c8 (or could use row stats menu)
 col c9 “Lz” calc menu: Lz=mean-1.96*stdev/sqrt(3)
 col c10 “Uz” Use Ctrl-E: Uz=mean+1.96*stdev/sqrt(3)
 col c11 “zcover” let c11=(Lz < 10) and (Uz > 10)
 sum(c11) Let k2=sum(c11)/2000 Print k2
 c12 “Lt” Lt = mean – 4.30*stdev/sqrt(3)             Script for
 c13 “Ut” Ut = mean + 4.30*stdev/sqrt(3)             the drama
 c14 “tcover” Let c14 = (Lt < 10) and (Ut > 10)      played out
 let k3=sum(c4)/2000 Print k3.                       in class
                                                          341
 s known                                            s
                   unknown,
       , s , i.e.,
X ~ N                             Jargon: “degrees of freedom”
           n using    s instead
 X                      X 
      ~ N (0, 1)               ~ t distrib with n  1 df
 s n                      s n

                     Which is N(0,1)? Which is t ?

                     N(0,1) density. 95% prob in 1.96

                        t(2) density. 95% prob in 4.30

                                                       342
    So what’s a t distribution again?
The distribution of a "standardized" X n , based on a
sample of size n, and "standardized" using s instead of s ,
is called the t distribution with n  1 degrees of freedom.


                     Xn  
i.e., the distrib of
                     s/ n




                                                        343
t densities
              df =  black, N(0,1)
                   8
                   4
                   2
                   1 red




                           344
             The logic of the t CI
E.g. for n  3:
   X 
        ~ t distrib with 2 df
   s 3

   This t(2) distrib has .95 probability between 4.30.

           X   between  4.30 and 4.30  0.95 .
   I.e. P                               
           s/ 3                         
   I.e. PX is within 4.30( s / 3) of    0.95 .

   95% CI for  :    X  4.30( s / 3) .
                                                      345
                 t tests: example
6 students, each took 2 reading     Radio off   Radio on   Difference
tests, radio off and radio on.         Y          Z              X
H 0 : Y   Z                       125         109         16
                                     347         278         69
H a : Y   Z                       265         275         10
                                     195         191          4
Equivalent : Define X  Y  Z ,
                                     535         416         119
Let  denote mean of X , and test    235         250         15
H0 :   0
H a :   0.                                            X  30.5
                                                       s X  52.8
Now it’s a 1-sample test. X i ~ N (  ,s ) .
Doing a test about  , withs not assumed known.            346
           t tests: example (cont)                      n6
                  X                                 X  30.5
"t statistic" t                                     s X  52.8
                  s n
                                   30.5
Under H 0 ,   0, so we get t          1.42.
                                 52.8 6

For a 2-sided test, we want to        -1.42      0     1.42
add the prob to the right of 1.42
and to the left of -1.42 in the t
distrib with n - 1 = 5 df.
                                              Don’t reject
t table gives P value between .2 and .3.      null hypothesis
Minitab: P = .215                                        347
                    Stat 10x
                    J. Chang
                Tuesday, 10/23/01

"Statistical thinking will one day be as necessary for
efficient citizenship as the ability to read and write."
                                          -- H.G. Wells

                                                     348
                   Today
• CI for a proportion
• Tests and CI’s for difference between two
  means
• Chi-square for goodness of fit to a given
  distribution
• Two-way tables and chi-square



                                              349
                                                   Review
       Simple confidence interval
I have a number  , which I know and you don’t.

Suppose I draw a random X ~ N ( , 0.06),
and report X = 0.38.

Reasoning for a 95% confidence interval for  :
• with probability 0.95, X is within 2 SD’s of 
• i.e. with prob 0.95, X is within 0.12 of 
• 95% CI is 0.38 ± 0.12, i.e., [0.26, 0.50]


                                                   350
                                                           Review
                  Math, logic
                             (Statistical hypothesis testing)
               Statistics, real life
Want to prove a statement.      Want to use some data to
                                provide evidence for a
                                hypothesis.


• Assume the statement           • Assume the hypothesis
  is not true.                     is not true.
• Reason to a contradiction.     • Show the observed data is
                                   very unlikely.

    i.e. observe something
         impossible.                                      351
        Inference for a proportion
  Example: Confidence interval in a
                        poll
Suppose we take a random sample of 900 likely voters.

We ask who they will vote for:        (Imagine a world without
52% say Bush, 48% say Gore.           Nader, Buchanan,…)


Let p denote the unknown fraction of Bush voters.
Our sample gives the point estimate p  .52
                                    ˆ
How about a 95% CI?
    X
p
ˆ      , where X ~ B(900, p )
   900

                                                        352
    Poll proportion example (cont.)
            ˆ
Distrib of p is approx
       p (1  p)          p (1  p ) 
N  p,              N  p,              N ( p, .017)
          900                30      

                                  (.52)(.48)
             Estimate this by                 .017
                                     30
An approximate 95% CI is
 the observed p  .52 , plus-or-minus 2  SD ( p )  .034 ,
              ˆ                                ˆ
 i.e. 52%  3.4% .
                margin of error
                                                           353
  Review of Confidence IntervalsReview
  We’ve done Normal case for  ,
               with sofknown N ( ,s ) .
Data: X ,, X , a sample size n from
        1       n
Sample mean X n .                        unknown, s known

                       X  1.96 s , X  1.96 s  .
E.g., 95% CI for  is  n
                                               n
                                      n
                                 n             
                          , s  , which says, e.g., that
Basis for this: X n ~ N        
                              n
the prob that X n is within 1.96 s n of     is 0.95.


                                                        354
Confidence intervals for  without
                                Review
          being told s
E.g. case n = 3. Data: X 1 , X 2 , X 3 from N (  ,s ) .
Sample mean X and SD s.                   unknown, s unknown
Say we want a 95% CI for .
                         s             s 
Note can't use  X  1.96    , X  1.96    , because we don't know s !
                          3             3
                    s             s 
How about  X  1.96    , X  1.96    ?        Nope. Bad.
                     3             3

95% CI for  :          X  4.30 s , X  4.30 s  .
                       
                                  3            3
                                                 

                     Q: Where does 4.30 come from?

           A: The “ t distribution with 2 degrees of freedom ” 355
                                     s                   s Review
Minitab demo for 95% CI: X 3  1.96     bad, X 3  4.30      good
  Enable command language            3                   3
  Make 3 cols, 2000 rows of N(10,2). These are c1-c3.
  Name c4=mean. Do rmean c1-c3 c4.
  col c5 “L” Use Calc menu: L=mean-1.96*2/sqrt(3)
  col c5 “U” Use Ctrl-E: U=mean+1.96*2/sqrt(3)
  col c7 Let c7 = (L<10) and (U>10)
  sum(c7). Hope about 1900. Let k1=sum(c7)/2000. Print k1.
  col c8 “stdev” rstdev c1-c3 c8 (or could use row stats menu)
  col c9 “Lz” calc menu: Lz=mean-1.96*stdev/sqrt(3)
  col c10 “Uz” Use Ctrl-E: Uz=mean+1.96*stdev/sqrt(3)
  col c11 “zcover” let c11=(Lz < 10) and (Uz > 10)
  sum(c11) Let k2=sum(c11)/2000 Print k2
  c12 “Lt” Lt = mean – 4.30*stdev/sqrt(3)              Script for
  c13 “Ut” Ut = mean + 4.30*stdev/sqrt(3)              the drama
  c14 “tcover” Let c14 = (Lt < 10) and (Ut > 10)       played out
  let k3=sum(c4)/2000 Print k3.                        in class
                                                            356
 s known                                            s
                                                        Review
                   unknown,
       , s , i.e.,
X ~ N                             Jargon: “degrees of freedom”
           n using    s instead
 X                      X 
      ~ N (0, 1)               ~ t distrib with n  1 df
 s n                      s n



                     N(0,1) density. 95% prob in 1.96

                        t(2) density. 95% prob in 4.30

                                                       357
                                                         Review

    So what’s a t distribution again?
The distribution of a "standardized" X n , based on a
sample of size n, and "standardized" using s instead of s ,
is called the t distribution with n  1 degrees of freedom.


                                           Xn  
         i.e., t (n  1) is the distrib of
                                           s/ n




                                                        358
                                                      Review

             The logic of the t CI
E.g. for n  3:
   X 
        ~ t distrib with 2 df
   s 3

   This t(2) distrib has .95 probability between 4.30.

           X   between  4.30 and 4.30  0.95 .
   I.e. P                               
           s/ 3                         
   I.e. PX is within 4.30( s / 3) of    0.95 .

   95% CI for  :    X  4.30( s / 3) .
                                                      359
  Comparing means of two Normal
                  distributions
             CaseXof paired data ,Y ).
Given paired data: ( , Y ),( X , Y ), ,( X
                        1 1    2   2        n   n
Want to test H 0 :  x   y
     (versus H a :  x   y or whatever)

Consider differences D1 ,, Dn , where Di  X i  Yi .

Now it’s a one-sample problem: D1 ,, Dn is a random
sample from a population, and null hypoth is the population
mean is 0.
                                                         360
     t tests: an example with paired
                          data Radio off Radio on Difference
6 students, each took 2 reading
speed measurements,                 Y        Z          X
with radio off and radio on.        125     109        16
  H 0 : Y   Z                    347     278        69
  H a : Y   Z                    265     275        10
                                    195     191         4
Equivalent: Define X  Y  Z ,      535     416        119
                                    235     250        15
Let  denote mean of X , and test
 H0 :   0
                                                   X  30.5
 H a :   0.
                                                  s X  52.8
Now it’s a 1-sample test. X i ~ N (  ,s ) .
                                           s
Doing a test about  . Let us not assume is known.
                                                 361
           t tests: example (cont)                         n6
                  X                                    X  30.5
"t statistic" t                                        s X  52.8
                  s n
                                   30.5
Under H 0 ,   0, so we get t          1.42.
                                 52.8 6

For a 2-sided test, we want to           -1.42      0     1.42
add the prob to the right of 1.42
and to the left of -1.42 in the t
distrib with n - 1 = 5 df.
                                                 Don’t reject
t table gives P value > .2 (see next slide)      null hypothesis
Minitab: P = .215                                           362
Critical values for t distributions
We got t = 1.42, with 5 df




                             tail probability



                                           critical value


                             Since t = 1.42 < 1.476,
                             tail prob > .1, so
                             2-sided P value > .2.




                                    N (0,1)! 363
 Comparing means of two Normal
                 distributions and Y1,,Yn .
Data: Two indep samples X 1 ,, X m
Model: X 1 ,, X m ~ N (  x ,s x ) and Y1 ,, Yn ~ N (  y ,s y ) .
Our goal is to test the null hypothesis H 0 :  x   y .
Idea: use test statistic X  Y .
Key: Assuming H 0 , need to know distrib of X  Y .
                                             Mean 0. SD? Shape?
Some cases:

    s x and s y         s x and s y           s x and s y
    known, and          unknown, but          unknown
    not nec equal       assumed equal                       364
    Simplest case: Population SD’s
                            known
 recall Key: Assuming H , need to know distrib of X  Y .
                      0

Easy: We know
                sx                     sy
  X ~ N ( x ,    ) and Y ~ N (  y ,    ).
                 m                      n
  X and Y are independent
                                  sy 
                         0, s x 
                               2    2
So…            X Y ~ N               
                            m     n 
                                     




                                                         365
Two-sample procedures with SD’s
       known:here X  9.77, Y  16.27.
           E.g. example
              X  Y  6.5.

              Also suppose we are told that
              s x  3 and s y  5 . Then…

                          32 52
              s (X Y )        4.06  2.02
                          7   9

              95% CI for  x   y is
               6.5  (1.96)(2.02)  [10.45, 2.55]

                                              366
  Two-sample hypothesis test with
                    SD’s known
…continuing our example…
If H 0 :  x   y is true, then using the given values for
s x and s y , the distrib of X  Y is N (0, 2.02).

We just observed a value  6.5 for X  Y .
                               6.5
Standardized observed value is       3.22 .
                               2.02
                                                      X Y
P value is 2 (-3.22)  .0013
                                                      s   2
                                                                  s   2
                                                          x
                                                                     y
                                                      m           n
                            “How extreme” is
                            difference in means               367
      Two-sample procedures with
   X Y
                       unknown
         variancesstill like to use
           Idea: would
  s   2
              sy
               2    this, but can’t.
      x
          
  m      n
So we estimate s x and s y by sample SD’s, s x and s y ,
                                  X Y
and use the test statistic T          2
                                         .
                                   2
                                  sx s y
                                     
                                  m n
Distrib for T is not Normal, but approximately a t distrib.
Degrees of freedom? No really clean answer...
• Conservative: Minimum of (m1) and (n1).
• More accurate: A complicated function of m and n. (In textbook…)
• Precise df usually doesn’t matter a whole lot…              368
      Note about t distributions
• Basically, the distinction between t and
  Normal distribs, and the precise number of
  degrees of freedom, hardly matter unless the
  sample sizes involved are very small.




                                             369
t densities
              df =  black, N(0,1)
                   8
                   4
                   2
                   1 red




                           370
          t critical values
(used for 95% CI’s and hypothesis
          df    tests)
                     t       .95
                              2




                               371
Two-sample t procedures




                          372
Two-sample t procedures




                          373
Two-sample t procedures




                          374
  Two sample procedures with SD’s
    unknown but assumed equal
If s x  s y
                                       pp.  X  Y
                     (Textbook,using T550-554)it’s more accurate
                  s , say, instead of
                                               2
                                              sx       s2
                                                   
                                                        y
                                               m       n
                                                         s
to use both samples to give a single, “pooled” estimate of .
          m                   n
          ( X i  X ) 2   (Y j  Y ) 2
          i 1               j 1
 sp 
                   (m  1)  (n  1)

                          X Y
Use test statistic T              , which, under H 0 , has
                        sp m  n
                             1   1

(exactly!) a t distrib with ( m  1)  ( n  1) df.
                                                               375
  Tables of counts, Goodness of fit,
           and Chi-Square
       Generalize the Binomial...

                                                    Success,   Failure,
                                                    prob p     prob (1-p)




                                                        Multinomial
                                     …                  distribution
       prob p1   prob p2   prob p3        prob pk

Test hypotheses about the urn (or "cell") probabilities p1 , p2 , , pk .
                                                                   376
          Goodness of fit: is the die fair?
Suppose we roll a die 60 times and get these frequencies:

value                 1     2       3        4       5    6
observed freq 8             13 11 5                  14 9         H 0 : pi  1 / 6 for all i

expected freq         10 10 10 10 10 10  Assuming H 0

              observed  expected              2
X 
  2
                      expected
          8  102 13  102 11  102 5  102 14  102 9  102
                                                                    
            10         10               10               10       10          10

       5 .6              How “extreme” is this?                                   377
                  Fair die (cont)
                                 observed  expected 2
                         X2                              = 5.6
                                        expected


Distrib of X 2 , assuming H 0 , is chi-square, here with 5 df.

P-value:
Table F, p. T-20: P > 0.25
Minitab:          P = 0.347

 Don’t reject null hypoth.
                                                            378
     Fair-die example with different
                numbers expected
                   observed                        2
                         X 
                         2
                                                                P = .347
                                     expected
                             = 5.6

That was then. This is now.

                 value               1    2     3       4   5   6
What if we get   observed freq 80 130 110 50 140 90                        ?
                 expected freq       100 100 100 100 100 100

Then X 2  56.      And P is miniscule (0.00000000008).
                                                                379
    How many degrees of freedom?
For chi-square distrib,
number of df is important!
     (Unlike for t distrib)


Here df = 5 is number of
“cells” minus 1. “Why?”

H 0 : ( p1, p2 , p3 , p4 , p5 , p6 )  ( 1 , 1 , 1 , 1 , 1 , 1 )
                                         6 6 6 6 6 6

Ha : General p’s (6 positive numbers that sum to 1)

Ha has 5 df, H0 has 0 df.                    Overall test has 5  0 = 5 df.
                                                                     380
                              General rule: subtract df in Ha minus df in H0.
t and chi-square critical values.
Used for constructing 95% CI’s
    df    (2-sided for t) 
             t.975            2
                              .95




                                    381
 Contingency tables, homogeneity,
                 independence
A two-way classification of subjects by two variables --
gender and handedness:

                 right      left   ambidextrous
       men       934        113         20

       women     1070       92           8

Are gender and handedness independent?
I.e. do proportions of right, left, and ambidextrous agree
between men and women?
                                                      382
   Chi-square for independence in 2-
              way tables
                                   A sum of 6 terms in our example

                       observed  expected 2
 Again use X 2  
                              expected
“expected” counts: see below… Degrees of freedom?

Ha ?? 6 cells, so 6 probabilities, so 5 df

H0 ?? Can choose, e.g., P(man)                      Ha      5 df
      [and then P(woman) is determined],            H0      3 df
      and can choose P(right) and P(left)
                                                    test    2 df
                                                           383
      [and then P(ambidextrous) is determined].
    Expected counts, assuming hull
Data:  hypoth (independence)




Expected counts, assuming H 0 :
                         (row total) (column total)
              expected 
                                      n
                                        (1067)(2004)
E.g. expected count for (men, right) is               955.86
                                            2237
                                                                384
2-way table chi-square test using
            Minitab




                                385
      (113  97.78) 2
                       2.369
          97.78



P{ 2 (2 df)  11.806}  .003



                    386
                        Stat 10x
                        J. Chang
                    Tuesday, 10/30/01

I always find that statistics are hard to swallow and impossible to
digest. The only one I can remember is that if all the people who go
to sleep in church were laid end to end they would be a lot more
comfortable.
                                         --Mrs. Robert A. Taft
                                                             387
                    Today
• Chi-square for goodness of fit to a given
  distribution
• Two-way tables and chi-square
• Inference for simple regression




                                              388
t densities
              df =  black, N(0,1)
                   8
                   4
                   2
                   1 red




                           389
          t critical values
(used for 95% CI’s and hypothesis
          df    tests)
                     t       .95
                              2




                               390
  Tables of counts, Goodness of fit,
           and Chi-Square
       Generalize the Binomial...

                                                    Success,   Failure,
                                                    prob p     prob (1-p)




                                     …                  Multinomial
       prob p1   prob p2   prob p3        prob pk

Test hypotheses about the urn (or "cell") probabilities p1 , p2 , , pk .
                                                                   391
          Goodness of fit: is the die fair?
Suppose we roll a die 60 times and get these frequencies:

value                 1     2       3        4       5    6
observed freq 8             13 11 5                  14 9         H 0 : pi  1 / 6 for all i

expected freq         10 10 10 10 10 10  Assuming H 0

              observed  expected              2
X 
  2
                      expected
          8  102 13  102 11  102 5  102 14  102 9  102
                                                                    
            10         10               10               10       10          10

       5 .6              How “extreme” is this?                                   392
                  Fair die (cont)
                                 observed  expected 2
                         X2                              = 5.6
                                        expected


Distrib of X 2 , assuming H 0 , is chi-square, here with 5 df.

P-value:
Table F, p. T-20: P > 0.25
Minitab:          P = 0.347

 Don’t reject null hypoth.
                                                            393
     Fair-die example with different
                numbers expected
                   observed                        2
                         X 
                         2
                                                                P = .347
                                     expected
                             = 5.6

That was then. This is now.

                 value               1    2     3       4   5   6
What if we get   observed freq 80 130 110 50 140 90                        ?
                 expected freq       100 100 100 100 100 100

Then X 2  56.      And P is miniscule (0.00000000008).
                                                                394
    How many degrees of freedom?
For chi-square distrib,
number of df is important!
     (Unlike for t distrib)


Here df = 5 is number of
“cells” minus 1. “Why?”

H 0 : ( p1, p2 , p3 , p4 , p5 , p6 )  ( 1 , 1 , 1 , 1 , 1 , 1 )
                                         6 6 6 6 6 6

Ha : General p’s (6 positive numbers that sum to 1)

Ha has 5 df, H0 has 0 df.                    Overall test has 5  0 = 5 df.
                                                                     395
                              General rule: subtract df in Ha minus df in H0.
t and chi-square critical values.
Used for constructing 95% CI’s
    df    (2-sided for t) 
             t.975            2
                              .95




                                    396
 Contingency tables, homogeneity,
                 independence
A two-way classification of subjects by two variables --
gender and handedness:

                 right      left   ambidextrous
       men       934        113         20

       women     1070       92           8

Are gender and handedness independent?
I.e. do proportions of right, left, and ambidextrous agree
between men and women?
                                                      397
   Chi-square for independence in 2-
              way tables
                                A sum of 6 terms in our example

                     observed  expected 2
 Again use X 2  
                            expected
“expected” counts: see below…




                                                       398
    Expected counts, assuming hull
Data:  hypoth (independence)




Expected counts, assuming H 0 :
                         (row total) (column total)
              expected 
                                      n
                                        (1067)(2004)
E.g. expected count for (men, right) is               955.86
                                            2237
                                                                399
               Degress of freedom
Ha ?? 6 cells, so 6 probabilities, so 5 df


H0 ?? Can choose, e.g., P(man)
      [and then P(woman) is determined],
      and can choose P(right) and P(left)
      [and then P(ambidextrous) is determined].

                         Ha     5 df
                         H0     3 df
                         test   2 df
                                                  400
2-way table chi-square test using
            Minitab




                                401
      (113  97.78) 2
                       2.369
          97.78



P{ 2 (2 df)  11.806}  .003



                    402
   Next: Inference for regression
Ask questions like:

• How strong is the evidence that there is a real correlation
between two variables?
• What is a 95% confidence interval for the mean of one
variable, for a given value of another variable?




                                                     403
                        Crying and IQ
The heartwarming story...
 Data on 38 infants (4 to 10 days old).
 Researchers used a rubber band to snap infants in the foot.
 Measured crying intensity
    (“number of peaks in the
      most active 20 seconds”)
 Later measured IQ at age 3 years.




Data from Basic Practice...
                                                      404
                Crying data
     165

     155

     145

     135
                                              r  .455
IQ




     125

     115

     105

     95

     85
           10           20           30
                        crying


            “Is there a real relationship?”
                                                   405
                Regression model
                                           Y = IQ
Mean of Y is a linear function of X.       X = crying

       Y   0  1 X
Actual values = mean + “random error”
      Yi   0  1 X i   i

Model :  i ~ N (0,s ) are indep random variables

3 unknown parameters:  0 , 1 , and s .


                                                        406
       Estimating the parameters
  (Start w/the coefficients in the linear
                            i ~ N (
Yi   0  1 X i   i equation)0, s )

We’ll estimate  0 and 1 by b0 and b1, say.
    b0 , b1 : intercept and slope of the usual least-squares
    regression line y  b0  b1x .
             165


             155
                                                  On average, gain
             145


             135
                                                  about 1.5 IQ points
                                                  per unit of crying
        IQ




             125
                              Y = 91.3 + 1.49 X
             115
                                                  intensity.
             105

             95
                                                  Very scientific.
             85

                   10   20
                             crying      30
                                                               407
                         Plot the residuals
           Residual is difference between observed Yi and the
           prediction of the regression line : ei  Yi  (b0  b1 X i )
           Hope to see a formless blob…
                   50
                   40
                   30
residual




                   20
                   10
                   0
                  -10
                  -20
                  -30

                         10            20            30
                                      crying

           Looks pretty formless to me. (Except maybe one point to examine.)
                                                                       408
          Check that the residuals look
               decently Normal
                           Normal Probability Plot for residuals

          99


          95
          90

          80
          70
Percent




          60
          50
          40
          30
          20

          10
           5


           1

               -40   -30    -20   -10   0      10   20   30   40   50

                                            Data



                                                                        409
               How to estimate s ?
As usual, need to est s to construct CI’s and hypothesis tests.

Yi   0  1 X i   i           i ~ N (0, s )

How about using s = SD of i ’s?
We don’t know the  i ’s !         i  Yi  (  0  1 X i )

Idea: Can estimate i by residual ei  Yi  (b0  b1 X i ) .
Estimate s by SD of residuals…


              s  
                   ei2           (Yi  b0  b1 X i ) 2
                 n2                   n2
                                                                  410
 There you go again… Why n - 2 ?
                                s        
Before we divided by n-1.            ei2      (Yi  b0  b1 X i ) 2
Why n-2 here?                      n2             n2


Choose n  2 to get an unbiased estimator of s 2 .

An intuitive way to remember:
Before we said if have n = 1, we want s undefined.
Now: if n = 2, s should be undefined (0/0).

The real idea: there are two estimated parameters in this
expression.
                                                          411
Minitab report




                 412
              CI’s and tests for 1
E.g. a 95% CI for 1 will look like
   b1  (multiplie r)(SE of b1 )

  Around 2, as usual.    Standard error
  From a t distrib.      (estimated SD).


                                   s
 By algebra... SD(b1 ) 
                              (Xi  X )   2


                             s
  Estimate by SE(b1 ) 
                         ( X i  X )2

                                               413
        SE of regression coefficient
                                s
 By algebra... SD(b1 ) 
                            ( X i  X )2
                             s
  Estimate by SE(b1 ) 
                         ( X i  X )2



This all makes qualitative sense at least. E.g.:
   b1 is more variable when s is larger,
  less variable when X i ' s are spaced farther apart


                                                        414
 95% CI for 1 in crying example
   b1  (tn  2 )
          *
                    (SE (b1 ))      From t distrib with
                                    38-2=36 df.
 1.49  (2.03) (SE (b1 ))
                                   17.5
                        SE(b1 )                0.487
                                  (Xi  X ) 2


 1.49  (2.03) (0.487)
 [0.51, 2.48]



                                                          415
Hypothesis test in crying example
Test H 0 : 1  0.

            b1     1.4929
Calculate                 3.07
          SE(b1 ) 0.487

P value: 2 (0.002) = 0.004.

This data gives strong evidence that IQ and crying
are correlated.


                                                     416
          Confidence and prediction
        intervals at a given value x*
E.g., for x*  30 :
Want a CI for mean of Y in the "vertical strip"
at X  30, that is, a CI for  (Y | X  x*).
Prediction interval: Suppose we just saw an infant with a
crying score of 30. Give an interval for future IQ score,
for which we have a given confidence.

Want a CI for the "vertical strip" mean  (Y | X  x*),
and a "prediction interval" for a new value of Y that
we haven't observed yet, for a given value X  x *
                                                          417
                      Formulas, etc.
Intervals will be of the form y  (multiplier)(SE),
                              ˆ
where "multiplier" comes from a t ( n  2) distribution.


For CI for  (Y | X  x*) use
                      1 ( x *  x )2
   SE  (Y | x*)   s 
                      n  ( xi  x )2      See pp. 674-677
                                           and pp. 690-691
For prediction interval use
                       1  ( x *  x )2
   SE (Y | x*)    s 1 
                       n  ( xi  x ) 2               418
                    In Minitab
Do stat > Regression > Regression, click Options, and fill
in a value or a column for “Prediction intervals for new
observations”

Do stat > Regression > Fitted line plot, click Options, and
check “Display confidence bands” and “Display
prediction bands” for a nice picture.

E.g. for x* = 30, get a CI of [122, 150] for the mean,
and a prediction interval of [98, 174] for a new IQ.


                                                     419
                Regression Plot
                 Y = 91.2683 + 1.49290X
                     R-Sq = 20.7 %

     180




     130
IQ




                                               Regression
      80                                        95% CI
                                                95% PI


           10       20                    30

                   crying

                                                     420
          From K.A.C. Manderville, The Undoing of Lamia Gurdleneck
"You haven't told me yet," said Lady Nuttal, "what it is your fiance
  does for a living."
"He's a statistician," replied Lamia, with an annoying sense of being
  on the defensive.
Lady Nuttal was obviously taken aback. It had not occurred to her that
  statisticians entered into normal social relationships. The species,
  she would have surmised, was perpetuated in some collateral
  manner, like mules.
"But Aunt Sara, it's a very interesting profession," said Lamia warmly.
"I don't doubt it," said her aunt, who obviously doubted it very much.
   "To express anything important in mere figures is so plainly
   impossible that there must be endless scope for well-paid advice
   on the how to do it. But don't you think that life with a statistician
   would be rather, shall we say, humdrum?"
Lamia was silent. She felt reluctant to discuss the surprising depth of
  emotional possibility which she had discovered below Edward's
  numerical veneer.                                             421
  Stat 10x
   J. Chang
Tuesday, 11/6/01




                   422
                   Today
• Multiple regression, including some ideas of
  model selection




                                             423
       Before I forget: are these
        interpretations correct?
• 95% CI of [.521, .583] for population
  proportion p means that “The probability that
  p lies between .521 and .583 is 0.95.”
• Testing a null hypothesis and find P value =
  .015 means that “The probability that the null
  hypothesis is true is 0.015.”



                                              424
    Multiple regression example:
      Deciding who should get
Data:       scholarships




Want to use HS GPA and achievement test to predict college GPA
                                                         425
 Look at scatterplots for all pairs of
              variables
          3.25
                    coll_GPA
          2.29                                                      Minitab:
                                                                    Graph
        3.2425                                                      Matrix Plot
                                     HS_GPA
        2.1675


         89.75
                                                         ach_test
         73.25

                    9      5        75      25       5         5
                 2.2    3.2    2. 16   3. 24     73.2      89.7



HS_GPA looks useful in predicting coll_GPA.                         Good
ach_test looks useful in predicting coll_GPA.                       Good
ach_test & HS_GPA not useful in predicting each other! Also good
                                                                       426
Simple (not multiple) regression of
     coll_GPA on HS_GPA




                                  427
Results of simple regression




                               428
Minitab can store the fits and residuals
          in the worksheet




                                     429
                  Plot residuals

       0.5
RES1




       0.0




       -0.5

              2           3    4
                     HS_GPA


                                   Looks nice
                                                430
       Same residuals vs. ach_test (the
               other predictor)

       0.5
RES1




       0.0




       -0.5

              70          80        90         100
                         ach_test

                   Indicates there’s still more info to be extracted!
                                                                 431
Multiple regression using both
     predictor variables




                                 432
The report




             433
Recall the simple regression report for
             comparison...




                                    434
Fits and resids from both
       regressions
  (simple and multiple)




                            435
       Plotting residuals vs. fitted values
                                   0.4

                                   0.3
0.5                                0.2

                                   0.1




                           RESI2
                                   0.0
0.0
                                   -0.1

                                   -0.2

                                   -0.3
-0.5
                                   -0.4

          2.5       3.0   3.5             2     3           4

                FITS1                         FITS2




                                                      436
     How much better did we do with
                2 two predictors?
      r  .73 (r  .53)       r  .93 (r 2  .87)

 4                                  4




                         coll_GPA
 3                                  3




 2                                  2


       2.5       3.0       3.5          2     3          4
             FITS1                          FITS2


The famous “multiple R-sq” (reported by Minitab) is simply
the squared correlation between actual and fitted y’s. 437
          Multiple regression model
                                                In our example:
Mean of Y is a linear fcn of X1 and X2.         Y = coll_GPA
                                                X1 = HS_GPA
   Y   0  1 X 1   2 X 2                  X2 = ach_test

Actual values = mean + “random error”
  Yi   0  1 X i1   2 X i 2   i

Model :  i ~ N (0,s ) are indep random variables

4 unknown parameters:  0 , 1,  2 , and s .


                                                       438
    Estimates in multiple regression
Yi   0  1 X i1   2 X i 2   i     i ~ N (0,s )

If we estimate  0 , 1 ,  2 by b0 , b1, b2 ,
define residual ei  Yi  (b0  b1 X i1  b2 X i 2 ) .

How to choose “best” b0 , b1, b2 ?
Least squares idea: choose b0 , b1, b2 that give smallest sum
of squared residuals.                     i.e. smallest  ei2 
Estimate of s : s               2
                                 ei
                              n3
There are formulas for SE(b0 ), SE (b1 ), SE (b2 ) , which
depend, roughly, on s and how spread out the X values are.439
Minitab uses those calculations and
           more algebra
          to get all this...




                                  440
        Model selection example:
        Guessing the degree of a
Data:         polynomial
            10




             0
        y




            -10

                  -3   -2   -1   0   1   2   3
                                 x

                                                 441
                linear fit (polynomial of degree 1)

15


10


 5


 0


 -5


-10


-15

      -3   -2       -1           0          1         2   3


                                                              442
                quadratic fit (polynomial of degree 2)

15


10


 5


 0


 -5


-10


-15

      -3   -2         -1          0          1           2   3


                                                                 443
                3rd degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2   -1          0         1     2   3


                                                    444
                4th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2   -1          0         1     2   3

                                                    445
                5th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2   -1          0         1     2   3

                                                    446
                10th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2    -1         0          1     2   3


                                                     447
                15th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2    -1         0          1     2   3


                                                     448
                20th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2    -1         0          1     2   3


                                                     449
                25th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2    -1         0          1     2   3

                                                     450
                30th degree polynomial fit

15


10


 5


 0


 -5


-10


-15

      -3   -2    -1         0          1     2   3

                                                     451
  Model selection alphabet soup
AIC: "An" Information Criterion (proposed by Akaike)
BIC: "Bayesian Information Criterion" (Schwarz, 1978)

                            n 1
          2
Adjusted R :   1  Radj
                    2
                                (1  R 2 )
                            n p

                                   n p    2   p
FPE  (average squared residual)        s 1  
                                   n p        n

"Cross validation"
                                               452
Results of a couple of model
     selection criteria




                               453
 The answer is indeed degree = 3
   1 3
y  x  x2  5x  4
   2
                        The actual (3rd degree) polynomial

        15


        10


         5


         0


         -5


        -10


        -15

              -3   -2        -1         0         1          2   3
                                                                     454
                3rd degree polynomial fit                                                                          4th degree polynomial fit

15                                                                                                  15


10                                                                                                  10


 5                                                                                                   5


 0                                                                                                   0


 -5                                                                                                 -5


-10                                                                                                -10


-15                                                                                                -15
      -3   -2   -1          0         1                2        3                                        -3   -2   -1          0         1     2     3


                                                                    The actual (3rd degree) polynomial

                                            15


                                            10


                                             5


                                             0


                                             -5


                                            -10


                                            -15
                                                                                                                                                   455
                                                  -3       -2            -1         0         1           2    3
      And now, as you go forth…
Please remember that the knowledge you have gained in
this class must always be used for good, and never,
not ever, ever, ever, ever,       ever,
for



 I’ll be around…
 Good luck with the rest of the course,
 and with all future random pursuits!
                                                 456

						
Related docs
Other docs by X52Is25h
Fernando Pessoa - DOC
Views: 22  |  Downloads: 0
Mod�le Prof_V12 - Download as DOC
Views: 122  |  Downloads: 0
Josh Marowitz statement 1
Views: 11  |  Downloads: 0
Rally �round the Flag Effect
Views: 13  |  Downloads: 0
DIABETE: GENERALITES TRAITEMENT et SUIVI
Views: 202  |  Downloads: 0
HOKLAS SC024
Views: 3  |  Downloads: 0
PA 4 PREHD
Views: 2  |  Downloads: 0