Mathematics 243 Statistics

Document Sample
Mathematics 243 Statistics Powered By Docstoc
					Mathematics 243: Statistics

          M. Stob

         May 4, 2008

   This is the textbook for the course Mathematics 243 taught at Calvin College. This
edition of the book is for the Spring, 2008 version of the course.
   Not using a “standard textbook” requires an explanation. This book differs from
other available books in at least three ways. First, this book is a “modern” treatment
of statistics that reflects the most recent wisdom about what belongs, and does not
belong, in a first course on statistics. Most existing textbooks must give at least
some attention to traditional or old-fashioned approaches since traditional and old-
fashioned courses are often taught. Second, this course relies on a particular statistical
software package, R. The use of R is expected of students throughout the course. Most
traditional textbooks are published so as to be usable with any software package (or
with no software package at all). The use of R is part of what makes this text modern.
Third, this textbook is written for Mathematics 243 and so includes all and only what
it covered in the course. Most traditional textbooks are rather encyclopedic.
   While this textbook includes all the topics that are covered in the course, it is not
meant to be self-contained. In particular, the textbook is for a class that meets 52
times throughout the semester and what goes on in those sessions is important. Also,
the textbook contains numerous problems and the problems must be done so that the
concepts are understood in full detail.
   The sections of the textbook are intended to be covered in the order that they
appear in the text. An exception concerns the appendix, Using R. The R language will
be introduced throughout the text by means of examples that solve the problems at
hand. The appendix gives fuller explanation of language features that are important for
developing the proficiency with R needed to proceed. The text will often refer forward
to the appropriate section of the appendix for more details. The text is not a complete
introduction to R however. R has a built-in help facility and there are also several
introductions to the R language that are available on the web. A particularly good one
is by John Verzani.
   This text will change over the course of the semester. The current version of the
text will always be available on the course website
courses/m243/S08. The pdf version is designed to be useful for on-screen reading.
The references in the text to other parts of the text and to the web are hyperlinked.
This is the first edition of this text. Thus errors, typographical and otherwise, abound.
I encourage readers to communicate them to me at This text is a
part of a larger effort to improve the teaching of statistics at Calvin College. Earlier
versions of some of this material were used for the course Mathematics 232. Some of
the material in this book was developed by Randy Pruim and appears in the text for
Mathematics 343–344. The assistance of Pruim and Tom Scofield in the development
of these courses is gratefully acknowledged.

Introduction                                                                                                                       1

1. Data                                                                                                                          101
   1.1. Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
   1.2. A Single Variable - Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
   1.3. Measures of the Center of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
   1.4. Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
   1.5. The Relationship Between Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
   1.6. Two Quantitative Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
   1.7. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

2. Data from Random Samples                                                                                                      201
   2.1. Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
   2.2. Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
   2.3. Other Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
   2.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

3. Probability                                                                                                                   301
   3.1. Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
   3.2. Assigning Probabilities I – Equally Likely Outcomes . . . . . . . . . . . . . . . . . . 306
   3.3. Probability Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
   3.4. Empirical Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
   3.5. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
   3.6. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

4. Random Variables                                                                                                         401
   4.1. Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
   4.2. Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
        4.2.1. The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
        4.2.2. The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
   4.3. An Introduction to Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
   4.4. Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
        4.4.1. pdfs and cdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
        4.4.2. Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
        4.4.3. Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419


          4.4.4. Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
     4.5. The Mean of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
          4.5.1. The Mean of a Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . 423
          4.5.2. The Mean of a Continuous Random Variable . . . . . . . . . . . . . . . . . . 425
     4.6. Functions of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
          4.6.1. The Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 428
     4.7. The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
     4.8. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

5. Inference - One Variable                                                                                                      501
   5.1. Statistics and Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
        5.1.1. Samples as random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
        5.1.2. Big Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
        5.1.3. The Standard Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
   5.2. The Sampling Distribution of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
   5.3. Estimating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
        5.3.1. Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
        5.3.2. Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
        5.3.3. Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
   5.4. Confidence Interval for Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
        5.4.1. Confidence Intervals for Normal Populations . . . . . . . . . . . . . . . . . . . 512
        5.4.2. The t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
        5.4.3. Interpreting Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
        5.4.4. Variants on Confidence Intervals and Using R . . . . . . . . . . . . . . . . . . 516
   5.5. Non-Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
        5.5.1. t Confidence Intervals are Robust . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
        5.5.2. Why are t Confidence Intervals Robust? . . . . . . . . . . . . . . . . . . . . . . 519
   5.6. Confidence Interval for Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
   5.7. The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
   5.8. Testing Hypotheses About the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
   5.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531

6. Producing Data – Experiments                                                                                                 601
   6.1. Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
   6.2. Randomized Comparative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
   6.3. Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
   6.4. Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608

7. Inference – Two Variables                                                                                             701
   7.1. Two Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
        7.1.1. The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
        7.1.2. I independent populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
        7.1.3. One population, two factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706


          7.1.4. I experimental treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
     7.2. Difference of Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
     7.3. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714

8. Regression                                                                                                                    801
   8.1. The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
   8.2. Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
   8.3. More Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
   8.4. Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
        8.4.1. The residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812
        8.4.2. Influential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
   8.5. Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
   8.6. Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
   8.7. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826

A. Appendix: Using R                                                                                                           1001
   A.1. Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001
   A.2. Vectors and Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001
   A.3. Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002
   A.4. Getting Data In and Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
   A.5. Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
   A.6. Samples and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
   A.7. Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
   A.8. Lattice Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
   A.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010

19:08 -- May 4, 2008                                                                                                               vii
Kellogg’s makes Raisin Bran and packages it in boxes that are labeled “Net Weight:
20 ounces”. How might we test this claim? It seems obvious that we need to actually
weigh some boxes. However we certainly cannot require that every box that we weigh
contains exactly 20 ounces. Surely some variation in weight from box to box is to be
expected and should be allowed. So we are faced with several questions: How many
boxes should we weigh? How should we choose these boxes? How much deviation in
weight from the 20 ounces should we allow? These are the kind of questions that the
discipline of statistics is designed to answer.

Definition 0.0.1 (Statistics). Statistics is the scientific discipline concerned with col-
lecting, analyzing and making inferences from data.

   While we cannot tell the whole Raisin Bran story here, the answers to our questions as
prescribed by NIST (National Institute of Standards and Technology) and developed
from statistical theory are something like this. Suppose that we are at a Meijer’s
warehouse that has just received a shipment of 250 boxes of Raisin Bran. We first
select twelve boxes out of the whole shipment at random. By at random we mean that
no box should be any more likely to occur in the group of twelve than any other. In
other words, we shouldn’t simply take the first twelve boxes that we find. Next we
weigh the contents of the twelve boxes. If any of the boxes are “too” underweight, we
reject the whole shipment - that is we disbelieve the claim of Kellogg’s (and they are
in trouble). If that is not the case, then we compute the average weight of the twelve
boxes. If that average is not “too” far below 20 ounces, we do not disbelieve the claim.
   Of course there are some details in the above paragraph. We’ll address the issue
of how to choose the boxes more carefully in Chapter 2. We’ll address the issue of
summarizing the data (in this case, using the average weight) in Chapter 1. The
question of how much below 20 ounces the average of our sample should be allowed to
be will be dealt with in Chapter 5.
   Underlying our statistical techniques is the theory of probability which we take up
in Chapter 3. The theory of probability is meant to supply a mathematical model for
situations in which there is uncertainty. In the context of Raisin Bran, we will use
probability to give a model for the variation that exists from box to box. We will
also use probability to give a model of the uncertainty introduced because we are only
weighing a sample of boxes.
   If the whole course was only about Raisin Bran it wouldn’t be worth it (except per-
haps to Kellogg’s). But you are probably sophisticated enough to be able to generalize


this example. Indeed, the above story can be told in every branch of science (biological,
physical, and social). Each time we have a hypothesis about a real-world phenomenon
that is measurable but variable, we need to test that hypothesis by collecting data. We
need to know how to collect that data, how to analyze it, and how to make inferences
from it.
   So without further ado, let’s talk about data.

1. Data
Statistics is the science of data. In this chapter, we talk about the kinds of data that
we study and how to effectively summarize such data.

1.1. Basic Notions
For our purposes, the sort of data that we will use comes to us in collections or datasets.
A dataset consists of a set of objects, variously called individuals, cases, items,
instances, units, or subjects, together with a record of the value of a certain variable
or variables defined on the objects.

Definition 1.1.1 (variable). A variable is a function defined on the set of objects.

  Ideally, each individual has a value for each variable. These values are usually
numbers but need not be. Sometimes there are missing values—defidxmissing values.

    Example 1.1.2. Your college maintains a dataset of all currently active students.
    The individuals in this dataset are the students. Many different variables are
    defined and recorded in this dataset. For example, every student has a GPA,
    a GENDER, a CLASS, etc. Not every student has an ACT score — there are
    missing values for this variable.

   In the preceding example, some of the variables are obviously quantitative (e.g.,
GPA) and others are categorical (e.g., GENDER). A categorical variable is often
called a factor and the possible values of the categorical variables are called its levels.
Sometimes the levels of a categorical variable are represented by numbers. For example,
we might code gender using 1 for female and 0 for male. It will be quite important to
us not to treat the categorical variable as quantitative just because numbers are used
in this way. (Is the average gender 1/2?)
   It is useful to think of the values of the variable as forming a list. In R, the values
of a particular quantitative variable defined on a collection of individuals is usually
stored in a vector. A categorical variable is stored in an R object called a index-
factor—defidxfactor (which behaves much like a vector). You can read more about
vectors and factors in Section A.2 of the Appendix.
   We will normally think of a dataset as presented in a two-dimensional table. The
rows of the table correspond to the individuals. (Thus the individuals need to be

1. Data

ordered in some way.) The columns of the table correspond to the variables. Each
of the rows and the columns normally has a name. In R, the canonical way to store
such data is in an object called a data.frame. More details on how to operate on
data.frames is in Appendix A.3.
  In the remainder of this section, we give a few examples of datasets that can be
accessed in R and look at some of their basic properties. These datasets will be used
several times in this book.

      Example 1.1.3. The iris dataset is a famous set of measurements taken by Edgar
      Anderson on 150 iris plants of the Gaspe Peninsula which is located on the eastern
      tip of the province of Quebec. The dataset is included in the basic installation of
      R. The variable iris is a predefined data.frame. There are many such datasets
      built into R.
      > data(iris)     # the dataset called iris is loaded into a data.frame called iris
      > dim(iris)      # list dimensions of iris data
      [1] 150   5
      > iris[1:5,]     # print first 5 rows (individuals), all columns
        Sepal.Length   Sepal.Width Petal.Length Petal.Width Species
      1          5.1           3.5          1.4         0.2 setosa
      2          4.9           3.0          1.4         0.2 setosa
      3          4.7           3.2          1.3         0.2 setosa
      4          4.6           3.1          1.5         0.2 setosa
      5          5.0           3.6          1.4         0.2 setosa

         Notice that the data.frame has rows and columns. The individuals (rows) are,
      by default, numbered (they can also be named) and the variables (columns) are
      named. The numbers and names are not part of the dataset. Each column of a
      data.frame is a vector or a factor. In the iris dataset, there are 150 individ-
      uals (plants) and five variables. Notice that four of the variables (Sepal.Length,
      Sepal.Width, Petal.Length, Petal.Width) are quantitative variables. The fifth
      variable is categorical. In this example the variable Species is categorical variable
      (factor) with three levels. The following example shows how to look at pieces of
      the dataset.
      > iris$Species       # a boring vector
        [1] setosa        setosa     setosa        setosa       setosa       setosa
        [7] setosa        setosa     setosa        setosa       setosa       setosa
       [13] setosa        setosa     setosa        setosa       setosa       setosa
       [19] setosa        setosa     setosa        setosa       setosa       setosa
       [25] setosa        setosa     setosa        setosa       setosa       setosa
       [31] setosa        setosa     setosa        setosa       setosa       setosa
       [37] setosa        setosa     setosa        setosa       setosa       setosa
       [43] setosa        setosa     setosa        setosa       setosa       setosa
       [49] setosa        setosa     versicolor    versicolor   versicolor   versicolor

                                                                     1.1. Basic Notions

     [55] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [61] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [67] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [73] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [79] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [85] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [91] versicolor versicolor versicolor versicolor       versicolor   versicolor
     [97] versicolor versicolor versicolor versicolor       virginica    virginica
    [103] virginica virginica virginica virginica           virginica    virginica
    [109] virginica virginica virginica virginica           virginica    virginica
    [115] virginica virginica virginica virginica           virginica    virginica
    [121] virginica virginica virginica virginica           virginica    virginica
    [127] virginica virginica virginica virginica           virginica    virginica
    [133] virginica virginica virginica virginica           virginica    virginica
    [139] virginica virginica virginica virginica           virginica    virginica
    [145] virginica virginica virginica virginica           virginica    virginica
    Levels: setosa versicolor virginica
    > iris$Petal.Width[c(1:5,146:150)]   # selecting        some individuals
     [1] 0.2 0.2 0.2 0.2 0.2 2.3 1.9 2.0 2.3 1.8

   Example 1.1.4. There are 3,077 counties in the United States (including D.C.).
   The U.S. Census Bureau lists 3,141 units that are counties or county-equivalents.
   (Some people don’t live in a county. For example, most of the land in Alaska is
   not in any borough, which is what Alaska calls county level divisions. The Census
   Bureau has defined county equivalents so that all land and every person is in some
   county or other.) Data from the 2000 census about each county is available in a
   dataset maintained at the website for this course. These data are available from The short R session below shows how to read the file
   and computes a few interesting numbers.
    > counties=read.csv(’’)
    > dim(counties)
    [1] 3141    9
    > names(counties)
    [1] "County"         "State"          "Population"     "HousingUnits"
    [5] "TotalArea"      "WaterArea"      "LandArea"       "DensityPop"
    [9] "DensityHousing"
    > sum(counties$Population)
    [1] 281421906
    > sum(counties$LandArea)
    [1] 3537438
   The population of the 50 states and D.C. was 281,421,906 at the time of the 2000
   U.S. Census. There were over 3.5 million square miles of land area. Notice that the
   variable State is a categorical variable and that County is really just a variable to

19:08 -- May 4, 2008                                                                103
1. Data

      hold the name of each individual.

      Example 1.1.5. R comes with many user-created “packages”, many of which con-
      tain additional datasets. The faraway package comes with a broccoli dataset.
      In this dataset, a number of growers supply broccoli to a food processing plant.
      They are supposed to pack the broccoli in boxes with 18 clusters to a box and with
      each cluster weighing between 1.3 and 1.5 pounds. Four boxes from each of three
      growers were selected and three clusters from each box were weighed. Notice that
      it appears that numerical variables were used for the cluster, box, and grower but
      that these variables are correctly stored in factors and not vectors.
      > library(faraway)
      > dim(broccoli)
      [1] 36 4
      > broccoli[1:5,]
         wt grower box cluster
      1 352      1   1       1
      2 369      1   1       2
      3 383      1   1       3
      4 339      2   1       1
      5 367      2   1       2
      > broccoli$grower
       [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
      Levels: 1 2 3

1.2. A Single Variable - Distributions
Now that we can get our hands on some data, we would like to develop some tools
to help us understand the distribution of a variable in a data set. By distribution
we mean two things: what values does the variable take on, and with what frequency.
Simply listing all the values of a variable is not an effective way to describe a distribution
unless the data set is quite small. For larger data sets, we require some better methods
of summarizing a distribution. In this section, we will look particularly at graphical
summaries of a single variable.
   The type of summary that we generate will vary depending on the type of data that
we are summarizing. A table is useful for summarizing a categorical variable. The
following table is a useful description of the distribution of species of iris flowers in the
iris dataset.

> table(iris$Species)

                                                  1.2. A Single Variable - Distributions

     setosa versicolor    virginica
         50         50           50

  A more interesting table gives the number of counties per state. Note that it isn’t
always the largest states that have the most counties.

> table(counties$State)
             Alabama                  Alaska              Arizona
                  67                      27                   15
            Arkansas              California             Colorado
                  75                      58                   63
         Connecticut                Deleware District of Columbia
                   8                       3                    1
             Florida                 Georgia               Hawaii
                  67                     159                    5
               Idaho                Illinois              Indiana
                  44                     102                   92
                Iowa                  Kansas             Kentucky
                  99                     105                  120
           Louisiana                   Maine             Maryland
                  64                      16                   24
       Massachusetts                Michigan            Minnesota
                  14                      83                   87
         Mississippi                Missouri              Montana
                  82                     115                   56
            Nebraska                  Nevada        New Hampshire
                  93                      17                   10
          New Jersey              New Mexico             New York
                  21                      33                   62
      North Carolina            North Dakota                 Ohio
                 100                      53                   88
            Oklahoma                  Oregon         Pennsylvania
                  77                      36                   67
        Rhode Island          South Carolina         South Dakota
                   5                      46                   66
           Tennessee                   Texas                 Utah
                  95                     254                   29
             Vermont                Virginia           Washington
                  14                     135                   39
       West Virginia               Wisconsin              Wyoming
                  55                      72                   23

  Tables can be generated for quantitative variables as well.

> table(iris$Sepal.Length)

4.3 4.4 4.5 4.6 4.7 4.8 4.9      5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9         6 6.1
  1   3   1   4   2   5   6     10   9   4   1   6   7   6   8   7   3         6   6

19:08 -- May 4, 2008                                                                105
1. Data

                                    Histogram of bball$HR


                                                                                Percent of Total

            0 2 4 6 8                                                                              20


                        100   120   140   160    180   200   220   240                             0

                                                                                                        100   150              200

                              Figure 1.1.: Homeruns in major leagues: hist() and histogram()

    6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9                                      7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
      4   9   7   5   2   8   3   4                                      1   1   3   1   1   1   4   1

  The table function is more useful in conjunction with the cut() function. The
second argument to cut() gives a vector of endpoints of half-open intervals. Note that
the default behavior is to use intervals that are open to the left and closed to the right.

    > table(cut(iris$Sepal.Length,c(4,5,6,7,8)))

    (4,5] (5,6] (6,7] (7,8]
       32    57    49    12

  The kind of summary in the above table is graphically presented by means of a
histogram. There are two R commands that can be used to build a histogram: hist()
and histogram(). hist() is part of the standard distribution of R. histogram()
can only be used after first loading the lattice graphics package, which now comes
standard with all distributions of R. The R functions are used as in the following
excerpt which generates the two histograms in Figure 1.1. Notice that two forms of
the histogram() function are given. The second form (the “formula” form) will be
discussed in more detail in Section 1.5. The histograms are of the number of homeruns
per team during the 2007 Major League Baseball season.
    >       library(lattice)
    >       bball=read.csv(’’)
    >       hist(bball$HR)
    >       histogram(bball$HR)              # lattice histogram of a vector
    >       histogram(~HR,data=bball)        # formula form of histogram

   Notice that the histograms produced differ in several ways. Besides aesthetic dif-
ferences, the two histogram algorithms typically choose different break points. Also,
the vertical scale of histogram() is in percentages of total while the vertical scale of
hist() contains actual counts. As one might imagine, there are optional arguments to
each of these functions that can be used to change such decisions.

                                                                                  1.2. A Single Variable - Distributions

                                                                0   5       10    15

                                             neg. skewed            pos. skewed            symmetric

                 Percent of Total




                                         0   5       10    15                          0   5      10   15

                               Figure 1.2.: Skewed and symmetric distributions.

  In these notes, we will usually use histogram() and indeed we will assume that the
lattice package has been loaded. Graphics functions in the lattice package often
have several useful features. We will see some of these in later Sections.
  A histogram gives a shape to a distribution and distributions are often described in
terms of these shapes. The exact shape depicted by a histogram will depend not only
on the data but on various other choices, such as how many bins are used, whether the
bins are equally spaced across the range of the variable, and just where the divisions
between bins are located. But reasonable choices of these arguments will usually lead
to histograms of similar shape, and we use these shapes to describe the underlying
distribution as well as the histogram that represents it.
  Some distributions are approximately symmetric with the distribution of the larger
values looking like a mirror image of the distribution of the lower values. We will call
a distribution positively skewed if the portion of the distribution with larger values
(the right of the histogram) is more spread out than the other side. Similarly, a
distribution is negatively skewed if the distribution deviates from symmetry in the
opposite manner. Later we will learn a way to measure the degree and direction of
skewness with a number; for now it is sufficient to describe distributions qualitatively
as symmetric or skewed. See Figure 1.2 for some examples of symmetric and skewed
  The county population data gives a natural example of a positively skewed distri-
bution. Indeed, it is so skewed that the histogram of populations by county is almost
worthless. The histogram is on the left in Figure 1.3.
  In the case of positively skewed data where the data includes observations of several
orders of magnitude, it is sometimes useful to transform the data. In the case of
county populations, a histogram of the natural log of population gives a nice symmetric
distribution. The histogram is on the right in Figure 1.3.

> logPopulation=log(counties$Population)
> histogram(logPopulation)

  Notice that each of these distributions is clustered around a center where most of
the values are located. We say that such distributions are unimodal. Shortly we

19:08 -- May 4, 2008                                                                                                107
1. Data



Percent of Total

                                                                                                                 Percent of Total




                    0                                                                                                                0

                          0   2000000     4000000      6000000              8000000   10000000                                           5           10          15
                                        counties$Population                                                                                      logPopulation

                         Figure 1.3.: County populations and natural log of county populations.


                                                    Percent of Total






                                                                                  2              3                                   4       5

                        Figure 1.4.: Old Faithful eruption times (based on the faithful data set).

will discuss ways to summarize the location of the “center” of unimodal distributions
numerically. But first we point out that some distributions have other shapes that are
not characterized by a strong central tendency. One famous example is eruption times
of the Old Faithful geyser in Yellowstone National park. The command
 > data(faithful);
 > histogram(faithful$eruptions,n=20);

produces the histogram in Figure 1.4 which shows a good example of a bimodal
distribution. There appear to be two groups or kinds of eruptions, some lasting about
2 minutes and others lasting between 4 and 5 minutes.
  While the default histogram has the vertical axis read percent of total, another scale
will be useful to us. In Figure 1.5, generated by

we have a density histogram. The vertical axis gives density per unit of the horizontal
axis. With this as a density, the bars of the histogram have total mass of 1. The

                                                                       1.2. A Single Variable - Distributions







                                          2     3                  4         5

               Figure 1.5.: Density histogram of Old Faithful eruption times.

histogram is read as follows. The bar that extends from 4 to 4.4 on the horizontal axis
as width 0.4 and density approximately 0.6. THis means that about 24% of the data
is represented by this bar.
   One disadvantage of a histogram is that the actual data values are lost. For a large
data set, this is probably unavoidable. But for more modestly sized data sets, a stem
plot can reveal the shape of a distribution without losing the actual data values. A
stem plot divides each value into a stem and a leaf at some place value. The leaf is
rounded so that it requires only a single digit.

> stem(faithful$eruptions)

   The decimal point is 1 digit(s) to the left of the |

   16   |   070355555588
   18   |   000022233333335577777777888822335777888
   20   |   00002223378800035778
   22   |   0002335578023578
   24   |   00228
   26   |   23
   28   |   080
   30   |   7
   32   |   2337
   34   |   250077
   36   |   0000823577
   38   |   2333335582225577
   40   |   0000003357788888002233555577778
   42   |   03335555778800233333555577778
   44   |   02222335557780000000023333357778888
   46   |   0000233357700000023578
   48   |   00000022335800333
   50   |   0370

  From this output we can readily see that the shortest recorded eruption time was

19:08 -- May 4, 2008                                                                                     109
1. Data

1.60 minutes. The second 0 in the first row represents 1.70 minutes. Note that the
output of stem() can be ambiguous when there are not enough data values in a row.

1.3. Measures of the Center of a Distribution
Qualitative descriptions of the shape of a distribution are important and useful. But we
will often desire the precision of numerical summaries as well. Two aspects of unimodal
distributions that we will often want to measure are central tendency (what is a typical
value? where do the values cluster?), and the amount of variation (are the data tightly
clustered around a central value, or more spread out?)
  Two widely used measures of center are the mean and the median. You are prob-
ably already familiar with both. The mean is calculated by adding all the values of a
variable and dividing by the number of values. Our usual notation will be to denote
the n values as x1 , x2 , . . . xn , and the mean of these values as x. Then the formula for
the mean becomes                                 n
                                           x = i=1 .
  The median is a value that splits the data in half – half of the values are smaller than
the median and half are larger. By this definition, there could be more than one median
(when there are an even number of values). This ambiguity is removed by taking the
mean of the “two middle numbers” (after sorting the data). Whereas x denotes the
mean of the n numbers x1 , . . . , xn , we use x to denote the median of these numbers.
  The mean and median are easily computed in R. For example,
> mean(iris$Sepal.Length); median(iris$Sepal.Length);
[1] 5.843333
[1] 5.8
  We can also compute the mean and median of the Old Faithful eruption times.
> mean(faithful$eruptions); median(faithful$eruptions);
[1] 3.487783
[1] 4
Notice, however, that in the Old Faithful eruption times histogram (Figure 1.4) there
are very few eruptions that last between 3.5 and 4 minutes. So although these numbers
are the mean and median, neither is a very good description of the typical eruption
time(s) of Old Faithful. It will often be the case that the mean and median are not
very good descriptions of a data set that is not unimodal. In the case of our Old
Faithful data, there seem to be two predominant peaks, but unlike in the case of the
iris data, we do not have another variable in our data that lets us partition the eruptions
times into two corresponding groups. This observation could, however, lead to some
hypotheses about Old Faithful eruption times. Perhaps eruption times are different
at night than during the day. Perhaps there are other differences in the eruptions.
Subsequent data collection (and statistical analysis of the resulting data) might help
us determine whether our hypotheses appear correct.

                                           1.3. Measures of the Center of a Distribution

Comparing mean and median
Why bother with two different measures of central tendency? The short answer is
that they measure different things, and sometimes one measure is better than the
other. If a distribution is (approximately) symmetric, the mean and median will be
(approximately) the same. (See Exercise 1.2.)
  If the distribution is not symmetric, however, the mean and median may be very
different. For example, if we begin with a symmetric distribution and add in one
additional value that is very much larger than the other values (an outlier), then the
median will not change very much (if at all), but the mean will increase substantially.
We say that the median is resistant to outliers while the mean is not. A similar
thing happens with a skewed, unimodal distribution. If a distribution is positively
skewed, the large values in the tail of the distribution increase the mean (as compared
to a symmetric distribution) but not the median, so the mean will be larger than the
median. Similarly, the mean of a negatively skewed distribution will be smaller than
the median. Consider the data on the populations of the 3,141 county equivalents in
the United States. From R we see the great difference in the mean county population
and the median county population. Note that the largest county, Los Angeles County
with over 9 million people, alone contributes over 3,000 people to the mean.
> mean(counties$Population); median(counties$Population)
[1] 89596.28
[1] 24595

Over 80% of the counties in the United States are less populus than the “average”
> sum(counties$Population<mean(counties$Population))
[1] 2565

  Whether a resistant measure is desirable or not depends on context. If we are looking
at the income of employees of a local business, the median may give us a much better
indication of what a typical worker earns, since there may be a few large salaries (the
business owner’s, for example) that inflate the mean. This is also why the government
reports median household income and median housing costs. The median county pop-
ulation perhaps tells us more about what a “typical” county looks like than does the
  On the other hand, if we compare the median and mean of the value of raffle prizes,
the mean is probably more interesting. The median is probably 0, since typically the
majority of raffle tickets do not win anything. This is independent of the values of any
of the prizes. The mean will tell us something about the overall value of the prizes
involved. In particular, we might want to compare the mean prize value with the cost
of the raffle ticket when we decide whether or not to purchase one. From the mean
population of counties, we can compute the total population of the United States. We
might underestimate that number if we are only told the median county size.

19:08 -- May 4, 2008                                                                111
1. Data

The trimmed mean compromise
There is another measure of central tendency that is less well known and represents a
kind of compromise between the mean and the median. In particular, it is more sensitive
to the the extreme values of a distribution than the median is, but less sensitive than
the mean. The idea of a trimmed mean is very simple.
   Before calculating the mean, we remove the largest and smallest values from the data.
The percentage of the data removed from each end is called the trimming percentage.
A 0% trimmed mean is just the mean; a 50% trimmed mean is the median; a 10%
trimmed mean is the mean of the middle 80% of the data (after removing the largest
and smallest 10%). A trimmed mean is calculated in R by setting the trim argument of
mean(), e.g. mean(x,trim=.10). Although a trimmed mean in some sense combines
the advantages of both the mean and median, it is less common than either the mean
or the median. This is partly due the mathematical theory that has been developed for
working with the median and especially the mean of sample data. The 10% trimmed
mean of county populations is 38,234 which is much closer in size to the median than
to the mean.
 > mean(counties$Population,trim=.1)
 [1] 38234.59

  In some sports, the trimmed mean is used to compute a competitors score based
on the scores given by individual judges. Both diving and international figure skating
work this way.

1.4. Measures of Dispersion
It is often useful to characterize a distribution in terms of its center, but that is not
the whole story. Consider the distributions depicted in the histograms below.

                                                 −10      0       10      20      30

                           A                                     B





                 −10   0   10      20      30

In each case the mean and median are approximately 10, but the distributions clearly
have very different shapes. The difference is that distribution B is much more “spread

                                                             1.4. Measures of Dispersion

out”. “Almost all” of the data in distribution A are quite close to 10; a much larger
proportion of distribution B is “far away” from 10. The intuitive (and not very precise)
statement in the preceding sentence can be quantified by means of quantiles. The
idea of quantiles is probably familiar to you since percentiles are a special case of

Definition 1.4.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative distribution
is a number q such that the (approximate) proportion of the distribution that is less
than q is p.

   So for example, the .2-quantile divides a distribution into 20% below and 80% above.
This is the same as the 20th percentile. The median is the .5-quantile (and the 50th
   The idea of a quantile is quite straightforward. In practice there are a few wrinkles
to be ironed out. Suppose your data set has 15 values. What is the .30-quantile? 30%
of the data would be (.30)(15) = 4.5 values. Of course, there is no number that has
4.5 values below it and 11.5 values above it. This is the reason for the parenthetical
word approximate in Definition 1.4.1. Different schemes have been proposed for giving
quantiles a precise value, and R implements several such methods. They are similar in
many ways to the decision we had to make when computing the median of a variable
with an even number of values.
   Two important methods can be described by imagining that the sorted data have
been placed along a ruler, one value at every unit mark and also at each end. To find
the p-quantile, we simply snap the ruler so that proportion p is to the left and 1 − p
to the right. If the break point happens to fall precisely where a data value is located
(i.e., at one of the unit marks of our ruler), that value is the p-quantile. If the break
point is between two data values, then the p-quantile is a weighted mean of those two
values. For example, suppose we have 10 data values: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100.
The 0-quantile is 1, the 1-quantile is 100, the .5-quantile (median) is midway between
25 and 36, that is 30.5. Since our ruler is 9 units long, the .25-quantile is located
9/4 = 2.25 units from the left edge. That would be one quarter of the way from 9 to
16, which is 9 + .25(16 − 9) = 9 + 1.75 = 10.75. (See Figure 1.6.) Other quantiles are
found similarly. This is precisely the default method used by quantile().

> quantile((1:10)^2)
    0%    25%    50%        75%   100%
  1.00 10.75 30.50        60.25 100.00

  A second scheme is just like this one except that the data values are placed midway
between the unit marks. In particular, this means that the 0-quantile is not the smallest
value. This could be useful, for example, if we imagined we were trying to estimate the
lowest value in a population from which we only had a sample. Probably the lowest
value overall is less than the lowest value in our particular sample. Other methods try

19:08 -- May 4, 2008                                                                  113
1. Data

                     1    4   9       16   25   36   49   64   81 100
                                  6             6
Figure 1.6.: An illustration of a method for determining quantiles from data. Arrows
             indicate the locations of the .25-quantile and the .5-quantile.

to refine this idea, usually based on some assumptions about what the population of
interest is like.
   Fortunately, for large data sets, the differences between the different quantile meth-
ods are usually unimportant, so we will just let R compute quantiles for us using the
quantile() function. For example, here are the deciles and quartiles of the Old
Faithful eruption times.
> quantile(faithful$eruptions,(0:10)/10);
    0%    10%    20%    30%     40%    50%   60%    70%    80%    90%   100%
1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000 5.1000
> quantile(faithful$eruptions,(0:4)/4);
     0%     25%     50%     75%     100%
1.60000 2.16275 4.00000 4.45425 5.10000
The latter of these provides what is commonly called the five number summary.
The 0-quantile and 1-quantile (at least in the default scheme) are the minimum and
maximum of the data set. The .5-quantile gives the median, and the .25- and .75-
quantiles (also called the first and third quartiles) isolate the middle 50% of the data.
When the quartiles are close together, then most (well, half, to be more precise) of
the values are near the median. If those numbers are farther apart, then much (again,
half) of the data is far from the center. The difference between the first and third
quartiles is called the inter-quartile range and abbreviated IQR. This is our first
numerical measure of dispersion. The five number summary is also computed by the
R function fivenum(). However fivenum uses yet another method of computing the
quartiles. The .25- and .75-quartiles computed this way are called the lower hinge
and upper hinge. The computation of the lower hinge depends on whether there are
an even or odd number of data points. If there are an even number of points, the lower
hinge is simply the median of the lower half of the data. If there are an odd number
of points, the lower hinge is simply the median of the lower half of the data with the
middle data point included in that lower half. The upper hinge is computed in exactly
the same way with the middle point again being considered as part of the upper half
of the data if there are an odd number of data points.
   The five-number summary is often presented by means of a boxplot. The stan-
dard R function is boxplot and the lattice function is bwplot() A boxplot of the
Sepal.Width of the iris data is in Figure 1.7 and was generated by
> bwplot(iris$Sepal.Width)
  The sides of the box are drawn at the hinges. The median is represented by a
dot in the box. In some boxplots, the whiskers extend out to the maximum and

                                                                           1.4. Measures of Dispersion

                          q               q                       q   q       q

                         2.0     2.5     3.0          3.5   4.0                   4.5

                  Figure 1.7.: Boxplot of Sepal.Width of iris data.

minimum values. However the boxplot that we are using here attempts to identify
outliers. Outliers are values that are unusually large or small and are indicated by a
special symbol beyond the whiskers. The whiskers are then drawn from the box to the
largest and smallest non-outliers. One common rule for automating outlier detection
for boxplots is the 1.5 IQR rule. This is the default rule in both boxplot functions in
R. Under this rule, any value that is more than 1.5 IQR away from the box is marked
as an outlier. Indicating outliers in this way is useful since it allows us to see if the
whisker is long only because of one extreme value.

Variance and Standard Deviation
Another important way to measure the dispersion of a distribution is by comparing
each value with the center of the distribution. If the distribution is spread out, these
differences will tend to be large, otherwise these differences will be small. To get a
single number, we could simply add up all of the deviation from the mean:
                    total deviation from the mean =                   (xi − x) .

The trouble with this is that the total deviation from the mean is always 0 (see Ex-
ercise 1.5). The problem is that the negative deviations and the positive deviations
always exactly cancel out.
   To fix this problem we might consider taking the absolute value of the deviations
from the mean:
                total absolute deviation from the mean =                          |xi − x| .

This number will only be 0 if all of the data values are equal to the mean. Even better
would be to divide by the number of data values. Otherwise large data sets will have

19:08 -- May 4, 2008                                                                              115
1. Data

large sums even if the values are all close to the mean.
                      mean absolute deviation =                |xi − x| .

This is a reasonable measure of the dispersion in a distribution, but we will not use it
very often. There is another measure that is much more common, namely the variance,
which is defined by
                      variance = Var(x) =                  (xi − x)2 .

  You will notice two differences from the mean absolute deviation. First, instead of
using an absolute value to make things positive, we square the deviations from the
mean. The chief advantage of squaring over the absolute value is that it is much
easier to do calculus with a polynomial than with functions involving absolute values.
The second difference is that we divide by n − 1 instead of by n. There is a good
reason for this, even though dividing by n seems more natural. We will get to that
reason in Chapter˜efchapter-onevariableinference. For now, we’ll use this heuristic for
remembering the n−1. If you know the mean and all but one of the values of a variable,
then you can determine the remaining value, since the sum of all the values must be
the product of the number of values and the mean. So once the mean is known, there
are only n − 1 independent pieces of information remaining.
  Because the squaring changes the units of this measure, the square root of the vari-
ance, called the standard deviation, is commonly used in place of the variance.

                       standard deviation = SD(x) =            Var(x) .
  We will sometimes use the notation sx and for the standard deviation and variance
  All of these quantities are easy to compute in R.
> x=c(1,3,5,5,6,8,9,14,14,20);
> mean(x);
[1] 8.5
> x - mean(x);
  [1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 0.5        5.5       5.5 11.5
> sum(x - mean(x));
[1] 0
> abs(x - mean(x));
  [1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5              5.5       5.5 11.5
> sum(abs(x - mean(x)));
[1] 46
> (x - mean(x))^2;
  [1] 56.25 30.25 12.25 12.25       6.25          0.25         0.25   30.25   30.25 132.25

                                               1.5. The Relationship Between Two Variables



                                 6                q

                                 5     q                      q

                                     setosa   versicolor   virginica

          Figure 1.8.: Box plot for iris sepal length as a function of Species.

> sum((x - mean(x))^2);
[1] 310.5
> n= length(x);
> 1/(n-1) * sum((x - mean(x))^2);
[1] 34.5
> var(x);
[1] 34.5
> sd(x);
[1] 5.87367
> sd(x)^2;
[1] 34.5

1.5. The Relationship Between Two Variables
Many scientific problems are about describing and explaining the relationship between
two or more variables. In the next two sections, we begin to look at graphical and
numerical ways to summarize such relationships. In this section, we consider the case
where one or both the variables are categorical.
  We first consider the case when one of the variables is categorical and the other is
quantitative. This is the situation with the iris data if we are interested in the question
of how, say, Sepal.Length varies by Species. A very common way of beginning to
answer this question is to construct side-by-side boxplots.

> bwplot(Sepal.Length~Species,data=iris)

We see from these boxplots (Figure 1.8)that the virginica variety of iris tends to have
the longest sepal length though the sepal lengths of this variety also have the greatest

19:08 -- May 4, 2008                                                                   117
1. Data

                                                                   4   5      6     7   8

                                                  setosa                   versicolor               virginica


                  Percent of Total





                                          4   5     6      7   8                            4   5      6    7   8

                     Figure 1.9.: Sepal lengths of three species of irises

   The notation used in the first argument of bwplot() is called formula notation and
is extremely important when considering the relationship between two variables. This
formula notation is used throughout lattice graphics and in other R functions as well.
The simplest form of a formula is

   y ~ x

We will often read this formula as “y modelled by x”. In general, the variable y is the
dependent variable and x the independent variable. In this example, it is more natural
to think of Species as the independent variable. There is nothing logically incorrect
however with thinking of sepal length as the independent variable. Usually, for plotting
functions, y will be the variable presented on the vertical axis, and x the variable to be
plotted along the horizontal axis. In this case, we are modeling (or describing) sepal
length by species.
  The formula notation can also be used with lattice function histogram(). For ex-


will produce a histogram of the variable Sepal.Length. In this case, the independent
variable in the formula is omitted since the independent variable, the frequency of the
class, is computed by histogram(). Side-by-side histograms can be generated by with
a more general form of the formula syntax The same information in the boxplots above
is contained in the side-by-side histograms of Figure 1.9.

> histogram(~Sepal.Length | Species,data=iris,layout=c(3,1))

  In this form of the formula

  y~x | z

                                           1.5. The Relationship Between Two Variables

the variable z is a conditioning variable. The condition z is a variable that is used to
break the data into different groups. In the case of histogram(), the different groups
are plotted in separate panels. When z is categorical there is one panel for each level
of z. When z is quantitative, the data is divided into a number of sections based on
the values of z.
   The formula notation is used for more than just graphics. In the above example,
we would also like to compute summary statistics (such as the mean) for each of the
species separately. There are two ways to do this in R. The first uses the aggregate()
function. A much easier way uses the summary() function from the Hmisc package. The
summary() function allows us to apply virtually any function that has vector input to
each level of a categorical variable separately.

> library(Hmisc) # load Hmisc package
Loading required package: Hmisc
> summary(Sepal.Length~Species,data=iris,fun=mean);
Sepal.Length    N=150

|       |          |N |Sepal.Length|
|Species|setosa    | 50|5.006000    |
|       |versicolor| 50|5.936000    |
|       |virginica | 50|6.588000    |
|Overall|          |150|5.843333    |
> summary(Sepal.Length~Species,data=iris,fun=median);
Sepal.Length    N=150

|       |          |N |Sepal.Length|
|Species|setosa    | 50|5.0         |
|       |versicolor| 50|5.9         |
|       |virginica | 50|6.5         |
|Overall|          |150|5.8         |
> summary(Sepal.Length~Species,iris);
Sepal.Length    N=150

|       |          |N |Sepal.Length|
|Species|setosa    | 50|5.006000    |
|       |versicolor| 50|5.936000    |

19:08 -- May 4, 2008                                                                119
1. Data

|       |virginica | 50|6.588000    |
|Overall|          |150|5.843333    |

Notice that the default function used in summary() computes the mean.

       From now on we will assume that the lattice and Hmisc packages have
       been loaded and will not show the loading of these packages in our exam-
       ples. If you try an example in this book and R reports that it cannot find
       a function, it is likely that you have failed to load one of these packages.
       You can set up R to automatically load these two packages every time you
       launch R if you like.

  Of course none of these summaries – boxplots, histograms, or numerical summaries –
can tell us whether the differences in sepal lengths among species is accidental to these
150 flowers or whether these differences are significant properties of the species.
  We next turn to the case where both variables are categorical.

      Example 1.5.1. In 2004, over 400 incoming first-year students at Calvin College
      took a survey concerning, among other things, their beliefs and values. In 2007, 221
      of these students were asked these same questions again. Their responses to three
      of the questions are included in the file
      CSBVpolitical.csv. The variable SEX uses codes of 1 for male and 2 for female.
      The other two variables, POLIVW04 and POLIVW07, refer to the question “How
      would you characterize your political views?” as answered in 2004 and 2007. The
      coded responses are
                                    Far right            1
                                    Conservative         2
                                    Middle-of-the-road 3
                                    Liberal              4
                                    Far left             5
      Each of these questions results in a categorical variable. We might be interested
      in whether there is a difference between self-characterization of male students and
      female students. We might also be interested in the relationship of the views of
      a student in 2004 and 2007. The first few entries of this dataset are given in the
      following output.
      > csbv=read.csv(’’)
      > csbv[1:5,]
      1   1        2        2
      2   1        3        3
      3   2        2        2

                                            1.5. The Relationship Between Two Variables

     4   1          2         2
     5   2          2         2

   The most useful form of summary of data that arises from two or more categorical
variables is a cross tabulation. We first use a cross-tabulation to determine the
relationship of the gender of a student to his or her political views as entering first-year
> xtabs(~SEX+POLIVW04,csbv)
SEX 1 2 3 4
  1 7 47 28 6
  2 0 67 48 14

   While the command syntax is a bit inscrutable, it should be clear how to read the
table. Note that no entering students characterized their views as “Far left” and no
female characterized her views as “Far right.” Also notice that it appears that males
tended to be more conservative than females.
   The xtabs() function uses the formula syntax. As in histogram, there is now no
independent variable in the formula as the frequencies are computed from the data.
Also, the formula has form x~y1+y2 where the plus sign indicates that there are two
independent variables. Another example of xtabs() with just one independent variable
> xtabs(~SEX ,csbv)
  1   2
 88 133

which counts the number of males and females in our dataset.
   In this first example of xtabs our dataset contained a record for each observation.
It is quite often the case that we are only given summary data.

    Example 1.5.2. Data on graduate school admissions to six different departments
    of the University of California of California, Berkeley, in 1973 are summarized in
    the dataset
     > Admissions=read.csv(’’)
     > Admissions[c(1,10,19),]
           Admit Gender Dept Freq
     1 Admitted    Male    A 512
     10 Rejected   Male    C 205
     19 Admitted Female    E   94
    We see that 512 Males were admitted to Department A while 10 Males were rejected
    by Department C. We now use the xtabs function with a dependent variable:

19:08 -- May 4, 2008                                                                   121
1. Data

      > xtabs(Freq~Gender+Admit,Admissions)
      Gender   Admitted Rejected
        Female       557    1278
        Male        1198    1493

      There seems to be relationship between the two variables in this cross-tabulation.
      Females were rejected at a greater rate than Males. While this might be evidence
      of gender bias at Berkeley, further analysis tells a more complicated story.

      > xtabs(Freq~Gender+Admit+Dept,Admissions)
      , , Dept = A

      Gender   Admitted Rejected
        Female       89       19
        Male        512      313

      , , Dept = B

      Gender   Admitted Rejected
        Female       17        8
        Male        353      207

      , , Dept = C

      Gender   Admitted Rejected
        Female      202      391
        Male        120      205

      , , Dept = D

      Gender   Admitted Rejected
        Female      131      244
        Male        138      279

      , , Dept = E

      Gender   Admitted Rejected
        Female       94      299
        Male         53      138

      , , Dept = F

                                          1.5. The Relationship Between Two Variables

    Gender   Admitted Rejected
      Female       24      317
      Male         22      351

    In all but two departments, females are admitted at a greater rate than males
   while in those two departments the admission rate is quite similar.

  The next example again illustrates the difficulty in trying to explain the relationship
between two categorical variables, in this case race and the death penalty.

   Example 1.5.3. A 1981 paper investigating racial biases in the application of
   the death penalty reported on 326 cases in which the defendant was convicted of
   murder. For each case they noted the race of the defendant and whether or not the
   death penalty was imposed.
    > deathpenalty=read.table(’’,header=T)
    > deathpenalty[1:5,]
      Penalty Victim Defendant
    1     Not White      White
    2     Not Black      Black
    3     Not White      White
    4     Not Black      Black
    5   Death White      Black
    > xtabs(~Penalty+Defendant,data=deathpenalty)
    Penalty Black White
      Death    17     19
      Not     149    141
   (We have used read.table which is suitable to read files that are not CSV but
   rather in which the data is separated by spaces. However read.table() does not
   assume a header with variable names.)
     From the output, it does not look like there is much of a difference in the rates
   at which black and white defendants receive the death penalty although a white
   defendant is slightly more likely to receive the death penalty. However a different
   picture emerges if we take into account the race of the victim.
    > xtabs(~Penalty+Defendant+Victim,data=deathpenalty)
    , , Victim = Black

    Penalty Black White
      Death     6     0
      Not      97     9

19:08 -- May 4, 2008                                                               123
1. Data

      , , Victim = White

      Penalty Black White
        Death    11     19
        Not      52    132

      It appears that black defendants are more likely to receive the death penalty when
      the victim is black and also when the victim is white.

   In the last example, we met something called Simpson’s Paradox. Specifically, we
found that a relationship between two categorical variables (white defendants receive
the death penalty more frequently) is reversed if we divide the analysis by a third
categorical variable (black defendants receive the death penalty more often if the victim
is white and if the victim is black).
   A cross-table is usually the most useful way to present data on the relationship
between two categorical variables. A graphical representation that is sometimes used
however is called a mosaic plot. We illustrate the relationship between gender and
political views in 2007 of the Calvin sample of 221 students. The function is

> mosaicplot(~SEX+POLIVW07,csbv)

and it generates the picture in Figure 1.10. Here area is proportional to frequency. It is
easy to see here (if we recall the codes) that the female student population is somewhat
less conservative in political orientation than the male population.
                                            1          2



 Figure 1.10.: A mosaic plot of the relationship between political views and gender.

                                                                       1.6. Two Quantitative Variables

       130    q
                                                           130    q
                    q                                                       q
                    q                                                       q
       120                                                 120
                          q                                                      q
                          q                                                      q

       110                                                 110
                              q                                                      q
                                    q                                                      q
       100                                                 100
                                        q                                                      q
       90                                                  90
                                              q                                                      q
                                              q                                                      q

             0.0    0.5       1.0       1.5   2.0                0.0       0.5       1.0       1.5   2.0
                              Fe                                                     Fe

             Figure 1.11.: The corrosion data with a “good” line added on the right.

1.6. Two Quantitative Variables
A very common problem in science is to describe and explain the relationship be-
tween two quantitative variables. Often our scientific theory (or at least our intuition)
suggests that two variables have a relatively simple functional relationship, at least
approximately. We look at three typical examples.

        Example 1.6.1. Thirteen bars of 90-10 Cu/Ni alloys were submerged for sixty days
        in sea water. The bars varied in iron content. The weight loss due to corrosion for
        each bar was recorded. The R dataset below gives the percentage content of iron
        (Fe) and the weight loss in mg per square decimeter (loss).
         > library(faraway)
         > data(corrosion)
         > corrosion[c(1:3,12:13),]
              Fe loss
         1 0.01 127.6
         2 0.48 124.0
         3 0.71 110.8
         12 1.44 91.4
         13 1.96 86.2
         > xyplot(loss~Fe, data=corrosion)
         > xyplot(loss~Fe,data=corrosion,type=c("p","r"))                  # plot has points, regression line
          It is evident from the plot (Figure 1.11) that the greater the percentage of iron,
        the less corrosion. The plot suggests that the relationship might be linear. In the
        second plot, a line is superimposed on the data. The line is meant to summarize
        approximately the linear relationship between iron content and corrosion. (We will
        explain how to choose the line soon.) Note that to plot the relationship between
        two quantitative variables, we may use either plot from the base R package or
        xyplot from lattice. The function xyplot() used the same formula notation as

19:08 -- May 4, 2008                                                                                     125
1. Data

                  Distance      Time    Record Holder
                       100       9.77   Asafa Powell (Jamaica)
                       200      19.32   Michael Johnson (US)
                       400      43.18   Michael Johnson (US)
                       800    1:41.11   Wilson Kipketer (Denmark)
                      1000    2:11.96   Noah Ngeny (Kenya)
                      1500    3:26.00   Hicham El Guerrouj (Morocco)
                      Mile    3:43.13   Hicham El Guerrouj (Morocco)
                      2000    4:44.79   Hicham El Guerrouj (Morocco)
                      3000    7:20.67   Daniel Komen (Kenya)
                      5000   12:37.35   Kenenisa Bekele (Ethiopia)
                    10,000   26:17.53   Kenenisa Bekele (Ethiopia)

                    Table 1.1.: Men’s World Records in Track (IAAF)

   What is the role of the line that we superimposed on the plot of the data in this
example? Obviously, we do not mean to claim that the relationship between iron
content and corrosion loss is completely captured by the line. But as a “model” of
the relationship between these variables, the line has at least three possible important
uses. First, it provides a succinct description of the relationship that is difficult to see
in the unsummarized data. The line plotted has equation

                                loss = 129.79 − 24.02Fe.

Both the intercept and slope of this line have simple interpretations. For example,
the slope suggests that every increase of 1% of iron content means a decrease in loss
of content of 24.02 mg per square decimeter. Second, the model might be used for
prediction in a situation where we have a yet untested object. We can easily use this
line to make a prediction for the material loss in an alloy of 2% iron content. Finally,
it might figure in a scientific explanation of the phenomenon of corrosion.

      Example 1.6.2. The current world records for men’s track appear in Table 1.1.
      These data may be found at
      csv. The plot of record distances (in meters) and times (in seconds) looks roughly
      linear. We know of course (for physical reasons) that this relationship cannot be
      a linear one. Nevertheless, it appears that a smooth curve might approximate the
      data very well and that this curve might have a relatively simple formula. Such a
      formula might help us predict what the world record time in a 4,000 meter race
      might be (if ever such a race would be run by world-class runners).

                                                                                                     1.6. Two Quantitative Variables



                                       500                          q
                                           0     qq

                                                0          2000         4000         6000    8000       10000

    Example 1.6.3. The R dataset trees contains the measurements of the volume
    (in cu ft), girth (diameter of tree in inches measured at 4 ft 6 in above the ground),
    and height (in ft) of 31 black cherry trees in a certain forest. Since girth is easily
    measured, we might want to use girth to predict volume of the tree. A plot shows
    the relationship.
     > data(trees)
     > trees[c(1:2,30:31),]
        Girth Height Volume
     1    8.3     70   10.3
     2    8.6     65   10.3
     30 18.0      80   51.0
     31 20.6      87   77.0
     > xyplot(Volume~Girth,data=trees)

                                      80                                                                     q

                                      60                                                         q


                                      40                                              q
                                                                        q    q q
                                                              q              q
                                                             q          q
                                      20                    q qq
                                                            q q q
                                                           q q

                                                      10                         15                     20

    In this example, we probably wouldn’t expect that a linear relationship is the best
    way to describe the data. Furthermore, the data indicates that any simple function
    is not going to describe completely the variation in volume as a function of girth.
    This makes sense because we know that trees of the same girth can have different

   These three examples share the following features. In each, we are given n observa-
tions (x1 , y1 ), . . . , (xn , yn ) of quantitative variables x and y. In each we would like to

19:08 -- May 4, 2008                                                                                                            127
1. Data

express the relationship between x and y, at least approximately, using a simple func-
tional form. In each case we would like to find a “model” that explains y in terms of
x. Specifically, we would like to find a simple functional relationship y = f (x) between
these variables. Summarizing, our goal is the following

      Goal:   Given (x1 , y1 ), . . . , (xn , yn ), find a “simple” function f such that yi is
              approximately equal to f (xi ) for every i.

   The goal is vague. We need to make precise the notion of “simple” and also the
measure of fit we will use in evaluating whether yi is close to f (xi ). In the rest of this
section, we make these two notions precise. The simplest functions we study are linear
functions such as the function that we used in Example 1.6.1. In other words, in this
case our goal is to find b0 and b1 so that yi ≈ b0 + b1 xi for all i. (Statisticians use b0 , b1
or a, b for the slope and intercept rather than the b, m that are typical in mathematics
texts. We will use b0 , b1 .) Of course, in only one of our motivating examples does it
seem sensible to use a line to approximate the data. So two important questions that
we will need to address are: How do we tell if a line is an appropriate description of
the relationship? and What do we do if a linear function is not the right relationship?
We will address both questions later.
   How shall we measure the goodness of fit of a proposed function f to the data? For
                                                  ˆ                                      ˆ
each xi the function f predicts a certain value yi = f (xi ) for yi . Then ri = yi − yi is
the “mistake” that f makes in the prediction of yi . Obviously we want to choose f so
that the values ri are small in absolute value. Introducing some terminology, we will
call yi the fitted or predicted value of the model and ri the residual. The following
is a succinct statement of the relationship

                                observation = fitted + residual.

   It will be impossible to choose a line so that all the values of ri are simultaneously
small (unless the data points are collinear). Various values of b0 , b1 might make some
values of ri small while making others large. So we need some measure that aggregates
all the residuals. Many choices are possible and R provides software to find the resulting
lines for many of these but the canonical choice and the one we investigate here is the
sum of squares of the residuals. Namely, our goal is now refined to the following

      Goal:   Given (x1 , y1 ), . . . , (xn , yn ), find b0 and b1 such that if f (x) = b0 + b1 x
              and ri = yi − f (xi ) then         ri is minimized.


                                                                  1.6. Two Quantitative Variables

  We call         ri the sum of squares of residuals and denote it by SSResid or SSE (for

sum squares error). The choice of the squaring function here is quite analogous to the
choice of squaring in the definition of variance for measuring variation. Just as in that
problem, different ways of combining the ri are possible. Before we discuss the solution
of this problem, we show how to solve it in R using the data of Example 1.6.1. The R
function lm finds the coefficients of the line that minimizes the sums of squares of the
residuals. Note that it uses the same syntax for expressing the relationship between
variables as does xyplot.

> lm(loss~Fe,data=corrosion)
lm(formula = loss ~ Fe, data = corrosion)
(Intercept)           Fe
      129.79      -24.02

   While we will always use R to solve our minimization problem, it is worthwhile to
explicitly solve for b0 and b1 so that we see how these coefficients are related to the
values of the data. Finding b0 and b1 is a minimization problem of the sort addressed
in calculus classes. In particular we want to find b0 and b1 to minimize

                               SSResid =         (yi − (b0 + b1 xi ))2 .

  It is important to note that SSResid is a function of b0 and b1 thought of as variables
(the xi and yi that appear in this function are not variables but rather have numerical
values) and so the task of finding b0 and b1 is that of minimizing a function of two
variables. Since the function is nicely differentiable (one consequence of using squares
rather than absolute values), calculus tells us to find the points where the partial
derivatives of SSResid with respect to each of b0 and b1 are 0. (Of course then we have
to check that we have found a minimum rather than a maximum or a saddlepoint.)
After much algebra, we find that

                                (xi − x)(yi − y)
                        b1 =                                   b0 = y − b1 x.
                                    (xi − x)2
  Therefore the equation of our “least-squares” line is

                                           (xi − x)(yi − y)
                               y=y+                         (x − x)
                                               (xi − x)2
  The quantities in these expressions are tedious to write, so we introduce some useful

19:08 -- May 4, 2008                                                                         129
1. Data

                   Sxx =                ¯
                                  (xi − x)2                       s2 = Sxx /(n − 1)
            SST = Syy =                 ¯
                                  (yi − y )2                      s2 = Syy /(n − 1)
                    Sxy =               ¯       ¯
                                  (xi − x)(yi − y )

  We can now rewrite the expression for b1 as
                                               b1 =
and the equation for the line as
                                      y−y =          (x − x) .
   An important fact that we note immediately from the above equation for the line
is that it passes through the point (x, y). This says that, whatever else, we should
predict that the value of y is “average” if the value of x is “average”. This seems like
a plausible thing to do.
   The slope b1 of the regression line tells us something about the nature of the linear
relationship between x and y. A positive slope suggests a positive relationship between
the two quantities, for example. However the slope has units — we would like a
dimensionless measure of the linear relationship. The key to finding such is to re-
express the variables x and y as unit-free quantities. The key is to “standardize” x and
   In problem 1.13 we introduced the notion of standardization of a variable. If x is
a variable, the new variable x = x−x changes the data to have mean 0 and standard
deviation 1. This new variable is unit-less. It can be shown that the regression equation
can be written as
                                   y−y            x−x
                                     sy             sx
where r is the correlation coefficient between x and y given by
                                        r=√                   .
                                                 Sxx    Syy

It can be shown that −1 ≤ r ≤ 1. For the corrosion dataset we find that the correlation
coefficient between iron content (Fe) and material loss due to corrosion (loss) is −.98.
> cor(corrosion$loss,corrosion$Fe)
[1] -0.9847435

                                                                            1.7. Exercises

This number can be easily interpreted using a sentence such as “loss decreases approxi-
mately .98 standard deviations for each increase of 1 standard deviation of iron content
in this dataset.”
   In R, the object defined by the lm() function is actually a list that contains more than
just the fitted line. There are several functions to access the information contained in
that object. In particular, the residuals() and fitted() return vectors of the same
length of the data containing the residuals and fitted values corresponding to each data

> l=lm(loss~Fe,corrosion)
> fitted(l)
        1         2         3         4         5         6         7        8
129.54640 118.25705 112.73247 106.96770 101.20293 129.54640 118.25705 95.19795
        9        10        11        12        13
112.73247 82.70761 129.54640 95.19795 82.70761
> residuals(l)
         1          2          3          4          5          6         7
-1.9464003 5.7429496 -1.9324749 -3.0677005 0.2970739 0.5535997 3.7429496
         8          9         10         11         12         13
-2.8979527 0.3675251 0.9923919 -1.5464003 -3.7979527 3.4923919

From this output, we can see that the largest residual corresponds to the second data
point. For that point, (0.48,124), the predicted value is 118.26 and the residual is 5.74.
Note that a positive residual means that the prediction underestimates the actual.
  A plot of residuals is often useful in determining whether a linear relationship is an
appropriate description of the relationship between the two variables. We know that
the track record data of Example 1.6.2 is not best summarized by a linear relationship.
When we try to do that, we have the residual plot of Figure 1.12.

> track=read.csv(’’)
> l=lm(Seconds~Meters,data=track)
> xyplot(residuals(l)~Meters,data=track)

  The residual plot certainly suggests that there is a structure in the data that is other
than linear. The fitted model consistently underpredicts at short and long distances
while overpredicts at intermediate distances.

1.7. Exercises

1.1 Load the built-in R dataset chickwts. (Use data(chickwts).)
  a) How many individuals are in this dataset?

  b) How many variables are in this dataset?

19:08 -- May 4, 2008                                                                  131
1. Data

                               20   q

                                        q                                                            q



                                0               q

                                                        qq   q
                              −10                                q

                                    0                    2000        4000            6000   8000   10000

          Figure 1.12.: A residual plot for the male world records in track data.

  c) Classify the variables as quantitative or categorical.

1.2 The distribution of a quantitative variables is symmetric about m if whenever
there are k data values m + d there are also k values of m − d.
  a) Show that if a distribution is symmetric about m then m is the median. (You
     may need to handle separately the cases where the number of values is odd and

  b) Show that if a distribution is symmetric about m then m is the mean.

  c) Create a small distribution that is not symmetric about m, but the mean and
     median are both equal to m.

1.3 Describe some situations where the mean or median is clearly a better measure of
central tendency than the other.
1.4 A bowler normally bowls a series of three games. When the author was first
learning long division, he learned to compute a bowling average. However he did not
completely understand the concept since to find the average of three games, he took
the average of the first two games and then averaged that with the the third game.
(That is, if x2 denotes the mean of the first two games and x3 denotes the mean of the
three games, the author thought that x3 = (x2 + x3 )/2.)
  a) Give a counterexample to the author’s method of computing the average of three

  b) Given x2 and x3 , how should x3 be computed?

                                                                           1.7. Exercises

  c) Generalizing, given the mean xn of n observations and an additional observation
     xn+1 , how should the mean xn+1 of the n + 1 observations be computed?

1.5 Show that the total deviation from the mean, defined by
                      total deviation from the mean =         (xi − x) ,

is 0 for any distribution.
1.6 Find a distribution with 10 values between 0 and 10 that has as large a variance
as possible.
1.7 Find a distribution with 10 values between 0 and 10 that has as small a variance
as possible.
1.8 We could compute the mean absolute deviation from the median instead of from
the mean. Show that the mean absolute deviation from the median is always less than
or equal to the mean absolute deviation from the mean.

1.9 Let SS(c) =      (xi − c)2 . (SS stands for sum of squares.) Show that the smallest
value of SS(c) occurs when c = x. This shows that the mean is a minimizer of SS.
(Hint: use calculus.)

1.10 Sketch a boxplot of a distribution that is positively skewed.
1.11 Suppose that x1 , . . . , xn are the values of some variable and a new variable y is
defined by adding a constant c to each xi . In other words, yi = xi + c for all i.
  a) How does y compare to x?
  b) How does Var(y) compare to Var(x)?

1.12 Repeat Problem 1.11 but with yi defined by multiplying xi by c. In other words,
yi = cxi for all i.
1.13 Suppose that x1 , . . . , xn are given and we define a new variable z by
                                            xi − x
                                     zi =          .
What is the mean and the standard deviation of the variable z? This transformed
variable is called the standardization of x. In R, the expression z=scale(x) produces
the standardization. The standard value zi of xi is also sometimes called the z-score
of xi .
1.14 The dataset singer comes with the lattice package. Make sure that you have
loaded the lattice package and then load that dataset. The dataset contains the
heights of 235 singers in the New York Choral Society.

19:08 -- May 4, 2008                                                                 133
1. Data

  a) Using a histogram of the heights of the singers, describe the distribution of

  b) Using side-by-side boxplots, describe how the heights of singers vary according
     to the part that they sing.

1.15 The R dataset barley has the yield in bushels/acre of barley for various varieties
of barley planted in 1931 and 1932. There are three categorical variables in play: the
variety of barley planted, the year of the experiment, and the site at which the exper-
iment was done (the site Grand Rapids is Minnesota, not Michigan). By examining
each of these variables one at a time, make some qualitative statements about the way
each variable affected yield. (e.g., did the year in which the experiment was done affect

1.16 A dataset from the Data and Story Library on the result of three different methods
of teaching reading can be found at
csv. The data includes the results of various pre- and post-tests given to each student.
There were 22 students taught by each method. Using the results of POST3, what can
you say about the differences in reading ability of the three groups at the end of the
course? Would you say that one of the methods is better than the other two? Why or
why not?

1.17 The death penalty data illustrated Simpson’s paradox. Construct your own
illustration to conform to the following story:

      Two surgeons each perform the same kind of heart surgery. The result
      of the surgery could be classified as “successful” or “unsuccessful.” They
      have each done exactly 200 surgeries. Surgeon A has a greater rate of
      success than Surgeon B. Now the surgical patient’s case can be classified as
      either “severe” or “moderate.” It turns out that when operating on severe
      cases, Surgeon B has a greater rate of success than Surgeon A. And when
      operating on moderate cases, Surgeon B also has a greater rate of success
      than Surgeon A.

By the way, who would you want to be your surgeon?

1.18 Data on the 2003 American League Baseball season is in the file http://www.’.

  a) Suppose that we wish to predict the number of runs (R) a team will score on
     the year given the number of homeruns (HR) the team will hit. Write a linear
     relationship between these two variables.

  b) Use this linear relationship to predict the number of runs a team will score given
     it hits 200 homeruns on the year.

                                                                           1.7. Exercises

  c) Are there any teams for which the linear relationship does a poor job in predicting
     runs from homeruns?

1.19 Continuing to use data from the AL 2003 baseball season, suppose that we wish
to predict the number of games a team will win (W) from the number of runs the team
scores (R).

  a) Write a linear relationship for W in terms of R.

  b) How many runs must a team score to win 81 games according to this relationship?

1.20 Suppose that we wish to fit a linear model without a constant: i.e., y = bx. Find
the value of b that minimizes the sums of squares of residuals, n (yi − bxi )2 in this
case. (Hint: there is only one variable here, b, so this is a straightforward Mathematics
161 max-min problem.)

1.21 In R, if we wish to fit a line y = bx without the constant term, we use lm(y~x-1).
(The -1 in the formula notation in this context tells R to omit the constant term.) Using
the same data as Problem 1.19, define new variables for W − L and R − OR. (For
example, define wl=s$W-s$L where s is the data frame containing your data.)

  a) Write W − L as a linear function of R − OR without a constant term.

  b) Why do you think it makes sense (given the nature of the variables) to omit a
     constant term in this model?

1.22 The R dataset women gives the average weight of American women by height. Do
you think that a linear relationship is the best way to describe the relationship between
average weight and height?

19:08 -- May 4, 2008                                                                 135
2. Data from Random Samples
If we are to make decisions based on data, we need to be careful in their collection. In
this chapter we consider one common way of generating data, that of sampling from a

2.1. Populations and Samples
To determine whether Kellogg’s is telling the truth about the net weight of its boxes
of Raisin Bran, it is simply not feasible to weigh every box of cereal in the warehouse.
Instead, the procedure recommended by NIST (National Institute of Standards and
Technology) tells us to select a sample consisting of a relatively small number of boxes
and weigh those. For example, in a shipment of 250 boxes, NIST tells us to weigh just
12. The hope is that this smaller sample is representative of the larger collection, the
population of all cereal boxes. We might hope, for example, that the average weight
of boxes in the sample is close to the average weight of the boxes in the population.

Definition 2.1.1 (population). A population is a well-defined collection of individuals.

  As with any mathematical set, sometimes we define a population by a census or
enumeration of the elements of the population. The registrar can easily produce an
enumeration of the population of all currently registered Calvin students. Other times,
we define a population by properties that determine membership in the population.
(In mathematics, we define sets like this all the time since many sets in mathematics
are infinite and so do not admit enumeration.) For example, the set of all persons who
voted in the last Presidential election is a well-defined population but it doesn’t admit
an easy enumeration.

Definition 2.1.2 (sample). A subset S of population P is called a sample from P .

   Quite typically, we are studying a population P but have only a sample S and have
the values of one or several variables for each element of S. The canonical goal of
(inferential) statistics is:

    Goal:   Given a sample S from population P and values of a variable X on
            elements of S, make inferences about the values of X on the elements of

2. Data from Random Samples

  Most commonly, we will be making inferences about parameters of the population.

Definition 2.1.3 (parameter). A parameter is a numerical characteristic of the pop-

  For example, we might want to know the mean value of a certain variable defined
on the population. One strategy for estimating the mean of such a variable is to take
a random sample and compute the mean of the sample elements. Such an estimate is
called a statistic.

Definition 2.1.4 (statistic). A statistic is a numerical characteristic of a sample.

      Example 2.1.5. The Current Population Survey (CPS) is a survey sponsored
      jointly by the Census Bureau and the Bureau of Labor Statistics. Each month
      60,000 households are surveyed. The intent is to make inferences about the whole
      population of the United States. For example, one population parameter is the
      unemployment rate – the ratio of the number of those unemployed to the size of
      the total labor force. The sample produces a statistic that is an estimate of the
      unemployment rate of the whole population.

  Obviously, our success in using a sample to make inferences about a population
will depend to a large extent on how representative S is of the whole population P
with respect to the properties measured by X. As one might imagine, if the 60,000
households in the Current Population Survey are to give dependable information about
the whole population, they must be chosen very carefully.

      Example 2.1.6. The Literary Digest began forecasting elections in 1912. While it
      forecasted the results of the election accurately until 1932, in 1936 the poll predicted
      that Alf Landon would receive 55% of the popular vote. Of course Roosevelt went
      on to win the election in a landslide with 61% of the popular vote. What went
      wrong with the poll? There were are at least two problems with the survey. First,
      the Literary Digest sampled from telephone directories and automobile registration
      lists. Voters with telephones and automobiles in 1936 tended to be more affluent
      and so were somewhat more likely to favor Landon than the typical voter. Second,
      although the digest sent out more than 10 million questionaires, only 2.3 million of
      these were returned. So it probably is the case that voters favorable to Landon were
      more likely to return their questionaires than those favorable to Roosevelt.

  The representativeness of the sample will depend how the sample is chosen. A
convenience sample is a sample chosen simply by locating units that conveniently
present themselves. A convenience sample of students at Calvin could be produced by

                                                             2.2. Simple Random Samples

grabbing the first 100 students that come through the doors of Johnny’s. It’s pretty
obvious that in this case, and for convenience samples in general, there is no guarantee
that the sample is likely to be representative of the whole population. In fact we can
predict some ways in which a “Johnny’s sample” would not be representative of the
whole student population.
   One might suppose that we could construct a representative sample by carefully
choosing the sample according to the important characteristics of the units. For ex-
ample, to choose a sample of 100 Calvin students, we might ensure that the sample
contains 54 females and 46 males. Continuing, we would then ensure a representative
proportion of first-year students, dorm-livers, etc. There are several problems with
this strategy. There are usually so many characteristics that we might consider that
we would have to take too large a sample so as to get enough subjects to represent
all the possible combinations of characteristics in the proportions that we desire. It
might be expensive to find the individuals with the desired characteristics. We have
no assurance that the subjects we choose with the desired combination of character-
istics are representative of the group of all the individuals with those characteristics.
Finally, even if we list many characteristics, it might be the case that the sample will
be unrepresentative according to some other characteristic that we didn’t think of and
that characteristic might turn out to be important for the problem at hand.
   Instead of trying to construct a representative sample, most survey samples are
chosen at “random.” We investigate the simplest sort of random sample in the next

2.2. Simple Random Samples

Definition 2.2.1 (simple random sample). A simple random sample (SRS) of size k
from a population is a sample that results from a procedure for which every subset of
size k has the same chance to be the sample chosen.

   For example, to pick a random sample of size 100 of Calvin students, we might
write the names of all Calvin students on index cards and choose 100 of these cards
from a well-mixed bag of all the cards. In practice, random samples are often picked
by computers that produce “random numbers.” (A computer can’t really produce
random numbers since a computer can only execute a deterministic algorithm. However
computers can produce numbers that behave as if they are random. We’ll talk about
what that might mean later.) In this case, we would number all students from 1 to 4,224
and then choose 100 numbers from 1 to 4224 in such a way that any set of 100 numbers
has the same chance of occurring. The R command sample(1:4224,100,replace=F)
will choose such a set of 100 numbers.
   It is certainly possible that a random sample is unrepresentative in some significant
way. Since all possible samples are equally likely to be chosen, by definition it is possible
that we choose a bad sample. For example, a random sample of Calvin students might

19:08 -- May 4, 2008                                                                    203
2. Data from Random Samples

fail to have any seniors in it. However the fact that a sample is chosen by simple
random sampling enables us to make quantitative statements about the likelihood of
certain kinds of nonrepresentativeness. This in turn will enable us to make inferences
about the population and to make statements about how likely it is that our inferences
are accurate. In Chapter 5 we will see how to place some bounds on the error that
using a random sample might produce.

Definition 2.2.2 (sampling error). The sampling error of an estimate of a population
parameter is the error that results from using a sample rather than the whole population
to estimate the parameter.

  Of course we cannot know the sampling error exactly (this is equivalent to knowing
the population parameter). But we will be able to place some bounds on it. High
quality public opinion polls are usually published with some information about the
sampling error. For example, typical political polls are expressed this way:
       Mitt Romney is favored by 43% of the Iowa voters (with a margin of error
       of ±3%).
While we will learn to carefully interpret this statement in Section 4.3, it means roughly
that we can be reasonably sure that 40%–46% of the population of Iowa voters favors
Romney if the only errors made in this process are those introduced by using a sample
rather than the whole population. (Though this survey was reported the day before the
Iowa caucuses, Romney actually only received 25.2% of the votes in those caucuses.)
  To see how sampling error might work we return to the data on US counties.

      Example 2.2.3. Recall that the dataset
      uscounties.csv contains data on the 3,141 county equivalents in the United
      States. Suppose that we take a random sample of size 10 counties from this pop-
      ulation. How representative is it? For example, can we make inferences about the
      mean population per county from a sample of size 10? (Of course in this instance,
      we know the actual mean population per county – 89,526 – so we do not need a
      sample to estimate it!) There are too many possible samples of size 10 to investigate
      them all, but we can get an idea of what might happen by taking many different
      samples. In the following example, we collect 10,000 different random samples of
      size 10. Notice that one of these samples had a mean population of as small as
      8,392 and another larger than 1.1 million. Half of the samples had means between
      38,219 and 107,426. It looks like using a sample of size 10 would more often than
      not produce a sample with mean considerably less than the population mean. This
      is to be expected since the distribution of populations by county is highly skewed.
      Notice also from the example that samples of size 30 produce a narrower range of
      estimates than samples of size 10. That’s of course not surprising. The distribu-
      tion of all of the 10,000 samples of size 10 and of size 30 are in the histograms of
      Figure 2.1.

                                                                                                                         2.2. Simple Random Samples

                        > mean(counties$Population)
                        [1] 89596.28
                        > fivenum(counties$Population)
                        [1]      67   11206   24595   61758 9519338
                        > samples = replicate(10000, mean( sample(counties$Population,10,replace=F)))
                        > fivenum(samples)
                        [1]    8391.70   38219.15   62015.35 107425.60 1122651.50
                        > samples = replicate(10000, mean( sample(counties$Population,30,replace=F)))
                        > fivenum(samples30)
                        [1] 18066.50 56462.10 78047.07 107471.27 592331.20




Percent of Total

                                                                                         Percent of Total



                    0                                                                                        0

                         0     200000   400000     600000   800000   1000000   1200000                           0e+00     2e+05               4e+05   6e+05
                                                 samples                                                                           samples30

Figure 2.1.: Sample means of 10,000 samples of size 10 (left) and 30 (right) of U.S.

   Of course the description of simple random sampling above is an idealized picture of
what happens in the real world. We are assuming that we can produce a dependable
list of the entire population, that we can have access to any subset of a particular size
from that population, and that we get perfect information about the sample that we
choose. The Current Population Survey Technical Manual spends considerable effort
identifying and attempting to measure non-sampling error. It lists several basic kinds
of such errors.

                             1. Inability to obtain information about all sample cases (unit non-reponse).
                             2. Definitional difficulties.
                             3. Differences in the interpretation of questions.
                             4. Respondent inability or unwillingness to provide correct information.
                             5. Respondent inability to recall information.
                             6. Errors made in data collection, such as recording and coding data.
                             7. Errors made in processing the data.
                             8. Errors made in estimating values for missing data.

19:08 -- May 4, 2008                                                                                                                                           205
2. Data from Random Samples

          9. Failure to represent all units with the sample (i.e., under-coverage).
  Most surveys of real populations (of people) fall prey to some or all of these problems.

      Example 2.2.4. The US National Immunization Survey attempts to determine
      how many young children receive the common vaccines against childhood illnesses.
      For example, in 2006, this survey estimates that 92.9% of ages 19–35 months at
      the time of the survey had received at least three doses of one of the polio vaccines.
      The sampling error reported for this estimate is 0.6%. The survey itself is a tele-
      phone survey of households that contain at least 30,000 children. One issue with a
      telephone survey is that not all children of the appropriate age live in a household
      with a telephone. Also, it is extremely difficult to choose telephone numbers at

  Though we would like a list of the entire population from which to choose our sample,
as in the previous example we often must choose our sample from another list that does
not “cover” the population. The sampling frame is the list of individuals from which
we actually choose our sample. The quality of the sampling frame is one of the most
important features in ensuring a representative sample. Political pollsters, for example,
would like a list of all and only those persons who will actually vote in the election.
Usual sampling frames will omit some of these voters but will also include many persons
who will not vote.

      Example 2.2.5. In 2004 during Quest, all incoming Calvin students were given a
      survey, the CIRP Freshmen Survey. In other words, the “sample” was actually the
      whole first year class. However only 43% of the first-year students actually filled
      out the survey and returned it. Of those students who returned it, in the Spring
      of 2007 (when they were Juniors), their GPA was substantially higher on average
      than those students who had not returned the survey. So the sample of students
      studied in this survey was not representative of the first-year students of 2004 in
      at least one important way.

  The response rate in the National Immunization Survey is about 75%. Consider-
able effort is expended in determining in what ways non-responders might differ from

2.3. Other Sampling Plans
The concept of random sampling can be extended to produce samples other than simple
random samples. There are a number of reasons that we might want to choose a sample
that is not a simple random sample. One important reason is to reduce sampling error.

                                                              2.3. Other Sampling Plans

                          Class Level   Population   Sample
                          First-year         1,129       27
                          Sophomore          1,008       24
                          Junior               897       21
                          Senior             1,041       24
                          Other                149        4
                          Total              4,224      100

     Table 2.1.: Population of Calvin Students and Proportionate Sample Sizes

Consider the situation in which the population in question has several subpopulations
that differ substantially on the variables in question. For example, suppose that we
wish to survey Calvin College students to determine whether they favor abolishing
the Interim. It seems likely that the seniors (who have take three or four interims)
might have in general a higher opinion of the interim than first year students who have
only taken DCM. Then a simple random sample in which first-year students happen to
be overrepresented is likely to underestimate the percentage of students favoring the
interim. A sample in which the classes are represented proportionally is an obvious
strategy for overcoming this bias.

   Example 2.3.1. Suppose that we wish to have a sample of Calvin students of size
   100 in which the classes are represented proportionally. We should then choose a
   sample according to the breakdowns in Table 2.1.

  Once we have defined the sizes of our subsamples, it seems wise to proceed to choose
simple random samples from each subpopulation.

Definition 2.3.2 (stratified random sample). A stratified random sample of size k
from a population is a sample that results from a procedure that chooses simple random
samples from each of a finite number of groups (strata) that partition the population.

   In the example of sampling from the Calvin student body, we chose the random
sample so that the number of individuals in the sample from each strata were propor-
tional to the size of the strata. While this procedure has much to recommend it, it is
not necessary and sometimes not even desirable. For example, only 4 “other” students
appear in our sample of size 100 from the whole population. This is fine if we are only
interested in making inferences about the whole population, but often we would like to
say something about the subgroups as well. For example, we might want to know how
much Calvin students work in off-campus jobs but we might expect and would like to
discover differences among the class levels in this variable. For this purpose, we might

19:08 -- May 4, 2008                                                               207
2. Data from Random Samples

choose a sample of 20 students from each of the five strata. (Of course we would have
to be careful about how to combine our numbers when making inferences about the
whole population.) We would say about this sample that we have “oversampled” one
of the groups. In public opinion polls, it is often the case that small minority groups
are oversampled. The sample that results will still be called a random sample.

Definition 2.3.3 (random sample). A random sample of size k from a population is
a sample chosen by a procedure such that each element of the population has a fixed
probability of being chosen as part of the sample.

   While we need to give a definition of probability in order to make this definition
precise, it is clear from the above examples what we mean. This definition differs from
that of a simple random sample in two ways. First, it does not requires that each
object has the same likelihood of being the sample chosen. Second, it does not require
that equal likelihood extends to groups. It is obvious that stratified random sampling
is a form of random sampling according to this definition.
   Other forms of sampling meet the above definition of random sampling without
being simple random sampling. A sampling method that we might employ given a list
of Calvin students is to choose one of the first 422 students in the list and then choose
every 422nd student thereafter. Obviously some subsets can never occur as the sample
since two students whose names are next to each other in the list can never be in the
same sample. Such a sample might indeed be representative however.
   It is very important to note that we cannot guarantee by using random sampling of
whatever form that our sample is representative of the population along the dimension
we are studying. In fact with random sampling, it is guaranteed that it is possible that
we could select a really bad (unrepresentative) sample. What we hope to be able to do
(and we will later see how to do it) is to be able to quantify our uncertainty about the
representativeness of the sample.

      Example 2.3.4. Another kind of modification to random sampling is used in the
      Current Population Survey. This survey of 60,000 households in the United States
      is conducted by individuals who live and work near enough to the sample subjects
      so that they can conduct the survey in person. It is easy to imagine that 60,000
      households chosen totally at random might be inconveniently distributed geograph-
      ically. The CPS works as follows. First, the country is divided in about 800 primary
      sampling units, PSUs, which must be, geographically, not too large. For example,
      large cities (actually, Metropolitan Statistical Areas) are each PSUs. Other PSUs
      are whole counties or pairs of contiguous counties. The PSUs are grouped into
      strata, and then one PSU per strata is chosen at random (with a probability pro-
      portional to its population). The next stage of the sampling procedure is to choose
      at random certain housing clusters. A housing cluster is a group of four housing
      units in a PSU. The idea behind sampling housing clusters rather than individual

                                                                           2.4. Exercises

    houses is to cut down on interviewer travel time. A larger sample is generated for
    the same cost. Of course the penalty for using clusters is that clusters tend to have
    less variability than the whole PSU in which the cluster lies and so the group of
    individuals in the cluster will probably not be as representative of the PSU as a
    sample of similar size chosen from the PSU at random.

  The PSU illustrates two enhancements to simple random samples: it is multistage
(but with random sampling at each stage) and it produces a cluster sample, a sample
in which the ultimate sampling units are not the individuals desired but clusters of
individuals. We will not undertake a formal study of all the variants of sampling
methods and their resultant sampling errors, but it is good to keep in mind that
most large scale surveys are not simple random samples but some modification thereof.
Nevertheless, they all rely on the basic principle that randomness is our best hope for
producing representative samples.

2.4. Exercises

2.1 In the parts below, we list some convenience samples of Calvin students. For each
of these methods for sampling Calvin students, indicate in what ways that the sample
is likely not to be representative of the population of all Calvin students.
  a) The students in Mathematics 243A.
  b) The students in Nursing 329.
  c) The first 30 students who walk into the FAC west door after 12:30 PM today.
  d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM
  e) The first 30 students named in the “Names and Faces” picture directory.
  f ) The men’s basketball team.

2.2 Suppose that we were attempting to estimate the average height of a Calvin stu-
dent. For this purpose, which of the convenience samples in the previous problem
would you suppose to be most representative of the Calvin population? Which would
you suppose to be least representative?

2.3 Consider the set of natural numbers P = {1, 2, . . . , 30} to be a population.
  a) How many prime numbers are there in the population?
  b) If a sample of size 10 is representative of the population, how many prime numbers
     would we expect to be in the sample? How many even numbers would we expect
     to be in the sample?

19:08 -- May 4, 2008                                                                 209
2. Data from Random Samples

  c) Using R choose 5 different samples of size 10 from the population P . Record how
     many prime numbers and how many even numbers are in each sample. Make any
     comments about the results that strike you as relevant.

2.4 Before easy access to computers, random samples were often chosen by using tables
of random digits. The tables looked something like this table which was constructed
in R.

 [1]   40139   61007   60277   41219   45533   68878   48506   11950   07747   69280
[11]   82348   44867   12854   03179   21145   91154   84831   78503   00159   97920
[21]   09366   05554   86209   36252   33740   92037   21446   63192   87206   58877
[31]   00976   43068   88362   42080   54161   34593   18209   04344   52566   86976
[41]   83264   34861   60488   52180   03796   17289   39816   19080   64575   55492
[51]   54703   28006   03477   66384   55787   42212   55253   82256   61471   73665

   Each digit in this table is supposed to occur with equal likelihood as are all pairs,
triples, etc. Suppose that a population has 280 individuals numbered 1–280. Explain
whether each of the following methods of using the random number table is an appro-
priate method of producing a simple random sample of size 5.

  a) Divide the table into three digit groups. (i.e., 401, 396, 100, etc.). Choose the
     first five numbers between 1–280 and choose the corresponding individuals. If a
     number is repeated, do not use it again. (So in this case, the first individual in
     the sample is the individual numbered 100.)

  b) Proceed as in (a) but instead of throwing out the whole three digit number if the
     first digit is 3 or larger, throw out only the first digit and use the next three. (On
     this method, the first element of the sample is the one numbered 013.)

  c) As in (a), use a 3 digit group. However divide the three digits by 3 and throw away
     the remainder. If the result is 1–280, use that individual as the next individual in
     the sample. (On this method, since 401/3=133.7, the first element of the sample
     is 133.)

2.5 In a very small class, the final exam scores of the six students were 139, 145, 152,
169, 171, and 189.

  a) How many different simple random samples of size 3 of students in this class are

  b) What is the “population” mean of exam scores?

  c) Suppose that we use the mean of exam scores of a SRS of size 3 to estimate the
     population mean. What is the greatest possible error that we could make?

                                                                        2.4. Exercises

2.6 Donald Knuth, the famous computer scientist wrote a book entitled “3:16”. This
book was a Bible study book that studied the 16th verse of the 3rd chapter of each
book of the Bible (that had a 3:16). Knuth’s thesis was that a Bible study of random
verses of the Bible might be edifying. The sample was of course not a random sample
of Bible verses and Knuth had ulterior motives in choosing 3:16. Describe a method for
choosing a random sample of 60 verses from the Bible. Construct a method that is more
complicated than simple random sampling that seeks to get a sample representative of
all parts of the Bible.
2.7 Suppose that we wish to survey the Calvin student body to see whether the student
body favors abolishing the Interim (we could only hope!). Suppose that instead of a
simple random sample, we select a random sample of size 20 from each of the five
groups of Table 2.1. Suppose that of 20 students in each group, 9 of the first-year
students, 10 of the sophomores, 13 of the juniors, 19 of the seniors and all 20 of the
other students favor abolishing the interim. Produce an estimate of the proportion of
the whole student body by using these sample results. Be sure to describe and justify
your computation that uses these results.

2.8 There are 3,141 county equivalents in the county dataset (http://www.calvin.
edu/~stob/data/uscounties.csv). Suppose that we wish to take a random sample
of 60 counties. What are two different variables that might be useful to create strata
for a stratified random sample?
2.9 Describe a method for choosing a random sample of 200 Calvin students using the
“Names and Faces” directory.
2.10 You would like to estimate the percentage of books in the library that have red
covers. Describe a method of choosing a random sample of books to help estimate this
parameter. Discuss any problems that you see with constructing such a sample.

19:08 -- May 4, 2008                                                              211
3. Probability
3.1. Random Processes
Probability theory is the mathematical discipline concerned with modeling situations
in which the outcome is uncertain. For example in choosing a simple random sample we
do not know which sample of individuals from the population that we might actually
get in our sample. The basic notion is that of a probability.

Definition 3.1.1 (A probability). A probability is a number meant to measure the
likelihood of the occurrence of some uncertain event (in the future).

Definition 3.1.2 (probability). Probability (or the theory of probability) is the math-
ematical discipline that
  1. constructs mathematical models for “real-world” situations that enable the com-
     putation of probabilities (“applied” probability)

  2. develops the theoretical structure that undergirds these models (“theoretical” or
     “pure” probability).

  The setting in which we make probability computations is that of a random process.
(What we call a random process is usually called a random experiment in the literature
but we use process here so as not to get the concept confused with that of randomized
experiment, a concept that we introduce later.)

  Characteristics of a Random Process:
    1. A random process is something that is to happen in the future (not in the
       past). We can only make probability statements about things that have not
       yet happened.

     2. The outcome of the process could be any one of a number of outcomes and
        which outcome will obtain is uncertain.

     3. The process could be repeated indefinitely (under essentially the same cir-
        cumstances), at least in theory.

3. Probability

   Historically, some of the basic random processes that were used to develop the theory
of probability were those originating in games of chance. Tossing a coin or dealing a
poker hand from a well-shuffled deck are examples of such processes. One of the most
important random processes that we study is that of choosing a random sample from
a population. It is clear that this process has all three characteristics of a random
   The first step in understanding a random process is to identify what might happen.

Definition 3.1.3 (sample space, event). Given a random process, the sample space
is the set (collection) of all possible outcomes of the process. An event of the random
process is any subset of the sample space.

  The next example lists several random processes, their sample spaces, and a typical
event for each.

      Example 3.1.4.
       1. A fair die is tossed. The sample space can be described as the set S =
          {1, 2, 3, 4, 5, 6}. A typical event might be E = {2, 4, 6}; i.e., the event that an
          even number is rolled.
       2. A card is chosen from a well-shuffled standard deck of playing cards. There
          are 52 outcomes in the sample space. A typical event might be “A heart is
          chosen” which is a subset consisting of 13 of the possible outcomes.
       3. Twenty-nine students are in a certain statistics class. It is decided to choose
          a simple random sample of 5 of the students. There are a boatload of possible
          outcomes. (It can be shown that there are 118,755 different samples of 5
          students out of 29.) One event of interest is the collection of all outcomes in
          which all 5 of the students are male. Suppose that 25 of the students in the
          class are male. Then it can be shown that 53,130 of the outcomes comprise
          this event.

We often have some choice as to what we call outcomes of a random process. For
example, in Example 3.1.4(3), we might consider two samples different outcomes if the
students in the sample are chosen in a different order, even if the same five students
appear in the samples. Or we might call such samples the same outcome. To some
extent, what we call an outcome depends on the way in which we are going to use the
results of the random process.
   Given a random process, our goal is to assign to each event E a number P(E)
(called the probability of E) such that P(E) measures in some way the likelihood
of E. In order to assign such numbers however, we need to understand what they are
intended to measure. Interpreting probability computations is fraught with all sorts of

                                                                 3.1. Random Processes

philosophical issues but it is not too great a simplification at this stage to distinguish
between two different interpretations of probability statements.

  The frequentist interpretation.
  The probability of an event E, P(E), is the limit of the relative frequency that E
  occurs in repeated trials of the process as the number of trials approaches infinity.

   In other words, if the event E occurs en many times in the first n trials, then on the
frequentist interpretation, P(E) = lim en /n.

  The subjectivist interpretation.
  The probability of an event E, P(E), is an expression of how confident the assignor
  is that the event will happen in the next trial of the process.

   The word “subjective” is usually used in science in a pejorative sense but that is
not the sense of the word here. Subjective here simply means that the assignor needs
to make a judgment and that this judgment may differ from assignor to assignor.
Nevertheless, this judgment might be based on considerable evidence and experience.
That is, it might be expert judgment.
   Mathematics cannot tell us which of these two interpretations is “true” or even which
is “better.” In some sense this is a discussion about how mathematics can be applied
to the real world and is a philosophical not a mathematical discussion. In this book (as
is customary for introductory texts) we will explain our probability statements using
frequentist language.
   Notice that the frequentist approach makes an important assumption about a ran-
dom process. Namely, it assumes that, given an event E, there will be a limiting
relative frequency of occurrence of E in repeated trials of the random process and that
this limiting relative frequency is always the same given any such infinite sequence
of repeated trials. This is not something that can be proved. Consider the simplest
kind of random process, one with two outcomes. The paradigmatic example of such a
process is coin tossing. The frequentist approach would say that in repeated tossing of
a coin, the fraction of tosses that have produced a head approaches some limit. The
next example simulates this situation.

19:08 -- May 4, 2008                                                                 303
3. Probability

      Example 3.1.5. Suppose that we toss a coin “fairly.” That is we toss a coin so
      that we expect that heads and tails are equally likely. Let E be the event that the
      coin turns up heads. It is reasonable to think that in large numbers of tosses, the
      fraction of heads approaches 1/2 so that P(E) = 1/2. (Indeed, there have been
      many famous coin-tossers throughout the years that have tried this experiment.)
      Rather than toss physical coins, we illustrate what happens when a coin is tossed
      1,000 times using R. In the R code below, we toss the coin 1,000 times and find that
      after 1,000 tosses, the relative frequency of heads is 0.499. Notice however that in
      the first 100 tosses or so that approximately 60% of the tosses were heads.
      > coins = sample (c(’H’,’T’), 1000, replace=T)
      > noheads = cumsum(coins==’H’)
      > cumfrequency=noheads/(1:1000)
      > xyplot(cumfrequency~(1:1000),type="l")
      > cumfrequency[1000]
      [1] 0.499





                                                 0   200   400      600   800   1000

  Though in the above example, the simulated frequency of heads did indeed approach
1/2, there does not seem to be any reason why it wouldn’t be possible to toss 1,000
consecutive heads or, alternatively, to have the relative frequency of heads oscillate
wildly from very close to 0 to very close to 1. We will return to this issue when we
discuss the Law of Large Numbers in Section 5.2.
  We should note here that one fact that is clear from the frequentist interpretation of
probability is the following.

       For every event E, 0 ≤ P(E) ≤ 1.

                                                                 3.1. Random Processes

  We have already said that the sample space is a set and an event is a subset of the
sample space. We will use the language of set theory extensively to talk about events.

Definition 3.1.6 (union, intersection, complement). Suppose that E and F are events
in some sample space S.

  1. The union of events E and F , denoted E ∪ F , is the set of outcomes that are in
     either E or F .

  2. The intersection of events E and F , denoted E ∩ F , is the set of outcomes that
     are in both E and F .

  3. The complement of an event E, denoted E , is the set of outcomes that are in S
     but not in E.

   Example 3.1.7. Suppose that a random sample of 5 individuals is chosen from a
   statistics class of 20 students. Let E be the event that there are at least 3 males in
   the sample and let F be the event that all five individuals are sophomores. Then
   we have
              Event Description
              E ∪ F either at least three males or all sophomores (or both)
              E ∩ F all sophomores and at least three of them male
              E         at most two males

  So far we have considered random processes that have only finitely many different
possible outcomes. Some random processes have infinitely many different outcomes
however. Here are two typical examples.

   Example 3.1.8. A six-sided die is tossed until all six different faces have appeared
   on top at least once. The possible outcomes form an infinite collection since we
   could toss arbitrarily many times before seeing the number 1.

   Example 3.1.9. Kellogg’s packages Raisin Bran in 11 ounce boxes. We might view
   the weight of any particular box as the result of a random process. It is difficult
   to describe exactly what outcomes are possible (is a 22 ounce box of Raisin Bran
   possible?), but it certainly seems like at least all real numbers between 10.9 and
   11.1 ounces are possible. This is already an infinite set of outcomes. An important
   event is that the weight of the box is at least 11 ounces.

19:08 -- May 4, 2008                                                                 305
3. Probability

3.2. Assigning Probabilities I – Equally Likely Outcomes
How shall we assign a probability P(E) to an event E? On the frequentist interpreta-
tion, we need to examine what happens if we repeat the experiment indefinitely. This
of course is not usually feasible. In fact, we often want to make probability statements
about a process that we will perform only once. For example, we would like to make
probability statements about what might happen in the Current Population Survey
but only one random sample is chosen. So what we need to do is make some sort of
model of the process and argue that the model allows us to draw conclusions about
what might happen if we repeat the experiment many times.
   For many random processes, we can make a plausible argument that the possible
outcomes of the process are equally likely. That is, we can argue that each of the
outcomes will occur about as often as any other outcome in a long series of trials.
For example, when we toss a fair coin, we usually assume that in a large number of
trials we will have as many heads as tails. That is, we assume that heads and tails are
equally likely. That’s why coin tossing is often used as a means of choosing between
two alternatives. Similarly, given the symmetry of a six-sided die, the sides of a die
should be equally likely to occur when the die is rolled vigorously. In a more important
example, a procedure for random sampling is designed to ensure that all samples are
equally likely to occur. In this situation, it is straightforward to assign probabilities to
each event.

Definition 3.2.1 (probability in the equally likely case). Suppose that a sample space
S has n outcomes that are equally likely. Then the probability of each outcome is 1/n.
Also, the probability of an event E, P(E) is k/n where k is the number of outcomes in

   The following examples illustrate this definition. In each example, the key is to list
the outcomes of the process in such a way that it is apparent that they are equally

      Example 3.2.2. A six-sided die is rolled. Then one of six possible outcomes occurs.
      From the symmetry of the die it is reasonable to assume that the six outcomes are
      equally likely. Therefore, the probability of each outcome is 1/6. If E is the event
      that is described by “the die comes up 1 or 2” then P(E) = 2/6 = 1/3 since the
      event E contains two of the outcomes. This probability assignment means that in
      a large number of tosses of the die, approximately one-third of them will be 1s or

      Example 3.2.3. Suppose that four coins are tossed. What is the probability that
      exactly three heads occur? It is tempting to list the outcomes in this particular

                                3.2. Assigning Probabilities I – Equally Likely Outcomes

    experiment as the set S = {0, 1, 2, 3, 4} since all that we are interested in is the
    number of heads that occurs. However, it would be difficult to make an argument
    that these outcomes are equally likely. The key is to note that there are really
    sixteen possible outcomes if we distinguish the four coins carefully. To see this,
    label the four coins (say, penny, nickel, dime, and quarter) and list the possible
    outcomes as a four-tuple in that order (PNDQ):
                             HHHH HHHT HHTH HTHH
                             THHH HHTT HTHT THHT
                             HTTH THTH TTHH HTTT
                             THTT TTHT TTTH TTTT
       Exactly 4 of these outcomes have three heads so that P(three heads) = 4/16 =
    1/4. In fact, the following table gives the complete probability distribution of the
    number of heads:
                         no. of heads    0     1    2    3      4
                                         1     4    6    4      1
                                         16   16    16   16    16

    Example 3.2.4. In many games (e.g., Monopoly) two dice are thrown and the sum
    of the two numbers that occur are used to initiate some action. Rather than use
    the 11 possible sums as outcomes, it is easy to see that there are 36 equally likely
    outcomes (list the pairs (i, j) of numbers where i is the number on the first die, j
    is the number on the second die and i and j range from 1 to 6). One event related
    to this process is the event E that the throw results in a sum of 7 on the two dice.
    It is easy to see that there are 6 outcomes in E so that P(E) = 6/36 = 1/6.
   For simple random processes with a small number of equally likely outcomes, it is
easy to compute probabilities using Definition 3.2.1. But when the number of outcomes
is so large that it is impractical to list them all, it becomes more difficult. In such a
case, we need to be able to count the number of outcomes without listing them. For
example, in choosing a random sample of 10 students from a large class, the number
of different possible samples is very large and would be impractical to enumerate.
   The mathematical discipline of counting is known as combinatorics. In this text,
we will not spend a great deal of time counting outcomes in complicated cases but
rather leave such computations to R. However a few of the more important principles
of counting will be quite useful to us.

The Multiplication Principle
It is no accident that in rolling 2 dice that there are 62 possible outcomes and that in
flipping 4 coins that there are 24 possible outcomes. These are special cases of what
we will call the multiplication principle.

19:08 -- May 4, 2008                                                                307
3. Probability

Definition 3.2.5 (cartesian product). If A and B are sets then the Cartesian product
of A and B, A × B, is the set of ordered pairs of elements of A and B. That is

                           A × B = {(a, b) | a ∈ A and b ∈ B} .

  The Multiplication Principle is then given by the following lemma.

Lemma 3.2.6. If A has n elements and B has m elements then A×B has mn elements.

  It is easy to prove this lemma (and to remember the multiplication principle) by a
diagram. Let a1 , . . . an be the elements of A and b1 , . . . bm be the elements of B. Then
the elements of A × B are listed in the following two dimensional array that has n rows
and m columns or nm entries.

                             (a1 , b1 )   (a1 , b2 ) . . .   (a1 , bm )
                             (a2 , b1 )   (a2 , b2 ) . . .   (a2 , bm )
                             (an , b1 )   (an , b2 ) . . .   (an , bm )

  It is easy to see that counting the outcomes in the experiment of tossing two dice is
equivalent to counting D × D where D = {1, 2, 3, 4, 5, 6}. The two sets A and B do not
have to be the same however.

      Example 3.2.7. A class has 20 students, 12 male and 8 female. A male and a
      female are chosen at random from the class. How many possible outcomes of this
      process are there? It is easy to see that we are simply counting A × B where A, the
      set of males, has 12 elements and B, the set of females, has 8 elements. Therefore
      there are 12 · 8 = 96 outcomes.

   The multiplication principle can be profitably generalized in two ways. First, we can
extend the principle to the case of more than two sets. It is easy to see that if sets A,
B, and C have n, m, p elements respectively, there are nmp triples of elements, one
from each of A, B, and C. This is because the set A × B × C can be thought of as
(A × B) × C. So for example, there are 63 = 216 different outcomes of the process of
tossing three fair dice.
   A second way to generalize this principle is illustrated in the following example.

      Example 3.2.8. In a certain card game, a player is dealt two cards. What is the
      probability that the player is dealt a pair? (A pair is two cards of the same rank.
      A deck of playing cards has 4 cards of each of thirteen ranks.) We first need to
      identify the equally likely outcomes and count them. Consider the cards being dealt
      in succession. There are 52 choices for the first card that player receives. For each

                                  3.2. Assigning Probabilities I – Equally Likely Outcomes

    of these 52 cards there are 51 possible choices for the second card that the player
    receives. Thus there are (52)(51) = 2, 652 possible equally likely outcomes. To
    see that this is really an application of the multiplication principle above, we could
    view it as counting a set that has the same size as A×B where A = {1, . . . , 52} and
    B = {1, . . . , 51} or we could directly list the possible outcomes in a table as we did
    in the proof of the multiplication principle. To compute the probability that the a
    pair is dealt, we need to also count the number of outcomes that are a pair. This is
    (52)(3) = 156 since the first card can be any card but the second card needs to be
    one of the three cards remaining that has the same rank as the first card. Thus the
    probability in question is 156/2652 = .059. Notice that in this example, we have
    treated the two cards of a given hand as being ordered by taking into account the
    order in which they are dealt. Of course it does not usually matter in a card game
    the order in which the cards of a given hand are dealt. We will later show how to
    compute the number of different unordered hands.

  Generalizing this example, we have the following principle. If two choices must be
made and there are n possibilities for the first choice and, for any first choice there are
m possibilities for the second choice then there are nm many ways to make the two
choices in succession.

Counting Subsets
Many of our counting problems can be reduced to counting the number of subsets of a
set that are of a given size.

    Example 3.2.9. Suppose that a set A has 10 elements. How many different three
    element subsets of A are there? To answer this question, we first count the number
    of ordered three element subsets of A using the multiplication principle. It is
    easy to see that there are 10 · 9 · 8 = 720 of these. However, since this counts
    the number of ordered subsets it counts each different (unordered) subset several
    times. In fact each three element subset is counted 3 × 2 × 1 = 6 times using
    the same multiplication principle. (There are 3 choices for the first element, 2
    for the second, and 1 for the third.) Thus there must be 720/6 = 120 different
    three-element subsets of A.

  Generalizing the example, we have

Theorem 3.2.10. Suppose that A has n elements. There are            n
                                                                    k   many subsets of A
of size k where
                              n         n!
                                 =             .
                              k     k!(n − k)!

19:08 -- May 4, 2008                                                                    309
3. Probability

Proof. We first count the number of k-element ordered subsets of A. By the multipli-
cation principle this is

                        n(n − 1)(n − 2) · · · (n − k + 1) =
                                                              (n − k)!

This follows from the multiplication principle since there are n choices for the first
element of the subset, n−1 choices for the second element, and so forth down to n−k+1
choices for the k th element. Now for any subset of size k, there are k(k − 1) · · · 1 = k!
many different orderings of the elements of that subset. Thus each subset is counted
k! many times in our count of the ordered subsets. So there are actually only

                                       n!              n
                                               /k! =
                                    (n − k)!           k

many subsets of size k of A.

  The number n is obviously an important one and it can be computed using R. The
R function choose(n,k) computes n .

      Example 3.2.11. A random sample of 5 students is chosen from a class of 20
      students, 12 of whom are female. What is the probability that the sample consists
      of 5 females? We first need to count the number of equally likely outcomes. Since
      there are 20 students and an outcome is a subset of size 5 of those 20, the number
      of different random samples that we could have chosen is 20 = 15, 504. Since the
      event that we are interested in is the collection of samples that have five females,
      we need to count how many of these 15,504 outcomes contain five females. But
      that is simply 12 = 792 since each sample of five females is a subset of the 12
      females in the class. So the probability in question is 792/15504 = 0.051.
      > choose(20,5)
      [1] 15504
      > choose(12,5)
      [1] 792
      > 792/15504
      [1] 0.05108359

3.3. Probability Axioms
In the last section, we considered one way of assigning probabilities to events. But we
can’t always identify equally likely outcomes.

                                                                      3.3. Probability Axioms

    Example 3.3.1. A basketball player is going to shoot two free throws. What is the
    probability that she makes both of them? It is easy to write the possible outcomes.
    Using X for a made free throw and O for a miss, the four outcomes are XX, XO,
    OX, and OO. In this respect, the process looks just like that of tossing a coin twice
    in succession. But we have no reason to think that these four outcomes are equally
    likely. In fact, it is almost always that case that shooter is more likely to make a
    free throw than miss it so that it is probably the case that XX is more likely to
    occur than OO.

   As we have said before, mathematics cannot tell us how to assign probabilities in
situations such as Example 3.3.1. However not just any assignment of probabilities
makes sense. For example, we cannot assign a probability of 1/2 to each of the four
outcomes. It is not reasonable to think that the limiting relative frequency of all four
outcomes will be 1/2 if the experiment is repeated many times. In fact it seems clear
that we should be looking for four numbers that sum to 1. In 1933, Andrei Kolmogorov
published the first rigorous treatment of probability in which he gave axioms for a
probability assignment in the same way that Euclid gave axioms for geometry.

Axiom 1. For all events E, P(E) ≥ 0.

Axiom 2. P(S) = 1.

Axiom 3. If E and F are disjoint events (i.e., have no outcomes in common) then

                                 P(E ∪ F ) = P(E) + P(F )

More generally, if E1 , E2 , . . . is a sequence of pairwise disjoint events, then

                        P(E1 ∪ E2 ∪ · · · ) = P(E1 ) + P(E2 ) + · · · .

   Axioms in mathematics are supposed to be propositions that are “intuitively obvious”
and that we agree to accept as true without proof. Each of the three Kolmogorov
axioms can easily be interpreted as a statement about limiting relative frequency that
is obviously true. For example, the second axiom is obviously true because by our
definition of a random process, one of the outcomes in the sample space must occur.
   Notice that the method of equally likely outcomes can be seen to rely heavily on
Axiom 2 and Axiom 3. While the axioms do not directly help us assign probabilities in
a case like Example 3.3.1, they do constrain our assignments. Also, they are useful in
helping to compute some probabilities in terms of others. Namely, we can prove some
theorems using these axioms.

19:08 -- May 4, 2008                                                                     311
3. Probability

Proposition 3.3.2. For every event E, P(E ) = 1 − P(E).

Proof. The events E and E are disjoint and E ∪ E = S. Thus

                        P(E) + P(E ) = P(E ∪ E ) = P(S) = 1 .

The first equality is Axiom 3 and the last is Axiom 2. The proposition follows imme-

   A curious event is ∅. Since we assume that something happens each time the random
process is performed, it should be the case that P(∅) = 0. It is easy to see that this
follows from the proposition and Axiom 2 since S = ∅ .

Proposition 3.3.3. For any events E and F , P(E ∪ F ) = P(E) + P(F ) − P(E ∩ F ).

Proof. We first use Axiom 3 and find that

       P(E) = P(E ∩ F ) + P(E ∩ F )       and       P(F ) = P(F ∩ E) + P(F ∩ E )

Next we use Axiom 3 again to see that

                    P(E ∪ F ) = P(E ∩ F ) + P(E ∩ F ) + P(E ∩ F )

Combining, we have that

           P(E ∪ F ) = (P(E) − P(E ∩ F )) + P(E ∩ F ) + (P(F ) − P(E ∩ F ))

which after simplifying gives the desired result.

   The propositions above help us simplify probability computations, even in the case
of equally likely outcomes.

      Example 3.3.4. From experience, an insurance company estimates that a cus-
      tomer that has both a homeowner’s policy and an auto policy has a probability
      of .83 of having no claim on either policy in a given year. These policy holders
      also have a probability of .15 of having an automobile claim and .05 of having a
      homeowner’s claim. What is the probability that such a policy holder has both a
      homeowner and automobile claim? If E is the event of a homeowner’s claim and F
      the event of an auto claim, then we have P(E ∪F ) = 1−.83 = .17. Also P(E) = .05
      and P(F ) = .15. Thus the event that we are looking for, E ∩ F , has probability
      P(E) + P(F ) − P(E ∪ F ) = .03.

                                                              3.4. Empirical Probabilities

3.4. Empirical Probabilities
In Section 3.2, we saw how to assign probabilities consistent with the Kolmogorov
Axioms in the case that we could identify a priori equally likely outcomes. However in
many applications, the outcomes are not equally likely and there is usually no similar
theoretical principle that enables us to assign probabilities with confidence. In such
cases, we need some data from the real world to assign probabilities. While much of
Chapter 4 will be devoted to this problem, in this section we look at a very simple
method of assigning probabilities based on data.
  Since the probability of an event E is supposed to be the limiting relative frequency
of the occurrence of E as the number of trials increases indefinitely, a very simple
estimate of the probability of E is the relative frequency with which it has occurred in
the past.

    Example 3.4.1. What is the probability that number 20 of the Calvin Knights
    will make a free-throw when he has to shoot one in a game? As of the writing of
    this example, number 20 had attempted 25 free-throws and had made 22 of them.
    Thus the relative frequency of a made free-throw is 88%. Thus we say that number
    20 has a 88% probability of making a free-throw.

   There are all sorts of objections that might be raised to the computation in Exam-
ple 3.4.1. The first that comes to mind is that 25 is a relatively small number of trials
on which to base the argument. Another serious objection might be to the whole idea
that there is a fixed probability that number 20 makes a free-throw. Nevertheless, as
a model of what number 22 might do on his next and subsequent free-throws, this
number might have some value and allow us to make some useful predictions.
   We have seen of course that this method of assigning probabilities can lead us to
incorrect (and sometimes really bad) probability values. Even in 100 tosses of a coin,
it is quite possible that we would find 60 heads and so think that the probability of
a head was 0.6 rather than 1/2. (In Section 4.3 we will actually examine closely the
question of just how close to the “true” value we are likely to be given a certain number
n of coin tosses.) But in situations where we have a lot of past data and very little of
a theoretical model to help us compute otherwise, this might be a reasonable strategy.
This way of assigning probabilities is an important tool in the insurance industry.

    Example 3.4.2. Suppose that an insurance company wants to sell a 5-year term
    life insurance policy in the amount of $100,000 to a 55-year old male. Such a policy
    pays $100,000 to the beneficiary of the policy holder only if he dies within five years.
    Obviously, the insurance company would like to know that the probability that the
    insured dies within five years. The key tool in computing such a probability is a
    mortality table such as Figure 3.4.2. (The full table is available at http://www. Using data from a variety

19:08 -- May 4, 2008                                                                  313
3. Probability

      of sources (including the US Census Bureau and the Center for Medicare and
      Medicaid), the Division of Vital Services makes a very accurate count of the number
      of people that die in the United States each year. For our problem, we note that
      the table indicates that of every 88,846 men alive at the age of 55, only 84,725 of
      them are alive at the age of 60. This means that our insurance company has a
      probability of (88846 − 84725)/88846 = 0.046 of paying out on this policy. If the
      company writes many such policies, it appears that it would average about $4,600
      per policy in payouts. This is the most important number in trying to decide how
      much the company should charge for such a policy.

Figure 3.1.: Portion of life table prepared by Division of Vital Services of U.S. Depart-
             ment of Health and Human Services

  For the purpose of investigating how random processes work, it is very useful to use
R to perform simulations. We have already seen how to simulate a random process in
which the outcomes are equally likely. The next example simulates a process in which
the probabilities are determined empirically.

      Example 3.4.3. In the 2007 baseball season, Manny Ramirez came to the plate
      569 times. Of those 569 times, he had 89 singles, 33 doubles, 1 triple, 20 homeruns,
      78 walks (and hit by pitch), and 348 outs. We can use the frequency of these
      events to estimate the probabilities of each sort of event that might happen when

                                                                    3.5. Independence

   Ramirez comes to the plate. For example, we might estimate that the probability
   Ramirez will hit a homerun in his next plate appearance to be 20/569 = .035. In
   the following R session we simulate one, and then five, of Manny Ramirez’s plate
    > outcomes=c(’Out’,’Single’,’Double’,’Triple’,’Homerun’,’Walk’)
    > ramirez=c(348,89,33,1,20,78)/569
    > sum(ramirez)
    [1] 1
    > ramirez
    [1] 0.611599297 0.156414763 0.057996485 0.001757469 0.035149385 0.137082601
    > sample(outcomes,1,prob=ramirez)
    [1] "Double"
    > sample(outcomes,5,prob=ramirez,replace=T)
    [1] "Out"    "Double" "Out"    "Out"    "Walk"

3.5. Independence
It is often the case that two events associated with a random process are related in
some way so that if we knew that one of them was going to happen we would change
our estimate of the likelihood that the other would happen. The following example
illustrates this.

   Example 3.5.1. At the end of each semester, students in many college courses are
   given the opportunity to rate the course. At Calvin, the first two questions that
   students are asked are:

    The course as a whole was: (Excellent, Very Good, Good, Fair, Poor, Very Poor)
    The course content was: (Excellent, Very Good, Good, Fair, Poor, Very Poor)
   Empirical evidence suggests that the probability that a student answers excellent
   on the first question is 0.24 and the probability that a student answers Excellent to
   the second question is 0.22. (What is the random process here? Am I suggesting
   that students answer these questions at random?) Suppose that we happen to see
   that a student has answered Excellent to the first question. We would certainly not
   continue to suppose that the probability that this student has answered Excellent
   to the second question is just 0.22. We would guess that the students answers are
   not independent one of another. In fact, 75% of the students who answer Excellent
   to question 1 also answer Excellent to question 2.

19:08 -- May 4, 2008                                                              315
3. Probability

Definition 3.5.2 (conditional probability). Given two events E and F such that
P(F ) = 0, the conditional probability of E given F , written P(E | F ) is given by
                                                 P(E ∩ F )
                                   P(E | F ) =
                                                   P(F )

   It is easiest to interpret the formula for P(E | F ) using the relative frequency in-
terpretation. The denominator in the fraction in the definition is the proportion of
times that the event F happens in a large number of trials of the random process.
The numerator, P(E ∩ F ) is the proportion of times that both events happen. So the
fraction is the proportion of times E happens among those times that F happens which
is precisely what we want conditional probability to measure.
   In the definition of conditional probability, it is best to think of F as being a fixed
event and that E is allowed to be any event in the sample space. Thus P(E | F ) is a
function of E. As a function of E, we can see that the new probabilities satisfy the
axioms of probability theory.

Proposition 3.5.3. Suppose that F is a fixed event of some process with sample space
S and such that P(F ) > 0. Then
  1. for every E, P(E | F ) ≥ 0,

  2. P(S | F ) = 1,

  3. For disjoint events E1 and E2 , P(E1 ∪ E2 | F ) = P(E1 | F ) + P(E2 | F ).

  In applications, it is often the case that we know P(F ) and P(E | F ). Using the
definition of conditional probability, we can then compute P(E ∩ F ) using

                           Multiplication Law of Probability

   If E and F are events with P(F ) = 0 then

                                 P(E ∩ F ) = P(F ) P(E|F ) .

      Example 3.5.4. Suppose that we choose two students from a class of 20 without
      replacement. If there are 12 female students in the class, the probability of the first
      chosen student being female is 12/20 = .6. Having chosen a female, the probability
      that the second chosen student is also female is 11/19 since there are 11 remaining
      females of the 19 remaining students. So the probability of choosing two females
      in succession is (12/20)(11/19) = .347.

                                                                        3.5. Independence

  We can extend the analysis in Example 3.5.4 to compute the probabilities of all
possible combinations of E and F occurring or not. It is useful to view this situation
as a tree.

                                           P(F | E)     F       E∩F

                                          P(F | E)      F      E∩F

                                           P(F | E )    F      E ∩F
                         P(E )
                                          P(F | E )     F      E ∩F

   It is clear when one thinks about it that, in general, P(E | F ) = P(F | E). Indeed,
simply knowing P(E | F ) does not necessarily given us any information about P(F | E).
As a simple example, note that the probability that a primary voter votes for Hilary
Clinton given that she votes in the Democratic primary is certainly not equal to the
probability that she votes in the Democratic primary given that she votes for Hilary
Clinton (the latter probability is 1!). In the next example, we look at an important sit-
uation in which we desire to know P(F | E) but we only know conditional probabilities
of form P(E | F ).

    Example 3.5.5. Most laboratory tests for diseases aren’t infallible. The important
    question from the point of view of the patient is what inference to make about the
    disease status given the outcome of the test. Namely, if the test is positive, how
    likely is it that the patient has the disease? The sensitivity of a test is the
    probability that it will give a positive result given that the patient has the disease.
    The specificity of a test is the probability that it will give a negative result given
    that the patient does not have the disease. A widely used rapid test for the HIV
    virus has sensitivity 99.9% and specificity 99.8%. Since the test appears to be very
    accurate and it is now quite inexpensive, one might suppose that doctors should
    give this test as a routine matter to allow for early detection of the virus. In this
    situation, we are interested in four possible events:
                         D+ the patient has the disease
                         D− the patient does not have the disease
                         T + the test is positive
                         T − the test is negative
    The sensitivity and specificity then give P(T + | D+ ) = .999 and P(T − | D− ) =
    .998. (Note that this means that P(T + | D− ) = 0.002 and P(T − | D+ ) = 0.001.

19:08 -- May 4, 2008                                                                   317
3. Probability

      Suppose now that a patient tests positive. What is the probability that this patient
      has the disease? It is clear that this is the question of computing P(D+ | T + ). We
                                                   P(D+ ∩ T + )
                                  P(D+ | T + ) =                 .
                                                      P(T + )
      Using the Multiplication Law, we have
                               P(D+ ∩ T + ) = P(T + | D+ ) P(D+ )
      and also we have
                             P(T + ) = P(T + ∩ D+ ) + P(T + ∩ D− ) .
      One more piece of information is needed to compute P(D+ | T + ) and that is P(D+ ),
      the prevalence of the disease in the tested population. Of course this depends
      on the population that is tested. It is estimated that about 0.01% of all persons
      in the U.S. have the disease. So if we adopt a policy of testing everyone without
      regard to other factors, we might estimate P(D+ ) = 0.0001. We can now compute
      P(D+ | T + ). The probability tree is as follows.

                                       0.999    T + (0.0001)(0.999) = (9.99)10−5

                  0.0001               0.001    T−      (0.0001)((0.001) = 10−7

                                       0.002    T+      (0.9999)(0.002) = 0.0020
                                       0.998    T−     (0.9999)((0.998) = 0.9979

      Using the probabilities computed from the tree, we have
                                         P(D+ ∩ T + )
                           P(D+ | T + ) =
                                            P(T + )
                                             P(T + | D+ ) P(D+ )
                                         P(T + ∩ D+ ) + P(T + ∩ D− )
                                       =                     = 0.047 .
                                         (9.99)10−5 + 0.0020
      Thus, even though the test is very accurate, 95% of the time the positive result will
      be for someone who does not have the disease! This is one reason that universal
      testing for rare diseases often does not make economic sense.

  This method of “reversing” the conditional probabilities is so important that it has
a name: Bayes’ Theorem.

                                                                          3.6. Exercises

If P(E | F ) = P(E), knowing that the event F occurs does not give us any more
information as to whether F will occur. Such events E and F are called independent.
The multiplication law simplifies in this case and leads to the following definition.

Definition 3.5.6 (independent). Events E and F are independent if

                               P(E ∩ F ) = P(E) P(F ) .

   Notice that we do not assume that P(F ) = 0 in this definition. It is easy to see that
if P(F ) = 0, the equality in the definition is always true so that we would consider E
and F to be independent in this special case.

    Example 3.5.7. Suppose that a free-throw shooter makes 70% of her free-throws.
    What is the probability that she makes both of her free-throws when she is fouled
    in the act of shooting? It might be reasonable to suppose that the results of the
    two free-throws are independent of each other. Then the probability of making
    two successive free-throws is (0.7)(0.7) = 0.49. Similarly, the probability that she
    misses both free throws is only 9%.

3.6. Exercises

3.1 For each of the following random processes, write a complete list of all outcomes
in the sample space.
  a) A nickel and a dime are tossed and the resulting faces observed.
  b) Two different cards are drawn from a hat containing five cards numbered 1–5 are
     put in a hat. (For some reason, lots of probability problems are about cards in
  c) A voter in the Michigan 2008 Primary elections is chosen at random and asked
     for whom she voted. (See problem A.2.)

3.2 Two six-sided dice are tossed.
  a) List all the outcomes in the sample space (you should find 36) using some appro-
     priate notation.
  b) Let F be the event that the sum of the dice is 7. List the elements of F .
  c) Let E be the event that the sum of the dice is odd. List the elements of the event

19:08 -- May 4, 2008                                                                319
3. Probability

3.3 If a Calvin College student is chosen at random and his/her height is recorded,
what is a reasonable listing of the possible outcomes? Explain the choices that you
have to make in determining what the outcomes are.

3.4 Weatherman in Grand Rapids are fond of saying things like “The probability of
snow tomorrow is 70%.” What do you think this statement really means? Can you
give a frequentist interpretation of this statement? A subjectivist interpretation?

3.5 In Example 3.2.3 we considered the random experiment of tossing four coins. In
this problem, we consider the problem of tossing five coins.

  a) How many equally likely outcomes are there?

  b) For each x = 0, 1, 2, 3, 4, 5, compute the probability that exactly x many heads
     occurs in the toss of five coins.

3.6 A 20-sided die (with sides numbered 1–20) is used in some games. Obviously the
die is constructed in such a way that the sides are intended to be equally likely to occur
when the die is rolled. (The die is in fact an icosohedron.) Using R, simulate 1,000
rolls of such a die. How many of each number did you expect to see? Include a table
of the actual number of times each of the 20 numbers occurred. Is there anything that
surprises you in the result?

3.7 A poker hand consists of 5 cards. What is the probability of getting dealt a poker
hand of 5 hearts? (Remember that there are 13 hearts in the deck of 52 cards.)

3.8 In Example 3.2.11 we considered choosing a random sample of 5 students from a
class of 20 students of whom 12 were female.

  a) What is the probability that such a random sample will contain 5 males?

  b) What is the probability that such a random sample will contain 3 females and 2

3.9 Many games use spinners rather than dice to initiate action. A classic board game
published by Cadaco-Ellis is “All-American Baseball.” The game contains discs for
each of several baseball players. The disk for Nellie Fox (the great Chicago White Sox
second baseman) is pictured below.

                                                                        3.6. Exercises

The disc is placed over a peg with a spinner mounted in the center of the circle. The
spinner is spun and comes to rest pointing to the one of the numbered areas. Each
number corresponds to the possible result of Nellie Fox batting. (For example, 1 is a
homerun and 14 is a flyout.)
  a) Why is it unreasonable to believe that all the numbered outcomes are equally
  b) Explain how one could use the idea of equal likelihood to predict the probability
     that the spinner will land on the sector numbered 14 and then make an estimate
     of this probability.
(Spinners with regions of unequal size are used heavily in the K–8 textbook series
Everyday Mathematics to introduce probability to younger children.)

3.10 The traditional darboard is pictured below.

19:08 -- May 4, 2008                                                              321
3. Probability

A dart that sticks in the board is scored as follows. There are 20 numbered sectors
each of which has a small outer ring, a small inner ring, and two larger areas. A dart
landing in the larger areas scores the number of the sector, in the outer ring scores
double the number of the sector, and in the inner ring scores triple the number of a
sector. The two circles near the center score 25 points (the outer one) and 50 points
(the inner one). Unlike the last problem, it does not seem that an equal likelihood
model could be used to compute the probability of a “triple 20.” Explain why not.

3.11 Suppose that E and F are events and that P(E), P(F ), and P(E ∩ F ) are
given. Find formulas (in terms of these known probabilities) for the probabilities of
the following events:

  a) exactly one of E or F happens,

  b) neither E nor F happens,

  c) at least one of E or F happens,

  d) E happens but F does not.

3.12 Suppose that E, F , and G are events. Show that

P(E ∪ F ∪ G) = P(E) + P(F ) + P(G) − P(E ∩ F ) − P(E ∩ G) − P(F ∩ G) + P(E ∩ F ∩ G) .

3.13 Use the axioms to prove that for all events E and F , if E ⊆ F then P(E) ≤ P(F ).
3.14 Show that for all events E and F that P(E ∩ F ) ≤ min{P(E), P(F )}.
3.15 In 2006, there were 42,642 deaths in vehicular accidents in the United States.
17,602 of the victims had a positive blood alcohol content (BAC). In 15,121 of these,
the BAC of the victim was greater than 0.08 (which is the legal limit for DUI in
many states). What is a good estimate for the probability that a victim of a vehicular
accident had BAC exceeding 0.08? The statistics in this problem can be found at the
Fatality Analysis Reporting System,
aspx. (A probability that we would really like to know is the probability that a driver
with a BAC of greater than 0.08 becomes a fatality in an accident. Unfortunately,
that’s a much harder number to obtain.)

3.16 We have used tossing coins as our favorite example of a process with two equally
likely outcomes. Consider instead the process where the coin is stood on end on a hard
surface and spun.

  a) If a dime is used, do you think a head and a tail are equally likely to occur?

  b) Do the experiment 10 times and record the results.

                                                                          3.6. Exercises

  c) On the basis of your data, is it possible that heads and tails are equally likely?

  d) Using the data alone, estimate the probability that a spun dime comes up heads.

3.17 In Example 3.4.2, we determined that the probability that a 55 year old male
dies before his 60th birthday is 0.046.

  a) If the company sells this 5-year, $100,000 policy to 100 different men, how many
     of these policies would you expect that they would have to pay the death benefit

  b) Simulate this situation. Namely use this empirical probability to simulate the
     100 policies. How many policies did the company have to pay off on in your
     simulation? Are you surprised by this result?

3.18 Show that if E ⊆ F then P(F | E) = 1.
3.19 Construct an example to show that it is not necessarily true that P(E | F ) =
1 − P(E | F ).

3.20 Show that if E and F are independent, then so are E and F .
3.21 Suppose that two different bags of blue and red marbles are presented to you and
you are told that one bag (bag A) has 75% blue marbles and the other bag (bag B) has
90% red marbles. Suppose that you choose a bag at random. Now suppose that you
choose a single marble from the bag at random and it is red. What is the probability
that you have in fact chosen bag A?
3.22 Over the course of a season, a certain basketball player shot two free-throws on
36 occasions. On 18 of those occasions, she made both of the free-throws and on 9 of
the occasions she missed both (and so on 9 occasions she made one and missed one).
Does this data appear to be consistent with the hypothesis that she has a constant
probability of making a free-throw and that the result of the second throw of a pair is
independent of the first?
3.23 In Example 3.5.5 we studied the effectiveness of universal HIV testing and deter-
mined that 95% of the time positive tests results are wrong even though the test itself
has a very high sensitivity. Now suppose that HIV testing is restricted to a high risk
population - one in which the prevalence of the disease is 25%. What is the probability
that a positive test result is wrong in this case?

19:08 -- May 4, 2008                                                                323
4. Random Variables

4.1. Basic Concepts
If the outcomes of a random process are numbers, we will call the random process a
random variable. Since non-numerical outcomes can always be coded with numbers,
restricting our attention to random variables results in no loss of generality. We will
use upper-case letters to name random variables (X, Y , etc.) and the corresponding
lower-case letters (x, y, etc.) to denote the possible values of the random variable.
Then we can describe events by equalities and inequalities so that we can write such
things as P (X = 3), P (Y = y) and P (Z ≤ z). Some examples of random variables

   1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be the
      random variable that counts the number of underweight boxes and let Y be the
      random variable that is the average weight of the 12 boxes.

   2. Choose a Calvin senior at random. Let Z be the GPA of that student and let U
      be the composite ACT score of that student.

   3. Assign 12 chicks at random to two groups of six and feed each group a different
      feed. Let D be the difference in average weight between the two groups.

   4. Throw a fair die until all six numbers have appeared. Let T be the number of
      throws necessary.

  We will consider two types of random variables, discrete and continuous.

Definition 4.1.1 (discrete random variable). A random variable X is discrete if its
possible values can be listed x1 , x2 , x3 , . . . .

  In the example above, the random variables X, U , and T are discrete random vari-
ables. Note that the possible values for X are 0, 1, . . . , 12 but that T has infinitely many
possible values 1, 2, 3, . . . . The random variables Y , Z, and D above are not discrete.
The random variable Z (GPA) for example can take on all values between 0.00 and
4.00. (We should make the following caveat here however. All variables are discrete
in the sense that there are only finitely many different measurements possible to us.
Each measurement device that we use has divisions only down to a certain tolerance.

4. Random Variables

Nevertheless it is usually more helpful to view these measurements as on a continuous
scale rather than a discrete one. We learned that in calculus.)
   The following definition is not quite right — it omits some technicalities. But it is
close enough for our purposes.

Definition 4.1.2 (continuous random variable). A random variable X is continuous
if its possible values are all x in some interval of real numbers.

  We will turn our attention first to discrete random variables.

4.2. Discrete Random Variables
If X is a discrete random variable, we will be able to compute the probability of any
event defined in terms of X if we know all the possible values of X and the probability
P (X = x) for each such value x.

Definition 4.2.1 (probability mass function). The probability mass function (pmf)
of a random variable X is the function f such that for all x, f (x) = P (X = x). We
will sometimes write fX to denote the probability mass function of X when we want
to make it clear which random variable is in question.

  The word mass is not arbitrary. It is convenient to think of probability as a unit
mass that is divided into point masses at each possible outcome. The mass of each
point is its probability. Note that mass obeys the Kolmogorov axioms.

      Example 4.2.2. Two dice are thrown and the sum X of the numbers appearing
      on their faces is recorded. X is a random variable with possible values 2, 3, . . . , 12.
      By using the method of equally likely outcomes, we can see that the pmf f of X is
      given by the following table:
        x         2      3      4      5      6      7       8      9     10     11      12
        f (x)   1/36   2/36   3/36   4/36   5/36   6/36    5/36   4/36   3/36   2/36    1/36
        We can now compute such probabilities as P (X ≤ 5) = 5/18 by adding the
      appropriate values of f .

      Example 4.2.3. We can think of a categorical variable as a discrete random vari-
      able by coding. Suppose that a student is chosen at random from the Calvin student
      body. We will code the class of the student by 1, 2, 3, 4 for the four standard classes
      and 5 for other. The coded class is a random variable. Referring to Table 2.1, we
      see that the probability mass function of X is given by f (1) = 0.27, f (2) = 0.24,
      f (3) = 0.21, f (4) = 0.25, f (5) = 0.03, and f (x) = 0 otherwise.

                                                                       4.2. Discrete Random Variables



                           Percent of Total




                                                   1   2   3   4   5

     Figure 4.1.: The probability histogram for the Calvin class random variable.

   One useful way of picturing a probability mass function is by a probability his-
togram. For the mass function in Example 4.2.3, we have the corresponding histogram
in Figure 4.2.
   On the frequentist interpretation of probability, if we repeat the random process many
times, the histogram of the results of those trials should approximate the probability
histogram. The probability histogram is not a histogram of data from many trials
however. It is a representation of what might happen in the next trial. We will often
use this idea to work in reverse. In other words, given a histogram of data that obtained
from successive trials of a random process, we will choose the pmf to fit the data. Of
course we might not ask for a perfect fit but instead we will choose the pmf f to fit the
data approximately but so that f has some simple form.
   Several families of discrete random variables are particularly important to us and
provide models for many real-world situations. We examine two such families here.
Each arises from a common kind of random process that will be important for statistical
inference. The second of these arises from the very important case of simple random
sampling from a population. We will first study a somewhat different case (which,
among other uses, can be used to study sampling with replacement).

4.2.1. The Binomial Distribution
A binomial process is a random process characterized by the following conditions:

  1. The process consists of a sequence of finitely many (n) trials of some simpler
     random process.

  2. Each trial results in one of two possible outcomes, usually called success (S) and
     failure (F ).

19:08 -- May 4, 2008                                                                             403
4. Random Variables

  3. The probability of success on each trial is a constant denoted by π.

  4. The trials are independent one from another.

  Thus a binomial process is characterized by two parameters, n and π. Given a
binomial process, the natural random variable to observe is the number of successes.

Definition 4.2.4 (binomial random variable). Given a binomial process, the binomial
random variable X associated with this process is defined by X is the number of
successes in the n trials of the process. If X is a binomial random variable with
parameters n and π, we write X ∼ Binom(n, π).

  The symbol ∼ can be read as “has the distribution” or something to that effect.
The use of the word distribution here is not inconsistent with our earlier use. Here to
specify a distribution is to specify the possible values of the random variable and the
probability that the random variable attains any particular value.

      Example 4.2.5. The following are all natural examples of binomial random vari-
        1. A fair coin is tossed n = 10 times with the probability of a HEAD (success)
           being π = .5. X is the number of heads.
        2. A basketball player shoots n = 25 freethrows with the probability of making
           each freethrow being π = .70. Y is the number of made freethrows.
        3. A quality control inspector tests the next n = 12 widgets off the assembly line
           each of which has a probability of 0.10 of being defective. Z is the number of
           defective widgets.
        4. Ten Calvin students are randomly sampled with replacement. W is the num-
           ber of males in the sample.

  The fact that the trials are independent of one another makes it possible to easily
compute the pmf of any binomial random variable using the multiplication principle.
We first give a simple example.

      Example 4.2.6. An unaccountably popular dice game is known as Bunko. Three
      dice are rolled and the number of sixes rolled is the important value. Let X be
      the random variable that counts the number of sixes in three dice. Then X ∼
      Binom(3, 1/6). We can now compute the probability mass function (which can
      take on values 0, 1, 2, 3). We simply need to keep track of all possible sequences of
      three successes and failures and find the probability of each such sequence.

                                                          4.2. Discrete Random Variables

    f (3) = P(X = 3) = (1/6)(1/6)(1/6) = 1/216
    f (2) = P(X = 2) = (1/6)(1/6)(5/6) + (1/6)(5/6)(1/6) + (5/6)(1/6)(1/6) = 15/216
    f (1) = P(X = 1) = (1/6)(5/6)(5/6) + (5/6)(1/6)(5/6) + (5/6)(5/6)(1/6) = 75/216
    f (0 = P(X = 0) = (5/6)(5/6)(5/6) = 125/216

      The computation for f (2) for example has three terms, one for each of SSF, SFS,
    FSS. The important probability fact for Bunko players is P(X ≥ 1) = 91/216.

  We can easily generalize the previous example to any n and π to get the following

Theorem 4.2.7 (The Binomial Distribution). Suppose that X is a binomial random
variable with parameters n and π. The pmf of X is given by

                   n x                  n!
  fX (x; n, π) =     π (1 − π)n−x =            π x (1 − π)n−x      x = 0, 1, 2, . . . , n .
                   x                x!(n − x)!

Proof. Suppose that n and π are given and that 0 ≤ x ≤ n. Consider all sequences of
n trials that have exactly x successes and n − x failures. There are n of these since
all we have to decide is how to “choose” the x places in the sequence for the successes.
Now consider any one such sequence, say the sequence S. . . SF. . . F, the sequence in
which x successes are followed by n − x failures. The probability on this sequence (and
any sequence with x successes is px (1 − p)n−x by the multiplication principle, relying
on the independence of the trials. The result follows.

  Note the use of the semicolon in the definition of fX in the theorem. We will use a
semicolon to separate the possible values of the random variable (x) from the parame-
ters (n, π). For any particular binomial experiment, n and π are fixed. If n and π are
understood, we might write fX (x) for fX (x; n, π).
  For all but very small n, computing f by hand is tedious. We will use R to do
this. Besides computing the mass function, R can be used to compute the cumulative
distribution function FX which is the useful function defined in the next definition.

Definition 4.2.8 (cumulative distribution function). If X is any random variable, the
cumulative distribution function of X (cdf) is the function FX given by

                           FX (x) = P (X ≤ x) =         fX (y)

19:08 -- May 4, 2008                                                                      405
4. Random Variables

   We will usually use the convention that the pmf of X is named by a lower-case letter
(usually fX ) and the cdf by the corresponding upper-case letter (usually FX ). The
R functions to compute the cdf, pdf, and also to simulate binomial processes are as
follows if X ∼ Binom(n, π).

        function (& parameters)        explanation
        rbinom(n,size,prob)            makes n random draws of the random vari-
                                       able X and returns them in a vector.
        dbinom(x,size,prob)            returns P(X = x) (the pmf).
        pbinom(q,size,prob)            returns P(X ≤ q) (the cdf).

      Example 4.2.9. Suppose that a manufacturing process produces defective parts
      with probability π = .1. If we take a random sample of size 10 and count the number
      of defectives X, we might assume that X ∼ Binom(10, 0.1). Some examples of R
      related to this situation are as follows.
       > defectives=rbinom(n=30, size=10,prob=0.1)
       > defectives
        [1] 2 0 2 0 0 0 0 2 0 1 1 1 0 0 2 2 3 1 1 2 1 1 0 2 0 1 1 0 1 1
       > table(defectives)
        0 1 2 3
       11 11 7 1
       > dbinom(c(0:4),size=10,prob=0.1)
       [1] 0.34867844 0.38742049 0.19371024 0.05739563 0.01116026
       > dbinom(c(0:4),size=10,prob=0.1)*30                      # pretty close to table
       [1] 10.4603532 11.6226147 5.8113073 1.7218688 0.3348078
       > pbinom(c(0:5),size=10,prob=0.1)                         # same as cumsum(dbinom(...))
       [1] 0.3486784 0.7360989 0.9298092 0.9872048 0.9983651 0.9998531

It is important to note that

      • R uses size for the number of trials (what we have called n) and n for the number
        of random draws.

      • pbinom() gives the cdf not the pmf. Reasons for this naming convention will
        become clearer later.

      • There are similar functions in R for many of the distributions we will encounter,
        and they all follow a similar naming scheme. We simply replace binom with the
        R-name for a different distribution.

                                                          4.2. Discrete Random Variables

4.2.2. The Hypergeometric Distribution
The hypergeometric distribution arises from considering the situation of random sam-
pling from a population in which there are just two types of individuals. (That is there
is a categorical variable defined on the population with just two levels.) It is traditional
to describe the distribution in terms of the urn model. Suppose that we have an urn
with two different colors of balls. There are m white balls and n black balls. Suppose
we choose k balls from the urn in such a way that every set of k balls is equally likely to
be chosen (i.e., a random sample of balls) and count the number X of white balls. We
say that X has the hypergeometric distribution with parameters m, n, and k and
write X ∼ Hyper(m, n, k). A simple example shows how we can compute probabilities
in this case.

    Example 4.2.10. Suppose the urn has 2 white and 3 black balls and that we choose
    2 balls at random without replacement. If X is the number of white balls, we have
    X ∼ Hyper(2, 3, 2). Notice that in this case there are 10 different possible choices
    of two balls. If we label the balls W1, W2, B1, B2, B3, we have the following:
           2 whites (W1,W2)
           1 white (W1,B1),(W1,B2), (W1,B3), (W2,B1), (W2,B2), (W3,B3)
           0 whites (B1,B2), (B1,B3), (B2,B3)
    Since the 10 different pairs are equally likely, we have P (X = 0) = 3/10, P (X =
    1) = 6/10, and P (X = 2) = 1/10.

The systematic counting of the example can easily be extended to compute the pmf of
any hypergeometric random variable.

Theorem 4.2.11. Suppose that X ∼ Hyper(m, n, k). Then the pmf f of X is given
                                        m   n
                                        x k−x
                     f (x; m, n, k) =    m+n        x ≤ min(k, m) .

Proof. The denominator counts the number of samples of size k from m+n many balls.
The two terms in the numerator count the number of ways of choosing x white balls
from m and k − x black balls from n. Multiplying the two terms together counts the
number of ways of choosing x white balls and k − x black balls.

  R knows the hypergeometric distribution and the syntax is exactly the same as for
the binomial distribution (except that the names of the parameters have changed).

19:08 -- May 4, 2008                                                                   407
4. Random Variables

        function (& parameters)        explanation
        rhyper(nn,m,n,k)               makes nn random draws of the random vari-
                                       able X and returns them in a vector.
        dhyper(x,m,n,k)                returns P(X = x) (the pmf).
        phyper(q,m,n,k)                returns P(X ≤ q) (the cdf).

      Example 4.2.12. Suppose that a statistics class has 29 students, 25 of whom are
      male. Let’s call the females the white balls and the males the black balls. suppose
      that we choose 5 of these students at random and without replacement, i.e., a
      random sample of size 5. Let X be the number of females in our sample. Then
      X ∼ Hyper(4, 25, 5). Some interesting questions related to this random variable are
      answered by the R output below.
      > dhyper(x=c(0:5),m=4,n=25,k=5)
      [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
      [6] 0.0000000000
      > dhyper(x=c(0:5),k=5,m=4,n=25)            # order of named arguments does not matter
      [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
      [6] 0.0000000000
      > phyper(q=c(0:5),m=4,n=25,k=5)
      [1] 0.4473917 0.8734790 0.9896846 0.9997895 1.0000000 1.0000000
      > rhyper(nn=30,m=4,n=25,k=5)              # note nn for number of random outcomes
       [1] 2 1 1 1 1 2 2 2 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1 2 0 0 0
      > dhyper(0:5,4,25,5)                      # default order of unnamed arguments
      [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
      [6] 0.0000000000

4.3. An Introduction to Inference
There are many situations in which the binomial distribution seems to be the right
model for a process but for which π is unknown. The next example gives several quite
natural cases of this.

      Example 4.3.1.
        1. Microprocessor chips are being produced by an assembly line. There is a pos-
           sibility that any particular chip produced is defective. It might be reasonable
           under some circumstances to assume that the probability that any particu-
           lar chip is defective is a constant π. Then in a sample of 10 chips, it might

                                                        4.3. An Introduction to Inference

                       Figure 4.2.: Zener cards for ESP testing.

         be plausible to assume that the number of defective chips X behaves like a
         binomial random variable with n = 10 and π fixed but unknown.
      2. Perhaps it is reasonable to assume that a free-throw shooter in a basketball
         game has a constant probability π of making a free-throw and that successive
         attempts are independent one from another. Then in a series of n free-throws,
         the number of successful free-throws might behave as a binomial random vari-
         able with n known and π unknown.
      3. In a standard test for ESP, a card with one of five printed symbols is selected
         without the person claiming to have ESP being able to see it. As the ex-
         perimenter “concentrates” on the symbol printed on the card, the subject is
         supposed to announce which symbol is on the card. (These cards are called
         Zener cards and are pictured in Figure 4.2.) While we think that the proba-
         bility that a subject can identify any card is 1/5, the person with ESP might
         claim that the probability is higher. If we allow n trials of this experiment,
         it is plausible to assume that the number of successful trials X is a binomial
         random variable with π unknown.

  In situations like those in the example, we often want to test a hypothesis about π.
For example, in the case of the person supposed to have ESP, we would like to test our
hypothesis that π = .2.
  Let us look more closely at the ESP situation. What would it take for us to believe
that the subject in fact has a probability greater than 0.2 of correctly identifying the
hidden card? Clearly, we would want to have several trials and a rate of success that we
would think would not be likely by luck (or “chance”) alone. A standard test is to use
25 trials. (In a standard deck, there are 25 cards with five each of the first five symbols.
Rather than going through the deck once however, we will think of the experiment as
shuffling the deck after each trial. Then it is clear that each of the five types of cards is

19:08 -- May 4, 2008                                                                  409
4. Random Variables

equally likely to occur as the top card.) The following R output is relevant to our test.

> x=c(5:15)
> pbinom(x,25,.2)
 [1] 0.6166894 0.7800353 0.8908772 0.9532258 0.9826681 0.9944451 0.9984599
 [8] 0.9996310 0.9999237 0.9999864 0.9999979

   Even if our subject is just guessing, he will get more than five cards right about 40%
of the time. Clearly it is certainly possible that he is just guessing if he gets just 6 out
of 25 correct. On the other hand, it is virtually certain that he will not get more than
12 right if he is just guessing. While we would not have to believe the ESP explaination
for 13 out of 25 successes, it would be difficult to continue asserting that π = 0.2 in
this case. Of course there is a grey area. Suppose that our subject gets 10 cards right.
The probability that our subject will get at least 10 cards correct by guessing alone is
less than 2%. Is this sufficiently surprising to rule out guessing as an explanation? We
might not rule out guessing but we would very likely test this subject further.
   The procedure described above for testing the ESP hypothesis is a special case of a
general (class of) procedures known as hypothesis tests.
   Any hypothesis test follows the same outline.

Step 1: Identify the hypotheses
A statistical hypothesis test starts, oddly enough, with a hypothesis. A hypoth-
esis is a statement proposing a possible state of affairs with respect to a probability
distribution governing an experiment that we are about to perform. There are a variety
of kinds of hypotheses that we might want to test.

  1. A hypothesis stating a fixed value of a parameter: π = .5.

  2. A hypothesis stating a range of values of a parameter: π ≤ .3.

  3. A hypothesis about the nature of the distribution itself: X has a binomial distri-

   In the ESP example, the hypothesis that we wished to test was π = .2. Notice
that we did not propose to test the hypothesis that a binomial distribution was the
correct explanation of the data. We assumed that the binomial distribution is a plau-
sible model of our data collection procedure. It will often be the case that we make
distributional hypotheses without thinking about testing them. (Sometimes that will
be a big mistake.)
   In the standard way of describing hypothesis tests, there are actually two hypotheses
that we view as being pitted against each other. For example, the two hypotheses in
the the ESP case were π = 0.2 (the subject does not have ESP) and π > 0.2 (the
subject does have ESP or some other mechanism of doing better than guessing). The
two hypotheses have standard names.

                                                        4.3. An Introduction to Inference

  1. Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a
     hypothesis that the data analysis is intended to investigate. It is usually thought
     of as the “default” or “status quo” hypothesis that we will accept unless the data
     gives us substantial evidence against it. The null hypothesis is often a hypothesis
     that we want to “prove” false.
  2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha ,
     is the hypothesis that we are wanting to put forward as true if we have sufficient
     evidence against the null hypothesis.

  Thus we present our hypotheses in the ESP experiment as
                                     H0 :   π = 0.2
                                     Ha :   π > 0.2 .
For the ESP experiment, π = 0.2 is the null hypothesis since it is clearly our starting
point and it is the hypothesis that we wish to retain unless we have convincing evidence

Step 2: Collect data and compute a test statistic
Earlier, we defined a statistic as a number computed from the data. In the ESP
example, our statistic is simply the result of the binomial random variable. Since we
are using the statistic to test a hypothesis, we often call it a test statistic.
  In our previous definition of a statistic, a statistic is a number, i.e., it is an actual
value computed from the data. In fact, we are now going to introduce some ambiguity
and refer also to the random variable as a statistic. So in this case, the random
variable X that is the result of counting the number of correct cards in 25 trials is our
test statistic but we will also refer to the value of that random variable x as a test
statistic. Of course the difference is whether we are referring to the experiment before
or after we collect the data. Note that we can only make probability statements about
X. The value of the statistic x is just a number which we have computed from the
result of the random process.
  Amidst this confusion, the central point is that if we think of the test statistic as a
random variable it has a distribution. This distribution is unknown (since we do not
know π).

Step 3: Compute the p-value.
Next we need to evaluate the evidence that our test statistic provides. To do this
requires that we think about our statistic as a random variable. In the ESP testing
example, X ∼ Binom(25, π). The distribution of the test statistic is called its sampling
distribution since we think of it as arising from producing a sample of the process.
  Since our test statistic is a random variable, we can ask probability questions about
our test statistic. The key question that we want to ask is this:

19:08 -- May 4, 2008                                                                  411
4. Random Variables

      How unusual would the value of the test statistic that I obtained be if the
      null hypothesis were true?

   We show how to answer the question if the result of the ESP experiment were 9 out
of 25 correct cards (36%).
   Notice that if the null hypothesis is true,

                      P(X ≥ 9) = 1 − pbinom(8,25,0.2) = 0.047.

  Therefore, if the null hypothesis is true, the probability that we would see a result
at least as extreme (in the direction of the alternate hypothesis) as 9 is 0.047. This
probability is called the p-value of the test statistic.

Definition 4.3.2 (p-value). The p-value of a test statistic t is the probability that a
result at least as extreme as t (in the direction of the alternate hypothesis) would occur
if the null hypothesis is true.

   Notice that the p-value is a number that is associated with a particular outcome
of the process. The p-value of 9 successes in our example is 0.047. Since the p-value
is computed after the random process is performed, it is not a probability associated
with this particular outcome of the random process. Rather it is a probability that
describes what might happen if the experiment is repeated indefinitely. Namely, if the
null hypothesis is true, then about 5% of the time the subject would get 9 or more
successes in his 25 trials.
   Countless journal articles in the social, biological, and medical sciences report on the
results of hypothesis tests. While there are many kinds of hypothesis tests and it might
not always be clear what kind of test the article is reporting on, it is almost universally
the case that the result of such a hypothesis test is reported using a p-value. It is quite
common to see statements such as p < 0.001. This obviously means that either the
null hypothesis being tested is false or something exceedingly surprising happened.

Step 4: Draw a conclusion
Drawing a conclusion from a p-value is a judgment call and it is a scientific rather than
mathematical decision. Our p-value of a test statistic of 9 in the ESP experiment is is
0.047. This means that if we would test many people for ESP and they were all just
guessing, about 5% of them would have a result at least as extreme as this. A test
statistic of 9 provides some evidence that our subject is more successful than we would
expect by chance alone but certainly not definitive such evidence. If we were really
interested in this question, we would probably subject the subject to more tests.
   Sometimes the results of hypothesis tests are expressed in terms of decisions rather
than p-values. This is often the case when we must take some action based on our
data. We illustrate with a common example.

                                                         4.3. An Introduction to Inference

    Example 4.3.3. Suppose that a company claims that the defective rate of their
    manufacturing process is 1%. A customer tests 100 parts in a large shipment and
    finds 4 of these parts defective. Is the customer justified in rejecting the shipment?
    It is easy to think of this situation as a hypothesis test. The test statistic 4 is the
    result of a random variable X ∼ Binom(100, π). The null and alternate hypotheses
    are given by
                                        H0 : π = 0.01
                                        Ha : π > 0.01 .
       The p value of this value of the test statistic is 1 − pbinom(3,100,.01) = 0.018.
    Therefore if the manufacturers claim is correct, we should only see 4 or more
    defectives 1.8% of the time when we test 100 parts. The customer might be justified
    in rejecting the shipment and therefore rejecting the null hypothesis.

  We will describe our possible decisions as to reject the null hypothesis or not
reject the null hypothesis. There are of course two different kinds of errors that we
could make.

Definition 4.3.4 (Type I and Type II errors). A Type I error is the error of rejecting
H0 even though it is true.
 A Type II error is the error of not rejecting H0 even though it is false.

  Of course, if we reject the null hypothesis, we cannot know whether we have made a
Type I error. Similarly, if we do not reject the null hypothesis, we cannot know whether
we have made a Type II error. Whether we have committed such an error depends on
the true value of π which we cannot ever know simply from data.
  To determine the likelihood of making such errors, we need to specify our decision
rule. For example in Example 4.3.3 we might decide that our rule is to reject the null
hypothesis if we see 4 or more defective parts in our shipment of 100. Then we know
that if the null hypothesis is true, the probability that we will make a type I error is
0.018. Of course we cannot know the probability of a type II error without knowing
the true value of π.
  In the next example, we further consider the probabilities of the two kinds of errors in
particular situation. This example also illustrates another variation on the hypothesis
test as it considers a two-sided alternate hypothesis.

    Example 4.3.5. A coin-toss is going to be used to make a very important decision.
    Since the coin is a commemorative coin of non-standard design (like those used in
    the Super Bowl), it is very important to know whether it is fair. We decide to do a
    test by tossing the coin 100 times and observing the number of heads. It is obvious
    that our test statistic x is the result of a random variable X ∼ Binom(100, π) and
    that our competing hypotheses should be

19:08 -- May 4, 2008                                                                   413
4. Random Variables

                                           H0 : π = 0.5
                                           Ha : π = 0.5 .
      Notice the two-sided alternate hypothesis suggests that we want to reject the fair-
      ness hypothesis in the case that the coin favors heads as well as the case that the
      coin favors tails. It is reasonable to reject the null hypothesis whenever the number
      of heads is too far from 50 in either direction. Let’s suppose that our decision rule
      is to reject the null hypothesis if the number of heads is at most 40 or at least 60
      (i.e., X ≤ 40 or X ≥ 60). Then the probability of a Type I error is the probability
      of getting 40 or fewer or 60 or more heads when the true probability of heads is
      0.5. This is given by
       > pbinom(40,100,.5)+(1-pbinom(59,100,0.5))
      [1] 0.05688793

      The probability of a Type I error is 0.057. The probability of a Type II error for
      this decision rule can only be computed if we know the true value of π. Suppose
      that π = 0.48. Then the probability of not rejecting the null hypothesis is given by
      > pbinom(59,100,.48)-pbinom(39,100,0.48)
      [1] 0.9454557
      In other words, we will almost always make a type II error with this decision rule
      if π = 0.48. On the other hand, if π = 0.4, then the probability of a Type II error
      is given by 0.54 as given by
      > pbinom(59,100,.4)-pbinom(39,100,0.4)
      [1] 0.5378822
      This above computation illustrates the central dilemma of hypothesis testing. If
      we want to make it very unlikely that we commit a Type I error as we did with our
      decision rule here, it will be very difficult to detect that the null hypothesis is false.
      For this decision rule, even 100 tosses of the coin do not allow us much success in
      discovering that the coin has a 10% bias!

   There is a trade-off between type I and type II errors. If we choose a decision rule
that is less likely to make a Type I error if the null hypothesis is true, then it is more
likely to make a Type II error if the null hypothesis is false. What one should also notice
in our treatment of hypothesis testing is the asymmetry between the two hypotheses.
We are generally not willing to tolerate a large probability of a Type I error. However
this seems to lead to a rather large probability of a Type II error in the case that the
null hypothesis is false. This asymmetry is intentional however as the null hypothesis
usually has a preferred status as the “innocent until proven guilty” hypothesis.

4.4. Continuous Random Variables
Recall that a continuous random variable X is one that can take on all values in
an interval of real numbers. For example, the height of a randomly chosen Calvin

                                                              4.4. Continuous Random Variables


 0.4                                                                  0.6

 0.0                                 0.0                              0.0

       0    2    4     6   8               0   2    4     6   8             0   2    4     6   8
                Time                               Time                             Time

                               Figure 4.3.: Discretized pmf for T .

student in inches could be any real number between, say, 36 and 80. Of course all
continuous random variables are idealizations. If we measure heights to the nearest
quarter inch, there are only finitely many possibilities for this random variable and we
could, in principle, treat it as discrete. We know from calculus however that treating
measurements as continuous valued functions often simplifies rather than complicates
our techniques. In order to understand what kinds of probability statements that
we would like to make about continuous random variables, it is helpful to keep in
mind this idea of the finite precision of our measurements however. For example, a
statement that a randomly chosen individual is 72 inches tall is not a claim that the
individual is exactly 72 inches tall but rather a claim that the height of the individual
is in some small interval (maybe 71 3 to 72 4 if we are measuring to the nearest half

inch). So probabilities of the form P (X = x) are not especially meaningful. Rather
the appropriate probability statements will be of the form P (a ≤ X ≤ b).

4.4.1. pdfs and cdfs
Recall the analogy of probability and mass. In the case of discrete random variables,
we represented the probability P(X = x) by a point of mass P(X = x) at the point x
and had total mass 1. In this case mass is continuous and the appropriate weighting
of mass is a density function. In the following example, we can see how this works.

       Example 4.4.1. A Geiger counter emits a beep when a radioactive particle is
       detected. The rate of beeping determines how radioactive the source is. Suppose
       that we record the time T to the next beep. It turns out that T behaves like
       a random variable. Suppose that we measured T with increasing precision. We
       might get histograms that look like those in Figure 4.3 for the pmf of T . It’s pretty
       obvious that we want to replace these histograms by a smooth curve. In fact the
       pictures should remind us of the pictures drawn for the Riemann sums that define
       the integral.

  The analogue to a probability mass function for a continuous variable is a probability
density function.

19:08 -- May 4, 2008                                                                           415
4. Random Variables

Definition 4.4.2 (probability density function, continuous random variable). A prob-
ability density function (pdf) is a function f such that
      • f (x) ≥ 0 for all real numbers x, and
      •   −∞ f (x)     dx = 1.
The continuous random variable X defined by the pdf f satisfies
                                   P(a ≤ X ≤ b) =                               f (x) dx
for any real numbers a ≤ b.

  The following simple lemma demonstrates one way in which continuous random
variables are very different from discrete random variables.

Lemma 4.4.3. Let X be a continuous random variable with pdf f . Then for any
a ∈ R,
  1. P(X = a) = 0,
  2. P(X < a) = P(X ≤ a), and
  3. P(X > a) = P(X ≥ a).

Proof.             f (x) dx = 0 . And P(X ≤ a) = P(X < a) + P(X = a) = P(X < a).

      Example 4.4.4.
                                                             3x2        x ∈ [0, 1]
      Q. Consider the function f (x) =                                             Show that f is a pdf and
                                                             0          otherwise.
      calculate P(X ≤ 1/2).
      A. Let’s begin be looking at a plot of the pdf.
                                   f (x)


                                                 0.0   0.2    0.4       0.6       0.8   1.0


                                                                    4.4. Continuous Random Variables

    The rectangular region of the plot has an area of 3, so it is plausible that the area
    under the graph of the pdf is 1. We can verify this by integration.
                              ∞                    1
                                  f (x) dx =           3x2 dx = x3          0
                             −∞                0
                                          1/2                         1/2
    so f is a pdf and P(X ≤ 1/2) =       0    3x2      dx = x3        0
                                                                            = 1/8.

  The cdf of a continuous random variable is defined the same way as it was for a
discrete random variable, but we use an integral rather than a sum to get the cdf from
the pdf in this case.

Definition 4.4.5 (cumulative distribution function). Let X be a continuous random
variable with pdf f , then the cumulative distribution function (cdf) for X is
                             F (x) = P(X ≤ x) =                     f (t) dt .

    Example 4.4.6. Q. Determine the cdf of the random variable from Example 4.4.4.
    A. For any x ∈ [0, 1],
                       FX (x) = P(X ≤ x) =                     3t2 dt = t3       0
                                                                                     = x3 .
                                          0               x ∈ [−∞, 0)
                                  FX (x) = x3              x ∈ [0, 1]
                                           1               x ∈ (1, ∞) .

Notice that the cdf FX is an antiderivative of the pdf fX . This follows immediately from
the Fundamental Theorem of Calculus. Notice also that P(a ≤ X ≤ b) = F (b) − F (a).

Lemma 4.4.7. Let FX be the cdf of a continuous random variable X. Then the pdf
fX satisfies
                                    fX (x) =      FX (x) .

   Just as the binomial and hypergeometric distributions were important families of
discrete random variables, there are several important families of continuous random
variables that are often used as models of real-world situations. We investigate a few
of these in the next three subsections.

19:08 -- May 4, 2008                                                                            417
4. Random Variables

4.4.2. Uniform Distributions
The continuous uniform distribution has a pdf that is constant on some interval.

Definition 4.4.8 (uniform random variable). A continuous uniform random variable
on the interval [a, b] is the random variable with pdf given by

                                                                                  x ∈ [a, b]
                                             f (x; a, b) =          b−a
                                                                    0             otherwise.

  It is easy to confirm that this function is indeed a pdf. We could integrate, or we
could simply use geometry. The region under the graph of the uniform pdf is a rectangle
with width b − a and height b−a , so the area is 1.

      Example 4.4.9.
      Q. Let X be uniform on [0, 10]. What is P(X > 7)? What is P(3 ≤ X < 7)?
      A. Again we argue geometrically. P(X > 7) is represented by a rectangle with base
      from 7 to 10 along the x-axis and a height of .1, so P(X > 7) = 3 · 0.1 = 0.3.
         Similarly P(3 ≤ X < 7) = 0.4. In fact, for any interval of width w contained in
      [0, 10], the probability that X falls in that particular interval is w/10.
         We could also compute these results by integrating, but this would be silly.

      Example 4.4.10. Q. Let X be uniform on the interval [0, 1] (which we denote
      X ∼ Unif(0, 1)) what is the cdf for X?
      A. For x ∈ [0, 1], FX (x) =                   0 1   dx = x, so

                                                            0               x ∈ (∞, 0)
                                                    FX (x) = x               x ∈ [0, 1]
                                                             1               x ∈ (1, ∞) .

                                pdf for Unif(0,1)                                                    cdf for Unif(0,1)

                                                                          F (x)
      f (x)




                    0.0   0.5          1.0          1.5       2.0                       0.0    0.5          1.0          1.5   2.0

                                       x                                                                    x

                                                     4.4. Continuous Random Variables

       Although it has a very simple pdf and cdf, this random variable actually has sev-
    eral important uses. One such use is related to random number generation. Com-
    puters are not able to generate truly random numbers. Algorithms that attempt to
    simulate randomness are called pseudo-random number generators. X ∼ Unif(0, 1)
    is a model for an idealized random number generator. Computer scientists compare
    the behavior of a pseudo-random number generator with the behavior that would
    be expected for X to test the quality of the pseudo-random number generator.

There are R functions for computing the pdf and cdf of a uniform random variable as
well as a function to return random numbers. An additional function computes the
quantiles of the uniform distribution. If X ∼ Unif(min, max) the following functions can
be used.

      function (& parameters)        explanation
      runif(n,min,max)               makes n random draws of the random vari-
                                     able X and returns them in a vector.
      dunif(x,min,max                returns fX (x), (the pdf).
      punif(q,min,max)               returns P(X ≤ q) (the cdf).
      qunif(p,min,max)               returns x such that P(X ≤ x) = p.

  Here are examples of computations for X ∼ Unif(0, 10).

> runif(6,0,10)    # 6 random values on [0,10]
[1] 5.449745 4.124461 3.029500 5.384229 7.771744 8.571396
> dunif(5,0,10)    # pdf is 1/10
[1] 0.1
> punif(5,0,10)    # half the distribution is below 5
[1] 0.5
> qunif(.25,0,10) # 1/4 of the distribution is below 2.5
[1] 2.5

4.4.3. Exponential Distributions
In Example 4.4.1 we considered a “waiting time” random variable, namely the waiting
time until the next radioactive event. Waiting times are important random variables
in reliability studies. For example, a common characteristic of a manufactured object
is MTF or mean time to failure. The model often used for the Geiger counter random
variable is the exponential distribution. Note that a waiting time can be any x in the
range 0 ≤ x < ∞.

19:08 -- May 4, 2008                                                                419
4. Random Variables

Definition 4.4.11 (The exponential distribution). The random variable X has the
exponential distribution with parameter λ > 0 (X ∼ Exp(λ)) if X has the pdf

                                           λe−λx   x≥0
                                fX (x) =
                                           0       x<0.

  It is easy to see that the function fX of the previous definition is a pdf for any
positive value of λ. R refers to the value of λ as the rate so the appropriate functions
in R are rexp(n,rate), dexp(x,rate), pexp(x,rate), and qexp(p,rate). We will
see later that rate is an apt name for λ as λ will be the rate per unit time if X is a
waiting time random variable.

      Example 4.4.12. Suppose that a random variable T measures the time until
      the next radioactive event is recorded at a Geiger counter (time measured since
      the last event). For a particular radioactive material, a plausible models for T is
      T ∼ Exp(0.1) where time is measured in seconds. Then the following R session
      computes some important values related to T .
      > pexp(q=0.1,rate=.1)   # probability waiting time less than .1
      [1] 0.009950166
      > pexp(q=1,rate=.1)     # probability waiting time less than 1
      [1] 0.09516258
      > pexp(q=10,rate=.1)
      [1] 0.6321206
      > pexp(q=20,rate=.1)
      [1] 0.8646647
      > pexp(100,rate=.1)
      [1] 0.9999546
      > pexp(30,rate=.1)-pexp(5,rate=.1)   # probability waiting time between 5 and 30
      [1] 0.5567436
      > qexp(p=.5,rate=.1)    # probability is .5 that T is less than 6.93
      [1] 6.931472

      The graphs in Figure 4.4 are graphs of the pdf and cdf of this random variable. All
      exponential distributions look the same except for the scale. The rate of 0.1 here
      means that we can expect that in the long run this process will average 0.1 counts
      per second.

  Notice that when given a random variable such as the waiting time to a geiger counter
event, we are not handed its pdf as well. The pdf is a model of the situation. In the
case of an example such as this, we really are faced with two decisions.

  1. Which family (e.g., uniform, exponential, etc.) of distributions best models the

                                                                4.4. Continuous Random Variables

    0.10                                              1.0

    0.08                                              0.8

    0.06                                              0.6

    0.04                                              0.4

    0.02                                              0.2

    0.00                                              0.0

           0     10    20       30   40   50                0       10    20       30   40   50
                            x                                                  x

               Figure 4.4.: The pdf and cdf of the random variable T ∼ Exp(0.1).

     2. What particular values of the parameters should we use for the pdf?

   Sometimes we can begin to answer question 1 even before we collect data. Each
of the distributions that we have met has certain properties which we check against
our process. For example, it is often apparent whether the properties of a binomial
process should apply to a certain process we are examining. Of course it is always
useful to check our answer to question 1 by collecting data and verifying that the shape
of the distribution of the data collected is consistent with the distribution we are using.
The only reasonable way to answer the second question however is to collect data. In
the last example, for instance, we saw that if X ∼ Exp(0.1) that P (X ≤ 6.93) = .5.
Therefore if about half of our data are less than 6.93, we would say that the data
are consistent with the hypothesis that X ∼ Exp(0.1) but if almost all the data are
less than 5, we would probably doubt that X has this distribution. The problems of
choosing the appropriate distribution and the appropriate values of the parameters is
an important one that we will address in various ways in Chapter 5.

4.4.4. Weibull Distributions
A very important generalization of the exponential distribution is the Weibull distri-
bution. A Weibull distribution is often used by engineers to model phenomena such as
failure, manufacturing or delivery times. They have also been used for applications as
diverse as fading in wireless communications channels and wind velocity. The Weibull
is a two-parameter family of distributions. The two parameters are a shape parameter
α and a scale parameter λ.

Definition 4.4.13 (The Weibull distributions). The random variable X has a Weibull
distribution with shape parameter α > 0 and scale parameter β > 0 (X ∼ Weibull(α, β))
if the pdf of X is
                                       α    α−1 e−(x/β)α x ≥ 0
                                       βα x
                      fX (x; α, β) =
                                     0                    x<0

19:08 -- May 4, 2008                                                                              421
4. Random Variables






              0    2     4         6     8    10                    0   2   4       6   8   10

                             x                                                  x

                                 Figure 4.5.: Left: fixed β. Right: fixed α.

  Notice that if X ∼ Weibull(1, λ) then X ∼ Exp(1/λ). Varying α in the Weibull
distribution changes the shape of the distribution while changing β changes the scale.
The effect of fixing β (β = 5) and changing α (α = 1, 2, 3) is illustrated by the first
graph in Figure 4.5 while the second graph shows the effect of changing β (β = 1, 3, 5)
with α fixed at α = 2. The appropriate R functions to compute with the Weibull
distribution are dweibull(x,shape,scale), pweibull(q,shape,scale), etc.

         Example 4.4.14. The Weibull distribution is sometimes used to model the max-
         imum wind velocity measured during a 24 hour period at a specific location. The
         dataset gives the maximum wind
         velocity at the San Diego airport on each of 6,209 consecutive days. It is claimed
         that the maximum wind velocity measured on a day behaves like a random variable
         W that has a Weibull distribution with α = 3.46 and β = 16.90. The R code below
         investigates that model using this past data. (In fact, this model is not a very good
         one although the output below suggests that it might be plausible.)
             > w$Wind
                [1] 14 11 10 13 11 11 26 21 14 13 10 10 13 10 13 13 12 12 13 17 11 11 13 25 15
               [26] 18 13 17 12 14 15 10 16 17 17 13 18 14 12 20 11 14 20 16 12 14 18 17 13 16
               [51] 13 16 11 13 11 15 13 15 16 18 14 15 15 14 14 16 15 18 14 16 14 10 17 14 12
             > cutpts=c(0,5,10,15,20,25,30)
             > table(cut(w$Wind,cutpts))

               (0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
                   2     434    3303    1910    409      95
             > length(w$Wind[w$Wind<12.5])/6209
             [1] 0.2728298                             # 27.3% days with max windspeed less than 12.5
             > pweibull(12.5,3.46,16.9)
             [1] 0.2968784                             # 29.7% predicted by Weibull model
             > length(w$Wind[w$Wind<22.5])/6209

                                                  4.5. The Mean of a Random Variable

    [1] 0.951361
    > pweibull(22.5,3.46,16.9)
    [1] 0.9322498
    > simulation=rweibull(100000,3.46,16.9)         # 100,000 simulated days
    > mean(simulation)                              # simulated days have mean about the same as actual
    [1] 15.18883
    > mean(w$Wind)
    [1] 15.32405
    > sd(simulation)                                # simulated days have greater variation
    [1] 4.85144
    > sd(w$Wind)
    [1] 4.239603

4.5. The Mean of a Random Variable

Just as numerical summaries of a data set can help us understand our data, numerical
summaries of the distribution of a random variable can help us understand the behavior
of that random variable. In this section we introduce the notion of a mean of a random
variable. The name of this summary, mean, is no accident. The mean of a random
variable is supposed to measure the “center” of a distribution in the same way that the
mean of data measures the center of that data. We will use our experience with data
to help us develop a definition.

4.5.1. The Mean of a Discrete Random Variable

   Example 4.5.1.
   Q. Let’s begin with a motivating example. Suppose a student has taken 10 courses
   and received 5 A’s, 4 B’s and 1 C. Using the traditional numerical scale where an
   A is worth 4, a B is worth 3 and a C is worth 2, what is this student’s GPA (grade
   point average)?

   A. The first thing to notice is that 4+3+2 = 3 is not correct. We cannot simply
   add up the values and divide by the number of values. Clearly this student should
   have GPA that is higher than 3.0, since there were more A’s than C’s.
     Consider now a correct way to do this calculation and some algebraic reformu-

19:08 -- May 4, 2008                                                               423
4. Random Variables

      lations of it.
                       4+4+4+4+4+3+3+3+3+2   5·4+4·3+1·2
            GPA =                          =
                                10                   10
                                              5       4      1
                                           =     ·4+     ·3+    ·2
                                             10       10     10
                                                 5       4      1
                                           =4·      +3·     +2·
                                                 10      10     10
                                           = 3.4

  Our definition of the mean of a random variable follows the example above. Notice
that we can think of the GPA as a sum of terms of the form

                       (grade)(proportion of students getting that grade) .

Since the limiting proportion of outcomes that have a particular value is the probability
of that value, we are led to the following definition.

Definition 4.5.2 (mean). Let X be a discrete random variable with pmf f . The mean
(also called expected value) of X is denoted as µX or E(X) and defined by

                                     µX = E(X) =          x · f (x) .

The sum is taken over all possible values of X.

      Example 4.5.3. Q. If we flip four fair coins and let X count the number of heads,
      what is E(X)?
      A. If we flip four fair coins and let X count the number of heads, then the distri-
      bution of X is described by the following table. (Note that X ∼ Binom(4, .5).)
                                  value of X     0   1       2     3     4
                                                 1   4       6     4     1
                                                16   16      16    16   16
      So the expected value is
                                  1       4     6      4       1
                             0·      +1·    +2·    +3·    +4·    =2
                                  16     16     16     16     16
      On average we get 2 heads in 4 tosses. This is certainly in keeping with our informal
      understanding of the word average.

                                                        4.5. The Mean of a Random Variable

 More generally, the mean of a binomial random variable is found by the following

Theorem 4.5.4. Let X ∼ Binom(n, π). Then E(X) = nπ.

  Similarly, the mean of a hypergeometric random variable is just what we think it
should be.

Theorem 4.5.5. Let X ∼ Hyper(m, n, k). Then E(X) = km/(m + n).

  The following example illustrates the computation of the mean for a hypergeometric
random variable.

> x=c(0:5)
> p=dhyper(x,m=4,n=25,k=5)
> sum(x*p)
[1] 0.6896552
> 4/29 * 5
[1] 0.6896552

4.5.2. The Mean of a Continuous Random Variable
If we think of probability as mass, then the expected value for a discrete random
variable X is the center of mass of a system of point masses where a mass fX (x) is
placed at each possible value of X. The expected value of a continuous random variable
should also be the center of mass where the pdf is now interpreted as density.

Definition 4.5.6 (mean). Let X be a continuous random variable with pdf f . The
mean of X is defined by
                            µX = E(X) =               xf (x) dx .

                                                                     3x2   x ∈ [0, 1]
   Example 4.5.7. Recall the pdf in Example 4.4.4: f (x) =                            . Then
                                                                     0     otherwise.
                              E(X) =            x · 3x2 dx = 3/4 .

   The value 3/4 seems plausible from the graph of f .

 We compute the mean of two of our favorite continuous random variables in the next

19:08 -- May 4, 2008                                                                    425
4. Random Variables

Theorem 4.5.8.

  1. If X ∼ Unif(a, b) then E(X) = (a + b)/2.

  2. If X ∼ Exp(λ) then E(X) = 1/λ.

Proof. The proof of each of these is a simple integral. These are left to the reader.

  Our intuition tells us that in a large sequence of trials of the random process described
by X, the sample mean of the observations should be usually be close the mean of X.
This is in fact true and is known as the Law of Large Numbers. We will not state that
law precisely here but we will illustrate it using several simulations in R.

> r=rexp(100000,rate=1)
> mean(r)                                # should be 1
[1] 0.9959467
> r=runif(100000,min=0,max=10)
> mean(r)
[1] 5.003549                             # should be 5
> r=rbinom(100000,size=100,p=.1)
> mean(r)
[1] 9.99755                              # should be 10
> r=rhyper(100000,m=10,n=20,k=6)
> mean(r)
[1] 1.99868                              # should be 2

4.6. Functions of a Random Variable
After collecting data, we often transform it. That is we apply some function to all the
data. For example, we saw the value of using a logarithmic transformation (on the
U.S. Counties data) to make a distribution more symmetric. Now consider the notion
of transforming a random variable.

Definition 4.6.1 (transformation). Suppose that t is a function defined on all the
possible values of the random variable X. Then the random variable t(X) is the
random variable that has outcome t(x) whenever x is the outcome of X.

  If the random variable Y is defined by Y = t(X), then Y itself has an expected value.
To find the expected value of Y , we would need to find the pmf or pdf of Y , fY (y),
and then use the definition of E(Y ) to compute E(Y ). Occasionally, this is easy to do,
particularly in the case of a discrete random variable X.

                                                                 4.6. Functions of a Random Variable

    Example 4.6.2. Suppose that X is the random variable that results when a single
    die is rolled and the number on its face recorded. The pdf of X is f (x) = 1/6,
    x = 1, 2, 3, 4, 5, 6, and E(X) = 3.5. Now suppose that for a certain game, the value
    Y = X 2 is interesting. Then the pdf of Y is easily seen to be f (y) = 1/6, y =
    1, 4, 9, 16, 25, 36, and E(Y ) = 15.2. Note that to find E(Y ) we first found the pdf
    of Y and then found E(Y ) using the usual method. Note that E(Y ) = [E(X)]2 !

  It turns out that there is a way to compute E(t(X)) that does not require us to first
find fY . This is especially useful in the case that X is continuous.

Lemma 4.6.3. If X is a random variable (discrete or continuous) and t a function
defined on the values of X, then if Y = t(X) and X has pdf (pmf) fX

                                      x t(x)fX (x)           if X is discrete
                        E(Y ) =      ∞
                                     −∞ t(x)f (x) dx         if X is continuous .

  We will not give the proof but it is easy to see that this lemma should be so (at least
for the discrete case) by looking at an example.

    Example 4.6.4. Let X be the result of tossing a fair die. X has possible outcomes
    1, 2, 3, 4, 5, 6. Let Y be the random variable |X − 2|. Then the lemma gives
                                     1    1   1   1   1   1   1  11
         E(Y ) =         |x − 2| ·     =1· +0· +1· +2· +3· +4· =    .
                                     6    6   6   6   6   6   6  6

    But we can also compute E(Y ) directly from the definition. Noting that the possible
    values of Y are 0, 1, 2, 3, 4, we have
                                                1    2   1   1   1  11
              E(Y ) =           yfY (y) = 0 ·     +1· +2· +3· +4· =    .
                                                6    6   6   6   6  6

    The sum that computes E(Y ) is clearly the same sum as E(X) but in a “different
    order” and with some terms combined since there are more than one x that produce
    a given value of Y .

    Example 4.6.5. Suppose that X ∼ Unif(0, 1) and that Y = X 2 . Then
                                      E(Y ) =            x2 · 1 dx = 1/3 .

    This is consistent with the following simulation.

19:08 -- May 4, 2008                                                                            427
4. Random Variables

      > x=runif(1000,0,1)
      > y=x^2
      > mean(y)
      [1] 0.326449

While it is not necessarily the case that E(t(X)) = t(E(X)) (see problem 4.21), the
next proposition shows that the expectation function is a “linear operator.”

Lemma 4.6.6. If a and b are real numbers, then E(aX + b) = a E(X) + b.

4.6.1. The Variance of a Random Variable
We are now in a position to define the variance of a random variable. Recall that
the variance of a set of n data points x1 , . . . , xn is almost the average of the squared-
deviation from the sample mean.

                               Var(x) =           (xi − x)2 /(n − 1)

The natural analogue for random variables is the following.

Definition 4.6.7 (variance, standard deviation of a random variable). Let X be a
random variable. The variance of X is defined by

                              σX = Var(X) = E((X − µX )2 ) .

The standard deviation is the square root of the variance and is denoted σX .

  It is obvious from the definition that σX ≥ 0 and that σX > 0 unless X = µX with
probability 1.

      Example 4.6.8. Suppose that X is a uniform random variable, X ∼ Unif(0, 1).
      Then E(X) = 1/2. To compute the variance of X we need to compute
                                                  (x − 1/2)2 dx

      It is easy to see that the value of this integral is 1/12.

  The following lemma records the variance of several of our favorite random variables.

Lemma 4.6.9.

                                                           4.7. The Normal Distribution

  It is instructive to compare the variances of the binomial and the hypergeometric
distribution. We do that in the next example.

    Example 4.6.10. Suppose that a population has 10,000 voters and that 4,000 of
    them plan to vote for a certain candidate. We select 100 voters at random and ask
    them if they favor this candidate. Obviously, the number of voters X that favor
    this candidate has the distribution Hyper(4000, 6000, 100). This distribution has
    mean 40 and variance 100(.4)(.6)(.99). On the other hand, were we to treat this
    situation as sampling with replacement so that X ∼ Binom(100, .4), X would have
    mean 40 and variance 100(.4)(.6). The only difference in the two expressions for the
    variance is the term m+n−k which is sometimes called the finite population cor-
    rection factor. It should really be called the sampling without replacement
    correction factor.

  The following lemma sometimes helps us to compute the variance of X. It also is
useful in understanding the properties of the variance.

Lemma 4.6.11. Suppose that the random variable X is either discrete or continuous
with mean µX . Then
                            σX = E(X 2 ) − µ2 .

Proof. We have

σX = E((X − µX )2 ) = E(X 2 − 2µX X + µ2 ) = E(X 2 ) − 2µX E(X) + µ2 = E(X 2 ) − µ2 .
                                       X                           X              X

Note that we have used the linearity of E and also that E(c) = c if c is a constant.

4.7. The Normal Distribution
The most important distribution in statistics is called the normal distribution.

Definition 4.7.1 (normal distribution). A random variable X has the normal distri-
bution with parameters µ and σ if X has pdf
                                     1          2    2
                   f (x; µ, σ) = √       e−(x−µ) /2σ     −∞<x<∞.
We write X ∼ Norm(µ, σ) in this case.

  The mean and variance of a normal distribution are µ and σ 2 so that the pa-
rameters are aptly, rather than confusingly, named. R functions dnorm(x,mean,sd),
pnorm(q,mean,sd), rnorm(n,mean,sd), and qnorm(p,mean,sd) compute the relevant

19:08 -- May 4, 2008                                                                   429
4. Random Variables






                                      −3   −2   −1   0      1           2    3

             Figure 4.6.: The pdf of a standard normal random variable.

  If µ = 0 and σ = 1 we say that X has a standard normal distribution. Figure 4.6
provides a graph of the density of the standard normal distribution. Notice the following
important characteristics of this distribution: it is unimodal, symmetric, and can take
on all possible real values both positive and negative. The curve in Figure 4.6 suffices
to understand all of the normal distributions due to the following lemma.

Lemma 4.7.2. If X ∼ Norm(µ, σ) then the random variable Z = (X − µ)/σ has the
standard normal distribution.

Proof. To see this, we show that P(a ≤ Z ≤ b) is computed by the integral of the
standard normal density function.

                        X −µ                                                            µ+bσ
                                                                                                   1          2   2
P(a ≤ Z ≤ b) = P(a ≤         ≤ b) = P (µ + aσ ≤ X ≤ µ + bσ) =                                  √       e−(x−µ) /2σ dx .
                          σ                                                         µ+aσ           2πσ

Now in the integral, make the substitution u = (x − µ)/σ. We have then that

                                     1          2   2
                                                                         1   2
                               √         e−(x−µ) /2σ dx =               √ e−u /2 du .
                   µ+aσ              2πσ                        a        2π

But the latter integral is precisely the integral that computes P(a ≤ U ≤ b) if U is a
standard normal random variable.

  The normal distribution is used so often that it is helpful to commit to memory
certain important probability benchmarks associated with it.

                                                                         4.8. Exercises

                              The 68–95–99.7 Rule
  If Z has a standard normal distribution, then

     1. P(−1 ≤ Z ≤ 1) ≈ 68%

     2. P(−2 ≤ Z ≤ 2) ≈ 95%

     3. P(−3 ≤ Z ≤ 3) ≈ 99.7%.

  If the distribution of X is normal (but not necessarily standard normal), then these
approximations have natural interpretations using Lemma 4.7.2. For example, we can
say that the probability that X is within one standard deviation of the mean is about

   Example 4.7.3. In 2000, the average height of a 19-year old United States male
   was 69.6 inches. The standard deviation of the population of males was 5.8 inches.
   The distribution of heights of this population is well-modeled by a normal distri-
   bution. Then the percentage of males within 5.8 inches of 69.6 inches was approx-
   imately 68%. In R,
    > pnorm(69.6+5.8,69.6,5.8)-pnorm(69.6-5.8,69.6,5.8)
    [1] 0.6826895

  It turns out that the normal distribution is a good model for many variables. When-
ever a variable has a unimodal, symmetric distribution in some population, we tend to
think of the normal distribution as a possible model for that variable. For example,
suppose that we take repeated measures of a difficult to measure quantity such as the
charge of an electron. It might be reasonable to assume that our measurements center
on the true value of the quantity but have some spread around that true value. And it
might also be reasonable to assume that the spread is symmetric around the true value
with measurements closer to the true value being more likely to occur than measure-
ments that are further away from the true value. Then a normal random variable is a
candidate (and often used) model for this situation.

4.8. Exercises

4.1 Suppose that you roll 5 standard dice. Determine the probability that all the dice
are the same. (Hint: first compute the probability that all five dice are sixes.)

4.2 Suppose that you deal 5 cards from a standard deck of cards. Determine the
probability that all the cards are of the same color. (A standard deck of cards has 52

19:08 -- May 4, 2008                                                               431
4. Random Variables

cards in two colors. There are 26 red and 26 black cards. You should be able to do
this computation using R and the appropriate discrete distribution.)

4.3 Acceptance sampling is a procedure that tests some of the items in a lot and decides
to accept or reject the entire lot based on the results of testing the sample. Suppose
that the test determines whether an item is “acceptable” or “defective”. Suppose that
in a lot of 100 items, 4 are tested and that the lot is rejected if one or more of those
four are found to be defective.

  a) If 10% of the lot of 100 are defective, what is the probability that the purchaser
     will reject the shipment?

  b) If 20% of the lot of 100 are defective, what is the probability that the purchaser
     will reject the shipment?

4.4 Suppose that there are 10,000 voters in a certain community. A random sample
of 100 of the voters is chosen and are asked whether they are for or against a new
bond proposal. Suppose that in fact only 4,500 of the voters are in favor of the bond

  a) What is the probability that fewer than half of the sampled voters (i.e., 49 or
     fewer) are in favor of the bond proposal?

  b) Suppose instead that the sample consists of 2,000 voters. Answer the same ques-
     tion as in the previous part.

4.5 If the population is very large relative to the size of the sample, sampling with
replacement should yield very similar results to that of sampling without replacement.
Suppose that an urn contains 10,000 balls, 3,000 of which are white.

  a) If 100 of these balls are chosen at random with replacement, what is the proba-
     bility that at most 25 of these are white?

  b) If 100 of these balls are chosen at random without replacement, what is the
     probability that at most 25 of these are white?

4.6 In the days before calculators, it was customary for textbooks to include tables
of the cdf of the binomial distribution for small values of n. Of course not all values
of π could be included — often only the values π = .1, , 2, . . . , .8, .9 were included.
Let’s supposes that one of these tables includes the value of the cdf of the binomial
distribution for all n ≤ 25, all x ≤ n and all these values of π.

  a) To save space, the values of π = .6, .7, .8, .9 could be omitted. Give a clear reason
     why F (x; n, π) could be computed for these values of π from the other values in
     the table.

                                                                          4.8. Exercises

  b) On the other hand, we could instead omit the values of x ≥ n/2. Show how the
     value of F (x; n, π) could be computed from the other values in the table for such
     omitted values of x.

(Hint: one person’s success is another person’s failure.)

4.7 The number of trials in the ESP experiment, 25, was arbitrary and perhaps too
small. Suppose that instead we use 100 trials.

  a) Suppose that the subject gets 30 right. What is the p-value of this test statistic?

  b) Suppose that the subject actually has a probability of .30 of guessing the card
     correctly. What is the probability that the subject will get at least 30 correct?

4.8 A basketball player claims to be a 90% free-throw shooter. Namely, she claims to
be able to make 90% of her free-throws. Should we doubt her claim if she makes 14
out of 20 in a session at practice? Set this problem up as a hypothesis testing problem
and answer the following questions.

  a) What are the null and alternate hypotheses?

  b) What is the p-value of the result 14?

  c) If the decision rule is to reject her claim if she makes 15 or fewer free-throws,
     what is the probability of a Type I error?

4.9 Nationally, 79% of students report that they have cheated on an exam at some
point in their college career. You can’t believe that the number is this high at your
own institution. Suppose that you take a random sample of size 50 from your student
body. Since 50 is so small compared to the size of the student body, you can treat this
sampling situation as sampling with replacement for the purposes of doing a statistical

  a) Write an appropriate set of hypotheses to test the claim that 79% of students

  b) Construct a decision rule so that the probability of a Type I error is less than

4.10 A random variable X has the triangular distribution if it has pdf

                                          2x x ∈ [0, 1]
                               fX (x) =
                                          0  otherwise.

  a) Show that fX is indeed a pdf.

19:08 -- May 4, 2008                                                                433
4. Random Variables

  b) Compute P(0 ≤ X ≤ 1/2).

  c) Find the number m such that P(0 ≤ X ≤ m) = 1/2. (If is natural to call m the
     median of the distribution.)

                   k(x − 2)(x + 2)   −2 ≤ x ≤ 2
4.11 Let f (x) =
                   0                 otherwise.

  a) Determine the value of k that makes f a pdf. Let X be the corresponding random

  b) Calculate P(X ≥ 0).

  c) Calculate P(X ≥ 1).

  d) Calculate P(−1 ≤ X ≤ 1).

4.12 Describe a random variable that is neither continuous nor discrete. Does your
random variable have a pmf? a pdf? a cdf?

4.13 Show that if f and g are pdfs and α ∈ [0, 1], then αf + (1 − α)g is also a pdf.
4.14 Suppose that a number of measurements that are made to 3 decimal digits accu-
racy are each rounded to the nearest whole number. A good model for the “rounding
error” introduced by this process is that X ∼ Unif(−.5, .5) where X is the difference
between the true value of the measurement and the rounded value.

  a) Explain why this uniform distribution might be a good model for X.

  b) What is the probability that the rounding error has absolute value smaller than

4.15 If X ∼ Exp(λ), find the median of X. That is find the number m (in terms of λ)
such that P(X ≤ m) = 1/2.

4.16 A part in the shuttle has a lifetime that can be modeled by the exponential
distribution with parameter λ = 0.0001, where the units are hours. The shuttle mission
is scheduled for 200 hours.

  a) What is the probability that the part fails on the mission?

  b) The event that is described in part (a) is BAD. So the shuttle actually runs
     three of these systems in parallel. What is the probability that the mission ends
     without all three failing if they are functioning independently?

  c) Is the assumption of independence in the previous part a realistic one?

                                                                         4.8. Exercises

4.17 The lifetime of a certain brand of water heaters in years can be modeled by a
Weibull distribution with α = 2 and β = 25.

  a) What is the probability that the water heater fails within its warranty period of
     10 years?

  b) What is it probability that the water heater lasts longer than 30 years?

  c) Using a simulation, estimate the average life of one of these water heaters.

4.18 Prove Theorem 4.5.8.
4.19 Suppose that you have an urn containing 100 balls, some unknown number of
which are red and the rest are black. You choose 10 balls without replacement and find
that 4 of them are red.

  a) How many red balls do you think are in the urn? Give an argument using the
     idea of expected value.

  b) Suppose that there were only 20 red balls in the urn. How likely is it that a
     sample of 10 balls would have at least 4 red balls?

4.20 The file contains a dataset
that records the time in seconds between scores in a basketball game played between
Kalamazoo College and Calvin College on February 7, 2003.

  a) This waiting time data might be modeled by an exponential distribution. Make
     some sort of graphical representation of the data and use it to explain why the
     exponential distribution might be a good candidate for this data.

  b) If we use the exponential distribution to model this data, which λ should we use?
     (A good choice would be to make the sample mean equal to the expected value
     of the random variable.)

  c) Your model of part (b) makes a prediction about the proportion of times that the
     next score will be within 10, 20, 30 and 40 seconds of the previous score. Test
     that prediction against what actually happened in this game.

4.21 Show that it is not necessarily the case that E(t(X)) = t(E(X)).
4.22 Prove Lemma 4.6.6 in the case that X is continuous.
4.23 Let X be the random variable that results form tossing a fair six-sided die and
reading the result (1–6). Since E(X) = 3.5, the following game seems fair. I will pay
you 3.52 and then we will roll the die and you will pay me the square of the result. Is
the game fair? Why or why not?

19:08 -- May 4, 2008                                                                435
4. Random Variables

4.24 Not every distribution has a mean! Define
                                  1 1
                        f (x) =               −∞<x<∞.
                                  π 1 + x2
  a) Show that f is a density function. (The resulting distribution is called the Cauchy

  b) Show that this distribution does not have a mean. (You will need to recall the
     notion of an improper integral.)

4.25 In this problem we compare sampling with replacement to sampling without
replacement. You will recall that the former is modeled by the binomial distribution
and the latter by the hypergeometric distribution. Consider the following setting.
There are 4,224 students at Calvin and we would like to know what they think about
abolishing the interim. We take a random sample of size 100 and ask the 100 students
whether or not they favor abolishing the interim. Suppose that 1,000 students favor
abolishing the interim and the other 3,224 misguidedly want to keep it.

  a) Suppose that we sample these 100 students with replacement. What is the mean
     and the variance of the random variable that counts the number of students in
     the sample that favor abolishing the interim?

  b) Now suppose that we sample these 100 students without replacement. What is
     the mean and the variance of the random variable that counts the number of
     students in the sample that favor abolishing the interim?

  c) Comment on the similarities and differences between the two. Give an intuitive
     reason for any difference.

4.26 Scores on IQ tests are scaled so that they have a normal distribution with mean
100 and standard deviation 15 (at least on the Stanford-Binet IQ Test).

  a) MENSA, a society supposedly for persons of high intellect, requires a score of
     130 on the Stanford-Binet IQ test for membership. What percentage of the
     population qualifies for MENSA?

  b) One psychology text labels those with IQs of between 80 and 115 as having “nor-
     mal intelligence.” What percentage of the population does this range contain?

  c) The top 25% of scores on an IQ test are in what range?

5. Inference - One Variable
In Chapter 2 we introduced random sampling as a way of making inferences about
populations. Recall the framework. We first identified a population and some pa-
rameters of that population about which we wanted to make inferences. We then
chose a sample, most often by simple random sampling, and computed statistics
from that sample to allow us to make statements about the parameters. Alas, these
statements were subject to sampling error. Armed now with the technology of the last
two chapters, we develop this framework further with a particular emphasis on under-
standing sampling error. We will focus especially on the problem of making inferences
about the mean of a population from that of a sample.

5.1. Statistics and Sampling Distributions
5.1.1. Samples as random variables
Suppose that we have a large population and a variable x defined on that population,
and we would like to estimate the mean of x on that population. We choose a simple
random sample x1 , . . . , xn and compute x. How is this sample mean related to the
population mean? In other words, what is likely to be the sampling error?
   Consider the first value of the sample, x1 . This value is the result of a random
variable, namely the random variable that results from choosing an individual from the
population at random and measuring or recording the value of the variable x. We call
that random variable X1 . Similarly, X2 is the process of choosing the second element
of the sample. And so forth. The result is a sequence of random variables X1 , . . . , Xn .
   Since we are now thinking of the data x1 , . . . , xn are the result of the random variables
X1 , . . . , Xn , the sample mean x is the result of a random variable as well, namely

                                         X1 + · · · + Xn
                                   X=                    .

Then X is a random variable and so it also has a distribution. We’ll call the distribution
of X the sampling distribution of the mean since it is a distribution that results
from sampling. The same kind of analysis can be done for any statistic. For example,
we will write SX for the random variable that is the result of computing the sample

variance that results from X1 , . . . , Xn . This is indeed a random variable — different
possible samples may have different values of SX . As another example, the sample

median X ˜ is a statistic and so it has a distribution as well.

5. Inference - One Variable

  We would like to know the distribution of the random variables X and SX (as well

as the distribution of any other statistics that we might want to compute). Obviously,
these distributions depend on the distributions of X1 , . . . , Xn , which in turn depend
on the underlying population. Before investigating this problem analytically, let’s in-
vestigate it via simulation.

5.1.2. Big Example
In general, we do not know the distribution of the variable in the population. In order
to illustrate what can happen in simple random sampling, we will do some simulation
in a situation where we actually have the entire population. The dataset we will use
is a dataset that contains information on every baseball game played in Major League
Baseball during the 2003 season. This population consists of 2430 games. The dataset
is available at For
our variable of interest, we will consider the number of runs scored by the visitors
in each game. In the population, the distribution of this variable is unimodal and
positively skewed as illustrated in Figure 5.1. Some numerical characteristics of this
                       Percent of Total




                                               0       5        10         15   20
                                                           Visitor Score

             Figure 5.1.: Runs scored by visitors in 2003 baseball games.

population are as follows.

> games=read.csv(’’)
> vs=games$visscore
> summary(vs)
   Min. 1st Qu. Median     Mean 3rd Qu.    Max.
  0.000   2.000   4.000   4.656   7.000 19.000

  Suppose that we take samples of size 2 from this population. It is in fact possible
to generate all possible samples of size 2 and compute the mean of each such sample
using the function combn().

> vs2mean=combn(vs,2,mean)                         # applies mean to all combinations of 2 elements of vs
> summary(vs2mean)

                                                                            5.1. Statistics and Sampling Distributions

    Min. 1st Qu.    Median                          Mean 3rd Qu.              Max.
   0.000   3.000     4.500                         4.656   6.000            18.500

Note that the mean of the distribution of sample means of size 2 is the same as the
mean of the population. This should be expected. The histogram of the samples means
is in Figure 5.2.

                       Percent of Total




                                               0         5             10               15   20
                                                             Means, samples of size 2

                      Figure 5.2.: All means of samples of size 2.

   We note the following two features of the distribution of sample means of samples of
size 2: its spread is less than the spread of the population variable and its shape, while
still positively skewed, is less so.
   It is not realistic to generate the actual sampling distribution of the X for samples
larger than size 2. For example, there are 1014 samples of size 5. However, simulation
allows us to get a fairly good idea of what the distribution of X looks like for larger
sample sizes. Consider first samples of size 5.
> vs5mean=replicate(10000, mean(sample(vs,5,replace=F)))
> summary(vs5mean)
   Min. 1st Qu. Median     Mean 3rd Qu.    Max.
  0.600   3.600    4.600  4.669   5.600 10.800

Comparing Figure 5.3 to Figure 5.2, we see that the distribution of the sample mean
in samples of size 5 appears to have less spread and to be more symmetric than that
of the distribution of sample means in samples of size 2.
   Now let’s consider samples of size 30. Again, simulating this situation by choosing
10,000 such samples, we have the following results.

> vs30mean=replicate(10000,mean(sample(vs,30,replace=F)))
> summary(vs30mean)
   Min. 1st Qu. Median     Mean 3rd Qu.    Max.
  2.433   4.267   4.633   4.658   5.033   7.300

With samples of size 30, we note that the distribution is now dramatically decreased in
spread. For example, the IQR is 0.76 (as compared to 2.0 for samples of size 5). This

19:08 -- May 4, 2008                                                                                              503
5. Inference - One Variable


                       Percent of Total




                                                   0       2      4         6         8    10
                                                               Means, samples of size 5

                    Figure 5.3.: Means of 10,000 samples of size 5.

                       Percent of Total






                                               2       3         4          5          6    7
                                                               Means, samples of size 30

                    Figure 5.4.: Means of 10,000 samples of size 30.

says that if we use the sample mean of a sample of size 30 to estimate the population
mean (of 4.656), over 50% of the time we will be within 0.4 of the true value. Notice
too from Figure 5.4 that the distribution of X 30 appears to be unimodal and quite

5.1.3. The Standard Framework
We are now conceiving of the simple random sample x1 , . . . , xn from a population as the
result of n random variables X1 , . . . , Xn . What can we say about the distributions of
these random variables? The first property is the Identically Distributed Property.

  Identically Distributed Property
  In simple random sampling, the random variables X1 , . . . , Xn all have the same
  distribution. In fact, the distribution of Xi is the same as the distribution of the
  variable x in the population.

                                                  5.1. Statistics and Sampling Distributions

   It is easy to see that this property is true in the case of simple random sampling.
Each xi is equally likely to be any on the individuals in the population. Therefore the
distribution of possible values of Xi is exactly the same as the distribution of actual
values of x in the population. For example, if the values of x are normally distributed
in the population, then Xi will have that same normal distribution.
   One important fact to note however is that the random variables Xi are not inde-
pendent one from another. In simple random sampling (which among other properties
is sampling without replacement) the outcome of X2 is dependent on that of X1 . This
will usually be an annoyance to us in trying to analyze the distribution of of certain
statistics — independent random variables are easier to deal with. Therefore we will
simplify and often assume that the Xi are independent. In fact, if we sample with
replacement, this will be exactly true. And if the population is large, this will be
“almost” true — sampling without replacement behaves almost like sampling with re-
placement. One general rule of thumb is that if the sample is of size less than 10% of
the population, then it does not do much harm to treat sampling without replacement
in the same way as sampling with replacement. Therefore we will usually assume that
our sample random variables are independent.

   The i.i.d. assumption.
   Random variables X1 , . . . , Xn are called i.i.d. if they are independent and identi-
   cally distributed. We will usually assume that the random variables X1 , . . . , Xn
   that arise from a simple random sample are i.i.d. (For this reason, we will call i.i.d.
   random variables X1 , . . . , Xn a random sample from X.)

  Given i.i.d. random variables X1 , . . . , Xn , we will refer to their (common) distribution
as the population distribution. With all this background, we expand the meaning of
our four important concepts.

   Population    any random variable X
   Parameter     a numerical property of X, (e.g., µX )
   Sample        i.i.d. random variables X1 , . . . , Xn with the same distribution as X
   Statistic     any function T = f (X1 , . . . , Xn ) of the sample

   While we have motivated this terminology by the very important problem of sampling
from a finite population, it is also useful for describing other situations. Suppose that
we have a random variable X which (since it is a random process) is repeatable under
essentially identical conditions. Suppose that the process is repeated n times. Then the
results of those n trials X1 , . . . , Xn are i.i.d. random variables and so fit the framework

19:08 -- May 4, 2008                                                                      505
5. Inference - One Variable

5.2. The Sampling Distribution of the Mean
In this section we consider the problem of determining the sampling distribution of the
mean. Namely we assume that X1 , . . . , Xn are i.i.d. random variables with population
random variable X and we want to explore the relationship between the distribution
of X and that of X. The fundamental tool in studying this problem is the following

Theorem 5.2.1. Suppose that Y and Z are random variables. Then

  1. If c is a constant E(cY ) = c E(Y ) and Var(cY ) = c2 Var(Y ),

  2. E(Y + Z) = E(Y ) + E(Z), and

  3. if Y and Z are independent, then Var(Y + Z) = Var(Y ) + Var(Z).

   We will not prove this theorem. Part (1) is easy to prove (it’s a simple fact about
integrals or sums). Part (2) certainly fits our intuition. Part (3) is not obvious. While
there certainly should be some relationship between the variance of Y + Z and those of
Y and Z, the fact that variances are additive seems almost accidental. Notice that this
rule looks like a “Pythagorean Theorem” as it involves squares on both sides. From
this Theorem, we now have one of the most important tools of inferential statistics.

Theorem 5.2.2 (The distribution of the sample mean). Suppose that X1 , . . . , Xn are
i.i.d. random variables with population random variable X. Then

  1. E(X) = E(X) , and

  2. Var(X) = Var(X)/n.

Proof. By Theorem 5.2.1, we have that

                      E(X1 + · · · + Xn ) =         E(Xi ) = n E(X).

                   X1 + · · · + Xn           1                      1
       E(X) = E                          =     E(X1 + · · · + Xn ) = n E(X) = E(X) .
                         n                   n                      n


                X1 + · · · + Xn          1                       1            1
Var(X) = Var                         =      Var(X1 +· · ·+Xn ) = 2 (n Var X) = Var(X) .
                      n                  n2                     n             n

                                             5.2. The Sampling Distribution of the Mean

    Example 5.2.3. We know that a random variable X such that X ∼ Unif(0, 1) has
    mean 1/2 and variance 1/12. Suppose that we have a random sample X1 , . . . , X10
    with population random variable X. Then X 10 has mean 1/2 and variance 1/120.
    This is not inconsistent with the simulation below.
     > means=replicate(10000,mean(runif(10,0,1)))
     > mean(means)
     [1] 0.4991267
     > var(means)
     [1] 0.008315763
     > 1/120
     [1] 0.008333333

   Theorem 5.2.2 gives us two crucial pieces of information concerning the distribution
of X. However, it does not tell us the shape of the distribution. In the example of Sec-
tion 5.1.2, we noted that as the size of the sample increased, the empirical distribution
of X approached a more symmetrical distribution. This was not a property peculiar
to that example. The next theorem is so important, we might call it the Fundamental
Theorem of Statistics.

Theorem 5.2.4 (The Central Limit Theorem). Suppose that X is a random variable
with mean µ and variance σ 2 . For every n, let X n denote the sample mean of i.i.d.
random variables X1 , . . . , Xn which have the same distribution as X. Then as n gets
large, the shape of the distribution of X n approaches that of a normal distribution. In
particular for every a, b,
                                   Xn − µ
                       lim P a ≤     √ ≤b         = P(a ≤ Z ≤ b)
                      n→∞          σ/ n
where Z is a standard normal random variable.

  The Central Limit Theorem (CLT) is a limit theorem. As such, it only provides an
approximation. In using it, we will always be faced with the question of large n needs
to be so that the approximation is “close enough” for our purposes. Nevertheless, it
will be a crucial tool in making inferences about µ.

    Example 5.2.5. Continuing Example 5.2.3, suppose again that X1 , . . . , X10 is a
    random sample from a population X ∼ Unif(0, 1). By the Central Limit Theo-
    rem, we have that X is approximately normal with mean 1/2 and variance 1/120.
    Therefore we have the approximate probability statement

                            1       1     1             1
                        P     −        ≤X≤ +                 = .68 .
                            2      120    2            120

19:08 -- May 4, 2008                                                                 507
5. Inference - One Variable

      Again, we can compare this we the results of a simulation.
      > means=replicate(10000,mean(runif(10,0,1)))
      > sum( (1/2-sqrt(1/120))<means & means<(1/2+sqrt(1/120)) )
      [1] 6783
      > pnorm(1)-pnorm(-1)
      [1] 0.6826895

  We know even more in the special case that the population random variable X is
normally distributed.

Theorem 5.2.6. Suppose that X is normally distributed with mean µ and variance
σ 2 . Let X1 , . . . , Xn be i.i.d. random variables with population random variable X.
Then X n has a normal distribution with mean µ and variance σ 2 /n.

      Example 5.2.7. The distribution of heights of 20 year old females in the United
      States in 2005 was very close to being normal with mean 163.3 cm and standard
      deviation 6.5 cm. If a random sample of 20 such females had been chosen, what
      is the probability that the mean of the sample was greater than 165 cm? Since
                              sample mean of a sample of size 20 has mean 163.3 and
      the distribution of the √
      standard deviation 6.5/ 20 = 1.45, a sample mean of 165 has a z-score of (165 −
      163.3)/1.45 = 1.17. Since 1-pnorm(1.17)=.12, this probability is 12%.

5.3. Estimating Parameters
The results of the last section taken together tell us that x provides a good estimate
of µX . In this section, we look at the problem of parameter estimation in general and
identify properties to look for in good estimators.
   Suppose that X is a random variable and that θ is a parameter associated with
X. Examples of such parameters include µX and σX . Let X1 , . . . , Xn be a random

sample with population random variable X. With that setting, we have the following

Definition 5.3.1 (estimator, estimate). An estimator of the parameter θ is any statis-
    ˆ                                                         ˆ
tic θ = f (X1 , . . . , Xn ) used to estimate θ. The value of θ for a particular outcome of
X1 , . . . , Xn is called the estimate of θ.

                                                           ˆ           ˆ2
  Using the notation of the definition, X should be written µ and SX is σX .

                                                               5.3. Estimating Parameters

5.3.1. Bias
Consider the following simple situation. We have one observation x from a random
variable X ∼ Binom(n, π) and we wish to estimate π. An absolutely natural choice
is to use x/n. In other words, π = X/n. One way of justifying this choice is that
E(X/n) = π so “on average” this estimator gets it right. Consider another estimator,
proposed by Laplace. He suggested using πL = X+1 . Notice that if π > .5, this
estimator tends to underestimate π a bit by on average shading its estimate towards
0.5. Likewise, if π < 0.5, the estimate tends to be a little larger than π. In other words,
Laplace’s estimate has a bias.

                                               ˆ                       ˆ
Definition 5.3.2 (unbiased, bias). An estimator θ of θ is unbiased if E(θ) = θ. The
                     ˆ      ˆ
bias of an estimator θ is E(θ) − θ.

  It is important to note that θ is unknown and E(θ) depends on θ so that in general we
do not know the bias of an estimator. In the first example below, we look at examples
where we can determine that an estimator is unbiased. In the second example, we look
more carefully at the bias of Laplace’s estimator. In the third example, we look at
another biased estimator via a simulation.

    Example 5.3.3.
      1. Since E(X n ) = µX for all random variables X no matter what the sample size
         n, we have that X n is an unbiased estimator of µ.
      2. It can be shown that E(S 2 ) = σ 2 . Thus S 2 is an unbiased estimator of σ 2 .
         This is the real reason for using n − 1 in the definition of S 2 rather than n. (It
         is important to note that it does not follow that S is an unbiased estimator
         of σ. Indeed, this is not true.)
      3. X/n is an unbiased estimator of π if X ∼ Binom(n, π).

    Example 5.3.4. Consider Laplace’s estimator πL =        n+2 .
                                                                    We have

                             X +1          1               n      1
               E(ˆL ) = E
                 π                    =       E (X + 1) =     π+     .
                             n+2          n+2             n+2    n+2
    Thus the bias of πL is
                              n      1                 1   2     1 − 2π
            E(ˆl ) − π =
              π                  π+           −π =       −    π=        .
                             n+2    n+2               n+2 n+2     n+2
    If π = .5 then this estimator is unbiased but the bias is negative if π > 0.5 and
    positive if π < 0.5.

19:08 -- May 4, 2008                                                                   509
5. Inference - One Variable

      Example 5.3.5. Suppose that we have a random sample from a population X ∼
      Exp(λ). Since µX = 1/λ, we have that E(X) = 1/λ. Therefore a reasonable choice
      for an estimator of λ is λ = 1/X. Notice that this estimator is not necessarily
      unbiased. We investigate with a simulation. We first consider random samples of
      size 5 and then random samples of size 20. We use λ = 10 in our simulation.
      > hatlambda5 = replicate(10000,1/mean(rexp(5,10)))
      > mean(hatlambda5)
      [1] 12.47850
      > hatlambda20 = replicate(10000,1/mean(rexp(20,10)))
      > mean(hatlambda20)
      [1] 10.51414
      Note that in both cases, our estimator appears to be biased and produces an
      overestimate on average.
   The last example illustrates an important point. Even if θ is an unbiased estimator
                                 ˆ is an unbiased estimator of f (θ).
of θ, this does not mean that f (θ)

5.3.2. Variance
An estimator is a random variable. In considering its bias, we are considering its mean.
But its variance is also important — an estimator with large variance is not likely to
produce an estimate close to the parameter it is trying to estimate.

                                      ˆ                                             ˆ
Definition 5.3.6 (standard error). If θ is an estimator for θ, the standard error of θ
                                 σ ˆ = Var(θ) .

If we can estimate σθ , we write sθ for the estimate of σθ .
                    ˆ             ˆ                      ˆ

      Example 5.3.7. Regardless of the population random variable X, we know that
      Var(X) = σX /n. Thus σX = σX / n. To estimate this, it is natural to use

                                          sX = sX / n .

      Example 5.3.8. If X ∼ Binom(n, π), we have that π = X/n has variance Var(ˆ ) =
      π(1 − π)/n. Thus
                                           π(1 − π)
                                  σπ =
                                   ˆ                .

                                                                  5.3. Estimating Parameters

   A good estimator for σπ can be found by using π to estimate π. Thus

                                                ˆ      ˆ
                                                π (1 − π )
                                    sπ =
                                     ˆ                     .

  An unbiased estimator with small variance is obviously the kind of estimator that we
seek. We note that the sample mean is always an unbiased estimator of the population
mean and the variance of the sample mean goes to 0 as the sample size gets large.

5.3.3. Mean Squared Error
Bias is bad and so is high variance. We put these two measures together into one in
this section.

Definition 5.3.9 (mean squared error). The mean squared error of an estimator θ is
                                   ˆ       ˆ
                               MSE(θ) = E[(θ − θ)2 ] .

 The mean squared error measures how far away θ is from θ on average where the
measure of distance is our now familiar one of squaring.

Proposition 5.3.10. For any estimator θ of θ
                                ˆ        ˆ         ˆ
                            MSE(θ) = Var(θ) + Bias(θ)2 .

   The proof of Proposition 5.3.10 is a messy computation and we will omit it. We
illustrate the use of the MSE to compare the two estimators we have for the parameter
π of the binomial distribution. Again, π denotes the usual unbiased estimator and
πL = X+1 denotes the Laplace estimator. We have

                          Estimator        Bias       Variance

                                                       π(1 − π)
                               π            0
                                       1 − 2π         π(1 − π)
                                        n+2           n+4+ n 4

  It is obvious that πL has a smaller variance that π (and it is clear why this should
be so). It is not immediately obvious from the expressions above which has the smaller
MSE. In fact, this depends on both π and n. In the Figure 5.5, we plot the MSE of
both estimators for samples of size 10 and size 30 respectively. Note that the Laplace
estimator has smaller MSE for intermediate values of π while the unbiased estimator
has smaller MSE for extreme values of π. Ase we might expect, there is a greater
difference in the two estimators for smaller samples than for large samples.

19:08 -- May 4, 2008                                                                    511
5. Inference - One Variable





      0.000                                                       0.000

              0.0   0.2   0.4        0.6   0.8   1.0                      0.0       0.2      0.4        0.6   0.8   1.0
                                pi                                                                 pi

              Figure 5.5.: MSE of two estimators of π, sample sizes n = 10 and n = 30.

5.4. Confidence Interval for Sample Mean
In this section, we introduce an important method for quantifying sampling error, the
confidence interval. First, we’ll look at a very special but important case.

5.4.1. Confidence Intervals for Normal Populations
Suppose that X1 , . . . , Xn is a random sample with population random variable X with
unknown mean µ and variance σ 2 . Suppose too that the population random variable
X has a normal distribution. Using Theorem 5.2.6 and one of our favorite facts about
the standard normal distribution, we have

                                                       X −µ
                                       P −1.96 <         √ < 1.96               = .95 .
                                                       σ/ n
      We now do some algebra to get

                                           σ                σ
                                P X − 1.96 √ < µ < X + 1.96 √                             = .95
                                            n                n
      The interval
                                                    σ            σ
                                           X − 1.96 √ , X + 1.96 √
                                                     n             n
is a random interval. Now suppose that we know σ (an unlikely happenstance, we
admit). For any particular set of data x1 , . . . , xn the interval is simply a numerical
interval. The key fact is that we are fairly confident that this interval contains µ.

Definition 5.4.1 (confidence interval). Suppose that X1 , . . . , Xn is a random sample
from a normal distribution with known variance σ 2 . Suppose that x1 , . . . , xn is the
observed sample. The interval

                                                     σ            σ
                                            x − 1.96 √ , x + 1.96 √
                                                      n             n

                                              5.4. Confidence Interval for Sample Mean

is called a 95% confidence interval for µ.

   Example 5.4.2. A machine creates rods that are to have a diameter of 23 mil-
   limeters. It is known that the distribution of the diameters of the parts is normal
   and that the standard deviation of the actual diameters of parts created over time
   is 0.1 mm. A random sample of 40 parts are measured precisely to determine if the
   machine is still producing rods of diameter 23 mm. The data and 95% confidence
   interval are given by
    > x
     [1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186        23.015   23.089
    [11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054        22.995   22.894
    [21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889        23.019   23.073
    [31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031        23.059   23.117
    > mean(x)
    [1] 23.024
    > c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40))
    [1] 22.993 23.055
   It appears that the process could still be producing rods of diameter 23 mm.

   Of course the example illustrates a problem with using this notion of confidence
interval, namely we need to know the standard deviation of the population. It is
unlikely that we would be in a situation where the mean of the population is unknown
but the standard deviation is known. One approach to solving this problem is to use
an estimate for σ, namely sX , the sample standard deviation. If the sample size is
quite large, we hope that sX is close to σ so that our confidence interval statement is
approximately correct. In the case of a normal population random variable X however,
we know more.

5.4.2. The t Distribution

Definition 5.4.3 (t distribution). A random variable T has a t distribution (with
parameter ν ≥ 1, called the degrees of freedom of the distribution) if it has pdf
                      1 Γ((ν + 1)/2)        1
             f (x) = √                                     −∞<x<∞
                       πν  Γ(ν/2) (1 + x 2 /ν)(ν+1)/2

  Here Γ is the gamma function from mathematics but all we need to know about the
constant out front is that it exists to make the integral of the density function equal
to 1. Some properties of the t distribution include
  1. f is symmetric about x = 0 and unimodal. In fact f looks bell-shaped.

  2. If ν > 1 then the mean of T is 0.

19:08 -- May 4, 2008                                                               513
5. Inference - One Variable

    3. If ν > 2 then the variance of T is ν/(ν − 2).
    4. For large ν, T is approximately standard normal.
  In summary, the t distributions look very similar to the normal distribution except
that they have slightly more spread, especially for small values of ν. R knows the t-
distribution of course and the appropriate functions are dt(x,df), pt(), qt(), and
rt(). The graphs of the normal distribution and two t-distributions are shown below.

>   x=seq(-3,3,.01)

>   y=dt(x,3)
>   z=dt(x,10)

>   w=dnorm(x,0,1)

>   plot(w~x,type="l",ylab="density")

>   lines(y~x)
>   lines(z~x)

                                                              −3   −2   −1   0   1   2   3


   The important fact that relates the t distribution to the normal distribution is the
following theorem which is one of the most heavily used in statistics.

Theorem 5.4.4. If X1 , . . . , Xn is a random sample from a normal distribution with
mean µ and variance σ 2 , then the random variable
                                         X −µ
                                         S/ n
has a t distribution with n − 1 degrees of freedom.

  To generate confidence intervals using this theorem,, first define tβ,ν to be the unique
number such that
                                  P (T > tβ,ν ) = β
where T is random variable that has a t distribution with ν degrees of freedom. We
have the following:

    Confidence Interval for µ If x1 , . . . , xn are the observed values of a random
    sample from a normal distribution with unknown mean µ and t∗ = tα/2,n−1 , the
                                         s           s
                                 ¯            ¯
                                 x − t∗ √ , x + t∗ √
                                          n           n
    is an 100(1 − α)% confidence interval for µ.

                                                 5.4. Confidence Interval for Sample Mean

      Example 5.4.5. It is plausible to think that the logs of populations of U.S. counties
      have a normal distribution. (We’ll talk about how to test that claim a later point.)
      In the following example, we look at a sample of 10 such counties and produce
      a 95% confidence interval for the mean of the log-population. To produce a 95%
      confidence interval, we need t.025 which is the 97.5% quantile of the t distribution.
      Notice that the true mean of our population random variable is 10.22 so in this
      case the confidence interval does capture the mean.
       > counties=read.csv(’’)
       > logpop=log(counties$Population)
       > smallsample=sample(logpop,10,replace=F)    # our sample of size 10
       > tstar = qt(.975,9)                        # 9 degrees of freedom
       > xbar= mean(smallsample)
       > s= sd(smallsample)
       > c( xbar-tstar* s/sqrt(10), xbar+tstar * s/sqrt(10))
       [1] 10.14891 12.01605

5.4.3. Interpreting Confidence Intervals
It is important to be very careful in making statements about what a confidence interval
   In Example 5.4.5, we can say something like “we are 95% confident that the true
mean of the logs of population is in the interval (10.15, 12.02).” (This, at least, is what
many AP Statistics students are taught to say.) But beware:
        This is not a probability statement! That is, we do not say that the prob-
        ability that the true mean is in the interval (10.15, 12.02) is 95%. There is
        no probability after the experiment is done, only before.
     The correct probability statement is one that we make before the experiment.
        If we are to generate a 95% confidence interval for the mean of the popula-
        tion from a sample of size 10 from this population, then the probability is
        95% that the resulting confidence interval will contain the mean.
     Another way of saying this using the relative frequency interpretation of probability
        If we generate many 95% confidence intervals by this procedure, approxi-
        mately 95% of them will contain the mean of the population.
     After the experiment, a good way of saying what confidence means is this
        Either the population mean is in (10.15, 12.02) or something very surprising

19:08 -- May 4, 2008                                                                    515
5. Inference - One Variable

5.4.4. Variants on Confidence Intervals and Using R
Nothing is sacred about 95%. We could generate 90% confidence intervals or confidence
intervals of any other level. There might also be a reason for generating one-sided
confidence intervals which could be done by using eliminating one of the two tails of
the t-distribution in our computation. R will actually do all the computations for us.
We illustrate.

      Example 5.4.6. The file
      csv contains the results of all basketball games played in NCAA Division I on
      March 9, 2008. It might be a reasonable assumption that the visitor’s scores in
      Division I games have a normal distribution and that the games of March 9 ap-
      proximate a random sample. Proceeding on that assumption, we write a variety of
      different confidence intervals. Notice that the output of t.test() gives a variety
      of information beyond simply the confidence interval.

      > games=read.csv(’’)
      > names(games)
      [1] "Visitor" "Vscore" "Home"     "Hscore"
      > t.test(games$Vscore)

               One Sample t-test

      data: games$Vscore
      t = 35.7926, df = 38, p-value < 2.2e-16
      alternative hypothesis: true mean is not equal to 0
      95 percent confidence interval:
       59.38840 66.50903
      sample estimates:
      mean of x
      > t.test(games$Vscore,conf.level=.9)      # 90% confidence interval

               One Sample t-test

      data: games$Vscore
      t = 35.7926, df = 38, p-value < 2.2e-16
      alternative hypothesis: true mean is not equal to 0
      90 percent confidence interval:
       59.98362 65.91382
      sample estimates:
      mean of x
      > t.test(games$Vscore,conf.level=.9,alternative=’greater’)       # 90% one-sided interval

               One Sample t-test

                                                           5.5. Non-Normal Populations

     data: games$Vscore
     t = 35.7926, df = 38, p-value < 2.2e-16
     alternative hypothesis: true mean is greater than 0
     90 percent confidence interval:
      60.65496      Inf
     sample estimates:
     mean of x

5.5. Non-Normal Populations
In this section we consider the problem of generating confidence intervals for the mean
in the case that our population random variable does not have a normal distribution.
Of course it is not hard to find examples where this would be useful. Indeed, it is really
not often that we know our population is normal. Our advice in this section amounts
to the following: we can often use the same confidence intervals that we used when the
population is normal.

5.5.1. t Confidence Intervals are Robust
A statistical procedure is robust if it performs as advertised (at least approximately)
even if the underlying distributional assumptions are not satisfied. The important fact
about confidence intervals generated by the method of the last section is that they are
robust against violations of the normality assumption if the sample size is not small
and if the data does not have extreme outliers. To measure whether the t procedure
works, we have the following definition.

Definition 5.5.1. Suppose that I is a random interval used as a confidence interval for
θ. The coverage probability of I is P(θ ∈ I). (In other words, the coverage probability
is the true confidence level of the confidence intervals produced by I.)

  We would like that 95% confidence intervals generated from the t distribution have a
95% coverage probability even in the case that the normality assumption is not satisfied.
We first look at some examples.

    Example 5.5.2. We will use as our population the maximum wind velocity at the
    San Diego airport on 6,209 consecutive days. The true mean of this population is
    15.32. We generate 10,000 samples of each of size 10, 30 and 50.

19:08 -- May 4, 2008                                                                 517
5. Inference - One Variable

      > w=read.csv(’’)
      > m=mean(w$Wind)

      #    samples of size 10

      > intervals= replicate(10000,t.test(sample(w$Wind,10,replace=F))$
      > sum(intervals[1,]<m & intervals[2,]>m)
      [1] 9346

      # samples of size 30
      > intervals= replicate(10000, t.test(sample(w$Wind,30,replace=F))$
      > sum(intervals[1,]<m & intervals[2,]>m)
      [1] 9427

      # samples of size 50
      > intervals= replicate(10000, t.test(sample(w$Wind,50,replace=F))$
      > sum(intervals[1,]<m & intervals[2,]>m)
      [1] 9441
      We find that we do not quite achieve our desired goal of 95% confidence intervals
      though it appears for samples of size 50 we have approximately 94.4% confidence

      Example 5.5.3. Suppose that X ∼ Exp(0.2) so that µX = 5. We generate 10,000
      different random samples of size 10 for this distribution and compute the 95%
      confidence interval given by the t-distribution in each case. We note that we do
      not have exceptional success - only 89.1% of the 95% confidence intervals contain
      the mean.
      > # samples of size 10 from an exponential distribution with mean 5
      > # t.test()$ recovers just the confidence interval
      > intervals = replicate(10000, t.test(rexp(10,.2))$
      > # now count the intervals that capture the mean
      > sum (intervals[1,]<5 & intervals[2,]>5)
      [1] 8918
      With random samples of size 30, we do better and with samples of size 50 better
      yet. However in no case do we achieve the 95% coverage probability that we desire.
      The exponential distribution is quite asymmetric.
      #   samples of size 30

      > intervals = replicate(10000, t.test(rexp(30,.2))$
      > sum (intervals[1,]<5 & intervals[2,]>5)
      [1] 9297

                                                           5.5. Non-Normal Populations

     #   samples of size 50

     >   intervals = replicate(10000, t.test(rexp(50,.2))$
     > sum (intervals[1,]<5 & intervals[2,]>5)
     [1] 9348

   In neither of the last two examples did we achieve our objective of 95% confidence
intervals containing the mean 95% of the time. The next example uses the Weibull
distribution with parameters that make it fairly symmetric.

    Example 5.5.4. The Weibull distribution with parameters α = 5 and β = 10 has
    mean 9.181687. We generate samples of size 10, 30 and 50. Note that we have
    achieved almost exactly 95% confidence intervals regardless of the sample size.
     > m=9.181687         # mean of Weibull distribution with parameters 5, 10

     > intervals = replicate(10000, t.test(rweibull(10,5,10))$
     > sum (intervals[1,]<m & intervals[2,]>m)
     [1] 9502

     > intervals = replicate(10000, t.test(rweibull(30,5,10))$
     > sum (intervals[1,]<m & intervals[2,]>m)
     [1] 9499

     >   intervals = replicate(10000, t.test(rweibull(50,5,10))$
     > sum (intervals[1,]<m & intervals[2,]>m)
     [1] 9496

5.5.2. Why are t Confidence Intervals Robust?
Let’s consider generating a 95% confidence interval from 30 data points x1 , . . . , x30 .
The t-confidence interval in this case is

                                        s            s
                              x − 2.05 √ , x − 2.05 √       .                       (5.1)
                                        30           30
The magic number 2.05 of course is just t.025,29 .
  Let’s approach the problem of generating a confidence interval from a different di-
rection. Namely let’s use the Central Limit Theorem. The CLT says that the random
                                        X −µ
                                        σ/ n

19:08 -- May 4, 2008                                                                 519
5. Inference - One Variable

has a distribution that is approximately standard normal (if we believe that n = 30 is
large). We therefore have the following approximate probability statment:

                                        X −µ
                          P   −1.96 <     √ < 1.96    ≈ .95 .
                                        σ/ n

This leads to the approximate 95% confidence interval
                                        σ            σ
                              x − 1.96 √ , x − 1.96 √      .                      (5.2)
                                        30            30
The problem with this interval (besides the fact that it is only approximate) is that σ
is not known. Now for a reasonably large sample size, we might expect that the value
s of the sample standard deviation is close to σ. If we replace σ in 5.2 by s, we have
the interval
                                      s              s
                            x − 1.96 √ , x − 1.96 √         .
                                       30            30
Now we see that the only difference between this interval (which involves two approx-
imations) and the interval of Equation 5.1 that results from the t-distribution is the
difference between the numbers 1.96 and 2.04. It is easy to give an argument for using
a larger number than 1.96 — using 2.04 helps compensate for the fact that we are
making several approximations in constructing the interval by expanding the width of
the interval slightly.
   Of course we should note that the t intervals do not perform equally well regard-
less of the population. The performance of this method depends on the shape of the
distribution (symmetric, unimodal is best) and the sample size (the larger the better).

5.6. Confidence Interval for Proportion
To estimate the proportion of individuals in a population with a certain property, we
often choose a random sample and use as an estimate the proportion of individuals in
the sample with that property. This is the methodology of political polls, for example.
While this random process is best modeled by the hypergeometric distribution, we
normally use the binomial distribution instead if the size of the population is large
relative to the size of the sample.
   So then, assume that we have a binomial random variable X ∼ Binom(n, π) where
as usual n is known but π is not. Then of course the obvious estimator for π is π = Xn
and it is an unbiased estimator of π. Of course we would also like to write a confidence
interval for π so that we know the precision of our estimate. Because X is discrete,
there is no good way to write exact confidence intervals for π, but the Central Limit
Theorem allows us to write an approximate confidence interval that is really quite good.
The key is to understanding the relationship between the binomial distribution and the
Central Limit Theorem.

                                                    5.6. Confidence Interval for Proportion

Theorem 5.6.1. Suppose that X ∼ Binom(n, π). Then if n is large, the random
                                 n −π

has a distribution that is approximately standard normal.

Proof. Let the individual trials of the random process X be denoted X1 , . . . , Xn . This
sequence is i.i.d. In fact Xi ∼ Binom(1, π). Obviously µXi = π and σXi = π(1 − π) for

each i and X =       Xi . We apply the CLT to the sequence X1 , . . . , Xn . The random
variable X is the sample mean for this i.i.d. sequence and so has mean π and variance
   n   . The result follows.

  The Theorem suggests how to find an (approximate) confidence interval. For a fixed
β, let zβ be the number such that P(Z > zβ ) = 1 − β where Z is the standard normal
random variable. Then we have the following approximate equality from the CLT.

                      P −zα/2 <                     < zα/2   ≈1−α                    (5.3)
                                      π(1 − π)/n
   Equation 5.3 is the starting point for several different approximate confidence inter-
   As we did for confidence intervals for µ, we should attempt to use Equation 5.3 to
isolate π in the “middle” of the inequalities. The first two steps are

                          π(1 − π)                       π(1 − π)
              P −zα/2                ˆ
                                   < π − π < zα/2                    ≈ 1 − α,
                             n                              n

  and thus

                            π(1 − π)                      π(1 − π)
             P π − zα/2                    ˆ
                                     < π < π + zα/2                   ≈1−α.          (5.4)
                               n                             n

   The problem with 5.4 is that the unknown π appears not only in the middle of the
inequalities but also in the bounds. Thus we do not yet have a true confidence interval
since the endpoints are not statistics that we can compute from the data.

The Wald interval.
The Wald interval results from replacing π by π in the endpoints of the interval of 5.4.

                     π − zα/2   π (1 − π )/n, π + zα/2
                                ˆ      ˆ      ˆ           ˆ      ˆ
                                                          π (1 − π )/n

19:08 -- May 4, 2008                                                                  521
5. Inference - One Variable

   Until recently, this was the standard confidence interval suggested in most elementary
statistics textbooks if the sample size is large enough. (In fact this interval still receives
credit on the AP Statistics Test.) Books varied as to what large enough meant. A
typical piece of advice is to only use this interval if nˆ (1 − π ) ≥ 10. However, you
should never use this interval.
   The coverage probability of the (approximately) 95% Wald confidence intervals is
almost always less than 95% and could be quite a bit less depending on π and the
sample size. For example, if π = .2, it takes a sample size of 118 to guarantee that the
coverage probability of the Wald confidence interval is at least 93%. For very small
probabilities, it takes thousands of observations to ensure that the coverage probability
of the Wald interval approaches 95%.

The Wilson Interval.
At least since 1927, a much better interval than the Wald interval has been known
although it wasn’t always appreciated how much better the Wilson interval is. The
Wilson interval is derived by solving the inequality in 5.3 so that π is isolated in the
middle. After some algebra and the quadratic formula, we get the following (impressive
looking) approximate confidence interval statement:

                                                                                                 
           zα/2                            2
                                          zα/2               2
                                                            zα/2                            2
                           ˆ     π
                           π (1−ˆ )                                         ˆ     π
                                                                            π (1−ˆ )
  π +      2n    − zα/2       n      +   4n2
                                                       ˆ     2n    + zα/2       n      +   4n2
 P                                                                                               ≈1−α
                 1 + (zα/2 )/n
                        2                                          1 + (zα/2 )/n
                                                                         2                        

   The Wilson interval performs much better than the Wald interval. If nˆ (1 − π ) ≥ 10,
you can be reasonably certain that the coverage probability of the 95% Wilson interval
is at least 93%. The Wilson interval is computed by R in the function prop.test().
The option correct=F needs to be used however. (The option correct=T makes a
“continuity” correction that comes from the fact that binomial data is discrete. It is
not recommended to be used in the Wilson interval however.)

      Example 5.6.2. In a poll taken in Mississippi on March 7, 2008, of 354 voters
      who were decided between Obama and Clinton, 190 said that they would vote for
      Obama in the Mississippi primary. We can estimate the proportion of voters in the
      population that will vote for Obama (of those who were decided on one of these
      two candidates) using the Wilson method.
      > prop.test(190,354,correct=F)

                  1-sample proportions test without continuity correction

      data:   190 out of 354, null probability 0.5

                                                    5.6. Confidence Interval for Proportion

     X-squared = 1.9096, df = 1, p-value = 0.167
     alternative hypothesis: true p is not equal to 0.5
     95 percent confidence interval:
      0.4846622 0.5879957
     sample estimates:

    We see that π = .537 and that a 95% confidence interval for π is (.485, .588). This
    is often reported by the media as 53.7% ± 5.1% with no mention of the fact that a
    95% confidence interval is being used. (Note that the center of the interval is not
    ˆ                                  ˆ
    π but in this case does agree with π to three decimal digits.)

  Notice that the center of the Wilson interval is not π . It is
                                π + 2n
                                ˆ     α/2     x + zα/2 /2
                                            =             .
                              1 + (zα/2 )/n
                                    2          n + zα/2

A way to think about this is that the center of the interval comes from adding zα/22

trials and zα/2 /2 successes to the observed data. (For a 95% confidence interval, this

is very close to adding two successes and four trials.) This is the basis for the next

The Agresti-Coull Interval.
Agresti and Coull (1998) suggest combining the biased estimator of π that is used in the
Wilson interval together with the simpler estimate for the standard error that comes
from the Wald interval. In particular, if we are looking for a 100(1 − α)% confidence
interval and x is the number of successes observed in n trials, define
                      x = x + zα/2 /2
                                           n = n + zα/2
Then the Agresti-Coull interval is

                                   ˜      ˜
                                   π (1 − π )              ˜      ˜
                                                           π (1 − π )
                       π − zα/2                 ˜
                                              , π + zα/2
                                       n                       ˜

   In practice, this estimator is even better than the Wilson estimator and is now widely
recommended, even in basic statistics textbooks. For the particular example of x = 7
and n = 10, the Wilson and Agresti-Coull intervals are compared below. Note that
the Agresti-Coull interval is somewhat wider than the Wilson interval. Of course wider
intervals are more likely to capture the mean.

#   The Wilson interval
> prop.test(7,10,correct=F)

19:08 -- May 4, 2008                                                                  523
5. Inference - One Variable

          1-sample proportions test without continuity correction

data: 7 out of 10, null probability 0.5
X-squared = 1.6, df = 1, p-value = 0.2059
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.3967781 0.8922087
sample estimates:

#     The Agresti-Coull Interval

> xtilde=9
> ntilde=14
> z=qnorm(.975)
> pitilde=xtilde/ntilde
> se= sqrt ( pitilde * (1-pitilde)/ntilde )
> c( pitilde - z* se, pitilde + z * se)
[1] 0.3918637 0.8938505

  In elementary statistics books, the Agrest-Coull interval is often presented at the
“Plus 4” interval and the instructions for computing it are simply to add four trials
and two successes and then to compute the Wald interval.

5.7. The Bootstrap
Throughout this chapter we have been developing methods for making inferences about
the unknown value of the parameter θ associated with a “population” random variable.
In general, to estimate θ we need good answers to two questions
    1. What estimator θ of θ should we use?
    2. How accurate is the estimator θ?

  For the case that θ is the population mean, we have a rich theory that answers these
questions. We answered the questions by knowing two things:

    1. the distribution of the population random variable (e.g., normal, binomial), and

    2. how the sampling distribution of the estimator depends on the the distribution
       of the population.

  If we know the distribution of the population random variable but not how the
sampling distribution of the estimator depends on it, we can often do simulation to get

                                                                       5.7. The Bootstrap

an idea of the sampling distribution of the estimator. Indeed, that is what we did in
Section 5.1. In this section we look at the bootstrap, a “computer-intensive” technique
for addressing these questions if we know neither of these two facts.
   We will illustrate the bootstrap with the following example dataset. (This dataset is
found in the package boot which you would probably need to load from the internet.)
The data are the times to failure for the air-conditioning unit of a certain Boeing 720
> aircondit$hours
 [1]   3   5   7 18 43         85   91   98 100 130 230 487
 > mean(aircondit$hours)
[1] 108.0833

   Suppose that we want to estimate the MTF — mean time to failure of such air-
conditioning units. Our estimate is 108 hours, but we would like an estimate of the
precision of this estimate, e.g., a confidence interval. While the simple advice of Sec-
tion 5.5 is to use the t-distribution, this is not really a good strategy as the dataset is
quite small and the distribution of the data is quite skewed. Furthermore, the small
size of the dataset does not suggest to us a particular distribution for the population
(although engineers might naturally turn to some Weibull distribution).
   The idea of the bootstrap is to generate lots of different samples from the population
as we did in Section 5.1). However, without any assumptions about the shape of the
distribution of the population, the bootstrap uses the data itself to approximate that
shape. In this case, we have that 1/12 of our sample has the value 3, 1/12 of the sample
has the value 5, etc. Therefore, we will model the population by assuming that 1/12
of the population has the value 3, 1/12 of the population has the value 5, etc! Now to
take a random sample of size 12 from such a population, we need only take a sample
of size 12 from our original data with replacement. The idea of the bootstrap is to
take many such samples and compute the value of the estimator for each sample and
thereby get an approximation to the sampling distribution of the estimator.
   Here are the steps to computing a bootstrap confidence interval for the mean of
our air-conditioning failure time population. The following R function chooses 1,000
different random samples of size 12 from our original random sample, with replacement
and computes the mean of each sample.
> means = replicate (1000, mean(sample(aircondit$hours,12,replace=T)))

   These 1,000 means are our approximation of what would happen if we took 1,000
samples from the population of air-cinditioning failure times. A histogram of these
1,000 means is in Figure 5.6. We now convert these 1,000 means to a confidence
interval by using the quantile() function.
> quantile(means,c(0.025,0.975))
     2.5%     97.5%
 45.16042 190.33750

19:08 -- May 4, 2008                                                                   525
5. Inference - One Variable



                         Percent of Total





                                                 0   50   100   150     200   250   300

Figure 5.6.: 1,000 sample means of bootstrapped samples of air-conditioning failure

It is reasonable to announce that the 95% confidence interval for µ is (45.16, 190.34)
   The bootstrap method that we illustrated above (called the bootstrap percentile
confidence interval), is quite general. There was nothing special about the fact that
we were constructing a confidence interval for the mean. Indeed, we could use the
very same method to construct a confidence interval for any parameter, as long as we
have a reasonable estimator for the parameter. (For parameters other than the mean,
there are more sophisticated bootstrap methods that account for the fact that many
estimators are biased.) We illustrate with one more example.

      Example 5.7.1. The dataset city in the boot package consists of a random sample
      of 10 of the 196 largest cities of 1930. The variables are u which is the population
      (in 1,000s) in 1920 and x which is the population in 1930. The population is the
      196 cities and we would like to know the value of θ =          x/ u, the ratio of
      increase of population in these cities from 1920 to 1930. The obvious estimator is
      θ=      x/ u for the sample. We construct our bootstrap confidence interval for
      > library(boot)
      > city
           u   x
      1 138 143
      2   93 104
      3   61 69
      4 179 260
      5   48 75
      6   37 63
      7   29 50
      8   23 48
      9   30 111

                                              5.8. Testing Hypotheses About the Mean

    10   2 50
    > thetahat=sum(city$x)/sum(city$u)              # estimate from sample
    > thetahat
    [1] 1.520312

    > thetahats = replicate ( 1000, { i=sample((1:10),10,replace=T) ;
    +                             us=city[i,]$u ; xs=city[i,]$x ;
    +                             sum(xs)/sum(us) } )

    > quantile(thetahats, c(0.025,0.975))           # bootstrap confidence interval
        2.5%    97.5%
    1.250343 2.127813

   Notice that th confidence interval is very wide. This is only to be expected from
   such a small sample.

5.8. Testing Hypotheses About the Mean
In this section, we review the logic of hypothesis testing in the context of testing
hypotheses about the mean. While the language of hypothesis testing is still quite
common in the literature, it is fair to say that confidence intervals are a superior way
to quantify inferences about the mean. The language of hypothesis testing is perhaps
most useful when one needs to make a decision about the parameter in question. We
first look at an example of a situation in which a decision rule is necessary.

   Example 5.8.1. Kellogg’s makes Raisin Bran and fills boxes that are labelled
   11 oz. NIST mandates testing protocols to ensure that this claim is accurate.
   Suppose that a shipment of 250 boxes, called the inspection lot, is to be tested.
   The mandated procedure is to take a random sample of 12 boxes from this shipment.
   If any box is more than 1/2 ounce underweight, then the lot is declared defective.
   Else, the sample mean x and the sample standard deviation s are computed. The
   shipment is rejected if (x − 11)/s ≤ −0.635.

 We can view Example 5.8.1 as implementing a hypothesis test. Recall the technology.
There are four steps as described in Section 4.3.

  1. Identify the hypotheses.

  2. Collect data and compute a test statistic.

  3. Compute a p-value.

  4. Draw a conclusion.

19:08 -- May 4, 2008                                                               527
5. Inference - One Variable

  We go through these four steps in the case that our hypotheses are about the pop-
ulation mean µ, using the Kellogg’s example as an illustration. We will suppose that
X1 , . . . , Xn is a random sample from a normal distribution with unknown mean µ and
that we wish to make inferences about µ.

Identify the Hypotheses
We start with a null hypothesis, H0 , the default or “status quo” hypothesis. We want
to use the data to determine whether there is substantial evidence against it. The
alternate hypothesis, Ha , is the hypothesis that we are wanting to put forward as true
if we have sufficient evidence in its favor. So in the Raisin Bran example, our pair of
hypotheses are
                                     H0 :   µ = 11
                                     Ha :   µ < 11 .
  In general, our hypotheses for a test of means is one of the following three pairs:
                  H0    µ = µ0        H0    µ = µ0        H0    µ = µ0
                  Ha    µ < µ0        Ha    µ > µ0        Ha    µ = µ0
where µ0 is some fixed number.

Collect data and compute a test statistic
We will use the following test statistic:

                                            X − µ0
                                      T =     √ .
                                            S/ n
   The important fact about this statistic is that if H0 is true then the distribution
of T is known. (It is a t distribution with n − 1 degrees of freedom.) This is the key
property that we need whenever we do a hypothesis test: we must have a test statistic
whose distribution we know if H0 is true.

Compute a p-value
Recall that the p-value of the test statistic t is the probability that we would see a
value at least as extreme as t (in the direction of the alternate hypothesis) if the null
hypothesis were true. The R function t.test() computes the p-value if the argument
alternative is appropriately set. Let’s look at some possible Raisin Bran data.

> raisinbran
 [1] 11.01 10.91 10.94 11.01 10.97 11.01 10.95 10.93 10.92 10.83 11.02 10.84
> t.test(raisinbran,alternative="less",mu=11)

          One Sample t-test

                                                5.8. Testing Hypotheses About the Mean

data: raisinbran
t = -2.9689, df = 11, p-value = 0.006385
alternative hypothesis: true mean is less than 11
95 percent confidence interval:
     -Inf 10.97827
sample estimates:
mean of x

  In this example, the p-value is 0.006. This means that if the null hypothesis (of
µ = 11) were true, we would expect to get a value of the test statistic at least as extreme
as the value we computed from the data (-2.9689) 0.6% of the time. This would be an
extremely rare occurence so this is strong evidence against the null hypothesis.

Draw a conclusion
It is often enough to present the result of a hypothesis test by stating the p-value.
What to do with that evidence is not really a statistical problem. It is sometimes
necessary to go further however and to announce a decision. That is the case in the
Raisin Bran example where it is necessary to decide whether to reject the shipment as
being underweight.
   In this case, we set up the hypothesis test in terms of a decision rule. The possible
decisions are either to reject the null hypothesis (and accept the alternate hypothesis)
or not to reject the null hypothesis. The decision rule is expressed in terms of the test
statistic. In order to determine what the decision rule should be, we need to examine
the errors in making an incorrect decision.
   Recall the kinds of errors that we might make:

  1. A Type I error is the error of rejecting H0 even though it is true. The probability
     of a type I error is denoted by α.

  2. A Type II error is the error of not rejecting H0 even though it is false. The
     probability of a Type II error is denoted by β.

To construct a decision rule, we choose α, the probability of a Type I error. This
number α is often called the significance level of the test.
  In this case, testing H0 : µ = µ0 versus Ha : µ < µ0 , our decision rule should be:

                         Reject H0 if and only if t < −tα,n−1 .

  It is obvious that this decision rule will reject H0 if it is true with probability α.
While the R example above does not explicitly make a decision, the p-value of the test
statistic gives us enough information to determine what the decision should be. Namely

19:08 -- May 4, 2008                                                                   529
5. Inference - One Variable

if the p-value is less than α, we must reject the null hypothesis. Otherwise we do not.
In the Kellogg’s example above, we obviously reject the null hypothesis.
   We can now understand the test that NIST prescribes in Example 5.8.1. The NIST
manual says that “this method gives acceptable lots a 97.5% chance of passing.” In
other words, NIST is prescribing that α = 0.025. For such an α, our test should be to
reject H0 if

              x − 11                            x − 11   t0.25,11
                √ < −t.025,11        or if             <− √       = 0.635
              s/ 12                               s          12
which is exactly the requirement of the NIST test. Of course this NIST method implic-
itly is relying on the assumption that the distribution of the lot is normal. We really
should be cautious about using the t-distribution for a non-normal population with a
sample size of 12 although the t-test is robust.

Type II Errors
The four step procedure above focuses on α, the probability of a Type I error. Usually,
the consequences of a Type I error are much more severe than those of making a Type II
error and it is for this reason that we set α to be a small number. But if our procedures
were only about minimizing Type I errors, we would never reject H0 since this would
make the probability of a Type I error 0!
  Of course the probability of a Type II error depends on the distribution of

                                             X − 11
                                      T =      √
                                             S/ 12
if µ = 11. This distribution depends on the true mean µ, the standard deviation σ
(neither of which we know), and the sample size. R will compute this probability for us
if we specify these values. The probability of a type II error is denoted by β and the
number 1 − β is called the power of the hypothesis test. (Higher powers are better.)
The R function power.t.test computes the power given the following arguments:

      delta         the deviation of the true mean from the null hypothesis mean
      sd            the true standard deviation
      n             the sample size
      sig.level     α
      type          this t-test is called a one.sample test
      alternative   we tested a one.sided alternative

  In the Raisin Bran example, if the true value of the mean is 10.9 and the standard
deviation is 0.1, then the power of the test is 88.3%. In other words, we will reject
a shipment that on average is one standard deviation underweight 88.3% of the time
using this test.

                                                                         5.9. Exercises

> power.t.test(delta=.1,sd=.1,n=12,sig.level=.025,type=’one.sample’,
+ alternative=’one.sided’)

      One-sample t test power calculation

               n   =   12
           delta   =   0.1
              sd   =   0.1
       sig.level   =   0.025
           power   =   0.8828915
     alternative   =   one.sided
Obviously, the test that we use should have greater power if the true mean is further
from 11.
> diff=seq(0,.1,.01)
> power.t.test(delta=diff,sd=.1,n=12,sig.level=.025,type=’one.sample’,
+ alternative=’one.sided’)

      One-sample t test power calculation

               n   12
           delta   0.00, 0.01,
                   =               0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10
              sd   0.1
       sig.level   0.025
           power   0.02500000,
                   =               0.05024502, 0.09249152, 0.15643493,
                   0.24401839,     0.35263574, 0.47466264, 0.59891866,
                   0.71365697,     0.80978484, 0.88289152
     alternative = one.sided
   Many users of hypothesis testing technology do not think very carefully about type
II errors before doing the experiment and so often construct tests that are not very
powerful. For example, if we think that it is important to reject shipments with that
average more than a half a standard deviation underweight, we find that the sample
size of 12 given above has power only 35%. We really should increase the sample size
in this case (and we know this even before we collect data).

5.9. Exercises

5.1 In this problem and the next, we investigate the use of the sample mean to estimate
the mean population of a U.S. county. We use the dataset at http://www.calvin.
  a) What is the average population of a U.S. county? (Answer: 89596.)
  b) Generate 10,000 samples of size 5 and compute the mean of the population of
     each sample. In how many of these 10,000 samples was the sample mean greater
     than the population mean? Why so many?

19:08 -- May 4, 2008                                                               531
5. Inference - One Variable

  c) Repeat part (b) but this time use samples of size 30. Compare the result to that
     of part (b).

  d) For the 10,000 samples of size 30 in part (c), what is the IQR of the sample

  e) Explain why using the sample mean for a sample of size 30 is likely to give a
     fairly poor estimate of the average population of a county.

5.2 Recall from Chapter 1 that reexpressing the population of counties by taking
logarithms produced a symmetric unimodal distribution. (See Figure 1.3.) Let’s now
repeat the work of the last problem using this transformed data.

  a) What is the mean of the log of population for all counties? (Answer: 10.22)

  b) Generate 10,000 samples of size 5 and compute the mean of the log-population
     for each of the samples. In how many of these samples was the sample mean
     greater than the population mean?

  c) Repeat part (b) but this time use samples of size 30.

  d) For the 10,000 samples in part (c), what is the IQR of the sample means?

  e) How useful is a sample of size 30 for estimating the mean log-population?

5.3 Suppose that X ∼ Binom(n, π) and that Y ∼ Binom(m, π). Also suppose that X
and Y are independent.

  a) Give a convincing reason why Z = X + Y should have a binomial distribution
     (with parameters n + m and π).

  b) Show that the mean and variance of Z as computed by Theorem 5.2.1 from that
     of X and Y is the same as computed directly from the fact that Z is binomial
     with parameters n + m and π.

5.4 In this problem, you are to investigate the accuracy of the approximation of the
Central Limit Theorem for the exponential distribution. Suppose that X ∼ Exp(0.1)
and that a random sample of size 20 is chosen from this population.

  a) What is the mean and variance of X?

  b) What is the mean and variance of X?

  c) Using the Central Limit Theorem approximation, compute the probability that
     X is within 1, 2, 3, 4, and 5 of µX .

                                                                            5.9. Exercises

  d) Now choose 1,000 random samples of size 20 from this distribution. Count the
     number of samples in which x is with 1, 2, 3, 4, and 5 of µX and compare to part
     (c). Comment.

5.5 Scores on the SAT test were redefined (recentered) in 1990 and were set to have
mean of 500 and standard deviation of 110 on each of the Mathematics and Verbal
Tests. The scores were constructed so that the population had a normal distribution
(or at least very close to normal). In random sample from this population of size 100,

  a) What is the probability that the sample mean will be between 490 and 510?

  b) What is the probability that the sample mean will exceed 500? 510? 520?

5.6 Continuing Problem 5.5, the total sat score for each student is formed by adding
their verbal score V and their math score M .

  a) If the two scores for an individual are independent of each other, what is the
     mean and standard deviation of V + M ?

  b) It is not likely that the verbal and mathematics scores of individuals in the popula-
     tion behave like independent random variables. Do you expect that the standard
     deviation of V + M is more or less than you computed in part (a)? Why?

5.7 Which is wider, a 90% confidence interval or a 95% confidence interval generated
from the same random sample from a normal population?

5.8 Suppose that the standard deviation σ of a normal population is known. How
large a random sample must be chosen so that a 95% confidence interval will be of
form x ± .1σ?
5.9 The dataset found at con-
tains the body temperature and heart rate of 130 adults. (“What’s Normal? – Tem-
perature, Gender, and Heart Rate” in the Journal of Statistics Education (Shoemaker
1996). )

  a) Assuming that the body temperatures of adults in the population is approxi-
     mately normal and that the 130 adults sampled behave like a simple random
     sample, write a 95% confidence interval for the mean body temperature of an

  b) Comment on the result in (a).

  c) Is there anything in the data that would lead you to believe that the normality
     assumption is incorrect?

19:08 -- May 4, 2008                                                                  533
5. Inference - One Variable

5.10 The R dataset morley contains the speed of light measurements for 100 different
experimental runs. The vector Speed contains the measurements (in some obscure

  a) If we think of these 100 measurements as repeated independent trials of a ran-
     dom variable X, what is a good description of the population of which these
     measurements are a sample?

  b) Write a 95% confidence interval for the mean of this population.

  c) What is the value tβ,n−1 for the confidence interval generated in the previous

  d) Is there anything in the histogram of the data values that suggests that the t
     procedure might not be a good one for generating a confidence interval in this

5.11 Write 95% confidence intervals for the mean of the sepal length of each of the
three species of irises in the R dataset iris. Would you say that these confidence
intervals give strong evidence that the means of the sepal lengths of these species are
5.12 The dataset contains
data collected about each student in our class. Our class is not a random sample of
Calvin students but suppose that we consider it so.

  a) Write a 90% confidence interval for mean number of hours of sleep that a Calvin
     student got the night before the first day of classes. (The variable named Sleep
     records that for the sample.)

  b) From the data, is there anything about the data on hours slept that concerns you
     in using the t-distribution to generate the confidence interval in (a)?

  c) Write a 90% confidence interval for the average amount of cash that students
     carried on that first day of class.

  d) Is there anything in the data that concerns you about using the t-distribution to
     generate the interval in (c)?

5.13 Suppose that 4 circuit boards out of 100 tested are defective. Generate 95%
confidence intervals for the proportion of the population of boards that is defective.
Give each of the Wald, Wilson and Agresti-Coull intervals.

5.14 The Chicago Cubs (a major league baseball team) won 11 games and lost 5 games
in their season series against the St. Louis Cardinals last year. Write a 90% confidence
interval for the proportion of the games that the Cubs would win if they played many

                                                                         5.9. Exercises

games against the Cardinals. Comment on the assumptions you are making about the
process of playing baseball games.
5.15 In a taste test, 30 Calvin students prefer Andrea’s Pizza and 19 prefer Papa
John’s. If the sample of students could reasonably be considered a random sample
of Calvin students write a 95% confidence interval for the proportion of students who
prefer Andrea’s Pizza.
5.16 It is common to use a sample size of 1,000 when doing a political poll. It is also
common to use the Wald interval to report the results of such polls. What is the widest
that a 95% confidence interval for a proportion could be with this sample size?

19:08 -- May 4, 2008                                                               535
6. Producing Data – Experiments
In many datasets we have more than one variable and we wish to describe and explain
the relationships between them. Often, we would like to establish a cause-and-effect

6.1. Observational Studies
The American Music Conference is an organization that promotes music education at
all levels. On their website they
promote music education as having all sorts of benefits. For example, they quote a
study performed at the University of Sarasota in which “middle school and high school
students who participated in instrumental music scored significantly higher than their
non-band peers in standardized tests”. Does this mean that if the availability of and
participation in instrumental programs in a school is increased, standardized test scores
would generally increase? The American Music Conference is at least suggesting that
this is true. They are attempting to “explain” the variation in test scores by the
variation in music participation. The problem with that conclusion is that there might
be other factors that cause the higher test scores of the band students. For example,
students who play in bands are more likely to come from schools with more financial
resources. They are also more likely to be in families that are actively involved in their
education. It might be that music participation and higher test scores are a result of
these variables. Such variables are often called lurking variables. A lurking variable
is any variable that is not measured or accounted for but that has a significant effect
on the relationship of the variables in the study.
   The Sarasota study described above is an observational study. In such a study,
the researcher simply observes the values of the relevant variables on the individuals
studied. But as we saw above, an observational study can never definitively establish a
causal relationship between two variables. This problem typically bedevils the analysis
of data concerning health and medical treatment. The long process of establishing the
relationship between smoking and lung cancer is a classic example. In 1957, the Joint
Report of the Study Group on Smoking and Health concluded (in Science, vol. 125,
pages 1129–1133) that smoking is an important health hazard because it causes an
increased risk for lung cancer. However for many years after that the tobacco industry
denied this claim. One of their principal arguments is that the data indicating this
relationship came from observational studies. (Indeed, the data in the Joint Report
came from 16 independent observational studies.) For example, the report documented

6. Producing Data – Experiments

that one out of every ten males who smoked at least two packs a day died of lung cancer.
but only one out of every 275 males who did not smoke died of lung cancer. Data such
as this falls short of establishing a cause-and-effect relationship however as there might
be other variables that increase both one’s disposition to smoke and susceptibility to
lung cancer.
   Observational studies are useful for identifying possible relationships and also sim-
ply for describing relationships that exist. But they can never establish that there is a
causal relationship between variables. Using observational studies in this way is anal-
ogous to using convenience samples to make inferences about a population. There are
some observational studies that are better than others however. The music study de-
scribed above is a retrospective study. That is the researchers identified the subjects
and then recorded information about past music behavior and grades. A prospective
study is one in which the researcher identifies the subjects and then records variables
over a period of time. A prospective study usually has a greater chance of identify-
ing relevant possible “lurking” variables so as to rule them out as explanations for a
possible relationship.
   One of the most ambitious and scientifically important prospective observational
studies has been the Framingham Heart Study. In 1948, researchers identified a sample
of 5,209 adults in the town of Framingham, Massachusetts (a town about 25 miles west
of Boston). The researchers tracked the lifestyle choices and medical records of these
individuals for the rest of their lives. In fact the study continues to this day with the
1,110 individuals who are still living. The researchers have also added to the study 5,100
children of original study participants. There is no question that the Framingham Heart
Study has led to a much greater understanding of what causes heart disease although it
is “only” an observational study. For example, it is this study that gave researchers the
first convincing data that smoking can cause high blood pressure. The website of the
study gives a wealth of information
about the study and about cardiovascular health.

6.2. Randomized Comparative Experiments
If an observational study falls short of establishing a causal relationship and even an
expensive well-designed prospective observational study cannot identify all possible
lurking variables, can we ever prove such a relationship?
   The “gold standard” for establishing a cause and effect relationship between two
variables is the randomized comparative experiment. In an experiment, we want
to study the relationship between two or more variables. At least one variable is an
explanatory variable and the value of the variable can be controlled or manipulated.
At least one variable is a response variable. The experimenter has access to a certain
set of experimental units (subjects, individuals, cases), sets various values of the
explanatory variables to create a treatment, and records the values of the response

                                             6.2. Randomized Comparative Experiments

   It is important first of all that an experiment be comparative. If we are attempting
to establish that music participation increases grades, we cannot simply look at par-
ticipators. We need to compare the achievement level of participators to those who do
not participate. Many educational studies fall short of this standard. A school might
introduce a new curriculum in mathematics and measure the test scores of the students
at the end of the year. However the school cannot make the case that the test scores
are a result of the new curriculum — the students might have achieved the same level
with any curriculum.
   In a randomized experiment we assign the individuals to the various treatments
at random. For example, if we took 100 fifth graders and randomly chose 50 of them
to be in the band and 50 of them not to receive any music instruction, we could
begin to believe that differences in their test scores could be explained by the different

    Example 6.2.1. Patients undergoing certain kinds of eye surgery are likely to
    experience serious post-operative pain. Researchers were interested in the question
    of whether giving acetaminophin to the patients before they experienced any pain
    would substantially reduce the subsequent pain and the further need for analgesics.
    One group received acetaminophin before the surgery but no pain medicine after
    the surgery. A second group received no pain medicine before the surgery and
    acetaminophin after the surgery. And the third group received no acetaminophin
    either before or after the surgery. Sixty subjects were used and 20 subjects were
    assigned at random to each group. (Soltani, Hashemi, and Babaei, Journal of
    Research in Medical Sciences, March and April 2007; vol. 12, No 2.)

   In Example 6.2.1, the goal of random assignment is to construct groups that are
likely to be representative of the whole pool of subjects. If the assignment were left
to the surgeons, for example, it might be the case that surgeons would give more pain
medication to certain types of patients and therefore we wouldn’t be able to attribute
the different results to the different treatments.

    Example 6.2.2. The R dataset chickwts gives the weights of chicks who were fed
    six different diets over a period of time. The experimenter was attempting to deter-
    mine which chicken feed caused the greatest weight gain. Feed is the explanatory
    variable and there were six treatments (six different feeds). Weight is the response
    variable. The first step in designing such an experiment is to assign baby chicks
    at random to the six different feed groups. If we allow the experimenter to choose
    which chicks receive which feed, she might unconsciously (or consciously) construct
    treatment groups that are unequal to start.

  Student (W.S. Gosset) was one of the researchers in the early part of the twentieth
century who realized the importance of randomization. One of his influential papers

19:08 -- May 4, 2008                                                                603
6. Producing Data – Experiments

analyzed a large scale study that was to compare the nutritional effects of pasteurized
and unpasteurized milk. In the Spring of 1930, 20,000 school children participated in the
study. Of these, 5,000 received pasteurized milk each day, 5,000 received unpasteurized
milk, and 10,000 did not receive milk at all. The weight and height of each student was
recorded both before and after the trial. Student analyzed the way in which students
were assigned to the three experimental treatments. There were 67 schools involved and
in each school about half the students were in the control group and half received milk.
However each school received only one kind of milk, pasteurized or unpasteurized. This
was the first sort of bias that Student found — he was not convinced that the schools
that received pasteurized milk were comparable to those that received unpasteurized
milk. A more important difficulty was the way in which students were assigned either
to the control or milk group within a school. The students were assigned at random
initially, but teachers were given freedom to adjust the assignments if it seemed to
them that the two groups were not comparable to each other in weight and height.
In fact Student showed that this freedom on the part of teachers to assign subjects
to groups resulted in a systematic difference between the groups in initial weight and
height. The control groups were taller and heavier on average than those in the milk
groups. Student conjectured that teachers unconsciously favored giving milk to the
more undernourished students.
   Of course assigning subjects to treatments at random does not ensure that the ex-
perimental groups are alike in all relevant ways. Just as we were subjected to sampling
error when choosing a random sample from a population, we can have variation in the
groups due to the chance mechanism alone. But assigning subjects at random will
allow us to make probabilistic statements about the likelihood of such error just as we
were able to make confidence intervals for parameters based on our analysis of sampling
error that might arise in random sampling.

Randomized assignment and random samples
We assign subjects to treatments at random so that the various treatment groups will
be similar with respect to the variables that we do not control. That is, we would
like the experimental groups to be representative of the whole group of subjects. In
surveys (Chapter 2), we choose a random sample from a population for a similar reason.
We hope that the random sample is representative of a larger population. Ideally, we
would like both kinds of randomness in our experiments. Not only do we ensure that
the subjects are assigned at random to treatments, but we would like the subjects to be
chosen at random from a larger population. If this is true, we could more easily justify
generalizing our experimental results to a larger population than the immediate subject
pool. However that is almost never the case. In the pain study of Example 6.2.1, the
subjects were simply all those persons who were operated on at a given clinic in a
given period of time. This issue is particularly important if we try to generalize the
conclusions of a an experiment to a larger population.

                                                                   6.2. Randomized Comparative Experiments

   Example 6.2.3. The author of this text participated in a study to investigate how
   people make probabilistic judgments in situations for which they do not have much
   data. (Default Probabilities, Osherson, Smith, Stob, and Wilkie, Cognitive Sci-
   ence, (15), 1991, 251–270.) Subjects were placed in various experimental groups at
   random. However the subjects were not chosen at random from any particular pop-
   ulation. Indeed every subject was an undergraduate in an introductory psychology
   course at the University of Michigan or Massachusetts Institute of Technology. It is
   difficult to make an argument that the results of the paper would generalize to the
   population of all undergraduates in the United States let alone to the population
   of all adults. The MIT students in particular seemed to have a different set of
   strategies for dealing with probabilistic arguments.

Other features of a good experiment
In our analysis of simple random sampling from a population, we saw again and again
the importance of large samples in getting precise estimates of our parameters. Anal-
ogously, if we are to measure precisely the effect of a treatment, we would like many
individuals in each treatment group. This principle is known as replication. With a
small number of individuals, it might be difficult to determine whether the differences
in response are due only to the treatments or whether they reflect the natural variation
in individuals. The chickwts data illustrate the issue. Figure 6.1 plots the weights of
the six different treatment groups of chicks. While there is definitely some variation


                              400                                                          q





                                    casein   horsebean   linseed   meatmeal   soybean   sunflower

    Figure 6.1.: Weights of six different treatment groups of a total of 71 chicks.

between the groups, there is also considerable variation within each group. Chicks fed
meatmeal, for example, have weights spanning most of the range of the the entire ex-
perimental group. It is probably the case that the small difference between the linseed
and soybean groups is due to the particular chicks in the groups rather than due to the
feed. More chickens in each group would help us resolve this issue however.

19:08 -- May 4, 2008                                                                                  605
6. Producing Data – Experiments

   In most good experiments one of the treatments is a control. A control generally
means a treatment that is a baseline or status quo treatment. In an educational exper-
iment, the control group might receive the standard curriculum while another group
is receiving the supposed improved curriculum. In a medical experiment, the control
group might receive the generally accepted treatment (or no treatment at all if ethical)
while another group receives a new drug. In Example 6.2.1, the group that received no
pre-pain medication is referred to as the control group. The goal of a control group is
to establish a baseline to which to compare the new or changed treatment.
   Often the control is a placebo. A placebo is a “treatment” that is really no treatment
at all but looks like a treatment from the point of view of the subject. In Example 6.2.1,
all subjects received pills both before and after surgery. But some of these pills con-
tained no acetaminophin and were inert. Placebos are given to ensure that the placebo
effect is measurable. The placebo effect is the tendency for experimental subjects to
be affected by the treatment even if it has no content. The need for control groups and
placebos is highlighted by the next famous example.

      Example 6.2.4. During the period 1927-1932, researchers conducted a large-scale
      study of industrial efficiency at the Hawthorne Plant of the Western Electric Com-
      pany in Cicero, IL. The researchers were interested in how physical and environ-
      mental features (e.g., lighting) affected worker productivity and satisfaction. Re-
      searchers found that no matter what the experimental conditions were, productiv-
      ity tended to improve. Workers participating in the experiment tended to work
      harder and better to satisfy those persons who were experimenting on them. This
      feature of human experimentation — that the experimentation itself changes be-
      havior whatever the treatment — is now called the Hawthorne Effect. (It is
      now generally accepted that the extent of the Hawthorne Effect in the original
      experiments have been significantly overstated by the gazillions of undergraduate
      psychology textbooks that refer to it. But the name remains and it makes a nice
      story as well as a plausible cautionary tale!)

   Another feature which helps to ensure that the differences in treatments are due
to the treatments themselves is blinding. An experiment is blind if the subjects do
not know which treatment group that they are in. In Example 6.2.1, no subject knew
whether they were receiving acetaminophin or a placebo. It is plausible that a sub-
ject knowing they receive a placebo would have a different (subjective) estimate of
pain than one who thought that they might be receiving acetaminophin. An experi-
ment is double-blind if the person administering the treatment also does not know
which treatment is being administered. This prevents the researcher from treating the
groups differently. It is not always possible or ethical to make an experiment blind or
double-blind. But when possible, blinding helps to ensure that the differences between
treatments are due to the treatments which is always the goal in experimentation.

                                                                             6.3. Blocking

                          A                            C        A        B

                          B                            B        C        A

                          C                            A        B        C

              Figure 6.2.: Two experimental designs for three fertilizers.

6.3. Blocking
If the experimental subjects are identical, it does not matter which is assigned to which
treatment. The differences in the response variable are likely to be the result of the
differences in treatment. The subjects are not usually identical however or at least
cannot be treated identically. So we would like to know that the differences in the
response variable are due to the differences in the explanatory variable and not any
systematic differences in subjects. Randomization is one tool that we use to distribute
such differences equally across the treatments. In some cases however, our experimental
units are not identical or our experiment itself introduces a systematic difference in the
units that is due to something other than the treatment variable. This leads to the
notion of blocking which we illustrate with a classic example.
   R.A. Fisher was one of the key early figures in developing the principles of good
experimental design. He did much of this while working at Rothamsted Experimental
Station on agricultural experiments. He studied closely data from experiments that
were attempting to establish such things as the effects of fertilizer on yield. Suppose
that we have three unimaginatively named fertilizers A, B, C. We could divide the
plot of land that we are using as in the first diagram of Figure 6.2. But it might be
the case that the further north in the plot, the better the soil conditions. In that case,
the variation in yield might be better explained (or at least partially explained) by the
location of the plot rather than by fertilizer. In this example, we would say that the
effects of northernness and fertilizer are confounded, meaning simply that we cannot
separate them given the data of the experiment at hand. To separate out the effect of
northernness from that of fertilizer, we could instead divide the patch using the second
diagram in figure 6.2. Of course there still might be variations in the soil conditions
across the three fertilizers. But we would at least be able to measure the effect of
northernness separately from that of fertilizer. In this example, “northernness” is a
blocking variable and our goal is to isolate the variability attributable to northernness
so that we can see the differences between the fertilizers more clearly.
   In a medical experiment it is often the case that gender or age are used as blocking

19:08 -- May 4, 2008                                                                  607
6. Producing Data – Experiments

variables. Obviously, we cannot assign individuals to the various levels of these variables
at random but it is plausible that in certain circumstances gender or age can have a
significant effect on the response. If so, it would be useful to design an experiment that
allows us to separate out the effects of, say, gender and the treatment.
   When using a blocking variable, it is important to continue to honor the principle
of randomization. Suppose for example that we use gender as a blocking variable in a
medical experiment comparing two treatments. The ideal experimental design would
be to take a group of females and assign them at random to the two treatments and
similarly for the group of males. That is, we should randomize the treatments within
the blocks. The resulting experiment is usually called a randomized block design.
It is not completely randomized because subjects in one block cannot be assigned to
another but within a block it is randomized.
   It is instructive to compare the randomized block design to stratified random sam-
pling. In each case, we divide subjects into groups and randomize within these groups.
The goal is to isolate and measure the variability that is due to the groups so that we
can measure the variability that remains.
   A special case of blocking is known as a matched pair design. In such an experi-
ment, there are just two observations in each block (one for each of two treatments).
In his 1908 paper, Student analyzed earlier published data from such an experiment.
That data is in the R dataframe sleep. The two different treatments were two different
soporifics (sleeping drugs). There was no control treatment. The response variable was
the number of extra hours of sleep gained by the subject over his “normal” sleep. There
were just 10 subjects and each subject took both drugs (on different nights). Thus each
subject was a block and there was one observation on each treatment in each block.
Student then compared the difference in the two drugs on each patient. Using the
individuals as blocks served to help Student to decide what part of the variation in
the response could be explained by the normal variation between individuals and what
could be attributed to the drugs themselves.
   In educational experiments, matched pairs are often constructed by finding two stu-
dents who are very similar in baseline academic performance. Then it is hoped that
the differences between these students at the end of the experiment are the result of
the different treatments.
   It is important to remember that block designs are not an alternative to random-
ization. Indeed, it is very important that we randomize the assignment to treatments
within every block for the same reasons that randomization is important when we have
no blocking variable. Identifying blocking variables is simply acknowledging that there
are variables on which the treatments may systematically differ.

6.4. Experimental Design
In the above sections, we have introduced the three key features of a good experimental
design — randomization, replication, blocking. We’ve illustrated these principles in the

                                                                                               6.4. Experimental Design

case that we have just one explanatory variable with just a few levels. These principles
can be extended to situations with more than one explanatory variable however. In
this book, we will not investigate the problem of inference for such situations or discuss
in detail the issues of experimental design in these cases. In this section, we look
at one example of extending these principles to experiments involving more than one
explanatory variable.

    Example 6.4.1. The R dataframe ToothGrowth contains the results of an exper-
    iment performed on Guinea Pigs to determine the effect of Vitamin C on tooth
    growth. There were two treatment variables, the dose of Vitamin C, and the deliv-
    ery method of the Vitamin C. The dose variable had three levels (.5, 1, and 2 mg)
    and the delivery method was by either orange juice or ascorbic acid. There were
    10 guinea pigs given each of the six treatments. The plot below (using coplot())
    shows the differences between the two delivery methods and the various does levels.
                                                              Given : supp



                                    0.5      1.0        1.5      2.0


                                                                  q                           q
                                              q                   q                           q
                                              q                   q                           q

                                              q                   q                           q
                                              q                   q                           q
                                                                  q          q
                                     q        q                                               q

                                                                             q                q
                                     q                                       q
                                     q                                       q

                                     q                                       q
                                     q        q                              q

                                     q                                 q


                                                                      0.5    1.0        1.5   2.0

                                    ToothGrowth data: length vs dose, given type of supplement

    It appears that both the delivery method and the dose have some effect on tooth

  Both the principles of randomization and replication extend to experiments with
more than one explanatory variable. In Example 6.4.1 for example, it is apparent that
the 60 guinea pigs should have been assigned at random to the six different treatments.
And it also is clear that there should have been enough guinea pigs in each treatment
so that the natural variation from pig to pig can be accounted for.
  No blocking variables are described in the tooth growth study but it is often the case
that natural blocking variables can be identified. For example, in the tooth growth
study, it might not have been possible for the same technician to have recorded all the
measurements. In that case, it would not be a good idea for one technician to make

19:08 -- May 4, 2008                                                                                               609
6. Producing Data – Experiments

all the measurements for the orange juice treatment while another technician makes
all the measurements for the ascorbic acid treatment. The blocking variable would
be the technician and we would attempt to randomize assignment withing treatment.
Since there were 10 guinea pigs in each of the 6 treatments, two technicians could each
measure 5 guinea pigs in each treatment.

7. Inference – Two Variables
Is there a relationship between two variables? If there is a relationship, is it causal?

7.1. Two Categorical Variables
7.1.1. The Data
Suppose that the data consist of a number of observations on which we observe two
categorical variables. We normally present such data in a (two-way) contingency table.

    Example 7.1.1. In 1973, the rate of acceptance to graduate school at the Uni-
    versity of California at Berkeley was lower for females than males. (See the R
    dataset UCBAdmissions.) Here 4,526 individuals are classified according to these
    two variables in the following contingency table.
     > xtabs(Freq~Gender+Admit,data=UCBAdmissions)
     Gender   Admitted Rejected
       Male        1198    1493
       Female       557    1278

  We introduce some notation to aid in our discussion and analysis of such situations.

               I         the   number of rows
               J         the   number of columns
               nij       the   integer entry in the ith row and j th column
               ni.       the   sum of the entries in the ith row
               n.j       the   sum of the entries in the j th column
               n = n..   the   total of all entries in the table

  We’ll also usually call the row variable R and the column variable C. Dots in
subscripts are often used in statistics to denote the operation of summing over the
possible values of that subscript. Hence ni. sums over the possible values of the second
subscript. This notation can be extended to more dimensions with k, l etc. denoting
the generic subscripts in the next places.

7. Inference – Two Variables

  Our research question and the nature of the two categorical variables determine how
we collect and analyze the data. There are three different data collection schemes that
we distinguish among.

  1. I independent populations. On this model, R, the categorical variable that
     determines the rows, defines I many populations. The data are collected by
     choosing a simple random sample of each population and categorizing each pop-
     ulation according the column categorical variable. An example of such a data
     collection exercise might be to choose a random sample of students of each class
     level and ask each subject a YES-NO question. On this model of sampling, we
     need to be able to identify each of the populations in advance.

  2. One population, two factors. On this model, we choose n individuals at
     random from one population and classify the individuals according to the two
     different categorical variables.

  3. I experimental treatments. On this model, the I rows are the I different
     treatments to which we might assign a number of individuals. We assign ni.
     individuals to each treatment (we hope by randomization) and then observe the
     value of the column categorical variable in each individual.

   Sometimes it is difficult to see immediately which of the three data collection schemes
is the best description of our data and sometimes it is clear that the data did not arise
in any one of these ways. For example, in most observational studies, randomness does
not play a role. It is often the case that such studies correspond to the description
of the second data collection scheme above but without the random sampling. How
we make inferences from such data from observational studies (and whether we can
make any inferences at all) is usually a difficult question. Of course the data collection
scheme should match the research question and we would like to phrase our research
questions as questions about parameters.

7.1.2. I independent populations
Suppose that random samples are chosen from each of the I independent populations
determined by the rows. This situation is really that of stratified random sampling with
the rows determining the strata. In this case, the variable C divides each population
into J many groups. A natural question to ask is whether the proportion of individuals
in a particular group is the same across populations.

      Example 7.1.2. In [AM], Chase and Dummer report on a survey of 478 children in
      Ingham and Clinton Counties in Michigan. (The data are available at the Data and
      Story Library and at
      The children were chosen from grades 4, 5, and 6. Among the questions asked was

                                                               7.1. Two Categorical Variables

    which goal was most important to them: making good grades, being popular, or
    being good in sports. The results are
     > pk=read.csv(’’)
     > names(pk)
      [1] "Gender"      "Grade"       "Age"         "Race"        "Urban.Rural"
      [6] "School"      "Goals"       "Grades"      "Sports"      "Looks"
     [11] "Money"
     > xtabs(~Grade+Goals,data=pk)
     Grade Grades Popular Sports
         4      63     31     25
         5      88     55     33
         6      96     55     32
    Here the three populations are students in the three grades and the research ques-
    tion is whether students at the three grade levels are the same in their choice of
    their most important goal.

  We define parameters as follows:

           πi,j = proportion of population i at level j of the second variable.
  Note that with πij defined in this way,               πij = 1 for every i. A natural first
hypothesis to test is

                        H0 :   for every j,   π1,j = π2,j = · · · = πI,j .
  If H0 is true, we say that the populations are homogeneous (with respect to variable
C). In order to test this hypothesis, it is necessary to construct a test statistic T such
that two things are true:

   1. We know the distribution of T when H0 is true, and

   2. The values of T tend to be small if H0 is true and large if H0 is false (or the
      other way around).

   It is easy to construct test statistics that have the second of these two properties.
However, since the distribution of such a statistic is discrete, it is usually computation-
ally impossible to determine the distribution of the statistic we construct even under
the assumption that the null hypothesis is true. The classical test in this situation is
to use a test statistic for which we have a good approximation to its distribution. The
statistic is called the chi-square (χ2 ) statistic and its lineage is really the same as that
of the normal approximation to the binomial distribution.
   To form the chi-square statistic, we investigate what we expect would happen if the
null hypothesis were true. In this case, for every j, we have π1,j = π2,j = · · · = πI,j .

19:08 -- May 4, 2008                                                                     703
7. Inference – Two Variables

We let π.j denote the common value. (Here we use the dot in a slightly different
but analogous manner.) How would we estimate π.j , the probability of an individual
falling in the j th column? Since there are n.j individuals in this column, a natural
estimate would be π.j = n.j /n. With this estimate of π.j , we can estimate the number
of individuals that should fall in each cell. Since there are ni. individuals in row i,
we should estimate that there are ni. π.j = ni. n.j /n individuals in the i, j th cell. This
quantity is important: we give it a name and notation.

Definition 7.1.3 (Expected Count). Under the null hypothesis H0 , the expected count
in cell i, j is
                                        ni. n.j
                                 ni,j =         .

  We now introduce the statistic that we use to test this hypothesis. (We use X 2
rather than χ2 so that the statistic is an upper-case Roman letter!)

                          (observed − expected)2                             ˆ
                                                                     (nij − nij )2
               X2 =                              =                                 .
                                 expected                                 ˆ
                                                           i     j

   It is not hard to see that this statistic is always nonnegative and tends to be larger
if the null hypothesis is false and smaller if it is true. However the distribution of
this statistic cannot be computed exactly for all but the smallest n. We digress and
introduce a new and important distribution.

Definition 7.1.4 (chi-square distribution). The chi-square distribution is a one-parameter
family of distributions with parameter a natural number ν and pdf
                       f (x; ν) =                 xv/2−1 e−x/2       x≥0.
                                    2ν/2 Γ(ν/2)
The chi-square distribution has mean ν and variance 2ν. The parameter ν is called the
degrees of freedom.

  The plot of the density function for the chi-square distribution with ν = 4 is in
Figure 7.1.
  The importance of the chi-square distribution stems from the following fact.

Proposition 7.1.5. Suppose that X1 , . . . , Xν are independent random variables each
of which has a standard normal distribution. Then X1 + · · · + Xν has a chi-square
                                                         2         2

distribution with ν degrees of freedom.

  For our purposes, we have the following fact.

Proposition 7.1.6. If the null hypothesis H0 is true, then the statistic X 2 has a
distribution that is approximately chi-square with (I − 1)(J − 1) degrees of freedom.

                                                              7.1. Two Categorical Variables





                               0   2    4    6     8     10       12

         Figure 7.1.: The density of the chi-square distribution with ν = 4.

  We now use the proposition to make a hypothesis test.

  chi-square test of homogeneity of populations.
  Suppose that the value of X 2 is c. The p-value of the hypothesis test of H0
  is p = P(X 2 ≥ c) where we assume that X 2 has a chi-square distribution with
  ν = (I − 1)(J − 1) degrees of freedom.

   Example 7.1.7. Continuing the popular kids example, Example 7.1.2, we compute
   the chi-square value using R. While R does the computations, we illustrate the
   computation by considering the first cell. There are 478 subjects total (n.. =
   478) of which 119 are in grade 4 (n1. = 119). Of the 478 subjects, 247 have
   getting good grades as their most important goal. Thus 247/478 = 51.7% of
   the sampled children have this as their goal. The expected count in the first
   cell is therefore n1,1 = (247/478)119 = 61.49. Since the actual count is 63, this
   contributes (61.49 − 63)2 /61.49 = .037 to the chi-squared value. Continuing over
   the six cells, we have a chi-square value of 1.3121 according to R.
    > popkidstable=xtabs(~Grade+Goals,data=pk)
    > chisq.test(popkidstable)

             Pearson’s chi-square test

    data: popkidstable
    X-squared = 1.3121, df = 4, p-value = 0.8593
   The value of X 2 is 1.31. The p-value indicates that if H0 is true, we would expect
   to see a value of X 2 at least as large as 1.31 over 85% of the time. So if H0 is true,

19:08 -- May 4, 2008                                                                    705
7. Inference – Two Variables

      this value of the chi-square statistic is not at all surprising. We have no reason to
      doubt the null hypothesis that students of these three grades do not differ in their
      most important goals.

   The use of the chi-square distribution is only an approximation. The approximation
is better if the populations are large and the individual cell sizes are not too small. The
conventional wisdom is to not use this test if any cell has a count of 0 or more than
20% of the cells have expected count less than 5. R will give a warning message if any
cell has expected count less than 5.

7.1.3. One population, two factors
We now look at the case in which the contingency table results from sampling from
a single population and classifying the sampled elements according to two different
categorical variables. The natural research question is whether the two variables are
“independent” of each other. We start with an example.

      Example 7.1.8. During the Spring semester of 2007, 280 statistics students were
      given a survey. Among other things, they were asked their gender and whether
      they were smokers. The results are tabulated below. (Note that the file was
      created using a blank field to denote a missing value. An argument to read.csv()
      addresses that.)
      > survey=read.csv(’’,na.strings=c(’NA’,’’))
      > t=xtabs(~gender+smoker,data=survey)
      gender Non Smoke
           F 133     5
           M 125    13
      Now these 280 students were not a random sample of 280 students from any partic-
      ular population. However we might think that this group could be representative
      of the population of all students with respect to the relationship of smoking to
      gender. We note that in this (convenience) sample, a male is more likely to smoke
      than a female. Does this difference indicate a true difference between the genders
      or is this simply a result of sampling variability?

  To formulate the research question as a question about parameters, we define πi,j as
the proportion of the population that has the value i for variable R and j for variable S.
We also define πi. and π.j to denote the proportion of the population with the relevant
value of each individual categorical variable. Then the hypothesis of independence that
we wish to test is

                            H0 :    for every i, j:   πi,j = πi. π.j .

                                                            7.1. Two Categorical Variables

   This hypothesis is an independence hypothesis as it states that the events of an
object being classified as i on variable R and j on variable C are independent. Just as
in the case of independent populations, it is plausible to estimate π.j by n.j /n. It is
also reasonable to estimate πi. by ni. /n. Then, if the null hypothesis is true we should
use πi,j = ni. n.j /n2 as our estimate of πi,j . Notice that with this estimate of πi,j , we
expect that we would have nˆi,j = ni. n.j /n individuals in cell i, j. This is exactly the
same expected cell value as in the case of the test for homogeneity. This suggests that
exactly the same statistic, X 2 , should be used to test H0 . Indeed, we have

Proposition 7.1.9. If H0 is true, then the statistic

                           (observed − expected)2                       ˆ
                                                                (nij − nij )2
                X2 =                              =
                                  expected                           ˆ
                                                        i   j

has a distribution that is approximately χ-squared with (I − 1)(J − 1) degrees of

   The proposition means that we can use exactly the same R test in this case. It
also means that in cases where it is not so clear whether we are testing for homogene-
ity or independence, it doesn’t really matter! In the smoking and gender example,
Example 7.1.8, we have

> chisq.test(t)

          Pearson’s Chi-squared test with Yates’ continuity correction

data: t
X-squared = 2.9121, df = 1, p-value = 0.08791

  A p-value of 0.088 suggests that there is not sufficient evidence to claim that smoking
and gender are not independent.

7.1.4. I experimental treatments
The third way that a two-way contingency table might arise is in the case that the rows
correspond to the the different treatments in an experiment. Here we are thinking that
the n individuals are assigned at random to the I treatments with ni. individuals as-
signed to treatment i. (We hope as well that the n individuals are a random sample from
some larger population to which we want to generalize the results of the experiment.
This hope will hardly ever be realized.) We want to know whether the experimental
treatments have an effect on the column variable C.

19:08 -- May 4, 2008                                                                    707
7. Inference – Two Variables

      Example 7.1.10. In [LP01], a study was done to see if delayed prescribing of
      antibiotics was as effective as immediate prescribing of antibiotics for treatment of
      ear infections. 164 children were assigned to the treatment group that received a
      prescription for antibiotics but which was instructed not to take the antibiotics for
      three days (the “delay” group). 151 children received a prescription for antibiotics
      to be taken immediately (the “immediate” group). The assignment was by ran-
      domization. One of the side effects of antibiotics in children is diarrhea. Of the
      delay group, 15 children had diarrhea and of the immediate group, 29 had diarrhea.
      The question is whether the rate of diarrhea differs for those receiving antibiotics
      immediately as opposed to those who waited. We do not have the raw data so we
      construct the table ourselves using the summary data above.
      > m=matrix(c(15,149,29,122),nrow=2,ncol=2,byrow=T)
      > m
            [,1] [,2]
      [1,]    15 149
      [2,]    29 122
      > colnames(m)=c(’Diarrhea’,’None’)
      > rownames(m)=c(’Delay’,’Immediate’)
      > m
                 Diarrhea None
      Delay            15 149
      Immediate        29 122
      Obviously, the rate of diarrhea in the immediate group is bigger but we would like
      to know if this difference could be attributable to chance.

   The null hypothesis in this case is that there is no difference between the treatments
(e.g., the rows) as far as the column variable C is concerned. This is essentially a
homogeneity hypothesis and we will analyze the data in precisely the same manner as
the case of I independent populations. In this case, we could think of the treatment
levels as defining theoretical populations, namely the population of individuals that
might have received each treatment. The “random sample” from the ith population is
then the collection of subjects randomly assigned to treatment i. We write the null
hypothesis in terms of parameters πij just as in the null hypothesis for homogeneity.
In this case πij denotes the probability that a subject assigned to treatment i will have
the value j on the the categorical variable. C. The null hypothesis is

                       H0 :    for every j,   π1,j = π2,j = · · · = πI,j .
  and we test this null hypothesis exactly the same way as in the case of homogeneity.

      Example 7.1.11. Continuing Example 7.1.10, we have the following test of the
      hypothesis that there is no difference in the rates of diarrhea for the two treatment

                                                         7.2. Difference of Two Means

    > chisq.test(m,correct=F)

             Pearson’s Chi-squared test

    data: m
    X-squared = 6.6193, df = 1, p-value = 0.01009
   With a p-value of .01 it appears that the difference in the rate of diarrhea in the
   two groups is greater than we would expect to see if the null hypothesis were true.
   We would reject the null hypothesis at the significance level α = 0.05, for example.
     In the above test, we have chosen not to use something called the “continuity
   correction” correct=F. If we use the correction, we find
    > chisq.test(m)

             Pearson’s Chi-squared test with Yates’ continuity correction

    data: m
    X-squared = 5.8088, df = 1, p-value = 0.01595
   In the correction, which is used only for the two-by-two case, the value 0.5 is
   subtracted from all of the terms Observed − Expected. It turns out that this makes
   the chi-square approximation somewhat closer.

7.2. Difference of Two Means
This section addresses the problem of determining the relationship between a cate-
gorical variable (with two levels) and a quantitative variable. Just as in the case of
two categorical variables, data like this can arise from independent samples from two
different populations, from a randomized comparative experiment with two treatment
groups, or from cross-classifying a random sample from a single population according
to the two variables. We look at the two population case here (and suggest that the
two treatment group case should be analyzed the same way as in Section 7.1).

  Assumptions for two independent samples:
    1. X1 , . . . , Xm is a random sample from a population with mean µX and vari-
       ance σX . 2

     2. Y1 , . . . , Yn is a random sample from a population with mean µY and variance
        σY .

     3. The two samples are independent one from another.

     4. The samples come from normal distributions.

19:08 -- May 4, 2008                                                               709
7. Inference – Two Variables

  Of course the fourth assumption above is an assumption of convenience to make the
mathematics work out. In most cases, our populations are not known or known not to
be normal and we hope that the inference procedures we develop below are reasonably
  We first write a confidence interval for the difference in the two means µX − µY . Just
as did our confidence intervals for one mean µ, our confidence interval will have the

              (estimate) ± (critical value) · (estimate of standard error) .

  The natural choice for an estimator of µX − µY is X − Y . To write the other two
pieces of the confidence interval, we need to know the distribution of X − Y . The
necessary fact is this:

                          X − Y − (µX − µY )
                                   2        2
                                                          ∼ Norm(0, 1) .
                                  σX       σY
                                  m    +    n

  Analogously to confidence intervals for a single mean, it seems like the right way to
proceed is to estimate σX by sX , σY by sY and to investigate the random variable

                                 X − Y − (µX − µY )
                                            2         2
                                                                    .               (7.1)
                                           SX        SY
                                           m     +    n

  The problem with this approach is that the distribution of this quantity is not known
even if we assume that the populations are normal (unlike the case of the single mean
where the analogous quantity has a t-distribution). We need to be content with an

Lemma 7.2.1. (Welch) The quantity in Equation 7.1 has a distribution that is ap-
proximately a t-distribution with degrees of freedom ν where ν is given by

                                                SX        2
                                                         SY   2
                                                m    +    n
                                 ν=      2
                                       (SX /m)2            2
                                                         (SY /n)2
                                         m−1         +     n−1

(It isn’t at all obvious from the formula but it is good to know that min(m − 1, n − 1) ≤
ν ≤ n + m − 2.)

  We are now in a position to write a confidence interval for µX − µY .

                                                         7.2. Difference of Two Means

  An approximate 100(1 − α)% confidence interval for µX − µY is

                                              s2  s2
                               x − y ± t∗      X
                                                 + Y                            (7.3)
                                              m    n
  where t∗ is the appropriate critical value tα/2,ν from the t-distribution with ν
  degrees of freedom given by (7.2).

  We note that ν is not necessarily an integer and we leave it R to compute both the
value of ν and the critical value t∗ .

   Example 7.2.2. The barley dataset of the lattice package has the yield in
   bushels per acre of various experiments done in Minnesota in 1931 and 1932. If we
   think of the experiments done in 1931 and in 1932 as samples from two populations,
   we have
    > t.test(yield~year,barley)

             Welch Two Sample t-test

    data: yield by year
    t = -2.9031, df = 116.214, p-value = 0.004422
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -8.940071 -1.688820
    sample estimates:
    mean in group 1932 mean in group 1931
              31.76333           37.07778

   There is a significant difference in the mean yield for the two years.

   We should remark at this point that older books (and even some newer books which
don’t reflect current practice) suggest an alternate approach to the problem of writing
confidence intervals for µX − µY . These books suggest that we assume that the two
standard deviations σX and σY are equal. In this case the exact distribution of our
quantity is known (it is t with n + m − 2 degrees of freedom). The difficulty with this
approach is that there is usually no reason to suppose that σX and σY are equal and
if they are not equal the proposed confidence interval procedure is not as robust as the
one we are using. Current best practice is to always prefer the Welch procedure to that
of assuming that the two standard deviations are equal.

19:08 -- May 4, 2008                                                               711
7. Inference – Two Variables

Confidence intervals generated by Equation 7.3 are probably the most common confi-
dence intervals in the statistical literature. But those who generate such intervals are
not always sensitive to the hypotheses that are necessary to be confident about the
confidence intervals generated. It should first be noted that the confidence intervals
constructed are based on the hypothesis that the two populations are normally dis-
tributed. It is often apparent from even a cursory examination of the data that this
hypothesis is unlikely to be true. However, if the sample sizes are large enough, the
intervals generated are fairly robust. (This is related to the Central Limit Theorem and
the fact that we are making inferences about means.) There are a number of different
rules of thumb as to what large enough means, but n, m > 15 for distributions that are
relatively symmetric and n, m > 40 for most distributions are common rules of thumb.
A second principle is that we are surer of confidence intervals where the quotients s2 /m
and s2 /n are not too different in size than those in which they are quite different.

Turning Confidence Intervals into Hypothesis Tests
It is often the case that researchers content themselves with testing hypotheses about
µX −µY rather than computing a confidence interval for that quantity. For example, the
null hypothesis µX − µY = 0 in the context of an experiment is a claim that there is no
difference in the two treatments represented by X and Y . This would be the typical null
hypothesis in comparing a medical treatment to a control or a placebo. Hypothesis
testing of this sort has fallen into disfavor in many circles since the knowledge that
µX − µY = 0 is of rather limited interest unless the size of this quantity is known.
(After all, nobody should really believe that two populations would have exactly the
same mean on any variable.) A confidence interval gives information about the size of
the difference. Nevertheless, since the literature is still littered with such hypothesis
tests, we give an example here.

      Example 7.2.3. Returning to our favorite chicks, we might want to know if we
      should believe that the effect of a diet of horsebean seed is really different that
      a diet of linseed. Suppose that x1 , . . . , xm are the weights of the m chickens fed
      horsebean seed and y1 , . . . , yn are the weights of the n chickens fed linseed. The
      hypothesis that we really want to test is H0 : µX − µY = 0. We note that if
      the null hypothesis is true, then T = (X − Y )/ Sx /m + Sy /m has a distribution
                                                             2       2

      that is approximately a t-distribution with the Welch formula giving the degrees of
      freedom. Thus the obvious strategy is to reject the null hypothesis if the value of T
      is too large in absolute value. Fortunately, R does all the appropriate computations.
      Notice that the mean weight of the two groups of chickens differs by 58.5 but that
      a 95% confidence interval for the true difference in means is (−99.1, −18.0). On
      this basis we expect to conclude that the linseed diet is superior, i.e., that there

                                                           7.2. Difference of Two Means

    is a difference in the mean weights of the two populations. This is verified by the
    hypothesis test of H0 : µX − µY = 0 which results in a p-value of 0.007. That is,
    this great a difference in mean weight would have been quite unlikely to occur if
    there was no real difference in the mean weights of the populations.
     > hb=chickwts$weight[chickwts$feed=="horsebean"]
     > ls=chickwts$weight[chickwts$feed=="linseed"]
     > t.test(hb,ls)

              Welch Two Sample t-test

     data: hb and ls
     t = -3.0172, df = 19.769, p-value = 0.006869
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
      -99.05970 -18.04030
     sample estimates:
     mean of x mean of y
        160.20    218.75

One-sided confidence intervals and one-sided tests are possible as are intervals of dif-
ferent confidence levels. All that is needed is an adjustment of the critical numbers (for
confidence intervals) or p-values for tests.

    Example 7.2.4. A random dot stereogram is shown to two groups of subjects
    and the time it takes for the subject to see the image is recorded. Subjects in one
    group (VV) are told what they are looking for but subjects in the other group (NV)
    are not. The quantity of interest is the difference in average times. If µX is the
    theoretical average of the population of the NV group and µY is the average of the
    VV group, then we might want to test the hypothesis
                                    H0 : µX − µY = 0
                                    Ha : µX > µY
     > rds=read.csv(’’)
     > rds
            Time Treatment
     1 47.20001         NV
     2 21.99998         NV
     3 20.39999         NV
     77 1.10000         VV
     78 1.00000         VV

19:08 -- May 4, 2008                                                                 713
7. Inference – Two Variables

      > t.test(Time~Treatment,data=rds,conf.level=.9,alternative="greater")

               Welch Two Sample t-test

      data: Time by Treatment
      t = 2.0384, df = 70.039, p-value = 0.02264
      alternative hypothesis: true difference in means is greater than 0
      90 percent confidence interval:
       1.099229      Inf
      sample estimates:
      mean in group NV mean in group VV
              8.560465         5.551429

      From this we see that a lower bound on the difference µX − µY is 1.10 at the 90%
      level of confidence. And we see that the p-value for the result of this hypothesis
      test is 0.023. We would probably conclude that those getting no information take
      longer than those who do on average.

   Just as in the case of the t-test for one mean, it is important to consider the
power of the two-sample t-test before conducting an experiment. The R function
power.t.test() with argument type=’two.sample’ does the appropriate computa-

7.3. Exercises

7.1 In Berkson, JASA, 33, pp. 526-536, there is data on the result of an experiment
evaluating a treatment designed to prevent the common cold. There were 300 subjects
and 143 received the treatment and 157 the placebo. Of the treatment group, 121 even-
tually got a cold and of the placebo group, 145 got a cold. Was the treatment effective?
Write a contingency table and formulate this problem as a chi-square hypothesis test
(as indeed Berkson did).

7.2 The DAAG package has a dataset rareplants classifying various plant species in
South Australia and Tasmania. Each species was classified according to whether it was
rare or common in each of those two locations (giving the possibilities CC, CR, RC,
RR) and whether its habitat was wet, dry, or both (W, D, WD). The dataset contains
the summary table which is also reproduced here.
> rareplants
    D   W WD
CC 37 190 94
CR 23 59 23
RC 10 141 28
RR 15 58 16

                                                                           7.3. Exercises

  a) what hypothesis exactly is begging to be tested with the aid of this contingency
     table? (e.g., homogeneity or independence?)

  b) Test this hypothesis.

7.3 21 rubber bands were divided into two groups. One group was placed in hot water
for 4 minutes while the other was left at room temperature. They were each then
stretched by a 1.35 km weight and the amount of stretch in mm was recorded. (The
dataset comes from the DAAG library where it is called two65). You can get the dataset
in a dataframe format from
Write a 95% confidence interval for the difference in average stretch for this kind of
rubberband for the two conditions.
7.4 The dataset contains the re-
sults of an experiment done to test the effectiveness of three different methods of reading
instruction. We are interested here in comparing the two methods DRTA and Strat.
Let’s suppose, for the moment, that students were assigned randomly to these two
different treatments.

  a) Use the scores on the third posttest (POST3) to investigate the difference between
     these two teaching methods by constructing a 95% confidence interval for the
     difference in the means of posttest scores.

  b) Your confidence interval in part (a) relies on certain assumptions. Do you have
     any concerns about these assumptions being satisfied in this case?

  c) Using your result in (a), can you make a conclusion about which method of
     reading instruction is better?

7.5 Surveying a choir, you might expect that there would not be a significant height
difference between sopranos and altos but that there would be between sopranos and
basses. The dataset singer from the lattice package contains the heights of the
members of the New York Choral Society together with their singing parts.

  a) Decide whether these differences do or do not exist by computing relevant confi-
     dence intervals.

  b) These singers aren’t random samples from any particular population. Explain
     what your conclusion in (a) might be about.

7.6 The package alr3 has a dataframe ais containing various statistics on 202 elite
Australian athletes. (The package must be loaded and then the dataset must be loaded
as well using data(ais).)

19:08 -- May 4, 2008                                                                 715
7. Inference – Two Variables

  a) Is there a difference between the hemoglobin levels of males and females? (Well,
     of course there is a difference. But is it statistically significant.)

  b) What assumptions are you making about the data in (a) to make it a problem in
     statistical inference?

  c) To what populations do you think you could generalize the result of (a)?

8. Regression
In Section 1.6 we introduced the least-squares method for finding a linear function that
best describes the relationship between a pair of quantitative variables. In this chapter
we enhance that method by grounding it in a statistical model.

8.1. The Linear Model
Suppose that we have n individuals on which we measure two variables. The data
then consist of n pairs (x1 , y1 ), . . . , (xn , yn ). We will develop a model for the situation
in which we consider the variable x as an explanatory variable and y as the response
variable. Our model will assume that for each fixed data value x, the corresponding
value y is the result of a random variable, Y . The linearity of the model comes from
the fact that we will assume that the expected value of Y is a linear function of x. The
model is given by

                                The standard linear model

   The standard linear model is given by the equation

                                       Y = β0 + β1 x +                                    (8.1)


      1.   is a random variable with mean 0 and variance σ 2 ,

      2. β0 , β1 , σ 2 are (unknown) parameters,

      3. and    has a normal distribution.

  We will assume that the data (x1 , y1 ), . . . , (xn , yn ), result from n independent trials
governed by the process above. That is, we assume that 1 , . . . , n is an iid sequence of
random variables with mean 0 and variance σ 2 . Then each yi is the result of a random
variable Yi given by
                                Yi = β0 + β1 xi + i .

8. Regression

   Notice that in our description of the data collection process, yi is treated as the result
of a random variable but xi is not. Also note that for any fixed i, the mean of Yi is
β0 + β1 xi and the variance of Yi is σ 2 .
   There are three unknown parameters (β0 , β1 , and σ 2 ) in the linear model. Usually,
the most interesting of these from a scientific point of view is β1 since it is an expression
of the way of in which the response variable Y depends on the value of the explanatory
variable x. We would like to estimate these parameters and make inferences about
them. It turns out that we have already done much of the work in Section 1.6. The
least-squares line is the “right” line to use in estimating β0 and β1 . We review the
                                ˆ      ˆ
construction of that line. Let β0 and β1 denote the estimators of β0 and β1 respectively.

      A note on notation. It would be nice to use uppercase to denote the
      estimator and lowercase to denote the estimate. That would mean that we
      should use b1 to denote the estimate of β1 and B1 to denote the estimator of
                                                          ˆ                       ˆ
      β1 . However this is typically not done and instead β1 is used for both. So β1
      might be a number (an estimate) or a random variable (the corresponding
      estimator) depending on the context. Be careful!

  Now define
                                      ˆ    ˆ    ˆ
                                      yi = β0 + β1 xi .
          ˆ                                               ˆ        ˆ        ˆ      ˆ
Of course yi is not defined until we specify how to choose β0 and β1 . Given β0 and β1 ,
we define
                  SSResid =           ˆ
                                (yi − yi )2 =          ˆ    ˆ
                                                 yi − (β0 + β1 xi ) .

                                                        ˆ        ˆ
We proceed exactly as in Section 1.6. Namely we choose β0 and β1 to minimize SSResid.
(In fact in that section we called these two numbers b0 and b1 .) We have the following
                 ˆ       ˆ
expressions for β0 and β1 .

                               i=1 (xi − x)yi
                      β1 =                                ˆ        ˆ
                                                          β0 = y − β1 x .
                                i=1 (xi − x)
                                n           2

The corresponding estimators result from these expressions by replacing yi by Yi . The
desirable properties of these estimators (besides minimizing SSResid) are summarized
in the next three results.

Proposition 8.1.1. Assume only that E( i ) = 0 for all i in the model given by
               ˆ        ˆ
(8.1). Then β0 and β1 are unbiased estimates of β0 and β1 respectively. Therefore,
yi = β
ˆ          ˆ
      ˆ0 + β1 xi is an unbiased estimate of β0 + β1 xi (which is the expected value of Yi
for the value x = xi ).

  Notice that in Proposition 8.1.1 we do not need to assume that that the errors have
constant variance or that they are independent! This proposition therefore gives us a
very good reason to use the least-squares slope and intercept for our estimates.

                                                                              8.1. The Linear Model

                              130    q




                                    0.0   0.5       1.0           1.5   2.0

         Figure 8.1.: The corrosion data with the least-squares line added.

   Example 8.1.2. In Example 1.6.1 we looked at the loss due to corrosion of 13
   Cu/Ni alloy bars submerged in the ocean for sixty days. Here the iron content
   Fe is the explanatory variable and it is reasonable to treat that as controlled and
   known by the experimenter (rather than as a random variable). The data plot
   suggests that the linear model might be a reasonable approximation to the true
   relationship between iron content and material loss. We reproduce the analysis
                              ˆ                 ˆ
   here. Using R we find that β0 = 129.79 and β1 = −24.02.
    > library(faraway)
    > data(corrosion)
    > corrosion[c(1:3,12:13),]
         Fe loss
    1 0.01 127.6
    2 0.48 124.0
    3 0.71 110.8
    12 1.44 91.4
    13 1.96 86.2
    > lm(loss~Fe,data=corrosion)

    lm(formula = loss ~ Fe, data = corrosion)

    (Intercept)                    Fe
         129.79                -24.02

  If we add the assumption of independence of the i and also the assumption of
constant variance, we know considerably more about our estimates as evidenced by the
next two propositions. (Recall that Sxx = (xi − x)2 .)

19:08 -- May 4, 2008                                                                           803
8. Regression

Proposition 8.1.3. Suppose that Yi = β0 + β1 xi + i where the random variables         i
are independent and satisfy E( i ) = 0 and Var( i ) = σ 2 . Then

         ˆ        σ2
  1. Var(β1 ) =       ,

         ˆ        σ 2 x2         1   x2
  2. Var(β0 ) =        i
                         = σ2      +    .
                  n Sxx          n Sxx

  It is not important to remember the formulas of this proposition. But they are worth
examining for what they say about the variance of our estimators. We can decrease the
variance of the estimator of slope, for example, by collecting a large amount of data
with x values that are widely spread. This seems intuitively correct.
  In general we like unbiased estimators with small variance. The next theorem assures
us that the least squares estimators are good estimators in this respect.

Theorem 8.1.4 (Gauss-Markov Theorem). Assume that E( i ) = 0, Var( i ) = σ 2 ,
                                                                  ˆ       ˆ
and the random variables i are independent. Then the estimators β0 and β1 are the
unbiased estimators of minimum variance among all unbiased estimators that are linear
in the random variables Yi . (We say that these estimators are best linear unbiased
estimators, BLUE.)

                                                               ˆ      ˆ
  While there might be non-linear estimators that improve on β0 and β1 , the Gauss-
Markov Theorem gives us a powerful reason for using these estimators. Notice however
that the Theorem has hypotheses. Both the homeoscedasticity (equal variance) and
independence hypotheses are important.
  Our final proposition of the section gives us additional information if we add the
normality assumption.

Theorem 8.1.5. Assume that E( i ) = 0, Var( i ) = σ 2 , and the random variables i
                                               ˆ       ˆ
are independent and normally distributed. Then β0 and β1 are normally distributed.

   We exploit this theorem in the next section to make inferences about the parameters
β0 , β1 .

8.2. Inferences
We first consider the problem of making inferences about β1 . In particular, we would
like to construct confidence intervals for β1 with the aid of our estimate β1 . In order
to do this, we must clearly make some distributional assumptions about the Yi . So
for this entire section, we will assume all the hypotheses of the standard linear model,
namely that E( i ) = 0, Var( i ) = σ 2 , the random variables i are normally distributed
and independent of one another. From the results of the last section, we then have

                                                                             8.2. Inferences

                                   β1 ∼ N β1 , σ 2 /Sxx .
  We’ve been in this situation before. Namely, we have an estimator that has a normal
distribution centered at the true value of the parameter but with a standard deviation
that depends on an unknown parameter σ. Clearly the way to proceed is to estimate
the unknown standard deviation. To do this, we need to estimate σ.

Proposition 8.2.1. Under the assumptions of the linear model,

                                   MSResid =
is an unbiased estimate of σ 2 .

   While we will not prove the proposition, let’s see that it is plausible. The numerator
                                                          ˆ           ˆ
in this computation is a sum of terms of the form (yi − yi )2 . Since yi is the best estimate
of E(Yi ) that we have, yi − yi is a measure of the deviation of yi from its mean. Thus
(yi − yi )2 functions exactly the same way that (xi − x)2 functions the computation
of the sample variance. However in this case we have a denominator of n − 2 rather
than n − 1. This accounts for the fact that we are minimizing SSResid by choosing
two parameters. The n − 2 is the key to making this estimator unbiased — a more
straightforward choice would have been to use n in the denominator. Since MSResid
is an estimate for σ 2 , we will use s to denote MSResid.
   With the estimate s = MSResid for σ in hand, √ can estimate the standard
deviation of β                ˆ
              ˆ1 . Since Var(β1 ) = σ 2 /Sxx we will use s/ Sxx to estimate the standard
               ˆ                                     ˆ
deviation of β1 . We can similarly estimate Var(β0 ). We record these estimates in a

                                         ˆ       ˆ
Definition 8.2.2 (standard errors of β0 and β1 ). The estimates of the standard devi-
                        ˆ0 and β1 , called the standard errors of the estimates, are given
ation of the estimators β
   1. sβ1 = √
       ˆ           and

                1 (xi − x)2
   2. sβ0 = s
       ˆ          +         .
                n    Sxx

  We illustrate all the estimates computed so far with another example.

    Example 8.2.3. A number of paper helicopters were dropped from a balcony and
    the time in air was recorded by two different timers. Various dimensions of each
    helicopter were measured including L, the “wing” length. A plot shows that there

19:08 -- May 4, 2008                                                                     805
8. Regression

      is a positive relationship betweeen L and the time of the second timer (Time.2).
      To describe the relationship, we suppose that a linear model might be a good
      description. A plot of the data with a regression line added is in Figure 8.2

                               7                                q    q               q


                                                        q                            q
                               5                            q
                                           q   q

                                   2               3            4            5       6

            Figure 8.2.: Flight time for helicopters with various wing lengths.

      > h=read.csv(’’)
      > h[1,]
        Number W L H   B Time.1 Time.2
      1      1 3 6 2 1.5   6.89   6.82
      > l=lm(Time.2~L,data=h)
      > summary(l)
      lm(formula = Time.2 ~ L, data = h)

           Min       1Q             Median                  3Q           Max
      -1.42875 -0.53381            0.04489             0.49348       1.59941

                  Estimate Std. Error t value Pr(>|t|)
      (Intercept)   2.9816     0.7987   3.733 0.00200 **
      L             0.5773     0.1753   3.293 0.00493 **
      Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

      Residual standard error: 0.8872 on 15 degrees of freedom
        (3 observations deleted due to missingness)
      Multiple R-Squared: 0.4196,        Adjusted R-squared: 0.381
      F-statistic: 10.85 on 1 and 15 DF, p-value: 0.004925

        We have the following estimates: s = 0.8872 (labeled residual standard error in
      the output of lm), β1 = 0.5773, sβ1 = 0.1753, β0 = 2.9816, and sβ0 = 0.7897.
                                       ˆ                              ˆ

                                                                          8.2. Inferences

   To construct confidence intervals for β1 , we need one more piece of information. The
following result should not seem surprising given our work on the t-distribution.

Proposition 8.2.4. With all the assumptions of the linear model, the random variable
                                      β1 − β1   ˆ
                                                β1 − β1
                                T =     √     =
                                      S/ Sxx      Sβ1

has a t-distribution with n − 2 degrees of freedom.

   The proposition is another example for us of the use of the t distribution to generate
a confidence interval in the presence of a normality assumption. We generalize this into
a principle (which is too imprecise to call a theorem or to prove).

  Suppose that θ is an unbiased estimator of a parameter θ and sθ is the standard
           ˆ                                                   ˆ
  error of θ (that is an estimate of the standard deviation of θ). Suppose also that s ˆ
  has ν degrees of freedom. Then, in the presence of sufficient normality assumptions,
  the random variable T =         has a t distribution with ν degrees of freedom.

  We now use Proposition 8.2.4 to write confidence intervals for β1 .

                             Confidence Intervals for β1

  A 100(1 − α)% confidence interval for β1 is given by

                                   β1 ± tα/2,n−2 · sβ1 .

  We don’t even have to use qt() or do the multiplication since R will compute the
confidence intervals for us. Both 95% and 90% confidence intervals for the slope and
the intercept of the regression line in Example 8.2.3 are given by
> confint(l)
                2.5 %       97.5 %
(Intercept) 1.2791991    4.6839421
L           0.2036587    0.9508533
> confint(l,level=.9)
                  5 %         95 %
(Intercept) 1.5814235    4.3817177
L           0.2699840    0.8845281

19:08 -- May 4, 2008                                                                 807
8. Regression

The 95% confidence interval for β1 of (0.204, 0.951) gives us a very good idea of the
large uncertainty in the estimate of linear relationship between L and flight time.
Nevertheless, it does tell us that L does have some use in predicting flight time.

8.3. More Inferences
We usually want to use the results of a regression to make inferences about the possible
values of y for given values of x. In this section, we look at two different kinds of
inferences of this sort. We begin with an example.

      Example 8.3.1. In the R library DAAG is a dataset ironslag that has observations
      of measurements of the iron content of 53 samples of slag by two different methods.
      One method, the chemical method, is more time-consuming and expensive than
      the other, the magnetic method, but presumably more accurate.


                                 30                                                       q
                                                                                   q          q
                                                                     q             q q    q
                                 25                      q               q                q            q        q
                                                             q q         q q       q

                                                               q     q
                                                         q                 q
                                                                 q q           q
                                 20   q                          q   q q
                                                     q             q
                                                                   q q q
                                                   q q q                   q
                                 15                  q q             q

                                 10            q

                                      10            15              20             25         30           35       40

              Figure 8.3.: Iron content measured by two different methods.

      > library(DAAG)
      > l=lm(chemical~magnetic,data=ironslag)
      > summary(l)
      lm(formula = chemical ~ magnetic, data = ironslag)

          Min      1Q Median                            3Q                    Max
      -6.5828 -2.6893 -0.3825                       2.7240                 6.6572

                  Estimate Std. Error t value Pr(>|t|)
      (Intercept) 8.95650     1.65235   5.420 1.63e-06 ***
      magnetic     0.58664    0.07624   7.695 4.38e-10 ***

                                                                    8.3. More Inferences

     Signif. codes:    0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

     Residual standard error: 3.464 on 51 degrees of freedom
     Multiple R-Squared: 0.5372,        Adjusted R-squared: 0.5282
     F-statistic: 59.21 on 1 and 51 DF, p-value: 4.375e-10

   Given a particular value x = x∗ (here x∗ might be one of the values xi or some other
                                                 ˆ      ˆ     ˆ           ˆ       ˆ
possible value of x), define Y = β0 + β1 x∗ and Y = β0 + β1 x∗ . Since β0 and β1 are
unbiased estimators of β0 and β1 , we have that E(Yˆ ) = β0 + β1 x∗ = E(Y ). It can also
be shown that
                                ˆ          1 (x∗ − x)2¯
                            Var(Y ) = σ 2   +               .
                                          n       Sxx
If we make the normality assumptions of the standard linear model, Y is also normally
distributed and we have the following confidence interval.

                         Confidence intervals for β0 + β1 x∗

  A 100(1 − α)% confidence interval for β0 + β1 x∗ is given by

                        ˆ    ˆ                      1 (x∗ − x)2
                        β0 + β1 x∗ ± tα/2,n−2 · s     +
                                                    n    Sxx

  Notice that the confidence interval is smallest when x∗ = x and at that point the
standard error is simply s/ n. This error should remind us of the standard error in the
construction of simple confidence intervals for the mean of a normal population. The
confidence interval is wider the greater the distance of x∗ from x. This is not surprising
as small errors in the position of a line magnify the errors at its extremes.
  Of course the computations of these intervals are to be left to R. We illustrate with
the ironslag data.

    Example 8.3.2. (continuing Example 8.3.1) In the ironslag data, the values
    of the explanatory variable magnetic range from 10 to 40. We use R to write
    confidence intervals for β0 + β1 x∗ for four different values of x∗ in this range.
     > x=data.frame(magnetic=c(10,20,30,40))
     > predict(l,x,interval=’confidence’)
            fit      lwr      upr
     1 14.82291 12.91976 16.72607
     2 20.68933 19.72724 21.65142
     3 26.55574 24.84847 28.26301
     4 32.42215 29.32547 35.51884

19:08 -- May 4, 2008                                                                 809
8. Regression

      Notice that for a value of x∗ = 20, the confidence interval for the mean of Y
      is (19.7, 21, 7) which is considerably narrower than the confidence interval at the
      extremes of the data. As is usual, R defaults to a 95% confidence interval.

It is important to realize that the confidence intervals produced by this method are
confidence intervals for the mean of Y . The confidence interval of (19.7, 21.7) for
x∗ = 20 means that we are confident that the true line has the value somewhere in this
interval at x = 20.
   Obviously, we often want to use the regression line to make predictions about future
observations of Y . Suppose for example in Example 8.3.1 that we produce another
sample with a measurement of 30 on the variable magnetic. The fitted line predicts a
measurement of 26.56 on the variable chemical. We also have a confidence interval of
(24.85, 28.26) for the mean of the possible observations at x = 30. But what we would
like to do is have an estimate of how close our measured value is likely to be to our
predicted value of 26.56. We take up this question next.
                                                          ˆ    ˆ    ˆ
   Given a value of x = x∗ , we define Y = β0 + β1 x∗ and Y = β0 + β1 x∗ as before. Since
Y is going to be based on a future observation of y, we know that that the random
variable Y is independent of the the random variable Y (which is based on the sample
observations). Consider the random variable Y − Y  ˆ . (This is simply the error made in
using Y to predict Y .) This random variable has mean 0 and variance given by

                     ˆ                   ˆ                   1 (x∗ − x)2
             Var(Y − Y ) = Var(Y ) + Var(Y ) = σ 2 + σ 2       +           .
                                                             n    Sxx

This leads to the following prediction interval for Y .

               Prediction intervals for a new Y given x = x∗ .
  A 100(1 − α)% prediction interval for a future value of Y given x = x∗ is

                       ˆ    ˆ                           1 (x∗ − x)2
                       β0 + β1 x∗ ± tα/2,n−2 · s   1+     +         .
                                                        n    Sxx

  For the ironslag data, with x = x∗ we have

> predict(l,data.frame(magnetic=30),interval="predict")
          fit      lwr      upr
[1,] 26.55574 19.39577 33.71571

Obviously, this is a very wide interval compared to the confidence intervals we generated
for the mean. This is because we are asking that the interval capture 95% of the values
of future measurements rather than just the true mean of such measurements.

                                                                                8.4. Diagnostics

The problem of multiple confidence intervals
When constructing many confidence intervals, we need to be careful in how we phrase
our conclusions. Consider the problem of constructing 95% confidence intervals for the
two parameters β0 and β1 . By the definition of confidence intervals, there is a 95%
probability that the confidence interval that we will construct for β0 will in fact contain
β0 and similarly for β1 . But what is the probability that both confidence intervals will
be correct? Formally, let Iβ0 denote the (random) interval for β0 and Iβ1 denote the
interval for β1 . We have P(β0 ∈ Iβ0 ) = .95 and P(β1 ∈ Iβ1 ) = .95. Then we know that

                            .90 ≤ P (β0 ∈ Iβ0 and β1 ∈ Iβ1 ) ≤ .95                          (8.2)

                                                                           ˆ     ˆ
but we cannot say more than this unless we know the joint distribution of β0 and β1 . In
fact, given the full assumptions of the normality model, we can find a joint confidence
region in the plane for the pair (β0 , β1 ). We need the ellipse package of R.

 > e=ellipse(l)
 > e[1:5,]
      (Intercept) magnetic
 [1,]    9.562752 0.6146146
 [2,]    9.300103 0.6266209
 [3,]    9.036070 0.6384663
 [4,]    8.771717 0.6501030
 [5,]    8.508108 0.6614841
 > plot(e,type=’l’)

We note that an ellipse is simply a set of points (by default 100 points are used)
and we can plot the points. (It is easier to use standard graphics to do this.) The
resulting ellipse is in Figure 8.4. The ellipse is chosen to have minimum area (just as
our confidence intervals are chosen to have minimum length. Thus more values of the
slope and intercept are allowed but the ellipse itself is small in area (compared to the
rectangle that is implied by using both individual confidence intervals).
   The problem of multiple confidence intervals arises in several other places. For
example, if we generate many 95% confidence intervals for the mean of Y given x
from the same data, we are not 95% confident in the enutre collection.

8.4. Diagnostics
We can construct the regression line and compute confidence and prediction intervals
for any set of pairs (x1 , y1 ), . . . , (xn , yn ). But unless the hypotheses of the linear model
are satisfied and the data are “clean,” we will be producing mostly nonsense. Anscombe
constructed the examples of Figure 8.5 to illustrate this fact in a dramatic way. Each
of the datasets has the same regression line: y = 3 + .5x. Indeed, the means and
standard deviations of all the x’s are exactly the same in each case, and similarly for

19:08 -- May 4, 2008                                                                          811
8. Regression


                                             6   8                 10   12


Figure 8.4.: The 95% confidence ellipse for the parameters in the ironslag example.

the y’s. These data are available in the dataset anscombe. The first example looks like
a textbook example for the application of regression. The relationship in the second
example is clearly non-linear. In the third example, one point is disguising what seems
to be the “real” relationship between x and y. And in the fourth example, it is clear
that some other method of analysis is more appropriate (is the outlier good data or
not?). In each of these four examples, a simple plot of the data suffices to convince
us not to use linear regression (at least with the data as given). But departures from
the assumptions that are more subtle are not always easily detectable by a plot. (That
will be true particularly in the case of several predictors which we take up in the next
section.) In this section we look at some of the things that can be done to determine
if the linear model is the appropriate one.

8.4.1. The residuals
A careful look at the residuals often gives useful information about the appropriateness
of the linear model. We will use ei for the ith residual rather than ri to emphasize
that the residual is an estimate of i , the error random variable of the model. Thus
ei = yi − yi . If the linear model is true, the random variables i are a random sample
from a population that has mean 0, variance σ 2 , and, in the case of the normality
assumption, are normally distributed. The residuals are estimates of the i in this
random sample so it behooves us to take a closer look at the distribution of the residuals.
The first step in an analysis of the model using residuals is to construct a residual
plot. While we could plot the residuals ei against either of xi or yi , the plot that is
usually constructed is that of the residuals against the fitted values yi . In other words,
we plot the n points (ˆ1 , e1 ), . . . , (ˆn , en ). In this plot we are looking for violations
                          y               y
of the linearity assumption, heteroscedasticity (unequal variances), and perhaps non-

                                                                                                                                                 8.4. Diagnostics

                                        Anscombe's 4 Regression data sets




                                                        q                                                             q         q
                                                            qq                                                    q




                                            q                                                                 q


                                                q                                                     q



                                        5                   10                   15                   5                   10            15

                                                                x1                                                         x2

                                                                         q                                                                   q



                                                                             q                                    q


                                                                 q                                                q
                                                        q                                                         q
                                                    q                                                             q

                                        q                                                                         q
                                    q                                                                             q

                                        5                   10                   15                   5                   10            15

                                                                x3                                                         x4

              Figure 8.5.: Four datasets with regression line y = 3 + .5x.


   Example 8.4.1. A famous dataset on cats used in a certain experiment has mea-
   surements of the body weight (in kg) and brain weight (in g) of 144 cats of both
   sexes. A linear regression suggests a strong relationship.
    >   library(MASS)
    >   cats.m=subset(cats,Sex==’M’)
    >   l.cats.m=lm(Hwt~Bwt,data=cats.m)
    >   xyplot(residuals(l.cats.m)~fitted(l.cats.m))
    >   summary(l.cats.m)

    lm(formula = Hwt ~ Bwt, data = cats.m)

        Min      1Q Median                          3Q                                   Max
    -3.7728 -1.0478 -0.2976                     0.9835                                4.8646

                Estimate Std. Error t value Pr(>|t|)
    (Intercept) -1.1841      0.9983 -1.186     0.239
    Bwt           4.3127     0.3399 12.688    <2e-16 ***
    Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

    Residual standard error: 1.557 on 95 degrees of freedom
    Multiple R-Squared: 0.6289,        Adjusted R-squared: 0.625

19:08 -- May 4, 2008                                                                                                                                         813
8. Regression

      F-statistic:     161 on 1 and 95 DF,                                       p-value: < 2.2e-16

      Note that R has functions to return both a vector of fitted values and a vector of
      the residual values of the fit. This makes it easy to construct the residual plot.


                                                                                 q                    q
                                                        q                        q
                                                        q   q                            q   q
                                                                                                      q       q            q
                                        2           q                        q                                q
                                            q       q       q                            q
                                                                         q   q
                                                                    q                    q            q

                                                        q   q       q            q                q
                                                                                                  q       q
                                                        q                                q   q                    q
                                                            q   q                        q                        q
                                                    q   q   q
                                                            q                q   q   q   q        q
                                                    q                    q                   q
                                                    q   q   q                                q                             q
                                        0           q
                                                    q   q       q
                                                                         q   q       q   q        q
                                            q                       q        q                            q
                                                    q   q   q   q
                                                                q                    q       q    q
                                            q                       q    q   q   q                    q   q   q   q
                                                    q   q           q        q   q                q
                                                            q       q                    q                q
                                            q                                        q   q
                                                        q       q        q           q                q
                                                            q                    q       q
                                                                    q    q           q                        q
                                       −2                       q            q
                                                                                             q            q
                                                                         q   q

                                                8                       10                   12                   14

      This plot gives no obvious evidence of the failure of any of our assumptions. The
      residuals do look as if they are random noise.

   We next take a more careful look at the size of the residuals. The residual ei is the
result of a random variable, Ei where Ei = Yi − Yi . (It is useful at this point to stop
and think about what a complicated random variable Ei is. We’ve come a long way
from tossing coins.) The important facts about the distribution of the residual random
variable Ei are

                                                                                                  1 (xi − x)2
                   E(Ei ) = 0                           Var(Ei ) = σ 2 1 −                          −                          .
                                                                                                  n    Sxx

   The first equation here is easy to prove and expected. It follows from the fact that β0 ˆ
and β                                                                             ˆ ˆ
     ˆ1 are unbiased estimators of β0 and β1 . Since Yi = β0 +β1 xi + i and Yi = β0 + β1 xi ,
                                 ˆ            ˆ
we have that Ei = i + (β0 − β0 ) + (βi − β1 )xi .
   The variance computation above is a bit surprising at first glance. For ease of
                            1    (xi − x)2
notation, define hi =          +             . Then the second equality above says that
                            n       Sxx
Var(Ei ) = σ 2 (1 − hi ). It can be shown that 1/n ≤ hi for every i. Therefore we
have that Var(Ei ) ≤ n−1 σ 2 . This means that the variance of our estimates of i are
smaller than the variances of the i by a factor that depends only the x values. Notice
that if hi is large, the variance of Ei is small. This means that for such a point, the line
is forced to be close to the point. Since hi is large when xi is far from x, this means
that points with extreme values of x pull the line close to them. The number hi is
appropriately called the leverage of the point (xi , yi ). This suggests that we should
pay careful attention to points of high leverage.
   With the variance of the residual in hand, we can normalize ei by dividing by the
estimate of its standard deviation. We are not surprised when the resulting random
variable has a t distribution. The resulting proposition should have a familiar look.

                                                                                                                                               8.4. Diagnostics

Proposition 8.4.2. With the normality assumption, Ei =
                                                   ∗                                                                                √Ei
                                                                                                                                   s 1−hi
                                                                                                                                            has a t distribution
with n − 2 degrees of freedom.

  The proposition implies that if we the normality assumption is true we should not
expect to see many standardized residuals outside of the range −2 ≤ e∗ ≤ 2. It is
useful to plot the standardized residuals against the fitted values. In the cats example,
the plot of the standardized residuals is produced by
> xyplot(rstandard(l.cats.m)~fitted(l.cats.m))

From this plot we see that there are one or two large residuals, both for relatively large
fitted values (corresponding to large cats).


                                              2                       q

                                                          q                          q                    q
                                                                               q             q    q

                                                                                                          q       q
                                                                                                                  q            q
                                              1                       q   q    q             q
                                                                                     q                    q
                                                                                             q        q       q
                                                          q   q                                   q   q
                                                                               q             q                         q
                                                                          q          q   q   q                         q
                                                                                                  q   q
                                                          q       q
                                              0                   q            q         q
                                                                                             q    q
                                                                      q                               q
                                                          q                                                                    q
                                                          q               q    q     q   q        q   q       q
                                                                      q        q     q
                                                  q                   q                               q   q       q    q
                                                          q                                  q                q
                                                                  q                      q                    q                    q
                                             −1                       q
                                                                          q          q       q
                                                                  q                      q
                                                                                                  q           q   q



                                                      8                   10                     12               14                   16

From this plot we see that there are one or two large residuals, both for relatively large
fitted values (corresponding to large cats).

8.4.2. Influential Observations
An influential observation is one that has a large effect on the fit. We have already
seen that a point with large leverage has the potential to have a large effect on the
fit as the fitted line tends to be closer to such a point than other points. However
that point might still have a relatively small effect on the regression as it might be
entirely consistent with the rest of the data. To measure the influence of a particular
observation on the fit, we consider what would change if we left that point out of the
fit. Let a subscript of (i) to any computed value denote the value we get from a fit
                                      ˆ                          ˆ
that omits the point (xi , yi ). Thus β0(i) denotes the value of β0 when the point (xi , yi )
is removed. Also yj(i) denotes the predicted yj when the point (xi , yi ) is removed. We
might measure the influence of a point on the regression by measuring

19:08 -- May 4, 2008                                                                                                                                        815
8. Regression

                               ˆ ˆ
  1. changes in the coefficients β − β(i) and

                       ˆ       ˆ
  2. changes in the fit yj(i) − yj .

  The R function dfbeta() computes the changes in the coefficients. In the case of the
cats data, we have

> dfbeta(l.cats.m)
      (Intercept)           Bwt
48 -0.1333235404 4.245539e-02
49 -0.1333235404 4.245539e-02
50   0.2807378812 -8.855077e-02
143 0.1677492306 -6.250644e-02
144 -0.6605688365 2.461400e-01

Note that the last observation has a considerably greater influence on the regression
than the four other points listed (and indeed it is the point of greatest influence in this
sense). In particular, its inclusion changes the intercept by 0.25 (from 4.06 to 4.31).
  Changes in the fit depend on the scale of the observations, so it is customary to
normalize by a measure of scale. One such popular measure is known as Cook’s
distance. The Cook’s distance Di of a point (xi , yi ) is a measure of how this point
affects the other fitted values and is defined by Di = (ˆj − yj(i) )2 /(2s2 ). It can be
shown that
                                         e∗2     hi
                                   Di = i
                                          2 (1 − hi )
Thus the point (xi , yi ) has a large influence on the regression if it has a large residual
and or a large leverage and especially if it has both. A general rule of thumb is that a
point with Cook’s distance greater than 0.7 is considered influential. In the cats data
the last point is by far the most influential but is not considered overly influential by
this criterion. This point corresponds to the biggest male cat.

> cd=cooks.distance(l.cats.m)
> summary(cd)
     Min.    1st Qu.   Median      Mean   3rd Qu.      Max.
1.117e-06 1.563e-03 4.482e-03 1.331e-02 1.155e-02 3.189e-01
> cd[cd>0.1]
      140        144
0.1302626 0.3189215

8.5. Multiple Regression
In this section, we extend the linear model to the case of several quantitative explana-
tory variables. There are many issues involved in this problem and this section serves
only as an introduction. We start with an example.

                                                                     8.5. Multiple Regression

    Example 8.5.1. The dataset fat in the faraway package contains several body
    measurements of 252 adult males. Included in this dataset are two measures of
    the percentage of body fat, the Brozek and Siri indices. Each of these indices
    computes the percentage of body fat from the density (in gm/cm3 ) which in turn
    is approximated by an underwater weighing technique. This is a time-consuming
    procedure and it might be useful to be able to estimate the percentage of body
    fat from easily obtainable measurements. For example, it might be nice to have a
    relationship of the following form: density = f (x1 , . . . , xk ) for k easily measured
    variables x1 , . . . , xk . We will first investigate the problem of approximating body
    fat by a function of only weight and abdomen circumference. The data on the first
    two individuals is given for illustration.
     > fat[1:2,]
       brozek siri density age        weight height adipos free neck chest abdom hip
     1   12.6 12.3 1.0708 23          154.25 67.75    23.7 134.9 36.2 93.1 85.2 94.5
     2    6.9 6.1 1.0853 22           173.25 72.25    23.4 161.3 38.5 93.6 83.0 98.7
       thigh knee ankle biceps        forearm wrist
     1 59.0 37.3 21.9     32.0           27.4 17.1
     2 58.7 37.3 23.4     30.5           28.9 18.2

   The notation gets a bit messy. We will continue to use y for the response variable
and we will use x1 , . . . , xk for the k explanatory variables. We will again assume that
there are n individuals and use the subscript i to range over individuals. Therefore,
the ith data point is (xi1 , xi2 , . . . , xik , yi ). The standard linear model now becomes the

                                The standard linear model

   The standard linear model is given by the equation

                                Y = β0 + β1 x1 + · · · + βk xk +                         (8.3)


      1.   is a random variable with mean 0 and variance σ 2 ,

      2. β0 , β1 , . . . , βk , σ 2 are (unknown) parameters,

      3. and    has a normal distribution.

  We again assume that the n data points are the result of independent 1 , . . . , n . To
find good estimates of β0 , . . . , βk we proceed exactly as in the case of one predictor and

19:08 -- May 4, 2008                                                                        817
8. Regression

find the least squares estimates. Specifically, let βi be an estimate of βi and define
                         ˆ    ˆ    ˆ        ˆ                ˆ
                         yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik .
We choose these estimates so that we minimize SSResid where
                                SSResid =                 ˆ
                                                    (yi − yi )2 .

It is routine to find the values of the β’s that minimize SSResid. R computes them with
dispatch. Suppose that we use weight and abdomen circumference to try to predict the
Brozek measure of body fat.
> l=lm(brozek~weight+abdom,data=fat)
> l

lm(formula = brozek ~ weight + abdom, data = fat)

(Intercept)          weight           abdom
   -41.3481         -0.1365          0.9151

In the case of multiple predictors, we need to be very careful in how we interpret the
various coefficients of the model. For example β1 = −0.14 in this model seems to
indicate that body fat is decreasing as a function of weight. This is counter to our
intuition and our experience which says that the heaviest men tend to have more body
fat than average. On the other hand, the coefficient β2 = 0.9151 seems to be consistent
with the relationship between stomach girth and body fat that we know. The key here
is that the coefficient β1 measures the effect of weight on body fat for a fixed abdomen
circumference. This makes more sense. Among individuals with a fixed abdomen
circumference, the heavier individuals tend to be taller and so have perhaps less body
fat. Even this interpretation needs to be expressed carefully however. It is misleading
to say that “body fat decreases as weight increases with abdomen circumference held
fixed” since increasing weight tends to increase abdomen circumference. We will come
back to this relationship in a moment but first we investigate the problem of inference
in this linear model. The short story of inference is that all of the results for the one
predictor case have the obvious extensions to more than one variable. For example, we

Theorem 8.5.2 (Gauss-Markov Theorem). The least squares estimator βj of βj is the
minimum variance unbiased estimator of βj among all linear estimators of βj .

  To estimate σ 2 , we again use MSResid except that we define MSResid by
                                MSResid =                  .
                                               n − (k + 1)

                                                               8.5. Multiple Regression

The denominator in MSResid is simply n − p where p is the number of estimated
parameters in SSResid. Using the estimate MSResid of σ 2 , we can again produce an
                                          ˆ                                      ˆ
estimate sβj of the standard deviation of βj and produce confidence intervals for βj .
For the body fat data we have

> summary(l)

lm(formula = brozek ~ weight + abdom, data = fat)

      Min       1Q        Median         3Q        Max
-10.83074 -2.97730       0.02372    2.93970    9.76794

             Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.34812    2.41299 -17.136 < 2e-16 ***
weight       -0.13645    0.01928 -7.079 1.47e-11 ***
abdom         0.91514    0.05254 17.419 < 2e-16 ***
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 4.127 on 249 degrees of freedom
Multiple R-Squared: 0.7187,     Adjusted R-squared: 0.7165
F-statistic: 318.1 on 2 and 249 DF, p-value: < 2.2e-16
> confint(l)
                  2.5 %       97.5 %
(Intercept) -46.1005887 -36.59566057
weight       -0.1744175 -0.09848946
abdom         0.8116675   1.01860856

From the output we observe the following. Our estimate for σ is the residual standard
error, 4.127, which is MSResid. We note that 249 degrees of freedom are used which
is 252 − 3 since there are three parameters. We can compute the confidence interval
    ˆ                             ˆ
for β1 from the summary table (β1 = −0.14 and sβ1 = 0.019) using the t distribution
with 249 degrees of freedom or from the R function confint.
   We can compute confidence intervals for the expected value of body fat and prediction
intervals for an individual observation as well. Investigating what happens for a male
weighing 180 pounds with an abdomen measure of 82 cm gives the following prediction
and confidence intervals:
> d=data.frame(weight=180, abdom=82)
> predict(l,d,interval=’confidence’)
         fit      lwr      upr
[1,] 9.13157 7.892198 10.37094
> predict(l,d,interval=’prediction’)

19:08 -- May 4, 2008                                                               819
8. Regression

         fit       lwr      upr
[1,] 9.13157 0.9090354 17.35410
The average body fat of such individuals is likely to be between 7.9% and 10.4%. An
individual male not part of the dataset is likely to have body fat between 0.91% and
  We now return to the issue of interpreting the coefficients in the linear model. In
the case of the body fat example, let’s fit a model with weight as the only predictor.
> lm(brozek~weight,data=fat)
lm(formula = brozek ~ weight, data = fat)

(Intercept)          weight
    -9.9952          0.1617

Notice that the sign of the relationship between weight and body fat has changed!
Using weight alone, we predict an increase of 0.16 in percentage of body fat for each
pound increase in weight. What has happened? Let’s first restate the two fitted linear

                     brozek = −41.3 − 0.14 weight + 0.92 abdom                     (8.4)
                     brozek = −10.0 + 0.16 weight                                  (8.5)
  In order to understand the relationships above, it is important to understand that
there is a linear relationship between weight and the abdomen measurement. One more
regression is useful.
> lm(abdom~weight,data=fat)

lm(formula = abdom ~ weight, data = fat)

(Intercept)          weight
    34.2604          0.3258

   Now suppose that we change weight by 10 pounds. The last analysis says that we
would predict that the abdomen measure increases by 3.3 cm. Using (8.4) we see that
a increase in 10 pounds of weight and an increase of 3.3 cm in abdomen circumference
causes an increase of 10 ∗ (−0.14) + 0.92 ∗ (3.3) = 1.6% in Brozek index. But this is
precisely what an increase in 10 pounds of weight should produce according (8.5). The
fact that our predictors are linearly related in the set of data (and so presumably in
the population that we are modeling) is known as multicollinearity. The presence of
multicollinearity makes it difficult to give simple interpretations of the coefficients in a
multiple regression.

                                                                   8.6. Evaluating Models

Interaction terms
Consider our linear relationship, brozek = −41.3−0.14 weight+0.92 abdom. This model
implies that for any fixed value of abdom, the slope of the line relating brozek to weight
is always −0.14. An alternative (and more complicated) model would be that the slope
of this line also changes as the value of abdom changes. One strategy for incorporating
such behavior into our model is to add an additional term, an interaction term. The
equation for the linear model with an interaction term in the case that there are only
two predictor variables is

                        Y = β0 + β1 x1 + β2 x2 + β1,2 x1 x2 + .
While this is not the only way that two variables could interact, it seems to be the
simplest possible way. R allows us to add an interaction term using a colon.

> lm(brozek~weight+abdom+weight:abdom,data=fat)

lm(formula = brozek ~ weight + abdom + weight:abdom, data = fat)

 (Intercept)           weight           abdom       weight:abdom
  -65.866013         0.003406        1.155338          -0.001350

While the coefficient for the interaction term (−0.0014) seems small, one should realize
that the values of the product of these two variables are large so that this term con-
tributes significantly to the sum. On the other hand, in the presence of this interaction
term, the contribution of the term for weight is now very small.
   With all the possible variables that we might include in our model and with all the
possible interaction terms, it is important to have some tools for evaluating different
choices. We take up this issue in the next section.

8.6. Evaluating Models
In the previous section, we considered several different linear models for predicting the
Brozek body fat index from easily determined physical measurements. Other models
could be considered by using other physical measurements that were available in the
dataset. How should we evaluate one of these models and how should we choose among
  One of the principle tools used to evaluate such models is known as the analysis
of variance. Given a linear model (any model, really), we choose the parameters to
minimize SSResid. Recall
                                SSResid =               ˆ
                                                  (yi − yi )2 .

19:08 -- May 4, 2008                                                                 821
8. Regression

Therefore it seems reasonable to suppose that a model with smaller SSResid is better
than one with large SSResid. Such a model seems to “explain” or account for more of
the variation in the yi . Consider the two models for body fat, one using only abdomen
circumference and the other only weight.
> la=lm(brozek~abdom,data=fat)
> anova(la)
Analysis of Variance Table

Response: brozek
           Df Sum Sq Mean Sq F value    Pr(>F)
abdom       1 9984.1 9984.1    489.9 < 2.2e-16 ***
Residuals 250 5094.9    20.4
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> lw=lm(brozek~weight,data=fat)
> anova(lw)
Analysis of Variance Table

Response: brozek
           Df Sum Sq Mean Sq F value   Pr(>F)
weight      1 5669.1 5669.1 150.62 < 2.2e-16 ***
Residuals 250 9409.9    37.6
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Among other things, the function anova() tells us that SSResid = 5, 095 for the linear
model using abdomen circumference and SSResid = 9, 410 for the model using only
weight. While this comparison seems clearly to indicate that abdomen circumference
predicts Brozek index better on average than does weight, using SSResid as an absolute
measure of goodness of fit is has two shortcomings. First, the units of SSResid are in
terms of the squares of y units which means that SSResid will tend to be large or small
according as the observations are large or small. Second, we will obviously reduce
SSResid by including more variables in the model so that comparing SSResid does not
give us a good way of comparing, say, the model with abdomen circumference and
weight to the model with abdomen circumference alone. We address the first issue
  We would like to transform SSResid into a dimension free measurement. The key to
doing this is to compare SSResid to the maximum possible SSResid. To do this, define
                               SSTotal =         (yi − y)2 ,

The quantity SSTotal could be viewed as SSResid for the model with only a constant
term. We have already seen (Problem 1.9) that y is the unique constant c that min-
imizes (yi − c)2 . The quantity SSTotal can be computed from the output of the

                                                                    8.6. Evaluating Models

function anova() by summing the column labeled Sum Sq. For the body fat data, that
number is SSTotal = 1, 579.0.
   We first note that 0 ≤ SSResid ≤ SSTotal. This is because choosing β0 = y and
β1 = 0 would already achieve SSResid = SSTotal but SSResid is the minimum among
all choices of β0 , β1 . Using this fact, we have a first measure of the fit of a linear model.
                                      R2 = 1 −            .
We have that 0 ≤ R2 ≤ 1 and R2 is close to 1 if linear part of the model fits the data
well. The number R2 is sometimes called the coefficient of determination of the
model and is often read as a percentage. In the model for Brozek index which uses
only abdomen circumference, we can compute R2 from the statistics in the analysis of
variance table or else we can read it from the summary of the regression where it is
labeled Multiple R-Squared. We read the result below as “abdomen circumference
explains 66.2% of the variation in Brozek index.”
> summary(la)

lm(formula = brozek ~ abdom, data = fat)

      Min       1Q          Median         3Q         Max
-17.62568 -3.46724         0.01113    3.14145    11.97539

             Estimate Std. Error t value Pr(>|t|)
(Intercept) -35.19661    2.46229 -14.29    <2e-16 ***
abdom         0.58489    0.02643   22.13   <2e-16 ***
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 4.514 on 250 degrees of freedom
Multiple R-Squared: 0.6621,     Adjusted R-squared: 0.6608
F-statistic: 489.9 on 1 and 250 DF, p-value: < 2.2e-16
The value of R2 for the model using only weight is 37.6%.
  The number R2 values for two different models with the same number of parameters
gives us a reasonable way to compare their usefulness. However R2 is a misleading
tool for comparing models with differing numbers of parameters. After all, if we allow
ourselves n different parameters (i.e., we have n different explanatory variables), we
will be able to fit the data exactly and so achieve R2 = 100%. We consider just one
way of comparing two models with a different number of parameters. Given a model
with parameters β0 , . . . , βk , we define a quantity AIC, called the Akaike Information
Criterion by
                                AIC = n ln          + 2(k + 1)

19:08 -- May 4, 2008                                                                     823
8. Regression

While we cannot give the theoretical basis for choosing this measure, we can notice the
following two properties:

  1. AIC is larger if SSResid is larger, and

  2. AIC is larger if the number of parameters is larger.

These two properties should lead us to choose models with small AIC. Indeed, AIC
captures one good way of measuring the trade-off in reducing SSResid (good) by in-
creasing the number of terms in the model (bad). We can compute AIC for a given
model by extractAIC in R.
> law=lm(brozek ~ abdom + weight,data=fat)
> extractAIC(law)
[1]   3.0000 717.4471

The 3 parameter model with linear terms for abdomen circumference and weight has
AIC = 717.4. This value of AIC does not mean much alone but it is used for comparing
models with differing numbers of parameters. (We should remark here that there are
different definitions of AIC that vary in the choice of some constants in the formula.
The R function AIC() computes one other version of AIC. It does not usually matter
which AIC one uses to compare two models.)
  We illustrate the use of AIC in developing a model by applying it the the Brozek data.
We first consider a model that contains all 12 easily measured explanatory variables in
the dataset fat.

> lbig=lm(brozek ~ weight + height + neck + chest +
+ abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,data=fat)
> extractAIC(lbig)
[1] 13.0000 712.5451

At least by the AIC criterion, the 13 parameter model is better (by a small margin)
than the 3 parameter model that we first considered.
   We really do not want the 13 parameter model above, however. First, it is too com-
plicated to suit the purpose of easily approximating body fat from body measurements.
Second, we really cannot believe that all these explanatory variables are necessary. In
order to decide which model to use, we might simply evaluate AIC for all possible
subsets of the 12 explanatory variables in the big model. While R packages exist that
do this, we use an alternate approach where we consider one variable at a time. The R
function that does this is step(). At each stage, step() performs a regression for each
variable, determining how AIC would change if that variable were left-out (or included)
in the model. The output is lengthy, the piece below illustrates the first step:

> step(lbig,direction=’both’)
Start: AIC=712.55
brozek ~ weight + height + neck + chest + abdom + hip + thigh +

                                                              8.6. Evaluating Models

     knee + ankle + biceps + forearm + wrist

            Df Sum of Sq      RSS     AIC
- chest      1       0.6   3842.8   710.6
- knee       1       3.0   3845.3   710.7
- ankle      1       5.8   3848.0   710.9
- height     1      15.5   3857.8   711.6
- biceps     1      19.4   3861.6   711.8
- thigh      1      19.9   3862.1   711.8
<none>                     3842.3   712.5
- hip       1       42.0   3884.2   713.3
- neck      1       55.4   3897.7   714.2
- forearm   1       67.0   3909.2   714.9
- weight    1       67.3   3909.6   714.9
- wrist     1       98.1   3940.4   716.9
- abdom     1     2831.4   6673.6   849.7

Step: AIC=710.58
brozek ~ weight + height + neck + abdom + hip + thigh + knee +
    ankle + biceps + forearm + wrist

For each possible variable that is in the big model, AIC is computed for a regression
leaving that variable out. For example, leaving out the variable chest reduces AIC to
710.6, an improvement from the value 712.5 of the full model. Removing chest gives
the most reduction of AIC. The second step starts with this model and determines that
it is useful to remove the knee measurement from the model.
brozek ~ weight + height + neck + abdom + hip + thigh + knee +
    ankle + biceps + forearm + wrist

            Df Sum of Sq      RSS     AIC
- knee       1       3.3   3846.1   708.8
- ankle      1       5.9   3848.7   709.0
- height     1      14.9   3857.8   709.6
- biceps     1      19.0   3861.8   709.8
- thigh      1      21.9   3864.7   710.0
<none>                     3842.8   710.6
- hip       1       41.6   3884.4   711.3
- neck      1       55.9   3898.7   712.2
+ chest     1        0.6   3842.3   712.5
- forearm   1       66.4   3909.2   712.9
- weight    1       87.3   3930.1   714.2
- wrist     1       98.0   3940.9   714.9
- abdom     1     3953.3   7796.1   886.9

Step:   AIC=708.8

19:08 -- May 4, 2008                                                             825
8. Regression

brozek ~ weight + height + neck + abdom + hip + thigh + ankle +
    biceps + forearm + wrist

Notice that at this second step, all variables in the model were considered for exclusion
and all variables currently not in the model (chest) were considered for inclusion. After
several more steps, the final step determines that no single variable should be included
or excluded:

            Df Sum of Sq        RSS      AIC
<none>                       3887.9    705.5
- hip           1     38.8   3926.6    706.0
+ biceps        1     19.6   3868.2    706.2
+ height        1     17.5   3870.3    706.4
- thigh         1     53.3   3941.1    706.9
+ ankle         1      6.8   3881.1    707.1
- neck          1     57.7   3945.5    707.2
+ knee          1      2.4   3885.4    707.4
+ chest         1      0.1   3887.7    707.5
- wrist         1     89.8   3977.7    709.3
- forearm       1    102.6   3990.5    710.1
- weight        1    134.1   4021.9    712.1
- abdom         1   4965.6   8853.4    910.9

lm(formula = brozek ~ weight + neck + abdom + hip + thigh + forearm +                 wrist, data = fat)

(Intercept)          weight              neck        abdom           hip           thigh
   -21.7410         -0.1042           -0.3971       0.9584       -0.2010          0.2090
    forearm           wrist
     0.4372         -1.0514

The final model has AIC = 705.5 and appears to be the best model, at least by the
AIC criterion.

8.7. Exercises

8.1 Sometimes the experimenter has control over the choice of the points x1 , . . . , xn in
an experiment. Consider the following two sets of choices:
   Set A: x1 = 1, x2 = 2, x3 = 3, x4 = 4, x5 = 5, x6 = 6, x7 = 7, x8 = 8, x9 = 9, x10 = 10
   Set B: x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 10, x7 = 10, x8 = 10, x9 = 10, x10 = 10

  a) Explain how Proposition 8.1.3 can be used to argue for Set B.

  b) Despite the argument in part (a), why might Set A be a better choice?

                                                                          8.7. Exercises

8.2 A simple random sample was chosen the population of all the students with senior
status as of February, 2003, who had taken the ACT test. The ACT score and GPA of
each student is in the file’.

  a) Write the equation of the regression line that could be used to predict the GPA
     of a student from their ACT.

  b) Write a 95% confidence interval for the slope of the line.

  c) For each of the ACT scores 20, 25, 30, use the line to predict the GPA of a student
     with that score.

  d) Write 95% confidence intervals for the mean GPA of all students with ACT scores
     20, 25, and 30.

  e) Write a 95% prediction interval for the GPA of another student with ACT score

  f ) Plot the residuals from this regression and say whether the residuals indicate any
      concerns about whether the assumptions of the standard linear model are met.

8.3 A famous dataset (Pierce, 1948) contains data on the relationship between cricket
chirps and temperature. The dataset is reproduced at
data/crickets.csv’. Here the variables are Temperature in degrees Fahrenheit and
Chirps giving the number of chirps per second of crickets at that temperature.

  a) Write the equation of the regression line that could be used to predict the tem-
     perature from the number of cricket chirps per second.

  b) Write a 95% confidence interval for the slope of the line.

  c) Write a 95% confidence interval for the mean temperature for each of the values
     12, 14, 16, and 18 of cricket chirps per second.

  d) You hear a cricket chirping 15 times per second. What is an interval that is likely
     to capture the value of the temperature? Explain what likely means here.

  e) Plot the residuals from this regression and say whether the residuals indicate any
     concerns about whether the assumptions of the standard linear model are met.

8.4 Prove Equation 8.2.
8.5 The faraway package contains a dataset cpd which has the projected and actual
sales of 20 different products of a company. (The data were actually transformed to
disguise the company.)

19:08 -- May 4, 2008                                                                827
8. Regression

  a) Write a regression line that describes a linear relationship between projected and
     actual sales.

  b) Identify one data point that has particularly large influence on the regression.
     Give a couple of quantitative measures that summarize its influence.

  c) Refit the regression line after removing the data point that you identified in part
     (b). How does the equation of the line change?

A. Appendix: Using R
A.1. Getting Started
Download R from the R project website which requires a
few clicks or directly from There are Windows, Mac,
and Unix versions. These notes are for the Windows version. There will be minor
differences for the other versions.

A.2. Vectors and Factors
A vector has a length (a non-negative integer) and a mode (numeric, character, complex,
or logical). All elements of the vector must be of the same mode. Typically, we use a
vector to store the values of a quantitative variable. Usually vectors will be constructed
by reading data from an R dataset or a file. But short vectors can be constructed by
entering the elements directly.
> x=c(1,3,5,7,9)
> x
[1] 1 3 5 7 9

Note that the [1] that precedes the elements of the vectors is not one of the elements
but rather an indication that the first element of the vector follows. There are a couple
of shortcuts that help construct vectors that are regular.
> y=1:5
> z=seq(0,10,.5)
> y;z
[1] 1 2 3 4 5
 [1] 0.0 0.5 1.0         1.5   2.0 2.5    3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0
[16] 7.5 8.0 8.5         9.0   9.5 10.0

To refer to individual elements of a vector we use square brackets. Note that a variety
of expressions, including other vectors, can go within the brackets.
> x[3]            # 3rd element of x
[1] 5
> x[c(1,3,5)]     # 1st, 3rd, 5th elements of x
[1] 1 5 9
> x[-4]           # all but 4th element of x
[1] 1 3 5 9

A. Appendix: Using R

> x[-c(2,3)]      # all but 2nd and 3rd elements of x
[1] 1 7 9

   If a vector t is a logical vector of the same length as x, then x[t] selects only those
elements of x for which t is true. Such logical vectors t are often constructed from
logical operations on x itself.

> x>5                                    # compares x elementwise to 5
> x[x>5]                                 # those elements of x where condition is true
[1] 7 9
> x[x==1|x>5]                            # == for equality and | for logical or
[1] 1 7 9

  Arithmetic on vectors works element by element as do many functions.

> x
[1] 1 3 5 7 9
> y
[1] 1 2 3 4 5
> x*y                  # componentwise multiplication
[1] 1 6 15 28 45
> x^2                  # exponentiation of each element by a constant
[1] 1 9 25 49 81
> c(1,2,3,4)*c(2,4)    # if the vectors are not of the same length, the shorter is
[1] 2 8 6 16           #     recycled if the lengths are compatible
> log(x)               # the log function operates componentwise
[1] 0.000000 1.098612 1.609438 1.945910 2.197225

A.3. Data frames
Datasets are typically stored in data frames. A data frame in R is a data structure
that can be considered a two-dimensional array with rows and columns. Each column
is a vector or a factor. The rows usually correspond to the individuals of our dataset.
Usually data frames are constructed by reading data from a file or loading a built-in R
dataset (see the next section). A data frame can also be constructed from individual
vectors and factors. The following R session uses the built-in iris dataset to illustrate
some of the basic operations on data frames.

> dim(iris)     # 150 rows or observations, 5 columns or variables
[1] 150   5
> iris[1,]      # the first observation (row)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2 setosa
> iris[,1]      # the first column (variable), output is a vector
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1

                                                                    A.3. Data frames

  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
> iris[1]       # alternative means of referring to first column, output is a data frame
1             5.1
2             4.9
3             4.7
4             4.6
5             5.0
................     # many observations omitted
145           6.7
146           6.7
147           6.3
148           6.5
149           6.2
150           5.9
> iris[1:5,3]        # the first five observations, the third variable
[1] 1.4 1.4 1.3 1.5 1.4
> iris$Sepal.Length    # the vector in the data frame named Sepal.Length
   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
> iris$Sepal.Length[10]    # iris$Sepal.Length is a vector and can be used as such
[1] 4.9

    We next demonstrate how to construct a data frame from vectors and factors.

>   x=1:3
>   y=factor(c("a","b","c"))     # makes a factor of the character vector
>   d=data.frame(numbers=x, letters=y)
>   d
    numbers letters
1         1       a
2         2       b
3         3       c
>   d[,2]

19:08 -- May 4, 2008                                                              1003
A. Appendix: Using R

[1] a b c
Levels: a b c
> d$numbers
[1] 1 2 3

A.4. Getting Data In and Out
Accessing datasets in R
There are a large number of datasets that are included with the standard distribution
of R. Many of these are historically important datasets or datasets that are often
used in statistics courses. A complete list of such datasets is available by data().
A built-in dataset named junk usually contains a data.frame named junk and the
command data(junk) defines that data.frame. In fact, many datasets are preloaded.
For example, the iris dataset is available to you without using data(iris). For the
built-in dataset junk, ?junk usually gives a description of the dataset.
  Many users of R have made other datasets available by creating a package. A
package is a collection of R datasets and/or functions that a user can load. Some of
these packages come with the standard distribution of R. Others are available from
CRAN. To load a package, use library( or require(
For example, the faraway package contains several datasets. One such dataset records
various health statistics on 768 adult pima indians for a medical study of diabetes.

> library(faraway)
> data(pima)
> dim(pima)
[1] 768    9
> pima[1:5,]
  pregnant glucose diastolic triceps insulin bmi diabetes age test
1        6     148        72      35       0 33.6   0.627 50     1
2        1      85        66      29       0 26.6   0.351 31     0
3        8     183        64       0       0 23.3   0.672 32     1
4        1      89        66      23      94 28.1   0.167 21     0
5        0     137        40      35     168 43.1   2.288 33     1

  If the package is not included in the distribution of R installed on your machine, the
package can be installed from a remote site. This can be done easily in both Windows
and Mac implementations of R using menus.
  Finally, datasets can be loaded from a file that is located on one’s local computer
or on the internet. Two things need to be known: the format of the data file and
the location of the data file. The most common format of a datafile is CSV (comma
separated values). In this format, each individual is a line in the file and the values
of the variables are separated by commas. The first line of such a file contains the
variable names. There are no individual names. The R function read.csv reads such

                                                           A.4. Getting Data In and Out

a file. Other formats are possible and the function read.table can be used with
various options to read these. The following example shows how a file is read from the
internet. The file contains the offensive statistics of all major league baseball teams for
the complete 2007 season.

> bball=read.csv(’’)
> bball[1:4,]
         CLUB LEAGUE    BA   SLG   OBP   G   AB   R    H   TB X2B X3B            HR   RBI
1    New York      A 0.290 0.463 0.366 162 5717 968 1656 2649 326 32            201   929
2     Detroit      A 0.287 0.458 0.345 162 5757 887 1652 2635 352 50            177   857
3     Seattle      A 0.287 0.425 0.337 162 5684 794 1629 2416 284 22            153   754
4 Los Angeles      A 0.284 0.417 0.345 162 5554 822 1578 2317 324 23            123   776
1 41 54 78 637 32 991 123 40 138 1249      8 88 174 0
2 31 45 56 474 45 1054 103 30 128 1148     3 99 148 0
3 33 40 62 389 32 861 81 30 154 1128       7 90 167 0
4 32 65 40 507 55 883 139 55 146 1100      8 101 154 0

Creating datasets in R
Probably the best way to create a new dataset for use in R is to use an external program
to create it. Excel, for example, can save a spreadsheet in CSV format. The editing
features of Excel make it very easy to create such a dataset. Small datasets can be
entered into R by hand. Usually this is done by creating the vectors of the data.frame
individually. Vectors can be created using the c() or scan() functions.

> x=c(1,2,3,4,5:10)
> x
  [1] 1 2 3 4 5 6 7            8   9 10
> y=c(’a’, ’b’,’c’)
> y
[1] "a" "b" "c"
> z=scan()
1: 2 3 4
4: 11 12 19
7: 4
Read 7 items
> z
[1] 2 3 4 11 12 19 4

   The scan() function prompts the user with the number of the next item to enter.
Items are entered delimited by spaces or commas. We can use as many lines as we like
and the input is terminated by a blank line. There is also a data editor available in
the graphical user interfaces but it is quite primitive.

19:08 -- May 4, 2008                                                                  1005
A. Appendix: Using R

A.5. Functions in R
Almost all the capabilities of R are implemented as functions. A function in R is
much like a mathematical function. Namely, a function has inputs and outputs. In
mathematics, f (x, y) is functional notation. The name of the function is f and there
are two inputs x, and y. The expression f (x, y) is the name of the output of the
function. The notation in R is quite similar. For example, mean(x) denotes the result
of applying the function mean to the input x. There are some important differences in
the conventions that we typically use in mathematics and that are used in R.
   A first difference is that functions in R often have optional arguments. For example,
in using the function to compute the mean, there is an optional argument that allows
us to compute the trimmed mean. Thus mean(x,trim=.1) computes a 10%-trimmed
mean of x.
   A second difference is that in R inputs have names. In mathematics, we rely only on
position to identify which input is which in functions that have several inputs. Because
we have optional arguments in R, we need some way to indicate which arguments
we are including. Hence, in the example of the mean function above, the argument
trim is named. If we use a function in R, without naming arguments, then R assumes
that the arguments are included in a certain order (that can be determined from the
documentation). For example, the mean function has specification
mean(x, trim = 0, na.rm = FALSE, ...)

This means that the first three arguments are called x, trim, and na.rm. The latter two
of these arguments have default values if they are missing. If unnamed, the arguments
must appear in this order. If named they can appear in any order. The following short
session of R shows the variety of possibilities. Just remember that R first matches up the
named arguments. Then R uses the unnamed arguments to match the other arguments
it accepts in the order that it expects them. Notice that mean allows other arguments
... that it does not use.
> mean(y)
[1] 5.5
> y=1:10
> mean(x=y)                    # all these are legal
> mean(y,trim=.1)
> mean(trim=.1,y)
> mean(trim=.1,x=y)
> mean(y,.1,na.rm=F)
> mean(y,na.rm=F,.1)
> mean(y,na.rm=F,trim=.1)
> mean(y,.1,F)
> mean(y,trim=.1,F)
> mean(y,F,trim=.1)

> mean(y,F,.1)                 # these are not legal

                                                           A.6. Samples and Simulation

> mean(.1,y)
> mean(z=y,.1)

A third difference between R and our usual mathematical conventions is that many
functions are “vectorized.” For example, the natural log function operates on vectors
one component at a time:
> x=c(1:10)
> log(x)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851

A.6. Samples and Simulation
The sample() function allows us to choose probability samples of any size from a fixed
population. The syntax is sample(x,size,replace=F,prob=NULL) where

 x          a vector representing the population
 size       the size of the sample
 replace    is true or false according to whether the sampling is with replacement or not
 prob       if present, a vector of same length of x of probabilities of choosing the cor-
            responding individual
  The following R session gives examples of some typical uses of the sample command.

> x=1:6
> sample(x)       # a random permutation
[1] 5 6 2 1 4 3
> sample(x,size=10,replace=T)   # throwing 10 dice
 [1] 5 3 2 5 5 5 5 3 1 2
> sample(x,size=10,replace=T,prob=c(1/2,1/10,1/10,1/10,1/10,1/10)) # weighted dice
 [1] 1 1 3 5 1 6 1 5 1 1
> sample(x,size=10,replace=T,prob=c(5,3,2,1,1,1)) # weights need not sum to 1 (used proportionally)
 [1] 3 1 1 1 2 1 4 2 1 5
> sample(x,size=4,replace=F)    # sampling without replacement
[1] 2 3 1 6

  Simulation is an important tool for understanding what might happen in a random
sampling situation. Many simulations can be performed using the replicate function.
The simplest form of the replicate function is replicate(n,expr) where expr is an
R expression that has a value (e.g., a function) and n is the number of times that we
wish to replicate expr. The result of replicate is a list but if all replications of expr
have scalar values of the same mode, the result is a vector. Continuing the dice-tossing
motif, the following R session gives the result of computing the mean of 10 dice rolls
for 20 different trials.

19:08 -- May 4, 2008                                                                1007
A. Appendix: Using R

> replicate(20, mean(sample(1:6,10,replace=T)))
 [1] 3.5 4.3 3.5 3.7 2.1 3.5 3.6 3.4 3.9 3.3 2.6 3.1 3.8 3.2 2.9 3.1 3.1 3.6 4.0
[20] 2.7

  If expr returns something other than a scalar, then the object created by replicate
might be a list or a matrix. For example, we generate 10 different permutations of the
numbers from 1 to 5.

> r=replicate(10,sample(c(1:5)))
> r
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    3    3    5    4    2    2    4    1    3     1
[2,]    5    1    4    1    4    3    1    2    1     3
[3,]    2    5    1    5    1    5    3    5    5     2
[4,]    4    2    3    3    3    1    5    3    4     4
[5,]    1    4    2    2    5    4    2    4    2     5
> r[,1]
[1] 3 5 2 4 1

Notice that the results of replicate are placed in the columns of the returned object.
In fact the result of replicate can have quite a complicated structure. In the following
code, we simulate 1,000 different tosses of 1,000 dice and for each of the trials we
construct a histogram. Note that the internal structure of a histogram is a list with
various components.

> h=replicate(1000,hist(sample(1:6,1000,replace=T)))
> h[,1]
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

 [1] 169 176     0 177    0 149    0 170    0 159

 [1] 0.3379999 0.3520000 0.0000000 0.3540000 0.0000000 0.2980000 0.0000000
 [8] 0.3400000 0.0000000 0.3180000

 [1] 0.3379999 0.3520000 0.0000000 0.3540000 0.0000000 0.2980000 0.0000000
 [8] 0.3400000 0.0000000 0.3180000

 [1] 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75

[1] "sample(1:6, 1000, replace = T)"

                                                                             A.7. Formulas

[1] TRUE

A.7. Formulas
Formulas are used extensively in R when analyzing multivariate data. Formulas can
take many forms and their meaning varies by R context but in general they are used to
describe models in which we have a dependent or response variable that depends
on some independent or predictor variables. There may also be conditioning
variables that limit the scope of the model. Suppose that x, y, z, w are variables (which
are usually vectors or factors). Then the following are legal formulas, together with a
way to read them.

                    x~y             x   modeled   by   y
                    x~y|z           x   modeled   by   y conditioned on z
                    x~y+w           x   modeled   by   y and w
                    x~y*w           x   modeled   by   y, w and y*w
                    x~y+I(y^2)      x   modeled   by   y and y2

  Notice in the last example that we are essentially defining a new variable, y 2 as one of
the predictor variables. In this case we need I to indicate that this is the interpretation.
Most arithmetic expressions can occur within the scope of I. For example,

> histogram(~I(x^2+x))

produces a histogram of the transformed variable x2 + x. (Leaving out the I in this
case gives a completely different result.)
  Most graphics commands that use formulas will use the vertical axis for the response
variable, the horizontal axis for the predictor variable, and will draw a separate plot
for each value of the conditioning variable (which is usually a categorical variable).

A.8. Lattice Graphics
The lattice graphics package (accessed by library(lattice)) is the R implementa-
tion of Trellis graphics, a graphics system developed at Bell Laboratories. The lattice
graphics package is completely self-contained and unrelated to the base graphics pack-
age of R. Lattice graphics functions in general produce objects that are of class “trellis.”
These objects can be manipulated and printed. Printing a lattice object is generally
what makes a graph appear in its own window on the display. The standard high-level
graphics functions automatically print the object they create. The most important
lattice graphic functions are as follows.

19:08 -- May 4, 2008                                                                   1009
A. Appendix: Using R

       xyplot()        scatter plot
       bwplot()        box and whiskers plot
       histogram()     histograms
       dotplot()       dot plots
       densityplot()   kernel density plots
       qq()            quantile-quantile plot for comparing two distributions
       qqmath()        quantile plots against certain mathematical distributions
       stripplot()     one-dimensional scatter plots
       contourplot()   contour plot of trivariate data
       levelplot()     level plot of trivariate data
       splom()         scatter plot matrix of several variables
       rfs()           residuals and fitted values plot

  The syntax of these plotting commands differs according to the nature of the plot
and the data and most of these high-level plotting commands allow various options. A
typical syntax is found in xyplot() which we illustrate here using the iris data.

> xyplot(Sepal.Length~Sepal.Width | Species, data=iris, subset=c(1:149),
+ type=c("p","r"),layout=c(3,1))

  Here we are using the data frame iris, and we are using only the first 149 obser-
vations of this data frame. We are making three x-y plots, one for each Species (the
conditioning variable in the formula). The plots have Sepal.Width on the horizontal
axis and Sepal.Length on the vertical axis. The plots contain points and also a fit-
ted regression line. There three plots are displayed in a 3 columns by 1 row display.
All kinds of options besides type and layout are available to control the size, shape,
labeling, colors, etc. of the plot.

A.9. Exercises

A.1 Choose 4 integers in the range 1–10 and 4 in the range 11–20. Enter these 8
integers in non-decreasing order into a vector x. For each of the following R commands,
write down a guess as to what the output of R would be and then write down (using R
of course), what the output actually is.

  a) x

  b) x+1

  c) sum(x)

  d) x>10

  e) x[x>10]

                                                                               A.9. Exercises

  f ) sum(x>10)      Explain what R is computing here.

  g) sum(x[x>10])        Explain what R is computing here.

  h) x[-(1:4)]

   i) x^2

A.2 The following table gives the total of votes cast for each of the candidates in the
2008 Presidential Primaries in the State of Michigan.

                         Democratic                   Republican
                   Clinton       328,151        Romney         337,847
                   Uncommitted 236,723          McCain         257,521
                   Kucinich        21,708       Huckabee       139,699
                   Dodd             3,853       Paul            54,434
                   Gravel           2,363       Thompson        32,135
                                                Giuliani        24,706
                                                Uncommitted     17,971
                                                Hunter           2,823

  a) Create a data frame in R, named Michigan, that has three variables: candidate,
     party, votes. Be careful to make variables factors or vectors as appropriate.

  b) Write an R expression to list all the candidates.

  c) Write an R expression to list all the Democratic candidates.

  d) Write an R expression that computes the total number of votes case in the Demo-
     cratic primary.

A.3 The function mad computes the median absolute deviation of the absolute devia-
tions from the median of a vector of numbers. That is, if m is the median of x1 , . . . , xn ,
then the median absolute deviation from the median is

                             median{|x1 − m|, . . . |xn − m|} .

  Actually, the function in R is considerably more versatile. For example, instead of m,
the function allows as an option that the mean x be used instead. Also there are several
choices for which median is computed of the set of numbers. Finally, the R function
multiplies the result by a constant (the default is 1.4826 for technical reasons). Using
?mad, we find that the usage for the function is

mad(x, center = median(x), constant = 1.4826, na.rm = FALSE,
    low = FALSE, high = FALSE)

19:08 -- May 4, 2008                                                                     1011
A. Appendix: Using R

Enter the vector x=c(1,2,4,6,8,10).

     a) R computes mad(x) to be 4.4478. (Try it!) Using the help document and the
        default values of the function, explain how the number 4.4478 is computed.

     b) Compute mad(x,mean(x),constant=1,FALSE,TRUE,FALSE). Explain the result.

     c) The three logical values in the expression in part (b) might be mysterious to a
        reader. Write an R function that is somewhat more self-explanatory.

A.4 In R, define a vector x with 100 values of your own choosing. Compare the behavior
 > histogram(~x^2+x)
 > histogram(~I(x^2+x))

and state precisely what each of the two expressions does with the data x.

[AM]    Chase M. A. and Dummer G. M. The role of sports as a social determinant
        for children. Research Quarterly for Exercise and Sport, 63:18–424.

[Bur06] U.S. Census Bureau. Current population survey, design and methodology.
        (Technical Paper 66):175, October 2006.

[LP01] Williamson I. Little P., Gould C. Delayed presciribing of antibiotics increased
       duration of acute otitis media but reduced diarrhoea. Evidence-Based Nursing,
       4(4):107, October 2001.


3:16, 211                                 decile, 114
68–95–99.7 Rule, 430                      discrete random variable, see random
                                                   variable, discrete
alternate hypothesis, 411                 distribution, 104
Bayes’ Theorem, 318                       equally likely outcomes, 306
bimodal, 108                              event, 302
bin, 107                                  expected value, 424
binomial distribution, 405                exponential distribution, 419
binomial process, 403
binomial random variable, 404             five number summary, 114
boxplot, 114                              frequentist interpretation, 303
broccoli, 104
Bunko, 404                                hat, 128
                                          hinge, 114
Cartesian product of sets, 307            histogram, 106
categorical variable, 101                 hypergeometric distribution, 407
Cauchy distribution, 436                  hypothesis, 410
CIRP Survey (Quest), 206                  hypothesis test, 410
cluster sampling, 209
coin toss, 424                            independent events, 319
complement of a set, 305                  inter-quartile range, 114
conditional probability, 315              Interim, abolish, 211
continuous random variable, see random    intersection of sets, 305
         variable, continuous, 414        IQR, see inter-quartile range
convenience sample, 202                   Kolmogorov, 311
corrosion, 125
Counties, 103                             Literary Digest, 202
cross tabulation, 121
cumulative distribution function          Manny Ramirez, 314
    continuous random variable, 417       mean, 110, 133
cumulative distribution function (cdf),      of a continuous random variable, 425
         405                                 of a random variable, see expected
Current Population Survey, 202                    value
                                          mean absolute deviation, 116, 133
dataset, 101                              median, 110, 133


missing values, 101                       skew, 107
mortality table, 314                      SRS, see simple random sample
mosaic plot, 124                          standard deviation, 116
Multiplication Law of Probability, 316         of a random variable, 428
Multiplicaton Principle, 308              standardized variable, 130
multistage sampling, 209                  statistic, 202
                                          stem and leaf plot, 109
National Immunization Survey, 206         stratified random sample, 207
Nellie Fox, 320                           subjectivist interpretation, 303
normal distribution, 429                  sum of squares, 133
null hypothesis, 411                      symmetric, 107
outlier, 111, 115                         total deviation from the mean, 133
                                          track records, 126
parameter, 202                            transform (reexpress), 107
percentile, 113                           transformation
population, 201                               of a random variable, 426
probability, 301                          trees, 127
probability density function (pdf), 415   trimmed mean, 112
probability mass function (pmf), 402
pseudo-random number, see random num-     uniform distribution, 418
         ber generation                   unimodal, 107
                                          union of sets, 305
quantile, 113
quartile, 114                             variable, 101
                                          variance, 116
random number generation, 419                 of a random variable, 428
random number table, 210                  vector, 101
random process, 301
random sample, 208                        Weibull distribution, 421
random variable, 401                      wind speed in San Diego, 422
     continuous, 402
     discrete, 401
residual, 128
residual sum of squares (SSResid), 129
resistant, 111

sample, 201
sample space, 302
sampling distribution, 411
sampling error, 204
sampling frame, 206
scatterplot, 125
simple random sample, 203


Shared By: