Chapter5--stat by ashrafp


									                                     Thanks to Joe Gheopreal (class of 2008) for outlining Chapter 5

                               Chapter 5: Producing Data


In statistics, we often need to get answers from immense groups of individuals. To get
proper answers, we must find a way to produce data in a way that will answer our
questions. Since most of the time it is near impossible to ask all subjects in the
population, we must generate a sample that accurately represents the entire population.
Two ways we gather data and create a sample that creates a least disturbed image of the
population is an Observational Study and an Experiment.

Observational Study- the observation of individuals and the measurement of variables of
interest. NOTE: An observational study does NOT attempt to influence the responses.

Experiment- the deliberate imposing of some treatment on individuals in order to observe
their responses.

Observational Studies are great when a statistician explores data about topics, which
could include opinions and behaviors. But to gauge the effect of an intervention, a
statistician needs to impose a change since the goal is to understand the cause ad effect.
Observational studies tend to fail when they are about the effects of one variable on
another because the explanatory variable is confounded with lurking variables. Only
well-designed experiments take steps to defeat confounding. Sometimes we may be
unable to observe individuals directly or performing an experiment, so simulations are
used to provide alternative method for producing data in such circumstances. Statistical
techniques for producing data open the door to answering specific questions with known
degrees of confidence or statistical inference. In the end, the most important prerequisite
for a trustworthy inference is careful design of data production.

Part1: Designing Samples

In most cases, we are gathering information about a large group of individuals. In the
real world, we do not have the time and money for contacting every individual in the
entire population. Since this is true, we only gather information of a part of the group so
we can draw conclusions about the entire group.

Population- The ENTIRE group of individuals that we want information about.

Sample- A part of the population that we actually examine in order to gather information.

Our population is defined in terms of our desire for knowledge. For example, if we wish
to make conclusions about all the snowmen made in the U.S. during a winter storm, that
group is our population even if the only the snowmen in your neighborhood are the only
snowmen you see. The sample is the part from which we draw conclusions about the
whole. To collect data we can either use sampling or conducting a census.
Sampling- studying a part in order to gain information about the whole.

Census- to contact every individual in the entire population.

A carefully conducted sample can often be more accurate than a census. For example, a
farmer can sample their corn inventory to verify the accuracy for the amount of corn the
farmer has grown. Attempting to count every last piece of corn would not just make him
sick of the crop, but is also expensive and inaccurate, since bored people do not count

For conclusions, based for a sample to be valid for the entire population, a proper design
for selecting the sample is required.

Design- the method used to choose the sample of the population.

Poor sample designs can produce MISLEADING conclusions…

Example 1: American Idol Example

It is the final two singers on American Idol, Sally Singsgreat and Bobby Badvoice. As
usual to the show, the host asks who should be American Idol. Lets say 290,000
American callers responded and 86% said they want Bobby Badvoice to win. What is
wrong with this sampling?

Solution: People who actually spend time and money to respond to call-in polls are not
representative of the entire population. In fact, they tend to be the same people who call
radio shows. People who feel strongly, especially those with strong opinions, and more
likely to call. It would not be surprising that a properly designed sample would show that
79% would want Sally Singsgreat to win.

Call-in opinion polls are an example of voluntary response sampling.

Voluntary response sample- a sample that consist of people who choose themselves by
responding to a general appeal. Voluntary response samples are biased because people
with strong opinions, especially negative opinions, are more likely to respond.

Voluntary response is one common type of bad sample design. Another is shown in the
following example…

Example 2: Should the Mall be remodeled Example

The King of Prussia Mall has decided that to determine whether they should do
remodeling of their malls, they will ask the mall shoppers whether they should remodel
their mall. What is the problem with this sampling?
Solution: This will cause a form of bad sample design called convenience sampling.

Convenience sampling- a sample that chooses the individuals easiest to reach.

This sample does not represent the entire population. For example, people who tend to
go to malls more are richer, teenagers, or retired. Plus, mall officials might tend to select
neat, safe looking individuals from the stream of customers.

Both forms of sampling almost guarantee not to represent the entire population. These
sampling methods display systematic error, or bias.

Bias- favoring some parts of the population over others.

To eliminate bias, the statistician is to allow impersonal chance to choose the sample. A
sample chosen by chance allows neither favoritism by the sampler nor self-selection by
respondents. Choosing a sample by chance eliminates bias by giving individuals an equal
chance to be chosen. The simplest way to use chance to select a sample is to place names
in a hat (the population) and draw out a handful (the sample). This is called simple
random sampling.

Simple Random Sample- consist of n individuals from the population chosen in such a
way that every set of n individuals has an equal chance to be the sample actually selected.

An SRS does not only give each individual an equal chance of being chosen, but also
gives every possible sample an equal chance of being chosen. The idea of SRS is to
choose our sample by drawing names from a group. In practice, computer software can
choose an SRS from a list of individuals in the population. If software is unavailable, one
can randomize by using a table of random digits.

Table of Random Digits- a long string of digits 0 to 9 with the following properties: Each
entry in the table is equally likely to by any of the 10 digits 0 through 9 and the entries
must be independent of each other (this means knowledge of one part of the table gives
no information about any other part).

For random digits, refer to table B at the back of the book. These digits are random and
only put in groups of five to make them easier to read. These numbers have no real

Use table B for the following example…

Example 3: Pizza Delivery Example:

You are hosting a party and everyone wants pizza. You decide to choose the type of
pizza places you will order from randomly. You plan to select from two of the pizza
places (SRS of 2).
Solution: Begin by labeling a bunch of places. For this example we will use 15. We will
use two digit labels and label the pizza places from 00 to 14. Here are the places with the
labels attached.

00 – Pizza Hut
01 – Dominos
02 – Papa Johns
03 – Franzone’s
04 – Angelo’s
05 – Creaser’s
06 – Pizza Planet
07 – Peace of Pizza
08 – Famous Georges
09 – Costco
10 – Uno’s
11 – Burchuchi’s
12 – Leaning Tower of Pizza
13 – Pizza Castle
14 – Sabarro’s

Second, enter any line from table B and read the two digit groups, for this example here
is line 115.

61041 77684 94322 24709 73698 14526 31893 31592

The two digit numbers made in this line are…

61 04 17 76 84 94 32 24 70 97 36 98 14 52 63 18 93 31 59 …

As you can see, some of the labels do not apply, so we simply ignore them. The labels
we do not ignore (00 to 14) we choose as our sample (in this case we will use
04(Angelo’s) and 14(Sabarro’s)).

For an SRS there are two steps. The first step is to assign numerical labels to every
individual in the population. The second step is to use table B or any random number
generator to select labels at random. Be sure that all labels have the same number of
digits so they all have the same chance of being chosen. Use the shortest possible labels:
one digit for a population up to 10 members, two digits for 11 to 100 members, three
digits for 101 to 1000 members, etc.

The general framework for designs that use chance to choose a sample is a probability
Probability Sample – A sample chosen by chance. We must know what samples are
possible and what chance, or probability, each possible sample has.

Some probability sampling designs like SRS give each member of the population an
equal chance to be selected. This may not be true in more elaborate sampling designs. In
every case however, the use of chance to select the sample is the essential principle of
statistical sampling.

Yet designs for sampling from large populations spread out over a wide area are usually
more complex than an SRS. It is important to sample important groups within the
population separately then combine these samples. This is called a stratified sample.

Stratified random sample – first divide the population into groups of similar individuals,
then choose separate SRS in each stratum and combine these SRS’s to form a full sample.

Strata – the divisions of the population into groups of individuals.

One chooses the strata based on the facts known before the sample is taken. For
example, a population of bees can be divided into workers, drones, larvae, and queen
strata. A stratified design can produce more exact information than an SRS of the same
size by using the idea that individuals in the same stratum are similar to one another. If
all individuals in each stratum are identical, just one individual from each stratum is
enough to completely describe the population.

Another common way to restrict random selection is to choose the sample in stages. This
is done by multistage sampling design.

Multistage samples – selecting successively smaller groups within the population in
stages, resulting in a sample consisting of clusters of individuals.

Analysis of data from sampling designs more complex than an SRS goes beyond basic
statistics. The SRS is the building block of more elaborate designs, and analysis of other
designs differs more in complexity of detail rather than in fundamental concepts.

Random sampling eliminates bias in the choice of the sample from a population, yet
accurate information from a sample requires more than good sampling design. To have
such a design we need an accurate and complete list of the population, or that sample
suffers from undercoverage.

Undercoverage – when some groups in the population are left out of the process of
choosing the sample.

For example, a sample survey of households will miss the homeless, prisoners, and
students in dormitories.
While undercoverage is nearly unavoidable and somewhat within most surveys to a
degree, a more serious source of bias is nonresponse.

Nonresponse – when an individual chosen for the sample cannot be contacted or does not

Example 4: Lances Date Example:

Lance is looking for a date for the next prom, so Lance sends e-mails to every girl in the
senior class, asking if they want to go to prom with him. Of all the girls that Lance sent
e-mails to, only 14% responded, all said no. There is a way for Lance to get a date, how
can he adjust his survey, so he is able to get some at the prom.

Solution: There are several problems with Lances survey. First, from earlier, it suffers
from being voluntary response sample, hence why all 14% of the girls said no. Second,
this sample suffers from undercoverage, as only the senior girls are the ones he asked, he
did not ask any juniors, sophomores, or freshmen. Last, since only 14% responded,
Lances survey suffers from nonresponse as some of these girls might want to say yes to
him in person, or more likely no in person.

Yet, some girls might see Lance does not have a date, and decide to say yes to him just to
make him happy. This is an example of a cause of response bias.

Response bias – the behavior of the respondent or of the interviewer that can cause bias
in sample results.

Respondents could lie if asked about behavior that is unpopular or illegal. The sample
then underestimates the presence of such behavior in the population. Things like the
interviewers attitude, race, sex, and recall of memory can easily influence responses. In
conclusion, good interviewing technique is another aspect of a well-done sample survey.

The most important influence on the answers given to a sample survey is the wording of
the question.

Wording of question – Confusing or leading questions can lead to strong bias.

But how accurate are the results of a survey? This is because if we take another survey,
we can get different results. But since we purposely use chance, the results obey the laws
of probability. In short, larger random samples give more accurate results than smaller
samples. Another important part of designing a survey is the sample frame.

Sample frame – List of individuals from which a sample is actually selected.

Ideally, the frame should list every individual in the population, but in practice this is
often difficult.
Another type of sample used is a systematic random sample.

Systematic random sample – Similar to an SRS, but the parts selected are chosen
systematically (i.e.: 10, 34, 53, 68, 89).

Part 2: Designing Experiments

A study is an experiment when we actually do something to people, animals, or objects
just to observe the response.

Experimental Units – The individuals on which the experiment is done.

Subject – When the units are human beings.

Treatment – A specific experimental condition applied to the units.

The purpose of an experiment is to reveal the response of one variable to changes in other
variables, the distinction between explanatory and response variables is important.

Factors – The explanatory variables in an experiment.

Yet many experiments study the joint effects of several factors. Combining a specific
value of each of the factors forms each treatment.

Level – the specific value of each of the factors.

Example 5: The drug dealer experiment:

A drug dealer is wondering if taking two drugs at the same time makes you more
addicted to each. The drug dealer decides to ask the friends of his customers whether
they are more addicted to drugs. How should this experiment be conducted?

Solution: For this experiment the dealer should divide the subjects into four groups.

Group1: Drug 1 and Drug 2
Group 2: Drug 1 and placebo
Group 3: placebo and Drug 2
Group 4: placebo and placebo

Placebo – a dummy pill (or whatever is being experimented) that looks and taste like
whatever is being experimented but has none of the active ingredients.

A study must be aware of the several response variables, for example if the guys are
getting drugs from somewhere else, or what they eat, etc.
What makes an experiment more advantageous than an observational study is the fact
that experiments giver good evidence of causation. Also, experiments also let us only
study the factors that we are actually interested in as well as the combined effects of
several factors. The design for an experiment is the following:

Units -> Treatment -> Observe Response

Yet in experiments there are chances that a response is due to a lurking variable rather
than a treatment…

Example 6: Luca Running Experiment:

Luca is determined to run a faster time than Marco, so when he sees new track shoes, the
speedy Gonzales, that say they will effectively make you faster, he quickly buys them.
Luca then decides to alternate shoes every other day, and he found on the days that he
used the Speedy Gonzales, he would beat Marco. What is wrong with Lucas experiment?

Solution: The kid may be a great runner, but he is not as good of a statistician. His
experiment is poorly designed as it suffers from placebo effect.

Placebo Effect – When a subject responds favorably to any treatment, even a placebo.

Since Luca believed the Speedy Gonzales would make him run faster than Marco, he
would probably run a little harder on those days than on the other days. The results were
confounded by the placebo effect.

Confounded – Mixed up with.

The days Luca used his regular shoes, are the days that are in the control group.

Control Group – group receiving the placebo.

The control group enables us to control the effects of outside variables on the outcome.
Control is the first basic principle of statistical design of experiments. The simplest form
of control is comparison.

Many experimenters would try to match groups by elaborate balancing acts. Matching is
helpful, yet not adequate due to too many lurking variables that might affect the outcome.
A statistician remedy to this problem is to rely on chance to make an assignment that
does not depend on any characteristic of the experimental units and that does not rely of
the judgment of the experimenter in any way. The use of chance can now be combined
with matching, as the following example will show.
Example 7: Dog crap problem:
Veterinarians want to help dogs with constipation so they are healthy again, so they
decide to test a new constipation drug for dogs. The response variable is a dog’s crapage
over a 30-day period. The control group eats a placebo. There are 40 dogs, how will we
conduct this experiment?

Solution: For this experiment lets use 40 dogs. So lets divide the dogs into two groups
of 20. This will be done without bias, so number the dogs 00 to 39 and select them
randomly to the two groups. Here is a diagram of the experiment…

Random assignment
         /      /
     Group1     Group2
       (20 dogs) (20 dogs)
       /               /
    Treatment        Treatment
    Crap pill        placebo
      Compare crap loss.

Randomization, the use of chance to divide experimental unites into groups, is essential
ingredient for good experimental design.

The logic behind the randomized comparative design as shown above is as follows:

      Randomization produces groups of dogs that should be similar in all respects
       before the treatment is applied.
      Comparative design ensures that influences other than the severity of their
       constipation operates equally on both groups.
      Therefore, differences in average crapage must be due to either the pill or the play
       of chance in the random assignment of dogs to the pills.

The reason we assign many dogs to the crap pill is the idea that the effects of chance will
average out and there will be little difference in the average crapage of the two groups
unless the pills themselves cause a difference. The use of enough experimental units to
reduce the chance of variation is the third big idea of statistical design of experiments.

The basic principles of statistical design of experiments are

   1. Control the effects of lurking variables on the response, most simply by
      comparing two or more treatments.
   2. Randomize – Use impersonal chance to assign experimental units to treatments.
   3. Replicate each treatment on many units to reduce chance variation in the results.

We hope to see a difference in the responses so large that it is unlikely to happen just
because of chance variation. We try to learn if the treatment effects are larger than we
would expect to see if only chance were operating. If they are, then they are statistically

Statistically significant – An observed effect so large that it would rarely occur by

To compare an array of treatments, a completely randomized design would be best used.

Completely randomized – When all experimental units are allocated at random among the

The logic of a randomized comparative experiment depends on our ability to treat all the
experimental units identically in every way except for the actual treatments being
compared. Therefore, careful attention to detail is a must for good experiments.

Some experiments can be plagued by unconscious bias, so for an experiment to be most
effective, it must be a double blind experiment.

Double blind – neither the subjects nor the people who have contact with them know
which treatment a subject received.

An experiment can also be plagued if there is a strong lack of realism.

Lack of realism – The subjects or treatments or setting of an experiment may not be
realistically duplicated in the conditions we really want to study.

Example 8: Pulse Introduction Example:

Pulse wants to restart the introductions in its daily broadcast. So they generated two
introductions and are deciding which one to use for the initial broadcast. So they brought
in groups of students and told them that they are viewing these for an experiment. What
is wrong with this experiment?

Solution: First, this experiment is not blind. Second, we cannot make sure this applies to
everyday students since the students know this is an experiment, hence non-realistic

Lack of realism can limit our ability to apply the conclusions of an experiment to the
settings of greatest interest.

Example 9: Rabbit trap Example:
Rabbits can be problems to farmers, and a new trap has come out to stop them. This trap
uses the scent of a specific vegetable to trap the rabbits. It is believed that the carrot trap
is better than the lettuce trap. So a farmer sets up an equal amount of traps of each and
whatever rabbits get trapped in them, he counts for the respectable trap. How is this
experiment organized?

Solution: This experiment is organized in a match pair’s design.

Match pairs – Compares two treatments.

For match pairs we choose blocks of two units that are as closely matched as possible.

Block – a group of experimental units or subjects that are known before the experiment to
be similar in some way that is expected to affect the response to the treatments.

Block design – the random assignment of units to treatments is carried out separately
within each block.

Block designs can have blocks of any size. A block design combines the idea of creating
equivalent treatment groups by matching with the principle of forming treatment groups
at random. Blocks are another form of control. For example…

Example 10: STD’s and Gender example:

The progress of S.T.D. (Sexually Terrifying Disease), a type of STD differs from women
and men. How can this experiment be properly done?

Solution: Two separate randomizations would be done, assigning the subjects by their
gender. Note that there is no randomization in making these blocks. Then conduct the
experiment as normal.

              Men---Group(x3) – Therapy(x3) – Compare results
Subjects ----- Women --- Group(x3) – Therapy(x3) – Compare results

Part 3: Simulating Experiments:

There are three methods we use to answer questions involving chance…
   1. Try to estimate the likelihood of a result of interest by actually carrying out the
      experiment many times and calculating the result’s relative frequency. That is
      slow, sometimes costly, and often impractical or logistically difficult.
   2. Develop a probability model and use it to calculate a theoretical answer. This
      requires that we know something about the rules of probability and therefore may
      not be feasible… yet.
   3. Start with a model that, in some way, reflects the truth about the experiment, and
      then develop a procedure for imitating or simulating a number of repetitions of
      the experiment. This is quicker than repeating the real experiment, especially if
      we use a calculator, and it allow us to do problems that are hard when done

Example 11: Gambling Example:

Suppose that for some gambling game, you win if the majority of 10 drawn cards is either
a spade or clover. But five of the cards are damaged and usable, and two are black suits
while three are red suits. How do we conduct this simulation?

Simulation – The imitation of chance behavior, based on a model that accurately reflects
the experiment under consideration.

Solution: Their a few steps to follow when doing a simulation…

   1. State the problem or describe the experiment – Draw 10 cards. What is the
      likelihood of a run where more then 5 of the cards are either a spade or a clover
      (black suit)?
   2. State the assumptions – The picks a not independent of each other.
   3. Assign digits to represent outcomes – assign the digits from 00 to 46 (since five of
      the cards are destroyed) (00 – 23 black suit, 24 – 46 red suit). Use the number
      table in the back (table B) to select values until 10 are selected.
   4. Simulate many repetitions – Looking at 10 consecutive digits in table B simulates
      one repetition. Read many groups of 10 digits from the table to simulate many
      repetitions. Make sure to keep track of whether or not the event that we want to
      occur has occurred.
   5. State your conclusion – Lets say that after 1000 simulations, 634 of those
      simulations had a majority of clovers and spades selected. We estimate the
      probability of a run by the proportion-estimated probability…

       634/1000 = 63.4%

       In the real world, if we did 1000 simulations, we would go crazy. But a small
       amount of simulations like 20 would not be enough. That’s why with enough
       understanding of simulations; we can use a computer to do a large amount of
       simulations. A long simulation (or mathematical analysis) finds that the true
       probability could be 61.9%.
The hardest part of this process is establishing a correspondence between random
numbers and outcomes in the experiment, so it must be done carefully. While not true
with the above problem, some problems might consist of independent trails.

Independent – The result of one part of the experiment does not affect the result of the
next part.

The reason it is not true with the above problem is because when one card is selected
from the deck, it would increase the chance that the next card in the deck is of the
opposite colors suit.

Test Questions:

   1. Youtube wants to know if it should add advertisements before every video. So
      Youtube decides to ask the first 1000 viewers of each day to state their opinion on
      the question for one month. This type of sample is most likely

   A.   Voluntary Response Sample.
   B.   Convenience Sample.
   C.   Simple Random Sample.
   D.   Probability Sample.
   E.   Stratified Random Sample.

   2. The students of A.V. Agadro School have been asked whether they drink alcohol
      in front of the science department, which is conducting this survey. Since some
      of the students did drink, they decided to ditch school so they did not get caught.
      This survey is suffering from…

   A.   Undercoverage.
   B.   Nonresponse.
   C.   Response bias.
   D.   None of the above.
   E.   Two of the above.

   3. A monkey, Chimpy, is one of 3 trillion chimps used as a part of an experiment on
      a pill that is planed to be sold as an intelligence raising pill. Chimpy took the one
      pill a day that he was given for a month and was properly able to write
      Shakespeare, defuse a nuclear bomb, and solve the number pi to the
      1,000,000,000,000,000,000,000,000th place. There is evidence that the monkeys
      know what the pill is capable of doing. During that one month Chimpy could
      have taken…

   A.   A placebo.
   B.   The intelligence raising pill.
   C.   Both pills.
   D.   Either pill.
E. A totally different pill from the above.

4. In an experiment investigating if playing another video game before the tested
   game, helps the player score better on the tested game. The experiment divides
   the subjects into subjects that have proven video game experience with those who
   have never played a game in their lives, before they treated the subjects. This is
   an example of …

A.   Lack of realism.
B.   Match-pairs design.
C.   Double blind.
D.   Block design.
E.   None of the above.

5. You are an intern for the FDA. You are told by your boss to conduct a simulation
   to check how many restaurants have disobeyed FDA regulation. Based on last
   years data, of 100 restaurants, 34 of those restaurants have disobeyed FDA
   regulation. You need to conduct a simulation to determine the real probability
   that a restaurant would disobey FDA regulation.

A. State the assumptions that are found within this experiment.

B. How are you going to assign digits to accurately represent the outcomes of this
   experiment? Remember to make it clear as this system is needed for the next two

C. Using line120 on table B, simulate this experiment with the system you made in
D. Now do the same thing as C, with Line 122 of table B.

E. Do the same thing as C, with any other line on table B. Remember to state the
   line you selected or you get no credit.

F. Calculate the relative frequency of restaurants that fail FDA regulation and state
   your conclusions.

G. Now repeat this simulation 20 times on the calculator. Calculate the relative
   frequency and state your conclusions.

H. Now which relative frequency would you tell to your boss and why?

To top