Data Collection

Document Sample
Data Collection Powered By Docstoc
					    Chapter 1: Data Collection

1.1 Introduction to the Practice of Statistics
1.2 Observational Studies, Experiments, and Simple Random Sampling
1.3 Other Effective Sampling Methods
1.4 Sources of Errors in Sampling
1.5 The Design of Experiments

                                                  September 3, 2008
   Definition of Statistics

Given a question, statistics is the art and science of designing studies,
collecting the data, summarizing the data, and then analyzing the data
to draw conclusions. In particular, statistics is:
     • collecting data
     • organizing this data
     • summarizing the organized data
     • analyzing the summarized data
     • draw conclusions from this analysis

                                           Section 1.1                      2

Data is information that is collected about a generic population (people,
animals, machines, etc.).
In the social sciences it is usually about people: the characteristics (height,
weight, age, etc.) or attitudes (believes, political opinions, religion, etc.).

                   Types of Statistics

• Descriptive Statistics: This type of statistics uses graphs, tables,
  charts and the calculation of various statistical measures (mean,
  standard deviation, etc.) to organize and summarize information
  about a population. This is material in Math 127A.
• Inferential Statistics: This type of statistics consists of techniques
  (hypothesis testing, confidence intervals, etc.) to reach conclusions
  about a population based upon information obtained by a subset of
  the population. This is the material in Math 127B.

Average Yearly Temperature in Nashville

   Question: Is the climate of Nashville warming?
   The average temperature of Nashville is available National
   Weather Service website from 1872-2007. Average daily
   temperature is calculated by summing the highest and lowest
   hourly temperature and then dividing by 2. The monthly
   average temperature is obtained by the computing the average
   of the daily average temperatures and yearly average
   temperature is obtained by computing the average of the
   monthly temperatures.

Mathematica Notebook

     The Statistical Method (QDDI)

• Question: What is the problem of interest? Identify your
  research objective.
• Design: How will the data be collected? From whom? About
• Description: Give the characteristics of the data. This is were
  mathematics can play a major role. Summarize the data. Give
  a graphical description of the data. (Descriptive Statistics)
• Inference: What does the data tell us? If you started with a
  hypothesis, does the data confirm this hypothesis? (Inferential


Harvard Medical School studied 22,000 male physician to determine if
taking aspirin could prevent heart attacks. The physician were split into
two equal groups: 11,000 would receive an aspirin per day and the other
11,000 would receive a placebo. The assignment of physicians was done
randomly. During the course of the study, 0.9% of the male physicians in
the study who were taking aspirin had a heart attacked and while 1.7%
taking the placebo experienced a heart attack. They then used the
statistical method to predict that if all male physicians could have
participated in the study, the percentage having a heart attack would have
been lower for those taking aspirin.


• Question: Does taking aspirin each day reduce the
  incidence of heart attacks in male physicians?
• Design: Take sample with half taking aspirin and half
  taking a placebo. This is called an experiment.
• Description: Heart attack rate: aspirin (0.9%) versus
  placebo (1.7%).
• Inference: All male physicians would benefit from taking
  daily aspirin.

           Terminology of Statistics

•   Population: A population is the complete collection of all elements to be
•   Sample: Any subset or group of a population is called a sample.
•   Variable: A variable is characteristic of the individuals in the population
    that will be analyzed.
•   Parameter: A parameter is numerical summary of a variable for the
•   Statistic: A statistic is numerical summary for a variable obtained from a
    sample of the population.

                    Types of Data

• Quantitative data is composed of measurements (numbers)
  about the population.
• Categorical (or qualitative) data is data that can be separated
  into categories and can be identified by some non-numeric
• Continuous data is quantitative data that can take any value.
• Discrete data is quantitative data is not continuous .

• Population: All of the students in Math 127A that are in WH 103 today.
• Sample: The students in Row 10 of the classroom.
• Variables:
    –   Color of eyes
    –   Month of birth
    –   Home state
    –   Age
    –   Religion

                       Example (continued)
• Data (Qualitative/Qualitative):
   –   Blue eyes
   –   October
   –   Georgia
   –   18
   –   Lutheran
• Parameter:
   – The average age.
   – The standard deviation of heights.
• Statistics:
   – The average age of students in Row 5.
   – The fraction of students with blue eyes in Row 9.

          Data for Statistical Studies

• Census: A census is list of all individuals in a population along with certain
  characteristics of each individual in the population (e.g., age, race, home
  ownership, etc.).
• Observational Study: An observational study attempts to measure a
  characteristic of the population by examining a sample, but does not
  manipulate the sample. An observational study often uses a sample
  survey to collect data.
• Experimental Study: An experiment selects a sample of the population
  and manipulates one or more variables of the population. The variable
  that is manipulated is called an independent variable and variable that is
  effected is called a dependent variable.

                                                                Section 1.2    14
Census Website

          Observational Study

Observational Study: An observational study measures
the characteristics of a population by studying a sample
of individuals. It attempts to find connections between
these characteristics without manipulation of the sample.
The study is passive or ex post facto.

Design of Observational Studies

    Example of Sample Survey
Sample Survey: A random sample of 10,000 people were the individual
are interviewed to determine information about the following variables of
the population:
• age
• race
• gender
• number of children
• income bracket ($0-$25K, $25K-$50K, ….)
• wealth bracket
• homeowner
Question: Is there a relationship between homeownership and number
of children?
Algorithm for Setting Up a Sample Survey

• Step 1: Identify the population from which the sample is to be drawn.
• Step 2: Compile a list of subjects in the population from which the
  sample will be taken. This is called the sampling frame.
• Step 3: Specify a method for selecting subjects from the sampling
  frame. This is called the sampling design.
• Step 4: Collect the data.

         Designed Experiments

Experimental Study: An experiment is a study in which data
is used and manipulated to determine the effects of one or
more variables (called explanatory variables) on another
variable (called the response variable). That is, the
explanatory variable is controlled to see how the response
variable changes with changes in the explanatory variable.
The conditions placed on the explanatory variable are called
treatments. In this type of study, the explanatory variable is
sometimes called a factor of the experiment.

Design of Experiments


Observational studies are useful for detecting connections between
two variables in a population. Experimental studies are useful to
determine the nature of the connection.

                  Types of Sampling
• Random (good)
• Non-random (bad)
  Examples: Suppose that our population is 200 students who are seated in a
  classroom of 10 rows with 20 seats per row.

  If we chose a sample as the subset of students who sit in the rows that end
  with an even integer, then this would be a non-random sample.

  Suppose that we place 10 balls each marked with a separate number (1-10)
  in a bag. We would generate a random sample of 20 by choosing one of the
  balls out of the bag and using the number on the ball as the row for our

                                                            Section 1.3
        Simple Random Sample

Simple Random Sampling: each individual in the
 population has the same or equal chance of being
 selected for a sample as any other individual. A list of
 individuals in the population from which a sample is to
 be drawn is called a frame.

              Two Sets of Random Numbers

     Generate a set of 100 random numbers (1 :     - 9)
                                                                                            S  {8, 1, 7, 1, 2, 7, 6, 4, 4, 5, 9, 6, 5, 4, 9, 9, 2, 4, 6, 6, 6, 7, 4, 2, 1,
     S = {1, 6, 6, 9, 3, 1, 6, 3, 5, 5, 4, 4, 4, 9, 2, 1, 1, 7, 6, 3, 2, 8, 1, 5, 4,
                                                                                            8, 8, 7, 5, 9, 2, 6, 6, 7, 2, 8, 1, 4, 1, 4, 9, 2, 7, 2, 8, 7, 4, 4, 1, 9, 8,
     6, 4, 9, 8, 1, 3, 7, 5, 7, 9, 6, 1, 8, 1, 6, 8, 8, 6, 2, 5, 1, 6, 9, 6, 5, 8,
                                                                                            3, 5, 5, 5, 2, 8, 1, 2, 4, 2, 2, 7, 4, 2, 8, 8, 2, 4, 3, 9, 3, 7, 3, 2, 5, 1,
     8, 2, 9, 9, 6, 8, 6, 2, 9, 8, 1, 1, 8, 2, 9, 1, 9, 3, 9, 4, 5, 2, 2, 5, 3, 5,
                                                                                            1, 6, 7, 4, 6, 9, 1, 8, 4, 1, 8, 5, 9, 6, 3, 7, 5, 4, 1, 9, 9, 5, 3}
     7, 2, 4, 1, 1, 4, 7, 4, 7, 7, 9, 9, 2, 4, 4, 9, 3, 6, 6, 6, 4, 1, 6}

                                                                                     

                                                          Frequency Chart of Numbers
                         Types of Samples

Simple Random Sample: A sample that is obtained by randomly choosing individuals in the

Stratified Sample: A stratified sample is sample that is obtained by separating the population
into non-overlapping groups (call strata) and then randomly selecting individuals from each

Systematic Sample: A systematic sample is a sample that is obtained by selecting individuals in
the population is a systematic way e.g., every 5th individual.

Cluster Sample: A cluster sample that is obtained by selecting all individuals with a randomly
selected subset or group of the population.

Convenience Sample: A convenience sample is a type of sample that is drawn because it is easy
or convenient to collect. Convenience samples are likely to under represent portions of the
population. They may not be random and may contain bias due to time or location.

                                                                Section 1.3                       26
     Three Main Sampling Methods




    Advantages of Different Random Sampling

•  Simple Random Sampling: Gives a good picture of the
  whole population.
• Cluster Random Sampling: Often it easier and cheaper to
  implement because subjects are close together and well-
  defined once clusters are chosen.
• Stratified Random Sampling: Guarantees that each
  stratum (segment) is sampled.

  Sources of Errors in Sampling

Fact: Erroneous conclusions can be drawn from observational or experimental
studies due to faulty statistical design and sampling.

• Non-sampling Errors: These errors occur when the sampling process (design)
are faulty. This usually occurs when there is a problem with the sampling frame
or sampling design. In other words, preference is given to selecting some
individuals over other individuals in the population.
      response errors
      non-response errors
      processing error
      analysis errors
      coverage errors
• Sampling or Estimation Errors: This error occurs when the sample gives an
incomplete picture of the population. This type of error is due to the fact that
we are using a sample instead of the whole population.

                                                    Section 1.4                29
          Non-sampling Errors

• Response Errors: Poor questionnaire design, interview
bias, respondent errors, poor survey process. For example,
the organization of the survey could be confusing, individuals
give deceptive responses to questions, the data collector
may not speak the language of the individual to be
interviewed, etc.
• Non-response Errors: Complete or partial non-response.
For example, individuals may agree to be interviewed, but
then choose not to answer some or all of the questions.
• Processing Errors: There are computational errors in
coding, capturing, editing and presenting the final data.
• Analysis Errors: Incorrect statistical tests are applied to
the data resulting in erroneous conclusions.
• Coverage Errors: There are errors in the duplication or
omission of individuals in the sample.                           30
                    Non-sampling Bias

Example: Suppose we are interested the approval rating of Mayor Dean and we
will conduct a random telephone survey on whether citizens of Nashville approve
or disapprove of his job performance since he took office. Is there bias in this
sample survey?
Answer: Maybe, since it will miss citizens who do not have a telephone and this
group of people may have different opinions about the mayor than those who do
have a telephone.

             Design of Experiments

Review from Section 1.3:

An experiment is a study for the collection of data that is used to
determine the effects of one or more variables (called explanatory
variables) on another variable (called the response variable). The
individuals from which the data is collected are called subjects or
experimental units. The conditions placed on the explanatory variable are
called treatments. In this type of study, the explanatory variable is
sometimes called a factor. An experiment is called double-blind if the
subjects and the experimenter do not know which treatments are being
administered to each subject. We say that the experiment is completely
randomized if each experimental unit is randomly assigned to a
treatment. A randomized experiment comparing medical treatments is
called a clinical trial.

                                                           Section 1.5      32
         Types of Experiments

• Completely Randomized Design: Each experimental unit is
randomly assigned a treatment.
• Randomized Matched-pairs Design: Experimental units are
paired with each experiment unit in the pair assigned a
different treatment. The matched-pair can be the same
individual so that the individual receives both treatments (e.g.,
before and after).
• Randomized Block Design: Experimental units are
grouped together in groups. Units in each group (block) are
randomly assigned treatments.


Object of Study: Does aspirin reduce the heart attack rate?
Population: Male physicians in the U.S.
Sample: 20,071 male physicians between the ages or 40 and 84.
Study: The sample was split in two groups. One group took an aspirin per
day and the other group took a placebo. The doctors were randomly
assigned to these two groups. The doctors were monitored over a 5 year
Explanatory Variable: aspirin: yes or no (categorical)
Response Variable: heart attack: yes or no (categorical)
Type of Experiment: Completely randomized design.

                Example (continued)

                  Yes            No           Total
  Aspirin         104         10,933         11,037
 Placebo          189         10,845         11,034
   Total          293         21,778         22,071

This is an experiment and the aspirin/placebo are the
treatments. We manipulated the explanatory variable
to see the effect on the response variable.

                 Example (continued)

Fraction of Heart Attacks for both Treatments

                    Yes             No
  Aspirin         0.0094          0.9906        1.0
 Placebo          0.0171          0.9829        1.0

                    Example (continued)

Conclusion from Study: The heart attack rate per 1000 male physicians
is 9.4 for those taking aspirins and 17.1 for those not taking aspirin.
Hence, we would conclude that taking aspirin reduces the heart attack

               Matched-pairs Designs

 A matched-pair design experiment is a study where there are only two
treatments and experimental units are matched. One experimental unit receives
one treatment and the other experimental unit receives the second treatment.
The pairs may be the same individual (before treatment and after treatment) or it
may be two individuals who have similar characteristics (e.g., gender, age, etc.).
The assignment of the treatments to each pair should be random.

         Example of Matched-Pairs

Purpose: Study the effect of taking caffeine one half hour before
Sample: 50 randomly chosen swimmers.
Explanatory Variable: A caffeine pill or a placebo.
Response Variable: Time to swim one mile.
Study Design: Experiment
Matched-pair Design: The 50 swimmers are selected. Each swimmer is randomly
given the caffeine pill or the placebo and swims one mile with the time recorded. After 1
week, the same 50 swimmers return and are given the treatment that they did not
receive the previous week. They swim the mile and the time is recorded. Each
swimmer’s times is compared against both treatments.

             Blocks and Block Designs

•   A collection of experimental units that have the same (or similar values) on a key
    variable is called a block. In the previous example, each subject (person) is a block.
•   Experimental units are divided into groups (blocks) and each treatment is randomly
    assign to one or more of the units in each block. In other words, a block design
    identifies blocks before the start of the experiment and assigns subjects to
    treatments within those blocks.
•   To reduce bias, order of treatments within each block is randomized and we call
    this a randomized block design.
•   A matched-pair design is a special type of block design. Here each paired
    experimental units form a block.
•   In a block design study, an experimental unit (subject) may receive only one

             Example of Block Design

Purpose: Study the effect of taking caffeine one half hour before swimming.
Sample: 50 swimmers, but 16 males who swim competitively, 14 males who do not
swim competitively, 8 females who swim competitively and 12 females who do not swim

Explanatory Variable: A caffeine pill or a placebo.
Response Variable: Time to swim one mile.
Study Design: Experiment
Randomized Block Design: We create four blocks (16, 14, 8, 12 subjects).             Within
each block, individuals take either the caffeine pill or the placebo. Each subject’s swim
time is recorded. The times of each swimmer within each block as well as across the
blocks are compared (caffeine pill versus placebo).

       What type of experiment?

A drug company wanted to test a new arthritis medication. The
researchers found 200 adults aged 25-35 and randomly assigned them to
two groups. The first group received the new drug, while the second
received a placebo. After one month of treatment, the percentage of each
group whose arthritis symptoms decreased was recorded and compared
with their original condition. What type of experimental design is this?

       What type of experiment?

 A medical journal published the results of an experiment on insomnia.
The experiment investigated the effects of a controversial new therapy for
insomnia. Researchers measured the insomnia levels of 86 adult women
who suffer moderate conditions of the disorder. After the therapy, the
researchers again measured the women's insomnia levels. The
differences between the the pre- and post-therapy insomnia levels were
reported. What type of experimental design is this?

      What type of experiment?

A farmer wishes to test the effects of a new fertilizer on her tomato yield.
She has four equal-sized plots of land--one with sandy soil, one with rocky
soil, one with clay-rich soil, and one with average soil. She divides each of
the four plots into three equal-sized portions and randomly labels them A,
B, and C. The four A portions of land are treated with her old fertilizer. The
four B portions are treated with the new fertilizer, and the four C's are
treated with no fertilizer. At harvest time, the tomato yield is recorded for
each section of land. What type of experimental design is this?

       What type of experiment?

A random sample of 1,000 overweight male adults is recruited. Each male
is weighed and his weight is recorded. Each individual is given a diet and
are told to follow it for one month. After one month, each individual is
weighed and recorded. The “before” and “after” are compared. What type of
experimental design is this?

        What type of experiment?

A random sample of 30 Vanderbilt students is selected. We are interested in
the reaction times when using or not using a cell phone during driving. Each
student’s reaction time was measured when he or she was using or not
using a cell phone on a driving course in a Vanderbilt parking lot. What type
of experimental design is this?