Chapter 4

W
Shared by: liwenting
-
Stats
views:
2
posted:
1/31/2011
language:
English
pages:
46
Document Sample
scope of work template
							             Chapter 4
Scatterplots & Correlation


                       Samuel Clark




 Department of Sociology, University of Washington
  Institute of Behavioral Science, University of Colorado at Boulder
Agincourt Health and Population Unit, University of the Witwatersrand
            Explanatory and Response
                   Variables
    Interested in studying the relationship
     between two variables by measuring
     both variables on the same individuals.
      – a response variable measures an outcome
        of a study
      – an explanatory variable explains or
        influences changes in a response variable
      – sometimes there is no distinction


Chapter 4                                           1
                     Question
In a study to determine whether surgery or
chemotherapy results in higher survival rates for
a certain type of cancer, whether or not the
patient survived is one variable, and whether
they received surgery or chemotherapy is the
other. Which is the explanatory variable and
which is the response variable?
                 Response: Survival
            Explanatory: Type of treatment
Chapter 4                                       2
  Example 4.1: Response & Explanatory

Question: How does drinking affect blood
alcohol level?
Investigation: Student volunteers drink different
numbers of cans of beer and thirty minutes later
a police officer measures the alcohol content of
their blood.
         Response: Blood alcohol content
Explanatory: Number of cans of beer consumed

Chapter 4                                       3
            Example 4.2 - Descriptive
      A college student loan officer wants to
understand the situation of recent college grads.
She looks at data describing recent grads:
amount of debt, current income and how stressed
they feel.
      In this situation the distinction between
response and explanatory variables is not
important – she is not trying to „explain‟ changes
in one variable with changes in another.

Chapter 4                                        4
        Example 4.2 - Explanatory
     A sociologist looks at the same data
describing recent college grads and asks “can
amount of debt and income be used along with
other variables to explain stress caused by
college debt?”
     Now stress level is the response variable
and amount of college debt and income, etc. are
the explanatory variables.


Chapter 4                                         5
                  Scatterplot

 Graphs  the relationship between two
  quantitative (numerical) variables measured
  on the same individuals.

 Ifa distinction exists, plot the explanatory
  variable on the horizontal (x) axis and plot the
  response variable on the vertical (y) axis.


Chapter 4                                        6
               Example 4.3/4
 The next figure displays a scatterplot of state
  mean SAT math scores vs. percent of
  graduates taking SAT
 Use the four-step process to describe the
  possible influence of who takes the SAT on
  the mean math score:
     1. State the problem – 2. make a plan
              3. solve – 4. conclude

Chapter 4                                           7
              Example 4.3/4
STATE: The percent of high school students
who take the SAT varies from state to state.
Does this affect the average SAT math score?

PLAN: Examine the relationship between the
mean SAT math score and the percent of
graduating students who take the SAT. Make a
scatterplot to display the relationship between
the variables. Interpret what we see.
Chapter 4                                      8
Chapter 4   9
              Example 4.3/4
SOLVE : We suspect that “percent taking” will
help explain the “mean math score”. We want
to see how mean math score (response)
changes as percent taking (explanatory)
changes.
     There is a clear direction to the overall
pattern – from upper left to lower right. The
form of the relationship is linear.

Chapter 4                                    10
               Example 4.3/4
SOLVE (cont.): There also appear to be two
clusters in the data – one in the upper left and
the rest of the data.
      The strength of the relationship is weak
because the points do not lie very close to the
line that you could draw through them.




Chapter 4                                          11
               Example 4.3/4
CONCLUDE : Percent taking does explain
some of the variation in mean math SAT score.
States with a larger fraction of students taking
the SAT have lower mean math SAT scores.
These are the states in which most students
take the SAT and fewer students take the ACT.

In the ACT states the better students are taking
the SAT to apply to the best colleges.
Chapter 4                                      12
              ACT and SAT States
To add a categorical
variable (region),           Southern
use a different plot           states
color or symbol for         highlighted
each category.


The midwest states
are mainly ACT
states and the
northeast are mainly
SAT states.

Chapter 4                                 13
  Example 4.5 – Manatees and Boats




Chapter 4                            14
Chapter 4   15
                  Scatterplot

   Look  for overall pattern and
     deviations from this pattern
   Describe  pattern by form, direction,
     and strength of the relationship
   Look    for outliers



Chapter 4                                   16
            Linear Relationship

     Some relationships are such that the
     points of a scatterplot tend to fall along
     a straight line – linear relationship




Chapter 4                                         17
                    Direction
    Positive   association
      – above-average values of one variable tend
        to accompany above-average values of the
        other variable, and below-average values
        tend to occur together
    Negative   association
      – above-average values of one variable tend
        to accompany below-average values of the
        other variable, and vice versa


Chapter 4                                           18
                 Examples

    From a scatterplot of college students,
    there is a positive association between
    verbal SAT score and GPA.

    For used cars, there is a negative
    association between the age of the car
    and the selling price.


Chapter 4                                     19
                                                    Examples of Relationships
                                   60                                                                                           70




            Heath Status Measure




                                                                                                         Heath Status Measure
                                   50                                                                                           60


                                                                                                                                50
                                   40

                                                                                                                                40
                                   30
                                                                                                                                30

                                   20
                                                                                                                                20


                                   10                                                                                           10


                                    0                                                                                            0
                                                                                                                                     0    20        40         60    80   100
                                        $0   $10        $20   $30    $40      $50        $60   $70

                                                              Income                                                                                     Age

                                   18                                                                                           65

                                   16




                                                                                                     Mental Health Score
                                                                                                                                60
            Education Level




                                   14
                                                                                                                                55
                                   12

                                   10                                                                                           50

                                   8                                                                                            45

                                   6
                                                                                                                                40
                                   4
                                                                                                                                35
                                   2

                                   0                                                                                            30
                                        0          20         40         60         80         100                                   0         20         40        60     80

                                                                   Age                                                                   Physical Health Score




Chapter 4                                                                                                                                                                       20
    Measuring Strength & Direction
       of a Linear Relationship
  How    closely does a non-horizontal straight
   line fit the points of a scatterplot?
  The correlation coefficient (often referred to
   as just correlation): r
    – measure of the strength of the relationship:
      the stronger the relationship, the larger the
      magnitude of r.
    – measure of the direction of the relationship:
      positive r indicates a positive relationship,
      negative r indicates a negative relationship.

Chapter 4                                             21
            Correlation Coefficient
     special values for r :
       a perfect positive linear relationship would have r = +1
       a perfect negative linear relationship would have r = -1
       if there is no linear relationship, or if the scatterplot
        points are best fit by a horizontal line, then r = 0
       Note: r must be between -1 and +1, inclusive
   both variables must be quantitative; no distinction
    between response and explanatory variables
   r has no units; does not change when
    measurement units are changed (ex: ft. or in.)


Chapter 4                                                      22
            Examples of Correlations




Chapter 4                              23
            Examples of Correlations
    Husband‟s            versus Wife‟s ages
            r   = .94
    Husband‟s            versus Wife‟s heights
            r   = .36
    Professional Golfer‟s Putting Success:
     Distance of putt in feet versus percent
     success
            r   = -.94



Chapter 4                                         24
 Not all Relationships are Linear
  Miles per Gallon versus Speed
                                              35
   Linear relationship?                      30




                           miles per gallon
                                              25
                                              20
   Correlation is close
                                              15
    to zero.                                           y = - 0.013x + 26.9
                                              10            r = - 0.06
                                              5
                                              0
                                                   0            50            100
                                                              speed

Chapter 4                                                                    25
 Not all Relationships are Linear
  Miles per Gallon versus Speed
                                              35
   Curved relationship.                      30




                           miles per gallon
                                              25

   Correlation is                            20

    misleading.                               15
                                              10
                                              5
                                              0
                                                   0    50          100
                                                       speed

Chapter 4                                                      26
        Problems with Correlations
 Outliers can inflate or deflate correlations (see
  next slide)

 Groups   combined inappropriately may mask
  relationships (a third variable)
   – groups may have different relationships when
     separated



Chapter 4                                           27
            Outliers and Correlation

   A                          B




    For each scatterplot above, how does the outlier
    affect the correlation?
            A: outlier decreases the correlation
            B: outlier increases the correlation


Chapter 4                                              28
       Correlation Calculation
 Suppose   we have data on variables X
  and Y for n individuals:
      x1, x2, … , xn and y1, y2, … , yn
 Each      variable has a mean and std dev:
      ( x, sx ) and ( y, sy )   (see ch. 2 for s )



      1          xi  x  y i  y 
                 n
  r         s  s 
                        
     n - 1 i 1  x  y 
                                    

Chapter 4                                            29
                 Case Study


        Per Capita Gross Domestic Product
         and Average Life Expectancy for
           Countries in Western Europe




Chapter 4                                   30
                       Case Study
         Country       Per Capita GDP (x)   Life Expectancy (y)
          Austria            21.4                  77.48
          Belgium            23.2                  77.53
          Finland            20.0                  77.32
          France             22.7                  78.63
         Germany             20.8                  77.17
          Ireland            18.6                  76.39
            Italy            21.5                  78.51
       Netherlands           22.0                  78.15
       Switzerland           23.8                  78.99
      United Kingdom         21.2                  77.37



Chapter 4                                                         31
                           Case Study
                                                           x i - x  y i - y 
      x           y        xi  x /s x y i  y /s y    s  s 
                                                          
                                                           x  y 
                                                                    
                                                                              

     21.4       77.48        -0.078         -0.345              0.027
     23.2       77.53         1.097         -0.282             -0.309
     20.0       77.32        -0.992         -0.546              0.542
     22.7       78.63         0.770          1.102              0.849
     20.8       77.17        -0.470         -0.735              0.345
     18.6       76.39        -1.906         -1.716              3.271
     21.5       78.51        -0.013          0.951             -0.012
     22.0       78.15         0.313          0.498              0.156
     23.8       78.99         1.489          1.555              2.315
     21.2       77.37        -0.209         -0.483              0.101
   x = 21.52 y = 77.754
                                                     sum = 7.285
   sx =1.532   sy =0.795



Chapter 4                                                                          32
             Case Study

       1      n
                 xi  x   y i  y   
   r       s  s
      n -1 i 1  x   y
                                       
                                       
                                      
          1 
                 (7.285)
          10  1 
        0.809
Chapter 4                                  33
            Facts about Correlation
 Correlationmakes no distinction between the
  explanatory and response variables

r  is unitless – it doesn‟t matter if we change
  the units of a variable when we calculate r
  (because the variables are standardized)




Chapter 4                                          34
            Facts about Correlation
 Positiver indicates positive association
  between the variables, negative r indicates
  negative association

 The   value r is always between -1 and 1
   – Values near 0 indicate a weak relationship
   – Values near -1 or 1 indicate strong negative and
     positive relationships, respectively


Chapter 4                                           35
            Facts about Correlation
 Correlation    requires that both variables are
  quantitative

 Correlation  measures the strength and
  direction of straight line relationships only –
  says nothing about curved relationships




Chapter 4                                           36
            Facts about Correlation
 Correlation is strongly affected by outliers
  (because it relies on the mean and standard
  deviation)

 Correlation is not a complete summary of two-
  variable data
   – Also need means and standard deviations of both
     variables


Chapter 4                                         37
            Start Here Weds 4/14




Chapter 4                          38
                              4.12
                                            Lean Body Metabolic
a)   What is the correlation between lean     Mass      Rate
     body mass and metabolic rate?               36.1       995
                                                 54.6     1,425
                                                 48.5     1,396
b)   Make a scatterplot with two                 42.0     1,418
     additional points A (65,1761) and B         50.6     1,502
                                                 42.0     1,256
     (35,1400). Find the correlation with
                                                 40.3     1,189
     original data plus A and with B.            33.1       913
                                                 42.4     1,124
                                                 34.5     1,052
c)   Why does point A make the                   51.1     1,347
     correlation stronger, and point B           41.2     1,204
     make the correlation weaker?                65.0     1,761
                                                 35.0     1,400

Chapter 4                                                         39
                      a) Metabolie Rate vs. Body Mass
                              2,000

                              1,800
                                                                        r = 0.88
  Metabolic Rate (cal/24hr)




                              1,600

                              1,400

                              1,200

                              1,000

                               800
                                      30   35   40   45    50      55     60   65   70

                                                      Body Mass (kg)


Chapter 4                                                                            40
                                 b) MR vs. BM with point A
                              2,000

                              1,800
  Metabolic Rate (cal/24hr)




                              1,600

                              1,400

                              1,200

                              1,000                                     r = 0.93
                               800
                                      30   35   40   45    50      55     60   65   70

                                                      Body Mass (kg)


Chapter 4                                                                            41
                                  c) MR vs. BM with point B
                              2,000

                              1,800
                                                                        r = 0.75
  Metabolic Rate (cal/24hr)




                              1,600

                              1,400

                              1,200

                              1,000

                               800
                                      30   35   40   45    50      55     60   65   70

                                                      Body Mass (kg)


Chapter 4                                                                            42
      Fertility & Mortality Example
 The   next slide presents a scatter plot of
  fertility rates vs. mortality rates for a number
  of years
   – Each measurement taken from the same
     population in a given year
   – Age is a categorical third variable
   – There are reasonably strong linear relationships
     between the two variables
 What      can we conclude from this scatterplot?
Chapter 4                                               43
                                                Age-Specific Fertility vs. Age-Specific Mortality: 1992-2003
                                                              Agincourt Study Population, Northeast South Africa
                                  0.18
                                                                                                                               Age 15-19
                                  0.16
                                                                                                                               Age 20-24
                                                                                                                               Age 25-29
                                  0.14
                                                                                                                               Age 30-34
Age-Specific Fertility Rate nFx




                                  0.12                                                                                         Age 35-39
                                                                                                                               Age 40-44
                                  0.10
                                                                                                                               Age 45-49
                                                                                                                               Linear (Age 15-19)
                                  0.08
                                                                                                                               Linear (Age 20-24)
                                  0.06                                                                                         Linear (Age 25-29)
                                                                                                                               Linear (Age 30-34)
                                  0.04
                                                                                                                               Linear (Age 35-39)
                                                                                                                               Linear (Age 40-44)
                                  0.02
                                                                                                                               Linear (Age 45-49)
                                  0.00
                                         0.00   0.01   0.02     0.03       0.04        0.05        0.06   0.07   0.08   0.09

                                                                 Age-Specific Probability of Dying nqx




Chapter 4                                                                                                                                    44
Chapter 4   45

						
Other docs by liwenting
Prudential Long-Term Care LTC3 Sales Ideas
Views: 7  |  Downloads: 0
Seite 1 von 5 Tischtennis Ein we
Views: 49  |  Downloads: 0
Activating Bridge Baron
Views: 216  |  Downloads: 0
doc_15_
Views: 4  |  Downloads: 0
MERCADOS FINANCIEROS
Views: 199  |  Downloads: 0
Business Object Type Library Dr
Views: 11  |  Downloads: 1
Hot Buy
Views: 67  |  Downloads: 0