# Chapter 4

W
Shared by:
Categories
-
Stats
views:
2
posted:
1/31/2011
language:
English
pages:
46
Document Sample

```							             Chapter 4
Scatterplots & Correlation

Samuel Clark

Department of Sociology, University of Washington
Institute of Behavioral Science, University of Colorado at Boulder
Agincourt Health and Population Unit, University of the Witwatersrand
Explanatory and Response
Variables
 Interested in studying the relationship
between two variables by measuring
both variables on the same individuals.
– a response variable measures an outcome
of a study
– an explanatory variable explains or
influences changes in a response variable
– sometimes there is no distinction

Chapter 4                                           1
Question
In a study to determine whether surgery or
chemotherapy results in higher survival rates for
a certain type of cancer, whether or not the
patient survived is one variable, and whether
they received surgery or chemotherapy is the
other. Which is the explanatory variable and
which is the response variable?
Response: Survival
Explanatory: Type of treatment
Chapter 4                                       2
Example 4.1: Response & Explanatory

Question: How does drinking affect blood
alcohol level?
Investigation: Student volunteers drink different
numbers of cans of beer and thirty minutes later
a police officer measures the alcohol content of
their blood.
Response: Blood alcohol content
Explanatory: Number of cans of beer consumed

Chapter 4                                       3
Example 4.2 - Descriptive
A college student loan officer wants to
understand the situation of recent college grads.
She looks at data describing recent grads:
amount of debt, current income and how stressed
they feel.
In this situation the distinction between
response and explanatory variables is not
important – she is not trying to „explain‟ changes
in one variable with changes in another.

Chapter 4                                        4
Example 4.2 - Explanatory
A sociologist looks at the same data
amount of debt and income be used along with
other variables to explain stress caused by
college debt?”
Now stress level is the response variable
and amount of college debt and income, etc. are
the explanatory variables.

Chapter 4                                         5
Scatterplot

 Graphs  the relationship between two
quantitative (numerical) variables measured
on the same individuals.

 Ifa distinction exists, plot the explanatory
variable on the horizontal (x) axis and plot the
response variable on the vertical (y) axis.

Chapter 4                                        6
Example 4.3/4
 The next figure displays a scatterplot of state
mean SAT math scores vs. percent of
 Use the four-step process to describe the
possible influence of who takes the SAT on
the mean math score:
1. State the problem – 2. make a plan
3. solve – 4. conclude

Chapter 4                                           7
Example 4.3/4
STATE: The percent of high school students
who take the SAT varies from state to state.
Does this affect the average SAT math score?

PLAN: Examine the relationship between the
mean SAT math score and the percent of
graduating students who take the SAT. Make a
scatterplot to display the relationship between
the variables. Interpret what we see.
Chapter 4                                      8
Chapter 4   9
Example 4.3/4
SOLVE : We suspect that “percent taking” will
help explain the “mean math score”. We want
to see how mean math score (response)
changes as percent taking (explanatory)
changes.
There is a clear direction to the overall
pattern – from upper left to lower right. The
form of the relationship is linear.

Chapter 4                                    10
Example 4.3/4
SOLVE (cont.): There also appear to be two
clusters in the data – one in the upper left and
the rest of the data.
The strength of the relationship is weak
because the points do not lie very close to the
line that you could draw through them.

Chapter 4                                          11
Example 4.3/4
CONCLUDE : Percent taking does explain
some of the variation in mean math SAT score.
States with a larger fraction of students taking
the SAT have lower mean math SAT scores.
These are the states in which most students
take the SAT and fewer students take the ACT.

In the ACT states the better students are taking
the SAT to apply to the best colleges.
Chapter 4                                      12
ACT and SAT States
variable (region),           Southern
use a different plot           states
color or symbol for         highlighted
each category.

The midwest states
are mainly ACT
states and the
northeast are mainly
SAT states.

Chapter 4                                 13
Example 4.5 – Manatees and Boats

Chapter 4                            14
Chapter 4   15
Scatterplot

Look  for overall pattern and
deviations from this pattern
Describe  pattern by form, direction,
and strength of the relationship
Look    for outliers

Chapter 4                                   16
Linear Relationship

Some relationships are such that the
points of a scatterplot tend to fall along
a straight line – linear relationship

Chapter 4                                         17
Direction
 Positive   association
– above-average values of one variable tend
to accompany above-average values of the
other variable, and below-average values
tend to occur together
 Negative   association
– above-average values of one variable tend
to accompany below-average values of the
other variable, and vice versa

Chapter 4                                           18
Examples

From a scatterplot of college students,
there is a positive association between
verbal SAT score and GPA.

For used cars, there is a negative
association between the age of the car
and the selling price.

Chapter 4                                     19
Examples of Relationships
60                                                                                           70

Heath Status Measure

Heath Status Measure
50                                                                                           60

50
40

40
30
30

20
20

10                                                                                           10

0                                                                                            0
0    20        40         60    80   100
\$0   \$10        \$20   \$30    \$40      \$50        \$60   \$70

Income                                                                                     Age

18                                                                                           65

16

Mental Health Score
60
Education Level

14
55
12

10                                                                                           50

8                                                                                            45

6
40
4
35
2

0                                                                                            30
0          20         40         60         80         100                                   0         20         40        60     80

Age                                                                   Physical Health Score

Chapter 4                                                                                                                                                                       20
Measuring Strength & Direction
of a Linear Relationship
 How    closely does a non-horizontal straight
line fit the points of a scatterplot?
 The correlation coefficient (often referred to
as just correlation): r
– measure of the strength of the relationship:
the stronger the relationship, the larger the
magnitude of r.
– measure of the direction of the relationship:
positive r indicates a positive relationship,
negative r indicates a negative relationship.

Chapter 4                                             21
Correlation Coefficient
   special values for r :
 a perfect positive linear relationship would have r = +1
 a perfect negative linear relationship would have r = -1
 if there is no linear relationship, or if the scatterplot
points are best fit by a horizontal line, then r = 0
 Note: r must be between -1 and +1, inclusive
 both variables must be quantitative; no distinction
between response and explanatory variables
 r has no units; does not change when
measurement units are changed (ex: ft. or in.)

Chapter 4                                                      22
Examples of Correlations

Chapter 4                              23
Examples of Correlations
 Husband‟s            versus Wife‟s ages
r   = .94
 Husband‟s            versus Wife‟s heights
r   = .36
 Professional Golfer‟s Putting Success:
Distance of putt in feet versus percent
success
r   = -.94

Chapter 4                                         24
Not all Relationships are Linear
Miles per Gallon versus Speed
35
   Linear relationship?                      30

miles per gallon
25
20
   Correlation is close
15
to zero.                                           y = - 0.013x + 26.9
10            r = - 0.06
5
0
0            50            100
speed

Chapter 4                                                                    25
Not all Relationships are Linear
Miles per Gallon versus Speed
35
   Curved relationship.                      30

miles per gallon
25

   Correlation is                            20

10
5
0
0    50          100
speed

Chapter 4                                                      26
Problems with Correlations
 Outliers can inflate or deflate correlations (see
next slide)

 Groups   combined inappropriately may mask
relationships (a third variable)
– groups may have different relationships when
separated

Chapter 4                                           27
Outliers and Correlation

A                          B

For each scatterplot above, how does the outlier
affect the correlation?
A: outlier decreases the correlation
B: outlier increases the correlation

Chapter 4                                              28
Correlation Calculation
 Suppose   we have data on variables X
and Y for n individuals:
x1, x2, … , xn and y1, y2, … , yn
 Each      variable has a mean and std dev:
( x, sx ) and ( y, sy )   (see ch. 2 for s )

1          xi  x  y i  y 
n
r         s  s 
        
n - 1 i 1  x  y 


Chapter 4                                            29
Case Study

Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe

Chapter 4                                   30
Case Study
Country       Per Capita GDP (x)   Life Expectancy (y)
Austria            21.4                  77.48
Belgium            23.2                  77.53
Finland            20.0                  77.32
France             22.7                  78.63
Germany             20.8                  77.17
Ireland            18.6                  76.39
Italy            21.5                  78.51
Netherlands           22.0                  78.15
Switzerland           23.8                  78.99
United Kingdom         21.2                  77.37

Chapter 4                                                         31
Case Study
 x i - x  y i - y 
x           y        xi  x /s x y i  y /s y    s  s 

 x  y 

         

21.4       77.48        -0.078         -0.345              0.027
23.2       77.53         1.097         -0.282             -0.309
20.0       77.32        -0.992         -0.546              0.542
22.7       78.63         0.770          1.102              0.849
20.8       77.17        -0.470         -0.735              0.345
18.6       76.39        -1.906         -1.716              3.271
21.5       78.51        -0.013          0.951             -0.012
22.0       78.15         0.313          0.498              0.156
23.8       78.99         1.489          1.555              2.315
21.2       77.37        -0.209         -0.483              0.101
x = 21.52 y = 77.754
sum = 7.285
sx =1.532   sy =0.795

Chapter 4                                                                          32
Case Study

1      n
 xi  x   y i  y   
r       s  s
n -1 i 1  x   y


           
 1 
          (7.285)
 10  1 
 0.809
Chapter 4                                  33
 Correlationmakes no distinction between the
explanatory and response variables

r  is unitless – it doesn‟t matter if we change
the units of a variable when we calculate r
(because the variables are standardized)

Chapter 4                                          34
 Positiver indicates positive association
between the variables, negative r indicates
negative association

 The   value r is always between -1 and 1
– Values near 0 indicate a weak relationship
– Values near -1 or 1 indicate strong negative and
positive relationships, respectively

Chapter 4                                           35
 Correlation    requires that both variables are
quantitative

 Correlation  measures the strength and
direction of straight line relationships only –

Chapter 4                                           36
 Correlation is strongly affected by outliers
(because it relies on the mean and standard
deviation)

 Correlation is not a complete summary of two-
variable data
– Also need means and standard deviations of both
variables

Chapter 4                                         37
Start Here Weds 4/14

Chapter 4                          38
4.12
Lean Body Metabolic
a)   What is the correlation between lean     Mass      Rate
body mass and metabolic rate?               36.1       995
54.6     1,425
48.5     1,396
b)   Make a scatterplot with two                 42.0     1,418
additional points A (65,1761) and B         50.6     1,502
42.0     1,256
(35,1400). Find the correlation with
40.3     1,189
original data plus A and with B.            33.1       913
42.4     1,124
34.5     1,052
c)   Why does point A make the                   51.1     1,347
correlation stronger, and point B           41.2     1,204
make the correlation weaker?                65.0     1,761
35.0     1,400

Chapter 4                                                         39
a) Metabolie Rate vs. Body Mass
2,000

1,800
r = 0.88
Metabolic Rate (cal/24hr)

1,600

1,400

1,200

1,000

800
30   35   40   45    50      55     60   65   70

Body Mass (kg)

Chapter 4                                                                            40
b) MR vs. BM with point A
2,000

1,800
Metabolic Rate (cal/24hr)

1,600

1,400

1,200

1,000                                     r = 0.93
800
30   35   40   45    50      55     60   65   70

Body Mass (kg)

Chapter 4                                                                            41
c) MR vs. BM with point B
2,000

1,800
r = 0.75
Metabolic Rate (cal/24hr)

1,600

1,400

1,200

1,000

800
30   35   40   45    50      55     60   65   70

Body Mass (kg)

Chapter 4                                                                            42
Fertility & Mortality Example
 The   next slide presents a scatter plot of
fertility rates vs. mortality rates for a number
of years
– Each measurement taken from the same
population in a given year
– Age is a categorical third variable
– There are reasonably strong linear relationships
between the two variables
 What      can we conclude from this scatterplot?
Chapter 4                                               43
Age-Specific Fertility vs. Age-Specific Mortality: 1992-2003
Agincourt Study Population, Northeast South Africa
0.18
Age 15-19
0.16
Age 20-24
Age 25-29
0.14
Age 30-34
Age-Specific Fertility Rate nFx

0.12                                                                                         Age 35-39
Age 40-44
0.10
Age 45-49
Linear (Age 15-19)
0.08
Linear (Age 20-24)
0.06                                                                                         Linear (Age 25-29)
Linear (Age 30-34)
0.04
Linear (Age 35-39)
Linear (Age 40-44)
0.02
Linear (Age 45-49)
0.00
0.00   0.01   0.02     0.03       0.04        0.05        0.06   0.07   0.08   0.09

Age-Specific Probability of Dying nqx

Chapter 4                                                                                                                                    44
Chapter 4   45

```
Other docs by liwenting
Prudential Long-Term Care LTC3 Sales Ideas