Chapter 4
Document Sample


Chapter 4
Scatterplots & Correlation
Samuel Clark
Department of Sociology, University of Washington
Institute of Behavioral Science, University of Colorado at Boulder
Agincourt Health and Population Unit, University of the Witwatersrand
Explanatory and Response
Variables
Interested in studying the relationship
between two variables by measuring
both variables on the same individuals.
– a response variable measures an outcome
of a study
– an explanatory variable explains or
influences changes in a response variable
– sometimes there is no distinction
Chapter 4 1
Question
In a study to determine whether surgery or
chemotherapy results in higher survival rates for
a certain type of cancer, whether or not the
patient survived is one variable, and whether
they received surgery or chemotherapy is the
other. Which is the explanatory variable and
which is the response variable?
Response: Survival
Explanatory: Type of treatment
Chapter 4 2
Example 4.1: Response & Explanatory
Question: How does drinking affect blood
alcohol level?
Investigation: Student volunteers drink different
numbers of cans of beer and thirty minutes later
a police officer measures the alcohol content of
their blood.
Response: Blood alcohol content
Explanatory: Number of cans of beer consumed
Chapter 4 3
Example 4.2 - Descriptive
A college student loan officer wants to
understand the situation of recent college grads.
She looks at data describing recent grads:
amount of debt, current income and how stressed
they feel.
In this situation the distinction between
response and explanatory variables is not
important – she is not trying to „explain‟ changes
in one variable with changes in another.
Chapter 4 4
Example 4.2 - Explanatory
A sociologist looks at the same data
describing recent college grads and asks “can
amount of debt and income be used along with
other variables to explain stress caused by
college debt?”
Now stress level is the response variable
and amount of college debt and income, etc. are
the explanatory variables.
Chapter 4 5
Scatterplot
Graphs the relationship between two
quantitative (numerical) variables measured
on the same individuals.
Ifa distinction exists, plot the explanatory
variable on the horizontal (x) axis and plot the
response variable on the vertical (y) axis.
Chapter 4 6
Example 4.3/4
The next figure displays a scatterplot of state
mean SAT math scores vs. percent of
graduates taking SAT
Use the four-step process to describe the
possible influence of who takes the SAT on
the mean math score:
1. State the problem – 2. make a plan
3. solve – 4. conclude
Chapter 4 7
Example 4.3/4
STATE: The percent of high school students
who take the SAT varies from state to state.
Does this affect the average SAT math score?
PLAN: Examine the relationship between the
mean SAT math score and the percent of
graduating students who take the SAT. Make a
scatterplot to display the relationship between
the variables. Interpret what we see.
Chapter 4 8
Chapter 4 9
Example 4.3/4
SOLVE : We suspect that “percent taking” will
help explain the “mean math score”. We want
to see how mean math score (response)
changes as percent taking (explanatory)
changes.
There is a clear direction to the overall
pattern – from upper left to lower right. The
form of the relationship is linear.
Chapter 4 10
Example 4.3/4
SOLVE (cont.): There also appear to be two
clusters in the data – one in the upper left and
the rest of the data.
The strength of the relationship is weak
because the points do not lie very close to the
line that you could draw through them.
Chapter 4 11
Example 4.3/4
CONCLUDE : Percent taking does explain
some of the variation in mean math SAT score.
States with a larger fraction of students taking
the SAT have lower mean math SAT scores.
These are the states in which most students
take the SAT and fewer students take the ACT.
In the ACT states the better students are taking
the SAT to apply to the best colleges.
Chapter 4 12
ACT and SAT States
To add a categorical
variable (region), Southern
use a different plot states
color or symbol for highlighted
each category.
The midwest states
are mainly ACT
states and the
northeast are mainly
SAT states.
Chapter 4 13
Example 4.5 – Manatees and Boats
Chapter 4 14
Chapter 4 15
Scatterplot
Look for overall pattern and
deviations from this pattern
Describe pattern by form, direction,
and strength of the relationship
Look for outliers
Chapter 4 16
Linear Relationship
Some relationships are such that the
points of a scatterplot tend to fall along
a straight line – linear relationship
Chapter 4 17
Direction
Positive association
– above-average values of one variable tend
to accompany above-average values of the
other variable, and below-average values
tend to occur together
Negative association
– above-average values of one variable tend
to accompany below-average values of the
other variable, and vice versa
Chapter 4 18
Examples
From a scatterplot of college students,
there is a positive association between
verbal SAT score and GPA.
For used cars, there is a negative
association between the age of the car
and the selling price.
Chapter 4 19
Examples of Relationships
60 70
Heath Status Measure
Heath Status Measure
50 60
50
40
40
30
30
20
20
10 10
0 0
0 20 40 60 80 100
$0 $10 $20 $30 $40 $50 $60 $70
Income Age
18 65
16
Mental Health Score
60
Education Level
14
55
12
10 50
8 45
6
40
4
35
2
0 30
0 20 40 60 80 100 0 20 40 60 80
Age Physical Health Score
Chapter 4 20
Measuring Strength & Direction
of a Linear Relationship
How closely does a non-horizontal straight
line fit the points of a scatterplot?
The correlation coefficient (often referred to
as just correlation): r
– measure of the strength of the relationship:
the stronger the relationship, the larger the
magnitude of r.
– measure of the direction of the relationship:
positive r indicates a positive relationship,
negative r indicates a negative relationship.
Chapter 4 21
Correlation Coefficient
special values for r :
a perfect positive linear relationship would have r = +1
a perfect negative linear relationship would have r = -1
if there is no linear relationship, or if the scatterplot
points are best fit by a horizontal line, then r = 0
Note: r must be between -1 and +1, inclusive
both variables must be quantitative; no distinction
between response and explanatory variables
r has no units; does not change when
measurement units are changed (ex: ft. or in.)
Chapter 4 22
Examples of Correlations
Chapter 4 23
Examples of Correlations
Husband‟s versus Wife‟s ages
r = .94
Husband‟s versus Wife‟s heights
r = .36
Professional Golfer‟s Putting Success:
Distance of putt in feet versus percent
success
r = -.94
Chapter 4 24
Not all Relationships are Linear
Miles per Gallon versus Speed
35
Linear relationship? 30
miles per gallon
25
20
Correlation is close
15
to zero. y = - 0.013x + 26.9
10 r = - 0.06
5
0
0 50 100
speed
Chapter 4 25
Not all Relationships are Linear
Miles per Gallon versus Speed
35
Curved relationship. 30
miles per gallon
25
Correlation is 20
misleading. 15
10
5
0
0 50 100
speed
Chapter 4 26
Problems with Correlations
Outliers can inflate or deflate correlations (see
next slide)
Groups combined inappropriately may mask
relationships (a third variable)
– groups may have different relationships when
separated
Chapter 4 27
Outliers and Correlation
A B
For each scatterplot above, how does the outlier
affect the correlation?
A: outlier decreases the correlation
B: outlier increases the correlation
Chapter 4 28
Correlation Calculation
Suppose we have data on variables X
and Y for n individuals:
x1, x2, … , xn and y1, y2, … , yn
Each variable has a mean and std dev:
( x, sx ) and ( y, sy ) (see ch. 2 for s )
1 xi x y i y
n
r s s
n - 1 i 1 x y
Chapter 4 29
Case Study
Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe
Chapter 4 30
Case Study
Country Per Capita GDP (x) Life Expectancy (y)
Austria 21.4 77.48
Belgium 23.2 77.53
Finland 20.0 77.32
France 22.7 78.63
Germany 20.8 77.17
Ireland 18.6 76.39
Italy 21.5 78.51
Netherlands 22.0 78.15
Switzerland 23.8 78.99
United Kingdom 21.2 77.37
Chapter 4 31
Case Study
x i - x y i - y
x y xi x /s x y i y /s y s s
x y
21.4 77.48 -0.078 -0.345 0.027
23.2 77.53 1.097 -0.282 -0.309
20.0 77.32 -0.992 -0.546 0.542
22.7 78.63 0.770 1.102 0.849
20.8 77.17 -0.470 -0.735 0.345
18.6 76.39 -1.906 -1.716 3.271
21.5 78.51 -0.013 0.951 -0.012
22.0 78.15 0.313 0.498 0.156
23.8 78.99 1.489 1.555 2.315
21.2 77.37 -0.209 -0.483 0.101
x = 21.52 y = 77.754
sum = 7.285
sx =1.532 sy =0.795
Chapter 4 32
Case Study
1 n
xi x y i y
r s s
n -1 i 1 x y
1
(7.285)
10 1
0.809
Chapter 4 33
Facts about Correlation
Correlationmakes no distinction between the
explanatory and response variables
r is unitless – it doesn‟t matter if we change
the units of a variable when we calculate r
(because the variables are standardized)
Chapter 4 34
Facts about Correlation
Positiver indicates positive association
between the variables, negative r indicates
negative association
The value r is always between -1 and 1
– Values near 0 indicate a weak relationship
– Values near -1 or 1 indicate strong negative and
positive relationships, respectively
Chapter 4 35
Facts about Correlation
Correlation requires that both variables are
quantitative
Correlation measures the strength and
direction of straight line relationships only –
says nothing about curved relationships
Chapter 4 36
Facts about Correlation
Correlation is strongly affected by outliers
(because it relies on the mean and standard
deviation)
Correlation is not a complete summary of two-
variable data
– Also need means and standard deviations of both
variables
Chapter 4 37
Start Here Weds 4/14
Chapter 4 38
4.12
Lean Body Metabolic
a) What is the correlation between lean Mass Rate
body mass and metabolic rate? 36.1 995
54.6 1,425
48.5 1,396
b) Make a scatterplot with two 42.0 1,418
additional points A (65,1761) and B 50.6 1,502
42.0 1,256
(35,1400). Find the correlation with
40.3 1,189
original data plus A and with B. 33.1 913
42.4 1,124
34.5 1,052
c) Why does point A make the 51.1 1,347
correlation stronger, and point B 41.2 1,204
make the correlation weaker? 65.0 1,761
35.0 1,400
Chapter 4 39
a) Metabolie Rate vs. Body Mass
2,000
1,800
r = 0.88
Metabolic Rate (cal/24hr)
1,600
1,400
1,200
1,000
800
30 35 40 45 50 55 60 65 70
Body Mass (kg)
Chapter 4 40
b) MR vs. BM with point A
2,000
1,800
Metabolic Rate (cal/24hr)
1,600
1,400
1,200
1,000 r = 0.93
800
30 35 40 45 50 55 60 65 70
Body Mass (kg)
Chapter 4 41
c) MR vs. BM with point B
2,000
1,800
r = 0.75
Metabolic Rate (cal/24hr)
1,600
1,400
1,200
1,000
800
30 35 40 45 50 55 60 65 70
Body Mass (kg)
Chapter 4 42
Fertility & Mortality Example
The next slide presents a scatter plot of
fertility rates vs. mortality rates for a number
of years
– Each measurement taken from the same
population in a given year
– Age is a categorical third variable
– There are reasonably strong linear relationships
between the two variables
What can we conclude from this scatterplot?
Chapter 4 43
Age-Specific Fertility vs. Age-Specific Mortality: 1992-2003
Agincourt Study Population, Northeast South Africa
0.18
Age 15-19
0.16
Age 20-24
Age 25-29
0.14
Age 30-34
Age-Specific Fertility Rate nFx
0.12 Age 35-39
Age 40-44
0.10
Age 45-49
Linear (Age 15-19)
0.08
Linear (Age 20-24)
0.06 Linear (Age 25-29)
Linear (Age 30-34)
0.04
Linear (Age 35-39)
Linear (Age 40-44)
0.02
Linear (Age 45-49)
0.00
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Age-Specific Probability of Dying nqx
Chapter 4 44
Chapter 4 45
Get documents about "