# Do the data - Goblues.org by linxiaoqin

VIEWS: 1 PAGES: 54

• pg 1
Chapter 4: More on Two-Variable Data
4.1    Transforming
140
Relationships                       120
100

g dye/kg fiber
80
4.2    Cautions                             60
40
20
4.3    Relations in                          0
20   40   60   80   100     120   140   160   180
Categorical Data                                          Time, minutes

1
Example
Cell Phone Users
Year      (thousands)
1990          5,283
1993         16,009
1994         24,134
1995         33,786
1996         44,043
1997         55,312
1998         69,209
1999         86,047

2
Scatterplot for Cell Phone Example

3
Residuals Plot

4
What’s going on here?

• Do the data (y) increase by a constant amount each
year?
– This would suggest a linear model.
• Or, do the data increase by a fixed percentage each
year? That is, can you multiply the y-value by a fixed
number to get the next year’s number, and then
multiply that number by the fixed number to get the
following year’s number?
– This would suggest an exponential model.

5
Transformation of the Variables

• The next step is to apply a mathematical
transformation that changes exponential
growth into linear growth.
– The transformation that can help here is
to take the logarithm of the y-variable,
then re-plot and re-calculate the LSR.

6
New LSR, with Transformed y

Residuals Plot
7
We are dealing with a transformed y-value!

• Model:
log y  263.20  0.13417 x

• In order to use the model for prediction, we
must “undo” the logarithm transformation to
– How do we do this?
• Now use the new model to predict cell phone
subscribers for 2000.
8
How do we predict for year 2000?

9
Plotting our original data vs. our
exponential model …

10
Homework

• Problem 4.6, p. 212
• Problem 4.11, p. 213

11
Power Law Models

• General form of a power law model:

y  ax   p

• Biologists have found that many
characteristics of living things are described
quite closely by power laws.
– For example, the rate at which animals use
energy goes up as the ¾ power of their body
weight (Kleiber’s Law).
12
LSR and Power Law Models

• As we saw in the last section, exponential
growth models become linear when we apply
the logarithm transformation to the response
variable y.
• Power law models become linear when we
apply the logarithm transformation to both
variables, x and y.

13
Log Transformations for Power Law Models

y  ax p

log y  log( ax p )
log y  log a  p log x

• Looking carefully at the last equation, the power (p)
becomes the slope of the straight line that links log y to
log x.
– We can estimate what power (p) the law involves by
regressing log y on log x and using the slope of the
regression line to estimate the power.
14
Problem 4.13, p 219

15
Problem 4.13, p. 219

Log of Both Variables

16
Residuals Analysis (Transformed Data)

17
Undoing the Transformation

• Let’s do the math to see what we need:

log y  0.76172  0.218215 log x

18
Predicting Lifespan for Humans

19
• HW Problem:
– 4.14, p. 220

20
Warm-Up Problem

• 4.25, pp. 224-225
• Create appropriate model
• Predict seed count for tree with seed weight of
1,000 mg.

21
I.                      4.25   II. Log of both L1 and L2

Axes off to see trend

IV.
III.
Y2 vs. original data

V.

22
and Regression
• The correlation (r) and the LSR line are not
resistant.
• As we have seen, extrapolation is often
dangerous.
– Predicting past the x-variable for which the
model was developed.

23

• The paradox refers to the fact that the French have
long had low rates of heart disease (Japan is the only
developed country with a lower rate), despite a diet
relatively rich in saturated animal fats. The French
propensity to drink wine the way some Americans
guzzle soft drinks has been cited as a likely explanation
of the paradox, since numerous studies have indicated
that alcohol consumed in moderation helps to prevent
atherosclerosis, or accumulation of fatty deposits in
arteries, which is the underlying cause of most heart
attacks.

+ from NY Times article                                   24
Lurking Variables
• As we discussed in the example of amount of wine
consumed vs. number of incidents of heart disease,
there can be other variables not measured in a
correlation study that may influence the interpretation
of relationships among those variables.
– Lurking Variables
• It is possible to show, for example, that there is a high
correlation between shoe size and intelligence for a
group of children varying in age from, say, 4 to 15.
– What is the lurking variable?
• To control for age, we can calculate the correlation
between shoe size and IQ for each of the different ages.
– Age 4, 5, 6, …                                      25
Correlation Between Shoe Size and IQ?
(Common Response)

Age

Shoe
IQ
Size

26
See Figure 4.18, p. 227

27
Lurking Variables That
Change Over Time

• Many lurking variables change systematically
over time.
• One useful method for detecting lurking
variables is to plot both the response variable
and the regression residuals against the time
order of the observations (whenever the time
order is available).
• See Example 4.12, p. 228

28
29
Using Averaged Data

• Be careful when applying the results of a study
that uses averages to individuals.

• Problem 4.31, p. 231

30
Causation

• Simply put, a strong correlation between two
variables says nothing about one variable
causing the other. One variable may in fact
cause the other to change, but a correlation or
LSR line cannot tell us that.
– More investigation is needed!
• A designed study with proper experimental
controls should be used.

31
Figure 4.22, p. 232

• Causation
• Common Response
• Confounding

32
Confounding
• The effects of two variables on a response variable are
said to be confounded when they cannot be
distinguished from one another.
– Definition: Two or more variables that might have
caused an effect were simultaneously present, so that we
do not know to which to attribute the effect.
– See 1, Example 4.13 (p. 232), and explanation, p. 233, top
of p. 234.
• Does this mean that we cannot ever suggest causation?
– Read the two paragraphs on p. 235 (establishing
causation).

33
Causation
• Example 4.14, p. 232
– Numbers 1 and 2 (p. 233)

34
Common Response
• Example 4.15, p. 233

35
Homework

36
Problems

• Problems on p. 237:
– 4.33, 4.34, 4.35
• 4.73, p.257

37
Problem 4.73, p. 257
Power law model might best fit,
so take log of L1 and L2. Plot below
of L3 and L4.

38
4.73, cont.

The pendulum period is proportional to the square root
of its length.
39
4.3 Relations in Categorical Variables

• There are many relationships of interest to us
that cannot be described by using correlation
and LSR techniques.
– Recall that correlation and LSR require both
variables to be quantitative.
• Often, we want to study the relationship
between two variables that are inherently
categorical.

40
Two-Way Table (Ex. 4.19, p. 241)cell

Age Group
Education        25 to 34      35 to 54      55+       Total
Did not              4,474         9,155    14,224     27,853
complete HS
Complete HS         11,546        26,481    20,060     58,087

1-3 yrs             10,700        22,618    11,127     44,445
college
4+ yrs college      11,066        23,183    10,596     44,845

Total             37,786        81,435     56,008    175,230

41
Two-Way Table

• The row variable is level of education.
– In this study, is level of education the
explanatory or response variable?
• The column variable is age.
– Explanatory or response?

• Marginal distributions:
– The distributions of education alone and age
alone are called marginal distributions because
their totals are in the margins: Education at the
right, and age at the bottom.
42
Marginal Distributions

• It is often
Education Level in U.S. (adults age 25+)
display the                           50

Percent of Total
40                33.1
marginal                                                            25.4         25.6
30
distribution in                       20    15.9

0
of raw numbers.                            No high   High school 1-3 years of 4+ years of
school       only       college      college
degree
Years of Schooling

43
Conditional Distributions

• The previous graph looked at the breakdown of
education levels for the entire population. Many times,
however, we are looking for breakdowns (i.e.,
distributions) for a certain group within the
population.
– For example, of those people with 4+ years of college,
look at the distribution across age groups.
– Let’s complete a bar graph for this comparison.
– This is a conditional distribution.

44
One Conditional Distribution for
Example 4.19
Breakdown by age group of people with 4+
years of college

60                         51.7
50
40
Percent

30          24.7                           23.6
20
10
0
25-34          35-54            55+
Age Group
45
Different Question

• What proportion of each age group received 4+
years of college education?

46
• Read paragraph at the bottom of page 248.

47
One set of conditional distributions:
Figure 4.27, p. 248

48
Problems

• 4.53, p. 245
• 4.59, p. 251

49
Graph for Problem 4.59

Beakdown of Planned Majors in Business School,
by Gender

50
40.4
40            34.8
36.6

30.2
Percent

30                            24.8
27.1

20

10                                           3.7
2.2

0
Female Male
50
Homework

• Read through the end of the chapter.
• Be sure you understand “Simpson’s
• Problem:
– 4.62, p. 253

51

• Problem 4.60, p. 251
– Simpson’s paradox refers to the reversal
of the direction of a comparison or
association when data from several
groups are combined to form a single
group.

52
Practice/Review Problems

• Problem:
– 4.68, p. 254
– 4.72 (parts a-c), p. 257

53
Relationship Between Type of College and
Management Level

65                         54.4   53.1
55
41.4   39.8
45
Percent

35
25
15        4.2
7.3
5
-5
High             Middle             Low
Management Level

Public     Private                 54

To top