VIEWS: 1 PAGES: 54 POSTED ON: 1/31/2013
Chapter 4: More on Two-Variable Data 4.1 Transforming 140 Relationships 120 100 g dye/kg fiber 80 4.2 Cautions 60 40 20 4.3 Relations in 0 20 40 60 80 100 120 140 160 180 Categorical Data Time, minutes 1 Example Cell Phone Users Year (thousands) 1990 5,283 1993 16,009 1994 24,134 1995 33,786 1996 44,043 1997 55,312 1998 69,209 1999 86,047 2 Scatterplot for Cell Phone Example 3 Residuals Plot 4 What’s going on here? • Do the data (y) increase by a constant amount each year? – This would suggest a linear model. • Or, do the data increase by a fixed percentage each year? That is, can you multiply the y-value by a fixed number to get the next year’s number, and then multiply that number by the fixed number to get the following year’s number? – This would suggest an exponential model. 5 Transformation of the Variables • The next step is to apply a mathematical transformation that changes exponential growth into linear growth. – The transformation that can help here is to take the logarithm of the y-variable, then re-plot and re-calculate the LSR. 6 New LSR, with Transformed y Residuals Plot 7 We are dealing with a transformed y-value! • Model: log y 263.20 0.13417 x • In order to use the model for prediction, we must “undo” the logarithm transformation to return to the original units of measurement. – How do we do this? • Now use the new model to predict cell phone subscribers for 2000. 8 How do we predict for year 2000? 9 Plotting our original data vs. our exponential model … 10 Homework • Problem 4.6, p. 212 • Problem 4.11, p. 213 • Reading: pp. 203-215 11 Power Law Models • General form of a power law model: y ax p • Biologists have found that many characteristics of living things are described quite closely by power laws. – For example, the rate at which animals use energy goes up as the ¾ power of their body weight (Kleiber’s Law). 12 LSR and Power Law Models • As we saw in the last section, exponential growth models become linear when we apply the logarithm transformation to the response variable y. • Power law models become linear when we apply the logarithm transformation to both variables, x and y. 13 Log Transformations for Power Law Models y ax p log y log( ax p ) log y log a p log x • Looking carefully at the last equation, the power (p) becomes the slope of the straight line that links log y to log x. – We can estimate what power (p) the law involves by regressing log y on log x and using the slope of the regression line to estimate the power. 14 Problem 4.13, p 219 15 Problem 4.13, p. 219 Log of Both Variables 16 Residuals Analysis (Transformed Data) 17 Undoing the Transformation • Let’s do the math to see what we need: log y 0.76172 0.218215 log x 18 Predicting Lifespan for Humans 19 • HW Problem: – 4.14, p. 220 20 Warm-Up Problem • 4.25, pp. 224-225 • Create appropriate model • Predict seed count for tree with seed weight of 1,000 mg. 21 I. 4.25 II. Log of both L1 and L2 Axes off to see trend IV. III. Y2 vs. original data V. 22 4.2 Cautions about Correlation and Regression • The correlation (r) and the LSR line are not resistant. • As we have seen, extrapolation is often dangerous. – Predicting past the x-variable for which the model was developed. 23 The French Paradox • The paradox refers to the fact that the French have long had low rates of heart disease (Japan is the only developed country with a lower rate), despite a diet relatively rich in saturated animal fats. The French propensity to drink wine the way some Americans guzzle soft drinks has been cited as a likely explanation of the paradox, since numerous studies have indicated that alcohol consumed in moderation helps to prevent atherosclerosis, or accumulation of fatty deposits in arteries, which is the underlying cause of most heart attacks. + from NY Times article 24 Lurking Variables • As we discussed in the example of amount of wine consumed vs. number of incidents of heart disease, there can be other variables not measured in a correlation study that may influence the interpretation of relationships among those variables. – Lurking Variables • It is possible to show, for example, that there is a high correlation between shoe size and intelligence for a group of children varying in age from, say, 4 to 15. – What is the lurking variable? • To control for age, we can calculate the correlation between shoe size and IQ for each of the different ages. – Age 4, 5, 6, … 25 Correlation Between Shoe Size and IQ? (Common Response) Age Shoe IQ Size 26 See Figure 4.18, p. 227 27 Lurking Variables That Change Over Time • Many lurking variables change systematically over time. • One useful method for detecting lurking variables is to plot both the response variable and the regression residuals against the time order of the observations (whenever the time order is available). • See Example 4.12, p. 228 28 29 Using Averaged Data • Be careful when applying the results of a study that uses averages to individuals. • Problem 4.31, p. 231 30 Causation • Simply put, a strong correlation between two variables says nothing about one variable causing the other. One variable may in fact cause the other to change, but a correlation or LSR line cannot tell us that. – More investigation is needed! • A designed study with proper experimental controls should be used. 31 Figure 4.22, p. 232 • Causation • Common Response • Confounding 32 Confounding • The effects of two variables on a response variable are said to be confounded when they cannot be distinguished from one another. – Definition: Two or more variables that might have caused an effect were simultaneously present, so that we do not know to which to attribute the effect. – See 1, Example 4.13 (p. 232), and explanation, p. 233, top of p. 234. • Does this mean that we cannot ever suggest causation? – Read the two paragraphs on p. 235 (establishing causation). 33 Causation • Example 4.14, p. 232 – Numbers 1 and 2 (p. 233) 34 Common Response • Example 4.15, p. 233 35 Homework • Reading through p. 240 36 Problems • Problems on p. 237: – 4.33, 4.34, 4.35 • 4.73, p.257 37 Problem 4.73, p. 257 Power law model might best fit, so take log of L1 and L2. Plot below of L3 and L4. 38 4.73, cont. The pendulum period is proportional to the square root of its length. 39 4.3 Relations in Categorical Variables • There are many relationships of interest to us that cannot be described by using correlation and LSR techniques. – Recall that correlation and LSR require both variables to be quantitative. • Often, we want to study the relationship between two variables that are inherently categorical. 40 Two-Way Table (Ex. 4.19, p. 241)cell Age Group Education 25 to 34 35 to 54 55+ Total Did not 4,474 9,155 14,224 27,853 complete HS Complete HS 11,546 26,481 20,060 58,087 1-3 yrs 10,700 22,618 11,127 44,445 college 4+ yrs college 11,066 23,183 10,596 44,845 Total 37,786 81,435 56,008 175,230 41 Two-Way Table • The row variable is level of education. – In this study, is level of education the explanatory or response variable? • The column variable is age. – Explanatory or response? • Marginal distributions: – The distributions of education alone and age alone are called marginal distributions because their totals are in the margins: Education at the right, and age at the bottom. 42 Marginal Distributions • It is often Education Level in U.S. (adults age 25+) advantageous to display the 50 Percent of Total 40 33.1 marginal 25.4 25.6 30 distribution in 20 15.9 percents instead 10 0 of raw numbers. No high High school 1-3 years of 4+ years of school only college college degree Years of Schooling 43 Conditional Distributions • The previous graph looked at the breakdown of education levels for the entire population. Many times, however, we are looking for breakdowns (i.e., distributions) for a certain group within the population. – For example, of those people with 4+ years of college, look at the distribution across age groups. – Let’s complete a bar graph for this comparison. – This is a conditional distribution. 44 One Conditional Distribution for Example 4.19 Breakdown by age group of people with 4+ years of college 60 51.7 50 40 Percent 30 24.7 23.6 20 10 0 25-34 35-54 55+ Age Group 45 Different Question • What proportion of each age group received 4+ years of college education? 46 • Read paragraph at the bottom of page 248. 47 One set of conditional distributions: Figure 4.27, p. 248 48 Problems • 4.53, p. 245 • 4.59, p. 251 49 Graph for Problem 4.59 Beakdown of Planned Majors in Business School, by Gender 50 40.4 40 34.8 36.6 30.2 Percent 30 24.8 27.1 20 10 3.7 2.2 0 Accounting Admin Economics Finance Business School Major Female Male 50 Homework • Read through the end of the chapter. • Be sure you understand “Simpson’s Paradox.” • Problem: – 4.62, p. 253 51 Simpson’s Paradox • Problem 4.60, p. 251 • Statement of the Paradox: – Simpson’s paradox refers to the reversal of the direction of a comparison or association when data from several groups are combined to form a single group. 52 Practice/Review Problems • Problem: – 4.68, p. 254 – 4.72 (parts a-c), p. 257 53 Relationship Between Type of College and Management Level 65 54.4 53.1 55 41.4 39.8 45 Percent 35 25 15 4.2 7.3 5 -5 High Middle Low Management Level Public Private 54