VIEWS: 6 PAGES: 11 POSTED ON: 11/2/2011 Public Domain
CHAPTER 4 NOTES pg 1 Modeling Nonlinear Data: Linear Data is data modeled by an equation of the form y = a + bx. Linearization is the process of transforming nonlinear data into linear data. We use properties of logarithms to linearize certain types of data. PROPERTIES OF LOGARITHMS: 1. log ab log a log b a 2. log log a log b b 3. log x p p log x Examples: log 2 x log 2 log x 2 log log 2 log x x log 2 x x log 2 Case 1: Consider the following set of Linear Data representing an account balance as a function of time: x: time (months) 0 48 96 144 192 240 y: account 100 580 1060 1540 2020 2500 balance ($) Describe the pattern of change… The relationship between x and y is linear if, for equal increments of x, we add a fixed increment to y. CHAPTER 4 NOTES pg 2 Case 2: Consider the following set of Nonlinear Data representing an account balance as a function of time: x: time (months) 0 48 96 144 192 240 y: account 100 161.22 259.93 419.06 675.62 1089.30 balance ($) The relationship between x and y is exponential if, for equal increments of x, we multiply a fixed increment by y. This increment is called the common ratio. We want to find the best fitting model to represent our data. For the linear data, we use least-squares regression to find the best fitting line. For the nonlinear data, the best fitting model would be an exponential curve. PROBLEM: We cannot use least-squares regression for the nonlinear data because least- squares regression depends upon correlation, which only measures the strength of linear relationships. SOLUTION: We transform the nonlinear data into linear data, then use least-squares regression to determine the best fitting line for the transformed data. Finally, do a reverse transformation to turn the linear equation back into a nonlinear equation which will model our original nonlinear data. Linearizing Exponential Functions: (We want to write an exponential function of the form y a b x as a function of the form y a bx ). y a bx (x,y are variables and a,b are constants) log y log(a b x ) log y log a log b x CHAPTER 4 NOTES pg 3 log y log a x log b var2 = con1 + (var1)(con2) This is in the general form y a bx , which is linear. So, the graph of (var1, var2) is linear. This means the graph of x, log y is linear. CONCLUSIONS: 1. If the graph of x, log y is linear, then the graph of x, y is exponential. 2. If the graph of x, y is exponential, then the graph of x, log y is linear. Once we have linearized our data, we can use least-squares regression on the transformed data x, log y to find the best fitting linear model. PRACTICE: Linearize the data for Case 2 and find the least-squares regression line for the transformed data. Then, do a reverse transformation to turn the linear equation back into an exponential equation. log y 2 0.0043x ˆ y 100 10 ˆ 0.0043 x 10log y 1020.0043 x ˆ y 100 1.01 x ˆ y 1020.0043 x ˆ y 102 100.0043 x ˆ CHAPTER 4 NOTES pg 4 So our exponential model for Case 2 is: y 100 100.0043 x ˆ ***Compare this to the equation the calculator gives when performing exponential regression on the Case 2 data. Linearizing Power Functions: (We want to write a power function of the form y ax b as a function of the form y a bx ). y ax b (x,y are variables and a,b are constants) log y log axb log y log a log xb log y log a b log x var2 = con1 + (con2)(var1) This is in the general form y a bx , which is linear. So, the graph of (var1, var2) is linear. This means the graph of log x, log y is linear. Case 3: Consider the following set of Nonlinear Data representing the average length and weight at different ages for Atlantic Ocean rockfish: x: age (years) 0 4 8 12 16 20 y: weight (grams) 0 48 192 432 768 1200 CHAPTER 4 NOTES pg 5 PRACTICE: Linearize the data for Case 3 and find the least-squares regression line for the transformed data. Then, do a reverse transformation to turn the linear equation back into a power equation. log y 0.4771 2log x ˆ 10 ˆ log y 10 0.4771 10 log x 2 log y 0.4771 log x 2 ˆ y 100.4771 x 2 ˆ 0.4771 log x 2 10 ˆ log y 10 y 3x 2 ˆ So our power model for Case 3 is: y 3x 2 ˆ ***Compare this to the equation the calculator gives when performing power regression on the Case 3 data. Interpreting Correlation and Regression: Extrapolation is the use of a regression equation to make predictions outside the domain of the explanatory variable. Caution! These predictions are not reliable. Example: The following data represent Sharon’s typing speed (words per minute) as a function of the time she has practiced (hours). Practice 0 10 20 30 40 (hrs.) Speed (wpm) 30 40 48 55 61 Use the linear regression model to predict Sharon’s typing speed after 50 hours of practice. After 60 hours of practice? Are these predictions reliable? CHAPTER 4 NOTES pg 6 A lurking variable effects the relationship between the variables being studied, although it is not part of the study. Example: A study compared the SAT scores of high school seniors who took an SAT-Prep course to high school seniors who did not take the Prep course. The study found no significant difference in the SAT scores. Is it fair to conclude that taking an SAT-Prep course has no effect on SAT scores? Identify any lurking variables. Association vs. Causation An association between two variables, even if it is very strong, does not imply that changes in one variable cause changes in the other variable. Association DOES NOT imply causation! Example: A study found a strong negative correlation between student failure rate and years of teaching experience at East Larson High School. The study found that failure rates for first-years teachers were significantly higher than failure rates for tenured (more experienced) teachers. Does longer teaching experience cause lower failure rates??? Identify any lurking variables. A strong association between two variables x and y could reflect the following relationships: Causation Common Response Confounding Causation: changes in x cause changes in y. x y Common Response: both x and y are caused by a lurking variable z. x y z CHAPTER 4 NOTES pg 7 Confounding: changes in x cause changes in y, but y is also caused by a lurking variable z. x y z Relations in Categorical Data: Because we cannot perform direct calculations on categorical data, we use the counts or percents of individuals by category. The count or percents of individuals in each category of one variable is called a marginal distribution. Example: Here are data from eight high schools on smoking among students and among their parents: Student smokes Student does not smoke Total Both parents smoke 400 1380 One parent smokes 416 1823 Neither parent 188 1168 smokes Total Find the marginal distributions for parent smoking behavior and student smoking behavior by count. CHAPTER 4 NOTES pg 8 Note: Counts cannot be directly compared when the sizes of the groups are unequal. Instead, compare the percents. Example: Find the marginal distributions for parent smoking behavior and student smoking behavior by percent. Student does not Student smokes Total smoke Both parents smoke One parent smokes Neither parent smokes Total 100 The count or percents of individuals in each category of one variable that are also in a given category of the other variable is called a conditional distribution. Example: Find the conditional distribution for student smoking behavior given that neither parent smokes. Student smokes Student does not smoke Total Neither parent smokes Simpson’s Paradox When data from several groups are combined to form a single group, the association between variables can drastically change. Example: Upper Wabash Tech has a Business school and a Law school. The following table shows the number of applicants admitted to and denied by each school. Upper Wabash Tech Applicants Business Law Admit Deny Admit Deny Male 480 120 Male 10 90 Female 180 20 Female 100 200 CHAPTER 4 NOTES pg 9 By combining the data from the Business school and the Law school, we have the following two-way table: Applicants for both schools combined Admit Deny Total Male 490 210 700 Female 280 220 500 Total 770 430 1200 From this table, we have the following result: 490 70% of males are granted admission 700 280 56% of females are granted admission. 500 So Wabash admits a higher percentage of male applicants. Now consider each of the schools separately: Business Admit Deny Total Male 480 120 600 Female 180 20 200 Total 660 140 800 480 80% of males are granted admission 600 180 90% of females are granted admission. 200 Law Admit Deny Total Male 10 90 100 Female 100 200 300 Total 110 290 400 CHAPTER 4 NOTES pg 10 10 10% of males are granted admission 100 100 33.3% of females are granted admission. 300 So each school admits a higher percent of female applicants! If each school admits a higher percent of females, how can both schools combined admit a higher percent of males? This is an example of Simpson’s Paradox. 600 200 86% of males apply to the Business school whereas 40% of females apply 700 500 to the Business school. Likewise, 14% of males apply to the law school whereas 60% of females apply to the law school. 660 The Business school admits 83% of its applicants, while the Law school only admits 800 110 28% of its applicants. 400 Because a higher percent of males apply to the school that’s easier to get into, a higher percent of males are admitted overall, even though each school admits a higher percent of female applicants. Consider the following data on 326 cases in which the defendant was convicted of murder: White Defendant Black Defendant Death Penalty Death Penalty Yes No Yes No White White 19 132 11 52 Victim Victim Black Black 0 9 6 97 Victim Victim The following two-way table shows the combined data (combining the victim’s race) for defendant’s race versus death penalty. CHAPTER 4 NOTES pg 11 Death Penalty Yes No Total White Defendant 19 141 160 Black Defendant 17 149 166 Total 36 290 326 11.9% of white defendants receive the death penalty 10.2% of black defendants receive the death penalty White Defendant Death Penalty Yes No Total White Victim 19 132 151 Black Victim 0 9 9 Total 19 141 160 12.6% receive death penalty for killing white victim 0% receive death penalty for killing black victim Black Defendant Death Penalty Yes No Total White Victim 11 52 63 Black Victim 6 97 103 Total 17 149 166 17.5% receive death penalty for killing white victim 5.8% receive death penalty for killing black victim Simpson’s Paradox: Although a higher percent of white defendants receive the death penalty overall, a higher percent of black defendants receive the death penalty for killing a white victim and for killing a black victim. Out of the 214 white victims killed, 14% of the defendants received the death penalty. Out of the 112 black victims killed, only 5.4% of the defendants received the death penalty. Since whites killed whites 151 out of 160 times (94%) this group has greater influence on the combined results.