Do the data - Goblues.org by linxiaoqin

VIEWS: 1 PAGES: 54

									      Chapter 4: More on Two-Variable Data
4.1    Transforming
                                           140
       Relationships                       120
                                           100




                          g dye/kg fiber
                                            80
4.2    Cautions                             60
                                            40
                                            20
4.3    Relations in                          0
                                                 20   40   60   80   100     120   140   160   180
       Categorical Data                                          Time, minutes




                                                                                          1
            Example
       Cell Phone Users
Year      (thousands)
1990          5,283
1993         16,009
1994         24,134
1995         33,786
1996         44,043
1997         55,312
1998         69,209
1999         86,047



                          2
Scatterplot for Cell Phone Example




                                     3
Residuals Plot




                 4
           What’s going on here?

• Do the data (y) increase by a constant amount each
  year?
   – This would suggest a linear model.
• Or, do the data increase by a fixed percentage each
  year? That is, can you multiply the y-value by a fixed
  number to get the next year’s number, and then
  multiply that number by the fixed number to get the
  following year’s number?
   – This would suggest an exponential model.


                                                           5
     Transformation of the Variables

• The next step is to apply a mathematical
  transformation that changes exponential
  growth into linear growth.
  – The transformation that can help here is
    to take the logarithm of the y-variable,
    then re-plot and re-calculate the LSR.




                                               6
New LSR, with Transformed y




Residuals Plot
                              7
 We are dealing with a transformed y-value!

• Model:
           log y  263.20  0.13417 x

• In order to use the model for prediction, we
  must “undo” the logarithm transformation to
  return to the original units of measurement.
  – How do we do this?
• Now use the new model to predict cell phone
  subscribers for 2000.
                                                 8
How do we predict for year 2000?




                                   9
Plotting our original data vs. our
      exponential model …




                                     10
                   Homework

• Problem 4.6, p. 212
• Problem 4.11, p. 213
• Reading: pp. 203-215




                              11
            Power Law Models

• General form of a power law model:

                   y  ax   p


• Biologists have found that many
  characteristics of living things are described
  quite closely by power laws.
   – For example, the rate at which animals use
     energy goes up as the ¾ power of their body
     weight (Kleiber’s Law).
                                                   12
     LSR and Power Law Models

• As we saw in the last section, exponential
  growth models become linear when we apply
  the logarithm transformation to the response
  variable y.
• Power law models become linear when we
  apply the logarithm transformation to both
  variables, x and y.


                                                 13
  Log Transformations for Power Law Models


                  y  ax p

                  log y  log( ax p )
                  log y  log a  p log x

• Looking carefully at the last equation, the power (p)
  becomes the slope of the straight line that links log y to
  log x.
   – We can estimate what power (p) the law involves by
     regressing log y on log x and using the slope of the
     regression line to estimate the power.
                                                               14
Problem 4.13, p 219




                      15
Problem 4.13, p. 219

            Log of Both Variables




                                    16
Residuals Analysis (Transformed Data)




                                    17
     Undoing the Transformation

• Let’s do the math to see what we need:

           log y  0.76172  0.218215 log x




                                              18
Predicting Lifespan for Humans




                                 19
• HW Problem:
 – 4.14, p. 220




                  20
           Warm-Up Problem

• 4.25, pp. 224-225
• Create appropriate model
• Predict seed count for tree with seed weight of
  1,000 mg.




                                                    21
I.                      4.25   II. Log of both L1 and L2

Axes off to see trend



                                IV.
III.
                               Y2 vs. original data


                                            V.




                                                      22
      4.2 Cautions about Correlation
             and Regression
• The correlation (r) and the LSR line are not
  resistant.
• As we have seen, extrapolation is often
  dangerous.
   – Predicting past the x-variable for which the
     model was developed.




                                                    23
                  The French Paradox

• The paradox refers to the fact that the French have
  long had low rates of heart disease (Japan is the only
  developed country with a lower rate), despite a diet
  relatively rich in saturated animal fats. The French
  propensity to drink wine the way some Americans
  guzzle soft drinks has been cited as a likely explanation
  of the paradox, since numerous studies have indicated
  that alcohol consumed in moderation helps to prevent
  atherosclerosis, or accumulation of fatty deposits in
  arteries, which is the underlying cause of most heart
  attacks.

+ from NY Times article                                   24
                Lurking Variables
• As we discussed in the example of amount of wine
  consumed vs. number of incidents of heart disease,
  there can be other variables not measured in a
  correlation study that may influence the interpretation
  of relationships among those variables.
   – Lurking Variables
• It is possible to show, for example, that there is a high
  correlation between shoe size and intelligence for a
  group of children varying in age from, say, 4 to 15.
   – What is the lurking variable?
• To control for age, we can calculate the correlation
  between shoe size and IQ for each of the different ages.
   – Age 4, 5, 6, …                                      25
Correlation Between Shoe Size and IQ?
         (Common Response)


                 Age



                            Shoe
      IQ
                            Size



                                    26
See Figure 4.18, p. 227




                          27
            Lurking Variables That
              Change Over Time

• Many lurking variables change systematically
  over time.
• One useful method for detecting lurking
  variables is to plot both the response variable
  and the regression residuals against the time
  order of the observations (whenever the time
  order is available).
• See Example 4.12, p. 228

                                                    28
29
          Using Averaged Data

• Be careful when applying the results of a study
  that uses averages to individuals.

• Problem 4.31, p. 231




                                                30
                   Causation

• Simply put, a strong correlation between two
  variables says nothing about one variable
  causing the other. One variable may in fact
  cause the other to change, but a correlation or
  LSR line cannot tell us that.
   – More investigation is needed!
• A designed study with proper experimental
  controls should be used.

                                                    31
              Figure 4.22, p. 232

• Causation
• Common Response
• Confounding




                                    32
                    Confounding
• The effects of two variables on a response variable are
  said to be confounded when they cannot be
  distinguished from one another.
   – Definition: Two or more variables that might have
     caused an effect were simultaneously present, so that we
     do not know to which to attribute the effect.
   – See 1, Example 4.13 (p. 232), and explanation, p. 233, top
     of p. 234.
• Does this mean that we cannot ever suggest causation?
   – Read the two paragraphs on p. 235 (establishing
     causation).

                                                                33
              Causation
• Example 4.14, p. 232
  – Numbers 1 and 2 (p. 233)




                               34
       Common Response
• Example 4.15, p. 233




                         35
              Homework

• Reading through p. 240




                           36
                 Problems

• Problems on p. 237:
  – 4.33, 4.34, 4.35
• 4.73, p.257




                            37
Problem 4.73, p. 257
    Power law model might best fit,
    so take log of L1 and L2. Plot below
    of L3 and L4.




                                   38
                   4.73, cont.




The pendulum period is proportional to the square root
of its length.
                                                 39
  4.3 Relations in Categorical Variables

• There are many relationships of interest to us
  that cannot be described by using correlation
  and LSR techniques.
   – Recall that correlation and LSR require both
     variables to be quantitative.
• Often, we want to study the relationship
  between two variables that are inherently
  categorical.

                                                    40
   Two-Way Table (Ex. 4.19, p. 241)cell

                             Age Group
Education        25 to 34      35 to 54      55+       Total
Did not              4,474         9,155    14,224     27,853
complete HS
Complete HS         11,546        26,481    20,060     58,087

1-3 yrs             10,700        22,618    11,127     44,445
college
4+ yrs college      11,066        23,183    10,596     44,845

Total             37,786        81,435     56,008    175,230

                                                           41
               Two-Way Table

• The row variable is level of education.
   – In this study, is level of education the
     explanatory or response variable?
• The column variable is age.
   – Explanatory or response?

• Marginal distributions:
   – The distributions of education alone and age
     alone are called marginal distributions because
     their totals are in the margins: Education at the
     right, and age at the bottom.
                                                     42
          Marginal Distributions

• It is often
                                         Education Level in U.S. (adults age 25+)
  advantageous to
  display the                           50




                     Percent of Total
                                        40                33.1
  marginal                                                            25.4         25.6
                                        30
  distribution in                       20    15.9

  percents instead                      10
                                        0
  of raw numbers.                            No high   High school 1-3 years of 4+ years of
                                             school       only       college      college
                                             degree
                                                         Years of Schooling




                                                                                       43
         Conditional Distributions

• The previous graph looked at the breakdown of
  education levels for the entire population. Many times,
  however, we are looking for breakdowns (i.e.,
  distributions) for a certain group within the
  population.
   – For example, of those people with 4+ years of college,
     look at the distribution across age groups.
   – Let’s complete a bar graph for this comparison.
   – This is a conditional distribution.


                                                              44
          One Conditional Distribution for
                  Example 4.19
               Breakdown by age group of people with 4+
                          years of college

          60                         51.7
          50
          40
Percent




          30          24.7                           23.6
          20
          10
           0
                     25-34          35-54            55+
                                   Age Group
                                                            45
           Different Question

• What proportion of each age group received 4+
  years of college education?




                                              46
• Read paragraph at the bottom of page 248.




                                              47
One set of conditional distributions:
         Figure 4.27, p. 248




                                        48
           Problems


• 4.53, p. 245
• 4.59, p. 251




                      49
                  Graph for Problem 4.59

               Beakdown of Planned Majors in Business School,
                                by Gender

          50
                                 40.4
          40            34.8
                                                                     36.6

                 30.2
Percent




          30                            24.8
                                                              27.1



          20

          10                                           3.7
                                                 2.2

          0
               Accounting        Admin         Economics     Finance
                               Business School Major
                                    Female Male
                                                                            50
                   Homework

• Read through the end of the chapter.
• Be sure you understand “Simpson’s
  Paradox.”
• Problem:
  – 4.62, p. 253



                                         51
          Simpson’s Paradox

• Problem 4.60, p. 251
• Statement of the Paradox:
  – Simpson’s paradox refers to the reversal
    of the direction of a comparison or
    association when data from several
    groups are combined to form a single
    group.

                                               52
      Practice/Review Problems

• Problem:
  – 4.68, p. 254
  – 4.72 (parts a-c), p. 257




                                 53
               Relationship Between Type of College and
                          Management Level


          65                         54.4   53.1
          55
                                                       41.4   39.8
          45
Percent




          35
          25
          15        4.2
                          7.3
           5
          -5
                     High             Middle             Low
                                Management Level

                                  Public     Private                 54

								
To top