Correlation Coefficient spr09

Document Sample
Correlation Coefficient spr09 Powered By Docstoc
					Stat 390                                                                                       April 15, 2009


                     Topic 27: Correlation Coefficient1
In 1970, the United States Selective Service instituted a draft to decide which young men would be
forced to join the armed forces. Wanting to be completely fair, they used a random lottery process that
assigned draft numbers to birthdays: those born on days with low draft numbers were drafted. But was
the lottery process carried out in a fair, truly random manner? In this topic, you will learn a new
technique for analyzing such data and answering this question.


Overview
In the previous topic, you saw how scatterplots provide useful visual information about the relationship
between two quantitative variables. Rather than relying on visual impressions alone, however, it is also
handy to have a numerical measure of the strength of association between two variables—just as you
made use of numerical summaries for various aspects of a single variable’s distribution. This topic
introduces you to such a measure and asks you to investigate some of its properties. This measure, one
of the most famous in statistics, is the correlation coefficient.

Activity 27-1: Car Data
Recall from Activity 26-3 in your previous handout, the nine scatterplots related to car data.

a. Check that you have this ordering of those scatterplots according to the direction and strength of
   association revealed in them:

                                  Negative                      None                     Positive
                    Strongest                    Weakest               Weakest                          Strongest
    Letter of
                        D         G          A      H            C         E         I              F      B
    Scatterplot
    Correlation
    Coefficient


                 The correlation coefficient, denoted by r, is a number that measures the
                 degree to which two quantitative variables are linearly associated.

The calculation of r is very tedious to do by hand, so you will begin by letting technology calculate
correlation coefficients while you explore their properties.

b. Use Minitab to calculate the value of the correlation coefficient between time to travel ¼ mile and
   weight. Record this value in the preceding table in the column corresponding to scatterplot A:

        1. Open the Cars99.MTW worksheet. Notice that the variables you are interested in are in
           columns C10 and C6 respectively.

        2. Select Stat  Basics Statistics  Correlation. Enter c10 and c6 as the Variables and then
           click OK.

1                                                          rd
 Excerpted from Workshop Statistics: Discovery with Data 3 Edition by Allan J. Rossman & Beth L. Chance, and
Minitab Companion for Workshop Statistics, by Julie M. Clark.
             1
Stat 390                                                                                     April 15, 2009

         Minitab Tip: Minitab reports the (Pearson) correlation coefficient and a “P-Value.” The sample
         correlation coefficient (r) is the first number reported. The reported p-value is the result of an
         “hypothesis test” of whether or not the correlation coefficient is different from zero. If the p-
         value is confusing, you can ask Minitab not to report it by unchecking the box labeled Display p-
         values.

         Minitab Tip: As a shortcut, you can obtain the correlation coefficient by typing at the command
         prompt: MTB > corr c10 c6.

c. Now use Minitab to calculate the value of the correlation coefficient for the other eight scatterplots.
   Record these in the table on the previous page, below the appropriate letter (B-I). Does it matter
   what in what order you give Minitab the variables?

d. Based on these results, what do you suspect is the largest value that a correlation coefficient can
   assume? What do you suspect is the smallest value?
       Largest:                                         Smallest:

e. Under what circumstances do you think the correlation coefficient assumes its largest or smallest
   value? Hint: Consider what would have to be true of the curve in the scatterplot.

f.   How does the value of the correlation relate to the direction of the association?

g. How does the value of the correlation relate to the strength of the association?


These examples should convince you that a correlation coefficient has to be between -1 and +1, and it
equals one of those values only when the observations form a perfectly straight line. The sign of the
correlation coefficient reflects the direction of the association (e.g., positive values of r correspond to a
positive linear association). The magnitude of the correlation coefficient indicates the strength of the
association, with values closer to -1 or +1 signifying a stronger linear association.


Activity 27-2: Governors’ Salaries
The following table reports governors’ salaries for the fifty states (as of the year 2005), along with the
median housing prices for the states.
         State         Governor’s    Median Housing           State         Governor’s    Median Housing
                          Salary         Price                                 Salary         Price
       Alabama           $96,361        $85,100            Montana            $96,462        $99,500
        Alaska           $85,776       $144,200            Nebraska           $85,000        $88,000
        Arizona          $95,000       $121,300             Nevada           $117,000       $142,000
       Arkansas          $77,028        $72,800         New Hampshire        $104,758       $133,300
       California       $175,000       $211,500           New Jersey         $175,000       $170,800
       Colorado          $90,000       $166,600          New Mexico          $110,000       $108,100
      Connecticut       $150,000       $166,900            New York          $179,000       $148,700
       Delaware         $114,000       $130,400         North Carolina       $123,819       $108,300
        Florida         $129,060       $105,500          North Dakota         $88,926        $74,400
        Georgia         $128,903       $111,200              Ohio            $132,292       $103,700
        Hawaii           $94,780       $272,700           Oklahoma           $117,571        $70,700

             2
Stat 390                                                                                 April 15, 2009

         State        Governor’s    Median Housing          State        Governor’s    Median Housing
                         Salary         Price                               Salary         Price
         Idaho          $98,500       $106,300            Oregon           $93,600       $152,100
        Illinois       $150,691       $130,800          Pennsylvania      $144,416        $97,000
          Iowa         $107,482        $82,500          Rhode Island      $105,194       $133,000
        Kansas         $103,813        $83,500          South Dakota      $103,222        $79,600
       Kentucky        $112,705        $86,700           Tennessee         $85,000        $93,000
       Louisiana        $95,000        $85,000             Texas          $115,345        $82,500
        Maine           $70,000        $98,700              Utah          $104,600       $146,100
      Maryland         $145,000       $146,000            Vermont         $168,600       $111,500
     Massachusetts     $135,000       $185,700            Virginia        $175,000       $125,400
       Michigan        $177,000       $115,600           Washington       $148,035       $168,300
      Minnesota        $120,303       $122,400          West Virginia      $95,000        $72,800
      Mississippi      $122,160        $71,400           Wisconsin        $131,768       $112,200
       Missouri        $120,087        $89,900            Wyoming         $105,000        $96,600

a. What are the observational units for these data?

b. Use Minitab (Govenors05.mtw) to produce a scatterplot of governor’s salary vs. median housing
   price. Describe the association (direction, strength, and form) between these two variables.

c. Based on this scatterplot, guess the value of the correlation coefficient between governor’s salary
   and median housing price.

d. Use Minitab to calculate the value of this correlation. Record this value, and comment on the
   accuracy of your guess.

e. Suppose Hawaii gives its governor a $100,000 raise. Make this change in the data. Then reproduce
   the scatterplot, and recalculate the value of the correlation coefficient. Has the correlation
   coefficient changed much?

f.   Repeat part e after giving the governor of Hawaii an additional $100,000 raise.

g. Now suppose Hawaii decides to make its governorship an unpaid position. Change the governor of
   Hawaii’s salary to $0. Then reproduce the scatterplot and recalculate the value of the correlation
   coefficient. Has the correlation coefficient changed much?

h. Based on these calculations, would you say the correlation coefficient is a resistant measure of
   association? Explain.


Activity 27-3: Televisions and Life Expectancy
Reconsider the data from Activity 26-6 about life expectancy and number of televisions per thousand
people in a sample of 22 countries. A scatterplot is reproduced here.




             3
Stat 390                                                                                   April 15, 2009




a. Describe the direction and strength of the association between life expectancy and number of
   televisions per thousand people in these countries. Also comment on whether or not this
   association follows a linear form.

b. Based on this scatterplot, guess the value of the correlation coefficient between life expectancy and
   televisions per thousand people in these countries.

c. Use Minitab (TVlife06.mtw) to calculate this correlation coefficient. How accurate was your
   guess?

d. Would you say the value of the correlation coefficient is fairly high, even though the association
   between the variables is not linear?

e. Does the fairly high value of the correlation coefficient provide evidence of a cause-and-effect
   relationship between number of televisions and life expectancy? Explain.


Watch Out
        • Correlation measures the degree of linear association between two quantitative variables. But
        even when two variables display a nonlinear relationship, the correlation between them still
        might be quite high. With these data, the relationship is clearly curved and not linear, and yet
        the correlation is still fairly high. Do not assume from a high correlation coefficient that the
        relationship between the variables must be only linear. Always look at a scatterplot, in
        conjunction with the correlation coefficient, to assess the form (linear or not) of the association.

        • No matter how close a correlation coefficient is to ±1, and no matter how strong the
        association between two variables, a cause-and-effect conclusion cannot necessarily be drawn
        from observational data. There are far more plausible explanations for why countries with lots
        of televisions per thousand people tend to have long life expectancies. For example, the
        technological sophistication of the country is related to both number of televisions and life
        expectancy.




            4
Stat 390                                                                                    April 15, 2009


Activity 27-4: Guess the Correlation
This activity will give you practice at judging the value of a correlation coefficient by examining a
scatterplot. http://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html
a. Open the applet Guess the Correlation. Keep 15 for the Number of Points, and click New Sample.
    The applet will generate some “pseudo-random data” and produce a scatterplot.




     Based solely on the scatterplot, guess the value of the correlation coefficient. Enter your guess in
     the Correlation Guess field in the applet, and click Enter. The applet then reports the actual value of
     the correlation coefficient. Record your guess and the actual value in the first empty column of the
     following table:

                    Repetition Number     1 2 3 4 5 6 7 8 9 10
                        Your Guess        1 2 3 4 5 6 7 8 9 10
                     Actual Correlation   1 2 3 4 5 6 7 8 9 10
b. Click New Sample to generate another scatterplot of pseudo-random data. Enter your guess for the
   value of the correlation coefficient in the applet. Then record your guess and the actual value of the
   correlation coefficient in the preceding table. Repeat for a total of 10 repetitions.

c. After the ten repetitions, guess the value of the correlation coefficient between your guesses for r
   and the actual values of r.

d. From the applet’s pull-down menu below Show Graph Of, select Guess vs. Actual. The applet will
   create the scatterplot of your ten guesses and the corresponding actual correlation coefficients and
   will also report the correlation coefficient between your guesses and the actual values. Record this
   correlation coefficient. Does the value surprise you?

e. Use the applet to examine a scatterplot of your errors vs. the actual values. Is there evidence you
   are better at guessing certain correlation coefficient values than other values? Explain.

f.   Use the applet to examine a scatterplot of your errors vs. the repetition (trial) number. Is there
     evidence your guesses were more accurate or less accurate as you went along? Explain.



             5
Stat 390                                                                                       April 15, 2009

g. Suppose all of your guesses had been too high by exactly 0.1, what would the correlation coefficient
   between your guesses and the actual values be? Hint: Think about what the scatterplot would look
   like.

h. Repeat part g if your guesses had all been too low by exactly 0.5.

i.   If the correlation coefficient between your guesses and the actual values is 1.0, does this mean you
     guessed perfectly every time? What does this value reveal about the utility of the correlation
     coefficient as a measure of your guessing prowess? Explain.


Activity 27-5: House Prices
Reconsider the data on house prices from Activity 26-1. The mean house price is $482,386, and the
standard deviation is $79,801.5. The mean house size is 1288.1 square feet, and the standard deviation
is 369.191 square feet. You can gain some insight into how the correlation coefficient r measures
association by examining the formula for its calculation:
                                              1 n  xi  x   yi  y 
                                        r                
                                             n  1 i1  sx   s y 
                                                                      
                                                                     
where xi denotes the ith observation of one variable, yi the ith observation of the other variable, x and y
the respective sample means, sx and sy the respective sample standard deviations, and n the sample size.
This formula says to standardize each x and y value into its z-score, multiply these z-scores together for
each observational unit, add those results, and finally divide the sum by one less than the sample size.
The following table begins the process of calculating the correlation between house price and size by
calculating the houses’ z-scores for price and size and then multiplying the results.

      Address                Price($)   Price Z-score    Size (sq ft)   Size Z-score   Product of Z-scores
      2130 Beach St.          311,000                        460            -2.243
      2545 Lancaster Dr.     344,720          -1.725         720           -0.699             1.206
      415 Golden West Pl.    359,500           -1.54         883           -1.097             1.69
      990 Fair Oaks Ave.     414,000          -0.857         728           -1.517             1.30
      845 Pearl Dr.          459,000          -0.293         926           -0.125             0.037
      1115 Rogers Ct.        470,000          -0.155        1499            0.355            -0.055
      579 Halcyon Rd.        470,000         --0.155        1419            -0.91             0.141
      1285 Poplar St.        470,000          -0.155         952            0.571            -0.089
      1080 Fair Oaks Ave.    474,000          -0.105        1014           -0.742             0.078
      690 Garfield Pl.       475,000          -0.093        1615            0.885            -0.082
      1030 Sycamore Dr.      490,000          -0.095        1664            1.018             0.097
      620 Eman Ct.           492,000           0.120        1160           -0.347            -0.042
      529 Adler St.          500,000           0.221        1545            0.696             0.154
      646 Cerro Vista Cir.   510,000           0.346        1567            0.755             0.261
      926 Sycamore Dr.       520,000           0.471        1176           -0.304            -0.143
      227 S Alpine St.       541,000           0.734        1120           -0.455            -0.334
      654 Woodland Ct.       567,000           1.067        1549            0.707             0.754
      2230 Paso Robles St.   575,000           1.161        1540            0.682             0.792
      2461 Ocean St.         580,000           1.223        1755
      833 Creekside Dr.      625,000           1.787        1844           1.506              2.691




              6
Stat 390                                                                                April 15, 2009

a. Calculate the z-score for the price of 2130 Beach St. and for the size of 2461 Ocean St. Then
   calculate the product of the z-scores for these two houses. Show your calculations below and record
   the results in the table.


b. The sum of the products turns out to equal 14.819. Use this information, and the fact that there are
   20 houses in this sample, to determine the value of the correlation coefficient between house price
   and size.

c. What do you notice about the size z-score for most of the houses with negative price z-scores?
   Explain how the signs of these z-scores result from the strong positive association between house
   price and size.

d. Confirm your calculation in part b by using Minitab (HousePricesAG.mtw) to calculate the value
   of the correlation coefficient between house price and size.


Activity 27-6: Exam Score Improvements
Consider some data on hypothetical exam scores stored in the Minitab file ExamScores.mtw.

a. Use Minitab to produce a scatterplot of exam 2 score vs. exam 1 score. Comment on the direction,
   strength, and form of the association revealed.

b. Use Minitab to calculate the correlation coefficient between exam 1 and exam 2.

c. Now suppose each student scores 10 points lower on exam 1 than she actually did. How would you
   expect this result to affect the value of the correlation coefficient between exam 2 and exam 1?
   Explain.

d. Use Minitab to make this change (subtract 10 points from everyone’s score on exam 1):
        1.   Click in the Session window at the Command Prompt (MTB>).
        2.   Type let c5 = c1 – 10
        3.   Now type a title for column C5 in the Data window (something clever like “Exam 1-10”.)
        4. Create a scatterplot of exam 2 vs. new exam 1 score and recalculate the correlation
     coefficient. How did the correlation value change?

e. Now suppose each student scores twice as many points on exam 2 as she actually did. How would
   you expect this result to affect the value of the correlation coefficient between exam 2 and exam 1?
   Explain.

f.   Use Minitab’s let command to make this change: double everyone’s score on exam 2. (You will need
     to use the * character to multiply in Minitab.) Store your results in column C6. Then reproduce the
     scatterplot of this new exam 2 vs. new exam 1, and recalculate the correlation. How did the
     correlation value change?




             7
Stat 390                                                                                   April 15, 2009


                 These questions demonstrate another property of the correlation coefficient: It
                 does not change if the scale of measurement is altered by adding a constant or
                 multiplying by a constant.

g. Now consider a different (hypothetical) class of students. Suppose each student scores exactly 10
   points higher on exam 2 than he/she does on exam 1. What do you think the value of the correlation
   coefficient would be between exam 1 and exam 2? Explain your reasoning. Hint: Consider what the
   scatterplot would look like.

h. Make up some hypothetical bivariate data in Minitab with the property described in part g. Hint:
   Choose any values at all for the exam 1 scores, and then make sure each exam 2 score is 10 points
   higher. Do this for at least 5 hypothetical students. Then use Minitab to produce a scatterplot and
   calculate the correlation. Does this confirm the value you expected in part g, or do you need to
   revise your thinking?

i.   Now suppose each student scores exactly twice as many points on exam 2 than he/she does on
     exam 1. What do you think the value of the correlation coefficient would be between exam 1 and
     exam 2? Explain your reasoning. Hint: Consider what the scatterplot would look like.

j.   Make up some hypothetical bivariate data in Minitab with the property described in part i. Then use
     Minitab to produce a scatterplot and calculate the correlation. Does this confirm the value you
     expected in part i, or do you need to revise your thinking?


Watch Out
• A correlation coefficient is a number! In fact, it is a number between + and -1, inclusive. While this may
seem obvious by now, many students say “the same” and do not give a number in response to the
question to part g.

• The slope, or steepness, of the points in a scatterplot is unrelated to the value of the correlation
coefficient. If the points fall on a perfectly straight line with a positive slope, then the correlation
coefficient equals 1.0 whether that slope is very steep or not steep at all. What matters for the
magnitude of the correlation is how closely the points concentrate around a line, not the steepness of a
line.



Activity 27-7: Draft Lottery (Self-Check Activity)
In 1970 the United State Selective Service conducted a lottery to decide which young men would be
drafted into the armed forces (Fienberg, 1971). Each of the 366 birthdays of the year was assigned a
draft number. Young men born on days assigned low draft numbers were drafted. The file
DraftLottery.mtw lists the draft number assigned to each birthday. The “sequential date” column
lists the birthday as a number from 1–366 (January 1 is coded as 1 and December 31 as 366).
a. What draft number was assigned to your birthday?

b. In a perfectly fair, random lottery, what should the correlation coefficient between draft number
   and sequential date of the birthday equal? Explain.
             8
Stat 390                                                                                 April 15, 2009

c. Use Minitab to produce a scatterplot of draft number vs. sequential date of the birthday. Based on
   the scatterplot, guess the value of the correlation coefficient. Explain the reasoning behind your
   guess.

d. Use Minitab to calculate the value of the correlation coefficient. Does its value surprise you? If so,
   look back at the scatterplot to see if, in hindsight, its value makes sense. Summarize what the value
   of this correlation coefficient reveals about how the draft numbers were distributed across
   birthdays throughout the year.

e. Data for 1971 are also stored in the file DraftLottery.mtw. Examine a scatterplot, and calculate
   the correlation coefficient between draft number and sequential date for that year’s lottery.
   Comment on your findings.


Solution
a. Answers will vary.
b. With a perfectly fair, random lottery, there should be no association between draft number and
sequential date for the birthday. In other words, these variables should be independent, so the
correlation coefficient would equal zero. With an actual lottery, you would not expect the correlation
coefficient to equal exactly zero, but it should be close to zero.
c. The scatterplot is shown here.




It’s hard to see a relationship between the variables in this scatterplot, so a reasonable guess for the
value of the correlation coefficient would be close to zero.
d. Minitab reveals the correlation coefficient to equal r = -0.226. This indicates a weak negative
association between draft number and sequential date. While not large, this correlation value is farther
from zero than most people expect. Looking at the scatterplot more closely, you can see there are few
points in the top right and bottom left of the graph. This result suggests few birthdays late in the year
were assigned high draft numbers, and few birthdays early in the year were assigned low draft numbers,
which means young men born late in the year were at a disadvantage and had a better chance of getting
a low draft number. Birthdays late in the year were not mixed as thoroughly as those earlier in the year,
so they tended to be selected early in the process and thereby assigned a low draft number.
e. The scatterplot for the 1971 draft lottery data is shown here.



            9
Stat 390                                                                                   April 15, 2009




The correlation coefficient is 0.014, which is very close to 0. This value indicates there is no association
between draft number and sequential date, suggesting the lottery process was fair and random in 1971.
The mixing mechanism was greatly improved after the anomaly with the 1970 results was spotted.


Wrap-Up
In this topic, you discovered the correlation coefficient as a measure of the linear relationship between
two variables. Analyzing pairs of variables for the house data, you discovered some of the properties of
this measure. For example, a correlation value has to be between -1 and +1, inclusive. The sign of the
correlation coefficient reflects the direction of the association. The magnitude of the correlation
coefficient reflects the strength of the association, with correlation coefficients close to -1 or +1
indicating very strong association, and correlation coefficients close to 0 reflecting very weak linear
association. But also keep in mind that you discovered the correlation coefficient is not resistant to
outliers, as altering simply one state’s value for governor’s salary changed the value of the correlation
considerably. It is important to always accompany your interpretation of the correlation coefficient with
a scatterplot. You also learned how to calculate a correlation coefficient based on z-scores and gained
practice judging the value of a correlation based on a scatterplot. Finally, with the data on televisions
and life expectancy, you saw again that you should not infer a causal relationship between variables
based on a high correlation.
Some useful definitions to remember and habits to develop from this topic include:

• The correlation coefficient is a number that measures the direction and strength of linear association
between two quantitative variables.
• The correlation coefficient is not resistant to outliers. One very unusual point can produce a large
correlation coefficient even when most of the data reveals no pattern, or a small correlation coefficient
when most of the data follows a clear linear pattern.
• Always examine a scatterplot in addition to calculating a correlation coefficient. A clear nonlinear
relationship can have a small (close to zero) correlation, and a correlation can be close to -1 or +1, even
if the relationship follows a curve or other nonlinear pattern.
• Never forget a large correlation coefficient between two variables does not necessarily establish a
cause-and-effect relationship between those variables.




           10
Stat 390                                                                                 April 15, 2009


Activity 27-8: Hypothetical Exam Scores
Consider the following scatterplots of hypothetical scores on two exams for Class A and Class B (the data
are also stored in the file HypoExams.mtw):




a. In class A, do most of the exam scores follow a linear pattern? Are there any exceptions?

b. In class B, are most of the exam scores scattered haphazardly with no apparent pattern? Are there
   any exceptions?

c. Use Minitab to calculate the correlation coefficient between exam 1 score and exam 2 score for each
   of these classes. Are you surprised at either of the values? Explain.

d. Describe how these scatterplots pertain to the issue of resistance of the correlation coefficient.

Now consider the following scatterplot of exam data for Class C:




e. Describe what the scatterplot reveals about the relationship between exam scores in class C.


           11
Stat 390                                                                                    April 15, 2009

f.    Use Minitab to calculate the correlation coefficient between exam scores in class C. Is its value
      higher than you expected? Explain what this example reveals about correlation.


Activity 27-9: Proximity to the Teacher
Consider the idea of studying whether students who sit closer to the teacher tend to have higher quiz
scores than students who sit farther away from the teacher. Suppose you measure distance from the
teacher and average quiz score for a group of students. Explain how you know each of the following
statements is in error:
a. The correlation between distance and quiz average is –1.8.
b. The correlation between distance and quiz average is –0.8, and the correlation between quiz
    average and distance is –0.4.
c. The correlation is –0.8, so there is no association between distance and quiz average.
d. The correlation between quiz average and gender is –0.8.
e. The correlation between distance and quiz average is –0.8, so students who sit farther away tend to
    score higher.
f. The correlation between distance and quiz average is –0.8, so sitting closer to the teacher must
    cause students to score higher on quizzes.


Activity 27-10: Monthly Temperatures
Reconsider 26-10 and the data on average monthly temperatures in Raleigh, North Carolina:

                  Jan   Feb    Mar     Apr    May    Jun    Jul   Aug      Sept    Oct      Nov      Dec
     Avg. Temp    39    42     50      59     67     74     78    77       71      60       51       43

The following scatterplot displays Raleigh’s average monthly temperature vs. the month number:




      a. Does there appear to be any relationship between temperature and month in Raleigh? If so,
         describe the relationship.
      b. Use Minitab to calculate the correlation coefficient between these variables. Does this
         correlation value seem to indicate a strong or a weak relationship?
      c. Explain why the correlation is so close to 0 even though the scatterplot reveals a clear
         relationship between temperature and month.




             12
Stat 390                                                                                April 15, 2009


Activity 27-11: Planetary Measurements
Consider the data below on planetary measurements. The following scatterplot displays the period of
revolution around the sun (in earth days) vs. the distance from the sun (in millions of miles).




    a. Describe the association between these variables as revealed in the scatterplot.
    b. Would a straight line appear to be a reasonable summary of the relationship between revolution
       and distance? Explain.
    c. The correlation coefficient between revolution and distance turns out to equal 0.989. This value
       is very close to 1. Does this value mean a straight line is the best model for a reasonable
       summary of the relationship between revolution and distance? Explain.


Activity 27-12: Ice Cream, Drownings, and Fire Damage
a. Suppose a beach community keeps track of the amount of ice cream sold in a given month and the
   number of drownings that occur in that month. Would you expect to find a negative correlation, a
   positive correlation, or a correlation close to zero? Explain your reasoning.

b. If the community in part a were to find a strong positive correlation between ice cream sales and
   drownings, would that mean ice cream causes drowning? If not, suggest an alternative explanation
   (i.e., a confounding variable) for the strong association.

c. Explain why you would expect to find a positive correlation between the number of fire engines that
   respond to a fire and the amount of damage done in the fire. Does this imply the damage would be
   less extensive if fewer fire engines were dispatched? Explain.


Activity 27-13: Climatic Conditions
The following data, from the 1992 Statistical Abstract of the United States, pertain to a number of
climatic variables for a sample of 25 American cities. These variables measure long-term averages of
         • January high temperature (in degrees Fahrenheit)
         • January low temperature
         • July high temperature
         • July low temperature
         • Annual precipitation (in inches)
         • Days of measurable precipitation per year

           13
Stat 390                                                                                     April 15, 2009

        • Annual snow accumulation
        • Percentage sunshine

    City             Jan. High   Jan. Low   July High   July Low   Precip.    Days Precip.   Snow    Sun
    Atlanta            50.4        31.5         88        69.5      50.77         115          2      61
    Baltimore          40.2        23.4        87.2       66.8       40.6         113         21.3    57
    Boston             35.7        21.6        81.8       65.1      41.51         126         40.7    58
    Chicago             29         12.9        83.7       62.6      35.82         126         38.7    55
    Cleveland          31.9        17.6        82.4       61.4      36.63         156         54.3    49
    Dallas             54.1        32.7        96.5       74.1       33.7          78         2.9     54
    Denver             43.2        16.1        88.2       58.6       15.4          89         59.8    70
    Detroit            30.3        15.6        83.3       61.3      32.62         135         41.5    53
    Houston             61         39.7        92.7       74.2      46.07         104         0.4     56
    Kansas City        34.7        16.7        88.7       68.2      37.62         104          20     62
    Los Angeles        65.7        47.8        75.3       62.8      12.01          35          0      73
    Miami              75.2        59.2         89        76.2      55.91         129          0      73
    Minneapolis        20.7         2.8         84        63.1      28.32         114         49.2    58
    Nashville          45.9        26.5        89.5       68.9       47.3         119         10.6    56
    New Orleans        60.8        41.8        90.6       73.1      61.88         114         0.2     60
    New York           37.6        25.3        85.2       68.4      47.25         121         28.4    58
    Philadelphia       37.9        22.8        82.6       67.2      41.41         117         21.3    56
    Phoenix            65.9        41.2       105.9         81       7.66          36          0      86
    Pittsburgh         33.7        18.5        82.6       61.6      36.85         154         42.8    46
    St. Louis          37.7        20.8        89.3       70.4      37.51         111         19.9    57
    Salt Lake City     36.4        19.3        92.2       63.7      16.18          90         57.8    66
    San Diego          65.9        48.9        76.2       65.7       9.9           42          0      68
    San Francisco      55.6        41.8        71.6       65.7       19.7          62          0      66
    Seattle             45         35.2        75.2       55.2      37.19         156         12.3    46
    Washington         42.3        26.8        88.5       71.4      38.63         112         17.1    56

Use Minitab to calculate the correlation coefficient between all pairs of these eight variables; the data
are stored in the file Climate.mtw. Hint: There are a total of 28such pairs of variables. It’s probably
easiest to record the correlation values in a table similar to the following:

                     Jan. High   Jan. Low   July High   July Low    Precip.   Days Precip.    Snow    Sun
   Jan. High            xxx
   Jan. Low             xxx        xxx
   July High            xxx        xxx         xxx
   July Low             xxx        xxx         xxx         xxx
   Precip.              xxx        xxx         xxx         xxx        xxx
   Days Precip.         xxx        xxx         xxx         xxx        xxx          Xxx
   Snow                 xxx        xxx         xxx         xxx        xxx          Xxx         xxx
   Sun                  xxx        xxx         xxx         xxx        xxx          Xxx         xxx    xxx

To compute all of the p-values simultaneously, select Stat  Basic Statistics  Correlation. Enter c2-c9
as the Variables. To simplify the output, remove the check from the box labeled Display p- values.
a. Which pair of variables has the strongest (either positive or negative) linear association? What is the
    value of the correlation between those variables?
        Variables:                                                                 correlation:


           14
Stat 390                                                                                   April 15, 2009

b. Which pair of variables has the weakest (either positive or negative) linear association? What is the
   value of the correlation between those variables?
       Variables:                                                                 correlation:
c. Suppose you want to predict the annual snowfall for an American city and you are allowed to look at
   that city’s averages for these other variables. Which variable would be most useful to you? Which
   variable would be least useful?
       Most useful:                                               Least useful:
d. Suppose you want to predict the average July high temperature for an American city and you are
   allowed to look at that city’s averages for these other variables. Which variable would be most
   useful to you? Which variable would be least useful?
       Most useful:                                               Least useful:
e. Use Minitab to explore the relationship between annual snowfall and annual precipitation more
   closely. Produce and comment on a scatterplot of these two variables.



Activity 27-14: Muscle Fatigue
Reconsider the matched-pairs study comparing muscle fatigue between men and women from Activity
23-5 (Hunter et al., 2004). In Activity 26-12, you analyzed a scatterplot of time until fatigue for men and
women.
a. Calculate the correlation coefficient between time until muscle fatigue for men and time until
    muscle fatigue for women.
b. Comment on what this correlation coefficient suggests about whether or not men and women of
    similar strength tend to have similar times until muscle fatigue.




           15

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:52
posted:9/16/2012
language:Unknown
pages:15