Bivariate data

Document Sample
Bivariate data Powered By Docstoc
					Bivariate data
                      2
                 VCE coverage
                 Area of study
                 Units 3 & 4 • Data analysis


                 In this chapter
                         chapter
                 2A   Types of data
                 2B   Back-to-back stem plots
                 2C   Parallel boxplots
                 2D   The two-way frequency
                      table
                 2E   The scatterplot
                 2F   The q-correlation
                      coefficient
                 2G   Pearson’s product–
                      moment correlation
                      coefficient
                 2H   Calculating r and the
                      coefficient of
                      determination
58   Further Mathematics




Types of data
         In this chapter we look at sets of data which contain two variables. We look at ways of
         displaying the data and of measuring relationships between the two variables.
            The methods we employ to do this depend entirely on the type of variables we are
         dealing with.
         Numerical and categorical data
         Examples of numerical data are:
         1. the heights of a group of teenagers
         2. the marks for a maths test
         3. the number of universities in a country
         4. ages
         5. salaries.
            As the name suggests, numerical data involve quantities which are, broadly
         speaking, measurable or countable.
            Examples of categorical data are:
         1. genders (sexes)
         2. AFL football teams
         3. religious denominations
         4. finishing positions in the Melbourne Cup
         5. municipalities
         6. ratings of 1–5 to indicate preferences for 5 different cars
         7. age groups, for example 0–9, 10–19, 20–29
         8. hair colours.
            Such categorical data, as the name suggests, have categories like masculine, feminine
         and neuter for gender, or Catholic, Anglican, Uniting, Baptist, Buddhist and so on for
         religious denomination, or 1st, 2nd, 3rd for finishing position in the Melbourne Cup.
         Note: Some numbers may look like numerical data, but really be names or titles (for
         example, ratings of 1 to 5 given to different samples of cake — ‘This one’s a 4’; the
         numbers on netball players’ uniforms — ‘she’s number 7’). These ‘titles’ are not count-
         able; they place the subject in a category (with a name) so are categorial.
            In this chapter we look at ways of measuring the relationship between:
         1. a numerical variable and a categorical variable (for example, weight and nationality)
         2. two categorical variables (for example, gender and religious denomination)
         3. two numerical variables (for example, height and weight).
         Dependent and independent variables
         When a relationship between two sets of variables is being examined, it is useful to
         know if one of the variables depends on the other. Often we can make a judgment about
         this but sometimes we can’t.
            Consider the case where a study compared the heights of company employees
         against their annual salaries. Common sense would suggest that the height of a com-
         pany employee would not depend on the person’s annual salary nor would the annual
         salary of a company employee depend on the person’s height. In this case, it is not
         appropriate to designate one variable as independent and one as dependent.
            In the case where the ages of company employees are compared with their annual
         salaries, you might reasonably expect that the annual salary of an employee would
         depend on the person’s age. In this case, the age of the employee is the independent
         variable and the salary of the employee is the dependent variable.
            It is useful to identify the independent and dependent variables where possible, since
         it is the usual practice when displaying data on a graph to place the independent vari-
         able on the horizontal axis and the dependent variable on the vertical axis.
                                                          Chapter 2 Bivariate data        59
   remember
 remember
     1. Bivariate data are data with two variables.
     2. Numerical data involve quantities that are measurable or countable.
     3. Categorical data, as the name suggests, are data which are divided into categories.
     4. In a relationship involving two variables, if the values of one variable ‘depend’ on
        the values of another variable, then the former variable is referred to as the
        dependent variable and the latter variable is referred to as the independent variable.
     5. It is the usual practice when displaying data on a graph to place the independent
        variable on the horizontal axis and the dependent variable on the vertical axis.




                 2A           Types of data

1 Write down whether each of the following represents numerical or categorical data.
  a The heights in centimetres of a group of children
  b The diameters in millimetres of a tray of ball-bearings
  c The numbers of visitors to a display each day
  d The modes of transport that students in Year 11 take to school
  e The 10 most-watched television programs in a week
  f The occupations of a group of 30-year-olds
  g The numbers of subjects offered to VCE students at various schools
  h Life expectancies                       i Species of fish
  j Blood groups                            k Years of birth
  l Countries of birth                      m Tax brackets
2 For each of the following pairs of variables, write down which is independent and
  which is dependent. If it is not possible to identify this, then write ‘not appropriate’.
  a The age of an AFL footballer and his annual salary
  b The weight of a businessman and the number of business lunches he attends each week
  c The growth of a plant and the amount of fertiliser it receives
  d The number of books read in a week and the eye colour of the readers
  e The voting intentions of a woman and her weekly consumption of red meat
  f The number of members in a household and the size of their house
3 multiple choice
  An example of a numerical variable is:
  A attitude to 4-yearly elections (for or against)         B year level of students
  C the total attendance at Carlton football matches        D position in a queue at the pie stall
  E television channel numbers shown on a dial
4 multiple choice
  In a study on mice, the dependent variable was the time (in days) for which the mice
  remained alive. The independent variable would most likely have been:
  A the weight of the mice
  B the amount of food eaten each day by the mice
  C the daily dosage of an experimental drug given to the mice
  D the number of mice
  E the sex of the mice
60     Further Mathematics




Back-to-back stem plots
            In chapter 1, we saw how to construct a stem plot for a set of univariate data. We can
            also extend a stem plot so that it displays bivariate data. Specifically, we shall create a
            stem plot that displays the relationship between a numerical variable and a categorical
            variable. We shall limit ourselves in this section to categorical variables with just two
            categories, for example sex. The two categories are used to provide two, back-to-back
            leaves of a stem plot.
            A back-to-back stem plot is used to display bivariate data, involving a numerical
            variable and a categorical variable with 2 categories.


     WORKED Example 1
 The girls and boys in Grade 4 at Kingston Primary School submitted projects on the
 Olympic Games. The marks they obtained out of 20 are given below.




  Girls’ marks       16      17      19       15      12      16      17      19         19   16
  Boys’ marks        14      15      16       13      12      13      14      13         15   14

 Display the data on a back-to-back stem plot.
 THINK                                             WRITE
 1   Identify the highest and lowest scores        Highest score = 19
     in order to decide on the stems.              Lowest score = 12
                                                   Use a stem of 1, divide into fifths.
                                                                Chapter 2 Bivariate data       61

THINK                                           WRITE
2   Create an unordered stem plot first. Put     Key: 1|2 = 12
    the boys’ scores on the left, and the                  Leaf Stem Leaf
    girls’ scores on the right.                           Boys       Girls
                                                                  1
                                                      3 2 3 3     1  2
                                                    4 5 4 5 4     1  5
                                                              6   1  6 7 6 7 6
                                                                  1  9 9 9
3   Now order the stem plot. The scores on      Key: 1|2 = 12
    the left should increase in value from                 Leaf Stem Leaf
    right to left, while the scores on the                Boys       Girls
    right should increase in value from left          3 3 3 2     1  2
    to right.                                       5 5 4 4 4     1  5
                                                              6   1  6 6 6 7 7
                                                                  1  9 9 9


           The back-to-back stem plot allows us to make some visual comparisons of the two
           distributions. In the above example the centre of the distribution for the girls is higher
           than the centre of the distribution for the boys. The spread of each of the distributions
           seems to be about the same. For the boys, the marks are grouped around the 12–15
           marks; for the girls, they are grouped around the 16–19 marks. On the whole, we can
           conclude that the girls obtained better marks than the boys did.
              To get a more precise picture of the centre and spread of each of the distributions we
           can use the summary statistics discussed in chapter 1. Specifically, we are interested in:
           1. the mean and the median (to measure the centre of the distributions), and
           2. the interquartile range and the standard deviation (to measure the spread of the
              distributions).
              We saw in chapter 1 that the calculation of these summary statistics is very straight-
           forward and rapid using a graphics calculator.


    WORKED Example 2
            The number of ‘how to vote’ cards handed out by various Australian Labor
            Party and Liberal party volunteers during the course of a polling day is shown
            below.

 Labor          180     233      246     252     263      270     229      238     226     211
                193     202      210     222     257      247     234      226     214     204
 Liberal        204     215      226     253     263      272     285      245     267     275
                287     273      266     233     244      250     261      272     280     279
Display the data using a back-to-back stem plot and use this, together with summary
statistics, to compare the distributions of the number of cards handed out by the Labor
and Liberal volunteers.
                                                                        Continued over page
62      Further Mathematics




 THINK                                             WRITE
 1 Construct the stem plot.                        Key: 18|0 = 180
                                                                Leaf Stem Leaf
                                                              Labor            Liberal
                                                                    0 18
                                                                    3 19
                                                                 4 2 20 4
                                                              4 1 0 21 5
                                                           9 6 6 2 22 6
                                                              8 4 3 23 3
                                                                 7 6 24 4 5
                                                                 7 2 25 0 3
                                                                    3 26 1 3 6 7
                                                                    0 27 2 2 3 5 9
                                                                         28 0 5 7
 2   Use a graphics calculator to calculate        For the Labor volunteers:
     the summary statistics: the mean, the             Mean = 227.9
     median, the standard deviation and the            Median = 227.5
     interquartile range. Enter each set of            Interquartile range = 36
     data as a separate list. (See chapter 1 on        Standard deviation = 23.9
     how to use your graphics calculator to        For the Liberal volunteers:
     calculate these values.)                          Mean = 257.5
                                                       Median = 264.5
                                                       Interquartile range = 29.5
                                                       Standard deviation = 23.4
 3   Comment on the relationship.                  From the stem plot we see that the Labor distribution
                                                   is symmetric and therefore the mean and the median
                                                   are very close, whereas the Liberal distribution is
                                                   negatively skewed.
                                                      Since the distribution is skewed, the median is a
                                                   better indicator of the centre of the distribution than
                                                   is the mean.
                                                      Comparing the medians therefore, we have the
                                                   median number of cards handed out for Labor at 228
                                                   and for Liberal at 265, which is a big difference.
                                                      The standard deviations were similar as were the
                                                   interquartile ranges. There was not a lot of difference
                                                   in the spread of the data.
                                                      In essence, the Liberal party volunteers handed out
                                                   a lot more ‘how to vote’ cards than the Labor party
                                                   volunteers did.



             remember
           remember
               1. A back-to-back stem plot displays bivariate data involving a numerical variable
                  and a categorical variable with two categories.
               2. In the ordered stem plot, the scores on the left side of the stem increase in value
                  from right to left.
               3. Together with summary statistics, back-to-back stem plots can be used for
                  comparing two distributions.
                                                                              Chapter 2 Bivariate data           63

                            2B             Back-to-back stem plots

WORKED    1 The marks (out of 50), obtained for the end-of-term test by the students in German and
Example
      1
            French classes are given below. Display the data on a back-to-back stem plot.

German 20 38 45 21 30 39 41 22 27 33 30 21 25 32 37 42 26 31 25 37

French    23 25 36 46 44 39 38 24 25 42 38 34 28 31 44 30 35 48 43 34

          2 The birth masses of 10 boys and 10 girls (in kilograms, to the nearest 100 grams) are
            recorded in the table below. Display the data on a back-to-back stem plot.
              Boys        3.4    5.0        4.2        3.7         4.9    3.4         3.8      4.8    3.6        4.3

              Girls       3.0    2.7        3.7        3.3         4.0    3.1         2.6      3.2    3.6        3.1

WORKED    3 The number of delivery trucks making deliveries to a supermarket each day over a
Example
      2
            2-week period was recorded for two neighbouring supermarkets —supermarket A and
            supermarket B. The data are shown below.

              A      11    15    20     25        12    16     21        27     16        17   17    22     23    24

              B      10    15    20     25        30    35     16        31     32        21   23    26     28    29

            a Display the data on a back-to-back stem plot.
            b Use the stem plot, together with some summary statistics, to compare the distribu-
              tions of the number of trucks delivering to supermarkets A and B.

          4 The marks out of 20 for males and females on a science test for a Year-10 class are
            given below.
              Females                  12         13          14         14          15        15     16         17

              Males                    10         12          13         14          14        15     17         19

            a Display the data on a back-to-back stem plot.
            b Use the stem plot, together with some summary statistics, to compare the distribu-
              tions of the marks of the males and the females.

          5 The end-of-year English marks for 10 students in an English class were compared over
            2 years. The marks for 1998 and for the same students in 1999 are shown below.
              1998          30        31      35         37         39        41          41    42        43     46

              1999          22        26      27         28         30        31          31    33        34     36

            a Display the data on a back-to-back stem plot.
            b Use the stem plot, together with some summary statistics, to compare the distribu-
              tions of the marks obtained by the students in 1998 and 1999.
64   Further Mathematics



      6 The age and sex of a group of people attending a fitness class are recorded below.
          Female                23       24        25        26    27      28        30        31
          Male                  22       25        30        31    36      37        42        46
         a Display the data on a back-to-back stem plot.
         b Use the stem plot, together with some summary statistics, to compare the distribu-
           tions of the ages of the female to male members of the fitness class.




      7 The scores on a board game are recorded for a group of kindergarten children and for a
        group of children in a preparatory school.
          Kindergarten           3     13     14        25    28   32    36     41        47   50
          Prep. School           5     12     17        25    27   32    35     44        46   52
         a Display the data on a back-to-back stem plot.
         b Use the stem plot, together with some summary statistics, to compare the distributions
           of the scores of the kindergarten children compared to the preparatory school children.
      8 multiple choice
        The pair of variables that could be displayed on a back-to-back stem plot is:
        A the height of student and the number of people in the student’s household
        B the time put into completing an assignment and a pass or fail score on the assignment
        C the weight of a businessman and his age
        D the religion of an adult and the person’s head circumference
        E the income bracket of an employees and the time the employee has worked for the
           company
      9 multiple choice
        A back-to-back stem plot is a useful way of displaying the relationship between:
        A the proximity to markets (km) and the cost of fresh foods on average per kilogram
        B height and head circumference
        C age and attitude to gambling (for or against)
        D weight and age
        E the money spent during a day of shopping and the number of shops visited on that day
                                                                 Chapter 2 Bivariate data           65
Parallel boxplots
           We saw in the previous section that we could display relationships between a numerical
           variable and a categorical variable with just two categories, using a back-to-back stem plot.
              When we want to display a relationship between a numerical variable and a
           categorical variable with more than two categories, a parallel boxplot can be used.
              A parallel boxplot is obtained by constructing individual boxplots for each
           distribution, using the common scale.
              Construction of individual boxplots was discussed in detail in chapter 1 on univariate
           data. In this section we concentrate on comparing distributions represented by a
           number of boxplots (that is, on the interpretation of parallel boxplots).

   WORKED Example 3
           The four Year-7 classes at Western Secondary College complete the same end-of-
           year maths test. The marks, expressed as percentages for each of the students in
           the four classes, are given below.
    7A        7B         7C         7D                     7A          7B         7C           7D
    40         60        50          40                    69          78         70           69
    43         62        51          42                    63          82         72           73
    45         63        53          43                    63          85         73           74
    47         64        55          45                    68          87         74           75
    50         70        57          50                    70          89         76           80
    52         73        60          53                    75          90         80           81
    53         74        63          55                    80          92         82           82
    54         76        65          59                    85          95         82           83
    57         77        67          60                    89          97         85           84
    60         77        69          61                    90          97         89           90
 Display the data using a parallel boxplot and use this to describe any similarities or
 differences in the distributions of the marks between the four classes.
 THINK                                         WRITE/DISPLAY
  1 Create the first boxplot (for class 7A)
     on a graphics calculator using 2nd
     [STAT PLOT] and appropriate WINDOW
     settings. Using TRACE to show key
     values, sketch the first boxplot using
     pen and paper, leaving room for three
     additional plots.


                                                  FM Fig SD 02.01a          FM Fig SD 02.01b


                                                                        Continued over page
66     Further Mathematics




THINK                                           WRITE
 2   Repeat step 1 for the other three                                          7D
     classes. All four boxplots share the                                       7C
     common scale.
                                                                                7B
                                                                                7A

                                                 30 40 50 60 70 80 90 100
                                                       Maths mark (%)
 3   Describe the similarities and              Class 7B had the highest median mark and the
     differences between the four               range of the distribution was only 37. The
     distributions.                             lowest mark in 7B was 60.
                                                   We notice that the median of 7A’s marks is
                                                approximately 60. So, 50% of students in 7A
                                                received less than 60. This means that half of
                                                7A had scores that were less than the lowest
                                                score in 7B.
                                                   The range of marks in 7A was about the
                                                same as that of 7D with the highest scores in
                                                each about equal, and the lowest scores in each
                                                about equal. However, the median mark in 7D
                                                was higher than the median mark in 7A so, des-
                                                pite a similar range, more students in 7D
                                                received a higher mark than in 7A.
                                                   While 7D had a top score that was higher
                                                than that of 7C, the median score in 7C was
                                                higher than that of 7D and the bottom 25% of
                                                scores in 7D were less than the lowest score in
                                                7C. In summary, 7B did best, followed by 7C
                                                then 7D and finally 7A.




             remember
           remember
              1. A relationship between a numerical variable and a categorical variable with
                 more than two categories can be displayed using a parallel boxplot.
              2. A parallel boxplot is obtained by constructing individual boxplots for each
                 distribution, using a common scale.
                                                               Chapter 2 Bivariate data       67

                         2C          Parallel boxplots
                                                                                                           L Spread
                                                                                                        XCE
          1 The heights (in cm) of students in 9A, 10A and 11A were recorded and




                                                                                                                 sheet
                                                                                                     E
WORKED
Example                                                                                              Parallel
      3
            are shown in the table below.
                                                                                                    boxplots
              9A       10A    11A             9A     10A     11A              9A     10A      11A
              120      140    151            146     153     164             158     168      175          GC pro




                                                                                                                 gram
              126      143    153            147     156     166             160     170      180   UV stats

              131      146    154            150     162     167             162     173      187
              138      147    158            156     164     169             164     175      189
              140      149    160            157     165     169             165     176      193
              143      151    163            158     167     172             170     180      199
            a Construct a parallel boxplot to show the data.
            b Use the boxplot to compare the distributions of height for the 3 classes.
          2 The amounts of money contributed annually to superannuation schemes by people in
            3 different age groups are shown below.
               20–29         30–39      40–49                    20–29       30–39         40–49
               2000          4000      10 000                    6500         7000         13 700
               3100          5200      11 200                    6700         8000         13 900
               5000          6000      12 000                    7000         9000         14 000
               5500          6300      13 300                    9200       10 300         14 300
               6200          6800      13 500                  10 000       12 000         15 000
            a Construct a parallel boxplot to show the data.
            b Use the boxplot to comment on the distributions.
                 68   Further Mathematics



                       3 The numbers of jars of vitamin A, B, C and multi-vitamins sold per week by a local
                         chemist are shown below.

                           Vitamin
                                         5       6      7       7      8       8       9      11     13      14
                              A
                           Vitamin
                                         10     10      11     12      14     15      15     15      17      19
                              B
                           Vitamin
                                         8       8      9       9      9      10      11      12     12      13
                              C
                            Multi-
                                         12     13      13     15      16     16      17     19      19      20
                           vitamins
                          Construct a parallel boxplot to display the data and use it to compare the distributions
                          of sales for the 4 types of vitamin.




                       4 multiple choice
                         The ages of the employees at 5 different companies of the same size are compared
                         using the parallel boxplots shown below.



                                                                                                       Company A
                                                                                                       Company B
                                                                                                       Company C
                                                                                                       Company D
                                                                                                       Company E

                                                                20 25 30 35 40 45 50 55 60




                          For each of the following, select from:
                          A company A                  B company B                  C company C
                          D company D                  E company E

                             a Which company has the greatest range of ages?
   SHE
      ET   2.1               b Which company has the greatest interquartile range of ages?
Work




                             c Which company has the lowest median age?
                             d Which company has the greatest range of ages among their oldest 25% of employees?
                                                              Chapter 2 Bivariate data        69
The two-way frequency table
             When we are examining the relationship between two categorical variables, the two-
             way frequency table is an excellent tool.
               Consider the following example.

     WORKED Example 4
At a local shopping centre, 34 females, and 23 males were asked which of the two major
political parties they preferred. Eighteen females and 12 males preferred Labor. Display
these data in a two-way table.

THINK                                            WRITE

 1   Draw a table. Record the respondent’s
     sex in the columns and party preference     Party preference Female     Male     Total
     in the rows of the table.
                                                 Labor

                                                 Liberal

                                                 Total


 2   (a) We know that 34 female and 23
     males were asked. Put this information      Party preference Female     Male     Total
     into the table and fill in the total.
     (b) We also know that 18 females and        Labor               18       12         30
     12 males preferred Labor. Put this
     information in the table and find the        Liberal
     total of people who preferred Labor.
                                                 Total               34       23         57


 3   Fill in the remaining cells. For example,
     to find the number of females who            Party preference Female     Male     Total
     preferred the Liberals, subtract the
     number of females preferring Labor          Labor               18       12         30
     from the total number of females asked:
     34 − 18 = 16.                               Liberal             16       11         27

                                                 Total               34       23         57



             In the above example we have a very clear breakdown of data. We know how many
             females preferred Labor, how many females preferred the Liberals, how many males
             preferred Labor and how many males preferred the Liberals.
                If we wish to compare the number of females who prefer Labor with the number of
             males who prefer Labor, we must be careful. While 12 males preferred Labor compared
             to 18 females, there were, of course, fewer males than females being asked. That is,
             only 23 males were asked for their opinion, compared to 34 females.
                To overcome this problem, we can express the figures in the table as percentages.
70           Further Mathematics




     WORKED Example 5
 Fifty-seven people in a local shopping
 centre were asked whether they preferred              Party preference Female           Male     Total
 the Australian Labor Party or the Liberal
                                                       Labor                    18        12        30
 Party. The results are given at right.
 Convert the numbers in this table to
                                                       Liberal                  16        11        27
 percentages.
                                                       Total                    34        23        57

 THINK                                                         WRITE
 1   Draw the table, omitting the ‘total’ column.
                                                               Party preference Female          Male

                                                               Labor

                                                               Liberal

                                                               Total

 2   Fill in the table by expressing the number in
     each cell as a percentage of its column’s total.          Party preference Female          Male
     For example, to obtain the percentage of males
     who prefer Labor, we divide the number of                 Labor                   52.9      52.2
     males who prefer Labor by the total number
     of males and multiply by 100%.                            Liberal                 47.1      47.8
     12
     -----
     23
         -   × 100% = 52.5% (correct to 1 decimal place)       Total                 100.0      100.0



                 We could have calculated percentages from the table rows, rather than columns. To do
                 that we would, for example, have divided the number of females who preferred Labor
                 (18) by the total number of people who preferred labor (30) and so on. The table below
                 shows this:
                                    Party preference      Female         Male    Total

                                    Labor                  60.0          40.0    100

                                    Liberal                59.3          40.7    100

                    By doing this we have obtained the percentage of people who were female and pre-
                 ferred Labor (60%), and the percentage of people who were male and preferred Labor
                 (40%), and so on. This highlights facts different from those shown in the previous
                 table. In other words, different results can be obtained by calculating percentages from
                 a table in different ways.
                    As a general rule, when the independent variable (in this case the respondent’s sex)
                 is placed in the columns of the table, then the percentages should be calculated in
                 columns.
                                                               Chapter 2 Bivariate data        71
    WORKED Example 6
Sixty-seven primary and 47 secondary school students were asked their attitude to the
number of school holidays which should be given. They were asked whether there should
be more, fewer or the same number. Five primary students and 2 secondary students
wanted fewer holidays, 29 primary and 9 secondary students thought they had enough
holidays (that is, they chose the same number) and the rest thought they needed to be
given more holidays.
  Present these data in percentage form in a two-way frequency table and use it to
compare the opinions of the primary and the secondary students.
THINK                                           WRITE
1   Put the data in a table. First fill in the
    given information, then find the missing      Attitude     Primary     Secondary     Total
    information by subtracting the
    appropriate numbers from the totals.         Fewer            5           2           7

                                                 Same            29           9           38

                                                 More            33           36          69

                                                 Total           67           47         114

2   Calculate the percentages. Since the
    independent variable (the level of the       Attitude     Primary     Secondary
    student, Primary or Secondary) has
    been placed in the columns of the table,     Fewer           7.5          4.3
    we calculate the percentages in
    columns. For example, to obtain the          Same           43.3         19.1
    percentage of primary students who
    wanted fewer holidays, divide the            More           49.2         76.6
    number of such students by the total
    number of primary students and               Total          100.0       100.0
    multiply by 100%.
    That is, ----- × 100% = 7.5%.
             67
               5
                 -
3   Comment on the results.                     Secondary students were much keener on
                                                having more holidays than were primary
                                                students.




            remember
          remember
              1. The two-way frequency table is an excellent tool for examining the relationship
                 between two categorical variables.
              2. If the total number of scores in each of the two categories is unequal,
                 percentages should be calculated in order to be able to analyse the table
                 properly. When the independent variable is placed in the columns of the table,
                 the percentages should be calculated in columns. That is, the numbers in each
                 column should be expressed as a percentage of that column’s total.
                         72        Further Mathematics




                                                        2D            The two-way frequency table

         Spreadshe       WORKED  1 In a survey, 139 women and 102 men were asked whether they approved or disapproved
                         Example
EXCEL


                et




                               4   of a proposed freeway. Thirty-seven women and 79 men approved of the freeway.
             Two-way
             frequency             Display these data in a two-way table (not as percentages).
             table
                                    2 Students at a secondary school were asked whether the length of lessons should be
                                      45 minutes or 1 hour. Ninety-three senior students (Years 10–12) were asked and 60
                                      preferred 1-hour lessons, whereas of the 86 junior students (Years 7–9), 36 preferred
                                      1-hour periods. Display these data in a two-way table (not as percentages).
                                    3 For each of the following two-way frequency tables, complete the entries.
                                       a      Attitude       Female      Male       Total

                                              For               25             i     47

                                              Against           ii         iii       iv

                                              Total             51         v         92

                                       b      Attitude       Female      Male       Total

                                              For               i          ii        21

                                              Against           iii       21         iv

                                              Total             v         30         63

                                       c      Party preference          Female      Male

                                              Labor                        i        42%

                                              Liberal                    53%         ii

                                              Total                       iii        iv

                         WORKED     4 Sixty single men and women were asked whether they prefer to live alone, or to share
                         Example
                               5      accommodation with friends. The results are shown below.

                                           Rent preference              Men        Women    Total

                                           Live alone                    12         23       35

                                           Share with friends             9         16       25
     HEET
               2.1
                                           Total                         21         39       60
SkillS




                                       Convert the numbers in this table to percentages.
                                                              Chapter 2 Bivariate data        73

                                                      The information in the following
                                                      two-way frequency table relates to
                                                      questions 5 and 6. The data show the
                                                      reactions of administrative staff and
                                                      technical staff to an upgrade of the
                                                      computer systems at a large
                                                      corporation.



                                                       Administrative      Technical
                                         Attitude          staff             staff        Total
                                         For                  53               98          151
                                         Against              37               31             68
                                         Total                90              129          219




          5 multiple choice
            From the above table, we can conclude that:
            A 53% of administrative staff were for the upgrade
            B 37% of administrative staff were for the upgrade
            C 37% of administrative staff were against the upgrade
            D 59% of administrative staff were for the upgrade
            E 54% of administrative staff were against the upgrade
          6 multiple choice
            From the above table, we can conclude that:
            A 98% of technical staff were for the upgrade
            B 65% of technical staff were for the upgrade
            C 76% of technical staff were for the upgrade
            D 31% of technical staff were against the upgrade
            E 14% of technical staff were against the upgrade
WORKED    7 Delegates at the respective Liberal Party and Australian Labor Party conferences were
Example
      6
            surveyed on whether or not they believed that marijuana should be legalised. Sixty-two
            Liberal delegates were surveyed and 40 were against legalisation. Seventy-one Labor
            delegates were surveyed and 43 were against legalisation.
               Present the data in percentage form in a two-way frequency table. Comment on any
            differences between the reactions of the Liberal and Labor delegates.
          8 Sixty-one union workers were surveyed and asked whether the number of public
            holidays should be reduced. Thirty-five supported a reduction. Fifty-nine non-union
            workers were also asked and 31 supported a reduction.
               Present the data in percentage form in a two-way frequency table. Comment on any
            difference between the reactions of the union and non-union workers.
74   Further Mathematics




The scatterplot
         We often want to know if there is some sort of relationship between two numerical
         variables. A scatterplot, which gives a visual display of the relationship between two
         variables, provides a good starting point.
            Consider the data obtained from last year’s 12B class at Northbank Secondary Col-
         lege. Each student in this class of 29 students was asked to give an estimate of the
         average number of hours of study per week they did during Year 12. They were also
         asked the TER score they obtained.

          Average                            Average                               Average
           hours        TER                   hours       TER                       hours     TER
          of study      score                of study     score                    of study   score
             18          59                     14          54                       17        59
             16          67                     17          72                       16        76
             22          74                     14          63                       14        59
             27          90                     19          72                       29        89
             15          62                     20          58                       30        93
             28          89                     10          47                       30        96
             18          71                     28          85                       23        82
             19          60                     25          75                       26        35
             22          84                     18          63                       22        78
             30          98                     19          61

            The figure at right shows the data plotted on a scatterplot.
            It is reasonable to think that the number of hours of
         study put in each week by students would affect their          100
         TER scores and so the number of hours of study per              90
         week is the independent variable and appears on the
                                                                       TER score




                                                                         80
         horizontal axis. The TER score is the dependent variable
         and appears on the vertical axis.                               70
            There are 29 points on the scatterplot. Each point           60
         represents the hours studied and the TER score of one           50
         student.
                                                                         40
            In analysing the scatterplot we look for a pattern in                              (26, 35)
         the way the points lie. Certain patterns tell us that cer-
         tain relationships exist between the two variables. This              10 15 20 25 30
         is referred to as correlation. We look at what type of             Average number of hours
                                                                                of study per week
         correlation exists and how strong it is.
            In the figure above right we see some sort of pattern: the points are spread in a rough
         corridor from bottom left to top right. We refer to data following such a direction as
         having a positive relationship. This tells us that as the average number of hours studied
         per week increases, the TER score increases.
                                                       Chapter 2 Bivariate data          75
   The point (26, 35) is an outlier. It stands out because
it is well away from the other points and clearly is not      100
part of the ‘corridor’ referred to above. This outlier may      90




                                                              TER score
have occurred because a student worked very hard but            80
found the VCE pretty tough or perhaps the student exag-
                                                                70
gerated the number of hours he or she worked in a week
or perhaps there was a recording error. This needs to be        60
checked.                                                        50
   We could describe the rest of the data as having a           40
linear form as the straight line in the diagram at right
indicates.                                                            10 15 20 25 30
   When describing the relationship between two vari-              Average number of hours
ables displayed on a scatterplot, we need to comment on:               of study per week
    (a) the direction — whether it is positive or negative
    (b) the form — whether it is linear or non-linear
    (c) the strength — whether it is strong, moderate or weak.
   Below is a gallery of scatterplots showing the various patterns we look for.




       Weak, positive                 Moderate, positive                    Strong, positive
     linear relationship              linear relationship                 linear relationship




       Weak, negative                Moderate, negative                    Strong, negative
     linear relationship             linear relationship                  linear relationship




      Perfect, negative                No relationship                     Perfect, positive
     linear relationship                                                  linear relationship
76        Further Mathematics




   WORKED Example 7
 The scatterplot at right shows the number of hours people




                                                                      Hours for recreation
                                                                     25
 spend at work each week and the number of hours people
 get to spend on recreational activities during the week.            20
    Decide whether or not a relationship exists between the          15
 variables and, if it does, comment on whether it is positive        10
 or negative; weak, moderate or strong; and whether or not            5
 it has a linear form.
                                                                         10 20 30 40 50 60 70
 THINK                                                   WRITE
                                                                             Hours worked
 (a) The points on a scatterplot are spread in a
     certain pattern, namely in a rough corridor from
     the top left to the bottom right corner. This tells
     us that as the work hours increase, the
     recreation hours decrease.
 (b) The corridor is straight (that is, it would be
     reasonable to fit a straight line into it).
 (c) The points are not too tight and not too
     dispersed either.
 (d) The pattern resembles the central diagram in There is a moderate, negative linear relation-
     the gallery of scatterplots shown previously. ship between the two variables.



   WORKED Example 8
             Data giving the average weekly number of hours studied by each student in 12B
             at Northbank Secondary College and the corresponding height of each student
             (to the nearest tenth of a metre) are given in the table below.

 Average                  Average                  Average                                   Average
  hours                    hours                    hours                                     hours
    of   Height              of   Height              of   Height                               of   Height
  study   (m)              study   (m)              study   (m)                               study   (m)
     18         1.5             19     2.0            20        1.9                            16     1.6
     16         1.9             22     1.9            10        1.9                            14     1.9
     22         1.7             30     1.6            28        1.5                            29     1.7
     27         2.0             14     1.5            25        1.7                            30     1.8
     15         1.9             17     1.7            18        1.8                            30     1.5
     28         1.8             14     1.8            19        1.8                            23     1.5
     18         2.1             19     1.7            17        2.1                            22     2.1
Construct a scatterplot for the data and use it to comment on the direction, form and
strength of any relationship between the number of hours studied and the height of the
students.
                                                                     Chapter 2 Bivariate data    77
THINK                                                   WRITE/DISPLAY
1   Construct the scatterplot. In this case it is almost
                                                             2.2
    impossible to decide which is the independent
                                                             2.1
    variable and which is the dependent variable, and
                                                             2.0
    therefore on which axis we will place the
                                                             1.9




                                                        Height (m)
    variables. In such cases, placing either variable
    on either axis is reasonable.                            1.8

    The scatterplot can be constructed using a               1.7
2
    graphics calculator:                                     1.6

    (a) Press Y= and CLEAR any functions.                    1.5

    (b) Press 2nd    [STAT PLOT] and select                  1.4

        4:PlotsOff. Press ENTER .
                                                                  10 12 14 16 18 20 22 24 26 28 30
    (c) Press STAT and select 1:Edit. Press ENTER .
                                                                          Average number of hours
    (d) Clear any existing lists and enter the list of                       studied each week
        hours of study in L1 and the list of heights
        in L2.                                                          FM Fig 02.07
    (e) Press 2nd [STAT PLOT] and select 1:Plot 1.
    (f) Press ENTER to turn the plot ON, and select
        the first icon which indicates a scatterplot.
    (g) For Xlist, select L1 and for Ylist select L2 and
        select the first symbol in Mark.
    (h) Press ZOOM and select 9:ZoomStat.
    (i) Press ENTER to see the scatterplot.
3   Comment on the direction of any relationship. There is no relationship; the points appear
                                                         to be randomly placed.
4   Comment on the form of the relationship.             There is no form, no linear trend, no
                                                         quadratic trend, just a random placement
                                                         of points.
5   Comment on the strength of any relationship.         Since there is no relationship, strength is
                                                         not relevant.




           Clearly, the number of hours you study for your VCE has no effect on how tall you
           might be!
             Note that when working with the scatterplot, to change settings at any time use
            WINDOW . To identify the coordinates of individual points, use the TRACE key with
           the arrow              keys.
                       M


                                M
78   Further Mathematics




         remember
       remember
           1. When we are investigating if there is any sort of relationship between two
              numerical variables, a scatterplot provides a useful starting point. It gives a
              visual display of the relationship between two such variables.
           2. In analysing the scatterplot we look for a pattern in the way the points lie.
              Certain patterns tell us that certain relationships exist between the two
              variables. This is referred to as a correlation. We look at what type of
              correlation exists and how strong it is.
           3. When describing the relationship between two variables displayed on a
              scatterplot, we need to comment on:
              (a) the direction — whether it is positive or negative
              (b) the form — whether it is linear or non-linear
              (c) the strength — whether it is strong, moderate or weak.




                       2E          The scatterplot

      Have your graphics calculator at hand for the following exercise questions.
      1 For each of the following pairs of variables, write down whether or not you would
        reasonably expect a relationship to exist between the pair and, if so, comment on
        whether it would be a positive or negative association.
        a Time spent in a supermarket and money spent
        b Income and value of car driven
        c Number of children living in a house and time spent cleaning the house
        d Age and number of hours of competitive sport played per week
        e Amount spent on petrol each week and distance travelled by car each week
        f Number of hours spent in front of a computer each week and time spent playing the
            piano each week
        g Amount spent on weekly groceries and time spent gardening each week
                                                                                                                  Chapter 2 Bivariate data                                               79
WORKED    2 For each of the scatterplots below, describe whether or not a relationship exists between
Example
            the variables and, if it does, comment on whether it is positive or negative, whether it is
      7
            weak, moderate or strong and whether or not it has a linear form.

             a                                                        b                                                         c
                 Haemoglobin count




                                                                                                                                        Marks at school (%)
                                                                                          120                                                                 100
                                     14




                                                                          Fitness level
                                     12                                                   100                                                                  80
                                     10                                                    80                                                                  60
                                                                                           60                                                                  40
                                      8
                                                                                                                                                               20
                                                 20 40 60 80                                        0     10     20                                             0
                                                     Age                                            Cigarettes smoked
                                                                                                                                                                              4 8 12 16
                                                 FM Fig 02.08a
                                                                                                   FM Fig 02.08b                                                          Weekly hours of study
                  gardening magazines ($)




             d                                                        e                                                             f
                  Weekly expenditure on




                                            25                                                     14                                                                 70




                                                                                                                                               Time under water (s)
                                                                             Hours spent using a
                                                                             computer per week     12                                                                 60
                                            20
                                            15                                                     10                                                                 50
                                            10                                                      8                                                                 40
                                             5                                                      6                                                                 30
                                                                                                    4                                                                 20
                                                    0 5 10 15                                       2                                                                 10
                                                    Hours spent
                                                 gardening per week
                                                                                                        2 4 6 8 1012 1416                                                     5 10 15 20 25
                                                                                                           Hours spent                                                             Age
                                                                                                        cooking per week

          3 multiple choice
            From the scatterplot shown at right, it would be reasonable to                                                                                            y
            observe that:
            A as the value of x increases, the value of y increases
            B as the value of x increases, the value of y decreases
            C as the value of x increases, the value of y remains the same
            D as the value of x remains the same, the value of y increases
                                                                                                                                                                                               x
            E there is no relationship between x and y

WORKED    4 The population of a municipality (to the nearest hundred thousand) together with           L Spread
Example                                                                                             XCE
            the number of primary schools in that particular municipality is given below for
                                                                                                                                                                                                       sheet
                                                                                                                                                                                                   E




      8
            11 municipalities.                                                               Scatterplot

             Population
                                                          110 130 130 140 150 160 170 170 180 180 190
             (000)

             No. of primary
                                                            4     4   6                             5       6      8        6                        7                       8      9      8
             schools

            Construct a scatterplot for the data and use it to comment on the direction, form and
            strength of any relationship between the population and the number of primary
            schools.
80   Further Mathematics




      5 The table below contains data giving the time taken for a paving job and the cost of the job.
         Time taken
                             5    7        5    8     10     13     15     20     18     25     23
         (hours)
         Cost of
                         1000 1000 1500 1200 2000 2500 2800 3200 2800 4000 3000
         job ($)
         Construct a scatterplot for the data. Comment on whether a relationship exists between
         the time taken and the cost. If there is a relationship, describe it.

      6 The table below shows the time of booking (how many days in advance) of the tickets
        for a musical performance and the corresponding row number in A-Reserve.

          Time of Row             Time of Row
          booking No.             booking No.
               5        15            20       10
               6        15            21         8
               7        15            22         5
               7        14            24         4
               8        14            25         3
             11         13            28         2
             13         13            29         2
             14         12            29         1
             14         10            30         1
             17         11            31         1

         Construct a scatterplot for the data. Comment on whether a relationship exists between
         the time of booking and the number of the row and, if there is a relationship, describe it.
                                                                 Chapter 2 Bivariate data         81
The     q-correlation coefficient
          The q-correlation coefficient is a measure of the strength of the association between
          two variables. In the previous section we estimated the strength of association by
          looking at a scatterplot and forming a judgment about whether the correlation between
          the variables was positive or negative and whether the correlation was weak, moderate
          or strong. The calculation of the q-correlation coefficient aids us considerably in
          making that judgment.
            To calculate the q-correlation coefficient:
          Step 1. Draw a scatterplot of the data.
          Step 2. Locate the median of the x-values. (If there are n points, the median is located
                          n+1
                                    -
                   at the ----------- th place.) Draw a vertical line through this median value.
                               2                                                                  y
          Step 3. Locate the median of the y-values and draw a horizontal                           B A
                   line through this median value.
          Step 4. The scatterplot is now divided into 4 sections or
                   quadrants (hence the name ‘q’-correlation coefficient).
                   (a) Label these sections A, B, C and D.
                   (b) Count the number of points in each section.                                  C D
                   (c) Do not count points which are on the lines.                                      x
                   (d) The number of points in section A is denoted by a, the number of points in
                        section B is denoted by b, and so on.
          Step 5. Calculate the q-correlation coefficient using the formula:
                                                           (a + c) – (b + d )
                                                      q = ---------------------------------------
                                                                                                -
                                                               a+b+c+d

  WORKED Example 9
Calculate the q-correlation coefficient for the data shown in the                y
scatterplot at right.
THINK                                                WRITE
 1 (a) Locate the median of the x-values.
        Note that we are talking here about the
        x-values of the data observations                                                           x
        given. In the scatterplot shown there
        are 15 points. Each point has an x-
        value and a y-value. To find the
        median of the x-values we look for the
        horizontal middle point; that is, we
                       15 + 1
        look for the -------------- = 8th point from
                                  -
                           2                         y Median
        the left (from the right, the point will          of
                                                       x-values
        be the same).
    (b) Draw a vertical line through this
        median value. Note that there are
        7 points to the right of this line and
        7 to the left.                                          x
                                                                           Continued over page
82     Further Mathematics




 THINK                                                       WRITE

 2   (a) Locate the median of the y-values.
         This is done in a similar way to finding
         the median of the x-values except,
         instead of counting from the left or
                                                             y
         right, we count from the top or bottom
         to find the 8th point.

     (b) Draw a horizontal line through this
         median value. Note that there are 7
         points above this line and 7 below.                                                     x

 3   (a) Label the quadrants A, B, C and D.                  y B                    A
                                                               b=0                  a=6
     (b) Count the number of points in each
         section. Do not count points that are                                     D
                                                                   C               d=1
         on the lines.                                             c=6
                                                                                                 x
                                                             a = 6, b = 0, c = 6, d = 1

                                                                 (a + c) – (b + d )
 4   Write the formula for calculating the                   q = ---------------------------------------
                                                                                                       -
     q-coefficient.                                                    a+b+c+d

 5   Substitute the values of a, b, c and d                       (6 + 6) – (0 + 1)
     into the formula and evaluate.                          q = ---------------------------------------
                                                                                                       -
                                                                      6+0+6+1
                                                                  11
                                                               = -----
                                                                     -
                                                                  13
                                                               = 0.85 (correct to 2 decimal places)


            The value of the q-correlation coefficient in the above example indicates a strong
            correlation. The diagram below gives a rough guide to the strength of the correlation
            suggested by the value of q.


                                                         1
                                                      0.75
                                                           } Strong positive association
                                                       0.5
                                                           } Moderate positive association
                                                           } Weak positive association
                                        Value of q




                                                      0.25
                                                         0
                                                     –0.25
                                                             }   No association


                                                      –0.5
                                                           } Weak negative association
                                                     –0.75
                                                           } Moderate negative association
                                                        –1
                                                           } Strong negative association
                                                                                            Chapter 2 Bivariate data               83
             The scatterplots below show three special values of the q-correlation coefficient.
              y B              A                             y B              A                              y B              A



                C              D                                C             D                                C              D



                                           x                                                    x                                         x
                 (8 + 8) – (0 + 0)                         (0 + 0) – (8 + 8)                          (3 + 3) – (3 + 3)
            q = -------------------------------------- q = -------------------------------------- q = --------------------------------------
                   8+0 +8+0                                   0+8 +0+8                                   3+3 +3+3
               =1                                        = –1                                       =0
            The sign of the q-value indicates the direction of the relationship; that is, whether
          there is a negative or positive association.
            In the cases shown above left and centre, the q-values are at both extremes. That is,
          q = 1 and −1 respectively. We would describe the variables as showing a very strong
          association. Having said that, the points are not showing a strong linear form or, for
          that matter, any linear form.
            The q-correlation coefficient merely gives us an idea of which quadrants contain the
          most points; but beyond that, the points can be in any position in the quadrants. In that
          sense, the q-correlation coefficient is a rather blunt instrument.

  WORKED Example 10
          An investigation was made into the relationship between the time spent
          watching television in the week preceding a Maths test and the mark obtained
          (out of 20) in that Maths test. The following data were recorded.

  Time (h)          Mark              Time (h)         Mark       Time (h)        Mark
        4              15                 10              8           12            10
        5              16                 20              5            5             8
        5              20                   5            12           20             8
      10               12                 15              4           15            10
      15                8                 15             12           20            10
Draw a scatterplot and calculate the q-correlation coefficient. Comment on the
relationship between the two variables.
THINK                                                       WRITE/DISPLAY
 1 Draw a scatterplot. We can use a graphics calculator
                                                              20
    to draw the scatterplot.
    (a) On the lists screen (press STAT , select EDIT         16
                                                                                   Maths mark




        and 1:Edit), enter the two lists of data into L1      12
        and L2.
                                                               8
    (b) Press 2nd [STAT PLOT] and select 4:PlotsOff.
    (c) Press ENTER .                                          4
    (d) Press 2nd [STAT PLOT] and select 1:Plot1.
    (e) Select On, and for Type, select the first icon              5 10 15 20 25
        (scatterplot).                                            Time watching TV
    (f) For Xlist, type in L1 (use 2nd [L1]); for Ylist,               (hours)
        type in L2; for Mark, select the first symbol.
    (g) Press ZOOM and select 9:ZoomStat. The display
                                                                    Continued over page
        now shows the scatterplot.
84     Further Mathematics




 THINK                                             WRITE/DISPLAY
 2   We can also use the graphics
     calculator to help calculate q.
     (a) Press 2nd [QUIT] and
         return to the home screen.
     (b) Press 2nd [DRAW] and
         select 4:Vertical.
     (c) Press 2nd [LIST] .
     (d) From the MATH menu,
         select 4:median(.
     (e) Type L1 (use 2nd [L1])
         at the prompt, then ENTER ,
         and the scatterplot appears
         with the vertical median line
         drawn.
     (f) Similarly, to create the
         horizontal median line,
         press 2nd [QUIT]
         and return to the home screen.
     (g)Press 2nd [DRAW] and select
         3:Horizontal.
     (h) Press 2nd [LIST] and from the MATH
         menu, select 4:median(.
     (i) Type L2 at the prompt, press ENTER
         and the scatterplot appears with the
         horizontal median line drawn as well.
 3   Count and record the number of points in      a = 1, b = 5, c = 2, d = 4
     each quadrant.
                                                       (a + c) – (b + d )
 4   Write the formula for calculating the         q = ---------------------------------------
                                                                                             -
     q-correlation coefficient.                              a+b+c+d
                                                         (1 + 2) – (5 + 4)
 5   Substitute the values of a, b, c and d into   q = ---------------------------------------
                                                                                             -
     the formula and evaluate.                              1+5+2+4
                                                            6
                                                      = – -----
                                                              -
                                                          12
                                                      = – 0.5
 6   Comment on the relationship.                  There is moderate, negative association
                                                   between the hours of television watched and
                                                   the Maths mark obtained.
                                                   The negative association means that as the
                                                   number of hours of television watched prior
                                                   to the test increased, the marks in the Maths
                                                   test decreased. The moderate association
                                                   suggests that it may be worth further
                                                   investigating the association.
                                                                    Chapter 2 Bivariate data         85
            remember
          remember
              1. The q-correlation coefficient is a measure of the strength of the association
                 between two variables.
              2. To calculate the q-correlation coefficient:
                 Step 1. Draw a scatterplot of the data.
                 Step 2. Locate the median of the x-values and draw a vertical line through this
                           median value.
                 Step 3. Locate the median of the y-values and draw a horizontal line through
                           this median value.
                                                                                               y B
                 Step 4. (a) Label the sections thus formed A, B, C and D.                         A

                           (b) Count the number of points in each section.
                           (c) Do not count points which are on the lines.
                           (d) (The number of points in section A is denoted                     C D
                                                                                                     x
                                by a, and so on.)
                 Step 5. Calculate the q-correlation coefficient using the formula:
                                                        (a + c) – (b + d )
                                                   q = ---------------------------------------
                                                                                             -
                                                            a+b+c+d
              3. The sign of the q-value indicates the direction of the relationship (whether
                 there is a negative association or a positive association) while the size of it
                 indicates the strength (whether the relationship is strong, moderate or weak).
              4. The q-correlation coefficient gives us an idea of into which quadrants the points
                 fall, but beyond that the points can be in any position in the quadrants. In that
                 sense, the q-correlation coefficient is a rather blunt instrument.




                            2F          The q-correlation coefficient

WORKED  1 Calculate the q-correlation coefficient for each of the sets of data shown on the scatter-           L Spread
Example                                                                                                    XCE
          plots below.


                                                                                                                     sheet
                                                                                                              E
      9
          ay                              b y                                c y                  q-correlation




                                    x                                 x                                   x

             dy                                e y                                 fy




                                    x                                  x                                  x
                 86        Further Mathematics




                 WORKED     2 The data given in the table below show the results of an investigation into
                 Example
                              the mass and the height of a certain breed of dog.
                      10
                              a Draw a scatterplot and calculate the q-correlation coefficient.
                              b Comment on the relationship between the height and the mass of this
                                  breed of dog.
                                     Height
                                                  41      40           35            38     43       44        37    39           42       44    31
                                     (cm)
                                     Mass
                                                  4.5         5         4            3.5    5.5      5         5         4        4        6     3.5
                                     (kg)
                            3 The data in the table below show the number of hours spent by students who are
                              learning touch-typing and their corresponding speed in words per minute (wpm).
                              a Using a graphics calculator or otherwise, calculate the q-correlation coefficient for
                                  these data.
                              b Comment on the relationship between the number of hours spent on learning and
                                  the speed of typing.
                                      Time
                                                  20     33       22            39    40        37   46   44        24       36       50    48   29
                                      (h)
                                      Speed
                                                  34     46       38            53    52        49   60   58        36       42       65    63   40
                                      (wpm)

                            4 multiple choice                                                          y

                              The q-correlation coefficient for data shown in the scatterplot at right is:
                                       1            1                     5                5              9
                               A   – -----
                                     11
                                         -    B   – --
                                                     -
                                                    9
                                                                  C         -
                                                                      – -----
                                                                        11
                                                                                       D   --
                                                                                           9
                                                                                            -         E    -
                                                                                                          --
                                                                                                          9

                            5 multiple choice
                                                                                                                                                       x
                               A researcher calculates the q-correlation coefficient for the relationship between time
                               (in days) and the diameter (measured in mm) of a crystal that is changing in size. The
                               value is 0.82. Based on this, the correlation between time and the diameter of the
                               crystal could be described as:
                               A strong and negative
                               B strong and positive
   SHE
      ET   2.2                 C weak and positive
Work




                               D weak and negative
                               E moderate and positive
                                                             Chapter 2 Bivariate data                              87
Pearson’s product–moment correlation
coefficient
     We saw in the previous exercise that the q-correlation coefficient was a rather blunt
     instrument for measuring correlation between variables. A more precise tool is
     Pearson’s product–moment correlation coefficient. This coefficient is used to measure
     the strength of linear relationships between variables; the q-correlation coefficient, on
     the other hand, can be used for both linear and non-linear relationships.
        Pearson’s coefficient is therefore more specialised and can give us a much more
     precise picture of the strength of the linear relationship between two variables.
        The symbol for Pearson’s product–moment correlation coefficient is r.
        Below is a gallery of scatterplots with the corresponding value of r for each.




           r=1                 r = –1          r=0                          r = 0.7                     r = –0.5




         r = –0.9                 r = 0.8                    r = 0.3                                     r = –0.2

        The two extreme values of r (1 and −1) are shown in the first two diagrams respec-
     tively. It is interesting to compare these two scatterplots with those showing extreme
     values (1 and −1) of q.




         q=1                       r =1                      q = –1                                             r = –1
                                                                                1
        In the four diagrams above, the scatterplots that                         } Strong positive linear association
                                                                             0.75
     show matching values of q and r are placed side                              } Moderate positive linear association
                                                                              0.5
     by side. We see just how differently the points on                           } Weak positive linear association
                                                               Value of r




                                                                             0.25
     the scatterplots are arranged and note from this
     that the r value gives us a much sharper
     impression of the relationship between the
                                                                                0
                                                                            –0.25
                                                                                    }   No linear association


                                                                             –0.5
                                                                                  } Weak negative linear association
     variables. That is, a value of r = 1 means that there                        } Moderate negative linear association
                                                                            –0.75
     is perfect linear association between the variables,                         } Strong negative linear association
                                                                               –1
     which is not necessarily the case when q = 1!
88       Further Mathematics



                In describing the strength of the relationship between the variables, the rough guide
             we used with the q-correlation coefficient can also be used with Pearson’s coefficient.
             The difference, of course, is that the value of r gives us a measure of the strength of
             linear relationships specifically.

     WORKED Example 11
 For each of the following:
 i   Estimate the value of Pearson’s product–moment correlation coefficient (r) from the
     scatterplot.
ii Use this to comment on the strength and direction of the relationship between the two
   variables.

 a                                     b                                      c




THINK                                            WRITE

a    1Compare these scatterplots with            a     i r ≈ 0.9
      those in the gallery of scatterplots
      shown previously and estimate the
      value of r.
   2 Comment on the strength and                      ii The relationship can be described as a
      direction of the relationship.                     strong, positive, linear relationship.
 b Repeat steps 1 and 2 as in a.                 b     i r ≈ −0.7
                                                      ii The relationship can be described as a
                                                         moderate, negative, linear relationship.
 c Repeat steps 1 and 2 as in a.                 c     i r ≈ −0.1
                                                      ii There is no linear relationship.


             Note that the symbol ≈ means ‘aproximately equal to’. We use it instead of the = sign
             to emphasise that the value (in this case r) is only an estimate.
                In completing the worked example above, we notice that estimating the value of
             r from a scatterplot is rather like making an informed guess. In the next section of
             work we will see how to obtain the actual value of r.


             remember
           remember
               1. Pearson’s product–moment correlation coefficient is used to measure the
                  strength of a linear relationship between two variables.
               2. The symbol for Pearson’s product–moment correlation coefficient is r.
               3. The estimate of r can be obtained from the scatterplot.
                                                                Chapter 2 Bivariate data       89
                                        Pearson’s product–moment
                          2G            correlation coefficient
          1 What type of linear relationship does each of the following values of r suggest?
            a 0.21                b 0.65               c −1                  d −0.78
            e 1                   f 0.9                g −0.34               h −0.1
WORKED    2 For each of the following:
Example
               i Estimate the value of Pearson’s product–moment correlation coefficient (r), from
     11
                 the scatterplot.
              ii Use this to comment on the strength and direction of the relationship between the
                 two variables.
             a                     b                     c                      d




             e                      f                     g                      h




          3 multiple choice
            A set of data relating the variables x and y is found to have an r value of 0.62. The
            scatterplot that could represent the data is:
A                  B                    C                  D                  E




          4 multiple choice
            A set of data relating the variables x and y is found to have an r value of −0.45. A true
            statement about the relationship between x and y is:
            A There is a strong linear relationship between x and y and when the x-values
                increase, the y-values tend to increase also.
            B There is a moderate linear relationship between x and y and when the x-values
                increase, the y-values tend to increase also.
            C There is a moderate linear relationship between x and y and when the x-values
                increase, the y-values tend to decrease.
            D There is a weak linear relationship between x and y and when the x-values increase,
                the y-values tend to increase also.
            E There is a weak linear relationship between x and y and when the x-values increase,
                the y-values tend to decrease.
90     Further Mathematics




Calculating r and the coefficient of
determination
           Pearson’s product–moment correlation coefficient
           The formula for calculating Pearson’s correlation coefficient r is as follows:
                                                       n
                                                           xi – x yi – y
                                                          ------------  ------------
                                                   ∑
                                                1
                                       r = -----------
                                                     -                -               -
                                            n–1           sx   sy 
                                                   i=1
           where n is the number of pairs of data in the set
                  sx is the standard deviation of the x-values
                  sy is the standard deviation of the y-values
                   x is the mean of the x-values
                   y is the mean of the y-values.
              The calculation of r by hand using this formula is unnecessary. The calculation of
           r is done far more efficiently using a graphics calculator.
              There are two important limitations on the use of r. First, since r measures the
           strength of a linear relationship, it would be inappropriate to calculate r for data which
           are not linear — for example, data which a scatterplot shows to be in a quadratic form.
              Second, outliers can bias the value of r. Consequently, if a set of linear data contains
           an outlier, then r is not a reliable measure of the strength of that linear relationship.
           The calculation of r is applicable to sets of bivariate data which are known to be
           linear in form and which do not have outliers.
              With those two provisos, it is good practice to draw a scatterplot for a set of data to
           check for a linear form and an absence of outliers before r is calculated. Having a scat-
           terplot in front of you is also useful because it enables you to estimate what the value
           of r will be — as you did in exercise 2G, and thus you can check that your workings on
           the calculator are correct.

   WORKED Example 12
 The heights (in centimetres) of 21 football players were recorded against the number of
 marks they took in a game of football. The data are shown in the table below.
                           Number of                                            Number of
     Height (cm)           marks taken                   Height (cm)            marks taken
         184                    6                            182                     7
         194                   11                            185                     5
         185                    3                            183                     9
         175                    2                            191                     9
         186                    7                            177                     3
         183                    5                            184                     8
         174                    4                            178                     4
         200                   10                            190                    10
         188                    9                            193                    12
         184                    7                            204                    14
         188                    6
                                                                Chapter 2 Bivariate data       91
a Construct a scatterplot for the data.
b Comment on the correlation between the heights of players and the number of marks
  that they take, and estimate the value of r.
c Calculate r and use it to comment on the relationship between the heights of players
  and the number of marks they take in a game.

THINK                                            WRITE/DISPLAY

a Using a graphics calculator, construct a       a
  scatterplot. Refer to worked example 8
  in the section on scatterplots for
  directions on how to use the graphics
  calculator to draw a scatterplot.

b Comment on the correlation between the         b The data show what appears to be a linear
  variables and estimate the value of r.           form of moderate strength.
                                                   We might expect r ≈ 0.6.
c   1   Because there is a linear form and there c
        are no outliers, the calculation of r is
        appropriate.
        Calculate r, using a graphics calculator.
        The lists are in place from the
        scatterplot.
        Firstly press 2nd [CATALOG] and select     r = 0.86
        DiagnosticOn and press ENTER .
        Press STAT and select CALC and
        4:LinReg(ax+b).
        Press ENTER .
        LinReg(ax+b) appears. Type L1, L2.
        Press ENTER .
    2   The value of r = 0.86 indicates a            There is a strong positive linear association
        strong positive linear relationship.         between the height of a player and the
                                                     number of marks he takes in a game. That is,
                                                     the taller the player the more marks we
                                                     might expect him to take.



            Correlation and causation
            In worked example 12 we saw that r = 0.86. While we are entitled to say that there is a
            strong association between the height of a footballer and the number of marks he takes,
            we cannot assert that the height of a footballer causes him to take a lot of marks. Being
            tall might assist in the taking of marks, but there will be many other factors which
            come into play — for example skill level, accuracy of passes from teammates, abilities
            of the opposing team, and so on.
               So, while establishing a high degree of correlation between two variables is very
            interesting and can often flag the need for further, more detailed investigation, it in no
            way gives us any basis to comment on whether or not one variable causes particular
            values in another variable.
92      Further Mathematics




             The coefficient of determination
             The coefficient of determination is given by r 2. Obviously, it is very easy to calculate
             — we merely square Pearson’s product–moment correlation coefficient (r).
             1. The coefficient of determination is useful when we have two variables which
                have a linear relationship. It tells us the proportion of variation in one variable
                which can be explained by the variation in the other variable.
             2. The coefficient of determination provides a measure of how well the linear rule
                linking the two variables (x and y) predicts the value of y when we are given the
                value of x.

     WORKED Example 13
 A set of data giving the number of police traffic patrols on duty and the number of
 fatalities for the region was recorded and a correlation coefficient of r = −0.8 was found.
 Calculate the coefficient of determination and interpret its value.
 THINK                                      WRITE
  1 Calculate the coefficient of             Coefficient of determination = r 2
     determination by squaring the given                                 = (−0.8)2
     value of r.                                                         = 0.64
 2   Interpret your result.                       We can conclude from this that 64% of the
                                                  variation in the number of fatalities can be
                                                  explained by the variation in the number of police
                                                  traffic patrols on duty. This means that the number
                                                  of police traffic patrols on duty is a major factor in
                                                  predicting the number of fatalities.



             remember
           remember
               1. The formula for calculating Pearson’s correlation coefficient r is as follows:
                                                          n
                                                              xi – x yi – y
                                                             ------------  ------------
                                                         ∑
                                                   1
                                          r = -----------
                                                        -                -               -
                                               n–1           sx   sy 
                                                        i=1
                    where n is the number of pairs of data in the set
                             sx is the standard deviation of the x values
                             sy is the standard deviation of the y values
                             x is the mean of the x-values
                             y is the mean of the y-values.
               2.   The calculation of r by hand using this formula is unnecessary. The calculation
                    of r is done far more efficiently using a graphics calculator.
               3.   The calculation of r is applicable to sets of bivariate data which are known to
                    be linear in form and which do not have outliers.
               4.   Even if we find that two variables have a very high degree of correlation, for
                    example r = 0.95, we cannot say that the value of one variable is caused by the
                    value of the other variable.
               5.   The coefficient of determination = r 2.
               6.   The coefficient of determination is useful when we have two variables which
                    have a linear relationship. It tells us the proportion of variation in one variable
                    which can be explained by the variation in the other variable.
                                                                    Chapter 2 Bivariate data   93
                                        Calculating r and the
                          2H            coefficient of determination
                                                                                                                 L Spread
                                                                                                              XCE




                                                                                                                          sheet
                                                                                                            E
WORKED  1 The yearly salary ($’000) and the number of votes polled in the
Example                                                                                                  Pearson’s
     12   Brownlow medal count are given below for 10 leading footballers.                                product-
                                                                                                          moment
               Yearly                                                                                   correlation
                salary     180     200    160       250      190    210     170    150   140   180
                                                                                                                 GC pro
               ($’000)




                                                                                                                      gram
              Number                                                                                      BV stats
                            24     15      33       10       16      23     14     21    31     28
              of votes

            a Construct a scatterplot for the data.
            b Comment on the correlation of salary and the number of votes and make an
              estimate of r.
            c Calculate r and use it to comment on the relationship between yearly salary and
              number of votes.

WORKED    2 A set of data, obtained from 40 smokers, gives the number of cigarettes smoked per day
Example
     13     and the number of visits per year to the doctor. The Pearson’s correlation coefficient for
            these data was found to be 0.87. Calculate the coefficient of determination for the data
            and interpret its value.

          3 Data giving the annual advertising budgets ($’000) and the yearly profit increases (%)
            of 8 companies are shown below.

              Annual advertising
                                           11       14       15      17     20     25    25     27
              budget ($’000)
              Yearly profit increase
                                           2.2      2.2      3.2     4.6    5.7    6.9   7.9   9.3
              (%)

            a Construct a scatterplot for these data.
            b Comment on the correlation of the advertising budget and profit increase and make
              an estimate of r.
            c Calculate r.
            d Calculate the coefficient of determination.
            e Write down the proportion of the variation in the yearly profit increase that can be
              explained by the variation in the advertising budget.

          4 Data showing the number of tourists visiting a small country in a month and the
            corresponding average monthly exchange rate for the country’s currency against the
            American dollar are given below.

              Number of tourists
                                                2        3     4       5     7      8     8     10
              (’000)
              Exchange rate                  1.2     1.1      0.9     0.9    0.8   0.8   0.7   0.6
94   Further Mathematics



         a Construct a scatterplot for the data.
         b Comment on the correlation between the number of tourists and the exchange rate
           and give an estimate of r.
         c Calculate r.
         d Calculate the coefficient of determination.
         e Write down the proportion of the variation in the number of tourists that can be
           explained by the exchange rate.

      5 Data showing the number of people in 9 households against weekly grocery costs are
        given below.

          Number of
          people in        2       5       6        3         4          5         2         6     3
          household
          Weekly
          grocery          60     180     210      120       150       160         65    200      90
          costs ($’s)

         a Construct a scatterplot for the data.
         b Comment on the correlation of the number of people in a household and the weekly
           grocery costs and give an estimate of r.
         c Calculate r.
         d Calculate the coefficient of determination.
         e Write down the proportion of the variation in the weekly grocery costs that can be
           explained by the variation in the number of people in a household.

      6 Data showing the number of people on 8 fundraising committees and the annual funds
        raised are given below.

         Number of
         people on         3        6          4         8         5           7         3        6
         committee
         Annual
         funds           4500     8500     6100    12 500         7200       10 000     4700     8800
         raised ($’s)

         a Construct a scatterplot for these data.
         b Comment on the correlation between the number of people on a committee and the
           funds raised and make an estimate of r.
         c Calculate r.
         d Calculate the coefficient of determination.
         e Write down the proportion of the variation in the funds raised that can be explained
           by the variation in the number of people on a committee.

      The following information applies to questions 7 and 8. A set of data was obtained from
      a large group of women with children under 5 years of age. They were asked the number
      of hours they worked per week and the amount of money they spent on childcare. The results
      were recorded and the value of Pearson’s correlation coefficient was found to be 0.92.
                                                   Chapter 2 Bivariate data     95
7 multiple choice
  Which of the following is not true?
  A The relationship between the number of working hours and the amount of money
    spent on child-care is linear.
  B There is a positive correlation between the number of working hours and the
    amount of money spent on child-care.
  C The correlation between the number of working hours and the amount of money
    spent on child-care can be classified as strong.
  D As the number of working hours increases, the amount spent on child-care increases
    as well.
  E The increase in the number of hours causes the increase in the amount of money
    spent on child-care.




8 multiple choice
  Which of the following is not true?
  A The coefficient of determination is about 0.85.
  B The number of working hours is the major factor in predicting the amount of money
    spent on child-care.
  C About 85% of the variation in the number of hours worked can be explained by the
    variation in the amount of money spent on child-care.
  D Apart from number of hours worked, there could be other factors affecting the
    amount of money spent on child-care.
  E About 17 of the variation in the amount of money spent on child-care can be
                -
            -----
            20
    explained by the variation in the number of hours worked.
96   Further Mathematics




summary
          Types of data
          • Bivariate data are data with two variables.
          • Numerical data involve quantities which are measurable or countable.
          • Categorical data are data divided into categories.
          • In a relationship involving two variables, if the values of one variable depend on the
            values of another variable, then the former variable is referred to as the dependent
            variable and the latter variable is referred to as the independent variable.
          • When data are displayed on a graph, the independent variable is placed on the
            horizontal axis and the dependent variable is placed on the vertical axis.

          Back-to-back stem plots
          • A back-to-back stem plot displays bivariate data involving a numerical variable and
            a categorical variable with two categories.
          • Together with summary statistics, back-to-back stem plots can be used to compare
            the two distributions.

          Parallel boxplots
          • To display a relationship between a numerical variable and a categorical variable
            with more than two categories, we can use a parallel boxplot.
          • A parallel boxplot is obtained by constructing individual boxplots for each
            distribution, using a common scale.

          The two-way frequency table
          • The two-way frequency table is a tool for examining the relationship between two
            categorical variables.
          • If the total number of scores in each of the two categories is unequal, percentages
            should be calculated in order to be able to analyse the table properly.
          • When the independent variable is placed in the columns of the table, the numbers in
            each column should be expressed as a percentage of that column’s total.

          The scatterplot
          • A scatterplot gives a visual display of the relationship between two numerical
            variables.
          • In analysing the scatterplot we look for a pattern in the way the points lie. Certain
            patterns tell us that certain relationships exist between the two variables. This is
            referred to as a correlation. We look at what type of correlation exists and how
            strong it is.
          • When describing the relationship between two variables displayed on a scatterplot,
            we need to comment on:
            (a) the direction — whether it is positive or negative
            (b) the form — whether it is linear or non-linear
            (c) the strength — whether it is strong, moderate or weak.
                                                                  Chapter 2 Bivariate data   97
The q-correlation coefficient
• The q-correlation coefficient gives us a measure of the strength of the association
  between two variables.
• To calculate the q-correlation coefficient:
  Step 1. Draw a scatterplot of the data.
  Step 2. Locate the median of the x-values. Draw a vertical line through this median
           value.
  Step 3. Locate the median of the y-values. Draw a horizontal line through this
           median value.                                                           y B A
  Step 4. The scatterplot is now divided into 4 sections or quadrants.
           (a) Label these sections A, B, C and D.
           (b) Count the number of points in each section.
           (c) Do not count points which are on the lines.                           C D
                                                                                         x
           (d) The number of points in section A is denoted by a, the number of
               points in section B is denoted by b, and so on.
  Step 5. Calculate the q-correlation coefficient, using the formula:
                                            (a + c) – (b + d )
                                       q = ---------------------------------------
                                                                                 -
                                                a+b+c+d
• The sign of the q-value indicates the direction of the relationship; that is, whether
  there is a negative association or a positive association. The magnitude of q
  indicates whether the relationship is strong, moderate or weak.
• The q-correlation coefficient gives us an idea of into which quadrants the points
  fall, but beyond that the points can be in any position in the quadrants. The
  q-correlation coefficient in that sense is a rather blunt instrument.

Pearson’s product–moment correlation coefficient
• Pearson’s product–moment correlation coefficient is used to measure the strength of
  a linear relationship between two variables.
• The symbol for Pearson’s product–moment correlation coefficient is r.
• The calculation of r is applicable to sets of bivariate data which are known to be
  linear in form and which don’t have outliers.
• The value of r can be estimated from the scatterplot.
• The formula for calculating Pearson’s correlation coefficient r is as follows:
                                                 n
                                                        x –x           y –y
                                                ∑  -------------  -------------
                                       1               i                i
                              r = -----------
                                            -
                                  n–1              sx   sy 
                                                i=1
  where n is the number of pairs of data in the set
          sx is the standard deviation of the x-values
          sy is the standard deviation of the y-values
           x is the mean of the x-values
           y is the mean of the y-values
• The calculation of r by hand using this formula is unnecessary. The calculation of r
  is done far more efficiently using a graphics calculator.
• Even if we find that two variables have a very high degree of correlation, for
  example r = 0.95, we cannot say that the value of one variable is caused by the
  value of the other variable.

Calculating the coefficient of determination
• The coefficient of determination = r 2.
• The coefficient of determination is useful when we have two variables which have
  a linear relationship. It tells us the proportion of variation in one variable which can
  be explained by the variation in the other variable.
     98      Further Mathematics




      CHAPTER
            review
      Multiple choice
          1 An example of a categorical variable is:
2A          A the membership number of a club
            B the number of students at each year level of a school
            C the total attendance at Hawthorn football matches
            D the breathalyser reading of people in a restaurant
            E the monthly income for a group of people
          2 In a study on the growth of plants, conducted in controlled surroundings, the dependent variable
2A          was the height of the plants. The independent variable in the study would be most likely:
            A the number of people caring for the plants
            B the amount of light present
            C the number of plants in the study
            D whether the plants were deciduous or evergreen
            E rainfall
          3 One of the following pairs of variables could not be displayed on a back-to-back stem plot. It is:
2B          A the heights of a group of students and whether or not they like football
            B the kilometres travelled in a week and the mode of transport (car or train)
            C the weights of a group of students and their eye colour (blue or brown)
            D the annual number of trips to a doctor and whether or not the person is a smoker
            E the amount spent by each child at the tuckshop and the age of the child
          4 A back-to-back stem plot is a useful way of displaying the relationship between:
2B          A the number of children attending a day care centre and whether or not the centre has
               federal funding
            B height and wrist circumference
            C age and weekly income
            D weight and the number of takeaway meals eaten each week
            E the age of a car and amount spent each year on servicing it

      The information below relates to questions 5 and 6. The salaries of people working at five
      different advertising companies are shown below on the parallel boxplots.
                                                                                Company A
                                                                                Company B
                                                                                Company C
                                                                                Company D
                                                                                Company E

                               10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
                                              Annual salary (× $1000)

          5 The company with the largest interquartile range is:
2C          A Company A                  B Company B                            C Company C
            D Company D                  E Company E
                                                                      Chapter 2 Bivariate data   99
 6 The company with the lowest median is:
   A Company A                 B Company B                            C Company C                     2C
   D Company D                 E Company E
Questions 7 and 8 relate to the following information. Data showing reactions of junior staff and
senior staff to a relocation of offices are given below in a two-way frequency table.
         Attitude             Junior staff               Senior staff                Total
 For                               23                        14                       37
 Against                           31                        41                       72
 Total                             54                        55                      109
 7 From this table, we can conclude that:
   A 23% of junior staff were for the relocation                                                      2D
   B 42.6% of junior staff were for the relocation
   C 31% of junior staff were against the relocation
   D 62.1% of junior staff were for the relocation
   E 28.4% of junior staff were against the relocation
 8 From this table, we can conclude that:
   A 14% of senior staff were for the relocation                                                      2D
   B 37.8% of senior staff were for the relocation
   C 12.8% of senior staff were for the relocation
   D 72% of senior staff were against the relocation
   E 74.5% of senior staff were against the relocation
 9 The relationship between the variables x and y is shown on the scatterplot below.
   That correlation between x and y would be best described as: y                                     2E
   A a weak positive association
   B a weak negative association
   C a strong positive association
   D a strong negative association
   E non-existent
                                                                                        x
10 An investigation is made into the number of freckles on the back of a hand and the age of
   the subject. A strong association was found to exist. In this investigation, age is the            2E
   independent variable and the number of freckles is the dependent variable. You would
   expect the association to be:
   A negative         B positive        C bivariate        D weak              E categorical
11 The q-correlation coefficient for data shown in the scatterplot above is:
             5
                                                                                 y
                                                                                                      2F
    A    – -----
           11
               -    B   –5
                         --
                         9
                          -      C   -----
                                     11
                                       5
                                         -      D   5
                                                    --
                                                    9
                                                     -            E   2
                                                                      --
                                                                      9
                                                                       -




                                                                                                 x
12 A researcher calculates the q-correlation coefficient for the relationship
   between time (in days) and the growth of the root of a bean plant (measured in millimetres).       2F
   The value is 0.62. Based on this, the correlation between time and the growth of the roots
   could be described as:
   A strong and negative          B strong and positive           C weak and positive
   D weak and negative            E moderate and positive
     100     Further Mathematics



      13 A set of data relating the variables x and y is found to have an r value of −0.83. The
2G       scatterplot that could represent this data set is:
         A y                             B y                            C y




                             x                              x                             x
           D y                           E y




                             x                                  x

      14 A set of data relating the variables x and y is found to have an r value of 0.65. A true
2G       statement about the relationship between x and y is:
         A There is a strong linear relationship between x and y and when the x-values increase, the
             y-values tend to increase also.
         B There is a moderate linear relationship between x and y and when the x-values increase,
             the y-values tend to increase also.
         C There is a moderate linear relationship between x and y and when the x-values increase,
             the y-values tend to decrease.
         D There is a weak linear relationship between x and y and when the x-values increase, the
             y-values tend to increase also.
         E There is a weak linear relationship between x and y and when the x-values increase, the
             y-values tend to decrease.
      15 A set of data comparing age with blood pressure is found to have a Pearson’s correlation
2H       coefficient of 0.86. The coefficient of determination for this data would be closest to:
         A −0.86           B −0.74           C −0.43            D 0.43            E 0.74

      16 The coefficient of determination for a set of data relating age and pulse rate is 0.7. This
2H       means that:
         A The correlation coefficient, r, for age against pulse rate is 0.7.
         B 70% of the variation in pulse rate can be explained by the variation in age.
         C 30% of the variation in pulse rate can be explained by the variation in age.
         D 49% of the variation in pulse rate can be explained by the variation in age.
         E 70% of those in the study had a pulse rate over 0.7.

      Short answer
      1 For each of the following, write down:
2A        i whether each variable in the pair is an example of numerical or categorical data
         ii which is a dependent and which is an independent variable or whether it is not appropriate
            to classify the variables as such.
        a The number of injuries in a netball season and the age of a netball player
        b The suburb and the size of a home mortgage
        c IQ and weight
                                                                       Chapter 2 Bivariate data          101
2 The number of hours of counselling received by a group of 9 full-time firefighters and
  9 volunteer firefighters after a serious bushfire is given below.                                                 2B
    Full-time             2         4         3         5         2          4          6          1     3
    Volunteer             8      10          11        11        12         13         13      14       15
  a Construct a back-to-back stem plot to display the data.
  b Comment on the distributions of the number of hours of counselling of the full-time
    firefighters and the volunteers.
3 The IQ of 8 players in 3 different football teams were recorded and are shown below.
                                                                                                                 2C
    Team A            120       105        140         116             98        105         130       102
    Team B            110       104        120         109            106         95         102       100
    Team C            121       115        145         130            120        114         116       123
  Display the data in parallel boxplots.
4 Delegates at the respective Liberal and Labor Party conferences were surveyed on whether or
  not they believed that uranium mining should continue. Forty-five Liberal delegates were                        2D
  surveyed and 15 were against continuation. Fifty-three Labor delegates were surveyed and
  43 were against continuation.
  a Present data in percentages in a two-way frequency table.
  b Comment on any difference between the reactions of the Liberal and Labor delegates.
5 a Construct a scatterplot for the data given in the table below.
  b Use the scatterplot to comment on any relationship which exists between the variables.                       2E
        Age              15       17         18        16        19         19         17      15      17
        Pulse rate       79       74         75        85        82         76         77      72      70

6 For the data given in question 5, calculate the q-correlation coefficient and use this to
  comment on the relationship between the two variables. (Compare your response about the                        2F
  relationship in this question to your response about the relationship in question 5 when you
  didn’t know the q-value).
7 For the variables shown on the scatterplot at right, give an estimate of
  the value of r and use it to comment on the nature of the relationship
                                                                                        y
                                                                                                                 2G
  between the two variables.




                                                                                                             x

8 The table below gives data relating the percentage of lectures attended by students in a
  semester and the corresponding mark for each student in the exam for that subject.                             2H
    Lectures
                        70     59       85        93        78        85     84         69     70       82
    attended (%)
    Exam result
                        80     62       89        98        84        91     83         72     75       85
    (%)
               102     Further Mathematics



                  a Construct a scatterplot for these data.
                  b Comment on the correlation between the lectures attended and the examination results and
                    make an estimate of r.
                  c Calculate r.
                  d Calculate the coefficient of determination.
                  e Write down the proportion of the variation in the examination results that can be explained
                    by the variation in the number of lectures attended.
                Analysis
                1 An investigation into the relationship between age and salary bracket among some employees
                  of a large computer company is made and the results are shown below.
                      Salary bracket ($’000)                        Age
                                20–39                   32 21 43 23 22 27 37
                                40–59                   29 31 37 26 33 37
                                60–79                   41 29 39 42 47 45 43 38
                                80–99                   43 48 38 37 49 51 53 59
                              100–120                   48 37 55 61
                  a   State, for each of the variables (age and salary bracket) whether they represent categorical
                      or numerical data.
                  b   State which is the independent variable and which is the dependent variable.
                  c   State which of the following you could use to display the data:
                         i back-to-back stem plot
                        ii parallel boxplot
                       iii scatterplot
                       iv two-way frequency table in percentage form
                  d   State which of the following you could calculate in order to find out more about the
                      relationship between age and salary bracket:
                         i the q-correlation coefficient
                        ii r, the Pearson product–moment correlation coefficient
                       iii the coefficient of determination
                2 An investigation similar to that in analysis task 1 is undertaken at an accounting firm to
                  explore the relationship between age and salary. The data are shown below.
                Age                20   20   30    35   50    45   35     45   55   55   42   50   25    30   40
                Salary (nearest
                                  20 40 20 30 40 80 40 60 100 70 45 85 30 60 60
                thousand $’s)
                  a State, for each of the variables (age and salary) whether they represent categorical or
                     numerical data.
                  b Display the data on a scatterplot.
                  c Describe the association between the two variables in terms of direction, form and
                     strength.
                  d Calculate the value of q.
                  e Explain whether or not it is appropriate to use Pearson’s correlation coefficient to explain
                     the relationship between age and salary.
                  f Estimate the value of Pearson’s correlation coefficient from the scatterplot.
                  g Calculate the value of this coefficient.
                  h Explain whether or not the salary of the employees is determined by their age.
test
yourself
yourself          i Calculate the value of the coefficient of determination.
 CHAPTER




                  j Explain what the coefficient of determination tells us about the relationship between age
           2         and salary at this accounting firm.

				
DOCUMENT INFO