# Bivariate data

Document Sample

```					Bivariate data
2
VCE coverage
Area of study
Units 3 & 4 • Data analysis

In this chapter
chapter
2A   Types of data
2B   Back-to-back stem plots
2C   Parallel boxplots
2D   The two-way frequency
table
2E   The scatterplot
2F   The q-correlation
coefﬁcient
2G   Pearson’s product–
moment correlation
coefﬁcient
2H   Calculating r and the
coefﬁcient of
determination
58   Further Mathematics

Types of data
In this chapter we look at sets of data which contain two variables. We look at ways of
displaying the data and of measuring relationships between the two variables.
The methods we employ to do this depend entirely on the type of variables we are
dealing with.
Numerical and categorical data
Examples of numerical data are:
1. the heights of a group of teenagers
2. the marks for a maths test
3. the number of universities in a country
4. ages
5. salaries.
As the name suggests, numerical data involve quantities which are, broadly
speaking, measurable or countable.
Examples of categorical data are:
1. genders (sexes)
2. AFL football teams
3. religious denominations
4. ﬁnishing positions in the Melbourne Cup
5. municipalities
6. ratings of 1–5 to indicate preferences for 5 different cars
7. age groups, for example 0–9, 10–19, 20–29
8. hair colours.
Such categorical data, as the name suggests, have categories like masculine, feminine
and neuter for gender, or Catholic, Anglican, Uniting, Baptist, Buddhist and so on for
religious denomination, or 1st, 2nd, 3rd for ﬁnishing position in the Melbourne Cup.
Note: Some numbers may look like numerical data, but really be names or titles (for
example, ratings of 1 to 5 given to different samples of cake — ‘This one’s a 4’; the
numbers on netball players’ uniforms — ‘she’s number 7’). These ‘titles’ are not count-
able; they place the subject in a category (with a name) so are categorial.
In this chapter we look at ways of measuring the relationship between:
1. a numerical variable and a categorical variable (for example, weight and nationality)
2. two categorical variables (for example, gender and religious denomination)
3. two numerical variables (for example, height and weight).
Dependent and independent variables
When a relationship between two sets of variables is being examined, it is useful to
know if one of the variables depends on the other. Often we can make a judgment about
this but sometimes we can’t.
Consider the case where a study compared the heights of company employees
against their annual salaries. Common sense would suggest that the height of a com-
pany employee would not depend on the person’s annual salary nor would the annual
salary of a company employee depend on the person’s height. In this case, it is not
appropriate to designate one variable as independent and one as dependent.
In the case where the ages of company employees are compared with their annual
salaries, you might reasonably expect that the annual salary of an employee would
depend on the person’s age. In this case, the age of the employee is the independent
variable and the salary of the employee is the dependent variable.
It is useful to identify the independent and dependent variables where possible, since
it is the usual practice when displaying data on a graph to place the independent vari-
able on the horizontal axis and the dependent variable on the vertical axis.
Chapter 2 Bivariate data        59
remember
remember
1. Bivariate data are data with two variables.
2. Numerical data involve quantities that are measurable or countable.
3. Categorical data, as the name suggests, are data which are divided into categories.
4. In a relationship involving two variables, if the values of one variable ‘depend’ on
the values of another variable, then the former variable is referred to as the
dependent variable and the latter variable is referred to as the independent variable.
5. It is the usual practice when displaying data on a graph to place the independent
variable on the horizontal axis and the dependent variable on the vertical axis.

2A           Types of data

1 Write down whether each of the following represents numerical or categorical data.
a The heights in centimetres of a group of children
b The diameters in millimetres of a tray of ball-bearings
c The numbers of visitors to a display each day
d The modes of transport that students in Year 11 take to school
e The 10 most-watched television programs in a week
f The occupations of a group of 30-year-olds
g The numbers of subjects offered to VCE students at various schools
h Life expectancies                       i Species of ﬁsh
j Blood groups                            k Years of birth
l Countries of birth                      m Tax brackets
2 For each of the following pairs of variables, write down which is independent and
which is dependent. If it is not possible to identify this, then write ‘not appropriate’.
a The age of an AFL footballer and his annual salary
b The weight of a businessman and the number of business lunches he attends each week
c The growth of a plant and the amount of fertiliser it receives
d The number of books read in a week and the eye colour of the readers
e The voting intentions of a woman and her weekly consumption of red meat
f The number of members in a household and the size of their house
3 multiple choice
An example of a numerical variable is:
A attitude to 4-yearly elections (for or against)         B year level of students
C the total attendance at Carlton football matches        D position in a queue at the pie stall
E television channel numbers shown on a dial
4 multiple choice
In a study on mice, the dependent variable was the time (in days) for which the mice
remained alive. The independent variable would most likely have been:
A the weight of the mice
B the amount of food eaten each day by the mice
C the daily dosage of an experimental drug given to the mice
D the number of mice
E the sex of the mice
60     Further Mathematics

Back-to-back stem plots
In chapter 1, we saw how to construct a stem plot for a set of univariate data. We can
also extend a stem plot so that it displays bivariate data. Speciﬁcally, we shall create a
stem plot that displays the relationship between a numerical variable and a categorical
variable. We shall limit ourselves in this section to categorical variables with just two
categories, for example sex. The two categories are used to provide two, back-to-back
leaves of a stem plot.
A back-to-back stem plot is used to display bivariate data, involving a numerical
variable and a categorical variable with 2 categories.

WORKED Example 1
The girls and boys in Grade 4 at Kingston Primary School submitted projects on the
Olympic Games. The marks they obtained out of 20 are given below.

Girls’ marks       16      17      19       15      12      16      17      19         19   16
Boys’ marks        14      15      16       13      12      13      14      13         15   14

Display the data on a back-to-back stem plot.
THINK                                             WRITE
1   Identify the highest and lowest scores        Highest score = 19
in order to decide on the stems.              Lowest score = 12
Use a stem of 1, divide into ﬁfths.
Chapter 2 Bivariate data       61

THINK                                           WRITE
2   Create an unordered stem plot ﬁrst. Put     Key: 1|2 = 12
the boys’ scores on the left, and the                  Leaf Stem Leaf
girls’ scores on the right.                           Boys       Girls
1
3 2 3 3     1  2
4 5 4 5 4     1  5
6   1  6 7 6 7 6
1  9 9 9
3   Now order the stem plot. The scores on      Key: 1|2 = 12
the left should increase in value from                 Leaf Stem Leaf
right to left, while the scores on the                Boys       Girls
right should increase in value from left          3 3 3 2     1  2
to right.                                       5 5 4 4 4     1  5
6   1  6 6 6 7 7
1  9 9 9

The back-to-back stem plot allows us to make some visual comparisons of the two
distributions. In the above example the centre of the distribution for the girls is higher
than the centre of the distribution for the boys. The spread of each of the distributions
seems to be about the same. For the boys, the marks are grouped around the 12–15
marks; for the girls, they are grouped around the 16–19 marks. On the whole, we can
conclude that the girls obtained better marks than the boys did.
To get a more precise picture of the centre and spread of each of the distributions we
can use the summary statistics discussed in chapter 1. Speciﬁcally, we are interested in:
1. the mean and the median (to measure the centre of the distributions), and
2. the interquartile range and the standard deviation (to measure the spread of the
distributions).
We saw in chapter 1 that the calculation of these summary statistics is very straight-
forward and rapid using a graphics calculator.

WORKED Example 2
The number of ‘how to vote’ cards handed out by various Australian Labor
Party and Liberal party volunteers during the course of a polling day is shown
below.

Labor          180     233      246     252     263      270     229      238     226     211
193     202      210     222     257      247     234      226     214     204
Liberal        204     215      226     253     263      272     285      245     267     275
287     273      266     233     244      250     261      272     280     279
Display the data using a back-to-back stem plot and use this, together with summary
statistics, to compare the distributions of the number of cards handed out by the Labor
and Liberal volunteers.
Continued over page
62      Further Mathematics

THINK                                             WRITE
1 Construct the stem plot.                        Key: 18|0 = 180
Leaf Stem Leaf
Labor            Liberal
0 18
3 19
4 2 20 4
4 1 0 21 5
9 6 6 2 22 6
8 4 3 23 3
7 6 24 4 5
7 2 25 0 3
3 26 1 3 6 7
0 27 2 2 3 5 9
28 0 5 7
2   Use a graphics calculator to calculate        For the Labor volunteers:
the summary statistics: the mean, the             Mean = 227.9
median, the standard deviation and the            Median = 227.5
interquartile range. Enter each set of            Interquartile range = 36
data as a separate list. (See chapter 1 on        Standard deviation = 23.9
how to use your graphics calculator to        For the Liberal volunteers:
calculate these values.)                          Mean = 257.5
Median = 264.5
Interquartile range = 29.5
Standard deviation = 23.4
3   Comment on the relationship.                  From the stem plot we see that the Labor distribution
is symmetric and therefore the mean and the median
are very close, whereas the Liberal distribution is
negatively skewed.
Since the distribution is skewed, the median is a
better indicator of the centre of the distribution than
is the mean.
Comparing the medians therefore, we have the
median number of cards handed out for Labor at 228
and for Liberal at 265, which is a big difference.
The standard deviations were similar as were the
interquartile ranges. There was not a lot of difference
in the spread of the data.
In essence, the Liberal party volunteers handed out
a lot more ‘how to vote’ cards than the Labor party
volunteers did.

remember
remember
1. A back-to-back stem plot displays bivariate data involving a numerical variable
and a categorical variable with two categories.
2. In the ordered stem plot, the scores on the left side of the stem increase in value
from right to left.
3. Together with summary statistics, back-to-back stem plots can be used for
comparing two distributions.
Chapter 2 Bivariate data           63

2B             Back-to-back stem plots

WORKED    1 The marks (out of 50), obtained for the end-of-term test by the students in German and
Example
1
French classes are given below. Display the data on a back-to-back stem plot.

German 20 38 45 21 30 39 41 22 27 33 30 21 25 32 37 42 26 31 25 37

French    23 25 36 46 44 39 38 24 25 42 38 34 28 31 44 30 35 48 43 34

2 The birth masses of 10 boys and 10 girls (in kilograms, to the nearest 100 grams) are
recorded in the table below. Display the data on a back-to-back stem plot.
Boys        3.4    5.0        4.2        3.7         4.9    3.4         3.8      4.8    3.6        4.3

Girls       3.0    2.7        3.7        3.3         4.0    3.1         2.6      3.2    3.6        3.1

WORKED    3 The number of delivery trucks making deliveries to a supermarket each day over a
Example
2
2-week period was recorded for two neighbouring supermarkets —supermarket A and
supermarket B. The data are shown below.

A      11    15    20     25        12    16     21        27     16        17   17    22     23    24

B      10    15    20     25        30    35     16        31     32        21   23    26     28    29

a Display the data on a back-to-back stem plot.
b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the number of trucks delivering to supermarkets A and B.

4 The marks out of 20 for males and females on a science test for a Year-10 class are
given below.
Females                  12         13          14         14          15        15     16         17

Males                    10         12          13         14          14        15     17         19

a Display the data on a back-to-back stem plot.
b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the marks of the males and the females.

5 The end-of-year English marks for 10 students in an English class were compared over
2 years. The marks for 1998 and for the same students in 1999 are shown below.
1998          30        31      35         37         39        41          41    42        43     46

1999          22        26      27         28         30        31          31    33        34     36

a Display the data on a back-to-back stem plot.
b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the marks obtained by the students in 1998 and 1999.
64   Further Mathematics

6 The age and sex of a group of people attending a ﬁtness class are recorded below.
Female                23       24        25        26    27      28        30        31
Male                  22       25        30        31    36      37        42        46
a Display the data on a back-to-back stem plot.
b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the ages of the female to male members of the ﬁtness class.

7 The scores on a board game are recorded for a group of kindergarten children and for a
group of children in a preparatory school.
Kindergarten           3     13     14        25    28   32    36     41        47   50
Prep. School           5     12     17        25    27   32    35     44        46   52
a Display the data on a back-to-back stem plot.
b Use the stem plot, together with some summary statistics, to compare the distributions
of the scores of the kindergarten children compared to the preparatory school children.
8 multiple choice
The pair of variables that could be displayed on a back-to-back stem plot is:
A the height of student and the number of people in the student’s household
B the time put into completing an assignment and a pass or fail score on the assignment
C the weight of a businessman and his age
E the income bracket of an employees and the time the employee has worked for the
company
9 multiple choice
A back-to-back stem plot is a useful way of displaying the relationship between:
A the proximity to markets (km) and the cost of fresh foods on average per kilogram
C age and attitude to gambling (for or against)
D weight and age
E the money spent during a day of shopping and the number of shops visited on that day
Chapter 2 Bivariate data           65
Parallel boxplots
We saw in the previous section that we could display relationships between a numerical
variable and a categorical variable with just two categories, using a back-to-back stem plot.
When we want to display a relationship between a numerical variable and a
categorical variable with more than two categories, a parallel boxplot can be used.
A parallel boxplot is obtained by constructing individual boxplots for each
distribution, using the common scale.
Construction of individual boxplots was discussed in detail in chapter 1 on univariate
data. In this section we concentrate on comparing distributions represented by a
number of boxplots (that is, on the interpretation of parallel boxplots).

WORKED Example 3
The four Year-7 classes at Western Secondary College complete the same end-of-
year maths test. The marks, expressed as percentages for each of the students in
the four classes, are given below.
7A        7B         7C         7D                     7A          7B         7C           7D
40         60        50          40                    69          78         70           69
43         62        51          42                    63          82         72           73
45         63        53          43                    63          85         73           74
47         64        55          45                    68          87         74           75
50         70        57          50                    70          89         76           80
52         73        60          53                    75          90         80           81
53         74        63          55                    80          92         82           82
54         76        65          59                    85          95         82           83
57         77        67          60                    89          97         85           84
60         77        69          61                    90          97         89           90
Display the data using a parallel boxplot and use this to describe any similarities or
differences in the distributions of the marks between the four classes.
THINK                                         WRITE/DISPLAY
1 Create the ﬁrst boxplot (for class 7A)
on a graphics calculator using 2nd
[STAT PLOT] and appropriate WINDOW
settings. Using TRACE to show key
values, sketch the ﬁrst boxplot using
pen and paper, leaving room for three

FM Fig SD 02.01a          FM Fig SD 02.01b

Continued over page
66     Further Mathematics

THINK                                           WRITE
2   Repeat step 1 for the other three                                          7D
classes. All four boxplots share the                                       7C
common scale.
7B
7A

30 40 50 60 70 80 90 100
Maths mark (%)
3   Describe the similarities and              Class 7B had the highest median mark and the
differences between the four               range of the distribution was only 37. The
distributions.                             lowest mark in 7B was 60.
We notice that the median of 7A’s marks is
approximately 60. So, 50% of students in 7A
received less than 60. This means that half of
7A had scores that were less than the lowest
score in 7B.
The range of marks in 7A was about the
same as that of 7D with the highest scores in
each about equal, and the lowest scores in each
about equal. However, the median mark in 7D
was higher than the median mark in 7A so, des-
pite a similar range, more students in 7D
received a higher mark than in 7A.
While 7D had a top score that was higher
than that of 7C, the median score in 7C was
higher than that of 7D and the bottom 25% of
scores in 7D were less than the lowest score in
7C. In summary, 7B did best, followed by 7C
then 7D and ﬁnally 7A.

remember
remember
1. A relationship between a numerical variable and a categorical variable with
more than two categories can be displayed using a parallel boxplot.
2. A parallel boxplot is obtained by constructing individual boxplots for each
distribution, using a common scale.
Chapter 2 Bivariate data       67

2C          Parallel boxplots
XCE
1 The heights (in cm) of students in 9A, 10A and 11A were recorded and

sheet
E
WORKED
Example                                                                                              Parallel
3
are shown in the table below.
boxplots
9A       10A    11A             9A     10A     11A              9A     10A      11A
120      140    151            146     153     164             158     168      175          GC pro

gram
126      143    153            147     156     166             160     170      180   UV stats

131      146    154            150     162     167             162     173      187
138      147    158            156     164     169             164     175      189
140      149    160            157     165     169             165     176      193
143      151    163            158     167     172             170     180      199
a Construct a parallel boxplot to show the data.
b Use the boxplot to compare the distributions of height for the 3 classes.
2 The amounts of money contributed annually to superannuation schemes by people in
3 different age groups are shown below.
20–29         30–39      40–49                    20–29       30–39         40–49
2000          4000      10 000                    6500         7000         13 700
3100          5200      11 200                    6700         8000         13 900
5000          6000      12 000                    7000         9000         14 000
5500          6300      13 300                    9200       10 300         14 300
6200          6800      13 500                  10 000       12 000         15 000
a Construct a parallel boxplot to show the data.
b Use the boxplot to comment on the distributions.
68   Further Mathematics

3 The numbers of jars of vitamin A, B, C and multi-vitamins sold per week by a local
chemist are shown below.

Vitamin
5       6      7       7      8       8       9      11     13      14
A
Vitamin
10     10      11     12      14     15      15     15      17      19
B
Vitamin
8       8      9       9      9      10      11      12     12      13
C
Multi-
12     13      13     15      16     16      17     19      19      20
vitamins
Construct a parallel boxplot to display the data and use it to compare the distributions
of sales for the 4 types of vitamin.

4 multiple choice
The ages of the employees at 5 different companies of the same size are compared
using the parallel boxplots shown below.

Company A
Company B
Company C
Company D
Company E

20 25 30 35 40 45 50 55 60

For each of the following, select from:
A company A                  B company B                  C company C
D company D                  E company E

a Which company has the greatest range of ages?
SHE
ET   2.1               b Which company has the greatest interquartile range of ages?
Work

c Which company has the lowest median age?
d Which company has the greatest range of ages among their oldest 25% of employees?
Chapter 2 Bivariate data        69
The two-way frequency table
When we are examining the relationship between two categorical variables, the two-
way frequency table is an excellent tool.
Consider the following example.

WORKED Example 4
At a local shopping centre, 34 females, and 23 males were asked which of the two major
political parties they preferred. Eighteen females and 12 males preferred Labor. Display
these data in a two-way table.

THINK                                            WRITE

1   Draw a table. Record the respondent’s
sex in the columns and party preference     Party preference Female     Male     Total
in the rows of the table.
Labor

Liberal

Total

2   (a) We know that 34 female and 23
males were asked. Put this information      Party preference Female     Male     Total
into the table and ﬁll in the total.
(b) We also know that 18 females and        Labor               18       12         30
12 males preferred Labor. Put this
information in the table and ﬁnd the        Liberal
total of people who preferred Labor.
Total               34       23         57

3   Fill in the remaining cells. For example,
to ﬁnd the number of females who            Party preference Female     Male     Total
preferred the Liberals, subtract the
number of females preferring Labor          Labor               18       12         30
from the total number of females asked:
34 − 18 = 16.                               Liberal             16       11         27

Total               34       23         57

In the above example we have a very clear breakdown of data. We know how many
females preferred Labor, how many females preferred the Liberals, how many males
preferred Labor and how many males preferred the Liberals.
If we wish to compare the number of females who prefer Labor with the number of
males who prefer Labor, we must be careful. While 12 males preferred Labor compared
to 18 females, there were, of course, fewer males than females being asked. That is,
only 23 males were asked for their opinion, compared to 34 females.
To overcome this problem, we can express the ﬁgures in the table as percentages.
70           Further Mathematics

WORKED Example 5
Fifty-seven people in a local shopping
centre were asked whether they preferred              Party preference Female           Male     Total
the Australian Labor Party or the Liberal
Labor                    18        12        30
Party. The results are given at right.
Convert the numbers in this table to
Liberal                  16        11        27
percentages.
Total                    34        23        57

THINK                                                         WRITE
1   Draw the table, omitting the ‘total’ column.
Party preference Female          Male

Labor

Liberal

Total

2   Fill in the table by expressing the number in
each cell as a percentage of its column’s total.          Party preference Female          Male
For example, to obtain the percentage of males
who prefer Labor, we divide the number of                 Labor                   52.9      52.2
males who prefer Labor by the total number
of males and multiply by 100%.                            Liberal                 47.1      47.8
12
-----
23
-   × 100% = 52.5% (correct to 1 decimal place)       Total                 100.0      100.0

We could have calculated percentages from the table rows, rather than columns. To do
that we would, for example, have divided the number of females who preferred Labor
(18) by the total number of people who preferred labor (30) and so on. The table below
shows this:
Party preference      Female         Male    Total

Labor                  60.0          40.0    100

Liberal                59.3          40.7    100

By doing this we have obtained the percentage of people who were female and pre-
ferred Labor (60%), and the percentage of people who were male and preferred Labor
(40%), and so on. This highlights facts different from those shown in the previous
table. In other words, different results can be obtained by calculating percentages from
a table in different ways.
As a general rule, when the independent variable (in this case the respondent’s sex)
is placed in the columns of the table, then the percentages should be calculated in
columns.
Chapter 2 Bivariate data        71
WORKED Example 6
Sixty-seven primary and 47 secondary school students were asked their attitude to the
number of school holidays which should be given. They were asked whether there should
be more, fewer or the same number. Five primary students and 2 secondary students
wanted fewer holidays, 29 primary and 9 secondary students thought they had enough
holidays (that is, they chose the same number) and the rest thought they needed to be
given more holidays.
Present these data in percentage form in a two-way frequency table and use it to
compare the opinions of the primary and the secondary students.
THINK                                           WRITE
1   Put the data in a table. First ﬁll in the
given information, then ﬁnd the missing      Attitude     Primary     Secondary     Total
information by subtracting the
appropriate numbers from the totals.         Fewer            5           2           7

Same            29           9           38

More            33           36          69

Total           67           47         114

2   Calculate the percentages. Since the
independent variable (the level of the       Attitude     Primary     Secondary
student, Primary or Secondary) has
been placed in the columns of the table,     Fewer           7.5          4.3
we calculate the percentages in
columns. For example, to obtain the          Same           43.3         19.1
percentage of primary students who
wanted fewer holidays, divide the            More           49.2         76.6
number of such students by the total
number of primary students and               Total          100.0       100.0
multiply by 100%.
That is, ----- × 100% = 7.5%.
67
5
-
3   Comment on the results.                     Secondary students were much keener on
having more holidays than were primary
students.

remember
remember
1. The two-way frequency table is an excellent tool for examining the relationship
between two categorical variables.
2. If the total number of scores in each of the two categories is unequal,
percentages should be calculated in order to be able to analyse the table
properly. When the independent variable is placed in the columns of the table,
the percentages should be calculated in columns. That is, the numbers in each
column should be expressed as a percentage of that column’s total.
72        Further Mathematics

2D            The two-way frequency table

Spreadshe       WORKED  1 In a survey, 139 women and 102 men were asked whether they approved or disapproved
Example
EXCEL

et

4   of a proposed freeway. Thirty-seven women and 79 men approved of the freeway.
Two-way
frequency             Display these data in a two-way table (not as percentages).
table
2 Students at a secondary school were asked whether the length of lessons should be
45 minutes or 1 hour. Ninety-three senior students (Years 10–12) were asked and 60
preferred 1-hour lessons, whereas of the 86 junior students (Years 7–9), 36 preferred
1-hour periods. Display these data in a two-way table (not as percentages).
3 For each of the following two-way frequency tables, complete the entries.
a      Attitude       Female      Male       Total

For               25             i     47

Against           ii         iii       iv

Total             51         v         92

b      Attitude       Female      Male       Total

For               i          ii        21

Against           iii       21         iv

Total             v         30         63

c      Party preference          Female      Male

Labor                        i        42%

Liberal                    53%         ii

Total                       iii        iv

WORKED     4 Sixty single men and women were asked whether they prefer to live alone, or to share
Example
5      accommodation with friends. The results are shown below.

Rent preference              Men        Women    Total

Live alone                    12         23       35

Share with friends             9         16       25
HEET
2.1
Total                         21         39       60
SkillS

Convert the numbers in this table to percentages.
Chapter 2 Bivariate data        73

The information in the following
two-way frequency table relates to
questions 5 and 6. The data show the
technical staff to an upgrade of the
computer systems at a large
corporation.

Attitude          staff             staff        Total
For                  53               98          151
Against              37               31             68
Total                90              129          219

5 multiple choice
From the above table, we can conclude that:
6 multiple choice
From the above table, we can conclude that:
A 98% of technical staff were for the upgrade
B 65% of technical staff were for the upgrade
C 76% of technical staff were for the upgrade
D 31% of technical staff were against the upgrade
E 14% of technical staff were against the upgrade
WORKED    7 Delegates at the respective Liberal Party and Australian Labor Party conferences were
Example
6
surveyed on whether or not they believed that marijuana should be legalised. Sixty-two
Liberal delegates were surveyed and 40 were against legalisation. Seventy-one Labor
delegates were surveyed and 43 were against legalisation.
Present the data in percentage form in a two-way frequency table. Comment on any
differences between the reactions of the Liberal and Labor delegates.
8 Sixty-one union workers were surveyed and asked whether the number of public
holidays should be reduced. Thirty-ﬁve supported a reduction. Fifty-nine non-union
workers were also asked and 31 supported a reduction.
Present the data in percentage form in a two-way frequency table. Comment on any
difference between the reactions of the union and non-union workers.
74   Further Mathematics

The scatterplot
We often want to know if there is some sort of relationship between two numerical
variables. A scatterplot, which gives a visual display of the relationship between two
variables, provides a good starting point.
Consider the data obtained from last year’s 12B class at Northbank Secondary Col-
lege. Each student in this class of 29 students was asked to give an estimate of the
average number of hours of study per week they did during Year 12. They were also
asked the TER score they obtained.

Average                            Average                               Average
hours        TER                   hours       TER                       hours     TER
of study      score                of study     score                    of study   score
18          59                     14          54                       17        59
16          67                     17          72                       16        76
22          74                     14          63                       14        59
27          90                     19          72                       29        89
15          62                     20          58                       30        93
28          89                     10          47                       30        96
18          71                     28          85                       23        82
19          60                     25          75                       26        35
22          84                     18          63                       22        78
30          98                     19          61

The ﬁgure at right shows the data plotted on a scatterplot.
It is reasonable to think that the number of hours of
study put in each week by students would affect their          100
TER scores and so the number of hours of study per              90
week is the independent variable and appears on the
TER score

80
horizontal axis. The TER score is the dependent variable
and appears on the vertical axis.                               70
There are 29 points on the scatterplot. Each point           60
represents the hours studied and the TER score of one           50
student.
40
In analysing the scatterplot we look for a pattern in                              (26, 35)
the way the points lie. Certain patterns tell us that cer-
tain relationships exist between the two variables. This              10 15 20 25 30
is referred to as correlation. We look at what type of             Average number of hours
of study per week
correlation exists and how strong it is.
In the ﬁgure above right we see some sort of pattern: the points are spread in a rough
corridor from bottom left to top right. We refer to data following such a direction as
having a positive relationship. This tells us that as the average number of hours studied
per week increases, the TER score increases.
Chapter 2 Bivariate data          75
The point (26, 35) is an outlier. It stands out because
it is well away from the other points and clearly is not      100
part of the ‘corridor’ referred to above. This outlier may      90

TER score
have occurred because a student worked very hard but            80
found the VCE pretty tough or perhaps the student exag-
70
gerated the number of hours he or she worked in a week
or perhaps there was a recording error. This needs to be        60
checked.                                                        50
We could describe the rest of the data as having a           40
linear form as the straight line in the diagram at right
indicates.                                                            10 15 20 25 30
When describing the relationship between two vari-              Average number of hours
ables displayed on a scatterplot, we need to comment on:               of study per week
(a) the direction — whether it is positive or negative
(b) the form — whether it is linear or non-linear
(c) the strength — whether it is strong, moderate or weak.
Below is a gallery of scatterplots showing the various patterns we look for.

Weak, positive                 Moderate, positive                    Strong, positive
linear relationship              linear relationship                 linear relationship

Weak, negative                Moderate, negative                    Strong, negative
linear relationship             linear relationship                  linear relationship

Perfect, negative                No relationship                     Perfect, positive
linear relationship                                                  linear relationship
76        Further Mathematics

WORKED Example 7
The scatterplot at right shows the number of hours people

Hours for recreation
25
spend at work each week and the number of hours people
get to spend on recreational activities during the week.            20
Decide whether or not a relationship exists between the          15
variables and, if it does, comment on whether it is positive        10
or negative; weak, moderate or strong; and whether or not            5
it has a linear form.
10 20 30 40 50 60 70
THINK                                                   WRITE
Hours worked
(a) The points on a scatterplot are spread in a
certain pattern, namely in a rough corridor from
the top left to the bottom right corner. This tells
us that as the work hours increase, the
recreation hours decrease.
(b) The corridor is straight (that is, it would be
reasonable to ﬁt a straight line into it).
(c) The points are not too tight and not too
dispersed either.
(d) The pattern resembles the central diagram in There is a moderate, negative linear relation-
the gallery of scatterplots shown previously. ship between the two variables.

WORKED Example 8
Data giving the average weekly number of hours studied by each student in 12B
at Northbank Secondary College and the corresponding height of each student
(to the nearest tenth of a metre) are given in the table below.

Average                  Average                  Average                                   Average
hours                    hours                    hours                                     hours
of   Height              of   Height              of   Height                               of   Height
study   (m)              study   (m)              study   (m)                               study   (m)
18         1.5             19     2.0            20        1.9                            16     1.6
16         1.9             22     1.9            10        1.9                            14     1.9
22         1.7             30     1.6            28        1.5                            29     1.7
27         2.0             14     1.5            25        1.7                            30     1.8
15         1.9             17     1.7            18        1.8                            30     1.5
28         1.8             14     1.8            19        1.8                            23     1.5
18         2.1             19     1.7            17        2.1                            22     2.1
Construct a scatterplot for the data and use it to comment on the direction, form and
strength of any relationship between the number of hours studied and the height of the
students.
Chapter 2 Bivariate data    77
THINK                                                   WRITE/DISPLAY
1   Construct the scatterplot. In this case it is almost
2.2
impossible to decide which is the independent
2.1
variable and which is the dependent variable, and
2.0
therefore on which axis we will place the
1.9

Height (m)
variables. In such cases, placing either variable
on either axis is reasonable.                            1.8

The scatterplot can be constructed using a               1.7
2
graphics calculator:                                     1.6

(a) Press Y= and CLEAR any functions.                    1.5

(b) Press 2nd    [STAT PLOT] and select                  1.4

4:PlotsOff. Press ENTER .
10 12 14 16 18 20 22 24 26 28 30
(c) Press STAT and select 1:Edit. Press ENTER .
Average number of hours
(d) Clear any existing lists and enter the list of                       studied each week
hours of study in L1 and the list of heights
in L2.                                                          FM Fig 02.07
(e) Press 2nd [STAT PLOT] and select 1:Plot 1.
(f) Press ENTER to turn the plot ON, and select
the ﬁrst icon which indicates a scatterplot.
(g) For Xlist, select L1 and for Ylist select L2 and
select the ﬁrst symbol in Mark.
(h) Press ZOOM and select 9:ZoomStat.
(i) Press ENTER to see the scatterplot.
3   Comment on the direction of any relationship. There is no relationship; the points appear
to be randomly placed.
4   Comment on the form of the relationship.             There is no form, no linear trend, no
quadratic trend, just a random placement
of points.
5   Comment on the strength of any relationship.         Since there is no relationship, strength is
not relevant.

Clearly, the number of hours you study for your VCE has no effect on how tall you
might be!
Note that when working with the scatterplot, to change settings at any time use
WINDOW . To identify the coordinates of individual points, use the TRACE key with
the arrow              keys.
M

M
78   Further Mathematics

remember
remember
1. When we are investigating if there is any sort of relationship between two
numerical variables, a scatterplot provides a useful starting point. It gives a
visual display of the relationship between two such variables.
2. In analysing the scatterplot we look for a pattern in the way the points lie.
Certain patterns tell us that certain relationships exist between the two
variables. This is referred to as a correlation. We look at what type of
correlation exists and how strong it is.
3. When describing the relationship between two variables displayed on a
scatterplot, we need to comment on:
(a) the direction — whether it is positive or negative
(b) the form — whether it is linear or non-linear
(c) the strength — whether it is strong, moderate or weak.

2E          The scatterplot

Have your graphics calculator at hand for the following exercise questions.
1 For each of the following pairs of variables, write down whether or not you would
reasonably expect a relationship to exist between the pair and, if so, comment on
whether it would be a positive or negative association.
a Time spent in a supermarket and money spent
b Income and value of car driven
c Number of children living in a house and time spent cleaning the house
d Age and number of hours of competitive sport played per week
e Amount spent on petrol each week and distance travelled by car each week
f Number of hours spent in front of a computer each week and time spent playing the
piano each week
g Amount spent on weekly groceries and time spent gardening each week
Chapter 2 Bivariate data                                               79
WORKED    2 For each of the scatterplots below, describe whether or not a relationship exists between
Example
the variables and, if it does, comment on whether it is positive or negative, whether it is
7
weak, moderate or strong and whether or not it has a linear form.

a                                                        b                                                         c
Haemoglobin count

Marks at school (%)
120                                                                 100
14

Fitness level
12                                                   100                                                                  80
10                                                    80                                                                  60
60                                                                  40
8
20
20 40 60 80                                        0     10     20                                             0
Age                                            Cigarettes smoked
4 8 12 16
FM Fig 02.08a
FM Fig 02.08b                                                          Weekly hours of study
gardening magazines (\$)

d                                                        e                                                             f
Weekly expenditure on

25                                                     14                                                                 70

Time under water (s)
Hours spent using a
computer per week     12                                                                 60
20
15                                                     10                                                                 50
10                                                      8                                                                 40
5                                                      6                                                                 30
4                                                                 20
0 5 10 15                                       2                                                                 10
Hours spent
gardening per week
2 4 6 8 1012 1416                                                     5 10 15 20 25
Hours spent                                                             Age
cooking per week

3 multiple choice
From the scatterplot shown at right, it would be reasonable to                                                                                            y
observe that:
A as the value of x increases, the value of y increases
B as the value of x increases, the value of y decreases
C as the value of x increases, the value of y remains the same
D as the value of x remains the same, the value of y increases
x
E there is no relationship between x and y

WORKED    4 The population of a municipality (to the nearest hundred thousand) together with           L Spread
Example                                                                                             XCE
the number of primary schools in that particular municipality is given below for
sheet
E

8
11 municipalities.                                                               Scatterplot

Population
110 130 130 140 150 160 170 170 180 180 190
(000)

No. of primary
4     4   6                             5       6      8        6                        7                       8      9      8
schools

Construct a scatterplot for the data and use it to comment on the direction, form and
strength of any relationship between the population and the number of primary
schools.
80   Further Mathematics

5 The table below contains data giving the time taken for a paving job and the cost of the job.
Time taken
5    7        5    8     10     13     15     20     18     25     23
(hours)
Cost of
1000 1000 1500 1200 2000 2500 2800 3200 2800 4000 3000
job (\$)
Construct a scatterplot for the data. Comment on whether a relationship exists between
the time taken and the cost. If there is a relationship, describe it.

6 The table below shows the time of booking (how many days in advance) of the tickets
for a musical performance and the corresponding row number in A-Reserve.

Time of Row             Time of Row
booking No.             booking No.
5        15            20       10
6        15            21         8
7        15            22         5
7        14            24         4
8        14            25         3
11         13            28         2
13         13            29         2
14         12            29         1
14         10            30         1
17         11            31         1

Construct a scatterplot for the data. Comment on whether a relationship exists between
the time of booking and the number of the row and, if there is a relationship, describe it.
Chapter 2 Bivariate data         81
The     q-correlation coefﬁcient
The q-correlation coefﬁcient is a measure of the strength of the association between
two variables. In the previous section we estimated the strength of association by
looking at a scatterplot and forming a judgment about whether the correlation between
the variables was positive or negative and whether the correlation was weak, moderate
or strong. The calculation of the q-correlation coefﬁcient aids us considerably in
making that judgment.
To calculate the q-correlation coefﬁcient:
Step 1. Draw a scatterplot of the data.
Step 2. Locate the median of the x-values. (If there are n points, the median is located
n+1
-
at the ----------- th place.) Draw a vertical line through this median value.
2                                                                  y
Step 3. Locate the median of the y-values and draw a horizontal                           B A
line through this median value.
Step 4. The scatterplot is now divided into 4 sections or
quadrants (hence the name ‘q’-correlation coefﬁcient).
(a) Label these sections A, B, C and D.
(b) Count the number of points in each section.                                  C D
(c) Do not count points which are on the lines.                                      x
(d) The number of points in section A is denoted by a, the number of points in
section B is denoted by b, and so on.
Step 5. Calculate the q-correlation coefﬁcient using the formula:
(a + c) – (b + d )
q = ---------------------------------------
-
a+b+c+d

WORKED Example 9
Calculate the q-correlation coefﬁcient for the data shown in the                y
scatterplot at right.
THINK                                                WRITE
1 (a) Locate the median of the x-values.
Note that we are talking here about the
x-values of the data observations                                                           x
given. In the scatterplot shown there
are 15 points. Each point has an x-
value and a y-value. To ﬁnd the
median of the x-values we look for the
horizontal middle point; that is, we
15 + 1
look for the -------------- = 8th point from
-
2                         y Median
the left (from the right, the point will          of
x-values
be the same).
(b) Draw a vertical line through this
median value. Note that there are
7 points to the right of this line and
7 to the left.                                          x
Continued over page
82     Further Mathematics

THINK                                                       WRITE

2   (a) Locate the median of the y-values.
This is done in a similar way to ﬁnding
the median of the x-values except,
instead of counting from the left or
y
right, we count from the top or bottom
to ﬁnd the 8th point.

(b) Draw a horizontal line through this
median value. Note that there are 7
points above this line and 7 below.                                                     x

3   (a) Label the quadrants A, B, C and D.                  y B                    A
b=0                  a=6
(b) Count the number of points in each
section. Do not count points that are                                     D
C               d=1
on the lines.                                             c=6
x
a = 6, b = 0, c = 6, d = 1

(a + c) – (b + d )
4   Write the formula for calculating the                   q = ---------------------------------------
-
q-coefﬁcient.                                                    a+b+c+d

5   Substitute the values of a, b, c and d                       (6 + 6) – (0 + 1)
into the formula and evaluate.                          q = ---------------------------------------
-
6+0+6+1
11
= -----
-
13
= 0.85 (correct to 2 decimal places)

The value of the q-correlation coefﬁcient in the above example indicates a strong
correlation. The diagram below gives a rough guide to the strength of the correlation
suggested by the value of q.

1
0.75
} Strong positive association
0.5
} Moderate positive association
} Weak positive association
Value of q

0.25
0
–0.25
}   No association

–0.5
} Weak negative association
–0.75
} Moderate negative association
–1
} Strong negative association
Chapter 2 Bivariate data               83
The scatterplots below show three special values of the q-correlation coefﬁcient.
y B              A                             y B              A                              y B              A

C              D                                C             D                                C              D

x                                                    x                                         x
(8 + 8) – (0 + 0)                         (0 + 0) – (8 + 8)                          (3 + 3) – (3 + 3)
q = -------------------------------------- q = -------------------------------------- q = --------------------------------------
8+0 +8+0                                   0+8 +0+8                                   3+3 +3+3
=1                                        = –1                                       =0
The sign of the q-value indicates the direction of the relationship; that is, whether
there is a negative or positive association.
In the cases shown above left and centre, the q-values are at both extremes. That is,
q = 1 and −1 respectively. We would describe the variables as showing a very strong
association. Having said that, the points are not showing a strong linear form or, for
that matter, any linear form.
The q-correlation coefﬁcient merely gives us an idea of which quadrants contain the
most points; but beyond that, the points can be in any position in the quadrants. In that
sense, the q-correlation coefﬁcient is a rather blunt instrument.

WORKED Example 10
An investigation was made into the relationship between the time spent
watching television in the week preceding a Maths test and the mark obtained
(out of 20) in that Maths test. The following data were recorded.

Time (h)          Mark              Time (h)         Mark       Time (h)        Mark
4              15                 10              8           12            10
5              16                 20              5            5             8
5              20                   5            12           20             8
10               12                 15              4           15            10
15                8                 15             12           20            10
Draw a scatterplot and calculate the q-correlation coefﬁcient. Comment on the
relationship between the two variables.
THINK                                                       WRITE/DISPLAY
1 Draw a scatterplot. We can use a graphics calculator
20
to draw the scatterplot.
(a) On the lists screen (press STAT , select EDIT         16
Maths mark

and 1:Edit), enter the two lists of data into L1      12
and L2.
8
(b) Press 2nd [STAT PLOT] and select 4:PlotsOff.
(c) Press ENTER .                                          4
(d) Press 2nd [STAT PLOT] and select 1:Plot1.
(e) Select On, and for Type, select the ﬁrst icon              5 10 15 20 25
(scatterplot).                                            Time watching TV
(f) For Xlist, type in L1 (use 2nd [L1]); for Ylist,               (hours)
type in L2; for Mark, select the ﬁrst symbol.
(g) Press ZOOM and select 9:ZoomStat. The display
Continued over page
now shows the scatterplot.
84     Further Mathematics

THINK                                             WRITE/DISPLAY
2   We can also use the graphics
calculator to help calculate q.
(a) Press 2nd [QUIT] and
(b) Press 2nd [DRAW] and
select 4:Vertical.
(c) Press 2nd [LIST] .
select 4:median(.
(e) Type L1 (use 2nd [L1])
at the prompt, then ENTER ,
and the scatterplot appears
with the vertical median line
drawn.
(f) Similarly, to create the
horizontal median line,
press 2nd [QUIT]
(g)Press 2nd [DRAW] and select
3:Horizontal.
(h) Press 2nd [LIST] and from the MATH
(i) Type L2 at the prompt, press ENTER
and the scatterplot appears with the
horizontal median line drawn as well.
3   Count and record the number of points in      a = 1, b = 5, c = 2, d = 4
(a + c) – (b + d )
4   Write the formula for calculating the         q = ---------------------------------------
-
q-correlation coefﬁcient.                              a+b+c+d
(1 + 2) – (5 + 4)
5   Substitute the values of a, b, c and d into   q = ---------------------------------------
-
the formula and evaluate.                              1+5+2+4
6
= – -----
-
12
= – 0.5
6   Comment on the relationship.                  There is moderate, negative association
between the hours of television watched and
the Maths mark obtained.
The negative association means that as the
number of hours of television watched prior
to the test increased, the marks in the Maths
test decreased. The moderate association
suggests that it may be worth further
investigating the association.
Chapter 2 Bivariate data         85
remember
remember
1. The q-correlation coefﬁcient is a measure of the strength of the association
between two variables.
2. To calculate the q-correlation coefﬁcient:
Step 1. Draw a scatterplot of the data.
Step 2. Locate the median of the x-values and draw a vertical line through this
median value.
Step 3. Locate the median of the y-values and draw a horizontal line through
this median value.
y B
Step 4. (a) Label the sections thus formed A, B, C and D.                         A

(b) Count the number of points in each section.
(c) Do not count points which are on the lines.
(d) (The number of points in section A is denoted                     C D
x
by a, and so on.)
Step 5. Calculate the q-correlation coefﬁcient using the formula:
(a + c) – (b + d )
q = ---------------------------------------
-
a+b+c+d
3. The sign of the q-value indicates the direction of the relationship (whether
there is a negative association or a positive association) while the size of it
indicates the strength (whether the relationship is strong, moderate or weak).
4. The q-correlation coefﬁcient gives us an idea of into which quadrants the points
fall, but beyond that the points can be in any position in the quadrants. In that
sense, the q-correlation coefﬁcient is a rather blunt instrument.

2F          The q-correlation coefﬁcient

WORKED  1 Calculate the q-correlation coefﬁcient for each of the sets of data shown on the scatter-           L Spread
Example                                                                                                    XCE
plots below.

sheet
E
9
ay                              b y                                c y                  q-correlation

x                                 x                                   x

dy                                e y                                 fy

x                                  x                                  x
86        Further Mathematics

WORKED     2 The data given in the table below show the results of an investigation into
Example
the mass and the height of a certain breed of dog.
10
a Draw a scatterplot and calculate the q-correlation coefﬁcient.
b Comment on the relationship between the height and the mass of this
breed of dog.
Height
41      40           35            38     43       44        37    39           42       44    31
(cm)
Mass
4.5         5         4            3.5    5.5      5         5         4        4        6     3.5
(kg)
3 The data in the table below show the number of hours spent by students who are
learning touch-typing and their corresponding speed in words per minute (wpm).
a Using a graphics calculator or otherwise, calculate the q-correlation coefﬁcient for
these data.
b Comment on the relationship between the number of hours spent on learning and
the speed of typing.
Time
20     33       22            39    40        37   46   44        24       36       50    48   29
(h)
Speed
34     46       38            53    52        49   60   58        36       42       65    63   40
(wpm)

4 multiple choice                                                          y

The q-correlation coefﬁcient for data shown in the scatterplot at right is:
1            1                     5                5              9
A   – -----
11
-    B   – --
-
9
C         -
– -----
11
D   --
9
-         E    -
--
9

5 multiple choice
x
A researcher calculates the q-correlation coefﬁcient for the relationship between time
(in days) and the diameter (measured in mm) of a crystal that is changing in size. The
value is 0.82. Based on this, the correlation between time and the diameter of the
crystal could be described as:
A strong and negative
B strong and positive
SHE
ET   2.2                 C weak and positive
Work

D weak and negative
E moderate and positive
Chapter 2 Bivariate data                              87
Pearson’s product–moment correlation
coefﬁcient
We saw in the previous exercise that the q-correlation coefﬁcient was a rather blunt
instrument for measuring correlation between variables. A more precise tool is
Pearson’s product–moment correlation coefﬁcient. This coefﬁcient is used to measure
the strength of linear relationships between variables; the q-correlation coefﬁcient, on
the other hand, can be used for both linear and non-linear relationships.
Pearson’s coefﬁcient is therefore more specialised and can give us a much more
precise picture of the strength of the linear relationship between two variables.
The symbol for Pearson’s product–moment correlation coefﬁcient is r.
Below is a gallery of scatterplots with the corresponding value of r for each.

r=1                 r = –1          r=0                          r = 0.7                     r = –0.5

r = –0.9                 r = 0.8                    r = 0.3                                     r = –0.2

The two extreme values of r (1 and −1) are shown in the ﬁrst two diagrams respec-
tively. It is interesting to compare these two scatterplots with those showing extreme
values (1 and −1) of q.

q=1                       r =1                      q = –1                                             r = –1
1
In the four diagrams above, the scatterplots that                         } Strong positive linear association
0.75
show matching values of q and r are placed side                              } Moderate positive linear association
0.5
by side. We see just how differently the points on                           } Weak positive linear association
Value of r

0.25
the scatterplots are arranged and note from this
that the r value gives us a much sharper
impression of the relationship between the
0
–0.25
}   No linear association

–0.5
} Weak negative linear association
variables. That is, a value of r = 1 means that there                        } Moderate negative linear association
–0.75
is perfect linear association between the variables,                         } Strong negative linear association
–1
which is not necessarily the case when q = 1!
88       Further Mathematics

In describing the strength of the relationship between the variables, the rough guide
we used with the q-correlation coefﬁcient can also be used with Pearson’s coefﬁcient.
The difference, of course, is that the value of r gives us a measure of the strength of
linear relationships speciﬁcally.

WORKED Example 11
For each of the following:
i   Estimate the value of Pearson’s product–moment correlation coefﬁcient (r) from the
scatterplot.
ii Use this to comment on the strength and direction of the relationship between the two
variables.

a                                     b                                      c

THINK                                            WRITE

a    1Compare these scatterplots with            a     i r ≈ 0.9
those in the gallery of scatterplots
shown previously and estimate the
value of r.
2 Comment on the strength and                      ii The relationship can be described as a
direction of the relationship.                     strong, positive, linear relationship.
b Repeat steps 1 and 2 as in a.                 b     i r ≈ −0.7
ii The relationship can be described as a
moderate, negative, linear relationship.
c Repeat steps 1 and 2 as in a.                 c     i r ≈ −0.1
ii There is no linear relationship.

Note that the symbol ≈ means ‘aproximately equal to’. We use it instead of the = sign
to emphasise that the value (in this case r) is only an estimate.
In completing the worked example above, we notice that estimating the value of
r from a scatterplot is rather like making an informed guess. In the next section of
work we will see how to obtain the actual value of r.

remember
remember
1. Pearson’s product–moment correlation coefﬁcient is used to measure the
strength of a linear relationship between two variables.
2. The symbol for Pearson’s product–moment correlation coefﬁcient is r.
3. The estimate of r can be obtained from the scatterplot.
Chapter 2 Bivariate data       89
Pearson’s product–moment
2G            correlation coefﬁcient
1 What type of linear relationship does each of the following values of r suggest?
a 0.21                b 0.65               c −1                  d −0.78
e 1                   f 0.9                g −0.34               h −0.1
WORKED    2 For each of the following:
Example
i Estimate the value of Pearson’s product–moment correlation coefﬁcient (r), from
11
the scatterplot.
ii Use this to comment on the strength and direction of the relationship between the
two variables.
a                     b                     c                      d

e                      f                     g                      h

3 multiple choice
A set of data relating the variables x and y is found to have an r value of 0.62. The
scatterplot that could represent the data is:
A                  B                    C                  D                  E

4 multiple choice
A set of data relating the variables x and y is found to have an r value of −0.45. A true
statement about the relationship between x and y is:
A There is a strong linear relationship between x and y and when the x-values
increase, the y-values tend to increase also.
B There is a moderate linear relationship between x and y and when the x-values
increase, the y-values tend to increase also.
C There is a moderate linear relationship between x and y and when the x-values
increase, the y-values tend to decrease.
D There is a weak linear relationship between x and y and when the x-values increase,
the y-values tend to increase also.
E There is a weak linear relationship between x and y and when the x-values increase,
the y-values tend to decrease.
90     Further Mathematics

Calculating r and the coefﬁcient of
determination
Pearson’s product–moment correlation coefﬁcient
The formula for calculating Pearson’s correlation coefﬁcient r is as follows:
n
xi – x yi – y
 ------------  ------------
∑
1
r = -----------
-                -               -
n–1           sx   sy 
i=1
where n is the number of pairs of data in the set
sx is the standard deviation of the x-values
sy is the standard deviation of the y-values
x is the mean of the x-values
y is the mean of the y-values.
The calculation of r by hand using this formula is unnecessary. The calculation of
r is done far more efﬁciently using a graphics calculator.
There are two important limitations on the use of r. First, since r measures the
strength of a linear relationship, it would be inappropriate to calculate r for data which
are not linear — for example, data which a scatterplot shows to be in a quadratic form.
Second, outliers can bias the value of r. Consequently, if a set of linear data contains
an outlier, then r is not a reliable measure of the strength of that linear relationship.
The calculation of r is applicable to sets of bivariate data which are known to be
linear in form and which do not have outliers.
With those two provisos, it is good practice to draw a scatterplot for a set of data to
check for a linear form and an absence of outliers before r is calculated. Having a scat-
terplot in front of you is also useful because it enables you to estimate what the value
of r will be — as you did in exercise 2G, and thus you can check that your workings on
the calculator are correct.

WORKED Example 12
The heights (in centimetres) of 21 football players were recorded against the number of
marks they took in a game of football. The data are shown in the table below.
Number of                                            Number of
Height (cm)           marks taken                   Height (cm)            marks taken
184                    6                            182                     7
194                   11                            185                     5
185                    3                            183                     9
175                    2                            191                     9
186                    7                            177                     3
183                    5                            184                     8
174                    4                            178                     4
200                   10                            190                    10
188                    9                            193                    12
184                    7                            204                    14
188                    6
Chapter 2 Bivariate data       91
a Construct a scatterplot for the data.
b Comment on the correlation between the heights of players and the number of marks
that they take, and estimate the value of r.
c Calculate r and use it to comment on the relationship between the heights of players
and the number of marks they take in a game.

THINK                                            WRITE/DISPLAY

a Using a graphics calculator, construct a       a
scatterplot. Refer to worked example 8
in the section on scatterplots for
directions on how to use the graphics
calculator to draw a scatterplot.

b Comment on the correlation between the         b The data show what appears to be a linear
variables and estimate the value of r.           form of moderate strength.
We might expect r ≈ 0.6.
c   1   Because there is a linear form and there c
are no outliers, the calculation of r is
appropriate.
Calculate r, using a graphics calculator.
The lists are in place from the
scatterplot.
Firstly press 2nd [CATALOG] and select     r = 0.86
DiagnosticOn and press ENTER .
Press STAT and select CALC and
4:LinReg(ax+b).
Press ENTER .
LinReg(ax+b) appears. Type L1, L2.
Press ENTER .
2   The value of r = 0.86 indicates a            There is a strong positive linear association
strong positive linear relationship.         between the height of a player and the
number of marks he takes in a game. That is,
the taller the player the more marks we
might expect him to take.

Correlation and causation
In worked example 12 we saw that r = 0.86. While we are entitled to say that there is a
strong association between the height of a footballer and the number of marks he takes,
we cannot assert that the height of a footballer causes him to take a lot of marks. Being
tall might assist in the taking of marks, but there will be many other factors which
come into play — for example skill level, accuracy of passes from teammates, abilities
of the opposing team, and so on.
So, while establishing a high degree of correlation between two variables is very
interesting and can often ﬂag the need for further, more detailed investigation, it in no
way gives us any basis to comment on whether or not one variable causes particular
values in another variable.
92      Further Mathematics

The coefﬁcient of determination
The coefﬁcient of determination is given by r 2. Obviously, it is very easy to calculate
— we merely square Pearson’s product–moment correlation coefﬁcient (r).
1. The coefﬁcient of determination is useful when we have two variables which
have a linear relationship. It tells us the proportion of variation in one variable
which can be explained by the variation in the other variable.
2. The coefﬁcient of determination provides a measure of how well the linear rule
linking the two variables (x and y) predicts the value of y when we are given the
value of x.

WORKED Example 13
A set of data giving the number of police trafﬁc patrols on duty and the number of
fatalities for the region was recorded and a correlation coefﬁcient of r = −0.8 was found.
Calculate the coefﬁcient of determination and interpret its value.
THINK                                      WRITE
1 Calculate the coefﬁcient of             Coefﬁcient of determination = r 2
determination by squaring the given                                 = (−0.8)2
value of r.                                                         = 0.64
2   Interpret your result.                       We can conclude from this that 64% of the
variation in the number of fatalities can be
explained by the variation in the number of police
trafﬁc patrols on duty. This means that the number
of police trafﬁc patrols on duty is a major factor in
predicting the number of fatalities.

remember
remember
1. The formula for calculating Pearson’s correlation coefﬁcient r is as follows:
n
xi – x yi – y
 ------------  ------------
∑
1
r = -----------
-                -               -
n–1           sx   sy 
i=1
where n is the number of pairs of data in the set
sx is the standard deviation of the x values
sy is the standard deviation of the y values
x is the mean of the x-values
y is the mean of the y-values.
2.   The calculation of r by hand using this formula is unnecessary. The calculation
of r is done far more efﬁciently using a graphics calculator.
3.   The calculation of r is applicable to sets of bivariate data which are known to
be linear in form and which do not have outliers.
4.   Even if we ﬁnd that two variables have a very high degree of correlation, for
example r = 0.95, we cannot say that the value of one variable is caused by the
value of the other variable.
5.   The coefﬁcient of determination = r 2.
6.   The coefﬁcient of determination is useful when we have two variables which
have a linear relationship. It tells us the proportion of variation in one variable
which can be explained by the variation in the other variable.
Chapter 2 Bivariate data   93
Calculating r and the
2H            coefﬁcient of determination
XCE

sheet
E
WORKED  1 The yearly salary (\$’000) and the number of votes polled in the
Example                                                                                                  Pearson’s
12   Brownlow medal count are given below for 10 leading footballers.                                product-
moment
Yearly                                                                                   correlation
salary     180     200    160       250      190    210     170    150   140   180
GC pro
(\$’000)

gram
Number                                                                                      BV stats
24     15      33       10       16      23     14     21    31     28

a Construct a scatterplot for the data.
b Comment on the correlation of salary and the number of votes and make an
estimate of r.
c Calculate r and use it to comment on the relationship between yearly salary and

WORKED    2 A set of data, obtained from 40 smokers, gives the number of cigarettes smoked per day
Example
13     and the number of visits per year to the doctor. The Pearson’s correlation coefﬁcient for
these data was found to be 0.87. Calculate the coefﬁcient of determination for the data
and interpret its value.

3 Data giving the annual advertising budgets (\$’000) and the yearly proﬁt increases (%)
of 8 companies are shown below.

11       14       15      17     20     25    25     27
budget (\$’000)
Yearly proﬁt increase
2.2      2.2      3.2     4.6    5.7    6.9   7.9   9.3
(%)

a Construct a scatterplot for these data.
b Comment on the correlation of the advertising budget and proﬁt increase and make
an estimate of r.
c Calculate r.
d Calculate the coefﬁcient of determination.
e Write down the proportion of the variation in the yearly proﬁt increase that can be
explained by the variation in the advertising budget.

4 Data showing the number of tourists visiting a small country in a month and the
corresponding average monthly exchange rate for the country’s currency against the
American dollar are given below.

Number of tourists
2        3     4       5     7      8     8     10
(’000)
Exchange rate                  1.2     1.1      0.9     0.9    0.8   0.8   0.7   0.6
94   Further Mathematics

a Construct a scatterplot for the data.
b Comment on the correlation between the number of tourists and the exchange rate
and give an estimate of r.
c Calculate r.
d Calculate the coefﬁcient of determination.
e Write down the proportion of the variation in the number of tourists that can be
explained by the exchange rate.

5 Data showing the number of people in 9 households against weekly grocery costs are
given below.

Number of
people in        2       5       6        3         4          5         2         6     3
household
Weekly
grocery          60     180     210      120       150       160         65    200      90
costs (\$’s)

a Construct a scatterplot for the data.
b Comment on the correlation of the number of people in a household and the weekly
grocery costs and give an estimate of r.
c Calculate r.
d Calculate the coefﬁcient of determination.
e Write down the proportion of the variation in the weekly grocery costs that can be
explained by the variation in the number of people in a household.

6 Data showing the number of people on 8 fundraising committees and the annual funds
raised are given below.

Number of
people on         3        6          4         8         5           7         3        6
committee
Annual
funds           4500     8500     6100    12 500         7200       10 000     4700     8800
raised (\$’s)

a Construct a scatterplot for these data.
b Comment on the correlation between the number of people on a committee and the
funds raised and make an estimate of r.
c Calculate r.
d Calculate the coefﬁcient of determination.
e Write down the proportion of the variation in the funds raised that can be explained
by the variation in the number of people on a committee.

The following information applies to questions 7 and 8. A set of data was obtained from
a large group of women with children under 5 years of age. They were asked the number
of hours they worked per week and the amount of money they spent on childcare. The results
were recorded and the value of Pearson’s correlation coefﬁcient was found to be 0.92.
Chapter 2 Bivariate data     95
7 multiple choice
Which of the following is not true?
A The relationship between the number of working hours and the amount of money
spent on child-care is linear.
B There is a positive correlation between the number of working hours and the
amount of money spent on child-care.
C The correlation between the number of working hours and the amount of money
spent on child-care can be classiﬁed as strong.
D As the number of working hours increases, the amount spent on child-care increases
as well.
E The increase in the number of hours causes the increase in the amount of money
spent on child-care.

8 multiple choice
Which of the following is not true?
A The coefﬁcient of determination is about 0.85.
B The number of working hours is the major factor in predicting the amount of money
spent on child-care.
C About 85% of the variation in the number of hours worked can be explained by the
variation in the amount of money spent on child-care.
D Apart from number of hours worked, there could be other factors affecting the
amount of money spent on child-care.
E About 17 of the variation in the amount of money spent on child-care can be
-
-----
20
explained by the variation in the number of hours worked.
96   Further Mathematics

summary
Types of data
• Bivariate data are data with two variables.
• Numerical data involve quantities which are measurable or countable.
• Categorical data are data divided into categories.
• In a relationship involving two variables, if the values of one variable depend on the
values of another variable, then the former variable is referred to as the dependent
variable and the latter variable is referred to as the independent variable.
• When data are displayed on a graph, the independent variable is placed on the
horizontal axis and the dependent variable is placed on the vertical axis.

Back-to-back stem plots
• A back-to-back stem plot displays bivariate data involving a numerical variable and
a categorical variable with two categories.
• Together with summary statistics, back-to-back stem plots can be used to compare
the two distributions.

Parallel boxplots
• To display a relationship between a numerical variable and a categorical variable
with more than two categories, we can use a parallel boxplot.
• A parallel boxplot is obtained by constructing individual boxplots for each
distribution, using a common scale.

The two-way frequency table
• The two-way frequency table is a tool for examining the relationship between two
categorical variables.
• If the total number of scores in each of the two categories is unequal, percentages
should be calculated in order to be able to analyse the table properly.
• When the independent variable is placed in the columns of the table, the numbers in
each column should be expressed as a percentage of that column’s total.

The scatterplot
• A scatterplot gives a visual display of the relationship between two numerical
variables.
• In analysing the scatterplot we look for a pattern in the way the points lie. Certain
patterns tell us that certain relationships exist between the two variables. This is
referred to as a correlation. We look at what type of correlation exists and how
strong it is.
• When describing the relationship between two variables displayed on a scatterplot,
we need to comment on:
(a) the direction — whether it is positive or negative
(b) the form — whether it is linear or non-linear
(c) the strength — whether it is strong, moderate or weak.
Chapter 2 Bivariate data   97
The q-correlation coefﬁcient
• The q-correlation coefﬁcient gives us a measure of the strength of the association
between two variables.
• To calculate the q-correlation coefﬁcient:
Step 1. Draw a scatterplot of the data.
Step 2. Locate the median of the x-values. Draw a vertical line through this median
value.
Step 3. Locate the median of the y-values. Draw a horizontal line through this
median value.                                                           y B A
Step 4. The scatterplot is now divided into 4 sections or quadrants.
(a) Label these sections A, B, C and D.
(b) Count the number of points in each section.
(c) Do not count points which are on the lines.                           C D
x
(d) The number of points in section A is denoted by a, the number of
points in section B is denoted by b, and so on.
Step 5. Calculate the q-correlation coefﬁcient, using the formula:
(a + c) – (b + d )
q = ---------------------------------------
-
a+b+c+d
• The sign of the q-value indicates the direction of the relationship; that is, whether
there is a negative association or a positive association. The magnitude of q
indicates whether the relationship is strong, moderate or weak.
• The q-correlation coefﬁcient gives us an idea of into which quadrants the points
fall, but beyond that the points can be in any position in the quadrants. The
q-correlation coefﬁcient in that sense is a rather blunt instrument.

Pearson’s product–moment correlation coefﬁcient
• Pearson’s product–moment correlation coefﬁcient is used to measure the strength of
a linear relationship between two variables.
• The symbol for Pearson’s product–moment correlation coefﬁcient is r.
• The calculation of r is applicable to sets of bivariate data which are known to be
linear in form and which don’t have outliers.
• The value of r can be estimated from the scatterplot.
• The formula for calculating Pearson’s correlation coefﬁcient r is as follows:
n
x –x           y –y
∑  -------------  -------------
1               i                i
r = -----------
-
n–1              sx   sy 
i=1
where n is the number of pairs of data in the set
sx is the standard deviation of the x-values
sy is the standard deviation of the y-values
x is the mean of the x-values
y is the mean of the y-values
• The calculation of r by hand using this formula is unnecessary. The calculation of r
is done far more efﬁciently using a graphics calculator.
• Even if we ﬁnd that two variables have a very high degree of correlation, for
example r = 0.95, we cannot say that the value of one variable is caused by the
value of the other variable.

Calculating the coefﬁcient of determination
• The coefﬁcient of determination = r 2.
• The coefﬁcient of determination is useful when we have two variables which have
a linear relationship. It tells us the proportion of variation in one variable which can
be explained by the variation in the other variable.
98      Further Mathematics

CHAPTER
review
Multiple choice
1 An example of a categorical variable is:
2A          A the membership number of a club
B the number of students at each year level of a school
C the total attendance at Hawthorn football matches
D the breathalyser reading of people in a restaurant
E the monthly income for a group of people
2 In a study on the growth of plants, conducted in controlled surroundings, the dependent variable
2A          was the height of the plants. The independent variable in the study would be most likely:
A the number of people caring for the plants
B the amount of light present
C the number of plants in the study
D whether the plants were deciduous or evergreen
E rainfall
3 One of the following pairs of variables could not be displayed on a back-to-back stem plot. It is:
2B          A the heights of a group of students and whether or not they like football
B the kilometres travelled in a week and the mode of transport (car or train)
C the weights of a group of students and their eye colour (blue or brown)
D the annual number of trips to a doctor and whether or not the person is a smoker
E the amount spent by each child at the tuckshop and the age of the child
4 A back-to-back stem plot is a useful way of displaying the relationship between:
2B          A the number of children attending a day care centre and whether or not the centre has
federal funding
B height and wrist circumference
C age and weekly income
D weight and the number of takeaway meals eaten each week
E the age of a car and amount spent each year on servicing it

The information below relates to questions 5 and 6. The salaries of people working at ﬁve
different advertising companies are shown below on the parallel boxplots.
Company A
Company B
Company C
Company D
Company E

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Annual salary (× \$1000)

5 The company with the largest interquartile range is:
2C          A Company A                  B Company B                            C Company C
D Company D                  E Company E
Chapter 2 Bivariate data   99
6 The company with the lowest median is:
A Company A                 B Company B                            C Company C                     2C
D Company D                 E Company E
Questions 7 and 8 relate to the following information. Data showing reactions of junior staff and
senior staff to a relocation of ofﬁces are given below in a two-way frequency table.
Attitude             Junior staff               Senior staff                Total
For                               23                        14                       37
Against                           31                        41                       72
Total                             54                        55                      109
7 From this table, we can conclude that:
A 23% of junior staff were for the relocation                                                      2D
B 42.6% of junior staff were for the relocation
C 31% of junior staff were against the relocation
D 62.1% of junior staff were for the relocation
E 28.4% of junior staff were against the relocation
8 From this table, we can conclude that:
A 14% of senior staff were for the relocation                                                      2D
B 37.8% of senior staff were for the relocation
C 12.8% of senior staff were for the relocation
D 72% of senior staff were against the relocation
E 74.5% of senior staff were against the relocation
9 The relationship between the variables x and y is shown on the scatterplot below.
That correlation between x and y would be best described as: y                                     2E
A a weak positive association
B a weak negative association
C a strong positive association
D a strong negative association
E non-existent
x
10 An investigation is made into the number of freckles on the back of a hand and the age of
the subject. A strong association was found to exist. In this investigation, age is the            2E
independent variable and the number of freckles is the dependent variable. You would
expect the association to be:
A negative         B positive        C bivariate        D weak              E categorical
11 The q-correlation coefﬁcient for data shown in the scatterplot above is:
5
y
2F
A    – -----
11
-    B   –5
--
9
-      C   -----
11
5
-      D   5
--
9
-            E   2
--
9
-

x
12 A researcher calculates the q-correlation coefﬁcient for the relationship
between time (in days) and the growth of the root of a bean plant (measured in millimetres).       2F
The value is 0.62. Based on this, the correlation between time and the growth of the roots
could be described as:
A strong and negative          B strong and positive           C weak and positive
D weak and negative            E moderate and positive
100     Further Mathematics

13 A set of data relating the variables x and y is found to have an r value of −0.83. The
2G       scatterplot that could represent this data set is:
A y                             B y                            C y

x                              x                             x
D y                           E y

x                                  x

14 A set of data relating the variables x and y is found to have an r value of 0.65. A true
2G       statement about the relationship between x and y is:
A There is a strong linear relationship between x and y and when the x-values increase, the
y-values tend to increase also.
B There is a moderate linear relationship between x and y and when the x-values increase,
the y-values tend to increase also.
C There is a moderate linear relationship between x and y and when the x-values increase,
the y-values tend to decrease.
D There is a weak linear relationship between x and y and when the x-values increase, the
y-values tend to increase also.
E There is a weak linear relationship between x and y and when the x-values increase, the
y-values tend to decrease.
15 A set of data comparing age with blood pressure is found to have a Pearson’s correlation
2H       coefﬁcient of 0.86. The coefﬁcient of determination for this data would be closest to:
A −0.86           B −0.74           C −0.43            D 0.43            E 0.74

16 The coefﬁcient of determination for a set of data relating age and pulse rate is 0.7. This
2H       means that:
A The correlation coefﬁcient, r, for age against pulse rate is 0.7.
B 70% of the variation in pulse rate can be explained by the variation in age.
C 30% of the variation in pulse rate can be explained by the variation in age.
D 49% of the variation in pulse rate can be explained by the variation in age.
E 70% of those in the study had a pulse rate over 0.7.

1 For each of the following, write down:
2A        i whether each variable in the pair is an example of numerical or categorical data
ii which is a dependent and which is an independent variable or whether it is not appropriate
to classify the variables as such.
a The number of injuries in a netball season and the age of a netball player
b The suburb and the size of a home mortgage
c IQ and weight
Chapter 2 Bivariate data          101
2 The number of hours of counselling received by a group of 9 full-time ﬁreﬁghters and
9 volunteer ﬁreﬁghters after a serious bushﬁre is given below.                                                 2B
Full-time             2         4         3         5         2          4          6          1     3
Volunteer             8      10          11        11        12         13         13      14       15
a Construct a back-to-back stem plot to display the data.
b Comment on the distributions of the number of hours of counselling of the full-time
ﬁreﬁghters and the volunteers.
3 The IQ of 8 players in 3 different football teams were recorded and are shown below.
2C
Team A            120       105        140         116             98        105         130       102
Team B            110       104        120         109            106         95         102       100
Team C            121       115        145         130            120        114         116       123
Display the data in parallel boxplots.
4 Delegates at the respective Liberal and Labor Party conferences were surveyed on whether or
not they believed that uranium mining should continue. Forty-ﬁve Liberal delegates were                        2D
surveyed and 15 were against continuation. Fifty-three Labor delegates were surveyed and
43 were against continuation.
a Present data in percentages in a two-way frequency table.
b Comment on any difference between the reactions of the Liberal and Labor delegates.
5 a Construct a scatterplot for the data given in the table below.
b Use the scatterplot to comment on any relationship which exists between the variables.                       2E
Age              15       17         18        16        19         19         17      15      17
Pulse rate       79       74         75        85        82         76         77      72      70

6 For the data given in question 5, calculate the q-correlation coefﬁcient and use this to
comment on the relationship between the two variables. (Compare your response about the                        2F
relationship in this question to your response about the relationship in question 5 when you
didn’t know the q-value).
7 For the variables shown on the scatterplot at right, give an estimate of
the value of r and use it to comment on the nature of the relationship
y
2G
between the two variables.

x

8 The table below gives data relating the percentage of lectures attended by students in a
semester and the corresponding mark for each student in the exam for that subject.                             2H
Lectures
70     59       85        93        78        85     84         69     70       82
attended (%)
Exam result
80     62       89        98        84        91     83         72     75       85
(%)
102     Further Mathematics

a Construct a scatterplot for these data.
b Comment on the correlation between the lectures attended and the examination results and
make an estimate of r.
c Calculate r.
d Calculate the coefﬁcient of determination.
e Write down the proportion of the variation in the examination results that can be explained
by the variation in the number of lectures attended.
Analysis
1 An investigation into the relationship between age and salary bracket among some employees
of a large computer company is made and the results are shown below.
Salary bracket (\$’000)                        Age
20–39                   32 21 43 23 22 27 37
40–59                   29 31 37 26 33 37
60–79                   41 29 39 42 47 45 43 38
80–99                   43 48 38 37 49 51 53 59
100–120                   48 37 55 61
a   State, for each of the variables (age and salary bracket) whether they represent categorical
or numerical data.
b   State which is the independent variable and which is the dependent variable.
c   State which of the following you could use to display the data:
i back-to-back stem plot
ii parallel boxplot
iii scatterplot
iv two-way frequency table in percentage form
d   State which of the following you could calculate in order to ﬁnd out more about the
relationship between age and salary bracket:
i the q-correlation coefﬁcient
ii r, the Pearson product–moment correlation coefﬁcient
iii the coefﬁcient of determination
2 An investigation similar to that in analysis task 1 is undertaken at an accounting ﬁrm to
explore the relationship between age and salary. The data are shown below.
Age                20   20   30    35   50    45   35     45   55   55   42   50   25    30   40
Salary (nearest
20 40 20 30 40 80 40 60 100 70 45 85 30 60 60
thousand \$’s)
a State, for each of the variables (age and salary) whether they represent categorical or
numerical data.
b Display the data on a scatterplot.
c Describe the association between the two variables in terms of direction, form and
strength.
d Calculate the value of q.
e Explain whether or not it is appropriate to use Pearson’s correlation coefﬁcient to explain
the relationship between age and salary.
f Estimate the value of Pearson’s correlation coefﬁcient from the scatterplot.
g Calculate the value of this coefﬁcient.
h Explain whether or not the salary of the employees is determined by their age.
test
yourself
yourself          i Calculate the value of the coefﬁcient of determination.
CHAPTER

j Explain what the coefﬁcient of determination tells us about the relationship between age
2         and salary at this accounting ﬁrm.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 275 posted: 2/3/2011 language: English pages: 46