VIEWS: 275 PAGES: 46 POSTED ON: 2/3/2011
Bivariate data 2 VCE coverage Area of study Units 3 & 4 • Data analysis In this chapter chapter 2A Types of data 2B Back-to-back stem plots 2C Parallel boxplots 2D The two-way frequency table 2E The scatterplot 2F The q-correlation coefﬁcient 2G Pearson’s product– moment correlation coefﬁcient 2H Calculating r and the coefﬁcient of determination 58 Further Mathematics Types of data In this chapter we look at sets of data which contain two variables. We look at ways of displaying the data and of measuring relationships between the two variables. The methods we employ to do this depend entirely on the type of variables we are dealing with. Numerical and categorical data Examples of numerical data are: 1. the heights of a group of teenagers 2. the marks for a maths test 3. the number of universities in a country 4. ages 5. salaries. As the name suggests, numerical data involve quantities which are, broadly speaking, measurable or countable. Examples of categorical data are: 1. genders (sexes) 2. AFL football teams 3. religious denominations 4. ﬁnishing positions in the Melbourne Cup 5. municipalities 6. ratings of 1–5 to indicate preferences for 5 different cars 7. age groups, for example 0–9, 10–19, 20–29 8. hair colours. Such categorical data, as the name suggests, have categories like masculine, feminine and neuter for gender, or Catholic, Anglican, Uniting, Baptist, Buddhist and so on for religious denomination, or 1st, 2nd, 3rd for ﬁnishing position in the Melbourne Cup. Note: Some numbers may look like numerical data, but really be names or titles (for example, ratings of 1 to 5 given to different samples of cake — ‘This one’s a 4’; the numbers on netball players’ uniforms — ‘she’s number 7’). These ‘titles’ are not count- able; they place the subject in a category (with a name) so are categorial. In this chapter we look at ways of measuring the relationship between: 1. a numerical variable and a categorical variable (for example, weight and nationality) 2. two categorical variables (for example, gender and religious denomination) 3. two numerical variables (for example, height and weight). Dependent and independent variables When a relationship between two sets of variables is being examined, it is useful to know if one of the variables depends on the other. Often we can make a judgment about this but sometimes we can’t. Consider the case where a study compared the heights of company employees against their annual salaries. Common sense would suggest that the height of a com- pany employee would not depend on the person’s annual salary nor would the annual salary of a company employee depend on the person’s height. In this case, it is not appropriate to designate one variable as independent and one as dependent. In the case where the ages of company employees are compared with their annual salaries, you might reasonably expect that the annual salary of an employee would depend on the person’s age. In this case, the age of the employee is the independent variable and the salary of the employee is the dependent variable. It is useful to identify the independent and dependent variables where possible, since it is the usual practice when displaying data on a graph to place the independent vari- able on the horizontal axis and the dependent variable on the vertical axis. Chapter 2 Bivariate data 59 remember remember 1. Bivariate data are data with two variables. 2. Numerical data involve quantities that are measurable or countable. 3. Categorical data, as the name suggests, are data which are divided into categories. 4. In a relationship involving two variables, if the values of one variable ‘depend’ on the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable. 5. It is the usual practice when displaying data on a graph to place the independent variable on the horizontal axis and the dependent variable on the vertical axis. 2A Types of data 1 Write down whether each of the following represents numerical or categorical data. a The heights in centimetres of a group of children b The diameters in millimetres of a tray of ball-bearings c The numbers of visitors to a display each day d The modes of transport that students in Year 11 take to school e The 10 most-watched television programs in a week f The occupations of a group of 30-year-olds g The numbers of subjects offered to VCE students at various schools h Life expectancies i Species of ﬁsh j Blood groups k Years of birth l Countries of birth m Tax brackets 2 For each of the following pairs of variables, write down which is independent and which is dependent. If it is not possible to identify this, then write ‘not appropriate’. a The age of an AFL footballer and his annual salary b The weight of a businessman and the number of business lunches he attends each week c The growth of a plant and the amount of fertiliser it receives d The number of books read in a week and the eye colour of the readers e The voting intentions of a woman and her weekly consumption of red meat f The number of members in a household and the size of their house 3 multiple choice An example of a numerical variable is: A attitude to 4-yearly elections (for or against) B year level of students C the total attendance at Carlton football matches D position in a queue at the pie stall E television channel numbers shown on a dial 4 multiple choice In a study on mice, the dependent variable was the time (in days) for which the mice remained alive. The independent variable would most likely have been: A the weight of the mice B the amount of food eaten each day by the mice C the daily dosage of an experimental drug given to the mice D the number of mice E the sex of the mice 60 Further Mathematics Back-to-back stem plots In chapter 1, we saw how to construct a stem plot for a set of univariate data. We can also extend a stem plot so that it displays bivariate data. Speciﬁcally, we shall create a stem plot that displays the relationship between a numerical variable and a categorical variable. We shall limit ourselves in this section to categorical variables with just two categories, for example sex. The two categories are used to provide two, back-to-back leaves of a stem plot. A back-to-back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with 2 categories. WORKED Example 1 The girls and boys in Grade 4 at Kingston Primary School submitted projects on the Olympic Games. The marks they obtained out of 20 are given below. Girls’ marks 16 17 19 15 12 16 17 19 19 16 Boys’ marks 14 15 16 13 12 13 14 13 15 14 Display the data on a back-to-back stem plot. THINK WRITE 1 Identify the highest and lowest scores Highest score = 19 in order to decide on the stems. Lowest score = 12 Use a stem of 1, divide into ﬁfths. Chapter 2 Bivariate data 61 THINK WRITE 2 Create an unordered stem plot ﬁrst. Put Key: 1|2 = 12 the boys’ scores on the left, and the Leaf Stem Leaf girls’ scores on the right. Boys Girls 1 3 2 3 3 1 2 4 5 4 5 4 1 5 6 1 6 7 6 7 6 1 9 9 9 3 Now order the stem plot. The scores on Key: 1|2 = 12 the left should increase in value from Leaf Stem Leaf right to left, while the scores on the Boys Girls right should increase in value from left 3 3 3 2 1 2 to right. 5 5 4 4 4 1 5 6 1 6 6 6 7 7 1 9 9 9 The back-to-back stem plot allows us to make some visual comparisons of the two distributions. In the above example the centre of the distribution for the girls is higher than the centre of the distribution for the boys. The spread of each of the distributions seems to be about the same. For the boys, the marks are grouped around the 12–15 marks; for the girls, they are grouped around the 16–19 marks. On the whole, we can conclude that the girls obtained better marks than the boys did. To get a more precise picture of the centre and spread of each of the distributions we can use the summary statistics discussed in chapter 1. Speciﬁcally, we are interested in: 1. the mean and the median (to measure the centre of the distributions), and 2. the interquartile range and the standard deviation (to measure the spread of the distributions). We saw in chapter 1 that the calculation of these summary statistics is very straight- forward and rapid using a graphics calculator. WORKED Example 2 The number of ‘how to vote’ cards handed out by various Australian Labor Party and Liberal party volunteers during the course of a polling day is shown below. Labor 180 233 246 252 263 270 229 238 226 211 193 202 210 222 257 247 234 226 214 204 Liberal 204 215 226 253 263 272 285 245 267 275 287 273 266 233 244 250 261 272 280 279 Display the data using a back-to-back stem plot and use this, together with summary statistics, to compare the distributions of the number of cards handed out by the Labor and Liberal volunteers. Continued over page 62 Further Mathematics THINK WRITE 1 Construct the stem plot. Key: 18|0 = 180 Leaf Stem Leaf Labor Liberal 0 18 3 19 4 2 20 4 4 1 0 21 5 9 6 6 2 22 6 8 4 3 23 3 7 6 24 4 5 7 2 25 0 3 3 26 1 3 6 7 0 27 2 2 3 5 9 28 0 5 7 2 Use a graphics calculator to calculate For the Labor volunteers: the summary statistics: the mean, the Mean = 227.9 median, the standard deviation and the Median = 227.5 interquartile range. Enter each set of Interquartile range = 36 data as a separate list. (See chapter 1 on Standard deviation = 23.9 how to use your graphics calculator to For the Liberal volunteers: calculate these values.) Mean = 257.5 Median = 264.5 Interquartile range = 29.5 Standard deviation = 23.4 3 Comment on the relationship. From the stem plot we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed. Since the distribution is skewed, the median is a better indicator of the centre of the distribution than is the mean. Comparing the medians therefore, we have the median number of cards handed out for Labor at 228 and for Liberal at 265, which is a big difference. The standard deviations were similar as were the interquartile ranges. There was not a lot of difference in the spread of the data. In essence, the Liberal party volunteers handed out a lot more ‘how to vote’ cards than the Labor party volunteers did. remember remember 1. A back-to-back stem plot displays bivariate data involving a numerical variable and a categorical variable with two categories. 2. In the ordered stem plot, the scores on the left side of the stem increase in value from right to left. 3. Together with summary statistics, back-to-back stem plots can be used for comparing two distributions. Chapter 2 Bivariate data 63 2B Back-to-back stem plots WORKED 1 The marks (out of 50), obtained for the end-of-term test by the students in German and Example 1 French classes are given below. Display the data on a back-to-back stem plot. German 20 38 45 21 30 39 41 22 27 33 30 21 25 32 37 42 26 31 25 37 French 23 25 36 46 44 39 38 24 25 42 38 34 28 31 44 30 35 48 43 34 2 The birth masses of 10 boys and 10 girls (in kilograms, to the nearest 100 grams) are recorded in the table below. Display the data on a back-to-back stem plot. Boys 3.4 5.0 4.2 3.7 4.9 3.4 3.8 4.8 3.6 4.3 Girls 3.0 2.7 3.7 3.3 4.0 3.1 2.6 3.2 3.6 3.1 WORKED 3 The number of delivery trucks making deliveries to a supermarket each day over a Example 2 2-week period was recorded for two neighbouring supermarkets —supermarket A and supermarket B. The data are shown below. A 11 15 20 25 12 16 21 27 16 17 17 22 23 24 B 10 15 20 25 30 35 16 31 32 21 23 26 28 29 a Display the data on a back-to-back stem plot. b Use the stem plot, together with some summary statistics, to compare the distribu- tions of the number of trucks delivering to supermarkets A and B. 4 The marks out of 20 for males and females on a science test for a Year-10 class are given below. Females 12 13 14 14 15 15 16 17 Males 10 12 13 14 14 15 17 19 a Display the data on a back-to-back stem plot. b Use the stem plot, together with some summary statistics, to compare the distribu- tions of the marks of the males and the females. 5 The end-of-year English marks for 10 students in an English class were compared over 2 years. The marks for 1998 and for the same students in 1999 are shown below. 1998 30 31 35 37 39 41 41 42 43 46 1999 22 26 27 28 30 31 31 33 34 36 a Display the data on a back-to-back stem plot. b Use the stem plot, together with some summary statistics, to compare the distribu- tions of the marks obtained by the students in 1998 and 1999. 64 Further Mathematics 6 The age and sex of a group of people attending a ﬁtness class are recorded below. Female 23 24 25 26 27 28 30 31 Male 22 25 30 31 36 37 42 46 a Display the data on a back-to-back stem plot. b Use the stem plot, together with some summary statistics, to compare the distribu- tions of the ages of the female to male members of the ﬁtness class. 7 The scores on a board game are recorded for a group of kindergarten children and for a group of children in a preparatory school. Kindergarten 3 13 14 25 28 32 36 41 47 50 Prep. School 5 12 17 25 27 32 35 44 46 52 a Display the data on a back-to-back stem plot. b Use the stem plot, together with some summary statistics, to compare the distributions of the scores of the kindergarten children compared to the preparatory school children. 8 multiple choice The pair of variables that could be displayed on a back-to-back stem plot is: A the height of student and the number of people in the student’s household B the time put into completing an assignment and a pass or fail score on the assignment C the weight of a businessman and his age D the religion of an adult and the person’s head circumference E the income bracket of an employees and the time the employee has worked for the company 9 multiple choice A back-to-back stem plot is a useful way of displaying the relationship between: A the proximity to markets (km) and the cost of fresh foods on average per kilogram B height and head circumference C age and attitude to gambling (for or against) D weight and age E the money spent during a day of shopping and the number of shops visited on that day Chapter 2 Bivariate data 65 Parallel boxplots We saw in the previous section that we could display relationships between a numerical variable and a categorical variable with just two categories, using a back-to-back stem plot. When we want to display a relationship between a numerical variable and a categorical variable with more than two categories, a parallel boxplot can be used. A parallel boxplot is obtained by constructing individual boxplots for each distribution, using the common scale. Construction of individual boxplots was discussed in detail in chapter 1 on univariate data. In this section we concentrate on comparing distributions represented by a number of boxplots (that is, on the interpretation of parallel boxplots). WORKED Example 3 The four Year-7 classes at Western Secondary College complete the same end-of- year maths test. The marks, expressed as percentages for each of the students in the four classes, are given below. 7A 7B 7C 7D 7A 7B 7C 7D 40 60 50 40 69 78 70 69 43 62 51 42 63 82 72 73 45 63 53 43 63 85 73 74 47 64 55 45 68 87 74 75 50 70 57 50 70 89 76 80 52 73 60 53 75 90 80 81 53 74 63 55 80 92 82 82 54 76 65 59 85 95 82 83 57 77 67 60 89 97 85 84 60 77 69 61 90 97 89 90 Display the data using a parallel boxplot and use this to describe any similarities or differences in the distributions of the marks between the four classes. THINK WRITE/DISPLAY 1 Create the ﬁrst boxplot (for class 7A) on a graphics calculator using 2nd [STAT PLOT] and appropriate WINDOW settings. Using TRACE to show key values, sketch the ﬁrst boxplot using pen and paper, leaving room for three additional plots. FM Fig SD 02.01a FM Fig SD 02.01b Continued over page 66 Further Mathematics THINK WRITE 2 Repeat step 1 for the other three 7D classes. All four boxplots share the 7C common scale. 7B 7A 30 40 50 60 70 80 90 100 Maths mark (%) 3 Describe the similarities and Class 7B had the highest median mark and the differences between the four range of the distribution was only 37. The distributions. lowest mark in 7B was 60. We notice that the median of 7A’s marks is approximately 60. So, 50% of students in 7A received less than 60. This means that half of 7A had scores that were less than the lowest score in 7B. The range of marks in 7A was about the same as that of 7D with the highest scores in each about equal, and the lowest scores in each about equal. However, the median mark in 7D was higher than the median mark in 7A so, des- pite a similar range, more students in 7D received a higher mark than in 7A. While 7D had a top score that was higher than that of 7C, the median score in 7C was higher than that of 7D and the bottom 25% of scores in 7D were less than the lowest score in 7C. In summary, 7B did best, followed by 7C then 7D and ﬁnally 7A. remember remember 1. A relationship between a numerical variable and a categorical variable with more than two categories can be displayed using a parallel boxplot. 2. A parallel boxplot is obtained by constructing individual boxplots for each distribution, using a common scale. Chapter 2 Bivariate data 67 2C Parallel boxplots L Spread XCE 1 The heights (in cm) of students in 9A, 10A and 11A were recorded and sheet E WORKED Example Parallel 3 are shown in the table below. boxplots 9A 10A 11A 9A 10A 11A 9A 10A 11A 120 140 151 146 153 164 158 168 175 GC pro gram 126 143 153 147 156 166 160 170 180 UV stats 131 146 154 150 162 167 162 173 187 138 147 158 156 164 169 164 175 189 140 149 160 157 165 169 165 176 193 143 151 163 158 167 172 170 180 199 a Construct a parallel boxplot to show the data. b Use the boxplot to compare the distributions of height for the 3 classes. 2 The amounts of money contributed annually to superannuation schemes by people in 3 different age groups are shown below. 20–29 30–39 40–49 20–29 30–39 40–49 2000 4000 10 000 6500 7000 13 700 3100 5200 11 200 6700 8000 13 900 5000 6000 12 000 7000 9000 14 000 5500 6300 13 300 9200 10 300 14 300 6200 6800 13 500 10 000 12 000 15 000 a Construct a parallel boxplot to show the data. b Use the boxplot to comment on the distributions. 68 Further Mathematics 3 The numbers of jars of vitamin A, B, C and multi-vitamins sold per week by a local chemist are shown below. Vitamin 5 6 7 7 8 8 9 11 13 14 A Vitamin 10 10 11 12 14 15 15 15 17 19 B Vitamin 8 8 9 9 9 10 11 12 12 13 C Multi- 12 13 13 15 16 16 17 19 19 20 vitamins Construct a parallel boxplot to display the data and use it to compare the distributions of sales for the 4 types of vitamin. 4 multiple choice The ages of the employees at 5 different companies of the same size are compared using the parallel boxplots shown below. Company A Company B Company C Company D Company E 20 25 30 35 40 45 50 55 60 For each of the following, select from: A company A B company B C company C D company D E company E a Which company has the greatest range of ages? SHE ET 2.1 b Which company has the greatest interquartile range of ages? Work c Which company has the lowest median age? d Which company has the greatest range of ages among their oldest 25% of employees? Chapter 2 Bivariate data 69 The two-way frequency table When we are examining the relationship between two categorical variables, the two- way frequency table is an excellent tool. Consider the following example. WORKED Example 4 At a local shopping centre, 34 females, and 23 males were asked which of the two major political parties they preferred. Eighteen females and 12 males preferred Labor. Display these data in a two-way table. THINK WRITE 1 Draw a table. Record the respondent’s sex in the columns and party preference Party preference Female Male Total in the rows of the table. Labor Liberal Total 2 (a) We know that 34 female and 23 males were asked. Put this information Party preference Female Male Total into the table and ﬁll in the total. (b) We also know that 18 females and Labor 18 12 30 12 males preferred Labor. Put this information in the table and ﬁnd the Liberal total of people who preferred Labor. Total 34 23 57 3 Fill in the remaining cells. For example, to ﬁnd the number of females who Party preference Female Male Total preferred the Liberals, subtract the number of females preferring Labor Labor 18 12 30 from the total number of females asked: 34 − 18 = 16. Liberal 16 11 27 Total 34 23 57 In the above example we have a very clear breakdown of data. We know how many females preferred Labor, how many females preferred the Liberals, how many males preferred Labor and how many males preferred the Liberals. If we wish to compare the number of females who prefer Labor with the number of males who prefer Labor, we must be careful. While 12 males preferred Labor compared to 18 females, there were, of course, fewer males than females being asked. That is, only 23 males were asked for their opinion, compared to 34 females. To overcome this problem, we can express the ﬁgures in the table as percentages. 70 Further Mathematics WORKED Example 5 Fifty-seven people in a local shopping centre were asked whether they preferred Party preference Female Male Total the Australian Labor Party or the Liberal Labor 18 12 30 Party. The results are given at right. Convert the numbers in this table to Liberal 16 11 27 percentages. Total 34 23 57 THINK WRITE 1 Draw the table, omitting the ‘total’ column. Party preference Female Male Labor Liberal Total 2 Fill in the table by expressing the number in each cell as a percentage of its column’s total. Party preference Female Male For example, to obtain the percentage of males who prefer Labor, we divide the number of Labor 52.9 52.2 males who prefer Labor by the total number of males and multiply by 100%. Liberal 47.1 47.8 12 ----- 23 - × 100% = 52.5% (correct to 1 decimal place) Total 100.0 100.0 We could have calculated percentages from the table rows, rather than columns. To do that we would, for example, have divided the number of females who preferred Labor (18) by the total number of people who preferred labor (30) and so on. The table below shows this: Party preference Female Male Total Labor 60.0 40.0 100 Liberal 59.3 40.7 100 By doing this we have obtained the percentage of people who were female and pre- ferred Labor (60%), and the percentage of people who were male and preferred Labor (40%), and so on. This highlights facts different from those shown in the previous table. In other words, different results can be obtained by calculating percentages from a table in different ways. As a general rule, when the independent variable (in this case the respondent’s sex) is placed in the columns of the table, then the percentages should be calculated in columns. Chapter 2 Bivariate data 71 WORKED Example 6 Sixty-seven primary and 47 secondary school students were asked their attitude to the number of school holidays which should be given. They were asked whether there should be more, fewer or the same number. Five primary students and 2 secondary students wanted fewer holidays, 29 primary and 9 secondary students thought they had enough holidays (that is, they chose the same number) and the rest thought they needed to be given more holidays. Present these data in percentage form in a two-way frequency table and use it to compare the opinions of the primary and the secondary students. THINK WRITE 1 Put the data in a table. First ﬁll in the given information, then ﬁnd the missing Attitude Primary Secondary Total information by subtracting the appropriate numbers from the totals. Fewer 5 2 7 Same 29 9 38 More 33 36 69 Total 67 47 114 2 Calculate the percentages. Since the independent variable (the level of the Attitude Primary Secondary student, Primary or Secondary) has been placed in the columns of the table, Fewer 7.5 4.3 we calculate the percentages in columns. For example, to obtain the Same 43.3 19.1 percentage of primary students who wanted fewer holidays, divide the More 49.2 76.6 number of such students by the total number of primary students and Total 100.0 100.0 multiply by 100%. That is, ----- × 100% = 7.5%. 67 5 - 3 Comment on the results. Secondary students were much keener on having more holidays than were primary students. remember remember 1. The two-way frequency table is an excellent tool for examining the relationship between two categorical variables. 2. If the total number of scores in each of the two categories is unequal, percentages should be calculated in order to be able to analyse the table properly. When the independent variable is placed in the columns of the table, the percentages should be calculated in columns. That is, the numbers in each column should be expressed as a percentage of that column’s total. 72 Further Mathematics 2D The two-way frequency table Spreadshe WORKED 1 In a survey, 139 women and 102 men were asked whether they approved or disapproved Example EXCEL et 4 of a proposed freeway. Thirty-seven women and 79 men approved of the freeway. Two-way frequency Display these data in a two-way table (not as percentages). table 2 Students at a secondary school were asked whether the length of lessons should be 45 minutes or 1 hour. Ninety-three senior students (Years 10–12) were asked and 60 preferred 1-hour lessons, whereas of the 86 junior students (Years 7–9), 36 preferred 1-hour periods. Display these data in a two-way table (not as percentages). 3 For each of the following two-way frequency tables, complete the entries. a Attitude Female Male Total For 25 i 47 Against ii iii iv Total 51 v 92 b Attitude Female Male Total For i ii 21 Against iii 21 iv Total v 30 63 c Party preference Female Male Labor i 42% Liberal 53% ii Total iii iv WORKED 4 Sixty single men and women were asked whether they prefer to live alone, or to share Example 5 accommodation with friends. The results are shown below. Rent preference Men Women Total Live alone 12 23 35 Share with friends 9 16 25 HEET 2.1 Total 21 39 60 SkillS Convert the numbers in this table to percentages. Chapter 2 Bivariate data 73 The information in the following two-way frequency table relates to questions 5 and 6. The data show the reactions of administrative staff and technical staff to an upgrade of the computer systems at a large corporation. Administrative Technical Attitude staff staff Total For 53 98 151 Against 37 31 68 Total 90 129 219 5 multiple choice From the above table, we can conclude that: A 53% of administrative staff were for the upgrade B 37% of administrative staff were for the upgrade C 37% of administrative staff were against the upgrade D 59% of administrative staff were for the upgrade E 54% of administrative staff were against the upgrade 6 multiple choice From the above table, we can conclude that: A 98% of technical staff were for the upgrade B 65% of technical staff were for the upgrade C 76% of technical staff were for the upgrade D 31% of technical staff were against the upgrade E 14% of technical staff were against the upgrade WORKED 7 Delegates at the respective Liberal Party and Australian Labor Party conferences were Example 6 surveyed on whether or not they believed that marijuana should be legalised. Sixty-two Liberal delegates were surveyed and 40 were against legalisation. Seventy-one Labor delegates were surveyed and 43 were against legalisation. Present the data in percentage form in a two-way frequency table. Comment on any differences between the reactions of the Liberal and Labor delegates. 8 Sixty-one union workers were surveyed and asked whether the number of public holidays should be reduced. Thirty-ﬁve supported a reduction. Fifty-nine non-union workers were also asked and 31 supported a reduction. Present the data in percentage form in a two-way frequency table. Comment on any difference between the reactions of the union and non-union workers. 74 Further Mathematics The scatterplot We often want to know if there is some sort of relationship between two numerical variables. A scatterplot, which gives a visual display of the relationship between two variables, provides a good starting point. Consider the data obtained from last year’s 12B class at Northbank Secondary Col- lege. Each student in this class of 29 students was asked to give an estimate of the average number of hours of study per week they did during Year 12. They were also asked the TER score they obtained. Average Average Average hours TER hours TER hours TER of study score of study score of study score 18 59 14 54 17 59 16 67 17 72 16 76 22 74 14 63 14 59 27 90 19 72 29 89 15 62 20 58 30 93 28 89 10 47 30 96 18 71 28 85 23 82 19 60 25 75 26 35 22 84 18 63 22 78 30 98 19 61 The ﬁgure at right shows the data plotted on a scatterplot. It is reasonable to think that the number of hours of study put in each week by students would affect their 100 TER scores and so the number of hours of study per 90 week is the independent variable and appears on the TER score 80 horizontal axis. The TER score is the dependent variable and appears on the vertical axis. 70 There are 29 points on the scatterplot. Each point 60 represents the hours studied and the TER score of one 50 student. 40 In analysing the scatterplot we look for a pattern in (26, 35) the way the points lie. Certain patterns tell us that cer- tain relationships exist between the two variables. This 10 15 20 25 30 is referred to as correlation. We look at what type of Average number of hours of study per week correlation exists and how strong it is. In the ﬁgure above right we see some sort of pattern: the points are spread in a rough corridor from bottom left to top right. We refer to data following such a direction as having a positive relationship. This tells us that as the average number of hours studied per week increases, the TER score increases. Chapter 2 Bivariate data 75 The point (26, 35) is an outlier. It stands out because it is well away from the other points and clearly is not 100 part of the ‘corridor’ referred to above. This outlier may 90 TER score have occurred because a student worked very hard but 80 found the VCE pretty tough or perhaps the student exag- 70 gerated the number of hours he or she worked in a week or perhaps there was a recording error. This needs to be 60 checked. 50 We could describe the rest of the data as having a 40 linear form as the straight line in the diagram at right indicates. 10 15 20 25 30 When describing the relationship between two vari- Average number of hours ables displayed on a scatterplot, we need to comment on: of study per week (a) the direction — whether it is positive or negative (b) the form — whether it is linear or non-linear (c) the strength — whether it is strong, moderate or weak. Below is a gallery of scatterplots showing the various patterns we look for. Weak, positive Moderate, positive Strong, positive linear relationship linear relationship linear relationship Weak, negative Moderate, negative Strong, negative linear relationship linear relationship linear relationship Perfect, negative No relationship Perfect, positive linear relationship linear relationship 76 Further Mathematics WORKED Example 7 The scatterplot at right shows the number of hours people Hours for recreation 25 spend at work each week and the number of hours people get to spend on recreational activities during the week. 20 Decide whether or not a relationship exists between the 15 variables and, if it does, comment on whether it is positive 10 or negative; weak, moderate or strong; and whether or not 5 it has a linear form. 10 20 30 40 50 60 70 THINK WRITE Hours worked (a) The points on a scatterplot are spread in a certain pattern, namely in a rough corridor from the top left to the bottom right corner. This tells us that as the work hours increase, the recreation hours decrease. (b) The corridor is straight (that is, it would be reasonable to ﬁt a straight line into it). (c) The points are not too tight and not too dispersed either. (d) The pattern resembles the central diagram in There is a moderate, negative linear relation- the gallery of scatterplots shown previously. ship between the two variables. WORKED Example 8 Data giving the average weekly number of hours studied by each student in 12B at Northbank Secondary College and the corresponding height of each student (to the nearest tenth of a metre) are given in the table below. Average Average Average Average hours hours hours hours of Height of Height of Height of Height study (m) study (m) study (m) study (m) 18 1.5 19 2.0 20 1.9 16 1.6 16 1.9 22 1.9 10 1.9 14 1.9 22 1.7 30 1.6 28 1.5 29 1.7 27 2.0 14 1.5 25 1.7 30 1.8 15 1.9 17 1.7 18 1.8 30 1.5 28 1.8 14 1.8 19 1.8 23 1.5 18 2.1 19 1.7 17 2.1 22 2.1 Construct a scatterplot for the data and use it to comment on the direction, form and strength of any relationship between the number of hours studied and the height of the students. Chapter 2 Bivariate data 77 THINK WRITE/DISPLAY 1 Construct the scatterplot. In this case it is almost 2.2 impossible to decide which is the independent 2.1 variable and which is the dependent variable, and 2.0 therefore on which axis we will place the 1.9 Height (m) variables. In such cases, placing either variable on either axis is reasonable. 1.8 The scatterplot can be constructed using a 1.7 2 graphics calculator: 1.6 (a) Press Y= and CLEAR any functions. 1.5 (b) Press 2nd [STAT PLOT] and select 1.4 4:PlotsOff. Press ENTER . 10 12 14 16 18 20 22 24 26 28 30 (c) Press STAT and select 1:Edit. Press ENTER . Average number of hours (d) Clear any existing lists and enter the list of studied each week hours of study in L1 and the list of heights in L2. FM Fig 02.07 (e) Press 2nd [STAT PLOT] and select 1:Plot 1. (f) Press ENTER to turn the plot ON, and select the ﬁrst icon which indicates a scatterplot. (g) For Xlist, select L1 and for Ylist select L2 and select the ﬁrst symbol in Mark. (h) Press ZOOM and select 9:ZoomStat. (i) Press ENTER to see the scatterplot. 3 Comment on the direction of any relationship. There is no relationship; the points appear to be randomly placed. 4 Comment on the form of the relationship. There is no form, no linear trend, no quadratic trend, just a random placement of points. 5 Comment on the strength of any relationship. Since there is no relationship, strength is not relevant. Clearly, the number of hours you study for your VCE has no effect on how tall you might be! Note that when working with the scatterplot, to change settings at any time use WINDOW . To identify the coordinates of individual points, use the TRACE key with the arrow keys. M M 78 Further Mathematics remember remember 1. When we are investigating if there is any sort of relationship between two numerical variables, a scatterplot provides a useful starting point. It gives a visual display of the relationship between two such variables. 2. In analysing the scatterplot we look for a pattern in the way the points lie. Certain patterns tell us that certain relationships exist between the two variables. This is referred to as a correlation. We look at what type of correlation exists and how strong it is. 3. When describing the relationship between two variables displayed on a scatterplot, we need to comment on: (a) the direction — whether it is positive or negative (b) the form — whether it is linear or non-linear (c) the strength — whether it is strong, moderate or weak. 2E The scatterplot Have your graphics calculator at hand for the following exercise questions. 1 For each of the following pairs of variables, write down whether or not you would reasonably expect a relationship to exist between the pair and, if so, comment on whether it would be a positive or negative association. a Time spent in a supermarket and money spent b Income and value of car driven c Number of children living in a house and time spent cleaning the house d Age and number of hours of competitive sport played per week e Amount spent on petrol each week and distance travelled by car each week f Number of hours spent in front of a computer each week and time spent playing the piano each week g Amount spent on weekly groceries and time spent gardening each week Chapter 2 Bivariate data 79 WORKED 2 For each of the scatterplots below, describe whether or not a relationship exists between Example the variables and, if it does, comment on whether it is positive or negative, whether it is 7 weak, moderate or strong and whether or not it has a linear form. a b c Haemoglobin count Marks at school (%) 120 100 14 Fitness level 12 100 80 10 80 60 60 40 8 20 20 40 60 80 0 10 20 0 Age Cigarettes smoked 4 8 12 16 FM Fig 02.08a FM Fig 02.08b Weekly hours of study gardening magazines ($) d e f Weekly expenditure on 25 14 70 Time under water (s) Hours spent using a computer per week 12 60 20 15 10 50 10 8 40 5 6 30 4 20 0 5 10 15 2 10 Hours spent gardening per week 2 4 6 8 1012 1416 5 10 15 20 25 Hours spent Age cooking per week 3 multiple choice From the scatterplot shown at right, it would be reasonable to y observe that: A as the value of x increases, the value of y increases B as the value of x increases, the value of y decreases C as the value of x increases, the value of y remains the same D as the value of x remains the same, the value of y increases x E there is no relationship between x and y WORKED 4 The population of a municipality (to the nearest hundred thousand) together with L Spread Example XCE the number of primary schools in that particular municipality is given below for sheet E 8 11 municipalities. Scatterplot Population 110 130 130 140 150 160 170 170 180 180 190 (000) No. of primary 4 4 6 5 6 8 6 7 8 9 8 schools Construct a scatterplot for the data and use it to comment on the direction, form and strength of any relationship between the population and the number of primary schools. 80 Further Mathematics 5 The table below contains data giving the time taken for a paving job and the cost of the job. Time taken 5 7 5 8 10 13 15 20 18 25 23 (hours) Cost of 1000 1000 1500 1200 2000 2500 2800 3200 2800 4000 3000 job ($) Construct a scatterplot for the data. Comment on whether a relationship exists between the time taken and the cost. If there is a relationship, describe it. 6 The table below shows the time of booking (how many days in advance) of the tickets for a musical performance and the corresponding row number in A-Reserve. Time of Row Time of Row booking No. booking No. 5 15 20 10 6 15 21 8 7 15 22 5 7 14 24 4 8 14 25 3 11 13 28 2 13 13 29 2 14 12 29 1 14 10 30 1 17 11 31 1 Construct a scatterplot for the data. Comment on whether a relationship exists between the time of booking and the number of the row and, if there is a relationship, describe it. Chapter 2 Bivariate data 81 The q-correlation coefﬁcient The q-correlation coefﬁcient is a measure of the strength of the association between two variables. In the previous section we estimated the strength of association by looking at a scatterplot and forming a judgment about whether the correlation between the variables was positive or negative and whether the correlation was weak, moderate or strong. The calculation of the q-correlation coefﬁcient aids us considerably in making that judgment. To calculate the q-correlation coefﬁcient: Step 1. Draw a scatterplot of the data. Step 2. Locate the median of the x-values. (If there are n points, the median is located n+1 - at the ----------- th place.) Draw a vertical line through this median value. 2 y Step 3. Locate the median of the y-values and draw a horizontal B A line through this median value. Step 4. The scatterplot is now divided into 4 sections or quadrants (hence the name ‘q’-correlation coefﬁcient). (a) Label these sections A, B, C and D. (b) Count the number of points in each section. C D (c) Do not count points which are on the lines. x (d) The number of points in section A is denoted by a, the number of points in section B is denoted by b, and so on. Step 5. Calculate the q-correlation coefﬁcient using the formula: (a + c) – (b + d ) q = --------------------------------------- - a+b+c+d WORKED Example 9 Calculate the q-correlation coefﬁcient for the data shown in the y scatterplot at right. THINK WRITE 1 (a) Locate the median of the x-values. Note that we are talking here about the x-values of the data observations x given. In the scatterplot shown there are 15 points. Each point has an x- value and a y-value. To ﬁnd the median of the x-values we look for the horizontal middle point; that is, we 15 + 1 look for the -------------- = 8th point from - 2 y Median the left (from the right, the point will of x-values be the same). (b) Draw a vertical line through this median value. Note that there are 7 points to the right of this line and 7 to the left. x Continued over page 82 Further Mathematics THINK WRITE 2 (a) Locate the median of the y-values. This is done in a similar way to ﬁnding the median of the x-values except, instead of counting from the left or y right, we count from the top or bottom to ﬁnd the 8th point. (b) Draw a horizontal line through this median value. Note that there are 7 points above this line and 7 below. x 3 (a) Label the quadrants A, B, C and D. y B A b=0 a=6 (b) Count the number of points in each section. Do not count points that are D C d=1 on the lines. c=6 x a = 6, b = 0, c = 6, d = 1 (a + c) – (b + d ) 4 Write the formula for calculating the q = --------------------------------------- - q-coefﬁcient. a+b+c+d 5 Substitute the values of a, b, c and d (6 + 6) – (0 + 1) into the formula and evaluate. q = --------------------------------------- - 6+0+6+1 11 = ----- - 13 = 0.85 (correct to 2 decimal places) The value of the q-correlation coefﬁcient in the above example indicates a strong correlation. The diagram below gives a rough guide to the strength of the correlation suggested by the value of q. 1 0.75 } Strong positive association 0.5 } Moderate positive association } Weak positive association Value of q 0.25 0 –0.25 } No association –0.5 } Weak negative association –0.75 } Moderate negative association –1 } Strong negative association Chapter 2 Bivariate data 83 The scatterplots below show three special values of the q-correlation coefﬁcient. y B A y B A y B A C D C D C D x x x (8 + 8) – (0 + 0) (0 + 0) – (8 + 8) (3 + 3) – (3 + 3) q = -------------------------------------- q = -------------------------------------- q = -------------------------------------- 8+0 +8+0 0+8 +0+8 3+3 +3+3 =1 = –1 =0 The sign of the q-value indicates the direction of the relationship; that is, whether there is a negative or positive association. In the cases shown above left and centre, the q-values are at both extremes. That is, q = 1 and −1 respectively. We would describe the variables as showing a very strong association. Having said that, the points are not showing a strong linear form or, for that matter, any linear form. The q-correlation coefﬁcient merely gives us an idea of which quadrants contain the most points; but beyond that, the points can be in any position in the quadrants. In that sense, the q-correlation coefﬁcient is a rather blunt instrument. WORKED Example 10 An investigation was made into the relationship between the time spent watching television in the week preceding a Maths test and the mark obtained (out of 20) in that Maths test. The following data were recorded. Time (h) Mark Time (h) Mark Time (h) Mark 4 15 10 8 12 10 5 16 20 5 5 8 5 20 5 12 20 8 10 12 15 4 15 10 15 8 15 12 20 10 Draw a scatterplot and calculate the q-correlation coefﬁcient. Comment on the relationship between the two variables. THINK WRITE/DISPLAY 1 Draw a scatterplot. We can use a graphics calculator 20 to draw the scatterplot. (a) On the lists screen (press STAT , select EDIT 16 Maths mark and 1:Edit), enter the two lists of data into L1 12 and L2. 8 (b) Press 2nd [STAT PLOT] and select 4:PlotsOff. (c) Press ENTER . 4 (d) Press 2nd [STAT PLOT] and select 1:Plot1. (e) Select On, and for Type, select the ﬁrst icon 5 10 15 20 25 (scatterplot). Time watching TV (f) For Xlist, type in L1 (use 2nd [L1]); for Ylist, (hours) type in L2; for Mark, select the ﬁrst symbol. (g) Press ZOOM and select 9:ZoomStat. The display Continued over page now shows the scatterplot. 84 Further Mathematics THINK WRITE/DISPLAY 2 We can also use the graphics calculator to help calculate q. (a) Press 2nd [QUIT] and return to the home screen. (b) Press 2nd [DRAW] and select 4:Vertical. (c) Press 2nd [LIST] . (d) From the MATH menu, select 4:median(. (e) Type L1 (use 2nd [L1]) at the prompt, then ENTER , and the scatterplot appears with the vertical median line drawn. (f) Similarly, to create the horizontal median line, press 2nd [QUIT] and return to the home screen. (g)Press 2nd [DRAW] and select 3:Horizontal. (h) Press 2nd [LIST] and from the MATH menu, select 4:median(. (i) Type L2 at the prompt, press ENTER and the scatterplot appears with the horizontal median line drawn as well. 3 Count and record the number of points in a = 1, b = 5, c = 2, d = 4 each quadrant. (a + c) – (b + d ) 4 Write the formula for calculating the q = --------------------------------------- - q-correlation coefﬁcient. a+b+c+d (1 + 2) – (5 + 4) 5 Substitute the values of a, b, c and d into q = --------------------------------------- - the formula and evaluate. 1+5+2+4 6 = – ----- - 12 = – 0.5 6 Comment on the relationship. There is moderate, negative association between the hours of television watched and the Maths mark obtained. The negative association means that as the number of hours of television watched prior to the test increased, the marks in the Maths test decreased. The moderate association suggests that it may be worth further investigating the association. Chapter 2 Bivariate data 85 remember remember 1. The q-correlation coefﬁcient is a measure of the strength of the association between two variables. 2. To calculate the q-correlation coefﬁcient: Step 1. Draw a scatterplot of the data. Step 2. Locate the median of the x-values and draw a vertical line through this median value. Step 3. Locate the median of the y-values and draw a horizontal line through this median value. y B Step 4. (a) Label the sections thus formed A, B, C and D. A (b) Count the number of points in each section. (c) Do not count points which are on the lines. (d) (The number of points in section A is denoted C D x by a, and so on.) Step 5. Calculate the q-correlation coefﬁcient using the formula: (a + c) – (b + d ) q = --------------------------------------- - a+b+c+d 3. The sign of the q-value indicates the direction of the relationship (whether there is a negative association or a positive association) while the size of it indicates the strength (whether the relationship is strong, moderate or weak). 4. The q-correlation coefﬁcient gives us an idea of into which quadrants the points fall, but beyond that the points can be in any position in the quadrants. In that sense, the q-correlation coefﬁcient is a rather blunt instrument. 2F The q-correlation coefﬁcient WORKED 1 Calculate the q-correlation coefﬁcient for each of the sets of data shown on the scatter- L Spread Example XCE plots below. sheet E 9 ay b y c y q-correlation x x x dy e y fy x x x 86 Further Mathematics WORKED 2 The data given in the table below show the results of an investigation into Example the mass and the height of a certain breed of dog. 10 a Draw a scatterplot and calculate the q-correlation coefﬁcient. b Comment on the relationship between the height and the mass of this breed of dog. Height 41 40 35 38 43 44 37 39 42 44 31 (cm) Mass 4.5 5 4 3.5 5.5 5 5 4 4 6 3.5 (kg) 3 The data in the table below show the number of hours spent by students who are learning touch-typing and their corresponding speed in words per minute (wpm). a Using a graphics calculator or otherwise, calculate the q-correlation coefﬁcient for these data. b Comment on the relationship between the number of hours spent on learning and the speed of typing. Time 20 33 22 39 40 37 46 44 24 36 50 48 29 (h) Speed 34 46 38 53 52 49 60 58 36 42 65 63 40 (wpm) 4 multiple choice y The q-correlation coefﬁcient for data shown in the scatterplot at right is: 1 1 5 5 9 A – ----- 11 - B – -- - 9 C - – ----- 11 D -- 9 - E - -- 9 5 multiple choice x A researcher calculates the q-correlation coefﬁcient for the relationship between time (in days) and the diameter (measured in mm) of a crystal that is changing in size. The value is 0.82. Based on this, the correlation between time and the diameter of the crystal could be described as: A strong and negative B strong and positive SHE ET 2.2 C weak and positive Work D weak and negative E moderate and positive Chapter 2 Bivariate data 87 Pearson’s product–moment correlation coefﬁcient We saw in the previous exercise that the q-correlation coefﬁcient was a rather blunt instrument for measuring correlation between variables. A more precise tool is Pearson’s product–moment correlation coefﬁcient. This coefﬁcient is used to measure the strength of linear relationships between variables; the q-correlation coefﬁcient, on the other hand, can be used for both linear and non-linear relationships. Pearson’s coefﬁcient is therefore more specialised and can give us a much more precise picture of the strength of the linear relationship between two variables. The symbol for Pearson’s product–moment correlation coefﬁcient is r. Below is a gallery of scatterplots with the corresponding value of r for each. r=1 r = –1 r=0 r = 0.7 r = –0.5 r = –0.9 r = 0.8 r = 0.3 r = –0.2 The two extreme values of r (1 and −1) are shown in the ﬁrst two diagrams respec- tively. It is interesting to compare these two scatterplots with those showing extreme values (1 and −1) of q. q=1 r =1 q = –1 r = –1 1 In the four diagrams above, the scatterplots that } Strong positive linear association 0.75 show matching values of q and r are placed side } Moderate positive linear association 0.5 by side. We see just how differently the points on } Weak positive linear association Value of r 0.25 the scatterplots are arranged and note from this that the r value gives us a much sharper impression of the relationship between the 0 –0.25 } No linear association –0.5 } Weak negative linear association variables. That is, a value of r = 1 means that there } Moderate negative linear association –0.75 is perfect linear association between the variables, } Strong negative linear association –1 which is not necessarily the case when q = 1! 88 Further Mathematics In describing the strength of the relationship between the variables, the rough guide we used with the q-correlation coefﬁcient can also be used with Pearson’s coefﬁcient. The difference, of course, is that the value of r gives us a measure of the strength of linear relationships speciﬁcally. WORKED Example 11 For each of the following: i Estimate the value of Pearson’s product–moment correlation coefﬁcient (r) from the scatterplot. ii Use this to comment on the strength and direction of the relationship between the two variables. a b c THINK WRITE a 1Compare these scatterplots with a i r ≈ 0.9 those in the gallery of scatterplots shown previously and estimate the value of r. 2 Comment on the strength and ii The relationship can be described as a direction of the relationship. strong, positive, linear relationship. b Repeat steps 1 and 2 as in a. b i r ≈ −0.7 ii The relationship can be described as a moderate, negative, linear relationship. c Repeat steps 1 and 2 as in a. c i r ≈ −0.1 ii There is no linear relationship. Note that the symbol ≈ means ‘aproximately equal to’. We use it instead of the = sign to emphasise that the value (in this case r) is only an estimate. In completing the worked example above, we notice that estimating the value of r from a scatterplot is rather like making an informed guess. In the next section of work we will see how to obtain the actual value of r. remember remember 1. Pearson’s product–moment correlation coefﬁcient is used to measure the strength of a linear relationship between two variables. 2. The symbol for Pearson’s product–moment correlation coefﬁcient is r. 3. The estimate of r can be obtained from the scatterplot. Chapter 2 Bivariate data 89 Pearson’s product–moment 2G correlation coefﬁcient 1 What type of linear relationship does each of the following values of r suggest? a 0.21 b 0.65 c −1 d −0.78 e 1 f 0.9 g −0.34 h −0.1 WORKED 2 For each of the following: Example i Estimate the value of Pearson’s product–moment correlation coefﬁcient (r), from 11 the scatterplot. ii Use this to comment on the strength and direction of the relationship between the two variables. a b c d e f g h 3 multiple choice A set of data relating the variables x and y is found to have an r value of 0.62. The scatterplot that could represent the data is: A B C D E 4 multiple choice A set of data relating the variables x and y is found to have an r value of −0.45. A true statement about the relationship between x and y is: A There is a strong linear relationship between x and y and when the x-values increase, the y-values tend to increase also. B There is a moderate linear relationship between x and y and when the x-values increase, the y-values tend to increase also. C There is a moderate linear relationship between x and y and when the x-values increase, the y-values tend to decrease. D There is a weak linear relationship between x and y and when the x-values increase, the y-values tend to increase also. E There is a weak linear relationship between x and y and when the x-values increase, the y-values tend to decrease. 90 Further Mathematics Calculating r and the coefﬁcient of determination Pearson’s product–moment correlation coefﬁcient The formula for calculating Pearson’s correlation coefﬁcient r is as follows: n xi – x yi – y ------------ ------------ ∑ 1 r = ----------- - - - n–1 sx sy i=1 where n is the number of pairs of data in the set sx is the standard deviation of the x-values sy is the standard deviation of the y-values x is the mean of the x-values y is the mean of the y-values. The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efﬁciently using a graphics calculator. There are two important limitations on the use of r. First, since r measures the strength of a linear relationship, it would be inappropriate to calculate r for data which are not linear — for example, data which a scatterplot shows to be in a quadratic form. Second, outliers can bias the value of r. Consequently, if a set of linear data contains an outlier, then r is not a reliable measure of the strength of that linear relationship. The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which do not have outliers. With those two provisos, it is good practice to draw a scatterplot for a set of data to check for a linear form and an absence of outliers before r is calculated. Having a scat- terplot in front of you is also useful because it enables you to estimate what the value of r will be — as you did in exercise 2G, and thus you can check that your workings on the calculator are correct. WORKED Example 12 The heights (in centimetres) of 21 football players were recorded against the number of marks they took in a game of football. The data are shown in the table below. Number of Number of Height (cm) marks taken Height (cm) marks taken 184 6 182 7 194 11 185 5 185 3 183 9 175 2 191 9 186 7 177 3 183 5 184 8 174 4 178 4 200 10 190 10 188 9 193 12 184 7 204 14 188 6 Chapter 2 Bivariate data 91 a Construct a scatterplot for the data. b Comment on the correlation between the heights of players and the number of marks that they take, and estimate the value of r. c Calculate r and use it to comment on the relationship between the heights of players and the number of marks they take in a game. THINK WRITE/DISPLAY a Using a graphics calculator, construct a a scatterplot. Refer to worked example 8 in the section on scatterplots for directions on how to use the graphics calculator to draw a scatterplot. b Comment on the correlation between the b The data show what appears to be a linear variables and estimate the value of r. form of moderate strength. We might expect r ≈ 0.6. c 1 Because there is a linear form and there c are no outliers, the calculation of r is appropriate. Calculate r, using a graphics calculator. The lists are in place from the scatterplot. Firstly press 2nd [CATALOG] and select r = 0.86 DiagnosticOn and press ENTER . Press STAT and select CALC and 4:LinReg(ax+b). Press ENTER . LinReg(ax+b) appears. Type L1, L2. Press ENTER . 2 The value of r = 0.86 indicates a There is a strong positive linear association strong positive linear relationship. between the height of a player and the number of marks he takes in a game. That is, the taller the player the more marks we might expect him to take. Correlation and causation In worked example 12 we saw that r = 0.86. While we are entitled to say that there is a strong association between the height of a footballer and the number of marks he takes, we cannot assert that the height of a footballer causes him to take a lot of marks. Being tall might assist in the taking of marks, but there will be many other factors which come into play — for example skill level, accuracy of passes from teammates, abilities of the opposing team, and so on. So, while establishing a high degree of correlation between two variables is very interesting and can often ﬂag the need for further, more detailed investigation, it in no way gives us any basis to comment on whether or not one variable causes particular values in another variable. 92 Further Mathematics The coefﬁcient of determination The coefﬁcient of determination is given by r 2. Obviously, it is very easy to calculate — we merely square Pearson’s product–moment correlation coefﬁcient (r). 1. The coefﬁcient of determination is useful when we have two variables which have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable. 2. The coefﬁcient of determination provides a measure of how well the linear rule linking the two variables (x and y) predicts the value of y when we are given the value of x. WORKED Example 13 A set of data giving the number of police trafﬁc patrols on duty and the number of fatalities for the region was recorded and a correlation coefﬁcient of r = −0.8 was found. Calculate the coefﬁcient of determination and interpret its value. THINK WRITE 1 Calculate the coefﬁcient of Coefﬁcient of determination = r 2 determination by squaring the given = (−0.8)2 value of r. = 0.64 2 Interpret your result. We can conclude from this that 64% of the variation in the number of fatalities can be explained by the variation in the number of police trafﬁc patrols on duty. This means that the number of police trafﬁc patrols on duty is a major factor in predicting the number of fatalities. remember remember 1. The formula for calculating Pearson’s correlation coefﬁcient r is as follows: n xi – x yi – y ------------ ------------ ∑ 1 r = ----------- - - - n–1 sx sy i=1 where n is the number of pairs of data in the set sx is the standard deviation of the x values sy is the standard deviation of the y values x is the mean of the x-values y is the mean of the y-values. 2. The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efﬁciently using a graphics calculator. 3. The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which do not have outliers. 4. Even if we ﬁnd that two variables have a very high degree of correlation, for example r = 0.95, we cannot say that the value of one variable is caused by the value of the other variable. 5. The coefﬁcient of determination = r 2. 6. The coefﬁcient of determination is useful when we have two variables which have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable. Chapter 2 Bivariate data 93 Calculating r and the 2H coefﬁcient of determination L Spread XCE sheet E WORKED 1 The yearly salary ($’000) and the number of votes polled in the Example Pearson’s 12 Brownlow medal count are given below for 10 leading footballers. product- moment Yearly correlation salary 180 200 160 250 190 210 170 150 140 180 GC pro ($’000) gram Number BV stats 24 15 33 10 16 23 14 21 31 28 of votes a Construct a scatterplot for the data. b Comment on the correlation of salary and the number of votes and make an estimate of r. c Calculate r and use it to comment on the relationship between yearly salary and number of votes. WORKED 2 A set of data, obtained from 40 smokers, gives the number of cigarettes smoked per day Example 13 and the number of visits per year to the doctor. The Pearson’s correlation coefﬁcient for these data was found to be 0.87. Calculate the coefﬁcient of determination for the data and interpret its value. 3 Data giving the annual advertising budgets ($’000) and the yearly proﬁt increases (%) of 8 companies are shown below. Annual advertising 11 14 15 17 20 25 25 27 budget ($’000) Yearly proﬁt increase 2.2 2.2 3.2 4.6 5.7 6.9 7.9 9.3 (%) a Construct a scatterplot for these data. b Comment on the correlation of the advertising budget and proﬁt increase and make an estimate of r. c Calculate r. d Calculate the coefﬁcient of determination. e Write down the proportion of the variation in the yearly proﬁt increase that can be explained by the variation in the advertising budget. 4 Data showing the number of tourists visiting a small country in a month and the corresponding average monthly exchange rate for the country’s currency against the American dollar are given below. Number of tourists 2 3 4 5 7 8 8 10 (’000) Exchange rate 1.2 1.1 0.9 0.9 0.8 0.8 0.7 0.6 94 Further Mathematics a Construct a scatterplot for the data. b Comment on the correlation between the number of tourists and the exchange rate and give an estimate of r. c Calculate r. d Calculate the coefﬁcient of determination. e Write down the proportion of the variation in the number of tourists that can be explained by the exchange rate. 5 Data showing the number of people in 9 households against weekly grocery costs are given below. Number of people in 2 5 6 3 4 5 2 6 3 household Weekly grocery 60 180 210 120 150 160 65 200 90 costs ($’s) a Construct a scatterplot for the data. b Comment on the correlation of the number of people in a household and the weekly grocery costs and give an estimate of r. c Calculate r. d Calculate the coefﬁcient of determination. e Write down the proportion of the variation in the weekly grocery costs that can be explained by the variation in the number of people in a household. 6 Data showing the number of people on 8 fundraising committees and the annual funds raised are given below. Number of people on 3 6 4 8 5 7 3 6 committee Annual funds 4500 8500 6100 12 500 7200 10 000 4700 8800 raised ($’s) a Construct a scatterplot for these data. b Comment on the correlation between the number of people on a committee and the funds raised and make an estimate of r. c Calculate r. d Calculate the coefﬁcient of determination. e Write down the proportion of the variation in the funds raised that can be explained by the variation in the number of people on a committee. The following information applies to questions 7 and 8. A set of data was obtained from a large group of women with children under 5 years of age. They were asked the number of hours they worked per week and the amount of money they spent on childcare. The results were recorded and the value of Pearson’s correlation coefﬁcient was found to be 0.92. Chapter 2 Bivariate data 95 7 multiple choice Which of the following is not true? A The relationship between the number of working hours and the amount of money spent on child-care is linear. B There is a positive correlation between the number of working hours and the amount of money spent on child-care. C The correlation between the number of working hours and the amount of money spent on child-care can be classiﬁed as strong. D As the number of working hours increases, the amount spent on child-care increases as well. E The increase in the number of hours causes the increase in the amount of money spent on child-care. 8 multiple choice Which of the following is not true? A The coefﬁcient of determination is about 0.85. B The number of working hours is the major factor in predicting the amount of money spent on child-care. C About 85% of the variation in the number of hours worked can be explained by the variation in the amount of money spent on child-care. D Apart from number of hours worked, there could be other factors affecting the amount of money spent on child-care. E About 17 of the variation in the amount of money spent on child-care can be - ----- 20 explained by the variation in the number of hours worked. 96 Further Mathematics summary Types of data • Bivariate data are data with two variables. • Numerical data involve quantities which are measurable or countable. • Categorical data are data divided into categories. • In a relationship involving two variables, if the values of one variable depend on the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable. • When data are displayed on a graph, the independent variable is placed on the horizontal axis and the dependent variable is placed on the vertical axis. Back-to-back stem plots • A back-to-back stem plot displays bivariate data involving a numerical variable and a categorical variable with two categories. • Together with summary statistics, back-to-back stem plots can be used to compare the two distributions. Parallel boxplots • To display a relationship between a numerical variable and a categorical variable with more than two categories, we can use a parallel boxplot. • A parallel boxplot is obtained by constructing individual boxplots for each distribution, using a common scale. The two-way frequency table • The two-way frequency table is a tool for examining the relationship between two categorical variables. • If the total number of scores in each of the two categories is unequal, percentages should be calculated in order to be able to analyse the table properly. • When the independent variable is placed in the columns of the table, the numbers in each column should be expressed as a percentage of that column’s total. The scatterplot • A scatterplot gives a visual display of the relationship between two numerical variables. • In analysing the scatterplot we look for a pattern in the way the points lie. Certain patterns tell us that certain relationships exist between the two variables. This is referred to as a correlation. We look at what type of correlation exists and how strong it is. • When describing the relationship between two variables displayed on a scatterplot, we need to comment on: (a) the direction — whether it is positive or negative (b) the form — whether it is linear or non-linear (c) the strength — whether it is strong, moderate or weak. Chapter 2 Bivariate data 97 The q-correlation coefﬁcient • The q-correlation coefﬁcient gives us a measure of the strength of the association between two variables. • To calculate the q-correlation coefﬁcient: Step 1. Draw a scatterplot of the data. Step 2. Locate the median of the x-values. Draw a vertical line through this median value. Step 3. Locate the median of the y-values. Draw a horizontal line through this median value. y B A Step 4. The scatterplot is now divided into 4 sections or quadrants. (a) Label these sections A, B, C and D. (b) Count the number of points in each section. (c) Do not count points which are on the lines. C D x (d) The number of points in section A is denoted by a, the number of points in section B is denoted by b, and so on. Step 5. Calculate the q-correlation coefﬁcient, using the formula: (a + c) – (b + d ) q = --------------------------------------- - a+b+c+d • The sign of the q-value indicates the direction of the relationship; that is, whether there is a negative association or a positive association. The magnitude of q indicates whether the relationship is strong, moderate or weak. • The q-correlation coefﬁcient gives us an idea of into which quadrants the points fall, but beyond that the points can be in any position in the quadrants. The q-correlation coefﬁcient in that sense is a rather blunt instrument. Pearson’s product–moment correlation coefﬁcient • Pearson’s product–moment correlation coefﬁcient is used to measure the strength of a linear relationship between two variables. • The symbol for Pearson’s product–moment correlation coefﬁcient is r. • The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which don’t have outliers. • The value of r can be estimated from the scatterplot. • The formula for calculating Pearson’s correlation coefﬁcient r is as follows: n x –x y –y ∑ ------------- ------------- 1 i i r = ----------- - n–1 sx sy i=1 where n is the number of pairs of data in the set sx is the standard deviation of the x-values sy is the standard deviation of the y-values x is the mean of the x-values y is the mean of the y-values • The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efﬁciently using a graphics calculator. • Even if we ﬁnd that two variables have a very high degree of correlation, for example r = 0.95, we cannot say that the value of one variable is caused by the value of the other variable. Calculating the coefﬁcient of determination • The coefﬁcient of determination = r 2. • The coefﬁcient of determination is useful when we have two variables which have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable. 98 Further Mathematics CHAPTER review Multiple choice 1 An example of a categorical variable is: 2A A the membership number of a club B the number of students at each year level of a school C the total attendance at Hawthorn football matches D the breathalyser reading of people in a restaurant E the monthly income for a group of people 2 In a study on the growth of plants, conducted in controlled surroundings, the dependent variable 2A was the height of the plants. The independent variable in the study would be most likely: A the number of people caring for the plants B the amount of light present C the number of plants in the study D whether the plants were deciduous or evergreen E rainfall 3 One of the following pairs of variables could not be displayed on a back-to-back stem plot. It is: 2B A the heights of a group of students and whether or not they like football B the kilometres travelled in a week and the mode of transport (car or train) C the weights of a group of students and their eye colour (blue or brown) D the annual number of trips to a doctor and whether or not the person is a smoker E the amount spent by each child at the tuckshop and the age of the child 4 A back-to-back stem plot is a useful way of displaying the relationship between: 2B A the number of children attending a day care centre and whether or not the centre has federal funding B height and wrist circumference C age and weekly income D weight and the number of takeaway meals eaten each week E the age of a car and amount spent each year on servicing it The information below relates to questions 5 and 6. The salaries of people working at ﬁve different advertising companies are shown below on the parallel boxplots. Company A Company B Company C Company D Company E 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Annual salary (× $1000) 5 The company with the largest interquartile range is: 2C A Company A B Company B C Company C D Company D E Company E Chapter 2 Bivariate data 99 6 The company with the lowest median is: A Company A B Company B C Company C 2C D Company D E Company E Questions 7 and 8 relate to the following information. Data showing reactions of junior staff and senior staff to a relocation of ofﬁces are given below in a two-way frequency table. Attitude Junior staff Senior staff Total For 23 14 37 Against 31 41 72 Total 54 55 109 7 From this table, we can conclude that: A 23% of junior staff were for the relocation 2D B 42.6% of junior staff were for the relocation C 31% of junior staff were against the relocation D 62.1% of junior staff were for the relocation E 28.4% of junior staff were against the relocation 8 From this table, we can conclude that: A 14% of senior staff were for the relocation 2D B 37.8% of senior staff were for the relocation C 12.8% of senior staff were for the relocation D 72% of senior staff were against the relocation E 74.5% of senior staff were against the relocation 9 The relationship between the variables x and y is shown on the scatterplot below. That correlation between x and y would be best described as: y 2E A a weak positive association B a weak negative association C a strong positive association D a strong negative association E non-existent x 10 An investigation is made into the number of freckles on the back of a hand and the age of the subject. A strong association was found to exist. In this investigation, age is the 2E independent variable and the number of freckles is the dependent variable. You would expect the association to be: A negative B positive C bivariate D weak E categorical 11 The q-correlation coefﬁcient for data shown in the scatterplot above is: 5 y 2F A – ----- 11 - B –5 -- 9 - C ----- 11 5 - D 5 -- 9 - E 2 -- 9 - x 12 A researcher calculates the q-correlation coefﬁcient for the relationship between time (in days) and the growth of the root of a bean plant (measured in millimetres). 2F The value is 0.62. Based on this, the correlation between time and the growth of the roots could be described as: A strong and negative B strong and positive C weak and positive D weak and negative E moderate and positive 100 Further Mathematics 13 A set of data relating the variables x and y is found to have an r value of −0.83. The 2G scatterplot that could represent this data set is: A y B y C y x x x D y E y x x 14 A set of data relating the variables x and y is found to have an r value of 0.65. A true 2G statement about the relationship between x and y is: A There is a strong linear relationship between x and y and when the x-values increase, the y-values tend to increase also. B There is a moderate linear relationship between x and y and when the x-values increase, the y-values tend to increase also. C There is a moderate linear relationship between x and y and when the x-values increase, the y-values tend to decrease. D There is a weak linear relationship between x and y and when the x-values increase, the y-values tend to increase also. E There is a weak linear relationship between x and y and when the x-values increase, the y-values tend to decrease. 15 A set of data comparing age with blood pressure is found to have a Pearson’s correlation 2H coefﬁcient of 0.86. The coefﬁcient of determination for this data would be closest to: A −0.86 B −0.74 C −0.43 D 0.43 E 0.74 16 The coefﬁcient of determination for a set of data relating age and pulse rate is 0.7. This 2H means that: A The correlation coefﬁcient, r, for age against pulse rate is 0.7. B 70% of the variation in pulse rate can be explained by the variation in age. C 30% of the variation in pulse rate can be explained by the variation in age. D 49% of the variation in pulse rate can be explained by the variation in age. E 70% of those in the study had a pulse rate over 0.7. Short answer 1 For each of the following, write down: 2A i whether each variable in the pair is an example of numerical or categorical data ii which is a dependent and which is an independent variable or whether it is not appropriate to classify the variables as such. a The number of injuries in a netball season and the age of a netball player b The suburb and the size of a home mortgage c IQ and weight Chapter 2 Bivariate data 101 2 The number of hours of counselling received by a group of 9 full-time ﬁreﬁghters and 9 volunteer ﬁreﬁghters after a serious bushﬁre is given below. 2B Full-time 2 4 3 5 2 4 6 1 3 Volunteer 8 10 11 11 12 13 13 14 15 a Construct a back-to-back stem plot to display the data. b Comment on the distributions of the number of hours of counselling of the full-time ﬁreﬁghters and the volunteers. 3 The IQ of 8 players in 3 different football teams were recorded and are shown below. 2C Team A 120 105 140 116 98 105 130 102 Team B 110 104 120 109 106 95 102 100 Team C 121 115 145 130 120 114 116 123 Display the data in parallel boxplots. 4 Delegates at the respective Liberal and Labor Party conferences were surveyed on whether or not they believed that uranium mining should continue. Forty-ﬁve Liberal delegates were 2D surveyed and 15 were against continuation. Fifty-three Labor delegates were surveyed and 43 were against continuation. a Present data in percentages in a two-way frequency table. b Comment on any difference between the reactions of the Liberal and Labor delegates. 5 a Construct a scatterplot for the data given in the table below. b Use the scatterplot to comment on any relationship which exists between the variables. 2E Age 15 17 18 16 19 19 17 15 17 Pulse rate 79 74 75 85 82 76 77 72 70 6 For the data given in question 5, calculate the q-correlation coefﬁcient and use this to comment on the relationship between the two variables. (Compare your response about the 2F relationship in this question to your response about the relationship in question 5 when you didn’t know the q-value). 7 For the variables shown on the scatterplot at right, give an estimate of the value of r and use it to comment on the nature of the relationship y 2G between the two variables. x 8 The table below gives data relating the percentage of lectures attended by students in a semester and the corresponding mark for each student in the exam for that subject. 2H Lectures 70 59 85 93 78 85 84 69 70 82 attended (%) Exam result 80 62 89 98 84 91 83 72 75 85 (%) 102 Further Mathematics a Construct a scatterplot for these data. b Comment on the correlation between the lectures attended and the examination results and make an estimate of r. c Calculate r. d Calculate the coefﬁcient of determination. e Write down the proportion of the variation in the examination results that can be explained by the variation in the number of lectures attended. Analysis 1 An investigation into the relationship between age and salary bracket among some employees of a large computer company is made and the results are shown below. Salary bracket ($’000) Age 20–39 32 21 43 23 22 27 37 40–59 29 31 37 26 33 37 60–79 41 29 39 42 47 45 43 38 80–99 43 48 38 37 49 51 53 59 100–120 48 37 55 61 a State, for each of the variables (age and salary bracket) whether they represent categorical or numerical data. b State which is the independent variable and which is the dependent variable. c State which of the following you could use to display the data: i back-to-back stem plot ii parallel boxplot iii scatterplot iv two-way frequency table in percentage form d State which of the following you could calculate in order to ﬁnd out more about the relationship between age and salary bracket: i the q-correlation coefﬁcient ii r, the Pearson product–moment correlation coefﬁcient iii the coefﬁcient of determination 2 An investigation similar to that in analysis task 1 is undertaken at an accounting ﬁrm to explore the relationship between age and salary. The data are shown below. Age 20 20 30 35 50 45 35 45 55 55 42 50 25 30 40 Salary (nearest 20 40 20 30 40 80 40 60 100 70 45 85 30 60 60 thousand $’s) a State, for each of the variables (age and salary) whether they represent categorical or numerical data. b Display the data on a scatterplot. c Describe the association between the two variables in terms of direction, form and strength. d Calculate the value of q. e Explain whether or not it is appropriate to use Pearson’s correlation coefﬁcient to explain the relationship between age and salary. f Estimate the value of Pearson’s correlation coefﬁcient from the scatterplot. g Calculate the value of this coefﬁcient. h Explain whether or not the salary of the employees is determined by their age. test yourself yourself i Calculate the value of the coefﬁcient of determination. CHAPTER j Explain what the coefﬁcient of determination tells us about the relationship between age 2 and salary at this accounting ﬁrm.