Frequency Diagrams and Histograms with the TI – 83 and 84
1. Consider the data in the following table: Commuting distances in km. for sixty people 13 2 14 18 46 17 37 47 7 6 23 12 24 31 10 25 12 12 9 27 8 3 8 45 6 18 29 16 16 21 1 2 34 9 12 20 19 8 14 13 14 16 17 15 4 13 41 26 40 3 16 7 28 10 4 17 11 15 36 24
If you have existing data in lists, you can delete this data by choosing STAT + EDIT + 1. This displays a window with all existing lists. Select a list label with the up-arrow. Press CLEAR and then ENTER. Enter the above data, one number at a time, until all sixty numbers are entered. Your screen should look something like the following:
Choose STAT + EDIT + 2. Choose 2nd + L1 and add a right parenthesis. Press ENTER and you are informed that L1 is sorted.
Choose STAT + EDIT + 1 again, and it will be seen that the list for L1 is sorted. Entries may be inspected by using the down-arrow.
2. Decide how many bars or classes you want in a diagram and use the following formula to determine the class width, the width of a bar in a diagram.
( maximum value - minimum value) ≈ class width . desired no. of classes
1
In this case, x60 = 47 and x1 = 1. If the desired number of classes is 10, then
47 - 1 = 4.6 ≈ 5 . 10 If the class width is to be a whole number, always round up to the nearest whole number. This guarantees that all the data in the class is covered. class width =
3. Determine the class midpoint for each of the ten classes, using this formula
Midpoint = upper limit of class + lower limit of class . 2
For instance, the first class has values of x in the interval [1, 5] and a midpoint given by
1+ 5 =3. 2
Thus, the second class has a midpoint given by ( 6 + 10 ) / 2 = 8 and so on, up to the tenth class with a midpoint of ( 46 + 50 ) / 2 = 48. If you go to Edit in STAT and count the number of entries in L1 for each class and record the tally on a sheet of paper. You obtain the frequency for each class. It is also possible to inspect this list by using the up-arrow to select L1, followed by ENTER. This makes an array visible at the bottom of the screen.
If you find it convenient, the members of this array may be inspected horizontally on the screen. The frequencies can then be entered in a table, along with class limits and midpoints.
Class Midpoint Frequency
(Table no.1)
1-5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 41 - 45 46 - 50
3 8 13 18 23 28 33 38 43 48
7 11 13 11 5 4 2 3 2 2
4. If a frequency diagram were now drawn using this data, the first bar would be situated on the interval [0, 5] and the midpoint value of 3 would not be in the center of the bar. It is now necessary to determine xmin and xmax in WINDOW so that the midpoints are in the centers of each bar. Use the following formulas:
2
x min = minimum midpoint -
class width 2 class width 2
x max = maximum midpoint +
In the present example, xmin = 3 - ( 5 / 2 ) = 0.5 and xmax = 48 + ( 5 / 2 ) = 50.5. Thus, in WINDOW enter the following values so that the screen looks like this:
(Class Width) (No class frequency exceeds 15)
Enter the data from the table no.1. Enter the values of midpoints in L2 and the values of corresponding frequencies in L3.
Press 2nd + STAT PLOT + 4 and then ENTER. This activates PlotsOff. Press 2nd + STAT PLOT + 1 and then ENTER. Make certain that Plot1 is on and choose the bar chart type of graph. Select L2 for the xlist and L3 for the frequency.
Press GRAPH and the following frequency diagram is produced.
3
Press 2nd + STAT PLOT + 2 and then ENTER. Choose the polygon type of graph. Select L2 for the xlist and L3 for the ylist. Select a type of mark.
Press GRAPH and a polygon is produced, connecting midpoints, which are plainly visible with the type of mark chosen here.
Done on Excel, the finished product looks like this.
14 12
frequency (no. of people)
10 8 6 4 2 0
1-5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 41 - 45 46 - 50
Commuting distance in km.
4
[Remark: When using the TI-83, whole number values of x must be entered when creating a frequency diagram, while decimal values may be entered when creating a polygon or a scatter diagram.] 5. One measure of the central tendency of data is that of the arithmetic mean. In the present example, this is given by the sum of all sixty distances divided by 60, the total number of distances. If the midpoint values are used, then the mean is the sum of the products of the ten midpoint values and their corresponding frequencies, divided by 60. The mean is therefore expressed in one of two ways:
∑x
m=
1
60
i
n
=
x 1 + x 2 + x 3 + ... + x 59 + x 60 60 1 + 2 + 2 + ... + 46 + 47 = = 17.52 60 x 1 ⋅ f1 + x 2 ⋅ f 2 + ... + x 10 ⋅ f10 60 ( 3 ⋅ 7 ) + ( 8 ⋅ 11) + ... + ( 48 ⋅ 2 ) = = 17.75 60
∑x
or m =
1
10
i
⋅ fi
n
=
There is a slight difference in results due to the different ways of calculating. For larger groups of data the difference would tend to disappear. The first result is the more accurate one here. The second method is preferable when large groups are considered. 6. A second measure of central tendency commonly used is the median, the data value that represents the central value of an ordered distribution. As there is no single central or middle value in the present example, the median is represented by the average of the 30th and 31st data values. The data is ordered in L1 and the two middle terms can be found by choosing STAT + EDIT and using the down-arrow to go to these terms.
Thus, the median =
two middle terms 15 + 15 = 15 . = 2 2
The median can be illustrated graphically. From STAT PLOT turn off Plot1 and Plot2. Go to Plot3 and use the following settings, which will produce a box plot. (Notice that the frequency is 1, as each value in the original table of values is read once. A frequency diagram could also be produced in this way, without grouping data as was done above. The polygon, however, could not have been produced using a frequency of 1.)
5
Press GRAPH + TRACE and the median is indicated in a boxplot.
Box plots will be discussed in more detail in section 8 below. Another measure of the central tendency of the above data is the mode. The mode is the data value that occurs most often, and very often is not as reliable as the mean or median. In the present case, the mode could either be 12 or 16, as both these values occur four times, the maximum number of times any one data value occurs. 7. In most applications of statistics we work with a random sample of data. Here, however, we are considering a whole population of data values. With the use of the mean, median or mode the location of a population is determined. Besides knowing the location or central tendency of a population, it is also desirable to know the spread of the population, the way in which data is dispersed about the center. One convenient way of doing this is by determining the average square of the deviation from the mean. This is called the variance and the square root of this value is called the standard deviation of the population. The (deviation)2 of a datum from the mean = (x − m ) .
2
The total number of (deviations)2 for a given class = f i ⋅ (x i − m ) .
2
Let s2 = variance and s = standard deviation, then the average (deviation)2 from the mean, in the present example, and standard deviation from the mean are given by
s2 =
∑
1 10 1
60
(x i − m) 2 60 =
∑x
1
60
2 i
60
− m2 =
26515 − ( 17.52 ) 2 ≈ 135 60
or s 2 =
∑
f i ⋅ (x i − m) 2 60 =
∑f
1
10
i
⋅ x i2 − m2 =
60
27125 − ( 17.75 ) 2 ≈ 137 60
6
∑
s=
1
60
2
x i2 − m 2 ≈ 11.6
60
∑x
or s =
1
10
2 i
⋅ fi − m 2 ≈ 11.7
60
Notice that there is a slight difference, depending on whether data are counted once for each value or grouped in classes. Fortunately, all these calculations can be had with little difficulty. Using a MODE setting of 1 decimal accuracy and choosing STAT + CALC + 1, the 1-Var Stats window is made visible. Enter the data from the first list with 2nd + L1 in the following manner.
When this is entered, various results are obtained.
(mean, m) (sum of xi) (sum of xi 2) (not needed here) (standard deviation, s) (population size)
Using the grouped data in L2 and L3, enter these data in the following manner:
The results of this are
The only quantity that needs to be calculated manually is the variance. (See above.) Letting midpoint = xi and frequency = fi , the calculator results can be illustrated in a table, using each midpoint value and its corresponding frequency in L2 and L3.
7
(Table No.2)
xi 3 8 13 18 23 28 33 38 43 48 SUM
fi 7 11 13 11 5 4 2 3 2 2 60
xi fi 21 88 169 198 115 112 66 114 86 96 1065
xi2 fi 63 704 2197 3564 2645 3136 2178 4332 3698 4608 27125
Calculating, we have m =
1065 27125 = 17.75 , s 2 = − 17.75 2 ≈ 137 and s = 137 ≈ 11.7 . 60 60
Lines for m = 17.75, m – s = 6.05 and m + s = 29.45 can now be added to the frequency diagram. As there are 42 data values between m - s and m + s, this amounts to 70% of the total population. In other words, 70% of the data lies within one standard deviation from the mean.
m-s
14 12
m
m+s
frequency (no. of people)
10 8 6 4 2 0
1- 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 41 - 45 46 - 50
Commuting distance in km.
8. Other ways of measuring the spread of a population are by using the range or the interquartile range. The use of the range is often not very informative, as it is merely the difference of the maximum and minimum values of a population. In the present example, the interquartile range is determined as follows: Interquartile range = x45 – x15 = 24 - 9 = 15 . x15 is one-fourth of the way through the population and is the upper boundary of the lower quartile; x45 is three-fourths of the way and is the lower boundary of the upper quartile. Dividing the interquartile range by two gives the quartile deviation.
8
Quartile deviation =
x 45 − x 15 24 − 9 = = 7.5 2 2
This number says that 50% of the data lies within 7.5 units of 15, the median. In a box plot, the end of the left whisker is defined by xmin, and the end of the right whisker is defined by xmax. The first quartile boundary, Q1, the median and the third quartile boundary, Q3, define the box. The following pictures may be obtained by setting Plot3 to a box plot and using TRACE.
Enter the data from the first list with 2nd + L1 in the following manner.
Go to the bottom of the screen with the down-arrow, and the following results are visible:
9. Once the data of a large population has been entered, it is easy to find the values of interquartile ranges and the median, if the previous method is used. Another method for investigating these values is the use of a cumulative frequency diagram and an ogive curve. Although the population of the present example is not particularly large, the use of the cumulative frequency may be easily demonstrated. Let , xi = midpoint values, ub = upper boundary of classes, fi = class frequency, cf = cumulative frequency and refer to the next table.
9
(Table no.3)
xi 3 8 13 18 23 28 33 38 43 48
ub 5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5 45.5 50.5
fi 7 11 13 11 5 4 2 3 2 2 n = 60
cf 7 18 31 42 47 51 53 56 58 60
Each value of cf is the sum of all the frequencies up to and including the adjacent frequency. For example, 7 + 11+ 13 + 11+ 5 + 4 = 51 or, more simply, 47 + 4 = 51, i.e., each value of cf is the sum of the value above it and the value of the adjacent frequency. Using STAT+EDIT, enter the cf values in L4. The xi values should already be entered in L2. Set ymax in WINDOW to a value of 60 and yscl to a value of 10.
Set Plot1 to a bar chart type of graph and select xlist: L2 and Freq: L4
Draw the cumulative frequency diagram by pressing GRAPH. frequency diagram without the ogive curve.
This gives a cumulative
10
The ogive curve represents the cumulative proportion or cumulative percentage of the population. When the frequency polygon was plotted in section 4, the curve went through midpoints. The ogive curve, however, goes through points above the upper limit of each class. This is consistent with the idea of cumulative percentage, as we want the accumulation for the whole of each class, the percentage of values that lie below the upper boundary of a class, ub. In order to draw the ogive curve we use the values of the ub variable in Table No.3 together with the values of cf. Choose STAT + EDIT and enter the values of ub in L5.
Leave the settings in WINDOW as they are. Press 2nd + STAT PLOT and set Plot2 as a polygon type of graph with xlist: L5 and ylist: L4.
With Plot1 and Plot2 on, press GRAPH and the following graph is produced:
11
The ogive curve is incomplete, as the tail of the curve is missing in the first class. This can be drawn by storing L4 in L6 and inserting values of L5(1) = 0.5 and L6(1) = 0. L6 must then be selected for ylist in Plot2. Try it, if you wish, but it is not necessary for our purposes here. Press TRACE and P1 will appear in the upper left-hand corner of the screen. Press the down-button and P2 will appear. Press the right-button three times and the cursor will be at x = 15.5 and y = 31, the ub for the third class. Since 30 is 50% of the total cumulative frequency of 60, the x-value of the cursor is very close to 15, which is the value of the median.
Choose 2nd + DRAW + 2 and place the cursor at y = 30 on the ogive curve. Press ENTER and then press the left-button to draw a line, until x = 0.5 and press ENTER again. With the right-button take the cursor back to the curve, and the result is as follows:
At 50% of the total cumulative frequency, the value of x is 15, the value of the median. It is not exactly 15 on the screen due to graphical resolution. Similarly, since 75% of the total cumulative frequency is 45, the lower limit of the upper quartile, a line can be drawn to indicate Q3. With DRAW place the cursor on the ogive curve as close as possible to y = 45 and draw a line to x = 0.5. Take the cursor back to the curve and the x-value should be close to 24, the value of Q3 in section 7.
Using Excel, the cumulative frequency diagram and ogive curve appear as follows:
12
60 50
75%
cumulative frequency
40
50%
30 20 10 0 3 8 13 18 23 28 33 38 43 48
Commuting distance in km.
10. The median, the upper quartile and the lower quartile are special cases of centiles or centile points. The median is the 50th centile point and the lower boundary of the upper quartile is the 75th centile point. Let us denote centile points by C50, C75, C88 or Cp, in general. Estimating centile points is quite easy with the use of DRAW−we had only to place the cursor at cf = 30 to find the 50th centile point on the ogive curve and from this estimate the median, x = 15. Centiles can also be calculated analytically. Refer to Table no.3 in making the following calculations.
•
To find C50, calculate the cf of the centile point. What class has the greatest cf < 30? What is the ub and the cf of the 2 class? Take the difference of 30 and 18. What is the frequency of the 3rd class? What is the class width of the class? Thus,
nd
(0.5 ⋅ 60 = 30)
( Class number 2 ) ( ub = 10.5 and cf = 18 ) ( 12 ) ( f3 = 13 ) ( class width = 5 )
⎛ 12 ⎞ C 50 = 10.5 + ⎜ ⎟ ⋅ 5 ≈ 15 ⎝ 13 ⎠
•
To find C88, calculate the cf of the centile point. What class has the greatest cf < 52.8? What is the ub and the cf of the 5th class? Take the difference of 52.8 and 51. What is the frequency of the 6th class? What is the class width of the class?
(0.88 ⋅ 60 = 52.8)
( Class number 6 ) ( ub = 30.5 and cf = 51 ) ( 1.8 ) ( f7 = 2 ) ( class width = 5 )
13
Thus,
⎛ 1.8 ⎞ C 88 = 30.5 + ⎜ ⎟ ⋅ 5 = 35 . ⎝ 2 ⎠
•
On your calculators this can be done with the help of L5(6), L4(6) and L4(7). L5(6) gives x6 = 30.5, the ub of class 6. The cf of C88 = 0. 88 ⋅ Ymax = 0. 88 ⋅ 60 = 52. 8 L4(6) = 51 or L4(7) = 53, the cf for the 6th and 7th classes. The class width = Xscl = 5. The frequency of class 7 = L4(7) − L4(6) = 53 − 51.
C 88 = L 5 (6) + (0.88 ⋅ Ymax − L 4 (6)) ⋅ X scl (L 4 (7) − L 4 (6))
When entered on the TI-83, it gives the following results for C50 and C88. The formula can be used over by pressing the 2nd + ENTRY and editing the formula.
•
An even easier way of calculating centiles is by using the program CENTILE. Enter the following commands, if you want to use the program.
PROGRAM:CENTILE :Disp "ENTER CENTILE" :Input P :(P/100)Ymax→A :Disp "CF=" :Disp A :Disp "CLASS" :Input C :L5(C)→B :L4(C)→D :L4(C+1) →E :(A−D)Xscl→F :B+F/(E−D) →G :Disp “CP=“ :Disp G
// Enter a whole number for the centile. //Calculate the cumulative frequency, cf, for the given centile and store as A.
// Enter the class number for the class with the greatest cf < A. // List names must correspond to the lists being used, // L4 for cf and L5 for ub
// Display the calculated centile point, cp.
Thus, if you want to calculate the centile point corresponding to 44% of the total cumulative frequency, 44 is entered as the value of the centile. The class number to be entered is determined by counting up to the class whose cf is the greatest cf < 0.44 ⋅ 60 = 26.4. The results are illustrated as follows:
14
Using DRAW on the ogive curve, it is seen that the 44th centile lies in the third class and C44 has a value near 13.79, corresponding closely to the calculated value. It is not possible to get a closer value, given the graphics resolution.
In terms of our original data, this means that 44% of the commuters travel less than 13.7 km. to get to work or, conversely, that 56% travel more than 13.7 km. 11. Let the relative frequency for each class be defined by the relationship
gi = fi n = fi 60 .
Thus, in the first interval
g1 = f1 7 = ≈ 0.117 = 11.7% n 60
The mean and variance can both be expressed in terms of gi .
m=
∑
1
10
x i ⋅ gi and s 2 =
∑x
1
10
2 i
⋅ gi − m 2 .
Let the relative frequency density, rfd, be defined by the relationship
rfd = gi class width = gi 5 .
Thus, gi = rfd ⋅ classwidth . A histogram, where the height of a rectangle is equal to the value of rfd and the base is equal to the class width, will consist of areas equal to the various values of gi .
15
The sum of the areas of the rectangles is equal to one, since
∑
1
10
∑f
gi =
1
10
i
=
60 60
=1 .
60
The values of rfd are decimal values and cannot be entered in STAT+EDIT and used to produce a histogram. Here the values of rfd are represented by whole numbers which yield the decimal values when multiplied by 10-3. The values of gi are rounded off at 3 dp.
xi 3 8 13 18 23 28 33 38 43 48 SUM fi 7 11 13 11 5 4 2 3 2 2 60 gi 0.117 0.183 0.217 0.183 0.083 0.067 0.033 0.050 0.033 0.033 1.000 rfd ×10- 3 23 37 43 37 17 13 7 10 7 7
(Table no.4)
The whole number values should now entered in L6, choosing STAT + EDIT.
The values of L2 the class midpoints remain unchanged. Clear the previous diagram with 2nd + DRAW + ClrDraw. In WINDOW set Ymax = 50 and Yscl = 5.
Choose 2nd + SET PLOT, turn off Plots 1 and 2 and in Plot3 set the plot as follows:
16
Draw the histogram by pressing GRAPH.
The picture of the histogram, except for the vertical scale, is hardly distinguishable from the frequency diagram. This is not always the case, especially when class widths are unequal. Very often frequency diagrams are totally misleading in showing the relative size of a class. The value of the rfd in the third class has a value of 43 × 10 −3 , as is plainly seen when TRACE is pressed and the cursor placed in the third class. The completed histogram is illustrated here, with the mean and standard deviation added.
rfd × 10
50
-3
m-s
m
m+s
relative frequency density
45 40 35 30 25 20 15 10 5 0 3 8 13 18 23 28 33 38 43 48
Commuting distance in km.
17
Enter the data from L2 and L6 with STAT+CALC+1 in the following manner.
Using the data in Table no.4, due to rounding errors, the mean and standard deviation are slightly different from values in section 6. When STAT + CALC + 1 are chosen the following results are obtained: m ≈ 17.9 and s ≈ 11.8.
This makes no great difference when the area between m − s and m + s is calculated. Letting lb and ub represent the lower and upper bounds of rectangular areas where the Area = (ub − lb)⋅rfd , the following results are obtained:
lb 6.1 10.5 15.5 20.5 25.5 Total ub 10.5 15.5 20.5 25.5 29.7 rfd 0.037 0.043 0.037 0.017 0.013 Area 0.163 0.215 0.185 0.085 0.055 0.702
(Table no.5)
This total result corresponds closely to the earlier one in section 6, where 42 data values were between m − s and m + s, approximately 70% of the total area. This also means that 70% of the data lies between m − s and m + s. The ideas of relative frequency density and area are closely related to those of the probability density function and the area under the normal curve. Data is said to be normally distributed if the data is evenly distributed on both sides of the mean value. In the present example, the data is not that evenly distributed. It is said to be skewed-right, i.e., the right hand tail of the diagram trails off to the right.
18