VIEWS: 52 PAGES: 27 POSTED ON: 10/4/2010
FORECASTING : FURTHER COMMENTS And SOME NOTES ON QUANTIFYING UNCERTAINTY SALES of SINGLES (in Millions) 100 90 MILLIONS 80 70 60 50 y = 0.032x3 - 188.541x2 + 374302.894x - 247694272.834 40 1970 1975 1980 1985 1990 1995 2000 At the end of this week‟s lab you plotted the above series, the numbers of singles sold each year from 1973 to 1997 and you fitted three trend lines. These are shown. They are a LINEAR TREND, a QUADRATIC TREND and a CUBIC TREND. The equation of the cubic is also shown on the plot. There were questions after the lab about the use of polynomials of different orders or degrees. The degree of a polynomial is simply the highest power of x in the expression for the polynomial, e.g. in the case of the cubic shown, the degree is 3. Polynomials are very smooth curves with the property that as we increase the degree so we increase the number of turning points, or maxima and minima in the curve. The cubic above (degree 3) has two, the quadratic (degree 2) has 1. The effect of choosing higher degrees for our fitted polynomials is that we always get a better fit to the data, because we are introducing a bit more flexibility, or “bendiness” with each extra degree. And we can see this happen above, the cubic gives much better fit than the quadratic. So, why not choose the highest degree available? Surely this gives the best fit to our data? To answer these questions, begin by looking at the data in the plot above and asking yourself what you would guess will happen to it. It has been rising for the last 5 years, so perhaps it will continue to do so. By how much? That‟s difficult, buyt we note that in the last 25 years sales of singles always lay between about 50 and 90 millions per year. Thus, unless there were a serious change in the market or technology or both it seems unlikely that it will rise much over 100 millions. Now, consider the plot below and note the forecasts from our BEST FITTING CUBIC. SALES of SINGLES (in Millions) 380 330 280 MILLIONS 230 180 130 80 30 1970 1975 1980 1985 1990 1995 2000 2005 2010 The cubic trend forecasts are simply unbelievable, at least in the state of current knowledge. The 5 year forecast is about 150 millions per year, the 10 years one exceeds 300 millions. These are not just a little unlikely, but outrageously so. This illustrates the problem with using polynomial trends to forecast. They can be made to FIT the HISTORICAL DATA very well, but this is no guide to how well they will perform forecasting. The mathematical reason is that when we forecast a trend we evaluate the function (here a polynomial) at a future time, i.e. we are increasing the x-value in the expression for our trend, and so we naturally increase the highest power of that x-value by even more. For example, if we use the cubic above the forecast for the year 2000 will contain a term of the form 2000*2000*2000. So, how can we forecast sales of singles from these data and using a trend curve? SALES of SINGLES (in Millions) 90 85 80 75 y = 7.0x + 36.6 70 MILLIONS 65 60 55 y = 42.21e0.10x 50 45 40 0 1 2 3 4 5 6 7 8 YEAR = 1990 + Since we wish to use a trend curve we examine the data for some consistency near the end or current time. Clearly, the last 5 years have shown a consistent rise, so we might try to fit a suitable trend to those data only. A straight line fitted to those data is shown above as the dashed line, and its equation is given in plot on top left. Thus, for each unit of X, i.e. for each year, Y increases by 7 units, i.e. sales increase by 7 millions. Straight lines imply a constant increase every year. In this case, it is 7 millions per year. The solid line shows the best-fitting exponential trend. Such trends are often used because they imply change each year by a constant percentage. In this case, we see from the equation that the chosen trend implies that sales are growing at the rate of 10% per year. Given these six years‟ data, both these trends appear plausible. Also, they both appear to be close to the data points in the plot. Statistically, they FIT the data equally well. So, do we need to choose between them, and, if so, how? The answer to the first question is YES: they will yield quite different forecasts, as we see below. SALES of SINGLES (in Millions) 300 250 200 y = 42.21e0.10x MILLIONS 150 100 50 y = 7.0x + 36.6 0 0 2 4 6 8 10 12 14 16 18 YEAR = 1990 + LINEAR TREND FORECAST FOR NEXT YEAR – 1998 FORECAST OF 1998‟S SINGLES SALES: 6.98 * 8 + 36.64 = 92.48 Million FORECAST OF 2008‟S SINGLES SALES: 6.98 *1 8 + 36.64 = 162.28 Million EXPONENTIAL TREND FORECAST FOR NEXT YEAR – 1998 FORECAST OF 1998‟S SINGLES SALES: 0.1*8 42.21 * e = 93.94 Million FORECAST OF 2008‟S SINGLES SALES: 0.1*18 42.21 * e = 255.36 Million SALES of LPs, CASSETTES,CDs and SINGLES 180 160 140 120 MILLIONS 100 LPs CASS CDs 80 SINGLES 60 40 20 0 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 How else might we forecast sales of singles? Interestingly, the plot of all types of records shown above, suggests the sales of singles is fairly flat (certainly compared with the meteoric rise of CD sales). But note also that sales of singles were falling from the late „70s and only picked up again in 1992 when sales of CDs accelerated their growth. Perhaps, then, it is worth linking the sales of these two together, reflecting that rise in single sales is “on the coat-tails” of the rise in CD sales, using the last 6 years again, 1992-1997. SINGLES SOLD NEXT YEAR = 0.34 * CDs SOLD THIS YEAR + 31.5 Millions 90 85 80 y = 0.34x + 31.51 75 SALES of SINGLES 70 65 60 55 50 45 40 40 60 80 100 120 140 160 180 SALES OF CDs in PREVIOUS YEAR The above graph shows a plot of Single sales during the years 1992-1997 against CD sales for the years 1991-1996, where at each point we have Single sales in a year and CD sales in the previous year. Why not choose the same year? Simply, it would not be helpful for forecasting since if the relationship we use has both in the same year then in order to forecast Single sales next year we would need to know CD sales for next year or forecast them first. However, if there is a relationship between CD sales this year and Single sales next year then we need know only this year‟s CD sales to forecast next year‟s Single sales. We cannot say that any “cause & effect” relationship exists here, but we can see that the data lie very close to a straight line during these years, and hope that this relationship continues into the future. LINEAR TREND FORECAST FOR NEXT YEAR – 1998 FORECAST OF 1998‟S SINGLES SALES: 6.98 * 8 + 36.64 = 92.48 Million FORECAST OF 2008‟S SINGLES SALES: 6.98 *1 8 + 36.64 = 162.28 Million EXPONENTIAL TREND FORECAST FOR NEXT YEAR – 1998 FORECAST OF 1998‟S SINGLES SALES: 0.1*8 42.21 * e = 93.94 Million FORECAST OF 2008‟S SINGLES SALES: 0.1*18 42.21 * e = 255.36 Million SINGLES SOLD NEXT YEAR = 0.34 * CDs SOLD THIS YEAR + 31.5 Millions CDs SOLD THIS YEAR = 158.8 Million FORECAST OF NEXT YEAR‟S SINGLES SALES: 0.34 * 158.8 + 31.5 = 44 + 31.5 = 75.5 Million COMPARE THIS FORECAST WITH THE OTHER TWO 1998 2008 LINEAR TREND: 94.3 162.3 EXPONENTIAL TREND: 93.9 255.4 WHICH TREND DO WE FORECAST? THIS IS A SERIOUSLY DIFFICULT PROBLEM IN PRACTICE --- IT REQUIRES BOTH SKILL AND EXPERIENCE ---- BUT THERE ARE SOME USEFUL RULES TO BEAR IN MIND GOODNESS OF FIT TO PAST HISTORY IS A USEFUL GUIDE BUT IT MUST NOT BE THE FINAL OR ONLY CRITERION POLYNOMIALS OF HIGH DEGEE WILL FIT PAST DATA WELL AND SO MAY GIVE A GOOD EXPLANATION OF PAST TRENDS BUT ARE VERY RARELY USED FOR FORECASTING THE TRENDS MOST COMMONLY USED FOR FORECASTING ARE FLAT TREND --- INDICATING NO CHANGE OF LEVEL OVER TIME LINEAR TREND --- INDICATING CONSTANT RATE OF CHANGE AND EXPONENTIAL TREND --- CONSTANT RATE OF PERCENTAGE CHANGE THE IMPORTANT POINT IS THAT THE TREND WE CHOOSE TO PROJECT INTO THE FUTURE WILL REFLECT OUR BELIEF THAT THE FUTURE BEHAVIOUR OF THE DATA WILL BE THE SAME IN KIND AS IT WAS IN THE PAST, AND WILL NOT CHANGE OVER THE HORIZON FOR WHICH WE FORECAST. THIS IS NOT A STATISTICAL CRITERION --- AND SO CANNOT BE TESTED IN A REALLY OBJECTIVE WAY. CHOICE OF A SENSIBLE TREND REQUIRES CONSIDERABLE CARE AND REASONING, AND FINALLY, SOME FAITH. SUPPOSE WE CAN CHOOSE A TREND AND FEEL CONFIDENT THAT IT WILL BE CONTINUED IN THE FUTURE AS IT WAS IN THE PAST ….. OUR FORECASTS ARE THE VALUES OF THE TREND AT THE FUTURE TIMES …. BUT …. HOW CLOSE CAN WE EXPECT THEM TO BE TO THE ACTUAL VALUES WHICH OCCUR THEN? CLEARLY, THERE IS UNCERTAINTY --- BUT OUR FORECASTS ARE REALLY OF USE ONLY IF WE CAN QUANTIFY THIS UNCERTAINTY, OTHERWISE WE WILL NOT HAVE ANY IDEA HOW ACCURATE OR RELIABLE THEY MAY BE. WE CANNOT SOLVE THIS PROBLEM FOR FORECASTING AT THIS STAGE OF THE CLASS. HOWEVER, IT IS A GENERAL PROBLEM IN ALL STATISTICAL INFERENCE AND WE WILL MEET IT SEVERAL TIMES IN THE SECOND SEMESTER, SO WE BEGIN DISCUSSION HERE. ALL FORECASTS ARE WRONG. WE CALL THE AMOUNT BY WHICH THEY DEVIATE FROM THE CORRECT VALUE THE FORECAST ERROR. WE WISH TO QUANTIFY THE UNCERTAINTY OR VARIABILITY IN FORECAST ERRORS. WE BEGIN BY CONSIDERING A MORE GENERAL PROBLEM. THE AMBULANCE SERVICE IN A RURAL AREA IS PROVIDED FROM A SINGLE CENTRE AND FOR MANY YEARS HAS BEEN ABLE TO GUARANTEE AN AVERAGE RESPONSE TIME, i.e. THE TIME TO REACH SOMEONE MAKING AN EMERGENCY CALL, OF 10 MINUTES. HOWEVER, IN THE LAST TWO YEARS THE AREA HAS BECOME POPULAR WITH COMMUTERS AND SO THERE HAS BEEN A LOT OF NEW BUILDING AND SUBSTANTIAL CHANGES IN TRAFFIC PATTERNS. THE SERVICE TAKES A SAMPLE OF 200 CALLS FROM ITS RECORDS FOR 1998 AND WISHES TO USE THEM TO MEASURE ITS PERFORMANCE AGAINST THE “10 MINUTE AVERAGE” CLAIM. RESPONSE TIMES (in MINUTES) 3.4 5.7 6.6 6.9 7.0 7.0 7.2 7.4 7.4 7.6 7.6 7.7 8.0 8.1 8.2 8.2 8.2 8.3 8.4 8.5 8.6 8.7 8.7 8.7 8.8 8.8 8.9 8.9 9.0 9.0 9.0 9.1 9.2 9.2 9.3 9.3 9.4 9.5 9.6 9.6 9.6 9.6 9.6 9.7 9.7 9.7 9.7 9.9 9.9 9.9 9.9 9.9 9.9 10.0 10.0 10.1 10.1 10.1 10.2 10.2 10.2 10.2 10.2 10.2 10.3 10.3 10.4 10.4 10.4 10.4 10.4 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.6 10.6 10.6 10.7 10.8 10.8 10.8 10.9 10.9 10.9 11.0 11.0 11.0 11.0 11.0 11.1 11.1 11.1 11.2 11.2 11.2 11.2 11.2 11.2 11.3 11.3 11.3 11.3 11.4 11.4 11.4 11.4 11.5 11.5 11.5 11.5 11.5 11.6 11.6 11.6 11.6 11.6 11.7 11.7 11.7 11.8 11.8 11.8 11.8 11.8 11.9 11.9 11.9 12.0 12.1 12.1 12.1 12.2 12.2 12.2 12.3 12.3 12.3 12.3 12.3 12.4 12.5 12.5 12.5 12.6 12.6 12.7 12.7 12.7 12.7 12.7 12.8 12.8 12.8 12.8 12.9 12.9 12.9 12.9 13.0 13.0 13.1 13.1 13.1 13.1 13.1 13.1 13.2 13.3 13.3 13.3 13.4 13.5 13.5 13.6 13.6 13.6 13.7 13.7 13.8 13.8 13.9 14.1 14.2 14.3 14.4 14.5 14.7 14.8 15.0 15.0 15.5 15.8 15.8 16.0 MEAN = 11.12 Minutes STANDARD DEVIATION = 2.031 Minutes In order to quantify the uncertainty present in a large set of data like these response times we usually find that it is not enough simply to quote the mean and standard deviation. We need to find out how this uncertainty is distributed across the range of values, here from 3.4 minutes to 16 minutes. A useful way to summarize this is to divide the range into short intervals and simply count how many observations fall in each one. These counts are called FREQUENCIES and the resulting information is a FREQUENCY TABLE. Using intervals of length 1 minute, the frequency table for the response times is shown below. Range Frequency 2- 3 0 3- 4 1 4- 5 0 5- 6 1 6- 7 4 7- 8 7 8- 9 18 9 - 10 24 10 - 11 40 11 - 12 39 12 - 13 32 13 - 14 21 14 - 15 9 15 - 16 4 16 - 17 0 With a little effort, we can easily see how the response times are distributed over the range from this table. Most lie between 8 and 14 minutes but there are some shorter and some longer times. Interpretation of such frequency tables is greatly aided by a simple display called a HISTOGRAM in which the frequencies are represented by the lengths of bars, as in a bar chart. Range Frequency 2- 3 0 3- 4 1 X 4- 5 0 5- 6 1 X 6- 7 4 XXXX 7- 8 7 XXXXXXX 8- 9 18 XXXXXXXXXXXXXXXXXX 9 - 10 24 XXXXXXXXXXXXXXXXXXXXXXXX 10 - 11 40 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 11 - 12 39 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 12 - 13 32 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 13 - 14 21 XXXXXXXXXXXXXXXXXXXXX 14 - 15 9 XXXXXXXXX 15 - 16 4 XXXX 16 - 17 0 To aid in interpretation we often compute not simply the frequencies but also the RELATIVE FREQUENCIES, which are defined thus: RELATIVE FREQUENCY = FREQUENCY / NUMBER OF DATA It is, of course, simply the fraction of the data which falls in each interval, rather than the actual count. Again, we often simplify the appearance of these values by transforming them into PERCENTAGES, thus: PERCENTAGE = 100 * RELATIVE FREQUENCY These are shown below for the above frequency table of response times. Relative Range Frequency Frequency Percentage 2- 3 0 0 0.0 3- 4 1 0.005 0.5 4- 5 0 0 0.0 5- 6 1 0.005 0.5 6- 7 4 0.02 2.0 7- 8 7 0.035 3.5 8- 9 18 0.09 9.0 9 - 10 24 0.12 12.0 10 - 11 40 0.2 20.0 11 - 12 39 0.195 19.5 12 - 13 32 0.16 16.0 13 - 14 21 0.105 10.5 14 - 15 9 0.045 4.5 15 - 16 4 0.02 2.0 16 - 17 0 0 0.0 Range Percentage 2- 3 0.0 3- 4 0.5 X 4- 5 0.0 5- 6 0.5 X 6- 7 2.0 XXXX 7- 8 3.5 XXXXXXX 8- 9 9.0 XXXXXXXXXXXXXXXXXX 9 - 10 12.0 XXXXXXXXXXXXXXXXXXXXXXXX 10 - 11 20.0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 11 - 12 19.5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 12 - 13 16.0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 13 - 14 10.5 XXXXXXXXXXXXXXXXXXXXX 14 - 15 4.5 XXXXXXXXX 15 - 16 2.0 XXXX 16 - 17 0.0 The importance of relative frequency (or percentage) rather than frequency lies in the definition of PROBABILITY. If we consider a simple experiment such as the toss of a coin, it is clear that we cannot predict the outcome. We know it will be HEADS or TAILS but cannot say which, no matter how often we perform this experiment. However, we know for certain that if we toss the coin a large number of times HEADS will occur half of the time, i.e. the RELATIVE FREQUENCY of HEADS will be 0.5 or 50%. We say that the PROBABILITY that HEADS occurs is 0.5 or 50%. The PROBABILITY that a particular event occurs or a particular value is observed is the limiting value of the relative frequency of its occurrence in a long run of identical repetitions of the process. Thus, if the sample of response times shown above is representative of all current response times then we can estimate the probability of a response time falling between 10 and 12 minutes as 20 + 19.5 = 39.5%. Similarly, we would estimate the probability of an ambulance being late, i.e. a response time in excess of 10 minutes as 72.5%, and the probability of being more than 5 minutes late is 2%. We can construct HISTOGRAMS in Excel and the one for response times appears below. Note, however, that although a picture of the distribution emerges, it is oddly structured. It is really a bar chart and the bars are well separated, but represent consecutive numerical ranges and so should actually touch. We move towards a more useful representation by considering the curve which envelopes the histogram, as in the next graph. HISTOGRAM of RESPONSE TIMES (Minutes) 45 40 35 30 FREQUENCY 25 20 15 10 5 0 2- 3 3- 4 4- 5 5- 6 6- 7 7- 8 8- 9 9 - 10 10 - 11 11 - 12 12 - 13 13 - 14 14 - 15 15 - 16 16 - 17 HISTOGRAM of RESPONSE TIMES (Minutes) 25 20 PERCENTAGE 15 10 5 0 2- 3 3- 4 4- 5 5- 6 6- 7 7- 8 8- 9 9 - 10 10 - 11 11 - 12 12 - 13 13 - 14 14 - 15 15 - 16 16 - 17 This curve is called the PROBABILITY CURVE of the HISTOGRAM or the data. If we imagine having more and more data we could make the intervals into which we counted the data smaller and smaller and achieve a finer and finer resolution of the HISTOGRAM in the form of the PROBABILITY CURVE. That is often what we try to do in general with data, i.e. identify the most suitable form of PROBABILITY CURVE for the data. We would then use mathematical tools and tables to compute the required probabilities. In general, that is a difficult problem in practice, but there is one special case where it is relatively simple. Fortunately, this is also the commonest case we meet in practice, beingespecially useful for data such as these response times which are aimed at a target value and miss symmetrically around that value. It is the NORMAL PROBABILITY CURVE. NORMAL PROBABILITY CURVE for RESPONSE TIMES 0 5 10 15 20 25 RESPONSE TIMES This graph displays the NORMAL PROBABILITY CURVE which corresponds to the response time data, i.e. has the same mean and standard deviation. The vertical line is placed at 11.12, the mean, and the rest of the values are symmetric around this. The width of the curve depends on the standard deviation, here 2.031, and we see that there are virtually no response times less than 5 minutes or greater than about 17 minutes. We can do much better than this, however, for the NORMAL PROBABILITY CURVE has a remarkable property: it depends on knowledge of only the MEAN and STANDARD DEVIATION. Thus, given these two quantities and the fact that the data have a NORMAL PROBABILITY CURVE we can compute the probabilities of response times in any ranges we wish directly. As an illustration of this, we simply note three basic results relating to all NORMAL PROBABILITY CURVES. 68% OF DATA LIE WITHIN 1 STANDARD DEVIATION OF THE MEAN [……………I……………] MEAN – ST.DEV MEAN MEAN + ST.DEV 95% OF DATA LIE WITHIN 2 STANDARD DEVIATIONS OF THE MEAN […………………………I…………………………] MEAN – 2*ST.DEV MEAN MEAN + 2*ST.DEV 99.7% OF DATA LIE WITHIN 3 STANDARD DEVIATIONS OF THE MEAN [………………………………………I………………………………………] MEAN –3* ST.DEV MEAN MEAN + 3*ST.DEV AMBULANCE RESPONSE TIMES: MEAN = 11.12 MINUTES ST. DEV = 2.03 MINUTES 68% OF RESPONSE TIMES WILL LIE IN THE RANGE 11.12 – 2.03 TO 11.12 + 2.03 i.e. [ 9.09 ------------13.15 ] 95% OF RESPONSE TIMES WILL LIE IN THE RANGE 11.12 – 2 * 2.03 TO 11.12 + 2 * 2.03 i.e. [ 7.06 ------------------------15.18 ] 99.7% OF RESPONSE TIMES WILL LIE IN THE RANGE 11.12 – 3 * 2.03 TO 11.12 + 3 * 2.03 i.e. [ 5.03 ------------------------------------17.21 ]