+ All Categories
Home > Documents > Judgemental forecasting in times of change

Judgemental forecasting in times of change

Date post: 22-Nov-2016
Category:
Upload: marcus-oconnor
View: 213 times
Download: 1 times
Share this document with a friend
10
International Journal of Forecasting 9 (1993) 163-172 0169-2070/93/$06.00 0 1993 - Elsevier Science Publishers B.V. All rights reserved 163 J~dgeme~tal forecasting in times of change Marcus O’Connor* University of New South Wales, P. 0. Box 1, Kensington, W. A. 2033, Australia William Remus, Ken Griggs University of Hawaii, 2404 Maile Way, Honolulu, HI 96822, USA Abstract This paper reports a study which examines the ability of people and statistical models to forecast time series which contain major discontinuities. It has often been suggested that human judgement will be superior when circumstances change dramatically and statistical models are no longer relevant. Using ten time series that contained five different discontinuities and two levels of randomness, the results indicated that people performed significantly worse than (parsimonious) statistical models. This occurred for the segments of the time series where the discontinuity was to be found and for the subsequent segment where the series was stable. People seemed to change their forecasts in response to random fluctuations in the time series, identifying a signal where it did not exist. This was especially true for the series with high variability. The implications of these results for forecasting practices are discussed. Keywords: Judgemental forecasting; Change 1. Introduction Our world is characterised by change and uncertainty. This makes the task of forecasting more important and, of course, more difficult. When conditions are in a state of extreme change, it may be necessary to abandon tradi- tional methods of forecasting. However, when change is regarded as a normal part of the everyday business environment, it is important that we adopt a forecasting methodology that explicitly accounts for the anticipated change. Traditional textbook approaches [see Makridakis et al. (1983)] have emphasised the need to use well documented statistical models in the fore- casting process, and then to use human judge- ment to adjust the forecasts for other informa- tion that could not be captured by the statistical models. Yet others have suggested that the true * Corresponding author. role of human judgement is to detect changes and to use the statistical approaches to forecast when conditions will become more stable [see Kleinmuntz (1990)]. This study examines whether people possess the suggested superiority at detecting changes in a time series forecasting task. We first review the literature about judgemental and statistical forecasting when con- ditions change and then describe a laboratory experiment designed to test whether people can demonstrate the alleged superiority. 2. Review of previous literature Surveys of the practice of forecasting have repeatedly found that human judgement is the overwhelming choice for forecasting, especially sales forecasting [Dalrymple (1987); Taranto (1989)]. This result holds true even after statisti-
Transcript

International Journal of Forecasting 9 (1993) 163-172

0169-2070/93/$06.00 0 1993 - Elsevier Science Publishers B.V. All rights reserved

163

J~dgeme~tal forecasting in times of change

Marcus O’Connor*

University of New South Wales, P. 0. Box 1, Kensington, W. A. 2033, Australia

William Remus, Ken Griggs

University of Hawaii, 2404 Maile Way, Honolulu, HI 96822, USA

Abstract

This paper reports a study which examines the ability of people and statistical models to forecast time series which contain major discontinuities. It has often been suggested that human judgement will be superior when circumstances change dramatically and statistical models are no longer relevant. Using ten time series that contained five different discontinuities and two levels of randomness, the results indicated that people performed significantly worse than (parsimonious) statistical models. This occurred for the segments of the time series where the discontinuity was to be found and for the subsequent segment where the series was stable. People seemed to change their forecasts in response to random fluctuations in the time series, identifying a signal where it did not exist. This was especially true for the series with high variability. The implications of these results for forecasting practices are discussed.

Keywords: Judgemental forecasting; Change

1. Introduction

Our world is characterised by change and uncertainty. This makes the task of forecasting more important and, of course, more difficult. When conditions are in a state of extreme change, it may be necessary to abandon tradi- tional methods of forecasting. However, when change is regarded as a normal part of the everyday business environment, it is important that we adopt a forecasting methodology that explicitly accounts for the anticipated change. Traditional textbook approaches [see Makridakis et al. (1983)] have emphasised the need to use well documented statistical models in the fore- casting process, and then to use human judge- ment to adjust the forecasts for other informa- tion that could not be captured by the statistical models. Yet others have suggested that the true

* Corresponding author.

role of human judgement is to detect changes and to use the statistical approaches to forecast when conditions will become more stable [see Kleinmuntz (1990)]. This study examines whether people possess the suggested superiority at detecting changes in a time series forecasting task. We first review the literature about judgemental and statistical forecasting when con- ditions change and then describe a laboratory experiment designed to test whether people can demonstrate the alleged superiority.

2. Review of previous literature

Surveys of the practice of forecasting have repeatedly found that human judgement is the overwhelming choice for forecasting, especially sales forecasting [Dalrymple (1987); Taranto (1989)]. This result holds true even after statisti-

164 M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172

cal approaches have been tried [Lawrence (1983)]. Apparently, managers feel more com- fortable dealing with an estimate of their own or of a colleague [Langer (1975)]. Recently, atten- tion has been devoted to understanding and improving judgemental point forecasts [Law- rence et al. (1985, 1986); Willemain (1989); Wright and Ayton (1987); Bunn and Wright (1991)].

Some support for the use of human judge- ment in forecasting has been provided in Law- rence et al. (1985). Using 111 time series from the M-Competition [Makridakis et al. (1982)], they found that partly structured eyeballing by naive subjects was as accurate as the best statisti- cal models. Moreover, the variance of the fore- cast errors was significantly less than for their statistical counterparts. However, no detailed analysis was made of the conditions where judge- ment was particularly good or bad. We know that this time series database was characterised by model fitting errors that were considerably smaller than the forecasting errors [see Carbone and Makridakis (1986)]. This may suggest that at least some of the series may have had changes or discontinuities that could not be anticipated by the statistical models. The current study tests the proposition that some of the surprisingly good performances of judgement may have been due to the ability of people to detect changes more quickly than the statistical models were able to react.

Some studies have suggested that, because of the ability of human judgement to incorporate additional information and to detect changes, people should adjust statistical forecasts [Wille- main (1989)]. Others have found that this of little benefit [Carbone et al. (1983)]. Despite this, there is still evidence that the forecasts of econometric models are routinely adjusted by judgement and that these adjustments are im- portant to the accuracy of the forecasts [Turner (1990)]. These adjustments are either made to the model itself or are carried out in order to incorporate the influence of external factors that have not already been assimilated [Corker et al. (1986); Turner (1990)]. Indeed, some [Brown (1988)] have suggested that this why judgemen- tal forecasts are so often valued. Obviously, statistical models work best when conditions are stable. Again, perhaps the reason why judge-

ment is preferred is that, as humans, we can incorporate change more easily.

On the other hand, there are volumes of literature that suggest that, in a multivariate task, human judgement almost always produces worse results than do the statistical approaches. Kleinmuntz (1990) recently summarised the

large number of ‘bootstrapping’ studies which have suggested that people are inconsistent in their judgement and this contributes significantly to their low accuracy. Kleinmuntz suggested that judgement is only useful in the (so-called) ‘broken leg cue’ situations when some external event renders the model completely useless. Dawes et al. (1989), however, suggest that times of change are rare and, therefore, that heavy reliance can usually be placed on the models. In spite of the assertions of Kleinmuntz and others, very few studies have examined the utility of the bootstrapping approaches in time of change. One exception is Remus et al. (1979), who found that, when a major change was introduced, re- gression based decision rules still performed bet- ter than judgement. Furthermore, telling people that the cost function had changed made no difference!

Despite the bootstrapping literature suggest- ing that human judgement is flawed in multi- variate tasks, the conclusion concerning judge- ment in univariate tasks is less damning. Some consider there has been an unconscious academic bias against human judgement [Beach et al. (1987)]. Others have demonstrated the utility of judgemental eyeballing in forecasting univariate real time series [Lawrence et al. (1985)]. Yet others report that human judgement is con- sistently favoured in practical forecasting [Dal- rymple (1987); Taranto (1989)]. Lawrence (1983) found that most people interviewed about their forecasting practices believed they could do much better than the expensive statistical ap- proaches. As we live in a world characterised by change, a plausible hypothesis is that people are better able to anticipate changes in the time series and to react more quickly. This study is concerned with testing that overall proposition and with determining whether the characteristics of the discontinuities introduced into the time series affect the ability of people to detect the changes.

Our research questions can then be stated as

M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172 165

follows: Is human judgement better than the statistical

approaches at detecting changes in time series? Do the characteristics of the changes occur-

ring in the time series affect the relative utility of human judgement in detecting changes?

3. Research design

3.1. Research task

Ten time series were generated for this ex- periment. These were based on five different functional forms with two levels of error for each functional for. Each series can be stated in the following form

yij(t) = h(t) + ej where i = 1 to 5, i = 1 to 2 ,

where h(t) represents the five different function- al forms and ej is the random shock added to the time series. The random shocks were drawn from one of two uniform distributions: one with a mean error of 5% and the other with a mean error of 15%. These levels of randomness were chosen based on prior suggesting that they are representative of error rates in sales forecasting [Dalrymple (1987); Taranto (1989)].

Each time series can be logically divided into four contiguous segments:

Segment 0 (periods l-20) This segment was 20 periods of historical data, generated with a base of 100 and error added. This data was displayed to the subjects initially so that they could assess its charac- teristics.

Segment 1 (periods 21-32) This segment was a continuation of the series as displayed for the first 20 points (segment 0). However, now the subjects made forecasts in each of the periods 21-32. In this way, they became accustomed to forecasting the series. So that the subjects did not come to expect changes at period 32, we discarded a random number of data points from the series at the beginning of this segment so that (at the ex- tremes) some subjects began forecasting at point 21 and others at point 24. In this way

there were always at least 8 forecasts in seg- ment 1, with some subjects providing as many

as 12 forecasts.

Segment 2 (periods 33-40) In this segment, the discontinuity (if any) was introduced. The nature of the discontinuity is explained below.

Segment 3 (periods l-48) In this segment, the time series again returned to a stable base. However, there were 3 stable bases depending on the direction of the series.

The discontinuities introduced in segment 2 were derived by characterising the changes into the following categories: l the direction of the change (up and down); l the type of change (the same change could be

introduced in a large step, or it could be gradu- ally introduced via a ramp). In addition to the four functional forms de-

scribed above, we included a form where the series did not change. Remember, that for each of these five functional forms we derived two series for the two levels of randomness intro- duced, making a total of ten time series. An example of the series used appears in Fig. 1.

The five functional forms for the series were:

Step-up: After segment 1, step-up to base 120 and continue at base 120 thereafter.

Step-down: After segment 1, step-down to base 80 and continue at base 80 thereafter.

Ramp-up: After segment 1, increase by 2.5 per period for eight periods (point 40 then has base 120), then continue at base 120.

Ramp-down: After segment 1, decrease by 2.5 per period for eight periods (point 40 then has base SO), then continue at base 80.

Straight: Stay at base 100 throughout the series.

Each series represents quarterly data. Thus, the ramp up should correspond to a 10% annual increase for 2 years in segment 2, followed by stability at that level. Ramp down is a 10% annual decrease for 2 years. The other series can

166 M. O’Connor et al. i International .3ournal of Forecast& 9 (1993) 163-172

150

70

60

I 50 ’ I I I Id 1 I I I I, I I I,, * ,,I , , , ,

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 2 4 6 6 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48

Fig. 1. Sample time series.

be similarly interpreted. All these series are masked with error as described earlier.

The display of data and gathering of the sub- jects’ forecasts was done with HyperCard soft- ware using a mouse interface running on Macin- tosh computers. As noted earlier, after viewing the first 20 points of the series (segment 0) subjects were required to forecast the next point in the series. When making their forecasts, sub- jects needed only to use the mouse to ‘point’ at their forecasted value and ‘click’ to record that value. The cursor then moved to the right ready for the next forecast. This sequence continued until the end of the series.

The subjects were required to forecast all ten series after they had gained practice with a sam- ple series (not included in the analysis). The ten series were presented in a random order to prevent any order effects. The subjects were told that the series may contain a change and that their task was to be on the look-out for such discontinuities. They were not told to smooth the data. The average time taken to complete all the tasks was about 1 h.

The 17 subjects for the experiment were M.B.A. and Ph.d. students at the University of Hawaii. It should be noted that, whilst only 17

subjects participated in the experiment, each of those subjects completed all ten series in the study. In previous studies on time series forecast- ing [Lawrence et al. (1985)], these student groups have been found to perform equally well as managers in forecasting time series.

Naturally, we need to emphasize that any results from this study need to be interpreted in the light of the characteristics of the experimen- tal design. As discussed above, we used artificial- ly generated series. We believe this was neces- sary to enable us to control the major indepen- dent variables (direction, segment, type and var- iance). Real life time series [such as those used in Makridakis et al. (1982)] rarely display the characteristics required for this study. For exam- ple, we needed to systematically control the direction of the series, the stage where discon- tinuity appears and the type of discontinuity. Real life time series may often display one of the above characteristics, but rarely permit the con- trol of the other aspects. In addition, we con- ducted the experiment in a laboratory setting. This enabled us to eliminate the influence of non-time-series information that would have been included in the forecasts if the time series had situational and contextual relevance. In sum-

M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172 167

mary, therefore, we believe that a controlled investigation of the influence of the independent variables on forecasts warranted the use of a laboratory based study.

4. Analysis methodology

The research questions required us to com- pare forecasts from the subjects with forecasts from classical analytic models. The two methods we chose were Single Exponential Smoothing (SES) and the Adaptive Response Rate Method (ARR).

Single exponential smoothing (SES) was chosen as it performed well in predicting the 111 times series in the M-Competition [Makridakis et al. (1982)]. We developed the SES model for each time series in the following way. Based on the first 20 points in the series, the smoothing factor was chosen for each series by selecting the factor yielding the lowest mean absolute percen- tage error. The best value for each smoothing factor was chosen from a range of 0.05 to 1.0 in steps of 0.05.

The adaptive response rate method (ARR) was chosen since this methods has all the virtues of SES with the additional benefit that it adjusts the smoothing factor as the series changes [for an explanation of the method see Makridakis et al.

Table 1

Mean absolute percentage error for each method

(1983)]. ARR should, therefore, respond quickly to the changes in the second segment of the series presented in this experiment.

Analysis of the results was undertaken using ANOVA for a four factorial experimental de- sign, where the independent variables were:

the forecasting method (judgement, SES and ARR) ; the direction of the discontinuity in the series (up and down); the type of change introduced (ramp and step); and the segment of the series (1, 2 and 3).

Forecast accuracy was measured by absolute per- centage error. This error measure was chosen as it is commonly used and readily understood [Carbone and Armstrong (1982)] and rates well in comparison with other error measures [Arm- strong and Collopy (1992)].

5. Results

5.1. The effect of method

Table 1 presents the mean absolute percen- tage error (APE) of the three methods (judgemental, SES and ARR) for the ten series.

Direction Type Randomness (%) Judgement SES ARR

Upward

Ramp

Step

All 5

15 All

5 15

Downward

Ramp

Step

All 5

15 All

5 15

Straight

All

5 15

13.4 11.0 10.5”: 13.0 10.3 10.0*

8.2 6.5 6.2* 18.0 14.2 13.5 13.7 11.6 11.0 7.8 7.3 6.4

19.4 15.9 15.7*

17.4 14.4 12.6:: 16.9 13.5 11.9** 9.7 9.0 6.3

24.1 17.9 17.5 17.8 15.4 13.2 10.5 11.4 7.7 25.3 19.4 18.7

14.0 11.6 12.9 7.0 4.9 5.0*

21.1 18.3 20.9 15.1 12.5 11.8**’

* p<o.o5; ** p-Co.01; *** p-Co.001.

168 M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172

These results show that, overall, judgemental forecasts have a significantly higher APE than either SES or ARR (F= 19.4, p CO.001). As Table 1 shows, the higher APE was reflected in most types of time series. Thus, overall, not only must we answer the first research question in the negative, but the opposite of our expectation has occurred! In the light of past claims [Bunn and Wright (1991)], this is quite surprising. However, a closer examination of Table 1 reveals that the comparative inaccuracies of judgement are found not just in the series where changes occurred but also in the series (included for control purposes) where the changes did not occur-see the Straight 5% series especially.

5.2. The effect of segment

As expected, forecast accuracy deteriorated across segments (F = 65.1, p = 0.001).

Table 2 shows that APE increased in segment 2, the segment where the discontinuities occurred. But in segment 3, where the time series returned to normal after the instability of the previous segment, APE did not return to the levels of the first segment, before the instability. as Table 2 shows, the mean APE for the third segment was only slightly less than for the un- stable segment before it. Using the judgemental method, Tukey multiple comparison tests re- vealed that there was a significant difference

(p < 0.05) in APE between the first segment and the other two segments. Thus, subsequent judgemental accuracy was affected by prior in- stability. However, there were no significant dif- ferences in APE between the second and third segments. Identical comparisons applied to each of the statistical forecasting methods revealed no difference in APE between the three segments. However, when the statistical methods were pooled, significant differences existed between

segments 1 and 2, but there was no difference

Table 2 Mean APE for each method across the three time series

segments

Method Segment All

1 2 3

Judgement 13.9 16.1 15.5 15.1

SES 11.0 13.2 14.0 12.5

ARR 10.9 12.3 12.6 11.8

between segments 2 and 3. So the high errors in segment 3 occurred across methods, and not just when human judgement was used.

5.3. The effect of the direction of the discontinuity

We have shown that there were significant differences in absolute error between methods and that these existed for each of the segments of the time series. But were these results con- sistent over the different types of time series? ANOVA revealed that there was a significant effect due to the direction of the series, both for the judgemental forecasts (F = 70.5, p < 0.001) as well as for the statistical forecasts (F = 7.7, p < 0.01). As can be seen from Table 1, the downward series had higher judgemental fore- cast errors than did the upward series, but this was also true for the statistical methods. Most of the problems were encountered in the last two segments of the series. As the same effect was observed for the judgemental and statistical methods, we cannot suggest that this effect was peculiar to human judgement.

5.4. The effect of the type of discontinuity

The type of discontinuity (step or ramp) had no effect on APE. This is perhaps a little surpris- ing, as one would expect a greater lag for the stepped series than for the ramped series.

5.5. Interaction effects

There was a solitary significant interaction effect between the direction of the series and the segment (F = 28.5, p < 0.001). For the upward trending series, forecast error in segment 3 ten- ded to return to the levels before the discontinui- ty was introduced (segment 1). However, for downward trending series, the errors of segment 3 remained at the same level as for the second segment. But this effect was observed for all forecast methods. Thus, it may have been due to peculiarities in the downward trending series.

6. Discussion

We have determined that judgemental fore- casting was significantly worse than the statistical

M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172 169

methods. Moreover, the effects of the instability on APE were felt not only in the period of instability (segment 2), but also in the third segment where the series returned to its initial stability.

A plausible explanation of the reasons for the judgemental errors in forecasting is that people spent too little time at the task. As our research instrument collected the time taken for each forecast, we are able to correlate time with APE. Surprisingly, there was absolutely no correlation between the time people spent in their forecast- ing and the error of the forecast (I = 0.006). So spending more time isolating the signal from the noise in the series make no difference to the levels of error. People did not spend more time per forecast period in the first segment of the series (10.1 s), compared with the second (8.3 s) and third (8.0 s) segments (F = 72.4, p < 0.001). However, most of the extra time spent in the first segment was spent in the first few periods of forecasting. When these periods were ignored, the time spent across the segments was about the same.

Can we, therefore, gain any insights into the nature of our reasons for the ineffective be- haviour of people in their one step ahead fore- casting? Can some of the inefficiency be attribu- ted to an excessive over-reaction to the actual values as the iterations progressed? To investi- gate this, the absolute value of the percentage changes in forecasts from period to period were calculated. Table 3 below presents the mean absolute percentage changes in forecasts for each method across each segment.

Table 3 shows that there was a significant difference in the absolute change between meth- ods (F = 111.9, p < O.OOOl), with judgemental forecasters, on average, changing their forecasts by over two times as much as their statistical counterparts. The table also shows that this rela-

Table 3

Absolute percentage changes in forecasts across segments and across methods

Method

Judgement SES ARR

Segment Total

1 2 3

12.6 13.1 12.2 12.6 4.7 3.7 4.4 4.3 3.8 4.1 2.7 3.6

tionship was consistent across the three segments of the series. There was, however, no difference in the absolute changes between segments for any one method. This is an interesting observa- tion. One would expect that, at least for the judgemental method, the changes to forecasters in the second segment would be greater than for the segments of stability. However, the changes were remarkably consistent across the three seg- ments. Perhaps there was a settling down period for the judgemental forecasters in the first few iterations? But when the first five periods (periods 21-25) were ignored the result did not change. The stability across the segments for each method was consistent for all directions of the series. These results were found to hold for both up and down series and for both ramped and stepped functional series forms. When com- pared with the statistical models, people seem, therefore, to be reacting excessively to the actual values of the series as they were revealed to them.

Thus, there seems to be excessive noise in the judgements of the human subjects. They adjust their forecasts from period to period owing to random factors. We can eliminate some of this ‘noisy’ behaviour by developing a regression based ‘model of man’ [Kleinmuntz (1990)]. As mentioned earlier, models of man have been found to outperform man himself in an over- whelming majority of cases. Whilst we are not especially interested in making this comparison, the model of man can provide insights into human behaviour. For example, although we have found that people make notoriously noisy adjustments to their forecasts (Table 3), perhaps their underlying or mean adjustment is quite good. Thus, for the ramp-up series, for example, judgemental forecasters should optimally adjust their forecasts upwards by 2.5 units each period in segment 2. We know that the absolute percen- tage change was rather wild, but was the mean adjustment optimal? Regression analysis re- vealed that subjects made a significant adjust- ment, albeit only for the low variance series. But the size of the adjustments was about half what it should have been. Subjects were still making linear adjustments in the third segment, the period of the series where it had returned to a steady state. And they were making these linear adjustments for both the ramped series (where it

170 M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172

is expected), as well as for the stepped series (where heavy adjustments should be made in the initial stages of the discontinuity only). For the high variance series, the adjustments were as- sessed by the regression model as completely random. Apparently, the additional noise in the series tended to confuse people on the overall direction of the series. Note, however, that the level of noise in these high noise series was only 15%, an amount common in commercial fore- casting experiences.

Although, in the aggregate, people performed badly in this task, some subjects performed quite well and others performed particularly badly. In the judgemental group there were two subjects who could be identified as extremely ‘good’ and two subjects who were extremely ‘bad’. The ‘goodies’ had APES that were significantly better than those of the ‘baddies’. In examining the performance of these two groups, we noticed that the superiority existed for all series and across all three segments of each series. Com- pared with the ‘goodies’, the ‘baddies’ always had a greater error (F = 34.9, p <O.OOl) and they changed their forecasts to a much greater extent (F = 60.7, p < 0.001). It is interesting to note that the APE of the ‘goodies’ was no differ- ent to that of the statistical techniques. Some people seem, therefore, to be doing something right. Others have a great deal of trouble. Per- haps the differences could be accounted for by the time (effort?) that subjects devoted to the tasks? Surprisingly, the time taken by the ‘goodies’ was, on average, 2 s less than that taken by the ‘baddies’ (F= 18.3, p < 0.001). Furthermore, there was no correlation between the time taken and the accuracy of the forecasts within either group. And the amount of time devoted to each segment of the time series dem- onstrated the same pattern for both groups. So, we must come to the bizarre conclusion that the people who performed best in the task spent the least at it, and that there was no association between the time taken and the accuracy of the forecasts.

7. Conclusions and implications

Our study has shown that, in a laboratory forecasting task, people were significantly worse

at forecasting time series that contained fun- damental changes in behaviour, compared with the statistical methods.’ This runs counter to the suggestions that people should demonstrate superiority in this task, especially in circum- stances when conditions change. Our analysis has demonstrated that people were trying to read too much signal into a series as it changed. As a consequence, they overreacted to each new value of the series as it was revealed to them. Perhaps this was due to the fact that, in the instructional talk prior to the experiment, we emphasised the fact that the experiment was concerned with their ability to recognise changes in the series as they developed. People may have been ‘hypervigilant’ towards the series changing. We did mention, however, that there were some series that did not change. Nevertheless, the inescapable conclusion remains that people per- formed quite badly at this task. It seems that they concentrated too much on the movements of the series from period to period, instead of examining the whole series as it developed. If they had adopted this latter strategy, they would (arguably) have appreciated the influence of error on the movements in the series.

The problems people have in appreciating randomness and its influence on judgement have been well documented [O’Connor and Lawrence (1989); Hogarth (1975); Lichtenstein et al. (1982)]. It must be remembered that each stu- dent had completed at least one course in statis- tics prior to the experiment and was aware of the concept of random error. Perhaps a training session, where subjects were shown that the best way to forecast a series with error is to estimate the mean, could have improved their perform- ance. But such training periods are not common in a business environment. Perhaps, this finding is yet another demonstration of the acceptability of Voltaire’s remark: ‘Chance is a word void of sense: nothing exists without a cause’. Certainly, remarks in the financial press when key

’ As discussed earlier, one must always interpret the results

of this study in the light of the experimental design. The use of post-graduate students in a forecasting task without

contextual relevance may seem to question the general-

isability of the results. However, as discussed earlier, this

research design was considered necessary because of the need to adequately control the influence of the indepen-

dent variables.

M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172 171

economic indicators are announced would sug- gest that people are looking for explanations for every movement. We rarely hear said, in associa- tion with these remarks, that there were may be some random factors at work.

The results of this study contrast with those obtained in Lawrence et al. (1985), where judgemental eyeballing techniques have been shown to be comparable with statistical methods. One reason for the difference in results could relate to the use of series that were artificially developed for the purposes of the study and, arguably, ideally suited to forecasting using statistical approaches. Another possible reason relates to the use of different media to record the forecast. Unlike the pencil-and-paper approach of Lawrence et al. (1985), this study used a computer screen with mouse-based input. Casual observation of the subjects by the researchers revealed that the subjects tended to focus only on the position of the last data point on the series in relation to their last forecast (as dis- cussed earlier), and then used the mouse to ‘drag’ the screen object to that point where their forecast should be recorded. Perhaps the tech- nology took some of their attention away from the overall behaviour of the time series. These casual observations suggest some new directions for research into judgemental forecasting.

We have seen that subjects reacted in a ‘knee- jerk’ fashion to the time series data as it was revealed to them. These results are another illus- tration of the difficulty people have in appreciat- ing the concept of randomness and its influence on behaviour. The important conclusion for busi- ness forecasting practices is to beware of the (natural?) tendency to over-react to recent data. The results reported here are surprisingly con- sistent with those of a recent investigation of sales forecasting practices in consumer products organisations in Australia [Lawrence et al. (1992)]. In that study, the most recent monthly forecast was often considerably less accurate than forecasts for the same month made some months before. The marketing and sales staff often imputed reasons and causes to movements in product sales that (in hindsight) were more of a random nature. Perhaps a more appropriate strategy for regular monthly sales forecasting practices is to ensure that the overall behaviour of the time series is given full consideration and

to consciously distrust the temptation to react inappropriately to the most recent data.

The results reported here also suggest that people are not as good as has been suggested at detecting changes in time series. If people still seem to want to forecast judgementally [Dalrym- ple (1987)], they should possibly do so with the benefit of statistical aids, especially when the task is that of detecting changes in trends as they occur.

References

Armstrong, J.S. and F. Collopy, 1992, “The selection of

error measures for generalizing about forecasting meth-

ods: Empirical tests”, International Journal of Forecast- ing, 8, 69-80.

Beach, L.R., J. Christensen-Szalanski and V. Barnes, 1987,

“Assessing human judgement: Has it been done, can it be

done, should it be done?” in: G. Wright and P. Ayton,

eds., Judgemental Forecasting (Wiley, New York).

Brown, L.D., 1988, “Comparing judgemental to extrapola-

tive forecasts: Its time to ask why and when”, Internation- al Journal of Forecasting, 4, 171-173.

Bunn. D. and G. Wright, 1991, “Interaction of judgemental and statistical forecasting methods: Issues and analysis”,

Management Science, 37, 501-518.

Carbone, R. and S. Armstrong, 1982, “Evaluation of ex-

trapolative forecasting methods: Results of academicians

and practitioners”, Journal of Forecasting, 1, 215-217.

Carbone, R. and S. Makridakis, 1986, “Forecasting when

pattern changes occur beyond the historical data”, Man- agement Science, 32, 257-271.

Carbone, R., A. Anderson, Y. Corriveau and P. Corson,

1983, “Comparing for different time series methods the

value of technical expertise, individualised analysis, and

judgemental adjustment”, Management Science. 79, 559- 566.

Corker, R.J., S. Holly and R.G. Ellis, 1986, “Uncertainty

and forecast precision”, International Journal of Forecast- ing, 2, 53-70.

Dalrymple, D.J., 1987, ‘Sales forecasting practices: Results

of a United States survey”, International Journal of Fore- casting, 3, 379-392.

Dawes, R.M., D. faust and P. Meehl, 1989, “Clinical versus

actuarial judgement”, Science, 243, 1668-1673.

Hogarth. R., 1975, “Cognitive processes and the assessment

of subjective probability distributions”, Journal of the American Statistical Association, 70, 271-294.

Kleinmuntz, B., 1990, “Why we still use our heads instead of formulas: Toward an integrative approach”, Psychologi- cal Bulletin, 107, 296-310.

Langer, E.J., 1975, “The illusion of control”, Journal of Personality and Social Psychology, 32, 311-328.

Lawrence, M.J., 1983, “An exploration of some practical issues in the use of quantitative forecasting models”,

Journal of Forecasting, 1, 169-179.

172 M. O’Connor et al. I International Journal of Forecasting 9 (1993) 163-172

Lawrence, M.J., R.H. Edmundson and M.J. O’Connor,

1985, “An examination of the accuracy of judgemental

extrapolation of time series”, International Journal of Forecasting, 1, 25-35.

Lawrence, M.J., R.H. Edmundson and M.J. O’Connor,

1986, “The accuracy of combining judgemental and statis-

tical forecasts”, Management Science, 32, 1521-1532.

Lawrence, M.J., R.H. Edmundson and M.J. O’Connor,

1992, “Sales forecasting practices in consumer products

organisations”, Working paper, School of Information

Systems, University of New South Wales, Australia.

Lichtenstein, S., B. Fischhoff and L. Phillips, 1982, “Calibra-

tion of probabilities: The state of the art to 1980”, in: D.

Kahneman, P. Slavic and A. Tversky, eds., Judgement under Uncertainty: Heuristics and Biases (Cambridge Uni-

versity Press, Cambridge).

Makridakis, S., A. Anderson, R. Carbone, R. Fildes, M.

Hibon, R. Lewandowski, J. Newton, E. Parzen and R.

Winkler, 1982, “The accuracy of extrapolation (time

series) methods: Results of a forecasting competition”,

Journal of Forecasting, 1, 11-153.

Makridakis, S., S. Wheelwright and V. McGee, 1983, Fore- casting: Methods and Applications, 2nd edn. (Wiley, New

York)

O’Connor, M.J. and M.J. Lawrence, 1989, “An examination

of the accuracy of judgemental confidence intervals in

time series forecasting”, Journal of Forecasting, 8, 141- 155.

Remus, W.E. P.L. Carter and L.O. Jenicke, 1979, “Regres-

sion models of decision rules in unstable environments”,

Journal of Business Research, 7, 187-196. Taranto, G.M., 1989, “Sales forecasting practices: Results

from an Australian survey”, Unpublished thesis, Uni-

versity of New South Wales. Turner, D., 1990, “The role of judgement in macroeconomic

forecasting”, Journal of Forecasting, 9, 315-346. Willemain, T.R., 1989, “Graphical adjustment of statistical

forecasts”, International Journal of Forecasting, 5, 179- 185.

Wright, G. and P. Ayton, eds., 1987, Judgemental Forecast- ing (Wiley, New York).


Recommended