Choosing the best model and model simplification by malj


                                                 Roger J. Brooks
                                        Department of Management Science
                                            The Management School
                                               Lancaster University
                                               Lancaster LA1 4YX

                                                Andrew M. Tobias
                                           Operational Research Group
                               School of Manufacturing and Mechanical Engineering
                                            University of Birmingham
                                             Birmingham B15 2TT.


Given the widespread acceptance of the importance of simplicity in modelling, the scarcity of research into
simplification is surprising. The model used affects all aspects of the project and there are many advantages of a
simple model. However, often simplification is not attempted, and on the (misguided) assumption that more
detailed models are necessarily more accurate and therefore better, the time available is used to build the most
complex model possible.

This paper briefly describes a number of case studies that were carried out in which models at different levels of
detail were built. In each case, the ways in which the original model was able to be simplified are set out along
with the benefits that can be obtained by using the simple model.

The general factors influencing the extent to which models can be simplified are then discussed. In particular,
where all that is required in terms of output is averages, it may be possible to reduce the model to such a simple
version that an analytical solution becomes feasible and the simulation redundant. In other circumstances it may be
best to build both a complex model for validation and a simplified model for the experimentation.

                                                              The assessment of the performance of a model
The choice of model used in a project is essential to         should cover its impact on the whole project, with
the success of the project. However, little guidance          the aim being to use the model which gives the best
is available about how to make this choice. A                 overall performance. We have previously set out the
modelling project can usually be split into the steps         following 11 elements of model performance which
of problem formulation, collection and analysis of            are discussed in Brooks and Tobias [1996]:
data, model formulation, model construction,
verification and validation, experimentation,
analysis of results, conclusions and implementation
of the recommendations. The initial choice of model           Results
constitutes model formulation and this step, along            1. The extent to which the model output describes
with problem formulation, stand out as being                      the behaviour of interest (whether it has
regarded as more of an art than a science. By                     adequate scope and detail).
contrast, the other steps have been the subject of            2   The accuracy of the model's results.
considerable research resulting in a much more                3. The ease with which the model and its results
scientific approach. For example, well established                can be understood.
statistical techniques can be applied to the                  Future use of the model
collection and analysis of data, validation,                  4. The portability of the model and the ease with
experimentation and the analysis of results. Where                which parts of the model can be reused in
the model takes the form of a computer program,                   future models.
software     methodologies     for    the    efficient        Confidence in the model (verification, validation
construction and testing of the model may be able                 and credibility)
be used. Even for problem formulation, techniques             5. The probability of the model containing errors.
such as Checkland's [1981] soft systems                       6. The accuracy with which the model output fits
methodology have been developed.                                  the historical data.
7.  The strength of the theoretical basis of the         poppy, Papaver rhoeas [Lawrence et al., 1978]. The
    model including the quality of input data (the       pollen from the plants is dispersed by bees and the
    credibility of the model).                           plants have a genetically-based mechanism that
Resources required                                       prevents plants pollinating themselves. The gene
8. The time and cost to build the model                  exists in a number of different types, called alleles.
    (including data collection, verification and         Each plant contains two alleles although the pollen
    validation).                                         grains produced by the plant each have just one of
9. The time and cost to run the model.                   these alleles at random. When a pollen grain is
10. The time and cost to analyse the results of the      deposited on a plant, if the pollen grains allele
    model.                                               matches either of the plant's alleles then a reaction
11. The hardware requirements (such as computer          occurs that prevents the pollen from fertilising the
    memory) of running the model.                        plant. The aim of the modelling was to investigate
                                                         the frequency distribution of the alleles in a typical
The relative importance of each of these elements        P. rhoeas population in steady state.
will vary in different projects. The actual
performance of a given model will also partly            The simulation model was built in FORTRAN. A
depend on how it is used in the project, with trade-     preliminary simulation of a simplified idealised
offs existing between some of the elements.              population had been built [Lawrence et al., 1994]
                                                         but its results were not consistent with samples from
Simplification is central to modelling since any         natural populations [Campbell and Lawrence, 1981;
model is a simplification of the object it represents.   Lawrence and O'Donnell, 1981]. Several more
The choice of model can, in one sense, be viewed as      realistic assumptions were therefore built into the
choosing the most appropriate simplification.            model including realistic probability distributions
However, the concepts of simplicity and complexity       for pollen and seed dispersal, the ability of seed to
in modelling are rarely defined. We will take            lie dormant in the soil for several years and two
complexity to be some combined measure of the            sizes of plant. The sizes of real plants vary
amount of elements, inter-connections and                considerably but, in particular, a proportion of
calculations included within the model [Brooks and       plants tend to be very much larger than the
Tobias, 1996]. With respect to the model                 remainder of the population and to produce much
performance elements, there are a number of              more seed and pollen. The main output variable
potential advantages of a simpler model compared         recorded was the variance of the allele frequencies.
to a more complex one. The simpler model would           A comparison of models with and without the
often be expected to be easier to understand, to         realistic factors enabled their effects to be assessed.
contain fewer errors and be quicker and easier to        Variation in plant size and seed dormancy had a
build, run and analyse. However, these relationships     considerable effect on the allele frequency variance
will not always apply [Brooks and Tobias, 1996]          whereas the inclusion of realistic seed and pollen
                                                         dispersal distributions (compared to uniform
The arguments for simplicity in Management               dispersal) had only a small effect. The detailed
Science modelling were considered by Ward                results are contained in Brooks et al [1996].
[1989], who, in particular, suggested several reasons
why a client may prefer a simple model and so is         In order to identify other possible models, attempts
more likely to implement the results. The time           were made to simplify the simulation model.
available for the decision making process is often       Analysis of the mechanisms contained in the model
severely limited and a simple model gives quicker        using some statistical analysis resulted in the
results which can be more readily assimilated. This      development of a simple analytical model. This
allows the client more time to consider the              expresses the allele frequency variance as a simple
alternative courses of action and more time for          function of the two most important factors, variation
implementation. It also requires less effort on the      in plant size and overlapping generations [Brooks et
client's part. The results from a simple model are       al, 1997a,b]. The analytical model results closely
likely to be less specific than from a complex model     matched those from the simulations. Alternative
and so may allow more scope for incorporating the        scenarios can easily be investigated with this model
client's knowledge and preferences. Clients may          and a number of further theoretical results were
also associate a quick model with the modeller           developed. Three intermediate model types were
having a good understanding of the problem, and a        identified between the analytical and the simulation
simpler model can be more easily explained by the        models.
client to third parties.
This paper describes several modelling projects in
which simplification of existing models was              The SIRIUS wheat model [Jamieson et al., 1998a] is
attempted and discusses the implications of the          a mechanistic model, based on the known
experiences.                                             phenology of wheat, that has been successfully
                                                         tested in a variety of conditions [Jamieson et al.,
POPULATION GENETICS                                      1998b]. The model simulates the growth of wheat
                                                         on a daily basis using input data for the
This model was built to simulate the population          characteristics of the wheat variety and soil
genetics of the self-incompatibility gene in the field   conditions as well as daily temperature,
precipitation and solar radiation values. The growth     The projects described here illustrate some of the
of the roots, leaves and then the grain are simulated    benefits that may be obtained from a simple model.
as well as the internal processes that determine the     The main advantage gained in these projects is a
timing of the different stages of development. A soil    much greater understanding of the system. This
sub-model is also included that determines the           should always be a primary objective in a simulation
amount of water and nitrogen in the soil. If these       project since pure black box prediction is only of
amounts become too small, the growth rate of the         limited value as it is difficult to apply the lessons
plant is reduced. The main variable of interest is       learned in the future. A sound understanding of the
usually yield.                                           system, on the other hand, can be used and adapted
                                                         even when circumstances change. Previous work
The combination of SIRIUS and stochastic weather         has indicated that the size of the model and number
generators [Semenov et al., 1998] have been used to      of connections may be much more important than
predict the effect of climate change on wheat            the complexity of the detailed calculations in
farming in the U.K. at the site scale [Wolf et al.,      determining ease of understanding [Brooks, 1996].
1996]. The aim of the current modelling work was
to extend this to the regional and national scale. As    There can also be unexpected benefits in that the
part of this work, a detailed sensitivity analysis was   insights obtained may be able to be further applied.
carried out of SIRIUS. Combined with analysis of         In the population genetics case, for example, the
the internal mechanisms within the model this again      analytical model was used to derive several
led to the development of a much simpler model           interesting additional results such as the expected
which estimates yield using four simple equations.       number of alleles that could be supported by a given
The relationships that make up the simple model          population. Similarly, the simplified wheat model
were used as the basis of the upscaling                  was used to develop the upscaling methodology by
methodology.                                             identifying conditions under which yield would be
                                                         Analysis of results is also made much easier.
Two discrete event simulation models of                  Sensitivity analysis can be easily carried out and
manufacturing systems were examined, one built           alternative scenarios easily investigated. This is
several years ago as part of a student project and       particularly the case for the analytical models. For
one built by the particular company. Both models         example, the effective capacity equations allow
were built using the WITNESS simulation package          changes in the bottleneck to be immediately
[Lanner]. The objective, in each case, was taken to      obtained when the parameters of one of the
be to identify ways of improving the throughput of       machines is changed.
the line, although the original objective was not
clear for the student project model. Both models         On the other hand, it can be difficult to obtain
were very detailed and both contained errors. The        sufficient confidence in a simple model. Certainly, a
complexity of the student project model meant that       model that omits an important factor will tend to
it only ran very slowly.                                 result in a seriously flawed understanding. Complex
                                                         systems may require a complex model [Bunge,
For each model, a revised model was built initially      1963]. For both the population genetics and wheat
by correcting the errors and simplifying the model       models the confidence in the simplified models
enough that it was feasible to run. Analytical models    comes because the results match well withthose of
were then developed for both models by calculating       the detailed models. Both simplified models exclude
the effective capacity of each machine and               certain factors which might have been thought to be
identifying the bottleneck. These calculations took      important without detailed testing and analysis of
into account set-ups, breakdowns, rejected parts and     the original simulation model.
the relative proportion of parts processed by each
machine. The students project model contained no         It is also useful to consider the simplification
stochasticity and the analytical model was able to       processes here. In the first two examples, it would
reproduce the simulation model throughput exactly.       have been very difficult to derive either the
For the other model, a close match was still             simplified population genetics or wheat models
obtained.                                                without first building the simulation model. This is
                                                         not just because the models ignore some factors but
DISCUSSION                                               also because of the complexity of the situation being
                                                         modelled. In each of the cases, the output variable
Greater computing power and ease of use of model         of interest was either a steady state average or an
building software has meant that it is tempting to       accumulated value. This tended to allow some of the
include a lot of detail in simulation models. Tilanus    processes within the model to be averaged, ignored
[1985] found that too much complexity is often           or treated by statistical analysis. However, care is
given as a reason for failure of Management Science      required in averaging the processes in a complex
and Operational Research projects, with the use of a     model. Modelling is conducted within a particular
simple model commonly mentioned in the reasons           experimental frame [Zeigler, 1976] and so models
for the success.                                         built for other purposes such as detailed analysis of
                                                         queue lengths may be harder to simplify. The
                                                         simplified models will tend to have more limited
validity and be less portable than the detailed
models.                                                    Brooks R.J. and Tobias A.M. 1996, "Choosing the
                                                           best model: level of detail, complexity and model
Considerable effort tends to be required for the           performance". Mathematical and Computer
simplification process [Rexstad and Innis, 1985].          Modelling, 24(4), Pp 1-14.
This certainly applied to the population genetics and
wheat models where the simplified models were              Brooks R.J. Tobias A.M. and Lawrence M.J. 1996,
produced by sensitivity analysis and examination of        "The population genetics of the self-incompatibility
the detailed workings of the model. Simplification         polymorphism in Papaver rhoeas. XI. The effects of
of existing models is risky since there is no              limited pollen and seed dispersal, overlapping
guarantee of success. However, the process itself,         generations and variation in plant size on the
even if unsuccessful, should still enhance                 variance of S-allele frequencies in populations at
understanding. It is also a useful undertaking in          equilibrium". Heredity, 76, Pp 367-376.
helping to verify the model [Rexstad and Innis,
1985].                                                     Brooks R.J. Tobias A.M. and Lawrence M.J. 1997a,
                                                           "Time series analysis of the self-incompatibility
Despite its importance, simplification has received        polymorphism. 1. Allele frequency distribution of a
relatively little attention in the modelling literature.   population with overlapping generations and
Zeigler [1976] distinguished four ways of                  variation in plant size", Heredity, 79, Pp 350-360..
simplifying a discrete event simulation model,
namely dropping unimportant parts of the model,            Brooks R.J. Tobias A.M. and Lawrence M.J. 1997b,
replacing part of model by a random variable,              "Time series analysis of the self-incompatibility
coarsening the range of values taken by a variable         polymorphism. 2. Frequency equivalent population
and grouping parts of a model together. Innis and          and the number of alleles that can be maintained in
Rexstad [1983] listed seventeen simplification             a population", Heredity, 79, Pp361-364.
techniques for general modelling which they
categorised under the modelling steps of hypotheses        Bunge M. 1963, The Myth of Simplicity: Problems
(identifying the important parts of the system),           of Scientific Philosophy, Prentice-Hall, Englewood
formulation (specifying the model), coding                 Cliffs, N.J.
(building the model) and experiments. Courtois
[1985] discusses scaling issues that can lead to           Campbell J.M. and Lawrence M.J. 1981, "The
model decomposition. However, none of these                population genetics of the self-incompatibility
contributions constitutes a methodology.                   polymorphism in Papaver rhoeas. II. The number
                                                           and frequency of S-alleles in a natural population
CONCLUSIONS                                                (R106)", Heredity, 46, Pp81-90.

There can be considerable benefits in using a simple       Checkland P.B. 1981, Systems Thinking, Systems
simulation model. Often this can only be                   Practice, John Wiley and Sons, New York.
accomplished by first building a more detailed
model and then attempting to simplify it. This is          Courtois P.-J. 1985, "On time and space
both because it would be difficult to derive the           decomposition     of    complex       structures",
simple model from the problem definition and               Communications of the ACM, 28(6), Pp 590-603.
because it would not be possible to obtain sufficient
confidence without first building the detailed model.      Innis G. and Rexstad E. 1983 "Simulation model
Although a number of authors [e.g. Pidd, 1998]             simplification techniques", Simulation, 41(1), Pp7-
advocate starting with a simple model and gradually        15.
adding detail, the simplification of the resulting
model is a different process since the simplified          Jamieson P.D. Semenov M.A. Brooking I.R. and
model will often have a different form to the              Francis G.S. 1998a, "Sirius: a mechanistic model
original. The simplified model gives greater               of wheat response to environmental variation."
understanding and ease of analysis whereas the             European Journal of Agronomy, 8, Pp161-179.
detailed model is required for cross validation to
give sufficient confidence [Murdoch et al.,]1992].         Jamieson, P.D. Porter J.R. Goudriaan J. Ritchie J.T.
The process of simplification is risky because it is       Keulen H. and van Stol W. 1998b. "A comparison
time consuming and may not lead to an acceptable           of the models AFRCWHEAT2, CERES-Wheat,
simplified model. There is a need both for a               Sirius,   SUCROS2        and    SWHEAT         with
simplification methodology and for a better                measurements from wheat grown under drought",
understanding of the circumstances in which                Field Crops Research, 55, Pp23 - 44.
simplification is likely to be successful.
                                                           Lanner Group Ltd., WITNESS User Manual,
REFERENCES                                                 Redditch, Worcestershire, U.K.

Brooks R.J. 1996, "A Framework for Choosing the            Lawrence M.J. and O'Donnell S. 1981, "The
Best Model in Mathematical Modelling and                   population genetics of the self-incompatibility
Simulation", Ph. D. Thesis, University of                  polymorphism in Papaver rhoeas. III. The number
Birmingham.                                                and frequency of S-alleles in two further
populations (R102 and R104)", Heredity, 1981, 47,

Lawrence M.J. Azfal M. and Kendrick J. 1978, "The
genetical control of self-incompatibility in Papaver
rhoeas", Heredity, 40, Pp239-253.

Lawrence M.J. O'Donnell S. Lane M.D. and
Marshall D.F. 1994, "The population genetics of the
self-incompatibility polymorphism in Papaver
rhoeas. VIII. Sampling effects as a possible cause of
unequal allele frequencies", Heredity, 72, Pp345-

Murdoch, W.W. McCauley E. Nisbet R.M. Gurney
W.S.C. and De Roos W.M. 1992, "Individual-based
models: combining testability and generality", In
Individual-based Models and Approaches in
Ecology:     Populations,    Communities      and
Ecosystems (edited by D. L. DeAngelis and L. J.
Gross), Chapman and Hall, New York, Pp. 18-35.

Pidd M. 1998, Computer Simulation in
Management Science, 4th Edition, Wiley, New

Rexstad E. and Innis G.S. 1985, "Model
simplification – three applications", Ecological
Modelling, 27(1-2), Pp1-13.

Semenov M.A. Brooks R.J. Barrow E.M. and
Richardson C.W. 1998, "Comparison of the WGEN
and LARS-WG stochastic weather generators for
diverse climates", Climate Research, 10, Pp 95-107.

Tilanus C.B. 1985, "Failures and successes of
quantitative methods in management", European
Journal of Operational Research, 19, Pp170-175.

Wolf J. Evans L.G. Semenov M.A. Eckersten H.
and Iglesias A. 1996, "Comparison of wheat
simulation models under climate change. I. Model
calibration and sensitivity analyses". Climate
Research, 7(3), Pp 253 - 270.

Zeigler B. P. 1976, Theory of Modelling and
Simulation. John Wiley, New York.


Roger Brooks is a lecturer in Management Science
at Lancaster University. Before coming to Lancaster
he worked on an E.C. funded project investigating
the effects of climate change on European
agriculture. He received a Ph.D from Birmingham
University and a B.A. in Mathematics from Oxford
University. He is also a chartered accountant.

To top