Document Sample
                          AND SIC CRITERION
                                           Suhartono *, Subanar †, Suryo Guritno †
                              * PhD Student, Mathematics Department, Gadjah Mada University
                              Statistics Department, Sepuluh Nopember Institute of Technology
                                     † Mathematics Department, Gadjah Mada University

ABSTRACT                                                             In general, there are two procedures usually used to
                                                                 find the best NN model (the optimal architecture), those
    The aim of this paper is to discuss and propose a            are “general-to-specific” or “top-down” and “specific-to-
procedure for model building in neural networks for time         general” or “bottom-up” procedures. “Top-down”
series forecasting. We focus on the model selection              procedure is started from complex model and then applies
strategies based on statistical concepts particularly on the     an algorithm to reduce number of parameters by using
inference of R2 incremental and SIC criterion. In this           some stopping criteria, whereas “bottom-up” procedure
paper, we employ this new procedure in bottom-up or              works from a simple model. The first procedure in some
forward approach which starts with a simple neural               literatures is also known as “pruning” (see [11]), or
networks. We use simulation data as a case study. The            “backward” method in statistical modeling. The second
results show that statistical inference of R2 incremental        procedure is also known as “constructive learning” and
combining by SIC criteria is an effective procedure for          one of the most popular is “cascade correlation” (see e.g.
model selection in neural networks for time series               [2, 10]), and it can be seen as “forward” method in
forecasting                                                      statistical modeling.
                                                                     The aim of this paper is to discuss and propose a new
Keywords: neural networks, model building, statistical           forward procedure by combining the inference of R2
          inference, time series forecasting.                    incremental and SIC (Schwarz Information Criteria). We
                                                                 emphasize on the used of NN for time series forecasting.
    In recent years, an impressive array of publications has     2     FFNN OR FEEDFORWARD NEURAL
appeared claiming considerable successes of neural                     NETWORKS
networks (NN) in data analysis and engineering                      Feedforward Neural Networks (FFNN) is the most
applications. NN model is a prominent example of such a          popular NN models for time series forecasting
flexible functional form. The use of the NN model in             applications. Figure 1 shows a typical three-layer FFNN
applied work is generally motivated by a mathematical            used for forecasting purposes. The input nodes are the
result stating that under mild regularity conditions, a          previous lagged observations, while the output provides
relatively simple NN model is capable for approximating          the forecast for the future values. Hidden nodes with
any Borel-measureable function to any given degree of            appropriate nonlinear transfer functions are used to
accuracy (see e.g. [5, 6, 16]).                                  process the information received by the input nodes.
    In the application of NN, it contains limited number of         The model of FFNN in figure 1 can be written as
parameters (weights). How to find the best NN model, i.e.                               q       p                    
how to find an accurate combination between number of                      y t   0    j f    ij y t i   oj    t ,
                                                                                                                              (1)
input variables and unit nodes in hidden layer, is a central                           j 1     i 1                 
topic on the some NN literatures that discussed on many          where p is the number of input nodes, q is the number
articles and books (see e.g. [1, 4, 12]).                        of hidden nodes, f is a sigmoid transfer function such as
    Time series forecasting has been an important                the logistic:
application of NN from the very beginning. Lapedes and                                                1
                                                                                        f ( x)               ,                  (2)
Farber [9] were among the first researchers who used a                                             1  e x
NN for time series forecasting. They explored the ability
                                                                 { j , j  0,1,  , q} is a vector of weights from the hidden to
of a multi-layer perceptron to forecast a nonlinear
computer generated signal such as, e.g., the Mackey-glass        output nodes and { ij , i  0,1,  , p; j  1,2,  , q} are weights
differential equation.                                           from the input to hidden nodes. Note that equation (1)
indicates a linear transfer function is employed in the                                              ( y h y) 2
output node.                                                                                  R2h 
                                                                                                                      .          (5)
                                                                                                ( y y )( y  h y h )
                                                                                                          ˆ ˆ
   Functionally, the FFNN expressed in equation (1) is
equivalent to a nonlinear AR model. This simple structure                  Now the R2 incremental contribution of unit hidden cell h
of the network model has been shown to be capable of                       is given as
approximating arbitrary function (see e.g. [5, 6, 16]).
                                                                                                   R(2h )  R 2  R 2 h .
                                                                                                                                             (6)
However, few practical guidelines exist for building a
FFNN for a time series, particularly the specification of                      The same procedure can be applied to reduce the
FFNN architecture in terms of the number of input and                      number of input layer cells. In this case, { y i (t )} is
hidden nodes is not an easy task.                                          network output, given network parameter estimates,
                                                                           without input cell i. The contribution of unit input cell i is
                                                                           put to zero ( ih  0, where i  1,2,  , p; h  1,2,  , q ) , then
                                                                           the reduced network can be quantified by the square of the
                                                                           correlation coefficient Ri between y and y i with

                                                                                                         ( y i y) 2
                                                                                                   R2i 
                                                                                                                          .       (7)
                                                                                                    ( y y )( y  i y i )
                                                                                                              ˆ ˆ
                                                                           The R2 incremental contribution of input cell i is measured
                                                                                                       R(2i )  R 2  R 2 i .
                                                                                                                                             (8)
                                                                           The relative value of R incremental contribution can be
                                                                           used in evaluating whether an input or unit hidden cell can
                                                                           be omitted or not [7].
Figure 1. Architecture of neural network model with single hidden layer

   Kaashoek and Van Dijk [7] introduced a “pruning”                        2.2     Statistically inference of R2 incremental
procedure by implementing three kinds of methods to find
the best FFNN model; those are incremental contribution
(R2 incremental), principal component analysis, and                               In this paper we propose a new forward procedure
graphical analysis. Whereas, Swanson and White [14, 15]                    based on the statistical inference of R2 incremental
applied a criterion of model selection, SIC, on “bottom-                   contribution. This approach involves three basic steps,
up” procedure to increase number of unit nodes in hidden                   which we now describe in turn.
layer and input variables until finding the best FFNN
                                                                            (i). Simple or Reduced model
                                                                            We begin with the simple model considered to be
                                                                 2          appropriate for the data, which in this context is called
2.1     Incremental Contribution through R                                  the reduced or restricted model. In this case, we firstly
    Kaashoek and Van Dijk [7] stated that a natural                         evaluate the contribution of unit hidden cells. For the
candidate for quantification of the network performance is                  simple case, the reduced model is a linear model or NN
the square of the correlation coefficient of y and y
                                                   ˆ                        model without hidden layer, i.e.
                              ( y y ) 2
                                ˆ                                                                  p
                    R2                                              (3)              yt   0      j yt  j   t .     Reduced model     (9)
                           ( y y )( y y )
                                     ˆˆ                                                            j 1

where y is the vector of network output points. The
        ˆ                                                                   We fit this reduced model and obtain the error sum of
network performance with only one unit hidden cell                          squares, denoted by SSE(R).
deleted can be measured in a similar way. For instance, if
                                                                           (ii). Complex or Full model
the contribution of hidden cell h is put to zero (  h  0) ,
then the network will produce an output y h with errors
                                        ˆ                                   Next, we consider the complex or full model, i.e. NN
                                                                            model as in equation (1). We start fitting NN model with
                      eh  y  y h .
                                ˆ                                    (4)    single unit hidden cell or q  1 . The error sum of squares
   This reduced network can be measured by the square                       of this full model denoted by SSE(F). Here, we have:
of the correlation coefficient R h between y and y h ,
                                2                                                        SSE ( F )   ( y  y h ) 2 .
                                                                                                             ˆ                  Full model   (10)
 (iii). Test Statistic                                             optimal unit hidden layer cells. The result of an
                                                                   optimization steps are reported in Table 1.
 Kutner, Nachsteim and Neter [8] stated when a large-
 sample test concerning several parameters (i.e.  j and
  ij in equation (1)) model simultaneously is desired, we
 can use the same approach as for the general linear test.
 First, we fit the reduced model and obtaining SSE(R),
 then fit the full model and obtaining SSE(F), and finally
 calculate the test statistic:
                   SSE ( R)  SSE ( F ) SSE ( F )
            F*                                  .        (11)
                      df R  df F         df F
 For large n, this test statistic is distributed approximately
 as F (v1  df R  df F , v 2  df F ) when H 0 holds, i.e.
 additional parameters in full model all equal to 0.
 Gujarati [3] showed that equation (11) can be written in
 R2 incremental contribution as
                      R(2F )  R(2R )               2                Figure 2. Time series and lags (yt-1 and yt-2) plots of simulated data
                                            (1  RF )
              F*                                    .   (12.a)
                       df R  df F             df F
                                                                      Table 1 shows that two unit hidden layer cells are the
 or                                                                optimal result and further optimization runs are not
                     R(2Incremental )               2
                                            (1  RF )              needed. The graphs of network output by adding one unit
             F*                                     .   (12.b)   hidden cell are presented at Figure 3. Then, we continue
                      df R  df F              df F
                                                                   an optimization to find the optimal input cells.
    We continue step 1 to 3 until the optimal of unit
hidden cells are found. Then, the forward procedure                 Table 1. The results of the optimal unit hidden cells determination by
                                                                             implementing forward procedure
continues to find the optimal unit input cells. We start
with the input which has the largest R2. In this paper, we
combine this test statistic with SIC criteria for determining
the optimal cells.

    In this paper, the proposed forward procedure is
implemented by using a simulated data. The simulation
experiment is carried out to show how the proposed NN
modeling procedure work. Simulated data are generated as
ESTAR (Exponential Smoothing Transition Autoregres-
sive) model, i.e.
           yt  6.5 yt 1. exp(0.25 yt21 )  u t , (13)
where u t ~ nid(0,0.5 2 ) .
    Time series and the lags plots of this simulated data
can be seen in Figure 2. We can observe clearly that data
follow nonlinear autoregressive pattern at lag 1.

   In this section the empirical results for the proposed
forward procedure are presented and discussed.
4.1    Unit hidden selection                                         Figure 3. The network output by adding one unit hidden layer cell
                                                                               compared with actual data
      First, we apply the proposed forward procedure
starting with a FFNN with six variable inputs
( yt 1 , yt 2 ,  , yt 6 ) and one constant input to find the
4.2      Input unit selection                                          [5]    K. Hornik, M. Stinchombe and H. White, Multi-
                                                                              layer feedforward networks are universal approxi-
    The results of an optimization steps for determining
                                                                              mators, Neural Networks, 2, 1989, pp. 359–366.
the optimum unit input cells are reported in Table 2. It
                                                                       [6]    K. Hornik, M. Stinchombe and H. White, Universal
shows that unit input 1, i.e. yt 1 , is the optimal input cell
                                                                              approximation of an unknown mapping and its
of the network. Hence, this forward procedure yields the                      derivatives using multilayer feedforward networks,
optimal network is FFNN with one input cell and two                           Neural Networks, 3, 1990, pp. 551–560.
hidden unit layer cells or FFNN(1,2).                                  [7]    J. F. Kaashoek and H.K. Van Dijk, Neural Network
    In general, the results of this simulation study show                     Pruning Applied to Real Exchange Rate Analysis,
that the optimal FFNN architecture yielded by this                            Journal of Forecasting, 21, 2002, pp. 559–577.
forward procedure is similar to the paper of Subanar and
                                                                       [8]    M. H. Kutner, C.J. Nachtsheim and J. Neter,
Suhartono [13].
                                                                              Applied Linear Regression Models, McGraw Hill
                                                                              International, New York, 2004.
    Table 2. The results of the optimal unit inputs determination by
             implementing forward procedure
                                                                       [9]    A. Lapedes and R. Farber, Nonlinear Signal
                                                                              Processing using Neural Networks: Prediction and
                                                                              System Modelling, Technical Report LAUR-87-
                                                                              2662, Los Alamos National Laboratory, Los
                                                                              Alamos, NM, 1987.
                                                                       [10]   L. Prechelt, Investigation of the CasCor Family of
                                                                              Learning Algorithms, Neural Networks, 10, 1997,
                                                                              pp. 885-896.
                                                                       [11]   R. Reed, Pruning algorithms - A survey, IEEE
                                                                              Transactions on Neural Networks, 4, 1993, pp.
                                                                       [12]   B. D. Ripley, Pattern Recognition and Neural
                                                                              Networks, Cambridge University Press, Cambridge,
                                                                       [13]   Subanar and Suhartono, Model selection in Neural
5      CONCLUSION                                                             Networks by using inference of R2 incremental and
    Based on the results at the previous section, we can                      Principal Component Analysis for Time Series
conclude that forward procedure by combining inference                        Forecasting, Presented at the 2nd IMT-GT Regional
of R2 incremental and SIC criterion is an effective and                       Conference on Mathematics, Statistics and Their
efficient procedure for determining the best NN model                         Applications, Universiti Sains Malaysia, 2006.
applied to time series forecasting. Additionally, the results          [14]   N. R. Swanson and H. White, A model-selection
also show that the proposed forward procedure gives an                        approach to assessing the information in the term
advantage for FFNN modeling, i.e. the building process of                     structure using linear models and artificial neural
FFNN model is not a black box.                                                networks, Journal of Business and Economic
                                                                              Statistics, 13, 1995, pp. 265–275.
                                                                       [15]   N. R. Swanson and H. White, A model-selection
                                                                              approach to real-time macroeconomic forecasting
REFERENCE                                                                     using linear models and artificial neural networks,
[1]      C. M. Bishop, Neural Network for Pattern Recog-                      Review of Economic and Statistics, 79, 1997, pp.
         nition, Oxford: Clarendon Press, 1995.                               540–550.
[2]      S. E. Fahlman and C. Lebiere, The Cascade-                    [16]   H. White, Connectionist nonparametric regression:
         Correlation Learning Architecture, in Touretzky, D.                  Multilayer feedforward networks can learn
         S. (ed.), Advances in Neural Information Proces-                     arbitrary mapping, Neural Networks, 3, 1990, pp.
         sing Systems 2, Los Altos, CA: Morgan Kaufmann                       535-550.
         Publishers, 1990, pp. 524-532.
[3]      D. N. Gujarati, Basic Econometrics, 5th edition,
         McGraw Hill International, New York, 1996.
[4]      H. Haykin, Neural Networks: A Comprehensive
         Foundation, Second edition, Prentice-Hall, Oxford,

Shared By: