Docstoc

pattern recognition in stock mar

Document Sample
pattern recognition in stock mar Powered By Docstoc
					pattern recognition
  in stock market
Introduction
               motivation
• Our time is limited, better not to waste it
  working
• Life style costs money
• Create someone else to do the job for you
               metatrader
• Online broker
• Lets you trade foreign currency, stocks
  and indexes
• MetaQuotes Language (MQL) similar to C,
  allows you to buy and sell
• Can be linked with dynamic linked libraries
  (dll)
           Pattern recognition
Pattern recognition aims to classify data
(patterns) based either on a priori knowledge or
on statistical information extracted from the
patterns. The patterns to be classified are usually
groups of measurements or observations,
defining points in an appropriate
multidimensional space.


To understand is to perceive patterns
SVM
                                Linear Support Vector Machines
                                                                     • A direct marketing company wants to sell a
                                                                       new book:
                                                                     “The Art History of Florence”
                                ∆   buyers
                                                                     • Nissan Levin and Jacob Zahavi in Lattin,
Number of art books purchased




                                ●   non-buyers
                                                                       Carroll and Green (2003).
                                    ∆       ∆
                                                 ∆                   • Problem: How to identify buyers and non-
                                        ∆                    ∆
                                ∆                                      buyers using the two variables:
                                     ∆          ●        ●             – Months since last purchase
                                                                 ●
                                                                       – Number of art books purchased
                                    ●       ∆                 ∆ ●
                                                             ●
                                        ●            ●
                                                                ●
                                                         ●
                                            ●

                                    Months since last purchase
                                    Linear SVM: Separable Case       • Main idea of SVM:

                                                                       separate groups by a line.

                                                                     • However: There are infinitely many lines that
                                ∆   buyers
                                                                       have zero training error…
Number of art books purchased




                                ●   non-buyers
                                    ∆       ∆
                                                                     • … which line shall we choose?
                                                 ∆
                                        ∆
                                ∆
                                     ∆                           ●
                                                                 ●
                                                             ●
                                        ●            ●
                                                                 ●
                                                         ●
                                            ●

                                    Months since last purchase
                                    Linear SVM: Separable Case
                                                                          • SVM use the idea of a margin around the
                                                                            separating line.
                                ∆   buyers
Number of art books purchased




                                ●   non-buyers
                                    ∆       ∆                             • The thinner the margin,
                                                                 margin
                                                 ∆
                                        ∆
                                ∆
                                                                          • the more complex the model,
                                     ∆                            ●
                                                                  ●
                                                             ●            • The best line is the one with the
                                        ●            ●
                                                                  ●         largest margin.
                                                         ●
                                            ●

                                    Months since last purchase
                                    Linear SVM: Separable Case

    x2                                                                   • The line having the largest margin is:
                                                w
                                                                           w 1x 1 + w 2x 2 + b = 0
Number of art books purchased




                                    ∆       ∆                            • Where
                                                ∆
                                        ∆                                  – x1 = months since last purchase
                                ∆
                                    ∆                                      – x2 = number of art books purchased
                                                                ●
                                            margin                       • Note:
                                                                ●
                                                            ●
                                        ●           ●
                                                                ●
                                                                           – w1xi 1 + w2xi 2 + b  +1      for i  ∆
                                                        ●
                                                                           – w1xj 1 + w2xj 2 + b  –1      for j  ●
                                            ●
                                                                    x1
                                    Months since last purchase
                                    Linear SVM: Separable Case
                                                                         • The width of the margin is given by:
    x2                                          w
                                                                                          1  ( 1)          2
Number of art books purchased




                                                                              margin                 
                                    ∆       ∆                                             w1  w 2
                                                                                           2     2        || w ||
                                                ∆
                                        ∆                                • Note:
                                ∆
                                    ∆                           ●
                                            margin                          2 w               w 2                    w   2
                                                                                                                             2
                                                                ●
                                                            ●               maximize       minimize                 minimize
                                        ●           ●
                                                                ●          the margin
                                                        ●
                                            ●
                                                                    x1
                                    Months since last purchase
wx i  b  1 for yi  1
                            yi (wx i  b)  1  0 for all i
wx i  b  1 for yi  1
         Linear SVM: Separable Case
                                                                                      2
                                                 2 w             w 2              w       2
x2                                                 maximize      minimize       minimize
                                                  the margin



         ∆       ∆
                     ∆                         • The optimization problem for SVM is:
             ∆
     ∆
                                 margin                                 2
         ∆                                       minimize L( w )  w        2
                                     ●
                                     ●         • subject to:
                                 ●
             ●           ●                       – w1xi 1 + w2xi 2 + b  +1     for i  ∆
                                     ●
                             ●
                                                 – w1xj 1 + w2xj 2 + b  –1     for j  ●
                 ●
                                          x1
         Linear SVM: Separable Case

x2                           “Support vectors”
                                                  • “Support vectors” are those points that lie
         ∆       ∆                                  on the boundaries of the margin

                     ∆
             ∆
     ∆
                                                  • The decision surface (line) is determined
         ∆                           ●              only by the support vectors. All other
                                                    points are irrelevant
                                     ●
                                 ●
             ●           ●
                                     ●
                             ●
                 ●
                                             x1
     Linear SVM: Nonseparable Case
                                              • Non-separable case: there is no line
     Training set: 1000 targeted customers      separating errorlessly the two groups
x2                                            • Here, SVM minimize L(w,C) :
     ∆   buyers
     ●   non-buyers
         ∆       ∆                                  L( w,C )  w
                                                                    2
                                                                        2         C  i
                      ∆                                                                 i
             ∆                 ∆
     ∆                                                        maximize          minimize the
                     ●        ●                               the margin        training errors
          ∆                         ●
                                                    L(w,C) = Complexity +          Errors
         ●       ∆                 ∆ ●
                                  ●
             ●            ●                   • subject to:
                                     ●
                              ●
                 ●                              – w1xi 1 + w2xi 2 + b  +1 – i       for i  ∆
                                         x1     – w1xj 1 + w2xj 2 + b  –1 + i       for j  ●
                                                – I,j  0
                                                        vectors Xi –
                                                     labels yi = ±1 –

                                               y  sign(w  X  b)

                                              w  C  1  yi (w  Xi  b)
                                         1       2
                                 min :   2
                                 w ,b
                                                         i

                                             margin and error vectors


                                                yi (w  Xi  b)  1, i  S
y  sign (  i yi X i X  b)                       w    i yi X i
          iS                                                iS
           Linear SVM: The Role of C
         x2                                                     x2
               ∆       ∆                     C=5                        ∆       ∆               C=1
                              ∆                                                     ∆
                ∆                        ●                              ∆                       ●
                                     ∆                                                      ∆
                   ●         ●           ●                                          ●
                                 ●                                          ●           ●       ●

                                               x1                                                   x1
   Bigger C           increased complexity                Smaller C           decreased complexity
                            ( thinner margin )                                      ( wider margin )

                       smaller number errors                                    bigger number errors
                           ( better fit on the data )                           ( worse fit on the data )


   Vary both complexity and empirical error via C … by affecting the optimal w and optimal
    number of training errors
             Non-linear SVMs

• Transform x  (x)
• The linear algorithm depends only on xxi, hence
  transformed algorithm depends only on (x)(xi)
• Use kernel function K(xi,xj) such that K(xi,xj)=
  (x)(xi)
     Nonlinear SVM: Nonseparable Case

                                                 Mapping into a higher-dimensional space

x2
                                                       x11   x12              x11
                                                                                   2
                                                                                            2 x11x12     x12 
                                                                                                           2

                                                      x                        2                         2 
         ∆       ∆                                     21    x 22 
                                                                              x 21       2 x 21x 22   x 22 
                                                                                                    
                     ∆                                                        2                            
             ∆                   ∆                                              xl 1                    xl 2 
                                                                                                           2
     ∆                                                 xl1   xl 2                        2 xl 1xl 2        
         ∆           ●       ●
                                     ●
                                                 Optimization task: minimize L(w,C)
         ●       ∆                ∆ ●
                                 ●                                                           C  i
                                                                           2
                                                       L(w,C )  w             2        
             ●           ●
                                    ●                                                             i
                             ●                   subject to:
                 ●                                                                                                ∆
                                                  
                                                        w1xi2  w2 2 xi 1xi 2  w3 xi22  b  1  i
                                         x1                 1
                                                                                                                  ●
                                                       w1x 21  w2 2 x j 1x j 2  w3 x 22  b  1  j
                                                  
                                                           j                           j
Nonlinear SVM: Nonseparable Case


 Map the data into higher-dimensional space:     2  3
                                                             1, 1        
                                                                        1, 2 , 1         ∆
                     x12 
    x1                                               1,  1        
                                                                        1, 2 , 1         ∆
                   2 x1 x2 
   x                                                     1,  1        1,        ●
    2              x2                                                         2,1

                       2                                  1, 1       1,    2,1 ●

               x2                                   2
                                                   x2

  ●                                                                                            ∆
                     ∆
      (-1,1)             (1,1)
                              x1                                                  ●

  ∆                  ●                                                                             x12
      (-1,-1)            (1,-1)
                                        2 x1 x2
Nonlinear SVM: Nonseparable Case

 Find the optimal hyperplane in the transformed space

                                                              1, 1        
                                                                         1, 2 , 1         ∆
                     x12 
    x1                                                1,  1        
                                                                         1, 2 , 1         ∆
                   2 x1 x2 
   x                                                      1,  1        1,        ●
    2              x2                                                          2,1

                       2                                   1, 1       1,    2,1 ●

               x2                                  2
                                                  x2

  ●                                                                                             ∆
                     ∆
      (-1,1)             (1,1)
                              x1                                                   ●

  ∆                  ●                                                                              x12
      (-1,-1)            (1,-1)
                                        2 x1 x2
Nonlinear SVM: Nonseparable Case

 Observe the decision surface in the original space (optional)

                                                              1, 1        
                                                                         1, 2 , 1         ∆
                   x12 
    x1                                                1,  1        
                                                                         1, 2 , 1         ∆
                 2 x1 x2 
   x                                                      1,  1        1,        ●
    2            x2                                                            2,1

                     2                                     1, 1       1,    2,1 ●

            x2                                       2
                                                    x2

  ●                                                                                             ∆
                    ∆

                         x1                                                         ●

  ∆                ●                                                                                x12
                                          2 x1 x2
Nonlinear SVM: Nonseparable Case

 Dual formulation of the (primal) SVM minimization problem

            Primal                                             Dual
                   2

  min
            w
               2
                          C  i              max          
                                                          i
                                                              i
                                                                   1
                                                                   2
                                                                       i   j
                                                                               i   j   yi yj  xi  xj 
                                                                                             
                                                                                             
                                                                                                       
                                                                                                       
                             i


  Subject to                                  Subject to

   yi w  x i   b  1   i
                                               0  i  C
   i  0
   yi   1                                  
                                                i
                                                     i   yi  0

                                               yi   1
Nonlinear SVM: Nonseparable Case

 Dual formulation of the (primal) SVM minimization problem

                                x12                                                       Dual
   x1                                
                              2 x1 x2 
  x 
   2                          x2 
                                  2     
                                                                            max         
                                                                                      i
                                                                                           i
                                                                                                1
                                                                                                2
                                                                                                            i   j
                                                                                                                      i   j   yi yj  xi  xj 
                                                                                                                                    
                                                                                                                                    
                                                                                                                                              
                                                                                                                                              


 ( xi )  ( x j ) 
 x ,
   2
   i1         2 xi1 xi 2 , x2
                                x
                            i2 
                                      2
                                      j1   , 2 x j1 x j 2 , x   2
                                                                j2      max    2
                                                                                 i  1 i j yi yj  ( xi )  ( xj )
                                                                                 i          i
                                                                                                      
                                                                                                      
                                                                                                        j
                                                                                                                         
                                                                                                                         


 ( x   i1   , xi 2 )  ( x j1 , x j 2 )      2
                                                   
  x x 
    i           j
                    2
                                                                            max       i  12 i j yi yj  xi  xj 
                                                                                                                                                  2

                                                                                      i                     i   j

                                                                            Subject to
 K ( xi , x j )  ( xi )  ( x j )
             (kernel function)
                                                                            0  i  C            i
                                                                                                                i   yi  0        yi   1
                                        Solving

• Construct & minimise the Lagrangian
                               N
                  1
      L(w, b, )  || w ||  i [ yi (wx i  b)  1]
                          2

                  2           i 1

      wrt.constraint i  0, i  1,...N

• Take derivatives wrt. w and b, equate them to 0
      L(w, b, )        N
                   w    i yi x i  0        parameters are expressed as a linear
         w             i 1
                                                combination of training points
      L(w, b, ) N
                    i yi  0
         b          i 1

      KKT cond :  i [ yi (wx i  b)  1]  0   only SVs will have non-zero αi

   The Lagrange multipliers αi are called „dual variables‟
   Each training point has an associated dual variable.
             Applications

• Handwritten digits recognition
    – Of interest to the US Postal services
    – 4% error was obtained
    – about 4% of the training data were SVs only
•   Text categorisation
•   Face detection
•   DNA analysis
•   …
             Architecture of SVMs

• Nonlinear Classifier(using kernel)
   • Decision function
                  l
  f ( x)  sgn(  vi ( ( x)   ( xi ))  b)
                 i 1
                 l
         sgn(  vi k ( x, xi )  b)
                i 1

   ( xi ) substitute for each
           train example xi
  vi   i yi
  vi are computed as the
   solution of quadratic program
Artificial Neural Networks
               Neural Network
            Taxonomy of Neural Network Architecture




The architecture of the neural network refers to the arrangement
of the connection between neurons, processing element, number
of layers, and the flow of signal in the neural network. There are
mainly two category of neural network architecture: feed-
forward and feedback (recurrent) neural networks
              Neural Network
• Feed-forward network, Multilayer Perceptron
             Neural Network
• Recurrent network
  Multilayer Perceptron (MLP)

          Input Layer                         Neuron processing element
         x1                             x1
                    Hidden Layer
                        h1                     w1
         x2
                             Output Layer
                                                         y F(y)
Input
Vector   x3                      O1      x2
                                              w2
                                                    
         x4             h2
                .                             wn
                .                       xn
                .
         xn
                                                        F(y)

              MLP Structure


                                                                   y
          Backpropagation Learning
• Architecture:
   – Feedforward network of at least one layer of non-linear
     hidden nodes, e.g., # of layers L ≥ 2 (not counting the input
     layer)
   – Node function is differentiable
     most common: sigmoid function
• Learning: supervised, error driven,
  generalized delta rule
• Call this type of nets BP nets
• The weight update rule
  (gradient descent approach)
• Practical considerations
• Variations of BP nets
• Applications
            Backpropagation Learning
• Notations:
  – Weights: two weight matrices:
    w(1,0) from input layer (0) to hidden layer (1)
    w( 2,1) from hidden layer (1) to output layer (2)
      (,
    w2110) weight from node 1 at layer 0 to node 2 in layer 1
       ,
  – Training samples: pair of {( x p , d p ) p  1,..., P}
    so it is supervised learning
  – Input pattern: x p  ( x p ,1 ,..., x p ,n )
  – Output pattern: o p  (o p ,1 ,..., o p ,k )
  – Desired output: d p  (d p ,1 ,..., d p ,k )
  – Error: l p, j  d p,k  o p,k error for output node j when xp is
    applied
                                 P K
    sum square error             (l p , j ) 2
                                p 1 j 1
                                                             ( 2,1)
    This error drives learning (change w(1, 0) and w                  )
               Backpropagation Learning
• Sigmoid function again:
  – Differentiable:
                 1
    S ( x) 
              1  e x
                       1
    S ' ( x)                   (1  e  x )'
                 (1  e  x ) 2
                       1
                               ( e  x )
                 (1  e  x ) 2                   Saturation      Saturation
                  1         e x                  region          region
                       
                    x
              1  e 1  e x
            S ( x)(1  S ( x))
  – When |net| is sufficiently large, it moves into one of the
    two saturation regions, behaving like a threshold or ramp
    function.
• Chain rule of differentiation
                                            dz dz dy dx
  if z  f ( y ), y  g ( x), x  h(t ) then       f ' ( y ) g ' ( x)h' (t )
                                            dt dy dx dt
                Backpropagation Learning

• Forward computing:
   – Apply an input vector x to input nodes
   – Computing output vector x(1) on hidden layer
           x(j1)  S (net (j1) )  S ( w(j1i,0) xi )
                                           ,
                                             i
   – Computing the output vector o on output layer
        ok  S (netk2) )  S ( wk2,j1) x(j1) )
                   (             (
                                   ,
                                             j
   – The network is said to be a map from input x to output o
• Objective of learning:
   – Modify the 2 weight matrices to reduce sum square error
               
         P          K               2   for the given P training samples as
         p 1       k 1 (l p , k )
     much as possible (to zero if possible)
          Backpropagation Learning
• Idea of BP learning:
  – Update of weights in w(2, 1) (from hidden layer to output
    layer):
    delta rule as in a single layer net using sum square error
  – Delta rule is not applicable to updating weights in w(1, 0)
    (from input and hidden layer) because we don‟t know the
    desired values for hidden nodes
  – Solution: Propagating errors at output nodes down to
    hidden nodes, these computed errors on hidden nodes
    drives the update of weights in w(1, 0) (again by delta rule),
    thus called error Back Propagation (BP) learning
  – How to compute errors on hidden nodes is the key
  – Error backpropagation can be continued downward if the
    net has more than one hidden layer
  – Proposed first by Werbos (1974), current formulation by
    Rumelhart, Hinton, and Williams (1986)
                 Backpropagation Learning
• Generalized delta rule:
  – Consider sequential learning mode: for a given sample (xp, dp)
            E  k (l p,k )2  k (d p,k  o p,k )2
  – Update weights by gradient descent
    For weight in   w(2, 1):   wk2,j1)  (E / wk2,j1) )
                                 (
                                   ,
                                                   (
                                                     ,
    For weight in w(1, 0):     w(j1i,0)  (E / w(j1i,0) )
                                   ,                  ,
  – Derivation of update rule for w(2, 1):
    since E is a function of lk = dk – ok, ok is a function of      (
                                                                 netk2,) and
    net k( 2 ) is a function of wk2,j1) , by chain rule
                                 (
                                   ,
                  Backpropagation Learning
                                                                              ok
 – Derivation of update rule for                 w(j1,0)
                                                    ,i
   consider hidden node j:
                                                                               (
                                                                              wk2,j1)
                                                                                 ,

   weight w(j1,0) influences net (j1)
             ,i                                                                j

   it sends S (net (j1) ) to all output nodes                                 w(j1,0)
                                                                                 ,i
  all K terms in E are functions of w(j1,0)
                                        ,i                                     i

   E  k (d k  ok ) 2 , ok  S (netk2) ), netk2)   j x (j1) wk2,j1) ,
                                     (         (                 (
                                                                   ,

   x (j1)  S (net (j1) ), net (j1)  i xi w(j1i,0)
                                               ,


by chain               E           S (netk2) )
                                            (
                                                       netk2)
                                                             (     x (j1)         net (j1)
rule                   ok           netk2)
                                          (
                                                        x (j1)   net (j1)         w(j1i)
                                                                                        ,
               Backpropagation Learning
– Update rules:
  for outer layer weights w(2, 1) :



                                      where  k  (dk  ok )S ' (netk2) )
                                                                    (




  for inner layer weights w(1, 0) :




                                          where       j  (k  k wk2,j1) )S ' (net (j1) )
                                                                    (
                                                                      ,



                                                   Weighted sum of errors
                                                   from output layer
Note: if S is a logistic function,
then S’(x) = S(x)(1 – S(x))
         Backpropagation Learning
• Pattern classification: an example
  – Classification of myoelectric signals
       • Input pattern: 2 features, normalized to real values
         between -1 and 1
       • Output patterns: 3 classes
  – Network structure: 2-5-3
       • 2 input nodes, 3 output nodes,
       • 1 hidden layer of 5 nodes
       • η = 0.95, α = 0.4 (momentum)
  –   Error bound e = 0.05
  –   332 training samples
  –   Maximum iteration = 20,000
  –   When stopped, 38 patterns remain misclassified
38 patterns misclassified
          Strengths of BP Learning
• Great representation power
   – Any L2 function can be represented by a BP net
   – Many such functions can be approximated by BP
     learning (gradient descent approach)
• Easy to apply
   – Only requires that a good set of training samples is
     available
   – Does not require substantial prior knowledge or deep
     understanding of the domain itself (ill structured
     problems)
   – Tolerates noise and missing data in training samples
     (graceful degrading)
• Easy to implement the core of the learning algorithm
• Good generalization power
   – Often produce accurate results for inputs outside
     the training set
          Deficiencies of BP Learning
• Learning often takes a long time to converge
  – Complex functions often need hundreds or thousands of
    epochs
• The net is essentially a black box
  – It may provide a desired mapping between input and
    output vectors (x, o) but does not have the information of
    why a particular x is mapped to a particular o.
  – It thus cannot provide an intuitive (e.g., causal)
    explanation for the computed result.
  – This is because the hidden nodes and the learned weights
    do not have clear semantics.
     • What can be learned are operational parameters, not general,
       abstract knowledge of a domain
  – Unlike many statistical methods, there is no theoretically
    well-founded way to assess the quality of BP learning
     • What is the confidence level of o computed from input x using
       such net?
     • What is the confidence level for a trained BP net, with the
       final E (which may or may not be close to zero)?
• Problem with gradient descent approach
  – only guarantees to reduce the total error to a local
    minimum. (E may not be reduced to zero)
    • Cannot escape from the local minimum error state
    • Not every function that is representable can be
      learned
  – How bad: depends on the shape of the error surface.
    Too many valleys/wells will make it easy to be trapped
    in local minima
  – Possible remedies:
    • Try nets with different # of hidden layers and hidden
      nodes (they may lead to different error surfaces, some
      might be better than others)
    • Try different initial weights (different starting points on the
      surface)
    • Forced escape from local minima by random perturbation
      (e.g., simulated annealing)
• Generalization is not guaranteed even if the error
  is reduced to 0
  – Over-fitting/over-training problem: trained net fits the training
    samples perfectly (E reduced to 0) but it does not give accurate
    outputs for inputs not in the training set

 – Possible remedies:
    • More and better samples
    • Using smaller net if possible
    • Using larger error bound
      (forced early termination)
    • Introducing noise into samples
       – modify (x1,…, xn) to
          (x1+α1,…, xn+αn) where αi
          are small random
          displacements
    • Cross-Validation
       – leave some (~10%) samples as test data (not used for weight
         update)
       – periodically check error on test data
       – learning stops when error on test data starts to increase
• Network paralysis with sigmoid activation function
  – Saturation regions:
   S ( x)  1 /(1  e  x ), its derivative S ' ( x)  S ( x)(1  S ( x))  0
   when x  .
   When x falls in a saturation region, S ( x) hardly changes its value
   regardless how fast themagnitude of x increases



  – Input to an node may fall into a saturation region
    when some of its incoming weights become very
    large during learning. Consequently, weights stop to
    change no matter how hard you try.
  – Possible remedies:
    • Use non-saturating activation functions
    • Periodically normalize all weights
       wk , j : wk , j / w.k   2
• The learning (accuracy, speed, and generalization)
  is highly dependent of a set of learning
  parameters
   – Initial weights, learning rate, # of hidden layers and
     # of nodes...
   – Most of them can only be determined empirically
     (via experiments)
             Practical Considerations
• A good BP net requires more than the core of the learning
  algorithms. Many parameters must be carefully selected
  to ensure a good performance.
• Although the deficiencies of BP nets cannot be
  completely cured, some of them can be eased by some
  practical means.
• Initial weights (and biases)
   – Random, [-0.05, 0.05], [-0.1, 0.1], [-1, 1]
     • Avoid bias in weight initialization
  – Normalize weights for hidden layer (w(1, 0)) (Nguyen-Widrow)
     • Random assign initial weights for all hidden nodes
     • For each hidden node j, normalize its weight by
       w(j1i,0)        w(j1i,0)   /   w(j1,0)       where   0.7 n m
          ,                  ,
                                                   2
      m  # of hiddent nodes, n  # of input nodes

        w(j1,0)         after normalizat ion
                  2
• Training samples:
   – Quality and quantity of training samples often determines the
     quality of learning results
   – Samples must collectively represent well the problem space
      • Random sampling
      • Proportional sampling (with prior knowledge of the problem
        space)
   – # of training patterns needed: There is no theoretically idea
     number.
      • Baum and Haussler (1989): P = W/e, where
        W: total # of weights to be trained (depends on net structure)
          e: acceptable classification error rate
        If the net can be trained to correctly classify (1 – e/2)P of the
        P training samples, then classification accuracy of this net is
        1 – e for input patterns drawn from the same sample space
        Example: W = 27, e = 0.05, P = 540. If we can successfully
        train the network to correctly classify (1 – 0.05/2)*540 = 526
        of the samples, the net will work correctly 95% of time with
        other input.
• How many hidden layers and hidden nodes
  per layer:
  – Theoretically, one hidden layer (possibly with many
    hidden nodes) is sufficient for any L2 functions
  – There is no theoretical results on minimum necessary
    # of hidden nodes
  – Practical rule of thumb:
     • n = # of input nodes; m = # of hidden nodes
     • For binary/bipolar data: m = 2n
     • For real data: m >> 2n
  – Multiple hidden layers with fewer nodes may be trained
    faster for similar quality in some applications
– Example: compressing character bitmaps.
   • Each character is represented by a 7 by 9 pixel
     bitmap, or a binary vector of dimension 63
   • 10 characters (A – J) are used in experiment
   • Error range:
     tight: 0.1 (off: 0 – 0.1; on: 0.9 – 1.0)
     loose: 0.2 (off: 0 – 0.2; on: 0.8 – 1.0)
   • Relationship between # hidden nodes, error
     range, and convergence rate
       – relaxing error range may speed up
       – increasing # hidden nodes (to a point) may
         speed up
     error range: 0.1 hidden nodes: 10 # epochs: 400+
     error range: 0.2 hidden nodes: 10 # epochs: 200+
     error range: 0.1 hidden nodes: 20 # epochs: 180+
     error range: 0.2 hidden nodes: 20 # epochs: 90+
     no noticeable speed up when # hidden nodes increases
     to beyond 22
• Other applications.
  – Medical diagnosis
     • Input: manifestation (symptoms, lab tests, etc.)
       Output: possible disease(s)
     • Problems:
        – no causal relations can be established
        – hard to determine what should be included as
          inputs
     • Currently focus on more restricted diagnostic tasks
        – e.g., predict prostate cancer or hepatitis B based
          on standard blood test
  – Process control
     • Input: environmental parameters
       Output: control parameters
     • Learn ill-structured control functions
– Stock market forecasting
   • Input: financial factors (CPI, interest rate, etc.)
     and stock quotes of previous days (weeks)
     Output: forecast of stock prices or stock indices
     (e.g., S&P 500)
   • Training samples: stock market data of past few
     years
– Consumer credit evaluation
   • Input: personal financial information (income,
     debt, payment history, etc.)
   • Output: credit rating
– And many more
– Key for successful application
   • Careful design of input vector (including all
     important features): some domain knowledge
   • Obtain good training samples: time and other cost
              Summary of BP Nets
• Architecture
   – Multi-layer, feed-forward (full connection between
     nodes in adjacent layers, no connection within a layer)
   – One or more hidden layers with non-linear activation
     function (most commonly used are sigmoid functions)
• BP learning algorithm
   – Supervised learning (samples (xp, dp))
   – Approach: gradient descent to reduce the total error
                  w  E / w
     (why it is also called generalized delta rule)
   – Error terms at output nodes
     error terms at hidden nodes (why it is called error BP)
   – Ways to speed up the learning process
      • Adding momentum terms
      • Adaptive learning rate (delta-bar-delta)
      • Quickprop
   – Generalization (cross-validation test)
• Strengths of BP learning
   – Great representation power
   – Wide practical applicability
   – Easy to implement
   – Good generalization power
• Problems of BP learning
   – Learning often takes a long time to converge
   – The net is essentially a black box
   – Gradient descent approach only guarantees a local minimum error
   – Not every function that is representable can be learned
   – Generalization is not guaranteed even if the error is reduced to zero
   – No well-founded way to assess the quality of BP learning
   – Network paralysis may occur (learning is stopped)
   – Selection of learning parameters can only be done by trial-and-error
   – BP learning is non-incremental (to include new training samples, the
     network must be re-trained with all old and new samples)
Experiments
                      Stock Prediction
• Stock prediction is a difficult task due to the nature of the stock data
  which is very noisy and time varying.
• The efficient market hypothesis claim that future price of the stock is
  not predictable based on publicly available information.
• However theory has been challenged by many studies and a few
  researchers have successfully applied machine learning approach
  such as neural network to perform stock prediction
      Is the Market Predictable ?
• Efficient Market Hypothesis (EMH) (Fama, 1965)
  Stock market is efficient in that the current market prices reflect all information
  available to traders, so that future changes cannot be predicted relying on past prices
  or publicly available information.

• Murphy's law : Anything that can go wrong will go wrong.

   Fama et al. (1988) showed that 25% to 40% of the variance in
   the stock returns over the period of three to five years is
   predictable from past return

   Pesaran and Timmerman (1999) conclude that the UK stock market is
   predictable for the past 25 years.

   Saad (1998) has successfully employed different neural network models
   to predict the trend of various stocks on a short-term range
Optimistic report
                   Implementation
• In this paper we propose to investigate SVM, MLP and RBF network
  for the task of predicting the future trend of the 3 major stock indices
  a) Kuala Lumpur Composite Index (KLCI)
  b) Hongkong Hangseng index
  c) Nikkei 225 stock index
  using input based on technical indicators.
• This paper approach the problem based on 2 class pattern
  classification formulated specifically to assist investor in making
  trading decisions
• The classifier is asked to recognise investment opportunities that
  can give a return of r% or more within the next h days. r=3% h=10
  days
           System Block Diagram
• The classifier is to predict if the trend of the stock index increment of
  more than 3% within the next 10 days period can be achieved.




                 Data from
                 daily
                 historical                          Increment Achievable ??
                 data               Classifier
                 converted
                 into
                                                       Yes / No
                 technical
                 analysis
                 indicator
                      Data Used
• Kuala Lumpur Stock Index (KLCI) for the period of 1992-1997.
                     Data Used
• Hangseng index (20/4/1992-1/9/1997)
                       Data Used
Nikkei 225 stock index (20/4/1982-1/9/1987)
TABLE 1: DESCRIPTION OF INPUT TO CLASSIFIER

xi i=1,2,3 ….12 n=15
                  Input to Classifier




DLN (t) = sign[q(t)-q(t-N)] * ln (q(t)/q(t-N) +1) (1)
q(t) is the index level at day t and DLN (t) is the actual input to the classifier.
        Prediction Formulation




Consider ymax(t) as the maximum upward movement of the stock
index value within the period t and t + . y(t) represents the stock
index level at day t
            Prediction Formulation



Classification
The prediction of stock trend is formulated as a two class
classification problem.

yr(t) > r% >> Class 2
yr(t)  r% >> Class 1
             Prediction Formulation
Classification
•   Let (xi , yi ) 1<i<N be a set of N training examples, each input example
    xi  Rn n=15 being the dimension of the input space, belongs to a class
    labelled by yi  +1,-1.




               Yi =-1




               Yi =+1
        Performance Measure
• True Positive (TP) is the number of positive    class
  predicted correctly as positive class.
• False Positive (FP) is the number of negative   class
  predicted wrongly as positive class.
• False Negative (FN) is the number of positive   class
  predicted wrongly as negative class.
• True Negative (TN) is the number of negative    class
  predicted correctly as negative class.
           Performance Measure

•   Accuracy = TP+TN / (TP+FP+TN+FN)
•   Precision = TP/(TP+FP)
•   Recall rate (sensitivity) = TP/(TP+FN)
•   F1 = 2 * Precision * Recall/(Precision + Recall)
                   Testing Method
Rolling Window Method is Used to Capture Training and
Test Data




           Train    Test




              Train =600 data Test= 400 data
         Experiment and Result
• Experiments are conducted to predict the stock trend of
  three major stock indexes, KLCI, Hangseng and Nikkei.
• SVM, MLP and RBF network is used in making trend
  prediction based on classification and regression
  approach.
• A hypothetical trading system is simulated to find out the
  annualized profit generated based on the given
  prediction.
Experiment and Result
          Trading Performance
• A hypothetical trading system is used
• When a positive prediction is made, one unit of money
  was invested in a portfolio reflecting the stock index. If
  the stock index increased by more than r% (r=3%) within
  the next h days (h=10) at day t, then the investment is
  sold at the index price of day t. If not, the investment is
  sold on day t+1 regardless of the price. A transaction fee
  of 1% is charged for every transaction made.
• Use annualised rate of return .
         Trading Performance
• Classifier Evaluation Using Hypothetical Trading
  System
Trading Performance
        Experiment and Result
• Classification Result
        Experiment and Result
• The result shows better performance of neural network
  techniques when compared to K nearest neighbour
  classifier. SVM shows the overall better performance on
  average than MLP and RBF network in most of the
  performance metric used
       Experiment and Result
Comparison of Receiver Operating Curve (ROC)
     Experiment and Result
• Area under Curve (ROC)
                       Conclusion
• We have investigated the SVM, MLP and RBF network as a
  classifier and regressor to assess it's potential in the stock trend
  prediction task

• Support vector machine (SVM) has shown better performance
  when compared to MLP and RBF .

• SVM classifier with probabilistic output outperform MLP and RBF
  network in terms of error-reject tradeoff

• Both the classification and regression model can be used for a
  profitable trend prediction system. The classification model has the
  advantage in which pattern rejection scheme can be incorporated.
This report
                 Implementation
• OnlineSVR by Francesco Parrella

• http://onlinesvr.altervista.org/

• BPN by Karsten Kutza

• http://www.neural-networks-at-your-fingertips.com/
                           Results
• Basically zero correlation between prediction and the actual
  outcome

• Suffer from many technical failures

• Still have faith that these methods (when applied correctly) can
  predict the future better then a random guess

• Tried many sorts of topologies of the BPN and the input values to
  SVM, looks like the secret does not lie there

• Future investigation, use wavelets/noiselets coefficients as inputs
                          References
• http://www.cs.unimaas.nl/datamining/slides2009/svm_presentation.ppt

• http://merlot.stat.uconn.edu/~lynn/svm.ppt

• http://www.cs.bham.ac.uk/~axk/ML_SVM05.ppt

• http://www.stanford.edu/class/msande211/KKTgeometry.ppt

• http://www.csee.umbc.edu/~ypeng/F09NN/lecture-notes/NN-Ch3.ppt

• http://fit.mmu.edu.my/caiic/reports/report04/mmc/haris.ppt

• http://www.youtube.com/watch?v=oQ1sZSCz47w

• Google, Wikipedia and others

				
DOCUMENT INFO