Neural Networks

Document Sample
Neural Networks Powered By Docstoc
					      Intro to Neural Networks

Abhijit Kedia (
                   Batch 2002-06      6th March „05
Why would anyone want a `new'
     sort of computer?
What are (everyday) computer systems good at... .....and
not so good at?
Good at : Fast arithmetic, Doing precisely what the
programmer programs them to do
Not So Good at: Massive parallelism, Fault tolerance,
Adapting to circumstances, Interacting with noisy data or
data from the environment
Where can neural network systems help?
    where we can't formulate an algorithmic solution.
    where we can get lots of examples of the behaviour we require.
    where we need to pick out the structure from existing data.
What is a Neural Network?
            The question
     'What is a neural network?„
             is ill-posed.
          -- Pinkus (1999)

A method of computing, based on the
   interaction of multiple connected
         processing elements
    What is a neural network?

Neural networks are a form of
multiprocessor computer system, with
   simple processing elements
   a high degree of interconnection
   simple scalar messages
   adaptive interaction between elements
   Biological Motivation

• Biological Learning Systems are built of very
  complex webs of interconnected neurons.
• Information-Processing abilities of biological
  neural systems must follow from highly parallel
  processes operating on representations that are
  distributed over many neurons
•ANNs attempt to capture this mode of computation
 The biological inspiration

• The brain has been extensively studied by
• Vast complexity prevents all but rudimentary
• Even the behaviour of an individual neuron
  is extremely complex
 What can a Neural Net do?
Compute a known function
Approximate an unknown function
Pattern Recognition
Signal Processing

Learn to do any of the above
  Brain and Machine
• The Brain
  –   Pattern Recognition
  –   Association
  –   Complexity
  –   Noise Tolerance

                            • The Machine
                              – Calculation
                              – Precision
                              – Logic
    Features of the Brain

•   Ten billion (1010) neurons
•   Neuron switching time >10-3secs
•   Face Recognition ~0.1secs
•   On average, each neuron has several thousand
•   Hundreds of operations per second
•   High degree of parallel computation
•   Distributed representations
•   Die off frequently (never replaced)
•   Compensated for problems by massive parallelism
             Basic Concepts
A Neural Network generally
maps a set of inputs to a set   Input 0    Input 1    ...    Input n

of outputs

Number of inputs/outputs is         Neural Network

The Network itself is           Output 0   Output 1   ...   Output m

composed of an arbitrary
number of nodes with an
arbitrary topology
                      Basic Concepts
                                          Definition of a node:
     Input 0   Input 1   ...    Input n

       W0        W1      ...      Wn
                                          • A node is an element
                                            which performs the
Wb    +         +
                                            y = fH(∑(wixi) + Wb)


         Simple Perceptron
Binary logic application               Input 0   Input 1

fH(x) = u(x) [linear threshold]
Wi = random(-1,1)                        W0        W1

                                  Wb    +
Y = u(W0X0 + W1X1 + Wb)

Now how do we train it?
          Basic Training
Perception learning rule
   ΔWi = η * (D – Y) * Xi

η = Learning Rate
D = Desired Output

Adjust weights based on a how well the
current weights match an objective
            Logic Training
Expose the network to the logical        X0   X1   D
OR operation                             0    0    0
Update the weights after each            0    1    1
                                         1    0    1
                                         1    1    1
As the output approaches the
desired output for all cases, ΔWi will
approach 0
W0 W1 Wb
Network converges on a hyper-plane decision
X1 = (W0/W1)X0 + (Wb/W1)


 Typical Activation Functions
F(x) = 1 / (1 + e -k ∑ (wixi) )
Shown for
k = 0.5, 1 and 10

Using a nonlinear
function which
approximates a linear
threshold allows a
network to approximate
nonlinear functions
  Back-Propagated Delta Rule
        Networks (BP)
Inputs are put
through a        Input 0    Input 1    ...   Input n

„Hidden Layer‟
before the
output layer       H0         H1               Hm
                                       ...              Hidden Layer
All nodes
connected          O0         O1               Oo

between layers                         ...

                 Output 0   Output 1   ...   Output o
         BP Network Details
Forward Pass:
   Error is calculated from outputs
   Used to update output weights
Backward Pass:
   Error at hidden nodes is calculated by back
    propagating the error at the outputs through
    the new weights
   Hidden weights updated
                     In Matrix Form
n inputs, m hidden nodes
and q outputs

olk is the output of the lth neuron
For the kth of p patterns

vk is the output of the hidden layer
ok is the true output vector
               Matrix Tricks
E(A, B) = k=1 Σ (tk – ok)T(tk – ok)
tk denotes true output vectors

The optimal weight matrix of B can be computed
directly if fH-1(t) is known
B‟ = fH-1(t)vT(vvT)*

So… E(A, B) = E(A, B(A)) = E‟(A)
     Which makes our weight space much smaller
  Backpropagation: Purpose
     and Implementation
  Purpose: To compute the weights of a
  feedforward multilayer neural network
  adaptatively, given a set of labeled
  training examples.
  Method: By minimizing the following cost
  function (the sum of square error):
  E= 1/2 n=1 k=1[yk-fk(x )]
           N  K    n     n 2

where N is the total number of training examples and K,
  the total number of output units (useful for multiclass
  problems) and fk is the function implemented by the
  neural net
Backpropagation: Overview
Backpropagation works by applying the gradient
descent rule to a feedforward network.
The algorithm is composed of two parts that get
repeated over and over until a pre-set maximal
number of epochs, EPmax.
Part I, the feedforward pass: the activation
values       of the hidden and then output units are
Part II, the backpropagation pass: the weights of
the network are updated--starting with the hidden
to output weights and followed by the input to
hidden weights--with respect to the sum of
squares error and through a series of weight
Backpropagation: The Delta Rule
 For the hidden to output connections
 (easy case)
 wkj = - E/wkj
       =  n=1[yk - fk(x )] g‟(hk) Vj
            N    n      n       n   n

       =  n=1k Vj
            N  n   n

       •  corresponding to the learning rate
       (an extra parameter of the neural net)
       • hn = M wkj Vjn
           k    j=0              M is the number of hidden units
           n                     and d the number of input units
       •Vj = g(i=0 wjixi) and
                    d    n
                  n          n
       •k = g’(hk)(yk - fk(x ))
          n           n
Backpropagation: The Delta Rule II
For the input to hidden connections
(hard case: no pre-fixed values for the hidden
wji = - E/wji n n

     = - n=1n E/Vj Vj/wji (Chain Rule)
                       n       n          n n

     =  k,n[yk - n k(x )] g‟(hk) wkj g‟(hj)xi
                   f n
     =  kwkjg‟(hj )xi
          N    n n

     =  n=1j= ii=0withn
            • hj
                n x d
           • n =
              j     g’(hn )
                        j     k=1
                                     wkj k
           • and all the other quantities already defined
              BP: The Algorithm
1. Initialize the weights to small random values; create a random
   pool of all the training patterns; set EP, the number of epochs of
   training to 0.
2. Pick a training pattern  from the remaining pool of patterns and
   propagate it forward through the network.
3. Compute the deltas, k for the output layer.

4. Compute the deltas, j for the hidden layer by propagating the
   error backward.
5. Update all the connections such that
                                    New     Old
    wji = wjiOld + wji and wkj = wkj + wkj

6. If any pattern remains in the pool, then go back to Step 2. If all
   the training patterns in the pool have been used, then set EP =
   EP+1, and if EP  EPMax, then create a random pool of patterns
   and go to Step 2. If EP = EPMax, then stop.
Hybrid LS RS/SA/GA Training
Delta rule training may converge to a local

Hybrid Global Learning (HGL) will
converge on a global minimum
   Randomize A [-0.5, 0.5]
   Minimize the Error function E‟(A)
         BP: The Momentum
 To this point, Backpropagation has the
 disadvantage of being too slow if  is small and
 it can oscillate too widely if  is large.
 To solve this problem, we can add a
 momentum to give each connection some
 inertia, forcing it to change in the direction of
 the downhill “force”.
 New Delta Rule:
         wpq(t+1) = - E/wpq +  wpq(t)
where p and q are any input and hidden, or, hidden and
outpu units; t is a time step or epoch; and  is the
momentum parameter which regulates the amount of
inertia of the weights.
              Other methods
Simulated Annealing
   More accurate results
   Much slower
Genetic Algorithms
   More accurate results
   Slower

For details on methods and results see:
   S. Cho, Chow, C. Leung, “A neural-based crowd
    estimation by hybrid global learning algorithm”,
    Systems, Man and Cybernetics, Part B, IEEE
    Transactions on, Page(s): 535-541
Alternative Activation functions
Radial Basis
    Square
    Triangle
    Gaussian!

                          Input 0   Input 1   ...          Input n

(μ, σ) can be varied at
each hidden node to                                 fRBF
                           fRBF     fRBF

guide training              (x)      (x)             (x)

                           fH(x)    fH(x)       fH(x)
       Alternate Topologies
Inputs analyze signal
at multiple points in

RBF functions may be
used to select a
„window‟ in the input
      Typical Topologies
Set of inputs
Set of hidden nodes
Set of outputs
Too many nodes makes network hard to
Supervised Vs. Unsupervised
Previously discussed networks are
   Need to be trained ahead of time with lots of
Unsupervised networks adapt to the input
   Applications in Clustering and reducing
   Learning may be very slow
       Self Organizing Maps
The basic Self-Organizing Map (SOM) can
 be visualized as a sheet-like neural-
 network array, the cells (or nodes) of
 which become specifically tuned to various
 input signal patterns or classes of patterns
 in an orderly fashion.
          Current Applications
Investment Analysis
   Predicting movement of stocks
   Replacing earlier linear models
Signature Analysis
   Bank Checks, VISA, etc.
Process Control
   Chemistry related
   Sensor networks may gather more data than can be processed
    by operators
   Inputs: Cues from camera data, vibration levels, sound, radar,
    lydar, etc.
   Output: Number of people at a terminal, engine warning light,
    control for light switch
        How to Go About it?
Web Resources
   Neural Networks: Simon Haykin
   Neural Networks and Fuzzy Logic: B.Kosko
   Building Neural Networks: Skapura
   Neural Networks for Pattern Recognition: ??
        How to Go About it?
   Very Strong Background in Mathematics. No
    Biology Required at all.
   Linear Algebra, Calculus, Probability and
    even Fourier Series to a certain extent.
   For ECE guys: Signals and Systems, DSP.
[1] L. Smith, ed. (1996, 2001), "An Introduction to Neural
Networks", URL:
[2] Sarle, W.S., ed. (1997), Neural Network FAQ, URL:
[3] StatSoft, "Neural Networks", URL:
[4] S. Cho, T. Chow, and C. Leung, "A Neural-Based
Crowd Estimation by Hybrid Global Learning Algorithm",
IEEE Transactions on Systems, Man and Cybernetics,
Part B, No. 4. 1999.
Questions, if Any?