Neural networks

Document Sample
Neural networks Powered By Docstoc
					Neural Networks

   Tuomas Sandholm
   Carnegie Mellon University
  Computer Science Department
     How the brain works
Synaptic connections exhibit long-term changes in
the connection strengths based on patterns seen
Comparing brains with digital

      Graceful degradation
      Inductive learning
synchronous/asynchronous)   Notation
Single unit (neuron) of an
 artificial neural network

      ini  W j ,i a j
Activation Functions

              n                       n
ai  stept ( W j ,i a j )  step0 ( W j ,i a j )
              j 1                   j 0

Where W0,i = t and a0= -1  fixed
Boolean gates can be simulated
 by units with a step function

W=1                  W=1
                                     W= -1
      t=1.5                t=0.5             t=-0.5
 W=1                  W=1
 AND                        OR           NOT

              g is a step function
         Feed-forward vs. recurrent

Recurrent networks have state (activations from previous
time steps have to be remembered): Short-term memory.
             Hopfield network
• Bidirectional symmetric (Wi,j = Wj,i) connections
• g is the sign function
• All units are both input and output units
• Activations are  1

“Associative memory”

After training on a set of examples, a new stimulus will
cause the network to settle into an activation pattern
corresponding to the example in the training set that most
closely resemble the new stimulus.

E.g. parts of photograph

Thrm. Can reliably store 0.138 #units training examples
             Boltzman machine
• Symmetric weights
• Each output is 0 or 1
• Includes units that are neither input units nor output units
• Stochastic g, i.e. some probability (as a fn of ini) that g=1

State transitions that resemble simulated annealing.

Approximates the configuration that best meets the training set.
Learning in ANNs is the process of tuning the weights

Form of nonlinear regression.
                    ANN topology
Representation capability vs. overfitting risk.

A feed-forward net with one hidden layer can approximate any
continuous fn of the inputs.

With 2 hidden layers it can approximate any fn at all.

The #units needed in each layer may grow exponentially

         Learning the topology
         Hill-climbing vs. genetic algorithms vs. …
         Removing vs. adding (nodes/connections).
         Compare candidates via cross-validation.
                O  step0 (W j I j )

              Implementable with one output unit
Majority fn
              Decision tree requires O(2n) nodes
   Representation capability of a
Every input can only affect the output in one direction
independent of other inputs.
E.g. unable to represent WillWait in the restaurant example.

Perceptrons can only represent linearly separable fns.
For a given problem, does one know in advance whether it is
linearly separable?
Linear separability in 3D
       Minority Function
   Learning linearly separable functions
                 Training examples used over and over!

                                                                 Err = T-O

                    W j  W j   * I j * Err
   Variant of perceptron learning rule.
   Thrm. Will learn the linearly separable target fn. (if  is not too high)
   Intuition: gradient descent in a search space with no local optima
  Encoding for ANNs
E.g. #patrons can be none, some or full

   Local encoding:
   None=0.0, Some=0.5, Full=1.0

   Distributed encoding:
   None           1      0       0
   Some           0      1       0
   Full           0      0       1
Majority Function
Multilayer feedforward networks
     Structural credit assignment problem

                               Back propagation algorithm
                               (again, Erri=Ti-Oi)
                               Updating between hidden & output units.

                                     Wij  W ji   * a j * Erri * g ' (ini )

      Updating between input & hidden units:
       Errj  W j ,i * Erri * g ' (ini )
                i                           Back propagation of the error
      Wkj  Wkj   * I k * Errj * g ' (in j )
      Back propagation (BP) as
       gradient descent search
A way of localizing the computation of the gradient to units.
                               E      
                                     2 i
                                          (Ti  Oi ) 2

                               E ( w )   (Ti  g ( W ji a j ))2
                                         2 i           j

                                             (Ti  g (Wij g (Wkj , I k ))) 2
                                           2 i          j       k

                                      a j (Ti  Oi ) * g ' ( W ji a j )
                               W ji                          j

                                       a j (Ti  Oi ) * g ' (ini )
                               For hidden units we get
                                       I k * g ' (in j ) W ji * Erri * g ' (ini )
                               Wkj                       i

                                        I k * g ' (in j ) * Errj
Observations on BP as gradient descent

 1. Minimize error  move in opposite direction of gradient

 2. g needs to be differentiable
     Cannot use sign fn or step fn
     Use e.g. sigmoid g’=g(1-g)

 3. Gradient taken wrt. one training example at a time
ANN learning curve
    WillWait problem
WillWait Problem
            Expressiveness of BP

2n/n hidden units needed to represent arbitrary Boolean fns of n
        (such a network has O(2n) weights, and we need at least
        2n bits to represent a Boolean fn)

Thrm. Any continuous fn f:[0,1]nRm
Can be implemented in a 3-layer network with 2n+1 hidden
units. (activation fns take special form) [Kolmogorov]
            Efficiency of BP

Using is fast

Training is slow
       Epoch takes O(m | w |)
       May need exponentially many epochs in #inputs
                  More on BP…
      Good on fns where output varies smoothly with input

Sensitivity to noise:
        Very tolerant of noise
        Does not give a degree of certainty in the output

       Black box

Prior knowledge:
        Hard to “prime”

No convergence guarantees
Summary of representation capabilities
 (model class) of different supervised
          learning methods
    3-layer feedforward ANN
    Decision Tree
    K-Nearest neighbor
    Version space

Shared By: