Docstoc

Learning-with-Perceptrons-and-Neural-Networks

Document Sample
Learning-with-Perceptrons-and-Neural-Networks Powered By Docstoc
					Learning with Neural Networks

       Artificial Intelligence
           CMSC 25000
        February 19, 2002
                     Agenda
• Neural Networks:
  – Biological analogy
• Review: single-layer perceptrons
     • Perceptron: Pros & Cons
• Neural Networks: Multilayer perceptrons
     • Neural net training: Backpropagation
     • Strengths & Limitations
• Conclusions
      Neurons: The Concept
Dendrites


                              Axon


                   Nucleus

                 Cell Body
   Neurons: Receive inputs from other neurons (via synapses)
             When input exceeds threshold, “fires”
                 Sends output along axon to other neurons
   Brain: 10^11 neurons, 10^16 synapses
                 Perceptron Structure
Single neuron-like element                                  y
-Binary inputs &output
-Weighted sum of inputs > threshold
                                           w0                           wn
                                            w1
                                                 w2    w3

                                   x0=-1   x1     x2    x3      . . .        xn
Until perceptron correct output for all
                                                       n

                                                            
 If the perceptron is correct, do nothing
                                                1 if      wi xi  0
 If the percepton is wrong,                  y      i 0
   If it incorrectly says “yes”,                0 otherwise
      Subtract input vector from weight vector 
  Otherwise, add input vector to it       x0 w0 compensates for threshold
         Perceptron Learning
• Perceptrons learn linear decision boundaries
       x2                    x2
• E.g.         0 0

                  0    0               +     0
            + +       0 0    But not
             +
                       0
           + ++
                                             +
                                       0
                            x1                   x1
                                           xor

• Guaranteed to converge, if linearly separable
• Many simple functions NOT learnable
               Neural Nets
• Multi-layer perceptrons
  – Inputs: real-valued
  – Intermediate “hidden” nodes
  – Output(s): one (or more) discrete-valued

     X1
     X2                          Y1
     X3
                                 Y2
     X4

 Inputs   Hidden   Hidden   Outputs
                Neural Nets
• Pro: More general than perceptrons
  – Not restricted to linear discriminants
  – Multiple outputs: one classification each
• Con: No simple, guaranteed training
  procedure
  – Use greedy, hill-climbing procedure to train
  – “Gradient descent”, “Backpropagation”
       Solving the XOR Problem        o1

                          w11
Network             x1                w13
Topology:           w21         w01
                                                  y
2 hidden nodes      w12         -1    w23
                                            w03
1 output            x2    w22
                                            -1
                                 w02 o
                                      2

Desired behavior:               -1

x1 x2 o1 o2 y                        Weights:
0 0 0 0 0                            w11= w12=1
1 0 0 1 1                            w21=w22 = 1
0 1 0 1 1                            w01=3/2; w02=1/2; w03=1/2
1 1 1 1 0                            w13=-1; w23=1
              Backpropagation
• Greedy, Hill-climbing procedure
  – Weights are parameters to change
  – Original hill-climb changes one parameter/step
     • Slow
  – If smooth function, change all parameters/step
     • Gradient descent
        – Backpropagation: Computes current output, works
          backward to correct error
  Producing a Smooth Function
• Key problem:
  – Pure step threshold is discontinuous
     • Not differentiable
• Solution:
  – Sigmoid (squashed „s‟ function): Logistic fn

                                 n
                                                     1
                            z   wi xi s ( z ) 
                                i                 1  ez
            Neural Net Training
• Goal:
  – Determine how to change weights to get correct
    output
     • Large change in weight to produce large reduction
       in error
• Approach:
     •   Compute actual output: o
     •   Compare to desired output: d
     •   Determine effect of each weight w on error = d-o
     •   Adjust weights
                    Neural Net Example
                                                                              y3
xi : ith sample input vector                                                                 z3
w : weight vector
                                                                    w03
yi*: desired output for ith sample
        1               
  E -  ( yi*  F ( xi , w)) 2
                                                               -1     w13 y1 w23 y2
                                                               z1                                     z2
       2   i
                                                                                   w21
                                                              w01                           w22
  Sum of squares error over training samples                                                             w02
                                                                      w1
                                                               -1                  w12                  -1
                                                                          1   x1           x2
                                                                                    From 6.034 notes lozano-perez
                                    z1                               z2
           
 y3  F ( x , w)  s( w13s( w11x1  w21x2  w01)  w23s( w12 x1  w22 x2  w02 )  w03 )
                                                     z3
 Full expression of output in terms of input and weights
               Gradient Descent
• Error: Sum of squares error of inputs with
  current weights
• Compute rate of change of error wrt each
  weight
  – Which weights have greatest effect on error?
  – Effectively, partial derivatives of error wrt weights
     • In turn, depend on other weights => chain rule
              Gradient Descent
                                               dG
                                 E             dw
• E = G(w)
   – Error as function of
     weights                           G(w)
• Find rate of change of
  error
                                        w0w1        w
   – Follow steepest rate of         Local
     change                          minima
   – Change weights s.t. error
     is minimized
    1
                       Gradient of Error
                        
E -  ( yi*  F ( xi , w)) 2
    2 i                       z1                                         z2
          
y3  F ( x , w)  s( w13s( w11x1  w21x2  w01)  w23s( w12 x1  w22 x2  w02 )  w03 )

E                  y3                                    z3                     y3
      ( yi  y3 )
           *                                                                                     z3
w j                w j                                                w03
Note: Derivative of sigmoid:                                       -1     w13 y1 w23 y2
                                                                   z1                                      z2
ds(z1) = s(z1)(1-s(z1))
                                                                                       w21
 dz1                                                              w01                           w22          w02
y3 s ( z3 ) z3 s ( z3 )            s ( z3 )                   -1
                                                                          w1
                                                                                       w12
                          s ( z1 )            y1                           1   x1            x2
                                                                                                            -1

w13   z3 w13    z3                  z3                                             From 6.034 notes lozano-perez




y3 s ( z3 ) z3 s ( z3 )     s ( z1 ) z1 s ( z3 )     s( z1 )
                          w13                        w13          x1
w11   z3 w11    z3           z1 w11      z3           z1
                             MIT AI lecture notes, Lozano-Perez
                                            2000
       From Effect to Update
• Gradient computation:
  – How each weight contributes to performance
• To train:
  – Need to determine how to CHANGE weight
    based on contribution to performance
  – Need to determine how MUCH change to make
    per iteration
     • Rate parameter „r‟
        – Large enough to learn quickly
        – Small enough reach but not overshoot target values
    Backpropagation Procedure
                                             wi  j     w j k
                                         i            j        k
                              oi                         oj
• Pick rate parameter „r‟        o (1  o )       j    j    ok (1  ok )
• Until performance is good enough,
   – Do forward computation to calculate output
   – Compute Beta in output node with
        z  d z  oz
   – Compute Beta in all other nodes with
        j   w j k ok (1  ok )  k
              k
   – Compute change for all weights with
    wi j  roi o j (1  o j )  j
                                                                 y3
               Backprop Example                                             z3

                                                     w03
Forward prop: Compute zi and yi given xk, wl
                                                 -1         w13 y1 w23       y2
 3  ( y3  y3 )
         *                                      z1                                       z2
                                                                      w21
 2  y3 (1  y3 )  3 w23                     w01                          w22             w02
                                                -1         w11        w12
1  y3 (1  y3 )  3 w13                                        x1         x2
                                                                                           -1
                                                                             From 6.034 notes lozano-perez


                    w03  w03  ry3 (1  y3 )  3 (1)
                   w02  w02  ry2 (1  y2 )  2 (1)
                   w01  w01  ry1 (1  y1 ) 1 (1)
w13  w13  ry1 y3 (1  y3 )  3    w23  w23  ry2 y3 (1  y3 )  3
w12  w12  rx1 y2 (1  y2 )  2    w22  w22  rx2 y2 (1  y2 )  2
w11  w11  rx1 y1 (1  y1 ) 1      w21  w21  rx2 y1 (1  y1 ) 1
  Backpropagation Observations
• Procedure is (relatively) efficient
  – All computations are local
     • Use inputs and outputs of current node


• What is “good enough”?
  – Rarely reach target (0 or 1) outputs
     • Typically, train until within 0.1 of target
          Neural Net Summary
• Training:
  – Backpropagation procedure
     • Gradient descent strategy (usual problems)
• Prediction:
  – Compute outputs based on input vector & weights
• Pros: Very general, Fast prediction
• Cons: Training can be VERY slow (1000‟s of
  epochs), Overfitting
          Training Strategies
• Online training:
  – Update weights after each sample
• Offline (batch training):
  – Compute error over all samples
     • Then update weights


• Online training “noisy”
  – Sensitive to individual instances
  – However, may escape local minima
            Training Strategy
• To avoid overfitting:
  – Split data into: training, validation, & test
     • Also, avoid excess weights (less than # samples)
• Initialize with small random weights
  – Small changes have noticeable effect
• Use offline training
  – Until validation set minimum
• Evaluate on test set
  – No more weight changes
               Classification
• Neural networks best for classification task
  – Single output -> Binary classifier
  – Multiple outputs -> Multiway classification
     • Applied successfully to learning pronunciation


  – Sigmoid pushes to binary classification
     • Not good for regression
      Neural Net Conclusions
• Simulation based on neurons in brain
• Perceptrons (single neuron)
  – Guaranteed to find linear discriminant
     • IF one exists -> problem XOR
• Neural nets (Multi-layer perceptrons)
  – Very general
  – Backpropagation training procedure
     • Gradient descent - local min, overfitting issues

				
DOCUMENT INFO
Shared By:
Tags: Learn, ing-w
Stats:
views:55
posted:12/1/2009
language:English
pages:24
Description: Learning-with-Perceptrons-and-Neural-Networks