Docstoc

Neural Networks

Document Sample
Neural Networks Powered By Docstoc
					Single Layer Neural Network

       Xingquan (Hill) Zhu
                   Outline
• Perceptron for Classification
  – Perceptron training rule
  – Why perceptron training rule work?
  – Gradient descent learning rule
  – Incremental stochastic gradient descent
     • Delta Rule (Adaline: Adaptive Linear Element)
     Perceptron: architecture
• We consider the architecture: feed-forward NN
  with one layer
• It is sufficient to study single layer perceptrons
  with just one neuron:




         
   
       Single layer perceptrons
  • Generalization to single layer perceptrons with
    more neurons is easy because:




                          

•The output units are independent among each other
•Each weight only affects one of the outputs
    Perceptron: Neuron Model
• The (McCulloch-Pitts) perceptron is a single
  layer NN with a non-linear , the sign function
  Perceptron for Classification
• The perceptron is used for binary
  classification.
• Given training examples of classes C1, C2
  train the perceptron in such a way that it
  classifies correctly the training examples:
  – If the output of the perceptron is +1 then the input
    is assigned to class C1
  – If the output is -1 then the input is assigned to C2
        Perceptron Training
• How can we train a perceptron for a
  classification task?
• We try to find suitable values for the
  weights in such a way that the training
  examples are correctly classified.
• Geometrically, we try to find a hyper-plane
  that separates the examples of the two
  classes.
       Perceptron Geometric View
   The equation below describes a (hyper-)plane in the
     input space consisting of real valued 2D vectors. The
     plane splits the input space into two regions, each of
     them describing one class.
                                                      decision
                                                 region for C1
 2                                       x2 w x + w x + w >= 0

w x
                                               1 1    2 2    0

       i   i    w 0  0 decision
i 1                     boundary                  C1
                                                        x1
                                    C2
                                         w1x1 + w2x2 + w0 = 0
           Example: AND
• Here is a representation of the AND function
• White means false, black means true for the
  output
• -1 means false, +1 means true for the input




                             -1 AND -1 = false
                             -1 AND +1 = false
                             +1 AND -1 = false
                             +1 AND +1 = true
  Example: AND continued
• A linear decision surface separates
  false from true instances
    Example: AND continued
• Watch a perceptron learn the AND function:
                Example: XOR
 • Here’s the XOR function:

                                      -1 XOR -1 = false
                                      -1 XOR +1 = true
                                      +1 XOR -1 = true
                                      +1 XOR +1 = false




Perceptrons cannot learn such linearly inseparable functions
   Example: XOR continued
• Watch a perceptron try to learn XOR
          Example

-1   -1   -1   -1   -1   -1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   -1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   -1   -1   -1
                   Example
• How to train a perceptron to recognize this 3?
• Assign –1 to weights of input values that are
  equal to -1, +1 to weights of input values that
  are equal to +1, and –63 to the bias.
• Then the output of the perceptron will be 1
  when presented with a “prefect” 3, and at most
  –1 for all other patterns.
          Example

-1   -1   -1   -1   -1   -1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   -1   +1   +1   +1   -1   -1
-1   +1   -1   -1   -1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   -1   -1   -1
                   Example
• What if a slightly different 3 is to be recognized,
  like the one in the previous slide?
• The original 3 with one bit corrupted would
  produce a sum equal to –1.
• If the bias is set to –61 then also this corrupted 3
  will be recognized, as well as all patterns with one
  corrupted bit.
  Perceptron: Learning Algorithm

• Variables and parameters at iteration n of
  the learning algorithm:
 x (n) = input vector
      = [+1, x1(n), x2(n), …, xm(n)]T
 w(n) = weight vector
      = [b(n), w1(n), w2(n), …, wm(n)]T
 b(n) = bias
 a(n) = actual response from the perceptron
 d(n) = desired response
 = learning rate parameter (real number)
     • Too small an η produces slow convergence.
     • Too large of an η can cause oscillations in the process.
        Perceptron Training Rule
k=1;
initialize wi(k) randomly;
while (there is a misclassified training example)
  Select the misclassified example (x(n),d(n))
  wi(k+1) = wi(k) + wi
     where wi =  {d(n)-a(n)}∙xi(n);
  k = k+1;
end-while;

  = learning rate parameter (real number)
                         Example

Consider the 2-dimensional training set C1  C2,

C1 = {(1,1), (1, -1), (0, -1)} with class label 1
C2 = {(-1,-1), (-1,1), (0,1)} with class label 0


Train a perceptron on C1  C2
                          Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2:     {(1, -1,-1), (1, -1,1), (1, 0,1)}

Fill out this table sequentially (First pass):

Input         Weight      Desired     Actual     Update?   New
                                                           weight
  (1, 1, 1)   (1, 0, 0)       1            1       No       (1, 0, 0)
 (1, 1, -1)   (1, 0, 0)       1            1       No       (1, 0, 0)
  (1,0, -1)   (1, 0, 0)       1            1       No       (1, 0, 0)
 (1,-1, -1)   (1, 0, 0)       0            1       Yes      (0, 1, 1)
  (1,-1, 1)   (0, 1, 1)       0            1       No       (0, 1, 1)
  (1, 0, 1)   (0, 1, 1)       0            1       Yes     (-1, 1, 0)
                            Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2:     {(1, -1,-1), (1, -1,1), (1, 0,1)}

Fill out this table sequentially (Second pass):

Input        Weight      Desired      Actual   Update?   New
                                                         weight
 (1, 1, 1) (-1, 1, 0)         1            0      Yes     (0, 2, 1)
(1, 1, -1) (0, 2, 1)          1            1      No      (0, 2, 1)
 (1,0, -1) (0, 2, 1)          1            0      Yes     (1, 2, 0)
(1,-1, -1) (1, 2, 0)          0            0      No      (1, 2, 0)
 (1,-1, 1) (1, 2, 0)          0            0      No      (1, 2, 0)
 (1, 0, 1) (1, 2, 0)          0            1      Yes    (0, 2, -1)
                               Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
Fill out this table sequentially (Third pass):

   Input        Weight       Desired   Actual   Update?   New
                                                          weight
    (1, 1, 1)   (0, 2, -1)      1         1       No      (0, 2, -1)
   (1, 1, -1)   (0, 2, -1)      1         1       No      (0, 2, -1)
    (1,0, -1)   (0, 2, -1)      1         1       No      (0, 2, -1)
   (1,-1, -1)   (0, 2, -1)      0         0       No      (0, 2, -1)
    (1,-1, 1)   (0, 2, -1)      0         0       No      (0, 2, -1)
    (1, 0, 1)   (0, 2, -1)      0         0       No      (0, 2, -1)


   At epoch 3 no weight changes.
    stop execution of algorithm.
   Final weight vector: (0, 2, -1).
    decision hyperplane is 2x1 - x2 = 0.
           Result


          x2       1
      -        -             +
                                     Decision boundary:
C2                                   2x1 - x2 = 0


     -1                1/2   1       2   x1
                                 w
      -        + -1          +           C1
         Some Unhappiness About
           Perceptron Training
• The perceptron learning rule fails to converge if
  examples are not linearly separable
  – Can only model linearly separable classes, like
    (those described by) the following Boolean
    functions:
     • AND, OR, but not XOR
• When a perceptron gives the right answer, no
  learning takes place.
• Anything below the threshold is interpreted as
  “no”, even if it just below the threshold.
  – Might it be better to train the neuron based on how
    far below the threshold it is?
                   Outline
• Perceptron for Classification
  – Perceptron training rule
  – Why perceptron training rule work?
  – Convergence theorem
  – Gradient descent learning rule
  – Incremental stochastic gradient descent
     • Delta Rule (Adaline: Adaptive Linear Element)
   Gradient Descent Learning Rule
• Gradient Descent: Consider linear unit without threshold
  and continuous output o (not just –1,1)
   – o(x)=w0 + w1 x1 + … + wm xm
   – o(x)=WX
• The squared error (where D is the set of training examples )
   – For an example (x,d) the error e(w) of the network is
                                m
        e(w)  d  o( x )  d   x jw j
                                j 0



   – And the squared error of (x,d) is         E ( w)  1 e2
                                                        2
   – The total squared error is
       • E(W)=E(w1,…,wm) = ½ (x,d)D (d-o(x))2
       • E(W)=E(w1,…,wm)= ½ (x,d)D (d-WTX)2
• Update the weight wi such that E(W)  minimum
   – Wi(k+1) = wi(k) + wi
 Gradient Descent Learning Rule
• start from an arbitrary point in the weight space
• the direction in which the error E of an example (as a
  function of the weights) is decreasing most rapidly is
  the opposite of the gradient of E:


 (gradient of E (W ))   E
                           w1
                               ,,
                                   E
                                     
                                   wm
                                                     
• take a small step (of size ) in that direction

w( k  1)  w(k)   (gradient of E(W))
Gradient Descent Learning Rule

                    (w1,w2)



                   (w1+w1,w2 +w2)
   Gradient Descent Learning Rule
• Train the wi’s such that they minimize the squared error
   – E(w1,…,wm) = ½ nD (dn-on)2

Gradient:
E(w)=[E/w0,… E/wm]
w=- E(w)
wi=- E/wi
 = - /wi 1/2n(dn-on)2
 = - /wi 1/2n(dn-i wi xin)2
 = - n(dn- on)(-xin)
 =  n(dn- on)xin        Gradient descent learning rule


wi(k+1)=wi(k)+ wi
       = wi(k)+  n(dn- on)xin
Gradient-Descent-Learning(D, )

  Denoting a training example (x(n), dn) or ((x1n,…xmn),dn) where (x1n,…,xmn) are
  the input values, and dn is the desired output,  is the learning rate (e.g. 0.1)

• k=1, randomly initialize wi(k), calculate E(W)
• While (E(W) unsatisfactory AND k<max_iterations)
   – Initialize each wi to zero
   – For each instance (x(n),dn) in D Do
       • Calculate network output on=i wi(k)xin
       • For each weight dimension wi(k), i=1,..,m
             – wi= wi +  (dn-on) xin
       • EndFor
   – EndFor
   – For each weight dimension wi(k), i=1,..,m
       • wi(k+1)=wi(k)+wi
   – EndFor
   – Calculate E(W) based on updated wi(k+1)
   – k=k+1
• EndWhile
                         Example

Consider the 2-dimensional training set C1  C2,

C1 = {(1,1), (1, -1), (0, -1)} with class label 1
C2 = {(-1,-1), (-1,1), (0,1)} with class label 0


Train a perceptron on C1  C2
                          Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1
wi(k+1)=wi(k)+wi ;              wi= n(dn-on)xin
Fill out this table sequentially (First pass):
   Input         w(k)     dn     on   (dn-on)xin      wi
     (1, 1, 1)   (1,0, 0)  1      1       (0, 0, 0)        (0, 0, 0)
    (1, 1, -1)   (1, 0, 0) 1      1       (0, 0, 0)        (0, 0, 0)
     (1,0, -1)   (1, 0, 0) 1      1       (0, 0, 0)        (0, 0, 0)
    (1,-1, -1)   (1, 0, 0) 0      1   (-0.1, 0.1, 0.1) (-0.1, 0.1, 0.1)
     (1,-1, 1)   (1, 0, 0) 0      1   (-0.1, 0.1, -0.1) (-0.2, 0.2, 0)
     (1, 0, 1)   (1, 0, 0) 0      1    (-0.1, 0, -0.1) (-0.3, 0.2, -0.1)
                      E(W)=3/2             w(k+1)       (0.7, 0.2, -0.1)
                           Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1
wi(k+1)=wi(k)+wi;               wi=n(dn-on)xin
Fill out this table sequentially (Second pass):
 Input        w(k)           dn    On    (dn-on)xin          wi
  (1, 1, 1)   (0.7, 0.2, -0.1) 1   0.8   (0.02, 0.02, 0.02)     (0.02,0.02,0.02)
 (1, 1, -1)   (0.7, 0.2, -0.1) 1    1           (0,0,0)         (0.02,0.02,0.02)
  (1,0, -1)   (0.7, 0.2, -0.1) 1   0.8      (0.02,0,-0.02)        (0.04,0.02,0)
 (1,-1, -1)   (0.7, 0.2, -0.1) 0   0.6    (-0.06,0.06,0.06)    (-0.02,0.08,0.06)
  (1,-1, 1)   (0.7, 0.2, -0.1) 0   0.4   (-0.04,0.04,-0.04)    (-0.06,0.12,0.02)
  (1, 0, 1)   (0.7, 0.2, -0.1) 0   0.6      (-0.06,0,-0.06)   (-0.12,0.12,-0.04)
                     E(W)=0.96/2              w(k+1)          (0.58,0.32,-0.14)
                              Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1           wi(k+1)=wi(k)+wi;           wi=n(dn-on)xin
Fill out this table sequentially (Third pass):
 Input        w(k)              dn On   (dn-on)xin            wi
  (1, 1, 1)   (0.58,0.32,-0.14) 1 0.76 (0.024,0.024,0.024)     (0.024,0.024,0.024)
 (1, 1, -1)   (0.58,0.32,-0.14) 1 1.04 (-0.004,-0.004,0.004)    (0.02,0.02,0.028)
  (1,0, -1)   (0.58,0.32,-0.14) 1 0.72     (0.028,0,-0.028)       (0.048,0.04,0)
 (1,-1, -1)   (0.58,0.32,-0.14) 0   0.4   (-0.06,0.06,0.06)      (-0.012,0.1,0.06)
  (1,-1, 1)   (0.58,0.32,-0.14) 0 0.12 (-0.088,0.088,-0.088)    (-0.1,0.188,-0.028)
  (1, 0, 1)   (0.58,0.32,-0.14) 0 0.44    (-0.056,0,-0.056)    (-0.156,0.188,-0.084)
                   E(W)=0.5056/2               w(k+1)          (0.424,0.508,-0.224)
 Gradient Descent Learning Rule

• Because the error surface contains only a
  single global minimum, this algorithm will
  converge to a weight vector with minimum
  error, regardless of whether the training
  examples are linearly separable, given a
  sufficiently small learning rate  is used
• If  is too large
  – Overstepping the minimum in the error surface
  – Gradually reduce the value of  as the number of
    gradient descent steps grows
           Example (large  value)
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=1
Fill out this table sequentially (First pass):

   Input          w(k)       d        O       (d-o)xi      wi
      (1, 1, 1)    (1,0, 0)     1         1     (0, 0, 0)    (0, 0, 0)
     (1, 1, -1)    (1, 0, 0)    1         1     (0, 0, 0)    (0, 0, 0)
      (1,0, -1)    (1, 0, 0)    1         1     (0, 0, 0)    (0, 0, 0)
     (1,-1, -1)    (1, 0, 0)    0         1    (-1, 1, 1)   (-1, 1, 1)
      (1,-1, 1)    (1, 0, 0)    0         1   (-1, 1, -1)   (-2, 2, 0)
      (1, 0, 1)    (1, 0, 0)    0         1   (-1, 0, -1)   (-3, 2, -1)
                           E(W)=3/2              w(k+1)     (-2, 2, -1)
                        Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=1
Fill out this table sequentially (Second pass):
 Input         w(k)      d         O        (d-o)xi      wi
   (1, 1, 1)   (-2,2, -1)   1          -1    (2, 2, 2)     (2, 2, 2)
  (1, 1, -1)   (-2,2, -1)   1           1    (0, 0, 0)     (2, 2, 2)
   (1,0, -1)   (-2,2, -1)   1          -1   (2, 0, -2)     (4, 2, 0)
  (1,-1, -1)   (-2,2, -1)   0          -3   (3, -3, -3)   (7, -1, -3)
   (1,-1, 1)   (-2,2, -1)   0          -5   (5, -5, 5)    (12, -6, 2)
   (1, 0, 1)   (-2,2, -1)   0          -3    (3, 0, 3)    (15, -6, 5)
                       E(W)=51/2              w(k+1)      (13, -4, 4)
                        Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=1
Fill out this table sequentially (Third pass):
 Input          w(k)      d    O    (d-o)xi        wi
    (1, 1, 1)   (13, -4, 4) 1 13    (-12,-12,-12)
   (1, 1, -1)   (13, -4, 4) 1  5        (-4,-4,4)
    (1,0, -1)   (13, -4, 4) 1  9         (-8,0,8)
   (1,-1, -1)   (13, -4, 4) 0 13      (-13,13,13)
    (1,-1, 1)   (13, -4, 4) 0 21     (-21,21,-21)
    (1, 0, 1)   (13, -4, 4) 0 17      (-17,0,-17)    (-75,18,-25)
                   E(W)=1123/2           w(k+1)      (-62,14,-21)
Local minimum
                   Outline
• Perceptron for Classification
  – Perceptron training rule
  – Why perceptron training rule work?
  – Convergence theorem
  – Gradient descent learning rule
  – Incremental stochastic gradient descent
     • Delta Rule (Adaline: Adaptive Linear Element)
        Incremental Stochastic
           Gradient Descent
• Batch mode : Gradient descent
   – w(k+1)=w(k) - ED(W) over the entire data D
   – ED(W)=1/2n(dn-on)2
• Incremental mode: Gradient descent
   – w(k+1)=w(k) - En(W) over individual training
     examples
   – En(W)=1/2 (dn-on)2

Incremental Gradient Descent can approximate Batch Gradient
   Descent arbitrarily close if  is small enough
 Weights Update Rule: incremental mode
• Computation of Gradient(E):
  E ( W )  1                        1
                ( ( d n  on ) ) 
                                 2
                                       ( (d n  wT x (n )) 2 )
    w      w 2                    w 2
           ( d n  on ) x (n )
• Delta rule (AdaLine: Adaptive Linear
  Elements) for weight update:

                         E (W )
    w(k  1)  w(k)  
                           w
    w(k  1)  w(k)   (d n  on ) x ( n )
 Delta Rule (AdaLine) learning algorithm

k=1;
initialize wi(k) randomly; Calculate ED(W)
while (ED(W) unsatisfactory AND k<max_iterations)
   Select an example (x(n),dn)
    wi   (d n  on ) xin
   wi (k  1)  wi (k )  wi
   Calculate ED(W)
   k = k+1;
end-while;
                             Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1
Fill out this table sequentially (First pass):
wi(k+1)=wi(k)+(d-o)xi

 Input      wi(k)            d    O     (d-o)xi            wi(k+1)
  (1, 1, 1)     (1,0, 0)      1     1         (0, 0, 0)           (1, 0, 0)
 (1, 1, -1)     (1, 0, 0)     1     1         (0, 0, 0)           (1, 0, 0)
  (1,0, -1)     (1, 0, 0)     1     1         (0, 0, 0)           (1, 0, 0)
 (1,-1, -1)     (1, 0, 0)     0     1     (-0.1, 0.1, 0.1)     (0.9, 0.1, 0.1)
  (1,-1, 1) (0.9, 0.1, 0.1)   0    0.9 (-0.09, 0.09, -0.09)  (0.81,0.19,0.01)
  (1, 0, 1) (0.81,0.19,0.01) 0    0.82 (-0.082, 0, -0.082) (0.728, 0.19, -0.072)
  Perceptron Learning Rule VS.
     Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
  – Training examples are linearly separable
  – Sufficiently small learning rate 
Gradient descent learning rules
  – Guaranteed to converge to hypothesis with minimum
    squared error
  – Given sufficiently small learning rate 
  – Even when training data contains noise
  – Even when training data not separable by H
Comparison of Perceptron and Adaline
               Perceptron       Adaline

Architecture   Single-layer     Single-layer

Neuron         Non-linear       linear
model
Learning       Minimze          Minimize total
algorithm      number of        squared error
               misclassified
               examples

Application    Linear           Linear classification, and
               classification   regression
                   Outline
• Perceptron for Classification
  – Perceptron training rule
  – Why perceptron training rule work?
  – Convergence theorem
  – Gradient descent learning rule
  – Incremental stochastic gradient descent
     • Delta Rule (Adaline: Adaptive Linear Element)

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:9
posted:9/14/2011
language:English
pages:48