# Neural Networks

Document Sample

```					Single Layer Neural Network

Xingquan (Hill) Zhu
Outline
• Perceptron for Classification
– Perceptron training rule
– Why perceptron training rule work?
Perceptron: architecture
• We consider the architecture: feed-forward NN
with one layer
• It is sufficient to study single layer perceptrons
with just one neuron:



Single layer perceptrons
• Generalization to single layer perceptrons with
more neurons is easy because:

         

•The output units are independent among each other
•Each weight only affects one of the outputs
Perceptron: Neuron Model
• The (McCulloch-Pitts) perceptron is a single
layer NN with a non-linear , the sign function
Perceptron for Classification
• The perceptron is used for binary
classification.
• Given training examples of classes C1, C2
train the perceptron in such a way that it
classifies correctly the training examples:
– If the output of the perceptron is +1 then the input
is assigned to class C1
– If the output is -1 then the input is assigned to C2
Perceptron Training
• How can we train a perceptron for a
• We try to find suitable values for the
weights in such a way that the training
examples are correctly classified.
• Geometrically, we try to find a hyper-plane
that separates the examples of the two
classes.
Perceptron Geometric View
The equation below describes a (hyper-)plane in the
input space consisting of real valued 2D vectors. The
plane splits the input space into two regions, each of
them describing one class.
decision
region for C1
2                                       x2 w x + w x + w >= 0

w x
1 1    2 2    0

i   i    w 0  0 decision
i 1                     boundary                  C1
x1
C2
w1x1 + w2x2 + w0 = 0
Example: AND
• Here is a representation of the AND function
• White means false, black means true for the
output
• -1 means false, +1 means true for the input

-1 AND -1 = false
-1 AND +1 = false
+1 AND -1 = false
+1 AND +1 = true
Example: AND continued
• A linear decision surface separates
false from true instances
Example: AND continued
• Watch a perceptron learn the AND function:
Example: XOR
• Here’s the XOR function:

-1 XOR -1 = false
-1 XOR +1 = true
+1 XOR -1 = true
+1 XOR +1 = false

Perceptrons cannot learn such linearly inseparable functions
Example: XOR continued
• Watch a perceptron try to learn XOR
Example

-1   -1   -1   -1   -1   -1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   -1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   -1   -1   -1
Example
• How to train a perceptron to recognize this 3?
• Assign –1 to weights of input values that are
equal to -1, +1 to weights of input values that
are equal to +1, and –63 to the bias.
• Then the output of the perceptron will be 1
when presented with a “prefect” 3, and at most
–1 for all other patterns.
Example

-1   -1   -1   -1   -1   -1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   -1   +1   +1   +1   -1   -1
-1   +1   -1   -1   -1   +1   -1   -1
-1   -1   -1   -1   -1   +1   -1   -1
-1   -1   +1   +1   +1   +1   -1   -1
-1   -1   -1   -1   -1   -1   -1   -1
Example
• What if a slightly different 3 is to be recognized,
like the one in the previous slide?
• The original 3 with one bit corrupted would
produce a sum equal to –1.
• If the bias is set to –61 then also this corrupted 3
will be recognized, as well as all patterns with one
corrupted bit.
Perceptron: Learning Algorithm

• Variables and parameters at iteration n of
the learning algorithm:
x (n) = input vector
= [+1, x1(n), x2(n), …, xm(n)]T
w(n) = weight vector
= [b(n), w1(n), w2(n), …, wm(n)]T
b(n) = bias
a(n) = actual response from the perceptron
d(n) = desired response
= learning rate parameter (real number)
• Too small an η produces slow convergence.
• Too large of an η can cause oscillations in the process.
Perceptron Training Rule
k=1;
initialize wi(k) randomly;
while (there is a misclassified training example)
Select the misclassified example (x(n),d(n))
wi(k+1) = wi(k) + wi
where wi =  {d(n)-a(n)}∙xi(n);
k = k+1;
end-while;

= learning rate parameter (real number)
Example

Consider the 2-dimensional training set C1  C2,

C1 = {(1,1), (1, -1), (0, -1)} with class label 1
C2 = {(-1,-1), (-1,1), (0,1)} with class label 0

Train a perceptron on C1  C2
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2:     {(1, -1,-1), (1, -1,1), (1, 0,1)}

Fill out this table sequentially (First pass):

Input         Weight      Desired     Actual     Update?   New
weight
(1, 1, 1)   (1, 0, 0)       1            1       No       (1, 0, 0)
(1, 1, -1)   (1, 0, 0)       1            1       No       (1, 0, 0)
(1,0, -1)   (1, 0, 0)       1            1       No       (1, 0, 0)
(1,-1, -1)   (1, 0, 0)       0            1       Yes      (0, 1, 1)
(1,-1, 1)   (0, 1, 1)       0            1       No       (0, 1, 1)
(1, 0, 1)   (0, 1, 1)       0            1       Yes     (-1, 1, 0)
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2:     {(1, -1,-1), (1, -1,1), (1, 0,1)}

Fill out this table sequentially (Second pass):

Input        Weight      Desired      Actual   Update?   New
weight
(1, 1, 1) (-1, 1, 0)         1            0      Yes     (0, 2, 1)
(1, 1, -1) (0, 2, 1)          1            1      No      (0, 2, 1)
(1,0, -1) (0, 2, 1)          1            0      Yes     (1, 2, 0)
(1,-1, -1) (1, 2, 0)          0            0      No      (1, 2, 0)
(1,-1, 1) (1, 2, 0)          0            0      No      (1, 2, 0)
(1, 0, 1) (1, 2, 0)          0            1      Yes    (0, 2, -1)
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
Fill out this table sequentially (Third pass):

Input        Weight       Desired   Actual   Update?   New
weight
(1, 1, 1)   (0, 2, -1)      1         1       No      (0, 2, -1)
(1, 1, -1)   (0, 2, -1)      1         1       No      (0, 2, -1)
(1,0, -1)   (0, 2, -1)      1         1       No      (0, 2, -1)
(1,-1, -1)   (0, 2, -1)      0         0       No      (0, 2, -1)
(1,-1, 1)   (0, 2, -1)      0         0       No      (0, 2, -1)
(1, 0, 1)   (0, 2, -1)      0         0       No      (0, 2, -1)

At epoch 3 no weight changes.
 stop execution of algorithm.
Final weight vector: (0, 2, -1).
 decision hyperplane is 2x1 - x2 = 0.
Result

x2       1
-        -             +
Decision boundary:
C2                                   2x1 - x2 = 0

-1                1/2   1       2   x1
w
-        + -1          +           C1
Perceptron Training
• The perceptron learning rule fails to converge if
examples are not linearly separable
– Can only model linearly separable classes, like
(those described by) the following Boolean
functions:
• AND, OR, but not XOR
• When a perceptron gives the right answer, no
learning takes place.
• Anything below the threshold is interpreted as
“no”, even if it just below the threshold.
– Might it be better to train the neuron based on how
far below the threshold it is?
Outline
• Perceptron for Classification
– Perceptron training rule
– Why perceptron training rule work?
– Convergence theorem
• Gradient Descent: Consider linear unit without threshold
and continuous output o (not just –1,1)
– o(x)=w0 + w1 x1 + … + wm xm
– o(x)=WX
• The squared error (where D is the set of training examples )
– For an example (x,d) the error e(w) of the network is
m
e(w)  d  o( x )  d   x jw j
j 0

– And the squared error of (x,d) is         E ( w)  1 e2
2
– The total squared error is
• E(W)=E(w1,…,wm) = ½ (x,d)D (d-o(x))2
• E(W)=E(w1,…,wm)= ½ (x,d)D (d-WTX)2
• Update the weight wi such that E(W)  minimum
– Wi(k+1) = wi(k) + wi
• start from an arbitrary point in the weight space
• the direction in which the error E of an example (as a
function of the weights) is decreasing most rapidly is
the opposite of the gradient of E:

 (gradient of E (W ))   E
w1
,,
E

wm

• take a small step (of size ) in that direction

w( k  1)  w(k)   (gradient of E(W))

(w1,w2)

(w1+w1,w2 +w2)
• Train the wi’s such that they minimize the squared error
– E(w1,…,wm) = ½ nD (dn-on)2

E(w)=[E/w0,… E/wm]
w=- E(w)
wi=- E/wi
= - /wi 1/2n(dn-on)2
= - /wi 1/2n(dn-i wi xin)2
= - n(dn- on)(-xin)
=  n(dn- on)xin        Gradient descent learning rule

wi(k+1)=wi(k)+ wi
= wi(k)+  n(dn- on)xin

Denoting a training example (x(n), dn) or ((x1n,…xmn),dn) where (x1n,…,xmn) are
the input values, and dn is the desired output,  is the learning rate (e.g. 0.1)

• k=1, randomly initialize wi(k), calculate E(W)
• While (E(W) unsatisfactory AND k<max_iterations)
– Initialize each wi to zero
– For each instance (x(n),dn) in D Do
• Calculate network output on=i wi(k)xin
• For each weight dimension wi(k), i=1,..,m
– wi= wi +  (dn-on) xin
• EndFor
– EndFor
– For each weight dimension wi(k), i=1,..,m
• wi(k+1)=wi(k)+wi
– EndFor
– Calculate E(W) based on updated wi(k+1)
– k=k+1
• EndWhile
Example

Consider the 2-dimensional training set C1  C2,

C1 = {(1,1), (1, -1), (0, -1)} with class label 1
C2 = {(-1,-1), (-1,1), (0,1)} with class label 0

Train a perceptron on C1  C2
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1
wi(k+1)=wi(k)+wi ;              wi= n(dn-on)xin
Fill out this table sequentially (First pass):
Input         w(k)     dn     on   (dn-on)xin      wi
(1, 1, 1)   (1,0, 0)  1      1       (0, 0, 0)        (0, 0, 0)
(1, 1, -1)   (1, 0, 0) 1      1       (0, 0, 0)        (0, 0, 0)
(1,0, -1)   (1, 0, 0) 1      1       (0, 0, 0)        (0, 0, 0)
(1,-1, -1)   (1, 0, 0) 0      1   (-0.1, 0.1, 0.1) (-0.1, 0.1, 0.1)
(1,-1, 1)   (1, 0, 0) 0      1   (-0.1, 0.1, -0.1) (-0.2, 0.2, 0)
(1, 0, 1)   (1, 0, 0) 0      1    (-0.1, 0, -0.1) (-0.3, 0.2, -0.1)
E(W)=3/2             w(k+1)       (0.7, 0.2, -0.1)
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1
wi(k+1)=wi(k)+wi;               wi=n(dn-on)xin
Fill out this table sequentially (Second pass):
Input        w(k)           dn    On    (dn-on)xin          wi
(1, 1, 1)   (0.7, 0.2, -0.1) 1   0.8   (0.02, 0.02, 0.02)     (0.02,0.02,0.02)
(1, 1, -1)   (0.7, 0.2, -0.1) 1    1           (0,0,0)         (0.02,0.02,0.02)
(1,0, -1)   (0.7, 0.2, -0.1) 1   0.8      (0.02,0,-0.02)        (0.04,0.02,0)
(1,-1, -1)   (0.7, 0.2, -0.1) 0   0.6    (-0.06,0.06,0.06)    (-0.02,0.08,0.06)
(1,-1, 1)   (0.7, 0.2, -0.1) 0   0.4   (-0.04,0.04,-0.04)    (-0.06,0.12,0.02)
(1, 0, 1)   (0.7, 0.2, -0.1) 0   0.6      (-0.06,0,-0.06)   (-0.12,0.12,-0.04)
E(W)=0.96/2              w(k+1)          (0.58,0.32,-0.14)
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1           wi(k+1)=wi(k)+wi;           wi=n(dn-on)xin
Fill out this table sequentially (Third pass):
Input        w(k)              dn On   (dn-on)xin            wi
(1, 1, 1)   (0.58,0.32,-0.14) 1 0.76 (0.024,0.024,0.024)     (0.024,0.024,0.024)
(1, 1, -1)   (0.58,0.32,-0.14) 1 1.04 (-0.004,-0.004,0.004)    (0.02,0.02,0.028)
(1,0, -1)   (0.58,0.32,-0.14) 1 0.72     (0.028,0,-0.028)       (0.048,0.04,0)
(1,-1, -1)   (0.58,0.32,-0.14) 0   0.4   (-0.06,0.06,0.06)      (-0.012,0.1,0.06)
(1,-1, 1)   (0.58,0.32,-0.14) 0 0.12 (-0.088,0.088,-0.088)    (-0.1,0.188,-0.028)
(1, 0, 1)   (0.58,0.32,-0.14) 0 0.44    (-0.056,0,-0.056)    (-0.156,0.188,-0.084)
E(W)=0.5056/2               w(k+1)          (0.424,0.508,-0.224)

• Because the error surface contains only a
single global minimum, this algorithm will
converge to a weight vector with minimum
error, regardless of whether the training
examples are linearly separable, given a
sufficiently small learning rate  is used
• If  is too large
– Overstepping the minimum in the error surface
– Gradually reduce the value of  as the number of
Example (large  value)
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=1
Fill out this table sequentially (First pass):

Input          w(k)       d        O       (d-o)xi      wi
(1, 1, 1)    (1,0, 0)     1         1     (0, 0, 0)    (0, 0, 0)
(1, 1, -1)    (1, 0, 0)    1         1     (0, 0, 0)    (0, 0, 0)
(1,0, -1)    (1, 0, 0)    1         1     (0, 0, 0)    (0, 0, 0)
(1,-1, -1)    (1, 0, 0)    0         1    (-1, 1, 1)   (-1, 1, 1)
(1,-1, 1)    (1, 0, 0)    0         1   (-1, 1, -1)   (-2, 2, 0)
(1, 0, 1)    (1, 0, 0)    0         1   (-1, 0, -1)   (-3, 2, -1)
E(W)=3/2              w(k+1)     (-2, 2, -1)
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=1
Fill out this table sequentially (Second pass):
Input         w(k)      d         O        (d-o)xi      wi
(1, 1, 1)   (-2,2, -1)   1          -1    (2, 2, 2)     (2, 2, 2)
(1, 1, -1)   (-2,2, -1)   1           1    (0, 0, 0)     (2, 2, 2)
(1,0, -1)   (-2,2, -1)   1          -1   (2, 0, -2)     (4, 2, 0)
(1,-1, -1)   (-2,2, -1)   0          -3   (3, -3, -3)   (7, -1, -3)
(1,-1, 1)   (-2,2, -1)   0          -5   (5, -5, 5)    (12, -6, 2)
(1, 0, 1)   (-2,2, -1)   0          -3    (3, 0, 3)    (15, -6, 5)
E(W)=51/2              w(k+1)      (13, -4, 4)
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=1
Fill out this table sequentially (Third pass):
Input          w(k)      d    O    (d-o)xi        wi
(1, 1, 1)   (13, -4, 4) 1 13    (-12,-12,-12)
(1, 1, -1)   (13, -4, 4) 1  5        (-4,-4,4)
(1,0, -1)   (13, -4, 4) 1  9         (-8,0,8)
(1,-1, -1)   (13, -4, 4) 0 13      (-13,13,13)
(1,-1, 1)   (13, -4, 4) 0 21     (-21,21,-21)
(1, 0, 1)   (13, -4, 4) 0 17      (-17,0,-17)    (-75,18,-25)
E(W)=1123/2           w(k+1)      (-62,14,-21)
Local minimum
Outline
• Perceptron for Classification
– Perceptron training rule
– Why perceptron training rule work?
– Convergence theorem
Incremental Stochastic
• Batch mode : Gradient descent
– w(k+1)=w(k) - ED(W) over the entire data D
– ED(W)=1/2n(dn-on)2
– w(k+1)=w(k) - En(W) over individual training
examples
– En(W)=1/2 (dn-on)2

Descent arbitrarily close if  is small enough
Weights Update Rule: incremental mode
E ( W )  1                        1
      ( ( d n  on ) ) 
2
( (d n  wT x (n )) 2 )
w      w 2                    w 2
 ( d n  on ) x (n )
Elements) for weight update:

E (W )
w(k  1)  w(k)  
w
w(k  1)  w(k)   (d n  on ) x ( n )

k=1;
initialize wi(k) randomly; Calculate ED(W)
while (ED(W) unsatisfactory AND k<max_iterations)
Select an example (x(n),dn)
wi   (d n  on ) xin
wi (k  1)  wi (k )  wi
Calculate ED(W)
k = k+1;
end-while;
Example
C1:      {(1, 1, 1), (1, 1, -1), (1, 0, -1)}
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
=0.1
Fill out this table sequentially (First pass):
wi(k+1)=wi(k)+(d-o)xi

Input      wi(k)            d    O     (d-o)xi            wi(k+1)
(1, 1, 1)     (1,0, 0)      1     1         (0, 0, 0)           (1, 0, 0)
(1, 1, -1)     (1, 0, 0)     1     1         (0, 0, 0)           (1, 0, 0)
(1,0, -1)     (1, 0, 0)     1     1         (0, 0, 0)           (1, 0, 0)
(1,-1, -1)     (1, 0, 0)     0     1     (-0.1, 0.1, 0.1)     (0.9, 0.1, 0.1)
(1,-1, 1) (0.9, 0.1, 0.1)   0    0.9 (-0.09, 0.09, -0.09)  (0.81,0.19,0.01)
(1, 0, 1) (0.81,0.19,0.01) 0    0.82 (-0.082, 0, -0.082) (0.728, 0.19, -0.072)
Perceptron Learning Rule VS.
Perceptron learning rule guaranteed to succeed if
– Training examples are linearly separable
– Sufficiently small learning rate 
– Guaranteed to converge to hypothesis with minimum
squared error
– Given sufficiently small learning rate 
– Even when training data contains noise
– Even when training data not separable by H

Architecture   Single-layer     Single-layer

Neuron         Non-linear       linear
model
Learning       Minimze          Minimize total
algorithm      number of        squared error
misclassified
examples

Application    Linear           Linear classification, and
classification   regression
Outline
• Perceptron for Classification
– Perceptron training rule
– Why perceptron training rule work?
– Convergence theorem

```
DOCUMENT INFO
Categories:
Tags:
Stats:
 views: 9 posted: 9/14/2011 language: English pages: 48