Document Sample

Single Layer Neural Network Xingquan (Hill) Zhu Outline • Perceptron for Classification – Perceptron training rule – Why perceptron training rule work? – Gradient descent learning rule – Incremental stochastic gradient descent • Delta Rule (Adaline: Adaptive Linear Element) Perceptron: architecture • We consider the architecture: feed-forward NN with one layer • It is sufficient to study single layer perceptrons with just one neuron: Single layer perceptrons • Generalization to single layer perceptrons with more neurons is easy because: •The output units are independent among each other •Each weight only affects one of the outputs Perceptron: Neuron Model • The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function Perceptron for Classification • The perceptron is used for binary classification. • Given training examples of classes C1, C2 train the perceptron in such a way that it classifies correctly the training examples: – If the output of the perceptron is +1 then the input is assigned to class C1 – If the output is -1 then the input is assigned to C2 Perceptron Training • How can we train a perceptron for a classification task? • We try to find suitable values for the weights in such a way that the training examples are correctly classified. • Geometrically, we try to find a hyper-plane that separates the examples of the two classes. Perceptron Geometric View The equation below describes a (hyper-)plane in the input space consisting of real valued 2D vectors. The plane splits the input space into two regions, each of them describing one class. decision region for C1 2 x2 w x + w x + w >= 0 w x 1 1 2 2 0 i i w 0 0 decision i 1 boundary C1 x1 C2 w1x1 + w2x2 + w0 = 0 Example: AND • Here is a representation of the AND function • White means false, black means true for the output • -1 means false, +1 means true for the input -1 AND -1 = false -1 AND +1 = false +1 AND -1 = false +1 AND +1 = true Example: AND continued • A linear decision surface separates false from true instances Example: AND continued • Watch a perceptron learn the AND function: Example: XOR • Here’s the XOR function: -1 XOR -1 = false -1 XOR +1 = true +1 XOR -1 = true +1 XOR +1 = false Perceptrons cannot learn such linearly inseparable functions Example: XOR continued • Watch a perceptron try to learn XOR Example -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 +1 -1 -1 -1 -1 -1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 +1 -1 -1 -1 -1 -1 -1 -1 +1 -1 -1 -1 -1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Example • How to train a perceptron to recognize this 3? • Assign –1 to weights of input values that are equal to -1, +1 to weights of input values that are equal to +1, and –63 to the bias. • Then the output of the perceptron will be 1 when presented with a “prefect” 3, and at most –1 for all other patterns. Example -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 +1 -1 -1 -1 -1 -1 +1 +1 +1 -1 -1 -1 +1 -1 -1 -1 +1 -1 -1 -1 -1 -1 -1 -1 +1 -1 -1 -1 -1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Example • What if a slightly different 3 is to be recognized, like the one in the previous slide? • The original 3 with one bit corrupted would produce a sum equal to –1. • If the bias is set to –61 then also this corrupted 3 will be recognized, as well as all patterns with one corrupted bit. Perceptron: Learning Algorithm • Variables and parameters at iteration n of the learning algorithm: x (n) = input vector = [+1, x1(n), x2(n), …, xm(n)]T w(n) = weight vector = [b(n), w1(n), w2(n), …, wm(n)]T b(n) = bias a(n) = actual response from the perceptron d(n) = desired response = learning rate parameter (real number) • Too small an η produces slow convergence. • Too large of an η can cause oscillations in the process. Perceptron Training Rule k=1; initialize wi(k) randomly; while (there is a misclassified training example) Select the misclassified example (x(n),d(n)) wi(k+1) = wi(k) + wi where wi = {d(n)-a(n)}∙xi(n); k = k+1; end-while; = learning rate parameter (real number) Example Consider the 2-dimensional training set C1 C2, C1 = {(1,1), (1, -1), (0, -1)} with class label 1 C2 = {(-1,-1), (-1,1), (0,1)} with class label 0 Train a perceptron on C1 C2 Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} Fill out this table sequentially (First pass): Input Weight Desired Actual Update? New weight (1, 1, 1) (1, 0, 0) 1 1 No (1, 0, 0) (1, 1, -1) (1, 0, 0) 1 1 No (1, 0, 0) (1,0, -1) (1, 0, 0) 1 1 No (1, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 Yes (0, 1, 1) (1,-1, 1) (0, 1, 1) 0 1 No (0, 1, 1) (1, 0, 1) (0, 1, 1) 0 1 Yes (-1, 1, 0) Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} Fill out this table sequentially (Second pass): Input Weight Desired Actual Update? New weight (1, 1, 1) (-1, 1, 0) 1 0 Yes (0, 2, 1) (1, 1, -1) (0, 2, 1) 1 1 No (0, 2, 1) (1,0, -1) (0, 2, 1) 1 0 Yes (1, 2, 0) (1,-1, -1) (1, 2, 0) 0 0 No (1, 2, 0) (1,-1, 1) (1, 2, 0) 0 0 No (1, 2, 0) (1, 0, 1) (1, 2, 0) 0 1 Yes (0, 2, -1) Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} Fill out this table sequentially (Third pass): Input Weight Desired Actual Update? New weight (1, 1, 1) (0, 2, -1) 1 1 No (0, 2, -1) (1, 1, -1) (0, 2, -1) 1 1 No (0, 2, -1) (1,0, -1) (0, 2, -1) 1 1 No (0, 2, -1) (1,-1, -1) (0, 2, -1) 0 0 No (0, 2, -1) (1,-1, 1) (0, 2, -1) 0 0 No (0, 2, -1) (1, 0, 1) (0, 2, -1) 0 0 No (0, 2, -1) At epoch 3 no weight changes. stop execution of algorithm. Final weight vector: (0, 2, -1). decision hyperplane is 2x1 - x2 = 0. Result x2 1 - - + Decision boundary: C2 2x1 - x2 = 0 -1 1/2 1 2 x1 w - + -1 + C1 Some Unhappiness About Perceptron Training • The perceptron learning rule fails to converge if examples are not linearly separable – Can only model linearly separable classes, like (those described by) the following Boolean functions: • AND, OR, but not XOR • When a perceptron gives the right answer, no learning takes place. • Anything below the threshold is interpreted as “no”, even if it just below the threshold. – Might it be better to train the neuron based on how far below the threshold it is? Outline • Perceptron for Classification – Perceptron training rule – Why perceptron training rule work? – Convergence theorem – Gradient descent learning rule – Incremental stochastic gradient descent • Delta Rule (Adaline: Adaptive Linear Element) Gradient Descent Learning Rule • Gradient Descent: Consider linear unit without threshold and continuous output o (not just –1,1) – o(x)=w0 + w1 x1 + … + wm xm – o(x)=WX • The squared error (where D is the set of training examples ) – For an example (x,d) the error e(w) of the network is m e(w) d o( x ) d x jw j j 0 – And the squared error of (x,d) is E ( w) 1 e2 2 – The total squared error is • E(W)=E(w1,…,wm) = ½ (x,d)D (d-o(x))2 • E(W)=E(w1,…,wm)= ½ (x,d)D (d-WTX)2 • Update the weight wi such that E(W) minimum – Wi(k+1) = wi(k) + wi Gradient Descent Learning Rule • start from an arbitrary point in the weight space • the direction in which the error E of an example (as a function of the weights) is decreasing most rapidly is the opposite of the gradient of E: (gradient of E (W )) E w1 ,, E wm • take a small step (of size ) in that direction w( k 1) w(k) (gradient of E(W)) Gradient Descent Learning Rule (w1,w2) (w1+w1,w2 +w2) Gradient Descent Learning Rule • Train the wi’s such that they minimize the squared error – E(w1,…,wm) = ½ nD (dn-on)2 Gradient: E(w)=[E/w0,… E/wm] w=- E(w) wi=- E/wi = - /wi 1/2n(dn-on)2 = - /wi 1/2n(dn-i wi xin)2 = - n(dn- on)(-xin) = n(dn- on)xin Gradient descent learning rule wi(k+1)=wi(k)+ wi = wi(k)+ n(dn- on)xin Gradient-Descent-Learning(D, ) Denoting a training example (x(n), dn) or ((x1n,…xmn),dn) where (x1n,…,xmn) are the input values, and dn is the desired output, is the learning rate (e.g. 0.1) • k=1, randomly initialize wi(k), calculate E(W) • While (E(W) unsatisfactory AND k<max_iterations) – Initialize each wi to zero – For each instance (x(n),dn) in D Do • Calculate network output on=i wi(k)xin • For each weight dimension wi(k), i=1,..,m – wi= wi + (dn-on) xin • EndFor – EndFor – For each weight dimension wi(k), i=1,..,m • wi(k+1)=wi(k)+wi – EndFor – Calculate E(W) based on updated wi(k+1) – k=k+1 • EndWhile Example Consider the 2-dimensional training set C1 C2, C1 = {(1,1), (1, -1), (0, -1)} with class label 1 C2 = {(-1,-1), (-1,1), (0,1)} with class label 0 Train a perceptron on C1 C2 Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =0.1 wi(k+1)=wi(k)+wi ; wi= n(dn-on)xin Fill out this table sequentially (First pass): Input w(k) dn on (dn-on)xin wi (1, 1, 1) (1,0, 0) 1 1 (0, 0, 0) (0, 0, 0) (1, 1, -1) (1, 0, 0) 1 1 (0, 0, 0) (0, 0, 0) (1,0, -1) (1, 0, 0) 1 1 (0, 0, 0) (0, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 (-0.1, 0.1, 0.1) (-0.1, 0.1, 0.1) (1,-1, 1) (1, 0, 0) 0 1 (-0.1, 0.1, -0.1) (-0.2, 0.2, 0) (1, 0, 1) (1, 0, 0) 0 1 (-0.1, 0, -0.1) (-0.3, 0.2, -0.1) E(W)=3/2 w(k+1) (0.7, 0.2, -0.1) Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =0.1 wi(k+1)=wi(k)+wi; wi=n(dn-on)xin Fill out this table sequentially (Second pass): Input w(k) dn On (dn-on)xin wi (1, 1, 1) (0.7, 0.2, -0.1) 1 0.8 (0.02, 0.02, 0.02) (0.02,0.02,0.02) (1, 1, -1) (0.7, 0.2, -0.1) 1 1 (0,0,0) (0.02,0.02,0.02) (1,0, -1) (0.7, 0.2, -0.1) 1 0.8 (0.02,0,-0.02) (0.04,0.02,0) (1,-1, -1) (0.7, 0.2, -0.1) 0 0.6 (-0.06,0.06,0.06) (-0.02,0.08,0.06) (1,-1, 1) (0.7, 0.2, -0.1) 0 0.4 (-0.04,0.04,-0.04) (-0.06,0.12,0.02) (1, 0, 1) (0.7, 0.2, -0.1) 0 0.6 (-0.06,0,-0.06) (-0.12,0.12,-0.04) E(W)=0.96/2 w(k+1) (0.58,0.32,-0.14) Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =0.1 wi(k+1)=wi(k)+wi; wi=n(dn-on)xin Fill out this table sequentially (Third pass): Input w(k) dn On (dn-on)xin wi (1, 1, 1) (0.58,0.32,-0.14) 1 0.76 (0.024,0.024,0.024) (0.024,0.024,0.024) (1, 1, -1) (0.58,0.32,-0.14) 1 1.04 (-0.004,-0.004,0.004) (0.02,0.02,0.028) (1,0, -1) (0.58,0.32,-0.14) 1 0.72 (0.028,0,-0.028) (0.048,0.04,0) (1,-1, -1) (0.58,0.32,-0.14) 0 0.4 (-0.06,0.06,0.06) (-0.012,0.1,0.06) (1,-1, 1) (0.58,0.32,-0.14) 0 0.12 (-0.088,0.088,-0.088) (-0.1,0.188,-0.028) (1, 0, 1) (0.58,0.32,-0.14) 0 0.44 (-0.056,0,-0.056) (-0.156,0.188,-0.084) E(W)=0.5056/2 w(k+1) (0.424,0.508,-0.224) Gradient Descent Learning Rule • Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small learning rate is used • If is too large – Overstepping the minimum in the error surface – Gradually reduce the value of as the number of gradient descent steps grows Example (large value) C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =1 Fill out this table sequentially (First pass): Input w(k) d O (d-o)xi wi (1, 1, 1) (1,0, 0) 1 1 (0, 0, 0) (0, 0, 0) (1, 1, -1) (1, 0, 0) 1 1 (0, 0, 0) (0, 0, 0) (1,0, -1) (1, 0, 0) 1 1 (0, 0, 0) (0, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 (-1, 1, 1) (-1, 1, 1) (1,-1, 1) (1, 0, 0) 0 1 (-1, 1, -1) (-2, 2, 0) (1, 0, 1) (1, 0, 0) 0 1 (-1, 0, -1) (-3, 2, -1) E(W)=3/2 w(k+1) (-2, 2, -1) Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =1 Fill out this table sequentially (Second pass): Input w(k) d O (d-o)xi wi (1, 1, 1) (-2,2, -1) 1 -1 (2, 2, 2) (2, 2, 2) (1, 1, -1) (-2,2, -1) 1 1 (0, 0, 0) (2, 2, 2) (1,0, -1) (-2,2, -1) 1 -1 (2, 0, -2) (4, 2, 0) (1,-1, -1) (-2,2, -1) 0 -3 (3, -3, -3) (7, -1, -3) (1,-1, 1) (-2,2, -1) 0 -5 (5, -5, 5) (12, -6, 2) (1, 0, 1) (-2,2, -1) 0 -3 (3, 0, 3) (15, -6, 5) E(W)=51/2 w(k+1) (13, -4, 4) Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =1 Fill out this table sequentially (Third pass): Input w(k) d O (d-o)xi wi (1, 1, 1) (13, -4, 4) 1 13 (-12,-12,-12) (1, 1, -1) (13, -4, 4) 1 5 (-4,-4,4) (1,0, -1) (13, -4, 4) 1 9 (-8,0,8) (1,-1, -1) (13, -4, 4) 0 13 (-13,13,13) (1,-1, 1) (13, -4, 4) 0 21 (-21,21,-21) (1, 0, 1) (13, -4, 4) 0 17 (-17,0,-17) (-75,18,-25) E(W)=1123/2 w(k+1) (-62,14,-21) Local minimum Outline • Perceptron for Classification – Perceptron training rule – Why perceptron training rule work? – Convergence theorem – Gradient descent learning rule – Incremental stochastic gradient descent • Delta Rule (Adaline: Adaptive Linear Element) Incremental Stochastic Gradient Descent • Batch mode : Gradient descent – w(k+1)=w(k) - ED(W) over the entire data D – ED(W)=1/2n(dn-on)2 • Incremental mode: Gradient descent – w(k+1)=w(k) - En(W) over individual training examples – En(W)=1/2 (dn-on)2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily close if is small enough Weights Update Rule: incremental mode • Computation of Gradient(E): E ( W ) 1 1 ( ( d n on ) ) 2 ( (d n wT x (n )) 2 ) w w 2 w 2 ( d n on ) x (n ) • Delta rule (AdaLine: Adaptive Linear Elements) for weight update: E (W ) w(k 1) w(k) w w(k 1) w(k) (d n on ) x ( n ) Delta Rule (AdaLine) learning algorithm k=1; initialize wi(k) randomly; Calculate ED(W) while (ED(W) unsatisfactory AND k<max_iterations) Select an example (x(n),dn) wi (d n on ) xin wi (k 1) wi (k ) wi Calculate ED(W) k = k+1; end-while; Example C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)} =0.1 Fill out this table sequentially (First pass): wi(k+1)=wi(k)+(d-o)xi Input wi(k) d O (d-o)xi wi(k+1) (1, 1, 1) (1,0, 0) 1 1 (0, 0, 0) (1, 0, 0) (1, 1, -1) (1, 0, 0) 1 1 (0, 0, 0) (1, 0, 0) (1,0, -1) (1, 0, 0) 1 1 (0, 0, 0) (1, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 (-0.1, 0.1, 0.1) (0.9, 0.1, 0.1) (1,-1, 1) (0.9, 0.1, 0.1) 0 0.9 (-0.09, 0.09, -0.09) (0.81,0.19,0.01) (1, 0, 1) (0.81,0.19,0.01) 0 0.82 (-0.082, 0, -0.082) (0.728, 0.19, -0.072) Perceptron Learning Rule VS. Gradient Descent Rule Perceptron learning rule guaranteed to succeed if – Training examples are linearly separable – Sufficiently small learning rate Gradient descent learning rules – Guaranteed to converge to hypothesis with minimum squared error – Given sufficiently small learning rate – Even when training data contains noise – Even when training data not separable by H Comparison of Perceptron and Adaline Perceptron Adaline Architecture Single-layer Single-layer Neuron Non-linear linear model Learning Minimze Minimize total algorithm number of squared error misclassified examples Application Linear Linear classification, and classification regression Outline • Perceptron for Classification – Perceptron training rule – Why perceptron training rule work? – Convergence theorem – Gradient descent learning rule – Incremental stochastic gradient descent • Delta Rule (Adaline: Adaptive Linear Element)

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 9 |

posted: | 9/14/2011 |

language: | English |

pages: | 48 |

OTHER DOCS BY dandanhuanghuang

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.