Document Sample

Learning with Neural Networks Artificial Intelligence CMSC 25000 February 19, 2002 Agenda • Neural Networks: – Biological analogy • Review: single-layer perceptrons • Perceptron: Pros & Cons • Neural Networks: Multilayer perceptrons • Neural net training: Backpropagation • Strengths & Limitations • Conclusions Neurons: The Concept Dendrites Axon Nucleus Cell Body Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses Perceptron Structure Single neuron-like element y -Binary inputs &output -Weighted sum of inputs > threshold w0 wn w1 w2 w3 x0=-1 x1 x2 x3 . . . xn Until perceptron correct output for all n If the perceptron is correct, do nothing 1 if wi xi 0 If the percepton is wrong, y i 0 If it incorrectly says “yes”, 0 otherwise Subtract input vector from weight vector Otherwise, add input vector to it x0 w0 compensates for threshold Perceptron Learning • Perceptrons learn linear decision boundaries x2 x2 • E.g. 0 0 0 0 + 0 + + 0 0 But not + 0 + ++ + 0 x1 x1 xor • Guaranteed to converge, if linearly separable • Many simple functions NOT learnable Neural Nets • Multi-layer perceptrons – Inputs: real-valued – Intermediate “hidden” nodes – Output(s): one (or more) discrete-valued X1 X2 Y1 X3 Y2 X4 Inputs Hidden Hidden Outputs Neural Nets • Pro: More general than perceptrons – Not restricted to linear discriminants – Multiple outputs: one classification each • Con: No simple, guaranteed training procedure – Use greedy, hill-climbing procedure to train – “Gradient descent”, “Backpropagation” Solving the XOR Problem o1 w11 Network x1 w13 Topology: w21 w01 y 2 hidden nodes w12 -1 w23 w03 1 output x2 w22 -1 w02 o 2 Desired behavior: -1 x1 x2 o1 o2 y Weights: 0 0 0 0 0 w11= w12=1 1 0 0 1 1 w21=w22 = 1 0 1 0 1 1 w01=3/2; w02=1/2; w03=1/2 1 1 1 1 0 w13=-1; w23=1 Backpropagation • Greedy, Hill-climbing procedure – Weights are parameters to change – Original hill-climb changes one parameter/step • Slow – If smooth function, change all parameters/step • Gradient descent – Backpropagation: Computes current output, works backward to correct error Producing a Smooth Function • Key problem: – Pure step threshold is discontinuous • Not differentiable • Solution: – Sigmoid (squashed „s‟ function): Logistic fn n 1 z wi xi s ( z ) i 1 ez Neural Net Training • Goal: – Determine how to change weights to get correct output • Large change in weight to produce large reduction in error • Approach: • Compute actual output: o • Compare to desired output: d • Determine effect of each weight w on error = d-o • Adjust weights Neural Net Example y3 xi : ith sample input vector z3 w : weight vector w03 yi*: desired output for ith sample 1 E - ( yi* F ( xi , w)) 2 -1 w13 y1 w23 y2 z1 z2 2 i w21 w01 w22 Sum of squares error over training samples w02 w1 -1 w12 -1 1 x1 x2 From 6.034 notes lozano-perez z1 z2 y3 F ( x , w) s( w13s( w11x1 w21x2 w01) w23s( w12 x1 w22 x2 w02 ) w03 ) z3 Full expression of output in terms of input and weights Gradient Descent • Error: Sum of squares error of inputs with current weights • Compute rate of change of error wrt each weight – Which weights have greatest effect on error? – Effectively, partial derivatives of error wrt weights • In turn, depend on other weights => chain rule Gradient Descent dG E dw • E = G(w) – Error as function of weights G(w) • Find rate of change of error w0w1 w – Follow steepest rate of Local change minima – Change weights s.t. error is minimized 1 Gradient of Error E - ( yi* F ( xi , w)) 2 2 i z1 z2 y3 F ( x , w) s( w13s( w11x1 w21x2 w01) w23s( w12 x1 w22 x2 w02 ) w03 ) E y3 z3 y3 ( yi y3 ) * z3 w j w j w03 Note: Derivative of sigmoid: -1 w13 y1 w23 y2 z1 z2 ds(z1) = s(z1)(1-s(z1)) w21 dz1 w01 w22 w02 y3 s ( z3 ) z3 s ( z3 ) s ( z3 ) -1 w1 w12 s ( z1 ) y1 1 x1 x2 -1 w13 z3 w13 z3 z3 From 6.034 notes lozano-perez y3 s ( z3 ) z3 s ( z3 ) s ( z1 ) z1 s ( z3 ) s( z1 ) w13 w13 x1 w11 z3 w11 z3 z1 w11 z3 z1 MIT AI lecture notes, Lozano-Perez 2000 From Effect to Update • Gradient computation: – How each weight contributes to performance • To train: – Need to determine how to CHANGE weight based on contribution to performance – Need to determine how MUCH change to make per iteration • Rate parameter „r‟ – Large enough to learn quickly – Small enough reach but not overshoot target values Backpropagation Procedure wi j w j k i j k oi oj • Pick rate parameter „r‟ o (1 o ) j j ok (1 ok ) • Until performance is good enough, – Do forward computation to calculate output – Compute Beta in output node with z d z oz – Compute Beta in all other nodes with j w j k ok (1 ok ) k k – Compute change for all weights with wi j roi o j (1 o j ) j y3 Backprop Example z3 w03 Forward prop: Compute zi and yi given xk, wl -1 w13 y1 w23 y2 3 ( y3 y3 ) * z1 z2 w21 2 y3 (1 y3 ) 3 w23 w01 w22 w02 -1 w11 w12 1 y3 (1 y3 ) 3 w13 x1 x2 -1 From 6.034 notes lozano-perez w03 w03 ry3 (1 y3 ) 3 (1) w02 w02 ry2 (1 y2 ) 2 (1) w01 w01 ry1 (1 y1 ) 1 (1) w13 w13 ry1 y3 (1 y3 ) 3 w23 w23 ry2 y3 (1 y3 ) 3 w12 w12 rx1 y2 (1 y2 ) 2 w22 w22 rx2 y2 (1 y2 ) 2 w11 w11 rx1 y1 (1 y1 ) 1 w21 w21 rx2 y1 (1 y1 ) 1 Backpropagation Observations • Procedure is (relatively) efficient – All computations are local • Use inputs and outputs of current node • What is “good enough”? – Rarely reach target (0 or 1) outputs • Typically, train until within 0.1 of target Neural Net Summary • Training: – Backpropagation procedure • Gradient descent strategy (usual problems) • Prediction: – Compute outputs based on input vector & weights • Pros: Very general, Fast prediction • Cons: Training can be VERY slow (1000‟s of epochs), Overfitting Training Strategies • Online training: – Update weights after each sample • Offline (batch training): – Compute error over all samples • Then update weights • Online training “noisy” – Sensitive to individual instances – However, may escape local minima Training Strategy • To avoid overfitting: – Split data into: training, validation, & test • Also, avoid excess weights (less than # samples) • Initialize with small random weights – Small changes have noticeable effect • Use offline training – Until validation set minimum • Evaluate on test set – No more weight changes Classification • Neural networks best for classification task – Single output -> Binary classifier – Multiple outputs -> Multiway classification • Applied successfully to learning pronunciation – Sigmoid pushes to binary classification • Not good for regression Neural Net Conclusions • Simulation based on neurons in brain • Perceptrons (single neuron) – Guaranteed to find linear discriminant • IF one exists -> problem XOR • Neural nets (Multi-layer perceptrons) – Very general – Backpropagation training procedure • Gradient descent - local min, overfitting issues

DOCUMENT INFO

Shared By:

Categories:

Stats:

views: | 55 |

posted: | 12/1/2009 |

language: | English |

pages: | 24 |

Description:
Learning-with-Perceptrons-and-Neural-Networks

OTHER DOCS BY akgame

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.