Docstoc

Lecture 13 – Perceptrons and Neural Networks - CUNY

Document Sample
Lecture 13 – Perceptrons and Neural Networks - CUNY Powered By Docstoc
					Lecture 13 – Perceptrons

     Machine Learning
                  Last Time
• Hidden Markov Models
  – Sequential modeling represented in a Graphical
    Model




                                                     2
                      Today
• Perceptrons
• Leading to
  – Neural Networks
  – aka Multilayer Perceptron Networks
  – But more accurately: Multilayer Logistic
    Regression Networks




                                               3
 Review: Fitting Polynomial Functions
• Fitting nonlinear 1-D functions
• Polynomial:

• Risk:




                                        4
 Review: Fitting Polynomial Functions
• Order-D polynomial regression for 1D
  variables is the same as D-dimensional linear
  regression
• Extend the feature vector from a scalar.

• More generally



                                                  5
     Neuron inspires Regression
• Graphical Representation of linear regression
  – McCullough-Pitts Neuron
  – Note: Not a graphical model




                                                  6
      Neuron inspires Regression
• Edges multiply the signal (xi) by some weight (θi).
• Nodes sum inputs
• Equivalent to Linear Regression




                                                        7
     Introducing Basis functions
• Graphical representation of feature extraction



• Edges multiply the signal by a weight.
• Nodes apply a function, ϕd.




                                                   8
     Extension to more features
• Graphical representation of feature extraction




                                                   9
          Combining function
• How do we construct the neural output




               Linear Neuron

                                          10
          Combining function
• Sigmoid function or Squashing function




                                  Classification using the
                Logistic Neuron
                                  same metaphor
                                                         11
    Logistic Neuron optimization
• Minimizing R(θ) is more difficult




  Bad News: There is no “closed-form” solution.

           Good News: It’s convex.

                                                  12
         Aside: Convex Regions
• Convex: for any pair of points xa and xb within
  a region, every point xc on a line between xa
  and xb is in the region




                                                    13
        Aside: Convex Functions
• Convex: for any pair of points xa and xb within
  a region, every point xc on a line between xa
  and xb is in the region




                                                    14
        Aside: Convex Functions
• Convex: for any pair of points xa and xb within
  a region, every point xc on a line between xa
  and xb is in the region




                                                    15
       Aside: Convex Functions
• Convex functions have a single maximum and
  minimum!
• How does this help us?
• (nearly) Guaranteed optimality of Gradient

  Descent




                                               16
            Gradient Descent
• The Gradient is defined (though we can’t solve
  directly)
• Points in the direction of fastest increase




                                               17
            Gradient Descent
• Gradient points in the direction of fastest
  increase
• To minimize R, move in the opposite direction




                                                  18
            Gradient Descent
• Gradient points in the direction of fastest
  increase
• To minimize R, move in the opposite direction




                                                  19
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           20
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           21
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           22
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           23
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           24
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           25
            Gradient Descent
• Initialize Randomly
• Update with small steps
• (nearly) guaranteed to converge to the
  minimum




                                           26
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    27
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    28
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    29
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    30
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    31
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    32
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    33
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    34
             Gradient Descent
• Initialize Randomly
• Update with small steps
• Can oscillate if η is too large




                                    35
            Gradient Descent
• Initialize Randomly
• Update with small steps
• Can stall if    is ever 0 not at the minimum




                                                 36
          Back to Neurons




Linear Neuron




                    Logistic Neuron



                                      37
                 Perceptron
• Classification squashing function




                   Perceptron

• Strictly classification error
                                      38
           Classification Error
• Only count errors when a classification is
  incorrect.



                                       Sigmoid leads to
                                       greater than
                                       zero error on correct
                                       classifications




                                                        39
           Classification Error
• Only count errors when a classification is
  incorrect.



                                       Perceptrons use
                                       strictly classification
                                       error




                                                           40
             Perceptron Error
• Can’t do gradient descent on this.




                                       Perceptrons use
                                       strictly classification
                                       error




                                                           41
                Perceptron Loss
• With classification loss:
• Define Perceptron Loss.
   – Loss calculated for each misclassified data point




                             vs



   – Now piecewise linear risk rather than step function

                                                           42
              Perceptron Loss
• Perceptron Loss.
  – Loss calculated for each misclassified data point




  – Nice gradient descent



                                                        43
 Perceptron vs. Logistic Regression
• Logistic Regression has a hard
  time eliminating all errors
  – Errors: 2



• Perceptrons often do better.
  Increased importance of
  errors.
  – Errors: 0

                                      44
     Stochastic Gradient Descent
• Instead of calculating the gradient over all
  points, and then updating.



• Update the gradient for each misclassified
  point at training

• Update rule:
                                                 45
      Online Perceptron Training
• Online Training
  – Update weights for each data point.
• Iterate over points xi point,
  – If xi is correctly classified
  – Else
• Theorem: If xi in X are linearly separable, then
  this process will converge to a θ* which leads
  to zero error in a finite number of steps.

                                                     46
            Linearly Separable
• Two classes of points are linearly separable,
  iff there exists a line such that all the points of
  one class fall on one side of the line, and all
  the points of the other class fall on the other
  side of the line




                                                    47
            Linearly Separable
• Two classes of points are linearly separable,
  iff there exists a line such that all the points of
  one class fall on one side of the line, and all
  the points of the other class fall on the other
  side of the line




                                                    48
                   Next
• Multilayer Neural Networks




                               49

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/1/2013
language:Unknown
pages:49