Grammatical Theory Its Limits and Possibilities

Document Sample
Grammatical Theory Its Limits and Possibilities Powered By Docstoc
					Support Vector
Chan-Gu Gang, MK Hasan and Ha-Yong, Jung   2004/11/17
                            Learning Theory

 Objective:
        Two classes of objects  new object  assign it to one of the two
 Binary Pattern Recognition (Binary Classification)

        xi : pattern, case, input, instance, ..
        X : domain (where the values of xi are taken from)
        yi : label, target, output

 In order to map xi values to yi values, we need the notion of
  similarity in X and yi
        For yi : trivial
        For X : ?

Support Vector Machine                                                       3
                         Similarity Measure

 Given two patterns x and x', returns a real number characterizing
  their similarity
 k is called a kernel

Support Vector Machine                                                4
   Simple Example of Similarity Measure :
              Dot Product
 Dot Product of Vectors

 However, the patterns xi are not yet in the form of vectors.
   The patterns xi don't exist in dot product space yet.
   They can be any kind of object.
 To use dot product as a similarity measure,
   Transform patterns into vectors (Dot Product Space H)

 Three benefits of the transformation (into vector form)
    define similarity measure from the dot product in H

        deal with the patterns geometrically  apply linear algebra and analytic
        freedom to choose the mapping phi
            enable large variety of similarity measures and learning algorithms

            change the representation into one that is more suitable for the
             given problem
Support Vector Machine                                                         5
    Simple Pattern Recognition Algorithm

 Basic idea: Assign a previously unseen pattern to the class with
  closer mean.
 Means of two classes (+,-)

 When c is mid-point of c+ and c-

 x will be classified into the class with the closer mean

Support Vector Machine                                               6
                             Decision Function

 From above formulas,



Support Vector Machine                           7
                Parzen Windows Estimators

 Condition for the resulting Decision Function to be Bayes Classifier
        The class means have the same distance to the origin
            b=0

        k is a probability density function

 which of the p+(x) and p-(x) is larger  the new sample is labeled

Support Vector Machine                                                   8

                         Can be generalized to

Support Vector Machine                           9
     To Make the Classification Technique
             More Sophisticated
 Two ways to make it more sophisticated
        selection of the patterns on which the kernels are centered
           remove the influence of patterns that are very far away from

             the decision boundary
                    because we expect that they will not improve the
                     generalization error of the decision function, or to reduce the
                     computational cost of evaluating the decision function
        choice of weights ai that are placed on the individual kernels in
         the decision function
            the weights above are only (1/m+) or (1/m-) above

              more variety of weights

Support Vector Machine                                                             10
                      Some Insights from
                  Statistical Learning Theory
 Some exceptions are allowed
   the boundary is ambiguous

 Almost linear separation of
  the classes
   misclassifies the two
  outliers , and other "easy"
  points which are so close to
  the decision boundary that the
  classifier really should be able
  to get them right.
 Compromise  gets most
  points right, without putting too
  much trust in any individual

Support Vector Machine                          11
                           More into
                  Statistical Learning Theory
 Put the above intuitive arguments in a mathematical
 Assumption: data are generated independently with
  probability distribution of P(x,y)
       iid (independent and identically distributed)
 Goal: find a function f that will correctly classify unseen
  examples (x,y)
 Measurement of correctness:
        zero-one loss function
           C(x,y,f(x)) := 0.5*|f(x)-y|

 Without restriction on the set of functions from which
  we choose our estimated f, might not generalize well.

Support Vector Machine                                          12
                  Training Error & Test Error

 Minimizing the training error (empirical risk) does not imply a small
  test error (risk)

 Restrict the set of functions from which f is chosen to one that has a
  capacity suitable for the amount of available training data

Support Vector Machine                                                 13
                         VC Dimension

 Each function of the class separates the patterns in a
  certain way.
   Labels are {+1|-1}
   At most 2^m different labeling for m patterns
 “Shatter”: when a function class can realize all 2^m
  separations, it is “Shatter”ing the m points
 VC Dimension: largest m such that there exists a set of
  m points which the class can shatter, and infinity if no
  such m exists
       one number summary of a learning machine’s

Support Vector Machine                                       14
                         Example of VC Bound
 If h<m is the VC dimension of the class of functions that the learning
  machine can implement,
     For all functions of that class, independent of underlying probability
       distribution generating the data,
          With a probability of at least 1-delta

 To reproduce the random labeling by correctly separating all training
  examples, this machine will require a large VC Dimension h.
   phi(h,m,delta) will be large
   Small training error does not guarantee a small test error
 To get nontrivial predictions of bound,
    The function class must be restricted

        Capacity is small enough

        Class should be large enough : to provide functions that can model
          the dependencies hidden in P(x,y)
 The choice of the set of functions is crucial for learning from data
Support Vector Machine                                                         15
Kernel Machine
                 Hyper plane Classifier
 We have a set of points
 Each point belongs to class +1 or class -1
 The points are linearly separable

Support Vector Machine                         17
                         A Point Set

Support Vector Machine                 18
                         Growing Ball

Support Vector Machine                  19
                    Growing Ball
             Several hyper planes exists

Support Vector Machine                     20
               Growing Balls
       Bigger balls fewer hyper planes

Support Vector Machine                   21
                    Growing Balls
              A single hyper plane is left

Support Vector Machine                       22
                          Growing Ball
                         Support vectors

Support Vector Machine                     23
                 Why Maximum Margin
 Generalization capability increases with increasing margin
        We are skipping the proof of this statement
 The problem can be solved using quadratic programming
  technique which is quite efficient
 A single global optimum exists.
        Turning point of choosing Support Vector Machine instead of Neur
         al Network as a tool.

Support Vector Machine                                                24
            How to get hyper plane with
                maximum margin

Support Vector Machine                    25
        How to get optimum margin

Support Vector Machine              26

Support Vector Machine                 27
     What if the points are not linearly
 We can map the points into some higher dimension space using
  some nonlinear transformation so that the points become linearly
  separable in higher dimension.

Support Vector Machine                                               28
    How to avoid computation to map
          in higher dimension
 We always use the dot product of the input vector
 Let Φ(x) be the function that map input vector xi to the vector in
  some higher dimension.
 If we can compute k(xi,xj) = Φ(xi).Φ(xj) without calculating Φ(xi) and
  Φ(xj) individually then we can save the time to map the input vectors
  in some higher dimension at the same time we can use the previous
  formulation in input space with nonlinear decision boundary.
 This k(x,y) is called the kernel function.

Support Vector Machine                                                 29
    Formulation using kernel function

Support Vector Machine                  30
Some Applications
                     Text Categorization
 Why is it needed?
        As the volume of electronic information increases, there is growing interest in
         developing tools to help people better find, filter, and manage these resources.

 What is it?
        The assignment of natural-language texts to one or more predefined categories
         based on their contents

 Text Categorization
        Representing Text: a bag of words in a document
        Feature Selection: because of too big feature dimension
        Machine Learning

 Extended Applications
        Patent classification, Spam-mail filtering, Categorization of Web pages, Automatic
         essay grading(?), …

Support Vector Machine                                                                      32
     Text Categorization with SVM
 Conventional Learning Methods
        Naïve Bayes Classifier
        Rocchio Algorithm
        Decision Tree Classifier
        k-Nearest Neighbors

 Experiments
        Test Collections
           Reuters-21578 dataset
                    9603 training, 3299 test, 90 categories, 9947 distinct terms
                    Direct correspondence, single category
              Ohsumed corpus
                    10000 training, 10000 test, 23 Mesh “diseases” categories, 15561 distinct terms
                    Less direct correspondence, multiple category

Support Vector Machine                                                                                 33
     Text Categorization with SVM

                                                           Almost SVMs
                                                           of the choice
                                                           of parameters.

                         Best             No overfitting
  SVM is better
  than k-NN on
    62 of the 90
      (20 ties),
     which is a
   improvement                  Fail   Best                     SVM
   according to                                             outperforms
    the binomial                                            k-NN on all
      sign test                                            23 categories

Support Vector Machine                                           34
     Text Categorization with SVM
 Why Should SVMs Work Well for Text Categorization?

        High dimensional input space
            SVMs use overfitting protection which does not necessarily depend on the
             number of features

        Few irrelevant feature
           Even features ranked lowest still contain considerable information and are
            somewhat relevant
             a good classier should combine many features

        Document vectors are sparse
           the mistake bound model that additive algorithms which have a similar
            inductive bias like SVMs are well suited for problems with dense concepts
            and sparse instances

        Most text categorization problems are linearly separable
           The idea of SVMs is to find such linear (or polynomial, RBF, etc) separators

Support Vector Machine                                                                   35
     Kernel Methods for Document filtering (MIT)

     Ranking
      1.   Adaptive T11F/U-assessor 1st
      2.   Batch T11F/U-assessor/intersection 1st
      3.   Routing assessor/intersection 1st

     Feature: Words in documents
          Filtering – Digits, words below two times
          Title has double weight

     Applying various kernel
          Second-order perceptron (2)
          SVM uneven margin
          SVM + new threshold-selection (3)

     Conclusion
          Good ranking except intersection topics
               More Complex Kernel, poorer results
          Various performance by each category

Support Vector Machine                                 36
                         Face Detection
 We can define the face-detection problem as follows:

     1.   Given as input an arbitrary image, which could be a digitized video
          signal or a scanned photograph,

     2.   determine whether there are any human faces in the image,

     3.   and if there are, return an encoding of their location.

     4.   The encoding in this system is to fit each face in a bounding box
          defined by the image coordinates of the corners.

 It can be extended to many applications
         Face-recognition, HCI, surveillance systems, …

Support Vector Machine                                                          37
    Applying SVMs to Face Detection
 Overview of overall process

       Training on a database of face and
        nonface patterns (fixed size) using

       Testing candidate image locations
        for local patterns that appear like
        faces, using a classification
        procedure that determines whether
        a given local image pattern is a face.

       the face-detection problem
         a classification problem
           : faces or nonfaces.

Support Vector Machine                           38
    Applying SVMs to Face Detection
       The SVM face-detection system
 1. Rescale the                    2. Cut 19x19
  input image                    window patterns               4. Classify the
 several times                   out of the scaled             pattern using
                                      image                       the SVM

                                                     5. If the class corresponds
            3. Preprocess the
                                                     to a face, draw a rectangle
        window using masking,
                                                        around the face in the
          light correction and
                                                             output image.
        histogram equalization
Support Vector Machine                                                     39
    Applying SVMs to Face Detection
 Experimental results on static images
         Set A: 313 high-quality, same number of faces
         Set B: 23 mixed quality, total of 155 faces

Support Vector Machine                                    40
    Applying SVMs to Face Detection
 Extension to a real-time system

                           An example
                           of the skin
                           using SVMs

                          on the PC-
                          Color Real

Support Vector Machine                   41
 Single layer neural network have simple and efficient
  learning algorithm, but have very limited expressive power.

 Multilayer networks, on the other hand, are much more
  expressive but are very hard to be trained.

 Kernel machine overcomes this problem. That is it can be
  trained very easily and at the same time, it can represent
  complex nonlinear function.

 Kernel machine is very efficient in hand writing recognition,
  text categorization, and face recognition.

Support Vector Machine                                      42