# Grammatical Theory Its Limits and Possibilities

Document Sample

```					Support Vector
Machine
Chan-Gu Gang, MK Hasan and Ha-Yong, Jung   2004/11/17
Introduction
Learning Theory

 Objective:
   Two classes of objects  new object  assign it to one of the two
classes
 Binary Pattern Recognition (Binary Classification)

   xi : pattern, case, input, instance, ..
   X : domain (where the values of xi are taken from)
   yi : label, target, output

 In order to map xi values to yi values, we need the notion of
similarity in X and yi
   For yi : trivial
   For X : ?

Support Vector Machine                                                       3
Similarity Measure

 Given two patterns x and x', returns a real number characterizing
their similarity
 k is called a kernel

Support Vector Machine                                                4
Simple Example of Similarity Measure :
Dot Product
 Dot Product of Vectors

 However, the patterns xi are not yet in the form of vectors.
 The patterns xi don't exist in dot product space yet.
 They can be any kind of object.
 To use dot product as a similarity measure,
 Transform patterns into vectors (Dot Product Space H)

 Three benefits of the transformation (into vector form)
 define similarity measure from the dot product in H

   deal with the patterns geometrically  apply linear algebra and analytic
geometry
   freedom to choose the mapping phi
 enable large variety of similarity measures and learning algorithms

 change the representation into one that is more suitable for the
given problem
Support Vector Machine                                                         5
Simple Pattern Recognition Algorithm

 Basic idea: Assign a previously unseen pattern to the class with
closer mean.
 Means of two classes (+,-)

 When c is mid-point of c+ and c-

 x will be classified into the class with the closer mean

Support Vector Machine                                               6
Decision Function

 From above formulas,

substitution

result

Support Vector Machine                           7
Parzen Windows Estimators

 Condition for the resulting Decision Function to be Bayes Classifier
   The class means have the same distance to the origin
 b=0

   k is a probability density function

 which of the p+(x) and p-(x) is larger  the new sample is labeled

Support Vector Machine                                                   8
Generalization

Can be generalized to

Support Vector Machine                           9
To Make the Classification Technique
More Sophisticated
 Two ways to make it more sophisticated
   selection of the patterns on which the kernels are centered
 remove the influence of patterns that are very far away from

the decision boundary
   because we expect that they will not improve the
generalization error of the decision function, or to reduce the
computational cost of evaluating the decision function
   choice of weights ai that are placed on the individual kernels in
the decision function
 the weights above are only (1/m+) or (1/m-) above

 more variety of weights

Support Vector Machine                                                             10
Some Insights from
Statistical Learning Theory
 Some exceptions are allowed
("outliers")
 the boundary is ambiguous

 Almost linear separation of
the classes
 misclassifies the two
outliers , and other "easy"
points which are so close to
the decision boundary that the
classifier really should be able
to get them right.
 Compromise  gets most
points right, without putting too
much trust in any individual
point

Support Vector Machine                          11
More into
Statistical Learning Theory
 Put the above intuitive arguments in a mathematical
framework
 Assumption: data are generated independently with
probability distribution of P(x,y)
 iid (independent and identically distributed)
 Goal: find a function f that will correctly classify unseen
examples (x,y)
 Measurement of correctness:
   zero-one loss function
 C(x,y,f(x)) := 0.5*|f(x)-y|

 Without restriction on the set of functions from which
we choose our estimated f, might not generalize well.

Support Vector Machine                                          12
Training Error & Test Error

 Minimizing the training error (empirical risk) does not imply a small
test error (risk)

 Restrict the set of functions from which f is chosen to one that has a
capacity suitable for the amount of available training data

Support Vector Machine                                                 13
VC Dimension

 Each function of the class separates the patterns in a
certain way.
 Labels are {+1|-1}
 At most 2^m different labeling for m patterns
 “Shatter”: when a function class can realize all 2^m
separations, it is “Shatter”ing the m points
 VC Dimension: largest m such that there exists a set of
m points which the class can shatter, and infinity if no
such m exists
 one number summary of a learning machine’s
capacity

Support Vector Machine                                       14
Example of VC Bound
 If h<m is the VC dimension of the class of functions that the learning
machine can implement,
 For all functions of that class, independent of underlying probability
distribution generating the data,
 With a probability of at least 1-delta

 To reproduce the random labeling by correctly separating all training
examples, this machine will require a large VC Dimension h.
 phi(h,m,delta) will be large
 Small training error does not guarantee a small test error
 To get nontrivial predictions of bound,
 The function class must be restricted

 Capacity is small enough

 Class should be large enough : to provide functions that can model
the dependencies hidden in P(x,y)
 The choice of the set of functions is crucial for learning from data
Support Vector Machine                                                         15
Kernel Machine
Classifier
Hyper plane Classifier
 We have a set of points
 Each point belongs to class +1 or class -1
 The points are linearly separable

Support Vector Machine                         17
A Point Set

Support Vector Machine                 18
Growing Ball

Support Vector Machine                  19
Growing Ball
Several hyper planes exists

Support Vector Machine                     20
Growing Balls
Bigger balls fewer hyper planes

Support Vector Machine                   21
Growing Balls
A single hyper plane is left

Support Vector Machine                       22
Growing Ball
Support vectors

Support Vector Machine                     23
Why Maximum Margin
 Generalization capability increases with increasing margin
   We are skipping the proof of this statement
 The problem can be solved using quadratic programming
technique which is quite efficient
 A single global optimum exists.
   Turning point of choosing Support Vector Machine instead of Neur
al Network as a tool.

Support Vector Machine                                                24
How to get hyper plane with
maximum margin

Support Vector Machine                    25
How to get optimum margin

Support Vector Machine              26
Formulation

Support Vector Machine                 27
What if the points are not linearly
separable
 We can map the points into some higher dimension space using
some nonlinear transformation so that the points become linearly
separable in higher dimension.

Support Vector Machine                                               28
How to avoid computation to map
in higher dimension
 We always use the dot product of the input vector
 Let Φ(x) be the function that map input vector xi to the vector in
some higher dimension.
 If we can compute k(xi,xj) = Φ(xi).Φ(xj) without calculating Φ(xi) and
Φ(xj) individually then we can save the time to map the input vectors
in some higher dimension at the same time we can use the previous
formulation in input space with nonlinear decision boundary.
 This k(x,y) is called the kernel function.

Support Vector Machine                                                 29
Formulation using kernel function

Support Vector Machine                  30
Some Applications
Text Categorization
 Why is it needed?
   As the volume of electronic information increases, there is growing interest in
developing tools to help people better find, filter, and manage these resources.

 What is it?
   The assignment of natural-language texts to one or more predefined categories
based on their contents

 Text Categorization
   Representing Text: a bag of words in a document
   Feature Selection: because of too big feature dimension
   Machine Learning

 Extended Applications
   Patent classification, Spam-mail filtering, Categorization of Web pages, Automatic

Support Vector Machine                                                                      32
Text Categorization with SVM
 Conventional Learning Methods
   Naïve Bayes Classifier
   Rocchio Algorithm
   Decision Tree Classifier
   k-Nearest Neighbors

 Experiments
   Test Collections
 Reuters-21578 dataset
   9603 training, 3299 test, 90 categories, 9947 distinct terms
   Direct correspondence, single category
   Ohsumed corpus
   10000 training, 10000 test, 23 Mesh “diseases” categories, 15561 distinct terms
   Less direct correspondence, multiple category

Support Vector Machine                                                                                 33
Text Categorization with SVM

Almost SVMs
perform
better
independent
of the choice
of parameters.

Best             No overfitting
SVM is better
than k-NN on
62 of the 90
categories
(20 ties),
which is a
significant
improvement                  Fail   Best                     SVM
according to                                             outperforms
the binomial                                            k-NN on all
sign test                                            23 categories

Support Vector Machine                                           34
Text Categorization with SVM
 Why Should SVMs Work Well for Text Categorization?

   High dimensional input space
 SVMs use overfitting protection which does not necessarily depend on the
number of features

   Few irrelevant feature
 Even features ranked lowest still contain considerable information and are
somewhat relevant
 a good classier should combine many features

   Document vectors are sparse
 the mistake bound model that additive algorithms which have a similar
inductive bias like SVMs are well suited for problems with dense concepts
and sparse instances

   Most text categorization problems are linearly separable
 The idea of SVMs is to find such linear (or polynomial, RBF, etc) separators

Support Vector Machine                                                                   35
TREC11
     Kernel Methods for Document filtering (MIT)

     Ranking
2.   Batch T11F/U-assessor/intersection 1st
3.   Routing assessor/intersection 1st

     Feature: Words in documents
    Filtering – Digits, words below two times
    Title has double weight

     Applying various kernel
    Second-order perceptron (2)
    SVM uneven margin
    SVM + new threshold-selection (3)

     Conclusion
    Good ranking except intersection topics
   More Complex Kernel, poorer results
    Various performance by each category

Support Vector Machine                                 36
Face Detection
 We can define the face-detection problem as follows:

1.   Given as input an arbitrary image, which could be a digitized video
signal or a scanned photograph,

2.   determine whether there are any human faces in the image,

3.   and if there are, return an encoding of their location.

4.   The encoding in this system is to fit each face in a bounding box
defined by the image coordinates of the corners.

 It can be extended to many applications
    Face-recognition, HCI, surveillance systems, …

Support Vector Machine                                                          37
Applying SVMs to Face Detection
 Overview of overall process

   Training on a database of face and
nonface patterns (fixed size) using
SVM.

   Testing candidate image locations
for local patterns that appear like
faces, using a classification
procedure that determines whether
a given local image pattern is a face.

   the face-detection problem
 a classification problem
: faces or nonfaces.

Support Vector Machine                           38
Applying SVMs to Face Detection
       The SVM face-detection system
1. Rescale the                    2. Cut 19x19
input image                    window patterns               4. Classify the
several times                   out of the scaled             pattern using
image                       the SVM

5. If the class corresponds
3. Preprocess the
to a face, draw a rectangle
around the face in the
light correction and
output image.
histogram equalization
Support Vector Machine                                                     39
Applying SVMs to Face Detection
 Experimental results on static images
   Set A: 313 high-quality, same number of faces
   Set B: 23 mixed quality, total of 155 faces

Support Vector Machine                                    40
Applying SVMs to Face Detection
 Extension to a real-time system

An example
of the skin
detection
module
implemented
using SVMs

Face
Detection
on the PC-
based
Color Real
Time
System

Support Vector Machine                   41
Summary
 Single layer neural network have simple and efficient
learning algorithm, but have very limited expressive power.

 Multilayer networks, on the other hand, are much more
expressive but are very hard to be trained.

 Kernel machine overcomes this problem. That is it can be
trained very easily and at the same time, it can represent
complex nonlinear function.

 Kernel machine is very efficient in hand writing recognition,
text categorization, and face recognition.

Support Vector Machine                                      42

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 6 posted: 3/23/2011 language: English pages: 42