How the SVM classifies instances:
The SVM constructs a decision function that is represented in “dual space” by:
D( x) k K ( x k , x) b
D(x) is the decision function.
p is the number of training examples in the training set.
is a learned parameter associated with the k’th training example.
K is the kernel function which is taking the k’th training example and the current input x.
b is a learned bias which is the same across all examples.
So, here and b are the parameters which are learned.
The kernel function used in the paper looks like this:
K ( x k , x) ( x k ) ( x)
So we are taking the dot product of mapped versions of both the current training example
vector and the given input vector to be classified. But let’s just leave out the mappings
We have the dot product of xk and x, which can be represented as:
xk x xk x cos( )
where represents the angle between the two vectors and | | calculates the magnitude of
the vectors. This is of course obvious to anyone who knows calculus, but it may help
those who do not.
So, if two vectors are in complete opposite directions from each other, you have cos(180)
= -1. So you get negative result with the maximum magnitude. Conversely, if the two
vectors are very similar, the angle would be 0, so you get cos(0) = 1.
Your kernel function then gives a positive result if the given input example should be
given the same label as the current training example. It is negative if they are in opposite
classes. The absolute value of K() gives the degree of this difference of similarity.
How is this translated into the decision function?
If you consider two classes, A and B, an instance is in class A if the decision function
yields a positive result, otherwise it is in class B.
So the decision function must be adjusted so that every training example with label B
gives a negative result (and positive for one with an A label). This is accomplished
through the adjusting of the parameter. For a training instance xk that has a label yk=B,
then its should be negative, so that it contributes a negative result when compared to a
similar instance x (K(xk,x) would then be positive). Then if the input instance was very
different, the kernel function would be negative and the resultant quantity that is added to
the sum would be positive (because one then considers it as part of class A, not B, which
means D(x) should be positive).
How the SVM learns:
The alpha parameter is adjusted in such a way that it maximizes the distances between
the nearest features in each class when plotting D(x) = 0.
I will describe how the margin the decision function is learned in direct space, where the
decision function is:
D( x ) w * ( x ) b
where w is a vector of weights of the same dimensionality as the input vectors (this is just
like the activation function for a node on a neural network).
The distance between the hyperplane and the example x is:
The objective is to find w such that |w| = 1 and the margin M is maximized, subject to the
constraint that the distance for every element which we can now express as ykD(xk) is
greater than or equal to M (yk here is used to compensate for the sign, depending on the
class, so yA = 1 and yB = -1). The support vectors will be where the distance is equal to
the margin. So the bound M* for this maximal margin is equal to the distance of the
closest instance and this all becomes a minimax problem:
max min yk D( xk )
w,| w| 1 k
It possible to derive an identical expression in dual space, which is where the decision
function relies a kernel, as I discussed above, but I don’t feel that I have the background
to explain it.
The article that explains all of this is by Vapnik “A Training Algorithm for Optimal
Margin Classifiers.” You can find this article at portal.acm.org.