Good afternoon, everybody. Welcome to my canditure presentation for my Mphil Degree. And today I will talk about adapting kernel-based methods: from multiclass SVM to semisupervised spectral analysis. And I am wuzhili, and my principal supervisor is Dr.Li and my cosupervisor is Dr Tang. Now let’s take a look at the outline of my presentation first. At the beginning, I will explain what is machine learning, and the three categories of learning: pure supervised, unsupervised and the newly emerging semi-supervised learning. After that, I will try to explain the support vector machine, which is a successful kernel based supervised learning method. Because SVM is only a binary classifier, the output code approach is usually used to handle the mult-class problem. This approach will be introduced in the presentation. After the SVM section, I will present you the effective unsupervised spectral clustering algorithm and a semi-supervised extension to the spectral method. Finally, I give the conclusion and prospect. So, what is Machine learning? As a branch of Artificial Intelligence, machine learning has been there for a long time but not received enough attention untill the last decade in the past century. In 1997, Tom Mitchell defined machine learning as the study of computer algorithms that improve automatically through experience. For the past several years, there have been many successful application using machine learning approaches, such as data mining upon large database, bioinformatics and information extraction. While the following sentences are extracted from the Science Magazine in year 2001, written by Dennis Decoste., says that “ the machine learning will lead to appropriate, partial automation of every element of scientific method, from hypothesis generation to model construction to decisive experimentation. Now let’s see the three categories of learning. The first one is the supervised learning with training samples available, the task is to predict the class of unknown patterns. There are two types of supervised learning, one is regression and the other is classification. While regression maps the data into infinite types of labels, and the classification, usually known as pattern recognition, just maps the data into finite types of labels. The second type of learning is unsupervised because no any labeled data present during the learning process. A common task of unsupervised learning is to do clustering by grouping data into several clusters. The third category of learning is a little bit new but of great importance. It is called semi-supervised learning. Although there is not a clear definition for it, it usually refers to the situations where labeled data are in a limited quantity, but unlabeled data are abundantly present. For example, there are so many data for the the webpage categorization and gene expression analysis, but the labeling process are too tedious or money demanding. After the general introduction to machine learning, I will try to illustrate the definition of kernel matrix, which will play a major role in my presentation. Given a dataset X with n data points, while each xi is a d dimensional feature vector, and y is the label, which can be one, minus one or unknown if the data point is not labeled. Now we construct a symmetric n times n matrix K, each entry of K is a measure of the pairwise relation between data point xi and xj. For example, the following exponential inverse distance function is a similarity measure of any data point xi and xj. By some rigorous mathematical proof, we can define what a kernel matrix is. If the above K is semipositive definite, it is a kernel matrix. The special and interesting point of a kernel matrix K is that any entry Kij can be regarded as the inner product of two vectors fixi and fixj. Where fixi is a nonlinear mapping of xi but with a much higher dimension. That is to say, the kernel function maps each data point in the lower dimension input space into a higher dimension feature space. To make the idea of kernel matrix more clear, I present you a toy example. Here are five patterns extracted from an IQ quiz problem: which is the odd one out? The y is totally unknown because it is a unsupervised learning problem. To answer the question, the strategy of Machine learning is to do clustering. But now I want to show how to build a matrix. To distinguish those patterns, we might define the features of the pattern first. For example, we might score those patterns by the symmetry, curvature, and the aesthetic appealing. Here is the dataset X, of course, you can have your own scoring scheme. Now we adopt a kernel function talked before. The expontional inverse distance square function. And then we obtain a 5 times 5 kernel matrix. After talking about the kernel matrix, we can go to the SVM section. SVM has been an effective classifier on the basis of three characteristics, the first one is large margin and the second is the quadratic programming and the third is kernel trick. Look at the diagram, if we are given two labeled datasets, one set of positive samples marked as blue color and another set of negative samples marked as red color. We wish to construct a separating plane xw plus b equal to zero such that there exist two parallel planes: xw plus b equals to 1, called positive plane, xw plus b equals to minus 1, called negative plane. In order the all positive samples are blocked on the right hand side of the positive plane, and all negative samples are separated on the left hand side of the negative plane. It can be easily calculated that the margin between the two planes are 2 overs the square root of the Euclidean norm of w. Our objective function is to maximize the margin. For maximizing the margin, we might need to do some mathematical transformation such that we can easily solve it. First we change the objective function from maximum to minimum with the same constraints, and then we relax the separable constraints to permit some violation of data points by adding a penalty term to the objective function. For example, we introduce the penalty term the summation of C times the violation of each data point. And with the help of the well-established optimization theories and techniques, we can obtain the following quadratic form by using the lagrangian optimization method. We now maximize a function of alpha with respect to a linear constraints: the summation of any alphai yi is zero, and the alpha is a nonnegative value bounded by C, where C is the penalty constant. It is easy to know if we strictly require the positive and negative samples are separable, the C will go to positive infinity. And the above QP problem can be solved by many standard QP routines, or using some faster algorithms by utilizing the special linear constraints, such as the working set selection, shrink and SMO. Although there is no a clear bound for its complexity, some algorithms run faster than N square. Now we come up with the third characteristic of SVM – the kernel trick. It is observed that there is a term of inner product between xi and xj in the quadratic program, by simply substituting the inner product with the entry of the kernel matrix, we can claim that we are constructing the SVM separating plane in a higher dimensional space rather than in the input space. So what is the benefit of such kind of substitution. Informally speaking, in a higher dimensional space, it is believed that the data points are more sparsely distributed, thus the data points are more separable, we then can less rely on the penalty constant C. For those familiar with Quadratic program the following denotation might make you feel better. Finally, we can take a look at the solution of alpha. All data points are classified to three types according the alpha. Those data points separated by the two parallel planes are associated with a zero alpha, and the second type of data points, which are on the two parallel planes, have corresponging alpha within zero to C, and those points violating the separable constraints are associated with alpha equaling to C. The last two type of data points are called support vectors because they are supportive to the final decision function. The decision function xw + b = 1 can be expressed by the combination of alpha,y and the inner product of data points, while the constant b can be specified after all the alpha are obtained. Simply take a sign operation, we can get the label of any new data point. The graph on the right side illustrate the nonseparable case, while the data points circled with dotted line are support vectors. The graph on the left is an example classified by SVM with the Gaussian kernel. It is a challenging problem. Though effective, the SVM is only a binary classifier, many real problems are muti-class. Some researchers have proposed the multiclass SVM by using the output coding approach. First let’s take a look at the output code. Those who are working on the communication theory or computer graphics might know much about output code. For example, the following output code can be found when doing line clipping in computer graphics. Back to our multi-class SVM problem, suppose the four lines are four svm decision boundaries: two vertical and two horizontal. The 2D plane thus is divided into nine regions, each region is reprensented by a four bit code, while each bit is an indication of the alignment to the svm line. For example, the leftmost three regions have the first bit 1 because they are all on the left side of the left vertical line. Other bits are specified using the same way. Now I try to explain what is ECOC in my words. Suppose for an m-class task, we train n SVMs, the n SVM hyperplanes divide the hyperspace into many subspaces, possibly 2 to the power n. Each subspace is encoded by an n-bit code, while because the number of class might be less than the number of subspaces, each class thereby is a union of several subspaces. And if you change several bits of the code, you might still get the same class. Now I present the algorithm of ECOC SVM to you. It might be too conceptual but later I will try to illustrate it to you by using graphics. Given a 3-class classification problem, representing each class by a 3-bit code, thus we form a 3 times 3 code matrix M. We then train 3 SVMs by using a different data split specified by the column of the code matrix. For example, if we adopt the one-per-class code, we train the first SVM according to the first column of the OPC matrix. While the first column is the transpose of 1 0 0. It means that we take the first class of data among the three classes as the positive samples for the SVM training. And take the remaining two classes of data as negative samples. If we use Pairwise coupling code, the SVM will try to separate the first class of data from the second class, while the third class is not in use. Finally, for each new data, we use the three SVMs as a bunch to do prediction by assigning the data a 3 bit code, and classify the data to the class of the closest code. For example, when using the one-per-class code, we train the first SVM by following the first column of the code matrix. The red points are taken as the positive samples and the other two types of colorful points are negative samples. The SVM tries to separate the positive and negative samples. After we obtain three SVMs, we can use it to do a 3-class classification. But OPC obviously has weak error correctling ability, because there are many blur regions, such as the region 110, 000, 101 and 011 because they all have the same code distance from the three classes: 100,010 and 001. The pairwise Coupling code is different because each SVM only tries to separate two classes of data, then total m time m minus 1 over 2 support vector machines are needed for a m-class task. But it has stronger error correcting ability. The only ambiguous region is coded with 000. Under the framework of ECOC, we propose the hadamard code for multi-class SVM. Here are two code examples with size 4 and 8 each. The first column is excluded because it is the same, so we can train m minus 1 svms for an m class task. The hadamard code has been proved to have strong error correcting ability because it has large row separation and all columns are mutally orthgonal, but it is not easy to graphically illustrated here. Insteaed, we present some experimental results on the UCI repository. From the table, we can conclude that Hadamard code has its superiority. The comparision to Pairwise coupling code is not listed here. What I can tell is that PWC code achieves good results too because it uses much many SVMs. Finally I give a summary of the ECOC approach. The code selection is important but not easy. For example, the hadamard code is good for the economic number of SVMs used and its accuracy, but it is hard to train each SVM because it does not aware of the nature boundary beween each class of data. And the pairwise code train too many SVMs and the OPC code uses unbalanced split, but is popular because it is easy to understand. The decoding method is also important. The weighted decoding scheme may help the OPC code. For example, Instead of assigning the data point to the class of the closest code, we can use probability measure related to the decision function. Now I will talk about another kernel-based method: the Spectral analysis. It is named from the operation of eigen-decomposition. So again, we present our matrix. In spectral clustering, the matrix is nearly the same as the kernel matrix as presented before. It is also n times n and symmetric, but the pairwise relation is limited to the similarity measure. The exponeontial inverse distance square is suitable for the requirement. The similarity matrix can be from kernel functions, or made manually, or defined in feature space by using kernel trick. Geometrically, the similarity matrix can be transformed to an undirected weighted graph with its vertices representing the data points. And its edges are weighted by the entry of the similarity matrix k. Now the clustering problem can be reformulated to the graph minimum cut. By partitioning the graph into two disjointed node sets A and B. if a node is in the set A, we label it as positive, otherwise negative. The graph cut is the summation of the weight of the edges connecting set A and set B. Mathematically, it can be expressed by the following form: if yi equals to yj, the term contributed to summation is zero, but if yi does not equal to yj, the weight of the edge connecting xi and xj is added. Our objective is to minimize the graph cut. It is easy to know the vector of all one or minus one make the objective function be the trivial minimum. To prevent the trivial minimum, we simply add a constraint : the summation of yi equals to zero, to enforce a balanced cut. But such a constraint make the optimization problem NP-hard. It is only possible to seek approximated solutions. Now we do some trainsformation first. Construct a matrix D with the same size as the K, and most entries of D are zero, the diagonal of D is the rowwise summation of K. And construct another matrix L which equals to D-K. L is a semipositive definite matrix. And the original objective function can be expressed to the multiplication of y transpose, K and y. But such trick transformation does not reduce the complexity of the problem. We now approximated it by spectral decomposition. By relaxing the y from binary integer to real number and restricting the sum of square of y equals to n. the objective function value is larger than lambda 2,where lambda 2 is the first nonzero eigenvalue of the matrix L, and the v2 is the corresponding eigen-vector. V2 is also the y we want to approximate. And lamda 2 is the approximated minimum. Because zero is the smallest eigenvalue of matrix L, and the corresponding eigen-vector equals to the vector of all one. We then get to know the summation of v2 is zero because any eigen-vector is mutally orthogonal. Once we obtain the v 2, we can simply partition the v2 from the median to get a balanced cut. Or it is also proved that any p percentile cut works well if we want to get a split with the ratio p. And other than the v2, the third eigenvector, the fourth, and so on can be simultaneously used as the basis of our clustering. And the algorithm has been used in many areas such as molecular physics, computer vision, high performance computing and data clustering. And there are many variants of the spectral graph cuts, like normalized cut, ratio cut, recursive cut. And it has been drawn some connection with other algorithms such as PCA, locally linear embedding, multidimensional scaling and random walk. Here are two simple clustering examples given by spectral algorithm. After talking about the spectral clustering, we now take a look at how the spectral method can be applied to semisupervised learning. There has been many studies on the field of semi-supervised learning. For example, Blum proposed the co-training algorithm in 1998. The basic idea of cotraining is to split the data by feature and tries to find the latent label from the unlabeled data. In the same year, Bennett proposed the a semisupervised SVM by using a variant of SVM and tries to do Linear optimization upon the whole dataset. And in 1999, Joachim invented the transductive SVM. It is a complicated algorithm by iteratively adding new data points into the SVM. Recently, Cristianini tried to modified the metric of the kernel matrix by using spectrum. While there are also some research on semi-supervised learning by using graph cut. Here is the problem formulation. In the graph formed from K, there are two sets of labeled data, one for positive s, and the other is the negative t. We want get a graph min cut with those labeled information provided. Simply we add a supersourse capital S and a supersink capital T into the graph, with the capacity w equals to positive infinity. Now we obtain a classic S-T minimum cut problem. It is well known there are many derterministic S-T cut algorithms approached by Maximum flow. Such kind of maximum flow approach has been studied in a paper written by blum in the year 2001. Although theare are many algorithms available, and some of them are very efficient. they all suffers from the outliers of the graph and are hard to get a balanced cut. Also the output of those maximum flow algorithms are only the discrete binary label, not a probability measure. We conducted some studies on semi-supervised learning by using spectral methods. The spectral methods systematically decompose a matrix, and a balanced cut can always be obtained and it is outlier insensitive. It can be very fast. And the eigenvector is a continous output, which can be used as a probability measure. And the capacity for the supersource and supersink can be further adapted to the different confidence of labeled data. Here are the comparative results. The left graph is the comparasion among the tansductive SVM, ordinary SVM and the extended spectral method. The transductive SVM and spectral methods are comparable, and both of them are better than ordinary SVMs, and the right graph shows the spectral method outperms the maximum flow approach. Finally, the presentation is about to be finished. In my presentation, I present the kernel based methods including the SVM and spectral clustering. And the ECOC SVM is explained and the Hadamard code is suggested to be used in multi-class SVM. And also the presentation explains the semisupervised learning, and give a simple extension to the spectral method which has very good performance in semi-supervised learning. My further Mphil study will still concentrate on the kernel-based method. There are many questions to be further investigated such as the relation between SVM and Spectral, and the better ECOC code , and some new approaches to semi-supervised learning.
Pages to are hidden for
"script"Please download to view full document