VIEWS: 19 PAGES: 38 POSTED ON: 6/17/2012 Public Domain
Semi-Supervised Clustering II Jieping Ye Department of Computer Science and Engineering Arizona State University http://www.public.asu.edu/~jye02 Outline of lecture • Overview of search-based semi-supervised clustering • Similarity-based semi-supervised clustering – Altered similarity matrix – Similarity metric learning – Altered Euclidian distance • Combination of search-based and similarity-based semi- supervised clustering • Summary Overview of search-based semi-supervised clustering • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML01]. – COP K-Means • Use the labeled data to initialize clusters in an iterative refinement algorithm (K-Means,) [Basu:ICML02]. – Seeded K-Means – Constrained K-Means Other search-based algorithms PC K-Means, Basu, et al. w is the penalty matrix Kernel-based semi-supervised clustering, Kulis, et al. Kernel K-Means reward Similarity-Based Semi-Supervised Clustering • Train an adaptive similarity function to fit the labeled data. • Use a standard clustering algorithm with the trained similarity function to cluster the unlabeled data. • Adaptive similarity functions: – Altered similarity matrix [Kamvar:IJCAI03] – Trained Mahalanobis distance [Xing:NIPS02] – Altered Euclidian distance [Klein:ICML02] • Clustering algorithms: – Spectral clustering [Kamvar:IJCAI03] – Complete-link agglomerative [Klein:ICML02] – K-means [Xing:NIPS02] Altered similarity matrix • Paper: Spectral learning. Kamvar et al. • Graph based clustering – W: similarity matrix – D: degree matrix (row sum of W) • Key idea: alter the similarity matrix W based on the domain knowledge. Overview of graph based (spectral) clustering • Associate each data item with a vertex in a weighted graph, where the weights on the edges between elements are large if the elements are similar and small if they are not. • Cut the graph into connected components with relatively large interior weights by cutting edges with relatively low weights. • Clustering becomes a graph cut problem. – Many different algorithms Overview of spectral clustering W: similarity matrix of size n x n Overview of spectral clustering 1. Compute the similarity matrix W and D. 2. Form D 0.5WD 0.5 3. Form the matrix Y consisting of the first K eigenvectors of D 0.5WD 0.5 4. Normalize Y so that all the rows have unit lengths. 5. Run K-Means on the rows to get the K clusters. (Ng, Jordan, and Weiss , NIPS’02) or Apply an iterative optimization to get the partition matrix. (Yu and Shi, ICCV’03) Note that (Kamvar, IJCAI’03) applied different spectral clustering algorithm. Semi-supervised spectral clustering 1. Compute the similarity matrix W and D. 2. For each pair of must-link (i,j), assign Wij W ji 1 3. For each pair of cannot-link (i,j), assign Wij W ji 0 4. Form the matrix D0.5WD 0.5 5. Form the matrix Y consisting of the first K eigenvectors of D0.5WD 0.5 6. Normalize Y so that all the rows have unit lengths. 7. Run K-Means on the rows to get the K clusters. (Ng, Jordan, and Weiss , NIPS’02) or Apply an iterative optimization to get the partition matrix. (Yu and Shi, ICCV’03) Note that (Kamvar, IJCAI’03) applied different spectral clustering algorithm. Distance metric learning Paper: Distance metric learning, with application to clustering with side-information. E. Xing, et al. Given two sets of pairs S and D: Compute a distance metric which respects these two sets. Distance metric learning Define a new distance measure of the form: Linear transformation of the original data Distance metric learning The optimal A is a rank one matrix. Distance metric learning Compute A by solving the following minimization problem: Two different cases: Algorithm: Gradient ascent + •A is diagonal iterative projection •A is a full matrix Semi-Supervised Clustering Example Similarity Based Semi-Supervised Clustering Example Distances Transformed by Learned Metric Semi-Supervised Clustering Example Clustering Result with Trained Metric Source: E. Xing, et al. Distance metric learning Source: E. Xing, et al. Distance metric learning Source: E. Xing, et al. Distance metric learning Observations from experiments • Main observations – Using a learned diagonal or full metric leads to significantly improved performance over naïve K-Means – In most cases, using a learned metric with constrained K-Means also outperforms using constrained K-Means alone, sometimes by a very large margin. – Having more side-information typically leads to metrics giving better performance. Altered Euclidian distance • Paper: From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. D. Klein et al. Two types of constraints: Must-links and Cannot-links Clustering algorithm: Hierarchical clustering Constraints Overview of Hierarchical Clustering Algorithm • Agglomerative versus Divisive • Basic algorithm of Agglomerative clustering 1. Compute the distance matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains • Key operation is the update of the distance between two clusters How to Define Inter-Cluster Distance p1 p2 p3 p4 p5 ... p1 Distance? p2 p3 p4 p5 . • MIN . • MAX . • Group Average distance matrix • Distance Between Centroids Must-link constraints • Distance between must-links pair to zero. • Derive a new metric by running an all-pairs-shortest distances algorithm. – It is still a metric – Faithful to the original metric – Computational complexity: O(N2 C) • C: number of points involved in must-link constraints • N: total number of points New distance matrix based on must-link constraints p1 p2 p3 p4 p5 ... p1 p2 Hierarchical clustering can p3 be carried out based on the new distance matrix. p4 p5 . . What is missing? . New distance matrix Cannot-link constraint • Run hierarchical clustering with complete link (MAX) – The distance between two clusters is determined by the largest distance. • Set the distance between cannot-link pair to be • The new distance matrix does not define a metric. – Work very well in practice Constrained complete-link clustering algorithm Derive a new distance Matrix based on both Types of constraints Illustration 0 0.2 0.5 0.1 0.8 2 1 0 0.4 0.2 0.6 4 0 0.3 0.2 3 0 0.5 5 0 Initial distance matrix New distance matrix 0 0.2 0.5 0.1 0.8 0 0 0.1 0.1 0.8 0 0.4 0.2 0.6 0 0.2 0.2 0.6 0 0.3 0.2 0 0 0.2 0 0.5 0 0.2 0 0 Must-links: 1—2, 3—4 Cannot-links: 2--3 Hierarchical clustering 1 and 2 form a cluster, and 3 and 4 form another cluster 1,2 3,4 5 0 0.9 0.8 1,2 0 0.2 3,4 0 1 2 3 4 5 5 Combining Similarity and Search-Based Semi-Supervised Clustering • Paper: Comparing and Unifying Search-Based and Similarity- Based Approaches to Semi-Supervised Clustering, Basu, et al. • With small amounts of training, seeded/constrained tends to do better than similarity-based. • With larger amounts of labeled data, similarity-based tends to do better. • Combining both approaches outperforms both individual approaches. Generalized K-Means model In K-Means clustering formulation, using matrix A is equivalent to considering a generalized K-Means model, 1 where all Gaussians have a common covariance matrix A Maximizing the complete data log-likelihood under this model is equivalent to minimizing the objective function: where the second term arises due to the normalizing constant 1 of a Gaussian with covariance matrix A Joint objective function • Adding constraint violation penalties: x log(det( A)) 2 i li A xi D ( xi , x j )M f ( xi , x j )1[li l j ] ij M ( xi , x j )C f ( xi , x j )1[li l j ] ij C – M/C: a set of must-link/cannot-link constraints – , : weights of constraints based on external knowledge – 1[true]=1, 1[false]=0 – fM, fC: weights of constraints based on the distance between the objects (should FM(x, y) be larger or smaller when d(x, y) is larger?) The main algorithm Initial centroids Assign cluster Estimate means Update A Summary • Semi-supervised clustering is an way of combining labeled and unlabeled data in learning. • Search-based and similarity-based are two alternative approaches. • They have useful applications in many domains. • Experimental results for these applications illustrate their utility. • They can be combined to produce even better results. Next class • Topics – Semi-supervised learning • Readings – Learning with Local and Global Consistency • http://www.public.asu.edu/~jye02/CLASSES/Fall-2005/PAPERS/LLGC.pdf – Semi-Supervised Learning Using Gaussian Fields and Harmonic functions • http://www.cs.wisc.edu/~jerryzhu/pub/zgl.pdf – Regularization and Semi-supervised Learning on Large Graphs • http://www.public.asu.edu/~jye02/CLASSES/Fall-2005/PAPERS/reg_colt.pdf – Semi-supervised Graph Clustering: A Kernel Approach • http://www.cs.utexas.edu/users/inderjit/public_papers/kernel_icml.pdf