Semi-Supervised Clustering II by tjP2ikl

VIEWS: 19 PAGES: 38

									Semi-Supervised Clustering II

              Jieping Ye
   Department of Computer Science
            and Engineering
        Arizona State University
   http://www.public.asu.edu/~jye02
                      Outline of lecture

• Overview of search-based semi-supervised clustering

• Similarity-based semi-supervised clustering
   – Altered similarity matrix
   – Similarity metric learning
   – Altered Euclidian distance

• Combination of search-based and similarity-based semi-
  supervised clustering

• Summary
 Overview of search-based semi-supervised
                clustering
• Enforcing constraints (must-link, cannot-link) on the labeled
  data during clustering [Wagstaff:ICML01].
   – COP K-Means

• Use the labeled data to initialize clusters in an iterative
  refinement algorithm (K-Means,) [Basu:ICML02].
   – Seeded K-Means
   – Constrained K-Means
             Other search-based algorithms

PC K-Means, Basu, et al.


                                              w is the penalty matrix




Kernel-based semi-supervised clustering, Kulis, et al.




                                         Kernel K-Means
  reward
Similarity-Based Semi-Supervised Clustering
 • Train an adaptive similarity function to fit the labeled data.
 • Use a standard clustering algorithm with the trained
   similarity function to cluster the unlabeled data.
 • Adaptive similarity functions:
    – Altered similarity matrix [Kamvar:IJCAI03]
    – Trained Mahalanobis distance [Xing:NIPS02]
    – Altered Euclidian distance [Klein:ICML02]
 • Clustering algorithms:
    – Spectral clustering [Kamvar:IJCAI03]
    – Complete-link agglomerative [Klein:ICML02]
    – K-means [Xing:NIPS02]
               Altered similarity matrix

• Paper: Spectral learning. Kamvar et al.

• Graph based clustering
   – W: similarity matrix
   – D: degree matrix (row sum of W)


• Key idea: alter the similarity matrix W based on the
  domain knowledge.
  Overview of graph based (spectral) clustering

• Associate each data item with a vertex in a weighted graph,
  where the weights on the edges between elements are large if the
  elements are similar and small if they are not.

• Cut the graph into connected components with relatively large
  interior weights by cutting edges with relatively low weights.

• Clustering becomes a graph cut problem.
   – Many different algorithms
         Overview of spectral clustering




W: similarity matrix of size n x n
               Overview of spectral clustering

       1. Compute the similarity matrix W and D.
       2. Form D 0.5WD 0.5
       3. Form the matrix Y consisting of the first K
          eigenvectors of D 0.5WD 0.5
       4. Normalize Y so that all the rows have unit
          lengths.
       5. Run K-Means on the rows to get the K clusters.
          (Ng, Jordan, and Weiss , NIPS’02) or
            Apply an iterative optimization to get the
          partition matrix. (Yu and Shi, ICCV’03)

Note that (Kamvar, IJCAI’03) applied different spectral clustering algorithm.
        Semi-supervised spectral clustering

1. Compute the similarity matrix W and D.
2. For each pair of must-link (i,j), assign Wij  W ji  1
3. For each pair of cannot-link (i,j), assign Wij  W ji  0
4. Form the matrix D0.5WD 0.5
5. Form the matrix Y consisting of the first K
   eigenvectors of D0.5WD 0.5
6. Normalize Y so that all the rows have unit lengths.
7. Run K-Means on the rows to get the K clusters.
   (Ng, Jordan, and Weiss , NIPS’02) or
    Apply an iterative optimization to get the
   partition matrix. (Yu and Shi, ICCV’03)

 Note that (Kamvar, IJCAI’03) applied different spectral clustering algorithm.
                Distance metric learning

Paper: Distance metric learning, with application to
clustering with side-information. E. Xing, et al.

Given two sets of pairs S and D:




Compute a distance metric which respects these two sets.
              Distance metric learning

Define a new distance measure of the form:



                Linear transformation of the original data
Distance metric learning




                      The optimal A is a
                      rank one matrix.
               Distance metric learning

Compute A by solving the following minimization problem:




Two different cases:
                           Algorithm: Gradient ascent +
  •A is diagonal                      iterative projection
  •A is a full matrix
Semi-Supervised Clustering Example
          Similarity Based
Semi-Supervised Clustering Example
Distances Transformed by Learned Metric
Semi-Supervised Clustering Example
 Clustering Result with Trained Metric
Source: E. Xing, et al. Distance metric learning
Source: E. Xing, et al. Distance metric learning
Source: E. Xing, et al. Distance metric learning
            Observations from experiments

• Main observations
   – Using a learned diagonal or full metric leads to significantly
     improved performance over naïve K-Means

   – In most cases, using a learned metric with constrained K-Means
     also outperforms using constrained K-Means alone, sometimes by
     a very large margin.

   – Having more side-information typically leads to metrics giving
     better performance.
           Altered Euclidian distance

• Paper: From Instance-level Constraints to Space-Level
  Constraints: Making the Most of Prior Knowledge in Data
  Clustering. D. Klein et al.



 Two types of constraints: Must-links and Cannot-links
 Clustering algorithm: Hierarchical clustering
Constraints
Overview of Hierarchical Clustering Algorithm
 •   Agglomerative versus Divisive

 •   Basic algorithm of Agglomerative clustering
     1.   Compute the distance matrix
     2.   Let each data point be a cluster
     3.   Repeat
     4.           Merge the two closest clusters
     5.           Update the distance matrix
     6.   Until only a single cluster remains


 •   Key operation is the update of the distance between two
     clusters
       How to Define Inter-Cluster Distance
                                      p1   p2   p3   p4 p5   ...
                                 p1
               Distance?
                                 p2

                                 p3

                                 p4

                                 p5

                                  .
•   MIN
                                  .
•   MAX                           .
•   Group Average                     distance matrix

•   Distance Between Centroids
                    Must-link constraints

• Distance between must-links pair to zero.

• Derive a new metric by running an all-pairs-shortest
  distances algorithm.
   – It is still a metric
   – Faithful to the original metric

   – Computational complexity: O(N2 C)
       • C: number of points involved in must-link constraints
       • N: total number of points
 New distance matrix based on must-link
              constraints

     p1   p2   p3   p4 p5   ...
p1

p2
                                  Hierarchical clustering can
p3
                                  be carried out based on the
                                  new distance matrix.
p4

p5

 .
 .                                What is missing?
 .
 New distance matrix
                     Cannot-link constraint

• Run hierarchical clustering with complete link (MAX)
   – The distance between two clusters is determined by the largest
     distance.



• Set the distance between cannot-link pair to be



• The new distance matrix does not define a metric.
   – Work very well in practice
Constrained complete-link clustering algorithm


                                Derive a new distance
                                Matrix based on both
                                Types of constraints
                Illustration




                                0   0.2   0.5   0.1   0.8
        2
1                                   0     0.4   0.2   0.6
    4                                     0     0.3   0.2

            3                                   0     0.5
                 5                                    0


                               Initial distance matrix
                        New distance matrix



0     0.2   0.5   0.1    0.8         0   0    0.1   0.1   0.8

      0     0.4   0.2    0.6             0    0.2   0.2   0.6

            0     0.3    0.2                  0     0     0.2

                  0      0.5                        0     0.2

                         0                                0




    Must-links: 1—2, 3—4
    Cannot-links: 2--3
                          Hierarchical clustering

1 and 2 form a cluster, and 3 and 4 form another cluster

          1,2       3,4          5
      0         0.9        0.8
1,2




                0          0.2
3,4




                           0           1   2   3    4   5
5
  Combining Similarity and Search-Based
      Semi-Supervised Clustering
• Paper: Comparing and Unifying Search-Based and Similarity-
  Based Approaches to Semi-Supervised Clustering, Basu, et al.


• With small amounts of training, seeded/constrained tends
  to do better than similarity-based.
• With larger amounts of labeled data, similarity-based tends
  to do better.
• Combining both approaches outperforms both individual
  approaches.
              Generalized K-Means model



In K-Means clustering formulation, using matrix A is
equivalent to considering a generalized K-Means model,
                                                       1
where all Gaussians have a common covariance matrix A
Maximizing the complete data log-likelihood under this model
is equivalent to minimizing the objective function:



where the second term arises due to the normalizing constant
                                       1
of a Gaussian with covariance matrix A
                            Joint objective function

• Adding constraint violation penalties:

        x                         log(det( A))
                               2
                  i          li A
       xi D

              
           ( xi , x j )M
                             f ( xi , x j )1[li  l j ] 
                            ij M                                
                                                            ( xi , x j )C
                                                                              f ( xi , x j )1[li  l j ]
                                                                             ij C


   –    M/C: a set of must-link/cannot-link constraints
   –    , : weights of constraints based on external knowledge
   –    1[true]=1, 1[false]=0
   –    fM, fC: weights of constraints based on the distance between
        the objects (should FM(x, y) be larger or smaller when d(x,
        y) is larger?)
The main algorithm




    Initial centroids




      Assign cluster

    Estimate means

            Update A
                       Summary

• Semi-supervised clustering is an way of combining labeled and
  unlabeled data in learning.
• Search-based and similarity-based are two alternative
  approaches.
• They have useful applications in many domains.
• Experimental results for these applications illustrate their
  utility.
• They can be combined to produce even better results.
                                  Next class

• Topics
   – Semi-supervised learning


• Readings
   – Learning with Local and Global Consistency
      • http://www.public.asu.edu/~jye02/CLASSES/Fall-2005/PAPERS/LLGC.pdf
   – Semi-Supervised Learning Using Gaussian Fields and Harmonic functions
      • http://www.cs.wisc.edu/~jerryzhu/pub/zgl.pdf
   – Regularization and Semi-supervised Learning on Large Graphs
      • http://www.public.asu.edu/~jye02/CLASSES/Fall-2005/PAPERS/reg_colt.pdf
   – Semi-supervised Graph Clustering: A Kernel Approach
      • http://www.cs.utexas.edu/users/inderjit/public_papers/kernel_icml.pdf

								
To top