Information Theoretic Clustering and Co Clustering for Text Mining by benbenzhou

VIEWS: 5 PAGES: 39

									Information Theoretic Clustering, Co-
clustering and Matrix Approximations
                Inderjit S. Dhillon
                University of Texas, Austin


          Data Mining Seminar Series,
                     Mar 26, 2004
Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,
                S. Merugu & D. Modha
Clustering: Unsupervised Learning

   Grouping together of ―similar‖ objects
       Hard Clustering -- Each object belongs to a single
        cluster



       Soft Clustering -- Each object is probabilistically
        assigned to clusters
Contingency Tables
   Let X and Y be discrete random variables
       X and Y take values in {1, 2, …, m} and {1, 2, …, n}
       p(X, Y) denotes the joint probability distribution—if not
        known, it is often estimated based on co-occurrence data
       Application areas: text mining, market-basket analysis,
        analysis of browsing behavior, etc.
   Key Obstacles in Clustering Contingency Tables
       High Dimensionality, Sparsity, Noise
       Need for robust and scalable algorithms
Co-Clustering
   Simultaneously
       Cluster rows of p(X, Y) into k disjoint groups
       Cluster columns of p(X, Y) into l disjoint groups
   Key goal is to exploit the ―duality‖ between
    row and column clustering to overcome
    sparsity and noise
  Co-clustering Example for Text Data

      Co-clustering clusters both words and
       documents simultaneously using the
       underlying co-occurrence frequency matrix
        document                        document clusters
word                            word
                             clusters
Co-clustering and Information Theory

   View ―co-occurrence‖ matrix as a joint probability
    distribution over row & column random variables
                                    Yˆ
        Y
    X                             ˆ
                                  X



   We seek a ―hard-clustering‖ of both rows and
    columns such that ―information‖ in the compressed
    matrix is maximized.
Information Theory Concepts
   Entropy of a random variable X with probability
    distribution p:
                H ( p )   p ( x) log p ( x)
                                  x


   The Kullback-Leibler (KL) Divergence or ―Relative
    Entropy‖ between two probability distributions p and q:
             KL ( p, q )   p ( x) log( p ( x ) q ( x))
                              x


   Mutual Information between random variables X and Y:
                                           p( x, y) 
             I ( X , Y )   p( x, y) log
                                           p ( x) p ( y ) 
                                                           
                           x y                            
“Optimal” Co-Clustering
                               ˆ       ˆ
    Seek random variables X and Y taking values
    in {1, 2, …, k} and {1, 2, …, l} such that mutual
    information is maximized:
                         ˆ ˆ
                     I ( X ,Y )
           ˆ
    where X = R(X) is a function of X alone
           ˆ
    where Y = C(Y) is a function of Y alone
Related Work

   Distributional Clustering
       Pereira, Tishby & Lee (1993), Baker & McCallum
        (1998)
   Information Bottleneck
       Tishby, Pereira & Bialek(1999), Slonim, Friedman &
        Tishby (2001), Berkhin & Becher(2002)
   Probabilistic Latent Semantic Indexing
       Hofmann (1999), Hofmann & Puzicha (1999)
   Non-Negative Matrix Approximation
       Lee & Seung(2000)
Information-Theoretic Co-clustering

   Lemma: ―Loss in mutual information‖ equals
                      ˆ ˆ
    I ( X , Y ) - I ( X , Y )  KL( p( x, y ) || q( x, y ))
                                    ˆ ˆ                 ˆ           ˆ
                               H ( X , Y )  H ( X | X )  H (Y | Y )  H ( X , Y )
   p is the input distribution
   q is an approximation to p
                q( x, y)  p( x, y) p( x | x) p( y | y), x  x, y  y
                              ˆ ˆ          ˆ         ˆ       ˆ      ˆ

       Can be shown that q(x,y) is a maximum entropy
        approximation subject to cluster constraints.
       p ( x, y )

.05    .05 .05 0   0    0  
.05    .05 .05 0   0    0
                            
0      0   0 .05 .05   .05
                            
.04                        
  0     0   0 .05 .05   .05
        .04 0 .04 .04   .04
.04
       .04 .04 0 .04   .04
             p ( x, y )

      .05    .05 .05 0     0     0  
      .05    .05 .05 0     0     0
                                     
      0         0   0 .05 .05   .05
                                     
      .04                           
        0        0   0 .05 .05   .05
              .04 0 .04 .04      .04
      .04
             .04 .04 0 .04      .04

.5   0      0  
.5    0      0

0    .5      0
                
0    .5      0
                
0              
  0   0      .5
     0      .5

         ˆ
 p ( x | x)
             p ( x, y )

      .05    .05 .05 0     0     0  
      .05    .05 .05 0     0     0
                                     
      0         0   0 .05 .05   .05
                                     
      .04                           
        0        0   0 .05 .05   .05
              .04 0 .04 .04      .04
      .04
             .04 .04 0 .04      .04

.5
.5
      0
       0
             0
              0
                                     
                                      .36 .36 .28 0
                                       0   0
                                                      0   0
                                               0 .28 .36 .36
                                                               
0    .5      0
                                                   ˆ
                                             p( y | y )
0    .5      0
                
0              
  0   0      .5
     0      .5

         ˆ
 p ( x | x)
             p ( x, y )

      .05    .05 .05 0      0       0 
      .05    .05 .05 0      0      0
                                       
      0         0   0 .05 .05     .05
                                       
      .04                             
        0        0   0 .05 .05     .05
              .04 0 .04 .04        .04
      .04
             .04 .04 0 .04        .04

.5
.5
      0
       0
             0
              0
                      .03 .03
                       .2 .2
                                         
                                         .36 .36 .28 0
                                          0   0
                                                         0   0
                                                  0 .28 .36 .36
                                                                  
0    .5      0
                                                    ˆ
                                                p( y | y )
0    .5      0
                           ˆ ˆ
                        p ( x, y )
0              
  0   0      .5
     0      .5

         ˆ
 p ( x | x)
             p ( x, y )

      .05    .05 .05 0      0       0 
      .05    .05 .05 0      0      0
                                       
      0         0   0 .05 .05     .05
                                       
      .04                             
        0        0   0 .05 .05     .05
              .04 0 .04 .04        .04
      .04
             .04 .04 0 .04        .04

.5
.5
      0
       0
             0
              0
                      .03 .03
                       .2 .2
                                         
                                         .36 .36 .28 0
                                          0   0
                                                         0   0
                                                  0 .28 .36 .36
                                                                     .054
                                                                       .054
                                                                               .054
                                                                               .054
                                                                                      .042
                                                                                      .042
                                                                                                0
                                                                                                 0
                                                                                                         0
                                                                                                          0
                                                                                                                0
                                                                                                                 0
                                                                                                                    
                                                                                                                    
0    .5      0
                                                    ˆ               0        0      0      .042     .054   .054
                                                                                                                    
                                                p( y | y )
0    .5      0
                           ˆ ˆ
                        p ( x, y )                                     .036
                                                                          0     0      0       .042     .054   .054
                                                                                                                    
0                                                                    .036                                        
  0   0      .5                                                                .036    028     .028     .036   .036
     0      .5                                                              .036   .028     .028     .036   .036

         ˆ
 p ( x | x)                                                                                q ( x, y )

#parameters that determine q(x,y) are: (m  k )  (kl  1)  (n  l )
Decomposition Lemma

   Question: How to minimize KL( p( x, y ) || q( x, y )) ?
   Following Lemma reveals the Answer:
                KL ( p ( x, y ) || q ( x, y ))   p ( x) KL ( p ( y | x) || q ( y | x))
                                                                                      ˆ
                                                    ˆ
                                                    x     xx
                                                            ˆ

       where    q ( y | x)  p ( y | y ) p ( y | x)  p ( y | y ) p ( y | x) p ( x | x)
                        ˆ            ˆ ˆ ˆ                    ˆ        ˆ              ˆ
                                                                xx
                                                                  ˆ

                     ˆ
    Note that q( y | x) may be thought of as the ―prototype‖ of row
    cluster.

Similarly, KL( p( x, y) || q( x, y))          p( y)KL( p( x | y) || q( x | y))
                                                ˆ
                                                y   yy
                                                      ˆ
                                                                              ˆ
Co-Clustering Algorithm
   [Step 1] Set i  1. Start with ( Ri, Ci ) , Compute q[i , i ] .

                                                       ˆ
    [Step 2] For every row x , assign it to the cluster x that
    minimizes
                                                         ˆ
                         KL( p( y | x) || q[i , i ]( y | x))

   [Step 3] We have ( Ri  1, Ci ). Compute q[i  1, i ] .

                                                         ˆ
    [Step 4] For every column y, assign it to the cluster y that
    minimizes
                                                            ˆ
                         KL( p( x | y) || q[i  1, i ]( x | y))
   [Step 5] We have ( Ri  1, Ci  1) . Compute q[i  1, i  i ]. Iterate 2-5.
Properties of Co-clustering Algorithm
   Main Theorem: Co-clustering ―monotonically‖
    decreases loss in mutual information
   Co-clustering converges to a local minimum
   Can be generalized to multi-dimensional
    contingency tables
   q can be viewed as a ―low complexity‖ non-negative
    matrix approximation
   q preserves marginals of p, and co-cluster statistics
   Implicit dimensionality reduction at each step helps
    overcome sparsity & high-dimensionality
   Computationally economical
             p ( x, y )

  .05      .05 .05       0   0       0  
  .05      .05 .05 0   0             0
                                         
  0         0   0 .05 .05           .05
                                         
  .04
    0        0     0    .05 .05      .05
                                         
  .04                                   
            .04 0 .04 .04            .04
           .04 .04 0 .04            .04 



0
1
      0 .28
       0    0
                      .10 .05
                       .10 .20
                                          
                                          .36 .36 0 .28 0
                                              0
                                                                0
                                                  0 .28 0 .36 .36
                                                                       .029
                                                                         .036
                                                                                  .029 .019 .022 .024 .024
                                                                                  .036 .014 .028 .018   .018
                                                                                                             
                                                                                                             
0    .5    0
                      .30 .25                          ˆ
                                                   p( y | y )            ..018   .018 .028 .014 .036   .036
                                                                                                             
0    .5    0
                           ˆ ˆ
                        p ( x, y )                                       .039
                                                                            018   .018 .028 .014 .036   .036
                                                                                                             
0             
  0   0    .36                                                                    .039 .025 .030 .032   .032
     0    .36                                                          .039
                                                                                 .039 .025 .030 .032   .032
         ˆ
 p ( x | x)                                                                             q ( x, y )
           p ( x, y )

 .05      .05 .05      0   0       0  
 .05      .05 .05 0   0            0
                                       
 0         0   0 .05 .05          .05
                                       
 .04
   0        0    0    .05 .05      .05
                                       
 .04                                  
           .04 0 .04 .04           .04
          .04 .04 0 .04           .04 



.5
.5
      0 0
       0   0
                    .20 .10
                     .18 .32
                                        
                                        .36 .36 0 .28 0
                                            0
                                                              0
                                                0 .28 0 .36 .36
                                                                     .036
                                                                       .036
                                                                                .036 .014 .028 .018 .018
                                                                                .036 .014 .028 .018   .018
                                                                                                           
                                                                                                           
0    .3   0
                    .12 .08                          ˆ
                                                 p( y | y )            ..019   .019 .026 .015 .034   .034
                                                                                                           
0    .3   0
                         ˆ ˆ
                      p ( x, y )                                       .043
                                                                          019   .019 .026 .015 .034   .034
                                                                                                           
0           
  0    0   1                                                                    .043 .022 .033 .028   .028
     .4   0                                                          .025
                                                                               .025 .035 .020 .046   .046
        ˆ
p ( x | x)                                                                            q ( x, y )
           p ( x, y )

 .05      .05 .05      0   0       0  
 .05      .05 .05 0   0            0
                                       
 0         0   0 .05 .05          .05
                                       
 .04
   0        0    0    .05 .05      .05
                                       
 .04                                  
           .04 0 .04 .04           .04
          .04 .04 0 .04           .04 



.5
.5
      0 0
      0    0
             
             
                     .30 0 
                     .12 .38
                                        
                                        .36 .36 .28 0
                                            0   0
                                                                 0
                                                      0 .28 .36 .36
                                                                     0
                                                                            .054
                                                                              .054
                                                                                       .054 .042
                                                                                       .054 .042
                                                                                                   0
                                                                                                   0
                                                                                                          0
                                                                                                          0
                                                                                                               0
                                                                                                                0
                                                                                                                   
                                                                                                                   
0    .3   0
                    .08 .12                             ˆ
                                                    p( y | y )                ..013   .013 .010 .031 .041    .041
                                                                                                                   
00   .3   0
                         ˆ ˆ
                      p ( x, y )                                              .028
                                                                                 013   .013 .010 .031 .041    .041
                                                                                                                   
0         0
       0   1                                                                           .028 .022 .033 .043    .043
     .4                                                                     .017
                                                                                      .017 .013 .042 .054    .054
        ˆ
p ( x | x)                                                                                   q ( x, y )
           p ( x, y )

  .05     .05 .05      0    0      0  
  .05     .05 .05 0   0            0
                                       
  0        0   0 .05 .05          .05
                                       
  .04
    0      0      0    .05 .05     .05
                                       
  .04                                 
           .04 0 .04 .04           .04
          .04 .04 0 .04           .04 



.5
.5
      0
       0
           0
            0
                     .03 .03
                      .2 .2
                                   .36 .36 .28 0
                                     0   0
                                                    0   0
                                             0 .28 .36 .36
                                                                .054
                                                                  .054
                                                                          .054
                                                                          .054
                                                                                 .042
                                                                                 .042
                                                                                           0
                                                                                            0
                                                                                                    0
                                                                                                     0
                                                                                                           0
                                                                                                            0
                                                                                                               
                                                                                                               
0    .5    0
                                               ˆ               0        0      0      .042     .054   .054
                                                                                                               
                                           p( y | y )
0    .5    0
                         ˆ ˆ
                      p ( x, y )                                  .036
                                                                     0     0      0       .042     .054   .054
                                                                                                               
0                                                               .036                                        
  0   0    .5                                                             .036    028     .028     .036   .036
     0    .5                                                           .036   .028     .028     .036   .036

         ˆ
 p ( x | x)                                                                           q ( x, y )
    Applications -- Text Classification
    Assigning class labels to text documents
    Training and Testing Phases


                                             New Document
                                  Class-1


     Document                 Grouped into                      New
     collection                 classes        Classifier     Document
                                                                With
                                  Class-m    (Learns from     Assigned
                                             Training data)     class
                  Training Data
Feature Clustering (dimensionality
reduction)
   Feature Selection
                               1               • Select the “best” words
        Document      Vector        Word#1     • Throw away rest
       Bag-of-words    Of                      • Frequency based pruning
                      words                    • Information criterion based
                                    Word#k
                                                  pruning
                               m



   Feature Clustering         1
                      Vector       Cluster#1   • Do not throw away words
       Document         Of                     • Cluster words instead
      Bag-of-words    words                    • Use clusters as features
                                   Cluster#k
                               m
Experiments

   Data sets
     20 Newsgroups data

       20 classes, 20000 documents

       Classic3 data set
        3 classes (cisi, med and cran), 3893 documents
     Dmoz Science HTML data

       49 leaves in the hierarchy

       5000 documents with 14538 words

       Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

   Implementation Details
     Bow – for indexing,co-clustering, clustering and classifying
Results (20Ng)
   Classification Accuracy
    on 20 Newsgroups data
    with 1/3-2/3 test-train
    split
   Divisive clustering beats
    feature selection
    algorithms by a large
    margin
   The effect is more
    significant at lower
    number of features
Results (Dmoz)
   Classification
    Accuracy on
    Dmoz data with
    1/3-2/3 test train
    split
   Divisive Clustering
    is better at lower
    number of
    features
   Note contrasting
    behavior of Naïve
    Bayes and SVMs
Results (Dmoz)
   Naïve Bayes on
    Dmoz data with
    only 2% Training
    data
   Note that Divisive
    Clustering achieves
    higher maximum
    than IG with a
    significant 13%
    increase
   Divisive Clustering
    performs better
    than IG at lower
    training data
       Hierarchical Classification
                                    Science

         Math                      Physics                      Social Science


                                             Quantum
Number           Logic      Mechanics                     Economics       Archeology
                                              Theory
Theory


•Flat classifier builds a classifier over the leaf classes in the above hierarchy
•Hierarchical Classifier builds a classifier at each internal node of the hierarchy
Results (Dmoz)

• Hierarchical Classifier                                      Dmoz data
(Naïve Bayes at each node)
• Hierarchical Classifier:                  80
64.54% accuracy at just 10                  70




                               % Accuracy
features (Flat achieves                     60
                                            50                                                              Hierarchical
64.04% accuracy at 1000
                                            40                                                              Flat(DC)
features)                                   30                                                              Flat(IG)
• Hierarchical Classifier                   20
improves accuracy to 68.42                  10
% from 64.42%(maximum)                       0




                                                                                      1000
                                                                                             5000
                                                                                                    10000
                                                 5
                                                     10
                                                          20
                                                               50
                                                                    100
                                                                          200
                                                                                500
achieved by flat classifiers
                                                          Number of Features
Anecdotal Evidence

      Cluster 10                Cluster 9                       Cluster 12
  Divisive Clustering     Divisive Clustering            Agglomerative Clustering
  (rec.sport.hockey)      (rec.sport.baseball)   (rec.sport.hockey and rec.sport.baseball)

        team                       hit               team         detroit
        game                     runs                hockey       pitching
         play                  Baseball              Games         hitter
       hockey                    base                Players       rangers
       Season                    Ball                baseball      nyi
       boston                    greg                league        morris
       chicago                  morris               player        blues
          pit                    Ted                 nhl          shots
         van                    Pitcher              Pit          Vancouver
         nhl                    Hitting              buffalo       ens




       Top few words sorted in Clusters obtained by Divisive and Agglomerative

                         approaches on 20 Newsgroups data
Co-Clustering Results (CLASSIC3)

    Co-Clustering           1-D Clustering
      (0.9835)                 (0.821)
  992        4         8   847     142         44


   40    1452          7    41     954        405


    1        4      1387   275      86       1099
Results – Binary (subset of 20Ng data)

           Binary                       Binary_subject
        (0.852,0.67)                    (0.946,0.648)
Co-               1-D           Co-               1-D
clustering        Clustering    clustering        Clustering

  207        31    178    104     234        11    179         94


   43    219         72   146      16     239        71   156
Precision – 20Ng data
                 Co-          1D-          IB-Double   IDC
                 clustering   clustering
Binary           0.98         0.64         0.70

Binary_Subject   0.96         0.67                     0.85

Multi5           0.87         0.34         0.5

Multi5_Subject   0.89         0.37                     0.88

Multi10          0.56         0.17         0.35

Multi10_Subject 0.54          0.19                     0.55
Results: Sparsity (Binary_subject data)
Results: Sparsity (Binary_subject data)
Results (Monotonicity)
Conclusions

   Information-theoretic approach to clustering,
    co-clustering and matrix approximation
   Implicit dimensionality reduction at each step
    to overcome sparsity & high-dimensionality
   Theoretical approach has the potential of
    extending to other problems:
       Multi-dimensional co-clustering
       MDL to choose number of co-clusters
       Generalized co-clustering by Bregman divergence
          More Information
   Email: inderjit@cs.utexas.edu
   Papers are available at:
    http://www.cs.utexas.edu/users/inderjit
   ―Divisive Information-Theoretic Feature Clustering for
    Text Classification‖, Dhillon, Mallela & Kumar, Journal of
    Machine Learning Research(JMLR), March 2003 (also KDD, 2002)
   ―Information-Theoretic Co-clustering‖, Dhillon, Mallela &
    Modha, KDD, 2003.
   ―Clustering with Bregman Divergences‖, Banerjee, Merugu,
    Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.
   ―A Generalized Maximum Entropy Approach to Bregman
    Co-clustering & Matrix Approximation‖, Banerjee, Dhillon,
    Ghosh, Merugu & Modha, working manuscript, 2004.

								
To top