# Information Theoretic Clustering and Co Clustering for Text Mining by benbenzhou

VIEWS: 5 PAGES: 39

• pg 1
```									Information Theoretic Clustering, Co-
clustering and Matrix Approximations
Inderjit S. Dhillon
University of Texas, Austin

Data Mining Seminar Series,
Mar 26, 2004
Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,
S. Merugu & D. Modha
Clustering: Unsupervised Learning

   Grouping together of ―similar‖ objects
   Hard Clustering -- Each object belongs to a single
cluster

   Soft Clustering -- Each object is probabilistically
assigned to clusters
Contingency Tables
   Let X and Y be discrete random variables
   X and Y take values in {1, 2, …, m} and {1, 2, …, n}
   p(X, Y) denotes the joint probability distribution—if not
known, it is often estimated based on co-occurrence data
   Application areas: text mining, market-basket analysis,
analysis of browsing behavior, etc.
   Key Obstacles in Clustering Contingency Tables
   High Dimensionality, Sparsity, Noise
   Need for robust and scalable algorithms
Co-Clustering
   Simultaneously
   Cluster rows of p(X, Y) into k disjoint groups
   Cluster columns of p(X, Y) into l disjoint groups
   Key goal is to exploit the ―duality‖ between
row and column clustering to overcome
sparsity and noise
Co-clustering Example for Text Data

    Co-clustering clusters both words and
documents simultaneously using the
underlying co-occurrence frequency matrix
document                        document clusters
word                            word
clusters
Co-clustering and Information Theory

   View ―co-occurrence‖ matrix as a joint probability
distribution over row & column random variables
Yˆ
Y
X                             ˆ
X

   We seek a ―hard-clustering‖ of both rows and
columns such that ―information‖ in the compressed
matrix is maximized.
Information Theory Concepts
   Entropy of a random variable X with probability
distribution p:
H ( p )   p ( x) log p ( x)
x

   The Kullback-Leibler (KL) Divergence or ―Relative
Entropy‖ between two probability distributions p and q:
KL ( p, q )   p ( x) log( p ( x ) q ( x))
x

   Mutual Information between random variables X and Y:
 p( x, y) 
I ( X , Y )   p( x, y) log
 p ( x) p ( y ) 

x y                            
“Optimal” Co-Clustering
                               ˆ       ˆ
Seek random variables X and Y taking values
in {1, 2, …, k} and {1, 2, …, l} such that mutual
information is maximized:
ˆ ˆ
I ( X ,Y )
ˆ
where X = R(X) is a function of X alone
ˆ
where Y = C(Y) is a function of Y alone
Related Work

   Distributional Clustering
   Pereira, Tishby & Lee (1993), Baker & McCallum
(1998)
   Information Bottleneck
   Tishby, Pereira & Bialek(1999), Slonim, Friedman &
Tishby (2001), Berkhin & Becher(2002)
   Probabilistic Latent Semantic Indexing
   Hofmann (1999), Hofmann & Puzicha (1999)
   Non-Negative Matrix Approximation
   Lee & Seung(2000)
Information-Theoretic Co-clustering

   Lemma: ―Loss in mutual information‖ equals
ˆ ˆ
I ( X , Y ) - I ( X , Y )  KL( p( x, y ) || q( x, y ))
ˆ ˆ                 ˆ           ˆ
 H ( X , Y )  H ( X | X )  H (Y | Y )  H ( X , Y )
   p is the input distribution
   q is an approximation to p
q( x, y)  p( x, y) p( x | x) p( y | y), x  x, y  y
ˆ ˆ          ˆ         ˆ       ˆ      ˆ

   Can be shown that q(x,y) is a maximum entropy
approximation subject to cluster constraints.
p ( x, y )

.05    .05 .05 0   0    0  
.05    .05 .05 0   0    0

0      0   0 .05 .05   .05

.04                        
0     0   0 .05 .05   .05
.04 0 .04 .04   .04
.04
       .04 .04 0 .04   .04
p ( x, y )

.05    .05 .05 0     0     0  
.05    .05 .05 0     0     0

0         0   0 .05 .05   .05

.04                           
0        0   0 .05 .05   .05
.04 0 .04 .04      .04
.04
       .04 .04 0 .04      .04

.5   0      0  
.5    0      0

0    .5      0

0    .5      0

0              
0   0      .5
     0      .5

ˆ
p ( x | x)
p ( x, y )

.05    .05 .05 0     0     0  
.05    .05 .05 0     0     0

0         0   0 .05 .05   .05

.04                           
0        0   0 .05 .05   .05
.04 0 .04 .04      .04
.04
       .04 .04 0 .04      .04

.5
.5
0
0
0
0
                     
.36 .36 .28 0
0   0
0   0
0 .28 .36 .36

0    .5      0
                                   ˆ
p( y | y )
0    .5      0

0              
0   0      .5
     0      .5

ˆ
p ( x | x)
p ( x, y )

.05    .05 .05 0      0       0 
.05    .05 .05 0      0      0

0         0   0 .05 .05     .05

.04                             
0        0   0 .05 .05     .05
.04 0 .04 .04        .04
.04
       .04 .04 0 .04        .04

.5
.5
0
0
0
0
      .03 .03
.2 .2

.36 .36 .28 0
0   0
0   0
0 .28 .36 .36

0    .5      0
                                    ˆ
p( y | y )
0    .5      0
           ˆ ˆ
p ( x, y )
0              
0   0      .5
     0      .5

ˆ
p ( x | x)
p ( x, y )

.05    .05 .05 0      0       0 
.05    .05 .05 0      0      0

0         0   0 .05 .05     .05

.04                             
0        0   0 .05 .05     .05
.04 0 .04 .04        .04
.04
       .04 .04 0 .04        .04

.5
.5
0
0
0
0
      .03 .03
.2 .2

.36 .36 .28 0
0   0
0   0
0 .28 .36 .36
   .054
.054
.054
.054
.042
.042
0
0
0
0
0
0


0    .5      0
                                    ˆ               0        0      0      .042     .054   .054

p( y | y )
0    .5      0
           ˆ ˆ
p ( x, y )                                     .036
0     0      0       .042     .054   .054

0                                                                    .036                                        
0   0      .5                                                                .036    028     .028     .036   .036
     0      .5                                                              .036   .028     .028     .036   .036

ˆ
p ( x | x)                                                                                q ( x, y )

#parameters that determine q(x,y) are: (m  k )  (kl  1)  (n  l )
Decomposition Lemma

   Question: How to minimize KL( p( x, y ) || q( x, y )) ?
   Following Lemma reveals the Answer:
KL ( p ( x, y ) || q ( x, y ))   p ( x) KL ( p ( y | x) || q ( y | x))
ˆ
ˆ
x     xx
ˆ

where    q ( y | x)  p ( y | y ) p ( y | x)  p ( y | y ) p ( y | x) p ( x | x)
ˆ            ˆ ˆ ˆ                    ˆ        ˆ              ˆ
xx
ˆ

ˆ
Note that q( y | x) may be thought of as the ―prototype‖ of row
cluster.

Similarly, KL( p( x, y) || q( x, y))          p( y)KL( p( x | y) || q( x | y))
ˆ
y   yy
ˆ
ˆ
Co-Clustering Algorithm
   [Step 1] Set i  1. Start with ( Ri, Ci ) , Compute q[i , i ] .

                                                       ˆ
[Step 2] For every row x , assign it to the cluster x that
minimizes
ˆ
KL( p( y | x) || q[i , i ]( y | x))

   [Step 3] We have ( Ri  1, Ci ). Compute q[i  1, i ] .

                                                         ˆ
[Step 4] For every column y, assign it to the cluster y that
minimizes
ˆ
KL( p( x | y) || q[i  1, i ]( x | y))
   [Step 5] We have ( Ri  1, Ci  1) . Compute q[i  1, i  i ]. Iterate 2-5.
Properties of Co-clustering Algorithm
   Main Theorem: Co-clustering ―monotonically‖
decreases loss in mutual information
   Co-clustering converges to a local minimum
   Can be generalized to multi-dimensional
contingency tables
   q can be viewed as a ―low complexity‖ non-negative
matrix approximation
   q preserves marginals of p, and co-cluster statistics
   Implicit dimensionality reduction at each step helps
overcome sparsity & high-dimensionality
   Computationally economical
p ( x, y )

.05      .05 .05       0   0       0  
.05      .05 .05 0   0             0

0         0   0 .05 .05           .05

.04
0        0     0    .05 .05      .05

.04                                   
.04 0 .04 .04            .04
         .04 .04 0 .04            .04 

0
1
0 .28
0    0
       .10 .05
.10 .20

.36 .36 0 .28 0
0
0
0 .28 0 .36 .36
   .029
.036
.029 .019 .022 .024 .024
.036 .014 .028 .018   .018


0    .5    0
       .30 .25                          ˆ
p( y | y )            ..018   .018 .028 .014 .036   .036

0    .5    0
            ˆ ˆ
p ( x, y )                                       .039
018   .018 .028 .014 .036   .036

0             
0   0    .36                                                                    .039 .025 .030 .032   .032
     0    .36                                                          .039
        .039 .025 .030 .032   .032
ˆ
p ( x | x)                                                                             q ( x, y )
p ( x, y )

.05      .05 .05      0   0       0  
.05      .05 .05 0   0            0

0         0   0 .05 .05          .05

.04
0        0    0    .05 .05      .05

.04                                  
.04 0 .04 .04           .04
         .04 .04 0 .04           .04 

.5
.5
0 0
0   0
       .20 .10
.18 .32

.36 .36 0 .28 0
0
0
0 .28 0 .36 .36
   .036
.036
.036 .014 .028 .018 .018
.036 .014 .028 .018   .018


0    .3   0
       .12 .08                          ˆ
p( y | y )            ..019   .019 .026 .015 .034   .034

0    .3   0
            ˆ ˆ
p ( x, y )                                       .043
019   .019 .026 .015 .034   .034

0           
0    0   1                                                                    .043 .022 .033 .028   .028
     .4   0                                                          .025
        .025 .035 .020 .046   .046
ˆ
p ( x | x)                                                                            q ( x, y )
p ( x, y )

.05      .05 .05      0   0       0  
.05      .05 .05 0   0            0

0         0   0 .05 .05          .05

.04
0        0    0    .05 .05      .05

.04                                  
.04 0 .04 .04           .04
         .04 .04 0 .04           .04 

.5
.5
0 0
0    0


.30 0 
.12 .38

.36 .36 .28 0
0   0
0
0 .28 .36 .36
0
   .054
.054
.054 .042
.054 .042
0
0
0
0
0
0


0    .3   0
       .08 .12                             ˆ
p( y | y )                ..013   .013 .010 .031 .041    .041

00   .3   0
            ˆ ˆ
p ( x, y )                                              .028
013   .013 .010 .031 .041    .041

0         0
0   1                                                                           .028 .022 .033 .043    .043
     .4                                                                     .017
        .017 .013 .042 .054    .054
ˆ
p ( x | x)                                                                                   q ( x, y )
p ( x, y )

.05     .05 .05      0    0      0  
.05     .05 .05 0   0            0

0        0   0 .05 .05          .05

.04
0      0      0    .05 .05     .05

.04                                 
.04 0 .04 .04           .04
        .04 .04 0 .04           .04 

.5
.5
0
0
0
0
       .03 .03
.2 .2
.36 .36 .28 0
0   0
0   0
0 .28 .36 .36
   .054
.054
.054
.054
.042
.042
0
0
0
0
0
0


0    .5    0
                                 ˆ               0        0      0      .042     .054   .054

p( y | y )
0    .5    0
           ˆ ˆ
p ( x, y )                                  .036
0     0      0       .042     .054   .054

0                                                               .036                                        
0   0    .5                                                             .036    028     .028     .036   .036
     0    .5                                                           .036   .028     .028     .036   .036

ˆ
p ( x | x)                                                                           q ( x, y )
Applications -- Text Classification
    Assigning class labels to text documents
    Training and Testing Phases

New Document
Class-1

Document                 Grouped into                      New
collection                 classes        Classifier     Document
With
Class-m    (Learns from     Assigned
Training data)     class
Training Data
Feature Clustering (dimensionality
reduction)
   Feature Selection
1               • Select the “best” words
Document      Vector        Word#1     • Throw away rest
Bag-of-words    Of                      • Frequency based pruning
words                    • Information criterion based
Word#k
pruning
m

   Feature Clustering         1
Vector       Cluster#1   • Do not throw away words
Document         Of                     • Cluster words instead
Bag-of-words    words                    • Use clusters as features
Cluster#k
m
Experiments

   Data sets
 20 Newsgroups data

 20 classes, 20000 documents

   Classic3 data set
3 classes (cisi, med and cran), 3893 documents
 Dmoz Science HTML data

 49 leaves in the hierarchy

 5000 documents with 14538 words

 Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

   Implementation Details
 Bow – for indexing,co-clustering, clustering and classifying
Results (20Ng)
   Classification Accuracy
on 20 Newsgroups data
with 1/3-2/3 test-train
split
   Divisive clustering beats
feature selection
algorithms by a large
margin
   The effect is more
significant at lower
number of features
Results (Dmoz)
   Classification
Accuracy on
Dmoz data with
1/3-2/3 test train
split
   Divisive Clustering
is better at lower
number of
features
   Note contrasting
behavior of Naïve
Bayes and SVMs
Results (Dmoz)
   Naïve Bayes on
Dmoz data with
only 2% Training
data
   Note that Divisive
Clustering achieves
higher maximum
than IG with a
significant 13%
increase
   Divisive Clustering
performs better
than IG at lower
training data
Hierarchical Classification
Science

Math                      Physics                      Social Science

Quantum
Number           Logic      Mechanics                     Economics       Archeology
Theory
Theory

•Flat classifier builds a classifier over the leaf classes in the above hierarchy
•Hierarchical Classifier builds a classifier at each internal node of the hierarchy
Results (Dmoz)

• Hierarchical Classifier                                      Dmoz data
(Naïve Bayes at each node)
• Hierarchical Classifier:                  80
64.54% accuracy at just 10                  70

% Accuracy
features (Flat achieves                     60
50                                                              Hierarchical
64.04% accuracy at 1000
40                                                              Flat(DC)
features)                                   30                                                              Flat(IG)
• Hierarchical Classifier                   20
improves accuracy to 68.42                  10
% from 64.42%(maximum)                       0

1000
5000
10000
5
10
20
50
100
200
500
achieved by flat classifiers
Number of Features
Anecdotal Evidence

Cluster 10                Cluster 9                       Cluster 12
Divisive Clustering     Divisive Clustering            Agglomerative Clustering
(rec.sport.hockey)      (rec.sport.baseball)   (rec.sport.hockey and rec.sport.baseball)

team                       hit               team         detroit
game                     runs                hockey       pitching
play                  Baseball              Games         hitter
hockey                    base                Players       rangers
Season                    Ball                baseball      nyi
boston                    greg                league        morris
chicago                  morris               player        blues
pit                    Ted                 nhl          shots
van                    Pitcher              Pit          Vancouver
nhl                    Hitting              buffalo       ens

Top few words sorted in Clusters obtained by Divisive and Agglomerative

approaches on 20 Newsgroups data
Co-Clustering Results (CLASSIC3)

Co-Clustering           1-D Clustering
(0.9835)                 (0.821)
992        4         8   847     142         44

40    1452          7    41     954        405

1        4      1387   275      86       1099
Results – Binary (subset of 20Ng data)

Binary                       Binary_subject
(0.852,0.67)                    (0.946,0.648)
Co-               1-D           Co-               1-D
clustering        Clustering    clustering        Clustering

207        31    178    104     234        11    179         94

43    219         72   146      16     239        71   156
Precision – 20Ng data
Co-          1D-          IB-Double   IDC
clustering   clustering
Binary           0.98         0.64         0.70

Binary_Subject   0.96         0.67                     0.85

Multi5           0.87         0.34         0.5

Multi5_Subject   0.89         0.37                     0.88

Multi10          0.56         0.17         0.35

Multi10_Subject 0.54          0.19                     0.55
Results: Sparsity (Binary_subject data)
Results: Sparsity (Binary_subject data)
Results (Monotonicity)
Conclusions

   Information-theoretic approach to clustering,
co-clustering and matrix approximation
   Implicit dimensionality reduction at each step
to overcome sparsity & high-dimensionality
   Theoretical approach has the potential of
extending to other problems:
   Multi-dimensional co-clustering
   MDL to choose number of co-clusters
   Generalized co-clustering by Bregman divergence
   Email: inderjit@cs.utexas.edu
   Papers are available at:
http://www.cs.utexas.edu/users/inderjit
   ―Divisive Information-Theoretic Feature Clustering for
Text Classification‖, Dhillon, Mallela & Kumar, Journal of
Machine Learning Research(JMLR), March 2003 (also KDD, 2002)
   ―Information-Theoretic Co-clustering‖, Dhillon, Mallela &
Modha, KDD, 2003.
   ―Clustering with Bregman Divergences‖, Banerjee, Merugu,
Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.
   ―A Generalized Maximum Entropy Approach to Bregman
Co-clustering & Matrix Approximation‖, Banerjee, Dhillon,
Ghosh, Merugu & Modha, working manuscript, 2004.

```
To top