# Distributional Clustering of Words for Text Classification (PowerPoint) by dandanhuanghuang

VIEWS: 16 PAGES: 11

• pg 1
```									Distributional Clustering of
Words for Text Classification

L. Douglas Baker
Andrew Kachites McCallum
SIGIR’98
Distributional Clustering
   Word similarity based on class label distribution

Sport

Baseball   Hockey   Tennis

   ‘puck’ and ‘goalie’
   ‘team’
Distributional Clustering
   Clustering words based on class distribution -
(supervised)
   Similarity between wt & wssimilarity between
P(C|wt) & P(C|ws)
   Information theoretic measure to calculate
similarity between distributions
   Kullback-Leibler divergence to the mean
Distributional Clustering

Class 8: Autos and Class 9: Motorcycles
Distributional Clustering
Kullback-Leibler Divergence
P( x) 
D(P( x) || (P( y))   P( x) log      
P( y) 
Here,
|C |              P(c | w ) 
D(P(C | wt ) || P(C | ws ))   P(c j | wt )log   j   t
P(c | w ) 

                                     j1                 j   s 

D is asymmetric and Dinfinity when P(y)=0 and P(x)≠0
 Also, D ≥ 0
Kullback-Leibler Divergence

D(P(C | wt ) || P(C | ws ))  P(wt )  D(P(C | wt ) || P(C | wt  ws ))
 P(ws )  D(P(C | ws ) || P(C | wt  ws ))

Where,
P(w t )
P(C | w t  w s )                    P(C | w t )
P(w t )  P(w s )
P(w s )
                   P(C | w s )
P(w t )  P(w s )

Jensen-Shannon Divergence is a special case of
symmetrised KL-Divergence. P(wt)=P(ws)=0.5

Clustering Algorithm

Characteristics:
-Greedy Aggressive
-Local Optimal
-Hard Clustering
-Agglomerative
Experiments
   Dataset:
   20 Newsgroups
   Reuters-21578
   Yahoo Science Hierarchy
   Compared with:
   Supervised Latent Semantic indexing
   Class-based clustering
   Feature selection by mutual information with the class variable
   Feature selection by Markov-blanket method
   Classifier : NBC
Results
Conclusion
   Useful semantic word clusterings
   Higher classification accuracy
   Smaller classification models

Word clustering vs. feature selection ??

What if the data is
 Noisy??

 Sparse??

```
To top