Docstoc

Distributional Clustering of Words for Text Classification

Document Sample
Distributional Clustering of Words for Text Classification Powered By Docstoc
					Distributional Clustering of
Words for Text Classification


        L. Douglas Baker
    Andrew Kachites McCallum
            SIGIR’98
         Distributional Clustering
   Word similarity based on class label distribution




   ‘puck’ and ‘goalie’
   ‘team’
        Distributional Clustering
   Clustering words based on class distribution -
    (supervised)
   Similarity between wt & wssimilarity between
    P(C|wt) & P(C|ws)
   Information theoretic measure to calculate
    similarity between distributions
   Kullback-Leibler divergence to the mean
Distributional Clustering




Class 8: Autos and Class 9: Motorcycles
Distributional Clustering
 Kullback-Leibler Divergence


Here,




D is asymmetric and Dinfinity when P(y)=0 and P(x)≠0
Also, D ≥ 0
 Kullback-Leibler Divergence



Where,




Jensen-Shannon Divergence is a special case of
symmetrised KL-Divergence. P(wt)=P(ws)=0.5
        Clustering Algorithm




Characteristics:
-Greedy Aggressive
-Local Optimal
-Hard Clustering
-Agglomerative
                       Experiments
   Dataset:
       20 Newsgroups
       Reuters-21578
       Yahoo Science Hierarchy
   Compared with:
       Supervised Latent Semantic indexing
       Class-based clustering
       Feature selection by mutual information with the class variable
       Feature selection by Markov-blanket method
   Classifier : NBC
Results
                      Conclusion
   Useful semantic word clusterings
   Higher classification accuracy
   Smaller classification models

           Word clustering vs. feature selection ??

What if the data is
 Noisy??

 Sparse??

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:5
posted:4/13/2011
language:English
pages:11