Docstoc

Distributional Clustering of Words for Text Classification (PowerPoint)

Document Sample
Distributional Clustering of Words for Text Classification (PowerPoint) Powered By Docstoc
					Distributional Clustering of
Words for Text Classification


        L. Douglas Baker
    Andrew Kachites McCallum
            SIGIR’98
         Distributional Clustering
   Word similarity based on class label distribution

                           Sport


                Baseball   Hockey   Tennis

   ‘puck’ and ‘goalie’
   ‘team’
        Distributional Clustering
   Clustering words based on class distribution -
    (supervised)
   Similarity between wt & wssimilarity between
    P(C|wt) & P(C|ws)
   Information theoretic measure to calculate
    similarity between distributions
   Kullback-Leibler divergence to the mean
Distributional Clustering




Class 8: Autos and Class 9: Motorcycles
Distributional Clustering
      Kullback-Leibler Divergence
                                             P( x) 
             D(P( x) || (P( y))   P( x) log      
                                             P( y) 
     Here,
                                      |C |              P(c | w ) 
         D(P(C | wt ) || P(C | ws ))   P(c j | wt )log   j   t
                                                        P(c | w ) 
                                                                    
                                     j1                 j   s 



     D is asymmetric and Dinfinity when P(y)=0 and P(x)≠0
 Also, D ≥ 0
  Kullback-Leibler Divergence

D(P(C | wt ) || P(C | ws ))  P(wt )  D(P(C | wt ) || P(C | wt  ws ))
         P(ws )  D(P(C | ws ) || P(C | wt  ws ))

Where,
                                    P(w t )
           P(C | w t  w s )                    P(C | w t )
                               P(w t )  P(w s )
                             P(w s )
                                        P(C | w s )
                       P(w t )  P(w s )

Jensen-Shannon Divergence is a special case of
symmetrised KL-Divergence. P(wt)=P(ws)=0.5

        Clustering Algorithm




Characteristics:
-Greedy Aggressive
-Local Optimal
-Hard Clustering
-Agglomerative
                       Experiments
   Dataset:
       20 Newsgroups
       Reuters-21578
       Yahoo Science Hierarchy
   Compared with:
       Supervised Latent Semantic indexing
       Class-based clustering
       Feature selection by mutual information with the class variable
       Feature selection by Markov-blanket method
   Classifier : NBC
Results
                      Conclusion
   Useful semantic word clusterings
   Higher classification accuracy
   Smaller classification models

           Word clustering vs. feature selection ??

What if the data is
 Noisy??

 Sparse??

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:16
posted:12/4/2011
language:English
pages:11