; Distributional Clustering of Words for Text Classification (PowerPoint)
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Distributional Clustering of Words for Text Classification (PowerPoint)

VIEWS: 16 PAGES: 11

  • pg 1
									Distributional Clustering of
Words for Text Classification


        L. Douglas Baker
    Andrew Kachites McCallum
            SIGIR’98
         Distributional Clustering
   Word similarity based on class label distribution

                           Sport


                Baseball   Hockey   Tennis

   ‘puck’ and ‘goalie’
   ‘team’
        Distributional Clustering
   Clustering words based on class distribution -
    (supervised)
   Similarity between wt & wssimilarity between
    P(C|wt) & P(C|ws)
   Information theoretic measure to calculate
    similarity between distributions
   Kullback-Leibler divergence to the mean
Distributional Clustering




Class 8: Autos and Class 9: Motorcycles
Distributional Clustering
      Kullback-Leibler Divergence
                                             P( x) 
             D(P( x) || (P( y))   P( x) log      
                                             P( y) 
     Here,
                                      |C |              P(c | w ) 
         D(P(C | wt ) || P(C | ws ))   P(c j | wt )log   j   t
                                                        P(c | w ) 
                                                                    
                                     j1                 j   s 



     D is asymmetric and Dinfinity when P(y)=0 and P(x)≠0
 Also, D ≥ 0
  Kullback-Leibler Divergence

D(P(C | wt ) || P(C | ws ))  P(wt )  D(P(C | wt ) || P(C | wt  ws ))
         P(ws )  D(P(C | ws ) || P(C | wt  ws ))

Where,
                                    P(w t )
           P(C | w t  w s )                    P(C | w t )
                               P(w t )  P(w s )
                             P(w s )
                                        P(C | w s )
                       P(w t )  P(w s )

Jensen-Shannon Divergence is a special case of
symmetrised KL-Divergence. P(wt)=P(ws)=0.5

        Clustering Algorithm




Characteristics:
-Greedy Aggressive
-Local Optimal
-Hard Clustering
-Agglomerative
                       Experiments
   Dataset:
       20 Newsgroups
       Reuters-21578
       Yahoo Science Hierarchy
   Compared with:
       Supervised Latent Semantic indexing
       Class-based clustering
       Feature selection by mutual information with the class variable
       Feature selection by Markov-blanket method
   Classifier : NBC
Results
                      Conclusion
   Useful semantic word clusterings
   Higher classification accuracy
   Smaller classification models

           Word clustering vs. feature selection ??

What if the data is
 Noisy??

 Sparse??

								
To top