Distributional Clustering of Words for Text Classification L. Douglas Baker Andrew Kachites McCallum SIGIR’98 Distributional Clustering Word similarity based on class label distribution ‘puck’ and ‘goalie’ ‘team’ Distributional Clustering Clustering words based on class distribution - (supervised) Similarity between wt & wssimilarity between P(C|wt) & P(C|ws) Information theoretic measure to calculate similarity between distributions Kullback-Leibler divergence to the mean Distributional Clustering Class 8: Autos and Class 9: Motorcycles Distributional Clustering Kullback-Leibler Divergence Here, D is asymmetric and Dinfinity when P(y)=0 and P(x)≠0 Also, D ≥ 0 Kullback-Leibler Divergence Where, Jensen-Shannon Divergence is a special case of symmetrised KL-Divergence. P(wt)=P(ws)=0.5 Clustering Algorithm Characteristics: -Greedy Aggressive -Local Optimal -Hard Clustering -Agglomerative Experiments Dataset: 20 Newsgroups Reuters-21578 Yahoo Science Hierarchy Compared with: Supervised Latent Semantic indexing Class-based clustering Feature selection by mutual information with the class variable Feature selection by Markov-blanket method Classifier : NBC Results Conclusion Useful semantic word clusterings Higher classification accuracy Smaller classification models Word clustering vs. feature selection ?? What if the data is Noisy?? Sparse??