VIEWS: 287 PAGES: 7 POSTED ON: 3/15/2010
An Adaptive Affinity Propagation Document Clustering Yancheng He, Qingcai Chen, Xiaolong Wang, Ruifeng Xu, Xiaohua Bai, Xianjun Meng Dept. of Computer Science and Technology Harbin Institute of Technology Shenzhen Graduate School Shenzhen, P.R. China heyanch@gmail.com qingcai.chen@gmail.com wangxl@insun.hit.edu.cn csrfxu@gmail.com baixh@insun.hit.edu.cn Abstract rounds fall below a certain threshold. However The standard affinity propagation clustering standard K-Means method has two limitations: 1) algorithm suffers from one limitation that it is the number of cluster needs to be specified first. hard to know the value of the parameter 2) The clustering result is sensitive to the initial “preference” which can yield an optimal cluster centers. clustering solution. To overcome this limitation, A new clustering approach named affinity in this paper we proposes an adaptive affinity propagation [3] (AP) has been proposed recently. propagation method. The method first finds out AP clustering method performs well in many the range of “preference”, then searches the applications such as image categorization [4], space of “preference” to find a good value gene expressions and text summarization [3] and which can optimize the clustering result. We speaker clustering [5]. This paper applies it to apply the method to document clustering and document clustering. Unlike other methods, AP compare it with the standard affinity simultaneously considers all data points as propagation and K-Means clustering method in potential exemplars. AP recursively transmits real data sets. Experimental results show that real-valued messages along edges of the network our proposed method can get better clustering by viewing each data point as a node in a result. network until a good set of centers is determined. . Rather than requiring that the number of 1. Introduction clusters be pre-specified, affinity propagation There are many document clustering takes as input a real number s (k , k ) for each methods, such as partition clustering, hierarchical clustering, Self-Organizing Maps data point k so that data points with larger values clustering and suffix tree clustering [1]. And of s (k , k ) are more likely to be chosen as K-means is the most popular one of partition clustering methods [2]. Standard K-means exemplars. These values s (k , k ) are referred to method first specifies K initial cluster centers randomly, and then assigns all the data points to as "preferences", which is a kind of the the nearest cluster centers, at last update all the self-similarity. The number of identified cluster centers. This process iterates until the exemplars is influenced by the values of the difference between cluster centers of consecutive input preferences. Frey suggested preference be set as the median of the input similarities ( pm ) responsibility r (i, k ) , which is sent from data without any prior knowledge. But in most cases, point i to candidate point k, reflects the accumulated evidence for point i, taking into pm can’t lead to optimal clustering solutions. account other potential exemplars for point i. Wang proposed an adAP algorithm [6] to solve The availability a (i, k ) , which is sent from this problem. AdAp searches the space of pm candidate exemplar point k to point i, reflects the preferences [, ] , to maximize the value of 2 accumulated evidence for how appropriate it Silhouette [7], described in section 2.3, of the would be for point i to choose point k as its clustering result when the result is optimal. exemplar, taking into account the support from However, the upper bound of search space is other points that point k should be an exemplar obtained by experiment experience, it lacks [3]. The procedure of the standard affinity theory foundation. propagation clustering is shown as figure 1. To solve these problems, this paper proposes Algorithm 1 Affinity Propagation an adaptive affinity propagation clustering Input: s (i, k ) : the similarity between point i and algorithm to determine the range of preference, point k (i k ) searches for the optimal value and then applies p( j ) : the preferences of data point j, to clustering document. Experimental results p( j ) s( j , j ) show that our method is better than the standard Output: the clustering result affinity propagation and K-Means clustering Step1: Initialize the availabilities a(i, k) to zero: methods. a(i, k ) 0 2. Adaptive Affinity Propagation Step2: Update the responsibilities: Clustering r (i, k ) s(i, k ) ' max {a(i, k ' ) s(i, k ' )} ' k , s ,t , k k This section introduces the adaptive affinity propagation algorithm, which developed from Step3: Update the availabilities: the standard affinity propagation clustering a(i, k ) min{0, r (k , k ) max{0, r (i ' , k )}} method. l ' , s ,t .l ' {i , k } 2.1 Affinity Propagation Clustering a(k , k ) max{0, r (i ' , k )} AP take a collection of real-valued i , s ,t ,i k ' ' similarities between data points as input. The Step4: Terminate the message-passing procedure after a fixed number of iterations or the changes similarity s (i, k ) indicates how well the data in the messages fall below a threshold. point with index k is suitable to be the cluster Otherwise go to step2. center for data point i . Generally speaking, AP Figure 1. The algorithm procedure of affinity can be viewed as searching over valid propagation clustering Availabilities and responsibilities can be configuration of labels c {c1 , c2 ,...cn } to combined to make the exemplar decisions. For minimize the energy E(C ) s(i, C ) . The i i point i, the value of k that maximizes a(i, k ) r (i, k ) either identifies point i as an process of AP can be regarded as a message communication process in a factor graph. There exemplar when k=i or identifies the data point are two kinds of messages exchanged between that is the exemplar for point i. When updating data points: responsibility and availability. The the messages, numerical oscillations should also be considered. Therefore, each message is set to Step3.3 Compute the minimal value of λ multiplied by its previous value plus 1−λ preferences pmin dpsim1 dpsim2 multiplied by its current value. The λ should be larger than or equal to 0.5 and less than 1. Figure 2. The procedure of computing the 2.2 Computation the range of range of preferences preferences The maximum preference ( pmax ) in the Affinity propagation tries to maximize the net similarity [8]. Net similarity is a score for range is the value which clusters the N data explaining the data, and it represents how points into N clusters, and this is equal to the appropriate the exemplars are. The score sums maximum similarity, since a preference lower up all similarities between data points and their than that would make the object better to have exemplar (The similarity between exemplar to the data point associated with that maximum itself is the preference of the exemplar). Affinity similarity assigned to be a cluster member rather propagation aims at maximizing Net Similarity than an exemplar. and tests each data point whether it is an The derivation for pmin is similar to pmax . exemplar. Therefore, the method which is using for computing the range of preferences can be Suppose that there is a particular preference p' developed [8], just as shown in Figure 2. Algorithm 2 Preference Range Computing such that the optimal net similarity for one clusters (k=1) and the optimal net similarity for Input: s (i, k ) : the similarity between point i two clusters (k=2) are the same. If there are two clusters, the optimal net similarity can be and point k (i k ) obtained by searching through all possible pairs Output: the maximal value and minimal value of possible exemplars, and the value is dpsim2 2* p . If there is one cluster, the ' of preferences: pmax , pmin Step1. Initialize s (k , k ) to zero: value of optimal net similarity is dpsim1 p' . s(k , k ) 0 The minimum preference pmin leads to Step2: Compute the maximal value of clustering the N data points into one cluster. preferences: Since affinity propagation aims at maximizing the net similarity, that pmax max{s(i, k )} is, dpsim1 p ' dpsim2 2* p ' , then Step3: Compute the minimal value of preferences p ' dpsim1 dpsim2 . pmin is no more Step 3.1 Compute the net similarity when the number of clusters is 1: than p ' , therefore, the minimum value for dpsim1 max{ j s(i, j )} j preferences is pmin dpsim1 dpsim2 . Step 3.2 Compute the net similarity when the number of clusters is 2: 2.3 Adaptive Affinity Propagation dpsim2 max{ max{s(i, k ), s( j, k )}} Clustering i j k After computing the range of preferences, we can scan through preferences space to find the Where r j is the count of the objects in class j, optimal clustering result. Different preferences would lead to different cluster results. Cluster a(i ) is the average distance between object i validation techniques are used to evaluate which clustering result is optimal for the datasets. and the objects in the same class j, b(i ) is the Preference step is very important to scan the space adaptively. We denote it asFormula 1. minimum average distance between object i and objects in class closet to class j. pmax pmin pstep (1) Figure 3 shows the procedure of the adaptive N *0.1* K 50 affinity propagation clustering method. The In order to sample the whole space, we set largest global silhouette index indicates the best pmax pmin clustering quality and the optimal number of the base of scanning step as . N clusters [7][9]. A series of Sil values However, this fixed increasing step cannot meet corresponding to clustering result with different the different requirement of different cases such number of cluster are calculated. The optimal as more clusters and less clusters. Because clustering result is found when Sil is largest. more-clusters case is more sensitive than that of Algorithm 3 Adaptive affinity propagation less-cluster case. We adopt the adaptive step clustering method similar to Wang’s[6], an adaptive Input: s (i, k ) : the similarity between point i 1 coefficient is introduced, q . 0.1* K 50 and point k (i k ) In this way, we set the value of pstep with the Output: the clustering result Step1: Apply Preferences Range algorithm to count of clusters (K) dynamically. When K is computing the range of preferences: large, pstep will be small and vice versa. [ pmin , pmax ] In this paper, we take global silhouettes Step2: Initialize the preferences: index as the validity indices. Silhouettes is introduced by Rousseeuw [7] as a general preference pmin pstep graphical aid for interpretation and validation of Step3: Update the preferences: cluster analysis, which provides a measure of how well a data point is classified when it is preference preference pstep assigned to a cluster in according to both the Step4: Apply Affinity Propagation algorithm to tightness of the clusters and the separation generating K clusters between them. Step5: Terminate until Sil is largest. Global silhouette index is defined as follows: Figure 3. The procedure of adaptive affinity nc propagation clustering 1 GS nc S j 1 j (2) 3 Adaptive Affinity Propagation Document Clustering Where local silhouette index is defined as: This section discusses the adaptive affinity b(i ) a(i ) rj 1 document clustering, which implements the Sj rj max{b(i), a(i)} i 1 (3) adaptive affinity propagation algorithm in clustering documents, combined with Vector Space Model (VSM). After computing the similarities every two 3.1 Document Representation documents, we can create a similarity matrix S. Vector Space Model is the most common Then take S as the input of adaptive affinity model for representing document among many propagation, we can obtain the clustering result models of document representation. In VSM, of the document collection finally. every document is represented as a vector. 4. Evaluation and Experiment Results V (d ) (t1 , w1 (d ); t2 , w2 (d );…tm , wm (d )) , 4.1 Dataset The experiment data is from the text where ti is the word item, wi (d ) is the weight classification corpus of FuDan University [11]. We randomly select some documents to make up of ti in the document d. The most widely used data set S1 and S2 from the corpus. The data set weighting scheme is Term Frequency with S1 consists of 100 documents, each class (Sports, Inverse Document Frequency (TF-IDF). Since Military, Environment, Politics and Economy) the lengths of documents are different, the contains 20 documents. The class information of weight of item is incomparable, we normalize data set S2 is shown in Table 1. There are fewer TF-IDF as shown in Formula 4: classes in S1, but each class has more examples. N While there are more classes in S2, each class tf ik log( 0.01) has different and fewer examples. df k wik (4) Table 1 Data distribution in data set S2 m N (tfik log( df 0.01))2 i 1 Document Class No. Document Class No. k C39-Sports 13 C23-Mine 7 C38-Politics 8 C29-Transport 13 Where wik is the weight of word item k in C37-Military 9 C31-Enviornment 9 document i, tf ik is the frequency of word item C11-Space 11 C32-Agriculture 15 C15-Energy 9 C35-Law 4 k in document i, df k is the number of C19-Computer 11 C36-Medical 16 C5-Education 6 Total 129 documents where feature word item k appears. N 4.2 Evaluation Method is the number of documents in the whole To, evaluate the quality of clustering result, collection. we use F-measure [10] and purity. 3.2 Document clustering using adaptive F-Measure: The F-measure is a harmonic affinity propagation combination of the precision and recall values After pre-processing the documents, we need used in information retrieval, which is widely to select a method to measure the similarity used. In general, the larger the F-measure is, the between different documents. We use cosine better the clustering result is. similarity in our experiment. The similarity For cluster j and class i, the recall and between document d1 and d2 is computed as precision are defined below: Formulation 5 nij m R (i, j ) (6) w d1k wd 2 k ni sim(d1 , d 2 ) k 1 (5) m m w w 2 2 nij d1k d2k P(i, j ) (7) k 1 k 1 nj propagation that the number of clusters is much Where ni , j denotes the number of more than the number of classes. Because of this, documents in class i which are clustered in the recall of affinity propagation is small; therefore the F-Measure is not high. The cluster j, ni is the number of documents in F-Measure of result generated by AAPC is the highest among the results. Moreover, compared class i and n j is the number of documents in with K-Means, AAPC does not need to specify cluster j. The F-measure of cluster j and class i is the number of clusters first; and compared with defined as follow: affinity propagation, the number of clusters of AAPC is closer to the number of classes. 2* R(i, j )* P(i, j ) F (i, j ) (8) Table2 the results on data set S1 using different R(i, j ) P(i, j ) algorithms For the entire clustering result, the F measure Algorithm Count Purity F-Measure is obtained by using the weighted sum of of maximum F-measure in each class. clusters K-Means 5 0.66 0.663654 n F i max{F (i, j )} (9) AP 19 0.87931 0.689195 i n AAPC 6 0.875789 0.818775 Where n is the total number of documents. Table3 the results on data set S2 using different Purity: The purity of cluster measures the algorithms ratio of documents of the major category in the Algorithm Count Purity F-Measure whole documents in the cluster. The purity of of cluster r is defined as: clusters 1 n K-Means 13 0.472868 0.440124 P( Sr ) max(nri ) (10) AP 22 0.898305 0.782709 nr i 1 AAPC 13 0.814815 0.819313 The overall purity of the clustering solution 5 Conclusions is obtained as a weighted sum of the individual This paper proposes an adaptive affinity cluster purities and is given by propagation clustering method, which n k overcomes the limitation that the standard Purity r P ( S r ) r 1 n affinity propagation clustering algorithm sets (11) preferences as the median of the similarities, which cannot yield optimal clustering results. In general, the larger the value of purity, the The algorithm first computes the range of preferences, and then searches the space to find better the clustering solution is. the value of preference which can generate the 4.3 Result and Discussion optimal clustering results. Finally, we apply the We apply three methods to clustering the two method to the document clustering, combined data sets. The results are shown in Table2 and with the vector space model. Table3. From the clustering results, we can see The experimental result demonstrates that, that the result of AAPC is much better than compared with K-Means and affinity K-Means, in both Purity and F-Measure value. propagation clustering, adaptive affinity From table2 and table3, we can also see that, propagation clustering can achieve higher there is a limitation of standard affinity precision in clustering documents. Moreover, unlike K-Means clustering, APPC does not need [11] http://www.nlp.org.cn/categories/default.php?c- to specify the number of clusters beforehand, it at_id=16 can compute an optimal preference with the distribution of the similarity, and then generate an optimal clustering result. Reference [1] Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In: Proceedings of the 21st International ACM SIGIR. 1998, 46-54. [2] McQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281-297. [3] Frey, B.J., Dueck, D.: Clustering by Passing Messages between Data Points. Science 2007, 315(5814), 972–976. [4] D. Dueck, B. Frey. Non-metric Afﬁnity Propagation for Unsupervised Image Categorization. In: IEEE International Conference on Computer Vision, 2007. [5] Zhang, X., Gao, J., Lu, P., Yan, Y.H.: A Novel Speaker Clustering Algorithm via Supervised Affinity Propagation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing 2008, 4369–4372. [6] Wang k. J. , Zhang J. Y. Li D. Adaptive Affinity Propagation Clustering. Acta Automatic Sinica, 2007, 33(12): 1242-1246. [7] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 1987(20), 53-65. [8] http://www.psi.toronto.edu/affinitypropagation/- faq.html [9] Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 2002, 3(7): 0036.1-0036.21. [10] Zhao, Y., Karypis, G., Kumar, V.: A Comparison of Document Clustering Functions for Document Clustering. Machine Learning, 2004, 55(3), 311–331.