An Adaptive Affinity Propagation Document Clustering

Document Sample
An Adaptive Affinity Propagation Document Clustering Powered By Docstoc
					            An Adaptive Affinity Propagation Document Clustering

    Yancheng He, Qingcai Chen, Xiaolong Wang, Ruifeng Xu, Xiaohua Bai, Xianjun
                     Dept. of Computer Science and Technology
              Harbin Institute of Technology Shenzhen Graduate School
                                 Shenzhen, P.R. China

                     Abstract                                rounds fall below a certain threshold. However
    The standard affinity propagation clustering             standard K-Means method has two limitations: 1)
algorithm suffers from one limitation that it is             the number of cluster needs to be specified first.
hard to know the value of the parameter                      2) The clustering result is sensitive to the initial
“preference” which can yield an optimal                      cluster centers.
clustering solution. To overcome this limitation,                A new clustering approach named affinity
in this paper we proposes an adaptive affinity               propagation [3] (AP) has been proposed recently.
propagation method. The method first finds out               AP clustering method performs well in many
the range of “preference”, then searches the                 applications such as image categorization [4],
space of “preference” to find a good value                   gene expressions and text summarization [3] and
which can optimize the clustering result. We                 speaker clustering [5]. This paper applies it to
apply the method to document clustering and                  document clustering. Unlike other methods, AP
compare      it    with        the   standard     affinity   simultaneously considers all data points as
propagation and K-Means clustering method in                 potential exemplars. AP recursively transmits
real data sets. Experimental results show that               real-valued messages along edges of the network
our proposed method can get better clustering                by viewing each data point as a node in a
result.                                                      network until a good set of centers is
.                                                                Rather than requiring that the number of
1. Introduction                                              clusters be pre-specified, affinity propagation
    There    are     many        document       clustering
                                                             takes as input a real number s (k , k ) for each
methods,      such        as     partition   clustering,
hierarchical clustering, Self-Organizing Maps                data point k so that data points with larger values
clustering and suffix tree clustering [1]. And
                                                             of s (k , k ) are more likely to be chosen as
K-means is the most popular one of partition
clustering methods [2]. Standard K-means
                                                             exemplars. These values s (k , k ) are referred to
method first specifies K initial cluster centers
randomly, and then assigns all the data points to            as "preferences", which is a kind of the
the nearest cluster centers, at last update all the          self-similarity.   The   number     of   identified
cluster centers. This process iterates until the             exemplars is influenced by the values of the
difference between cluster centers of consecutive            input preferences. Frey suggested preference be
set as the median of the input similarities (         pm )    responsibility r (i, k ) , which is sent from data

without any prior knowledge. But in most cases,               point i to candidate point k, reflects the
                                                              accumulated evidence for point i, taking into
pm can’t lead to optimal clustering solutions.
                                                              account other potential exemplars for point i.
Wang proposed an adAP algorithm [6] to solve
                                                              The availability a (i, k ) , which is sent from
this problem. AdAp searches the space of
                        pm                                    candidate exemplar point k to point i, reflects the
preferences [,           ] , to maximize the value of
                        2                                     accumulated evidence for how appropriate it
Silhouette [7], described in section 2.3, of the              would be for point i to choose point k as its
clustering result when the result is optimal.                 exemplar, taking into account the support from
However, the upper bound of search space is                   other points that point k should be an exemplar
obtained by experiment experience, it lacks                   [3]. The procedure of the standard affinity
theory foundation.                                            propagation clustering is shown as figure 1.
      To solve these problems, this paper proposes            Algorithm 1          Affinity Propagation
an adaptive affinity propagation clustering                   Input: s (i, k ) : the similarity between point i and
algorithm to determine the range of preference,               point k (i  k )
searches for the optimal value and then applies                     p( j ) : the preferences of data point j,
to clustering document. Experimental results                  p( j )  s( j , j )
show that our method is better than the standard              Output: the clustering result
affinity propagation and K-Means clustering                   Step1: Initialize the availabilities a(i, k) to zero:
methods.                                                                          a(i, k )  0
2. Adaptive              Affinity       Propagation           Step2: Update the responsibilities:
Clustering                                                         r (i, k )  s(i, k )  ' max {a(i, k ' )  s(i, k ' )}
                                                                                                k , s ,t , k  k
      This section introduces the adaptive affinity
propagation algorithm, which developed from                   Step3: Update the availabilities:
the standard affinity propagation clustering                  a(i, k )  min{0, r (k , k )                                           max{0, r (i ' , k )}}
method.                                                                                                    l ' , s ,t .l ' {i , k }

2.1 Affinity Propagation Clustering                                        a(k , k )                     max{0, r (i ' , k )}
      AP     take   a     collection    of     real-valued                               i , s ,t ,i  k
                                                                                          '       '

similarities between data points as input. The                Step4: Terminate the message-passing procedure
                                                              after a fixed number of iterations or the changes
similarity s (i, k ) indicates how well the data
                                                              in the messages fall below a threshold.
point with index k is suitable to be the cluster              Otherwise go to step2.
center for data point i . Generally speaking, AP                     Figure 1. The algorithm procedure of affinity
can     be    viewed      as   searching       over   valid                     propagation clustering
                                                                  Availabilities and responsibilities can be
configuration of labels          c  {c1 , c2 , } to
                                                              combined to make the exemplar decisions. For

minimize the energy E(C )             s(i, C ) . The
                                           i      i
                                                              point i, the value of k that maximizes

                                                              a(i, k )  r (i, k ) either identifies point i as an
process of AP can be regarded as a message
communication process in a factor graph. There                exemplar when k=i or identifies the data point
are two kinds of messages exchanged between                   that is the exemplar for point i. When updating
data points: responsibility and availability. The             the messages, numerical oscillations should also
be considered. Therefore, each message is set to                  Step3.3 Compute the minimal value of
λ multiplied by its previous value plus 1−λ
                                                           preferences       pmin  dpsim1  dpsim2
multiplied by its current value. The λ should be
larger than or equal to 0.5 and less than 1.                       Figure 2. The procedure of computing the
2.2 Computation                   the       range     of                     range of preferences
                                                                 The maximum preference (             pmax ) in the
    Affinity propagation tries to maximize the
net similarity [8]. Net similarity is a score for          range is the value which clusters the N data
explaining the data, and it represents how                 points into N clusters, and this is equal to the
appropriate the exemplars are. The score sums              maximum similarity, since a preference lower
up all similarities between data points and their          than that would make the object better to have
exemplar (The similarity between exemplar to               the data point associated with that maximum
itself is the preference of the exemplar). Affinity        similarity assigned to be a cluster member rather
propagation aims at maximizing Net Similarity              than an exemplar.
and tests each data point whether it is an
                                                                 The derivation for     pmin is similar to pmax .
exemplar. Therefore, the method which is using
for computing the range of preferences can be
                                                           Suppose that there is a particular preference           p'
developed [8], just as shown in Figure 2.
Algorithm 2 Preference Range Computing                     such that the optimal net similarity for one
                                                           clusters (k=1) and the optimal net similarity for
Input: s (i, k ) : the similarity between point i
                                                           two clusters (k=2) are the same. If there are two
                                                           clusters, the optimal net similarity can be
and point k (i  k )
                                                           obtained by searching through all possible pairs
Output: the maximal value and minimal value                of     possible     exemplars,       and     the   value

                                                           is dpsim2  2* p . If there is one cluster, the
of preferences:       pmax , pmin

Step1. Initialize s (k , k ) to zero:                      value of optimal net similarity is dpsim1              p' .

                      s(k , k )  0                        The     minimum       preference      pmin     leads     to

Step2:   Compute         the      maximal    value   of    clustering the N data points into one cluster.
preferences:                                               Since affinity propagation aims at maximizing
                                                           the           net             similarity,              that
               pmax  max{s(i, k )}
                                                           is,    dpsim1  p '  dpsim2  2* p ' ,                then
Step3:   Compute            the   minimal    value   of
                                                            p '  dpsim1  dpsim2 . pmin is no more
     Step 3.1 Compute the net similarity when
the number of clusters is 1:
                                                           than    p ' , therefore, the minimum value for
         dpsim1  max{ j s(i, j )}
                                                           preferences is      pmin  dpsim1  dpsim2 .
     Step 3.2 Compute the net similarity when
the number of clusters is 2:                               2.3 Adaptive              Affinity         Propagation
dpsim2  max{ max{s(i, k ), s( j, k )}}                   Clustering
               i j
                        k                                        After computing the range of preferences, we
can scan through preferences space to find the
                                                                Where      r j is the count of the objects in class j,
optimal clustering result. Different preferences
would lead to different cluster results. Cluster
                                                                a(i ) is the average distance between object i
validation techniques are used to evaluate which
clustering result is optimal for the datasets.
                                                                and the objects in the same class j,         b(i ) is the
    Preference step is very important to scan the
space adaptively. We denote it asFormula 1.                     minimum average distance between object i and
                                                                objects in class closet to class j.
                      pmax  pmin
         pstep                                           (1)       Figure 3 shows the procedure of the adaptive
                   N *0.1* K  50
                                                                affinity propagation clustering method. The
    In order to sample the whole space, we set                  largest global silhouette index indicates the best
                                               pmax  pmin      clustering quality and the optimal number of
the base of scanning step as                               .
                                                    N           clusters    [7][9].    A series         of   Sil   values
However, this fixed increasing step cannot meet                 corresponding to clustering result with different
the different requirement of different cases such               number of cluster are calculated. The optimal
as more clusters and less clusters. Because                     clustering result is found when Sil is largest.
more-clusters case is more sensitive than that of
                                                                 Algorithm 3 Adaptive affinity propagation
less-cluster case. We adopt the adaptive step
method similar to Wang’s[6], an adaptive
                                                                Input: s (i, k ) : the similarity between point i
coefficient is introduced, q              .
                               0.1* K  50                      and point k (i  k )

In this way, we set the value of               pstep with the   Output: the clustering result
                                                                Step1: Apply Preferences Range algorithm to
count of clusters (K) dynamically. When K is
                                                                computing the range of preferences:
large,   pstep will be small and vice versa.                                          [ pmin , pmax ]
    In this paper, we take global silhouettes
                                                                Step2: Initialize the preferences:
index as the validity indices. Silhouettes is
introduced by Rousseeuw [7] as a general                                     preference  pmin  pstep
graphical aid for interpretation and validation of
                                                                Step3: Update the preferences:
cluster analysis, which provides a measure of
how well a data point is classified when it is                         preference  preference  pstep
assigned to a cluster in according to both the
                                                                Step4: Apply Affinity Propagation algorithm to
tightness of the clusters and the separation
                                                                generating K clusters
between them.
                                                                Step5: Terminate until Sil is largest.
    Global silhouette index is defined as follows:
                                                                      Figure 3. The procedure of adaptive affinity
                                    nc                                          propagation clustering
                    GS 
                                    j 1
                                           j             (2)
                                                                3 Adaptive Affinity                      Propagation
                                                                Document Clustering
         Where local silhouette index is defined as:
                                                                   This section discusses the adaptive affinity
                               b(i )  a(i )
                   1                                            document clustering, which implements the
          Sj 
                         max{b(i), a(i)}
                        i 1
                                                                adaptive affinity propagation algorithm in
                                                                clustering documents, combined with Vector
Space Model (VSM).                                                    After computing the similarities every two
3.1 Document Representation                                      documents, we can create a similarity matrix S.
     Vector Space Model is the most common                       Then take S as the input of adaptive affinity
model for representing document among many                       propagation, we can obtain the clustering result
models of document representation. In VSM,                       of the document collection finally.
every document is represented as a vector.
                                                                 4. Evaluation and Experiment Results
V (d )  (t1 , w1 (d ); t2 , w2 (d );…tm , wm (d )) ,
                                                                 4.1 Dataset
                                                                     The experiment data is from the text
where   ti is the word item, wi (d ) is the weight
                                                                 classification corpus of FuDan University [11].
                                                                 We randomly select some documents to make up
of   ti in the document d. The most widely used
                                                                 data set S1 and S2 from the corpus. The data set
weighting scheme is Term Frequency with                          S1 consists of 100 documents, each class (Sports,
Inverse Document Frequency (TF-IDF). Since                       Military, Environment, Politics and Economy)
the lengths of documents are different, the                      contains 20 documents. The class information of
weight of item is incomparable, we normalize                     data set S2 is shown in Table 1. There are fewer
TF-IDF as shown in Formula 4:                                    classes in S1, but each class has more examples.

                                   N                             While there are more classes in S2, each class
                  tf ik  log(          0.01)                   has different and fewer examples.
                                  df k
wik                                                       (4)           Table 1 Data distribution in data set S2
           (tfik  log( df  0.01))2
          i 1
                                                                 Document Class            No.   Document Class     No.
                                                                 C39-Sports                13    C23-Mine           7
                                                                 C38-Politics              8     C29-Transport      13
     Where        wik is the weight of word item k in
                                                                 C37-Military              9     C31-Enviornment    9

document i,        tf ik is the frequency of word item           C11-Space                 11    C32-Agriculture    15
                                                                 C15-Energy                9     C35-Law            4
k in document i,                 df k is the number of           C19-Computer              11    C36-Medical        16
                                                                 C5-Education              6     Total              129
documents where feature word item k appears. N
                                                                 4.2 Evaluation Method
is the number of documents in the whole                              To, evaluate the quality of clustering result,
collection.                                                      we use F-measure [10] and purity.
3.2 Document clustering using adaptive                               F-Measure: The F-measure is a harmonic
affinity propagation                                             combination of the precision and recall values
     After pre-processing the documents, we need                 used in information retrieval, which is widely
to select a method to measure the similarity                     used. In general, the larger the F-measure is, the
between different documents. We use cosine                       better the clustering result is.
similarity in our experiment. The similarity                         For cluster j and class i, the recall and
between document d1 and d2 is computed as                        precision are defined below:
Formulation 5
                                  m                                          R (i, j )                            (6)
                             w         d1k wd 2 k                                         ni
 sim(d1 , d 2 )               k 1
                           m                 m

                          w                w
                                      2              2                                     nij
                                      d1k            d2k                    P(i, j )                              (7)
                           k 1             k 1                                           nj
                                                             propagation that the number of clusters is much
    Where       ni , j       denotes   the   number     of
                                                             more than the number of classes. Because of this,
documents in class i which are clustered in                  the recall of affinity propagation is small;
                                                             therefore the F-Measure is not high. The
cluster j,    ni is the number of documents in
                                                             F-Measure of result generated by AAPC is the
                                                             highest among the results. Moreover, compared
class i and    n j is the number of documents in
                                                             with K-Means, AAPC does not need to specify
cluster j. The F-measure of cluster j and class i is         the number of clusters first; and compared with
defined as follow:                                           affinity propagation, the number of clusters of
                                                             AAPC is closer to the number of classes.
                  2* R(i, j )* P(i, j )
      F (i, j )                                       (8)   Table2 the results on data set S1 using different
                   R(i, j )  P(i, j )
    For the entire clustering result, the F measure          Algorithm     Count        Purity             F-Measure
is obtained by using the weighted sum of                                   of
maximum F-measure in each class.                                           clusters
                                                             K-Means       5            0.66               0.663654
       F   i max{F (i, j )}                          (9)   AP            19           0.87931            0.689195
           i n
                                                             AAPC          6            0.875789           0.818775
    Where n is the total number of documents.                Table3 the results on data set S2 using different
    Purity: The purity of cluster measures the                                    algorithms
ratio of documents of the major category in the              Algorithm     Count            Purity         F-Measure
whole documents in the cluster. The purity of                              of
cluster r is defined as:                                                   clusters
                         1 n                                 K-Means       13               0.472868       0.440124
        P( Sr )           max(nri )                  (10)
                                                             AP            22               0.898305       0.782709
                         nr i 1
                                                             AAPC          13               0.814815       0.819313
    The overall purity of the clustering solution
                                                             5 Conclusions
is obtained as a weighted sum of the individual
                                                                  This paper proposes an adaptive affinity
cluster purities and is given by
                                                             propagation       clustering      method,        which
                    n   k
                                                            overcomes the limitation that the standard
      Purity    r P ( S r ) 
               r 1  n                                     affinity propagation clustering algorithm sets
                                                             preferences as the median of the similarities,
                                                             which cannot yield optimal clustering results.
In general, the larger the value of purity, the
                                                             The algorithm first computes the range of
                                                             preferences, and then searches the space to find
better the clustering solution is.
                                                             the value of preference which can generate the
4.3 Result and Discussion                                    optimal clustering results. Finally, we apply the
    We apply three methods to clustering the two
                                                             method to the document clustering, combined
data sets. The results are shown in Table2 and
                                                             with the vector space model.
Table3. From the clustering results, we can see
                                                                  The experimental result demonstrates that,
that the result of AAPC is much better than
                                                             compared      with     K-Means          and     affinity
K-Means, in both Purity and F-Measure value.
                                                             propagation    clustering,        adaptive      affinity
From table2 and table3, we can also see that,
                                                             propagation clustering can          achieve      higher
there is a limitation of standard affinity
                                                             precision in clustering documents. Moreover,
unlike K-Means clustering, APPC does not need               [11]
to specify the number of clusters beforehand, it                 at_id=16
can compute an optimal preference with the
distribution of the similarity, and then generate
an optimal clustering result.

[1]   Zamir, O. and Etzioni, O. Web document
      clustering: A feasibility demonstration. In:
      Proceedings of the 21st International ACM
      SIGIR. 1998, 46-54.
[2]   McQueen, J.: Some Methods for Classification
      and Analysis of Multivariate Observations. In:
      Fifth Berkeley Symposium on Mathematical
      Statistics and Probability, 1967, 281-297.
[3]   Frey, B.J., Dueck, D.: Clustering by Passing
      Messages between Data Points. Science 2007,
      315(5814), 972–976.
[4]   D.   Dueck,   B.    Frey.   Non-metric    Affinity
      Propagation       for     Unsupervised       Image
      Categorization.     In:     IEEE      International
      Conference on Computer Vision, 2007.
[5]   Zhang, X., Gao, J., Lu, P., Yan, Y.H.: A Novel
      Speaker Clustering Algorithm via Supervised
      Affinity Propagation. In: IEEE International
      Conference on Acoustics, Speech, and Signal
      Processing 2008, 4369–4372.
[6]   Wang k. J. , Zhang J. Y. Li D. Adaptive Affinity
      Propagation Clustering. Acta Automatic Sinica,
      2007, 33(12): 1242-1246.
[7] P.J. Rousseeuw, Silhouettes: a graphical aid to
      the interpretation and validation of cluster
      analysis, J. Comput. Appl. Math. 1987(20),
[9]   Dudoit S, Fridlyand J. A prediction-based
      resampling method for estimating the number of
      clusters in a dataset. Genome Biology, 2002,
      3(7): 0036.1-0036.21.
[10] Zhao, Y., Karypis, G., Kumar, V.: A Comparison
      of   Document       Clustering     Functions   for
      Document Clustering. Machine Learning, 2004,
      55(3), 311–331.