A Divisive Hierarchical Structural Clustering Algorithm

Document Sample
A Divisive Hierarchical Structural Clustering Algorithm Powered By Docstoc
					     A Divisive Hierarchical Structural Clustering Algorithm for Networks

            Nurcan Yuruk, Mutlu Mete, Xiaowei Xu                        Thomas A. J. Schweiger
             University of Arkansas at Little Rock                        Acxiom Cooperation
             {nxyuruk, mxmete, xwxu}@ualr.edu                          Tom.Schweiger@acxiom.com

                                                          algorithm is described in section 3. We evaluate the
                     Abstract                             proposed algorithm using real networks whose
                                                          expected cluster structures are known to us. The
   Many systems in sciences, engineering and nature       experiment results are presented in section 4. We
can be modeled as networks. Examples are internet,        conclude the paper with some future work in section 5.
metabolic networks and social networks. Network
clustering algorithms aimed to find hidden structures
                                                          2. Related work
from networks are important to make sense of complex
networked data. In this paper we present a new
                                                             The goal of network clustering is to partition the
clustering method for networks. The proposed
                                                          network into clusters. Due to the immense needs the
algorithm can find hierarchical structure of clusters
                                                          network clustering problem has been studied in many
without requiring any input parameters. The
                                                          science and engineering disciplines for many years. In
experiments using real data demonstrate an
                                                          this section we focus on recent and commonly used
outstanding performance of the new method.
                                                          algorithms.
                                                             The min-max cut method [2] seeks to partition a
1. Introduction                                           graph G = {V, E} into two clusters A and B. The
                                                          principle of min-max clustering is minimizing the
    Many systems in science, engineering and nature       number of connections between A and B and
can be modeled as networks consisting of nodes and        maximizing the number of connections within each. A
links that represent real entities and relationships      cut is defined as the number of edges that would have
between entities. Examples are social networks,           to be removed to isolate the vertices in cluster A from
biological networks and internet. Network clustering is   those in cluster B. The min-max cut algorithm searches
targeted to find clusters in the network, which is an     for the clustering that creates two clusters whose cut is
important task for finding hidden structures in the       minimized while maximizing the number of remaining
messy, otherwise hard to comprehend networks. The         edges.
cluster can be a community such as a clique of               A pitfall of this method is that, if one cuts out a
terrorists in a social network, or a group of molecules   single vertex from the graph, one will probably achieve
sharing similar functions in a biological network.        the optimum. Therefore, in practice, the optimization
    In this paper, we present, DHSCAN, a Divisive         must be accompanied with some constraint, such as A
Hierarchical Structural Clustering Algorithm for          and B should be of equal or similar size, or |A| ≈ |B|.
Networks that iteratively removes links based on an       Such constraints are not always appropriate; for
ascending order of a structural similarity measure. The   example, in social networks some communities are
network is divided into disconnected components by        much larger than the others.
removal of links. The iterative divisive procedure           To amend the issue, a normalized cut [3] was
produces a dendrogram showing the hierarchical            proposed, which normalizes the cut by the total number
structure of the clusters. Additionally the divisive      connections between each cluster to the rest of the
procedure stops at the maximum of similarity-based        graph. Therefore, cutting out one vertex or some small
modularity that is a slightly modified version of         part of the graph will no longer always yield an
Newman’s modularity [1]. Therefore, our algorithm         optimum.
has two main advantages: (1). It can find hierarchical       Both min-max cut and normalized cut methods
structure of clusters. (2). It does not require any       partition a graph into two clusters. To divide a graph
parameters.                                               into k clusters, one has to adopt a top-down approach,
    This paper is organized as follows. After a brief     splitting the graph into two clusters, and then further
review of related work in section 2, the proposed new     splitting these clusters, and so on, until k clusters have
been detected. There is no guarantee of the optimality                       | Γ(v) I Γ( w) |             (2.3)
of recursive clustering. There is no measure of the             σ (v, w) =
                                                                              | Γ(v) || Γ( w) |
number of clusters that should be produced when k is
unknown. There is no indicator to stop the bisection         where Γ(v) is the direct neighbors of v. A genetic
procedure.                                                   algorithm is developed to find the optimal clustering of
    Recently, modularity was proposed as a quality           networks by maximizing similarity-based modularity
measure of network clustering [1]. For a clustering of       in [5]. Although the proposed algorithm can find both
graph with k clusters, the modularity is defined as:         clusters and hubs in networks, it does not scale well to
              k ⎡
                                                             large networks.
                   ls ⎛ d s ⎞ ⎤
                             2

     Qn = ∑ ⎢ − ⎜ ⎟ ⎥                                (2.1)      Most recently, we proposed SCAN, a Structural
            s =1 ⎢ L
                 ⎣    ⎝ 2L ⎠ ⎥ ⎦                             Clustering Algorithm for Networks in [6]. SCAN can
    L is the number of edges in the graph, ls is the         efficiently find clusters, hubs as well as outliers from
number of edges between vertices within cluster s, and       very large networks by visiting each node exactly once.
ds is the sum of the degrees of the vertices in cluster s.   However it requires two parameters that may be
The modularity of a clustering of a graph is the fraction    difficult to determine for users.
of all edges that lie within each cluster minus the
fraction that would lie within each cluster if the graph’s   3. The algorithm
vertices were randomly connected. Optimal clustering
is achieved when the modularity is maximized.                    In this section we present a Divisive Hierarchical
Modularity is defined such that it is zero for two           Structural Clustering Algorithm for Networks
extreme cases: when all vertices partitioned into a          (DHSCAN) that can find hierarchical structure of
single cluster, and when the vertices are clustered at       clusters in networks without requiring any parameters.
random. Note that the modularity measures the quality            We focus on simple, undirected and un-weighted
of any network clustering. Normalized and min-max            graph G = {V, E}, where V is a set of vertices; and E is
cut measures only the quality of a clustering of two         set of pairs (unordered) of distinct vertices, called
clusters.                                                    edges.
    Finding the maximum Qn is NP-complete. Instead               Our method is based on common neighbors. Two
of performing an exhaustive search, various                  vertices are assigned to a cluster according to how they
optimization approaches are proposed. For example, a         share neighbors. This makes sense when you consider
greedy method based on a hierarchical agglomeration          social communities. People who share many friends
clustering algorithm is proposed in [4], which is faster     create a community, and the more friends they have in
than many competing algorithms: its running time on a        common, the more intimate the community.
graph with n vertices and m edges is O(md log n)                 The structure of a vertex can be described by its
where d is the depth of the dendrogram describing the        neighborhood. A formal definition of vertex structure
hierarchical cluster structure.                              is given as follows.
    Although the modularity-based algorithms can find            DEFINITION 1 (VERTEX STRUCTURE)
good clusters, they fail to identify nodes playing               Let v ∈ V, the structure of v is defined by its
special roles such as hubs and outliers in networks.         neighborhood, denoted by Γ(v)
Hubs connecting many clusters are responsible for                 Γ(v) = {w ∈ V | (v,w) ∈ E} ∪ {v}
spreading ideas or diseases in social networks. Outliers         The structure similarity between vertices can be
are nodes marginally connected to clusters. Recently         measured by normalized common neighbors, which is
we proposed a similarity-based modularity [5] defined        also called cosine similarity measure commonly used
as:                                                          in information retrieval.
             k ⎡
                        ⎛ DS i ⎞ ⎤
                                  2
                   IS                                            DEFINITION 2 (STRUCTURE SIMILARITY)
    Qs = ∑ ⎢ i − ⎜               ⎟ ⎥            (2.2)
                                                                 Let v, w ∈ V, the structure similarity of v and w is
           i =1 ⎢ TS
                ⎣       ⎝ TS ⎠ ⎥    ⎦                        defined by their common neighborhood normalized by
where k is the number of clusters, ISi is the total          the geometric mean of the neighborhood, denoted by
similarity of vertices within cluster i; DSi is the total    σ(v, w)
similarity between vertices in cluster i and any vertices                   | Γ(v) I Γ( w) |
in the graph; TS is the total similarity between any two         σ (v, w) =
vertices in the graph. The similarity of two vertices is                      | Γ(v) || Γ( w) |
defined by a structural similarity measure, which is             Every edge can be represented by two end vertices.
defined as:                                                  Therefore, we can define the structure of an edge by
                                                             the structure similarity of two end vertices.
                                                                 DEFINITION 3 (EDGE STRUCTURE)
   Let v, w ∈ V and e = (v,w) ∈ E, the structure of e is    [1] and Books about US politics [8]. The performance
defined by the structural similarity of v and w, denoted    of DHSCAN is compared with modularity-based
by κ(e) = σ(v, w).                                          algorithm [4].
   The vertices in the same cluster have a higher               We use adjusted Rand index (ARI) [9] as our
structural similarity than vertices from different          measure of agreement between the clustering results
clusters. Therefore, intra-cluster edges, i.e. the edges    found by a particular algorithm and the true clustering
connecting vertices of the same clusters, have a larger     of the network. It is defined as:
edge structure than inter-cluster edges, i.e. the edges                        ⎛ nij ⎞ ⎡ ⎛ ni. ⎞       ⎛ n ⎞⎤ ⎛ n ⎞
connecting vertices of different clusters. Our clustering                ∑i, j ⎜ 2
                                                                               ⎜
                                                                                     ⎟ − ⎢∑ ⎜ ⎟∑ ⎜ . j ⎟ ⎥ / ⎜ ⎟
                                                                                     ⎟         i⎜2 ⎟  j⎜   ⎟ ⎜ ⎟
                                                                               ⎝ ⎠ ⎢ ⎝ ⎠  ⎣            ⎝ 2 ⎠⎥ ⎝ 2 ⎠
                                                                                                            ⎦
algorithm aims to cluster vertices by identifying both
                                                                  1 ⎡ ⎛ ni . ⎞         ⎛ n. j ⎞⎤ ⎡ ⎛ ni. ⎞    ⎛ n. j ⎞⎤ ⎛ n ⎞
intra and inter cluster edges. For this reason, we sort             ⎢∑ ⎜ ⎟ + ∑ j ⎜ ⎟⎥ − ⎢∑i ⎜ ⎟∑ j ⎜ ⎟⎥ / ⎜ ⎟
                                                                                       ⎜ ⎟           ⎜ ⎟      ⎜ ⎟ ⎜ ⎟
                                                                  2 ⎣ i⎜2 ⎟
                                                                       ⎝ ⎠             ⎝ 2 ⎠⎦ ⎣ ⎝ 2 ⎠         ⎝ 2 ⎠⎦ ⎝ 2 ⎠
the edges based on an ascending order of edge
structure and iteratively classify edges based on the       where ni,j is the number of vertices in both cluster xi
edge structure. We use two sets for the classified          and yj; and ni,⋅ and n⋅,j is the number of vertices in
edges. Set B is for inter-cluster edges; and set W is for   cluster xi and yj respectively. ARI ranges between 0
intra-cluster edges. All edges are initialized as intra-    and 1. ARI = 0 means a total disagreement of the
cluster edge and stored in set W in the beginning of the    clustering; and ARI = 1 means a perfect agreement.
algorithm. In each iteration, the edge with minimal             Zachary karate-club dataset is compiled by Zachary
edge structure will be moved from W to B. If there is       [7] during a two-year period observation and largely
any change in terms of the clusters, modularity Q           studied in the literatures [1] and [4]. It is a social
defined in (2.2) will be updated for changed clusters.      friendship network between members of a karate club
Additionally, the edge structures will also be updated      at a US university. The network is then split into two
for the edges that are directly connected to the moved      groups due to a conflict in club management. By
edge. The procedure will terminate if all edges are         analyzing the relationship between members of the
moved from W to B. The result of our clustering             club, we try to assign club members to two distinct
algorithm is a hierarchy of clusters and can be             groups which were only observable after the actual
represented as a dendrogram. The optimal clustering         split occurred. The result of DHSCAN algorithm is
can be found by the maximal Q. The pseudo-code of           shown on the graph in Figure 2. Shapes (round and
our algorithm is presented in Figure 1.                     rectangle) denote the true classes for club members and
                                                            colors denote the clusters found by DHSCAN. The
   ALGORITHM DHSCAN(G=<V, E>)                               result is exactly same with the one obtained by
   // all edges are initialized as intra-cluster edges;     Newman [4], with only one misclassified node, namely
   W := E; B := Φ; i := 0; Qi := 0;                         node 10.
   while W ≠ Φ do {
   // Move edge with minimal structure;
      remove e := min_struct(W) from W;
      insert e into B;
      find all connected components in W;
      if (number of components increased) {
         i := i+1;
         define each component in W as a cluster;
         plot level i of dendrogram;
         calculate Qi;
      }
   }
   // Get the optimal clustering;
      cut the dendrogram at maximal Q value;
   end DHSCAN.

              Figure 1 DHSCAN algorithm
                                                                           Figure 2 Zachary karate data
4. Experiments
                                                               DHSCAN is a divisive hierarchical clustering
   In this section, we evaluate the algorithm DHSCAN        algorithm and produces hierarchical clustering of
using three real datasets including well-known Zachary      nodes represented by a tree structure called
karate dataset [7], American College Football network       dendrogram. The dendrogram for Zachary karate
                          n
dataset is shown as an example in Figure 3. Th          he
optimal c  clustering is obtained by cutting th         he
           m              el             l
dendrogram at the leve of maximal modularity Qs
defined in (2.2). For this particular exam Qs value of
                                          mple
            es                           o
0.43 divide the entire dataset into two distinct group  ps
           y
with only one (node 10) misclas          ssified and th he
correspond                e
           ding ARI value of 0.88. The p                lts
                                          presented resul
           ee                            he
for all thre datasets are obtained in th same manne     er,
by cutting the tree where Qs is maximiz  zed.




                                                                                                  ks
                                                                            Figure 4 Political book data

                                                                 DHHSCAN cluste   ering results are represente by ed
                                                                    ent
                                                              differe colors: libe erals are in bl                n
                                                                                                  lue, neutrals in gray
                                                              and co               n              rge
                                                                    onservatives in red. The lar overlap be      etween
                                                                    s
                                                              shapes and colors in                od
                                                                                   ndicates a goo clustering y   yielded
                                                              by DH                               t
                                                                    HSCAN. The achieved adjust rand index va     alue of
                                                                     s              r             AN
                                                              0.64 is the same for both DHSCA and modu           ularity-
                                                                                  orithm by Clau et al [4].
                                                              based clustering algo              uset
                                                                 The last dataset u
                                                                    e              used in our ex xperiment is a social
                                                              networ – detecting communities (or conferenc
                                                                     rk                                          ces) of
                                                              American college football team [1]. The graph
                                                                                                 ms
                                                              represe               ule            n
                                                                     ents the schedu of Division I-A games f the for
                                                                                   e
                                                              2000 season. The National C          Collegiate A Athletic
          re          am
      Figur 3 Dendrogra of Zachary k
                                   karate data                Associ              A)
                                                                     iation (NCAA divides 11 college fo
                                                                                                  15             ootball
                                                              teams into 12 conf  ferences. In ad  ddition, there are 5
   The seccond example is the classifiication of bookks       indepeendent teams (Utah State, N    Navy, Notre D  Dame,
about US ppolitics. We use the dataset of Books abo
                       u                            out       Conne ecticut and Cenntral Florida) th do not belo to
                                                                                                   hat            ong
          s            y
US politics compiled by Valdis Krebs [8]. There is as         any co               e
                                                                    onference. The question is h                 out
                                                                                                   how to find o the
link betwe              ks
          een two book if they ar co-purchase
                                      re             ed       conferrences from a g graph that represents the sch hedule
frequently enough by th same buyer The vertices
                        he             rs.                           mes           by
                                                              of gam played b all teams. We presume that          e
have been given values "l", "n", or "c" to indica
         n              s                           ate             se             e
                                                              becaus teams in the same confere     ence are more likely
whether th are "liberal", "neutral", or "conservative
         hey                          r             e".       to play with each ot
                                                                    y              ther, the confer               can
                                                                                                   rence system c be
These ali ignments were assigned separately b        by             ed            re
                                                              mappe as a structur despite the s   significant amoount of
Newman [10]. True classes corresp     ponding libera al,            conference play
                                                              inter-c              y.
neutral an conservati
          nd            ive are deno  oted in roun  nd,                             tes           ge
                                                                 Figure 5 illustrat the colleg football net       twork,
diamond an rectangle re
          nd                          Figure 4.
                       espectively in F                       where each vertex represents a team and an edge    n
                                                              connec two teams if they playe in a game. Each
                                                                     cts                          ed
                                                              conferrence is represeented using a color and an i integer
                                                              as the conference ID.
                                                            The comparison above demonstrates an improved
                                                         accuracy of DHSCAN over modularity-based
                                                         algorithm on college football dataset. Both DHSCAN
                                                         and modularity-based methods achieve equivalent
                                                         result on Zachary-karate and political book datasets.
                                                            To demonstrate the efficiency of finding optimal
                                                         clustering of networks, we present the plots for Qs and
                                                         ARI values for each iteration of the algorithm on all
                                                         three datasets in Figure 7, Figure 8 and Figure 9
                                                         respectively. Plots demonstrate that Qs and ARI are
                                                         positively correlated, thus a high Qs value indicates a
                                                         high ARI value that also means a high-quality
                                                         clustering result. In our experiments, DHSCAN
                                                         algorithm is halted as Qs values start to decline because
                                                         maximal Qs value yields the highest ARI value as well.
             Figure 5 College football data              Note DHSCAN achieves an optimal clustering quickly,
                                                         after only removal of 9 out of 73, 61 out of 421 and
   The clustering results of DHSCAN algorithm is         197 out of 601 edges, indicating a good efficiency and
presented in Figure 6, which demonstrates a good         speed of clustering.
match with the original conference system we are
seeking for. The most errors are in lower-right part     5. Conclusion
where two separate conferences are merged together,
which we believe is caused by confused edge structure.       In this paper we proposed a simple divisive
However, achieved ARI value of 0.79 is still             hierarchical clustering algorithm, DHSCAN, that
significantly larger than that of modularity-based       iteratively removes links based on an ascending order
algorithm by Clauset et al [4], which is 0.50.           of a structural similarity measure. The network is
                                                         divided into clusters by removal of links. The iterative
                                                         divisive procedure produces a dendrogram showing the
                                                         hierarchical structure of the clusters. Additionally the
                                                         divisive procedure stops at the maximum of Newman’s
                                                         modularity. Therefore, our algorithm has two main
                                                         advantages: (1). It can find hierarchical structure of
                                                         clusters. (2). It does not require any parameters.
                                                             As future work we will compare DHSCAN with
                                                         more algorithms for clustering very large networks.
                                                         Additionally we will apply our algorithm to analyze
                                                         very large biological networks such as metabolic and
                                                         protein interaction networks.


                                                                             Zachary Karate
  Figure 6 Result of DHSCAN on college football data
                                                           1.00
   We measure the accuracy of the clustering using         0.80                               QS
adjusted rand index ARI. The ARI of DHSCAN and             0.60                               ARI
modularity-based algorithm for three datasets are          0.40
presented in Table 1.                                      0.20
       Table 1 Adjust rand index comparison                0.00
                                                                  1 7 13 19 25 31 37 43 49 55 61 67 73
                    DHSCAN          Modularity-based
Zachary-karate        0.88               0.88
                                                                         Iteration of Removals
Political books       0.64               0.64              Figure 7 QS - ARI behavior for Zachary karate data
College football      0.79               0.50
                                                            [2] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min-
                       Political Books                          max cut algorithm for graph partitioning and data
    0.70                                                        clustering”, Proc. of 2001 IEEE International
                                              QS                Conference on Data Mining, San Jose, CA, November
    0.60
                                                                29 – December 2, 2001.
    0.50                                      ARI
                                                            [3] J. Shi and J. Malik, “Normalized cuts and image
    0.40
                                                                segmentation”, IEEE Transactions on Pattern Analysis
    0.30                                                        and Machine Intelligence, Vol 22, No. 8, 2000.
    0.20                                                    [4] A. Clauset, M. E. J. Newman, and C. Moore, “Finding
    0.10                                                        community in very large networks”, Physical Review E
    0.00                                                        70, 066111 (2004).
           1   43 85 127 169 211 253 295 337 379 421        [5] Z. Feng, X. Xu, N. Yuruk and T. A. J. Schweiger, “A
                     Iteration of Removals                      Novel Similarity-based Modularity Function for Graph
                                                                Partitioning”, To be published in Proc. of 9th
   Figure 8 Qs - ARI behavior for political book data           International Conference on Data Warehousing and
                                                                Knowledge      Discovery,  Regensburg,     Germany,
                                                                September 3-7, 2007.
                      College Football
                                                            [6] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger,
                                                     QS         “SCAN: an Structural Clustering Algorithm for
  1.00
                                                     ARI        Netowrks”, To be published in Proc. of 13th ACM
  0.80                                                          SIGKDD International Conference on Knowledge
  0.60                                                          Discovery and Data Mining, San Jose, CA, August 12-
                                                                15, 2007.
  0.40
                                                            [7] W. W. Zachary, “An information flow model for
  0.20                                                          conflict and fission in small groups”, Journal of
  0.00                                                          Anthropological Research 33, 452–473 (1977).
         1 51 101 151 201 251 301 351 401 451 501 551 601   [8] http://www.orgnet.com/.
                     Iteration of Removals                  [9] L. Hubert and P. Arabie, “Comparing Partitions”.
Figure 9 Qs - ARI behavior for college football data            Journal of Classification, 193–218, 1985.
                                                            [10] http://www-personal.umich.edu/~mejn/netdata/.
6. References
[1] M. E. J. Newman and M. Girvan, “Finding and
    evaluating community structure in networks”, Phys.
    Rev. E 69, 026113 (2004).