VIEWS: 80 PAGES: 6 CATEGORY: Technology POSTED ON: 4/27/2010 Public Domain
A Divisive Hierarchical Structural Clustering Algorithm for Networks Nurcan Yuruk, Mutlu Mete, Xiaowei Xu Thomas A. J. Schweiger University of Arkansas at Little Rock Acxiom Cooperation {nxyuruk, mxmete, xwxu}@ualr.edu Tom.Schweiger@acxiom.com algorithm is described in section 3. We evaluate the Abstract proposed algorithm using real networks whose expected cluster structures are known to us. The Many systems in sciences, engineering and nature experiment results are presented in section 4. We can be modeled as networks. Examples are internet, conclude the paper with some future work in section 5. metabolic networks and social networks. Network clustering algorithms aimed to find hidden structures 2. Related work from networks are important to make sense of complex networked data. In this paper we present a new The goal of network clustering is to partition the clustering method for networks. The proposed network into clusters. Due to the immense needs the algorithm can find hierarchical structure of clusters network clustering problem has been studied in many without requiring any input parameters. The science and engineering disciplines for many years. In experiments using real data demonstrate an this section we focus on recent and commonly used outstanding performance of the new method. algorithms. The min-max cut method [2] seeks to partition a 1. Introduction graph G = {V, E} into two clusters A and B. The principle of min-max clustering is minimizing the Many systems in science, engineering and nature number of connections between A and B and can be modeled as networks consisting of nodes and maximizing the number of connections within each. A links that represent real entities and relationships cut is defined as the number of edges that would have between entities. Examples are social networks, to be removed to isolate the vertices in cluster A from biological networks and internet. Network clustering is those in cluster B. The min-max cut algorithm searches targeted to find clusters in the network, which is an for the clustering that creates two clusters whose cut is important task for finding hidden structures in the minimized while maximizing the number of remaining messy, otherwise hard to comprehend networks. The edges. cluster can be a community such as a clique of A pitfall of this method is that, if one cuts out a terrorists in a social network, or a group of molecules single vertex from the graph, one will probably achieve sharing similar functions in a biological network. the optimum. Therefore, in practice, the optimization In this paper, we present, DHSCAN, a Divisive must be accompanied with some constraint, such as A Hierarchical Structural Clustering Algorithm for and B should be of equal or similar size, or |A| ≈ |B|. Networks that iteratively removes links based on an Such constraints are not always appropriate; for ascending order of a structural similarity measure. The example, in social networks some communities are network is divided into disconnected components by much larger than the others. removal of links. The iterative divisive procedure To amend the issue, a normalized cut [3] was produces a dendrogram showing the hierarchical proposed, which normalizes the cut by the total number structure of the clusters. Additionally the divisive connections between each cluster to the rest of the procedure stops at the maximum of similarity-based graph. Therefore, cutting out one vertex or some small modularity that is a slightly modified version of part of the graph will no longer always yield an Newman’s modularity [1]. Therefore, our algorithm optimum. has two main advantages: (1). It can find hierarchical Both min-max cut and normalized cut methods structure of clusters. (2). It does not require any partition a graph into two clusters. To divide a graph parameters. into k clusters, one has to adopt a top-down approach, This paper is organized as follows. After a brief splitting the graph into two clusters, and then further review of related work in section 2, the proposed new splitting these clusters, and so on, until k clusters have been detected. There is no guarantee of the optimality | Γ(v) I Γ( w) | (2.3) of recursive clustering. There is no measure of the σ (v, w) = | Γ(v) || Γ( w) | number of clusters that should be produced when k is unknown. There is no indicator to stop the bisection where Γ(v) is the direct neighbors of v. A genetic procedure. algorithm is developed to find the optimal clustering of Recently, modularity was proposed as a quality networks by maximizing similarity-based modularity measure of network clustering [1]. For a clustering of in [5]. Although the proposed algorithm can find both graph with k clusters, the modularity is defined as: clusters and hubs in networks, it does not scale well to k ⎡ large networks. ls ⎛ d s ⎞ ⎤ 2 Qn = ∑ ⎢ − ⎜ ⎟ ⎥ (2.1) Most recently, we proposed SCAN, a Structural s =1 ⎢ L ⎣ ⎝ 2L ⎠ ⎥ ⎦ Clustering Algorithm for Networks in [6]. SCAN can L is the number of edges in the graph, ls is the efficiently find clusters, hubs as well as outliers from number of edges between vertices within cluster s, and very large networks by visiting each node exactly once. ds is the sum of the degrees of the vertices in cluster s. However it requires two parameters that may be The modularity of a clustering of a graph is the fraction difficult to determine for users. of all edges that lie within each cluster minus the fraction that would lie within each cluster if the graph’s 3. The algorithm vertices were randomly connected. Optimal clustering is achieved when the modularity is maximized. In this section we present a Divisive Hierarchical Modularity is defined such that it is zero for two Structural Clustering Algorithm for Networks extreme cases: when all vertices partitioned into a (DHSCAN) that can find hierarchical structure of single cluster, and when the vertices are clustered at clusters in networks without requiring any parameters. random. Note that the modularity measures the quality We focus on simple, undirected and un-weighted of any network clustering. Normalized and min-max graph G = {V, E}, where V is a set of vertices; and E is cut measures only the quality of a clustering of two set of pairs (unordered) of distinct vertices, called clusters. edges. Finding the maximum Qn is NP-complete. Instead Our method is based on common neighbors. Two of performing an exhaustive search, various vertices are assigned to a cluster according to how they optimization approaches are proposed. For example, a share neighbors. This makes sense when you consider greedy method based on a hierarchical agglomeration social communities. People who share many friends clustering algorithm is proposed in [4], which is faster create a community, and the more friends they have in than many competing algorithms: its running time on a common, the more intimate the community. graph with n vertices and m edges is O(md log n) The structure of a vertex can be described by its where d is the depth of the dendrogram describing the neighborhood. A formal definition of vertex structure hierarchical cluster structure. is given as follows. Although the modularity-based algorithms can find DEFINITION 1 (VERTEX STRUCTURE) good clusters, they fail to identify nodes playing Let v ∈ V, the structure of v is defined by its special roles such as hubs and outliers in networks. neighborhood, denoted by Γ(v) Hubs connecting many clusters are responsible for Γ(v) = {w ∈ V | (v,w) ∈ E} ∪ {v} spreading ideas or diseases in social networks. Outliers The structure similarity between vertices can be are nodes marginally connected to clusters. Recently measured by normalized common neighbors, which is we proposed a similarity-based modularity [5] defined also called cosine similarity measure commonly used as: in information retrieval. k ⎡ ⎛ DS i ⎞ ⎤ 2 IS DEFINITION 2 (STRUCTURE SIMILARITY) Qs = ∑ ⎢ i − ⎜ ⎟ ⎥ (2.2) Let v, w ∈ V, the structure similarity of v and w is i =1 ⎢ TS ⎣ ⎝ TS ⎠ ⎥ ⎦ defined by their common neighborhood normalized by where k is the number of clusters, ISi is the total the geometric mean of the neighborhood, denoted by similarity of vertices within cluster i; DSi is the total σ(v, w) similarity between vertices in cluster i and any vertices | Γ(v) I Γ( w) | in the graph; TS is the total similarity between any two σ (v, w) = vertices in the graph. The similarity of two vertices is | Γ(v) || Γ( w) | defined by a structural similarity measure, which is Every edge can be represented by two end vertices. defined as: Therefore, we can define the structure of an edge by the structure similarity of two end vertices. DEFINITION 3 (EDGE STRUCTURE) Let v, w ∈ V and e = (v,w) ∈ E, the structure of e is [1] and Books about US politics [8]. The performance defined by the structural similarity of v and w, denoted of DHSCAN is compared with modularity-based by κ(e) = σ(v, w). algorithm [4]. The vertices in the same cluster have a higher We use adjusted Rand index (ARI) [9] as our structural similarity than vertices from different measure of agreement between the clustering results clusters. Therefore, intra-cluster edges, i.e. the edges found by a particular algorithm and the true clustering connecting vertices of the same clusters, have a larger of the network. It is defined as: edge structure than inter-cluster edges, i.e. the edges ⎛ nij ⎞ ⎡ ⎛ ni. ⎞ ⎛ n ⎞⎤ ⎛ n ⎞ connecting vertices of different clusters. Our clustering ∑i, j ⎜ 2 ⎜ ⎟ − ⎢∑ ⎜ ⎟∑ ⎜ . j ⎟ ⎥ / ⎜ ⎟ ⎟ i⎜2 ⎟ j⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎢ ⎝ ⎠ ⎣ ⎝ 2 ⎠⎥ ⎝ 2 ⎠ ⎦ algorithm aims to cluster vertices by identifying both 1 ⎡ ⎛ ni . ⎞ ⎛ n. j ⎞⎤ ⎡ ⎛ ni. ⎞ ⎛ n. j ⎞⎤ ⎛ n ⎞ intra and inter cluster edges. For this reason, we sort ⎢∑ ⎜ ⎟ + ∑ j ⎜ ⎟⎥ − ⎢∑i ⎜ ⎟∑ j ⎜ ⎟⎥ / ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 2 ⎣ i⎜2 ⎟ ⎝ ⎠ ⎝ 2 ⎠⎦ ⎣ ⎝ 2 ⎠ ⎝ 2 ⎠⎦ ⎝ 2 ⎠ the edges based on an ascending order of edge structure and iteratively classify edges based on the where ni,j is the number of vertices in both cluster xi edge structure. We use two sets for the classified and yj; and ni,⋅ and n⋅,j is the number of vertices in edges. Set B is for inter-cluster edges; and set W is for cluster xi and yj respectively. ARI ranges between 0 intra-cluster edges. All edges are initialized as intra- and 1. ARI = 0 means a total disagreement of the cluster edge and stored in set W in the beginning of the clustering; and ARI = 1 means a perfect agreement. algorithm. In each iteration, the edge with minimal Zachary karate-club dataset is compiled by Zachary edge structure will be moved from W to B. If there is [7] during a two-year period observation and largely any change in terms of the clusters, modularity Q studied in the literatures [1] and [4]. It is a social defined in (2.2) will be updated for changed clusters. friendship network between members of a karate club Additionally, the edge structures will also be updated at a US university. The network is then split into two for the edges that are directly connected to the moved groups due to a conflict in club management. By edge. The procedure will terminate if all edges are analyzing the relationship between members of the moved from W to B. The result of our clustering club, we try to assign club members to two distinct algorithm is a hierarchy of clusters and can be groups which were only observable after the actual represented as a dendrogram. The optimal clustering split occurred. The result of DHSCAN algorithm is can be found by the maximal Q. The pseudo-code of shown on the graph in Figure 2. Shapes (round and our algorithm is presented in Figure 1. rectangle) denote the true classes for club members and colors denote the clusters found by DHSCAN. The ALGORITHM DHSCAN(G=<V, E>) result is exactly same with the one obtained by // all edges are initialized as intra-cluster edges; Newman [4], with only one misclassified node, namely W := E; B := Φ; i := 0; Qi := 0; node 10. while W ≠ Φ do { // Move edge with minimal structure; remove e := min_struct(W) from W; insert e into B; find all connected components in W; if (number of components increased) { i := i+1; define each component in W as a cluster; plot level i of dendrogram; calculate Qi; } } // Get the optimal clustering; cut the dendrogram at maximal Q value; end DHSCAN. Figure 1 DHSCAN algorithm Figure 2 Zachary karate data 4. Experiments DHSCAN is a divisive hierarchical clustering In this section, we evaluate the algorithm DHSCAN algorithm and produces hierarchical clustering of using three real datasets including well-known Zachary nodes represented by a tree structure called karate dataset [7], American College Football network dendrogram. The dendrogram for Zachary karate n dataset is shown as an example in Figure 3. Th he optimal c clustering is obtained by cutting th he m el l dendrogram at the leve of maximal modularity Qs defined in (2.2). For this particular exam Qs value of mple es o 0.43 divide the entire dataset into two distinct group ps y with only one (node 10) misclas ssified and th he correspond e ding ARI value of 0.88. The p lts presented resul ee he for all thre datasets are obtained in th same manne er, by cutting the tree where Qs is maximiz zed. ks Figure 4 Political book data DHHSCAN cluste ering results are represente by ed ent differe colors: libe erals are in bl n lue, neutrals in gray and co n rge onservatives in red. The lar overlap be etween s shapes and colors in od ndicates a goo clustering y yielded by DH t HSCAN. The achieved adjust rand index va alue of s r AN 0.64 is the same for both DHSCA and modu ularity- orithm by Clau et al [4]. based clustering algo uset The last dataset u e used in our ex xperiment is a social networ – detecting communities (or conferenc rk ces) of American college football team [1]. The graph ms represe ule n ents the schedu of Division I-A games f the for e 2000 season. The National C Collegiate A Athletic re am Figur 3 Dendrogra of Zachary k karate data Associ A) iation (NCAA divides 11 college fo 15 ootball teams into 12 conf ferences. In ad ddition, there are 5 The seccond example is the classifiication of bookks indepeendent teams (Utah State, N Navy, Notre D Dame, about US ppolitics. We use the dataset of Books abo u out Conne ecticut and Cenntral Florida) th do not belo to hat ong s y US politics compiled by Valdis Krebs [8]. There is as any co e onference. The question is h out how to find o the link betwe ks een two book if they ar co-purchase re ed conferrences from a g graph that represents the sch hedule frequently enough by th same buyer The vertices he rs. mes by of gam played b all teams. We presume that e have been given values "l", "n", or "c" to indica n s ate se e becaus teams in the same confere ence are more likely whether th are "liberal", "neutral", or "conservative hey r e". to play with each ot y ther, the confer can rence system c be These ali ignments were assigned separately b by ed re mappe as a structur despite the s significant amoount of Newman [10]. True classes corresp ponding libera al, conference play inter-c y. neutral an conservati nd ive are deno oted in roun nd, tes ge Figure 5 illustrat the colleg football net twork, diamond an rectangle re nd Figure 4. espectively in F where each vertex represents a team and an edge n connec two teams if they playe in a game. Each cts ed conferrence is represeented using a color and an i integer as the conference ID. The comparison above demonstrates an improved accuracy of DHSCAN over modularity-based algorithm on college football dataset. Both DHSCAN and modularity-based methods achieve equivalent result on Zachary-karate and political book datasets. To demonstrate the efficiency of finding optimal clustering of networks, we present the plots for Qs and ARI values for each iteration of the algorithm on all three datasets in Figure 7, Figure 8 and Figure 9 respectively. Plots demonstrate that Qs and ARI are positively correlated, thus a high Qs value indicates a high ARI value that also means a high-quality clustering result. In our experiments, DHSCAN algorithm is halted as Qs values start to decline because maximal Qs value yields the highest ARI value as well. Figure 5 College football data Note DHSCAN achieves an optimal clustering quickly, after only removal of 9 out of 73, 61 out of 421 and The clustering results of DHSCAN algorithm is 197 out of 601 edges, indicating a good efficiency and presented in Figure 6, which demonstrates a good speed of clustering. match with the original conference system we are seeking for. The most errors are in lower-right part 5. Conclusion where two separate conferences are merged together, which we believe is caused by confused edge structure. In this paper we proposed a simple divisive However, achieved ARI value of 0.79 is still hierarchical clustering algorithm, DHSCAN, that significantly larger than that of modularity-based iteratively removes links based on an ascending order algorithm by Clauset et al [4], which is 0.50. of a structural similarity measure. The network is divided into clusters by removal of links. The iterative divisive procedure produces a dendrogram showing the hierarchical structure of the clusters. Additionally the divisive procedure stops at the maximum of Newman’s modularity. Therefore, our algorithm has two main advantages: (1). It can find hierarchical structure of clusters. (2). It does not require any parameters. As future work we will compare DHSCAN with more algorithms for clustering very large networks. Additionally we will apply our algorithm to analyze very large biological networks such as metabolic and protein interaction networks. Zachary Karate Figure 6 Result of DHSCAN on college football data 1.00 We measure the accuracy of the clustering using 0.80 QS adjusted rand index ARI. The ARI of DHSCAN and 0.60 ARI modularity-based algorithm for three datasets are 0.40 presented in Table 1. 0.20 Table 1 Adjust rand index comparison 0.00 1 7 13 19 25 31 37 43 49 55 61 67 73 DHSCAN Modularity-based Zachary-karate 0.88 0.88 Iteration of Removals Political books 0.64 0.64 Figure 7 QS - ARI behavior for Zachary karate data College football 0.79 0.50 [2] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min- Political Books max cut algorithm for graph partitioning and data 0.70 clustering”, Proc. of 2001 IEEE International QS Conference on Data Mining, San Jose, CA, November 0.60 29 – December 2, 2001. 0.50 ARI [3] J. Shi and J. Malik, “Normalized cuts and image 0.40 segmentation”, IEEE Transactions on Pattern Analysis 0.30 and Machine Intelligence, Vol 22, No. 8, 2000. 0.20 [4] A. Clauset, M. E. J. Newman, and C. Moore, “Finding 0.10 community in very large networks”, Physical Review E 0.00 70, 066111 (2004). 1 43 85 127 169 211 253 295 337 379 421 [5] Z. Feng, X. Xu, N. Yuruk and T. A. J. Schweiger, “A Iteration of Removals Novel Similarity-based Modularity Function for Graph Partitioning”, To be published in Proc. of 9th Figure 8 Qs - ARI behavior for political book data International Conference on Data Warehousing and Knowledge Discovery, Regensburg, Germany, September 3-7, 2007. College Football [6] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, QS “SCAN: an Structural Clustering Algorithm for 1.00 ARI Netowrks”, To be published in Proc. of 13th ACM 0.80 SIGKDD International Conference on Knowledge 0.60 Discovery and Data Mining, San Jose, CA, August 12- 15, 2007. 0.40 [7] W. W. Zachary, “An information flow model for 0.20 conflict and fission in small groups”, Journal of 0.00 Anthropological Research 33, 452–473 (1977). 1 51 101 151 201 251 301 351 401 451 501 551 601 [8] http://www.orgnet.com/. Iteration of Removals [9] L. Hubert and P. Arabie, “Comparing Partitions”. Figure 9 Qs - ARI behavior for college football data Journal of Classification, 193–218, 1985. [10] http://www-personal.umich.edu/~mejn/netdata/. 6. References [1] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks”, Phys. Rev. E 69, 026113 (2004).