Document Sample

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 Constraint-free Optimal Meta Similarity Clusters Using Dynamic Minimum Spanning Tree S. John Peter S.P. Victor Assistant Professor Associate Professor Department of Computer Science and Department of Computer Science and Research Center Research Center St. Xavier’s College, Palayamkottai St. Xavier’s College, Palayamkottai Tamil Nadu, India. Tamil Nadu, India. jaypeeyes@rediffmail.com victorsp@rediffmail.com ABSTRACT — Clustering is a process of discovering I. INTRODUCTION groups of objects such that the objects of the same The problem of determining the correct number of group are similar, and objects belonging to different clusters in a data set is perhaps the most difficult groups are dissimilar. A number of clustering algorithms exist that can solve the problem of and ambiguous part of cluster analysis. The ―true‖ clustering, but most of them are very sensitive to number of clusters depends on the ―level‖ on is their input parameters. Therefore it is very viewing the data. Another problem is due to the important to evaluate the result of them. The methods that may yield the ―correct‖ number of minimum spanning tree clustering algorithm is clusters for a ―bad‖ classification [10]. capable of detecting clusters with irregular Furthermore, it has been emphasized that boundaries. In this paper we propose a constraint- mechanical methods for determining the optimal free minimum spanning tree based clustering number of clusters should not ignore that the fact algorithm. The algorithm constructs hierarchy from that the overall clustering process has an top to bottom. At each hierarchical level, it optimizes the number of cluster, from which the unsupervised nature and its fundamental objective proper hierarchical structure of underlying dataset is to uncover the unknown structure of a data set, can be found. The algorithm uses a new cluster not to impose one. For these reasons, one should validation criterion based on the geometric property be well aware about the explicit and implicit of data partition of the data set in order to find the assumptions underlying the actual clustering proper number of clusters at each level. The procedure before the number of clusters can be algorithm works in two phases. The first phase of reliably estimated or, otherwise the initial the algorithm create clusters with guaranteed intra- objective of the process may be lost. As a solution cluster similarity, where as the second phase of the for this, Hardy [10] recommends that the algorithm create dendrogram using the clusters as objects with guaranteed inter-cluster similarity. The determination of optimal number of clusters first phase of the algorithm uses divisive approach, should be made by using several different where as the second phase uses agglomerative clustering methods that together produce more approach. In this paper we used both the information about the data. By forcing a structure approaches in the algorithm to find Optimal Meta to a data set, the important and surprising facts similarity clusters. about the data will likely remain uncovered. In some applications the number of clusters is not a problem, because it is predetermined by the Keywords: Euclidean minimum spanning tree, Subtree, Clustering, Eccentricity, Center, Hierarchical context [11]. Then the goal is to obtain a clustering, Dendrogram, Cluster validity, Cluster mechanical partition for a particular data using a Separation fixed number of clusters. Such a process is not intended for inspecting new and unexpected facts 126 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 arising from the data. Hence, splitting up a possible connections between the data patterns, so homogeneous data set in a ―fair‖ way is much the cost of clustering can be decreased. The MST more straightforward problem when compared to based clustering algorithm is known to be capable the analysis of hidden structures from of detecting clusters with various shapes and size heterogeneous data set. The clustering algorithms [34]. Unlike traditional clustering algorithms, the [15, 21] partitioning the data set in to k clusters MST clustering algorithm does not assume a without knowing the homogeneity of groups. spherical shapes structure of the underlying data. Hence the principal goal of these clustering The EMST clustering algorithm [23,34] uses the problems is not to uncover novel or interesting Euclidean minimum spanning tree of a graph to facts about data. produce the structure of point clusters in the n- dimensional Euclidean space. Clusters are Numerical methods can usually provide only detected to achieve some measures of optimality, guidance about the true number of clusters and the such as minimum intra-cluster distance or final decision is often an ad hoc decision that is maximum inter-cluster distance [2]. The EMST based on prior assumptions and domain algorithm has been widely used in practice. knowledge. Therefore, the choice between the different numbers of clusters is often made by Clustering by minimal spanning tree can be comparing several alternatives, and the final viewed as a hierarchical clustering algorithm decision is a subjective problem that can be which follows a divisive approach. Using this solved in practice only by humans. Nevertheless, method firstly MST is constructed for a given a number of methods for objective assessment of input. There are different methods to produce cluster validity have been developed and group of clusters. If the number of clusters k is proposed. Because the recognition of cluster given in advance, the simplest way to obtain k structures is difficult especially in high- clusters is to sort the edges of minimum spanning dimensional spaces, various visualization tree in descending order of their weights and technique can also be of valuable help to the remove edges with first k-1 heaviest weights [2, cluster analyst. 33]. Given a connected, undirected graph G = ( V, E ) , All existing clustering Algorithm require a where V is the set of nodes, E is the set of edges number of parameters as their inputs and these between pairs of nodes, and a weight w (u , v) parameters can significantly affect the cluster specifying weight of the edge (u, v) for each edge quality. Our algorithm does not require a (u, v) E. A spanning tree is an acyclic subgraph predefined cluster number. In this paper we want of a graph G, which contains all vertices from G. to avoid experimental methods and advocate the The Minimum Spanning Tree (MST) of a idea of need-specific as opposed to care-specific weighted graph is minimum weight spanning tree because users always know the needs of their of that graph. Several well established MST applications. We believe it is a good idea to allow algorithms exist to solve minimum spanning tree users to define their desired similarity within a problem [24, 19, 20]. The cost of constructing a cluster and allow them to have some flexibility to minimum spanning tree is O (m log n), where m is adjust the similarity if the adjustment is needed. the number of edges in the graph and n is the Our Algorithm produces clusters of n-dimensional number of vertices. More efficient algorithm for points with a naturally approximate intra-cluster constructing MSTs have also been extensively distance. researched [18, 5, 13]. These algorithms promise close to linear time complexity under different Geometric notion of centrality are closely linked assumptions. A Euclidean minimum spanning tree to facility location problem. The distance matrix (EMST) is a spanning tree of a set of n points in a D can computed rather efficiently using Dijkstra’s metric space (En), where the length of an edge is algorithm with time complexity O (| V| 2 ln | V |) the Euclidean distance between a pair of points in [29]. the point set. The eccentricity of a vertex x in G and radius ρ (G), respectively are defined as The hierarchical clustering approaches are related to graph theoretic clustering. Clustering e(x) = max d(x , y) and ρ(G) = min e(x) algorithms using minimal spanning tree takes the y V x V advantage of MST. The MST ignores many 127 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 The center of G is the set properly represent the hierarchical structure of the C (G) = {x V | e(x) = ρ (G)} underlying dataset, which improves the accuracy of the final clustering result. C (G) is the center to the ―emergency facility location problem‖ which is always contain Our DGMST clustering algorithm addresses the single block of G. The length of the longest issues of undesired clustering structure and path in the graph is called diameter of the unnecessary large number of clusters. Our graph G. we can define diameter D (G) as algorithm does not require a predefined cluster D (G) = max e(x) number. The algorithm constructs an EMST of a x V point set and removes the inconsistent edges that The diameter set of G is satisfy the inconsistence measure. The process is Dia (G) = {x V | e(x) = D (G)} repeated to create a hierarchy of clusters until optimal numbers of clusters (regions) are An important objective of hierarchical cluster obtained. Hence the title! In section 2 we review analysis is to provide picture of data that can some of the existing works on cluster validity and easily be interpreted. A picture of a hierarchical graph based clustering algorithms. In Section 3 clustering is much easier for a human being to we propose DGMST algorithm which produces comprehend than is a list of abstract symbols. A optimal number of clusters with dendrogram for dendrogram is a special type of tree structure that cluster of clusters. Hence we named this new provides a convenient way to represent cluster as Optimal Meta similarity clusters. hierarchical clustering. A dendrogram consists of Finally in conclusion we summarize the strength layers of nodes, each representing a cluster. of our methods and possible improvements. Hierarchical clustering is a sequence of partitions II. RELATED WORK. in which each partition is nested into the next in sequence. An Agglomerative algorithm for Determining the true number of clusters, also hierarchical clustering starts with disjoint known as the cluster validation problem, is a clustering, which places each of the n objects in fundamental problem in cluster analysis. Many an individual cluster [1]. The hierarchical approaches to this problem have been proposed clustering algorithm being employed dictates how [25, 32, 10]. Two kinds of indexes have been used the proximity matrix or proximity graph should be to validate the clustering [6, 7]: one based on interpreted to merge two or more of these trivial relative criteria and other based on external and clusters, thus nesting the trivial clusters into internal criteria. The first approach is to choose second partition. The process is repeated to form a the best result from set of clustering result sequence of nested clustering in which the number according to a prespecified criterion. Although the of clusters decreases as a sequence progress until computational cost of the approach is light, single cluster containing all n objects, called the human intervention is required to find the best conjoint clustering, remains[1]. number of clusters. The DGMST algorithm tries to find the proper number of clusters Nearly all hierarchical clustering techniques that automatically which makes the first approach include the tree structure have two short comings: unsuitable for clustering validation in the (1) they do not properly represent hierarchical DGMST algorithm. relationship and (2) once the data are assigned improperly to a given cluster it cannot later The second approach is based on statistical tests reevaluate and placed in another cluster. and involves computations of both inter-cluster and intra-cluster quality to determine the proper In this paper, we propose a new clustering best number of clusters. The evaluation of the algorithm: the Dynamically Growing Minimum criteria can be completed automatically. However Spanning Tree (DGMST), which can overcome the computational cost of this type of cluster these shortcomings. The algorithm optimizes the validation is very high. The second type of this number of clusters at each hierarchical level with kind of approach is also not suitable for DGMST the cluster validation criteria during the minimum algorithm when it is used to cluster a large spanning tree construction process. Then the dataset. A successful and practical cluster hierarchy constructed by the algorithm can validation criteria used in the DGMST algorithm 128 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 for large dataset must have modest computational tree. This result in set of disjoint subtrees each cost and can be easily evaluated automatically. represents a separate cluster. Paivinen [22] proposed a Scale Free Minimum Spanning Tree Clustering by minimal spanning tree can be (SFMST) clustering algorithm which constructs viewed as a hierarchical clustering algorithm scale free networks and outputs clusters which follows the divisive approach. Clustering containing highly connected vertices and those Algorithm based on minimum and maximum connected to them. spanning tree were extensively studied. Avis [3] found an O (n2 log2 n) algorithm for the min-max The MST clustering algorithm has been widely diameter-2 clustering problem. Asano, used in practice. Xu (Ying), Olman and Xu Bhattacharya, Keil and Yao [2] later gave (Dong) [33] use MST as multidimensional gene optimal O (n log n) algorithm using maximum expression data. They point out that MST- based spanning trees for minimizing the maximum clustering algorithm does not assume that data diameter of a bipartition. The problem becomes points are grouped around centers or separated by NP-complete when the number of partitions is regular geometric curve. Thus the shape of the beyond two [17]. Asano, Bhattacharya, Keil and cluster boundary has little impact on the Yao also considered the clustering problem in performance of the algorithm. They described which the goal to maximize the minimum inter- three objective functions and the corresponding cluster distance. They gave a k-partition of point cluster algorithm for computing k-partition of set removing the k-1 longest edges from the spanning tree for predefined k > 0. The algorithm minimum spanning tree constructed from that simply removes k-1 longest edges so that the point set [2]. The identification of inconsistent weight of the subtrees is minimized. The second edges causes problem in the MST clustering objective function is defined to minimize the total algorithm. There exist numerous ways to divide distance between the center and each data point in clusters successively, but there is not suitable a the cluster. The algorithm removes first k-1 edges suitable choice for all cases. from the tree, which creates a k-partitions. Zahn [34] proposes to construct MST of point set The clustering algorithm proposed by and delete inconsistent edges – the edges, whose S.C.Johnson [16] uses proximity matrix as input weights are significantly larger than the average data. The algorithm is an agglomerative scheme weight of the nearby edges in the tree. Zahn’s that erases rows and columns in the proximity inconsistent measure is defined as follows. Let e matrix as old clusters are merged into new ones. denote an edge in the MST of the point set, v1 and The algorithm is simplified by assuming no ties in v2 be the end nodes of e, w be the weight of e. A the proximity matrix. Graph based algorithm was depth neighborhood N of an end node v of an proposed by Hubert [12] using single link and edge e defined as a set of all edges that belong to complete link methods. He used threshold graph all the path of length d originating from v, for formation of hierarchical clustering. An excluding the path that include the edge e. Let N1 algorithm for single-link hierarchical clustering and N2 be the depth d neighborhood of the node v1 begins with the minimum spanning tree (MST) for and v2. Let ŴN1 be the average weight of edges in G (∞), which is a proximity graph containing n(n- N1 and σN1 be its standard deviation. Similarly, let 1)/2 edge was proposed by Gower and Ross [14]. ŴN2 be the average weight of edges in N2 and σN2 Later Hansen and DeLattre [9] proposed another be its standard deviation. The inconsistency hierarchical algorithm from graph coloring. measure requires one of the three conditions hold: Many different methods for determining the 1. w > ŴN1 + c x σN1 or w > ŴN2 + c x σN2 number of clusters have been developed. Hierarchical clustering methods provide direct 2. w > max(ŴN1 + c x σN1 , ŴN2 + c x σN2) information about the number of clusters by clustering objects on a number of different 3. w >f hierarchical levels, which are then presented by a max (c x σN1 , c x σN2) graphical tree structure known as dendrogram. One may apply some external criteria to validate where c and f are preset constants. All the edges the solutions on different levels or use the of a tree that satisfy the inconsistency measure are dendrogram visualization for determining the best considered inconsistent and are removed from the cluster structure. 129 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 The procedure of evaluating the results of a acquired with the help of domain experts. Other clustering algorithm is known under the term times this information is hidden and unavailable cluster validity. In general terms, there are three to the clustering algorithm. In this section we approaches to investigate cluster validity [31]. present clustering algorithm which produce The first is based on external criteria. This optimal number of clusters. implies that we evaluate the results of a clustering algorithm based on a pre-specified structure, A. DGMST Clustering Algorithm: which is imposed on a data set and reflects our intuition about the clustering structure of the data Given a point set S in En, the hierarchical method set. The second structure is based on internal starts by constructing a Minimum Spanning Tree criteria. In this case the clustering results are (MST) from the points in S. The weight of the evaluated in terms of quantities that involve the edge in the tree is Euclidean distance between the vectors of the data set themselves (e.g. proximity two end points. So we named this MST as matrix). The third approach of clustering validity EMST1. Next the average weight Ŵ of the edges is based on relative criteria. Here the basic idea is in the entire EMST1 and its standard deviation σ the evaluation of a clustering structure by are computed; any edge with W > Ŵ + σ or comparing it to other clustering schemes, resulting current longest edge is removed from the tree. by the same algorithm but with different input This leads to a set of disjoint subtrees S T = {T1, T2 parameter values. …} (divisive approach). Each of these subtrees Ti is treated as cluster. Oleksandr Grygorash et al The selection of the correct number of clusters is proposed minimum spanning tree based clustering actually a kind of validation problem. A large algorithm [21] which generates k clusters. Our number of clusters provides a more complex previous algorithm [15] generates k clusters with ―model‖ where as a small number may centers, which used to produce Meta similarity approximate data too much. Hence, several clusters. Both of the minimum spanning tree methods and indices have been developed for the based algorithms assumed the desired number of problem of cluster validation and selection of the clusters in advance. In practice, determining the number of clusters [27, 8, 26, 28, 30]. Many of number of clusters is often coupled with them based on the within and between-group discovering cluster structure. Hence we propose a distance. new algorithm named, Dynamically Growing Minimum Spanning Tree algorithm (DGMST), III. OUR CLUSTERING ALGORITHM which does not require a predefined cluster number. The algorithm works in two phases. The A tree is a simple structure for representing binary first phase of the algorithm partitioned the relationship, and any connected components of EMST1 into sub trees (clusters/regions). The tree is called subtree. Through this MST centers of clusters or regions are identified using representation, we can convert a multi- eccentricity of points. These points are a dimensional clustering problem to a tree representative point for the each subtree ST. A partitioning problem, ie finding particular set of point ci is assigned to a cluster i if ci Ti. The tree edges and then cutting them. Representing a group of center points is represented as C = {c1, set of multi-dimensional data points as simple tree c2……ck}. These center points c1, c2 ….ck are structure will clearly lose some of the inter data connected and again minimum spanning tree relationship. However many clustering algorithm EMST2 is constructed is shown in the Figure 4. A proved that no essential information is lost for the Euclidean distance between pair of clusters can be purpose of clustering. This is achieved through represented by a corresponding weighted edge. rigorous proof that each cluster corresponds to Our Algorithm is also based on the minimum one subtree, which does not overlap the spanning tree but not limited to two-dimensional representing subtree of any other cluster. points. There were two kinds of clustering Clustering problem is equivalent to a problem of problem; one that minimizes the maximum intra- identifying these subtrees through solving a tree cluster distance and the other maximizes the partitioning problem. The inherent cluster minimum inter-cluster distances. Our Algorithm structure of a point set in a metric space is closely produces clusters with both intra-cluster and inter- related to how objects or concepts are embedded cluster similarity. The Second phase of the in the point set. In practice, the approximate algorithm converts the minimum spanning tree number of embedded objects can sometimes be EMST2 into dendrogram, which can be used to 130 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 interpret about inter-cluster distances. This new predefine a threshold to test the CS. If the CS is algorithm is neither single link clustering greater than the threshold, the partition of the algorithm (SLCA) nor complete link clustering dataset is valid. Then again partitions the data set algorithm (CLCA) type of hierarchical clustering, by creating subtree (cluster/region). This process but it is based on the distance between centers of continues until the CS is smaller than the clusters. This approach leads to new threshold. At that point, the proper number of developments in hierarchical clustering. The clusters will be the number of cluster minus one. level function, L, records the proximity at which The CS criterion finds the proper binary each clustering is formed. The levels in the relationship among clusters in the data space. The dendrogram tell us the least amount of similarity value setting of the threshold for the CS will be that points between clusters differ. This piece of practical and is dependent on the dataset. The information can be very useful in several medical higher the value of the threshold the smaller the and image processing applications. number of clusters would be. Generally, the value of the threshold will be > 0.8[4]. Figure 3 shows Here, we use a cluster validation criterion based the CS value versus the number of clusters in on the geometric characteristics of the clusters, in hierarchical clustering. The CS value < 0.8 when which only the inter-cluster metric is used. The the number of clusters is 5. Thus, the proper DGMST algorithm is a nearest centroid-based number of clusters for the data set is 4. Further clustering algorithm, which creates region or more, the computational cost of CS is much subtrees (clusters/regions) of the data space. The lighter because the number of subclusters is small. algorithm partitions a set S of data of data D in This makes the CS criterion practical for the data space in to n regions (clusters). Each region DGMST algorithm when it is used for clustering is represented by a centroid reference vector. If large dataset. we let p be the centroid representing a region (cluster), all data within the region (cluster) are Algorithm: DGMST ( ) closer to the centroid p of the region than to any Input : S the point set other centroid q: Output : dendrogram with optimal number of clusters R (p) = {x D dist(x, p) dist(x, q) q} Let e1 be an edge in the EMST1 constructed from S Let e2 be an edge in the EMST2 constructed from C Thus, the problem of finding the proper number of Let We be the weight of e1 clusters of a dataset can be transformed into Let σ be the standard deviation of the edge problem of finding the proper region (clusters) of weights in EMST1 the dataset. Here, we use the MST as a criterion Let ST be the set of disjoint subtrees of EMST1 to test the inter-cluster property. Based on this Let nc be the number of clusters observation, we use a cluster validation criterion, called Cluster Separation (CS) in DGMST 1. Construct an EMST1 from S algorithm [4]. 2. Compute the average weight of Ŵ of all the Edges from EMST1 Cluster separation (CS) is defined as the ratio 3. Compute standard deviation σ of the edges between minimum and maximum edge of MST. ie from EMST1 4. ST = ø; nc = 1; C = ø; CS = Emin / Emax , 5. Repeat 6. For each e1 EMST1 where Emax is the maximum length edge of MST, 7. If (We > Ŵ + σ) or (current longest edge e1) which represents two centroids that are at 8. Remove e1 from EMST1 maximum separation, and Emin is the minimum 9. ST = ST U { T’ } // T’’ is new disjoint length edge in the MST, which represents two Subtree (regions) centroids that are nearest to each other. Then, the 10. nc = nc+1 CS represents the relative separation of centroids. 11. Compute the center Ci of Ti using The value of CS ranges from 0 to 1. A low value eccentricity of points of CS means that the two centroids are too close 12. C = UTi ST {Ci} to each other and the corresponding partition is 13. Construct an EMST2 T from C not valid. A high CS value means the partitions of 14. Emin = get-min-edge (T) the data is even and valid. In practice, we 15. Emax = get-max-edge (T) 131 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 16. CS = Emin / Emax 17. Until CS < 0.8 18. Begin with disjoint clusters with level L (0) = 0 and sequence number m = 0 19. While (T has some edge) 20. e2 = get-min-edge(T) // for least dissimilar pair of clusters 21. (i, j) = get-vertices (e2) 22. Increment the sequence number m = m + 1, merge the clusters (i) and (j), into single cluster to form next clustering m and set the level of this cluster to L(m) = e2; 23. Update T by forming new vertex by combining the vertices i, j; Figure 3: Number of Clusters vs. Cluster Separation 24. Return dendrogram with optimal number of clusters Our DGMST algorithm works in two phases. The outcome of the first phase (lines 1-17) of the Figure 1 shows a typical example of EMST1 algorithm consists of optimal number clusters constructed from point set S, in which with their center. It first constructs EMST1 form inconsistent edges are removed to create subtree set of point S (line 1). Average weight of edges (clusters/regions). Our algorithm finds the center and standard deviation are computed (lines 2-3). of the each cluster, which will be useful in many Inconsistent edges are identified and removed applications. Our algorithm will find optimal from EMST1 to generate subtree T’ (lines 7-9). number of clusters or cluster structures. Figure 2 The center for each subtree (cluster/region) is shows the possible distribution of the points in the computed at line 11. Using the cluster/region two cluster structures with their center vertex as 5 center point again another minimum spanning tree and 3. EMST2 is constructed (line 13). Using the new evaluation criteria, optimal number of clusters/regions is identified (lines 14-16). Lines 6-16 in the algorithm are repeated until optimal number of clusters are obtained. We use the graph of Figure 4 as example to illustrate the second phase (lines 18-24) of the algorithm. The second phase of the DGMST algorithm Figure 1: Clusters connected through points -EMST1 construct minimum spanning tree T from the point set C = {c1, c2, c3….ck} and convert the T into dendrogram is shown in figure 5. It places the entire disjoint cluster at level 0 (line 18). It then checks to see if T still contains some edge (line 19). If so, it finds minimum edge e2 (line 20). It then finds the vertices i, j of an edge e2 (line 21). It then merges the vertices (clusters) and forms a new vertex (agglomerative approach). At the same time the sequence number is increased by one and the level of the new cluster is set to the edge weight (line 22). Finally, the Updation of minimum spanning tree is performed at line 23. The lines 20-23 in the algorithm are repeated until the minimum spanning tree T has no edge. The dendrogram with optimal number of cluster as objects is generated. The objects within the clusters are compact. The clusters are well separated, shown in Figure 4. Figure 2: Two Clusters/regions with Center points 5 and 3 132 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 future we will explore and test our proposed clustering algorithm in various domains. The DGMST algorithm uses both divisive and agglomerative approach to find Optimal Meta similarity clusters. We will further study the rich properties of EMST-based clustering methods in solving different clustering problems. REFERENCES [1] Anil K. Jain, Richard C. Dubes “Algorithm Figure 4. EMST2 From 4 region/cluster center points for Clustering Data‖, Michigan State University, Prentice Hall, Englewood Cliffs, New Jersey 07632.1988. [2] T. Asano, B. Bhattacharya, M.Keil and F.Yao. ―Clustering Algorithms based on minimum and maximum spanning trees‖. In Proceedings of the 4th Annual Symposium on Computational Geometry,Pages 252-257, 1988. [3] D. Avis ―Diameter partitioning.‖ Discrete and Computational Geometr, 1:265-276, 1986. [4] Feng Luo,Latifur Kahn, Farokh Bastani, I- Ling Yen, and Jizhong Zhou, ―A dynamically growing self-organizing tree(DGOST) for hierarchical gene expression Figure 5. Dendrogram for Optimal Meta cluster profile‖,Bioinformatics,Vol 20,no 16, pp 2605- 2617, 2004. IV. CONCLUSION [5] M. Fredman and D. Willard. ―Trans- Our DGMST clustering algorithm does not dichotomous algorithms for minimum spanning assumes any predefined cluster number. The trees and shortest paths‖. In Proceedings of the algorithm gradually finds clusters with center for 31st Annual IEEE Symposium on Foundations of each cluster. These clusters ensure guaranteed Computer Science,pages 719-725, 1990. intra-cluster similarity. Our algorithm does not require the users to select and try various [6] M. Halkidi, Y.Batistakis and M. Vazirgiannis parameters combinations in order to get the ―On clustering validation techniques‖, J.Intel. desired output. Our DGMST clustering algorithm Inform. System., 17, 107-145, 2001 uses a new cluster validation criterion based on the geometric property of partitioned [7] M. Halkidi, Y.Batistakis and M. Vazirgiannis, regions/clusters to produce optimal number of ―Clustering validity checking methods:part II‖ ―true‖ clusters with center for each of them. Our SIGMOD record., 31, 19-27, 2002 algorithm also generates dendrogram which is used to find the relationship between the optimal [8] G. Hamerly and C. Elkan, ―Learning the k in number clusters. The inter-cluster distances k-means, in Advances in Neural Information between clusters/regions are shown in the Processing Systems‖ 16, S. Thrun, L. Saul, and Dendrogram. This will be very useful in many B. Schölkopf, eds., MIT Press, Cambridge, MA, applications. All of these look nice from 2004. theoretical point of view. However from practical point of view, there is still some room for [9] P. Hansen and M. Delattre, ―Complete-link improvement for running time of the clustering cluster analysis by graph coloring‖ Journal of algorithm. This could perhaps be accomplished by the American Statistical Association 73, 397-403, using some appropriate data structure. In the 1978. 133 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 IEEE International conference on tools with [10] A. Hardy, ―On the number of clusters‖, Artificial Intelligence (ICTAI’06) 2006. Computational Statistics and Data Analysis, 23, pp. 83–96, 1996. [22] N. Paivinen, ―Clustering with a minimum spanning of scale-free-like structure‖.Pattern [11] T. Hastie, R. Tibshirani, and J. Friedman, Recogn. Lett.,26(7): 921-930, 2005. ―The elements of statistical learning: Data mining, inference and prediction‖, Springer-Verlag, 2001. [23] F. Preparata and M.Shamos. ―Computational Geometry‖: An Introduction. Springer-Verlag, [12] Hubert L. J ―Min and max hierarchical Newyr, NY ,USA, 1985 clustering using asymmetric similarity measures‖ Psychometrika 38, 63-72, 1973. [24] R. Prim. ―Shortest connection networks and some generalization‖. Bell systems Technical [13] H.Gabow, T.Spencer and R.Rarjan. Journal,36:1389-1401, 1957. ―Efficient algorithms for finding minimum spanning trees in undirected and directed graphs‖, [25] R. Rezaee, B.P.F. Lelie and J.H.C. Reiber, Combinatorica, 6(2):109-122, 1986. ―A new cluster validity index for the fuzzy C- mean‖, Pattern Recog. Lett., 19,237-246, 1998. [14] J.C. Gower and G.J.S. Ross ―Minimum Spanning trees and single-linkage cluster [26] D. M. Rocke and J. J. Dai, ―Sampling and analysis‖ Applied Statistics 18, 54-64, 1969. subsampling for cluster analysis in data mining: With applications to sky survey data‖, Data [15] S. John Peter, S.P. Victor, ―A Novel Mining and Knowledge Discovery, 7, pp. 215– Algorithm for Meta similarity clusters using 232, 2003. Minimum spanning tree‖. International Journal of computer science and Network Security. Vol.10 [27] S. Salvador and P. Chan, ―Determining the No.2 pp. 254 – 259, 2010 number of clusters/segments in hierarchical clustering/segmentation algorithms‖, in [16] S. C. Johnson, ―Hierarchical clustering Proceedings Sixteenth IEEE International schemes‖ Psychometrika 32, 241-254, 1967. Conference on Tools with Artificial Intelligence, ICTAI 2004, Los Alamitos, CA, USA, IEEE [17] D. Johnson, ―The np-completeness column: Computer Society, pp. 576–584 , 2004. An ongoing guide‖. Journal of Algorithms,3:182- 195, 1982. [28] S. Still and W. Bialek, ―How many clusters?” , An information-theoretic perspective, [18] D. Karger, P. Klein and R. Tarjan, ―A Neural Computation, 16, pp. 2483–2506, 2004. randomized linear-time algorithm to find minimum spanning trees‖. Journal of the ACM, [29] Stefan Wuchty and Peter F. Stadler. ―Centers 42(2):321-328, 1995. of Complex Networks‖. 2006 [19] J. Kruskal, ―On the shortest spanning subtree [30] C. Sugar and G. James, ―Finding the number and the travelling salesman problem‖, In of clusters in a data set ”, An information Proceedings of the American Mathematical theoretic approach, Journal of the American Society, pp 48-50, 1956. Statistical Association, 98 pp. 750–763, 2003. [20] J. Nesetril, E.Milkova and H.Nesetrilova. [31] S. Theodoridis, K. Koutroubas, ―Pattern Otakar boruvka on ―Minimum spanning tree recognition‖ Academic Press, 1999 problem‖: Translation of both the 1926 papers, comments, history. DMATH: Discrete [32] R. Tibshirani, G. Walther and T.Hastie Mathematics, 233, 2001. ―Estimating the number of clusters in a dataset via the gap statistic‖. J.R. Stat. Soc.Ser.B,63.411-423, [21] Oleksandr Grygorash, Yan Zhou, Zach 2001. Jorgensen. ―Minimum spanning Tree Based Clustering Algorithms‖. Proceedings of the 18th 134 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 [33] Y.Xu, V.Olman and D.Xu. ―Minimum spanning trees for gene expression data clustering‖. Genome Informatics,12:24-33, 2001. [34] C. Zahn. ―Graph-theoretical methods for detecting and describing gestalt clusters‖. IEEE Transactions on Computers, C-20:68-86, 1971. BIOGRAPHY OF AUTHORS S. John Peter is working as Assistant professor in Computer Science, St.Xavier’s college (Autonomous), Palayamkottai, Tirunelveli. He earned his M.Sc degree from Bharadhidasan University, Trichirappalli. He also earned his M.Phil from Bharadhidasan University, Trichirappalli. Now he is doing Ph.D in Computer Science at Manonmaniam Sundranar University, Tirunelveli, Tamil Nadu, - INDIA. He has published research papers on clustering algorithm in various national and international Journals. E-mail: jaypeeyes@rediffmail.com Dr. S. P. Victor earned his M.C.A. degree from Bharathidasan University, Tiruchirappalli. The M. S. University, Tirunelveli, awarded him Ph.D. degree in Computer Science for his research in Parallel Algorithms. He is the Head of the department of computer science, and the Director of the computer science research centre, St. Xavier’s college (Autonomous), Palayamkottai, Tirunelveli. The M.S. University, Tirunelveli and Bharathiar University, Coimbatore have recognized him as a research guide. He has published research papers in international, national journals and conference proceedings. He has organized Conferences and Seminars at national and state level. E-mail: victorsp@rediffmail.com 135 http://sites.google.com/site/ijcsis/ ISSN 1947-5500

DOCUMENT INFO

Shared By:

Categories:

Tags:
IJCSIS, call for paper, journal computer science, research, google scholar, IEEE, Scirus, download, ArXiV, library, information security, internet, peer review, scribd, docstoc, cornell university, archive, Journal of Computing, DOAJ, Open Access

Stats:

views: | 48 |

posted: | 8/13/2010 |

language: | English |

pages: | 10 |

OTHER DOCS BY ijcsiseditor

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.