Docstoc

Constraint-free Optimal Meta Similarity Clusters Using Dynamic Minimum Spanning Tree

Document Sample
Constraint-free Optimal Meta Similarity Clusters Using Dynamic Minimum Spanning Tree Powered By Docstoc
					                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 8, No. 4, July 2010




   Constraint-free Optimal Meta Similarity
     Clusters Using Dynamic Minimum
                Spanning Tree
                  S. John Peter                                                         S.P. Victor
               Assistant Professor                                                 Associate Professor
    Department of Computer Science and                                    Department of Computer Science and
               Research Center                                                       Research Center
     St. Xavier’s College, Palayamkottai                                   St. Xavier’s College, Palayamkottai
             Tamil Nadu, India.                                                     Tamil Nadu, India.
           jaypeeyes@rediffmail.com                                            victorsp@rediffmail.com




ABSTRACT — Clustering is a process of discovering                               I.        INTRODUCTION
groups of objects such that the objects of the same              The problem of determining the correct number of
group are similar, and objects belonging to different
                                                                 clusters in a data set is perhaps the most difficult
groups are dissimilar. A number of clustering
algorithms exist that can solve the problem of                   and ambiguous part of cluster analysis. The ―true‖
clustering, but most of them are very sensitive to               number of clusters depends on the ―level‖ on is
their input parameters.       Therefore it is very               viewing the data. Another problem is due to the
important to evaluate the result of them. The                    methods that may yield the ―correct‖ number of
minimum spanning tree clustering algorithm is                    clusters for a ―bad‖ classification [10].
capable of detecting clusters with irregular                     Furthermore, it has been emphasized that
boundaries. In this paper we propose a constraint-               mechanical methods for determining the optimal
free minimum spanning tree based clustering                      number of clusters should not ignore that the fact
algorithm. The algorithm constructs hierarchy from
                                                                 that the overall clustering process has an
top to bottom. At each hierarchical level, it
optimizes the number of cluster, from which the                  unsupervised nature and its fundamental objective
proper hierarchical structure of underlying dataset              is to uncover the unknown structure of a data set,
can be found. The algorithm uses a new cluster                   not to impose one. For these reasons, one should
validation criterion based on the geometric property             be well aware about the explicit and implicit
of data partition of the data set in order to find the           assumptions underlying the actual clustering
proper number of clusters at each level. The                     procedure before the number of clusters can be
algorithm works in two phases. The first phase of                reliably estimated or, otherwise the initial
the algorithm create clusters with guaranteed intra-             objective of the process may be lost. As a solution
cluster similarity, where as the second phase of the
                                                                 for this, Hardy [10] recommends that the
algorithm create dendrogram using the clusters as
objects with guaranteed inter-cluster similarity. The            determination of optimal number of clusters
first phase of the algorithm uses divisive approach,             should be made by using several different
where as the second phase uses agglomerative                     clustering methods that together produce more
approach.      In this paper we used both the                    information about the data. By forcing a structure
approaches in the algorithm to find Optimal Meta                 to a data set, the important and surprising facts
similarity clusters.                                             about the data will likely remain uncovered.

                                                                 In some applications the number of clusters is not
                                                                 a problem, because it is predetermined by the
Keywords: Euclidean minimum spanning tree,
Subtree, Clustering, Eccentricity, Center, Hierarchical          context [11]. Then the goal is to obtain a
clustering, Dendrogram, Cluster validity, Cluster                mechanical partition for a particular data using a
Separation                                                       fixed number of clusters. Such a process is not
                                                                 intended for inspecting new and unexpected facts




                                                          126                              http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                            Vol. 8, No. 4, July 2010




arising from the data. Hence, splitting up a                 possible connections between the data patterns, so
homogeneous data set in a ―fair‖ way is much                 the cost of clustering can be decreased. The MST
more straightforward problem when compared to                based clustering algorithm is known to be capable
the analysis of hidden structures from                       of detecting clusters with various shapes and size
heterogeneous data set. The clustering algorithms            [34]. Unlike traditional clustering algorithms, the
[15, 21] partitioning the data set in to k clusters          MST clustering algorithm does not assume a
without knowing the homogeneity of groups.                   spherical shapes structure of the underlying data.
Hence the principal goal of these clustering                 The EMST clustering algorithm [23,34] uses the
problems is not to uncover novel or interesting              Euclidean minimum spanning tree of a graph to
facts about data.                                            produce the structure of point clusters in the n-
                                                             dimensional Euclidean space. Clusters are
Numerical methods can usually provide only                   detected to achieve some measures of optimality,
guidance about the true number of clusters and the           such as minimum intra-cluster distance or
final decision is often an ad hoc decision that is           maximum inter-cluster distance [2]. The EMST
based on prior assumptions and domain                        algorithm has been widely used in practice.
knowledge. Therefore, the choice between the
different numbers of clusters is often made by               Clustering by minimal spanning tree can be
comparing several alternatives, and the final                viewed as a hierarchical clustering algorithm
decision is a subjective problem that can be                 which follows a divisive approach. Using this
solved in practice only by humans. Nevertheless,             method firstly MST is constructed for a given
a number of methods for objective assessment of              input. There are different methods to produce
cluster validity have been developed and                     group of clusters. If the number of clusters k is
proposed. Because the recognition of cluster                 given in advance, the simplest way to obtain k
structures is difficult especially in high-                  clusters is to sort the edges of minimum spanning
dimensional spaces,        various visualization             tree in descending order of their weights and
technique can also be of valuable help to the                remove edges with first k-1 heaviest weights [2,
cluster analyst.                                             33].

Given a connected, undirected graph G = ( V, E ) ,           All existing clustering Algorithm require a
where V is the set of nodes, E is the set of edges           number of parameters as their inputs and these
between pairs of nodes, and a weight w (u , v)               parameters can significantly affect the cluster
specifying weight of the edge (u, v) for each edge           quality. Our algorithm does not require a
(u, v) E. A spanning tree is an acyclic subgraph             predefined cluster number. In this paper we want
of a graph G, which contains all vertices from G.            to avoid experimental methods and advocate the
The Minimum Spanning Tree (MST) of a                         idea of need-specific as opposed to care-specific
weighted graph is minimum weight spanning tree               because users always know the needs of their
of that graph. Several well established MST                  applications. We believe it is a good idea to allow
algorithms exist to solve minimum spanning tree              users to define their desired similarity within a
problem [24, 19, 20]. The cost of constructing a             cluster and allow them to have some flexibility to
minimum spanning tree is O (m log n), where m is             adjust the similarity if the adjustment is needed.
the number of edges in the graph and n is the                Our Algorithm produces clusters of n-dimensional
number of vertices. More efficient algorithm for             points with a naturally approximate intra-cluster
constructing MSTs have also been extensively                 distance.
researched [18, 5, 13]. These algorithms promise
close to linear time complexity under different              Geometric notion of centrality are closely linked
assumptions. A Euclidean minimum spanning tree               to facility location problem. The distance matrix
(EMST) is a spanning tree of a set of n points in a          D can computed rather efficiently using Dijkstra’s
metric space (En), where the length of an edge is            algorithm with time complexity O (| V| 2 ln | V |)
the Euclidean distance between a pair of points in           [29].
the point set.                                                    The eccentricity of a vertex x in G and radius
                                                                  ρ (G), respectively are defined as
The hierarchical clustering approaches are related
to graph theoretic clustering. Clustering                         e(x) = max d(x , y) and       ρ(G) = min e(x)
algorithms using minimal spanning tree takes the                              y V                      x V
advantage of MST. The MST ignores many




                                                      127                              http://sites.google.com/site/ijcsis/
                                                                                       ISSN 1947-5500
                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                            Vol. 8, No. 4, July 2010




    The center of G is the set                               properly represent the hierarchical structure of the
          C (G) = {x V | e(x) = ρ (G)}                       underlying dataset, which improves the accuracy
                                                              of the final clustering result.
    C (G) is the center to the ―emergency facility
    location problem‖ which is always contain                Our DGMST clustering algorithm addresses the
    single block of G. The length of the longest             issues of undesired clustering structure and
    path in the graph is called diameter of the              unnecessary large number of clusters. Our
    graph G. we can define diameter D (G) as                 algorithm does not require a predefined cluster
                 D (G) = max e(x)                            number. The algorithm constructs an EMST of a
                           x V                               point set and removes the inconsistent edges that
    The diameter set of G is                                 satisfy the inconsistence measure. The process is
           Dia (G) = {x V | e(x) = D (G)}                    repeated to create a hierarchy of clusters until
                                                             optimal numbers of clusters (regions) are
An important objective of hierarchical cluster               obtained. Hence the title! In section 2 we review
analysis is to provide picture of data that can              some of the existing works on cluster validity and
easily be interpreted. A picture of a hierarchical           graph based clustering algorithms. In Section 3
clustering is much easier for a human being to               we propose DGMST algorithm which produces
comprehend than is a list of abstract symbols. A             optimal number of clusters with dendrogram for
dendrogram is a special type of tree structure that          cluster of clusters. Hence we named this new
provides a convenient way to represent                       cluster as Optimal Meta similarity clusters.
hierarchical clustering. A dendrogram consists of            Finally in conclusion we summarize the strength
layers of nodes, each representing a cluster.                of our methods and possible improvements.

Hierarchical clustering is a sequence of partitions                        II.       RELATED WORK.
in which each partition is nested into the next in
sequence. An Agglomerative algorithm for                     Determining the true number of clusters, also
hierarchical clustering starts with disjoint                 known as the cluster validation problem, is a
clustering, which places each of the n objects in            fundamental problem in cluster analysis. Many
an individual cluster [1]. The hierarchical                  approaches to this problem have been proposed
clustering algorithm being employed dictates how             [25, 32, 10]. Two kinds of indexes have been used
the proximity matrix or proximity graph should be            to validate the clustering [6, 7]: one based on
interpreted to merge two or more of these trivial            relative criteria and other based on external and
clusters, thus nesting the trivial clusters into             internal criteria. The first approach is to choose
second partition. The process is repeated to form a          the best result from set of clustering result
sequence of nested clustering in which the number            according to a prespecified criterion. Although the
of clusters decreases as a sequence progress until           computational cost of the approach is light,
single cluster containing all n objects, called the          human intervention is required to find the best
conjoint clustering, remains[1].                             number of clusters. The DGMST algorithm tries
                                                             to find the proper number of clusters
Nearly all hierarchical clustering techniques that           automatically which makes the first approach
include the tree structure have two short comings:           unsuitable for clustering validation in the
(1) they do not properly represent hierarchical              DGMST algorithm.
relationship and (2) once the data are assigned
improperly to a given cluster it cannot later                The second approach is based on statistical tests
reevaluate and placed in another cluster.                    and involves computations of both inter-cluster
                                                             and intra-cluster quality to determine the proper
In this paper, we propose a new clustering                   best number of clusters. The evaluation of the
algorithm: the Dynamically Growing Minimum                   criteria can be completed automatically. However
Spanning Tree (DGMST), which can overcome                    the computational cost of this type of cluster
these shortcomings. The algorithm optimizes the              validation is very high. The second type of this
number of clusters at each hierarchical level with           kind of approach is also not suitable for DGMST
the cluster validation criteria during the minimum           algorithm when it is used to cluster a large
spanning tree construction process. Then the                 dataset. A successful and practical cluster
hierarchy constructed by the algorithm can                   validation criteria used in the DGMST algorithm




                                                      128                              http://sites.google.com/site/ijcsis/
                                                                                       ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 8, No. 4, July 2010




for large dataset must have modest computational              tree. This result in set of disjoint subtrees each
cost and can be easily evaluated automatically.               represents a separate cluster. Paivinen [22]
                                                              proposed a Scale Free Minimum Spanning Tree
Clustering by minimal spanning tree can be                    (SFMST) clustering algorithm which constructs
viewed as a hierarchical clustering algorithm                 scale free networks and outputs clusters
which follows the divisive approach. Clustering               containing highly connected vertices and those
Algorithm based on minimum and maximum                        connected to them.
spanning tree were extensively studied. Avis [3]
found an O (n2 log2 n) algorithm for the min-max              The MST clustering algorithm has been widely
diameter-2      clustering     problem.    Asano,             used in practice. Xu (Ying), Olman and Xu
Bhattacharya, Keil and Yao [2] later gave                     (Dong) [33] use MST as multidimensional gene
optimal O (n log n) algorithm using maximum                   expression data. They point out that MST- based
spanning trees for minimizing the maximum                     clustering algorithm does not assume that data
diameter of a bipartition. The problem becomes                points are grouped around centers or separated by
NP-complete when the number of partitions is                  regular geometric curve. Thus the shape of the
beyond two [17]. Asano, Bhattacharya, Keil and                cluster boundary has little impact on the
Yao also considered the clustering problem in                 performance of the algorithm. They described
which the goal to maximize the minimum inter-                 three objective functions and the corresponding
cluster distance. They gave a k-partition of point            cluster algorithm for computing k-partition of
set removing the k-1 longest edges from the                   spanning tree for predefined k > 0. The algorithm
minimum spanning tree constructed from that                   simply removes k-1 longest edges so that the
point set [2]. The identification of inconsistent             weight of the subtrees is minimized. The second
edges causes problem in the MST clustering                    objective function is defined to minimize the total
algorithm. There exist numerous ways to divide                distance between the center and each data point in
clusters successively, but there is not suitable a            the cluster. The algorithm removes first k-1 edges
suitable choice for all cases.                                from the tree, which creates a k-partitions.

Zahn [34] proposes to construct MST of point set              The     clustering    algorithm    proposed    by
and delete inconsistent edges – the edges, whose              S.C.Johnson [16] uses proximity matrix as input
weights are significantly larger than the average             data. The algorithm is an agglomerative scheme
weight of the nearby edges in the tree. Zahn’s                that erases rows and columns in the proximity
inconsistent measure is defined as follows. Let e             matrix as old clusters are merged into new ones.
denote an edge in the MST of the point set, v1 and            The algorithm is simplified by assuming no ties in
v2 be the end nodes of e, w be the weight of e. A             the proximity matrix. Graph based algorithm was
depth neighborhood N of an end node v of an                   proposed by Hubert [12] using single link and
edge e defined as a set of all edges that belong to           complete link methods. He used threshold graph
all the path of length d originating from v,                  for formation of hierarchical clustering. An
excluding the path that include the edge e. Let N1            algorithm for single-link hierarchical clustering
and N2 be the depth d neighborhood of the node v1             begins with the minimum spanning tree (MST) for
and v2. Let ŴN1 be the average weight of edges in             G (∞), which is a proximity graph containing n(n-
N1 and σN1 be its standard deviation. Similarly, let          1)/2 edge was proposed by Gower and Ross [14].
ŴN2 be the average weight of edges in N2 and σN2              Later Hansen and DeLattre [9] proposed another
be its standard deviation. The inconsistency                  hierarchical algorithm from graph coloring.
measure requires one of the three conditions hold:
                                                              Many different methods for determining the
1. w > ŴN1 + c x σN1 or w > ŴN2 + c x σN2                     number of clusters have been developed.
                                                              Hierarchical clustering methods provide direct
2. w > max(ŴN1 + c x σN1 , ŴN2 + c x σN2)                     information about the number of clusters by
                                                              clustering objects on a number of different
3.               w                    >f                      hierarchical levels, which are then presented by a
     max (c x σN1 , c x σN2)                                  graphical tree structure known as dendrogram.
                                                              One may apply some external criteria to validate
where c and f are preset constants. All the edges             the solutions on different levels or use the
of a tree that satisfy the inconsistency measure are          dendrogram visualization for determining the best
considered inconsistent and are removed from the              cluster structure.




                                                       129                              http://sites.google.com/site/ijcsis/
                                                                                        ISSN 1947-5500
                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                              Vol. 8, No. 4, July 2010




The procedure of evaluating the results of a                   acquired with the help of domain experts. Other
clustering algorithm is known under the term                   times this information is hidden and unavailable
cluster validity. In general terms, there are three            to the clustering algorithm. In this section we
approaches to investigate cluster validity [31].               present clustering algorithm which produce
The first is based on external criteria. This                  optimal number of clusters.
implies that we evaluate the results of a clustering
algorithm based on a pre-specified structure,                  A. DGMST Clustering Algorithm:
which is imposed on a data set and reflects our
intuition about the clustering structure of the data           Given a point set S in En, the hierarchical method
set. The second structure is based on internal                 starts by constructing a Minimum Spanning Tree
criteria. In this case the clustering results are              (MST) from the points in S. The weight of the
evaluated in terms of quantities that involve the              edge in the tree is Euclidean distance between the
vectors of the data set themselves (e.g. proximity             two end points. So we named this MST as
matrix). The third approach of clustering validity             EMST1. Next the average weight Ŵ of the edges
is based on relative criteria. Here the basic idea is          in the entire EMST1 and its standard deviation σ
the evaluation of a clustering structure by                    are computed; any edge with W > Ŵ + σ or
comparing it to other clustering schemes, resulting            current longest edge is removed from the tree.
by the same algorithm but with different input                 This leads to a set of disjoint subtrees S T = {T1, T2
parameter values.                                              …} (divisive approach). Each of these subtrees Ti
                                                               is treated as cluster. Oleksandr Grygorash et al
The selection of the correct number of clusters is             proposed minimum spanning tree based clustering
actually a kind of validation problem. A large                 algorithm [21] which generates k clusters. Our
number of clusters provides a more complex                     previous algorithm [15] generates k clusters with
―model‖ where as a small number may                            centers, which used to produce Meta similarity
approximate data too much. Hence, several                      clusters. Both of the minimum spanning tree
methods and indices have been developed for the                based algorithms assumed the desired number of
problem of cluster validation and selection of the             clusters in advance. In practice, determining the
number of clusters [27, 8, 26, 28, 30]. Many of                number of clusters is often coupled with
them based on the within and between-group                     discovering cluster structure. Hence we propose a
distance.                                                      new algorithm named, Dynamically Growing
                                                               Minimum Spanning Tree algorithm (DGMST),
    III.      OUR CLUSTERING ALGORITHM                         which does not require a predefined cluster
                                                               number. The algorithm works in two phases. The
A tree is a simple structure for representing binary           first phase of the algorithm partitioned the
relationship, and any connected components of                  EMST1 into sub trees (clusters/regions). The
tree is called subtree. Through this MST                       centers of clusters or regions are identified using
representation, we can convert a multi-                        eccentricity of points. These points are a
dimensional clustering problem to a tree                       representative point for the each subtree ST. A
partitioning problem, ie finding particular set of             point ci is assigned to a cluster i if ci     Ti. The
tree edges and then cutting them. Representing a               group of center points is represented as C = {c1,
set of multi-dimensional data points as simple tree            c2……ck}. These center points c1, c2 ….ck are
structure will clearly lose some of the inter data             connected and again minimum spanning tree
relationship. However many clustering algorithm                EMST2 is constructed is shown in the Figure 4. A
proved that no essential information is lost for the           Euclidean distance between pair of clusters can be
purpose of clustering. This is achieved through                represented by a corresponding weighted edge.
rigorous proof that each cluster corresponds to                Our Algorithm is also based on the minimum
one subtree, which does not overlap the                        spanning tree but not limited to two-dimensional
representing subtree of any other cluster.                     points. There were two kinds of clustering
Clustering problem is equivalent to a problem of               problem; one that minimizes the maximum intra-
identifying these subtrees through solving a tree              cluster distance and the other maximizes the
partitioning problem. The inherent cluster                     minimum inter-cluster distances. Our Algorithm
structure of a point set in a metric space is closely          produces clusters with both intra-cluster and inter-
related to how objects or concepts are embedded                cluster similarity. The Second phase of the
in the point set. In practice, the approximate                 algorithm converts the minimum spanning tree
number of embedded objects can sometimes be                    EMST2 into dendrogram, which can be used to




                                                        130                              http://sites.google.com/site/ijcsis/
                                                                                         ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 8, No. 4, July 2010




interpret about inter-cluster distances. This new             predefine a threshold to test the CS. If the CS is
algorithm is neither single link clustering                   greater than the threshold, the partition of the
algorithm (SLCA) nor complete link clustering                 dataset is valid. Then again partitions the data set
algorithm (CLCA) type of hierarchical clustering,             by creating subtree (cluster/region). This process
but it is based on the distance between centers of            continues until the CS is smaller than the
clusters. This approach leads to new                          threshold. At that point, the proper number of
developments in hierarchical clustering.      The             clusters will be the number of cluster minus one.
level function, L, records the proximity at which             The CS criterion finds the proper binary
each clustering is formed. The levels in the                  relationship among clusters in the data space. The
dendrogram tell us the least amount of similarity             value setting of the threshold for the CS will be
that points between clusters differ. This piece of            practical and is dependent on the dataset. The
information can be very useful in several medical             higher the value of the threshold the smaller the
and image processing applications.                            number of clusters would be. Generally, the value
                                                              of the threshold will be > 0.8[4]. Figure 3 shows
Here, we use a cluster validation criterion based             the CS value versus the number of clusters in
on the geometric characteristics of the clusters, in          hierarchical clustering. The CS value < 0.8 when
which only the inter-cluster metric is used. The              the number of clusters is 5. Thus, the proper
DGMST algorithm is a nearest centroid-based                   number of clusters for the data set is 4. Further
clustering algorithm, which creates region or                 more, the computational cost of CS is much
subtrees (clusters/regions) of the data space. The            lighter because the number of subclusters is small.
algorithm partitions a set S of data of data D in             This makes the CS criterion practical for the
data space in to n regions (clusters). Each region            DGMST algorithm when it is used for clustering
is represented by a centroid reference vector. If             large dataset.
we let p be the centroid representing a region
(cluster), all data within the region (cluster) are           Algorithm: DGMST ( )
closer to the centroid p of the region than to any             Input : S the point set
other centroid q:                                             Output : dendrogram with optimal number of
                                                                         clusters
  R (p) = {x    D   dist(x, p)   dist(x, q) q}                Let e1 be an edge in the EMST1 constructed from S
                                                              Let e2 be an edge in the EMST2 constructed from C
Thus, the problem of finding the proper number of             Let We be the weight of e1
clusters of a dataset can be transformed into                 Let σ be the standard deviation of the edge
problem of finding the proper region (clusters) of            weights in EMST1
the dataset. Here, we use the MST as a criterion              Let ST be the set of disjoint subtrees of EMST1
to test the inter-cluster property. Based on this             Let nc be the number of clusters
observation, we use a cluster validation criterion,
called Cluster Separation (CS) in DGMST                        1. Construct an EMST1 from S
algorithm [4].                                                 2. Compute the average weight of Ŵ of all the
                                                                  Edges from EMST1
Cluster separation (CS) is defined as the ratio                3. Compute standard deviation σ of the edges
between minimum and maximum edge of MST. ie                        from EMST1
                                                               4. ST = ø; nc = 1; C = ø;
          CS = Emin / Emax ,                                   5. Repeat
                                                               6. For each e1 EMST1
where Emax is the maximum length edge of MST,                  7.    If (We > Ŵ + σ) or (current longest edge e1)
which represents two centroids that are at                     8.       Remove e1 from EMST1
maximum separation, and Emin is the minimum                    9.       ST = ST U { T’ } // T’’ is new disjoint
length edge in the MST, which represents two                            Subtree (regions)
centroids that are nearest to each other. Then, the           10.       nc = nc+1
CS represents the relative separation of centroids.           11.       Compute the center Ci of Ti using
The value of CS ranges from 0 to 1. A low value                         eccentricity of points
of CS means that the two centroids are too close              12.       C = UTi ST {Ci}
to each other and the corresponding partition is              13.       Construct an EMST2 T from C
not valid. A high CS value means the partitions of            14.       Emin = get-min-edge (T)
the data is even and valid. In practice, we                   15.       Emax = get-max-edge (T)




                                                       131                              http://sites.google.com/site/ijcsis/
                                                                                        ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 8, No. 4, July 2010




16.       CS = Emin / Emax
17. Until CS < 0.8
18. Begin with disjoint clusters with level L (0) =
     0 and sequence number m = 0
 19. While (T has some edge)
 20. e2 = get-min-edge(T) // for least
        dissimilar pair of clusters
 21. (i, j) = get-vertices (e2)
 22. Increment the sequence number m = m + 1,
        merge the clusters (i) and (j), into single
        cluster to form next clustering m and set
        the level of this cluster to L(m) = e2;
 23. Update T by forming new vertex by
        combining the vertices i, j;                             Figure 3: Number of Clusters vs. Cluster Separation
24. Return dendrogram with optimal number of
             clusters                                            Our DGMST algorithm works in two phases. The
                                                                 outcome of the first phase (lines 1-17) of the
Figure 1 shows a typical example of EMST1                        algorithm consists of optimal number clusters
constructed from point set S, in which                           with their center. It first constructs EMST1 form
inconsistent edges are removed to create subtree                 set of point S (line 1). Average weight of edges
(clusters/regions). Our algorithm finds the center               and standard deviation are computed (lines 2-3).
of the each cluster, which will be useful in many                Inconsistent edges are identified and removed
applications. Our algorithm will find optimal                    from EMST1 to generate subtree T’ (lines 7-9).
number of clusters or cluster structures. Figure 2               The center for each subtree (cluster/region) is
shows the possible distribution of the points in the             computed at line 11. Using the cluster/region
two cluster structures with their center vertex as 5             center point again another minimum spanning tree
and 3.                                                           EMST2 is constructed (line 13). Using the new
                                                                 evaluation     criteria,    optimal    number    of
                                                                 clusters/regions is identified (lines 14-16). Lines
                                                                 6-16 in the algorithm are repeated until optimal
                                                                 number of clusters are obtained. We use the graph
                                                                 of Figure 4 as example to illustrate the second
                                                                 phase (lines 18-24) of the algorithm.

                                                                 The second phase of the DGMST algorithm
Figure 1: Clusters connected through points -EMST1               construct minimum spanning tree T from the point
                                                                 set C = {c1, c2, c3….ck} and convert the T into
                                                                 dendrogram is shown in figure 5. It places the
                                                                 entire disjoint cluster at level 0 (line 18). It then
                                                                 checks to see if T still contains some edge (line
                                                                 19). If so, it finds minimum edge e2 (line 20). It
                                                                 then finds the vertices i, j of an edge e2 (line 21).
                                                                 It then merges the vertices (clusters) and forms a
                                                                 new vertex (agglomerative approach). At the
                                                                 same time the sequence number is increased by
                                                                 one and the level of the new cluster is set to the
                                                                 edge weight (line 22). Finally, the Updation of
                                                                 minimum spanning tree is performed at line 23.
                                                                 The lines 20-23 in the algorithm are repeated until
                                                                 the minimum spanning tree T has no edge. The
                                                                 dendrogram with optimal number of cluster as
                                                                 objects is generated. The objects within the
                                                                 clusters are compact. The clusters are well
                                                                 separated, shown in Figure 4.
Figure 2: Two Clusters/regions with Center points 5
and 3




                                                         132                               http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 8, No. 4, July 2010




                                                                 future we will explore and test our proposed
                                                                 clustering algorithm in various domains. The
                                                                 DGMST algorithm uses both divisive and
                                                                 agglomerative approach to find Optimal Meta
                                                                 similarity clusters. We will further study the rich
                                                                 properties of EMST-based clustering methods in
                                                                 solving different clustering problems.

                                                                                    REFERENCES

                                                                 [1] Anil K. Jain, Richard C. Dubes “Algorithm
Figure 4. EMST2 From 4 region/cluster center points              for Clustering Data‖, Michigan State University,
                                                                 Prentice Hall, Englewood Cliffs, New Jersey
                                                                 07632.1988.

                                                                 [2] T. Asano, B. Bhattacharya, M.Keil and
                                                                 F.Yao. ―Clustering Algorithms based on
                                                                 minimum and maximum spanning trees‖. In
                                                                 Proceedings of the 4th Annual Symposium on
                                                                 Computational Geometry,Pages 252-257, 1988.

                                                                 [3] D. Avis ―Diameter partitioning.‖ Discrete and
                                                                 Computational Geometr, 1:265-276, 1986.

                                                                 [4] Feng Luo,Latifur Kahn, Farokh Bastani, I-
                                                                 Ling Yen, and Jizhong Zhou, ―A dynamically
                                                                 growing self-organizing tree(DGOST) for
                                                                 hierarchical           gene          expression
Figure 5. Dendrogram for Optimal Meta cluster                    profile‖,Bioinformatics,Vol 20,no 16, pp 2605-
                                                                 2617, 2004.
               IV.       CONCLUSION
                                                                 [5] M. Fredman and D. Willard. ―Trans-
Our DGMST clustering algorithm does not                          dichotomous algorithms for minimum spanning
assumes any predefined cluster number. The                       trees and shortest paths‖. In Proceedings of the
algorithm gradually finds clusters with center for               31st Annual IEEE Symposium on Foundations of
each cluster. These clusters ensure guaranteed                   Computer Science,pages 719-725, 1990.
intra-cluster similarity. Our algorithm does not
require the users to select and try various                      [6] M. Halkidi, Y.Batistakis and M. Vazirgiannis
parameters combinations in order to get the                      ―On clustering validation techniques‖, J.Intel.
desired output. Our DGMST clustering algorithm                   Inform. System., 17, 107-145, 2001
uses a new cluster validation criterion based on
the     geometric     property     of    partitioned             [7] M. Halkidi, Y.Batistakis and M. Vazirgiannis,
regions/clusters to produce optimal number of                    ―Clustering validity checking methods:part II‖
―true‖ clusters with center for each of them. Our                SIGMOD record., 31, 19-27, 2002
algorithm also generates dendrogram which is
used to find the relationship between the optimal                [8] G. Hamerly and C. Elkan, ―Learning the k in
number clusters. The inter-cluster distances                     k-means, in Advances in Neural Information
between clusters/regions are shown in the                        Processing Systems‖ 16, S. Thrun, L. Saul, and
Dendrogram. This will be very useful in many                     B. Schölkopf, eds., MIT Press, Cambridge, MA,
applications. All of these look nice from                        2004.
theoretical point of view. However from practical
point of view, there is still some room for                      [9] P. Hansen and M. Delattre, ―Complete-link
improvement for running time of the clustering                   cluster analysis by graph coloring‖ Journal of
algorithm. This could perhaps be accomplished by                 the American Statistical Association 73, 397-403,
using some appropriate data structure. In the                    1978.




                                                         133                               http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                            Vol. 8, No. 4, July 2010




                                                             IEEE International conference on tools with
[10] A. Hardy, ―On the number of clusters‖,                  Artificial Intelligence (ICTAI’06) 2006.
Computational Statistics and Data Analysis, 23,
pp. 83–96, 1996.                                             [22] N. Paivinen, ―Clustering with a minimum
                                                             spanning of scale-free-like structure‖.Pattern
[11] T. Hastie, R. Tibshirani, and J. Friedman,              Recogn. Lett.,26(7): 921-930, 2005.
―The elements of statistical learning: Data mining,
inference and prediction‖, Springer-Verlag, 2001.            [23] F. Preparata and M.Shamos. ―Computational
                                                             Geometry‖: An Introduction. Springer-Verlag,
[12] Hubert L. J ―Min and max hierarchical                   Newyr, NY ,USA, 1985
clustering using asymmetric similarity measures‖
Psychometrika 38, 63-72, 1973.                               [24] R. Prim. ―Shortest connection networks and
                                                             some generalization‖. Bell systems Technical
[13] H.Gabow, T.Spencer and R.Rarjan.                        Journal,36:1389-1401, 1957.
―Efficient algorithms for finding minimum
spanning trees in undirected and directed graphs‖,           [25] R. Rezaee, B.P.F. Lelie and J.H.C. Reiber,
Combinatorica, 6(2):109-122, 1986.                           ―A new cluster validity index for the fuzzy C-
                                                             mean‖, Pattern Recog. Lett., 19,237-246, 1998.
[14] J.C. Gower and G.J.S. Ross ―Minimum
Spanning trees and single-linkage cluster                    [26] D. M. Rocke and J. J. Dai, ―Sampling and
analysis‖ Applied Statistics 18, 54-64, 1969.                subsampling for cluster analysis in data mining:
                                                             With applications to sky survey data‖, Data
[15] S. John Peter, S.P. Victor, ―A Novel                    Mining and Knowledge Discovery, 7, pp. 215–
Algorithm for Meta similarity clusters using                 232, 2003.
Minimum spanning tree‖. International Journal of
computer science and Network Security. Vol.10                [27] S. Salvador and P. Chan, ―Determining the
No.2 pp. 254 – 259, 2010                                     number of clusters/segments in hierarchical
                                                             clustering/segmentation     algorithms‖,       in
[16] S. C. Johnson, ―Hierarchical clustering                 Proceedings Sixteenth IEEE International
schemes‖ Psychometrika 32, 241-254, 1967.                    Conference on Tools with Artificial Intelligence,
                                                             ICTAI 2004, Los Alamitos, CA, USA, IEEE
[17] D. Johnson, ―The np-completeness column:                Computer Society, pp. 576–584 , 2004.
An ongoing guide‖. Journal of Algorithms,3:182-
195, 1982.                                                   [28] S. Still and W. Bialek, ―How many
                                                             clusters?” , An information-theoretic perspective,
[18] D. Karger, P. Klein and R. Tarjan, ―A                   Neural Computation, 16, pp. 2483–2506, 2004.
randomized linear-time algorithm to find
minimum spanning trees‖. Journal of the ACM,                 [29] Stefan Wuchty and Peter F. Stadler. ―Centers
42(2):321-328, 1995.                                         of Complex Networks‖. 2006

[19] J. Kruskal, ―On the shortest spanning subtree           [30] C. Sugar and G. James, ―Finding the number
and the travelling salesman problem‖, In                     of clusters in a data set ”, An information
Proceedings of the American Mathematical                     theoretic approach, Journal of the American
Society, pp 48-50, 1956.                                     Statistical Association, 98 pp. 750–763, 2003.

[20] J. Nesetril, E.Milkova and H.Nesetrilova.               [31] S. Theodoridis, K. Koutroubas, ―Pattern
Otakar boruvka on ―Minimum spanning tree                     recognition‖ Academic Press, 1999
problem‖: Translation of both the 1926 papers,
comments,     history.   DMATH:       Discrete               [32] R. Tibshirani, G. Walther and T.Hastie
Mathematics, 233, 2001.                                      ―Estimating the number of clusters in a dataset via
                                                             the gap statistic‖. J.R. Stat. Soc.Ser.B,63.411-423,
[21] Oleksandr Grygorash, Yan Zhou, Zach                     2001.
Jorgensen. ―Minimum spanning Tree Based
Clustering Algorithms‖. Proceedings of the 18th




                                                      134                              http://sites.google.com/site/ijcsis/
                                                                                       ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 8, No. 4, July 2010




[33] Y.Xu, V.Olman and D.Xu. ―Minimum
spanning trees for gene expression data
clustering‖. Genome Informatics,12:24-33, 2001.

[34] C. Zahn. ―Graph-theoretical methods for
detecting and describing gestalt clusters‖. IEEE
Transactions on Computers, C-20:68-86, 1971.



          BIOGRAPHY OF AUTHORS


                    S. John Peter is working as
                    Assistant      professor     in
                    Computer Science, St.Xavier’s
                    college         (Autonomous),
                    Palayamkottai, Tirunelveli. He
                    earned his M.Sc degree from
                    Bharadhidasan       University,
                    Trichirappalli. He also earned
his M.Phil from Bharadhidasan University,
Trichirappalli. Now he is doing Ph.D in Computer
Science at Manonmaniam Sundranar University,
Tirunelveli, Tamil Nadu, - INDIA. He has
published research papers on clustering algorithm
in various national and international Journals.
E-mail: jaypeeyes@rediffmail.com

                      Dr. S. P. Victor earned his
                      M.C.A.       degree     from
                      Bharathidasan University,
                      Tiruchirappalli. The M. S.
                      University,       Tirunelveli,
                      awarded him Ph.D. degree
                      in Computer Science for his
                      research       in     Parallel
                      Algorithms. He is the Head
of the department of computer science, and the
Director of the computer science research centre,
St.      Xavier’s     college       (Autonomous),
Palayamkottai, Tirunelveli. The M.S. University,
Tirunelveli     and     Bharathiar      University,
Coimbatore have recognized him as a research
guide. He has published research papers in
international, national journals and conference
proceedings. He has organized Conferences and
Seminars at national and state level.
E-mail: victorsp@rediffmail.com




                                                       135                              http://sites.google.com/site/ijcsis/
                                                                                        ISSN 1947-5500