A Robust -knowledge guided fusion of clustering Ensembles

Document Sample
A Robust -knowledge guided fusion of clustering Ensembles Powered By Docstoc
					                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 8, No. 4, July 2010

      A Robust -knowledge guided fusion of clustering
                       Ensembles

                          Anandhi R J                                                     Dr Natarajan Subramaniyan
              Research Scholar, Dept of CSE,                                                 Professor, Dept of ISE
                   Dr MGR University,                                                      PES Institute of Technology
                      Chennai, India                                                            Bangalore, India
                 rjanandhi@hotmail.com                                                      snatarajan44@gmail.com


Abstract— Discovering interesting, implicit knowledge and                   commercial, industrial, administrative and other applications,
general relationships in geographic information databases is very           it is necessary and interesting to examine how to extract
important to understand and to use the spatial data. Spatial                knowledge automatically from huge amount of data. Very
Clustering has been recognized as a primary data mining method              large data sets present a challenge for both humans and
for knowledge discovery in spatial databases. In this paper, we
have analyzed that by using a guided approach in combining the
                                                                            machine learning algorithms. Machine learning algorithms can
outputs of the various clusterers, we can reduce the intensive              be inundated by the flood of data, and become very slow in
computations and also will result in robust clusters .We have               knowledge extraction. More over, along with the large amount
discussed our proposed layered cluster merging technique for                of data available, there is also a compelling need for producing
spatial datasets and used it in our three-phase clustering                  results accurately and fast.
combination technique in this paper. At the first level, m
heterogeneous ensembles are run against the same spatial data               Efficiency and scalability are, indeed, the key issues when
set to generate B1…Bm results. The major challenge in fusion of
                  B   B
                                                                            designing data mining systems for very large data sets.
ensembles is the generation of         voting matrix or proximity           Through the extraction of knowledge in databases, large
matrix which is in the order of n2, where n is the number of data           databases will serve as a rich, reliable source for knowledge
points. This is very expensive both in time and space factors, with         generation and verification, the discovered knowledge can be
respect to spatial datasets. Instead, in our method, we compute a           applied to information management, query processing,
symmetric clusterer compatibility matrix of order        (m x m) ,          decision-making, process control and many other applications.
where m is the number of clusterers and m <<n, using the
                                                                            Therefore, data mining has been considered as one of the most
cumulative similarity between the clusters of the clusterers. This
matrix is used for identifying which two clusterers, if considered
                                                                            important topics in databases by many database researchers.
for fusion initially, will provide more information gain. As we             Spatial data describes information related to the space occupied
travel down the layered merge, for every layer, we calculate a              by objects. It consists of 2D or 3D points, polygons etc. or
factor called Degree of Agreement (DOA), based on the agreed                points in some d-dimensional feature space. It can be either
clusterers. Using the updated DOA at every layer, the movement              discrete or continuous. Discrete spatial data might be a single
of unresolved, unsettled data elements will be handled at much              point in multi-dimensional space while continuous spatial data
reduced the computational cost. Added to this advantage, we
                                                                            spans a region of space. This data might consist of medical
have pruned the datasets after every (m-1)/2 layers, using the
gained knowledge in previous layer. This helps in faster
                                                                            images or map regions and it can be managed through spatial
convergence compared to the existing cluster aggregation                    databases [8].
techniques. The correctness and efficiency of the proposed cluster          Clustering [17] is to group analogous elements in a data set in
ensemble algorithm is demonstrated on real world datasets                   accordance with its similarity such that elements in each cluster
available in UCI data repository.                                           are similar, while elements from different clusters are
                                                                            dissimilar. It doesn’t require the class label information about
    Keywords- Clustering ensembles, Spatial Data mining, Degree
                                                                            the data set because it is inherently a data-driven approach. So,
of Agreement, Cluster Compatibility matrix.
                                                                            the most interesting and well developed method of
                                                                            manipulating and cleaning spatial data in order to prepare it for
                          I.   INTRODUCTION                                 spatial data mining analysis is by clustering that has been
With a variety of applications, large amounts of spatial and                recognized as a primary data mining method for knowledge
related non-spatial data are collected and stored in Geographic             discovery in spatial database [4-7].
Information Databases. Spatial Data Mining[1], (i.e.,                       Clustering fusion is the integration of results from various
discovering interesting, implicit knowledge and general                     clustering algorithms using a consensus function to yield stable
relationships in large spatial databases) is an important task for          results. Clustering fusion approaches are receiving increasing
the understanding the usage of these spatial data. With the                 attention for their capability of improving clustering
rapid growth in size and number of available databases in                   performance. At present, the usual operational mechanism for




                                                                      284                              http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 8, No. 4, July 2010
clustering fusion is the combining of clusterer outputs. One tool         measure used in such algorithms is the Euclidian distance.
for such combining or consolidation of results from a portfolio           Recently new set of spatial clustering algorithms has been
of individual clustering results is a cluster ensemble [13].   It         proposed, which represents faster method to find clusters with
was shown to be useful in a variety of contexts such as                   overlapping densities. DBSCAN, GDBDCAN and DBRS are
“Quality and Robustness” [3], “Knowledge Reuse” [13,14],                  density-based spatial clustering algorithms, but they each
and “Distributed Computing” [9].                                          perform best only on particular types of datasets [17].
The rest of the paper is organized as follows. The related work
is in section 2. The proposed knowledge guided fusion                     However, these algorithms also ignore the non-spatial attribute
ensemble technique is in section 3. In section 4, we present              participation and require user defined parameters. For large-
experimental test platform and results with discussion. Finally,          scale spatial databases, the current density based cluster
we conclude with a summary and our planned future work in                 algorithms can be found to be expensive as they require large
this area of research.                                                    volume of memory support due to its operations over the
                                                                          entire database. Another disadvantage is the input parameters
                      II.   RELATED WORK                                  required by these algorithms are based on experimental
                                                                          evaluations. There is a large interest in addressing the
A. Litereature on Clustering Algorithms                                   automation of the general purpose clustering approach without
Many clustering algorithms have been developed and they can               user intervention. However, it is difficult to expect accurate
be roughly classified into hierarchical approaches and non-               results from the results of these algorithms as each one has its
hierarchical approaches. Non-hierarchical approaches can also             own shortfalls.
be divided into four categories; partitioning methods, density-
                                                                          B. Litereature on Clustering Ensembles
based methods, grid-based methods, and model-based methods.
Hierarchical algorithms can be further divided to agglomerative           Clustering ensemble is the method to combine several runs of
and divisive algorithms, corresponding to bottom-up and top-              different clustering algorithms to get an optimal partition of the
down strategies, to build a hierarchical clustering tree.                 original dataset. Given dataset X = {x1 x2,.. ,xn}, a cluster
                                                                          ensemble is a set of clustering solutions, represented as P =
Spatial data mining or knowledge discovery in spatial                     P1,P2,..Pr,where r is the ensemble size, i.e. the number of
databases refers to the extraction, from spatial databases, of            clusterings in the ensemble. Clustering-Ensemble Approach
implicit knowledge, spatial relations, or other patterns that are         first gets the result of M clusterers, then sets up a common
not explicitly stored [8, 10]. The large size and high                    understanding function to fuse each vector and get the labeled
dimensionality of spatial data make the complex patterns that             vector in the end. The goal of cluster ensemble is to combine
lurk in the data hard to find. It is expected that the coming             the clustering results of multiple clustering algorithms to obtain
years will witness very large number of objects that are                  better quality and robust clustering results. Even though many
location enabled to varying degrees. Spatial clustering [8] has           clustering algorithms have been developed, not much work is
been used as an important process in the areas such as                    done in cluster ensemble in data mining and machine learning
geographic analysis, exploring data from sensor networks,                 community.
traffic control, and environmental studies. Spatial data                  Strethl and Ghosh [13,14], proposed a hypergraph-partitioned
clustering has been identified as an important technique for              approach to combine different clustering results by treating
many applications and several techniques have been proposed               each cluster in an individual clustering algorithm as a hyper
over the past decade based on density-based strategies,                   edge. All the three proposed algorithms approach the problem
random walks, grid based strategies, and brute force                      by first transforming the set of clusterings into a hypergraph
exhaustive searching methods[5]. This paper deals with fusion             representation.     Cluster-based     Similarity      Partitioning
of spatial cluster ensembles using a guided approach to reduce            Algorithm (CSPA) uses relationship between objects in the
the space complexity of such fusion algorithms.                           same cluster for establishing a measure of pair wise similarity.
Spatial data is about instances located in a physical space.              In Hyper Graph Partitioning Algorithm (HGPA) the maximum
Spatial clustering aims to group similar objects into the same            mutual information objective is approximated with a
group considering spatial attributes of the object. The existing          constrained minimum cut objective. In their Meta-CLustering
spatial clustering algorithms in literature focus exclusively             Algorithm (MCLA), the objective of integration is viewed as a
either on the spatial distances or minimizing the distance of             cluster correspondence problem.
object attributes pairs. i.e., the locations are considered as            Kai Kang, Hua-Xiang Zhang, Ying Fan [6] formulated the
another attribute or the non-spatial attribute distances are              process of cooperation between component clusterers, and
ignored. Much activity in spatial clustering focuses on                   proposed a novel cluster ensemble learning technique based on
clustering objects based on the location nearness to each other           dynamic cooperating (DCEA). The approach is mainly
[5]. Finding clusters in spatial data is an active research area,         concerned how the component clusterers fully cooperate in the
and the current non-spatial clustering algorithms are applied to          process of training component clusterers.
spatial domain, with recent application and results reported on           Fred and Jain [2] used co-association matrix to form the final
the effectiveness and scalability of algorithms [8, 16].                  partition. They applied a hierarchical (single-link) clustering to
Partitioning algorithms are best suited to such problems where            the co-association matrix. Zeng, Tang, Garcia-Frias and
minimization of a distance function is required and a common              Gao[18], proposed an adaptive meta-clustering approach for



                                                                    285                               http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 8, No. 4, July 2010
combining different clustering results by using a distance                vertically into n subgroups and used it as the input to our
matrix.                                                                   ensemble algorithm. Either way individual partitions in each
                                                                          ensemble are sequentially generated.
C. Fusion Framework
We begin our discussion of the guided ensembles fusion                    A. Selection of clusterings for prime fusion
framework by presenting our notation. Let us consider a set of            D Any layered approach will have a drawback of being
n data objects, D = { v1 . . . vn }. A clustering C of dataset D,         dependent on which clusterer is considered for initial fusion.
is a partition of D into k disjoint sets C1 … Ck. In the sequel           This sensitiveness is a major bottleneck in deciding the
we consider m clusterings; we write Bi to denote the ith                  accuracy of the outputs. But, in our approach, we compute a m
clustering, and ki for the number of clusters of Bi. In the               x m symmetric clusterer compatibility matrix, where
clustering fusion problem the task is to find a clustering that           CCM[i][j] indicates the summary of information gain when ith
maximizes the related items with a number of already-existing             clusterer and jth clusterer are merged. This way we have used
clusterings [4].                                                          heuristics to direct the fusion in the right direction.
D. Definations                                                            B. Resolution for Label Correspondence Problem
   • Fusion Joint set, FJij:                                              The other issues in fusion of cluster ensembles are label
                                                                          correspondence problem and the merging technique used for
   Fusion Joint set, FJij refers to set of matching pairs of ith
                                                                          fusion. At the second phase, we address the label
   clusterer’s jth cluster. For instance, FJ12 refers to probable
                                                                          correspondence problem. These clustering results are
   fusion spot for first clusterer’s clusters with second
   clusterer’s cluster. It will be used for deciding where the            combined in layered pairs, called fusion joints set, FJmk . The
   fusion is most likely to yield optimal preciseness of                  criteria of merging can be any one of the Fusion Joint
   clusters.                                                              Identification Techniques i.e., overshadowing or usage of
                                                                          highest cardinality in intersection set along with usage of add-
   •    Clusterer Compatibility matrix: CCM (m X m)                       on knowledge gathered from such association.
   Clusterer Compatibility matrix is a m X m symmetric                    First approach uses the degree of shadow that one cluster has
   matrix where m is the total number of clusterers, considered           on other. This is computed using the smallest circle or
   for fusion.                                                            minimum covering circle approach, which is a mathematical
                                                                          problem of computing the smallest circle that contains all of a
   •    CCM[i][j]                                                         given set of points in the Euclidean plane. Each cluster of the
   Integer value d representing the maximum information                   clusterer in two layers first compute the minimum bounding
   gained through the summation of intersection elements                  circle and the diameter of such circle, using which the degree
   cardinality of the matching pairs of clusterer found in                of Shadow (DOS) is computed. The aim is to find the clusters
   Fusion Joint Set, FJ[i][j].                                            in different layers whose shadow overlap is maximized and
                                                                          then assign it to the matching pair set. This method finds the
   •    Degree Of Agreement Factor: (DOA)                                 most appropriate clusters belonging to a two clustererings for
   Degree of agreement factor is the ratio of the index of the            forthcoming fusion phase.
   merging level to the total number of clusterers. And also              Second approach uses the usage of heuristic greedy approach
   this DoA value will be cumulative till it reaches the                  in computing mutual information theory to decide on the
   threshold level DoATh, an user assigned value indicating               degree of compatibility. Mutual information is used when we
   the majority required for decision making. Under normal                need to decide, which amongst candidate variables are closest
   scenario, DoATh will be set as 50% of the number of                    to a particular variable. Higher the mutual information, more
   clusterers.                                                            the two variables are 'closer'. It is the amount of information
   •    Degree of Shadow factor : (DOS)                                   'contained' in Y about X.

   Degree of shadow factor is the maximized value of the                  Let X and Y be the random variables described by the cluster
   intersection of the two minimum bounding circles of k                  labeling λ(a) and λ(b) , with k(a) and k(b) groups respectively.
   clusters with ith cluster from a different clustering.                 Let I(X; Y) denote the mutual information between X and Y,
                                                                          and H(X) denote the entropy of X, i.e, a measure of the
         III.   KNOWLEDGE GUIDED ENSEMBLE FUSION                          uncertainty associated with a X. The chain rule for entropy
In this section we discuss our proposed layered cluster                   states that
ensemble fusion guided by the gained knowledge during the
merging process. The first phase of the algorithm is the                  H(X1:n)= H(X1)+H(X2|X1)+...+H(Xn|X1:n−1)                          (1)
preparation of B heterogeneous ensembles. This is done by
executing different clustering algorithms against the same                When X1:n are independent ,identically distributed (i.i.d.),
spatial data set to generate partitioning results. For our                then H(X1:n) = nH(X1). From Eqn 1, we have
experimental purpose, we have also generated homogenous
ensembles by partitioning the spatial data horizontally/                          H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )




                                                                    286                              http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 8, No. 4, July 2010
H(X) − H(X|Y ) = H(Y ) − H(Y |X).                                          (2)      The initial phase of the fusion starts with finding Bm Clusterers,
                                                                                                                                                B




This difference can be interpreted as the reduction in                                 using m different clustering algorithms. Next stage is to find
uncertainty in X after we know Y, or vice versa. It is thus                            the clusterers amongst Bm , with maximum compatibility
                                                                                                                   B




known as the information gain, or more commonly the mutual                             matrix index, for merging, so that they yield maximum
information between X and Y. Finding the maximum MI for                                knowledge for further fusions. When the merging happens,
the clusters in the clusterings is a combinatorial optimization                        based on the Fusion Joint Set, for each merged data point, the
problem, requiring an exhaustive search through all possible                           degree of agreement (DOA) is calculated.. For example, if the
clusterings with k labels. This is formidable since for n objects                      total number of clusterers are 5, then all the data points that
and k partitions there are approximately kn /k! for n >> k. For                        get merged at level 1, will have DOA as 1/5 = 0.2. This DOA
example, there are 171,798,901 ways to form four groups of                             value will be treated as the increment factor for every future
16 objects. Hence, instead of the complete greedy solution, we                         fusion. And also the DOA value in the corresponding DOA
have incorporated some heuristics so that cluster accuracy will                        vector be cumulative till it reaches the threshold level DOATh.
improve amidst the cost savings in terms of space &                                    Once the DOA of any point in the cluster crosses the threshold,
computations. As we travel the length and breadth of the                               it can be affirmed to belong to a particular cluster result and
ensemble space, we try to reduce the kn /k! Combinations, by                           will be treated as a strong link. . Thus, the normal voting
reuse of the cumulative information gain. This way, when n                             procedure with huge voting matrix, to confirm the majority
>> k, as in most of the cases of spatial data, the solutions can                       does not arise at all in our method.
be reached much faster.                                                                This final layer merge with the earlier combined clusters will
                                                                                       yield the robust combined result. This approach is not
C. Knowledge Guided fusion of clusterings – An Excerpt                                 computationally intensive, as we tend to use the first law of
                                                                                       geography in merging layer by layer. And also the computation
  Input:
                                                                                       of voting matrix is avoided.
  D – the input data in 2-dimensional feature space
                                                                                       The three levels of the technique: fusion joint identification,
  Layer : Group of Clusterers B1 to Bm ;
  Levels : List of clusters k1 to kn
                                                                                       guided fusion and resolving low voted data points are all
  CCM[i][j] : Clusterer compatability for ith clusterer with jth clusterer.            executed sequentially. They do not interfere with each other,
  Step1: Form B1k1 to Bikn clusters from D using B1 to Bm clusterers, each             but they just receive the results from the previous levels. No
  clusterer generating k clusters                                                      feedback process happens, and the algorithm terminates after
  Step 2; Compute Clusterer compatability matrix, whose entries are the
  aggregated cardinality values of the intersecting elements set of the
                                                                                       the completion of all procedures.
  clusterers.
  Step 3: Identify and select the harmonizing clusterers for fusion from the                     IV.   EXPERIMENTAL PLATFORM & RESULTS
  CCM matrix. as TobeMerged_Layers
  Step 4: Set DOA_Increment Factor as 1/ m .                                           A. The Test Platform
  Step 5: Find fusion joints for TobeMerged_Layers , (FJ12 1 .. FJ1 k) , using
  degree of Shadow overlap or maximizing the information gain of
                                                                                       Ensemble Creation : In order to predigest the analysis, the
  probable merge.                                                                      paper uses five representative clustering methods to produce
  Step 6: For every pair in the fusion joint Set, FJi k,                               five heterogeneous ensembles or clustering members, viz.
   Do{                                                                                 DBSCAN, k-means, PAM, Fuzzy K Means and Ward’s
         ClustData[i]      Union of Data points of the pair
         Initialize Vector_DOA with DOA_Increment Factor
                                                                                       algorithm. K-means is a very simple yet very powerful and
         Append it to to Vector_CData [i]                                              efficient iterative technique to partition a large data set into k
         For each element in the intersection set between Pairs,                       disjoint clusters. The objective criterion used in the algorithm
              DOA[i]       DOA[i] + DOA_Increment Factor                               is typically the squared-error function. DBSCAN method
          Increment the vector index I by 1 & merge_ layer by 2
       } until i <= k; //normal merge for m/2 layers
                                                                                       performs well with attribute data and with spatial data.
  Step 7: repeat steps2 to 6 till merge_ layer < m/2;                                  Partitioning around medoids (PAM) is mostly preferred for its
  // finalize the cluster elements at layer i and at level k                           scalability and hence useful in Spatial data. The latest in
   do{ If (Vector_DOA > DOA Th )                                                       clustering is the usage of fuzziness and we have added Fuzzy
      Strong links Corresponding Elements of Vector_CData
  Else
                                                                                       C means (FCM) as one of the clusterer, so that we get a robust
     Weak linksk       Corresponding Elements of Vector_CData                          partition in the end result. Hence these clustering techniques
   } until all pairs at layer I is resolved                                            along with different cluster sizes form the input for our
  Step 8: Using Strong links, finalize Final_Kluster k & continue gathering            knowledge guided fusion technique.
  votes for weak links.
  Step 9: From (m/2 +1)th layer, perform the pruned merging, where the                 Data Source : Most of the ensemble methods, have sampling
  strong links will be pruned for the confirmed data points, when they                 techniques in selecting the data for their experimental platform,
  reappear. Data points in the weak links could be the noise data points,              but this heuristics results in losing some inherent data clusters,
  (Noise_Elements k) , as their inherent votes were below the threshold                thereby reducing the quality of clusters. We have tried to
  value.
  Step10: Return the robust clusters obtained from m clusterers                        avoid sampling and involve the whole dataset. For our
  Final_Kluster k and Noise_Elements k.                                                experiments we have used the datasets available in the data
Figure 3.3. Excerpt of the guided fusion of ensembles
                                                                                       repository of University of California, Irvine.
                                                                                       Metrics: We used the classification accuracy (CA) to measure
                                                                                       the accuracy of an ensemble as the agreement between the




                                                                                 287                               http://sites.google.com/site/ijcsis/
                                                                                                                   ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 8, No. 4, July 2010
ensemble partition and the "true" partition. The classification           technique had generated an intra cluster density of 5.7125,
accuracy is commonly used for evaluating clustering results.              implying that we have generated more precise clusters.
To guarantee the best re-labeling of the clusters, the proportion
of correctly labeled objects in the data set is calculated as CA
for the partition. We have used the measurement of intra
cluster and inter cluster density before and after usage of our
cluster ensemble approach, which will be a metric for the
preciseness of the so formed cluster groups.
B. Validation of fusion Results
As clustering is a kind of study without guidance, basically
unsupervised classification, it is difficult to evaluate the
clustering efficiency. But with classifying information of data,
it can be considered that some inner distribution characters of                   Figure 4.3.1 Comparison of error rates of the fused ensembles
the data are expressed to certain degree. If such classifying
information is not used by clustering, it can be used to
evaluate the clustering effect. If the number of same objects,
which covered by certain clustering label of labeled vectors
and certain known category of category properties, are at best,
this clustering label corresponds to this known category. Thus
many clustering labels might correspond to the same category,
whereas one clustering label can not correspond to many
categories. The clustering results can be evaluated by
classifying information.                                                  Figure 4.3.2 Intra cluster density      Figure 4.3.3 Inter cluster density
 The test results with the IRIS dataset, Wine dataset, Half rings
and Spiral dataset (Courtesy: UCI data repository) is                                               With Iris dataset
promising and shows better cluster accuracy. Two parameters
were computed to verify our algorithm: Cluster Correctness
Factor (CCF) and the space complexity of the fusion of                                     V.     CONCLUSION AND FUTURE WORK

ensembles. Few bench marked datasets as mentioned above                   In this paper we addressed the relabeling problem found in
were tested with this technique and the CCF was found to be               general in most of cluster ensembles, and has been resolved
100%, in all the cases. Normally, in all ensembling algorithms,           without much computations, using the notion from first law of
voting matrix is computed which is normally in the order of n2,           geography. The cluster ensemble is a very general framework
where n is the number of data points. But, due to the                     that enables a wide range of applications. We have applied the
knowledge guided fusion along with unique inherent voting                 proposed guided cluster merging technique on spatial
scheme, the space complexity has been reduced to the order of             databases. The main issue in spatial databases is the
n. This has a major impact in not only memory requirements                cardinality of data points and also the increased dimensions.
but also in the total number of matrix computations.                      Most of the existing Ensemble algorithms have to generate
                                                                          voting matrix of at least an order of n2. or an expensive
C. Comparison of the Experimental Results                                 graphical representation with the vertices which is equal to the
In our approach of knowledge guided fusion, we have                       number of data points. When n is very huge and is also a
combined the results of several independent cluster runs by               common factor in spatial datasets, this restriction is a very big
computing inherent voting of their results. Our phased                    bottleneck in obtaining robust clusters in reasonable time and
knowledge guided approach voting helps us to overcome the                 high accuracy.
computationally infeasible simultaneous combination of all                Our algorithm has resolved the re labeling using layered
partitions and also increases the cluster accuracy. (Figure               merging as well as guided by the gained information. Once
4.3.1). By the help of our scheme, we have shown that the                 elements move from strong links to final clusters, they do not
consensus partition indeed converges to the true partition.               participate in further computations. Hence, the computational
InterCluster Density (Figure 4.3.3) has been reduced by almost            cost is also hugely reduced. Usage of the Cluster compatibility
40% when compared against the other clustering algorithms                 matrix enables us to have a good head start in the fusion
with our technique. For the benchmark iris dataset it is around           process, which otherwise is a matter of sheer randomness.
11.47 and our cluster miner produces 6.77 implying that the               The key goal of spatial data mining is to automate knowledge
later has produced better cluster in terms intercluster Density.          discovery process. It is important to note that in this study, it
We have observed that the IntraCluster Density (Figure 4.3.2)             has been assumed that, the user has a good knowledge of data
has increased, implying that the cluster quality has improved             and of the hierarchies used in the mining process. The crucial
due to the guided approach used for ensemble fusion. For the              input of deciding the value of k, still affects the quality of the
standard benchmark iris dataset, intra cluster density achieved
                                                                          resultant clusters. Domain specific Apiori knowledge can be
using normal clustering methods is 5.1283, whereas our




                                                                    288                                        http://sites.google.com/site/ijcsis/
                                                                                                               ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 8, No. 4, July 2010
used as guidance for deciding the value k. We feel that semi                          [8]    Ng R.T., and Han J., “Efficient and Effective Clustering Methods for
                                                                                             Spatial Data Mining”, Proc. 20th Int. Conf. on Very Large DataBases,
supervised clustering using the domain knowledge could                                       144-155, Santiago, Chile, 1994.
improve the quality of the mined clusters. We have used                               [9]    B.H. Park and H. Kargupta, “Distributed Data Mining”, In The
heterogeneous clusterers for our testing but it can be tested                                Handbook of Data Mining, Ed. Nong Ye, Lawrence Erlbaum Associates,
with more new combinations of spatial clustering algorithms                                  2003.
as base clusterers. This will ensure exploring more natural                           [10]   J. Roddick and B. G. Lees, "Paradigms for Spatial and Spatio-Temporal
clusters.                                                                                    Data Mining," in Geographic Data Mining and Knowledge Discovery,
                                                                                             Taylor & Francis, 2001.
First, we have identified several non-spatial datasets which are                      [11]   Su-lan Zhai1,Bin Luo1 Yu-tang Guo : Fuzzy Clustering Ensemble Based
normally used as bench mark ones for data clustering. Then                                   on Dual Boosting , Fourth International Conference on Fuzzy Systems
                                                                                             and Knowledge Discovery 07.
we tested how our layer based methodology can wok with
                                                                                      [12]   Samet, Hanan.: “Spatial Data Models and Query Processing”. In Modern
spatial data. This setup must be worked with more large                                      Databases Systems: The object model, Interoperability, and Beyond.
datasets available in GIS areas and with satellite images. We                                Addison Wesley/ ACM Press, 1994,Reading, MA.
evaluated our work and can conclude that for targeting a                              [13]   A.Strehl, J.Ghosh, “Cluster ensembles - a knowledge reuse framework
specific platform and incorporating spatial feature space, our                               for combining multiple partitions”, Journal of Machine Learning
automated layered merge approach is able to provide the                                      Research, 3: 583-618, 2002.
necessary correctness with more efficiency both in space                              [14]   A.Strehl, J.Ghosh, “Cluster ensembles- a knowledge reuse framework
                                                                                             for combining partitionings”, in: Proc. Of 11th National Conference On
constraint and in matrix computations. However, more work                                    Artificial Intelligence, NCAI, Edmonton, Alberta, Canada, pp.93-98,
should be carried out to provide support for more real life data                             2002.
from satellites and incomplete data. Future work in the short                         [15]   Y. Tao, J. Zhang, D. Papa dias, and N. Mamoulis, "An Efficient Cost
term will focus on how to acquire such datasets, and continue                                Model for Optimization of Nearest Neighbor Search in Low and
                                                                                             Medium Dimensional Spaces," IEEE Transactions on Knowledge and
with more testing, in spite of current security concerns in                                  Data Engineering, vol. 16,no. 10, pp. 1169-1184, 2004.
distributing such data.                                                               [16]    X. Wang and H. J. Hamilton, "Clustering Spatial Data in the Presence
                                                                                             of Obstacles," International Journal on Artificial Intelligence Tools, vol.
                          ACKNOWLEDGMENT                                                     14, no. 1-2, pp. 177-198, 2005.
This work has been partly done in the labs of The Oxford                              [17]    R. Xu and D. Wunsch, II, "Survey of clustering algorithms," IEEE
College of Engineering, Bangalore, where the author is                                       Transactions on Neural Networks, vol. 16,no. 3, pp. 645- 678, 2005.
currently working as a Professor, in the department of                                [18]    Zhang, J. 2004. Polygon-based Spatial clustering and its application in
                                                                                             watershed study. MS Thesis, University of Nebraska-Lincoln, December
Computer Science & Engineering. The authors would like to                                    2004.
express their sincere gratitude to the Management and                                 [19]   Zeng, Y., Tang, J., Garcia-Frias, J. and Gao, G.R., “An Adaptive Meta-
Principal of The Oxford College of Engineering for their                                     Clustering Approach: Combining The Information From Different
support rendered during the testing of some of our modules.                                  Clustering Results”, CSB2002 IEEE Computer Society Bioinformatics
                                                                                             Conference Proceeding.
They also express their thanks to the University of California
Irvine, for their huge data repository made available for
                                                                                                                    AUTHORS PROFILE
testing our knowledge guided approach of fusion of
ensembles.
                              REFERENCES
[1]   M.Ester, H. Kriegel, J. Sander, X. Xu. ”Clustering for Mining in Large
      Spatial Databases”. Special Issue on Data Mining, KI-Journal Tech                                RJ Anandhi is a PhD student in the department of Computer
      Publishing, Vol.1, 98.                                                          Science & Engineering at Dr M G R University She is currently working as a
[2]   A.L.N. Fred and A.K. Jain, “Data Clustering using Evidence                      professor in the Department of Computer Science at Oxford College of
      Accumulation”. In Proc. of the 16th International Conference on Pattern         Engineering, Bangalore. She has completed her BE degree from Bharatiyar
      Recognition, ICPR 2002, Quebec City.                                            University and MTech degree from Pondicherry Central University. Her
                                                                                      research interests are in Spatial Data mining and ANT algorithms.
[3]   A.L.N. Fred and A.K. Jain, “Robust data clustering” in Proc. IEEE
      Computer Society Conference on Computer Vision and Pattern
      Recognition, CVPR, USA, 2003.
[4]   Filkov, V. and Skiena, S. “Integrating microarray data by concensus
      clustering”. In International Conference on Tools with Artificial
      Intelligence, 2003                                                                             Dr. Natarajan has initially worked in Defence Research and
[5]    K.Koperski, J.Han, K. Koperski and J. Han, "Discovery of spatial               Development Laboratory (DRDL) for five years in the area of software
      Rules in Geographic Information Databases," Proc. 4th Intl Symposium            development in defence missions. Dr Natarajan then worked for 28 years in
      on Large Spatial Databases, pp. 47-66, 95.                                      National Remote Sensing Agency (NRSA) in the areas pertaining to DIP and
[6]   Kai Kang, Hua-Xiang Zhang, Ying Fan, “A Novel Clusterer Ensemble                GlS for several remote sensing missions like IRS-1A, IRS-1B, IRS-1C,
      Algorithm Based on Dynamic Cooperation”, IEEE 5TH International                 IKONOS and LANDSAT. As a Project Manager of Ground Control Point
      Conf. on Fuzzy Systems and Knowledge Discovery 2008.                            Library (GCPL) Project, he had completed the task of computing cm level
[7]    Matheus C.J., Chan P.K, and Piatetsky-Shapiro G, “Systems for                  accuracy for 3000 locations within India which is being used for cartographic
      Knowledge Discovery in Databases”, IEEE Transactions on Knowledge               satellite missions. He was the Deputy Project Director of Large Scale
      and Data Engineering 5(6), pp. 903-913, 1993.                                   Mapping (LSM) of Department of Space. Dr Natarajan has published about
                                                                                      fifteen papers in National/ International Conferences and Journals. His
                                                                                      research interests are Data mining, GIS and Spatial Databases.




                                                                                289                                      http://sites.google.com/site/ijcsis/
                                                                                                                         ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 8, No. 4, July 2010




    290                              http://sites.google.com/site/ijcsis/
                                     ISSN 1947-5500

				
DOCUMENT INFO
Description: The International Journal of Computer Science and Information Security is a monthly periodical on research articles in general computer science and information security which provides a distinctive technical perspective on novel technical research work, whether theoretical, applicable, or related to implementation. Target Audience: IT academics, university IT faculties; and business people concerned with computer science and security; industry IT departments; government departments; the financial industry; the mobile industry and the computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. Thanks for your contributions in July 2010 issue and we are grateful to the reviewers for providing valuable comments. IJCSIS July 2010 Issue (Vol. 8, No. 4) has an acceptance rate of 36 %.