A Robust -knowledge guided fusion of clustering Ensembles
The International Journal of Computer Science and Information Security is a monthly periodical on research articles in general computer science and information security which provides a distinctive technical perspective on novel technical research work, whether theoretical, applicable, or related to implementation. Target Audience: IT academics, university IT faculties; and business people concerned with computer science and security; industry IT departments; government departments; the financial industry; the mobile industry and the computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. Thanks for your contributions in July 2010 issue and we are grateful to the reviewers for providing valuable comments. IJCSIS July 2010 Issue (Vol. 8, No. 4) has an acceptance rate of 36 %.

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
A Robust -knowledge guided fusion of clustering
Ensembles
Anandhi R J Dr Natarajan Subramaniyan
Research Scholar, Dept of CSE, Professor, Dept of ISE
Dr MGR University, PES Institute of Technology
Chennai, India Bangalore, India
rjanandhi@hotmail.com snatarajan44@gmail.com
Abstract— Discovering interesting, implicit knowledge and commercial, industrial, administrative and other applications,
general relationships in geographic information databases is very it is necessary and interesting to examine how to extract
important to understand and to use the spatial data. Spatial knowledge automatically from huge amount of data. Very
Clustering has been recognized as a primary data mining method large data sets present a challenge for both humans and
for knowledge discovery in spatial databases. In this paper, we
have analyzed that by using a guided approach in combining the
machine learning algorithms. Machine learning algorithms can
outputs of the various clusterers, we can reduce the intensive be inundated by the flood of data, and become very slow in
computations and also will result in robust clusters .We have knowledge extraction. More over, along with the large amount
discussed our proposed layered cluster merging technique for of data available, there is also a compelling need for producing
spatial datasets and used it in our three-phase clustering results accurately and fast.
combination technique in this paper. At the first level, m
heterogeneous ensembles are run against the same spatial data Efficiency and scalability are, indeed, the key issues when
set to generate B1…Bm results. The major challenge in fusion of
B B
designing data mining systems for very large data sets.
ensembles is the generation of voting matrix or proximity Through the extraction of knowledge in databases, large
matrix which is in the order of n2, where n is the number of data databases will serve as a rich, reliable source for knowledge
points. This is very expensive both in time and space factors, with generation and verification, the discovered knowledge can be
respect to spatial datasets. Instead, in our method, we compute a applied to information management, query processing,
symmetric clusterer compatibility matrix of order (m x m) , decision-making, process control and many other applications.
where m is the number of clusterers and m <<n, using the
Therefore, data mining has been considered as one of the most
cumulative similarity between the clusters of the clusterers. This
matrix is used for identifying which two clusterers, if considered
important topics in databases by many database researchers.
for fusion initially, will provide more information gain. As we Spatial data describes information related to the space occupied
travel down the layered merge, for every layer, we calculate a by objects. It consists of 2D or 3D points, polygons etc. or
factor called Degree of Agreement (DOA), based on the agreed points in some d-dimensional feature space. It can be either
clusterers. Using the updated DOA at every layer, the movement discrete or continuous. Discrete spatial data might be a single
of unresolved, unsettled data elements will be handled at much point in multi-dimensional space while continuous spatial data
reduced the computational cost. Added to this advantage, we
spans a region of space. This data might consist of medical
have pruned the datasets after every (m-1)/2 layers, using the
gained knowledge in previous layer. This helps in faster
images or map regions and it can be managed through spatial
convergence compared to the existing cluster aggregation databases [8].
techniques. The correctness and efficiency of the proposed cluster Clustering [17] is to group analogous elements in a data set in
ensemble algorithm is demonstrated on real world datasets accordance with its similarity such that elements in each cluster
available in UCI data repository. are similar, while elements from different clusters are
dissimilar. It doesn’t require the class label information about
Keywords- Clustering ensembles, Spatial Data mining, Degree
the data set because it is inherently a data-driven approach. So,
of Agreement, Cluster Compatibility matrix.
the most interesting and well developed method of
manipulating and cleaning spatial data in order to prepare it for
I. INTRODUCTION spatial data mining analysis is by clustering that has been
With a variety of applications, large amounts of spatial and recognized as a primary data mining method for knowledge
related non-spatial data are collected and stored in Geographic discovery in spatial database [4-7].
Information Databases. Spatial Data Mining[1], (i.e., Clustering fusion is the integration of results from various
discovering interesting, implicit knowledge and general clustering algorithms using a consensus function to yield stable
relationships in large spatial databases) is an important task for results. Clustering fusion approaches are receiving increasing
the understanding the usage of these spatial data. With the attention for their capability of improving clustering
rapid growth in size and number of available databases in performance. At present, the usual operational mechanism for
284 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
clustering fusion is the combining of clusterer outputs. One tool measure used in such algorithms is the Euclidian distance.
for such combining or consolidation of results from a portfolio Recently new set of spatial clustering algorithms has been
of individual clustering results is a cluster ensemble [13]. It proposed, which represents faster method to find clusters with
was shown to be useful in a variety of contexts such as overlapping densities. DBSCAN, GDBDCAN and DBRS are
“Quality and Robustness” [3], “Knowledge Reuse” [13,14], density-based spatial clustering algorithms, but they each
and “Distributed Computing” [9]. perform best only on particular types of datasets [17].
The rest of the paper is organized as follows. The related work
is in section 2. The proposed knowledge guided fusion However, these algorithms also ignore the non-spatial attribute
ensemble technique is in section 3. In section 4, we present participation and require user defined parameters. For large-
experimental test platform and results with discussion. Finally, scale spatial databases, the current density based cluster
we conclude with a summary and our planned future work in algorithms can be found to be expensive as they require large
this area of research. volume of memory support due to its operations over the
entire database. Another disadvantage is the input parameters
II. RELATED WORK required by these algorithms are based on experimental
evaluations. There is a large interest in addressing the
A. Litereature on Clustering Algorithms automation of the general purpose clustering approach without
Many clustering algorithms have been developed and they can user intervention. However, it is difficult to expect accurate
be roughly classified into hierarchical approaches and non- results from the results of these algorithms as each one has its
hierarchical approaches. Non-hierarchical approaches can also own shortfalls.
be divided into four categories; partitioning methods, density-
B. Litereature on Clustering Ensembles
based methods, grid-based methods, and model-based methods.
Hierarchical algorithms can be further divided to agglomerative Clustering ensemble is the method to combine several runs of
and divisive algorithms, corresponding to bottom-up and top- different clustering algorithms to get an optimal partition of the
down strategies, to build a hierarchical clustering tree. original dataset. Given dataset X = {x1 x2,.. ,xn}, a cluster
ensemble is a set of clustering solutions, represented as P =
Spatial data mining or knowledge discovery in spatial P1,P2,..Pr,where r is the ensemble size, i.e. the number of
databases refers to the extraction, from spatial databases, of clusterings in the ensemble. Clustering-Ensemble Approach
implicit knowledge, spatial relations, or other patterns that are first gets the result of M clusterers, then sets up a common
not explicitly stored [8, 10]. The large size and high understanding function to fuse each vector and get the labeled
dimensionality of spatial data make the complex patterns that vector in the end. The goal of cluster ensemble is to combine
lurk in the data hard to find. It is expected that the coming the clustering results of multiple clustering algorithms to obtain
years will witness very large number of objects that are better quality and robust clustering results. Even though many
location enabled to varying degrees. Spatial clustering [8] has clustering algorithms have been developed, not much work is
been used as an important process in the areas such as done in cluster ensemble in data mining and machine learning
geographic analysis, exploring data from sensor networks, community.
traffic control, and environmental studies. Spatial data Strethl and Ghosh [13,14], proposed a hypergraph-partitioned
clustering has been identified as an important technique for approach to combine different clustering results by treating
many applications and several techniques have been proposed each cluster in an individual clustering algorithm as a hyper
over the past decade based on density-based strategies, edge. All the three proposed algorithms approach the problem
random walks, grid based strategies, and brute force by first transforming the set of clusterings into a hypergraph
exhaustive searching methods[5]. This paper deals with fusion representation. Cluster-based Similarity Partitioning
of spatial cluster ensembles using a guided approach to reduce Algorithm (CSPA) uses relationship between objects in the
the space complexity of such fusion algorithms. same cluster for establishing a measure of pair wise similarity.
Spatial data is about instances located in a physical space. In Hyper Graph Partitioning Algorithm (HGPA) the maximum
Spatial clustering aims to group similar objects into the same mutual information objective is approximated with a
group considering spatial attributes of the object. The existing constrained minimum cut objective. In their Meta-CLustering
spatial clustering algorithms in literature focus exclusively Algorithm (MCLA), the objective of integration is viewed as a
either on the spatial distances or minimizing the distance of cluster correspondence problem.
object attributes pairs. i.e., the locations are considered as Kai Kang, Hua-Xiang Zhang, Ying Fan [6] formulated the
another attribute or the non-spatial attribute distances are process of cooperation between component clusterers, and
ignored. Much activity in spatial clustering focuses on proposed a novel cluster ensemble learning technique based on
clustering objects based on the location nearness to each other dynamic cooperating (DCEA). The approach is mainly
[5]. Finding clusters in spatial data is an active research area, concerned how the component clusterers fully cooperate in the
and the current non-spatial clustering algorithms are applied to process of training component clusterers.
spatial domain, with recent application and results reported on Fred and Jain [2] used co-association matrix to form the final
the effectiveness and scalability of algorithms [8, 16]. partition. They applied a hierarchical (single-link) clustering to
Partitioning algorithms are best suited to such problems where the co-association matrix. Zeng, Tang, Garcia-Frias and
minimization of a distance function is required and a common Gao[18], proposed an adaptive meta-clustering approach for
285 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
combining different clustering results by using a distance vertically into n subgroups and used it as the input to our
matrix. ensemble algorithm. Either way individual partitions in each
ensemble are sequentially generated.
C. Fusion Framework
We begin our discussion of the guided ensembles fusion A. Selection of clusterings for prime fusion
framework by presenting our notation. Let us consider a set of D Any layered approach will have a drawback of being
n data objects, D = { v1 . . . vn }. A clustering C of dataset D, dependent on which clusterer is considered for initial fusion.
is a partition of D into k disjoint sets C1 … Ck. In the sequel This sensitiveness is a major bottleneck in deciding the
we consider m clusterings; we write Bi to denote the ith accuracy of the outputs. But, in our approach, we compute a m
clustering, and ki for the number of clusters of Bi. In the x m symmetric clusterer compatibility matrix, where
clustering fusion problem the task is to find a clustering that CCM[i][j] indicates the summary of information gain when ith
maximizes the related items with a number of already-existing clusterer and jth clusterer are merged. This way we have used
clusterings [4]. heuristics to direct the fusion in the right direction.
D. Definations B. Resolution for Label Correspondence Problem
• Fusion Joint set, FJij: The other issues in fusion of cluster ensembles are label
correspondence problem and the merging technique used for
Fusion Joint set, FJij refers to set of matching pairs of ith
fusion. At the second phase, we address the label
clusterer’s jth cluster. For instance, FJ12 refers to probable
correspondence problem. These clustering results are
fusion spot for first clusterer’s clusters with second
clusterer’s cluster. It will be used for deciding where the combined in layered pairs, called fusion joints set, FJmk . The
fusion is most likely to yield optimal preciseness of criteria of merging can be any one of the Fusion Joint
clusters. Identification Techniques i.e., overshadowing or usage of
highest cardinality in intersection set along with usage of add-
• Clusterer Compatibility matrix: CCM (m X m) on knowledge gathered from such association.
Clusterer Compatibility matrix is a m X m symmetric First approach uses the degree of shadow that one cluster has
matrix where m is the total number of clusterers, considered on other. This is computed using the smallest circle or
for fusion. minimum covering circle approach, which is a mathematical
problem of computing the smallest circle that contains all of a
• CCM[i][j] given set of points in the Euclidean plane. Each cluster of the
Integer value d representing the maximum information clusterer in two layers first compute the minimum bounding
gained through the summation of intersection elements circle and the diameter of such circle, using which the degree
cardinality of the matching pairs of clusterer found in of Shadow (DOS) is computed. The aim is to find the clusters
Fusion Joint Set, FJ[i][j]. in different layers whose shadow overlap is maximized and
then assign it to the matching pair set. This method finds the
• Degree Of Agreement Factor: (DOA) most appropriate clusters belonging to a two clustererings for
Degree of agreement factor is the ratio of the index of the forthcoming fusion phase.
merging level to the total number of clusterers. And also Second approach uses the usage of heuristic greedy approach
this DoA value will be cumulative till it reaches the in computing mutual information theory to decide on the
threshold level DoATh, an user assigned value indicating degree of compatibility. Mutual information is used when we
the majority required for decision making. Under normal need to decide, which amongst candidate variables are closest
scenario, DoATh will be set as 50% of the number of to a particular variable. Higher the mutual information, more
clusterers. the two variables are 'closer'. It is the amount of information
• Degree of Shadow factor : (DOS) 'contained' in Y about X.
Degree of shadow factor is the maximized value of the Let X and Y be the random variables described by the cluster
intersection of the two minimum bounding circles of k labeling λ(a) and λ(b) , with k(a) and k(b) groups respectively.
clusters with ith cluster from a different clustering. Let I(X; Y) denote the mutual information between X and Y,
and H(X) denote the entropy of X, i.e, a measure of the
III. KNOWLEDGE GUIDED ENSEMBLE FUSION uncertainty associated with a X. The chain rule for entropy
In this section we discuss our proposed layered cluster states that
ensemble fusion guided by the gained knowledge during the
merging process. The first phase of the algorithm is the H(X1:n)= H(X1)+H(X2|X1)+...+H(Xn|X1:n−1) (1)
preparation of B heterogeneous ensembles. This is done by
executing different clustering algorithms against the same When X1:n are independent ,identically distributed (i.i.d.),
spatial data set to generate partitioning results. For our then H(X1:n) = nH(X1). From Eqn 1, we have
experimental purpose, we have also generated homogenous
ensembles by partitioning the spatial data horizontally/ H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )
286 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
H(X) − H(X|Y ) = H(Y ) − H(Y |X). (2) The initial phase of the fusion starts with finding Bm Clusterers,
B
This difference can be interpreted as the reduction in using m different clustering algorithms. Next stage is to find
uncertainty in X after we know Y, or vice versa. It is thus the clusterers amongst Bm , with maximum compatibility
B
known as the information gain, or more commonly the mutual matrix index, for merging, so that they yield maximum
information between X and Y. Finding the maximum MI for knowledge for further fusions. When the merging happens,
the clusters in the clusterings is a combinatorial optimization based on the Fusion Joint Set, for each merged data point, the
problem, requiring an exhaustive search through all possible degree of agreement (DOA) is calculated.. For example, if the
clusterings with k labels. This is formidable since for n objects total number of clusterers are 5, then all the data points that
and k partitions there are approximately kn /k! for n >> k. For get merged at level 1, will have DOA as 1/5 = 0.2. This DOA
example, there are 171,798,901 ways to form four groups of value will be treated as the increment factor for every future
16 objects. Hence, instead of the complete greedy solution, we fusion. And also the DOA value in the corresponding DOA
have incorporated some heuristics so that cluster accuracy will vector be cumulative till it reaches the threshold level DOATh.
improve amidst the cost savings in terms of space & Once the DOA of any point in the cluster crosses the threshold,
computations. As we travel the length and breadth of the it can be affirmed to belong to a particular cluster result and
ensemble space, we try to reduce the kn /k! Combinations, by will be treated as a strong link. . Thus, the normal voting
reuse of the cumulative information gain. This way, when n procedure with huge voting matrix, to confirm the majority
>> k, as in most of the cases of spatial data, the solutions can does not arise at all in our method.
be reached much faster. This final layer merge with the earlier combined clusters will
yield the robust combined result. This approach is not
C. Knowledge Guided fusion of clusterings – An Excerpt computationally intensive, as we tend to use the first law of
geography in merging layer by layer. And also the computation
Input:
of voting matrix is avoided.
D – the input data in 2-dimensional feature space
The three levels of the technique: fusion joint identification,
Layer : Group of Clusterers B1 to Bm ;
Levels : List of clusters k1 to kn
guided fusion and resolving low voted data points are all
CCM[i][j] : Clusterer compatability for ith clusterer with jth clusterer. executed sequentially. They do not interfere with each other,
Step1: Form B1k1 to Bikn clusters from D using B1 to Bm clusterers, each but they just receive the results from the previous levels. No
clusterer generating k clusters feedback process happens, and the algorithm terminates after
Step 2; Compute Clusterer compatability matrix, whose entries are the
aggregated cardinality values of the intersecting elements set of the
the completion of all procedures.
clusterers.
Step 3: Identify and select the harmonizing clusterers for fusion from the IV. EXPERIMENTAL PLATFORM & RESULTS
CCM matrix. as TobeMerged_Layers
Step 4: Set DOA_Increment Factor as 1/ m . A. The Test Platform
Step 5: Find fusion joints for TobeMerged_Layers , (FJ12 1 .. FJ1 k) , using
degree of Shadow overlap or maximizing the information gain of
Ensemble Creation : In order to predigest the analysis, the
probable merge. paper uses five representative clustering methods to produce
Step 6: For every pair in the fusion joint Set, FJi k, five heterogeneous ensembles or clustering members, viz.
Do{ DBSCAN, k-means, PAM, Fuzzy K Means and Ward’s
ClustData[i] Union of Data points of the pair
Initialize Vector_DOA with DOA_Increment Factor
algorithm. K-means is a very simple yet very powerful and
Append it to to Vector_CData [i] efficient iterative technique to partition a large data set into k
For each element in the intersection set between Pairs, disjoint clusters. The objective criterion used in the algorithm
DOA[i] DOA[i] + DOA_Increment Factor is typically the squared-error function. DBSCAN method
Increment the vector index I by 1 & merge_ layer by 2
} until i <= k; //normal merge for m/2 layers
performs well with attribute data and with spatial data.
Step 7: repeat steps2 to 6 till merge_ layer < m/2; Partitioning around medoids (PAM) is mostly preferred for its
// finalize the cluster elements at layer i and at level k scalability and hence useful in Spatial data. The latest in
do{ If (Vector_DOA > DOA Th ) clustering is the usage of fuzziness and we have added Fuzzy
Strong links Corresponding Elements of Vector_CData
Else
C means (FCM) as one of the clusterer, so that we get a robust
Weak linksk Corresponding Elements of Vector_CData partition in the end result. Hence these clustering techniques
} until all pairs at layer I is resolved along with different cluster sizes form the input for our
Step 8: Using Strong links, finalize Final_Kluster k & continue gathering knowledge guided fusion technique.
votes for weak links.
Step 9: From (m/2 +1)th layer, perform the pruned merging, where the Data Source : Most of the ensemble methods, have sampling
strong links will be pruned for the confirmed data points, when they techniques in selecting the data for their experimental platform,
reappear. Data points in the weak links could be the noise data points, but this heuristics results in losing some inherent data clusters,
(Noise_Elements k) , as their inherent votes were below the threshold thereby reducing the quality of clusters. We have tried to
value.
Step10: Return the robust clusters obtained from m clusterers avoid sampling and involve the whole dataset. For our
Final_Kluster k and Noise_Elements k. experiments we have used the datasets available in the data
Figure 3.3. Excerpt of the guided fusion of ensembles
repository of University of California, Irvine.
Metrics: We used the classification accuracy (CA) to measure
the accuracy of an ensemble as the agreement between the
287 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
ensemble partition and the "true" partition. The classification technique had generated an intra cluster density of 5.7125,
accuracy is commonly used for evaluating clustering results. implying that we have generated more precise clusters.
To guarantee the best re-labeling of the clusters, the proportion
of correctly labeled objects in the data set is calculated as CA
for the partition. We have used the measurement of intra
cluster and inter cluster density before and after usage of our
cluster ensemble approach, which will be a metric for the
preciseness of the so formed cluster groups.
B. Validation of fusion Results
As clustering is a kind of study without guidance, basically
unsupervised classification, it is difficult to evaluate the
clustering efficiency. But with classifying information of data,
it can be considered that some inner distribution characters of Figure 4.3.1 Comparison of error rates of the fused ensembles
the data are expressed to certain degree. If such classifying
information is not used by clustering, it can be used to
evaluate the clustering effect. If the number of same objects,
which covered by certain clustering label of labeled vectors
and certain known category of category properties, are at best,
this clustering label corresponds to this known category. Thus
many clustering labels might correspond to the same category,
whereas one clustering label can not correspond to many
categories. The clustering results can be evaluated by
classifying information. Figure 4.3.2 Intra cluster density Figure 4.3.3 Inter cluster density
The test results with the IRIS dataset, Wine dataset, Half rings
and Spiral dataset (Courtesy: UCI data repository) is With Iris dataset
promising and shows better cluster accuracy. Two parameters
were computed to verify our algorithm: Cluster Correctness
Factor (CCF) and the space complexity of the fusion of V. CONCLUSION AND FUTURE WORK
ensembles. Few bench marked datasets as mentioned above In this paper we addressed the relabeling problem found in
were tested with this technique and the CCF was found to be general in most of cluster ensembles, and has been resolved
100%, in all the cases. Normally, in all ensembling algorithms, without much computations, using the notion from first law of
voting matrix is computed which is normally in the order of n2, geography. The cluster ensemble is a very general framework
where n is the number of data points. But, due to the that enables a wide range of applications. We have applied the
knowledge guided fusion along with unique inherent voting proposed guided cluster merging technique on spatial
scheme, the space complexity has been reduced to the order of databases. The main issue in spatial databases is the
n. This has a major impact in not only memory requirements cardinality of data points and also the increased dimensions.
but also in the total number of matrix computations. Most of the existing Ensemble algorithms have to generate
voting matrix of at least an order of n2. or an expensive
C. Comparison of the Experimental Results graphical representation with the vertices which is equal to the
In our approach of knowledge guided fusion, we have number of data points. When n is very huge and is also a
combined the results of several independent cluster runs by common factor in spatial datasets, this restriction is a very big
computing inherent voting of their results. Our phased bottleneck in obtaining robust clusters in reasonable time and
knowledge guided approach voting helps us to overcome the high accuracy.
computationally infeasible simultaneous combination of all Our algorithm has resolved the re labeling using layered
partitions and also increases the cluster accuracy. (Figure merging as well as guided by the gained information. Once
4.3.1). By the help of our scheme, we have shown that the elements move from strong links to final clusters, they do not
consensus partition indeed converges to the true partition. participate in further computations. Hence, the computational
InterCluster Density (Figure 4.3.3) has been reduced by almost cost is also hugely reduced. Usage of the Cluster compatibility
40% when compared against the other clustering algorithms matrix enables us to have a good head start in the fusion
with our technique. For the benchmark iris dataset it is around process, which otherwise is a matter of sheer randomness.
11.47 and our cluster miner produces 6.77 implying that the The key goal of spatial data mining is to automate knowledge
later has produced better cluster in terms intercluster Density. discovery process. It is important to note that in this study, it
We have observed that the IntraCluster Density (Figure 4.3.2) has been assumed that, the user has a good knowledge of data
has increased, implying that the cluster quality has improved and of the hierarchies used in the mining process. The crucial
due to the guided approach used for ensemble fusion. For the input of deciding the value of k, still affects the quality of the
standard benchmark iris dataset, intra cluster density achieved
resultant clusters. Domain specific Apiori knowledge can be
using normal clustering methods is 5.1283, whereas our
288 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
used as guidance for deciding the value k. We feel that semi [8] Ng R.T., and Han J., “Efficient and Effective Clustering Methods for
Spatial Data Mining”, Proc. 20th Int. Conf. on Very Large DataBases,
supervised clustering using the domain knowledge could 144-155, Santiago, Chile, 1994.
improve the quality of the mined clusters. We have used [9] B.H. Park and H. Kargupta, “Distributed Data Mining”, In The
heterogeneous clusterers for our testing but it can be tested Handbook of Data Mining, Ed. Nong Ye, Lawrence Erlbaum Associates,
with more new combinations of spatial clustering algorithms 2003.
as base clusterers. This will ensure exploring more natural [10] J. Roddick and B. G. Lees, "Paradigms for Spatial and Spatio-Temporal
clusters. Data Mining," in Geographic Data Mining and Knowledge Discovery,
Taylor & Francis, 2001.
First, we have identified several non-spatial datasets which are [11] Su-lan Zhai1,Bin Luo1 Yu-tang Guo : Fuzzy Clustering Ensemble Based
normally used as bench mark ones for data clustering. Then on Dual Boosting , Fourth International Conference on Fuzzy Systems
and Knowledge Discovery 07.
we tested how our layer based methodology can wok with
[12] Samet, Hanan.: “Spatial Data Models and Query Processing”. In Modern
spatial data. This setup must be worked with more large Databases Systems: The object model, Interoperability, and Beyond.
datasets available in GIS areas and with satellite images. We Addison Wesley/ ACM Press, 1994,Reading, MA.
evaluated our work and can conclude that for targeting a [13] A.Strehl, J.Ghosh, “Cluster ensembles - a knowledge reuse framework
specific platform and incorporating spatial feature space, our for combining multiple partitions”, Journal of Machine Learning
automated layered merge approach is able to provide the Research, 3: 583-618, 2002.
necessary correctness with more efficiency both in space [14] A.Strehl, J.Ghosh, “Cluster ensembles- a knowledge reuse framework
for combining partitionings”, in: Proc. Of 11th National Conference On
constraint and in matrix computations. However, more work Artificial Intelligence, NCAI, Edmonton, Alberta, Canada, pp.93-98,
should be carried out to provide support for more real life data 2002.
from satellites and incomplete data. Future work in the short [15] Y. Tao, J. Zhang, D. Papa dias, and N. Mamoulis, "An Efficient Cost
term will focus on how to acquire such datasets, and continue Model for Optimization of Nearest Neighbor Search in Low and
Medium Dimensional Spaces," IEEE Transactions on Knowledge and
with more testing, in spite of current security concerns in Data Engineering, vol. 16,no. 10, pp. 1169-1184, 2004.
distributing such data. [16] X. Wang and H. J. Hamilton, "Clustering Spatial Data in the Presence
of Obstacles," International Journal on Artificial Intelligence Tools, vol.
ACKNOWLEDGMENT 14, no. 1-2, pp. 177-198, 2005.
This work has been partly done in the labs of The Oxford [17] R. Xu and D. Wunsch, II, "Survey of clustering algorithms," IEEE
College of Engineering, Bangalore, where the author is Transactions on Neural Networks, vol. 16,no. 3, pp. 645- 678, 2005.
currently working as a Professor, in the department of [18] Zhang, J. 2004. Polygon-based Spatial clustering and its application in
watershed study. MS Thesis, University of Nebraska-Lincoln, December
Computer Science & Engineering. The authors would like to 2004.
express their sincere gratitude to the Management and [19] Zeng, Y., Tang, J., Garcia-Frias, J. and Gao, G.R., “An Adaptive Meta-
Principal of The Oxford College of Engineering for their Clustering Approach: Combining The Information From Different
support rendered during the testing of some of our modules. Clustering Results”, CSB2002 IEEE Computer Society Bioinformatics
Conference Proceeding.
They also express their thanks to the University of California
Irvine, for their huge data repository made available for
AUTHORS PROFILE
testing our knowledge guided approach of fusion of
ensembles.
REFERENCES
[1] M.Ester, H. Kriegel, J. Sander, X. Xu. ”Clustering for Mining in Large
Spatial Databases”. Special Issue on Data Mining, KI-Journal Tech RJ Anandhi is a PhD student in the department of Computer
Publishing, Vol.1, 98. Science & Engineering at Dr M G R University She is currently working as a
[2] A.L.N. Fred and A.K. Jain, “Data Clustering using Evidence professor in the Department of Computer Science at Oxford College of
Accumulation”. In Proc. of the 16th International Conference on Pattern Engineering, Bangalore. She has completed her BE degree from Bharatiyar
Recognition, ICPR 2002, Quebec City. University and MTech degree from Pondicherry Central University. Her
research interests are in Spatial Data mining and ANT algorithms.
[3] A.L.N. Fred and A.K. Jain, “Robust data clustering” in Proc. IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition, CVPR, USA, 2003.
[4] Filkov, V. and Skiena, S. “Integrating microarray data by concensus
clustering”. In International Conference on Tools with Artificial
Intelligence, 2003 Dr. Natarajan has initially worked in Defence Research and
[5] K.Koperski, J.Han, K. Koperski and J. Han, "Discovery of spatial Development Laboratory (DRDL) for five years in the area of software
Rules in Geographic Information Databases," Proc. 4th Intl Symposium development in defence missions. Dr Natarajan then worked for 28 years in
on Large Spatial Databases, pp. 47-66, 95. National Remote Sensing Agency (NRSA) in the areas pertaining to DIP and
[6] Kai Kang, Hua-Xiang Zhang, Ying Fan, “A Novel Clusterer Ensemble GlS for several remote sensing missions like IRS-1A, IRS-1B, IRS-1C,
Algorithm Based on Dynamic Cooperation”, IEEE 5TH International IKONOS and LANDSAT. As a Project Manager of Ground Control Point
Conf. on Fuzzy Systems and Knowledge Discovery 2008. Library (GCPL) Project, he had completed the task of computing cm level
[7] Matheus C.J., Chan P.K, and Piatetsky-Shapiro G, “Systems for accuracy for 3000 locations within India which is being used for cartographic
Knowledge Discovery in Databases”, IEEE Transactions on Knowledge satellite missions. He was the Deputy Project Director of Large Scale
and Data Engineering 5(6), pp. 903-913, 1993. Mapping (LSM) of Department of Space. Dr Natarajan has published about
fifteen papers in National/ International Conferences and Journals. His
research interests are Data mining, GIS and Spatial Databases.
289 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
290 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "