VIEWS: 53 PAGES: 7 CATEGORY: Emerging Technologies POSTED ON: 8/13/2010
The International Journal of Computer Science and Information Security is a monthly periodical on research articles in general computer science and information security which provides a distinctive technical perspective on novel technical research work, whether theoretical, applicable, or related to implementation. Target Audience: IT academics, university IT faculties; and business people concerned with computer science and security; industry IT departments; government departments; the financial industry; the mobile industry and the computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. Thanks for your contributions in July 2010 issue and we are grateful to the reviewers for providing valuable comments. IJCSIS July 2010 Issue (Vol. 8, No. 4) has an acceptance rate of 36 %.
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 A Robust -knowledge guided fusion of clustering Ensembles Anandhi R J Dr Natarajan Subramaniyan Research Scholar, Dept of CSE, Professor, Dept of ISE Dr MGR University, PES Institute of Technology Chennai, India Bangalore, India rjanandhi@hotmail.com snatarajan44@gmail.com Abstract— Discovering interesting, implicit knowledge and commercial, industrial, administrative and other applications, general relationships in geographic information databases is very it is necessary and interesting to examine how to extract important to understand and to use the spatial data. Spatial knowledge automatically from huge amount of data. Very Clustering has been recognized as a primary data mining method large data sets present a challenge for both humans and for knowledge discovery in spatial databases. In this paper, we have analyzed that by using a guided approach in combining the machine learning algorithms. Machine learning algorithms can outputs of the various clusterers, we can reduce the intensive be inundated by the flood of data, and become very slow in computations and also will result in robust clusters .We have knowledge extraction. More over, along with the large amount discussed our proposed layered cluster merging technique for of data available, there is also a compelling need for producing spatial datasets and used it in our three-phase clustering results accurately and fast. combination technique in this paper. At the first level, m heterogeneous ensembles are run against the same spatial data Efficiency and scalability are, indeed, the key issues when set to generate B1…Bm results. The major challenge in fusion of B B designing data mining systems for very large data sets. ensembles is the generation of voting matrix or proximity Through the extraction of knowledge in databases, large matrix which is in the order of n2, where n is the number of data databases will serve as a rich, reliable source for knowledge points. This is very expensive both in time and space factors, with generation and verification, the discovered knowledge can be respect to spatial datasets. Instead, in our method, we compute a applied to information management, query processing, symmetric clusterer compatibility matrix of order (m x m) , decision-making, process control and many other applications. where m is the number of clusterers and m <<n, using the Therefore, data mining has been considered as one of the most cumulative similarity between the clusters of the clusterers. This matrix is used for identifying which two clusterers, if considered important topics in databases by many database researchers. for fusion initially, will provide more information gain. As we Spatial data describes information related to the space occupied travel down the layered merge, for every layer, we calculate a by objects. It consists of 2D or 3D points, polygons etc. or factor called Degree of Agreement (DOA), based on the agreed points in some d-dimensional feature space. It can be either clusterers. Using the updated DOA at every layer, the movement discrete or continuous. Discrete spatial data might be a single of unresolved, unsettled data elements will be handled at much point in multi-dimensional space while continuous spatial data reduced the computational cost. Added to this advantage, we spans a region of space. This data might consist of medical have pruned the datasets after every (m-1)/2 layers, using the gained knowledge in previous layer. This helps in faster images or map regions and it can be managed through spatial convergence compared to the existing cluster aggregation databases [8]. techniques. The correctness and efficiency of the proposed cluster Clustering [17] is to group analogous elements in a data set in ensemble algorithm is demonstrated on real world datasets accordance with its similarity such that elements in each cluster available in UCI data repository. are similar, while elements from different clusters are dissimilar. It doesn’t require the class label information about Keywords- Clustering ensembles, Spatial Data mining, Degree the data set because it is inherently a data-driven approach. So, of Agreement, Cluster Compatibility matrix. the most interesting and well developed method of manipulating and cleaning spatial data in order to prepare it for I. INTRODUCTION spatial data mining analysis is by clustering that has been With a variety of applications, large amounts of spatial and recognized as a primary data mining method for knowledge related non-spatial data are collected and stored in Geographic discovery in spatial database [4-7]. Information Databases. Spatial Data Mining[1], (i.e., Clustering fusion is the integration of results from various discovering interesting, implicit knowledge and general clustering algorithms using a consensus function to yield stable relationships in large spatial databases) is an important task for results. Clustering fusion approaches are receiving increasing the understanding the usage of these spatial data. With the attention for their capability of improving clustering rapid growth in size and number of available databases in performance. At present, the usual operational mechanism for 284 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 clustering fusion is the combining of clusterer outputs. One tool measure used in such algorithms is the Euclidian distance. for such combining or consolidation of results from a portfolio Recently new set of spatial clustering algorithms has been of individual clustering results is a cluster ensemble [13]. It proposed, which represents faster method to find clusters with was shown to be useful in a variety of contexts such as overlapping densities. DBSCAN, GDBDCAN and DBRS are “Quality and Robustness” [3], “Knowledge Reuse” [13,14], density-based spatial clustering algorithms, but they each and “Distributed Computing” [9]. perform best only on particular types of datasets [17]. The rest of the paper is organized as follows. The related work is in section 2. The proposed knowledge guided fusion However, these algorithms also ignore the non-spatial attribute ensemble technique is in section 3. In section 4, we present participation and require user defined parameters. For large- experimental test platform and results with discussion. Finally, scale spatial databases, the current density based cluster we conclude with a summary and our planned future work in algorithms can be found to be expensive as they require large this area of research. volume of memory support due to its operations over the entire database. Another disadvantage is the input parameters II. RELATED WORK required by these algorithms are based on experimental evaluations. There is a large interest in addressing the A. Litereature on Clustering Algorithms automation of the general purpose clustering approach without Many clustering algorithms have been developed and they can user intervention. However, it is difficult to expect accurate be roughly classified into hierarchical approaches and non- results from the results of these algorithms as each one has its hierarchical approaches. Non-hierarchical approaches can also own shortfalls. be divided into four categories; partitioning methods, density- B. Litereature on Clustering Ensembles based methods, grid-based methods, and model-based methods. Hierarchical algorithms can be further divided to agglomerative Clustering ensemble is the method to combine several runs of and divisive algorithms, corresponding to bottom-up and top- different clustering algorithms to get an optimal partition of the down strategies, to build a hierarchical clustering tree. original dataset. Given dataset X = {x1 x2,.. ,xn}, a cluster ensemble is a set of clustering solutions, represented as P = Spatial data mining or knowledge discovery in spatial P1,P2,..Pr,where r is the ensemble size, i.e. the number of databases refers to the extraction, from spatial databases, of clusterings in the ensemble. Clustering-Ensemble Approach implicit knowledge, spatial relations, or other patterns that are first gets the result of M clusterers, then sets up a common not explicitly stored [8, 10]. The large size and high understanding function to fuse each vector and get the labeled dimensionality of spatial data make the complex patterns that vector in the end. The goal of cluster ensemble is to combine lurk in the data hard to find. It is expected that the coming the clustering results of multiple clustering algorithms to obtain years will witness very large number of objects that are better quality and robust clustering results. Even though many location enabled to varying degrees. Spatial clustering [8] has clustering algorithms have been developed, not much work is been used as an important process in the areas such as done in cluster ensemble in data mining and machine learning geographic analysis, exploring data from sensor networks, community. traffic control, and environmental studies. Spatial data Strethl and Ghosh [13,14], proposed a hypergraph-partitioned clustering has been identified as an important technique for approach to combine different clustering results by treating many applications and several techniques have been proposed each cluster in an individual clustering algorithm as a hyper over the past decade based on density-based strategies, edge. All the three proposed algorithms approach the problem random walks, grid based strategies, and brute force by first transforming the set of clusterings into a hypergraph exhaustive searching methods[5]. This paper deals with fusion representation. Cluster-based Similarity Partitioning of spatial cluster ensembles using a guided approach to reduce Algorithm (CSPA) uses relationship between objects in the the space complexity of such fusion algorithms. same cluster for establishing a measure of pair wise similarity. Spatial data is about instances located in a physical space. In Hyper Graph Partitioning Algorithm (HGPA) the maximum Spatial clustering aims to group similar objects into the same mutual information objective is approximated with a group considering spatial attributes of the object. The existing constrained minimum cut objective. In their Meta-CLustering spatial clustering algorithms in literature focus exclusively Algorithm (MCLA), the objective of integration is viewed as a either on the spatial distances or minimizing the distance of cluster correspondence problem. object attributes pairs. i.e., the locations are considered as Kai Kang, Hua-Xiang Zhang, Ying Fan [6] formulated the another attribute or the non-spatial attribute distances are process of cooperation between component clusterers, and ignored. Much activity in spatial clustering focuses on proposed a novel cluster ensemble learning technique based on clustering objects based on the location nearness to each other dynamic cooperating (DCEA). The approach is mainly [5]. Finding clusters in spatial data is an active research area, concerned how the component clusterers fully cooperate in the and the current non-spatial clustering algorithms are applied to process of training component clusterers. spatial domain, with recent application and results reported on Fred and Jain [2] used co-association matrix to form the final the effectiveness and scalability of algorithms [8, 16]. partition. They applied a hierarchical (single-link) clustering to Partitioning algorithms are best suited to such problems where the co-association matrix. Zeng, Tang, Garcia-Frias and minimization of a distance function is required and a common Gao[18], proposed an adaptive meta-clustering approach for 285 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 combining different clustering results by using a distance vertically into n subgroups and used it as the input to our matrix. ensemble algorithm. Either way individual partitions in each ensemble are sequentially generated. C. Fusion Framework We begin our discussion of the guided ensembles fusion A. Selection of clusterings for prime fusion framework by presenting our notation. Let us consider a set of D Any layered approach will have a drawback of being n data objects, D = { v1 . . . vn }. A clustering C of dataset D, dependent on which clusterer is considered for initial fusion. is a partition of D into k disjoint sets C1 … Ck. In the sequel This sensitiveness is a major bottleneck in deciding the we consider m clusterings; we write Bi to denote the ith accuracy of the outputs. But, in our approach, we compute a m clustering, and ki for the number of clusters of Bi. In the x m symmetric clusterer compatibility matrix, where clustering fusion problem the task is to find a clustering that CCM[i][j] indicates the summary of information gain when ith maximizes the related items with a number of already-existing clusterer and jth clusterer are merged. This way we have used clusterings [4]. heuristics to direct the fusion in the right direction. D. Definations B. Resolution for Label Correspondence Problem • Fusion Joint set, FJij: The other issues in fusion of cluster ensembles are label correspondence problem and the merging technique used for Fusion Joint set, FJij refers to set of matching pairs of ith fusion. At the second phase, we address the label clusterer’s jth cluster. For instance, FJ12 refers to probable correspondence problem. These clustering results are fusion spot for first clusterer’s clusters with second clusterer’s cluster. It will be used for deciding where the combined in layered pairs, called fusion joints set, FJmk . The fusion is most likely to yield optimal preciseness of criteria of merging can be any one of the Fusion Joint clusters. Identification Techniques i.e., overshadowing or usage of highest cardinality in intersection set along with usage of add- • Clusterer Compatibility matrix: CCM (m X m) on knowledge gathered from such association. Clusterer Compatibility matrix is a m X m symmetric First approach uses the degree of shadow that one cluster has matrix where m is the total number of clusterers, considered on other. This is computed using the smallest circle or for fusion. minimum covering circle approach, which is a mathematical problem of computing the smallest circle that contains all of a • CCM[i][j] given set of points in the Euclidean plane. Each cluster of the Integer value d representing the maximum information clusterer in two layers first compute the minimum bounding gained through the summation of intersection elements circle and the diameter of such circle, using which the degree cardinality of the matching pairs of clusterer found in of Shadow (DOS) is computed. The aim is to find the clusters Fusion Joint Set, FJ[i][j]. in different layers whose shadow overlap is maximized and then assign it to the matching pair set. This method finds the • Degree Of Agreement Factor: (DOA) most appropriate clusters belonging to a two clustererings for Degree of agreement factor is the ratio of the index of the forthcoming fusion phase. merging level to the total number of clusterers. And also Second approach uses the usage of heuristic greedy approach this DoA value will be cumulative till it reaches the in computing mutual information theory to decide on the threshold level DoATh, an user assigned value indicating degree of compatibility. Mutual information is used when we the majority required for decision making. Under normal need to decide, which amongst candidate variables are closest scenario, DoATh will be set as 50% of the number of to a particular variable. Higher the mutual information, more clusterers. the two variables are 'closer'. It is the amount of information • Degree of Shadow factor : (DOS) 'contained' in Y about X. Degree of shadow factor is the maximized value of the Let X and Y be the random variables described by the cluster intersection of the two minimum bounding circles of k labeling λ(a) and λ(b) , with k(a) and k(b) groups respectively. clusters with ith cluster from a different clustering. Let I(X; Y) denote the mutual information between X and Y, and H(X) denote the entropy of X, i.e, a measure of the III. KNOWLEDGE GUIDED ENSEMBLE FUSION uncertainty associated with a X. The chain rule for entropy In this section we discuss our proposed layered cluster states that ensemble fusion guided by the gained knowledge during the merging process. The first phase of the algorithm is the H(X1:n)= H(X1)+H(X2|X1)+...+H(Xn|X1:n−1) (1) preparation of B heterogeneous ensembles. This is done by executing different clustering algorithms against the same When X1:n are independent ,identically distributed (i.i.d.), spatial data set to generate partitioning results. For our then H(X1:n) = nH(X1). From Eqn 1, we have experimental purpose, we have also generated homogenous ensembles by partitioning the spatial data horizontally/ H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) 286 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 H(X) − H(X|Y ) = H(Y ) − H(Y |X). (2) The initial phase of the fusion starts with finding Bm Clusterers, B This difference can be interpreted as the reduction in using m different clustering algorithms. Next stage is to find uncertainty in X after we know Y, or vice versa. It is thus the clusterers amongst Bm , with maximum compatibility B known as the information gain, or more commonly the mutual matrix index, for merging, so that they yield maximum information between X and Y. Finding the maximum MI for knowledge for further fusions. When the merging happens, the clusters in the clusterings is a combinatorial optimization based on the Fusion Joint Set, for each merged data point, the problem, requiring an exhaustive search through all possible degree of agreement (DOA) is calculated.. For example, if the clusterings with k labels. This is formidable since for n objects total number of clusterers are 5, then all the data points that and k partitions there are approximately kn /k! for n >> k. For get merged at level 1, will have DOA as 1/5 = 0.2. This DOA example, there are 171,798,901 ways to form four groups of value will be treated as the increment factor for every future 16 objects. Hence, instead of the complete greedy solution, we fusion. And also the DOA value in the corresponding DOA have incorporated some heuristics so that cluster accuracy will vector be cumulative till it reaches the threshold level DOATh. improve amidst the cost savings in terms of space & Once the DOA of any point in the cluster crosses the threshold, computations. As we travel the length and breadth of the it can be affirmed to belong to a particular cluster result and ensemble space, we try to reduce the kn /k! Combinations, by will be treated as a strong link. . Thus, the normal voting reuse of the cumulative information gain. This way, when n procedure with huge voting matrix, to confirm the majority >> k, as in most of the cases of spatial data, the solutions can does not arise at all in our method. be reached much faster. This final layer merge with the earlier combined clusters will yield the robust combined result. This approach is not C. Knowledge Guided fusion of clusterings – An Excerpt computationally intensive, as we tend to use the first law of geography in merging layer by layer. And also the computation Input: of voting matrix is avoided. D – the input data in 2-dimensional feature space The three levels of the technique: fusion joint identification, Layer : Group of Clusterers B1 to Bm ; Levels : List of clusters k1 to kn guided fusion and resolving low voted data points are all CCM[i][j] : Clusterer compatability for ith clusterer with jth clusterer. executed sequentially. They do not interfere with each other, Step1: Form B1k1 to Bikn clusters from D using B1 to Bm clusterers, each but they just receive the results from the previous levels. No clusterer generating k clusters feedback process happens, and the algorithm terminates after Step 2; Compute Clusterer compatability matrix, whose entries are the aggregated cardinality values of the intersecting elements set of the the completion of all procedures. clusterers. Step 3: Identify and select the harmonizing clusterers for fusion from the IV. EXPERIMENTAL PLATFORM & RESULTS CCM matrix. as TobeMerged_Layers Step 4: Set DOA_Increment Factor as 1/ m . A. The Test Platform Step 5: Find fusion joints for TobeMerged_Layers , (FJ12 1 .. FJ1 k) , using degree of Shadow overlap or maximizing the information gain of Ensemble Creation : In order to predigest the analysis, the probable merge. paper uses five representative clustering methods to produce Step 6: For every pair in the fusion joint Set, FJi k, five heterogeneous ensembles or clustering members, viz. Do{ DBSCAN, k-means, PAM, Fuzzy K Means and Ward’s ClustData[i] Union of Data points of the pair Initialize Vector_DOA with DOA_Increment Factor algorithm. K-means is a very simple yet very powerful and Append it to to Vector_CData [i] efficient iterative technique to partition a large data set into k For each element in the intersection set between Pairs, disjoint clusters. The objective criterion used in the algorithm DOA[i] DOA[i] + DOA_Increment Factor is typically the squared-error function. DBSCAN method Increment the vector index I by 1 & merge_ layer by 2 } until i <= k; //normal merge for m/2 layers performs well with attribute data and with spatial data. Step 7: repeat steps2 to 6 till merge_ layer < m/2; Partitioning around medoids (PAM) is mostly preferred for its // finalize the cluster elements at layer i and at level k scalability and hence useful in Spatial data. The latest in do{ If (Vector_DOA > DOA Th ) clustering is the usage of fuzziness and we have added Fuzzy Strong links Corresponding Elements of Vector_CData Else C means (FCM) as one of the clusterer, so that we get a robust Weak linksk Corresponding Elements of Vector_CData partition in the end result. Hence these clustering techniques } until all pairs at layer I is resolved along with different cluster sizes form the input for our Step 8: Using Strong links, finalize Final_Kluster k & continue gathering knowledge guided fusion technique. votes for weak links. Step 9: From (m/2 +1)th layer, perform the pruned merging, where the Data Source : Most of the ensemble methods, have sampling strong links will be pruned for the confirmed data points, when they techniques in selecting the data for their experimental platform, reappear. Data points in the weak links could be the noise data points, but this heuristics results in losing some inherent data clusters, (Noise_Elements k) , as their inherent votes were below the threshold thereby reducing the quality of clusters. We have tried to value. Step10: Return the robust clusters obtained from m clusterers avoid sampling and involve the whole dataset. For our Final_Kluster k and Noise_Elements k. experiments we have used the datasets available in the data Figure 3.3. Excerpt of the guided fusion of ensembles repository of University of California, Irvine. Metrics: We used the classification accuracy (CA) to measure the accuracy of an ensemble as the agreement between the 287 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 ensemble partition and the "true" partition. The classification technique had generated an intra cluster density of 5.7125, accuracy is commonly used for evaluating clustering results. implying that we have generated more precise clusters. To guarantee the best re-labeling of the clusters, the proportion of correctly labeled objects in the data set is calculated as CA for the partition. We have used the measurement of intra cluster and inter cluster density before and after usage of our cluster ensemble approach, which will be a metric for the preciseness of the so formed cluster groups. B. Validation of fusion Results As clustering is a kind of study without guidance, basically unsupervised classification, it is difficult to evaluate the clustering efficiency. But with classifying information of data, it can be considered that some inner distribution characters of Figure 4.3.1 Comparison of error rates of the fused ensembles the data are expressed to certain degree. If such classifying information is not used by clustering, it can be used to evaluate the clustering effect. If the number of same objects, which covered by certain clustering label of labeled vectors and certain known category of category properties, are at best, this clustering label corresponds to this known category. Thus many clustering labels might correspond to the same category, whereas one clustering label can not correspond to many categories. The clustering results can be evaluated by classifying information. Figure 4.3.2 Intra cluster density Figure 4.3.3 Inter cluster density The test results with the IRIS dataset, Wine dataset, Half rings and Spiral dataset (Courtesy: UCI data repository) is With Iris dataset promising and shows better cluster accuracy. Two parameters were computed to verify our algorithm: Cluster Correctness Factor (CCF) and the space complexity of the fusion of V. CONCLUSION AND FUTURE WORK ensembles. Few bench marked datasets as mentioned above In this paper we addressed the relabeling problem found in were tested with this technique and the CCF was found to be general in most of cluster ensembles, and has been resolved 100%, in all the cases. Normally, in all ensembling algorithms, without much computations, using the notion from first law of voting matrix is computed which is normally in the order of n2, geography. The cluster ensemble is a very general framework where n is the number of data points. But, due to the that enables a wide range of applications. We have applied the knowledge guided fusion along with unique inherent voting proposed guided cluster merging technique on spatial scheme, the space complexity has been reduced to the order of databases. The main issue in spatial databases is the n. This has a major impact in not only memory requirements cardinality of data points and also the increased dimensions. but also in the total number of matrix computations. Most of the existing Ensemble algorithms have to generate voting matrix of at least an order of n2. or an expensive C. Comparison of the Experimental Results graphical representation with the vertices which is equal to the In our approach of knowledge guided fusion, we have number of data points. When n is very huge and is also a combined the results of several independent cluster runs by common factor in spatial datasets, this restriction is a very big computing inherent voting of their results. Our phased bottleneck in obtaining robust clusters in reasonable time and knowledge guided approach voting helps us to overcome the high accuracy. computationally infeasible simultaneous combination of all Our algorithm has resolved the re labeling using layered partitions and also increases the cluster accuracy. (Figure merging as well as guided by the gained information. Once 4.3.1). By the help of our scheme, we have shown that the elements move from strong links to final clusters, they do not consensus partition indeed converges to the true partition. participate in further computations. Hence, the computational InterCluster Density (Figure 4.3.3) has been reduced by almost cost is also hugely reduced. Usage of the Cluster compatibility 40% when compared against the other clustering algorithms matrix enables us to have a good head start in the fusion with our technique. For the benchmark iris dataset it is around process, which otherwise is a matter of sheer randomness. 11.47 and our cluster miner produces 6.77 implying that the The key goal of spatial data mining is to automate knowledge later has produced better cluster in terms intercluster Density. discovery process. It is important to note that in this study, it We have observed that the IntraCluster Density (Figure 4.3.2) has been assumed that, the user has a good knowledge of data has increased, implying that the cluster quality has improved and of the hierarchies used in the mining process. The crucial due to the guided approach used for ensemble fusion. For the input of deciding the value of k, still affects the quality of the standard benchmark iris dataset, intra cluster density achieved resultant clusters. Domain specific Apiori knowledge can be using normal clustering methods is 5.1283, whereas our 288 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 used as guidance for deciding the value k. We feel that semi [8] Ng R.T., and Han J., “Efficient and Effective Clustering Methods for Spatial Data Mining”, Proc. 20th Int. Conf. on Very Large DataBases, supervised clustering using the domain knowledge could 144-155, Santiago, Chile, 1994. improve the quality of the mined clusters. We have used [9] B.H. Park and H. Kargupta, “Distributed Data Mining”, In The heterogeneous clusterers for our testing but it can be tested Handbook of Data Mining, Ed. Nong Ye, Lawrence Erlbaum Associates, with more new combinations of spatial clustering algorithms 2003. as base clusterers. This will ensure exploring more natural [10] J. Roddick and B. G. Lees, "Paradigms for Spatial and Spatio-Temporal clusters. Data Mining," in Geographic Data Mining and Knowledge Discovery, Taylor & Francis, 2001. First, we have identified several non-spatial datasets which are [11] Su-lan Zhai1,Bin Luo1 Yu-tang Guo : Fuzzy Clustering Ensemble Based normally used as bench mark ones for data clustering. Then on Dual Boosting , Fourth International Conference on Fuzzy Systems and Knowledge Discovery 07. we tested how our layer based methodology can wok with [12] Samet, Hanan.: “Spatial Data Models and Query Processing”. In Modern spatial data. This setup must be worked with more large Databases Systems: The object model, Interoperability, and Beyond. datasets available in GIS areas and with satellite images. We Addison Wesley/ ACM Press, 1994,Reading, MA. evaluated our work and can conclude that for targeting a [13] A.Strehl, J.Ghosh, “Cluster ensembles - a knowledge reuse framework specific platform and incorporating spatial feature space, our for combining multiple partitions”, Journal of Machine Learning automated layered merge approach is able to provide the Research, 3: 583-618, 2002. necessary correctness with more efficiency both in space [14] A.Strehl, J.Ghosh, “Cluster ensembles- a knowledge reuse framework for combining partitionings”, in: Proc. Of 11th National Conference On constraint and in matrix computations. However, more work Artificial Intelligence, NCAI, Edmonton, Alberta, Canada, pp.93-98, should be carried out to provide support for more real life data 2002. from satellites and incomplete data. Future work in the short [15] Y. Tao, J. Zhang, D. Papa dias, and N. Mamoulis, "An Efficient Cost term will focus on how to acquire such datasets, and continue Model for Optimization of Nearest Neighbor Search in Low and Medium Dimensional Spaces," IEEE Transactions on Knowledge and with more testing, in spite of current security concerns in Data Engineering, vol. 16,no. 10, pp. 1169-1184, 2004. distributing such data. [16] X. Wang and H. J. Hamilton, "Clustering Spatial Data in the Presence of Obstacles," International Journal on Artificial Intelligence Tools, vol. ACKNOWLEDGMENT 14, no. 1-2, pp. 177-198, 2005. This work has been partly done in the labs of The Oxford [17] R. Xu and D. Wunsch, II, "Survey of clustering algorithms," IEEE College of Engineering, Bangalore, where the author is Transactions on Neural Networks, vol. 16,no. 3, pp. 645- 678, 2005. currently working as a Professor, in the department of [18] Zhang, J. 2004. Polygon-based Spatial clustering and its application in watershed study. MS Thesis, University of Nebraska-Lincoln, December Computer Science & Engineering. The authors would like to 2004. express their sincere gratitude to the Management and [19] Zeng, Y., Tang, J., Garcia-Frias, J. and Gao, G.R., “An Adaptive Meta- Principal of The Oxford College of Engineering for their Clustering Approach: Combining The Information From Different support rendered during the testing of some of our modules. Clustering Results”, CSB2002 IEEE Computer Society Bioinformatics Conference Proceeding. They also express their thanks to the University of California Irvine, for their huge data repository made available for AUTHORS PROFILE testing our knowledge guided approach of fusion of ensembles. REFERENCES [1] M.Ester, H. Kriegel, J. Sander, X. Xu. ”Clustering for Mining in Large Spatial Databases”. Special Issue on Data Mining, KI-Journal Tech RJ Anandhi is a PhD student in the department of Computer Publishing, Vol.1, 98. Science & Engineering at Dr M G R University She is currently working as a [2] A.L.N. Fred and A.K. Jain, “Data Clustering using Evidence professor in the Department of Computer Science at Oxford College of Accumulation”. In Proc. of the 16th International Conference on Pattern Engineering, Bangalore. She has completed her BE degree from Bharatiyar Recognition, ICPR 2002, Quebec City. University and MTech degree from Pondicherry Central University. Her research interests are in Spatial Data mining and ANT algorithms. [3] A.L.N. Fred and A.K. Jain, “Robust data clustering” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, USA, 2003. [4] Filkov, V. and Skiena, S. “Integrating microarray data by concensus clustering”. In International Conference on Tools with Artificial Intelligence, 2003 Dr. Natarajan has initially worked in Defence Research and [5] K.Koperski, J.Han, K. Koperski and J. Han, "Discovery of spatial Development Laboratory (DRDL) for five years in the area of software Rules in Geographic Information Databases," Proc. 4th Intl Symposium development in defence missions. Dr Natarajan then worked for 28 years in on Large Spatial Databases, pp. 47-66, 95. National Remote Sensing Agency (NRSA) in the areas pertaining to DIP and [6] Kai Kang, Hua-Xiang Zhang, Ying Fan, “A Novel Clusterer Ensemble GlS for several remote sensing missions like IRS-1A, IRS-1B, IRS-1C, Algorithm Based on Dynamic Cooperation”, IEEE 5TH International IKONOS and LANDSAT. As a Project Manager of Ground Control Point Conf. on Fuzzy Systems and Knowledge Discovery 2008. Library (GCPL) Project, he had completed the task of computing cm level [7] Matheus C.J., Chan P.K, and Piatetsky-Shapiro G, “Systems for accuracy for 3000 locations within India which is being used for cartographic Knowledge Discovery in Databases”, IEEE Transactions on Knowledge satellite missions. He was the Deputy Project Director of Large Scale and Data Engineering 5(6), pp. 903-913, 1993. Mapping (LSM) of Department of Space. Dr Natarajan has published about fifteen papers in National/ International Conferences and Journals. His research interests are Data mining, GIS and Spatial Databases. 289 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 290 http://sites.google.com/site/ijcsis/ ISSN 1947-5500