Document Sample

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011 Potential Research into Spatial Cancer Database by Using Data Clustering Techniques N. Naga Saranya, Dr.M. Hemalatha, Research Scholar (C.S), Karpagam University, Coimbatore- Head, Department of Software Systems, 641021, Karpagam University, Coimbatore-641021, Tamilnadu, India. Tamilnadu,India. E-mail: nachisaranya01@gmail.com E-mail: hema.bioinf@gmail.com Abstract— Data mining, the taking out of hidden analytical liver, lungs, kidney, cervix, prostate testis, bladder, blood, information from large databases. Data mining tools forecast borne, breast and many others. There has been huge future trends and behaviors, allowing businesses to build development in the clinical data from past decades, so we practical, knowledge-driven decisions. This paper discusses the need proper data analysis techniques for more sophisticated data analytical tools and data mining techniques to analyze data. It allows users to analyze data from many different dimensions or methods of data exploration. In this study, we are using angles, sort it, and go over the relationships identified. Here we different data mining technique for effective implementation are analyzing the medical data as well as spatial data. Spatial of clinical data. The main aim of this work is to discover data mining is the process of difficult to discover patterns in various data mining techniques on clinical and spatial data geographic data. Spatial data mining is measured a more difficult sets. Several data mining techniques are pattern recognition, face than traditional mining because of the difficulties associated clustering, association, and classification. Our Proposed work with analyzing objects with concrete existences in space and time. is on medical spatial datasets by using clustering techniques. Here we applied clustering techniques to form the efficient There are fast and enormous numbers of clustering algorithms cluster in discrete and continuous spatial medical database. The are developed for large datasets such as CURE, MAFIA, clusters of random shapes are created if the data is continuous in natural world. Furthermore, this application investigated data DBSCAN, CLARANS, BIRCH, and STING. mining techniques (clustering techniques) such as Exclusive clustering and hierarchical clustering on the spatial data set to generate the well-organized clusters. The tentative results showed II. CLUSTERING ALGORITHMS AND TECHNIQUES IN that there are certain particulars that are evolved and can not be DATA MINING apparently retrieved as of raw data. The process of organizing objects into groups whose Keywords- Data Mining, Spatial Data Mining, Clustering members are similar in some way is called clustering. So, the Techniques, K-means, HAC, Standard Deviation,Medical Database, goal of clustering is to conclude the essential grouping in a set Cancer Patients, Hidden Analytical. of unlabeled data. Various kinds of Clustering algorithms are partitioning-based clustering, hierarchical algorithms, density I. INTRODUCTION based clustering and grid based clustering. Recently many commercial data mining clustering A. Partitioning Algorithm techniques have been developed and their usage is increasing tremendously to achieve desired goal. Researchers are putting their best hard work to reach the fast and well-organized K-Means is one of the simplest unsupervised learning algorithm for the abstraction of spatial medical data sets. algorithms that solve the well known clustering problem. Cancer has become one of the foremost causes of deaths in Fig.1. shows the K-Means algorithm is composed of the India. An analysis of most recent data has shown that over 7 following steps belongs to centroid: lakh new cases of cancer and 3 lakh deaths occur annually due 1. It classifies a given dataset through certain number of to cancer in India. Cancer has striven against near clusters (assume k clusters). These points are first group insurmountable obstacles of financial difficulties and an centroids. almost indifferent ambience, to fulfill the objectives of its 2. Grouping is done based on the Euclidean's distance. founder, bringing to the poorest in the land the most refined 3. And the centroids are formed on the basis of mean value of scientific technology and excellent patient care. Furthermore, that object group. cancer is a preventable disease if it is analyzed at an early 4. The steps 2 & 3 repeats until the centroids no longer move. stage. There are different sites of cancer such as oral, stomach, 168 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011 which top down strategy is used to cluster the objects. In this method the larger clusters are divided into smaller clusters until each object forms cluster of its own. Fig.2 shows simple example of hierarchical clustering. C. Density Based Clustering Algorithm It is a clustering technique to develop clusters of arbitrary shapes. They are different types of density based clustering techniques such as DBSCAN, SNN, OPTICS and DENCLUE. DBSCAN algorithm: The DBSCAN algorithm was early introduced by Ester, et al. [Ester1996], and relies on a density- based notion of clusters. Clusters are recognized by looking at the density of points. Regions with a high density of points Figure 1: Work Flow of Partition based cluster algorithms depict the existence of clusters whereas regions with allow density of points indicate clusters of noise or clusters of outliers. This algorithm is particularly suited to deal with large B. Hierarchical Clustering Algorithms datasets, with noise, and is able to identify clusters with different sizes and shapes. The hierarchical clustering functions essentially in The algorithm: The key idea of the DBSCAN algorithm is combine closest clusters until the desired number of clusters is that, for each point of a cluster, the neighborhood of a given achieved. This sort of hierarchical clustering is named radius has to contain at least a minimum number of points, that agglomerative since it joins the clusters iteratively. There is is, the density in the neighborhood has to exceed some also a divisive hierarchical clustering that does a turn around predefined threshold. This algorithm requires three input process, every data item start in the same cluster and then it is parameters: divided in slighter groups (JAIN, MURTY, FLYNN, 1999). - k, the neighbors list size; The distance capacity between clusters can be done in - Eps, the radius that delimitate the neighborhood area of a numerous ways, and that's how hierarchical clustering point (Eps neighborhood); algorithms of single, common and totally differ. Many - MinPts, the minimum number of points that must exist in the hierarchical clustering algorithms have an interesting property Eps-neighborhoods. that the nested sequence of clusters can be graphically represented with a tree, called a 'dendrogram' (CHIPMAN, The clustering process is based on the classification of the TIBSHIRANI, 2006). There are two approaches to points in the dataset as core hierarchical clustering: we can go from the bottom up, points, border points and noise points, and on the use of grouping small clusters into larger ones, or from the top down, density relations between points splitting big clusters into small ones. These are called (directly density-reachable, density-reachable, density- agglomerative and divisive clustering, respectively. connected [Ester, 1996] [2] ) to form the clusters. D. SNN algorithm The SNN algorithm [Ertoz, 2003] [3]) the same as DBSCAN, is a density-based clustering algorithm. The main difference between this algorithm and DBSCAN is that it defines the similarity between points by looking at the number of nearest neighbors that two points share. Using this similarity measure in the SNN algorithm, the density is defined as the sum of the similarities of the nearest neighbors of a point. Points with high density become core points, while points with low density represent noise points. All remainder Figure 2: Hierarchical Clustering points that are strongly similar to a specific core points will represent a new clusters. 1) Agglomerative approach is the clustering technique in which bottom up strategy is used to cluster the The algorithm: The SNN algorithm needs three inputs objects. It merges the atomic clusters into larger and larger parameters: until all the objects are merged into single cluster. - K, the neighbors’ list size; 2) Divisive approach is the clustering technique in - Eps, the threshold density; 169 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011 - MinPts, the threshold that define the core points. 2. Cell Density Calculation for each cell. After defining the input parameters, the SNN 3. Form of the cells according to their densities. algorithm first finds the K nearest neighbors of each point of 4. Identify the cluster centers according to their result. the dataset. Then the similarity between pairs of points is 5. Finally Traversal of neighboring cells. calculated in terms of how many nearest neighbors the two points share. Using this similarity measure, the density of each III. EXPERIMENTAL RESULTS point can be calculated as being the numbers of neighbors with Here we have taken several series of Datasets by which the number of shared neighbors is equal or greater than using several websites and direct surveys. And we conclude Eps (density threshold). Next, the points are classified as being applicable pattern detection for medical diagnosis. core points, if the density of the point is equal or greater than Cancer Database (SEER Datasets): The web site called www- MinPts (core point threshold). At this point, the algorithm has dep.iarc.fr/ globocan/database.htm consist of datasets. It all the information needed to start to build the clusters. contain number of cancer patients those who registered themselves in this. The dataset consists of basic attributes such Optics: OPTICS (Ordering Points to Identify the Clustering as sex, age, marital status, height and weight. The data of age Structure) is the clustering technique in which the augmented group was taken from (20 - 75+) years in this group major order of the datasets for cluster analysis. Optics built dataset- cancers were examined. A total of male and female cases were using density based clustering structure. The advantage of examined for the various cancers. The data were collected and using optics is it in not sensitive to parameters input values substantial distribution was found for Incidence and Mortality through the user it automatically generates the number of by Sex and Cancer site. Perhaps analysis suggests that they clusters. were more male cases those who were suffering from cancer as per opposite sex. Denclue: DENCLUE (Clustering Based on Density In this study, the data was taken from SEER datasets Distribution Function) is the clustering technique in which the which has record of cancer patients from the year 1975-2008. clustering method is dependent on density distribution Spatial dataset consists of location collected include remotely function. A cluster is denned by a local maximum of the sensed images, geographical information with spatial estimated density function. Data points are assigned to clusters attributes such as location, digital sky survey data, mobile by hill climbing, i.e. points going to the same local maximum phone usage data, and medical data. The five major cancer are put into the same cluster. The disadvantage of Denclue 1.0 areas such as lung, kidney, bones, small intestine and liver is, that the used hill climbing may make unnecessary small were experimented. After this data mining algorithms were steps in the beginning and never converges exactly to the applied on the data sets such as K-means, SOM and maximum, it just comes close. The clustering technique is Hierarchical clustering technique. The database analysis was basically based on influence function (data point impact on its done using XLMiner tool kit. Fig.3 represents the statistical neighborhood), the overall density of data space can be diagram for representation between number of male and calculated as the sum of influence functions applied to data female cases for cancer. points) and cluster can be calculated using density attractors The data consists of discrete data sets with following (local maxima of the overall density function). attribute value types of cancer, male cases, female cases, cases of death pertaining to specific cancer. They were around 21 cancers that have been used as the part of analysis. The E. Grid Based Clustering XLMiner tool doesn’t take the discrete value so it has to be transformed into continuous attribute value. The Grid Based clustering algorithm, to form a grid 8 structure it partitions the data space into a finite number of 7 cell. After that performs all clustering operations are obtained 6 2000 5 grid structure. It is a well-organized clustering algorithm, but 4 2002 2004 its effect is gravely partial by the size of the cells. Grid-based 3 2006 2008 approaches are well-liked for mining clusters in a large 2 1 multidimensional space in which clusters are regarded as 0 denser regions than their environs. The computational Small intestine Lung and Bronchus Bones and Joints Kidney and Renal Pelvis difficulty of most clustering algorithms is at least linearly Figure 3 : Female and Male cases of Cancer comparative to the size of the data set. The great advantage of grid-based clustering is its important decrease of the . The data was subdivided into X, Y values and the computational complexity, especially for clustering very huge result was formed using K-means and HAC clustering data sets [8]. In general, a distinctive grid-based clustering algorithm. In XLMiner, the low level clusters are formed algorithm consists of the following five basic steps (Grabusts using K-MEANS and SOM then HAC clustering builds the and Borisov, 2002) [7] : Dendrogram using the low level clusters. Fig.3 specifies the 1. Grid Structure Creation i.e., splitting the data space into a number of clusters for both sexes , in this male is more finite number of cells. affected compare to the opposite sex. 170 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011 8 7 TABLE1: Input of Hierarchical Clustering 6 2000 Data 5 2002 4 2004 ['both si 3 2006 Input data dendrogram.xls']'Sheet1'!$A$ 2 2008 2:$F$5 1 # Records in the input data 4 0 Input variables normalized Yes Small intestine Lung and Bones and Kidney and Bronchus Joints Renal Pelvis Data Type Raw data Figure 4 : Female Cases of cancer # Selected Variables Selected variables X1 X2 X3 X4 10 9 8 Parameters/Options 7 2000 Draw dendrogram Yes 6 2002 5 2004 Show cluster membership Yes 4 2006 # Clusters 3 3 2008 2 Selected Similarity Euclidean distance 1 measure 0 Small intestine Lung and Bones and Kidney and Selected clustering method Average group linkage Bronchus Joints Renal Pelvis Figure 5: Male cases of Cancer TABLE2: Clustering Stages The Fig.4 and Fig.5 specifies the number of cluster for male and female suffering from different cancers. This Stage Cluster 1 Cluster 2 Distance sample is collected from the patient who couldn’t stay alive 1 1 3 0.079582 with the disease. The result of the analysis shows that the male 2 1 4 1.69234 ratio was large in percentage while compared to the opposite 3 1 2 3.146232 sex. Possibly by analyzing the collected data we can enlarge Elapsed Time: certain measures for the improved procurement of this disease. Overall (secs) 3.00 . Fig.6 specifies the number of death in both males and female cases of death due to cancer using XLMiner. Table 3 presents HAC (hierarchical agglomerative clustering) 30 in which the cluster were determined with appropriate size. 25 Clusters are subdivided in to many sub clusters and the 20 2008 attributes are Xn, (n= 1,2,3,4,5). In this we predicted the 2006 15 2004 clusters by using hierarchical clustering. 2002 10 2000 5 TABLE3: Hierarchical Clustering – Predicted Clusters 0 Small intestine Lung and Bones and Kidney and Bronchus Joints Renal Pelvis Figure 6: Female and Male death cases of Cancer The K-means method is an efficient technique for clustering large data sets and is used to determine the size of The Figure 7 represents the dendrogram in which the each cluster. The input of the hierarchical clustering shown in dataset has been partitioned into three clusters with the K- Table1, and it contain the data, variables, parameters. These means. are all calculated by the distance measure which is in side the Dendrogram(Average group linkage) hierarchical clustering. Here Xn, (n= 1,2,3,4,5) are the selected variables which is placed in the datasets. After this the HAC 3.5 4000 (hierarchical agglomerative clustering), is used on our datasets 3 in which we have used tree based partition method in which 2.5 the results has shown as clustering stages and its elapsed time 2 D is t a n c e in Table 2. The HAC has proved to have for better results than 1.5 other clustering methods. The principal component analysis 1 technique has been used to visualize the data. The X, Y 0.5 coordinates recognize the point position of objects. The coordinates were used and the clusters were determined by 0 0 1 1 2 3 3 4 4 2 5 0 appropriate attribute value. The mean and standard deviation Figure 7: Dendrogram of each cluster was determined. 171 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011 The HAC clustering algorithm is applied on K-means to The cluster compactness has been determined by generate the dendrogram. In a dendrogram, the elements are standard deviation where the cluster becomes compact when grouped together in one cluster when they have the closest standard deviation value decreases and if the value of standard values of all elements available. The cluster 2 and cluster 3 are deviation increases the cluster becomes dispersed. combined together in the diagram. Analyzes is done in the subdivisions of clusters. IV. Dicussion TABLE 4: Representation of Cluster Mean and Standard Deviation This paper focuses on clustering algorithms such as Cluster_HAC_1=c_hac_1 Cluster_HAC_1=c_hac_2 HAC and K-Means in which, HAC is applied on K-means to determine the number of clusters. The quality of cluster is Exampl [ 59.3 Example [ 7.4 improved, if HAC is applied on K-means. The paper has es %] 16 s %] 2 referenced and discussed the issues on the specified algorithms Test Test for the data analysis. The analysis does not include missing Grou Overra Att - valu Grou Overra records. The application can be used to demonstrate how data value p l Desc e p l mining technique can be combined with medical data sets and can be efficiently established in modifying the various cancer Continuous attributes : Mean Continuous attributes : Mean related research. (StdDev) (StdDev) Year 82.50 80.59 64.90 55.78 V. Conclusion ofDiagno This study clearly shows that data mining techniques are sis 0.5 (4.76) (24.14) moral 2.9 (0.99) (4.52) promising for cancer datasets. Our future work will be related 11.79 13.03 13.60 6.53 to missing values and applying various algorithms for the fast - implementation of records. In addition, the research would be Att Desc 1.7 (4.34) (4.47) mliver 2.7 (0.00) (3.72) focusing on spatial data clustering to develop a new spatial data mining algorithm. Once our tool will be implemented as a 24.98 26.61 27.60 19.57 - mstomac complete data analysis environment in the cancer registry of fstomach 2.1 (5.31) (4.84) h 2.7 (2.97) (4.22) SEER datasets, we aim at transferring the tool to related domains, thus showing the flexibility and extensibility of the 17.15 19.57 21.05 13.03 underlying basic concepts and system architecture. mstomac - h 3.5 (1.90) (4.22) fliver 2.6 (1.77) (4.47) References 18.88 19.83 22.55 19.83 - flungs 3.6 (1.08) (1.64) flungs 2.4 (0.78) (1.64) 1. Rao, Y.N, Sudir Gupta and S.P. Agarwal 2003. National Cancer Control Programme: Current status and strategies, 4.20 6.53 70.00 62.13 50 years of cancer control in India, NCD Section, Director - General of Health. mliver 3.9 (1.92) (3.72) mkidney 2.1 (0.99) (5.30) 2. Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996) A 13.90 14.56 77.75 66.97 density-based algorithm for discovering clusters in large - spatial databases. In the Proceedings of the International mlungs 3.9 (0.68) (1.05) fkidney 2.1 (2.05) (7.28) Conference on Knowledge Discovery and Data Mining (KDD.96), Portland, Oregon, pp. 226-231. 58.71 62.13 32.20 26.61 mkidney -4 (3.83) (5.30) fstomach 1.7 (1.56) (4.84) 3. [Ertöz et al. 2003] Ertöz, L., Steinbach, M., Kumar, V.: 62.14 66.97 Year 0.50 80.59 “Finding Clusters of Different Sizes, Shapes, and - ofDiagno (24.14 Densities in Noisy, High Dimensional Data”; In Proc. of fkidney 4.1 (5.17) (7.28) sis -4.8 (0.71) ) SIAM Int. Conf. on Data Mining (2003), 1-12. 4. Aberer, Karl. 2001. P-Grid: A self-organizing access Table 4 characterizes the cluster according to the structure for P2P information systems. In Proc. mean and standard deviation of each object and cluster were International Conference on Cooperative Information determined. The primary comparison in between cluster 1 Systems, pp. 179-194. Springer. objects. The second comparison was between the objects of 5. Bar-Yossef, Ziv, and Maxim Gurevich. 2006. Random cluster 2 and cluster 1.The third comparison was determined in sampling from a search engine's index. In Proc. WWW, between cluster 3 and cluster 1. The results show the mean and pp. 367-376. ACM Press. DOI:doi.acm. org/10.1145/ standard deviation of each cluster and also among the objects 135777.1135833. in each cluster. The cluster 1 has the lowest number of cancer 6. Ng R.T., and Han J. 1994. Efficient and Effective cases the cluster 2 has average number of cancer cases where Clustering Methods for Spatial Data Mining, Proc. as the cluster 3 has large number of cancer cases. 20th Int. Conf. on Very Large Data Bases, 144-155. Santiago, Chile. 172 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011 7. Grabust P., Borisov A. Using grid-clustering methods in Experience in teaching and published Twenty seven papers in data classification// Proceedings of the International International Journals and also presented seventy papers in Conference on Parallel Computing & Electrical various National conferences and one international Engineering- PARELEC'2002. - Warsaw, Poland, 2002 - conference. Area of research is Data Mining, Software p. 425-426. Engineering, Bioinformatics, Neural Network. Also reviewer 8. H. Pilevar, M. Sukumar, “GCHL: A grid-clustering in several National and International journals. algorithm for high- dimensional very large spatial data bases”, Pattern Recognition Letters 26(2005), pp.999- 1010. 9. Jones C., et al., 2002. Spatial Information Retrieval and Geographical Ontologies: An Overview of the SPIRIT Project [C]. In proceedings: 25th ACM Conference of the Special Interest Group in Information Retrieval, pp387- 388. 10. Processing of Spatial Joins, Proc. ACM SIGMOD Int. Conf. on Management of Data, Minneapolis, MN, 1994, pp. 197- 208. 11. T. Zhang, R. Ramakrishnan, and M. L1nvy, B1RCH: An Efficient Data C1ustering 12. Method for Very Large Databases, Proc. ACM SIGMOD Int’L Conf. On Management of Data, ACM Press, pp. 103-114 (1996). 13. M. Ester, H. Kriegel, J. Sander, and X. Xu. “A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, In Proc. of 2nd Int. Conf. on KDD, 1996, pp. 226-231. 14. Wang, Yang, R. Muntz, Wei Wang and Jiong Yang and Richard R. Muntz “STING: A Statistical Information Grid Approach to Spatial Data Mining”, In Proc. of 23rd Int. Conf. on VLDB, 1997, pp. 186-195. 15. Ian H. Witten; Eibe Frank (2005). "Data Mining: Practical machine learning tools and techniques, 2nd Edition". Morgan Kaufmann, San Francisco. 16. M J Horner, L A G Ries, M Krapcho, N Neyman, R Aminou, N Howlader, et al. (2009) SEER Cancer Statistics Review , 1975-2007, Based on November 2008 SEER data Submission. 17. Gondek D, Hofmann T (2007) Non- redundant data clustering. Knowl Inf Syst 12(1):1–24. Authors Profile N.Naga Saranya received the first degree in Mathematics from Periyar University in 2006, Tamilnadu, India. She obtained her master degree in Computer Applications from Anna University in 2009, Tamilnadu, India. She is currently pursuing her Ph.D. degree Under the guidance of Dr. M.Hemalatha, Head, Dept of Software Systems, Karpagam University, Tamilnadu, India. Dr.M.Hemaltha completed MCA MPhil., PhD in Computer Science and Currently working as a AsstProfessor and Head , Dept of Software systems in Karpagam University. Ten years of 173 http://sites.google.com/site/ijcsis/ ISSN 1947-5500

DOCUMENT INFO

Shared By:

Categories:

Tags:
Data Mining, Spatial Data Mining, Clustering Techniques, K-means, Standard Deviation, Medical Database, Cancer Patients, Hidden Analytical

Stats:

views: | 152 |

posted: | 6/5/2011 |

language: | English |

pages: | 6 |

OTHER DOCS BY ijcsiseditor

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.