Potential Research into Spatial Cancer Database by Using Data Clustering Techniques

Document Sample
Potential Research into Spatial Cancer Database by Using Data Clustering Techniques Powered By Docstoc
					                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 5, 2011

   Potential Research into Spatial Cancer Database
        by Using Data Clustering Techniques

                     N. Naga Saranya,                                                              Dr.M. Hemalatha,
Research Scholar (C.S), Karpagam University, Coimbatore-                                Head, Department of Software Systems,
                         641021,                                                       Karpagam University, Coimbatore-641021,
                    Tamilnadu, India.                                                              Tamilnadu,India.
           E-mail: nachisaranya01@gmail.com                                                E-mail: hema.bioinf@gmail.com

Abstract— Data mining, the taking out of hidden analytical                  liver, lungs, kidney, cervix, prostate testis, bladder, blood,
information from large databases. Data mining tools forecast                borne, breast and many others. There has been huge
future trends and behaviors, allowing businesses to build                   development in the clinical data from past decades, so we
practical, knowledge-driven decisions. This paper discusses the             need proper data analysis techniques for more sophisticated
data analytical tools and data mining techniques to analyze data.
It allows users to analyze data from many different dimensions or
                                                                            methods of data exploration. In this study, we are using
angles, sort it, and go over the relationships identified. Here we          different data mining technique for effective implementation
are analyzing the medical data as well as spatial data. Spatial             of clinical data. The main aim of this work is to discover
data mining is the process of difficult to discover patterns in             various data mining techniques on clinical and spatial data
geographic data. Spatial data mining is measured a more difficult           sets. Several data mining techniques are pattern recognition,
face than traditional mining because of the difficulties associated         clustering, association, and classification. Our Proposed work
with analyzing objects with concrete existences in space and time.          is on medical spatial datasets by using clustering techniques.
Here we applied clustering techniques to form the efficient                 There are fast and enormous numbers of clustering algorithms
cluster in discrete and continuous spatial medical database. The            are developed for large datasets such as CURE, MAFIA,
clusters of random shapes are created if the data is continuous in
natural world. Furthermore, this application investigated data
                                                                            DBSCAN, CLARANS, BIRCH, and STING.
mining techniques (clustering techniques) such as Exclusive
clustering and hierarchical clustering on the spatial data set to
generate the well-organized clusters. The tentative results showed
                                                                              II.   CLUSTERING ALGORITHMS AND TECHNIQUES IN
that there are certain particulars that are evolved and can not be                             DATA MINING
apparently retrieved as of raw data.
                                                                                      The process of organizing objects into groups whose
   Keywords- Data Mining, Spatial Data Mining, Clustering                   members are similar in some way is called clustering. So, the
Techniques, K-means, HAC, Standard Deviation,Medical Database,
                                                                            goal of clustering is to conclude the essential grouping in a set
Cancer Patients, Hidden Analytical.
                                                                            of unlabeled data. Various kinds of Clustering algorithms are
                                                                            partitioning-based clustering, hierarchical algorithms, density
                       I.    INTRODUCTION                                   based clustering and grid based clustering.

          Recently many commercial data mining clustering
                                                                            A. Partitioning Algorithm
techniques have been developed and their usage is increasing
tremendously to achieve desired goal. Researchers are putting
their best hard work to reach the fast and well-organized                            K-Means is one of the simplest unsupervised learning
algorithm for the abstraction of spatial medical data sets.                 algorithms that solve the well known clustering problem.
Cancer has become one of the foremost causes of deaths in                   Fig.1. shows the K-Means algorithm is composed of the
India. An analysis of most recent data has shown that over 7                following steps belongs to centroid:
lakh new cases of cancer and 3 lakh deaths occur annually due               1. It classifies a given dataset through certain number of
to cancer in India. Cancer has striven against near                         clusters (assume k clusters). These points are first group
insurmountable obstacles of financial difficulties and an                   centroids.
almost indifferent ambience, to fulfill the objectives of its               2. Grouping is done based on the Euclidean's distance.
founder, bringing to the poorest in the land the most refined               3. And the centroids are formed on the basis of mean value of
scientific technology and excellent patient care. Furthermore,              that object group.
cancer is a preventable disease if it is analyzed at an early               4. The steps 2 & 3 repeats until the centroids no longer move.
stage. There are different sites of cancer such as oral, stomach,

                                                                      168                              http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 9, No. 5, 2011
                                                                              which top down strategy is used to cluster the objects. In this
                                                                              method the larger clusters are divided into smaller clusters
                                                                              until each object forms cluster of its own. Fig.2 shows simple
                                                                              example of hierarchical clustering.

                                                                              C. Density Based Clustering Algorithm

                                                                                  It is a clustering technique to develop clusters of arbitrary
                                                                              shapes. They are different types of density based clustering
                                                                              techniques such as DBSCAN, SNN, OPTICS and DENCLUE.
                                                                              DBSCAN algorithm: The DBSCAN algorithm was early
                                                                              introduced by Ester, et al. [Ester1996], and relies on a density-
                                                                              based notion of clusters. Clusters are recognized by looking at
                                                                              the density of points. Regions with a high density of points
        Figure 1: Work Flow of Partition based cluster algorithms             depict the existence of clusters whereas regions with allow
                                                                              density of points indicate clusters of noise or clusters of
                                                                              outliers. This algorithm is particularly suited to deal with large
B. Hierarchical Clustering Algorithms
                                                                              datasets, with noise, and is able to identify clusters with
                                                                              different sizes and shapes.
    The hierarchical clustering functions essentially in                      The algorithm: The key idea of the DBSCAN algorithm is
combine closest clusters until the desired number of clusters is              that, for each point of a cluster, the neighborhood of a given
achieved. This sort of hierarchical clustering is named                       radius has to contain at least a minimum number of points, that
agglomerative since it joins the clusters iteratively. There is               is, the density in the neighborhood has to exceed some
also a divisive hierarchical clustering that does a turn around               predefined threshold. This algorithm requires three input
process, every data item start in the same cluster and then it is             parameters:
divided in slighter groups (JAIN, MURTY, FLYNN, 1999).                        - k, the neighbors list size;
The distance capacity between clusters can be done in                         - Eps, the radius that delimitate the neighborhood area of a
numerous ways, and that's how hierarchical clustering                         point (Eps neighborhood);
algorithms of single, common and totally differ. Many                         - MinPts, the minimum number of points that must exist in the
hierarchical clustering algorithms have an interesting property               Eps-neighborhoods.
that the nested sequence of clusters can be graphically
represented with a tree, called a 'dendrogram' (CHIPMAN,                      The clustering process is based on the classification of the
TIBSHIRANI, 2006). There are two approaches to                                points in the dataset as core
hierarchical clustering: we can go from the bottom up,                        points, border points and noise points, and on the use of
grouping small clusters into larger ones, or from the top down,               density relations between points
splitting big clusters into small ones. These are called                      (directly density-reachable, density-reachable, density-
agglomerative and divisive clustering, respectively.                          connected [Ester, 1996] [2] ) to form the clusters.

                                                                              D. SNN algorithm

                                                                                  The SNN algorithm [Ertoz, 2003] [3]) the same as
                                                                              DBSCAN, is a density-based clustering algorithm. The main
                                                                              difference between this algorithm and DBSCAN is that it
                                                                              defines the similarity between points by looking at the number
                                                                              of nearest neighbors that two points share. Using this
                                                                              similarity measure in the SNN algorithm, the density is
                                                                              defined as the sum of the similarities of the nearest neighbors
                                                                              of a point. Points with high density become core points, while
                                                                              points with low density represent noise points. All remainder
                    Figure 2: Hierarchical Clustering                         points that are strongly similar to a specific core points will
                                                                              represent a new clusters.
           1) Agglomerative approach is the clustering
technique in which bottom up strategy is used to cluster the                  The algorithm: The SNN algorithm needs three inputs
objects. It merges the atomic clusters into larger and larger                 parameters:
until all the objects are merged into single cluster.                          - K, the neighbors’ list size;
           2) Divisive approach is the clustering technique in                 - Eps, the threshold density;

                                                                        169                               http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 5, 2011
 - MinPts, the threshold that define the core points.                      2. Cell Density Calculation for each cell.
          After defining the input parameters, the SNN                     3. Form of the cells according to their densities.
algorithm first finds the K nearest neighbors of each point of             4. Identify the cluster centers according to their result.
the dataset. Then the similarity between pairs of points is                5. Finally Traversal of neighboring cells.
calculated in terms of how many nearest neighbors the two
points share. Using this similarity measure, the density of each                            III. EXPERIMENTAL RESULTS
point can be calculated as being the numbers of neighbors with                        Here we have taken several series of Datasets by
which the number of shared neighbors is equal or greater than              using several websites and direct surveys. And we conclude
Eps (density threshold). Next, the points are classified as being          applicable pattern detection for medical diagnosis.
core points, if the density of the point is equal or greater than          Cancer Database (SEER Datasets): The web site called www-
MinPts (core point threshold). At this point, the algorithm has            dep.iarc.fr/ globocan/database.htm consist of datasets. It
all the information needed to start to build the clusters.                 contain number of cancer patients those who registered
                                                                           themselves in this. The dataset consists of basic attributes such
Optics: OPTICS (Ordering Points to Identify the Clustering                 as sex, age, marital status, height and weight. The data of age
Structure) is the clustering technique in which the augmented              group was taken from (20 - 75+) years in this group major
order of the datasets for cluster analysis. Optics built dataset-          cancers were examined. A total of male and female cases were
using density based clustering structure. The advantage of                 examined for the various cancers. The data were collected and
using optics is it in not sensitive to parameters input values             substantial distribution was found for Incidence and Mortality
through the user it automatically generates the number of                  by Sex and Cancer site. Perhaps analysis suggests that they
clusters.                                                                  were more male cases those who were suffering from cancer
                                                                           as per opposite sex.
Denclue: DENCLUE (Clustering Based on Density                                         In this study, the data was taken from SEER datasets
Distribution Function) is the clustering technique in which the            which has record of cancer patients from the year 1975-2008.
clustering method is dependent on density distribution                     Spatial dataset consists of location collected include remotely
function. A cluster is denned by a local maximum of the                    sensed images, geographical information with spatial
estimated density function. Data points are assigned to clusters           attributes such as location, digital sky survey data, mobile
by hill climbing, i.e. points going to the same local maximum              phone usage data, and medical data. The five major cancer
are put into the same cluster. The disadvantage of Denclue 1.0             areas such as lung, kidney, bones, small intestine and liver
is, that the used hill climbing may make unnecessary small                 were experimented. After this data mining algorithms were
steps in the beginning and never converges exactly to the                  applied on the data sets such as K-means, SOM and
maximum, it just comes close. The clustering technique is                  Hierarchical clustering technique. The database analysis was
basically based on influence function (data point impact on its            done using XLMiner tool kit. Fig.3 represents the statistical
neighborhood), the overall density of data space can be                    diagram for representation between number of male and
calculated as the sum of influence functions applied to data               female cases for cancer.
points) and cluster can be calculated using density attractors                         The data consists of discrete data sets with following
(local maxima of the overall density function).                            attribute value types of cancer, male cases, female cases, cases
                                                                           of death pertaining to specific cancer. They were around 21
                                                                           cancers that have been used as the part of analysis. The
E. Grid Based Clustering
                                                                           XLMiner tool doesn’t take the discrete value so it has to be
                                                                           transformed into continuous attribute value.
    The Grid Based clustering algorithm, to form a grid                      8
structure it partitions the data space into a finite number of               7

cell. After that performs all clustering operations are obtained             6
grid structure. It is a well-organized clustering algorithm, but             4
its effect is gravely partial by the size of the cells. Grid-based           3                                                           2006
approaches are well-liked for mining clusters in a large                     2

multidimensional space in which clusters are regarded as                     0
denser regions than their environs. The computational                            Small intestine   Lung and
                                                                                                              Bones and
                                                                                                                          Kidney and
                                                                                                                          Renal Pelvis
difficulty of most clustering algorithms is at least linearly
                                                                           Figure 3 : Female and Male cases of Cancer
comparative to the size of the data set. The great advantage of
grid-based clustering is its important decrease of the                              . The data was subdivided into X, Y values and the
computational complexity, especially for clustering very huge              result was formed using K-means and HAC clustering
data sets [8]. In general, a distinctive grid-based clustering             algorithm. In XLMiner, the low level clusters are formed
algorithm consists of the following five basic steps (Grabusts             using K-MEANS and SOM then HAC clustering builds the
and Borisov, 2002) [7] :                                                   Dendrogram using the low level clusters. Fig.3 specifies the
1. Grid Structure Creation i.e., splitting the data space into a           number of clusters for both sexes , in this male is more
finite number of cells.                                                    affected compare to the opposite sex.

                                                                     170                                      http://sites.google.com/site/ijcsis/
                                                                                                              ISSN 1947-5500
                                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                           Vol. 9, No. 5, 2011
                                                                                           TABLE1: Input of Hierarchical Clustering
                                                                        2000                  Data
  5                                                                     2002
  4                                                                     2004
                                                                                                                               ['both                     si
  3                                                                     2006                  Input data                       dendrogram.xls']'Sheet1'!$A$
  1                                                                                           # Records in the input data      4
  0                                                                                           Input variables normalized       Yes
          Small intestine    Lung and    Bones and     Kidney and
                             Bronchus      Joints      Renal Pelvis                           Data Type                        Raw data
Figure 4 : Female Cases of cancer                                                                             # Selected Variables
                                                                                                              Selected variables                              X1     X2      X3          X4
      8                                                                                                       Parameters/Options
      7                                                                   2000                                Draw dendrogram                                 Yes
      6                                                                   2002
      5                                                                   2004                                Show cluster membership                         Yes
      4                                                                   2006                                # Clusters                                      3
      3                                                                   2008
      2                                                                                                       Selected          Similarity
                                                                                                                                                              Euclidean distance
      1                                                                                                       measure
           Small intestine    Lung and     Bones and     Kidney and                                           Selected clustering method                      Average group linkage
                              Bronchus       Joints      Renal Pelvis

Figure 5: Male cases of Cancer

                                                                                           TABLE2: Clustering Stages
          The Fig.4 and Fig.5 specifies the number of cluster
for male and female suffering from different cancers. This                                  Stage                           Cluster 1                 Cluster 2           Distance
sample is collected from the patient who couldn’t stay alive                                1                               1                         3                   0.079582
with the disease. The result of the analysis shows that the male                            2                               1                         4                   1.69234
ratio was large in percentage while compared to the opposite                                3                               1                         2                   3.146232
sex. Possibly by analyzing the collected data we can enlarge                               Elapsed Time:
certain measures for the improved procurement of this disease.                              Overall (secs)                                              3.00
. Fig.6 specifies the number of death in both males and female
cases of death due to cancer using XLMiner.
                                                                                           Table 3 presents HAC (hierarchical agglomerative clustering)
                                                                                           in which the cluster were determined with appropriate size.
                                                                                           Clusters are subdivided in to many sub clusters and the
                                                                          2008             attributes are Xn, (n= 1,2,3,4,5). In this we predicted the
  15                                                                      2004             clusters by using hierarchical clustering.

      5                                                                                    TABLE3: Hierarchical Clustering – Predicted Clusters
           Small intestine   Lung and    Bones and     Kidney and
                             Bronchus      Joints      Renal Pelvis

Figure 6: Female and Male death cases of Cancer

          The K-means method is an efficient technique for
clustering large data sets and is used to determine the size of                                     The Figure 7 represents the dendrogram in which the
each cluster. The input of the hierarchical clustering shown in                            dataset has been partitioned into three clusters with the K-
Table1, and it contain the data, variables, parameters. These                              means.
are all calculated by the distance measure which is in side the                                                                 Dendrogram(Average group linkage)
hierarchical clustering. Here Xn, (n= 1,2,3,4,5) are the selected
variables which is placed in the datasets. After this the HAC                                                 3.5                                                                 4000

(hierarchical agglomerative clustering), is used on our datasets                                               3

in which we have used tree based partition method in which                                                    2.5
the results has shown as clustering stages and its elapsed time
                                                                                             D is t a n c e

in Table 2.
          The HAC has proved to have for better results than                                                  1.5

other clustering methods. The principal component analysis                                                     1

technique has been used to visualize the data. The X, Y
coordinates recognize the point position of objects. The
coordinates were used and the clusters were determined by                                                      0    0   1





appropriate attribute value. The mean and standard deviation                               Figure 7: Dendrogram
of each cluster was determined.

                                                                                     171                                                             http://sites.google.com/site/ijcsis/
                                                                                                                                                     ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 9, No. 5, 2011
The HAC clustering algorithm is applied on K-means to                                   The cluster compactness has been determined by
generate the dendrogram. In a dendrogram, the elements are                     standard deviation where the cluster becomes compact when
grouped together in one cluster when they have the closest                     standard deviation value decreases and if the value of standard
values of all elements available. The cluster 2 and cluster 3 are              deviation increases the cluster becomes dispersed.
combined together in the diagram. Analyzes is done in the
subdivisions of clusters.
                                                                                                        IV. Dicussion
TABLE 4: Representation of Cluster Mean and Standard Deviation
                                                                                         This paper focuses on clustering algorithms such as
Cluster_HAC_1=c_hac_1             Cluster_HAC_1=c_hac_2                        HAC and K-Means in which, HAC is applied on K-means to
                                                                               determine the number of clusters. The quality of cluster is
 Exampl                 [   59.3   Example                 [   7.4             improved, if HAC is applied on K-means. The paper has
 es                     %] 16      s                       %] 2                referenced and discussed the issues on the specified algorithms
            Test                               Test
                                                                               for the data analysis. The analysis does not include missing
                  Grou Overra      Att       - valu Grou   Overra              records. The application can be used to demonstrate how data
            value p    l           Desc        e    p      l                   mining technique can be combined with medical data sets and
                                                                               can be efficiently established in modifying the various cancer
 Continuous attributes : Mean Continuous attributes : Mean                     related research.
 (StdDev)                     (StdDev)

 Year         82.50      80.59                      64.90 55.78                                       V. Conclusion
 ofDiagno                                                                         This study clearly shows that data mining techniques are
 sis      0.5 (4.76)     (24.14)   moral     2.9    (0.99) (4.52)              promising for cancer datasets. Our future work will be related
              11.79      13.03                      13.60 6.53                 to missing values and applying various algorithms for the fast
          -                                                                    implementation of records. In addition, the research would be
 Att Desc 1.7 (4.34)     (4.47)    mliver    2.7    (0.00) (3.72)              focusing on spatial data clustering to develop a new spatial
                                                                               data mining algorithm. Once our tool will be implemented as a
              24.98      26.61                      27.60 19.57
          -                        mstomac
                                                                               complete data analysis environment in the cancer registry of
 fstomach 2.1 (5.31)     (4.84)    h       2.7      (2.97) (4.22)              SEER datasets, we aim at transferring the tool to related
                                                                               domains, thus showing the flexibility and extensibility of the
             17.15       19.57                      21.05 13.03                underlying basic concepts and system architecture.
 mstomac -
 h       3.5 (1.90)      (4.22)    fliver    2.6    (1.77) (4.47)
               18.88     19.83                      22.55 19.83
 flungs    3.6 (1.08)    (1.64)    flungs    2.4    (0.78) (1.64)              1.   Rao, Y.N, Sudir Gupta and S.P. Agarwal 2003. National
                                                                                    Cancer Control Programme: Current status and strategies,
               4.20      6.53                       70.00 62.13                     50 years of cancer control in India, NCD Section, Director
           -                                                                        General of Health.
 mliver    3.9 (1.92)    (3.72)    mkidney 2.1      (0.99) (5.30)
                                                                               2.   Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996) A
               13.90     14.56                      77.75 66.97                     density-based algorithm for discovering clusters in large
        -                                                                           spatial databases. In the Proceedings of the International
 mlungs 3.9 (0.68)       (1.05)    fkidney   2.1    (2.05) (7.28)                   Conference on Knowledge Discovery and Data Mining
                                                                                    (KDD.96), Portland, Oregon, pp. 226-231.
               58.71     62.13                      32.20 26.61
 mkidney -4    (3.83)    (5.30)    fstomach 1.7     (1.56) (4.84)
                                                                               3.   [Ertöz et al. 2003] Ertöz, L., Steinbach, M., Kumar, V.:
               62.14     66.97     Year             0.50 80.59                      “Finding Clusters of Different Sizes, Shapes, and
         -                         ofDiagno                (24.14                   Densities in Noisy, High Dimensional Data”; In Proc. of
 fkidney 4.1 (5.17)      (7.28)    sis      -4.8    (0.71) )                        SIAM Int. Conf. on Data Mining (2003), 1-12.
                                                                               4.   Aberer, Karl. 2001. P-Grid: A self-organizing access
          Table 4 characterizes the cluster according to the                        structure for P2P information systems. In Proc.
mean and standard deviation of each object and cluster were                         International Conference on Cooperative Information
determined. The primary comparison in between cluster 1                             Systems, pp. 179-194. Springer.
objects. The second comparison was between the objects of                      5.   Bar-Yossef, Ziv, and Maxim Gurevich. 2006. Random
cluster 2 and cluster 1.The third comparison was determined in                      sampling from a search engine's index. In Proc. WWW,
between cluster 3 and cluster 1. The results show the mean and                      pp. 367-376. ACM Press. DOI:doi.acm. org/10.1145/
standard deviation of each cluster and also among the objects                       135777.1135833.
in each cluster. The cluster 1 has the lowest number of cancer                 6.   Ng R.T., and Han J. 1994. Efficient and Effective
cases the cluster 2 has average number of cancer cases where                        Clustering Methods for Spatial Data Mining, Proc.
as the cluster 3 has large number of cancer cases.
                                                                                    20th Int. Conf. on Very Large Data Bases, 144-155.
                                                                                    Santiago, Chile.

                                                                         172                              http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No. 5, 2011
7.   Grabust P., Borisov A. Using grid-clustering methods in           Experience in teaching and published Twenty seven papers in
     data classification// Proceedings of the International            International Journals and also presented seventy papers in
     Conference on Parallel Computing & Electrical                     various National conferences and one international
     Engineering- PARELEC'2002. - Warsaw, Poland, 2002 -               conference. Area of research is Data Mining, Software
     p. 425-426.                                                       Engineering, Bioinformatics, Neural Network. Also reviewer
8.   H. Pilevar, M. Sukumar, “GCHL: A grid-clustering                  in several National and International journals.
     algorithm for high- dimensional very large spatial data
     bases”, Pattern Recognition Letters 26(2005), pp.999-
9.   Jones C., et al., 2002. Spatial Information Retrieval and
     Geographical Ontologies: An Overview of the SPIRIT
     Project [C]. In proceedings: 25th ACM Conference of the
     Special Interest Group in Information Retrieval, pp387-
10. Processing of Spatial Joins, Proc. ACM SIGMOD
    Int. Conf. on Management of Data, Minneapolis,
    MN, 1994, pp. 197- 208.
11. T. Zhang, R. Ramakrishnan, and M. L1nvy, B1RCH:
    An Efficient Data C1ustering
12. Method for Very Large Databases, Proc. ACM
    SIGMOD Int’L Conf. On Management of Data,
    ACM Press, pp. 103-114 (1996).
13. M. Ester, H. Kriegel, J. Sander, and X. Xu. “A Density-
    Based Algorithm for Discovering Clusters in Large
    Spatial Databases with Noise”, In Proc. of 2nd Int. Conf.
    on KDD, 1996, pp. 226-231.
14. Wang, Yang, R. Muntz, Wei Wang and Jiong Yang and
    Richard R. Muntz “STING: A Statistical Information Grid
    Approach to Spatial Data Mining”, In Proc. of 23rd Int.
    Conf. on VLDB, 1997, pp. 186-195.
15. Ian H. Witten; Eibe Frank (2005). "Data Mining: Practical
    machine learning tools and techniques, 2nd Edition".
    Morgan Kaufmann, San Francisco.
16. M J Horner, L A G Ries, M Krapcho, N Neyman, R
    Aminou, N Howlader, et al. (2009) SEER Cancer
    Statistics Review , 1975-2007, Based on November 2008
    SEER data Submission.
17. Gondek D, Hofmann T (2007) Non- redundant data
    clustering. Knowl Inf Syst 12(1):1–24.

                       Authors Profile

N.Naga Saranya received the first degree in Mathematics from
Periyar University in 2006, Tamilnadu, India. She obtained
her master degree in Computer Applications from Anna
University in 2009, Tamilnadu, India. She is currently
pursuing her Ph.D. degree Under the guidance of Dr.
M.Hemalatha, Head, Dept of Software Systems, Karpagam
University, Tamilnadu, India.

                       Dr.M.Hemaltha    completed    MCA
                       MPhil., PhD in Computer Science and
                       Currently working as a AsstProfessor
                       and Head , Dept of Software systems
                       in Karpagam University. Ten years of

                                                                 173                            http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500