Document Sample
Paper-5 Powered By Docstoc
					                                                                             International Journal of Computer Information Systems,
                                                                                                                 Vol. 3, No. 2, 2011

               An Empirical Comparison of Data Mining
                       Classification Methods
                Angeline Christobel . Y                                                  Dr. Sivaprakasam
                    Research Scholar                                               Department of Computer Science
                  Karpagam University                                                    Sri Vasavi College
              Coimbatore, Tamil Nadu, India                                           Erode, Tamil Nadu, India

Abstract— In recent years, huge amount of data have become           given object is likely to have [3, 4]. The most popular
available due to the rapid development of technologies. Data         classification and prediction methods are
mining is a field developed as a means of extracting information
and knowledge from databases to discover patterns.                   1. Decision Trees
Classification is a data mining task used in grouping similar data
objects together. It can be defined as supervised learning as it
                                                                     2. Rule based
assigns class labels to data objects based on the relationship       3. Bayesian
between the data items with a pre-defined class label. In this       4. Support Vector Machines
paper we perform a comparative study of the performance of           5. Artificial Neural Network
C4.5, Naïve Bayes, SVM and KNN Classification Algorithms. The        6. Ensemble methods
algorithms are evaluated based on Accuracy, Sensitivity and          7. Lazy Learners
Specificity and Error rate.
                                                                     Decision tree induction is the learning of a decision tree from
    Key Words: Data Mining, Classification, Performance              class-labeled training tuples.
                                                                     A rule based classifier is a technique for classifying records
                    I.       INTRODUCTION                            using a collection of “if … then” rules.
Data mining is the extraction of hidden predictive information       Bayesian classifiers are statistical classifiers and are based on
from large databases [1]. It uses well established statistical and   Bayes theorem
machine learning techniques to build models that predict some        Support Vector Machines has its roots in statistical learning
behavior of the data. Data mining tasks can be classified into       theory and has shown promising empirical results in many
two categories: Descriptive and predictive data mining.              applications.
Descriptive data mining provides information to understand           An Artificial Neural Network is a computational model based
what is happening inside the data without a predetermined            on biological neural networks.
idea. Predictive data mining allows the user to submit records       An Ensemble method constructs a set of base classifiers from
with unknown field values, and the system will guess the             training data and performs classification by taking a vote on
unknown values based on previous patterns discovered from            the predictions made by each base classifier.
the database.                                                        Lazy learning is a learning method where the system tries to
                                                                     generalize the training data before receiving queries.
Data mining models can be categorized according to the tasks
they perform:                                                        The main objective of this paper is to compare C4.5, Naïve
 1. Classification and Prediction                                    Bayes, SVM and KNN algorithms on the data set “Wisconsin-
 2. Clustering                                                       breast-cancer” obtained from the UCI Machine Learning
 3. Association Rules                                                Repository based on Accuracy, Sensitivity, Specificity and
                                                                     Error rate. These algorithms are among the top 10 algorithms
Classification and prediction is a predictive model, but             in data mining [14].
clustering and association rules are descriptive models.
Classification and prediction are two forms of data analysis
                                                                                II. CLASSIFICATION ALGORITHMS
that can be used to extract models describing important data
classes or to predict future data trends. Classification is the      Classification is a process of finding a model that describes
task of examining the features of a newly presented object and       and distinguishes data classes or concepts, for the purpose of
assigning it to one of a predefined set of classes. Prediction is    being able to use the model to predict the class of objects
the construction and use of a model to assess the class of an        whose class label is unknown[2]. Classification algorithms
unlabeled object or to assess the value or value ranges of a         have a wide range of applications like churn pre-diction, fraud
                                                                     detection, artificial intelligence, and credit card rating etc.

       August Issue                                         Page 24 of 107                                   ISSN 2229 5208
                                                                             International Journal of Computer Information Systems,
                                                                                                                 Vol. 3, No. 2, 2011

The C4.5, Naïve Bayes, SVM and KNN algorithms are
discussed below.
                                                                     3. As P(X) is constant for all classes, only                need
                       A) C4.5 Algorithm                             to be maximized. If the class prior probabilities are not known,
C4.5 is a popular and powerful decision tree classification          then it is commonly assumed that the classes are equally likely
algorithm used to generate a decision tree developed by Ross         i.e. P(C1) = P(C2)= … =P(Ci)
Quinlan. It is a successor of ID3. It constructs the decision tree   Note that the class prior probabilities may be estimated by
with a „divide and conquer‟ strategy. It eliminates the problem
of unavailable values, continuous attributes value ranges,           Where si is the number of training samples of class       and s is
pruning of decision trees and rule derivation. In C4.5, each         the total number of training samples.
node in a tree is associated with a set of cases. Also these
cases are assigned weights to take into account unknown              4. Given data sets with many attributes, it would be extremely
attribute values. At the beginning, only the root is present and
                                                                     computationally expensive to compute             . In order to
it is associated with the whole training set, and all the weights
                                                                     reduce computation in evaluating                    the naïve
are equal to one. At each node the divide and conquer
                                                                     assumption of class conditional independence is made. This
algorithm is executed, trying to exploit the locally best choice
                                                                     presumes that the values of the attributes are conditionally
with no backtracking allowed. In building a decision tree, we
                                                                     independent of one another, given the class label of the
deal with training set that have records with unknown
                                                                     sample, ie., that there are no dependence relationships among
attributes by considering only those records where those
                                                                     the attributes. Thus
attribute values are available. We can classify records that
have unknown attribute values by estimating the probability of
the various possible results. C4.5 produces tree with variable
branches per node. When a discrete variable is chosen as the
splitting attribute in C4.5 there will be one branch for each
value of attributes [5, 9].                                           The probabilities         ,        , …,           , can be
                B) Naïve Bayes Algorithm                             estimated from the training samples   refers to the value of
                                                                     attribute Ak for sample X which may be categorical or
The Naïve Bayes classifier produce probability estimates             continuous valued.
rather than predictions. For each class value they estimate the
probability that a given instance belongs to that class. The         5. In order to classify an unknown sample X,                  is
advantage of the Naive Bayes classifier is that it only requires     evaluated for each class . Sample X is then assigned to the
a small amount of training data to estimate the parameters           class if and only if
necessary for classification. It assumes that the effect of an                                                for 1 ≤ j ≤ m, j ≠ i
attribute value on a given class is independent of the values of     In other words, it is assigned to the class       , for which
the other attributes. This assumption is called class conditional
independence [2].
                                                                     C) SVM (Support Vector Machine) Algorithm
The Naïve Bayesian classifier [2] works as follows:
                                                                     This algorithm is introduced by Vapnik et al. [11], is a very
                                                                     powerful method that has been applied in a wide variety of
1. Each data sample is represented by an n-dimensional feature
                                                                     applications. The basic concept in SVM is the hyper plane
vector,                      depicting n measurements made
                                                                     classifier, or linear separability. To achieve linear separability,
on the sample from n attributes respectively A1,A2, … An.
                                                                     SVM applies two basic ideas: margin maximization and
                                                                     kernels, that is, mapping input space to a higher-dimension
2. Suppose that there are m classes, C1, C2, … Cm. Given an
                                                                     space, feature space.
unknown data sample, X, the classifier will predict that X
                                                                     SVM is an algorithm with strong regularization properties,
belongs to the class having the highest posterior probability,
                                                                     that is, the optimization procedure maximizes predictive
conditioned on X. That is, the Naïve Bayesian classifier
                                                                     accuracy while automatically avoiding over-fitting of the
assigns an unknown sample X to the class Ci if and only if:
                                                                     training data. Neural networks and radial basis functions, both
         P(Ci| X) > P(Cj| X) for 1 ≤ j ≤ m , j ≠ i
                                                                     popular data mining techniques, have the same functional
                                                                     form as SVM models; however, neither of these algorithms
Thus we maximize P(Ci| X). The class Ci for which P(Ci | X) is       has the well-founded theoretical approach to regularization
maximized is called the maximum posteriori hypothesis.               that forms the basis of SVM.

       August Issue                                         Page 25 of 107                                     ISSN 2229 5208
                                                                            International Journal of Computer Information Systems,
                                                                                                                Vol. 3, No. 2, 2011

SVM projects the input data into a kernel space. Then it builds     The Accuracy, Sensitivity, Specificity and Error rate can be
a linear model in this kernel space. A classification SVM           defined as follows:
model attempts to separate the target classes with the widest
possible margin. A regression SVM model tries to find a             Accuracy = (TP+TN) / (TP + FP + TN + FN)
continuous function such that maximum number of data points
lie within an epsilon-wide tube around it. Different types of       Sensitivity = TP/(TP + FN)
kernels and different kernel parameter choices can produce a
variety of decision boundaries (classification) or function         Specificity = TN /(TN + FP)
approximators (regression).
                                                                    Error rate = (FP+FN) / (TP + FP + TN + FN)
D) KNN (K-Nearest Neighbor) Algorithm
KNN classification classifies instances based on their              Where TP is the number of True Positives
similarity. It is one of the most popular algorithms for pattern          TN is the number of True Negatives
recognition. It is a type of Lazy learning where the function is          FP is the number of False Positives
only approximated locally and all computation is deferred                 FN is the number of False Negatives
until classification.
An object is classified by a majority of its neighbors. K is                      IV. EXPERIMENTAL RESULTS
always a positive integer. The neighbors are selected from a
                                                                    In this paper 10-fold cross validation is applied for evaluating
set of objects for which the correct classification is known.
                                                                    the performance of the classifiers. These data mining
                                                                    classification model were developed using data mining
The KNN algorithm is as follows:
                                                                    classification tool Weka version 3.7. We have used one of the
                                                                    dataset “Wisconsin-breast-cancer” which is obtained from the
    1.    Determine K i.e., the number of nearest neighbors
                                                                    UCI machine learning library [13.] Algorithm for attribute
    2.    Using the distance measure, calculate the distance
                                                                    selection was applied on dataset to preprocess the data. The
          between the query instance and all the training
                                                                    dataset contains 699 instances and 10 attributes.
    3.    The distance of all the training samples are sorted and
                                                                    Table 1 shows the Accuracy, Sensitivity, Specificity and Error
          nearest neighbor based on the K-th minimum distance
                                                                    rate of C4.5, Naïve Bayes, SVM and KNN algorithms.
          is determined.
    4.    Since the KNN is supervised learning, get all the
                                                                    Figure1 shows the graphical representation of difference in
          categories of the training data for the sorted value
          which fall under K.
    5.    The prediction value is measured by using the
                                                                    Figure2 shows the graphical representation of difference in
          majority of nearest neighbors.

         III. PERFORMANCE EVALUATION                                Figure3 shows the graphical representation of difference in
Classifier performance depends on the characteristics of the        Specificity.
data to be classified. Various empirical tests can be performed
to compare the classifier like holdout, random sub-sampling,        Figure4 shows the graphical representation of difference in
k-fold cross validation and bootstrap method. In this study we      Error rate.
have selected k-fold cross validation for evaluating the
                                                                    Table 1: Comparison of Data Mining Models
In k-fold cross validation, the initial data are randomly
partitioned into k mutually exclusive subset or folds                                                                         Error
                                                                    Algorithms     Accuracy      Sensitivity    Specificity
d1,d2,…,dk, each approximately equal in size. The training and                                                                 rate
testing is performed k times. In the first iteration, subsets d2,      C4.5          94.56%       95.63%          92.53%      5.43%
…, dk collectively serve as the training set in order to obtain a
first model, which is tested on d1; the second iteration is           Naïve
                                                                                     95.99%       95.19%          97.51%      4.00%
trained in subsets d1, d3,…, dk and tested on d2; and so no[2].       Bayes
                                                                      SVM            96.99%       97.37%          96.26%      3.00%
Performance of the selected algorithms is measured for
                                                                       KNN            95.13        96.72          92.11       4.86%
Accuracy, Sensitivity, Specificity and Error rate from the
confusion matrix obtained.

         August Issue                                      Page 26 of 107                                      ISSN 2229 5208
                                                                                                       International Journal of Computer Information Systems,
                                                                                                                                           Vol. 3, No. 2, 2011

Figure 1: Comparison graph based on Accuracy                                    Figure 4: Comparison graph based on Error rate.

                             Difference in Classification Accuracy                                           Difference in Classification Error rate
                        97                                                                         6
                      96.5                                                                         5
   Accuracy rate(%)


                                                                                  Error rate(%)
                        95                                                                         3
                      93.5                                                                         1
                        93                                                                         0
                               C4.5      Naïve      SVM       KNN                                            C4.5     Naïve Bayes      SVM         KNN
                                             Algorithm                                                                       Algorithm

Figure 2: Comparison graph based on Sensitivity                                 Our result shows that out of C4.5, Naïve Bayes, SVM and
                                                                                KNN algorithms, SVM performs better classification. The
                             Differece in Classification Sensitivity            error rate of SVM is low and the accuracy, sensitivity is very
                                                                                high compared to the other three models.
              97.5                                                                                V. CONCLUSION AND SCOPE FOR FURTHER
               97                                                                                            ENHANCEMENTS

               96                                                               In this paper, the performance of C4.5, Naïve Bayes, SVM and
              95.5                                                              KNN are compared. The experiments were conducted on the
                                                                                dataset “Wisconsin-breast-cancer”. Classification Accuracy,
                                                                                Sensitivity, Specificity and Error rate is validated by 10-fold
                                                                                cross validation method. Our Studies shows that Support
               94                                                               Vector Machine turned out to be a best classifier. In future we
                              C4.5      Naïve       SVM        KNN              intend to improve the performance of these classifications
                                        Bayes                                   techniques by creating a Meta model which will be used to
                                             Algorithm                          predict breast cancer disease in patients.

Figure 3: Comparison graph based on Specificity
                                                                                                                    VI. REFERENCES
                                                                                                  1.     Kietikul Jearanaitanakij,”Classifying Continous Data
                                                                                                         Set by ID3 Algorithm”, Proceedings of fifth
                             Difference in Classification Specificity                                    International     Conference       on    Information
                                                                                                         Communication and Signal Processing, 2005.
                      97                                                                          2.     J. Han and M. Kamber,”Data Mining Concepts and
                      96                                                                                 Techniques”, Morgan Kauffman Publishers, USA,

                      94                                                                          3.     Agrawal, R., Imielinski, T., Swami, A., “Database
                      93                                                                                 Mining:A     Performance     Perspective”,  IEEE
                      92                                                                                 Transactions on Knowledge and Data Engineering,
                      91                                                                                 pp. 914-925, December 1993.
                      89                                                                          4.     Chen, M., Han, J., Yu P.S., “Data Mining: An
                              C4.5    Naïve Bayes   SVM         KNN                                      Overview from Database Perspective”, IEEE
                                             Algorithm                                                   Transactions on Knowledge and Data Engineering,
                                                                                                         Vol. 8 No.6, December 1996.

                       August Issue                                    Page 27 of 107                                                  ISSN 2229 5208
                                                                      International Journal of Computer Information Systems,
                                                                                                          Vol. 3, No. 2, 2011

5.    Quinlan, J. R. C4.5: Programs for Machine Learning.         17. Arbach,     L.;    Reinhardt,   J.M.;      Bennett,
      Morgan Kaufmann Publishers, 1993.                               D.L.; Fallouh, G.; Iowa Univ., Iowa City, IA,
                                                                      USA “Mammographic        masses      classification:
6.    S.B. Kotsiantis, Supervised Machine Learning: A                 comparison between backpropagation neural network
      Review of Classification Techniques, Informatica                (BNN), K nearest neighbors (KNN), and human
      31(2007) 249-268, 2007                                          readers”, 2003 IEEE CCECE

7.     J. R. Quinlan. Improved use of continuous attributes       18. D. Xhemali, C.J. Hinde, and R.G. Stone. Naïve Bayes
      in c4.5. Journal of Artificial Intelligence Research,           vs. Decision Trees vs. Neural Networks in the
      4:77-90, 1996.                                                  Classification of Training Web Pages. International
                                                                      Journal of Computer Science, 4(1):16-23,2009.
8.    J.R. Quinlan, “Induction of decision trees,” In Jude
      W.Shavlik, Thomas G. Dietterich, (Eds.), Readings in
      Machine Learning. Morgan Kaufmann, 1990.                                    AUTHOR‟S PROFILE
      Originally published in Machine Learning, vol. 1,
      1986, pp 81–106.
                                                                                    Ms. Angeline Christobel is working as
9.    Salvatore Ruggieri,”Efficient C4.5 Proceedings of                             an Asst. Professor in AMA International
      IEEE transactions on knowledge and data                                       University, Bahrain.        Her research
      Engineering”,Vo1. 14,2,No.2, PP.438-444,20025                                 interest is in Data mining.

10. P. Domingos, M. Pazzani, On the optimality of the
    simple Bayesian classifier under Zero-one loss,
    Machine learning 29(2-3)(1997) 103-130.11
                                                                                   Dr. Sivaprakasam is working as a
11. Vapnik, V.N., The Nature of Statistical Learning                               Professor in Sri Vasavi College, Erode,
    Theory, 1st ed., Springer-Verlag,New York, 1995.                               Tamil Nadu, India. His research
                                                                                   interests include Data mining, Internet
12. Michael J. Sorich, John O. Miners,Ross A.                                      Technology,         Web & Caching
    McKinnon,David A. Winkler, Frank R. Burden, and                                Technology,             Communication
    Paul A. Smith, “Comparison of linear and nonlinear                             Networks and Protocols, Content
    classification algorithms for the prediction of drug                           Distributing Networks.
    and chemical metabolism by human UDP-
    Glucuronosyltransferase Isoforms”

13. UCI         Machine         Learning        Repository

14. XindongWu · Vipin Kumar · J. Ross Quinlan
    Joydeep Ghosh · Qiang Yang ·Hiroshi Motoda
    Geoffrey J. McLachlan · Angus Ng · Bing Liu ·Philip
    S. Yu ·Zhi-Hua Zhou · Michael Steinbach ·David J.
    Hand · Dan Steinberg, “Top 10 algorithms in data
    mining ” Springer 2007

15. Niculescu-Mizil, A., & Caruana, R. (2005).
    Predicting good probabilities with supervised
    learning. Proc.22nd International Conference on
    Machine Learning (ICML'05).

16. Thair Nu Phyu, “Survey of Classification Techniques
    in Data Mining MultiConference of Engineers and
    Computer Scientists” 2009 Vol I IMECS 2009, Hong

     August Issue                                    Page 28 of 107                                   ISSN 2229 5208

Shared By: