Document Sample

International Journal of Computer Information Systems, Vol. 3, No. 2, 2011 An Empirical Comparison of Data Mining Classification Methods Angeline Christobel . Y Dr. Sivaprakasam Research Scholar Department of Computer Science Karpagam University Sri Vasavi College Coimbatore, Tamil Nadu, India Erode, Tamil Nadu, India angeline_christobel@yahoo.com psperode@yahoo.com Abstract— In recent years, huge amount of data have become given object is likely to have [3, 4]. The most popular available due to the rapid development of technologies. Data classification and prediction methods are mining is a field developed as a means of extracting information and knowledge from databases to discover patterns. 1. Decision Trees Classification is a data mining task used in grouping similar data objects together. It can be defined as supervised learning as it 2. Rule based assigns class labels to data objects based on the relationship 3. Bayesian between the data items with a pre-defined class label. In this 4. Support Vector Machines paper we perform a comparative study of the performance of 5. Artificial Neural Network C4.5, Naïve Bayes, SVM and KNN Classification Algorithms. The 6. Ensemble methods algorithms are evaluated based on Accuracy, Sensitivity and 7. Lazy Learners Specificity and Error rate. Decision tree induction is the learning of a decision tree from Key Words: Data Mining, Classification, Performance class-labeled training tuples. A rule based classifier is a technique for classifying records I. INTRODUCTION using a collection of “if … then” rules. Data mining is the extraction of hidden predictive information Bayesian classifiers are statistical classifiers and are based on from large databases [1]. It uses well established statistical and Bayes theorem machine learning techniques to build models that predict some Support Vector Machines has its roots in statistical learning behavior of the data. Data mining tasks can be classified into theory and has shown promising empirical results in many two categories: Descriptive and predictive data mining. applications. Descriptive data mining provides information to understand An Artificial Neural Network is a computational model based what is happening inside the data without a predetermined on biological neural networks. idea. Predictive data mining allows the user to submit records An Ensemble method constructs a set of base classifiers from with unknown field values, and the system will guess the training data and performs classification by taking a vote on unknown values based on previous patterns discovered from the predictions made by each base classifier. the database. Lazy learning is a learning method where the system tries to generalize the training data before receiving queries. Data mining models can be categorized according to the tasks they perform: The main objective of this paper is to compare C4.5, Naïve 1. Classification and Prediction Bayes, SVM and KNN algorithms on the data set “Wisconsin- 2. Clustering breast-cancer” obtained from the UCI Machine Learning 3. Association Rules Repository based on Accuracy, Sensitivity, Specificity and Error rate. These algorithms are among the top 10 algorithms Classification and prediction is a predictive model, but in data mining [14]. clustering and association rules are descriptive models. Classification and prediction are two forms of data analysis II. CLASSIFICATION ALGORITHMS that can be used to extract models describing important data classes or to predict future data trends. Classification is the Classification is a process of finding a model that describes task of examining the features of a newly presented object and and distinguishes data classes or concepts, for the purpose of assigning it to one of a predefined set of classes. Prediction is being able to use the model to predict the class of objects the construction and use of a model to assess the class of an whose class label is unknown[2]. Classification algorithms unlabeled object or to assess the value or value ranges of a have a wide range of applications like churn pre-diction, fraud detection, artificial intelligence, and credit card rating etc. August Issue Page 24 of 107 ISSN 2229 5208 International Journal of Computer Information Systems, Vol. 3, No. 2, 2011 The C4.5, Naïve Bayes, SVM and KNN algorithms are discussed below. 3. As P(X) is constant for all classes, only need A) C4.5 Algorithm to be maximized. If the class prior probabilities are not known, C4.5 is a popular and powerful decision tree classification then it is commonly assumed that the classes are equally likely algorithm used to generate a decision tree developed by Ross i.e. P(C1) = P(C2)= … =P(Ci) Quinlan. It is a successor of ID3. It constructs the decision tree Note that the class prior probabilities may be estimated by with a „divide and conquer‟ strategy. It eliminates the problem of unavailable values, continuous attributes value ranges, Where si is the number of training samples of class and s is pruning of decision trees and rule derivation. In C4.5, each the total number of training samples. node in a tree is associated with a set of cases. Also these cases are assigned weights to take into account unknown 4. Given data sets with many attributes, it would be extremely attribute values. At the beginning, only the root is present and computationally expensive to compute . In order to it is associated with the whole training set, and all the weights reduce computation in evaluating the naïve are equal to one. At each node the divide and conquer assumption of class conditional independence is made. This algorithm is executed, trying to exploit the locally best choice presumes that the values of the attributes are conditionally with no backtracking allowed. In building a decision tree, we independent of one another, given the class label of the deal with training set that have records with unknown sample, ie., that there are no dependence relationships among attributes by considering only those records where those the attributes. Thus attribute values are available. We can classify records that have unknown attribute values by estimating the probability of the various possible results. C4.5 produces tree with variable branches per node. When a discrete variable is chosen as the splitting attribute in C4.5 there will be one branch for each value of attributes [5, 9]. The probabilities , , …, , can be B) Naïve Bayes Algorithm estimated from the training samples refers to the value of attribute Ak for sample X which may be categorical or The Naïve Bayes classifier produce probability estimates continuous valued. rather than predictions. For each class value they estimate the probability that a given instance belongs to that class. The 5. In order to classify an unknown sample X, is advantage of the Naive Bayes classifier is that it only requires evaluated for each class . Sample X is then assigned to the a small amount of training data to estimate the parameters class if and only if necessary for classification. It assumes that the effect of an for 1 ≤ j ≤ m, j ≠ i attribute value on a given class is independent of the values of In other words, it is assigned to the class , for which the other attributes. This assumption is called class conditional independence [2]. C) SVM (Support Vector Machine) Algorithm The Naïve Bayesian classifier [2] works as follows: This algorithm is introduced by Vapnik et al. [11], is a very powerful method that has been applied in a wide variety of 1. Each data sample is represented by an n-dimensional feature applications. The basic concept in SVM is the hyper plane vector, depicting n measurements made classifier, or linear separability. To achieve linear separability, on the sample from n attributes respectively A1,A2, … An. SVM applies two basic ideas: margin maximization and kernels, that is, mapping input space to a higher-dimension 2. Suppose that there are m classes, C1, C2, … Cm. Given an space, feature space. unknown data sample, X, the classifier will predict that X SVM is an algorithm with strong regularization properties, belongs to the class having the highest posterior probability, that is, the optimization procedure maximizes predictive conditioned on X. That is, the Naïve Bayesian classifier accuracy while automatically avoiding over-fitting of the assigns an unknown sample X to the class Ci if and only if: training data. Neural networks and radial basis functions, both P(Ci| X) > P(Cj| X) for 1 ≤ j ≤ m , j ≠ i popular data mining techniques, have the same functional form as SVM models; however, neither of these algorithms Thus we maximize P(Ci| X). The class Ci for which P(Ci | X) is has the well-founded theoretical approach to regularization maximized is called the maximum posteriori hypothesis. that forms the basis of SVM. August Issue Page 25 of 107 ISSN 2229 5208 International Journal of Computer Information Systems, Vol. 3, No. 2, 2011 SVM projects the input data into a kernel space. Then it builds The Accuracy, Sensitivity, Specificity and Error rate can be a linear model in this kernel space. A classification SVM defined as follows: model attempts to separate the target classes with the widest possible margin. A regression SVM model tries to find a Accuracy = (TP+TN) / (TP + FP + TN + FN) continuous function such that maximum number of data points lie within an epsilon-wide tube around it. Different types of Sensitivity = TP/(TP + FN) kernels and different kernel parameter choices can produce a variety of decision boundaries (classification) or function Specificity = TN /(TN + FP) approximators (regression). Error rate = (FP+FN) / (TP + FP + TN + FN) D) KNN (K-Nearest Neighbor) Algorithm KNN classification classifies instances based on their Where TP is the number of True Positives similarity. It is one of the most popular algorithms for pattern TN is the number of True Negatives recognition. It is a type of Lazy learning where the function is FP is the number of False Positives only approximated locally and all computation is deferred FN is the number of False Negatives until classification. An object is classified by a majority of its neighbors. K is IV. EXPERIMENTAL RESULTS always a positive integer. The neighbors are selected from a In this paper 10-fold cross validation is applied for evaluating set of objects for which the correct classification is known. the performance of the classifiers. These data mining classification model were developed using data mining The KNN algorithm is as follows: classification tool Weka version 3.7. We have used one of the dataset “Wisconsin-breast-cancer” which is obtained from the 1. Determine K i.e., the number of nearest neighbors UCI machine learning library [13.] Algorithm for attribute 2. Using the distance measure, calculate the distance selection was applied on dataset to preprocess the data. The between the query instance and all the training dataset contains 699 instances and 10 attributes. samples. 3. The distance of all the training samples are sorted and Table 1 shows the Accuracy, Sensitivity, Specificity and Error nearest neighbor based on the K-th minimum distance rate of C4.5, Naïve Bayes, SVM and KNN algorithms. is determined. 4. Since the KNN is supervised learning, get all the Figure1 shows the graphical representation of difference in categories of the training data for the sorted value Accuracy. which fall under K. 5. The prediction value is measured by using the Figure2 shows the graphical representation of difference in majority of nearest neighbors. Sensitivity. III. PERFORMANCE EVALUATION Figure3 shows the graphical representation of difference in Classifier performance depends on the characteristics of the Specificity. data to be classified. Various empirical tests can be performed to compare the classifier like holdout, random sub-sampling, Figure4 shows the graphical representation of difference in k-fold cross validation and bootstrap method. In this study we Error rate. have selected k-fold cross validation for evaluating the classifiers. Table 1: Comparison of Data Mining Models In k-fold cross validation, the initial data are randomly partitioned into k mutually exclusive subset or folds Error Algorithms Accuracy Sensitivity Specificity d1,d2,…,dk, each approximately equal in size. The training and rate testing is performed k times. In the first iteration, subsets d2, C4.5 94.56% 95.63% 92.53% 5.43% …, dk collectively serve as the training set in order to obtain a first model, which is tested on d1; the second iteration is Naïve 95.99% 95.19% 97.51% 4.00% trained in subsets d1, d3,…, dk and tested on d2; and so no[2]. Bayes SVM 96.99% 97.37% 96.26% 3.00% Performance of the selected algorithms is measured for KNN 95.13 96.72 92.11 4.86% Accuracy, Sensitivity, Specificity and Error rate from the confusion matrix obtained. August Issue Page 26 of 107 ISSN 2229 5208 International Journal of Computer Information Systems, Vol. 3, No. 2, 2011 Figure 1: Comparison graph based on Accuracy Figure 4: Comparison graph based on Error rate. Difference in Classification Accuracy Difference in Classification Error rate 97.5 97 6 96.5 5 Accuracy rate(%) 96 Error rate(%) 4 95.5 95 3 94.5 2 94 93.5 1 93 0 C4.5 Naïve SVM KNN C4.5 Naïve Bayes SVM KNN Bayes Algorithm Algorithm Figure 2: Comparison graph based on Sensitivity Our result shows that out of C4.5, Naïve Bayes, SVM and KNN algorithms, SVM performs better classification. The Differece in Classification Sensitivity error rate of SVM is low and the accuracy, sensitivity is very high compared to the other three models. 98 97.5 V. CONCLUSION AND SCOPE FOR FURTHER 97 ENHANCEMENTS Sensitivity(%) 96.5 96 In this paper, the performance of C4.5, Naïve Bayes, SVM and 95.5 KNN are compared. The experiments were conducted on the dataset “Wisconsin-breast-cancer”. Classification Accuracy, 95 Sensitivity, Specificity and Error rate is validated by 10-fold 94.5 cross validation method. Our Studies shows that Support 94 Vector Machine turned out to be a best classifier. In future we C4.5 Naïve SVM KNN intend to improve the performance of these classifications Bayes techniques by creating a Meta model which will be used to Algorithm predict breast cancer disease in patients. Figure 3: Comparison graph based on Specificity VI. REFERENCES 1. Kietikul Jearanaitanakij,”Classifying Continous Data Set by ID3 Algorithm”, Proceedings of fifth Difference in Classification Specificity International Conference on Information Communication and Signal Processing, 2005. 98 97 2. J. Han and M. Kamber,”Data Mining Concepts and 96 Techniques”, Morgan Kauffman Publishers, USA, 2006. Specificity(%) 95 94 3. Agrawal, R., Imielinski, T., Swami, A., “Database 93 Mining:A Performance Perspective”, IEEE 92 Transactions on Knowledge and Data Engineering, 91 pp. 914-925, December 1993. 90 89 4. Chen, M., Han, J., Yu P.S., “Data Mining: An C4.5 Naïve Bayes SVM KNN Overview from Database Perspective”, IEEE Algorithm Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996. August Issue Page 27 of 107 ISSN 2229 5208 International Journal of Computer Information Systems, Vol. 3, No. 2, 2011 5. Quinlan, J. R. C4.5: Programs for Machine Learning. 17. Arbach, L.; Reinhardt, J.M.; Bennett, Morgan Kaufmann Publishers, 1993. D.L.; Fallouh, G.; Iowa Univ., Iowa City, IA, USA “Mammographic masses classification: 6. S.B. Kotsiantis, Supervised Machine Learning: A comparison between backpropagation neural network Review of Classification Techniques, Informatica (BNN), K nearest neighbors (KNN), and human 31(2007) 249-268, 2007 readers”, 2003 IEEE CCECE 7. J. R. Quinlan. Improved use of continuous attributes 18. D. Xhemali, C.J. Hinde, and R.G. Stone. Naïve Bayes in c4.5. Journal of Artificial Intelligence Research, vs. Decision Trees vs. Neural Networks in the 4:77-90, 1996. Classification of Training Web Pages. International Journal of Computer Science, 4(1):16-23,2009. 8. J.R. Quinlan, “Induction of decision trees,” In Jude W.Shavlik, Thomas G. Dietterich, (Eds.), Readings in Machine Learning. Morgan Kaufmann, 1990. AUTHOR‟S PROFILE Originally published in Machine Learning, vol. 1, 1986, pp 81–106. Ms. Angeline Christobel is working as 9. Salvatore Ruggieri,”Efficient C4.5 Proceedings of an Asst. Professor in AMA International IEEE transactions on knowledge and data University, Bahrain. Her research Engineering”,Vo1. 14,2,No.2, PP.438-444,20025 interest is in Data mining. 10. P. Domingos, M. Pazzani, On the optimality of the simple Bayesian classifier under Zero-one loss, Machine learning 29(2-3)(1997) 103-130.11 Dr. Sivaprakasam is working as a 11. Vapnik, V.N., The Nature of Statistical Learning Professor in Sri Vasavi College, Erode, Theory, 1st ed., Springer-Verlag,New York, 1995. Tamil Nadu, India. His research interests include Data mining, Internet 12. Michael J. Sorich, John O. Miners,Ross A. Technology, Web & Caching McKinnon,David A. Winkler, Frank R. Burden, and Technology, Communication Paul A. Smith, “Comparison of linear and nonlinear Networks and Protocols, Content classification algorithms for the prediction of drug Distributing Networks. and chemical metabolism by human UDP- Glucuronosyltransferase Isoforms” 13. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets/Breast+Cancer +Wisconsin+%28Diagnostic%29] 14. XindongWu · Vipin Kumar · J. Ross Quinlan Joydeep Ghosh · Qiang Yang ·Hiroshi Motoda Geoffrey J. McLachlan · Angus Ng · Bing Liu ·Philip S. Yu ·Zhi-Hua Zhou · Michael Steinbach ·David J. Hand · Dan Steinberg, “Top 10 algorithms in data mining ” Springer 2007 15. Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proc.22nd International Conference on Machine Learning (ICML'05). 16. Thair Nu Phyu, “Survey of Classification Techniques in Data Mining MultiConference of Engineers and Computer Scientists” 2009 Vol I IMECS 2009, Hong Kong August Issue Page 28 of 107 ISSN 2229 5208

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 79 |

posted: | 9/17/2011 |

language: | English |

pages: | 5 |

OTHER DOCS BY shilpabangalore

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.