VIEWS: 69 PAGES: 8 CATEGORY: Emerging Technologies POSTED ON: 10/9/2012 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September, 2012 Comparison of Supervised Learning Techniques for Binary Text Classification Hetal Doshi Maruti Zalte Dept of Electronics and Telecommunication Dept of Electronics and Telecommunication KJSCE, Vidyavihar KJSCE, Vidyavihar Mumbai - 400077, India Mumbai - 400077, India hsdoshi@gmail.com mbzalte@rediffmail.com Abstract — Automated text classifier is useful assistance in A binary text classifier is a function that maps input information management. In this paper, supervised learning feature vectors x to output class/category labels y = (1, 0). techniques like Naïve Bayes, Support Vector Machine (SVM) Aim is to learn and understand the function f from available and K Nearest Neighbour (KNN) are implemented for labeled training set of N i/p – o/p pairs (xi, yi), i = 1…N [5]. classifying certain categories from 20 Newsgroup and WebKB This is called as supervised learning as opposed to dataset. Two weighting schemes to represent documents are employed and compared. Results show that effectiveness of the unsupervised learning which doesn’t comprise of labeled weighting scheme depends on the nature of the dataset and training set. There are two ways of implementing a classifier modeling approach adopted. Accuracy of classifiers can be model. In discriminating model, the aim is to learn function improved using more number of training documents. Naïve that computes the class posterior p(y/x), thus it discriminates Bayes performs mostly better than SVM and KNN when between different classes given the input. In generative number of training documents is few. The average amount of model, the aim is to learn the class conditional density improvement in SVM with more number of training p(x/y) for each value of y and also learn class priors p(y) and documents is better than that of Naïve Bayes and KNN. then by applying Bayes rule, compute the class posterior, as Accuracy of KNN is lesser than Naïve Bayes and SVM. shown below [5], Procedure to evaluate optimum classifier for a given dataset using cross-validation is verified. Procedure for identifying the probable misclassified documents is developed. (1) This is known as generative model as it specifies a way to generate the feature vector x for each possible class y. Keywords- Naïve Bayes, SVM, KNN, Supervised learning and Naïve Bayes classifier is an example of generative model text classification. while SVM is an example of discriminative model. KNN adopts a different approach than Naïve Bayes and SVM. In I. INTRODUCTION KNN, calculations are deferred till actual classification and Manually organizing large set of electronic documents model building using training examples is not performed. into required categories/classes can be extremely taxing, In this paper, section II explains the Naïve Bayes and TF time consuming, expensive and is often not feasible. Text (Term Frequency) & TF*IDF (Term Frequency * Inverse classification also known as text categorization deals with Document Frequency) weighting schemes. Section III the assignment of text to a particular category from a explains the SVM and its Quadratic programming predefined set of categories based on the words of the text. optimization problem. Section IV describes the KNN Text classification combines the concepts of Machine classifier and distance computation method. Section V learning and Information Retrieval. Machine Learning is a provides the implementation steps for binary text classifier. field of Artificial Intelligence (AI) that deals with the Section VI discusses result analysis followed by conclusions development of techniques or algorithms that will let in section VII. computers understand and extract pattern in the given data. Various applications of machine learning in the field of II. NAÏVE BAYES speech recognition, computer vision, robot control etc are Naïve Bayes (NB) is based on the probabilistic model discussed in [1]. Text classification finds applications in that uses collection of labeled training documents to various domains like Knowledge Management, Human estimate the parameters of the model and every new Resource Management, sorting of online information, document is classified using Bayes rule by selecting the emails, information technology and internet [2]. Text category that is most likely to have generated the new classification can be implemented using various supervised example [8]. Principles of Bayes theorem and its application and unsupervised machine learning techniques [3]. Various in developing Naïve Bayes classifier is discussed in [6]. performance parameters for binary text classification Naive Bayes has simplistic approach in its training and evaluation are discussed in [4]. Accuracy is the evaluation classification phase [7]. Naïve Bayes model assumes that all parameter for classifiers implemented in this paper. the attributes of the training documents are independent of each other given the context of the class. Reference [8] 52 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 describes the differences and details of the two models of Calculation mentioned above is done for both categories Naïve Bayes – Bernoulli Multivariate model and and is compared. Depending upon which of the two values Multinomial model. Its results show that accuracy achieved are greater, label of that category is assigned to the testing by multinomial model is better than that achieved by document. The assigned labels are compared with the true Bernoulli multivariate model for large vocabulary. labels for each testing document to evaluate accuracy. Multinomial model of Naïve Bayes is selected for implementation in this paper and its procedure is described Mit represents Term Frequency (TF) representation in [9] and [10]. method in which frequency of occurrence of a particular word Wt in a given document Di is captured (local If document Di (representing the input feature vector x) information). But the TF representation also has a problem is to be classified (i=1…N), the learning algorithm should that it scales up the frequent terms and scales down rare be able to classify it in required category Ck (representing terms which are mostly more informative than the high output class y). Category can be either C1, category with frequency terms. The basic intuition is that a word that label 1or C0, category with label 0. In Multinomial model, occurs frequently in many documents is not a good document feature vector captures word frequency discriminator. The weighting scheme can help solving this information and not just its presence or absence. In this problem. TF*IDF provides information about how important model, a biased V sided dice is considered and each side of a word is to a document in a collection. TF*IDF weighting the dice represents the word Wt with probability p(Wt/Ck), t scheme does this by incorporating the local and the global = 1…V. Thus at each position in the document dice is rolled information. This is because it takes into consideration not and a word is inserted. Thus a document is generated as bag only the isolated term but also the term within the document of words which includes which words are present in the collection [4]. document and their frequency of occurrence. Mathematically this can be achieved by defining Mi as NFt = Document frequency or number of documents multinomial model feature vector for the ith document Di. containing term t and N = Number of documents. Mit is the frequency with which word Wt occurs in document NFt /N = Probability of selecting a document containing Di and ni is the total number of words in Di. Vocabulary V is a queried term from a collection of documents. defined as the number of unique words (found in documents). Training documents are scanned to obtain Log(N/NFt) = Inverse Document Frequency, IDFt, following counts, represents global information. 1 is added to NFt to avoid division by zero in some cases. N: Number of documents Nk: Number of documents of class Ck , for both the It is better to multiply TF values with IDF values, by classes considering local and global information. Therefore weight Estimate likelihoods p(Wt/Ck) and priors p(Ck) of a term = TF * IDF. This is commonly referred to as, TF* IDF weighting. Now as longer documents with more terms Let Zik = 1 when Di has class Ck and Zik = 0, otherwise. and higher term frequencies tend to get larger dot products Let N be the total number of documents then, [9] than smaller documents which can skew and bias the similarity measures, normalization is recommended. A very common normalization method is dividing weights by the (2) L2 norm of the documents. The L2 norm of the vector representing a document is simply the square root of the dot products of the document vector by itself. If a particular word doesn’t appear in a category, then the probability calculated by (2) will become zero. To avoid III. SUPPORT VECTOR MACHINE this problem, Laplace smoothing is applied, [9] An SVM model is a representation of training documents as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is (3) as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based The priors are estimated as, on which side of the gap they fall on. Certain properties of text like high dimensional feature spaces, few irrelevant (4) features i. e. dense concept vector and sparse document After training is performed and parameters are ready, for vectors are well handled by SVM making it suitable for the every new unlabelled document, Dj, the posterior probability application of document classification [11]. for each category is estimated as [9] Support Vector Machine classification algorithm is based on maximum margin training algorithm. It finds a decision function D(x) for pattern vectors x of dimension V belonging to either of the two category 1 and 0 (-1). The (5) input to the training algorithm is a set of N examples xi with labels yi i. e (x1, y1), (x2, y2), (x3, y3)..(xN, yN). 53 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 Fig. 1 shows the two dimensional feature space with constraints by preferably a small amount . This approach vectors belonging to one of the two categories. Two also helps in allowing improvements in generalization and is dimensional space is considered for simplicity in called as soft margin SVM. In this approach, slack variables understanding of maximum margin training algorithm. are introduced as shown in the quadratic programming Objective is to have maximum margin M as wide as model, possible which separates the two categories. SVM maximize the margin around the separating hyper plane. The decision function is fully specified by a subset of training samples (support vectors). Subject to, (8) The value of C is a regularization parameter which trades between how large of a margin is preferred, as opposed to number of the training set examples that violate this margin and by what amount [12]. Optimum value of C is obtained using cross – validation process. The way of proceeding with the mathematical analysis is to convert soft margin SVM problem (8) into an equivalent Lagrangian dual problem which is to be maximized [12] & [15] using Lagrange multiplier αi Figure 1: Maximum margin solution in two dimensional space To obtain the classifier boundary in terms of w’s and b, two hyper-planes are defined Plus hyper-plane as and minus hyper-plane as Subject to constraints, which are the borders of the maximum margin. Distance between plus and minus hyper-plane is called as margin M which is to be maximized . (9) (6) The bias b is obtained by applying decision function to two arbitrary supporting patterns x1 belonging to C1 and Margin 2/ is to be maximized , given the fact supporting pattern x2 belonging to C0 [13] that, (10) LIBSVM is a library for Support Vector Machine (7) (SVM) used for implementing SVM for text classification in Sometimes vectors are not linearly separable as indicated this paper [14]. Its objective is to assist the users to easily in the fig. 2 below apply SVM to their respective applications. Reference [17] provides the practical aspects involved in implementing SVM algorithm and using linear kernel for text classification which involves large number of features. IV. K NEAREST NEIGHBOUR K nearest neighbour is one of the pattern recognition techniques employed for classification based on the technique of evaluating closest training examples in the multidimensional feature space. It is a type of instance based or lazy learning where the decision function is approximated locally and all the computations are deferred until classification. The document is classified by a majority vote of its neighbours with document being assigned to the category most common amongst its K nearest neighbour. Figure 2: Non- linearly separable vector points Reference [16] explains how K Nearest Neighbour can be applied to text classification application and importance of Hence there is a need to soften the constraint that these K value for a given problem. KNN is based on calculating data points lie on the correct side of plus and minus hyper- planes i.e. some data points are allowed to violate these distance of the query document from the training documents 54 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 [18].Cosine similarity is selected for distance measurement A. Comparison of TF and TF*IDF weighting scheme as documents are represented as vectors in multidimensional Total accuracy for TF and TF*IDF representation is feature space and distance between documents is calculated obtained after performing 10 – fold cross-validation process as following for different values of C for SVM classifier and K for KNN (11) classifier. In 10 fold cross-validation process, the entire training set for a given group is divided into 10 subsets of For improving the accuracy of KNN classifier, optimum almost equal size. Sequentially one subset is tested using the value of K is important. K is the parameter which indicates classifier trained on remaining 9 subsets. Thus the process is the number of nearest neighbours to be considered for label repeated 10 times. Total accuracy is evaluated. As the computation. The accuracy of KNN classifier is severely division of the entire training set for a given group into affected by the presence of noisy and irrelevant features. almost equal 10 parts is performed randomly, three The best value of K is data dependent. The larger value of K iterations are performed and average is calculated to obtain reduces the effect of noise and results in smoother, less optimum C value for SVM and K value for KNN which locally sensitive decision function but it makes the provides highest cross-validation accuracy. SVM and KNN boundaries between the classes less distinct resulting in are implemented using these optimum values of C and K. misclassification. Evaluating the optimum value of K is Comparison of TF and TF*IDF representation of 20 achieved by cross-validation. Newsgroup group 1 is shown in Table I, for 20 Newsgroup group 2 in Table II, for WebKB group 1in Table III and WebKB group 2 in Table IV. V. IMPLEMENTATION STEPS OF BINARY TEXT CLASSIFIER There are primarily three steps in implementing a binary TABLE I. COMPARISON OF TF AND TF*IDF REPRESENTATION FOR 20 text classifier using MATLAB TM as a tool, NEWSGROUP GROUP 1 1. Feature extraction: In this step, text document Training docs: 1197 Testing docs: 796 comprising of words on the basis of which classification is Document Accuracy % performed is converted into a matrix format capturing the representation property of the words found in the documents. This can be NB SVM KNN done in two ways - TF or TF*IDF. Matrix is created, where TF the number of rows is equal to number of documents and 97. 236 93.844 85.302 number of columns represents number of words in the C=1 K=1 dictionary defined for a given classification task and TF*IDF individual element of matrix represents TF or TF*IDF 97.613 97.362 96.231 C=1 K= 30 weights in the respective documents. 2. Training the classifier: During the training phase the classifier is provided with training documents along with TABLE II. COMPARISON OF TF AND TF*IDF REPRESENTATION FOR 20 NEWSGROUP GROUP 2 the labels of training documents. Classifier develops the model representing the pattern with which training Training docs: 1185 Testing docs: 789 documents are related to their labels on the basis of the Document Accuracy % words appearing in the document. Parameter tuning is representation performed using cross-validation process. NB SVM KNN 3. Testing the classifier: On the basis of the model TF developed, classifier predicts labels for the testing 96.831 92.269 79.214 C=1 K=1 documents. Accuracy of classification is assessed by comparing the predicted labels with the true labels. TF*IDF 96.451 96.198 93.156 Accuracy = Number of correctly classified testing C=1 K= 10 documents / total number of testing documents. TABLE III. COMPARISON OF TF AND TF*IDF REPRESENTATION FOR VI. RESULT ANALYSIS WEBKB GROUP 1 For implementation and evaluation of text classifiers, the Training docs: 1358 Testing docs: 684 datasets like 20 Newsgroup and WebKB are used available Document Accuracy % at [20]. Difference in the nature of the dataset is provided in representation [19]. Within individual dataset, there are two groups. For 20 NB SVM KNN Newsgroup, it is group 1‘rec.sport.baseball’ & TF ‘rec.sport.hockey’ and group 2-‘sci.electronics’ & ‘sci.med’. 97.368 97.807 94.298 C=0.01 K=10 For WebKB dataset, there are also two groups, group 1: Faculty and Course and group 2: Student and course. TF*IDF Accuracy evaluation is performed for all four groups using 97.222 98.538 94.152 C=1 K= 45 TF and TF*IDF document representation. 55 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 TABLE IV. COMPARISON OF TF AND TF*IDF REPRESENTATION FOR WEBKB GROUP 2 Training docs: 1697 Testing docs: 854 Document Accuracy % representation NB SVM KNN TF 97.658 97.19 94.262 C=0.1 K=10 TF*IDF 96.956 98.009 94.496 C=1 K= 40 TF*IDF emphasizes the weight of low frequent terms. For SVM and KNN, difference in the accuracy for TF*IDF and TF representation is more pronounced in 20 Newsgroup Figure 3: Comparison of Naïve Bayes, SVM and KNN for 20 Dataset as compared to WebKB dataset as seen from Table I Newsgroup Group 1 - IV. This, results due to the fact that contribution of low frequency words to text categorization is significant in 20 Newsgroup dataset as compared to WebKB dataset [19]. SVM and KNN adopt spatial distribution of documents in multidimensional feature space. These techniques attempt to solve the classification problem using spatial means. SVM tries to find hyper-plane in that space separating the categories. Classification model of SVM depends on support vectors. KNN tries to compute which K training examples are closest to the testing document. TF*IDF weighting affects the spatial domain largely helping SVM and KNN to perform better. It is observed for Naive Bayes classifier, performance in terms of accuracy doesn’t vary much for TF or TF*IDF representation. Naive Bayes builds a classifier based on probability of words and their relative occurrences in different categories. Hence all the training documents are used for building the model for TF and TF*IDF Figure 4: Comparison of Naïve Bayes, SVM and KNN for 20 representation. Hence its performance is high for the TF Newsgroup Group 2 representation and doesn’t change much for TF*IDF representation. The contribution of low frequency words to text categorization is not significant in WebKB as compared to 20 Newsgroup [19]. Hence for WebKB dataset, performance difference between TF and TF*IDF representation is not significant for all three classifiers. B. Comparison over increasing training documents Comparison of all the classifiers (TF*IDF representation) for increasing number of training documents is shown below in fig. 3 for 20 Newsgroup group 1, fig. 4 for 20 Newsgroup group 2, fig. 5 for WebKB group 1 and fig. 6 for WebKB group 2. Figure 5: Comparison of Naïve Bayes, SVM and KNN for WebKB Group 1 56 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 Naive Bayes builds a classifier based on probability of words and their relative occurrences in different categories. This inherently requires less data than SVM. SVM builds a hyper plane to separate the two categories, with maximum margin. Thus, it needs more training documents, especially training documents close to the hyper plane, to develop its accuracy. Thus, NB performs better for data sets with low training. SVM requires building a hyper – plane which best separates the two categories. With more training documents, it acquires more data around supporting hyper-plane to provide better solution. Thus, SVM learns the data better as more training documents are provided; especially training data close to the hyper plane. These training documents are the support vectors. Figure 6: Comparison of Naïve Bayes, SVM and KNN for WebKB Group 2 Naive Bayes uses word counts/frequencies as a feature Results show that the accuracy of the classifier is to distinguish between data sets. Each word provides a dependent on the number of training documents and larger probability that the document is in a particular class. The number of training documents can increase the accuracy of individual probabilities are combined to arrive at a final classification task. In case of Naïve Bayes, increase in the decision. It is expected that adding more words (as a result classification accuracy with larger training size is the result of adding more training documents) would not drastically of improvement in accuracy of probability estimation with change the performance level of Naive Bayes, and this is more training documents as larger possibilities are covered. what is observed. As a result with more training documents the average performance of SVM improves more, relative to In case of SVM, increase in the classification accuracy the improvement in Naive Bayes. with larger training size is the result of obtaining the hyper- plane which provides a more generalized solution thus KNN on the other hand considers the entire multi avoiding over – fitted solution. dimensional feature space as a whole and obtains the labels for testing documents on the basis of nearest neighbour In case of KNN, increase in the classification accuracy concept. Hence classification is done not on the basis of with larger training size results due to the fact that with large model building and is dependent on local information as number of training documents, effect of noisy training emphasis is given to K nearest neighbour for label example on the classification accuracy reduces. Also it computation. Thus its accuracy is mediocre compared to reduces the locally sensitive nature of KNN classifier. Naive Bayes and SVM Results show that accuracy of Naïve Bayes classifier is usually better than SVM and KNN classification accuracy C. Selection of classifier when number of training documents is less. But as the Performance of the classifier can be predicted using training set size increase, SVM classification accuracy cross – validation results which is shown in Table V for TF becomes comparable to Naïve Bayes and in certain cases representation and Table VI for TF *IDF representation. becomes better. KNN is observed to have accuracy lower Cross-validation (CV) results for Naïve Bayes, SVM and than Naïve Bayes and SVM. Also average amount of KNN classifier and classification accuracy results for all improvement in SVM with more training documents is three classifiers are provided below in Table V and VI. better than that of KNN and Naïve Bayes as seen from fig. 3, fig 4, fig. 5 and fig. 6 TABLE V. ACCURACY AND CROSS – VALIDATION RELATIONSHIP FOR TF REPRESENTATION Group Naïve Bayes SVM KNN TF representation Accuracy % CV % Accuracy % CV % Accuracy % CV % 20 Newsgroup group 1 97.236 98.747 93.844 96.825 85.302 94.627 20 Newsgroup group 2 96.831 98.819 92.269 97.258 79.214 91.702 WebKB group 1 97.368 96.533 97.807 97.619 94.298 93.813 WebKB group 2 97.658 97.346 97.19 96.778 94.262 95.129 57 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 TABLE VI. ACCURACY AND CROSS – VALIDATION RELATIONSHIP FOR TF*IDF REPRESENTATION Group Naïve Bayes SVM KNN TF*IDF representation Accuracy % CV % Accuracy % CV % Accuracy % CV% 20 Newsgroup group 1 97.613 98.218 97.362 99.249 96.231 97.882 20 Newsgroup group 2 96.451 98.677 96.198 98.790 93.156 97.720 WebKB group 1 97.222 96.392 98.538 97.913 94.152 94.035 WebKB group 2 96.956 97.584 98.009 98.035 94.496 96.719 In real life application, true labels of the testing one classifier. After performing classification operation by documents are not available. To ensure that classifier all three classifiers, the results of all three classifier are selection is optimum for a given dataset, it is advisable to added and ranks are given to testing documents. Thus every perform cross-validation on the training dataset using testing document is assigned a rank which is 0, 1, 2 or 3. different classifiers. Observing the results of the cross- Obtaining the rank 0 means all three classifier has assigned validation process, the classifier which provides the best label 0 to that testing documents. Obtaining rank 3 means all results should be selected for the text classification task. It three classifier has assigned label 1 to that testing document. is seen from Table V and Table VI that mostly whenever a Obtaining rank 1 means two out of the three classifiers has classifier gives relatively better cross-validation assigned a label 0 to the testing document. Obtaining rank 2 performance; it also gives better accuracy results on testing means two out of three documents has assigned label 1 to data. Cross-validation helps in estimating how accurately, a that testing document. Hence discrepancy between classifier predictive model of a classifier will generalize to a testing results is observed when the testing document is assigned dataset compared to other classifiers rank 1 or rank 2. The documents with rank 1 and rank 2 are indicated to D. Identification of misclassified documents the user. This indication alerts the user about the documents In real life application, labels for query documents are which may be misclassified. For labels, SVM predicted not available. SVM may give best performance but labels are considered. Table VII summarizes this procedure identifying the misclassified documents is not possible with for identification of misclassified documents TABLE VII. IDENTIFICATION OF MISCLASSIFIED DOCUMENTS Group Percentage of documents Number of documents Number of misclassified documents flagged as probably misclassified by SVM identified misclassified 20 Newsgroup group 1 3.89 21 11 20 Newsgroup group 2 4.94 30 9 WebKB group 1 5.11 10 2 WebKB group 2 4.91 17 5 It is seen from the Table VII that it is possible to identify few conclusions. Effectiveness of the weighting schemes to of misclassified documents using the combination of results of represent documents depends on the nature of dataset and also three classifiers. Around 5% of documents are flagged as on the modeling approach adopted by the classifiers. probably misclassified documents. As actual labels are Classification accuracy of the classifier can be improved using available for testing documents, it is seen that out of total more training documents thus helping in more generalized misclassified documents by SVM, some are identified. All are solution covering larger possibilities. Naïve Bayes performs not identified as they were ranked as 0 or 3 indicating that mostly better than SVM and KNN when number of training none of the classifiers could assign correct labels to those documents is few. The average amount of improvement in documents. SVM with more training documents is better than that of KNN and Naïve Bayes. Parameter tuning in case of SVM and KNN VII. CONCLUSIONS using the cross-validation assists in achieving generalized solution suitable for a given dataset. Classification in KNN is Implementation and evaluation of Naïve Bayes, SVM and not done on the basis of model building and is dependent only KNN on categories from 20 Newsgroup and WebKB dataset on local information. Thus its classification accuracy is lesser using two different weighting schemes resulted in several 58 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012 than Naïve Bayes and SVM. Procedure to evaluate suitable International Workshop on Education Technology and Computer Science, pp. 219 – 222. classifier for a given dataset using cross-validation process is verified. Procedure for identifying the probable misclassified [19] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional documents is developed by combining the results of three word clusters vs. words for text categorization”, in Journal of Machine Learning Research, Volume 3, 2003. pp. 1183–1208. classifiers as they adopt different approach to text classification problem. [20]Dataset used in this paper available online at http://web.ist.utl.pt/~acardoso/datasets/ REFERENCES [1] M. Mitchell “The Discipline of Machine Learning” Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Technical report CMU-ML-06-108. AUTHORS PROFILE Hetal Doshi has received B. E. (Electronics) degree in 2003 from University [2] Vishal Gupta, Gurpreet S. Lehal, “ A survey of Text Mining Techniques of Mumbai and is currently pursuing her M. E. (Electronics and and Applications”, in Journal Of Emerging Technologies in Web Telecommunication) from K. J. Somaiya College of Engineering (KJSCE), Intelligence, Vol 1, No 1, August 2009, pp. 60 -76. Vidyavihar, Mumbai, India. She is in the teaching profession for last 8 years and is working as Assistant Professor at KJSCE. Her area of interest is [3] George Tzanis, Ioannis Katakis, Ioannis Partalas, Ioannis Vlahavas, “ Education Technology, Text Mining and Signal Processing. Modern Applications of Machine Learning”, in Proceedings of the 1st Annual SEERC Doctoral Student Conference – DSC 2006. pp. 1-10. Maruti Zalte has received M. E. (Electronics and Telecommunication) degree in 2006 from Govt. College of Engineering, Pune. He is in the teaching [4] Fabrizio Sebastiani, “Text categorization”, In: Alessandro Zanasi profession for last 9 years and is working as Associate Professor at KJSCE. (ed.), Text mining and its Applications, WIT Press, Southampton, UK, His area of interest is Digital Signal Processing and VLSI technology. He is 2005, pp. 109-129. currently holding the post of Dean, Students Affairs, KJSCE. [5] Kevin P. Murphy, “Naïve Bayes classifier”, Technical report, Department of Computer Science, University of British Columbia, 2006. [6] Haiyi Zhang, DiLi, “Naïve Bayes Text Classifier”, in 2007 IEEE International Conference on Granular Computing, pp. 708-711. [7] S.L. Ting, W.H. Ip, Albert H.C. Tsang, “Is Naïve Bayes a Good Classifier for Document Classification?”, In: International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011, pp. 37-46. [8] Andrew McCallum and Kamal Nigam, “A Comparison of Event Models for Naïve Bayes Text Classification”, In: Learning for Text Categorization: Papers from the AAAI workshop, AAAI pressc(1998) 41 – 48 Technical report Ws – 98 – 05. [9] “Text Classification using Naïve Bayes”, Steve Renals, Learning and Data lecture 7, Informatics 2B. Available online: http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b11- learnlec07-nup.pdf [10] “Generative learning algorithm”, lecture notes2 for CS229, Department of Computer Science, University of Stanford. Available online: http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf [11] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, In Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998. [12] Brian C. Lovell and Christian J. Walder, “ Support Vector Machines for Business Applications” , Business Applications and Computational Intelligence, Idea Group Publishers, 2006. [13] Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers” In proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM press, 1992, pp. 144-152. [14] Chih – Chung Chang and Chih – Jen Lin, “ LIBSVM: A library for Support Vector Machines”, Department of Computer Science, National Taiwan University, Taipei, Taiwan, 2001. [15] Nello Cristianini , John Shawe-Taylor, An Introduction to support Vector Machines: and other kernel-based learning methods, Cambridge University Press, 2000. [16]KNN classification details available online at http://www.mathworks.in/help/toolbox/bioinfo/ref/knnclassify.html [17] Hsu, C.-W., Chang, C.-C., and Lin, C.-J. “A practical guide to support vector classification.”, Technical report, Department of Computer Science, National Taiwan University, 2003. [18]Zhijie Liu, Xueqiang Lv, Kun Liu, Shuicai Shi, “Study on SVM compared with the other Text Classification Methods” in 2010 Second 59 http://sites.google.com/site/ijcsis/ ISSN 1947-5500