Comparison of Supervised Learning Techniques for Binary Text Classification by ijcsiseditor


									                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                      Vol. 10, No.9, September, 2012

       Comparison of Supervised Learning Techniques
               for Binary Text Classification
                           Hetal Doshi                                                           Maruti Zalte
        Dept of Electronics and Telecommunication                                 Dept of Electronics and Telecommunication
                    KJSCE, Vidyavihar                                                         KJSCE, Vidyavihar
                 Mumbai - 400077, India                                                    Mumbai - 400077, India

    Abstract — Automated text classifier is useful assistance in             A binary text classifier is a function that maps input
information management. In this paper, supervised learning               feature vectors x to output class/category labels y = (1, 0).
techniques like Naïve Bayes, Support Vector Machine (SVM)                Aim is to learn and understand the function f from available
and K Nearest Neighbour (KNN) are implemented for                        labeled training set of N i/p – o/p pairs (xi, yi), i = 1…N [5].
classifying certain categories from 20 Newsgroup and WebKB               This is called as supervised learning as opposed to
dataset. Two weighting schemes to represent documents are
employed and compared. Results show that effectiveness of the
                                                                         unsupervised learning which doesn’t comprise of labeled
weighting scheme depends on the nature of the dataset and                training set. There are two ways of implementing a classifier
modeling approach adopted. Accuracy of classifiers can be                model. In discriminating model, the aim is to learn function
improved using more number of training documents. Naïve                  that computes the class posterior p(y/x), thus it discriminates
Bayes performs mostly better than SVM and KNN when                       between different classes given the input. In generative
number of training documents is few. The average amount of               model, the aim is to learn the class conditional density
improvement in SVM with more number of training                          p(x/y) for each value of y and also learn class priors p(y) and
documents is better than that of Naïve Bayes and KNN.                    then by applying Bayes rule, compute the class posterior, as
Accuracy of KNN is lesser than Naïve Bayes and SVM.                      shown below [5],
Procedure to evaluate optimum classifier for a given dataset
using cross-validation is verified. Procedure for identifying the
probable misclassified documents is developed.
                                                                            This is known as generative model as it specifies a way
                                                                         to generate the feature vector x for each possible class y.
    Keywords- Naïve Bayes, SVM, KNN, Supervised learning and             Naïve Bayes classifier is an example of generative model
text classification.                                                     while SVM is an example of discriminative model. KNN
                                                                         adopts a different approach than Naïve Bayes and SVM. In
                      I.     INTRODUCTION                                KNN, calculations are deferred till actual classification and
    Manually organizing large set of electronic documents                model building using training examples is not performed.
into required categories/classes can be extremely taxing,                    In this paper, section II explains the Naïve Bayes and TF
time consuming, expensive and is often not feasible. Text                (Term Frequency) & TF*IDF (Term Frequency * Inverse
classification also known as text categorization deals with              Document Frequency) weighting schemes. Section III
the assignment of text to a particular category from a                   explains the SVM and its Quadratic programming
predefined set of categories based on the words of the text.             optimization problem. Section IV describes the KNN
    Text classification combines the concepts of Machine                 classifier and distance computation method. Section V
learning and Information Retrieval. Machine Learning is a                provides the implementation steps for binary text classifier.
field of Artificial Intelligence (AI) that deals with the                Section VI discusses result analysis followed by conclusions
development of techniques or algorithms that will let                    in section VII.
computers understand and extract pattern in the given data.
Various applications of machine learning in the field of                                         II.      NAÏVE BAYES
speech recognition, computer vision, robot control etc are                   Naïve Bayes (NB) is based on the probabilistic model
discussed in [1]. Text classification finds applications in              that uses collection of labeled training documents to
various domains like Knowledge Management, Human                         estimate the parameters of the model and every new
Resource Management, sorting of online information,                      document is classified using Bayes rule by selecting the
emails, information technology and internet [2]. Text                    category that is most likely to have generated the new
classification can be implemented using various supervised               example [8]. Principles of Bayes theorem and its application
and unsupervised machine learning techniques [3]. Various                in developing Naïve Bayes classifier is discussed in [6].
performance parameters for binary text classification                    Naive Bayes has simplistic approach in its training and
evaluation are discussed in [4]. Accuracy is the evaluation              classification phase [7]. Naïve Bayes model assumes that all
parameter for classifiers implemented in this paper.                     the attributes of the training documents are independent of
                                                                         each other given the context of the class. Reference [8]

                                                                                                       ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                         Vol. 10, No.9, September 2012
describes the differences and details of the two models of               Calculation mentioned above is done for both categories
Naïve Bayes – Bernoulli Multivariate model and                       and is compared. Depending upon which of the two values
Multinomial model. Its results show that accuracy achieved           are greater, label of that category is assigned to the testing
by multinomial model is better than that achieved by                 document. The assigned labels are compared with the true
Bernoulli multivariate model for large vocabulary.                   labels for each testing document to evaluate accuracy.
Multinomial model of Naïve Bayes is selected for
implementation in this paper and its procedure is described              Mit represents Term Frequency (TF) representation
in [9] and [10].                                                     method in which frequency of occurrence of a particular
                                                                     word Wt in a given document Di is captured (local
    If document Di (representing the input feature vector x)         information). But the TF representation also has a problem
is to be classified (i=1…N), the learning algorithm should           that it scales up the frequent terms and scales down rare
be able to classify it in required category Ck (representing         terms which are mostly more informative than the high
output class y). Category can be either C1, category with            frequency terms. The basic intuition is that a word that
label 1or C0, category with label 0. In Multinomial model,           occurs frequently in many documents is not a good
document feature vector captures word frequency                      discriminator. The weighting scheme can help solving this
information and not just its presence or absence. In this            problem. TF*IDF provides information about how important
model, a biased V sided dice is considered and each side of          a word is to a document in a collection. TF*IDF weighting
the dice represents the word Wt with probability p(Wt/Ck), t         scheme does this by incorporating the local and the global
= 1…V. Thus at each position in the document dice is rolled          information. This is because it takes into consideration not
and a word is inserted. Thus a document is generated as bag          only the isolated term but also the term within the document
of words which includes which words are present in the               collection [4].
document       and     their   frequency    of    occurrence.
Mathematically this can be achieved by defining Mi as                   NFt = Document frequency or number of documents
multinomial model feature vector for the ith document Di.            containing term t and N = Number of documents.
Mit is the frequency with which word Wt occurs in document              NFt /N = Probability of selecting a document containing
Di and ni is the total number of words in Di. Vocabulary V is        a queried term from a collection of documents.
defined as the number of unique words (found in
documents). Training documents are scanned to obtain                     Log(N/NFt) = Inverse Document Frequency, IDFt,
following counts,                                                    represents global information. 1 is added to NFt to avoid
                                                                     division by zero in some cases.
    N: Number of documents
    Nk: Number of documents of class Ck , for both the                   It is better to multiply TF values with IDF values, by
classes                                                              considering local and global information. Therefore weight
    Estimate likelihoods p(Wt/Ck) and priors p(Ck)                   of a term = TF * IDF. This is commonly referred to as, TF*
                                                                     IDF weighting. Now as longer documents with more terms
   Let Zik = 1 when Di has class Ck and Zik = 0, otherwise.          and higher term frequencies tend to get larger dot products
Let N be the total number of documents then, [9]                     than smaller documents which can skew and bias the
                                                                     similarity measures, normalization is recommended. A very
                                                                     common normalization method is dividing weights by the
                                                         (2)         L2 norm of the documents. The L2 norm of the vector
                                                                     representing a document is simply the square root of the dot
                                                                     products of the document vector by itself.
     If a particular word doesn’t appear in a category, then
the probability calculated by (2) will become zero. To avoid                         III.   SUPPORT VECTOR MACHINE
this problem, Laplace smoothing is applied, [9]                          An SVM model is a representation of training
                                                                     documents as points in space, mapped so that the examples
                                                                     of the separate categories are divided by a clear gap that is
                                                          (3)        as wide as possible. New examples are then mapped into
                                                                     that same space and predicted to belong to a category based
   The priors are estimated as,                                      on which side of the gap they fall on. Certain properties of
                                                                     text like high dimensional feature spaces, few irrelevant
                                                                     features i. e. dense concept vector and sparse document
    After training is performed and parameters are ready, for        vectors are well handled by SVM making it suitable for the
every new unlabelled document, Dj, the posterior probability         application of document classification [11].
for each category is estimated as [9]                                   Support Vector Machine classification algorithm is
                                                                     based on maximum margin training algorithm. It finds a
                                                                     decision function D(x) for pattern vectors x of dimension V
                                                                     belonging to either of the two category 1 and 0 (-1). The
                                                         (5)         input to the training algorithm is a set of N examples xi with
                                                                     labels yi i. e (x1, y1), (x2, y2), (x3, y3)..(xN, yN).

                                                                                                ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 10, No.9, September 2012
    Fig. 1 shows the two dimensional feature space with                         constraints by preferably a small amount . This approach
vectors belonging to one of the two categories. Two                             also helps in allowing improvements in generalization and is
dimensional space is considered for simplicity in                               called as soft margin SVM. In this approach, slack variables
understanding of maximum margin training algorithm.                               are introduced as shown in the quadratic programming
Objective is to have maximum margin M as wide as                                model,
possible which separates the two categories. SVM maximize
the margin around the separating hyper plane. The decision
function is fully specified by a subset of training samples
(support vectors).
                                                                                                             Subject to,

                                                                                    The value of C is a regularization parameter which
                                                                                trades between how large of a margin is preferred, as
                                                                                opposed to number of the training set examples that violate
                                                                                this margin and by what amount [12]. Optimum value of C
                                                                                is obtained using cross – validation process. The way of
                                                                                proceeding with the mathematical analysis is to convert soft
                                                                                margin SVM problem (8) into an equivalent Lagrangian
                                                                                dual problem which is to be maximized [12] & [15] using
                                                                                Lagrange multiplier αi
       Figure 1: Maximum margin solution in two dimensional space

   To obtain the classifier boundary in terms of w’s and b,
two hyper-planes are defined Plus hyper-plane as
                 and minus hyper-plane as                                                               Subject to constraints,
which are the borders of the maximum margin. Distance
between plus and minus hyper-plane is called as margin M
which is to be maximized .                                                                                                                         (9)
                                                                    (6)            The bias b is obtained by applying decision function to
                                                                                two arbitrary supporting patterns x1 belonging to C1 and
    Margin       2/          is to be maximized , given the fact                supporting pattern x2 belonging to C0 [13]
                                                                                    LIBSVM is a library for Support Vector Machine
                                                                    (7)         (SVM) used for implementing SVM for text classification in
    Sometimes vectors are not linearly separable as indicated                   this paper [14]. Its objective is to assist the users to easily
in the fig. 2 below                                                             apply SVM to their respective applications. Reference [17]
                                                                                provides the practical aspects involved in implementing
                                                                                SVM algorithm and using linear kernel for text
                                                                                classification which involves large number of features.

                                                                                                  IV.    K NEAREST NEIGHBOUR
                                                                                    K nearest neighbour is one of the pattern recognition
                                                                                techniques employed for classification based on the
                                                                                technique of evaluating closest training examples in the
                                                                                multidimensional feature space. It is a type of instance based
                                                                                or lazy learning where the decision function is approximated
                                                                                locally and all the computations are deferred until
                                                                                classification. The document is classified by a majority vote
                                                                                of its neighbours with document being assigned to the
                                                                                category most common amongst its K nearest neighbour.
              Figure 2: Non- linearly separable vector points                   Reference [16] explains how K Nearest Neighbour can be
                                                                                applied to text classification application and importance of
   Hence there is a need to soften the constraint that these
                                                                                K value for a given problem. KNN is based on calculating
data points lie on the correct side of plus and minus hyper-
planes i.e. some data points are allowed to violate these                       distance of the query document from the training documents

                                                                                                           ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                           Vol. 10, No.9, September 2012
[18].Cosine similarity is selected for distance measurement            A. Comparison of TF and TF*IDF weighting scheme
as documents are represented as vectors in multidimensional                Total accuracy for TF and TF*IDF representation is
feature space and distance between documents is calculated             obtained after performing 10 – fold cross-validation process
as following                                                           for different values of C for SVM classifier and K for KNN
                                                          (11)         classifier. In 10 fold cross-validation process, the entire
                                                                       training set for a given group is divided into 10 subsets of
    For improving the accuracy of KNN classifier, optimum              almost equal size. Sequentially one subset is tested using the
value of K is important. K is the parameter which indicates            classifier trained on remaining 9 subsets. Thus the process is
the number of nearest neighbours to be considered for label            repeated 10 times. Total accuracy is evaluated. As the
computation. The accuracy of KNN classifier is severely                division of the entire training set for a given group into
affected by the presence of noisy and irrelevant features.             almost equal 10 parts is performed randomly, three
The best value of K is data dependent. The larger value of K           iterations are performed and average is calculated to obtain
reduces the effect of noise and results in smoother, less              optimum C value for SVM and K value for KNN which
locally sensitive decision function but it makes the                   provides highest cross-validation accuracy. SVM and KNN
boundaries between the classes less distinct resulting in              are implemented using these optimum values of C and K.
misclassification. Evaluating the optimum value of K is                Comparison of TF and TF*IDF representation of 20
achieved by cross-validation.                                          Newsgroup group 1 is shown in Table I, for 20 Newsgroup
                                                                       group 2 in Table II, for WebKB group 1in Table III and
                                                                       WebKB group 2 in Table IV.
    There are primarily three steps in implementing a binary
                                                                        TABLE I.        COMPARISON OF TF AND TF*IDF REPRESENTATION FOR 20
text classifier using MATLAB TM as a tool,                                                     NEWSGROUP GROUP 1

    1. Feature extraction: In this step, text document                                    Training docs: 1197 Testing docs: 796
comprising of words on the basis of which classification is
                                                                           Document                              Accuracy %
performed is converted into a matrix format capturing the                representation
property of the words found in the documents. This can be                                          NB              SVM                KNN
done in two ways - TF or TF*IDF. Matrix is created, where                      TF
the number of rows is equal to number of documents and                                           97. 236          93.844             85.302
number of columns represents number of words in the                         C=1 K=1
dictionary defined for a given classification task and                       TF*IDF
individual element of matrix represents TF or TF*IDF                                             97.613           97.362             96.231
                                                                           C=1 K= 30
weights in the respective documents.
    2. Training the classifier: During the training phase
the classifier is provided with training documents along with           TABLE II.       COMPARISON OF TF AND TF*IDF REPRESENTATION FOR 20
                                                                                               NEWSGROUP GROUP 2
the labels of training documents. Classifier develops the
model representing the pattern with which training                                        Training docs: 1185 Testing docs: 789
documents are related to their labels on the basis of the                  Document                              Accuracy %
words appearing in the document. Parameter tuning is                     representation
performed using cross-validation process.                                                          NB              SVM                KNN
    3. Testing the classifier: On the basis of the model                       TF
developed, classifier predicts labels for the testing                                            96.831           92.269             79.214
                                                                            C=1 K=1
documents. Accuracy of classification is assessed by
comparing the predicted labels with the true labels.                         TF*IDF
                                                                                                 96.451           96.198             93.156
Accuracy = Number of correctly classified testing                          C=1 K= 10
documents / total number of testing documents.
                                                                         TABLE III.       COMPARISON OF TF AND TF*IDF REPRESENTATION FOR
                   VI.   RESULT ANALYSIS                                                          WEBKB GROUP 1

    For implementation and evaluation of text classifiers, the                            Training docs: 1358 Testing docs: 684
datasets like 20 Newsgroup and WebKB are used available
                                                                           Document                              Accuracy %
at [20]. Difference in the nature of the dataset is provided in          representation
[19]. Within individual dataset, there are two groups. For 20                                      NB              SVM                KNN
Newsgroup, it is group 1‘’ &                                 TF
‘’ and group 2-‘sci.electronics’ & ‘’.                                    97.368           97.807             94.298
                                                                          C=0.01 K=10
For WebKB dataset, there are also two groups, group 1:
Faculty and Course and group 2: Student and course.                          TF*IDF
Accuracy evaluation is performed for all four groups using                                       97.222           98.538             94.152
                                                                           C=1 K= 45
TF and TF*IDF document representation.

                                                                                                    ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 10, No.9, September 2012
                           WEBKB GROUP 2

                   Training docs: 1697 Testing docs: 854
    Document                             Accuracy %
                            NB            SVM              KNN
                          97.658          97.19            94.262
   C=0.1 K=10
                          96.956         98.009            94.496
   C=1 K= 40

    TF*IDF emphasizes the weight of low frequent terms.
For SVM and KNN, difference in the accuracy for TF*IDF
and TF representation is more pronounced in 20 Newsgroup                              Figure 3: Comparison of Naïve Bayes, SVM and KNN for 20
Dataset as compared to WebKB dataset as seen from Table I                                               Newsgroup Group 1
- IV. This, results due to the fact that contribution of low
frequency words to text categorization is significant in 20
Newsgroup dataset as compared to WebKB dataset [19].
SVM and KNN adopt spatial distribution of documents in
multidimensional feature space. These techniques attempt to
solve the classification problem using spatial means. SVM
tries to find hyper-plane in that space separating the
categories. Classification model of SVM depends on support
vectors. KNN tries to compute which K training examples
are closest to the testing document. TF*IDF weighting
affects the spatial domain largely helping SVM and KNN to
perform better.
    It is observed for Naive Bayes classifier, performance in
terms of accuracy doesn’t vary much for TF or TF*IDF
representation. Naive Bayes builds a classifier based on
probability of words and their relative occurrences in
different categories. Hence all the training documents are
used for building the model for TF and TF*IDF                                         Figure 4: Comparison of Naïve Bayes, SVM and KNN for 20
representation. Hence its performance is high for the TF                                               Newsgroup Group 2
representation and doesn’t change much for TF*IDF
    The contribution of low frequency words to text
categorization is not significant in WebKB as compared to
20 Newsgroup [19]. Hence for WebKB dataset, performance
difference between TF and TF*IDF representation is not
significant for all three classifiers.

B. Comparison over increasing training documents
    Comparison of all the classifiers (TF*IDF
representation) for increasing number of training documents
is shown below in fig. 3 for 20 Newsgroup group 1, fig. 4
for 20 Newsgroup group 2, fig. 5 for WebKB group 1 and
fig. 6 for WebKB group 2.
                                                                                    Figure 5: Comparison of Naïve Bayes, SVM and KNN for WebKB
                                                                                                             Group 1

                                                                                                         ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 10, No.9, September 2012
                                                                                 Naive Bayes builds a classifier based on probability of
                                                                             words and their relative occurrences in different categories.
                                                                             This inherently requires less data than SVM. SVM builds a
                                                                             hyper plane to separate the two categories, with maximum
                                                                             margin. Thus, it needs more training documents, especially
                                                                             training documents close to the hyper plane, to develop its
                                                                             accuracy. Thus, NB performs better for data sets with low
                                                                                 SVM requires building a hyper – plane which best
                                                                             separates the two categories. With more training documents,
                                                                             it acquires more data around supporting hyper-plane to
                                                                             provide better solution. Thus, SVM learns the data better as
                                                                             more training documents are provided; especially training
                                                                             data close to the hyper plane. These training documents are
                                                                             the support vectors.
     Figure 6: Comparison of Naïve Bayes, SVM and KNN for WebKB
                              Group 2                                            Naive Bayes uses word counts/frequencies as a feature
    Results show that the accuracy of the classifier is                      to distinguish between data sets. Each word provides a
dependent on the number of training documents and larger                     probability that the document is in a particular class. The
number of training documents can increase the accuracy of                    individual probabilities are combined to arrive at a final
classification task. In case of Naïve Bayes, increase in the                 decision. It is expected that adding more words (as a result
classification accuracy with larger training size is the result              of adding more training documents) would not drastically
of improvement in accuracy of probability estimation with                    change the performance level of Naive Bayes, and this is
more training documents as larger possibilities are covered.                 what is observed. As a result with more training documents
                                                                             the average performance of SVM improves more, relative to
   In case of SVM, increase in the classification accuracy                   the improvement in Naive Bayes.
with larger training size is the result of obtaining the hyper-
plane which provides a more generalized solution thus                            KNN on the other hand considers the entire multi
avoiding over – fitted solution.                                             dimensional feature space as a whole and obtains the labels
                                                                             for testing documents on the basis of nearest neighbour
   In case of KNN, increase in the classification accuracy                   concept. Hence classification is done not on the basis of
with larger training size results due to the fact that with large            model building and is dependent on local information as
number of training documents, effect of noisy training                       emphasis is given to K nearest neighbour for label
example on the classification accuracy reduces. Also it                      computation. Thus its accuracy is mediocre compared to
reduces the locally sensitive nature of KNN classifier.                      Naive Bayes and SVM
    Results show that accuracy of Naïve Bayes classifier is
usually better than SVM and KNN classification accuracy                      C. Selection of classifier
when number of training documents is less. But as the                            Performance of the classifier can be predicted using
training set size increase, SVM classification accuracy                      cross – validation results which is shown in Table V for TF
becomes comparable to Naïve Bayes and in certain cases                       representation and Table VI for TF *IDF representation.
becomes better. KNN is observed to have accuracy lower                       Cross-validation (CV) results for Naïve Bayes, SVM and
than Naïve Bayes and SVM. Also average amount of                             KNN classifier and classification accuracy results for all
improvement in SVM with more training documents is                           three classifiers are provided below in Table V and VI.
better than that of KNN and Naïve Bayes as seen from fig.
3, fig 4, fig. 5 and fig. 6


                           Group                        Naïve Bayes                        SVM                            KNN
                     TF representation        Accuracy %            CV %      Accuracy %    CV %            Accuracy %      CV %
                    20 Newsgroup group 1            97.236          98.747        93.844         96.825          85.302          94.627
                    20 Newsgroup group 2            96.831          98.819        92.269         97.258          79.214          91.702
                       WebKB group 1                97.368          96.533        97.807         97.619          94.298          93.813
                       WebKB group 2                97.658          97.346         97.19         96.778          94.262          95.129

                                                                                                          ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                        Vol. 10, No.9, September 2012

                          Group                           Naïve Bayes                         SVM                            KNN
                   TF*IDF representation         Accuracy %         CV %         Accuracy %         CV %       Accuracy %          CV%
                    20 Newsgroup group 1             97.613          98.218          97.362          99.249         96.231          97.882
                    20 Newsgroup group 2             96.451          98.677          96.198          98.790         93.156          97.720
                       WebKB group 1                 97.222          96.392          98.538          97.913         94.152          94.035
                       WebKB group 2                 96.956          97.584          98.009          98.035         94.496          96.719

    In real life application, true labels of the testing                        one classifier. After performing classification operation by
documents are not available. To ensure that classifier                          all three classifiers, the results of all three classifier are
selection is optimum for a given dataset, it is advisable to                    added and ranks are given to testing documents. Thus every
perform cross-validation on the training dataset using                          testing document is assigned a rank which is 0, 1, 2 or 3.
different classifiers. Observing the results of the cross-                      Obtaining the rank 0 means all three classifier has assigned
validation process, the classifier which provides the best                      label 0 to that testing documents. Obtaining rank 3 means all
results should be selected for the text classification task. It                 three classifier has assigned label 1 to that testing document.
is seen from Table V and Table VI that mostly whenever a                        Obtaining rank 1 means two out of the three classifiers has
classifier   gives     relatively   better    cross-validation                  assigned a label 0 to the testing document. Obtaining rank 2
performance; it also gives better accuracy results on testing                   means two out of three documents has assigned label 1 to
data. Cross-validation helps in estimating how accurately, a                    that testing document. Hence discrepancy between classifier
predictive model of a classifier will generalize to a testing                   results is observed when the testing document is assigned
dataset compared to other classifiers                                           rank 1 or rank 2.
                                                                                    The documents with rank 1 and rank 2 are indicated to
D. Identification of misclassified documents                                    the user. This indication alerts the user about the documents
   In real life application, labels for query documents are                     which may be misclassified. For labels, SVM predicted
not available. SVM may give best performance but                                labels are considered. Table VII summarizes this procedure
identifying the misclassified documents is not possible with                    for identification of misclassified documents

                                           TABLE VII.     IDENTIFICATION OF MISCLASSIFIED DOCUMENTS

           Group                           Percentage of        documents     Number of documents      Number of misclassified documents
                                           flagged       as      probably     misclassified by SVM     identified
               20 Newsgroup group 1                      3.89                            21                                  11
               20 Newsgroup group 2                      4.94                            30                                   9
                   WebKB group 1                         5.11                            10                                   2
                   WebKB group 2                         4.91                            17                                   5

It is seen from the Table VII that it is possible to identify few             conclusions. Effectiveness of the weighting schemes to
of misclassified documents using the combination of results of                represent documents depends on the nature of dataset and also
three classifiers. Around 5% of documents are flagged as                      on the modeling approach adopted by the classifiers.
probably misclassified documents. As actual labels are                        Classification accuracy of the classifier can be improved using
available for testing documents, it is seen that out of total                 more training documents thus helping in more generalized
misclassified documents by SVM, some are identified. All are                  solution covering larger possibilities. Naïve Bayes performs
not identified as they were ranked as 0 or 3 indicating that                  mostly better than SVM and KNN when number of training
none of the classifiers could assign correct labels to those                  documents is few. The average amount of improvement in
documents.                                                                    SVM with more training documents is better than that of KNN
                                                                              and Naïve Bayes. Parameter tuning in case of SVM and KNN
                       VII. CONCLUSIONS                                       using the cross-validation assists in achieving generalized
                                                                              solution suitable for a given dataset. Classification in KNN is
    Implementation and evaluation of Naïve Bayes, SVM and                     not done on the basis of model building and is dependent only
KNN on categories from 20 Newsgroup and WebKB dataset                         on local information. Thus its classification accuracy is lesser
using two different weighting schemes resulted in several

                                                                                                              ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 10, No.9, September 2012
than Naïve Bayes and SVM. Procedure to evaluate suitable                                  International Workshop on Education Technology and Computer Science,
                                                                                          pp. 219 – 222.
classifier for a given dataset using cross-validation process is
verified. Procedure for identifying the probable misclassified                        [19] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional
documents is developed by combining the results of three                                  word clusters vs. words for text categorization”, in Journal of Machine
                                                                                          Learning Research, Volume 3, 2003. pp. 1183–1208.
classifiers as they adopt different approach to text
classification problem.                                                               [20]Dataset      used       in    this     paper    available    online    at

[1] M. Mitchell “The Discipline of Machine Learning” Machine Learning
    Department, School of Computer Science, Carnegie Mellon University,
    Pittsburgh, PA, USA, Technical report CMU-ML-06-108.                                                        AUTHORS PROFILE
                                                                                      Hetal Doshi has received B. E. (Electronics) degree in 2003 from University
[2] Vishal Gupta, Gurpreet S. Lehal, “ A survey of Text Mining Techniques             of Mumbai and is currently pursuing her M. E. (Electronics and
    and Applications”, in Journal Of Emerging Technologies in Web                     Telecommunication) from K. J. Somaiya College of Engineering (KJSCE),
    Intelligence, Vol 1, No 1, August 2009, pp. 60 -76.                               Vidyavihar, Mumbai, India. She is in the teaching profession for last 8 years
                                                                                      and is working as Assistant Professor at KJSCE. Her area of interest is
[3] George Tzanis, Ioannis Katakis, Ioannis Partalas, Ioannis Vlahavas, “
                                                                                      Education Technology, Text Mining and Signal Processing.
    Modern Applications of Machine Learning”, in Proceedings of the 1st
    Annual SEERC Doctoral Student Conference – DSC 2006. pp. 1-10.                    Maruti Zalte has received M. E. (Electronics and Telecommunication) degree
                                                                                      in 2006 from Govt. College of Engineering, Pune. He is in the teaching
[4] Fabrizio Sebastiani, “Text categorization”, In: Alessandro Zanasi                 profession for last 9 years and is working as Associate Professor at KJSCE.
    (ed.), Text mining and its Applications, WIT Press, Southampton, UK,              His area of interest is Digital Signal Processing and VLSI technology. He is
    2005, pp. 109-129.                                                                currently holding the post of Dean, Students Affairs, KJSCE.
[5] Kevin P. Murphy, “Naïve Bayes classifier”,           Technical report,
    Department of Computer Science, University of British Columbia, 2006.
[6] Haiyi Zhang, DiLi, “Naïve Bayes Text Classifier”, in 2007 IEEE
    International Conference on Granular Computing, pp. 708-711.
[7] S.L. Ting, W.H. Ip, Albert H.C. Tsang, “Is Naïve Bayes a Good Classifier
    for Document Classification?”, In: International Journal of Software
    Engineering and Its Applications Vol. 5, No. 3, July, 2011, pp. 37-46.
[8] Andrew McCallum and Kamal Nigam, “A Comparison of Event Models
    for Naïve Bayes Text Classification”, In: Learning for Text
    Categorization: Papers from the AAAI workshop, AAAI pressc(1998) 41
    – 48 Technical report Ws – 98 – 05.
[9] “Text Classification using Naïve Bayes”, Steve Renals, Learning and
    Data     lecture     7,     Informatics     2B.     Available      online:
[10] “Generative learning algorithm”, lecture notes2 for CS229, Department
    of Computer Science, University of Stanford. Available online:
[11] T. Joachims, “Text Categorization with Support Vector Machines:
    Learning with Many Relevant Features”, In Proceedings of the European
    Conference on Machine Learning (ECML), Springer, 1998.
[12] Brian C. Lovell and Christian J. Walder, “ Support Vector Machines for
    Business Applications” , Business Applications and Computational
    Intelligence, Idea Group Publishers, 2006.
[13] Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik, “A Training
    Algorithm for Optimal Margin Classifiers” In proceedings of the Fifth
    Annual Workshop on Computational Learning Theory, ACM press, 1992,
    pp. 144-152.
[14] Chih – Chung Chang and Chih – Jen Lin, “ LIBSVM: A library for
    Support Vector Machines”, Department of Computer Science, National
    Taiwan University, Taipei, Taiwan, 2001.
[15] Nello Cristianini , John Shawe-Taylor, An Introduction to support Vector
    Machines: and other kernel-based learning methods, Cambridge
    University Press, 2000.
[16]KNN        classification    details      available      online         at
[17] Hsu, C.-W., Chang, C.-C., and Lin, C.-J. “A practical guide to support
    vector classification.”, Technical report, Department of Computer
    Science, National Taiwan University, 2003.
[18]Zhijie Liu, Xueqiang Lv, Kun Liu, Shuicai Shi, “Study on SVM
    compared with the other Text Classification Methods” in 2010 Second

                                                                                                                        ISSN 1947-5500

To top