Using Fuzzy Support Vector Machine in Text Categorization Base on Reduced Matrices by ijcsis


More Info
									                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                              Vol. 8, No. 7, October 2010

Using Fuzzy Support Vector Machine in Text
 Categorization Base on Reduced Matrices
                                               Vu Thanh Nguyen1
                            University of Information Technology HoChiMinh City, VietNam

Abstract - In this article, the authors present result         classifier. The categorization results are compared to
compare from using Fuzzy Support Vector Machine                those reached using standard BoW representations by
(FSVM) and Fuzzy Support Vector Machine which                  Vector Space Model (VSM), and the authors also
combined Latin Semantic Indexing and Random                    demonstrate how the performance of the FSVM can
Indexing on reduced matrices (FSVM_LSI_RI). Our
results show that FSVM_LSI_RI provide better results
                                                               be improved by combining representations.
on Precision and Recall than FSVM. In this experiment
a corpus comprising 3299 documents and from the                   II. VECTOR SPACE MODEL (VSM) ([14]).
Reuters-21578 corpus was used.                                 1. Data Structuring
                                                               In Vector space model, documents are represented as
Keyword – SVM, FSVM, LSI, RI                                   vectors in t-dimensional space, where t is the number
                                                               of indexed terms in the collection. Function to
               I. INTRODUCTION                                 evaluate terms weight:
Text categorization is the task of assigning a text to
one or more of a set of predefined categories. As with                               wij = lij * gi * nj
most other natural language processing applications,
representational factors are decisive for the                  -lij denotes the local weight of term i in document j.
performance of the categorization. The incomparably            - gi is the global weight of term i in the document
most common representational scheme in text                    collection
categorization is the Bag-of-Words (BoW) approach,             - nj is the normalization factor for document j.
in which a text is represented as a vector t of word
weights, such that ti = (w1...wn) where wn are the                                 lij = log ( 1 + fij )
weights of the words in the text. The BoW
representation ignores all semantic or conceptual
information; it simply looks at the surface word
forms. BoW modern is based on three models:
Boolean model, Vector Space model, Probability
model.                                                         Where:
There have been attempts at deriving more                      - fij is the frequency of token i in document j.
sophisticated representations for text categorization,
including the use of n-grams or phrases (Lewis, 1992;          -                        is the probability of token i
Dumais et al., 1998), or augmenting the standard               occurring in document j.
BoW approach with synonym clusters or latent                   2. Term document matrix
dimensions (Baker and Mc- Callum, 1998; Cai and                In VSM is implemented by forming term-document
Hofmann, 2003). However, none of the more                      matrix. Term- document matrix is m×n matrix where
elaborate representations manage to significantly              m is number of terms and n is number of documents.
outperform the standard BoW approach (Sebastiani,
2002). In addition to this, they are typically more
expensive to compute.                                                        ⎛ d11     d 21     • • • d1n ⎞
                                                                             ⎜                             ⎟
In order to do this, the authors introduce a new                             ⎜ d12      d 22    • • • d 2n ⎟
method for producing concept-based representations                           ⎜ •         •      • • • • ⎟
for natural language data. This method is a                                A=⎜                             ⎟
                                                                             ⎜ •         •      • • • • ⎟
combination of Random indexing(RI) and Latin                                 ⎜ •
Semantic Indexing (LSI), computation time for                                ⎜           •      • • • • ⎟  ⎟
Singular Value Decomposition on a RI reduced                                 ⎜d        d m2     • • • d mn ⎟
                                                                             ⎝ m1                          ⎠
matrix is almost halved compared to LSI. The authors           where:
use this method to create concept-based                        - term: row of term-document matrix.
representations for a standard text categorization             - document: column of term-document matrix.
problem, and the representations as input to a FSVM

                                                                                             ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 8, No. 7, October 2010

- dij: is the weight associated with token i in                    • SVD is computationally expensive.
document j.                                                        • Initial ”huge matrix step”
                                                                   • Linguistically agnostic.
                          [4])                                           IV. RANDOM INDEXING (RI) ([6],[10])
The vector space model is presented in section 2                   Random Indexing is an incremental vector space
suffers from the curse of dimensionality. In other                 model that is computationally less demanding
words, as the problem of sizes increase may become                 (Karlgren and Sahlgren, 2001). The Random
more complex, the processing time required to                      Indexing model reduces dimensionality by, instead of
construct a vector space and query throughout the                  giving each word a whole dimension, it gives them a
document space will increase as well. In addition, the             random vector by a much lesser dimensionality than
vector space model exclusively measures term co-                   the total number of words in the text.
occurrence—that is, the inner product between two                  Random Indexing differs from the basic vector space
documents is nonzero if and only if there exist at                 model in that it doesn’t give each word an orthogonal
least one shared term between them. Latent Semantic                unit vector. Instead each word is given a vector of
Indexing (LSI) is used to overcome the problems of                 length 1 in a random direction. The dimension of this
synoymy and polysemy                                               randomized vector will be chosen to be smaller than
1. Singular Value Decomposition (SVD) ([5]-[9])                    the amount of words in the document, with the end
LSI is based on a mathematical technique called                    result that not all words will be orthogonal to each
Singular Value Decomposition (SVD). The SVD is                     other since the rank of the matrix won’t be high
used to process decomposes a term-by-document                      enough. This can be formulated as AT = A˜ where A
matrix A into three matrices: a term-by-dimension                  is the original matrix representation of the d × w
matrix, U, a singular-value matrix, ∑, and a                       word document matrix as in the basic vector space
document-by-dimension matrix, VT. The purpose of                   model, T is the random vectors as a w×k matrix
analysis the SVD is to detect semantic relationships               representing the mapping between each word wi and
in the documents collection. This decomposition is                 the k-dimensional random vectors, A˜ is A projected
performed as following:                                            down into d × k dimensions. A query is then matched
                                                                   by first multiplying the query vector with T, and then
                       A = UΣV T
                                                                   finds the column in A˜ that gave the best match. T is
                                                                   constructed by, for each column in T, each
-U orthogonal m×m matrix whose columns are left
                                                                   corresponding to a row in A, electing n different
singular vectors of A
                                                                   rows. n/2 of these are assigned the value 1/!(n), and
- Σ diagonal matrix on whose diagonal are singular                 the rest are assigned −1/!(n). This ensures unit
values of matrix A in descending order                             length, and that the vectors are distributed evenly in
 - V orthogonal n×n matrix whose columns are right                 the unit sphere of dimension k (Sahlgren, 2005).
singular vectors of A.                                             An even distribution will ensure that every pair of
To generate a rank-k approximation Ak of A where k                 vectors has a high probability to be orthogonal.
<< r, each matrix factor is truncated to its first k               Information is lost during this process (pigeonhole
columns. That is, Ak is computed as:                               principle, the fact that the rank of the reduced matrix
                       Ak = U k Σ k VkT                            is lower). However, if used on a matrix with very few
Where:                                                             nonzero elements, the induced error will decrease as
- Uk is m×k matrix whose columns are first k left                  the likelihood of a conflict in each document, and
singular vectors of A                                              between documents, will decrease. Using Random
 - Σk is k×k diagonal matrix whose diagonal is                     Indexing on a matrix will introduce a certain error to
formed by k leading singular values of A                           the results. These errors will be introduced by words
 - Vk is n×k matrix whose columns are first k right                that match with other words, i.e. the scalar product
singular vectors of A                                              between the corresponding vectors will be ≠ 0. In the
In LSI, Ak is approximation of A is created and that               matrix this will show either that false positive
is very important: detected a combination of                       matches are created for every word that have a
literatures between terms used in the documents,                   nonzero scalar product of any vector in the vector
excluding the change in usage term bad influence to                room of the matrix. False negatives can also be
the method to search for the index [6], [7], [8].                  created by words that have corresponding vectors that
Because use of k-dimensional LSI (k<<r) the                        cancel each other out.
difference is not important in the "means" is                      Advantages of Random Indexing
removed. Keywords often appear together in the                     • Based on Pentti Kanerva's theories on Sparse
document is nearly the same performance space in k-                     Distributed Memory.
dimensional LSI, even the index does not appear                    • Uses distributed representations to accumulate
simultaneously in the same document.                                    context vectors.
2. Drawback of the LSI model                                       •    Incremental method that avoids the ”huge matrix
• SVD often treated as a ”magical” process.                             step”.

                                                                                             ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 8, No. 7, October 2010

                                                                where ξi is a slack variable introduced to relax the
           V. COMBINING RI AND LSI                              hard margin constraints and the regularization
We have seen the advantages and disadvantages for               constant C > 0 implements the trade-off between the
both RI and LSI: RI is efficient in terms of                    maximal margin of separation and the classification
computational time but does not preserve as much                error.
information as LSI; LSI, on the other hand, is                  To resolve the optimization problem, we introduce
computationally expensive, but produces highly                  the following Lagrange function.
accurate results, in addition to capturing the
underlying semantics of the documents. As
mentioned earlier, a hybrid algorithm was proposed
that combines the two approaches to benefit from the
advantages of both algorithms. The algorithm works
as follows:
• First the data is pre-processed with RI to a lower
    dimension k1.                                               Where αi>=0, βj>=0 is Lagrange genes.
• Then LSI is applied on the reduced, lower-                    Differentiating L with respect to w, b and ξi, and
    dimensional data, to further reduce the data to the         setting the result to zero. The optimization problem
    desired dimension, k2.                                      (2) can translate into the following simple dual
This algorithm supposedly will improve running time             problem.
for LSI, and accuracy for RI. As mentioned earlier,             Maximize:
the time complexity of SVD D is O(cmn) for large,
sparse datasets. It is reasonable, then, to assume that
a lower dimensionality will result in faster
computation time, since it’s dependent of the                   Subject to
dimensionality m.

1. Support Vector Machines
Support vector machine is a very specific class of              Where (xi, xj) ( φ (xi), φ (xj) ) is a kernel function and
algorithms, characterized by the use of kernels, the            satisfies the Mercer theorem.
absence of local minima, the sparseness of the                  Let α* is the optimal solutions of (4) and
solution and the capacity control obtained by acting            corresponding weight and the bias w*, b* ,
on the margin, or on other “dimension independent”              respectively.
quantities such as the number of support vectors.               According to the Karush-Kuhn-Tucker(KKT)
Let                                               is a          conditions, the solution of the optimal problem (4)
training sample set, where xi                Rn and             must satisfy
corresponding binary class labels yi    {1,-1} .Let φ
is a non-linear mapping from original date space to a
high-dimensional feature space, therefore , we                  Where αi* are non zero only for a subset of vector xi
replace sample points x i and x j with their mapping            called support vectors.
images φ (xi) and φ (x j) respectively.                         Finally, the optimal decision function is
Let the weight and the bias of the separating hyper-
plane is w and b, respectively. We define a hyper-
plane which might act as decision surface in feature
space, as following.

                                                                1. Fuzzy Support Vector Machines ([12]-[13]).
                                                                Consider the aforementioned binary training set S.
                                                                We choose a proper membership function and receive
To separate the data linearly in the feature space, the
                                                                si which is the fuzzy memberships value of the
decision function satisfies the following constrain
                                                                training point xi . Then, the training set S become
conditions. The optimization problem is
                                                                fuzzy training set S ‘

                                                                where xi      Rn and corresponding binary class labels
                                                                yi    {1,-1}, 0<=si<=1.
Subject to:
                                                                Then, the quadratic programming problem for
                                                                classification can be described as following:

                                                                                          ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 8, No. 7, October 2010

                                                                              2   Acquisition     0.93          0.965
                                                                              3   Money           0.95          0.972
Subject to:
                                                                              4   Grain           0.79          0.933
                                                                              5   Crude           0.92          0.961
                                                                                     Average      0.912         0.9648

where C > 0 is punishment gene and ξi is a slack                     Table 1: The experiment results of FSVM and FSVM+LSI+RI
variable. The fuzzy membership si is attitude of the
corresponding point xi toward one class. It shows that
                                                                                 VIII. CONCLUSION
a smaller si can reduce the effect of the parameter ξi
                                                                  This article introduces Fuzzy Support Vector
in problem (18), so the corresponding point xi can be
                                                                  Machines for Text Categorization based on reduced
treated as less important
                                                                  matrices use Latin Semantic Indexing combined with
By using the KKT condition and Lagrange
                                                                  Random Indexing.). Our results show that
Multipliers. We are able to form the following
                                                                  FSVM_LSI_RI provide better results on Precision
equivalent dual problem
                                                                  and Recall than FSVM. Due to time limit, only
                                                                  experiments on the 5 categories. Future direction
                                                                  include how to use this scheme to future direction
                                                                  include how to use this scheme to classify student's
                                                                  idea at University of Information Technology
                                                                  HoChiMinh City.
Subject to:                                                                         REFERENCES
                                                                  [1].    April Kontostathis (2007), “Essential Dimensions of
                                                                          latent semantic indexing”, Department of Mathematics
                                                                          and Computer Science Ursinus College, Proceedings of
                                                                          the 40th Hawaii International Conference on System
                                                                          Sciences, 2007.
                                                                  [2].    Cherukuri Aswani Kumar, Suripeddi Srinivas (2006) ,
If αi>0, then the corresponding point xi is support                       “Latent Semantic Indexing Using Eigenvalue Analysis for
vectors . More, if 0 <αι< siC, then support vectors xi                    Efficient Information Retrieval”, Int. J. Appl. Math.
                                                                          Comp. Sci., 2006, Vol. 16, No. 4, pp. 551–558.
lies round of separating surface; if αi siC, then                 [3].    David A.Hull (1994), Information retrieval Using
support vectors xi belongs to error sample. Then, the                     Statistical Classification, Doctor of Philosophy Degree,
decision function of the corresponding optimal                            The University of Stanford.
separating surface becomes                                        [4].    Gabriel Oksa, Martin Becka and Marian Vajtersic
                                                                          (2002),” Parallel       SVD Computation in Updating
                                                                          Problems of Latent Semantic Indexing”, Proceeding
                                                                          ALGORITMY 2002 Conference on Scientific Computing,
                                                                          pp. 113 – 120.
                                                                  [5].    Katarina Blom, (1999), Information Retrieval Using the
                                                                          Singular Value       Decomposition and Krylov Subspace,
                                                                          Department of Mathematics Chalmers          University of
Where K(xi.x) is kernel function.                                         Technology S-412 Goteborg, Sewden
                VII. EXPERIMENT                                   [6].    Kevin Erich Heinrich (2007), Automated Gene
                                                                          Classification using Nonnegative Matrix Factorization on
We will investigate the performance of these two                          Biomedical Literature, Doctor of Philosophy Degree, The
techniques, (1) Classifying FSVM on original matrix                       University of Tennessee, Knoxville.
where Vector Space Model is used, (2) and FSVM on                 [7].    Miles Efron (2003). Eigenvalue – Based Estimators for
a matrix where Random Indexing is used to reduce                          Optimal Dimentionality Reduction in Information
                                                                          Retrieval. ProQuest Information and Learning Company.
the dimensionality of the matrix before singular value            [8].    Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup
decomposition. Performance will be measured as                            (1999), “Matrix,Vector Space, and Information
calculation time as well as precision and recall. We                      Retrieval”, SIAM REVIEW Vol 41, No. 2, pp. 335 – 352.
have used a subset of the Reuters-21578 text corpus.              [9].    Nordianah Ab Samat, Masrah Azrifah Azmi Murad,
                                                                          Muhamad Taufik Abdullah, Rodziah Atan (2008), “Term
The subset comprises 3299 that include 5 most                             Weighting Schemes Experiment Based on SVD for Malay
frequent categories : earn, acquisition, money, grain,                    Text Retrieval”, Faculty of Computer Science and
crude.                                                                    Information Technology University Putra Malaysia,
                                                                          IJCSNS International Journal of Computer Science and
                                                                          Network Security, VOL.8 No.10, October 2008.
                                                                  [10].   Jussi Karlgren and Magnus Sahlgren. 2001. From words
                                    F-score                               to understanding. In Y. Uesaka, P.Kanerva, and H. Asoh,
       No       Classifier                                                editors, Foundations of Real-World Intelligence, chapter
                                                                          26, pages 294–308. Stanford: CSLI Publications.
                             FSVM   FSVM+LSI+RI                   [11].   Magnus Rosell, Martin Hassel, Viggo Kann: “Global
                                                                          Evaluation of Random Indexing through SwedishWord
            1   Earn         0.97        0.993                            Clustering Compared to the People’s Dictionary of
                                                                          Synonyms”, (Rosell et al., 2009).

                                                                                                ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 8, No. 7, October 2010

[12].   Shigeo Abe and Takuya Inoue (2002), “Fuzzy Support                               AUTHORS PROFILE
        Vector Machines for Multiclass Problems”, ESANN’2002
        proceedings, pp. 113-118.                                      The author born in 1969 in Da Nang, VietNam. He
[13].   Shigeo Abe and Takuya Inoue (2001), “Fuzzy Support             graduated University of Odessa (USSR), in 1992,
        Vector Machines for Pattern Classification”, In                specialized in Information Technology. He postgraduated
        Proceeding of International Joint Conference on Neural         on doctoral thesis in 1996 at the Academy of Science of
        Networks (IJCNN ’01), volume 2, pp. 1449-1454.                 Russia, specialized in IT. Now he is the Dean of Software
[14].   T.Joachims (1998), “Text Categorization with Support           Engineering of University of Information Technology,
        Vector Machines: Learning with Many Relevant
                                                                       VietNam National University HoChiMinh City.
        Features,” in Proceedings of ECML-98, 10th European
        Conference on Machine Learning, number 1398, pp. 137–          Research: Knowledge Engineering, Information Systems
        142.                                                           and software Engineering.

                                                                                                 ISSN 1947-5500

To top