VIEWS: 130 PAGES: 5 CATEGORY: Emerging Technologies POSTED ON: 11/2/2010 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 7, October 2010 Using Fuzzy Support Vector Machine in Text Categorization Base on Reduced Matrices Vu Thanh Nguyen1 1 University of Information Technology HoChiMinh City, VietNam email: nguyenvt@uit.edu.vn Abstract - In this article, the authors present result classifier. The categorization results are compared to compare from using Fuzzy Support Vector Machine those reached using standard BoW representations by (FSVM) and Fuzzy Support Vector Machine which Vector Space Model (VSM), and the authors also combined Latin Semantic Indexing and Random demonstrate how the performance of the FSVM can Indexing on reduced matrices (FSVM_LSI_RI). Our results show that FSVM_LSI_RI provide better results be improved by combining representations. on Precision and Recall than FSVM. In this experiment a corpus comprising 3299 documents and from the II. VECTOR SPACE MODEL (VSM) ([14]). Reuters-21578 corpus was used. 1. Data Structuring In Vector space model, documents are represented as Keyword – SVM, FSVM, LSI, RI vectors in t-dimensional space, where t is the number of indexed terms in the collection. Function to I. INTRODUCTION evaluate terms weight: Text categorization is the task of assigning a text to one or more of a set of predefined categories. As with wij = lij * gi * nj most other natural language processing applications, representational factors are decisive for the -lij denotes the local weight of term i in document j. performance of the categorization. The incomparably - gi is the global weight of term i in the document most common representational scheme in text collection categorization is the Bag-of-Words (BoW) approach, - nj is the normalization factor for document j. in which a text is represented as a vector t of word weights, such that ti = (w1...wn) where wn are the lij = log ( 1 + fij ) weights of the words in the text. The BoW representation ignores all semantic or conceptual information; it simply looks at the surface word forms. BoW modern is based on three models: Boolean model, Vector Space model, Probability model. Where: There have been attempts at deriving more - fij is the frequency of token i in document j. sophisticated representations for text categorization, including the use of n-grams or phrases (Lewis, 1992; - is the probability of token i Dumais et al., 1998), or augmenting the standard occurring in document j. BoW approach with synonym clusters or latent 2. Term document matrix dimensions (Baker and Mc- Callum, 1998; Cai and In VSM is implemented by forming term-document Hofmann, 2003). However, none of the more matrix. Term- document matrix is m×n matrix where elaborate representations manage to significantly m is number of terms and n is number of documents. outperform the standard BoW approach (Sebastiani, 2002). In addition to this, they are typically more expensive to compute. ⎛ d11 d 21 • • • d1n ⎞ ⎜ ⎟ In order to do this, the authors introduce a new ⎜ d12 d 22 • • • d 2n ⎟ method for producing concept-based representations ⎜ • • • • • • ⎟ for natural language data. This method is a A=⎜ ⎟ ⎜ • • • • • • ⎟ combination of Random indexing(RI) and Latin ⎜ • Semantic Indexing (LSI), computation time for ⎜ • • • • • ⎟ ⎟ Singular Value Decomposition on a RI reduced ⎜d d m2 • • • d mn ⎟ ⎝ m1 ⎠ matrix is almost halved compared to LSI. The authors where: use this method to create concept-based - term: row of term-document matrix. representations for a standard text categorization - document: column of term-document matrix. problem, and the representations as input to a FSVM 139 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 7, October 2010 - dij: is the weight associated with token i in • SVD is computationally expensive. document j. • Initial ”huge matrix step” • Linguistically agnostic. III.LATENT SEMANTIC INDEXING (LSI) ([1]- [4]) IV. RANDOM INDEXING (RI) ([6],[10]) The vector space model is presented in section 2 Random Indexing is an incremental vector space suffers from the curse of dimensionality. In other model that is computationally less demanding words, as the problem of sizes increase may become (Karlgren and Sahlgren, 2001). The Random more complex, the processing time required to Indexing model reduces dimensionality by, instead of construct a vector space and query throughout the giving each word a whole dimension, it gives them a document space will increase as well. In addition, the random vector by a much lesser dimensionality than vector space model exclusively measures term co- the total number of words in the text. occurrence—that is, the inner product between two Random Indexing differs from the basic vector space documents is nonzero if and only if there exist at model in that it doesn’t give each word an orthogonal least one shared term between them. Latent Semantic unit vector. Instead each word is given a vector of Indexing (LSI) is used to overcome the problems of length 1 in a random direction. The dimension of this synoymy and polysemy randomized vector will be chosen to be smaller than 1. Singular Value Decomposition (SVD) ([5]-[9]) the amount of words in the document, with the end LSI is based on a mathematical technique called result that not all words will be orthogonal to each Singular Value Decomposition (SVD). The SVD is other since the rank of the matrix won’t be high used to process decomposes a term-by-document enough. This can be formulated as AT = A˜ where A matrix A into three matrices: a term-by-dimension is the original matrix representation of the d × w matrix, U, a singular-value matrix, ∑, and a word document matrix as in the basic vector space document-by-dimension matrix, VT. The purpose of model, T is the random vectors as a w×k matrix analysis the SVD is to detect semantic relationships representing the mapping between each word wi and in the documents collection. This decomposition is the k-dimensional random vectors, A˜ is A projected performed as following: down into d × k dimensions. A query is then matched by first multiplying the query vector with T, and then A = UΣV T finds the column in A˜ that gave the best match. T is Where: constructed by, for each column in T, each -U orthogonal m×m matrix whose columns are left corresponding to a row in A, electing n different singular vectors of A rows. n/2 of these are assigned the value 1/!(n), and - Σ diagonal matrix on whose diagonal are singular the rest are assigned −1/!(n). This ensures unit values of matrix A in descending order length, and that the vectors are distributed evenly in - V orthogonal n×n matrix whose columns are right the unit sphere of dimension k (Sahlgren, 2005). singular vectors of A. An even distribution will ensure that every pair of To generate a rank-k approximation Ak of A where k vectors has a high probability to be orthogonal. << r, each matrix factor is truncated to its first k Information is lost during this process (pigeonhole columns. That is, Ak is computed as: principle, the fact that the rank of the reduced matrix Ak = U k Σ k VkT is lower). However, if used on a matrix with very few Where: nonzero elements, the induced error will decrease as - Uk is m×k matrix whose columns are first k left the likelihood of a conflict in each document, and singular vectors of A between documents, will decrease. Using Random - Σk is k×k diagonal matrix whose diagonal is Indexing on a matrix will introduce a certain error to formed by k leading singular values of A the results. These errors will be introduced by words - Vk is n×k matrix whose columns are first k right that match with other words, i.e. the scalar product singular vectors of A between the corresponding vectors will be ≠ 0. In the In LSI, Ak is approximation of A is created and that matrix this will show either that false positive is very important: detected a combination of matches are created for every word that have a literatures between terms used in the documents, nonzero scalar product of any vector in the vector excluding the change in usage term bad influence to room of the matrix. False negatives can also be the method to search for the index [6], [7], [8]. created by words that have corresponding vectors that Because use of k-dimensional LSI (k<<r) the cancel each other out. difference is not important in the "means" is Advantages of Random Indexing removed. Keywords often appear together in the • Based on Pentti Kanerva's theories on Sparse document is nearly the same performance space in k- Distributed Memory. dimensional LSI, even the index does not appear • Uses distributed representations to accumulate simultaneously in the same document. context vectors. 2. Drawback of the LSI model • Incremental method that avoids the ”huge matrix • SVD often treated as a ”magical” process. step”. 140 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 7, October 2010 where ξi is a slack variable introduced to relax the V. COMBINING RI AND LSI hard margin constraints and the regularization We have seen the advantages and disadvantages for constant C > 0 implements the trade-off between the both RI and LSI: RI is efficient in terms of maximal margin of separation and the classification computational time but does not preserve as much error. information as LSI; LSI, on the other hand, is To resolve the optimization problem, we introduce computationally expensive, but produces highly the following Lagrange function. accurate results, in addition to capturing the underlying semantics of the documents. As mentioned earlier, a hybrid algorithm was proposed that combines the two approaches to benefit from the advantages of both algorithms. The algorithm works as follows: • First the data is pre-processed with RI to a lower dimension k1. Where αi>=0, βj>=0 is Lagrange genes. • Then LSI is applied on the reduced, lower- Differentiating L with respect to w, b and ξi, and dimensional data, to further reduce the data to the setting the result to zero. The optimization problem desired dimension, k2. (2) can translate into the following simple dual This algorithm supposedly will improve running time problem. for LSI, and accuracy for RI. As mentioned earlier, Maximize: the time complexity of SVD D is O(cmn) for large, sparse datasets. It is reasonable, then, to assume that a lower dimensionality will result in faster computation time, since it’s dependent of the Subject to dimensionality m. VI. TEXT CATEGORIZATION 1. Support Vector Machines Support vector machine is a very specific class of Where (xi, xj) ( φ (xi), φ (xj) ) is a kernel function and algorithms, characterized by the use of kernels, the satisfies the Mercer theorem. absence of local minima, the sparseness of the Let α* is the optimal solutions of (4) and solution and the capacity control obtained by acting corresponding weight and the bias w*, b* , on the margin, or on other “dimension independent” respectively. quantities such as the number of support vectors. According to the Karush-Kuhn-Tucker(KKT) Let is a conditions, the solution of the optimal problem (4) training sample set, where xi Rn and must satisfy corresponding binary class labels yi {1,-1} .Let φ is a non-linear mapping from original date space to a high-dimensional feature space, therefore , we Where αi* are non zero only for a subset of vector xi replace sample points x i and x j with their mapping called support vectors. images φ (xi) and φ (x j) respectively. Finally, the optimal decision function is Let the weight and the bias of the separating hyper- plane is w and b, respectively. We define a hyper- plane which might act as decision surface in feature space, as following. Where 1. Fuzzy Support Vector Machines ([12]-[13]). Consider the aforementioned binary training set S. We choose a proper membership function and receive To separate the data linearly in the feature space, the si which is the fuzzy memberships value of the decision function satisfies the following constrain training point xi . Then, the training set S become conditions. The optimization problem is fuzzy training set S ‘ Minimize: where xi Rn and corresponding binary class labels yi {1,-1}, 0<=si<=1. Subject to: Then, the quadratic programming problem for classification can be described as following: Minimize: 141 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 7, October 2010 2 Acquisition 0.93 0.965 3 Money 0.95 0.972 Subject to: 4 Grain 0.79 0.933 5 Crude 0.92 0.961 Average 0.912 0.9648 where C > 0 is punishment gene and ξi is a slack Table 1: The experiment results of FSVM and FSVM+LSI+RI classifiers. variable. The fuzzy membership si is attitude of the corresponding point xi toward one class. It shows that VIII. CONCLUSION a smaller si can reduce the effect of the parameter ξi This article introduces Fuzzy Support Vector in problem (18), so the corresponding point xi can be Machines for Text Categorization based on reduced treated as less important matrices use Latin Semantic Indexing combined with By using the KKT condition and Lagrange Random Indexing.). Our results show that Multipliers. We are able to form the following FSVM_LSI_RI provide better results on Precision equivalent dual problem and Recall than FSVM. Due to time limit, only Maximize: experiments on the 5 categories. Future direction include how to use this scheme to future direction include how to use this scheme to classify student's idea at University of Information Technology HoChiMinh City. Subject to: REFERENCES [1]. April Kontostathis (2007), “Essential Dimensions of latent semantic indexing”, Department of Mathematics and Computer Science Ursinus College, Proceedings of the 40th Hawaii International Conference on System Sciences, 2007. [2]. Cherukuri Aswani Kumar, Suripeddi Srinivas (2006) , If αi>0, then the corresponding point xi is support “Latent Semantic Indexing Using Eigenvalue Analysis for vectors . More, if 0 <αι< siC, then support vectors xi Efficient Information Retrieval”, Int. J. Appl. Math. Comp. Sci., 2006, Vol. 16, No. 4, pp. 551–558. lies round of separating surface; if αi siC, then [3]. David A.Hull (1994), Information retrieval Using support vectors xi belongs to error sample. Then, the Statistical Classification, Doctor of Philosophy Degree, decision function of the corresponding optimal The University of Stanford. separating surface becomes [4]. Gabriel Oksa, Martin Becka and Marian Vajtersic (2002),” Parallel SVD Computation in Updating Problems of Latent Semantic Indexing”, Proceeding ALGORITMY 2002 Conference on Scientific Computing, pp. 113 – 120. [5]. Katarina Blom, (1999), Information Retrieval Using the Singular Value Decomposition and Krylov Subspace, Department of Mathematics Chalmers University of Where K(xi.x) is kernel function. Technology S-412 Goteborg, Sewden VII. EXPERIMENT [6]. Kevin Erich Heinrich (2007), Automated Gene Classification using Nonnegative Matrix Factorization on We will investigate the performance of these two Biomedical Literature, Doctor of Philosophy Degree, The techniques, (1) Classifying FSVM on original matrix University of Tennessee, Knoxville. where Vector Space Model is used, (2) and FSVM on [7]. Miles Efron (2003). Eigenvalue – Based Estimators for a matrix where Random Indexing is used to reduce Optimal Dimentionality Reduction in Information Retrieval. ProQuest Information and Learning Company. the dimensionality of the matrix before singular value [8]. Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup decomposition. Performance will be measured as (1999), “Matrix,Vector Space, and Information calculation time as well as precision and recall. We Retrieval”, SIAM REVIEW Vol 41, No. 2, pp. 335 – 352. have used a subset of the Reuters-21578 text corpus. [9]. Nordianah Ab Samat, Masrah Azrifah Azmi Murad, Muhamad Taufik Abdullah, Rodziah Atan (2008), “Term The subset comprises 3299 that include 5 most Weighting Schemes Experiment Based on SVD for Malay frequent categories : earn, acquisition, money, grain, Text Retrieval”, Faculty of Computer Science and crude. Information Technology University Putra Malaysia, IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.10, October 2008. [10]. Jussi Karlgren and Magnus Sahlgren. 2001. From words F-score to understanding. In Y. Uesaka, P.Kanerva, and H. Asoh, No Classifier editors, Foundations of Real-World Intelligence, chapter 26, pages 294–308. Stanford: CSLI Publications. FSVM FSVM+LSI+RI [11]. Magnus Rosell, Martin Hassel, Viggo Kann: “Global Evaluation of Random Indexing through SwedishWord 1 Earn 0.97 0.993 Clustering Compared to the People’s Dictionary of Synonyms”, (Rosell et al., 2009). 142 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 7, October 2010 [12]. Shigeo Abe and Takuya Inoue (2002), “Fuzzy Support AUTHORS PROFILE Vector Machines for Multiclass Problems”, ESANN’2002 proceedings, pp. 113-118. The author born in 1969 in Da Nang, VietNam. He [13]. Shigeo Abe and Takuya Inoue (2001), “Fuzzy Support graduated University of Odessa (USSR), in 1992, Vector Machines for Pattern Classification”, In specialized in Information Technology. He postgraduated Proceeding of International Joint Conference on Neural on doctoral thesis in 1996 at the Academy of Science of Networks (IJCNN ’01), volume 2, pp. 1449-1454. Russia, specialized in IT. Now he is the Dean of Software [14]. T.Joachims (1998), “Text Categorization with Support Engineering of University of Information Technology, Vector Machines: Learning with Many Relevant VietNam National University HoChiMinh City. Features,” in Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pp. 137– Research: Knowledge Engineering, Information Systems 142. and software Engineering. 143 http://sites.google.com/site/ijcsis/ ISSN 1947-5500