Application of genetic algorithm feature selection in text

Document Sample
Application of genetic algorithm feature selection in text Powered By Docstoc
					                                                                                                  Journal of
                                                                                                 Science and

2050-2311/Copyright © 2012 IE Enterprises ltd                                        Jour. of Comp. Sci. and Eng.
All right reserved                                                                      Vol. 1, Num.1, 0009–0013, 2012

  The Algorithm Research of Genetic Algorithm Combining with
                 Text Feature Selection Method
                               Yuping Fanga, Ken Chenb, Chenhong Luoa,a*
               Yunnan Normal University, College of Vocational and Technical Education, Yunnan Kunming 650032, China
                     Yunnan Normal University, Department of Computer Science, Yunnan Kunming 650032, China


The traditional methods of text feature selection were analyzed and their respective advantages and disadvantages were
compared in detail. Based on the characteristic of optimizing itself of genetic algorithm, an improved text feature selection
scheme was proposed. Firstly, the common text feature selection method (DF, IG, MI, CHI) was used to select the text
characteristic, then screening was processed by genetic algorithm, and finally feature items according with feature of text
classification was selected. The experimental results show that the performance has been significantly improved.

Keywords: Feature Selection; Dimensionality Reduction; Evaluation Function; Genetic Algorithm

1. Introduction
   How to effectively organize and manage information, and how to find the information that users need quickly,
accurately and comprehensively are major challenges faced by those in the current information science and
technology fields. Text classification refers to the process in which texts are divided into the relevant predefined
categories according to their content under the given classification system. It is an important component of text
mining [1] and plays a significant role in improving the speed and accuracy of text retrieval.
   Text classification includes three steps: the vector model of the text, text feature selection and classifier
training [2]. In order to balance the two aspects of the mathematical time and classification accuracy, feature
selection has to be done, striving to reach the goal of dimensionality reduction without damaging the
classification performance.

   * Corresponding author. Yuping Fang
   E-mail address:
   September 2012
                     Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 0009–0013

2. Text feature selection
   Feature selection is a dimensionality reduction measure used in the original feature space, i.e. selecting some
of the most effective features which contribute most to the text information from a group of candidate features
to form an optimal feature subset. From the optimization point of view, the feature selection process is actually a
process of feature optimal combination. The goal of text feature selection is to achieve the same or better
classification results with few features; therefore, for the individuals who have the same express ability, the
fewer number of features the better.
   Currently the statistical methods used for feature selection are; feature frequency and text frequency methods
which are based on frequency and mutual information, information gain, expected cross entropy,            statistics,
correlation coefficients, right of text evidence etc which are based on information theory. The following is the
introduction of these four methods that will be used:

2.1 The document frequency (DF, Document Frequency)

  Text frequency can be expressed as:
             thenumber of documents in which feature t appears
DF ( F )                                                                                     (1)
                the total number of documents of training set
   It is the simplest evaluation function; having a small amount of calculations is its greatest feature. Theoretical
assumption of the DF evaluation function is that the characteristics which emerge a lower frequency contain less
information, but this assumption is obviously incomplete. Therefore, generally DF is not directly used in
practical application, but it is used as a standard evaluation function.

2.2 Information Gain (IG, Information Gain)

   IG is a feature selecting method which is widely used in a machine study field. From the information theory
point of view, it divides the study sample space by individual feature selecting, then filters and selects the
effective features according to how much information was gained. IG can be expressed as:
   In the formula,            indicates the probability which the text belongs to    when feature t appears in the
text;          indicates the probability which the text belongs to IG when word t does not appear in the text;
       indicates the probability of category occurrence;        indicates the probability which the word t appears
in the entire text of the training set.

2.3 Mutual Information (MI, Mutual Information)

   MI is the concept of information theory used to measure the degree of interdependence between the two
signals in a message. In the field of feature selection, the mutual information between the feature t and the
category reflects the relevance between the characteristics and the category. Feature t which appears high
probability in a category and appears low probability in the other categories will receive higher mutual
information. MI can be expressed as;
  Where each indicator is the same as that of above formula (2).

2.4    Statistic (CHI)

   Where A is the frequency which both feature t and class A document appear together; B is the frequency that
feature t appears but the class A document does not appear; C is the frequency that class A document appears
but the feature t does not appear; D is the frequency which both the feature t and the class A document do not
appear; N is the total number of text.
     Method believes that the non-independent relationship between the feature t and text category is similar
to the    distribution which contains one-dimensional freedom. It is based on the following assumptions: high
                         Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 0009–0013

frequency words which are in specified or in other types of text can help to determine whether the article
belongs to the category .
   The basic idea of the four kinds of feature selecting methods is to calculate a statistical measure, and then set
a threshold, T to every feature word, filter out those features that have a smaller measure than T and the rest that
are accounted as effective characteristics. Table 1 shows the respective advantages and disadvantages of the four
kinds of feature selecting methods.

Table 1. The comparisons of the advantages and disadvantages of feature selection methods

Selection method     Advantage                                               Disadvantage
                     Low computational complexity, capable of large-
The document                                                                 Does not meet the widely accepted theory of information
                     scale classification tasks. Is the simplest method of
frequency (DF)                                                               retrieval: ignores the role of high-frequency words.
                     feature selection.
                                                                             If the word does not occur the effectiveness of information
Information gain     It is widely used in machine learning feature
                                                                             gain will be greatly reduced: the statistics of the method is
(IG)                 selection method.
                                                                             Easily lead to over-learning: not considered a negative
Mutual               Considers the low-frequency words, brought some         correlation: Ignores the characteristics of the dispersion and
information (MI)     amount of information.                                  concentration of test indicators, resulting in the individual
                                                                             characteristics of over-fitting.
                     Considering the word’s "negative correlation":
   statistic CHI                                                             Statistics expensive and ineffective for low-frequency words.
                     good classification results.

3. The combing algorithm method between the genetic algorithm method and the
traditional text feature selecting method
Genetic algorithms copy the biological process of natural selection and evolution system; it is a random
population-based overall optimization algorithm. The genetic algorithm encodes those problems (parameters)
which need to be solved, generates the initial solution group in solution space, and is gradually evolved to the
overall optimal solution by genetic variation.
As a more mature method, genetic algorithm has been discussed in a lot of literature, for further details please
refer to the literature [1, 2 and 3]. The experimental idea in this article is: the use of the traditional method of the
text feature selection (DF, IG, MI, CHI) to select the text feature, and then using the genetic algorithm to filter
them, and eventually selecting a feature item which suits the text classification.

3.1 The combining algorithm method between genetic algorithm and text feature selection method

Input: the set of entries set after word dividing process.
Output: The text feature set.
Algorithm description:
[T1]. Using the Chinese Academy of Sciences’ word dividing system to divide the text word, to obtain the entry
set of T;
[T2]. Using equation (1), (2), (3) and (4) respectively on T to calculate the traditional text feature selection
method and the result is T1;
[T3]. T1’s entry as the coding of genetic algorithms: entry that appears is 1,entry does not appear is 0, result
the collection of {0, 1};
[T4]. Re-use the TF-IDF formula for weight calculation:


Where N is the number of all documents, ni is the number of documents which contains the term of ti, tfi
indicates the frequency that entry T appears in document d.
[T5]. Fitness function
                         Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 0009–0013


                          (
                       i 1, j 1
                                    ik         jk )  1
 fit ( si )  log                                                    (6)

                                              jk
                                         2              2
                         i , j 1

i k 、  jk are elements of the vector Ti, Tj.
[T6]. In this paper, the roulette selection method is used, the basic idea is: the selected probability of each
individual and its fitness is proportional to the direct ratio.
[T7]. Crossover operator
This article adopted the method of Insert Crossover,the specific algorithm is as follows:
1) Randomly selected father sample, to determine the insertion point and gene fragments;
2) Insert the gene fragment.
3) Delete duplicate genes.
[T8]. Mutation operator
In this algorithm, randomly select a chromosome, according to the entry weight, obtain a gene (i.e. Feature word)
by the roulette method, delete this gene, and randomly select a gene which is not in a chromosome from the
glossary, put it at that location, thus forming a new source of future generations.

4. Experiment results and Conclusion
   The test used 8,200 articles which are divided into five categories: politics, military affairs, entertainment,
education and livelihood. 3,500 articles of them were used as a training part and the rest as an examination part.
   Document Frequency DF, Information Gain IG, Mutual Information MI and                Statistic were tested under
the Rainbow system. The Genetic Algorithm was achieved by the VC ++ program, average accuracy was used as
the evaluation indicator and it was also the evaluation algorithm of the Rainbow system. Its average accuracy
was the correct rate of average values which were calculated under the certain training set circumstance, based
on repeating experiments. The experimental results are shown in table 2.

Table 2. The result comparison of feature selecting method

                         Feature Selecting Method                          Average Accuracy(%)
                         Document Frequency DF                             33.58
                         Information Gain IG                               85.36

                         Mutal Information MI                              31.25
                             Statistic CHI                                 52.14

                         Genetic Algorithm GA                              78.28

                         DF+GA                                             41.54
                         IG+GA                                             88.65
                         MI+GA                                             42.65
                         CHI+GA                                            65.32
   As for the identification of text feature, three of the most common evaluation indicators were applied to the
test, they are Precision (P), Recall Rate (R) and aggregative indicator value (F). Their definitions are as follows:
(1) Feature Identification Precision, the ratio of that identified character strings are indeed the characteristic
          the number of feature words which are correctly identified
P                                                                     100%
      the total number of feature words which are determined by system
(2) Feature Identification Recall Rate, The ratio of being identified of text feature words:
                         Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 0009–0013

      the number of feature words which are correctly identified
R                                                               100%
             the total number of feature words in corpus
                                                2 P R
(3) aggregative indicator value F: F                    100%
  The system test which is achieved by using VC++ Language program, still used the above corpus. The results
are shown below.

Table 3. The Comparison of Experiment Results

              Feature Selecting Algorithm                  Precision            Recall Rate               Aggregative
                                                                                                          Indicator F

              Ducument Frequency DF                        78.12                69.36                     73.48
              Information Gain IG                          88.54                82.98                     85.67
              Mutal Information MI                         82.11                79.41                     80.74
                 Statistic CHI                             87.32                81.84
              Genetic Algorithm GA                         80.39                80.12                     80.25
              DF+GA                                        79.68                75.15                     77.35
              IG+GA                                        89.94                88.57                     89.25
              MI+GA                                        84.28                83.35                     83.81
              CHI+GA                                       88.64                83.51                     86.00
   Through the experiments it was found that through using a combination method of both the genetic
algorithms and feature selection, on the evaluation indicator, the filtering process has been significantly
improved, compared with simply the use of either the feature selection method or using a genetic algorithm. But
it costs a large amount of time; this is also a field which needs to be improved through future research.
   In the combination of the four selecting methods and genetic algorithms, the effect of IG + GA is the best;
followed by CHI + GA; the effect of MI + GA is the poorest; the DF + GA’s running speed is the fastest.


  This work is supported by Humanity and Science foundation of Ministry of Education under Grant No.
09YJC870001; the Natural Science Foundation of Yunnan Education Department of China under Grant No.


[1] Zhaoqi Bian, Xuegong Zhang. Pattern Recognition [M] Beijing: Tsinghua University Press, 2000.
[2] Jianchao Xu, Ming Hu. Chinese Web text characteristics of acquisition and classification [J]. Computer Engineering, 2005,31 (8) :24-26.
[3] Miettinen K,Neittaanmaki P,Makela M M.Evolutionary Algorithms in Engineering and Computer Science[M].New York:Wiley,1999.
[4] Guoliang Chen, Xufa Wang, Zhenquan Zhuang etc. genetic algorithm and its application [M], Beijing, People's Posts and
   Telecommunications Press, 1996.
[5] McCallum,Andrew Kachites.“Bow: A toolkit for statistical language modeling, text retrieval, classification and
   clustering.”[DB/OL]. 996(2003-02-08).


  Yuping Fang, 1977, female, Han, from Dali Yunnan, Master’s degree, lecturer. Main research direction:
Natural Language Process, Text Excavation etc. Email:

Shared By:
Tags: Research, Paper
Description: A research paper on Application of genetic algorithm feature selection in text by Yuping Fanga, Ken Chenb and Chenhong Luo.