An Unsupervised Feature Selection Method Based On Genetic Algorithm by ijcsis


More Info
									                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 1, 2011

An Unsupervised Feature Selection Method Based On
               Genetic Algorithm
          Nasrin Sheikhi, Amirmasoud Rahmani                                                    Reza Veisisheikhrobat
          Department of computer engineering.                                       National Iranian South Oil Company(NISOC)
Islamic azad university of iran research and science branch                                          Ahvaz, Iran
                         Ahvaz, Iran

Abstract— In this paper we describe a new unsupervised feature                 In this paper we proposed a novel feature selection method
selection method for text clustering. In this method we introduce          that evaluate the discriminating power of set of terms instead
a new kind of features that we called multi term features. Multi           raw terms as features.
term feature is the combination of terms with different length. So             The main idea of this method is that a feature that is
we design a genetic algorithm to find the multi term features that
                                                                           irrelevant by itself may become relevant when used with other
have maximum discriminating power.
Keywords-multi term feature; discriminating power; genetic                 features. So we describe new kind of feature named Multi
algorithm; fitness function                                                Term Feature(MTF), that is the feature that made from
                                                                           combination of terms.
                                                                               We use genetic algorithm for search the large space of
                       I.    INTRODUCTION                                  different multi term features to find most relevant of them. To
     Reducing dimensionality of a problem, in many real world              achieve this goal we designed the fitness function to estimate
problems, is an essential step before any analysis of the data.            the discriminating power of MTFs.
The general criterion for reducing the dimensionality is the                   The rest of this paper organized as follows: the next section
desire to preserve most of the relevant information of the                 describes two methods for evaluate relevance of MTFs.
original data according to some optimality criteria.                       Section III explains using the genetic algorithm to find best
Dimensionality reduction or feature selection has been an                  MTFs. Experimental results are presented in section IV, and a
active research area in pattern recognition, statistics and data           conclusion is given in section V.
mining communities. The main idea of feature selection is to
choose a subset of input features by eliminating features with
little or no predictive information. In particular, feature                   II.   EVALUATE RELEVANCE OF MULTI TERM FEATURES
selection removes irrelevant features, increases efficiency of                Because in many cases one term can not determine the
learning tasks, improves learning performance and enhances                 subject of document very well, we use MTF to find the best
comprehensibility of learned results[2]                                    terms that can determine the clusters of documents. So we
     Depending on if the class label information is required,              must define criterions for evaluate relevance of MTFs. At first
feature selection can be either unsupervised or supervised.                we must determine when a MTF appear in a document.
     Feature selection has been well studied in supervised                    We defined appearance threshold for determine the
classification [3]. However, it is a quite recent research topic           presence of MTF in a document, that is the minimum number
and also a challenging problem for clustering analysis for two             of terms of MTF that if appear in a document that’s MTF
reasons: first, it is not an easy task to define a good criterion          appear in the document too.
for evaluating the quality of a candidate feature subset due to               Two criterions that we defined for evaluating
the absence of accurate labels of items. Second, it requires an            discriminating power of MTFs are as follows:
exponentially increasing number of feature subset evaluations
to optimize the defined criterion, that is in fact impractical if
                                                                           A. Modified Term Variance
the data set has a large number of features.
     Some methods for unsupervised feature selection have been                Term variance is one of the methods that use for evaluate
proposed in the literature, such as document frequency(DF),                the quality of term in dataset for clustering the documents. The
term contribution(TC), Term Variance Quality(TVQ), Term                    equation of this method is as follows:
                                                                                               v(t i ) = ∑ [ f ij − f i ] 2
Variance(TV) et al. In most of these methods a criterion is
defined for evaluate the relevance of one term of documents                                                                                 (1)
for clustering, and depend on how much dimensionality                                                    j =1
reduction required, the number of most relevant features will                 In this method the terms that have high frequency but have
be selected.                                                               not uniform distribution over document will have high TV
                                                                           value. We modified TV method to use with MTFs :

                                                                                                     ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 9, No. 1, 2011
                                                                                          Genetic algorithm is one of the best algorithms that can
                               N                                                       find best solutions for a problem between large number of
         v ( MTF i , th ) =   ∑ [ vf
                              j =1
                                       ij , th   − vf i ,th ] 2        (2)             solutions that is in the search space of problem. So we use
                                                                                       genetic algorithm as our search strategy to find the most
                                                                                       discriminating MTFs that exist in the search space.
   In this relation vf ij ,th is the frequency of ikh MTF in

document j with appearance threshold th, and vf i ,th is the                           A. Chromosomes
average of ith MTF frequency in all documents. The frequency                              Each chromosome in this method is a MTF that can have
of MTF is measured by equation as follows:                                             different length. Each gene of chromosome is a term of the
                                                                                       MTF. So the chromosome is shown as the set of terms and not
                                                                                       a binary code.

                                                                      (3)              B. Initial population
                                                                                          Initial population is the set of specific number of

In this relation      is the kth term of MTF, m is the number
                                                                                       C. Crossover and Mutation
of term of MTF and       is the jth document in dataset,                    is             The genetic algorithm generates new solutions by
the number of different MTF 's term that appear in           and                       recombining the genes of the current best solutions. This is
length is the length of MTF.                                                           accomplished through the crossover and the mutation
                          is the logical function that determine                       operators. On a one-point crossover, the crossing point is
                                                                                       selected at random and genes from one side of the
if    contains the MTF return TRUE and else return FALSE.
                                                                                       chromosomes are exchanged. In our model because of the
vf i ,th is measured as follows:                                                       different length of chromosomes crossover           method is
                                                                                       different too.
                                                                                          In this method the crossing point is selected at random on
                                                                                       both of parent chromosomes. Then one side of chromosomes
                                                                                       are exchanged, so two chromosomes of results of this kind of
B. Dependency Between Terms                                                            crossover have not equal length.
                                                                                          The mutation operator selected one position of gene in
   Another criterion that we define to evaluate the relevance
                                                                                       chromosome at random, and then exchange it with the term
of MTFs is dependency between terms of MTF that measure
                                                                                       that is selected from documents randomly.
by this equation:

                                                                       (5)             D. Fitness function
                                                                                          The objective function is the cornerstone of the genetic
                                                                                       process. We designed the following fitness function to explore
                                                                                       the space of solutions:
   Our goal is to find the MTFs that have high discriminating
power, so we look for find the MTFs that terms of them is
belong to same subject and most of the time appear in the                                                                                              (6)
documents of that subject.
   Dependency between the terms of MTF is the ratio of sum                             In this function:
of MTF 's frequency in all documents to sum of the MTF 's                                     Is the ith chromosome
terms frequency. This value show that most of the time the                                               Is the fitness value of
terms of MTF appear together in documents or separately.                                              Is the modified term variance value of   with
                                                                                       appearance threshold
               III.   USING GENETIC ALGORITHM                                                            Is the value of dependency between terms of
    As we already mentioned our goal is to find best MTFs                                             Is the length of
that can determine the clusters of documents. Because of the                              We described the modified term variance and dependency
large number of MTFs that can extract from documents, using                            between terms of MTF in section
a search algorithm that can search a huge amount of data is                            Another part of our fitness function is

                                                                                                                ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                 Vol. 9, No. 1, 2011
   During the genetic process the chromosomes' length will
increased because of operating crossover on the population.
According to this length increasing the probability of presence                           0.2
of chromosome in the document and as its result the value of                             0.15
modified term variance and dependency criterions for this

chromosome will decreased. So the fitness value of the
chromosomes with more length will decreased, and then the                                0.05
                                                                                                             GA          TV
algorithm go to select the smaller chromosome and so go to                                  0
usual methods.
                                                                                                 1% 5% 10%20%25%30%
By adding                   we give more chance to the bigger
chromosomes to be selected as relevance chromosome.
                                                                                                     Number of selected terms
E. The proposed algorithm
    The designed algorithm is as follows:                                               Figure 1. precision comparison on reut2-001(AA)
       1. An initial set of solutions is established at random.
This population contains chromosomes that are MTFs that
made of terms that selected randomly from documents.
       2. The fitness value of each chromosome is measured by
fitness function, the stopping criteria are tested. As a general                        0.59
criterion the genetic process is stopped when the maximum
fitness does not increase over a few iterations.                                        0.54
       3. Selection, mutation and crossover operate on                     Accuracy
                                                                                        0.49              GA
      4. The new population is generated and the iterative                                                TV
process buckles up from step 2.
                                                                                                 1% 5% 10% 20% 25% 30%

                 IV.   EXPERIMENTAL RESULTS
                                                                                                   Number of selected terms
   The following experiments we conducted are to compare
the proposed genetic model and term variance method.
                                                                                        Figure 2. precision comparison on reut2-001(F1)

    We choose 3 datasets from Reuters-21587 that each one
have 1000 documents.
    We choose K-means to be the clustering algorithm .since
K-means clustering algorithm is easily influenced by selection                          0.25
of initial centroids, we random produced 10 sets of initial                              0.2
centroids for each dataset and averaged 10 times performance

as the final clustering performance.
    We use Average Accuracy(AA) and F1-Measure(F1) that                                  0.1
defined in [3], as clustering validity criterions for evaluate the                      0.05                GA          TV
accuracy of clustering results. This results on reut2-001, reut2-
002 and reut2-003 datasets are shown in Fig. 1 to Fig. 6.
    From these figures, we can see that proposed algorithm can                                   1% 5% 10% 20% 25% 30%
improve the clustering accuracy in most of experiments
                                                                                                     Number of selected terms

                                                                                        Figure 3. precision comparison on reut2-002(AA)

                                                                                                        ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 9, No. 1, 2011
                                                                                                        V.      CONCLUSION
                 0.8                                                            In this paper we described a new feature selection method
                                                                            based on genetic algorithm. We use the new kind of feature
                 0.6                                                        that we called MTF that is the set of terms and then define the
                                                                            criterions for evaluate the relevance of these features. The

                 0.4                                                        experimental results shown that the proposed method can
                 0.2                                                        improve the accuracy of clustering.
                          1% 5% 10%20%25%30%
                                                                               Athors thank national iranian oil company (nioc) and
                                                                            national iranian south oil company (nisoc) for their help and
                           Number of selected terms                         financial support.
               Figure 4. precision comparison on reut2-002(F1)

                                                                             [1]    Sullivan, D., Document warehousing and text Mining, John Wiley,
                                                                                    New York, 2001.J.
                                                                             [2]    Miller, T., Data and text mining a business applications approach,
                                                                                    Prentice Hall, New York, 2005.
                                                                             [3]    Liu, L. and Kang, J. and YU, J. and Wang, Z., “A comparative study
               0.4                                                                  on unsupervised feature selection methods for text Clustering”,
                                                                                    Proceeding of NLP-KE, Vol. 9, pp. 597-601, 2005.
                                                                             [4]    Aliane, H., “An ontology based approach to multilingual information
               0.2                                                                  retrieval”, IEEE Information and Communication Technologies, Vol.

                                                                                    1, pp. 1732-1737, 2006.
               0.1                                                           [5]    Yu Lee, L. and Soo, v., “Ontology-based information retrieval and
                                    GA          TV                                  extraction”, International Conference on Information Technology:
                  0                                                                 Research and Education, Vol. 8, pp. 265-269, 2005.
                                                                             [6]    Nayyeri, A. and Oroumchian, F., “FuFaIR: a fuzzy farsi information
                          1% 5% 10%20%25%30%                                        retrieval system”, IEEE International Conference on Computer
                                                                                    Systems and Applications, Vol. 3, pp. 1126-1130, 2006.
                                                                             [7]    Desjardins, G. and Proulx, R. and Godin, R., “An auto-associative
                            Number of selected terms                                neural network for information retrieval”, IEEE International Joint
                                                                                    Conference on Neural Networks, Vol. 9, pp. 3492-3498, 2006.
                 Figure 5. precision comparison on reut2-003(AA)             [8]    Tian, Q., “A foundational perspective for visual information retrieval”,
                                                                                    Multimedia IEEE, Vol. 13, pp. 90-92, 2006.
                                                                             [9]    Brunner, J. and Naudet, Y. and Latour, T., “Information retrieval in
                                                                                    multimedia: exploiting MPEG-7 metadata by the use of ontologies and
                                                                                    fuzzy thematic spaces”, Proceedings of the Sixth International
                  0.8                                                               Conference on Computational Intelligence and Multimedia
                                                                                    Applications (ICCIMA’05), Vol. 7, pp. 1-6. 2005.
                                                                             [10]   Dong, A. and Li, H., “Multi-ontology based multimedia annotation for
                  0.4                                                               domain-specific information retrieval”, Proceedings of the IEEE

                                                                                    International Conference on Sensor Networks, Ubiquitous, and
                  0.2                                                               Trustworthy Computing (SUTC’06), Vol. 9, pp. 1-8, 2006.
                                    GA          TV
                      0                                                      [11]   Heo, S. and Motoyuki, S. and Ito, A. and Makino, S., “An effective
                                                                                    music information retrieval method using three-dimensional
                           1% 5% 10%20%25%30%                                       continuous DP”, IEEE Transactions on Multimedia, Vol. 8, NO. 3,
                                                                                    pp. 633-639, 2006.
                                                                             [12]   Chang, C. and Kayed, M., “A survey of web information extraction
                                                                                    systems”, IEEE Transactions on Knowledge and Data Engineering,
                           Number of selected terms                                 Vol. 18, NO. 10, pp. 1411-1428, 2006.
                                                                             [13]   Kim, J. and Moldovan, D., “Acquisition of linguistic patterns for
                                                                                    knowledge-based information extraction”, IEEE Transactions on
               Figure 6. precision comparison on reut2-003(F1)                      Knowledge and Data Engineering, Vol. 7, NO. 5, pp. 713-724, 1995.
                                                                             [14]   Ramshaw, A. and Weischeldel, M., “Information extraction”, IEEE
                                                                                    International Conference on Acoustics, Speech, and Signal
                                                                                    Processing, Vol. 7, pp. 969-972. 2005.
                                                                             [15]   Lam, M. and Gong, Z., “Web information extraction”, Proceedings of
                                                                                    the 2005 IEEE International Conference on Information Acquisition,
                                                                                    Vol. 1, pp. 569-601, 2005.
                                                                             [16]   Yang, S. and WU, X. and Deng, Z. and Zhang M. and Yang, D.,
                                                                                    “Relative term-frequency based feature selection for text

                                                                                                             ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 9, No. 1, 2011
       categorization”, Proceeding of the IEEE first international Conference         [22] Dong, Y. and Han, K., “A comparison of several ensemble methods
       on Machine Learning and Cybernetics, Vol. 4, pp. 1432-1436, 2002.                   for text categorization”, Proceedings of the 2004 IEEE International
[17]   Yiming, Y. and Pedersen, J., “A comparative study on feature                        Conference on Services Computing (SCC’04), Vol. 4, pp. 1-4, 2004.
       selection in text categorization”, Proceedings of ICML-97, 14th                [23] Soucy, P. and Mineau, G., ”A simple KNN algorithm for text
       International Conference on Machine Learning, pp. 412-420, 1997.                    categorization”, Vol. 8, pp. 647-648, 2001.
[18]   Prabowo, R. and Thelwall, M., “A comparison of feature selection               [24] Namburu, S. and Tu, H. and Luo, J. and Pattipati, R., “Experiments
       methods for an evolving RSS feed corpus”, Information Processing                    on supervised learning algorithms for text categorization”, IEEE
       and Management, Vol. 42, pp. 1491-1512, 2006.                                       Aerospace Conference, pp. 1-8, 2005.
[19]   How, B. and Narayanan, K., “An empirical study of feature selection            [25] Goldberg, J.L., “CDM: An approach to learning in text
       for text categorization based on Term weightage”, Proceedings of the                categorization”, Proceedings of Seventh International Conference on
       IEEE/WIC/ACM International Conference on Web Intelligence                           Tools with Artificial Intelligence, Vol. 9, pp. 258-265, 1995.
       (WI’04), Vol. 2, pp. 1-4, 2004.
[20]   Li, S. and Zong, C, “A new approach to feature selection for text
       categorization”, IEEE Proceeding of NLP-KE'O5, Vol. 9, pp. 626-630,
[21]   Mitchell, T., Machine Learning, McGraw-Hill, Washington, 1997.

                                                                                                                  ISSN 1947-5500

To top