Final manuscript-paper-ID237 237 - PDF by tabindah


More Info
                        DATA MINING

                                         Department of Computer science
                                          Bellary Engineering College
                                         Bellary, Karnataka State, India

                                         Department of Computer science
                                  Sri Jayachamarajendra College of Engineering
                                         Mysore, Karnataka State, India

               In recent years many applications of data mining deal with a high-dimensional data
               (very large number of features) impose a high computational cost as well as the
               risk of “over fitting”. In these cases, it is common practice to adopt feature
               selection method to improve the generalization accuracy. Feature selection method
               has become the focus of research in the area of data mining where there exists a
               high-dimensional data. We propose in this paper a novel feature selection method
               based on two stage analysis of Fisher Ratio and Mutual Information. The two-
               stage analysis of Fisher Ratio and Mutual Information is carried out in the feature
               domain to reject the noisy feature indexes and select the most informative
               combination from the remaining. In the approach, we develop two practical
               solutions, avoiding the difficulties of using high dimensional Mutual Information
               in the application, that are the feature indexes clustering using cross Mutual
               Information and the latter estimation based on conditional empirical PDF. The
               effectiveness of the proposed method is evaluated by the SVM classifier using
               datasets from the UCI Machine Learning Repository. Experimental results show
               that the proposed method is superior to some other classical feature selection
               methods and can get higher prediction accuracy with small number of features.
               The results are highly promising.

               Keywords: Pattern recognition, feature selection, data mining, fisher ratio, mutual

1   INTRODUCTION                                           [4, 5]. The filter model separates feature selection
                                                           from classifier learning and relies on general
     Feature selection, a process of choosing a subset     characteristics of the training data to select feature
of features from the original ones, is frequently used     subsets that are independent of any mining
as a preprocessing technique in data mining [6,7]. It      algorithms. By reducing the number of features, one
has proven effective in reducing dimensionality,           can both reduce over fitting of learning methods, and
improving mining efficiency, increasing mining             increase the computation speed of prediction (Guyon
accuracy, and enhancing result comprehensibility [4,       and Elisseeff, 2003)[1]. We focus in this paper on the
5]. Feature selection methods can broadly fall into        selection of a few features among several in a
the wrapper model and the filter model [5]. The            context of classification. Our main interest in this
wrapper model uses the predictive accuracy of a            paper is to design an efficient filter, both from a
predetermined mining algorithm to determine the            statistical and from a computational point of view.
goodness of a selected subset. It is computationally
expensive for data with a large number of features

                     Ubiquitous Computing and Communication Journal                                            1
    The most standard filters rank features according    2   FEATURE SELECTION BASED ON
to their individual predictive power, which can be           FISHER RATIO AND MUTUAL
estimated by various means such as Fisher score              INFORMATION
(Furey et al., 2000), Kolmogorov-Smirnov test,
Pearson correlation (Miyahara and Pazzani, 2000) or          In this section we discuss following issues of the
mutual information (Battiti, 1994; Bonnlander and        proposed method:1)the pre-selection of feature
Weigend, 1996; Torkkola, 2003)[3]. Selection based       components based on their Fisher Ratio (FR) scores
on such a ranking does not ensure weak dependency        2)the final selection by Mutual Information and 3)the
among features, and can lead to redundant and thus       classifier SVM
less informative selected families. To catch
dependencies between features, a criterion based on
decision trees has been proposed recently                2.1 Fisher Ratio Analysis
(Ratanamahatana and Gunopulos, 2003)[2]. Features            Suppose that two investigating classes on a
which appear in binary trees build with the standard     feature component domain have mean vectors µ1, µ2
C4.5 algorithm are likely to be either individually      and covariances Σ1, Σ2 respectively. The Fisher Ratio
informative (those at the top) or conditionally          is defined as the ratio of the variance of the between
informative (deeper in the trees). The drawbacks of      classes to the variance of within classes noted by
such a method are its computational cost and
sensitivity to overfitting.
    The combination of Fisher Ratio and Mutual
Information is the principal merit of the proposed
method. From a theoretical point of view, the Fisher
Ratio analysis should be able to select the most         The maximum of class separation (discriminative
discriminate (i.e. less noisy) feature components but    level) is obtained when
can not to provide the "best" combination between
selected components because there might be cross
correlations. On another hand, the Mutual
Information without Fisher Ratio analysis would be       2.2 Feature Selection By Mutual Information
able to provide the maximum independence between         Maximization
feature components but this might mistakenly select          The Mutual Information maximization is a
the noisy components which even degrade the              natural idea for subset selection since this can
system. In this work we show that the combination        provide a maximum information combination of pre-
of Fisher Ratio and Mutual Information greatly           selected components. However, a big problem is that
improve the performance of classifier.                   the estimation of high-dimensional mutual
     In the approach, we developed two flexible and      information requires very large number of
practical solutions for the application of Mutual        observations to be accurate but this is often not
Information. First, to avoid the typical difficulty in   provided.
estimating high dimensional mutual information, we           To solve this problem we develop a flexible
use two-dimensional Mutual Information as a              solution using only two dimensional cross Mutual
distance measurement to cluster the pre-selected         Information. This measurement is used as a distance
components into groups. Their best subset is then        to cluster the feature components into groups and
chosen picking up the component with the best            then select from each group the best component by
Fisher Ratio score in each group.                        mean of the Fisher Ratio.
     In this approach, we develop two practical          2.1.1 Cluster Pre-Selected Components For The
solutions which avoid the difficulties in the                  Selection
application of high dimensional Mutual Information:           The Mutual Information based clustering is an
the feature indexes clustering using cross Mutual        iterative procedure like vector quantization and
Information and the former estimation based on           therefore might be sensitive to the initial setting.
conditional empirical PDF.                               However, taking into account the fact that the largest
     The remainder of this paper is organized as         contribution is expected to come from the best pre-
follows. In Section 2 we briefly describe the theory     selected component, the initialization is set as
of fisher ratio analysis, mutual information             follows.
maximization, the proposed feature selection method      1. Fixed the feature component with the best Fisher
and Support Vector Machine (SVM) classifier.             Ratios; calculate the cross Mutual Informations from
Section 3 contains experimental results. Section 4       each component to this; sort the estimated sequence;
concludes this work.                                     2. Set the initial cluster centers uniformly from the
                                                         sorted sequence;

                     Ubiquitous Computing and Communication Journal                                          2
3. Classify the components according to the lowest        in each group. For example, the "piecewise" pdf of
cross mutual information to the centers.                  sequence Y with the order statistics in (5) is noted by
4. Recompute the center of each group;
5. Repeat (3) and (4) until the centers do not change;
6. Select in each group the component with highest
FR score.

2.1.2 Two-Dimension         Mutual       Information      The selection of parameter M is determined by a bias
     Estimation                                           variance tradeoff and typically, M= √N.
    Now we pay our attention to the estimation of
cross Mutual Information. Conventional method             2.3 Support Vector Machine (SVM) Classifier
estimates the two-dimensional Mutual Information               Support Vector Machine (SVM) is based on
through the marginal and joint histograms                 statistical learning theory developed by Vapnik [9,
                                                          10]. It has been used extensively for classification of
                                                          data. SVM was originally designed for binary
                                                          classification. How to effectively extend it for multi-
                                                          class classification is still an ongoing research issue
where x, y denote the observation of random               [11]. The most common way to build a k-class SVM
variables X and Y. In our case, X and Y are a pair of     is by constructing and combining several binary
feature components. The typical problems of the           classifiers [12]. The representative ensemble
estimation (3) are the complexity and possible            schemes are One-Against-All and One-Versus-One.
presence of null bins in the conventional joint           In One-Against-All k binary classifiers are trained,
histogram. In this work, we develop an estimation         each of which separates one class from other k-1
method by using the empirical conditional                 classes. Given a test sample X to classifier, the binary
distributions. We first rewrite (3) as                    classifier with the largest output determines the class
                                                          label of X. One-Versus-One constructs k*(k-1)/2
                                                          binary classifiers. The outputs of the classifier are
                                                          aggregated to make a final decision. Decision tree
                                                          formulation is a variant of One-Against-All
Eq. (4) can be simplified by clustering the y into
                                                          formulation based on decision tree. Error correcting
clusters and estimate the conditional densities in each
                                                          output code is a general representation of One-
                                                          Against-All or One-Versus-One formulation, which
                                                          uses error-correcting codes for encoding outputs [8].
                                                          The One-Against-All approach, in combination with
                                                          SVM, provides better classification accuracy in
                                                          comparison to others [11]. Consequently we applied
where i is the cluster index and ki is the number of      One- Against-All approach in our experiments.
observations in each cluster. For clustering y we
apply a fast method based on order statistics. We         3   EXPERIMENTAL RESULTS
briefly describe this idea as follows. Given an
observed sequence Y {y1, y2,….., yn} and a number M,           Experiments are carried out on well-known data
a set of M order statistics are defined as                sets from UCI Machine Learning Repository. In the
                                                          experiments the original partition of the datasets into
                                                          training and test sets is used whenever information
                                                          about the data split is available. In the absence of
                                                          separate test set, 10 fold cross validation is used for
                                                          calculating the classification accuracy.. Multi-class
                                                          SVM classifier (one against all) is implemented
                                                          using MATLAB. The kernel chosen is RBF kernel
                                                                                        the multi-class SVM
is the sorted sequence of Y, i= 1, 2, ..., M. The         classifier. The control parameter is taken as 0.01
clustering of sequence Y into group number i is           and the regularization parameter C is fixed as 100.
given by matching the sequence to an inequality as             In this section we address the performance of our
                                                          proposed approach in terms of classification
                                                          accuracy. We compare two feature selection methods
                                                          (proposed, MIM). Features selected by Mutual
The conditional density for each group p (y|i) is         Information Maximization (MIM) does not ensure
estimated by a "piecewise" density function using the     weak dependency among features, and can lead to
order statistics, calculated from the clustered samples   redundant and poorly informative families of

                     Ubiquitous Computing and Communication Journal                                             3
features. These feature selection methods are              [2] C. A. Ratanamahatana and D. Gunopulos.
evaluated according to their classification accuracy           Feature selection for the naive bayesian
achieved by the SVM classifier.                                classifier using decision trees. Applied Artificial
     A comparison of classification accuracy                   Intelligence, Vol. 17(5-6), pp. 475–487, 2003.
achieved by the feature selection methods is made in       [3] K.   Torkkola. Feature extraction by non-
Table 1 for each data collection. Real life public             parametric mutual information maximization.
domain data sets are used from the UCI Machine                 Journal of Machine Learning Research, Vol. 3,
learning repository for our experiments. Our                   pp. 1415–1438, 2003.
proposed approach achieves better classification
accuracy compared to other method MIM.                     [4] A. Blum and P. Langley. Selection of relevant
                                                               features and examples in machine learning.
                                                               Artificial Intelligence, pp. 245-271, 1997.
Table 1 Comparison of Feature selection Methods            [5] R. Kohavi and G. John. Wrappers for feature
                                                               subset selection. Artificial Intelligence, Vol. (1-
      Data set         Feature         Classification          2), pp. 273-324, 1997.
                       selection       Accuracy (%)        [6] M. Dash and H. Liu. Feature selection for
                       method                                  classification. Intelligent Data Analysis: An
      Iris             MIM             95.12                   International Journal, Vol. 1(3), pp. 131-156,
      D=4              Proposed        98.16                   1997.
      Ionosphere       MIM             92.0
                                                           [7] H. Liu and H. Motoda. Feature Selection for
      D=32             Proposed        94.18
                                                               Knowledge Discovery and Data Mining. Boston:
      Isolet           MIM             85.63
                                                               Kluwer Academic Publishers, 1998.
      D=617            Proposed        87.24
      Multiple         MIM             82.46               [8] Dietterich TG., Bakiri G., Solving multi-class
      Features                                                 learning via errorcorrecting output codes, General of
      D=649            Proposed        84.67                   Artificial Intelligence Research,Vol. 2, pp. 263-
                                                               86, 1995.
      Arrhythmia       MIM             90.23
      D=195            Proposed        93.46               [9] Corts   C., Vapnik VN., Support Vector
                                                               Networks, Machine Learning, Vol. 2, pp. 273-
                                                               297, 1995.
4   CONCLUSION                                             [10] Vapnik VN., The Nature of Statistical Learning
                                                               Theory. Springer, Berlin Heidelberg New York
    We have presented a simple and very efficient              1995.
scheme for feature selection in a context of
                                                           [11] Rifkin R., Klautau A., In Defence of One-Vs.-
classification. The method proposed selects a small
                                                               All Classification, Journal of Machine Learning,
subset of features that carries as much information as
                                                               Vol. 5, pp. 101-141, 2004.
possible. The hybrid feature selection method
described in this paper has demonstrated that the          [12] Hsu CW., Lin CJ., A comparison of methods for
approach is able to reduce the number of features              Multi-class Support vector machine, IEEE
selected as well to increase the classification rate.          Transactions on Neural Networks, Vol. 13(2),
This method does not select a feature similar to               pp. 415-425, 2002.
already picked ones, even if it is individually
powerful, as it does not carry additional information
about the class to predict. Thus, this criterion ensures
a good tradeoff between independence and
discrimination. The experiments we have conducted
show that proposed method is the best feature
selection method. Combined with SVM Classifier
the scores we obtained are comparable or better than
those of state-of-the-art techniques.


[1] I. Guyon and A. Elisseeff. An introduction to
    variable and feature selection. Journal of
    Machine Learning Research, Vol. 3, pp. 1157–
    1182, 2003.

                     Ubiquitous Computing and Communication Journal                                               4

To top