A Comparative Study of Microarray Data Classification with Missing Values Imputation

Document Sample
A Comparative Study of Microarray Data Classification with Missing Values Imputation Powered By Docstoc
					                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 8, No. 2, 2010

           A Comparative Study of Microarray Data
         Classification with Missing Values Imputation
                             Kairung Hengpraphrom1, Sageemas Na Wichian2 and Phayung Meesad3

                             1
                                 Department of Information Technology, Faculty of Information Technology
                             2
                                 Department of Social and Applied Science, College of Industrial Technology
                   3
                       Department of Teacher Training in Electrical Engineering, Faculty of Technical Education
                                         King Mongkut's University of Technology North Bangkok
                                       1518 Piboolsongkram Rd.Bangsue, Bangkok 10800, Thailand
                                    kairung2004@yahoo.com, sgm@kmutnb.ac.th, pym@kmutnb.ac.th

Abstract—The incomplete data is an important problem in data                Consequently, many algorithms have been developed to
mining. The consequent downstream analysis becomes less                  accurately impute MVs in microarray experiments, for
effective. Most algorithms for statistical data analysis need a          example K-Nearest Neighbor, Singular Value Decomposition,
complete set of data. Microarray data usually consists of a small        and Row average method have been proposed to estimate
number of samples with high dimensions but with a number of              missing values in microarrays. KNN Impute was found to be
missing values. Many missing value imputation methods have               the best among three methods [9]. However, there are still
been developed for microarray data, but only a few studies have          some points to improve. Many imputation techniques have
investigated the relationship between missing value imputation
method and classification accuracy. In this paper we carry out
                                                                         been proposed to resolve the missing values problems. For
experiments with Colon Cancer dataset to evaluate the                    example, Troyanskaya et al. [9] proposed KNN imputation
effectiveness of the four methods dealing with missing values            based on Singular Value Decomposition and Row average
imputations: the Row average method, KNN imputation, KNNFS               methods. The results showed that KNN imputation method is
imputation and Multiple Linear Regression imputation                     better than the Row average method. Oba et al. [10] have
procedure. The considered classifier is the Support Vector               proposed an imputation method called Bayesian Principal
Machine (SVM).                                                           Component Analysis (BPCA). The researchers claimed that
                                                                         BPCA can estimate the missing values better than KNN and
   Keywords;KNN, Regression, Microarray, Imputation, Missing
                                                                         SVD. Another efficient method was proposed by Zhou et al.
Values
                                                                         [11]. The method automatically selects gene parameters for
                        I.       INTRODUCTION                            estimation of missing values. The algorithm uses linear and
                                                                         nonlinear regression. The key benefit of the algorithm is quick
   Microarray data is a representative of thousands of genes at
                                                                         estimation. Another research by Kim et al. [12] proposed local
the same time. In with many types of experimental data,
                                                                         least squares (LLS) imputation. The idea is to use the
expression data obtained from microarray experiments are
                                                                         similarity of structure of data as in least square optimization.
frequently peppered with missing values (MVs) that may
                                                                         This method is very robust. Later, Robust Least Squares
occur for a variety of reasons, such as insufficient resolution,
                                                                         Estimation with Principal Components (RLSP) was proposed
image corruption, dust, scratches on the slide, or errors in the
                                                                         by Yoon et al. [13] to improve the efficiency of the previous
process of experiments. Many data mining techniques have
                                                                         methods. RLSP imputation method showed better
been proposed for analysis to identify regulatory patterns or
                                                                         performance than KNN, LLS, and BPCA. The NRMSE is
similarities in expressions under similar conditions. For the
                                                                         calculated to measure the imputation performance since the
analysis to be efficient, data mining techniques such as
                                                                         original values are now known.
classification [1-3] and clustering [4-5] techniques require that
                                                                            Many missing value imputation methods have been
the microarray data must be complete with no missing values
                                                                         developed for microarray data, but only a few studies have
[6]. One solution for the missing data problem is to go over
                                                                         investigated the relationship between missing value imputation
the experiment again, but it is time consuming and very
                                                                         method and classification accuracy. In this paper, we carry out
expensive [7]. Replacing the missing values by zero and
                                                                         a model-based analysis to investigate how different properties
average value can be helpful instead of eliminating the
                                                                         of a dataset influence imputation and classification, and how
missing-value records [8], but the two simple methods are not
                                                                         imputation affects classification performance. We compare
very effective.
                                                                         four imputation algorithms: the Row average method, KNN




                                                                    29                               http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 8, No. 2, 2010
imputation, KNNFS imputation and Multiple Linear                              Step 3: Estimate the missing value as an average of the K
Regression imputation method to measure how well the                       nearest neighbors, corresponding entries in the selected K
imputed dataset can preserve the discriminated power residing              expression vectors by using (2)
in the original dataset. The Support Vector Machine (SVM) is                            K
used as a classifier in this work.                                                      ∑X     k
   The remainder of this paper is organized as follows. Section                 xij =
                                                                                ˆ       k =1
                                                                                                                                      (2)
II provides theory and related works. The details of the                                K
proposed methodology are given in Section III. Section IV
                                                                               X k = X i =1...M | di ∈{d1 , d 2 ,..., d K }
illustrates the simulation and comparison results. Finally,
concluding remarks are given in section V.                                        ˆ
                                                                           where xij is the estimated missing value at ith gene in jth
                                                                           sample; di is the ith rank in distance of neighbor; Xk is the input
                          II.    RELATED WORK                              matrix containing kth rank in the nearest neighbor gene
A. Microarray Data                                                         expressions; and M is the total number of samples in the
                                                                           training data.
    Every cell of living organisms contains a full set of
chromosomes and identical genes. Only a portion of these                   C. The Algorithm of KNNFS
genes are turned on and it is the subset that is expressed,                    The algorithm of the combination of KNN-based feature
conferring distinctive properties to each cell category.                   selection and KNN-based imputation is as follows [15].
   There are two most important application forms for the                      Phase 1: Feature Selection
DNA microarray technology: 1) identification of sequence                            Step 1: Initialize KF feature;
(gene/gene mutation) and 2) determination of expression level                       Step 2: Calculate feature distance between Xj, j = 1,
(abundance) of genes of one sample or comparing gene                                         …, col and Xmiss (the feature with missing
transcription in two or more different kinds of cells. In data                               values) by using (1);
preparation, DNA Microarrays are small, solid supports onto                         Step 3: Sort feature distance in ascending order;
which the sequences from thousands of different genes are                           Step 4: Select KF minimum distances;
attached at fixed locations. The supports themselves are                       Phase 2: Imputation of Missing Values
usually glass microscope slides, the size of two side-by-side                       Step 5: Initialize KC samples;
small fingers, but can also be silicon chips or nylon                               Step 6: Use KF feature to calculate sample distance
membranes. The DNA is printed, spotted, or actually                                          between Ri, i = 1, …, row and Rmiss (the row
synthesized directly onto the support. With the aid of a                                     with missing values) by using (1);
computer, the amount of mRNA bounding to the spots on the                           Step 7: Sort sample distance ascending;
microarray is precisely measured, which generates a profile of                      Step 8: Select KC minimum distance;
gene expression in the cell. The generating process usually                         Step 9: Use KC sample to estimate missing value by
produces a lot of missing values and resulting in less                                       an average of KC most similar values by
efficiency of the downstream computational analysis [14].                                    using (2).
B. K-nearest neighbor(KNN)                                                 D. Multiple Linear Regression
    Due to its simplicity, K-Nearest Neighbor (KNN) method                    Multiple linear regression (MLR) is a method used to model
is one of the well-known methods to impute missing values in               the linear relationship between a dependent variable and one
microarray data. The KNN method imputes missing values by                  or more independent variables. The dependent variable is
selecting genes with expression values similar to the gene of              sometimes also called the predictand, and the independent
interest. The steps of KNN imputation are as follows.                      variables are called the predictors.
    Step 1: Chose K genes that are most similar to the gene                   The model expresses the value of a predictand variable as a
with the missing value (MV). In order to estimate the missing              linear function of one or more predictor variables and an error
value xij of ith gene in jth sample, K genes are selected whose            term:
expression vectors are similar to genetic expression of i in
samples other than j.
                                                                               yi = b0 + b1 xi ,1 + b2 xi ,2 + ... + bk xi ,k + ei (3)
    Step 2: Measure the distance between two expression                          xi ,k is value of k th predictor in case i
vectors xi and xj by using the Euclidian distance over the
observed components in jth sample. Euclidean distance                            b0 is regression constant
between xi and xj can be calculated from (1)
                                                                                 bi is coefficient on k th the predictor
                                  n

    d ij = dist ( xi , x j ) =   ∑ (x   ik
                                             − x jk )
                                                        2
                                                            (1)                  K is total number of predictors
                                 k =1                                            yi is predictand in case
   Where dist(xi,xj) is the Euclidean distance between samples
xi and xj; n is the number of features or dimensions of
                                                                                 ei is error term
microarray; and xik is the kth feature of sample xi.




                                                                      30                                http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 8, No. 2, 2010
  The model (3) is estimated by least squares, which yields                                        IV.   THE EXPERIMENTAL RESULTS
parameter estimates such that the sum of squares of errors is                       To evaluate the effectiveness of the imputation methods, the
minimized. The resulting prediction equation is                                  NRMSE values were computed using each algorithm as
   yi = b0 + b1 xi ,1 + b2 xi ,2 + ... + bk xi ,k
   ˆ    ˆ ˆ             ˆ                ˆ                    (4)                descript above. The experiment is repeated 10 times and
  Where the variables are defined as in (3) except that “^”                      reported the average as the result. The experimental results are
denotes estimated values                                                         shown in Tables I and Fig. 1.
                                                                                    Table I and Fig. 1 show the NRMSE of the estimation error
                 III.      THE EXPERIMENTAL DESIGN                               for Colon Tumor data. The results show that the Regression
   To compare the performance of the KNN, Row, Regression,                       method has a lower NRMSE compared to the other methods.
and KNNFS imputation algorithms, NRMSE was used to
measure the experimental results. The missing value                              TABLE I.          NORMALIZE ROOT MEANS SQUARE ERROR OF MISSING-VALUE
                                                                                                    IMPUTATION FOR COLON CANCER DATA
estimation techniques were tested by randomly removing data
values and then computing the estimation error. In the                                       %                      Colon Cancer
experiments, between 1% and 10% of the values were                                          Miss      Row       KNN       KNNFS        Regression
removed from the dataset randomly. Next, the four imputation                                 1       0.6363    0.5486      0.4990        0.4049
algorithms as mention above are applied separately to                                        2       0.6121    0.5366      0.4918        0.4103
calculate the missing values and then the imputed data
                                                                                             3       0.6319    0.5606      0.5173        0.4282
(complete data) were used for accuracy measurement
(NRMSE and classification accuracy) by SVM classifier. The                                   4       0.6339    0.5621      0.5169        0.4251
overall process is shown in Fig. 1.                                                          5       0.6301    0.5673      0.5267        0.4410
                                                                                             6       0.6281    0.5634      0.5212        0.4573
                                                                                             7       0.6288    0.5680      0.5254        0.4415
     Complete                 Generate artificial              Data with
                               missing values                                                8       0.6382    0.5882      0.5534        0.4548
       data                                                     missing
                                                                                             9       0.6310    0.5858      0.5481        0.4418
                                                                                             10      0.6296    0.5849      0.5483        0.4450

                          Feature Selection Method,
                          Missing estimation Method

                                                           Classification
                                    Imputed                  accuracy
                                      data
                                                            NRMSE


                        Figure 1. Simulation flow chart.

   To test the effectiveness of the different imputation
algorithms, Conlon Cancer dataset was used. The data were
collected from 62 patients: 40 tumor and 22 normal cases. The
dataset has 2,000 selected genes. It is clean and contains no
missing values.                                                                  Figure 2. Normalize root means square error of missing value imputation for
   The effectiveness of missing values imputation was                                                      Colon Cancer Data
computed by Normalized Room Mean Squared Error
(NRMSE) [12] as shown in equation 5.                                               The classification accuracy by using the SVM classifier is
                                                                                 summarized in Table II and Fig. 2. The experimental results
                             mean[( y guess − yans ) ]
                                                    2
                                                                                 show that the accuracy of the row average method is ranged
          NRMSE =                                                    (5)
                                   std [ yans ]                                  between 82.10% and 83.39%, while the neighbour-based
                                                                                 methods (KNN, KNNFS) gave the result between 82.90% and
Subject to                                                                       84.77%, and the regression method ranges between 82.90%
                                                                                 and 84.84%.
     y guess is estimated value
     y ans is prototype gene's value
    std[ y ans ] is stand deviation of prototype gene




                                                                            31                                   http://sites.google.com/site/ijcsis/
                                                                                                                 ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 8, No. 2, 2010
                           V.      CONCLUSION                                                            REFERENCES
   This research studies the effectiveness of MVs imputation
methods to the classification problems. The model-based                  [1]      M. P. S. Brown, W. N. Grundy, D. Lin , N Cristianini, C. W. Sugnet, T.
                                                                                  S. Furey, M. J. Ares, D. Haussler, “Knowledge-based analysis of
approach is employed. Four methods for imputation (Row                            microarray gene expression data by using support vector machines”,
average, KNN, KNNFS, Regression) are used to compare the                          Proc Natl Acad Sci USA, vol. 97, pp. 262-267, 2000.
performance of classification accuracy in this research. The             [2]      X. L. Ji, J. L. Ling ,Z. R. Sun, “Mining gene expression data using a
Colon Cancer data is used in this experiment.                                     novel approach based on hidden Markov models”, FEBS Letters, vol.
                                                                                  542, pp. 125-131, 2003.
   To evaluate the performance of the imputation methods, we             [3]      O. Alter, P. O. Brown, D. Botstein, “Singular Value decomposition for
randomly removed known expression values between 1% and                           genome-wide expression data processing and modeling”, Proc Natl
10% of the values from the complete matrices, imputed MVs,                        Acad Sci USA, vol. 97, pp. 10101-10106, 2000.
and assessed the performance by using the NRMSE.                         [4]     M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein , “Cluster
                                                                                 analysis and display of genome-wide expression patterns”, Proc Natl
   The results show that the Row average method yields a very                    Acad Sci USA, vol. 97, pp. 262-267, 1998.
poor effectiveness comparing with other methods in term of               [5]     P. Tamayo, D. Slonim , J. Mesirov, Q. Zhu, S. Kitareewan, E.
NRMSE. And also, it gives lowest classification accuracy with                    Dmitrovsky,E. S. Lander, T. R. Golub , “Interpreting patterns of gene
SVM classifier. For other methods, although the Regression                       expression with self-organizing maps: Methods and application to
                                                                                 hematopoietic differentiation”, Proc Natl Acad Sci USA, vol. 96, pp.
yields the best performance in term of NRMSE, it is not                          2907-2912, 1999.
different in classification accuracy.                                    [6]     E. Wit, and J. McClure, “Statistics for Microarrays: Design, Analysis
                                                                                 and Inference”, West Sussex: John Wiley and Sons Ltd, pp.65-69, 2004.
TABLE II.        ACCURACY OF SVM CLASSIFYER FOR COLON CANCER DATA        [7]     M. S. Sehgal, L. Gondal, L. S. Dooley, “Collateral Missing value
                           CLASSIFICATION                                        imputation: a new robust missing value estimation algorithm for
                                                                                 microarray data”, Bioinformatics, vol. 21, pp. 2417-2423, 2005.
                                    Colon Cancer
         %                                                               [8]     A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A.
        Miss       Row          KNN      KNNFS     Regression                    Rosenwald, J. C. Boldrick, H. Sabet, T. Tran X. Yu, J. I. Powell, L.
                                                                                 Yang, G. E. Marti, T. Moore, J. J. Hudson, L. Lu, D. B. Lewis, R.
            1      83.39        84.03     84.35      84.84                       Tibshirani, G. Sherlock,        W. C. Chan, T. C. Greiner, D. D.
            2      83.23        84.35     84.03      84.19                       Weisenburger, J. O. Armitage, R. Warnke, L. M. Staudt, et al., “Distinct
                                                                                 types of diffuse large B-cell lymphoma identified by gene expression
            3      83.06        83.87     83.71      84.84                       profiling”, Nature, vol. 403, pp. 503-511, 2000.
            4      82.74        84.19     83.87      83.71               [9]     O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.
                                                                                 Tibshirani, D. Botstein, R. B. Altman, “Missing values estimation
            5      82.62        84.23     84.77      83.51
                                                                                 methods for DNA microarrays”, Bioinformatics, vol. 17, pp. 520-525,
            6      82.90        82.90     82.74      83.87                       2001.
                                                                         [10]    S. Oba, M. A. Sato, I. Takemasa, M. Monden, K. I. Matsubara, S. Ishii,
            7      82.42        83.87     83.87      84.19
                                                                                 “A Bayesian missing value estimation method for gene expression
            8      82.10        83.39     83.23      84.03                       profile data”, Bioinformatics, vol. 19, pp. 2088-2096, 2003.
            9      83.23        84.35     84.68      84.35               [11]    X. B. Xhou, X. D. Wang, E. R. Dougherty, “Missing –value estimation
                                                                                 using linear and non-linear regression with Bayesian gene selection”,
            10     82.26        83.55     83.71      82.90                       Bioinformatics, vol. 19, pp. 2302-2307, 2003.
                                                                         [12]    H. Kim, G.H. Golub, H. Park, “Missing value estimation for DNA
                                                                                 microarray gene expression data: local least squares imputation”,
                                                                                 Bioinformatics, vol. 21, pp. 187-198, 2005.
                                                                         [13]    D. Yoon, E. K. Lee, T. Park, “Robust imputation method for missing
                                                                                 values in microarray data”, BMC Bioinformatics, vol. 8, no. 2:S6, 2007.
                                                                         [14]   J. Quackenbush, “Microarray data normalization and transformation”,
                                                                                Nature Genetics Supplement, vol. 32, pp. 496-501, 2002.
                                                                         [15]   P. Meesad and K. Hengpraprohm, “Combination of KNN-Based Feature
                                                                                Selection and KNNBased Missing-Value Imputation of Microarray
                                                                                Data”, 2008 3rd International Conference on Innovative Computing
                                                                                Information and Control, pp.341, 2008.




        Figure 3. Accuracy of SVM Classifyer for Colon Cancer




                                                                    32                                      http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500

				
DOCUMENT INFO