VIEWS: 149 PAGES: 4 CATEGORY: Emerging Technologies POSTED ON: 6/11/2010
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010 A Comparative Study of Microarray Data Classification with Missing Values Imputation Kairung Hengpraphrom1, Sageemas Na Wichian2 and Phayung Meesad3 1 Department of Information Technology, Faculty of Information Technology 2 Department of Social and Applied Science, College of Industrial Technology 3 Department of Teacher Training in Electrical Engineering, Faculty of Technical Education King Mongkut's University of Technology North Bangkok 1518 Piboolsongkram Rd.Bangsue, Bangkok 10800, Thailand kairung2004@yahoo.com, sgm@kmutnb.ac.th, pym@kmutnb.ac.th Abstract—The incomplete data is an important problem in data Consequently, many algorithms have been developed to mining. The consequent downstream analysis becomes less accurately impute MVs in microarray experiments, for effective. Most algorithms for statistical data analysis need a example K-Nearest Neighbor, Singular Value Decomposition, complete set of data. Microarray data usually consists of a small and Row average method have been proposed to estimate number of samples with high dimensions but with a number of missing values in microarrays. KNN Impute was found to be missing values. Many missing value imputation methods have the best among three methods [9]. However, there are still been developed for microarray data, but only a few studies have some points to improve. Many imputation techniques have investigated the relationship between missing value imputation method and classification accuracy. In this paper we carry out been proposed to resolve the missing values problems. For experiments with Colon Cancer dataset to evaluate the example, Troyanskaya et al. [9] proposed KNN imputation effectiveness of the four methods dealing with missing values based on Singular Value Decomposition and Row average imputations: the Row average method, KNN imputation, KNNFS methods. The results showed that KNN imputation method is imputation and Multiple Linear Regression imputation better than the Row average method. Oba et al. [10] have procedure. The considered classifier is the Support Vector proposed an imputation method called Bayesian Principal Machine (SVM). Component Analysis (BPCA). The researchers claimed that BPCA can estimate the missing values better than KNN and Keywords;KNN, Regression, Microarray, Imputation, Missing SVD. Another efficient method was proposed by Zhou et al. Values [11]. The method automatically selects gene parameters for I. INTRODUCTION estimation of missing values. The algorithm uses linear and nonlinear regression. The key benefit of the algorithm is quick Microarray data is a representative of thousands of genes at estimation. Another research by Kim et al. [12] proposed local the same time. In with many types of experimental data, least squares (LLS) imputation. The idea is to use the expression data obtained from microarray experiments are similarity of structure of data as in least square optimization. frequently peppered with missing values (MVs) that may This method is very robust. Later, Robust Least Squares occur for a variety of reasons, such as insufficient resolution, Estimation with Principal Components (RLSP) was proposed image corruption, dust, scratches on the slide, or errors in the by Yoon et al. [13] to improve the efficiency of the previous process of experiments. Many data mining techniques have methods. RLSP imputation method showed better been proposed for analysis to identify regulatory patterns or performance than KNN, LLS, and BPCA. The NRMSE is similarities in expressions under similar conditions. For the calculated to measure the imputation performance since the analysis to be efficient, data mining techniques such as original values are now known. classification [1-3] and clustering [4-5] techniques require that Many missing value imputation methods have been the microarray data must be complete with no missing values developed for microarray data, but only a few studies have [6]. One solution for the missing data problem is to go over investigated the relationship between missing value imputation the experiment again, but it is time consuming and very method and classification accuracy. In this paper, we carry out expensive [7]. Replacing the missing values by zero and a model-based analysis to investigate how different properties average value can be helpful instead of eliminating the of a dataset influence imputation and classification, and how missing-value records [8], but the two simple methods are not imputation affects classification performance. We compare very effective. four imputation algorithms: the Row average method, KNN 29 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010 imputation, KNNFS imputation and Multiple Linear Step 3: Estimate the missing value as an average of the K Regression imputation method to measure how well the nearest neighbors, corresponding entries in the selected K imputed dataset can preserve the discriminated power residing expression vectors by using (2) in the original dataset. The Support Vector Machine (SVM) is K used as a classifier in this work. ∑X k The remainder of this paper is organized as follows. Section xij = ˆ k =1 (2) II provides theory and related works. The details of the K proposed methodology are given in Section III. Section IV X k = X i =1...M | di ∈{d1 , d 2 ,..., d K } illustrates the simulation and comparison results. Finally, concluding remarks are given in section V. ˆ where xij is the estimated missing value at ith gene in jth sample; di is the ith rank in distance of neighbor; Xk is the input II. RELATED WORK matrix containing kth rank in the nearest neighbor gene A. Microarray Data expressions; and M is the total number of samples in the training data. Every cell of living organisms contains a full set of chromosomes and identical genes. Only a portion of these C. The Algorithm of KNNFS genes are turned on and it is the subset that is expressed, The algorithm of the combination of KNN-based feature conferring distinctive properties to each cell category. selection and KNN-based imputation is as follows [15]. There are two most important application forms for the Phase 1: Feature Selection DNA microarray technology: 1) identification of sequence Step 1: Initialize KF feature; (gene/gene mutation) and 2) determination of expression level Step 2: Calculate feature distance between Xj, j = 1, (abundance) of genes of one sample or comparing gene …, col and Xmiss (the feature with missing transcription in two or more different kinds of cells. In data values) by using (1); preparation, DNA Microarrays are small, solid supports onto Step 3: Sort feature distance in ascending order; which the sequences from thousands of different genes are Step 4: Select KF minimum distances; attached at fixed locations. The supports themselves are Phase 2: Imputation of Missing Values usually glass microscope slides, the size of two side-by-side Step 5: Initialize KC samples; small fingers, but can also be silicon chips or nylon Step 6: Use KF feature to calculate sample distance membranes. The DNA is printed, spotted, or actually between Ri, i = 1, …, row and Rmiss (the row synthesized directly onto the support. With the aid of a with missing values) by using (1); computer, the amount of mRNA bounding to the spots on the Step 7: Sort sample distance ascending; microarray is precisely measured, which generates a profile of Step 8: Select KC minimum distance; gene expression in the cell. The generating process usually Step 9: Use KC sample to estimate missing value by produces a lot of missing values and resulting in less an average of KC most similar values by efficiency of the downstream computational analysis [14]. using (2). B. K-nearest neighbor(KNN) D. Multiple Linear Regression Due to its simplicity, K-Nearest Neighbor (KNN) method Multiple linear regression (MLR) is a method used to model is one of the well-known methods to impute missing values in the linear relationship between a dependent variable and one microarray data. The KNN method imputes missing values by or more independent variables. The dependent variable is selecting genes with expression values similar to the gene of sometimes also called the predictand, and the independent interest. The steps of KNN imputation are as follows. variables are called the predictors. Step 1: Chose K genes that are most similar to the gene The model expresses the value of a predictand variable as a with the missing value (MV). In order to estimate the missing linear function of one or more predictor variables and an error value xij of ith gene in jth sample, K genes are selected whose term: expression vectors are similar to genetic expression of i in samples other than j. yi = b0 + b1 xi ,1 + b2 xi ,2 + ... + bk xi ,k + ei (3) Step 2: Measure the distance between two expression xi ,k is value of k th predictor in case i vectors xi and xj by using the Euclidian distance over the observed components in jth sample. Euclidean distance b0 is regression constant between xi and xj can be calculated from (1) bi is coefficient on k th the predictor n d ij = dist ( xi , x j ) = ∑ (x ik − x jk ) 2 (1) K is total number of predictors k =1 yi is predictand in case Where dist(xi,xj) is the Euclidean distance between samples xi and xj; n is the number of features or dimensions of ei is error term microarray; and xik is the kth feature of sample xi. 30 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010 The model (3) is estimated by least squares, which yields IV. THE EXPERIMENTAL RESULTS parameter estimates such that the sum of squares of errors is To evaluate the effectiveness of the imputation methods, the minimized. The resulting prediction equation is NRMSE values were computed using each algorithm as yi = b0 + b1 xi ,1 + b2 xi ,2 + ... + bk xi ,k ˆ ˆ ˆ ˆ ˆ (4) descript above. The experiment is repeated 10 times and Where the variables are defined as in (3) except that “^” reported the average as the result. The experimental results are denotes estimated values shown in Tables I and Fig. 1. Table I and Fig. 1 show the NRMSE of the estimation error III. THE EXPERIMENTAL DESIGN for Colon Tumor data. The results show that the Regression To compare the performance of the KNN, Row, Regression, method has a lower NRMSE compared to the other methods. and KNNFS imputation algorithms, NRMSE was used to measure the experimental results. The missing value TABLE I. NORMALIZE ROOT MEANS SQUARE ERROR OF MISSING-VALUE IMPUTATION FOR COLON CANCER DATA estimation techniques were tested by randomly removing data values and then computing the estimation error. In the % Colon Cancer experiments, between 1% and 10% of the values were Miss Row KNN KNNFS Regression removed from the dataset randomly. Next, the four imputation 1 0.6363 0.5486 0.4990 0.4049 algorithms as mention above are applied separately to 2 0.6121 0.5366 0.4918 0.4103 calculate the missing values and then the imputed data 3 0.6319 0.5606 0.5173 0.4282 (complete data) were used for accuracy measurement (NRMSE and classification accuracy) by SVM classifier. The 4 0.6339 0.5621 0.5169 0.4251 overall process is shown in Fig. 1. 5 0.6301 0.5673 0.5267 0.4410 6 0.6281 0.5634 0.5212 0.4573 7 0.6288 0.5680 0.5254 0.4415 Complete Generate artificial Data with missing values 8 0.6382 0.5882 0.5534 0.4548 data missing 9 0.6310 0.5858 0.5481 0.4418 10 0.6296 0.5849 0.5483 0.4450 Feature Selection Method, Missing estimation Method Classification Imputed accuracy data NRMSE Figure 1. Simulation flow chart. To test the effectiveness of the different imputation algorithms, Conlon Cancer dataset was used. The data were collected from 62 patients: 40 tumor and 22 normal cases. The dataset has 2,000 selected genes. It is clean and contains no missing values. Figure 2. Normalize root means square error of missing value imputation for The effectiveness of missing values imputation was Colon Cancer Data computed by Normalized Room Mean Squared Error (NRMSE) [12] as shown in equation 5. The classification accuracy by using the SVM classifier is summarized in Table II and Fig. 2. The experimental results mean[( y guess − yans ) ] 2 show that the accuracy of the row average method is ranged NRMSE = (5) std [ yans ] between 82.10% and 83.39%, while the neighbour-based methods (KNN, KNNFS) gave the result between 82.90% and Subject to 84.77%, and the regression method ranges between 82.90% and 84.84%. y guess is estimated value y ans is prototype gene's value std[ y ans ] is stand deviation of prototype gene 31 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 2, 2010 V. CONCLUSION REFERENCES This research studies the effectiveness of MVs imputation methods to the classification problems. The model-based [1] M. P. S. Brown, W. N. Grundy, D. Lin , N Cristianini, C. W. Sugnet, T. S. Furey, M. J. Ares, D. Haussler, “Knowledge-based analysis of approach is employed. Four methods for imputation (Row microarray gene expression data by using support vector machines”, average, KNN, KNNFS, Regression) are used to compare the Proc Natl Acad Sci USA, vol. 97, pp. 262-267, 2000. performance of classification accuracy in this research. The [2] X. L. Ji, J. L. Ling ,Z. R. Sun, “Mining gene expression data using a Colon Cancer data is used in this experiment. novel approach based on hidden Markov models”, FEBS Letters, vol. 542, pp. 125-131, 2003. To evaluate the performance of the imputation methods, we [3] O. Alter, P. O. Brown, D. Botstein, “Singular Value decomposition for randomly removed known expression values between 1% and genome-wide expression data processing and modeling”, Proc Natl 10% of the values from the complete matrices, imputed MVs, Acad Sci USA, vol. 97, pp. 10101-10106, 2000. and assessed the performance by using the NRMSE. [4] M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein , “Cluster analysis and display of genome-wide expression patterns”, Proc Natl The results show that the Row average method yields a very Acad Sci USA, vol. 97, pp. 262-267, 1998. poor effectiveness comparing with other methods in term of [5] P. Tamayo, D. Slonim , J. Mesirov, Q. Zhu, S. Kitareewan, E. NRMSE. And also, it gives lowest classification accuracy with Dmitrovsky,E. S. Lander, T. R. Golub , “Interpreting patterns of gene SVM classifier. For other methods, although the Regression expression with self-organizing maps: Methods and application to hematopoietic differentiation”, Proc Natl Acad Sci USA, vol. 96, pp. yields the best performance in term of NRMSE, it is not 2907-2912, 1999. different in classification accuracy. [6] E. Wit, and J. McClure, “Statistics for Microarrays: Design, Analysis and Inference”, West Sussex: John Wiley and Sons Ltd, pp.65-69, 2004. TABLE II. ACCURACY OF SVM CLASSIFYER FOR COLON CANCER DATA [7] M. S. Sehgal, L. Gondal, L. S. Dooley, “Collateral Missing value CLASSIFICATION imputation: a new robust missing value estimation algorithm for microarray data”, Bioinformatics, vol. 21, pp. 2417-2423, 2005. Colon Cancer % [8] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Miss Row KNN KNNFS Regression Rosenwald, J. C. Boldrick, H. Sabet, T. Tran X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. J. Hudson, L. Lu, D. B. Lewis, R. 1 83.39 84.03 84.35 84.84 Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. 2 83.23 84.35 84.03 84.19 Weisenburger, J. O. Armitage, R. Warnke, L. M. Staudt, et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression 3 83.06 83.87 83.71 84.84 profiling”, Nature, vol. 403, pp. 503-511, 2000. 4 82.74 84.19 83.87 83.71 [9] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, “Missing values estimation 5 82.62 84.23 84.77 83.51 methods for DNA microarrays”, Bioinformatics, vol. 17, pp. 520-525, 6 82.90 82.90 82.74 83.87 2001. [10] S. Oba, M. A. Sato, I. Takemasa, M. Monden, K. I. Matsubara, S. Ishii, 7 82.42 83.87 83.87 84.19 “A Bayesian missing value estimation method for gene expression 8 82.10 83.39 83.23 84.03 profile data”, Bioinformatics, vol. 19, pp. 2088-2096, 2003. 9 83.23 84.35 84.68 84.35 [11] X. B. Xhou, X. D. Wang, E. R. Dougherty, “Missing –value estimation using linear and non-linear regression with Bayesian gene selection”, 10 82.26 83.55 83.71 82.90 Bioinformatics, vol. 19, pp. 2302-2307, 2003. [12] H. Kim, G.H. Golub, H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation”, Bioinformatics, vol. 21, pp. 187-198, 2005. [13] D. Yoon, E. K. Lee, T. Park, “Robust imputation method for missing values in microarray data”, BMC Bioinformatics, vol. 8, no. 2:S6, 2007. [14] J. Quackenbush, “Microarray data normalization and transformation”, Nature Genetics Supplement, vol. 32, pp. 496-501, 2002. [15] P. Meesad and K. Hengpraprohm, “Combination of KNN-Based Feature Selection and KNNBased Missing-Value Imputation of Microarray Data”, 2008 3rd International Conference on Innovative Computing Information and Control, pp.341, 2008. Figure 3. Accuracy of SVM Classifyer for Colon Cancer 32 http://sites.google.com/site/ijcsis/ ISSN 1947-5500