VIEWS: 99 PAGES: 6 CATEGORY: Emerging Technologies POSTED ON: 10/10/2010 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.6, 2010 A Hybrid PSO-SVM Approach for Haplotype Tagging SNP Selection Problem Min-Hui Lin Chun-Liang Leu Department of Computer Science and Information Department of Information Technology, Ching Kuo Engineering, Dahan Institute of Technology, Institute of Management and Health, Sincheng, Hualien County , Taiwan, Republic of China Keelung , Taiwan, Republic of China Abstract—Due to the large number of single nucleotide two categories: block-based and block-free methods. The polymorphisms (SNPs), it is essential to use only a subset of all block-based methods [1-2] firstly partition human genome into SNPs called haplotype tagging SNPs (htSNPs) for finding the haplotype blocks. The haplotype diversity is limited and then relationship between complex diseases and SNPs in biomedical subsets of tagging SNPs are searched within each haplotype research. In this paper, a PSO-SVM model that hybridizes the block. A main drawback of block-based methods is that the particle swarm optimization (PSO) and support vector machine definition of blocks is not a standard form and there is no (SVM) with feature selection and parameter optimization is consensus about how these blocks should be partitioned. The proposed to appropriately select the htSNPs. Several public algorithmic framework for selecting a minimum informative datasets of different sizes are considered to compare the proposed set of SNPs avoiding any reference to haplotype blocks is approach with other previously published methods. The computational results validate the effectiveness and performance called block-free methods [3]. In the literature [4-5], feature of the proposed approach and the high prediction accuracy with selection technique was adopted to solve for the tagging SNPs the fewer htSNPs can be obtained. selection problem and achieved some promising results. Feature selection algorithms may be widely categorized Keywords ： Single Nucleotide Polymorphisms (SNPs), into two groups: the filter approach and the wrapper approach. Haplotype Tagging SNPs (htSNPs), Particle Swarm Optimization The filter approach selects highly ranked features based on a (PSO), Support Vector Machine (SVM). statistical score as a preprocessing step. They are relatively computationally cheap since they do not involve the induction algorithm. Wrapper approach, on the contrary, directly uses the I. INTRODUCTION induction algorithm to evaluate the feature subsets. It generally outperforms filter method in terms of classification accuracy, The large number of single nucleotide polymorphisms but computationally more intensive. Support Vector Machine (SNPs) in the human genome provides the essential tools for (SVM) [6] is a useful technique for data classification. A finding the association between sequence variation and practical difficulty of using SVM is the selection of parameters complex diseases. A description of the SNPs in each such as the penalty parameter C of the error term and the kernel chromosome is called a haplotype. The string element of each parameter γ in RBF kernel function. The appropriate choice of haplotype is 0 or 1, where 0 denotes the major allele and 1 parameters is to get the better generalization performance. denotes the minor allele. The genotype is the combined information of two haplotypes on the homologous In this paper, a hybrid PSO-SVM model that incorporates chromosomes and is prohibitively expensive to directly the Particle Swarm Optimization (PSO) and Support Vector determine the haplotypes of an individual. Usually, the string Machine (SVM) with feature selection and parameter element of a genotype is 0, 1, or 2, where 0 represents the optimization is proposed to appropriately select the htSNPs. major allele in homozygous site, 1 represents the minor allele Several public benchmark datasets are considered to compare in homozygous site, and 2 is in the heterozygous site. The the proposed approach with other published methods. genotyping cost is affected by the number of SNPs typed. In Experimental results validate the effectiveness of the proposed order to reduce this cost, a small number of haplotype tagging approach and the high prediction accuracy with the fewer SNPs (htSNPs) which predicts the rest of SNPs are needed. htSNPs can be obtained. The remainder of the paper is organized as follows: Section 2 introduces the problem The haplotype tagging SNP selection problem has become formulation. Section 3 describes the PSO and SVM classifier. a very active research topic and is promising in disease In Section 4, the particle representation, fitness measurement, association studies. Several computational algorithms have and the proposed hybrid system procedure are presented. Three been proposed in the past few years, which can be divided into public benchmark problems are used to validate the proposed 60 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.6, 2010 approach and the comparison results are described in Section 5. where d = 1, 2,..., D , i = 1, 2,..., S , and D is the dimension of Finally, conclusions are made in Section 6. the problem space, S is the size of population, k is the iterative k k times; vid is the i-th particle velocity, xid is the current particle k solution, pbid is the i-th particle best ( pbest ) solution achieved II. PROBLEM FORMULATION k so far; gbid is the global best ( gbest ) solution obtained so far by As shown in Figure 1, assume that dataset U consists of n any particle in the population; r1 and r2 are random values in haplotypes {hi }1≤i ≤ n , each with p different SNPs {S j }1≤ j ≤ p , U is the range [0,1], both of c1 and c2 are learning factors, usually n×p matrix. Each row in U indicates the haplotype hi and each c1 = c2 = 2 , w is a inertia factor. A large inertia weight column in U represents the SNP S j . The element di , j denotes facilitates global exploration, while a small one tends to local the j-th SNP of i-th haplotype, di , j ∈ {0,1} . Our goal is to exploration. In order to achieve more refined solution, a general rule of thumb suggests that the initial inertia value had determine a minimum size g set of selected SNPs (htSNPs) better be set to the maximum wmax = 0.9 , and gradually down V = {vk }, k ∈ {1, 2,..., p} , g = V , in which each random to the minimum wmin = 0.4 . variable vk corresponding to the k-th SNP of haplotypes in U, to predict the remaining unselected ones with a minimum According to the searching behavior of PSO, the gbest prediction error. The size of V is smaller than a user-defined value will be an important clue in leading particles to the global value R ( g ≤ R ), and the selected SNPs are called haplotype optimal solution. It is unavoidable for the solution to fall into tagging SNPs (htSNPs) while the remaining unselected ones the local minimum while particles try to find better solutions. are named as tagged SNPs. Thus, the selection set V of htSNPs In order to allow the solution exploration in the area to produce is based on how well to predict the remaining set of the more potential solutions, a mutation-like disturbance operation unselected SNPs and the number g of selected SNPs is usually is inserted between Eq. (1) and Eq. (2). The disturbance minimized according to the prediction error by calculating the operation random selects k dimensions (1 ≤ k ≤ problem leave-one-out cross-validation (LOOCV) experiments [7]. dimensions) of m particles (1 ≤ m ≤ particle numbers) to put Gaussian noise into their moving vectors (velocities). The S1 S2 L Sj L S p −1 Sp disturbance operation will affect particles moving toward to unexpected direction in selected dimensions but not previous h1 ⎡ d1,1 d1,2 L d1, j L d1, p −1 d1, p ⎤ experience. It will lead particle jump out from local search and ⎢d d 2,2 L d 2, j L d 2, p −1 d 2, p ⎥ further can explore more un-searched area. h2 ⎢ 2,1 ⎥ M ⎢ M M O M N M M ⎥ According to the velocity and position updated formula ⎢ ⎥ mentioned above, the basic process of the PSO algorithm is hi ⎢ di ,1 d i ,2 L di , j L d i , p −1 di , p ⎥ given as follows: M ⎢ M M N M O M M ⎥ ⎢ ⎥ 1.) Initialize the swarm by randomly generating initial hn −1 ⎢ d n −1,1 d n −1,2 L d n −1, j L d n −1, p −1 d n −1, p ⎥ particles. hn ⎢ d n ,1 ⎣ d n ,2 L d n, j L d n , p −1 d n, p ⎥ ⎦ n× p 2.) Evaluate the fitness of each particle in the population. Figure 1 The haplotype tagging SNP Selection Problem. 3.) Compare the particle’s fitness value to identify the both of pbest and gbest values. 4.) Update the velocity of all particles using Equation (1). 5.) Add disturbance operator to moving vector (velocity). III. RELATED WORKS 6.) Update the position of all particles using Equation (2). 7.) Repeat the Step 2 to Step 6 until a termination criterion A. Particle Swarm Optimization is satisfied (e.g., the number of iteration reaches the pre-defined maximum number or a sufficiently good fitness value is The PSO is a novel optimization method originally obtained). developed by Kennedy and Eberhart [8]. It models the The authors [8] proposed a discrete binary version to allow processes of the sociological behavior associated with bird the PSO algorithm to operate in discrete problem spaces. In the flocking and is one of the evolutionary computation techniques. binary PSO (BPSO), the particle’s personal best and global In the PSO, each solution is a ‘bird’ in the flock and is referred best is updated as in continuous value. The major different to as a ‘particle’. A particle is analogous to a chromosome in between discrete PSO with continuous version is that velocities GA. Each particle traverses the search space looking for the of the particles are rather defined in terms of probabilities that a global optimum. The basic PSO algorithm is as follow: bit whether change to one. By this definition, a velocity must vid+1 = w ⋅ vid + c1 ⋅ r1 ⋅ ( pbid − xid ) + c2 ⋅ r2 ⋅ ( gbid − xid ) k k k k k k (1) be restricted within the range [Vmin , Vmax ] . If vid+1 ∉ (Vmin , Vmax ) k then vid+1 = max(min(Vmax , vid+1 ), Vmin ) . The new particle position k k xid+1 = vid+1 + xid k k k (2) is calculated using the following rule: 61 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.6, 2010 If rand ( ) < S (vid+1 ) , then xid+1 = 1 ; else xid+1 = 0 k k k (3) Linear SVM can be generalized to non-linear SVM via a mapping function Φ , which is also called the kernel function, and the training data can be linearly separated by applying the 1 , where S (vid+1 ) = k k +1 (4) linear SVM formulation. The inner product (Φ ( xi ) ⋅ Φ ( x j )) is 1 + e − vid calculated by the kernel function k ( xi , x j ) for given training The function S (vid ) is a sigmoid limiting transformation and data. By introducing the kernel function, the non-linear SVM rand() is a random number selected from a uniform distribution (optimal hyper-plane) has the following forms: in [0, 1]. Note that the BPSO is susceptible to sigmod function m saturation which occurs when velocity values are either too f ( x, α * , b* ) = ∑ yiα i* < Φ ( x) ⋅ Φ ( xi ) > +b* large or too small. For a velocity of zero, it is a probability of i =1 50% for the bit to flip. m (8) = ∑ yiα k ( x, xi ) + b * i * i =1 B. Support Vector Machine Classifier Though new kernel functions are being proposed by SVM starts from a linear classifier and searches the optimal researchers, there are four basic kernels as follows. hyper-plane with maximal margin. The main motivating criterion is to separate the various classes in the training set • Linear: k ( xi , x j ) = xiT x j (9) with a surface that maximizes the margin between them. It is an approximate implementation of the structural risk minimization • Polynomial: k ( xi , x j ) = (γ xiT x j + r ) d , γ > 0 (10) induction principle that aims to minimize a bound on the generalization error of a model. • RBF: k ( xi , x j ) = exp(−γ || xi − x j ||2 ), γ > 0 (11) Given a training set of instance-label pairs • Sigmoid: k ( xi , x j ) = tanh(γ xiT x j + r ) (12) ( xi , yi ), i = 1, 2,..., m where xi ∈ R n and yi ∈ {+1, −1} . The generalized linear SVM finds an optimal separating hyper- where γ , r and d are kernel parameters. Radial basis function plane f ( x) = w ⋅ x + b by solving the following optimization (RBF) is a common kernel function as Eq. (11). In order to problem: improve classification accuracy, the kernel parameter γ in the kernel function should be properly set. m 1 T Minimize w w + C ∑ ξi w , b ,ξ 2 i =1 (5) IV. METHODS Subject to : yi (< w ⋅ xi > +b) + ξi − 1 ≥ 0, ξi ≥ 0 As the htSNPs selection problem mentioned above in Section 2, the notations and definitions are used to present our where C is a penalty parameter on the training error, and ξi is proposed method. In the dataset U of n×p matrix, each row the non-negative slack variables. This optimization model can (haplotypes) can be viewed as a learning instance belonging to be solved using the Lagrangian method, which maximizes the a class and each column (SNPs) are attributes or features based same dual variables Lagrangian LD (α ) (6) as in the separable on which sequences can be classified into class. Given the case. values of g htSNPs of an unknown individual x and the known full training samples from U, a SNP prediction process can be m 1 m treated as the problem of selecting tagging SNPs as a feature Maximize LD (α ) = ∑ α i − ∑ α iα j yi y j < xi ⋅ x j > selection problem to predict the non-selected tagging SNPs in x. α i =1 2 i , j =1 (6) Thus, the tagging SNPs selection can be transformed to solve m Subject to : 0 ≤ α i ≤ C , i = 1, 2,..., m and ∑ α i yi = 0 i =1 for a binary classification of vectors with g coordinates by using the support vector machine classifier. Here, an effective PSO-SVM model that hybridizes the particle swarm To solve the optimal hyper-plane, a dual Lagrangian optimization and support vector machine with feature selection LD (α ) must be maximized with respect to non-negative α i and parameter optimization is proposed to appropriately select m the htSNPs. The particle representation, fitness definition, under the constraint ∑α y i =1 i i = 0 and 0 ≤ α i ≤ C . The penalty disturbance strategy for PSO operation and system procedure for the proposed hybrid model are described as follows. parameter C is a constant to be chosen by the user. A larger value of C corresponds to assigning a higher penalty to the errors. After the optimal solution α i* is obtained, the optimal A. Particle Representation * * hyper-plane parameters w and b can be determined. The The RBF kernel function is used in the SVM classifier to optimal decision hyper-plane f ( x, α * , b* ) can be written as: implement our proposed method. The RBF kernel function requires that only two parameters, C and γ should be set. Using the RBF kernel for SVM, the parameters C , γ and SNPs m f ( x, α * , b* ) = ∑ yiα i* < xi ⋅ x > +b* = w* ⋅ x + b* (7) i =1 viewed as input features which must be optimized 62 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.6, 2010 simultaneously for our proposed PSO-SVM hybrid system. The particle representation consists of three parts including: C and γ are the continuous variables, and the SNPs mask are the Data Preprocessing discrete variables. For the feature selection, if n f features are Dataset required to decide which features are chosen, then n f +2 decision variables in each particle must be adopted. Split dataset by LOOCV Table 1 shows the particle representation of our design. The representation of particle i with dimension of n f + 2 , Training set Testing set where n f is the number of SNPs (features) that varies from htSNPs (Feature) Selection different datasets. xi ,1 ~ xi , n f ∈ {0,1} denotes the SNPs mask, Selected htSNPs (features) subset F xi , n f +1 indicates the parameter value C , xi , n f + 2 represents the PSO parameter : htSNPs mask parameter value γ . If xi , k = 1, k = 1, 2,..., n f represents the k-th SNP on the i-th particle to be selected, and vice versa. Testing set tagged SNPs Training set htSNPs TABLE I. The particle i representation. SVM parameter Optimization Discrete-variables Continuous-variables Training SVM classifier PSO parameters : C and r SNPs mask C γ Learned SVM classifier xi ,1 xi ,2 L xi , n f xi , n f +1 xi , n f + 2 PSO operation A random key encoding method [9] is applied in the PSO Fitness calculation algorithm. Generally, random key encoding is used for an order-based encoding scheme where the value of random key is No PSO Termination check ? the genotype and the decoding value is the phenotype. Note operation that the particle in each {xi , k }1≤ k ≤ n f is assigned a random Yes number on (0, 1), and to decode in ascending order with regard Optimized C , r , and feature subset F to its value. In the PSO learning process, the particle to be counted larger tends to evolve closer to 1 and those to be counted smaller tends to evolve closer to 0. Therefore, a repair Figure 2 The flowchart of the proposed PSO-SVM model. mechanism such as particle amendment in [5] to guarantee the number of htSNPs after update process in PSO is not required. measurement mentioned above, details of the proposed hybrid B. Fitness Measurement PSO-SVM procedure are described as follows: Procedure PSO-SVM hybrid model In order to compare the performance of our proposed approach with other published methods SVM/STSA in [4] and 1.) Data preparation BPSO in [5], the leave-one-out cross validation is used to Given a dataset U is considered using the leave-one-out evaluate the quality of fitness measurement. The prediction cross-validation process to split the data into training and accuracy is measured as the percentage of correctly predicted testing sets. The training and testing sets are represented as SNP values on non-selected SNPs. In the LOOCV experiments, U TR and U TE , respectively. each haplotype sequence is removed one by one from dataset U, 2.) PSO initialization and parameters setting the htSNPs are selected using only the remaining haplotypes to Set the PSO parameters including the number of iterations, predict these tagged SNPs values for the removed one. This procedure is repeated such that each haplotype in U is run once number of particles, velocity limitation, particle dimension, in turn as the validation data. disturbance rate. Generate initial particles comprised of the features mask, C and γ . 3.) Selected htSNPs (features) subset C. The Proposed Hybrid System Procedure Select input features for training set according to the feature Figure 2 shows the system architecture of our proposed mask which is represented in the particle from 2), then the hybrid model. Based on the particle representation and fitness features subset can be determined. 63 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.6, 2010 TABLE II. Results to compare PSO-SVM with SVM/STSA [4] and BPSO [5] on four real haplotype datasets. Datasets Prediction accuracy % (num of SNPs) 80 85 90 91 92 93 94 95 96 97 98 99 SVM/STSA 1 1 3 3 4 5 6 8 10 22 42 51 5q31 BPSO 1 1 2 3 4 5 6 7 9 14 29 42 (103) PSO-SVM 1 1 2 2 3 4 5 6 7 10 23 36 SVM/STSA 1 1 2 5 5 6 7 8 10 15 15 24 TRPM8 BPSO 1 1 2 5 5 6 7 8 9 13 14 22 (101) PSO-SVM 1 1 2 4 4 5 6 7 8 11 13 21 SVM/STSA 2 3 4 10 13 20 25 30 35 39 42 47 LPL BPSO 2 3 6 9 12 16 18 21 25 28 31 37 (88) PSO-SVM 2 3 4 7 10 12 13 17 20 22 26 31 4.) SVM model training and testing ranges of continuous type dimension parameters are: Based on the parameters C and γ which are represented in C ∈ [10−2 ,104 ] and γ ∈ [10 −4 ,10 4 ] . The discrete type particle the particle, to train the SVM classifier on the training dataset, for features mask, we set [Vmin , Vmax ] = [ −6, 6] , which yields a then the prediction accuracy for SVM on the testing dataset by range of [0.9975, 0.0025] using the sigmoid limiting LOOCV can be evaluated. transformation by Eq. (4). Both the cognition learning factor c1 5.) Fitness calculation For each particle, evaluate its fitness value by the prediction and the social learning factor c2 are set to 2. The disturbance accuracy obtained from previous step. The optimal fitness rate is 0.05, and the number of generation is 600. The inertia value will be stored to provide feedback on the evolution weight factor wmin = 0.4 and wmax = 0.9 . The linearly process of PSO to find the increasing fitness of particle in the decreasing inertia weight is set as Eq. (13), where inow is the next generation. current iteration and imax is the pre-defined maximum iteration. 6.) Termination check When the maximal evolutionary epoch is reached, the inow program ends; otherwise, go to the next step. w = wmax − ( wmax − wmin ) (13) imax 7.) PSO operation In the evolution process, discrete valued and continuous valued dimension of PSO with the disturbance operator may be To compare the proposed PSO-SVM approach with the applied to search for better solutions. SVM/STSA in [4] and BPSO in [5] on the three haplotype datasets by LOOCV experiments, the computational results of prediction accuracy according to the numbers of selected V. EXPERIMENTAL RESULTS AND COMPARISONS htSNPs are summarized in Table 2. As mentioned in [4], it is astonished that only one SNP for the 80% prediction accuracy To validate the performance of the developed hybrid in 5q31 and TRPM8 datasets can be achieved. In practice, if approach, three public experimental SNP datasets [4] including one guesses each SNP as 0, the prediction accuracy of 72.5% 5q31, TRPM8 and LPL are used to compare the proposed for 5q31 dataset and 79.3% for TRPM8 dataset would be approach with other previously published methods. When there obtained. Therefore, the appropriate selection of one htSNPs to are missing data exist in haplotype datasets, the GERBIL [4-5] correctly predict 80% on the rest of non-selected SNPs is program is used to resolve them. The chromosome 5q31 reasonable. It is obvious that the proposed PSO-SVM hybrid dataset was from the 616 kilobase region of human model achieves higher prediction accuracy with fewer selected chromosome 5q31 and the SNPs were 103. The TRPM8 which htSNPs in the three haplotype datasets. In general, the consists of 101 SNPs was obtained from HapMap. The human prediction accuracy is increased refers to the incremental lipoprotein lipase (LPL) gene was derived from the haplotypes selected htSNPs number. From Figure 3 to Figure 5 show that of 71 individuals typed over 88 SNPs. the numbers of selected htSNPs on haplotype datasets are Our implementation platform was carried out on the Matlab proportional to the prediction accuracy and the PSO-SVM 7.3, a mathematical development environment by extending the algorithm has very good performance for haplotype tagging Libsvm which is originally designed by Chang and Lin [10]. SNPs selection problem in the three testing cases. The empirical evaluation was performed on Intel Pentium IV CPU running at 3.4GHz and 2 GB RAM. Through initial experiment, the parameter values of the PSO were set as follows. The swarm size is set to 200 particles. The searching 64 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.6, 2010 VI. CONCLUSION In this paper, a hybrid PSO-SVM model that combines the particle swarm optimization (PSO) and support vector machine (SVM) with feature selection and parameter optimization is proposed to effectively solve for the haplotype tagging SNP selection problem. Several public datasets of different sizes are considered to compare the PSO-SVM with SVM/STSA and BPSO previously published methods. The experimental results show that the effectiveness of the proposed approach and the high prediction accuracy with the fewer number of haplotype tagging SNP can be obtained by the hybrid PSO-SVM system. REFERENCES Figure 3 The comparison result of prediction accuracy associated with selected [1] K. Zhang, M. Deng, T. Chen, M. Waterman and F. Sun, “A dynamic htSNPs on 5q31 datasets. programming algorithm for haplotype block partitioning,” Proc. Natl. Acad. Sci., vol. 99, pp. 7335–7339, 2002. [2] K. Zhang, F. Sun, S. Waterman and T. Chen, “Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data,” Am. J. Hum. Genet., vol. 73, pp. 63–73, 2003. [3] H. Avi-Itzhak, X. Su, and F. de la Vega, “Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity,” In Proceedings of Pacific Symposium on Biocomputing, vol. 8, pp. 466–477, 2003. [4] He Jingwu and A. Zelikovsky, “Informative SNP Selection Methods Based on SNP Prediction,” IEEE Transactions on NanoBioscience, Vol. 6, pp. 60-67, 2007. [5] Cheng-Hong Yang, Chang-Hsuan Ho and Li-Yeh Chuang, “Improved tag SNP selection using binary particle swarm optimization,” IEEE Congress on Evolutionary Computation (CEC 2008), pp. 854-860, 2008. [6] V.N. Vapnik, “The nature of statistical learning theory,” New York: Springer-Verlag, 1995. [7] E. Halperin, G. Kimmel and R. Shamir, “Tag SNP selection in genotype data for maximizing SNP prediction accuracy,” Bioinformatics, Vol. 21, Figure 4 The comparison result of prediction accuracy associated with pp. 195-203, 2005. selected htSNPs on TRPM8 datasets. [8] J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle swarm algorithm,” in Proceedings of the World Multiconference on Systemics, Cybernetics and Informatics, Piscataway, NJ, 1997, pp. 4104-4109. [9] J.C. Bean, “Genetics and random keys for sequencing and optimization,” ORSA J. Comput., Vol. 6, pp. 154-160, 1994. [10] C.C. Chang, and C.J. Lin, LIBSVM: a library for support vector machines, Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. Figure 5 The comparison result of prediction accuracy associated with selected htSNPs on LPL datasets. 65 http://sites.google.com/site/ijcsis/ ISSN 1947-5500