A Hybrid PSO-SVM Approach for Haplotype Tagging SNP Selection Problem

Document Sample
A Hybrid PSO-SVM Approach for Haplotype Tagging SNP Selection Problem Powered By Docstoc
					                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 8, No.6, 2010

         A Hybrid PSO-SVM Approach for Haplotype
              Tagging SNP Selection Problem
                         Min-Hui Lin                                                              Chun-Liang Leu
   Department of Computer Science and Information                                Department of Information Technology, Ching Kuo
      Engineering, Dahan Institute of Technology,                                      Institute of Management and Health,
 Sincheng, Hualien County , Taiwan, Republic of China                                  Keelung , Taiwan, Republic of China

Abstract—Due to the large number of single nucleotide                     two categories: block-based and block-free methods. The
polymorphisms (SNPs), it is essential to use only a subset of all         block-based methods [1-2] firstly partition human genome into
SNPs called haplotype tagging SNPs (htSNPs) for finding the               haplotype blocks. The haplotype diversity is limited and then
relationship between complex diseases and SNPs in biomedical              subsets of tagging SNPs are searched within each haplotype
research. In this paper, a PSO-SVM model that hybridizes the              block. A main drawback of block-based methods is that the
particle swarm optimization (PSO) and support vector machine              definition of blocks is not a standard form and there is no
(SVM) with feature selection and parameter optimization is                consensus about how these blocks should be partitioned. The
proposed to appropriately select the htSNPs. Several public               algorithmic framework for selecting a minimum informative
datasets of different sizes are considered to compare the proposed
                                                                          set of SNPs avoiding any reference to haplotype blocks is
approach with other previously published methods. The
computational results validate the effectiveness and performance
                                                                          called block-free methods [3]. In the literature [4-5], feature
of the proposed approach and the high prediction accuracy with            selection technique was adopted to solve for the tagging SNPs
the fewer htSNPs can be obtained.                                         selection problem and achieved some promising results.
                                                                              Feature selection algorithms may be widely categorized
   Keywords : Single Nucleotide Polymorphisms (SNPs),                     into two groups: the filter approach and the wrapper approach.
Haplotype Tagging SNPs (htSNPs), Particle Swarm Optimization              The filter approach selects highly ranked features based on a
(PSO), Support Vector Machine (SVM).                                      statistical score as a preprocessing step. They are relatively
                                                                          computationally cheap since they do not involve the induction
                                                                          algorithm. Wrapper approach, on the contrary, directly uses the
                    I.    INTRODUCTION                                    induction algorithm to evaluate the feature subsets. It generally
                                                                          outperforms filter method in terms of classification accuracy,
    The large number of single nucleotide polymorphisms                   but computationally more intensive. Support Vector Machine
(SNPs) in the human genome provides the essential tools for               (SVM) [6] is a useful technique for data classification. A
finding the association between sequence variation and                    practical difficulty of using SVM is the selection of parameters
complex diseases. A description of the SNPs in each                       such as the penalty parameter C of the error term and the kernel
chromosome is called a haplotype. The string element of each              parameter γ in RBF kernel function. The appropriate choice of
haplotype is 0 or 1, where 0 denotes the major allele and 1               parameters is to get the better generalization performance.
denotes the minor allele. The genotype is the combined
information of two haplotypes on the homologous                               In this paper, a hybrid PSO-SVM model that incorporates
chromosomes and is prohibitively expensive to directly                    the Particle Swarm Optimization (PSO) and Support Vector
determine the haplotypes of an individual. Usually, the string            Machine (SVM) with feature selection and parameter
element of a genotype is 0, 1, or 2, where 0 represents the               optimization is proposed to appropriately select the htSNPs.
major allele in homozygous site, 1 represents the minor allele            Several public benchmark datasets are considered to compare
in homozygous site, and 2 is in the heterozygous site. The                the proposed approach with other published methods.
genotyping cost is affected by the number of SNPs typed. In               Experimental results validate the effectiveness of the proposed
order to reduce this cost, a small number of haplotype tagging            approach and the high prediction accuracy with the fewer
SNPs (htSNPs) which predicts the rest of SNPs are needed.                 htSNPs can be obtained. The remainder of the paper is
                                                                          organized as follows: Section 2 introduces the problem
   The haplotype tagging SNP selection problem has become                 formulation. Section 3 describes the PSO and SVM classifier.
a very active research topic and is promising in disease                  In Section 4, the particle representation, fitness measurement,
association studies. Several computational algorithms have                and the proposed hybrid system procedure are presented. Three
been proposed in the past few years, which can be divided into            public benchmark problems are used to validate the proposed

                                                                     60                             http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                             Vol. 8, No.6, 2010
approach and the comparison results are described in Section 5.                                where d = 1, 2,..., D , i = 1, 2,..., S , and D is the dimension of
Finally, conclusions are made in Section 6.                                                    the problem space, S is the size of population, k is the iterative
                                                                                                       k                                     k
                                                                                               times; vid is the i-th particle velocity, xid is the current particle
                                                                                               solution, pbid is the i-th particle best ( pbest ) solution achieved
                  II. PROBLEM FORMULATION                                                                  k
                                                                                               so far; gbid is the global best ( gbest ) solution obtained so far by
   As shown in Figure 1, assume that dataset U consists of n                                   any particle in the population; r1 and r2 are random values in
haplotypes {hi }1≤i ≤ n , each with p different SNPs {S j }1≤ j ≤ p , U is                     the range [0,1], both of c1 and c2 are learning factors, usually
n×p matrix. Each row in U indicates the haplotype hi and each                                   c1 = c2 = 2 , w is a inertia factor. A large inertia weight
column in U represents the SNP S j . The element di , j denotes                                facilitates global exploration, while a small one tends to local
the j-th SNP of i-th haplotype, di , j ∈ {0,1} . Our goal is to                                exploration. In order to achieve more refined solution, a
                                                                                               general rule of thumb suggests that the initial inertia value had
determine a minimum size g set of selected SNPs (htSNPs)
                                                                                               better be set to the maximum wmax = 0.9 , and gradually down
V = {vk }, k ∈ {1, 2,..., p} , g = V , in which each random
                                                                                               to the minimum wmin = 0.4 .
variable vk corresponding to the k-th SNP of haplotypes in U,
to predict the remaining unselected ones with a minimum                                              According to the searching behavior of PSO, the gbest
prediction error. The size of V is smaller than a user-defined                                 value will be an important clue in leading particles to the global
value R ( g ≤ R ), and the selected SNPs are called haplotype                                  optimal solution. It is unavoidable for the solution to fall into
tagging SNPs (htSNPs) while the remaining unselected ones                                      the local minimum while particles try to find better solutions.
are named as tagged SNPs. Thus, the selection set V of htSNPs                                  In order to allow the solution exploration in the area to produce
is based on how well to predict the remaining set of the                                       more potential solutions, a mutation-like disturbance operation
unselected SNPs and the number g of selected SNPs is usually                                   is inserted between Eq. (1) and Eq. (2). The disturbance
minimized according to the prediction error by calculating the                                 operation random selects k dimensions (1 ≤ k ≤ problem
leave-one-out cross-validation (LOOCV) experiments [7].                                        dimensions) of m particles (1 ≤ m ≤ particle numbers) to put
                                                                                               Gaussian noise into their moving vectors (velocities). The
                S1         S2        L     Sj       L      S p −1      Sp                      disturbance operation will affect particles moving toward to
                                                                                               unexpected direction in selected dimensions but not previous
        h1   ⎡ d1,1        d1,2      L     d1, j    L     d1, p −1     d1, p    ⎤              experience. It will lead particle jump out from local search and
             ⎢d            d 2,2     L     d 2, j   L     d 2, p −1    d 2, p   ⎥              further can explore more un-searched area.
        h2   ⎢ 2,1                                                              ⎥
         M   ⎢ M             M       O       M      N        M           M      ⎥                 According to the velocity and position updated formula
             ⎢                                                                  ⎥              mentioned above, the basic process of the PSO algorithm is
        hi   ⎢ di ,1       d i ,2    L di , j       L d i , p −1       di , p ⎥
                                                                                               given as follows:
         M   ⎢ M             M       N    M         O      M              M ⎥
             ⎢                                                                  ⎥                 1.) Initialize the swarm by randomly generating initial
       hn −1 ⎢ d n −1,1   d n −1,2   L d n −1, j    L d n −1, p −1    d n −1, p ⎥
        hn ⎢ d n ,1
             ⎣             d n ,2    L d n, j       L d n , p −1       d n, p ⎥ ⎦ n× p            2.) Evaluate the fitness of each particle in the population.
              Figure 1 The haplotype tagging SNP Selection Problem.
                                                                                                  3.) Compare the particle’s fitness value to identify the both
                                                                                               of pbest and gbest values.
                                                                                                  4.) Update the velocity of all particles using Equation (1).
                                                                                                  5.) Add disturbance operator to moving vector (velocity).
                            III.     RELATED WORKS
                                                                                                  6.) Update the position of all particles using Equation (2).
                                                                                                  7.) Repeat the Step 2 to Step 6 until a termination criterion
A. Particle Swarm Optimization                                                                 is satisfied (e.g., the number of iteration reaches the pre-defined
                                                                                               maximum number or a sufficiently good fitness value is
    The PSO is a novel optimization method originally                                          obtained).
developed by Kennedy and Eberhart [8]. It models the                                              The authors [8] proposed a discrete binary version to allow
processes of the sociological behavior associated with bird                                    the PSO algorithm to operate in discrete problem spaces. In the
flocking and is one of the evolutionary computation techniques.                                binary PSO (BPSO), the particle’s personal best and global
In the PSO, each solution is a ‘bird’ in the flock and is referred                             best is updated as in continuous value. The major different
to as a ‘particle’. A particle is analogous to a chromosome in                                 between discrete PSO with continuous version is that velocities
GA. Each particle traverses the search space looking for the                                   of the particles are rather defined in terms of probabilities that a
global optimum. The basic PSO algorithm is as follow:                                          bit whether change to one. By this definition, a velocity must
    vid+1 = w ⋅ vid + c1 ⋅ r1 ⋅ ( pbid − xid ) + c2 ⋅ r2 ⋅ ( gbid − xid )
     k           k                   k    k                     k    k
                                                                                   (1)         be restricted within the range [Vmin , Vmax ] . If vid+1 ∉ (Vmin , Vmax )

                                                                                               then vid+1 = max(min(Vmax , vid+1 ), Vmin ) . The new particle position
                                                                                                      k                     k

    xid+1 = vid+1 + xid
     k       k       k
                                                                                    (2)        is calculated using the following rule:

                                                                                          61                                http://sites.google.com/site/ijcsis/
                                                                                                                            ISSN 1947-5500
                                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                               Vol. 8, No.6, 2010
   If rand ( ) < S (vid+1 ) , then xid+1 = 1 ; else xid+1 = 0
                     k              k                k
                                                                                     (3)              Linear SVM can be generalized to non-linear SVM via a
                                                                                                 mapping function Φ , which is also called the kernel function,
                                                                                                 and the training data can be linearly separated by applying the
   , where S (vid+1 ) =
                                          k +1
                                                                                     (4)         linear SVM formulation. The inner product (Φ ( xi ) ⋅ Φ ( x j )) is
                          1 + e − vid                                                            calculated by the kernel function k ( xi , x j ) for given training
The function S (vid ) is a sigmoid limiting transformation and                                   data. By introducing the kernel function, the non-linear SVM
rand() is a random number selected from a uniform distribution                                   (optimal hyper-plane) has the following forms:
in [0, 1]. Note that the BPSO is susceptible to sigmod function                                                           m
saturation which occurs when velocity values are either too                                             f ( x, α * , b* ) = ∑ yiα i* < Φ ( x) ⋅ Φ ( xi ) > +b*
large or too small. For a velocity of zero, it is a probability of                                                        i =1
50% for the bit to flip.                                                                                                  m
                                                                                                                      = ∑ yiα k ( x, xi ) + b

                                                                                                                         i =1
B. Support Vector Machine Classifier
                                                                                                 Though new kernel functions are being proposed by
    SVM starts from a linear classifier and searches the optimal                                 researchers, there are four basic kernels as follows.
hyper-plane with maximal margin. The main motivating
criterion is to separate the various classes in the training set                                    •     Linear: k ( xi , x j ) = xiT x j                                (9)
with a surface that maximizes the margin between them. It is an
approximate implementation of the structural risk minimization                                      •     Polynomial: k ( xi , x j ) = (γ xiT x j + r ) d , γ > 0        (10)
induction principle that aims to minimize a bound on the
generalization error of a model.                                                                    •     RBF: k ( xi , x j ) = exp(−γ || xi − x j ||2 ), γ > 0          (11)
      Given a training set of instance-label pairs                                                  •     Sigmoid: k ( xi , x j ) = tanh(γ xiT x j + r )                 (12)
( xi , yi ), i = 1, 2,..., m where xi ∈ R n and yi ∈ {+1, −1} . The
generalized linear SVM finds an optimal separating hyper-                                        where γ , r and d are kernel parameters. Radial basis function
plane f ( x) = w ⋅ x + b by solving the following optimization                                   (RBF) is a common kernel function as Eq. (11). In order to
problem:                                                                                         improve classification accuracy, the kernel parameter γ in the
                                                                                                 kernel function should be properly set.
                 1 T
    Minimize        w w + C ∑ ξi
        w , b ,ξ 2            i =1                                                   (5)                                         IV. METHODS
    Subject to : yi (< w ⋅ xi > +b) + ξi − 1 ≥ 0, ξi ≥ 0                                             As the htSNPs selection problem mentioned above in
                                                                                                 Section 2, the notations and definitions are used to present our
where C is a penalty parameter on the training error, and ξi is                                  proposed method. In the dataset U of n×p matrix, each row
the non-negative slack variables. This optimization model can                                    (haplotypes) can be viewed as a learning instance belonging to
be solved using the Lagrangian method, which maximizes the                                       a class and each column (SNPs) are attributes or features based
same dual variables Lagrangian LD (α ) (6) as in the separable                                   on which sequences can be classified into class. Given the
case.                                                                                            values of g htSNPs of an unknown individual x and the known
                                                                                                 full training samples from U, a SNP prediction process can be
                                                   1 m                                           treated as the problem of selecting tagging SNPs as a feature
Maximize       LD (α ) = ∑ α i −                      ∑ α iα j yi y j < xi ⋅ x j >               selection problem to predict the non-selected tagging SNPs in x.
      α                        i =1                2 i , j =1
                                                                                      (6)        Thus, the tagging SNPs selection can be transformed to solve
Subject to : 0 ≤ α i ≤ C , i = 1, 2,..., m and                        ∑ α i yi = 0
                                                                      i =1
                                                                                                 for a binary classification of vectors with g coordinates by
                                                                                                 using the support vector machine classifier. Here, an effective
                                                                                                 PSO-SVM model that hybridizes the particle swarm
     To solve the optimal hyper-plane, a dual Lagrangian                                         optimization and support vector machine with feature selection
LD (α ) must be maximized with respect to non-negative α i                                       and parameter optimization is proposed to appropriately select
                             m                                                                   the htSNPs. The particle representation, fitness definition,
under the constraint         ∑α y
                             i =1
                                      i       i   = 0 and 0 ≤ α i ≤ C . The penalty              disturbance strategy for PSO operation and system procedure
                                                                                                 for the proposed hybrid model are described as follows.
parameter C is a constant to be chosen by the user. A larger
value of C corresponds to assigning a higher penalty to the
errors. After the optimal solution α i* is obtained, the optimal                                 A. Particle Representation
                                          *              *
hyper-plane parameters w and b can be determined. The                                               The RBF kernel function is used in the SVM classifier to
optimal decision hyper-plane f ( x, α * , b* ) can be written as:                                implement our proposed method. The RBF kernel function
                                                                                                 requires that only two parameters, C and γ should be set. Using
                                                                                                 the RBF kernel for SVM, the parameters C , γ and SNPs
      f ( x, α * , b* ) = ∑ yiα i* < xi ⋅ x > +b* = w* ⋅ x + b*                      (7)
                      i =1                                                                       viewed as input features which must be optimized

                                                                                            62                                    http://sites.google.com/site/ijcsis/
                                                                                                                                  ISSN 1947-5500
                                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                               Vol. 8, No.6, 2010
simultaneously for our proposed PSO-SVM hybrid system. The
particle representation consists of three parts including: C
and γ are the continuous variables, and the SNPs mask are the                                   Data Preprocessing

discrete variables. For the feature selection, if n f features are                                                                Dataset
required to decide which features are chosen, then n f +2
decision variables in each particle must be adopted.                                                                       Split dataset by LOOCV

    Table 1 shows the particle representation of our design.
The representation of particle i with dimension of n f + 2 ,                                                      Training set                  Testing set

where n f is the number of SNPs (features) that varies from
                                                                                              htSNPs (Feature) Selection
different datasets. xi ,1 ~ xi , n f ∈ {0,1} denotes the SNPs mask,
                                                                                                                   Selected htSNPs (features) subset F
xi , n f +1 indicates the parameter value C , xi , n f + 2 represents the                                            PSO parameter : htSNPs mask
parameter value γ . If xi , k = 1, k = 1, 2,..., n f represents the k-th
SNP on the i-th particle to be selected, and vice versa.                                                    Testing set tagged SNPs       Training set htSNPs

                  TABLE I.           The particle i representation.                             SVM parameter Optimization

            Discrete-variables                       Continuous-variables                                               Training SVM classifier
                                                                                                                       PSO parameters : C and r
                SNPs mask                              C                  γ
                                                                                                                           Learned SVM classifier
            xi ,1 xi ,2 L xi , n f                  xi , n f +1       xi , n f + 2
                                                                                                PSO operation

    A random key encoding method [9] is applied in the PSO                                                                   Fitness calculation
algorithm. Generally, random key encoding is used for an
order-based encoding scheme where the value of random key is                                                                                           No       PSO
                                                                                                                             Termination check ?
the genotype and the decoding value is the phenotype. Note                                                                                                    operation
that the particle in each {xi , k }1≤ k ≤ n f is assigned a random                                                                        Yes

number on (0, 1), and to decode in ascending order with regard                                                   Optimized C , r , and feature subset F
to its value. In the PSO learning process, the particle to be
counted larger tends to evolve closer to 1 and those to be
counted smaller tends to evolve closer to 0. Therefore, a repair                                       Figure 2 The flowchart of the proposed PSO-SVM model.
mechanism such as particle amendment in [5] to guarantee the
number of htSNPs after update process in PSO is not required.
                                                                                              measurement mentioned above, details of the proposed hybrid
B. Fitness Measurement                                                                        PSO-SVM procedure are described as follows:
                                                                                              Procedure PSO-SVM hybrid model
    In order to compare the performance of our proposed
approach with other published methods SVM/STSA in [4] and                                        1.) Data preparation
BPSO in [5], the leave-one-out cross validation is used to                                       Given a dataset U is considered using the leave-one-out
evaluate the quality of fitness measurement. The prediction                                   cross-validation process to split the data into training and
accuracy is measured as the percentage of correctly predicted                                 testing sets. The training and testing sets are represented as
SNP values on non-selected SNPs. In the LOOCV experiments,                                    U TR and U TE , respectively.
each haplotype sequence is removed one by one from dataset U,                                    2.) PSO initialization and parameters setting
the htSNPs are selected using only the remaining haplotypes to
                                                                                                 Set the PSO parameters including the number of iterations,
predict these tagged SNPs values for the removed one. This
procedure is repeated such that each haplotype in U is run once                               number of particles, velocity limitation, particle dimension,
in turn as the validation data.                                                               disturbance rate. Generate initial particles comprised of the
                                                                                              features mask, C and γ .
                                                                                                 3.) Selected htSNPs (features) subset
C. The Proposed Hybrid System Procedure
                                                                                                 Select input features for training set according to the feature
   Figure 2 shows the system architecture of our proposed                                     mask which is represented in the particle from 2), then the
hybrid model. Based on the particle representation and fitness                                features subset can be determined.

                                                                                         63                                    http://sites.google.com/site/ijcsis/
                                                                                                                               ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 8, No.6, 2010
                     TABLE II.        Results to compare PSO-SVM with SVM/STSA [4] and BPSO [5] on four real haplotype datasets.
           Datasets                                                              Prediction accuracy %
        (num of SNPs)            80         85       90        91         92          93         94        95          96      97        98        99
              SVM/STSA            1         1         3         3            4        5          6          8          10      22        42        51
 5q31            BPSO             1         1         2         3            4        5          6          7          9       14        29        42
              PSO-SVM             1         1         2         2            3        4          5          6          7       10        23        36
              SVM/STSA            1         1         2         5            5        6          7          8          10      15        15        24
                 BPSO             1         1         2         5            5        6          7          8          9       13        14        22
              PSO-SVM             1         1         2         4            4        5          6          7          8       11        13        21
              SVM/STSA            2         3         4        10         13          20         25        30          35      39        42        47
                 BPSO             2         3         6         9         12          16         18        21          25      28        31        37
              PSO-SVM             2         3         4         7         10          12         13        17          20      22        26        31

  4.) SVM model training and testing                                         ranges of continuous type dimension parameters are:
    Based on the parameters C and γ which are represented in                  C ∈ [10−2 ,104 ] and γ ∈ [10 −4 ,10 4 ] . The discrete type particle
the particle, to train the SVM classifier on the training dataset,           for features mask, we set [Vmin , Vmax ] = [ −6, 6] , which yields a
then the prediction accuracy for SVM on the testing dataset by               range of [0.9975, 0.0025] using the sigmoid limiting
LOOCV can be evaluated.
                                                                             transformation by Eq. (4). Both the cognition learning factor c1
  5.) Fitness calculation
  For each particle, evaluate its fitness value by the prediction            and the social learning factor c2 are set to 2. The disturbance
accuracy obtained from previous step. The optimal fitness                    rate is 0.05, and the number of generation is 600. The inertia
value will be stored to provide feedback on the evolution                    weight factor wmin = 0.4 and wmax = 0.9 . The linearly
process of PSO to find the increasing fitness of particle in the             decreasing inertia weight is set as Eq. (13), where inow is the
next generation.                                                             current iteration and imax is the pre-defined maximum iteration.
  6.) Termination check
  When the maximal evolutionary epoch is reached, the                                           inow
program ends; otherwise, go to the next step.                                      w = wmax −        ( wmax − wmin )                               (13)
  7.) PSO operation
  In the evolution process, discrete valued and continuous
valued dimension of PSO with the disturbance operator may be                     To compare the proposed PSO-SVM approach with the
applied to search for better solutions.                                      SVM/STSA in [4] and BPSO in [5] on the three haplotype
                                                                             datasets by LOOCV experiments, the computational results of
                                                                             prediction accuracy according to the numbers of selected
 V. EXPERIMENTAL RESULTS AND COMPARISONS                                     htSNPs are summarized in Table 2. As mentioned in [4], it is
                                                                             astonished that only one SNP for the 80% prediction accuracy
    To validate the performance of the developed hybrid                      in 5q31 and TRPM8 datasets can be achieved. In practice, if
approach, three public experimental SNP datasets [4] including               one guesses each SNP as 0, the prediction accuracy of 72.5%
5q31, TRPM8 and LPL are used to compare the proposed                         for 5q31 dataset and 79.3% for TRPM8 dataset would be
approach with other previously published methods. When there                 obtained. Therefore, the appropriate selection of one htSNPs to
are missing data exist in haplotype datasets, the GERBIL [4-5]               correctly predict 80% on the rest of non-selected SNPs is
program is used to resolve them. The chromosome 5q31                         reasonable. It is obvious that the proposed PSO-SVM hybrid
dataset was from the 616 kilobase region of human                            model achieves higher prediction accuracy with fewer selected
chromosome 5q31 and the SNPs were 103. The TRPM8 which                       htSNPs in the three haplotype datasets. In general, the
consists of 101 SNPs was obtained from HapMap. The human                     prediction accuracy is increased refers to the incremental
lipoprotein lipase (LPL) gene was derived from the haplotypes                selected htSNPs number. From Figure 3 to Figure 5 show that
of 71 individuals typed over 88 SNPs.                                        the numbers of selected htSNPs on haplotype datasets are
    Our implementation platform was carried out on the Matlab                proportional to the prediction accuracy and the PSO-SVM
7.3, a mathematical development environment by extending the                 algorithm has very good performance for haplotype tagging
Libsvm which is originally designed by Chang and Lin [10].                   SNPs selection problem in the three testing cases.
The empirical evaluation was performed on Intel Pentium IV
CPU running at 3.4GHz and 2 GB RAM. Through initial
experiment, the parameter values of the PSO were set as
follows. The swarm size is set to 200 particles. The searching

                                                                        64                                  http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                    Vol. 8, No.6, 2010
                                                                                                                 VI.     CONCLUSION

                                                                                           In this paper, a hybrid PSO-SVM model that combines the
                                                                                      particle swarm optimization (PSO) and support vector machine
                                                                                      (SVM) with feature selection and parameter optimization is
                                                                                      proposed to effectively solve for the haplotype tagging SNP
                                                                                      selection problem. Several public datasets of different sizes are
                                                                                      considered to compare the PSO-SVM with SVM/STSA and
                                                                                      BPSO previously published methods. The experimental results
                                                                                      show that the effectiveness of the proposed approach and the
                                                                                      high prediction accuracy with the fewer number of haplotype
                                                                                      tagging SNP can be obtained by the hybrid PSO-SVM system.

Figure 3 The comparison result of prediction accuracy associated with selected        [1]  K. Zhang, M. Deng, T. Chen, M. Waterman and F. Sun, “A dynamic
                         htSNPs on 5q31 datasets.                                          programming algorithm for haplotype block partitioning,” Proc. Natl.
                                                                                           Acad. Sci., vol. 99, pp. 7335–7339, 2002.
                                                                                      [2] K. Zhang, F. Sun, S. Waterman and T. Chen, “Haplotype block partition
                                                                                           with limited resources and applications to human chromosome 21
                                                                                           haplotype data,” Am. J. Hum. Genet., vol. 73, pp. 63–73, 2003.
                                                                                      [3] H. Avi-Itzhak, X. Su, and F. de la Vega, “Selection of minimum subsets
                                                                                           of single nucleotide polymorphisms to capture haplotype block
                                                                                           diversity,” In Proceedings of Pacific Symposium on Biocomputing, vol.
                                                                                           8, pp. 466–477, 2003.
                                                                                      [4] He Jingwu and A. Zelikovsky, “Informative SNP Selection Methods
                                                                                           Based on SNP Prediction,” IEEE Transactions on NanoBioscience, Vol.
                                                                                           6, pp. 60-67, 2007.
                                                                                      [5] Cheng-Hong Yang, Chang-Hsuan Ho and Li-Yeh Chuang, “Improved
                                                                                           tag SNP selection using binary particle swarm optimization,” IEEE
                                                                                           Congress on Evolutionary Computation (CEC 2008), pp. 854-860, 2008.
                                                                                      [6] V.N. Vapnik, “The nature of statistical learning theory,” New York:
                                                                                           Springer-Verlag, 1995.
                                                                                      [7] E. Halperin, G. Kimmel and R. Shamir, “Tag SNP selection in genotype
                                                                                           data for maximizing SNP prediction accuracy,” Bioinformatics, Vol. 21,
      Figure 4 The comparison result of prediction accuracy associated with                pp. 195-203, 2005.
                    selected htSNPs on TRPM8 datasets.                                [8] J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle
                                                                                           swarm algorithm,” in Proceedings of the World Multiconference on
                                                                                           Systemics, Cybernetics and Informatics, Piscataway, NJ, 1997, pp.
                                                                                      [9] J.C. Bean, “Genetics and random keys for sequencing and optimization,”
                                                                                           ORSA J. Comput., Vol. 6, pp. 154-160, 1994.
                                                                                      [10] C.C. Chang, and C.J. Lin, LIBSVM: a library for support vector
                                                                                           machines,                 Software             available             at:
                                                                                           http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.

      Figure 5 The comparison result of prediction accuracy associated with
                     selected htSNPs on LPL datasets.

                                                                                 65                                    http://sites.google.com/site/ijcsis/
                                                                                                                       ISSN 1947-5500

Description: IJCSIS is an open access publishing venue for research in general computer science and information security. Target Audience: IT academics, university IT faculties; industry IT departments; government departments; the mobile industry and computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; computer science, computer applications, multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. The average paper acceptance rate for IJCSIS issues is kept at 25-30% with an aim to provide selective research work of quality in the areas of computer science and engineering. Thanks for your contributions in September 2010 issue and we are grateful to the experienced team of reviewers for providing valuable comments.