Docstoc

Microarray Data Analysis - Classification

Document Sample
Microarray Data Analysis - Classification Powered By Docstoc
					Microarray Data
Analysis - Classification

Chien-Yu Chen
Graduate school of biotechnology and
bioinformatics, Yuan Ze University
Class Prediction

 It can be conjectured that it ought to be
  possible to differentiate among the tumor
  classes by studying and contrasting their
  gene expression profiles.
 We can apply class prediction or supervised
  classification techniques to develop a
  classification rule to discriminate them.
 The knowledge can be used to predict the
  class of a new tumor of unknown class based
  on its gene expression profile.
                                            2
Model of Microarray Data Sets
                   Gene 1   Gene 2    Gene n
             S11

   Class 1   S12
             S13
              :

             S21
                             vi,j
   Class 2   S22
             S23
              :

             S31
   Class 3   S32
                                                                          3
             S33
                                     Lecture notes of Dr. Yen-Jen Oyang
Learning From Training Data

 A classification task usually involves with
  training and testing data which consist of
  some data instances.
 Each instance in the training set contains one
  ―target value‖ (class labels) and several
  ―attributes‖ (features).




                                               4
Testing procedure

   The goal of a classifier is to produce a model
    which predicts target value of data instances
    in the testing set which are given only the
    attributes.




                                                     5
Cross Validation

   Most data classification algorithms require some
    parameters to be set, e.g. k in KNN classifier and the
    tree pruning threshold in the decision tree.
   One way to find an appropriate parameter setting is
    through k-fold cross validation, normally k=10.
   In the k-fold cross validation, the training data set is
    divided into k subsets. Then k runs of the
    classification algorithm is conducted, with each
    subset serving as the test set once, while using the
    remaining (k-1) subsets as the training set.
   The parameter values that yield maximum accuracy
    in cross validation are then adopted.

                                                                          6
                                     Lecture notes of Dr. Yen-Jen Oyang
   In the cross validation process, we set the
    parameters of the classifier to a particular
    combination of values that we are interested in
    and then evaluate how good the combination is
    based on alternative schemes.
   With the leave-one-out cross validation scheme,
    we attempt to predict the class of each sample
    using the remaining samples as the training data
    set.
   With 10-fold cross validation, we evenly divide
    the training data set into 10 subsets. Each time,
    we test the prediction accuracy of one of the 10
    subsets using the other 9 subsets as the training
    set.

                                                                        7
                                   Lecture notes of Dr. Yen-Jen Oyang
Instance-Based Learning

 In instance-based learning, we take k nearest
  training samples of a new instance (v1, v2, …,
  vm) and assign the new instance to the class
  that has most instances in the k nearest
  training samples.
 Classifiers that adopt instance-based learning
  are commonly called the KNN classifiers.



                                                                  8
                             Lecture notes of Dr. Yen-Jen Oyang
 The basic version of the KNN classifiers
  works only for data sets with numerical values.
  However, extensions have been proposed for
  handling data sets with categorical attributes.
 If the number of training samples is
  sufficiently large, then it can be proved
  statistically that the KNN classifier can deliver
  the accuracy achievable with learning from
  the training data set.
 However, if the number of training samples is
  not large enough, the KNN classifier may not
  work well.

                                                                   9
                              Lecture notes of Dr. Yen-Jen Oyang
   If the data set is noiseless, then the 1NN classifier should work
    well. In general, the more noisy the data set is, the higher
    should k be set. However, the optimal k value should be figured
    out through cross validation.
   The ranges of attribute values should be normalized, before the
    KNN classifier is applied. There are two common normalization
    approaches
        v  vm in
    w
       vm ax  vm in
         v
    w
                     , where  and 2 are the mean and the variance
    of the attribute values, respectively.

                                                                               10
                                          Lecture notes of Dr. Yen-Jen Oyang
Example of the KNN Classifiers




 If an 1NN classifier is employed, then the
  prediction of ―‖ = ―X‖.
 If an 3NN classifier is employed, then
  prediction of ―‖ = ―O‖.
                                                                  11
                             Lecture notes of Dr. Yen-Jen Oyang
Alternative Similarity Functions

   Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n >
    be the gene expression vectors, i.e. the
    feature vectors, of samples Sr and St,
    respectively. Then, the following alternative
    similarity functions can be employed:
    – Euclidean distance—


                        v   r ,h  vt ,h 
                        n
     dissimilarity 
                                           2

                       h 1

                                                                                    12
                                               Lecture notes of Dr. Yen-Jen Oyang
– Cosine—
                            v                      vt ,h 
                               n

                                             r ,h
   Similarity              h 1
                           n                            n

                          v
                          h 1
                                      2
                                      r ,h           v
                                                       h 1
                                                                2
                                                                t ,h



– Correlation coefficient--

                  v                  r vt ,h  t 
                    n

                               r ,h
  Similarity      h 1
                                                                       , where
                                         r t
       1 n                    1 n
   r   v r ,h          t   vt ,h
       n h 1                 n h 1

  r 
           1 n
                   vr ,h  r 2  t                                  1 n
                                                                                 vt ,h  t 2
         n  1 h 1                                                    n  1 h 1
                                                                                                              13
                                                                         Lecture notes of Dr. Yen-Jen Oyang
Importance of Feature Selection

 Inclusion of features that are not correlated to
  the classification decision may make the
  problem even more complicated.
 For example, in the data set shown on the
  following page, inclusion of the feature
  corresponding to the Y-axis causes incorrect
  prediction of the test instance marked by ―‖,
  if a 3NN classifier is employed.

                                                                   14
                              Lecture notes of Dr. Yen-Jen Oyang
      y




                     x=10           x


   It is apparent that ―o‖s and ―x‖ s are separated by
    x=10. If only the attribute corresponding to the x-axis
    was selected, then the 3NN classifier would predict
    the class of ―‖ correctly.

                                                                         15
                                    Lecture notes of Dr. Yen-Jen Oyang
Feature Selection for Microarray
Data Analysis
 In microarray data analysis, it is highly
  desirable to identify those genes that are
  correlated to the classes of samples.
 For example, in the Leukemia data set, there
  are 7129 genes. We want to identify those
  genes that lead to different disease types.




                                                                 16
                            Lecture notes of Dr. Yen-Jen Oyang
Parameter Setting through Cross
Validation
 When carrying out data classification, we
  normally need to set one or more parameters
  associated with the data classification
  algorithm.
 For example, we need to set the value of k
  with the KNN classifier.
 The typical approach is to conduct cross
  validation to find out the optimal value.

                                                                17
                           Lecture notes of Dr. Yen-Jen Oyang
Linear Separable and Non-Linear
Separable
 The example above shows a case of linear
  separable.
 Following is an example of non-linear
  separable.




                                                                18
                           Lecture notes of Dr. Yen-Jen Oyang
                                     19
Lecture notes of Dr. Yen-Jen Oyang
An Example of Non-Separable




                                                          20
                     Lecture notes of Dr. Yen-Jen Oyang
Support Vector Machines (SVM)

 Over the last few years, the SVM has been
  established as one of the preferred
  approaches to many problems in pattern
  recognition and regression estimation.
 In most cases, SVM generalization
  performances have been found to be either
  equal to or much better than that of the
  conventional methods

                                              21
Optimal Separation

 Suppose we are interested in finding out how
  to separate a set of training data vectors that
  belong to two distinct classes.
 If the data are separable in the input space,
  there may exist many hyperplanes that can
  do such a separation.
 We are interested in finding the optimal
  hyperplane classifier – the one with the
  maximal margin of separation between the
  two classes.
                                                22
23
Kernels
 :RNF
 k(x,y):=((x)(y))




                        24
A Practical Guide to SVM
Classification
   http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf




                                                               25
Support Vector Machines (SVM)




                                26
Graphic Interface

   http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html#GUI




                                                              27
Specificity/Selectivity v.s.
Sensitivity
   Sensitivity
    – True positives / (True positives + False negatives)
   Selectivity True positives / (True positives + False
    positives)
   Specificity
    – True negatives / (True Negatives + False positives)
   Accuracy
    – (True positives + True Negatives ) / (True positives+ True
      Negatives + False positives + False negatives)



                                                                   28
HW3 – Example




                29

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:1/21/2012
language:
pages:29