; Microarray Data Analysis - Classification
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# Microarray Data Analysis - Classification

VIEWS: 3 PAGES: 35

• pg 1
```									Microarray Data
Analysis - Classification

Chien-Yu Chen
Graduate school of biotechnology and
bioinformatics, Yuan Ze University
Class Prediction

 It can be conjectured that it ought to be
possible to differentiate among the tumor
classes by studying and contrasting their
gene expression profiles.
 We can apply class prediction or supervised
classification techniques to develop a
classification rule to discriminate them.
 The knowledge can be used to predict the
class of a new tumor of unknown class based
on its gene expression profile.
2
Model of Microarray Data Sets
Gene 1   Gene 2    Gene n
S11

Class 1   S12
S13
:

S21
vi,j
Class 2   S22
S23
:

S31
Class 3   S32
3
S33
Lecture notes of Dr. Yen-Jen Oyang
Learning From Training Data

 A classification task usually involves with
training and testing data which consist of
some data instances.
 Each instance in the training set contains one
―target value‖ (class labels) and several
―attributes‖ (features).

4
Testing procedure

   The goal of a classifier is to produce a model
which predicts target value of data instances
in the testing set which are given only the
attributes.

5
Cross Validation

   Most data classification algorithms require some
parameters to be set, e.g. k in KNN classifier and the
tree pruning threshold in the decision tree.
   One way to find an appropriate parameter setting is
through k-fold cross validation, normally k=10.
   In the k-fold cross validation, the training data set is
divided into k subsets. Then k runs of the
classification algorithm is conducted, with each
subset serving as the test set once, while using the
remaining (k-1) subsets as the training set.
   The parameter values that yield maximum accuracy
in cross validation are then adopted.

6
Lecture notes of Dr. Yen-Jen Oyang
   In the cross validation process, we set the
parameters of the classifier to a particular
combination of values that we are interested in and
then evaluate how good the combination is based
on alternative schemes.
   With the leave-one-out cross validation scheme,
we attempt to predict the class of each sample
using the remaining samples as the training data
set.
   With 10-fold cross validation, we evenly divide the
training data set into 10 subsets. Each time, we
test the prediction accuracy of one of the 10
subsets using the other 9 subsets as the training
set.

7
Lecture notes of Dr. Yen-Jen Oyang
Naive Bayes Classifier

 The naive Bayes classifier assigns an
instance sk with attribute values (A1=v1,
A2=v2, …, Am=vm ) to class Ci with maximum
Prob(Ci|(v1, v2, …, vm)) for all i.
 The naive Bayes classifier exploits the
Bayes’s rule and assumes independence of
attributes.

8
Lecture notes of Dr. Yen-Jen Oyang
   Likelihood of sk belonging to Ci
Pv1 , v 2 ,..., v m  | Ci P(Ci )
 ProbCi | v1 , v 2 ,..., v m  
Pv1 , v 2 ,..., v m 
 Likelihood of sk belonging to Cj

P v1 , v 2 ,..., v m  | C j P(C j )
 Prob C j | v1 , v 2 ,..., v m  
P v1 , v 2 ,..., v m 
   Therefore, when comparing Prob(Ci| (v1, v2, …, vm))
and P(Cj |(v1, v2, …, vm)), we only need to compute
P((v1, v2, …, vm)|Ci)P(Ci) and P((v1, v2, …,
vm)|Cj)P(Cj)

9
Lecture notes of Dr. Yen-Jen Oyang
   Under the assumption of independent attributes
P v1 , v 2 ,..., v m  | C j 
 P( A1  v1 | C j )  P( A2  v 2 | C j )    P( Am  v m | C j )
m
  P( Ah  v h | C j )
h 1
   Furthermore, P(Cj) can be computed by

number of training samples belonging to C j
total number of training samples

10
Lecture notes of Dr. Yen-Jen Oyang
An Example of the Naïve Bayes
Classifier
The weather data, with counts and probabilities
outlook               temperature               humidity                 windy               play
yes   no              yes   no                yes   no             yes    no     yes         no

sunny             2   3      hot      2     2     high        3      4    false     6     2       9          5
overcast          4   0      mild     4     2     normal      6      1    true      3     3
rainy             3   2      cool     3     1
sunny           2/9   3/5    hot      2/9   2/5   high        3/9   4/5   false    6/9    2/5   9/14     5/14
overcast        4/9   0/5    mild     4/9   2/5   normal      6/9   1/5   true     3/9    3/5
rainy           3/9   2/5    cool     3/9   1/5

A new day

outlook             temperature            humidity               windy                 play

sunny                  cool                  high                  true                  ?
11
Lecture notes of Dr. Yen-Jen Oyang
   Likelihood of yes
2 3 3 3 9
      0.0053
9 9 9 9 14
   Likelihood of no
3 1 4 3 5
        0.0206
5 5 5 5 14
   Therefore, the prediction is No

12
Lecture notes of Dr. Yen-Jen Oyang
The Naive Bayes Classifier for Data
Sets with Numerical Attribute Values

   One common practice to handle numerical
attribute values is to assume normal
distributions for numerical attributes.

13
Lecture notes of Dr. Yen-Jen Oyang
The numeric weather data with summary statistics
outlook            temperature             humidity                   windy              play
ye   no            ye    no             yes       no               ye    no    yes         no
s                  s                                               s
sunny        2    3             83    85              86       85     false     6     2      9          5
overcast     4    0             70    80              96       90     true      3     3
rainy        3    2             68    65              80       70
64    72              65       95
69    71              70       91
75                    80
75                    70
72                    90
81                    75
sunny       2/9   3/5   mean    73   74.6   mean     79.1     86.2    false    6/9    2/5   9/14    5/14
overcast    4/9   0/5   std    6.2   7.9    std      10.2      9.7    true     3/9    3/5
dev                 dev

rainy       3/9   2/5

14
Lecture notes of Dr. Yen-Jen Oyang
   Let x1, x2, …, xn be the values of a numerical attribute
in the training data set.

1 n
   xi
n i 1
1 n
        xi   
n  1 i 1
2

 w   2
1       
f ( w)         e         2

2 
15
Lecture notes of Dr. Yen-Jen Oyang
   For examples,                                          66 73 2

f temperatur e  66 | Yes  
1
e         2  6.2 2
 0.0340
2 6.2 
2                    3 9
 Likelihood of Yes =    0.0340  0.0221    0.000036
9                    9 14
3                   3 5
 Likelihood of No =    0.0291  0.038    0.000136
5                   5 14

16
Lecture notes of Dr. Yen-Jen Oyang
Instance-Based Learning

 In instance-based learning, we take k nearest
training samples of a new instance (v1, v2, …,
vm) and assign the new instance to the class
that has most instances in the k nearest
training samples.
 Classifiers that adopt instance-based learning
are commonly called the KNN classifiers.

17
Lecture notes of Dr. Yen-Jen Oyang
 The basic version of the KNN classifiers
works only for data sets with numerical values.
However, extensions have been proposed for
handling data sets with categorical attributes.
 If the number of training samples is
sufficiently large, then it can be proved
statistically that the KNN classifier can deliver
the accuracy achievable with learning from
the training data set.
 However, if the number of training samples is
not large enough, the KNN classifier may not
work well.

18
Lecture notes of Dr. Yen-Jen Oyang
   If the data set is noiseless, then the 1NN classifier should work
well. In general, the more noisy the data set is, the higher
should k be set. However, the optimal k value should be figured
out through cross validation.
   The ranges of attribute values should be normalized, before the
KNN classifier is applied. There are two common normalization
approaches
v  vm in
w
vm ax  vm in
v
w
          , where  and 2 are the mean and the variance
of the attribute values, respectively.

19
Lecture notes of Dr. Yen-Jen Oyang
Example of the KNN Classifiers

 If an 1NN classifier is employed, then the
prediction of ―‖ = ―X‖.
 If an 3NN classifier is employed, then
prediction of ―‖ = ―O‖.
20
Lecture notes of Dr. Yen-Jen Oyang
Alternative Similarity Functions

   Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n >
be the gene expression vectors, i.e. the
feature vectors, of samples Sr and St,
respectively. Then, the following alternative
similarity functions can be employed:
– Euclidean distance—

 v   r ,h  vt ,h 
n
dissimilarity 
2

h 1

21
Lecture notes of Dr. Yen-Jen Oyang
– Cosine—
 v                      vt ,h 
n

r ,h
Similarity              h 1
n                            n

v
h 1
2
r ,h           v
h 1
2
t ,h

– Correlation coefficient--

 v                  r vt ,h  t 
n

r ,h
Similarity      h 1
, where
 r t
1 n                    1 n
 r   v r ,h          t   vt ,h
n h 1                 n h 1

r 
1 n
    vr ,h  r 2  t                                  1 n
    vt ,h  t 2
n  1 h 1                                                    n  1 h 1
22
Lecture notes of Dr. Yen-Jen Oyang
Importance of Feature Selection

 Inclusion of features that are not correlated to
the classification decision may make the
problem even more complicated.
 For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y-axis causes incorrect
prediction of the test instance marked by ―‖,
if a 3NN classifier is employed.

23
Lecture notes of Dr. Yen-Jen Oyang
y

x=10           x

   It is apparent that ―o‖s and ―x‖ s are separated by
x=10. If only the attribute corresponding to the x-axis
was selected, then the 3NN classifier would predict
the class of ―‖ correctly.

24
Lecture notes of Dr. Yen-Jen Oyang
Feature Selection for Microarray
Data Analysis
 In microarray data analysis, it is highly
desirable to identify those genes that are
correlated to the classes of samples.
 For example, in the Leukemia data set, there
are 7129 genes. We want to identify those
genes that lead to different disease types.

25
Lecture notes of Dr. Yen-Jen Oyang
Parameter Setting through Cross
Validation
 When carrying out data classification, we
normally need to set one or more parameters
associated with the data classification
algorithm.
 For example, we need to set the value of k
with the KNN classifier.
 The typical approach is to conduct cross
validation to find out the optimal value.

26
Lecture notes of Dr. Yen-Jen Oyang
Linear Separable and Non-Linear
Separable
 The example above shows a case of linear
separable.
 Following is an example of non-linear
separable.

27
Lecture notes of Dr. Yen-Jen Oyang
28
Lecture notes of Dr. Yen-Jen Oyang
An Example of Non-Separable

29
Lecture notes of Dr. Yen-Jen Oyang
Support Vector Machines (SVM)

 Over the last few years, the SVM has been
established as one of the preferred
approaches to many problems in pattern
recognition and regression estimation.
 In most cases, SVM generalization
performances have been found to be either
equal to or much better than that of the
conventional methods

30
Optimal Separation

 Suppose we are interested in finding out how
to separate a set of training data vectors that
belong to two distinct classes.
 If the data are separable in the input space,
there may exist many hyperplanes that can
do such a separation.
 We are interested in finding the optimal
hyperplane classifier – the one with the
maximal margin of separation between the
two classes.
31
A Practical Guide to SVM
Classification
   http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

32
Support Vector Machines (SVM)

33
Graphic Interface

   http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html#GUI

34
Specificity/Selectivity v.s.
Sensitivity
   Sensitivity
– True positives / (True positives + False negatives)
 Selectivity True positives / (True positives +
False positives)
 Specificity
– True negatives / (True Negatives + False positives)
   Accuracy
– (True positives + True Negatives ) / (True
positives+ True Negatives + False positives +
False negatives)
35

```
To top