Microarray Data Analysis - Classification
Document Sample


Microarray Data
Analysis - Classification
Chien-Yu Chen
Graduate school of biotechnology and
bioinformatics, Yuan Ze University
Class Prediction
It can be conjectured that it ought to be
possible to differentiate among the tumor
classes by studying and contrasting their
gene expression profiles.
We can apply class prediction or supervised
classification techniques to develop a
classification rule to discriminate them.
The knowledge can be used to predict the
class of a new tumor of unknown class based
on its gene expression profile.
2
Model of Microarray Data Sets
Gene 1 Gene 2 Gene n
S11
Class 1 S12
S13
:
S21
vi,j
Class 2 S22
S23
:
S31
Class 3 S32
3
S33
Lecture notes of Dr. Yen-Jen Oyang
Learning From Training Data
A classification task usually involves with
training and testing data which consist of
some data instances.
Each instance in the training set contains one
―target value‖ (class labels) and several
―attributes‖ (features).
4
Testing procedure
The goal of a classifier is to produce a model
which predicts target value of data instances
in the testing set which are given only the
attributes.
5
Cross Validation
Most data classification algorithms require some
parameters to be set, e.g. k in KNN classifier and the
tree pruning threshold in the decision tree.
One way to find an appropriate parameter setting is
through k-fold cross validation, normally k=10.
In the k-fold cross validation, the training data set is
divided into k subsets. Then k runs of the
classification algorithm is conducted, with each
subset serving as the test set once, while using the
remaining (k-1) subsets as the training set.
The parameter values that yield maximum accuracy
in cross validation are then adopted.
6
Lecture notes of Dr. Yen-Jen Oyang
In the cross validation process, we set the
parameters of the classifier to a particular
combination of values that we are interested in
and then evaluate how good the combination is
based on alternative schemes.
With the leave-one-out cross validation scheme,
we attempt to predict the class of each sample
using the remaining samples as the training data
set.
With 10-fold cross validation, we evenly divide
the training data set into 10 subsets. Each time,
we test the prediction accuracy of one of the 10
subsets using the other 9 subsets as the training
set.
7
Lecture notes of Dr. Yen-Jen Oyang
Instance-Based Learning
In instance-based learning, we take k nearest
training samples of a new instance (v1, v2, …,
vm) and assign the new instance to the class
that has most instances in the k nearest
training samples.
Classifiers that adopt instance-based learning
are commonly called the KNN classifiers.
8
Lecture notes of Dr. Yen-Jen Oyang
The basic version of the KNN classifiers
works only for data sets with numerical values.
However, extensions have been proposed for
handling data sets with categorical attributes.
If the number of training samples is
sufficiently large, then it can be proved
statistically that the KNN classifier can deliver
the accuracy achievable with learning from
the training data set.
However, if the number of training samples is
not large enough, the KNN classifier may not
work well.
9
Lecture notes of Dr. Yen-Jen Oyang
If the data set is noiseless, then the 1NN classifier should work
well. In general, the more noisy the data set is, the higher
should k be set. However, the optimal k value should be figured
out through cross validation.
The ranges of attribute values should be normalized, before the
KNN classifier is applied. There are two common normalization
approaches
v vm in
w
vm ax vm in
v
w
, where and 2 are the mean and the variance
of the attribute values, respectively.
10
Lecture notes of Dr. Yen-Jen Oyang
Example of the KNN Classifiers
If an 1NN classifier is employed, then the
prediction of ―‖ = ―X‖.
If an 3NN classifier is employed, then
prediction of ―‖ = ―O‖.
11
Lecture notes of Dr. Yen-Jen Oyang
Alternative Similarity Functions
Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n >
be the gene expression vectors, i.e. the
feature vectors, of samples Sr and St,
respectively. Then, the following alternative
similarity functions can be employed:
– Euclidean distance—
v r ,h vt ,h
n
dissimilarity
2
h 1
12
Lecture notes of Dr. Yen-Jen Oyang
– Cosine—
v vt ,h
n
r ,h
Similarity h 1
n n
v
h 1
2
r ,h v
h 1
2
t ,h
– Correlation coefficient--
v r vt ,h t
n
r ,h
Similarity h 1
, where
r t
1 n 1 n
r v r ,h t vt ,h
n h 1 n h 1
r
1 n
vr ,h r 2 t 1 n
vt ,h t 2
n 1 h 1 n 1 h 1
13
Lecture notes of Dr. Yen-Jen Oyang
Importance of Feature Selection
Inclusion of features that are not correlated to
the classification decision may make the
problem even more complicated.
For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y-axis causes incorrect
prediction of the test instance marked by ―‖,
if a 3NN classifier is employed.
14
Lecture notes of Dr. Yen-Jen Oyang
y
x=10 x
It is apparent that ―o‖s and ―x‖ s are separated by
x=10. If only the attribute corresponding to the x-axis
was selected, then the 3NN classifier would predict
the class of ―‖ correctly.
15
Lecture notes of Dr. Yen-Jen Oyang
Feature Selection for Microarray
Data Analysis
In microarray data analysis, it is highly
desirable to identify those genes that are
correlated to the classes of samples.
For example, in the Leukemia data set, there
are 7129 genes. We want to identify those
genes that lead to different disease types.
16
Lecture notes of Dr. Yen-Jen Oyang
Parameter Setting through Cross
Validation
When carrying out data classification, we
normally need to set one or more parameters
associated with the data classification
algorithm.
For example, we need to set the value of k
with the KNN classifier.
The typical approach is to conduct cross
validation to find out the optimal value.
17
Lecture notes of Dr. Yen-Jen Oyang
Linear Separable and Non-Linear
Separable
The example above shows a case of linear
separable.
Following is an example of non-linear
separable.
18
Lecture notes of Dr. Yen-Jen Oyang
19
Lecture notes of Dr. Yen-Jen Oyang
An Example of Non-Separable
20
Lecture notes of Dr. Yen-Jen Oyang
Support Vector Machines (SVM)
Over the last few years, the SVM has been
established as one of the preferred
approaches to many problems in pattern
recognition and regression estimation.
In most cases, SVM generalization
performances have been found to be either
equal to or much better than that of the
conventional methods
21
Optimal Separation
Suppose we are interested in finding out how
to separate a set of training data vectors that
belong to two distinct classes.
If the data are separable in the input space,
there may exist many hyperplanes that can
do such a separation.
We are interested in finding the optimal
hyperplane classifier – the one with the
maximal margin of separation between the
two classes.
22
23
Kernels
:RNF
k(x,y):=((x)(y))
24
A Practical Guide to SVM
Classification
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
25
Support Vector Machines (SVM)
26
Graphic Interface
http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html#GUI
27
Specificity/Selectivity v.s.
Sensitivity
Sensitivity
– True positives / (True positives + False negatives)
Selectivity True positives / (True positives + False
positives)
Specificity
– True negatives / (True Negatives + False positives)
Accuracy
– (True positives + True Negatives ) / (True positives+ True
Negatives + False positives + False negatives)
28
HW3 – Example
29
Get documents about "