VIEWS: 3 PAGES: 35 POSTED ON: 9/14/2011
Microarray Data Analysis - Classification Chien-Yu Chen Graduate school of biotechnology and bioinformatics, Yuan Ze University Class Prediction It can be conjectured that it ought to be possible to differentiate among the tumor classes by studying and contrasting their gene expression profiles. We can apply class prediction or supervised classification techniques to develop a classification rule to discriminate them. The knowledge can be used to predict the class of a new tumor of unknown class based on its gene expression profile. 2 Model of Microarray Data Sets Gene 1 Gene 2 Gene n S11 Class 1 S12 S13 : S21 vi,j Class 2 S22 S23 : S31 Class 3 S32 3 S33 Lecture notes of Dr. Yen-Jen Oyang Learning From Training Data A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one ―target value‖ (class labels) and several ―attributes‖ (features). 4 Testing procedure The goal of a classifier is to produce a model which predicts target value of data instances in the testing set which are given only the attributes. 5 Cross Validation Most data classification algorithms require some parameters to be set, e.g. k in KNN classifier and the tree pruning threshold in the decision tree. One way to find an appropriate parameter setting is through k-fold cross validation, normally k=10. In the k-fold cross validation, the training data set is divided into k subsets. Then k runs of the classification algorithm is conducted, with each subset serving as the test set once, while using the remaining (k-1) subsets as the training set. The parameter values that yield maximum accuracy in cross validation are then adopted. 6 Lecture notes of Dr. Yen-Jen Oyang In the cross validation process, we set the parameters of the classifier to a particular combination of values that we are interested in and then evaluate how good the combination is based on alternative schemes. With the leave-one-out cross validation scheme, we attempt to predict the class of each sample using the remaining samples as the training data set. With 10-fold cross validation, we evenly divide the training data set into 10 subsets. Each time, we test the prediction accuracy of one of the 10 subsets using the other 9 subsets as the training set. 7 Lecture notes of Dr. Yen-Jen Oyang Naive Bayes Classifier The naive Bayes classifier assigns an instance sk with attribute values (A1=v1, A2=v2, …, Am=vm ) to class Ci with maximum Prob(Ci|(v1, v2, …, vm)) for all i. The naive Bayes classifier exploits the Bayes’s rule and assumes independence of attributes. 8 Lecture notes of Dr. Yen-Jen Oyang Likelihood of sk belonging to Ci Pv1 , v 2 ,..., v m | Ci P(Ci ) ProbCi | v1 , v 2 ,..., v m Pv1 , v 2 ,..., v m Likelihood of sk belonging to Cj P v1 , v 2 ,..., v m | C j P(C j ) Prob C j | v1 , v 2 ,..., v m P v1 , v 2 ,..., v m Therefore, when comparing Prob(Ci| (v1, v2, …, vm)) and P(Cj |(v1, v2, …, vm)), we only need to compute P((v1, v2, …, vm)|Ci)P(Ci) and P((v1, v2, …, vm)|Cj)P(Cj) 9 Lecture notes of Dr. Yen-Jen Oyang Under the assumption of independent attributes P v1 , v 2 ,..., v m | C j P( A1 v1 | C j ) P( A2 v 2 | C j ) P( Am v m | C j ) m P( Ah v h | C j ) h 1 Furthermore, P(Cj) can be computed by number of training samples belonging to C j total number of training samples 10 Lecture notes of Dr. Yen-Jen Oyang An Example of the Naïve Bayes Classifier The weather data, with counts and probabilities outlook temperature humidity windy play yes no yes no yes no yes no yes no sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5 overcast 4 0 mild 4 2 normal 6 1 true 3 3 rainy 3 2 cool 3 1 sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14 overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5 rainy 3/9 2/5 cool 3/9 1/5 A new day outlook temperature humidity windy play sunny cool high true ? 11 Lecture notes of Dr. Yen-Jen Oyang Likelihood of yes 2 3 3 3 9 0.0053 9 9 9 9 14 Likelihood of no 3 1 4 3 5 0.0206 5 5 5 5 14 Therefore, the prediction is No 12 Lecture notes of Dr. Yen-Jen Oyang The Naive Bayes Classifier for Data Sets with Numerical Attribute Values One common practice to handle numerical attribute values is to assume normal distributions for numerical attributes. 13 Lecture notes of Dr. Yen-Jen Oyang The numeric weather data with summary statistics outlook temperature humidity windy play ye no ye no yes no ye no yes no s s s sunny 2 3 83 85 86 85 false 6 2 9 5 overcast 4 0 70 80 96 90 true 3 3 rainy 3 2 68 65 80 70 64 72 65 95 69 71 70 91 75 80 75 70 72 90 81 75 sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14 overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5 dev dev rainy 3/9 2/5 14 Lecture notes of Dr. Yen-Jen Oyang Let x1, x2, …, xn be the values of a numerical attribute in the training data set. 1 n xi n i 1 1 n xi n 1 i 1 2 w 2 1 f ( w) e 2 2 15 Lecture notes of Dr. Yen-Jen Oyang For examples, 66 73 2 f temperatur e 66 | Yes 1 e 2 6.2 2 0.0340 2 6.2 2 3 9 Likelihood of Yes = 0.0340 0.0221 0.000036 9 9 14 3 3 5 Likelihood of No = 0.0291 0.038 0.000136 5 5 14 16 Lecture notes of Dr. Yen-Jen Oyang Instance-Based Learning In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples. Classifiers that adopt instance-based learning are commonly called the KNN classifiers. 17 Lecture notes of Dr. Yen-Jen Oyang The basic version of the KNN classifiers works only for data sets with numerical values. However, extensions have been proposed for handling data sets with categorical attributes. If the number of training samples is sufficiently large, then it can be proved statistically that the KNN classifier can deliver the accuracy achievable with learning from the training data set. However, if the number of training samples is not large enough, the KNN classifier may not work well. 18 Lecture notes of Dr. Yen-Jen Oyang If the data set is noiseless, then the 1NN classifier should work well. In general, the more noisy the data set is, the higher should k be set. However, the optimal k value should be figured out through cross validation. The ranges of attribute values should be normalized, before the KNN classifier is applied. There are two common normalization approaches v vm in w vm ax vm in v w , where and 2 are the mean and the variance of the attribute values, respectively. 19 Lecture notes of Dr. Yen-Jen Oyang Example of the KNN Classifiers If an 1NN classifier is employed, then the prediction of ―‖ = ―X‖. If an 3NN classifier is employed, then prediction of ―‖ = ―O‖. 20 Lecture notes of Dr. Yen-Jen Oyang Alternative Similarity Functions Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n > be the gene expression vectors, i.e. the feature vectors, of samples Sr and St, respectively. Then, the following alternative similarity functions can be employed: – Euclidean distance— v r ,h vt ,h n dissimilarity 2 h 1 21 Lecture notes of Dr. Yen-Jen Oyang – Cosine— v vt ,h n r ,h Similarity h 1 n n v h 1 2 r ,h v h 1 2 t ,h – Correlation coefficient-- v r vt ,h t n r ,h Similarity h 1 , where r t 1 n 1 n r v r ,h t vt ,h n h 1 n h 1 r 1 n vr ,h r 2 t 1 n vt ,h t 2 n 1 h 1 n 1 h 1 22 Lecture notes of Dr. Yen-Jen Oyang Importance of Feature Selection Inclusion of features that are not correlated to the classification decision may make the problem even more complicated. For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by ―‖, if a 3NN classifier is employed. 23 Lecture notes of Dr. Yen-Jen Oyang y x=10 x It is apparent that ―o‖s and ―x‖ s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of ―‖ correctly. 24 Lecture notes of Dr. Yen-Jen Oyang Feature Selection for Microarray Data Analysis In microarray data analysis, it is highly desirable to identify those genes that are correlated to the classes of samples. For example, in the Leukemia data set, there are 7129 genes. We want to identify those genes that lead to different disease types. 25 Lecture notes of Dr. Yen-Jen Oyang Parameter Setting through Cross Validation When carrying out data classification, we normally need to set one or more parameters associated with the data classification algorithm. For example, we need to set the value of k with the KNN classifier. The typical approach is to conduct cross validation to find out the optimal value. 26 Lecture notes of Dr. Yen-Jen Oyang Linear Separable and Non-Linear Separable The example above shows a case of linear separable. Following is an example of non-linear separable. 27 Lecture notes of Dr. Yen-Jen Oyang 28 Lecture notes of Dr. Yen-Jen Oyang An Example of Non-Separable 29 Lecture notes of Dr. Yen-Jen Oyang Support Vector Machines (SVM) Over the last few years, the SVM has been established as one of the preferred approaches to many problems in pattern recognition and regression estimation. In most cases, SVM generalization performances have been found to be either equal to or much better than that of the conventional methods 30 Optimal Separation Suppose we are interested in finding out how to separate a set of training data vectors that belong to two distinct classes. If the data are separable in the input space, there may exist many hyperplanes that can do such a separation. We are interested in finding the optimal hyperplane classifier – the one with the maximal margin of separation between the two classes. 31 A Practical Guide to SVM Classification http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf 32 Support Vector Machines (SVM) 33 Graphic Interface http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html#GUI 34 Specificity/Selectivity v.s. Sensitivity Sensitivity – True positives / (True positives + False negatives) Selectivity True positives / (True positives + False positives) Specificity – True negatives / (True Negatives + False positives) Accuracy – (True positives + True Negatives ) / (True positives+ True Negatives + False positives + False negatives) 35