VIEWS: 121 PAGES: 16

• pg 1
```									MAS 622J Course Project

Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN

Hyungil Ahn (hiahn@media.mit.edu)

Objective & Dataset
• • Recognize the affective states of a child solving a puzzle Affective Dataset

- 1024 features from Face, Posture, Game - 3 affective states, labels annotated by teachers
High interest (61), Low interest (59), Refreshing (16)



Binary Classification
High interest (61 samples) vs. Low Interest or Refreshing (75 samples)



Approaches

- Semi-Supervised Learning: Gaussian Process (GP) - Support Vector Machine - k-Nearest Neighbor (k = 1)

GP Semi-Supervised Learning


Given

, predict the labels of unlabeled pts



Assume the data, data generation process X : inputs, y : vector of labels, t : vector of hidden soft labels, Each label (binary classification) Final classifier y = sign[ t ] = sign [ ] Define
Similarity function





Infer

given

GP Semi-Supervised Learning
Infer
 Bayesian Model

given



: Prior of the classifier : Likelihood of the classifier given the labeled data

GP Semi-Supervised Learning
 How to model the prior & the likelihood ? The prior : Using GP,

(Soft labels vary smoothly across the data manifold!)

The likelihood :

GP Semi-Supervised Learning




EP (Expectation Propagation)  approximating the posterior as a Gaussian Select hyperparameter { kernel width σ, labeling error rate ε } that maximizes evidence !
Advantage of using EP  we get the evidence as a side product EP estimates the leave-one-out predictive performance without performing any expensive cross-validation.





Support Vector Machine

  

OSU SVM toolbox RBF kernel : Hyperparameter {C, σ} Selection  Use leaveone-out validation !

kNN (k = 1)


The label of test point follows that of its nearest point
This algorithm is simple to implement and the accuracy of this algorithm can be used as a base line. However, sometimes this algorithm gives a good result !





Split of the dataset & Experiment


GP Semi-supervised learning

- randomly select labeled data (p % of overall data), use the remaining data as unlabeled data, predict the labels of unlabeled data (In this setting, unlabeled data == test data) - 50 tries for each p (p = 10, 20, 30, 40, 50) - Each time select the hyperparameter that maximizes the evidence from EP


SVM and kNN - randomly select train data (p % of overall data), use the remaining data as test data, predict the labels of test data - 50 tries for each p (p = 10, 20, 30, 40, 50) - In the SVM, leave-one-out validation for hyperparameter selection was achieved by using the train data

GP – evidence & accuracy
88 86 Rec Accuracy (Unlabeled) Log Evidence

Recognition Accuracy / Log Evidence

84 82 80 78 76 74 72

0

1

2 3 4 Sigma (Hyperparameter)

5

6

The case of Percentage of train points per class = 50 % (average over 10 tries) (Note) An offset was added to log evidence to plot all curves in the same figure. Max of Rec Accuracy ≈ Max of Log Evidence  Find the optimal hyperparameter by using evidence from EP

SVM – hyperparameter selection
Evidence from Leave-one-out validation

Log (C)

Log (1/

)

Select the hyperparameter {C, sigma} that maximizes the evidence from leave-one-out validation !

Classification Accuracy
100

Percentage of recognition on unlabeled(or test) points

95 90 85 80 75 70 65 60

GP kNN(k=1) SVM

0

10

20 30 40 Percent of labeled(or train) points per class

50

60

As expected, kNN is bad at small # of train pts and better at large # of train pts

SVM has good accuracy even when the # of train pts is small, why? GP has bad accuracy when the # of train pts is large, why?

Analysis-SVM
Why does SVM give a good test accuracy even when the number of train points is small ?
80

CV accuracy rate, Test accuracy rate, # SVs / # train pts

Number of support vectors, Number of train points

70 60 50 40 30 20 10 0

# of SVs # of train points

100 CV accuracy rate Test accuracy rate # SVs / # train pts

90

80

70

60

50

0

10

20 30 40 Percent of train points per class

50

60

40

0

10

20 30 40 Percent of train points per class

50

60

The best things I can tell…
1. {# Support Vectors} / {# of Train Points} is high in this task, in particular when the percentage of train points is low. The support vectors decide the decision boundary. But it is not guaranteed that the SV ratio is highly related with the test accuracy. Actually it is known that {Leave-one-out CV error} is less than {# Support Vectors} / {# of Train Points}. CV accuracy rate is high even when the # of train pts is small. CV accuracy rate is very related with Test accuracy rate.

2.

Analysis-GP
Why does GP give a bad test accuracy when the number of train points is small ?
88 86 Rec Accuracy (Unlabeled) Log Evidence
Rec Accuracy (Unlabeled) Log Evidence 74

Recognition Accuracy / Log Evidence

72

84 82 80 78 76 74 72

Recognition Accuracy / Log Evidence

70

68

66

64

62

0

1

2 3 4 Sigma (Hyperparameter)

5

6

60

0

1

2 3 4 Sigma (Hyperparameter)

5

6

Percentage of train points per class = 50 % Max of Rec Accuracy ≈ Max of Log Evidence

Percentage of train points per class = 10 % Log Evidence curve is flat  fail to find optimal Sigma !

Conclusion
 GP Small number of train points  bad accuracy Large number of train points  good accuracy  SVM Regardless of the number of train points  good accuracy  kNN (k = 1) Small number of train points  bad accuracy Large number of train points  good accuracy

```
To top