Final Exam (not cumulative) Next Tuesday Dec. 12, 7-8:15 PM 1105 SC (This Room)
Miscellaneous Topics
CS446-Fall ’06
1
Topics Since Midterm
Statistical Learning Parameterized Models Generative Models Discriminative Models Bayes
Likelihood function Estimation
- Rule - Networks Naïve -
Weak Learning Margin Distribution Frequentist / Bayesian Statistics ANNs Backpropagation Nonparametric classifiers
Conjugate Priors K-means Clustering Expectation Maximization Jordan's talk Logistic Regression Information Theory
Maximum Likelihood Maximum A Posteriori
Dimensionality Reduction
Nearest Neighbor Bias / Variance k Nearest Neighbor Kernel Smoothing
Model Selection
LDA, PCA, MDS, FA Local Linear Embedding Neural Network Derived Features Fit / Regularization AIC, BIC Kolmogorov Complexity, MDL
Ensembles
Conditional Information Mutual Information KL Divergence Bayes Optimal Classifier Bagging Boosting
K-fold Cross Validation Leave one out cross validation Learning Curve ROC Curve
Miscellaneous Topics
CS446-Fall ’06
2
Dimensionality Reduction
Many approaches Supervised
Feature selection Linear Discriminant Analysis (LDA/Fisher; we’ve seen this already) …
Unsupervised - what does this mean?
Principle Component Analysis (PCA) Multidimensional Scaling Factor Analysis …
Neural Net application
CS446-Fall ’06 3
Also some nonlinear dimensionality reduction
Miscellaneous Topics
Linear Discriminant Analysis (LDA)
Introduce a “new” feature Linear combination of old features The new feature should maximize distance between classes And simultaneously minimize variance within classes:
Maximize
| mP – mN|2 / (S2p + s2N)
Project points into subspace and repeat
Miscellaneous Topics
CS446-Fall ’06
4
Unsupervised – do not use class label
Principle component analysis (PCA)
Like LDA but maximize variance in X Observed raw features are imagined to be derived Introduce K latent features that “cause” the observed features Estimate them
Factor Analysis
Multidimensional Scaling
Suppose we know distances between examples Find a map in K dimensions that supports
Placement of examples Computes desired distances
Miscellaneous Topics
CS446-Fall ’06
5
Principle Component Analysis (PCA)
Find direction of maximum variance Project Repeat Y
1
2
X
This is an eigenvalue problem; we are looking for eigenvectors
Miscellaneous Topics CS446-Fall ’06 6
Principle Component Analysis
Eigenvector of the largest eigenvalue is the direction of greatest variance Second largest is greatest remaining variance etc. They are orthogonal and form the new (linear) features Use eigenvectors of the largest k eigenvalues or down to some relative size of eigenvalues Previous LDA example: First LDA component First PCA component Is this good?
Miscellaneous Topics
CS446-Fall ’06
7
Multidimensional Scaling (MDS)
Given distances between examples Position examples in a lower dimensional metric space faithful to these distances Atlanta Flying times between pairs Chicago
Denver Houston
Los Angeles
Miami New York
San Francisco
Seattle Washington D.C.
Works particularly well here. Why?
Miscellaneous Topics CS446-Fall ’06 8
Factor Analysis (FA)
Assume the observed features are really manifestations of some number k of more primitive factors These factors are unobservable Assume they are uncorrelated linear combinations of the original features Features In matrix form: X = + LF + Factors Where
X is an example is the mean of each feature L is a matrix of factor loads F are the factors is a noise term
Class
Similar to PCA except for which can aid in post hoc interpretation of the factors
CS446-Fall ’06 9
Miscellaneous Topics
Non-linear Dimensionality Reduction
Linear is limited: many classifiers easily handle anyway Potential big gains in nonlinear transformations But a rich space with huge potential for overfitting Kernel PCA – PCA with kernel functions Local Linear Embedding (LLE) – currently very popular
Note that distance is measured along the lower dimensional manifold, assumptions: smoothness, density
CS446-Fall ’06 10
Miscellaneous Topics
Using Neural Networks
Features Use hidden layers to learn lower dimensional features Hidden Couple the example Layer features to both the input and output Learns to reproduce the Features input features Hidden layer is floating
Miscellaneous Topics
CS446-Fall ’06
11
Using Neural Networks
Features Use hidden layers to learn lower dimensional features Hidden Couple the example Layer features to both the input and output Learns to reproduce the Features input features Hidden layer is floating Limit the number of hidden units
CS446-Fall ’06 12
Miscellaneous Topics
Using Neural Networks
Features Use hidden layers to learn lower dimensional features Hidden Couple the example Layer features to both the input and output Learns to reproduce the Features input features Hidden layer is floating Hidden Nodes Learn the Best Limit the number of Nonlinear Transformations hidden units that Reproduce the Input Features
CS446-Fall ’06 13
Miscellaneous Topics
Model Selection
Comparing error rate alone on models of different complexity is not fair Consider the Likelihood Function
It will tend to prefer a more complex model Why?
Overfitting Need regularization Penalty to compensate for complexity Richer model families are more likely to find a good fit by accident
Miscellaneous Topics
CS446-Fall ’06
14
Information Criteria
L is the likelihood; k is number of parameters; N is number of examples Prefer the higher scoring models Akaike Information Criterion (AIC) AIC = ln(L) – k fit complexity penalty
Bayesian Information Criterion (BIC) BIC = ln(L) – k ln(N)/4
Minimum Description Length (MDL) is the same as BIC Kolmogorov complexity
Learning = Data Compression Compression bounds; bound test accuracy from training alone Luckiness Framework; PAC-Bayes
CS446-Fall ’06 15
Miscellaneous Topics
Cross Validation
Good For
Setting Parameters Choosing Models Evaluating a Learner
Data Resampling Technique Different partition sets of the training data are somewhat independent Overlap introduces some bias, this can be estimated if necessary In statistics: bootstrap, jackknife
Miscellaneous Topics
CS446-Fall ’06
16
Computing a learning curve
Classifier performance as a function of training Desire confidences Perhaps Error Bars: Usually 95% standard error of the mean Need multiple runs but have limited data Each point is generated by cross validation
Miscellaneous Topics
CS446-Fall ’06
17
Cross Validation
k-fold cross validation Partition the data D into k equal disjoint sets: d1, d2… dk For i=1 to k Train on D-di Test on di Generates a population of results Can compute average performance Confidence measures Most popular / most standard is k = 10 When k = |D| it is called “Leave one out cross validation”
CS446-Fall ’06 18
Miscellaneous Topics
Cross Validation
Every example gets used as a test example exactly once Every example gets used as a training example k-1 times.
Test sets are independent but training sets overlap significantly.
The hypotheses are generated using (k-1)/k of the training data. With resampling, “paired” statistical design can be used to compare two or more learners Paired tests are statistically stronger since outcome variations due to the test set are identical in each fold
CS446-Fall ’06 19
Miscellaneous Topics
ROC Curve
Miscellaneous Topics
Often a classifier can be adjusted to have more false positives or more false negatives This can be used to hide weaknesses of the classifier Receiver Operating Characteristic Curve Prob. of True Positive vs. Prob. of False Positive as sensitivity is increased The – (left peak) and + (right peak) populations overlap The classification boundary is the vertical line The relevant areas are labeled: TP: true positives = Red + Purple; FP: false positives = Pink + Purple; TN: true negatives = Dark Blue + Light Blue; FN: false negatives = Light Blue
CS446-Fall ’06
20