supervised classification1
Document Sample


Bayesian Discriminant Analysis
• This supervised learning technique uses Bayes’
rule but is different in philosophy from the well
known work of Aitken, Taroni, et al.
Prior probability
• Bayes’ rule: This can be a problem!
Pr x | grp i.d.Pr grp i.d.
Pr grp i.d. | x
Pr x
• Pr is probability
• Equation means: “How does the probability of an
item being a member of group change, given
evidence x”
Bayesian Discriminant Analysis
• Bayes’ rule can be turned into a classification
rule:
If: Pr x | grp 1Pr grp 1 Pr x | grp 2 Pr grp 2
=> Choose group 1
*If priors are both 0.5,
decision boundaries are
where curves cross
Bayes-Gaussian Discriminant Analysis
• If the data is multivariate normal drawn from
the same population, the decision rule
becomes: arg min d x
grp j unk
j 1,...k
Like an average
with the “distance” defined as: cov mat
1 T 1 k
x j S1 x j ln Pr(grp j) and (ni 1)Si
T 1
d j (xunk ) x S x
j pl unk pl S pl
2 n k i 1
• Note that if the data is just 1D this is just an
equation for a line:
2
xj x
d j (xunk ) 2 xunk 2 ln Pr(grp j)
j
s 2s
slope intercept
Bayes-Gaussian Discriminant Analysis
• If the data is multivariate normal but drawn
from different populations, the decision rule is
the same but the “decision distance” becomes:
1 T 1 1 T 1
d j (xunk ) xunk S j xunk x j Spl xunk x j Spl x j ln Pr(grp j)
T 1
2 2
New quadratic term
• Note that if the data is just 1D this is an equation
for a parabola:
2
1 2 xj xj
d j (xunk ) 2 xunk 2 xunk 2 ln Pr(grp j)
2s j sj 2s j
a b c
Bayes-Gaussian Discriminant Analysis
• The “quadratic” version is always called
quadratic discriminant analysis, QDA
• The “linear” version is called by a number of
names!
• linear discriminant analysis, LDA
• Some combination of of the above with the words,
Gaussian or classification
• A number of techniques use the name LDA!
• Important to specify the equations used to tell the
difference!
Bayes-Gaussian Discriminant Analysis
Groups have similar covariance structure: Groups have different covariance structure:
linear discriminant rule should work well quadratic discriminant rule may work better
Canonical Variate Analysis
• This supervised technique is called Linear
Discriminant Analysis (LDA) in R
• Also called Fisher linear discriminant analysis
• CVA is closely related to linear Bayes-Gaussian
discriminant analysis
• Works on a principle similar to PCA: Look for
“interesting directions in data space”
• CVA: Find directions in space which best separate
groups.
• Technically: find directions which maximize ratio of
between group to within variation
Canonical Variate Analysis
Project on PC1:
Not necessarily good
group separation!
Project on CV1:
Good group separation!
Note: There are #groups -1 or p CVs
which ever is smaller
Canonical Variate Analysis
• Use between-group to within-group covariance
matrix, W-1B to find directions of best group
separation (CVA loadings, Acv):
W1B Acv Acvs
• CVA can be used for dimension reduction.
• Caution! These “dimensions” are not at right
angles (i.e. not orthogonal)
• CVA plots can thus be distorted from reality
• Always check loading angles!
• Caution! CVA will not work well with very
correlated data
Canonical Variate Analysis
2D CVA of gasoline data set: 2D PCA of gasoline data set:
Canonical Variate Analysis
• Distance metric used in CVA to assign group
i.d. of an unknown data point: arg min d x
j 1,...k
grp j unk
1 T
x
x j A cvA cv x j ln Pr(grp j)
T
d j (xunk ) x A cvA
j
T
cv unk
T
2
• If data is Gaussian and group covariance structures
are the same then CVA classification is the same as
Bayes-Gaussian classification.
*Now Exercise:
Explore some data sets with:
lda_group_explore.R
Try try simple supervised classification with:
lda_group_predict.R
lda_group_predict2.R
Partial Least Squares Discriminant Analysis
• PLS-DA is a supervised discrimination
technique and very popular in chemometrics
• Works well with highly correlated variables (like in
spectroscopy)
• Lots of correlation causes CVA to fail!
• Group labels coded into a “response matrix” Y
• PLS searches for directions of maximum covariance in
X and Y.
• Loading for X can be used like PCA loading
• Dimension reduction
• Loading plots
Partial Least Squares Discriminant Analysis
2D PLS of gasoline data set: 2D PCA of gasoline data set:
Partial Least Squares Discriminant Analysis
• Group assignments of observation vectors are
made by interpreting Y scores.
• Typically “soft-max” function is used.
Y-scores
*Now Exercise:
Try try plsda.R
Observation Vectors
Get documents about "