# Discriminant Analysis Concepts

Document Sample

```					 Discriminant Analysis Concepts
• Used to predict group membership from a
set of continuous predictors
• Think of it as MANOVA in reverse – in
MANOVA we asked if groups are
significantly different on a set of linearly
combined responses.
• The same responses can be used to predict
group membership.
Discriminant Analysis Concepts
• Determine how can continuous variables be
linearly combined to best classify a subject
into a group.
• A better term may be “separation.”
• Slightly different is “classification” when
we seek rules that allocate new subjects into
established classes.
• Logistic regression is a competitor.
Classification
• Two populations π1 and π2
• We have measurements x' = [x1 x2 . . .xp]
on each of the individuals concerned.
• Given a new value of x for an unknown
individual our problem is how we can best
classify this individual.
Illustration

f1(x)

f2(x)

R1                   R2

Probability of misclassifying   Probability of misclassifying
Population 2 member in          Population 1 member in
Population 1                    Population 2
Misclassification

The probability an individual from π1 is wrongly
classified is
 f (x)dx = P(2 | 1)
1
R2
and an individual from π2 is wrongly classified is:
 f (x)dx
2      = P(1 | 2)
R1
Four Possibilities
• Assume p1 and p2 are the prior probabilities
of p1 and p2, respectively.

Classified

p1       p2
p1 P(1|1)p1 P(2|1)p1
Actual
p2 P(1|2)p2 P(2|2)p2
Costs
• In general there is a cost associated with
misclassification.
• Assume the cost is zero for correct classification.
• C(21) as the cost of misclassifying a π1
individual as a π2 individual.
• C(12) as the cost of misclassifying a π2
individual as a π1 individual.
Classified
p1      p2
p1     0 c(2|1)
Actual p    c(1|2)     0
2
Expected Cost of
Misclassification (ECM)

ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p2

Goal: To minimize ECM
It can be shown that the ECM is minimized if R1
contains those values of for which

C(1|2)p2f2(x) -C(2|1)p1f1(x)  0

and excludes those x for which the above is > 0.
In other words R1 is the set of points x for
which:

f1(x)   p2 C(1|2)
>
f2(x)   p1 C(2|1)

So when x satisfies this inequality we would
classify the corresponding individual in p1.
Conversely since R2 is the complement of R1
R2 is the set such that:

f1(x)   p2 C(1|2)
<
f2(x)   p1 C(2|1)

and an individual whose x vector satisfied this
inequality would be allocated to p2.
Assuming x has a multivariate normal
distribution i.e.

x   Np(µi , ) in population i   (i =1,2)

(note that this implies the same covariance
matrix applies to each population) we have

f1(x) exp[-1/2(x-m1)‟-1(x-m1)]
=
f2(x) exp[-1/2(x-m2)‟-1(x-m2)]
and the general rule (1) after taking natural log‟s and
some rearrangement can be shown to be equivalent to
'x -  '( µ1 + µ2 )  c
2
where  = -1( µ1 - µ2 ) = 1
2
.
.
p          say
(Correspondingly ‟ = [1,2….,p] = (m1-m2)‟-1 )
and c = ln C(1|2) p2
[ C(2|1) p1
]
Priors
• Typically, information is not available on the prior
probabilities p1 and p2.
• Typically taken to be the same making c a function
of only the ratio of the two costs.
• In addition, if the misclassification costs, C(1|2) and
C(2|1), are the same then c = 0.
Ordinarily , µ1, µ2 are not known and need to

be estimated from the data by S, x1 and x2
respectively and we therefore use:

S-1( x1 - x2 ) for  etc

where S-1 is taken as the inverse of

Spooled = (n1-1)S1 +(n2-1)S2
(n1+n2-2)

Where S1 and S2 are the sample covariance
matrices for the each of the two groups
(populations) respectively.
Minimum ECN for Two Normals

Allocate xo to p1 for which:

'xo - 1 „(x1+ x2)    c
2
Linear Discriminant Function
’ x = ( µ1 - µ2 ) -1x is called the linear
discriminant function of x.
This linear combination of x summarizes
all of the possible information in x that is
available for discriminating between the
two populations.
Unequal Covariance Matrices
Allocate xo to p1 for which:
-0.5xo’ (S1-1 - S2-1 ) xo + (x1S1-1 – x2S2-1 ) xo – k  c

where k = 0.5ln(|S1| |S2|-1 ) xo + (x1’S1-1x1 – x2’S2-1 x2)

Fisher’s Discriminant Function
Allocate xo to p1 for which:
(x1-x2) SPooled-1 xo  0.5 (x1-x2) SPooled-1 (x1+ x2)

Note: The p-variate standard                                a '  x1 - x 2 
 X  x1 , x 2   max
 a a 
distance between two vectors is                                        1/ 2
aR p        t
defined as:                                         a0

For this problem maximized at a = SPooled-1 (x1-x2)
Linear Discriminant Function,
Alternative View

The linear combination of x, say y   x is called a linear
t

discriminant function if .

 y β μ1, β' μ2    X μ1 , μ2 
'
Example

 4 0        1         - 1
Suppose        , m1  0 , m 2   0  .
0 9                    
Example
Example of Linear Discriminant Function
4
The unscaled
3

0.5 
2
β 
1                                                     0.0 
0

-1

-2

-3

-4
-4      -3   -2   -1   0       1   2   3      4
Example
Example With Correlation
4
Correlation of 0.6
 0.7813 
3                                                       The unscaled β           .
2
- 0.3125 

1
How separated are the “mean
0                                                       scores” when projected onto
this line? (ans: 1.25 units)
-1


-2
How separated are the “mean
-3                                                      scores” when projected onto
the x-axis? (ans: 1.0 units)
-4
-4       -3      -2       -1   0   1   2       3   4
More Than Two Groups
Sample variance/covariance matrix
g ni
 1 
SX          
 n-1  i 1 j 1
(x ij - x)(x ij - x)t

Among-Groups sums of squares and
cross-products matrix
g
H   n i (xi - x)(xi - x) t
i 1

Pooled Within-Groups sums of squares and
cross-products matrix
g ni
E    (x ij - x i )(x ij - x i )t
i 1 j 1
More Than Two Groups

(n - 1)S X  H  E

•Note: Same decomposition we
used with MANOVA
More Than Two Groups
Let X, Sx, E and H be defined as above. Suppose E is
positive definite and denote by A  a1 a 2  a k -1  . Then,
 (a t Ha) 
arg max  t          ak ,
aR
p   (a Ea) 
a t S A 0 t
x
where a k is the eigenvector of E -1 H corresponding to
the kth largest eigenvalue.

Referred to as canonical discriminant analysis
Canonical Correlation Analysis
• A statistical technique to identify and measure
the association between two sets of variables.
• Multiple regression can be interpreted as a
special case of such an analysis.
• The “multiple correlation coefficient,” R, can
be thought of as the maximum correlation that
is attainable between the dependent variable
and a linear combination of the independent
variables
Canonical Correlation Analysis
• CCA is an extension of the multiple R in
multiple regression.
• In CCA, there can be multiple response
variables.
• Canonical correlations are the maximum
correlation between a linear combination of
the responses and a linear combination of the
predictor variables.
Canonical Correlations

Suppose
 x1       m1   11 12  
x       N   ,             
 x2      m2  21 22  

where x1=(x11,…,x1q) and x2=(x21,…,x2,p-q).

Note that Var(x1) = 11 is qq, Var(x2) = 22 is
(p-q)(p-q), Cov(x1,x2) = 12 is q(p-q), Cov(x2,x1) = 21 is
(p-q)q, and 12 = 21
The First Canonical Correlation
• Find a1 and b1 (vectors of constants) such
   
that corr(a1x1,b1x2 ) is large as possible.

• Let U1= a1x1 and V1= b1x2 and call them

canonical variables.

• Then Var(U1) = a111a1 ,

Var(V1) = b122b1 ,

and Cov(U1,V1) = a112b1 .
The First Canonical Correlation
The correlation between U1 and V1 is

Cov(U1,V1 )
Corr(U1,V1 ) 
Var(U1 ) Var(V1 )

a112b1

        
a111a1 b122b1
Finding the Correlation
Let 1 =    max corr(U1,V1 ) . It can be shown that
a1  0,b1  0

 -1   -1
1 is the largest eigenvalue of 11122221
2

2
 a1 is the eigenvector corresponding to1
 b1 is the eigenvector corresponding to the largest eigenvalue of
-1    -1
22211112 - this largest eigenvalue also is 1
2 .

Note that 011.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 53 posted: 1/19/2010 language: English pages: 31
How are you planning on using Docstoc?