# Discrimination and Classification by yurtgc548

VIEWS: 0 PAGES: 30

• pg 1
```									Discrimination and
Classification
Discrimination
Situation:
We have two or more populations p1, p2, etc
(possibly p-variate normal).
The populations are known (or we have data
from each population)
We have data for a new case (population
unknown) and we want to identify the which
population for which the new case is a member.
Examples
The Basic Problem
Suppose that the data from a new case x1, … , xp
has joint density function either :
p1: f(x1, … , xn) or
p2: g(x1, … , xn)
We want to make the decision to
D1: Classify the case in p1 (f is the
correct distribution) or
D2: Classify the case in p2 (g is the
correct distribution)
The Two Types of Errors

1.  Misclassifying the case in p1 when it actually lies
in p2.
Let P[1|2] = P[D1|p2] = probability of this type of error

2.  Misclassifying the case in p2 when it actually lies
in p1.
Let P[2|1] = P[D2|p1] = probability of this type of error

This is similar Type I and Type II errors in hypothesis
testing.
Note:
A discrimination scheme is defined by splitting p –
dimensional space into two regions.

1.    C1 = the region were we make the decision D1.
(the decision to classify the case in p1)

2.    C2 = the region were we make the decision D2.
(the decision to classify the case in p2)
There can be several approaches to determining the
regions C1 and C2. All concerned with taking into
account the probabilities of misclassification P[2|1] and
P[1|2]

1.   Set up the regions C1 and C2 so that one of the
probabilities of misclassification , P[2|1] say, is at
some low acceptable value a. Accept the level of
the other probability of misclassification P[1|2] =
b.
2.   Set up the regions C1 and C2 so that the total
probability of misclassification:

P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
is minimized

P[1] = P[the case belongs to p1]

P[2] = P[the case belongs to p2]
3.   Set up the regions C1 and C2 so that the total
expected cost of misclassification:
E[Cost of Misclassification]
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
is minimized
P[1] = P[the case belongs to p1]
P[2] = P[the case belongs to p2]
c2|1= the cost of misclassifying the case in p2
when the case belongs to p1.
c1|2= the cost of misclassifying the case in p1
when the case belongs to p2.
4.   Set up the regions C1 and C2 The two types of
error are equal:
P[2|1] = P[1|2]
Computer security:
p1: Valid users
p2: Imposters
P[2|1] = P[identifying a valid user as an imposter]

P[1|2] = P[identifying an imposter as a valid user ]
P[1] = P[valid user]
P[2] = P[imposter]
c2|1= the cost of identifying the user as an
imposter when the user is a valid user.
c1|2= the cost of identifying the user as a valid
user when the user is an imposter.
This problem can be viewed as an Hypothesis
testing problem
H0:p1 is the correct population
HA:p2 is the correct population

P[2|1] = a

P[1|2] = b

Power = 1 - b
The Neymann-Pearson Lemma
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ;q)
where q is either q1 or q2.
Let
g(x1, … , xn) = f(x1, … , xn ;q1) and
h(x1, … , xn) = f(x1, … , xn ;q2)

We want to test
H0: q = q1 (g is the correct distribution) against
HA: q = q2 (h is the correct distribution)
The Neymann-Pearson Lemma states that the Uniformly
Most Powerful (UMP) test of size a is to reject H0 if:

and accept H0 if:

where ka is chosen so that the test is of size a .
Proof: Let C be the critical region of any test of size   a.
Let

We want to show that

Note:
hence

and

Thus
and
Thus

and

when we add the common quantity

Q.E.D.
to both sides.
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:

The covariance matrix S is the same for both
populations p1 and p2.
The Neymann-Pearson Lemma states that we should
classify into populations p1 and p2 using:

That is make the decision
D1 : population is p1
if l ≥ ka
or

or

and

Finally we make the decision
D1 : population is p1
if
where
The function

Is called Fisher’s linear discriminant function
In the case where the populations are unknown
but estimated from data

Fisher’s linear discriminant function
Example 2
Annual financial data are collected for firms
approximately 2 years prior to bankruptcy and for
financially sound firms at about the same point in
time. The data on the four variables
• x1 = CF/TD = (cash flow)/(total debt),
• x2 = NI/TA = (net income)/(Total assets),
• x3 = CA/CL = (current assets)/(current liabilties, and
• x4 = CA/NS = (current assets)/(net sales) are given in
the following table.
The data are given in the following table:
Examples using SPSS

```
To top