Docstoc

Discrimination and Classification

Document Sample
Discrimination and Classification Powered By Docstoc
					Discrimination and
   Classification
             Discrimination
Situation:
We have two or more populations p1, p2, etc
(possibly p-variate normal).
The populations are known (or we have data
from each population)
We have data for a new case (population
unknown) and we want to identify the which
population for which the new case is a member.
Examples
              The Basic Problem
Suppose that the data from a new case x1, … , xp
has joint density function either :
      p1: f(x1, … , xn) or
       p2: g(x1, … , xn)
We want to make the decision to
   D1: Classify the case in p1 (f is the
       correct distribution) or
   D2: Classify the case in p2 (g is the
       correct distribution)
             The Two Types of Errors

1.  Misclassifying the case in p1 when it actually lies
    in p2.
Let P[1|2] = P[D1|p2] = probability of this type of error


2.  Misclassifying the case in p2 when it actually lies
    in p1.
Let P[2|1] = P[D2|p1] = probability of this type of error

 This is similar Type I and Type II errors in hypothesis
 testing.
Note:
A discrimination scheme is defined by splitting p –
dimensional space into two regions.

1.    C1 = the region were we make the decision D1.
     (the decision to classify the case in p1)

2.    C2 = the region were we make the decision D2.
     (the decision to classify the case in p2)
There can be several approaches to determining the
regions C1 and C2. All concerned with taking into
account the probabilities of misclassification P[2|1] and
P[1|2]


1.   Set up the regions C1 and C2 so that one of the
     probabilities of misclassification , P[2|1] say, is at
     some low acceptable value a. Accept the level of
     the other probability of misclassification P[1|2] =
     b.
2.   Set up the regions C1 and C2 so that the total
     probability of misclassification:

      P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
      is minimized

      P[1] = P[the case belongs to p1]

      P[2] = P[the case belongs to p2]
3.   Set up the regions C1 and C2 so that the total
     expected cost of misclassification:
     E[Cost of Misclassification]
       = c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
     is minimized
       P[1] = P[the case belongs to p1]
       P[2] = P[the case belongs to p2]
       c2|1= the cost of misclassifying the case in p2
             when the case belongs to p1.
        c1|2= the cost of misclassifying the case in p1
              when the case belongs to p2.
4.   Set up the regions C1 and C2 The two types of
     error are equal:
                 P[2|1] = P[1|2]
Computer security:
   p1: Valid users
   p2: Imposters
 P[2|1] = P[identifying a valid user as an imposter]

 P[1|2] = P[identifying an imposter as a valid user ]
  P[1] = P[valid user]
  P[2] = P[imposter]
   c2|1= the cost of identifying the user as an
         imposter when the user is a valid user.
   c1|2= the cost of identifying the user as a valid
         user when the user is an imposter.
This problem can be viewed as an Hypothesis
testing problem
      H0:p1 is the correct population
      HA:p2 is the correct population

    P[2|1] = a

    P[1|2] = b

   Power = 1 - b
         The Neymann-Pearson Lemma
Suppose that the data x1, … , xn has joint density
function
      f(x1, … , xn ;q)
where q is either q1 or q2.
Let
      g(x1, … , xn) = f(x1, … , xn ;q1) and
      h(x1, … , xn) = f(x1, … , xn ;q2)

We want to test
    H0: q = q1 (g is the correct distribution) against
    HA: q = q2 (h is the correct distribution)
The Neymann-Pearson Lemma states that the Uniformly
Most Powerful (UMP) test of size a is to reject H0 if:




and accept H0 if:




 where ka is chosen so that the test is of size a .
Proof: Let C be the critical region of any test of size   a.
Let




We want to show that




 Note:
hence




and




Thus
and
Thus




and




when we add the common quantity


                                  Q.E.D.
 to both sides.
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:



The covariance matrix S is the same for both
populations p1 and p2.
The Neymann-Pearson Lemma states that we should
classify into populations p1 and p2 using:




That is make the decision
         D1 : population is p1
if l ≥ ka
or

or



and


Finally we make the decision
        D1 : population is p1
 if
 where
The function


Is called Fisher’s linear discriminant function
In the case where the populations are unknown
but estimated from data


   Fisher’s linear discriminant function
Example 2
  Annual financial data are collected for firms
  approximately 2 years prior to bankruptcy and for
  financially sound firms at about the same point in
  time. The data on the four variables
• x1 = CF/TD = (cash flow)/(total debt),
• x2 = NI/TA = (net income)/(Total assets),
• x3 = CA/CL = (current assets)/(current liabilties, and
• x4 = CA/NS = (current assets)/(net sales) are given in
  the following table.
The data are given in the following table:
Examples using SPSS

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:7/24/2013
language:English
pages:30