# Classification or Cluster Analysis by rt3463df

VIEWS: 30 PAGES: 52

• pg 1
```									Discrimination and
Classification
Discrimination
Situation:
We have two or more populations p1, p2, etc
(possibly p-variate normal).
The populations are known (or we have data
from each population)
We have data for a new case (population
unknown) and we want to identify the which
population for which the new case is a member.
The Basic Problem
Suppose that the data from a new case x1, … , xp
has joint density function either :
p1: g(x1, … , xn) or
p2: h(x1, … , xn)
We want to make the decision to
D1: Classify the case in p1 (g is the
correct distribution) or
D2: Classify the case in p2 (h is the
correct distribution)
The Two Types of Errors

1.  Misclassifying the case in p1 when it actually lies
in p2.
Let P[1|2] = P[D1|p2] = probability of this type of error

2.  Misclassifying the case in p2 when it actually lies
in p1.
Let P[2|1] = P[D2|p1] = probability of this type of error

This is similar Type I and Type II errors in hypothesis
testing.
Note:
A discrimination scheme is defined by splitting p –
dimensional space into two regions.

1.    C1 = the region were we make the decision D1.
(the decision to classify the case in p1)

2.    C2 = the region were we make the decision D2.
(the decision to classify the case in p2)
There can be several approaches to determining the
regions C1 and C2. All concerned with taking into
account the probabilities of misclassification P[2|1] and
P[1|2]

1.   Set up the regions C1 and C2 so that one of the
probabilities of misclassification , P[2|1] say, is at
some low acceptable value a. Accept the level of
the other probability of misclassification P[1|2] =
b.
2.   Set up the regions C1 and C2 so that the total
probability of misclassification:

P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
is minimized

P[1] = P[the case belongs to p1]

P[2] = P[the case belongs to p2]
3.   Set up the regions C1 and C2 so that the total
expected cost of misclassification:
E[Cost of Misclassification] = ECM
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
is minimized
P[1] = P[the case belongs to p1]
P[2] = P[the case belongs to p2]
c2|1= the cost of misclassifying the case in p2
when the case belongs to p1.
c1|2= the cost of misclassifying the case in p1
when the case belongs to p2.
The Optimal Classification Rule
The Neyman-Pearson Lemma
Suppose that the data x1, … , xp has joint density
function
f(x1, … , xp ;q)
where q is either q1 or q2.
Let
g(x1, … , xp) = f(x1, … , xn ;q1) and
h(x1, … , xp) = f(x1, … , xn ;q2)

We want to make the decision
D1: q = q1 (g is the correct distribution) against
D2: q = q2 (h is the correct distribution)
then the optimal regions (minimizing ECM, expected
cost of misclassification) for making the decisions D1
and D2 respectively are C1 and C2


                          L q1          g  x1 ,   , xp       

C1   x1 ,    , xp                                             k
                        L q 2           h  x1 ,   , xp       
                                                                 
and

                            L q1        g  x1 ,   , xp       

C2   x1 ,    , xp                                             k
                          L q 2         h  x1 ,   , xp       
                                                                 
c1 2 P  2
where                k
c21 P 1
Proof:    ECM = E[Cost of Misclassification]
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]

P 1 2   
                 hx , 1        , x p  dx1       dx p
C1

P  2 1  
                  g x ,
1         , x p  dx1       dx p
C2

 1              g x ,1       , x p  dx1       dx p
C1
                                                          
ECM  c2|1 P 1 1    g  x1 ,                 , x p  dx1         dx p  
                                                          
     C1                                                   
c1|2 P  2            hx ,  1        , x p  dx1     dx p
C1
Therefore

ECM 

         c1|2 P  2 h  x1 ,
                           , x p   c2|1P 1 g  x1 ,     , x p  dx1   dx p

C1
c2|1P 1

Thus ECM is minimized if C1 contains all of the points
(x1, …, xp) such that the integrand is negative

c1|2 P  2 h  x1 ,      , x p   c2|1P 1 g  x1 ,      , xp   0

g  x1 ,      , xn        c1|2 P  2

h  x1 ,      , xn        c2|1P 1
The Neymann-Pearson Lemma
(another proof)
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ;q)
where q is either q1 or q2.
Let
g(x1, … , xn) = f(x1, … , xn ;q1) and
h(x1, … , xn) = f(x1, … , xn ;q2)

We want to test
H0: q = q1 (g is the correct distribution) against
HA: q = q2 (h is the correct distribution)
The Neymann-Pearson Lemma states that the Uniformly
Most Powerful (UMP) test of size a is to reject H0 if:

L q2        h  x1 ,    , xn 
                                      ka
L q1        g  x1 ,    , xn 

and accept H0 if:

L q2        h  x1 ,    , xn 
                                      ka
L q1        g  x1 ,    , xn 

where ka is chosen so that the test is of size a .
Proof: Let C be the critical region of any test of size a.
Let                  
          h  x1 , , xn       

C   x1 , , xn
*
 ka 

          g  x1 , , xn       

         g  x1, , xn  dx1 dxn 
C*
  g x ,     1       , xn  dx1     dxn  a
C
We want to show that
  h  x ,      1     , xn  dx1     dxn 
C*
  h  x ,
C
1         , xn  dx1   dxn

Note:                      
C*  C  C  C*  C
*
               
C  C   *
 C   C    *
C
hence       g  x , , x  dx  1               n           1       dxn 
C*

  g  x , , x  dx dx
1             n       1               n   
C* C
  g  x ,                         1           , xn  dx1    dxn  a
C* C

and        g  x ,
C
1        , xn  dx1           dxn 

  g  x , , x  dx dx 
1             n       1               n
C* C
  g  x , , x  dx            1               n     1      dxn  a
C * C

Thus              g  x ,         1       , xn  dx1               dxn 

  g  x ,                , xn  dx1
C* C
1                   dxn
C * C
C

C*  C                                C*
C*  C
C*  C

  g  x ,1   , xn  dx1   dxn 
C* C
  g  x ,   1   , xn  dx1   dxn
C * C
  g  x ,  1     , xn  dx1 dxn 
C* C
1
  h  x1 , , xn  dx1
ka C * C
dxn

1
since g  x1 ,   , xn   h  x1 , , xn  in C * .
ka
and
  g  x ,   1      , xn  dx1       dxn 
C * C
1
ka     h  x ,   1      , xn  dx1     dxn
C * C

1
since g  x1 ,   , xn   h  x1 ,            , xn  in C * .
ka
Thus
  h  x ,     1           , xn  dx1     dxn 
C* C

  h  x ,         1       , xn  dx1    dxn
C * C

and

  h  x , , x  dx
1             n        1   dxn 
C*

  h  x ,
C
1       , xn  dx1    dxn
when we add the common quantity

  h  x ,       1        , xn  dx1      dxn
C* C                                                     Q.E.D.
to both sides.
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:
1 or 2

The covariance matrix  is the same for both
populations p1 and p2.
1                     1  x  1  1  x  1 
g x                              e     2

 2p 
p/2

1/ 2

1                    1  x  2  1  x  2 
hx                              e     2

 2p 
p/2

1/ 2
The Neymann-Pearson Lemma states that we should
classify into populations p1 and p2 using:
1                    1  x  1   1  x  1 
e     2

g x        2p 
p/2

1/ 2

            
hx                    1
e
 1  x  2   1  x  2 
2

 2p 
p/2

1/ 2

 x 2  1  x 2  1  x 1  1  x 1 
e
1
2                             2

That is make the decision
D1 : population is p1
if  > k
or ln   1  x  2  1  x  2   1  x  1  1  x  1   ln k
2                              2
or     x  2  1  x  2    x  1  1  x  1   2ln k

or                               
x 1 x  2  2  1 x   2  1 2
 x 1 x  2 1 1 x  111  2 ln k

and
 1  2   1 x  ln k  1  1 11   2 1 2 
2

Finally we make the decision
D1 : population is p1

if     a x  K

where

2              
a  1  1  2  and K  ln k  1 111  212
                                  
c1 2 P  2
and          k
c21 P 1
Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].

and K      1
2    1 11  2  12  
              1
2    1  2   1  1  2 
The function
a x   1  2   1 x
Is called Fisher’s linear discriminant function
p2
p1
2
1

a x   1  2   1 x  Ka
In the case where the populations are unknown
but estimated from data

Fisher’s linear discriminant function

a  x   x1  x2  S 1 x
ˆ
200
x2
Classify asp 1

Classify as p 2

100
p1
p2

0
x1
0   20     40        60     80      100      120
A Pictorial repres entation of Fisher's procedure for two populations
Example 1

p1 : Riding-mower owners          p2 : Nonowners

x1 (Income        x2 (Lot size     x1 (Income     x2 (Lot size
in \$1000s)        in 1000 sq ft)   in \$1000s)     in 1000 sq ft)

20.0              9.2            25.0             9.8
28.5              8.4            17.6            10.4
21.6             10.8            21.6             8.6
20.5             10.4            14.4            10.2
29.0             11.8            28.0             8.8
36.7              9.6            16.4             8.8
36.0              8.8            19.8             8.0
27.6             11.2            22.0             9.2
23.0             10.0            15.8             8.2
31.0             10.4            11.0             9.4
17.0             11.0            17.0             7.0
27.0             10.0            21.0             7.4
12
Lot Size (in thousands of s quare feet)

8

Rid in g Mo wer o wn ers
No n o wn wers
4
10             20                    30            40
In co me (in th ou s an d s o f d o llars )
Example 2
Annual financial data are collected for firms
approximately 2 years prior to bankruptcy and for
financially sound firms at about the same point in
time. The data on the four variables
• x1 = CF/TD = (cash flow)/(total debt),
• x2 = NI/TA = (net income)/(Total assets),
• x3 = CA/CL = (current assets)/(current liabilties, and
• x4 = CA/NS = (current assets)/(net sales) are given in
the following table.
The data are given in the following table:
Bankrupt Firms                            Nonbankrupt Firms
x1      x2      x3          x4               x1     x2       x3          x4
Firm   CF/TD     NI/TA     CA/CL    CA/NS    Firm   CF/TD     NI/TA     CA/CL    CA/NS
1     -0.4485   -0.4106   1.0865   0.4526    1      0.5135    0.1001   2.4871   0.5368
2     -0.5633   -0.3114   1.5314   0.1642    2      0.0769    0.0195   2.0069   0.5304
3      0.0643    0.0156   1.0077   0.3978    3      0.3776    0.1075   3.2651   0.3548
4     -0.0721   -0.0930   1.4544   0.2589    4      0.1933    0.0473   2.2506   0.3309
5     -0.1002   -0.0917   1.5644   0.6683    5      0.3248    0.0718   4.2401   0.6279
6     -0.1421   -0.0651   0.7066   0.2794    6      0.3132    0.0511   4.4500   0.6852
7      0.0351    0.0147   1.5046   0.7080    7      0.1184    0.0499   2.5210   0.6925
8     -0.6530   -0.0566   1.3737   0.4032    8     -0.0173    0.0233   2.0538   0.3484
9      0.0724   -0.0076   1.3723   0.3361    9      0.2169    0.0779   2.3489   0.3970
10     -0.1353   -0.1433   1.4196   0.4347   10      0.1703    0.0695   1.7973   0.5174
11     -0.2298   -0.2961   0.3310   0.1824   11      0.1460    0.0518   2.1692   0.5500
12      0.0713    0.0205   1.3124   0.2497   12     -0.0985   -0.0123   2.5029   0.5778
13      0.0109    0.0011   2.1495   0.6969   13      0.1398   -0.0312   0.4611   0.2643
14     -0.2777   -0.2316   1.1918   0.6601   14      0.1379    0.0728   2.6123   0.5151
15      0.1454    0.0500   1.8762   0.2723   15      0.1486    0.0564   2.2347   0.5563
16      0.3703    0.1098   1.9914   0.3828   16      0.1633    0.0486   2.3080   0.1978
17     -0.0757   -0.0821   1.5077   0.4215   17      0.2907    0.0597   1.8381   0.3786
18      0.0451    0.0263   1.6756   0.9494   18      0.5383    0.1064   2.3293   0.4835
19      0.0115   -0.0032   1.2602   0.6038   19     -0.3330   -0.0854   3.0124   0.4730
20      0.1227    0.1055   1.1434   0.1655   20      0.4875    0.0910   1.2444   0.1847
21     -0.2843   -0.2703   1.2722   0.5128   21      0.5603    0.1112   4.2918   0.4443
22      0.2029    0.0792   1.9936   0.3018
23      0.4746    0.1380   2.9166   0.4487
24      0.1661    0.0351   2.4527   0.1370
25      0.5808    0.0371   5.0594   0.1268
Examples using SPSS
Classification or Cluster Analysis

Have data from one or several
populations
Situation
• Have multivariate (or univariate) data from
one or several populations (the number of
populations is unknown)
• Want to determine the number of populations
and identify the populations
Example
Table: Numerals in eleven languages

English Norwegian Danish Dutch         German    French Spanish       Italian   Polish Hungarian     Finnish

one       en      en        een        ein       un      uno       uno     jeden       egy           yksi
two        to       to     twee       zwei     deux        dos      due      dwa      ketto          kaksi
three      tre      tre      drie       drei    trois      tres       tre      trzy   harom          kolme
four     fire     fire      vier       vier   quatre   cuarto    quattro   cztery      negy           neua
five    fem      fem         vijf      funf     cinq    cinco    cinque      piec         ot          viisi
six   seks     seks        zes     sechs       six      seix       sei    szesc        hat         kuusi
seven       sju     syv     zeven     sieben      sept     siete     sette siedem          het    seitseman
eight     atte     otte      acht       acht     huit    ocho        otto   osiem      nyole    kahdeksan
nine       ni       ni    negen       neun     neuf    nueve       nove dziewiec     kilenc     yhdeksan
ten        ti       ti     tien      zehn       dix     diez      dieci dziesiec        tiz   kymmenen
Distance Matrix
Distance = # of numerals (1 to 10) differing in first letter

E   N Da Du G Fr Sp I P H Fi
E     0                            
N     2    0                       
Da    2    1 0                     
                             
Du    7    5 6 0                   
G     6    4 5 5 0                 
Fr    6    6 6 9 7 0               

                             

Sp    6    6 5 9 7 2 0             
I    6    6 5 9 7 1 1 0           
P    7    7 6 10 8 5 3 4 0        
H                                 
9    8 8 8 9 10 10 10 10 0 
Fi   
9    9 9 9 9 9 9 9 9 8 0    
Hierarchical Clustering Methods
The following are the steps in the agglomerative Hierarchical
clustering algorithm for grouping N objects (items or variables).
1.   Start with N clusters, each consisting of a single entity and
an N X N symmetric matrix (table) of distances (or
similarities) D = (dij).
2.   Search the distance matrix for the nearest (most similar)
pair of clusters. Let the distance between the "most
similar" clusters U and V be dUV.
3.   Merge clusters U and V. Label the newly formed cluster
(UV). Update the entries in the distance matrix by

a)   deleting the rows and columns corresponding to
clusters U and V and
b)   adding a row and column giving the distances
between cluster (UV) and the remaining clusters.
4.   Repeat steps 2 and 3 a total of N-1 times. (All objects
will be a single cluster a termination of this algorithm.)
Record the identity of clusters that are merged and the
levels (distances or similarities) at which the mergers
take place.
Different methods of computing inter-cluster distance
Clu s ter Dis tan ce
Sin g le Lin k ag e

1                                     3                d 24

2                     4         5

Co mp lete Lin k ag e

1                                       3
d 15
2                      4          5

Av erag e Lin k ag e

3
1                                           d 13+ d 14+ d 15+ d 23+ d 24+ d 25
2                     4         5                  6
Example
To illustrate the single linkage algorithm, we consider the
hypothetical distance matrix between pairs of five objects given
below:

1    2 3 4 5
1           0               
2           9     0         
3               
D = {d ik } = 3                 7    0    
                
4           6     5    9 0 
5            11
      10        
2 8 0
Treating each object as a cluster, the clustering
begins by merging the two closest items (3 & 5).
To implement the next level of clustering we
need to compute the distances between cluster
(35) and the remaining objects:
d(35)1 = min{3,11} = 3
d(35)2 = min{7,10} = 7
d(35)4 = min{9,8} = 8
The new distance matrix becomes:
The new distance matrix becomes:
(35     1    2 4
(35  0              
1 3         0       
2 7                
9    0 
4 8        6       
5 0

The next two closest clusters ((35) & 1) are
merged to form cluster (135). Distances between
this cluster and the remaining clusters become:
Distances between this cluster and the remaining
clusters become:
d(135)2 = min{7,9} = 7
d(135)4 = min{8,6} = 6
The distance matrix now becomes:
35 2 4 
35  0    
2 7 0 
      
      
4  6 5 0
Continuing the next two closest clusters (2 & 4)
are merged to form cluster (24).
Distances between this cluster and the remaining
clusters become:
d(135)(24) = min{d(135)2,d(135)4)=
min{7,6} = 6
The final distance matrix now becomes:
35 24
35  0    
      
24  6 0 

At the final step clusters (135) and (24) are merged to
form the single cluster (12345) of all five items.
The results of this algorithm can be summarized
graphically on the following "dendogram"
Dendograms

for clustering the 11 languages on the
basis of the ten numerals
Example 2: Public Utility data

variables

Company                         X1     X2 X3 X4            X5    X6      X7        X8

1      Arizona Public Service            1.06    9.2   151    54.4 1.6     9077     0.0      0.628
2      Boston Edison Co                  0.89   10.3   202    57.9 2.2     5088    25.3      1.555
3      Central Louisiana Electric Co     1.43   15.4   113    53.0 3.4     9212     0.0      1.058
4      Commonwealth Edison Co            1.02   11.2   168    56.0 0.3     6423    34.3      0.700
5      Consolidated Edison Co (NY)       1.49    8.8   192    51.2 1.0     3300    15.6      2.044
6      Florida Power & Light Co          1.32   13.5   111    60.0 -2.2   11127    22.5      1.241
7      Hawaiian Electric Co              1.22   12.2   175    67.6 2.2     7642     0.0      1.652
8      Idaho Power Co                    1.10    9.2   245    57.0 3.3    13082     0.0      0.309
9      Kentucky Utilities Co             1.34   13.0   168    60.4 7.2     8406     0.0      0.862
10     Madison Gas & Electric Co         1.12   12.4   197    53.0 2.7     6455    39.2      0.623
11     Nevada Power Co                   0.75    7.5   173    51.5 6.5    17441     0.0      0.768
12     New England Electric Co           1.13   10.9   178    62.0 3.7     6154     0.0      1.897
13     Northern States Power Co          1.15   12.7   199    53.7 6.4     7179    50.2      0.527
14     Oklahoma Gas & Electric Co        1.09   12.0    96    49.8 1.4     9673     0.0      0.588
15     Pacific Gas & Electric Co         0.96    7.6   164    62.2 -0.1    6468     0.9      1.400
16     Puget Sound Power & Light Co      1.16    9.9   252    56.0 9.2    15991     0.0      0.620
17     San Diego Gas & Electric Co       0.76    6.4   136    61.9 9.0     5714     8.3      1.920
18     The Southern Co                   1.05   12.6   150    56.7 2.7    10140     0.0      1.108
19     Texas Utilities Co                1.16   11.7   104    54.0 -2.1   13507     0.0      0.636
20     Wisconsin Electric Power Co       1.20   11.8   148    59.9 3.5     7287    41.1      0.702
21     United Illuminating Co            1.04    8.6   204    61.0 3.5     6650     0.0      2.116
22     Virginia Electric & Power Co      1.07    9.3   174    54.3 5.9    10093    26.6      1.306

X1: Fixed charge coverage ratio (income/debt)           X2: Rate of return on capital
X3: Cost per KW capacity in place                       X4: Annual load factor
X5: Peak KWH demand growth from 1974 to1975             X6: Sales (KWH per year)
X7: Percent Nuclear                                     X8: Total fuel costs (cents per KWH)
Table: Distances between 22 Utilities

Firm
number   1      2      3      4      5      6      7      8      9      10      11      12      13      14      15      16      17      18      19   20   21   22

1       0.00
2       3.10   0.00
3       3.68   4.92   0.00
4       2.46   2.16   4.11   0.00
5       4.12   3.85   4.47   4.13   0.00
6       3.61   4.22   2.99   3.20   4.60   0.00
7       3.90   3.45   4.22   3.97   4.60   3.35   0.00
8       2.74   3.89   4.99   3.69   5.16   4.91   4.36   0.00
9       3.25   3.96   2.75   3.75   4.49   3.73   2.80   3.59   0.00
10      3.10   2.71   3.93   1.49   4.05   3.83   4.51   3.67   3.57    0.00
11      3.49   4.79   5.90   4.86   6.46   6.00   6.00   3.46   5.18    5.08    0.00
12      3.22   2.43   4.03   3.50   3.60   3.74   1.66   4.06   2.74    3.94    5.21    0.00
13      3.96   3.43   4.39   2.58   4.76   4.55   5.01   4.14   3.66    1.41    5.31    4.50    0.00
14      2.11   4.32   2.74   3.23   4.82   3.47   4.91   4.34   3.82    3.61    4.32    4.34    4.39    0.00
15      2.59   2.50   5.16   3.19   4.26   4.07   2.93   3.85   4.11    4.26    4.74    2.33    5.10    4.24    0.00
16      4.03   4.84   5.26   4.97   5.82   5.84   5.04   2.20   3.63    4.53    3.43    4.62    4.41    5.17    5.18    0.00
17      4.40   3.62   6.36   4.89   5.63   6.10   4.58   5.43   4.90    5.48    4.75    3.50    5.61    5.56    3.40    5.56    0.00
18      1.88   2.90   2.72   2.65   4.34   2.85   2.95   3.24   2.43    3.07    3.95    2.45    3.78    2.30    3.00    3.97    4.43    0.00
19      2.41   4.63   3.18   3.46   5.13   2.58   4.52   4.11   4.11    4.13    4.52    4.41    5.01    1.88    4.03    5.23    6.09    2.47    0.00
20      3.17   3.00   3.73   1.82   4.39   2.91   3.54   4.09   2.95    2.05    5.35    3.43    2.23    3.74    3.78    4.82    4.87    2.92    3.90 0.00
21      3.45   2.32   5.09   3.88   3.64   4.63   2.68   3.98   3.74    4.36    4.88    1.38    4.94    4.93    2.10    4.57    3.10    3.19    4.97 4.15 0.00
22      2.51   2.42   4.11   2.58   3.77   4.03   4.00   3.24   3.21    2.56    3.44    3.00    2.74    3.51    3.35    3.46    3.63    2.55    3.97 2.62 3.01 0.00
Dendogram
Cluster Analysis of N=22 Utility companies