Bayesian Decision Theory (Classification)

Document Sample
Bayesian Decision Theory (Classification) Powered By Docstoc
					Bayesian Decision Theory
    (Classification)



           主講人:虞台文
Contents
   Introduction
   Generalize Bayesian Decision Rule
   Discriminant Functions
   The Normal Distribution
   Discriminant Functions for the Normal
    Populations.
   Minimax Criterion
   Neyman-Pearson Criterion
Bayesian Decision Theory
    (Classification)



           Introduction
What is Bayesian Decision Theory?

   Mathematical foundation for decision making.



   Using probabilistic approach to help making
    decision (e.g., classification) so as to
    minimize the risk (cost).
Preliminaries and Notations

i  {1 , 2 ,, c } : a state of nature
              P (i ) : prior probability
                   x : feature vector
                           class-conditional
            p ( x | i ) :
                           density

           P(i | x) : posterior probability
Bayesian Rule
            p(x | i ) P(i )
P(i | x) 
                  p ( x)
             c
    p ( x)   p (x | i ) P (i )
             j 1
Decision
            p(x | i ) P(i )
P(i | x) 
                  p ( x)
                           unimportant in
                           making decision



    D (x)  arg max P(i | x)
                i
                                      p(x | i ) P(i )
                          P(i | x) 
                                            p ( x)
Decision                  D (x)  arg max P(i | x)
                                      i




Decide i if P(i|x) > P(j|x)        ji

Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i

Special cases:
1. P(1)=P(2)=   =P(c)
2. p(x|1)=p(x|2) =   = p(x|c)
                     Decide i if P(i|x) > P(j|x)     ji
                     Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i


Two Categories
Decide 1 if P(1|x) > P(2|x); otherwise decide 2

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Special cases:
1. P(1)=P(2)
   Decide 1 if p(x|1) > p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)
   Decide 1 if P(1) > P(2); otherwise decide 2
           Special cases:
           1. P(1)=P(2)
             Decide 1 if p(x|> p(x|2); otherwise decide 1

Example    2. p(x|1)=p(x|2)
             Decide 1 if P(1) > P(2); otherwise decide 2




 R2   R1
                   P(1)=P(2)
Example
                   P(1)=2/3
                   P(2)=1/3

                                           R2

                                R2
                                      R1           R1

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Classification Error
P(error )   p(error , x)dx
                 P(error | x) p(x)dx
Consider two categories:
     Decide 1 if P(1|x) > P(2|x); otherwise decide 2

               P(2 | x) if we decide 1
P(error | x)                              min[ P(1 | x), P(2 | x)]
                P(1 | x) if we decide 2
Classification Error
P(error )   p(error , x)dx
                 P(error | x) p(x)dx
Consider two categories:
     Decide 1 if P(1|x) > P(2|x); otherwise decide 2

               P(2 | x) if we decide 1
P(error | x)                              min[ P(1 | x), P(2 | x)]
                P(1 | x) if we decide 2
Bayesian Decision Theory
    (Classification)

              Generalized
            Bayesian Decision
                  Rule
The Generation
  {1 , 2 ,  , c } : a set of c states of nature
  {1 ,  2 ,  ,  a } : a set of a possible actions

                              The loss incurred for
     ij   ( i |  j ) :   taking action i when the
     can be zero.             true state of nature is j.

We want to minimize the expected loss in making decision.
                               Risk
Conditional Risk
                c                            c
R ( i | x)    ( i |  j ) P ( j | x)   ij P ( j | x)
                j 1                        j 1




            x
      Given , the expected loss (risk)
       associated with taking action

                       i .
0/1 Loss Function
              c                              c
R ( i | x)    ( i |  j ) P ( j | x)   ij P ( j | x)
              j 1                          j 1



                 0  i is a correctdecision assiciated with  j
 ( i |  j )  
                 1 otherwise

           R( i | x)  P(error | x)
Decision
              c                              c
R ( i | x)    ( i |  j ) P ( j | x)   ij P ( j | x)
              j 1                          j 1



 Bayesian Decision Rule:


  (x)  arg min R( i | x)
                           i
Overall Risk
  R   R( (x) | x) p(x)dx
               Decision function



Bayesian decision rule:  (x)  arg min R( i | x)
                                           i

   the optimal one to minimize the overall risk
   Its resulting overall risk is called the Bayesian risk
Two-Category Classification
                                      State of Nature

  {1 , 2 }
                   Loss Function      1 2
                                 1   11 12
   {1 ,  2 }
                        Action
                                 2   21 22

R(1 | x)  11P(1 | x)  12 P(2 | x)
R( 2 | x)  21P(1 | x)  22 P(2 | x)
Two-Category Classification

Perform 1 if R(2|x) > R(1|x); otherwise perform 2

       21P(1 | x)  22 P(2 | x)  11P(1 | x)  12 P(2 | x)

              (21  11 ) P(1 | x)  (12  22 ) P(2 | x)

 R(1 | x)  11P(1 | x)  12 P(2 | x)
 R( 2 | x)  21P(1 | x)  22 P(2 | x)
Two-Category Classification

Perform 1 if R(2|x) > R(1|x); otherwise perform 2

       21P(1 | x)  22 P(2 | x)  11P(1 | x)  12 P(2 | x)

              (21  11 ) P(1 | x)  (12  22 ) P(2 | x)
                positive                 positive



 Posterior probabilities are scaled before comparison.
                                   p(x | i ) P(i )
                     P(i | x) 
                                         p ( x)          irrelevan
                                                              t
Two-Category Classification

Perform 1 if R(2|x) > R(1|x); otherwise perform 2

       21P(1 | x)  22 P(2 | x)  11P(1 | x)  12 P(2 | x)

               (21  11 ) P(1 | x)  (12  22 ) P(2 | x)

        (21  11 ) p(x | 1 ) P(1 )  (12  22 ) p(x | 2 ) P(2 )
                           p(x | 1 ) (12  22 ) P(2 )
                                     
                           p(x | 2 ) (21  11 ) P(1 )
              This slide will be recalled later.

Two-Category Classification




                   Likelihood
                                   Threshold
                     Ratio

                    p(x | 1 ) (12  22 ) P(2 )
   Perform 1 if              
                    p(x | 2 ) (21  11 ) P(1 )
Bayesian Decision Theory
    (Classification)



           Discriminant Functions
              How to define discriminant functions?

The Multicategory Classification

                      gi(x)’s are called the
      g1(x)           discriminant functions.


x
      g2(x)                Action
                       (e.g., classification)
                                                (x)

      gc(x)        Assign x to i if
                        gi(x) > gj(x) for all j  i.
             If f(.) is a monotonically increasing function,
             than f(gi(.) )’s are also be discriminant functions.


Simple Discriminant Functions

Minimum Risk case:
       g i (x)   R( i | x)
Minimum Error-Rate case:
       g i (x)  P(i | x)
       g i (x)  p(x | i ) P(i )
       g i (x)  ln p(x | i )  ln P(i )
Decision Regions
 R i  {x | g i (x)  g j (x) j  i}

Two-category example

    Decision regions are
    separated by decision
    boundaries.
Bayesian Decision Theory
    (Classification)

           The Normal
           Distribution
 Basics of Probability
Discrete random variable (X) - Assume integer
     Probability mass function (pmf):      p ( x )  P( X  x )
                                                                    x
Cumulative distribution function (cdf):    F ( x)  P( X  x)     p(t )
                                                                  t  

Continuous random variable (X)
     Probability density function (pdf):   p( x) or f ( x) not a probability
                                                                   x
Cumulative distribution function (cdf):    F ( x)  P( X  x)   p(t )dt
                                                                  
Expectations
Let g be a function of random variable X.
                     
                      g ( x) p( x)    X is discrete
      E[ g ( X )]   x  
                        
                     g ( x) p ( x)dx
                    
                                        X is continuous

The kth moment      E[ X k ]
        The 1st moment  X  E[X ]

The kth central moments           E[( X   X ) k ]
                          Fact: Var[ X ]  E[ X 2 ]  ( E[ X ]) 2
Important Expectations

    Mean
                        
                         xp( x)          X is discrete
         X  E[ X ]   x  
                           
                        xp ( x )dx
                       
                                           X is continuous

Variance
                                    
                                         ( x   X ) 2 p( x)   X is discrete
 X  Var[ X ]  E[( X   X ) ]   x  
  2                           2
                                       
                                    ( x   X ) 2 p ( x)dx
                                   
                                                                 X is continuous
Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.

             
              p ( x) ln p ( x)     X is discrete
  H [ X ]   x  
                
            
               p( x) ln p( x)dx   X is continuous
                              Properties:
                              1. Maximize the entropy
                              2. Central limit theorem
Univariate Gaussian Distribution

X~N(μ,σ2)                            p(x)

                   ( x )2
         1       
p( x)       e       2 2
                                              μ
        2                               σ   σ
                                                         x
                                       2σ         2σ
 E[X] =μ                             3σ            3σ


Var[X] =σ2
                       X  ( X 1 , X 2 ,, X d )       T



Random Vectors

                     X:  R
A d-dimensional                         d
 random vector

Vector Mean:   μ  E[ X]  ( 1 ,  2 , ,  d )   T


Covariance Matrix:               12  12      1d 
                                                    
                                 21  2 2
                                                2d 
Σ  E[( X  μ)( X  μ) ]  T
                              
                                             
                                                  2 
                                d 1  d 2
                                              d  
                                                               ( x )2
                                                     1       
                                            p( x)       e       2 2
                                                    2 

Multivariate Gaussian Distribution

X~N(μ,Σ)                     A d-dimensional random vector

                    1                   1         1      
p ( x)                            exp  (x  μ) Σ (x  μ)
                                                 T

           (2 ) d / 2 | Σ |1/ 2        2                 

E[X] =μ
E[(X-μ) (X-μ)T] =Σ
Properties of N(μ,Σ)
X~N(μ,Σ)      A d-dimensional random vector

Let Y=ATX, where A is a d × k matrix.



           Y~N(ATμ, ATΣA)
Properties of N(μ,Σ)
X~N(μ,Σ)      A d-dimensional random vector

Let Y=ATX, where A is a d × k matrix.



           Y~N(ATμ, ATΣA)
On Parameters of N(μ,Σ)

X~N(μ,Σ)                      X  ( X 1 , X 2 ,, X d )             T


μ  E[ X]  ( 1 ,  2 , ,  d )                T


i  E[ X i ]
Σ  E[(X  μ)(X  μ) ]  [ ij ]d d   T


 ij  E[( X i  i )( X j   j )]  Cov( X i , X j )   X i  X j   ij  0

 ii   i2  E[( X i  i ) 2 ]  Var ( X i )
                                 Σ  (ΦΛ             1/ 2
                                                            )( ΦΛ   1/ 2 T
                                                                         )

More On Covariance Matrix
 is symmetric and positive semidefinite.
Σ  ΦΛΦ  ΦΛ Λ Φ  T             1/ 2   1/ 2      T

: orthonormal matrix, whose columns are eigenvectors of .
: diagonal matrix (eigenvalues).

Σ  E[(X  μ)(X  μ) ]  [ ij ]d d   T


 ij  E[( X i  i )( X j   j )]  Cov( X i , X j )        X i  X j   ij  0

 ii   i2  E[( X i  i ) 2 ]  Var ( X i )
                                 Σ  (ΦΛ          1/ 2
                                                         )( ΦΛ    1/ 2 T
                                                                     )
Whitening Transform
X~N(μ,Σ)
Y=ATX                        Y~N(ATμ, ATΣA)
                        1 / 2
Let Aw  ΦΛ
          A w X ~ N ( A μ, A ΣA w ) T
                                    w
                                              T
                                              w
 A T ΣA w  (ΦΛ 1/ 2 )T (ΦΛ 1/ 2 )( ΦΛ 1/ 2 )T (ΦΛ 1/ 2 )  I
   w


             A w X ~ N ( A μ, I )       T
                                        w
                                 Σ  (ΦΛ             1/ 2
                                                            )( ΦΛ   1/ 2 T
                                                                            )
Whitening Transform
                                                        Whitening
X~N(μ,Σ)                                      Linear

                             Y~N(A μ, ATΣA)
                                            Transform
Y=ATX                                           T

                        1 / 2
Let Aw  ΦΛ                                                    Projection



          A w X ~ N ( A μ, A ΣA w ) T
                                    w
                                                 T
                                                 w
 A T ΣA w  (ΦΛ 1/ 2 )T (ΦΛ 1/ 2 )( ΦΛ 1/ 2 )T (ΦΛ 1/ 2 )  I
   w


             A w X ~ N ( A μ, I )       T
                                        w
                                           r 2  (x  μ)T Σ 1 (x  μ)

    Mahalanobis Distance
   X~N(μ,Σ)
                        1                   1         1      
    p ( x)                            exp  (x  μ) Σ (x  μ)
                                                     T

               (2 ) d / 2 | Σ |1/ 2        2                 

 depends on     constant                                r2
the value of r2
                                           r 2  (x  μ)T Σ 1 (x  μ)

    Mahalanobis Distance
   X~N(μ,Σ)
                        1                   1         1      
    p ( x)                            exp  (x  μ) Σ (x  μ)
                                                     T

               (2 ) d / 2 | Σ |1/ 2        2                 

 depends on     constant                                r2
the value of r2
Bayesian Decision Theory
    (Classification)

               Discriminant
             Functions for the
            Normal Populations
                                       g i (x)  P(i | x)
                                       g i (x)  p(x | i ) P(i )
Minimum-Error-Rate Classification

g i (x)  ln p(x | i )  ln P(i )
Xi~N(μi,Σi)
                         1                  1                              
p ( x | i )                          exp  ( x  μ i )T Σ i1 ( x  μ i )
               (2 ) d / 2 | Σ i |1/ 2      2                              



          1                          d       1
gi (x)   (x  μ i ) Σi (x  μ i )  ln 2  ln | Σi |  ln P(i )
                     T 1

          2                          2       2
Minimum-Error-Rate Classification

Three Cases:
   Case 1: Σ i   2 I
       Classes are centered at different mean, and their feature
       components are pairwisely independent have the same variance.
   Case 2: Σ i  Σ
       Classes are centered at different mean, but have the same variation.

   Case 3: Σ i  Σ j
       Arbitrary.
           1                          d       1
 gi (x)   (x  μ i ) Σi (x  μ i )  ln 2  ln | Σi |  ln P(i )
                      T 1

           2                          2       2
                                                     1    1
                                                     Σ            I
Case 1. i =                                              
                                                     i         2
                                       2I

                1
 g i ( x)             || x  μ i ||2  ln P(i )
           2 2
            1
          2 (xT x  2μT x  μT μ i )  ln P(i )
           2
                         i     i

              irrelevant
            1        1 T                       
 g i ( x)  2 μ x   
                    T
                             μ i μ i  ln P(i )
                       2 2
                    i
                                               
                                          irrelevant

          1                          d       1
gi (x)   (x  μ i ) Σi (x  μ i )  ln 2  ln | Σi |  ln P(i )
                     T 1

          2                          2       2
                                       wi  12 μi

                                       wi 0   2 2 μT μi  ln P(i )
                                                 1

Case 1. i =                  
                                                      i
                               2I


 g i (x)  w T x  wi 0
             i




           1         1 T                       
 g i ( x)  2 μ x   
                T
                             μ i μ i  ln P(i )
                       2 2
                i
                                               
                                                    wi  12 μi

                                                    wi 0   2 2 μT μi  ln P(i )
                                                              1

 Case 1. i =                                 
                                                                   i
                                               2I

                                                              i                   j
      g i (x)  w x  wi 0
                      T
                      i



  wT x  wi 0  wT x  wj 0
   i             j
                                                                    g i ( x)  g j ( x)
(w  w )x  w j 0  wi 0
      T
      i
                  T
                  j
                                                                    Boundary btw.
                                      P (i )                         i and j
  (μ  μ )x  (μ μ i  μ μ j )   ln
          T       T       1       T       T     2

                                      P ( j )
          i       j       2       i       j



                                                      (μT  μTj )(μ i  μ j )      P(i )
(μ  μ )x  (μ  μ )(μ i  μ j )  
  T           T       1       T       T             2   i
                                                                                ln
                                                          || μ i  μ j ||2         P( j )
  i           j       2       i       j
                           The decision boundary will be a hyperplane
                           perpendicular to the line btw. the means at somewhere.


 Case 1. i =                           2I

                                                                 i             x j
w (x  x 0 )  0
      T
                                                                                xx0
                                                                         w
w  μi  μ j                                                                    x0
                            2            P(i )                      g i ( x)  g j ( x)
x0  1 (μi  μ j )                    ln         (μ i  μ j )
     2
                       || μi  μ j ||2
                                          P( j )                     Boundary btw.
          midpoint                                                      i and j
                              0 if P(i)=P(j)
  wT
                                                     (μT  μTj )(μ i  μ j )        P(i )
(μ  μ )x  (μ  μ )(μ i  μ j )  
  T        T     1     T      T                    2   i
                                                                                 ln
                                                          || μ i  μ j ||2          P( j )
  i        j     2     i      j
                                        2           P(1 )
               x0  (μ1  μ 2 ) 
                     1
                                                  ln        (μ1  μ 2 )
                     2
                                  || μ1  μ 2 ||2
                                                     P(2 )

Case 1. i =             2I


P(1 )  P(2 )




 Minimum distance classifier (template matching)
                                     2           P(1 )
            x0  (μ1  μ 2 ) 
                  1
                                               ln        (μ1  μ 2 )
                  2
                               || μ1  μ 2 ||2
                                                  P(2 )

Case 1. i =          2I


P(1 )  P(2 )
                                     2           P(1 )
            x0  (μ1  μ 2 ) 
                  1
                                               ln        (μ1  μ 2 )
                  2
                               || μ1  μ 2 ||2
                                                  P(2 )

Case 1. i =          2I


P(1 )  P(2 )
                                     2           P(1 )
            x0  (μ1  μ 2 ) 
                  1
                                               ln        (μ1  μ 2 )
                  2
                               || μ1  μ 2 ||2
                                                  P(2 )

Case 1. i =          2I


P(1 )  P(2 )                                   Demo
Case 2. i = 
          1
gi (x)   (x  μ i )T Σ 1 (x  μ i )  ln P(i )
          2
                 Mahalanobis Irrelevant if
                  Distance P(i)= P(j) i, j




                                             irrelevant

          1                          d       1
gi (x)   (x  μ i ) Σi (x  μ i )  ln 2  ln | Σi |  ln P(i )
                     T 1

          2                          2       2
                                    w i  Σ 1μ i

                                    wi 0   1 μT Σ 1μ i  ln P (i )
Case 2. i = 
                                             2 i




          1
gi (x)   (x  μ i )T Σ 1 (x  μ i )  ln P(i )
          2
          1 T 1
         (x Σ x  2μT Σ 1x  μT Σ 1μ i )  ln P(i )
                             i            i
          2
            Irrelevant

g i (x)  w T x  wi 0
            i
                                       w i  Σ 1μ i

                                       wi 0   1 μT Σ 1μ i  ln P (i )
Case 2. i = 
                                                2 i




g i (x)  w T x  wi 0
            i
                                                i             x    j
                                                        w

                                                        x0

w T (x  x 0 )  0
            1
                                                  g i ( x)  g j ( x)
w  Σ (μi  μ j )
                            ln[ P(i ) / P( j )]
 x 0  (μ i  μ j ) 
        1
                                        1
                                                       (μ i  μ j )
        2
                         (μ i  μ j ) Σ (μ i  μ j )
                                   T
Case 2. i = 
                 Demo

Case 2. i = 
      1 1 w i  Σ i1μ i wi 0   1 μ T Σ i1μ i  1 ln | Σ i1 |  ln P(i )
Wi   Σi                          2 i              2
      2

      Case 3. i   j
                 1                          1
       gi (x)   (x  μ i ) Σi (x  μ i )  ln | Σi |  ln P(i )
                            T 1

                 2                          2
                                         Decision surfaces are hyperquadrics, e.g.,
        g i ( x)  x Wi x  w x  wi 0
                  T          T
                             i           • hyperplanes
                                         • hyperspheres
           Without this term             • hyperellipsoids
           In Case 1 and 2               • hyperhyperboloids


                                             irrelevant

                  1                          d       1
        gi (x)   (x  μ i ) Σi (x  μ i )  ln 2  ln | Σi |  ln P(i )
                             T 1

                  2                          2       2
Case 3. i   j
             Non-simply connected decision
             regions can arise in one dimensions
             for Gaussians having unequal
             variance.
Case 3. i   j
Case 3. i   j
Case 3. i   j




                   Demo
Multi-Category Classification
Bayesian Decision Theory
    (Classification)


               Minimax
               Criterion
Bayesian Decision Rule:
Two-Category Classification


                  Likelihood
                                  Threshold
                    Ratio

                   p(x | 1 ) (12  22 ) P(2 )
   Decide 1 if              
                   p(x | 2 ) (21  11 ) P(1 )


 Minimax criterion deals with the case that
 the prior probabilities are unknown.
Basic Concept on Minimax

To choose the worst-case prior probabilities
(the maximum loss) and, then, pick the
decision rule that will minimize the overall risk.



Minimize the maximum possible overall risk.
                     R(1 | x)  11P(1 | x)  12 P(2 | x)
                     R( 2 | x)  21P(1 | x)  22 P(2 | x)

Overall Risk
R   R( (x) | x) p(x)dx

    R(1 | x) p(x)dx   R( 2 | x) p(x)dx
     R1                        R2




R   [11P(1 | x)  12 P(2 | x)] p(x)dx 
     R1


     
     R2
          [21P(1 | x)  22 P(2 | x)] p(x)dx
                                                      p(x | i ) P(i )
                                          P(i | x) 
                                                            p ( x)

Overall Risk
 R   [11P(1 ) p(x | 1 )  12 P(2 ) p(x | 2 )]dx 
      R1


     R2
           [21P(1 ) p(x | 1 )  22 P(2 ) p(x | 2 )]dx




R   [11P(1 | x)  12 P(2 | x)] p(x)dx 
      R1


     R2
           [21P(1 | x)  22 P(2 | x)] p(x)dx
                                                    P(2 )  1  P(1 )
Overall Risk
 R   [11P(1 ) p(x | 1 )  12 P(2 ) p(x | 2 )]dx 
         R1


          R2
                [21P(1 ) p(x | 1 )  22 P(2 ) p(x | 2 )]dx
 R   {11P(1 ) p(x | 1 )  12 [1  P(1 )] p(x | 2 )}dx 
      R1


     R2
           {21P(1 ) p(x | 1 )  22 [1  P(1 )] p(x | 2 )}dx

R  12  p(x | 2 )dx  22  p(x | 2 ) dx
           R1                   R2

     11 P(1 )  p(x | 1 )dx  12 P(1 )  p(x | 2 )dx
                    R1                         R1

     21 P(1 )  p (x | 1 )dx  22 P(1 )  p (x | 2 )dx
                    R2                         R2
                                                   
                                                   R1
                                                         p(x | i )dx   p(x | i )dx  1
                                                                        R2




Overall Risk

R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
                              R1


             P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx
                     
                                                  R2                            R1           
                                                                                              



 R  12  p(x | 2 )dx  22  p(x | 2 ) dx
           R1                      R2

       11 P(1 )  p(x | 1 )dx  12 P(1 )  p(x | 2 )dx
                    R1                             R1

       21 P(1 )  p (x | 1 )dx  22 P(1 )  p (x | 2 )dx
                     R2                             R2
                                                         R(x) = ax + b
Overall Risk
            The value depends on
       the setting of decision boundary
R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
                              R1


             P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx
                     
                                                  R2                            R1           
                                                                                              

                                        The value depends on
                                   the setting of decision boundary


            The overall risk for a particular P(1).
                                                         R(x) = ax + b
Overall Risk
               = Rmm, minimax risk

R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
                              R1


             P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx
                     
                                                  R2                            R1           
                                                                                              

                                        = 0 for minimax solution

                            Independent on the value of P(i).
Minimax Risk

R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
                              R1


             P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx
                     
                                                  R2                            R1           
                                                                                              


Rmm  22  (12  22 )  p(x | 2 )dx
                                R1

        11  (21  11 )  p(x | 1 )dx
                                R2
                                      Use 0/1 loss function

Error Probability

R[ P(1 )]  22  (12  22 )  p(x | 2 )dx
                              R1


             P(1 ) (11  22 )  (21  11 )  p(x | 1 )dx  (12  22 )  p(x | 2 )dx
                     
                                                  R2                            R1           
                                                                                              


Perror [ P(1 )]   p (x | 2 )dx
                     R1


                      P(1 )   p (x | 1 )dx   p (x | 2 )dx
                               R2
                                                  R1            
                                                                 
                               Use 0/1 loss function

Minimax Error-Probability

Pmm (error )   p(x | 2 )dx   p(x | 1 )dx
                       R1                        R2


                            P(1|2)                  P(2|1)


Perror [ P(1 )]   p (x | 2 )dx
                  R1


                  P(1 )   p (x | 1 )dx   p (x | 2 )dx
                           R2
                                              R1            
                                                             
                 Perror [ P(1 )]   p (x | 2 )dx
                                   R1


                                   P(1 )   p (x | 1 )dx   p (x | 2 )dx
                                            R2
                                                               R1            
                                                                              

Minimax Error-Probability

Pmm (error )   p(x | 2 )dx   p(x | 1 )dx
                R1                                    R2


                     P(1|2)                              P(2|1)

          1                                      2


           R1                                    R2
        Perror [ P(1 )]   p (x | 2 )dx
                          R1


                          P(1 )   p (x | 1 )dx   p (x | 2 )dx
                                   R2
                                                      R1            
                                                                     

Minimax Error-Probability
Bayesian Decision Theory
    (Classification)


           Neyman-Pearson
           Criterion
Bayesian Decision Rule:
Two-Category Classification


                  Likelihood
                                  Threshold
                    Ratio

                   p(x | 1 ) (12  22 ) P(2 )
   Decide 1 if              
                   p(x | 2 ) (21  11 ) P(1 )


 Neyman-Pearson Criterion deals with the
 case that both loss functions and the prior
 probabilities are unknown.
Signal Detection Theory
   The theory of signal detection theory evolved
    from the development of communications and
    radar equipment the first half of the last century.

   It migrated to psychology, initially as part of
    sensation and perception, in the 50's and 60's as
    an attempt to understand some of the features of
    human behavior when detecting very faint stimuli
    that were not being explained by traditional
    theories of thresholds.
The situation of interest
   A person is faced with a stimulus (signal) that is
    very faint or confusing.

   The person must make a decision, is the signal
    there or not.

   What makes this situation confusing and difficult
    is the presences of other mess that is similar to the
    signal. Let us call this mess noise.
Example
          Noise is present both in the
          environment and in the sensory
          system of the observer.

          The observer reacts to the
          momentary total activation of
          the sensory system, which
          fluctuates from moment to
          moment, as well as responding
          to environmental stimuli, which
          may include a signal.
Example
   A radiologist is examining a CT scan, looking for
    evidence of a tumor.
   A hard job, because there is always some uncertainty.

   There are four possible outcomes:
    –   hit (tumor present and doctor says "yes'')
    –   miss (tumor present and doctor says "no'')         Two types
    –   false alarm (tumor absent and doctor says "yes") of Error
    –   correct rejection (tumor absent and doctor says "no").
Signal detection theory was developed to help us understand how a
continuous and ambiguous signal can lead to a binary yes/no decision.

The Four Cases
                                Signal (tumor)
                         Absent (1)        Present (2)
                           P(1|1)            P(1|2)
            No (1)    Correct Rejection         Miss
Decision
                           P(2|1)            P(2|2)
          Yes (2)       False Alarms             Hit
                                         |  2  1 |
               Discriminability    d'
                                               
Decision Making
               Criterion    Based on expectancy
                              (decision bias)
                  d’
Noise                          Noise + Signal
                                         Hit       P(2|2)
                                         False
                                               P(2|1)
                                         Alarm
          1           2
        No (1)         Yes (2)
ROC Curve
(Receiver Operating Characteristic)


  Hit
PH=P(2|2)




                                False
                                Alarm
                             PFA=P(2|1)
Neyman-Pearson Criterion

  Hit
PH=P(2|2)
                 NP:
                 max. PH
                 subject to PFA ≦ a

                        False
                        Alarm
                     PFA=P(2|1)
Likelihood Ratio Test
                 p ( x|1 )
         0
                p ( x| 2 )   T   R 1  {x | p(x | 1 )  Tp (x | 2 )}
 ( x)         p ( x|1 )
         1
                p ( x| 2 )   T   R 2  {x | p(x | 1 )  Tp (x | 2 )}


where T is a threshold that meets the PFA constraint (≦ a).


       How to determine T?
                                                PFA  E[ ( X) | 1 ]
                                                PH  E[ ( X) | 2 ]
Likelihood Ratio Test
                 p ( x|1 )
         0
                p ( x| 2 )   T   R 1  {x | p(x | 1 )  Tp (x | 2 )}
 ( x)         p ( x|1 )
         1
                p ( x| 2 )   T   R 2  {x | p(x | 1 )  Tp (x | 2 )}

PFA   p(x | 1 )dx
       R2

       (x) p(x | 1 )dx
                                                                       PH
 PH   p(x | 2 )dx                                                   PFA
      R2

       (x) p(x | 2 )dx                      R1         R2
         0
         
              p ( x|1 )
                            T                       PFA  E[ ( X) | 1 ]
              p ( x| 2 )
 ( x)      p ( x|1 )
         1                 T
             p ( x| 2 )
                                                     PH  E[ ( X) | 2 ]
   Neyman-Pearson Lemma
   Consider the aforementioned rule with T chosen to give PFA() =a. There is
   no decision rule ’ such that PFA(’)  a and PH(’) > PH() .
  Pf) Let ’ be a decision rule with PFA ( ' )  E[ ' (X) | 1 ]  a.

               [ (x)  ' (x)][Tp(x |  )  p(x |  )]dx  0
                                          2             1


                    =1
                             0            >0
         1
         
              p ( x|1 )
              p ( x| 2 )   T                       PFA  E[ ( X) | 1 ]
 ( x)      p ( x|1 )
         0
             p ( x| 2 )   T                       PH  E[ ( X) | 2 ]
   Neyman-Pearson Lemma
   Consider the aforementioned rule with T chosen to give PFA() =a. There is
   no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() .
  Pf) Let ’ be a decision rule with PFA ( ' )  E[ ' (X) | 1 ]  a.

               [ (x)  ' (x)][Tp(x |  )  p(x |  )]dx  0 
                                          2             1


                    =0
                             0               0
         1
         
               p ( x|1 )
               p ( x| 2 )   T                                    PFA  E[ ( X) | 1 ]
 ( x)       p ( x|1 )
         0
              p ( x| 2 )   T                                    PH  E[ ( X) | 2 ]
   Neyman-Pearson Lemma
    Consider the aforementioned rule with T chosen to give PFA() =a. There is
    no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() .         OK
  Pf) Let ’ be a decision rule with PFA ( ' )  E[ ' (X) | 1 ]  a.

               [ (x)  ' (x)][Tp(x |  )  p(x |  )]dx  0
                                                    2                 1

                                                  
   T   (x) p(x | 2 )dx    ' (x) p(x | 2 )dx    (x) p(x | 1 )dx    ' (x) p(x | 1 )dx   
   T [ PH ( )  PH ( ' )]  [ PFA ( )  PFA ( ' )]  0
                                               0
                             PH ( )  PH ( ' )

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:6/18/2012
language:
pages:95