Logistic Regression by vr57AkwA

VIEWS: 12 PAGES: 11

									                   Logistic Regression


                   10701/15781 Recitation
                      February 5, 2008



Parts of the slides are from previous years’ recitation and lecture notes,
and from Prof. Andrew Moore’s data mining tutorials.
            Discriminative Classifier

 Learn P(Y|X) directly
 Logistic regression for binary classification:
                                       1
       P(Y  1 | X , w) 
                          1  exp(( w0  wi X i ))                   i

                                   exp(( w0      w X )) i       i
            P(Y  0 | X , w)                     i
                                  1  exp(( w0     w X ))   i       i
                                                       i

Note: Generative classifier: learn P(X|Y), P(Y) to get P(Y|X) under
   some modeling assumption e.g. P(X|Y) ~ N(my, 1), etc.
                Decision Boundary

 For which X, P(Y=1|X,w) ≥P(Y=0|X,w)?

                    1
                                                       exp(( w0      w X ))    i       i
                                                                        i
     1  exp(( w0       w X ))
                          i
                                  i       i           1  exp(( w0     w X ))
                                                                              i
                                                                                      i       i


  1  exp(( w0          w X ))    i       i
                              i

  w0    w X
            i
                i   i   0
                                                  Linear classification
                                                          rule!


     Decision boundary from NB?
               LR more generally

 In more general case where Y {1,..., K}
                                                        n
for k < K                             exp(wk 0    w
                                                    i 1
                                                              ki X i )
            P(Y  k | X , w)         K 1                   n
                                 1    exp(w   w
                                      j 1
                                               j0
                                                            i 1
                                                                   ji X i )


for k=K
                                                   1
            P(Y  K | X , w)         K 1                   n
                                 1    exp(w   w
                                       j 1
                                                   j0
                                                            i 1
                                                                    ji X i )
                     How to learn P(Y|X)

 Logistic regression                     P(Y  1 | X , w) 
                                                                           1
                                                               1  exp(( w0      w X ))
                                                                                   i
                                                                                       i   i



      Maximize conditional log likelihood
         ln  P(Y l | X l , w)   ln P(Y l | X l , w)
             l                        l
                                n                                           n
            (Y  1)(w0   wi X i )  ln(1  exp(( w0   wi X i )))
                      l                       l                                        l

                 l             i 1                                        i 1




 Good news: concave function of w
 Bad news: no closed form solution  gradient ascent
       Gradient ascent (/descent)

 General framework for finding a maximum
  (or minimum) of a continuous (differentiable)
  function, say f(w)
     Start with some initial value w(1) and compute
                       f (w (1) )
      the gradient vector
     The next value w(2) is obtained by moving some
      distance from w(1) in the direction of steepest
      ascent, i.e., along the negative of the gradient
             w ( k 1)  w ( k )   ( k )f (w ( k ) )
               Gradient ascent for LR

 l ( w)  ln  P (Y l | X l , w)
               l
                                n                                      n
         (Y  1)(w0   wi X i )  ln(1  exp(( w0   wi X i )))
                   l                       l                                   l

           l                   i 1                                   i 1

l ( w)
          X i (Y l  P (Y l  1 | X l , w))
               l

 wi      l



Iterate until change < threshold

                                      
    For all i,
                       wi  wi          X i l (Y l  P(Y l  1 | X l , w))
                                      l
                 Regularization

 Overfitting is a problem, especially when data is
  very high dimensional and training data is
  sparse
 Regularization: use a “penalized log likelihood
  function” which penalizes large values of w
                              
             ln P(Y
             l
                      l
                         | X , w) 
                         l           2
                                      2
                                          || w ||

 the modified gradient ascent
      wi  wi     
                    l
                        X i l (Y l  P(Y l  1 | X l , w)) wi
 Applet
  http://www.cs.technion.ac.il/~rani/LocBoost/
                 NB vs LR

 Consider Y boolean, X continuous, X=(X1,…,Xn)

 Number of parameters
   NB:
   LR:
 Parameter estimation method
   NB: uncoupled
   LR: coupled
                     NB vs LR

 Asymptotic comparison (#training examples->infinity)

 When model assumptions correct
      NB,LR produce identical classifiers


 When model assumptions incorrect
      LR is less biased-does not assume conditional
       independence
      therefore expected to outperform NB

								
To top