# Logistic Regression by vr57AkwA

VIEWS: 12 PAGES: 11

• pg 1
```									                   Logistic Regression

10701/15781 Recitation
February 5, 2008

Parts of the slides are from previous years’ recitation and lecture notes,
and from Prof. Andrew Moore’s data mining tutorials.
Discriminative Classifier

 Learn P(Y|X) directly
 Logistic regression for binary classification:
1
P(Y  1 | X , w) 
1  exp(( w0  wi X i ))                   i

exp(( w0      w X )) i       i
 P(Y  0 | X , w)                     i
1  exp(( w0     w X ))   i       i
i

Note: Generative classifier: learn P(X|Y), P(Y) to get P(Y|X) under
some modeling assumption e.g. P(X|Y) ~ N(my, 1), etc.
Decision Boundary

 For which X, P(Y=1|X,w) ≥P(Y=0|X,w)?

1
exp(( w0      w X ))    i       i
                                                                       i
1  exp(( w0       w X ))
i
i       i           1  exp(( w0     w X ))
i
i       i

 1  exp(( w0          w X ))    i       i
i

 w0    w X
i
i   i   0
Linear classification
rule!

   Decision boundary from NB?
LR more generally

 In more general case where Y {1,..., K}
n
for k < K                             exp(wk 0    w
i 1
ki X i )
P(Y  k | X , w)         K 1                   n
1    exp(w   w
j 1
j0
i 1
ji X i )

for k=K
1
P(Y  K | X , w)         K 1                   n
1    exp(w   w
j 1
j0
i 1
ji X i )
How to learn P(Y|X)

 Logistic regression                     P(Y  1 | X , w) 
1
1  exp(( w0      w X ))
i
i   i

   Maximize conditional log likelihood
ln  P(Y l | X l , w)   ln P(Y l | X l , w)
l                        l
n                                           n
  (Y  1)(w0   wi X i )  ln(1  exp(( w0   wi X i )))
l                       l                                        l

l             i 1                                        i 1

 Good news: concave function of w

 General framework for finding a maximum
(or minimum) of a continuous (differentiable)
function, say f(w)
f (w (1) )
   The next value w(2) is obtained by moving some
distance from w(1) in the direction of steepest
ascent, i.e., along the negative of the gradient
w ( k 1)  w ( k )   ( k )f (w ( k ) )

l ( w)  ln  P (Y l | X l , w)
l
n                                      n
  (Y  1)(w0   wi X i )  ln(1  exp(( w0   wi X i )))
l                       l                                   l

l                   i 1                                   i 1

l ( w)
  X i (Y l  P (Y l  1 | X l , w))
l

wi      l

Iterate until change < threshold


For all i,
wi  wi          X i l (Y l  P(Y l  1 | X l , w))
l
Regularization

 Overfitting is a problem, especially when data is
very high dimensional and training data is
sparse
 Regularization: use a “penalized log likelihood
function” which penalizes large values of w

 ln P(Y
l
l
| X , w) 
l           2
2
|| w ||

wi  wi     
l
X i l (Y l  P(Y l  1 | X l , w)) wi
 Applet
http://www.cs.technion.ac.il/~rani/LocBoost/
NB vs LR

 Consider Y boolean, X continuous, X=(X1,…,Xn)

 Number of parameters
 NB:
 LR:
 Parameter estimation method
 NB: uncoupled
 LR: coupled
NB vs LR

 Asymptotic comparison (#training examples->infinity)

 When model assumptions correct
   NB,LR produce identical classifiers

 When model assumptions incorrect
   LR is less biased-does not assume conditional
independence
   therefore expected to outperform NB

```
To top