VIEWS: 12 PAGES: 11 POSTED ON: 4/10/2012 Public Domain
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew Moore’s data mining tutorials. Discriminative Classifier Learn P(Y|X) directly Logistic regression for binary classification: 1 P(Y 1 | X , w) 1 exp(( w0 wi X i )) i exp(( w0 w X )) i i P(Y 0 | X , w) i 1 exp(( w0 w X )) i i i Note: Generative classifier: learn P(X|Y), P(Y) to get P(Y|X) under some modeling assumption e.g. P(X|Y) ~ N(my, 1), etc. Decision Boundary For which X, P(Y=1|X,w) ≥P(Y=0|X,w)? 1 exp(( w0 w X )) i i i 1 exp(( w0 w X )) i i i 1 exp(( w0 w X )) i i i 1 exp(( w0 w X )) i i i w0 w X i i i 0 Linear classification rule! Decision boundary from NB? LR more generally In more general case where Y {1,..., K} n for k < K exp(wk 0 w i 1 ki X i ) P(Y k | X , w) K 1 n 1 exp(w w j 1 j0 i 1 ji X i ) for k=K 1 P(Y K | X , w) K 1 n 1 exp(w w j 1 j0 i 1 ji X i ) How to learn P(Y|X) Logistic regression P(Y 1 | X , w) 1 1 exp(( w0 w X )) i i i Maximize conditional log likelihood ln P(Y l | X l , w) ln P(Y l | X l , w) l l n n (Y 1)(w0 wi X i ) ln(1 exp(( w0 wi X i ))) l l l l i 1 i 1 Good news: concave function of w Bad news: no closed form solution gradient ascent Gradient ascent (/descent) General framework for finding a maximum (or minimum) of a continuous (differentiable) function, say f(w) Start with some initial value w(1) and compute f (w (1) ) the gradient vector The next value w(2) is obtained by moving some distance from w(1) in the direction of steepest ascent, i.e., along the negative of the gradient w ( k 1) w ( k ) ( k )f (w ( k ) ) Gradient ascent for LR l ( w) ln P (Y l | X l , w) l n n (Y 1)(w0 wi X i ) ln(1 exp(( w0 wi X i ))) l l l l i 1 i 1 l ( w) X i (Y l P (Y l 1 | X l , w)) l wi l Iterate until change < threshold For all i, wi wi X i l (Y l P(Y l 1 | X l , w)) l Regularization Overfitting is a problem, especially when data is very high dimensional and training data is sparse Regularization: use a “penalized log likelihood function” which penalizes large values of w ln P(Y l l | X , w) l 2 2 || w || the modified gradient ascent wi wi l X i l (Y l P(Y l 1 | X l , w)) wi Applet http://www.cs.technion.ac.il/~rani/LocBoost/ NB vs LR Consider Y boolean, X continuous, X=(X1,…,Xn) Number of parameters NB: LR: Parameter estimation method NB: uncoupled LR: coupled NB vs LR Asymptotic comparison (#training examples->infinity) When model assumptions correct NB,LR produce identical classifiers When model assumptions incorrect LR is less biased-does not assume conditional independence therefore expected to outperform NB