Logistic regression Outline Outline Outline
Document Sample


Outline
Logistic regression Logistic regression
Logistic regression in R
Marco Baroni
Further topics
Linear models in R
Outline Outline
Logistic regression Logistic regression
Introduction Introduction
The model The model
Estimation Estimation
Looking at and comparing fitted models Looking at and comparing fitted models
Logistic regression in R Logistic regression in R
Further topics Further topics
Modeling discrete response variables Examples: binomial responses
Is statement X rated as “acceptable” in the following
condition(s)?
Does sentence S, that has features Y, W and Z, display
In a very large number of problems in cognitive science phenomenon X? (linguistic corpus data!)
and related fields Is it common for shoppers to decide to purchase the good
the response variable is categorical, often binary (yes/no; X given these conditions?
acceptable/not acceptable; phenomenon takes place/does
not take place) Did subject make more errors in this condition?
potentially explanatory factors (independent variables) are How many people answer YES to question X in the survey
categorical, numerical or both
Do old women like X more than young men?
Did the subject feel pain in this condition?
How often was reaction X triggered by these conditions?
Do children with characteristics X, Y and Z tend to have
autism?
Examples: multinomial responses Binomial and multinomial logistic regression models
Discrete response variable with natural ordering of the
Problems with binary (yes/no, success/failure,
levels:
happens/does not happen) dependent variables are
Ratings on a 6-point scale
Depending on the number of points on the scale, you might
handled by (binomial) logistic regression
also get away with a standard linear regression Problems with more than one discrete output are handled
Subjects answer YES, MAYBE, NO by
Subject reaction is coded as FRIENDLY, NEUTRAL, ordinal logistic regression, if outputs have natural ordering
ANGRY multinomial logistic regression otherwise
The cochlear data: experiment is set up so that possible The output of ordinal and especially multinomial logistic
errors are de facto on a 7-point scale regression tends to be hard to interpret, whenever possible
Discrete response variable without natural ordering: I try to reduce the problem to a binary choice or a set of
Subject decides to buy one of 4 different products binary choices
We have brain scans of subjects seeing 5 different objects, E.g., if output is yes/maybe/no, treat “maybe” as “yes”
and we want to predict seen object from features of the and/or as “no”
scan Multiple binomial regressions instead of a single
We model the chances of developing 4 different (and multinomial regression
mutually exclusive) psychological syndromes in terms of a Here, I focus entirely on the binomial case
number of behavioural indicators
Don’t be afraid of logistic regression! Don’t be afraid of logistic regression!
Logistic regression is less popular than linear regression
This might be due in part to historical reasons
the formal theory of generalized linear models is relatively If it is natural to cast your problem in terms of a discrete
recent: it was developed in the early 70s variable, you should go ahead and use logistic regression
the iterative maximum likelihood methods used for fitting
logistic regression models require more computational Logistic regression might be trickier to work with than linear
power than solving the least squares equations regression, but it’s still much better than pretending that the
Results of logistic regression are not as straightforward to variable is continuous or artificially re-casting the problem
understand and interpret as linear regression results in terms of a continuous response
Finally, there might also be a bit of prejudice against
discrete data as less “scientifically credible” than
hard-science-like continuous measurements
The Machine Learning angle The Machine Learning angle
Classification of a set of observations into 2 or more
discrete categories is a central task in Machine Learning
Same setting of logistic regression, except that emphasis is
The classic supervised learning setting:
placed on predicting the class of unseen data, rather than
Data points are represented by a set of features, i.e.,
on the significance of the effect of the features/independent
discrete or continuous explanatory variables
The “training” data also have a label indicating the class of variables (that are often too many – hundreds or more – to
the data point, i.e., a discrete binomial or multinomial be analyzed singularly) in discriminating the classes
dependent variable Indeed, logistic regression is also a standard technique in
A model (e.g., in the form of weights assigned to the Machine Learning, where it is sometimes known as
dependent variables) is fitted to the training data Maximum Entropy
The trained model is then used to predict the class of
unseen data points (where we know the values of the
features, but we do not have the label)
Outline Ordinary linear regression
Logistic regression
Introduction The by now familiar model:
The model
Estimation y = β0 + β1 x1 + β2 × x2 + ... + βn xn +
Looking at and comparing fitted models
Why will this not work if variable is binary (0/1)?
Logistic regression in R Why will it not work if we try to model proportions instead
of responses (e.g., proportion of YES-responses in
condition C)?
Further topics
Modeling log odds ratios From probabilities to log odds ratios
Following up on the “proportion of YES-responses” idea,
let’s say that we want to model the probability of one of the
two responses (which can be seen as the population
5
proportion of the relevant response for a certain choice of
the values of the dependent variables)
Probability will range from 0 to 1, but we can look at the
logarithm of the odds ratio instead:
logit(p)
0
p
logit(p) = log
1−p
This is the logarithm of the ratio of probability of
−5
1-response to probability of 0-response
It is arbitrary what counts as a 1-response and what counts
as a 0-response, although this might hinge on the ease of
interpretation of the model (e.g., treating YES as the
1-response will probably lead to more intuitive results than
treating NO as the 1-response) 0.0 0.2 0.4 0.6 0.8 1.0
Log odds ratios are not the most intuitive measure (at least
p
for me), but they range continuously from −∞ to +∞
The logistic regression model From log odds ratios to probabilities
1.0
Predicting log odds ratios:
logit(p) = β0 + β1 x1 + β2 x2 + ... + βn xn
0.8
From probabilities to log odds ratios:
0.6
p
logit(p) = log
1−p
p
0.4
Back to probabilities:
elogit(p)
0.2
p=
1 + elogit(p)
0.0
Thus:
eβ0 +β1 x1 +β2 x2 +...+βn xn
p= −10 −5 0 5 10
1 + eβ0 +β1 x1 +β2 x2 +...+βn xn
logit(p)
Probabilities and responses What is the relation between p and the response?
1.0
q q q
q q qqq
Given a single response at a specific setting of the
0.8
independent variables, p is the p parameter of a Bernoulli
probability distribution for whether the response is 1 (YES,
0.6
success, etc.) or 0 (NO, failure, etc.)
Probability of the two possible values for the response
p
random variable (1 and 0):
0.4
P(y ) = py (1 − p)(1−y )
0.2
Expected value: E(y ) = p
Variance: var (y ) = p(1 − p)
0.0
q
qq q q q q q
−10 −5 0 5 10
logit(p)
What is the relation between p and the response? Bernoulli vs. binomial
If independent variables are categorical, data might be
Given n responses for a certain setting of the independent stored in Bernoulli format (one row per observation, with 1
variables, p is the p parameter of a binomial distribution or 0 response) or in aggregate binomial format (one row
bin(p, n) for the number y of 1 responses per combination of the independent variables, with one
Probability of a number y of 1 responses given n column for the number of successes for that combination,
responses in the specific setting: and one column for the observations with that combination)
Below, I assume Bernoulli format, that is more natural
n y
p(y ) = p (1 − p)(n−y ) when collecting data, and typically more meaningful in the
y presence of continuous independent variables (when there
Expected value: E(y ) = np might be mostly one observation per combination of the
independent variables anyway)
Variance: var (y ) = np(1 − p)
Almost nothing changes, but deviance can only be
Single response (Bernoulli) scenario is simply a special
interpreted as a measure of goodness of fit with a χ2
case of aggregate binomial setting with n = 1
distribution when data are analyzed in the aggregate
binomial format
Generalized linear models Ordinary linear regression
as a generalized linear model
Logistic regression is an instance of wider family of
generalized linear models
Somewhat brutally, in a generalized linear model:
the value of a certain quantity η for a certain setting of the Systematic component: η = x β
independent variables (a row vector x) is computed by a
Link function is identity:
weighted linear combination of the explanatory variables
(the systematic component): η = x β
η = E(y )
the responses are modeled by a probability distribution that
has an expected value E(y ) (the random component)
Random component is normal distribution with mean µ and
A link function connects the random and systematic
variance σ 2
components by establishing that η is a function of the
expected value E(y ), i.e.: η = f (E(y )) This is why we add the 0-centered error term in the linear
regression model
General framework that uses same fitting techniques to
estimate models for different kinds of data
More about the general model in Luigi’s classes
Logistic regression as a generalized linear model Variance in the random component
Systematic component: η = x β
The systematic component is η = x β for both ordinary and
Link function: logistic regression (and any other generalized linear model)
E(y ) E(y ) In ordinary regression, random component is normally
η = f (E(y )) = logit(E(y )) = log = log
1 − E(y ) 1 − E(y ) distributed with mean η and variance σ 2
Putting the pieces together, a specific response is modeled
Random component is Bernoulli distribution with expected
by x β + , where the term comes from the random
value parameter E(y ), i.e., p
component (it is sampled from a normal distribution with
Given E(y ), i.e., p, observations have a Bernoulli mean 0 and variance σ 2 )
distribution with variance p(1 − p)
Variance in the random component Bernoulli variance
as a function of the expected value p
In logistic regression, the random component has a
Bernoulli distribution with p = logit(η) and variance
0.25
p(1 − p): there is no independent σ 2 parameter, and no
term from the random component
The response generation happens by picking 1 with
0.20
probability p and 0 otherwise; variance in responses is
entirely determined by p, i.e., η
0.15
On a related point, the errors, computed as response − p, variance
are not expected to be normally distributed with the same
variance across settings of the independent variables: we
0.10
expect a larger variance (and consequently a larger
squared error) when the expected p is close to .5, dropping
0.05
as we move towards the extremes (p = 0 or p = 1)
0.00
0.0 0.2 0.4 0.6 0.8 1.0
Outline Maximum likelihood estimation
Logistic regression
Introduction For any generalized linear model, we estimate the values
The model of the β parameters by maximum likelihood estimation, i.e.,
Estimation searching across possible settings of the β for that
Looking at and comparing fitted models combination that makes the sample responses most
probable
Logistic regression in R ˆ
Given a set of responses y , we search for the β estimate
that maximizes
Further topics ˆ
p(y |β)
Maximum likelihood estimation Iteratively reweighted least squares
We minimize squared error by giving more weight to errors
at levels of the independent variables where variance
In ordinary linear regression maximizing the likelihood is around p should be low (high and low ps)
equivalent to minimizing the sum of squared errors across This requires estimates of p, that in turn require estimates
the board (and consequently the estimated variance of of the β coefficients
errors) Iterative procedure:
In logistic regression, the errors are not expected to have make a guess at the ps (e.g., set them to the proportions of
the same variance: we should have high variance for p positive responses at each level)
near .5, lower variance towards the extremes use guesses to estimate the variances and hence the
weights (inversely proportional to the variances)
Leads to (iteratively) (re)weighted least squares (IRWLS) use the new weights to estimate the βs, derive new p
method, where errors are penalized more where we expect estimates, go back to the second step until convergence
less variance around p In the logistic regression setting (and in other generalized
linear models), IRWLS is equivalent to the
Newton-Rhapson optimization method to maximize
likelihood
Iteratively reweighted least squares Outline
Iterative search for best parameter tuning instead of closed Logistic regression
form equation: Introduction
No guarantee that estimation will converge on a vector of β The model
values that are actually maximizing the likelihood (“local Estimation
maximum” problem) Looking at and comparing fitted models
Computationally more demanding than ordinary least
squares
Possibly unstable Logistic regression in R
Note that ordinary least squares can be interpreted as a
special case of weighted least squares with fixed and Further topics
uniform weights
Interpreting the βs Interpreting the βs
ˆ ˆ
Asymptotically, βi |/s.e.(βi ) follows a standard normal
distribution if βi is 0
The standard errors are the squared roots of the diagonal
p is not a linear function of the βs elements of:
The same β will have a more drastic impact on p towards [X T diag(var (pi ))X ]−1
ˆ
the center of the p range than near the extremes (recall the where, if there are m data points in the sample,
S shape of the p curve) diag(var (pi )) is a m × m matrix with pi (1 − pi ) in the ith
ˆ ˆ ˆ
As a rule of thumb (the “divide by 4” rule), β/4 is an upper diagonal element
(More involved than standard error of ordinary regression
bound on the difference in p brought about by a unit
coefficients because variance changes with p)
difference on the corresponding explanatory variable
Roughly, if a β is more than 2 standard errors away from 0,
we can say that the corresponding explanatory variable
has an effect that is significantly different from 0 (at
α = 0.05)
Goodness of fit Binned goodness of fit
Measures such as R 2 based on residual errors are not
very informative
Goodness of fit can be inspected visually by grouping the
One intuitive measure of fit is the error rate, given by the ps into equally wide bins (0-0.1,0.1-0.2, . . . ) and plotting
ˆ
proportion of data points in which the model assigns p > .5
ˆ
the average p predicted by the model for the points in each
ˆ
to 0-responses or p < .5 to 1-responses
ˆ
bin vs. the observed proportion of 1-responses for the data
This can be compared to baseline in which the model
always predicts 1 if majority of data points are 1 or 0 if points in the bin
majority of data points are 0 (baseline error rate given by We can also compute a R 2 or other goodness of fit
proportion of minority responses over total) measure on these binned data
Some information lost (a .9 and a .6 prediction are treated Won’t try it here, see the plot.logistic.fit.fnc()
equally) function from languageR
Other measures of fit proposed in the literature, no widely
agreed upon standard
Model selection Log likelihood
We explore two ways to select models:
Via a statistical significance test for nested models based Various measures of absolute or relative fit of logistic
on the notion of deviance regression models (and other generalized linear models)
There is a relation between the deviance-based test used in are based on the logarithm of the likelihood of a fitted
logistic regression (and with other generalized linear models) model, i.e., the maximum likelihood value for the relevant
and the F-test used for ordinary linear regression model
Via a criterion-based method that favours the model with
For a data set of m rows with Bernoulli (1 or 0) responses,
the lowest Akaike Information Criterion (AIC) score
where yi is the ith response and pi is the model-produced
ˆ
In both approaches, the notion of log likelihood of a model p for the ith row, the (maximized) log likelihood of the
fitted by maximum likelihood plays a crucial role (it is model M is given by:
desirable that a model assigns a high probability to the
seen data) i=m
Moreover, in both approaches the number of parameters L(y , M) = {log(ˆ yi ) + log[(1 − p)1−yi ]}
p ˆ
i=1
used by a model counterbalances the likelihood (a high
likelihood is less impressive if we can tune a lot of Note that the log likelihood is always ≤ 0 (because
parameters, and a model with many parameters is more log 1 = 0)
prone to overfitting)
Deviance Deviance
L(y , M) is the (maximum) log likelihood for a model M with If data are arranged with a separate row for each Bernoulli
n parameters (βs to estimate) on a sample of m response, there are m rows and pi is the model-fitted p for
ˆ
observations with responses y = (y1 , y2 , · · · , ym ) the ith row, the computation of deviance reduces to:
L(y , My ) is the (maximum) log likelihood for the saturated
i=m
model that has one parameter per response (i.e., m
parameters), so it can fit the observed responses perfectly −2 p p
{ˆi [log(ˆi ) − log(1 − pi )] + log(1 − pi )}
ˆ ˆ
i=1
The deviance of a maximum-likelihood-fitted model M is
−2 times the difference in log likelihoods between M and When the data are arranged in this way (rather than by
My (equivalently, the log likelihood ratio): counts of successes and observations per level) and/or
some independent variables are continuous, the
D(y , M) = −2[L(y , M) − L(y , My )] significance of the deviance from the saturated model
cannot be assessed with a χ2 test as we will do below
The better the model is at predicting the actual responses,
when comparing two nested models: comparing nested
the lower the deviance will be; conversely, the larger the
models will be the main use of deviance for our purposes
deviance, the worse the fit
Deviance Deviance
Consider two nested models where the larger model Ml
uses all the independent variables in the smaller model Ms The log likelihood ratio, or equivalently the difference in
plus some extra ones; the difference in parameters (β) to deviance, approximates a χ2 distribution with the difference
estimate will be nl − ns in number of parameters nl − ns as df’s (with some caveats
Maximizing the likelihood on the same observations y we do not worry about here: in general, approximation is
using a subset of the parameters cannot yield a higher better if difference in parameter number is not too large)
maximum than with the larger set, thus: Intuitively, the larger the difference in deviance between Ms
and Ml , the less likely is is that the improvement in
L(y , Ms ) ≤ L(y , Ml )
likelihood is just due to the fact that Ml has more
A log likelihood ratio test that the improvement in likelihood parameters to play with, although this difference will be the
with Ml is not due to chance is given by: less surprising, the more extra-parameters Ml has
A model can also be compared against the baseline “null”
−2[L(y , Ms ) − L(y , Ml )] = −2[L(y , Ms ) − L(y , My )] model that always predicts the same p (given by the
−2[L(y , Ml ) − L(y , My )] proportion of 1-responses in the data) and has only one
= D(y , Ms ) − D(y , Ml ) parameter (the fixed predicted value)
Akaike Information Criterion (AIC) Outline
The AIC is a measure derived from information-theoretic
arguments that provides a trade-off between the fit and the
complexity of a model
Given a model M with n parameters achieving maximum Logistic regression
log likelihood L(y , M) on some data y , the AIC of the
model is: Logistic regression in R
AIC = −2L(y , M) + 2n Preparing the data and fitting the model
The lower the AIC, the better the model Practice
The negative log likelihood term penalizes models with a
poor fit, whereas the parameter count term penalizes Further topics
complex models
AIC does not tell us anything about “significant”
differences, but models with lower AIC can be argued to be
“better” than those with higher AIC (what matters is the
rank, not the absolute difference)
AIC can also be applied to choose among non-nested
models, as long as the fitted data are the same, of course
Outline Back to the Graffeo et al.’s discount study
Fields in the discount.txt file
Logistic regression
subj Unique subject code
sex M or F
Logistic regression in R
Preparing the data and fitting the model age NB: contains some NA
Practice presentation absdiff (amount of discount), result (price after
discount), percent (percentage discount)
Further topics product pillow, (camping) table, helmet, (bed) net
choice Y (buys), N (does not buy) → the discrete
response variable
Preparing the data Logistic regression in R
Read the file into an R data-frame, look at the summaries,
etc.
Note in the summary of age that R “understands” NAs
> sex_age_pres_prod.glm<-glm(choice~sex+age+
(i.e., it is not treating age as a categorical variable)
presentation+product,family="binomial")
We can filter out the rows containing NAs as follows:
> e<-na.omit(d) > summary(sex_age_pres_prod.glm)
Compare summaries of d and e
na.omit can also be passed as an option to the modeling
functions, but I feel uneasy about that
Attach the NA-free data-frame
Selected lines from the summary() output Selected lines from the summary() output
Deviance
Estimated β coefficients, standard errors and z scores
(β/std. error):
Coefficients:
Estimate Std. Error z value Pr(>|z|) For the “null” model and for the current model:
sexM -0.332060 0.140008 -2.372 0.01771 *
age -0.012872 0.006003 -2.144 0.03201 * Null deviance: 1453.6 on 1175 degrees of freedom
presentationpercent 1.230082 0.162560 7.567 3.82e-14 *** Residual deviance: 1284.3 on 1168 degrees of freedom
presentationresult 1.516053 0.172746 8.776 < 2e-16 ***
Note automated creation of binary dummy variables:
Difference in deviance (169.3) is much higher than
discounts presented as percents and as resulting values
are significantly more likely to lead to a purchase than difference in parameters (7), suggesting that the current
discounts expressed as absolute difference (the default model is significantly better than the null model
level)
use relevel() to set another level of a categorical
variable as default
Deviance: do it yourself Selected lines from the summary() output
AIC and Fisher Scoring iterations
i=m
D(y , M) = −2 p p
{ˆi [log(ˆi ) − log(1 − pi )] + log(1 − pi )}
ˆ ˆ
i=1
The Akaike Information Criterion score:
AIC: 1300.3
> fitted_ps<-fitted.values(sex_age_pres_prod.glm)
> -2 * sum(fitted_ps * (log(fitted_ps)
And how many iterations of the IRWLS procedure were
- log(1-fitted_ps)) + log(1-fitted_ps))
performed before convergence (large numbers might cue a
[1] 1284.251
problem):
> ys<-as.numeric(choice)-1 Number of Fisher Scoring iterations: 4
> -2 * length(ys) * (mean_y * (log(mean_y)
- log(1-mean_y)) + log(1-mean_y))
[1] 1453.619
The AIC: do it yourself Comparing models with the deviance test
Let us add a presentation by interaction term:
> ys<-as.numeric(choice)-1 > interaction.glm<-glm(choice~sex+age+presentation+
product+sex:presentation,family="binomial")
> fitted_ps<-fitted.values(sex_age_pres_prod.glm)
Are the extra-parameters justified?
> ll<-sum(log(fitted_ps^ys)
+ log((1-fitted_ps)^(1-ys))) > anova(sex_age_pres_prod.glm,interaction.glm,
test="Chisq")
...
> -2*ll + Resid. Df Resid. Dev Df Deviance P(>|Chi|)
2*length(coefficients(sex_age_pres_prod.glm)) 1 1168 1284.25
[1] 1300.251 2 1166 1277.68 2 6.57 0.04
Apparently, yes (although summary(interaction.glm)
suggests just a marginal interaction between sex and the
percentage dummy variable)
Comparing models by AIC Error rate
The model makes an error when it assigns p > .5 to
observation where choice is N or p < .5 to observation
where choice is Y:
> sum((fitted(interaction.glm)>.5 & choice=="N") |
(fitted(interaction.glm)<.5 & choice=="Y")) /
Again, the model with the interactions looks better: length(choice)
[1] 0.2712585
> sex_age_pres_prod.glm$aic
[1] 1300.251
Compare to error rate by baseline model that always
> interaction.glm$aic guesses the majority choice:
[1] 1297.682
> table(choice)
choice
N Y
363 813
> sum(choice=="N")/length(choice)
[1] 0.3086735
Improvement in error rate is nothing to write home about. . .
Bootstrap estimation Outline
Recompute logistic model fitted with lrm(), the logistic
regression fitting function from the Design package:
> interaction.glm<-lrm(choice~sex+age Logistic regression
+presentation+product+sex:presentation,
x=TRUE,y=TRUE) Logistic regression in R
Validation using the logistic model estimated by lrm() and Preparing the data and fitting the model
1,000 iterations: Practice
> validate(interaction.glm,B=1000)
When fed a logistic model, validate() returns various Further topics
measures of fit we have not discussed: see, e.g., Baayen
or Harrell
Independently of the interpretation of the measures, the
size of the optimism indices gives a general idea of the
amount of overfitting (not dramatic in this case)
The Navarrete et al.’s The cwcc.txt data-set
Cumulative Within-Category Cost data
subj A unique id for each of the 20 subjects
Part of a larger study by Eduardo Navarrete, Brad Mahon item The specific objects presented in the pictures
and Alfonso Caramazza; we just look at one very specific (pears, pianos, houses, etc.)
aspect of their Experiment 1 data category 12 superordinate categories (animals, body parts,
A well-known and robust effect in picture naming makes buildings, etc.)
subjects slower at naming objects as a (linear) function of ordpos The ordinal position of the item within its category
how many objects from the same category they saw before in the current block (is this the first, second, . . . ,
(you are slower at naming the second mammal, slower still fifth animal?)
to name the third, etc.) block Each subject sees 4 repetitions of the whole
The question here: does this “cumulative within-category stimulus set, presented in different orders
cost” effect also impact error rate? Are subjects more likely response The picture naming latency in milliseconds (or
to mis-label/fail to name an object as a function of the error for wrong or anomalous responses)
numbers of objects in the same category they already error Did the subject gave a wrong/anomalous
saw? response? (0: no; 1: yes)
Practice time Outline
Using logistic regression, model the probability of making Logistic regression
an error in function of category, ordinal position within
category and block Logistic regression in R
Which effects do you find?
Do they make sense? Further topics
Is there an interaction, e.g., between ordinal position and
block?
Further topics
All stuff you can do in R
More generalized linear models: multinomial, ordinal
logistic regression, Poisson regression, . . .
Too many independent variables? Automated selection
procedures, Ridge/regularized regression, PCA of the
independent variables. . .
Item and subject issues? Mixed models with random
effects
No independent variable (unlabeled data)? Clustering,
PCA. . .
Fancier machine learning: Support Vector Machines. . .
Related docs
Get documents about "