Logistic regression Outline Outline Outline

Document Sample
scope of work template
							                                              Outline




                      Logistic regression        Logistic regression

                                                 Logistic regression in R
                              Marco Baroni
                                                 Further topics

                         Linear models in R




Outline                                       Outline


   Logistic regression                           Logistic regression
      Introduction                                  Introduction
      The model                                     The model
      Estimation                                    Estimation
      Looking at and comparing fitted models         Looking at and comparing fitted models

   Logistic regression in R                      Logistic regression in R

   Further topics                                Further topics
Modeling discrete response variables                                       Examples: binomial responses

                                                                                 Is statement X rated as “acceptable” in the following
                                                                                 condition(s)?
                                                                                 Does sentence S, that has features Y, W and Z, display
      In a very large number of problems in cognitive science                    phenomenon X? (linguistic corpus data!)
      and related fields                                                          Is it common for shoppers to decide to purchase the good
          the response variable is categorical, often binary (yes/no;            X given these conditions?
          acceptable/not acceptable; phenomenon takes place/does
          not take place)                                                        Did subject make more errors in this condition?
          potentially explanatory factors (independent variables) are            How many people answer YES to question X in the survey
          categorical, numerical or both
                                                                                 Do old women like X more than young men?
                                                                                 Did the subject feel pain in this condition?
                                                                                 How often was reaction X triggered by these conditions?
                                                                                 Do children with characteristics X, Y and Z tend to have
                                                                                 autism?



Examples: multinomial responses                                            Binomial and multinomial logistic regression models
      Discrete response variable with natural ordering of the
                                                                                 Problems with binary (yes/no, success/failure,
      levels:
                                                                                 happens/does not happen) dependent variables are
          Ratings on a 6-point scale
               Depending on the number of points on the scale, you might
                                                                                 handled by (binomial) logistic regression
               also get away with a standard linear regression                   Problems with more than one discrete output are handled
          Subjects answer YES, MAYBE, NO                                         by
          Subject reaction is coded as FRIENDLY, NEUTRAL,                            ordinal logistic regression, if outputs have natural ordering
          ANGRY                                                                      multinomial logistic regression otherwise
          The cochlear data: experiment is set up so that possible               The output of ordinal and especially multinomial logistic
          errors are de facto on a 7-point scale                                 regression tends to be hard to interpret, whenever possible
      Discrete response variable without natural ordering:                       I try to reduce the problem to a binary choice or a set of
          Subject decides to buy one of 4 different products                     binary choices
          We have brain scans of subjects seeing 5 different objects,                E.g., if output is yes/maybe/no, treat “maybe” as “yes”
          and we want to predict seen object from features of the                    and/or as “no”
          scan                                                                       Multiple binomial regressions instead of a single
          We model the chances of developing 4 different (and                        multinomial regression
          mutually exclusive) psychological syndromes in terms of a              Here, I focus entirely on the binomial case
          number of behavioural indicators
Don’t be afraid of logistic regression!                                    Don’t be afraid of logistic regression!


       Logistic regression is less popular than linear regression
       This might be due in part to historical reasons
           the formal theory of generalized linear models is relatively           If it is natural to cast your problem in terms of a discrete
           recent: it was developed in the early 70s                              variable, you should go ahead and use logistic regression
           the iterative maximum likelihood methods used for fitting
           logistic regression models require more computational                  Logistic regression might be trickier to work with than linear
           power than solving the least squares equations                         regression, but it’s still much better than pretending that the
       Results of logistic regression are not as straightforward to               variable is continuous or artificially re-casting the problem
       understand and interpret as linear regression results                      in terms of a continuous response
       Finally, there might also be a bit of prejudice against
       discrete data as less “scientifically credible” than
       hard-science-like continuous measurements




The Machine Learning angle                                                 The Machine Learning angle


       Classification of a set of observations into 2 or more
       discrete categories is a central task in Machine Learning
                                                                                  Same setting of logistic regression, except that emphasis is
       The classic supervised learning setting:
                                                                                  placed on predicting the class of unseen data, rather than
           Data points are represented by a set of features, i.e.,
                                                                                  on the significance of the effect of the features/independent
           discrete or continuous explanatory variables
           The “training” data also have a label indicating the class of          variables (that are often too many – hundreds or more – to
           the data point, i.e., a discrete binomial or multinomial               be analyzed singularly) in discriminating the classes
           dependent variable                                                     Indeed, logistic regression is also a standard technique in
           A model (e.g., in the form of weights assigned to the                  Machine Learning, where it is sometimes known as
           dependent variables) is fitted to the training data                     Maximum Entropy
           The trained model is then used to predict the class of
           unseen data points (where we know the values of the
           features, but we do not have the label)
Outline                                                                    Ordinary linear regression


   Logistic regression
      Introduction                                                               The by now familiar model:
      The model
      Estimation                                                                                 y = β0 + β1 x1 + β2 × x2 + ... + βn xn +
      Looking at and comparing fitted models
                                                                                 Why will this not work if variable is binary (0/1)?
   Logistic regression in R                                                      Why will it not work if we try to model proportions instead
                                                                                 of responses (e.g., proportion of YES-responses in
                                                                                 condition C)?
   Further topics




Modeling log odds ratios                                                   From probabilities to log odds ratios
       Following up on the “proportion of YES-responses” idea,
       let’s say that we want to model the probability of one of the
       two responses (which can be seen as the population




                                                                                                5
       proportion of the relevant response for a certain choice of
       the values of the dependent variables)
       Probability will range from 0 to 1, but we can look at the
       logarithm of the odds ratio instead:



                                                                                     logit(p)
                                                                                                0
                                             p
                            logit(p) = log
                                           1−p
       This is the logarithm of the ratio of probability of

                                                                                                −5
       1-response to probability of 0-response
            It is arbitrary what counts as a 1-response and what counts
            as a 0-response, although this might hinge on the ease of
            interpretation of the model (e.g., treating YES as the
            1-response will probably lead to more intuitive results than
            treating NO as the 1-response)                                                           0.0   0.2    0.4       0.6   0.8   1.0
       Log odds ratios are not the most intuitive measure (at least
                                                                                                                        p
       for me), but they range continuously from −∞ to +∞
The logistic regression model                                      From log odds ratios to probabilities




                                                                                 1.0
      Predicting log odds ratios:

                logit(p) = β0 + β1 x1 + β2 x2 + ... + βn xn




                                                                                 0.8
      From probabilities to log odds ratios:




                                                                                 0.6
                                              p
                           logit(p) = log
                                             1−p




                                                                             p
                                                                                 0.4
      Back to probabilities:

                                      elogit(p)




                                                                                 0.2
                               p=
                                    1 + elogit(p)




                                                                                 0.0
      Thus:
                             eβ0 +β1 x1 +β2 x2 +...+βn xn
                      p=                                                               −10     −5         0        5     10
                           1 + eβ0 +β1 x1 +β2 x2 +...+βn xn
                                                                                                      logit(p)

Probabilities and responses                                        What is the relation between p and the response?
              1.0




                                q        q         q
                                                q q qqq

                                                                         Given a single response at a specific setting of the
              0.8




                                                                         independent variables, p is the p parameter of a Bernoulli
                                                                         probability distribution for whether the response is 1 (YES,
              0.6




                                                                         success, etc.) or 0 (NO, failure, etc.)
                                                                         Probability of the two possible values for the response
          p




                                                                         random variable (1 and 0):
              0.4




                                                                                             P(y ) = py (1 − p)(1−y )
              0.2




                                                                         Expected value: E(y ) = p
                                                                         Variance: var (y ) = p(1 − p)
              0.0




                            q
                      qq q q q           q q

                    −10        −5        0           5        10
                                     logit(p)
What is the relation between p and the response?                           Bernoulli vs. binomial

                                                                                 If independent variables are categorical, data might be
      Given n responses for a certain setting of the independent                 stored in Bernoulli format (one row per observation, with 1
      variables, p is the p parameter of a binomial distribution                 or 0 response) or in aggregate binomial format (one row
      bin(p, n) for the number y of 1 responses                                  per combination of the independent variables, with one
      Probability of a number y of 1 responses given n                           column for the number of successes for that combination,
      responses in the specific setting:                                          and one column for the observations with that combination)
                                                                                 Below, I assume Bernoulli format, that is more natural
                                 n y
                       p(y ) =     p (1 − p)(n−y )                               when collecting data, and typically more meaningful in the
                                 y                                               presence of continuous independent variables (when there
      Expected value: E(y ) = np                                                 might be mostly one observation per combination of the
                                                                                 independent variables anyway)
      Variance: var (y ) = np(1 − p)
                                                                                 Almost nothing changes, but deviance can only be
      Single response (Bernoulli) scenario is simply a special
                                                                                 interpreted as a measure of goodness of fit with a χ2
      case of aggregate binomial setting with n = 1
                                                                                 distribution when data are analyzed in the aggregate
                                                                                 binomial format



Generalized linear models                                                  Ordinary linear regression
                                                                           as a generalized linear model
      Logistic regression is an instance of wider family of
      generalized linear models
      Somewhat brutally, in a generalized linear model:
          the value of a certain quantity η for a certain setting of the         Systematic component: η = x β
          independent variables (a row vector x) is computed by a
                                                                                 Link function is identity:
          weighted linear combination of the explanatory variables
          (the systematic component): η = x β
                                                                                                              η = E(y )
          the responses are modeled by a probability distribution that
          has an expected value E(y ) (the random component)
                                                                                 Random component is normal distribution with mean µ and
          A link function connects the random and systematic
                                                                                 variance σ 2
          components by establishing that η is a function of the
          expected value E(y ), i.e.: η = f (E(y ))                                   This is why we add the 0-centered error term in the linear
                                                                                      regression model
      General framework that uses same fitting techniques to
      estimate models for different kinds of data
      More about the general model in Luigi’s classes
Logistic regression as a generalized linear model                       Variance in the random component



      Systematic component: η = x β
                                                                              The systematic component is η = x β for both ordinary and
      Link function:                                                          logistic regression (and any other generalized linear model)
                                              E(y )           E(y )           In ordinary regression, random component is normally
       η = f (E(y )) = logit(E(y )) = log             = log
                                            1 − E(y )       1 − E(y )         distributed with mean η and variance σ 2
                                                                              Putting the pieces together, a specific response is modeled
      Random component is Bernoulli distribution with expected
                                                                              by x β + , where the term comes from the random
      value parameter E(y ), i.e., p
                                                                              component (it is sampled from a normal distribution with
      Given E(y ), i.e., p, observations have a Bernoulli                     mean 0 and variance σ 2 )
      distribution with variance p(1 − p)




Variance in the random component                                        Bernoulli variance
                                                                        as a function of the expected value p
      In logistic regression, the random component has a
      Bernoulli distribution with p = logit(η) and variance




                                                                                           0.25
      p(1 − p): there is no independent σ 2 parameter, and no
      term from the random component
          The response generation happens by picking 1 with




                                                                                           0.20
          probability p and 0 otherwise; variance in responses is
          entirely determined by p, i.e., η


                                                                                           0.15
      On a related point, the errors, computed as response − p,                 variance
      are not expected to be normally distributed with the same
      variance across settings of the independent variables: we
                                                                                           0.10



      expect a larger variance (and consequently a larger
      squared error) when the expected p is close to .5, dropping
                                                                                           0.05




      as we move towards the extremes (p = 0 or p = 1)
                                                                                           0.00




                                                                                                  0.0   0.2   0.4   0.6   0.8   1.0
Outline                                                              Maximum likelihood estimation


   Logistic regression
      Introduction                                                         For any generalized linear model, we estimate the values
      The model                                                            of the β parameters by maximum likelihood estimation, i.e.,
      Estimation                                                           searching across possible settings of the β for that
      Looking at and comparing fitted models                                combination that makes the sample responses most
                                                                           probable
   Logistic regression in R                                                                                               ˆ
                                                                           Given a set of responses y , we search for the β estimate
                                                                           that maximizes
   Further topics                                                                                         ˆ
                                                                                                     p(y |β)




Maximum likelihood estimation                                        Iteratively reweighted least squares
                                                                           We minimize squared error by giving more weight to errors
                                                                           at levels of the independent variables where variance
       In ordinary linear regression maximizing the likelihood is          around p should be low (high and low ps)
       equivalent to minimizing the sum of squared errors across           This requires estimates of p, that in turn require estimates
       the board (and consequently the estimated variance of               of the β coefficients
       errors)                                                             Iterative procedure:
       In logistic regression, the errors are not expected to have             make a guess at the ps (e.g., set them to the proportions of
       the same variance: we should have high variance for p                   positive responses at each level)
       near .5, lower variance towards the extremes                            use guesses to estimate the variances and hence the
                                                                               weights (inversely proportional to the variances)
       Leads to (iteratively) (re)weighted least squares (IRWLS)               use the new weights to estimate the βs, derive new p
       method, where errors are penalized more where we expect                 estimates, go back to the second step until convergence
       less variance around p                                              In the logistic regression setting (and in other generalized
                                                                           linear models), IRWLS is equivalent to the
                                                                           Newton-Rhapson optimization method to maximize
                                                                           likelihood
Iteratively reweighted least squares                                    Outline


      Iterative search for best parameter tuning instead of closed         Logistic regression
      form equation:                                                          Introduction
          No guarantee that estimation will converge on a vector of β         The model
          values that are actually maximizing the likelihood (“local          Estimation
          maximum” problem)                                                   Looking at and comparing fitted models
          Computationally more demanding than ordinary least
          squares
          Possibly unstable                                                Logistic regression in R
      Note that ordinary least squares can be interpreted as a
      special case of weighted least squares with fixed and                 Further topics
      uniform weights




Interpreting the βs                                                     Interpreting the βs

                                                                                                  ˆ      ˆ
                                                                               Asymptotically, βi |/s.e.(βi ) follows a standard normal
                                                                               distribution if βi is 0
                                                                                    The standard errors are the squared roots of the diagonal
      p is not a linear function of the βs                                          elements of:
      The same β will have a more drastic impact on p towards                                         [X T diag(var (pi ))X ]−1
                                                                                                                     ˆ
      the center of the p range than near the extremes (recall the                  where, if there are m data points in the sample,
      S shape of the p curve)                                                       diag(var (pi )) is a m × m matrix with pi (1 − pi ) in the ith
                                                                                               ˆ                           ˆ       ˆ
      As a rule of thumb (the “divide by 4” rule), β/4 is an upper                  diagonal element
                                                                                    (More involved than standard error of ordinary regression
      bound on the difference in p brought about by a unit
                                                                                    coefficients because variance changes with p)
      difference on the corresponding explanatory variable
                                                                               Roughly, if a β is more than 2 standard errors away from 0,
                                                                               we can say that the corresponding explanatory variable
                                                                               has an effect that is significantly different from 0 (at
                                                                               α = 0.05)
Goodness of fit                                                                  Binned goodness of fit

      Measures such as R 2 based on residual errors are not
      very informative
                                                                                      Goodness of fit can be inspected visually by grouping the
      One intuitive measure of fit is the error rate, given by the                     ps into equally wide bins (0-0.1,0.1-0.2, . . . ) and plotting
                                                                                      ˆ
      proportion of data points in which the model assigns p > .5
                                                              ˆ
                                                                                      the average p predicted by the model for the points in each
                                                                                                    ˆ
      to 0-responses or p < .5 to 1-responses
                         ˆ
                                                                                      bin vs. the observed proportion of 1-responses for the data
          This can be compared to baseline in which the model
          always predicts 1 if majority of data points are 1 or 0 if                  points in the bin
          majority of data points are 0 (baseline error rate given by                 We can also compute a R 2 or other goodness of fit
          proportion of minority responses over total)                                measure on these binned data
      Some information lost (a .9 and a .6 prediction are treated                     Won’t try it here, see the plot.logistic.fit.fnc()
      equally)                                                                        function from languageR
      Other measures of fit proposed in the literature, no widely
      agreed upon standard




Model selection                                                                 Log likelihood
      We explore two ways to select models:
          Via a statistical significance test for nested models based                  Various measures of absolute or relative fit of logistic
          on the notion of deviance                                                   regression models (and other generalized linear models)
               There is a relation between the deviance-based test used in            are based on the logarithm of the likelihood of a fitted
               logistic regression (and with other generalized linear models)         model, i.e., the maximum likelihood value for the relevant
               and the F-test used for ordinary linear regression                     model
          Via a criterion-based method that favours the model with
                                                                                      For a data set of m rows with Bernoulli (1 or 0) responses,
          the lowest Akaike Information Criterion (AIC) score
                                                                                      where yi is the ith response and pi is the model-produced
                                                                                                                        ˆ
      In both approaches, the notion of log likelihood of a model                     p for the ith row, the (maximized) log likelihood of the
      fitted by maximum likelihood plays a crucial role (it is                         model M is given by:
      desirable that a model assigns a high probability to the
      seen data)                                                                                            i=m

      Moreover, in both approaches the number of parameters                                    L(y , M) =         {log(ˆ yi ) + log[(1 − p)1−yi ]}
                                                                                                                       p                 ˆ
                                                                                                            i=1
      used by a model counterbalances the likelihood (a high
      likelihood is less impressive if we can tune a lot of                           Note that the log likelihood is always ≤ 0 (because
      parameters, and a model with many parameters is more                            log 1 = 0)
      prone to overfitting)
Deviance                                                              Deviance

     L(y , M) is the (maximum) log likelihood for a model M with           If data are arranged with a separate row for each Bernoulli
     n parameters (βs to estimate) on a sample of m                        response, there are m rows and pi is the model-fitted p for
                                                                                                            ˆ
     observations with responses y = (y1 , y2 , · · · , ym )               the ith row, the computation of deviance reduces to:
     L(y , My ) is the (maximum) log likelihood for the saturated
                                                                                      i=m
     model that has one parameter per response (i.e., m
     parameters), so it can fit the observed responses perfectly                  −2          p       p
                                                                                            {ˆi [log(ˆi ) − log(1 − pi )] + log(1 − pi )}
                                                                                                                    ˆ               ˆ
                                                                                      i=1
     The deviance of a maximum-likelihood-fitted model M is
     −2 times the difference in log likelihoods between M and              When the data are arranged in this way (rather than by
     My (equivalently, the log likelihood ratio):                          counts of successes and observations per level) and/or
                                                                           some independent variables are continuous, the
                  D(y , M) = −2[L(y , M) − L(y , My )]                     significance of the deviance from the saturated model
                                                                           cannot be assessed with a χ2 test as we will do below
     The better the model is at predicting the actual responses,
                                                                           when comparing two nested models: comparing nested
     the lower the deviance will be; conversely, the larger the
                                                                           models will be the main use of deviance for our purposes
     deviance, the worse the fit



Deviance                                                              Deviance
     Consider two nested models where the larger model Ml
     uses all the independent variables in the smaller model Ms            The log likelihood ratio, or equivalently the difference in
     plus some extra ones; the difference in parameters (β) to             deviance, approximates a χ2 distribution with the difference
     estimate will be nl − ns                                              in number of parameters nl − ns as df’s (with some caveats
     Maximizing the likelihood on the same observations y                  we do not worry about here: in general, approximation is
     using a subset of the parameters cannot yield a higher                better if difference in parameter number is not too large)
     maximum than with the larger set, thus:                               Intuitively, the larger the difference in deviance between Ms
                                                                           and Ml , the less likely is is that the improvement in
                          L(y , Ms ) ≤ L(y , Ml )
                                                                           likelihood is just due to the fact that Ml has more
     A log likelihood ratio test that the improvement in likelihood        parameters to play with, although this difference will be the
     with Ml is not due to chance is given by:                             less surprising, the more extra-parameters Ml has
                                                                           A model can also be compared against the baseline “null”
        −2[L(y , Ms ) − L(y , Ml )] = −2[L(y , Ms ) − L(y , My )]          model that always predicts the same p (given by the
                                        −2[L(y , Ml ) − L(y , My )]        proportion of 1-responses in the data) and has only one
                                    = D(y , Ms ) − D(y , Ml )              parameter (the fixed predicted value)
Akaike Information Criterion (AIC)                                      Outline
       The AIC is a measure derived from information-theoretic
       arguments that provides a trade-off between the fit and the
       complexity of a model
       Given a model M with n parameters achieving maximum                  Logistic regression
       log likelihood L(y , M) on some data y , the AIC of the
       model is:                                                            Logistic regression in R
                            AIC = −2L(y , M) + 2n                              Preparing the data and fitting the model
       The lower the AIC, the better the model                                 Practice
            The negative log likelihood term penalizes models with a
            poor fit, whereas the parameter count term penalizes             Further topics
            complex models
       AIC does not tell us anything about “significant”
       differences, but models with lower AIC can be argued to be
       “better” than those with higher AIC (what matters is the
       rank, not the absolute difference)
            AIC can also be applied to choose among non-nested
            models, as long as the fitted data are the same, of course


Outline                                                                 Back to the Graffeo et al.’s discount study
                                                                        Fields in the discount.txt file




   Logistic regression
                                                                                     subj Unique subject code
                                                                                     sex M or F
   Logistic regression in R
      Preparing the data and fitting the model                                        age NB: contains some NA
      Practice                                                              presentation absdiff (amount of discount), result (price after
                                                                                        discount), percent (percentage discount)
   Further topics                                                                product pillow, (camping) table, helmet, (bed) net
                                                                                  choice Y (buys), N (does not buy) → the discrete
                                                                                         response variable
Preparing the data                                                  Logistic regression in R


      Read the file into an R data-frame, look at the summaries,
      etc.
      Note in the summary of age that R “understands” NAs
                                                                        > sex_age_pres_prod.glm<-glm(choice~sex+age+
      (i.e., it is not treating age as a categorical variable)
                                                                          presentation+product,family="binomial")
      We can filter out the rows containing NAs as follows:
      > e<-na.omit(d)                                                   > summary(sex_age_pres_prod.glm)
      Compare summaries of d and e
          na.omit can also be passed as an option to the modeling
          functions, but I feel uneasy about that
      Attach the NA-free data-frame




Selected lines from the summary() output                            Selected lines from the summary() output
                                                                    Deviance

      Estimated β coefficients, standard errors and z scores
      (β/std. error):
      Coefficients:
                           Estimate Std. Error z value Pr(>|z|)                For the “null” model and for the current model:
      sexM                -0.332060   0.140008 -2.372 0.01771 *
      age                 -0.012872   0.006003 -2.144 0.03201 *                    Null deviance: 1453.6     on 1175    degrees of freedom
      presentationpercent 1.230082    0.162560   7.567 3.82e-14 ***            Residual deviance: 1284.3     on 1168    degrees of freedom
      presentationresult   1.516053   0.172746   8.776 < 2e-16 ***
      Note automated creation of binary dummy variables:
                                                                               Difference in deviance (169.3) is much higher than
      discounts presented as percents and as resulting values
      are significantly more likely to lead to a purchase than                  difference in parameters (7), suggesting that the current
      discounts expressed as absolute difference (the default                  model is significantly better than the null model
      level)
          use relevel() to set another level of a categorical
          variable as default
Deviance: do it yourself                                                   Selected lines from the summary() output
                                                                           AIC and Fisher Scoring iterations



                     i=m
     D(y , M) = −2          p       p
                           {ˆi [log(ˆi ) − log(1 − pi )] + log(1 − pi )}
                                                   ˆ               ˆ
                     i=1
                                                                                     The Akaike Information Criterion score:
                                                                                     AIC: 1300.3
   > fitted_ps<-fitted.values(sex_age_pres_prod.glm)
   > -2 * sum(fitted_ps * (log(fitted_ps)
                                                                                     And how many iterations of the IRWLS procedure were
     - log(1-fitted_ps)) + log(1-fitted_ps))
                                                                                     performed before convergence (large numbers might cue a
   [1] 1284.251
                                                                                     problem):
   > ys<-as.numeric(choice)-1                                                        Number of Fisher Scoring iterations: 4
   > -2 * length(ys) * (mean_y * (log(mean_y)
     - log(1-mean_y)) + log(1-mean_y))
   [1] 1453.619




The AIC: do it yourself                                                    Comparing models with the deviance test
                                                                                     Let us add a presentation by interaction term:

   > ys<-as.numeric(choice)-1                                                        > interaction.glm<-glm(choice~sex+age+presentation+
                                                                                       product+sex:presentation,family="binomial")
   > fitted_ps<-fitted.values(sex_age_pres_prod.glm)
                                                                                     Are the extra-parameters justified?
   > ll<-sum(log(fitted_ps^ys)
     + log((1-fitted_ps)^(1-ys)))                                                    > anova(sex_age_pres_prod.glm,interaction.glm,
                                                                                       test="Chisq")
                                                                                     ...
   > -2*ll +                                                                           Resid. Df Resid. Dev   Df Deviance P(>|Chi|)
      2*length(coefficients(sex_age_pres_prod.glm))                                  1      1168     1284.25
   [1] 1300.251                                                                      2      1166     1277.68   2     6.57      0.04


                                                                                     Apparently, yes (although summary(interaction.glm)
                                                                                     suggests just a marginal interaction between sex and the
                                                                                     percentage dummy variable)
Comparing models by AIC                                            Error rate
                                                                          The model makes an error when it assigns p > .5 to
                                                                          observation where choice is N or p < .5 to observation
                                                                          where choice is Y:

                                                                          > sum((fitted(interaction.glm)>.5 & choice=="N") |
                                                                               (fitted(interaction.glm)<.5 & choice=="Y")) /
      Again, the model with the interactions looks better:                     length(choice)
                                                                          [1] 0.2712585
      > sex_age_pres_prod.glm$aic
      [1] 1300.251
                                                                          Compare to error rate by baseline model that always
      > interaction.glm$aic                                               guesses the majority choice:
      [1] 1297.682
                                                                          > table(choice)
                                                                          choice
                                                                            N   Y
                                                                          363 813
                                                                          > sum(choice=="N")/length(choice)
                                                                          [1] 0.3086735


                                                                          Improvement in error rate is nothing to write home about. . .

Bootstrap estimation                                               Outline

      Recompute logistic model fitted with lrm(), the logistic
      regression fitting function from the Design package:
      > interaction.glm<-lrm(choice~sex+age                           Logistic regression
        +presentation+product+sex:presentation,
        x=TRUE,y=TRUE)                                                Logistic regression in R
      Validation using the logistic model estimated by lrm() and         Preparing the data and fitting the model
      1,000 iterations:                                                  Practice
      > validate(interaction.glm,B=1000)
      When fed a logistic model, validate() returns various           Further topics
      measures of fit we have not discussed: see, e.g., Baayen
      or Harrell
      Independently of the interpretation of the measures, the
      size of the optimism indices gives a general idea of the
      amount of overfitting (not dramatic in this case)
The Navarrete et al.’s                                                The cwcc.txt data-set
Cumulative Within-Category Cost data
                                                                                 subj A unique id for each of the 20 subjects
      Part of a larger study by Eduardo Navarrete, Brad Mahon                   item The specific objects presented in the pictures
      and Alfonso Caramazza; we just look at one very specific                        (pears, pianos, houses, etc.)
      aspect of their Experiment 1 data                                     category 12 superordinate categories (animals, body parts,
      A well-known and robust effect in picture naming makes                         buildings, etc.)
      subjects slower at naming objects as a (linear) function of            ordpos The ordinal position of the item within its category
      how many objects from the same category they saw before                       in the current block (is this the first, second, . . . ,
      (you are slower at naming the second mammal, slower still                     fifth animal?)
      to name the third, etc.)                                                  block Each subject sees 4 repetitions of the whole
      The question here: does this “cumulative within-category                        stimulus set, presented in different orders
      cost” effect also impact error rate? Are subjects more likely        response The picture naming latency in milliseconds (or
      to mis-label/fail to name an object as a function of the                      error for wrong or anomalous responses)
      numbers of objects in the same category they already                      error Did the subject gave a wrong/anomalous
      saw?                                                                            response? (0: no; 1: yes)



Practice time                                                         Outline




      Using logistic regression, model the probability of making         Logistic regression
      an error in function of category, ordinal position within
      category and block                                                 Logistic regression in R
      Which effects do you find?
      Do they make sense?                                                Further topics
      Is there an interaction, e.g., between ordinal position and
      block?
Further topics
All stuff you can do in R




           More generalized linear models: multinomial, ordinal
           logistic regression, Poisson regression, . . .
           Too many independent variables? Automated selection
           procedures, Ridge/regularized regression, PCA of the
           independent variables. . .
           Item and subject issues? Mixed models with random
           effects
           No independent variable (unlabeled data)? Clustering,
           PCA. . .
           Fancier machine learning: Support Vector Machines. . .

						
Related docs