; Methodology A
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Methodology A

VIEWS: 19 PAGES: 93

  • pg 1
									      MRes
   Wednesday
 11th March 2009

Logistic regression



                      1
              Programme
• A short talk.
• Break for coffee.
• A short class exercise.




                            2
              Background
• Logistic regression is a special kind of
  regression designed for a specific type of
  situation.
• To understand it, you must be clear about
  some fundamentals of ORDINARY LEAST
  SQUARES (OLS) regression.
• I’ll review those first, before I talk about
  logistic regression itself.
                                             3
                 A study
• In a study of the effects of media violence,
  children were measured on their Actual
  violence and their Exposure to screen
  violence.
• Here is a scatterplot of Actual violence
  against Exposure.



                                                 4
The scatterplot
        • Each point in the plot
          represents one child.
        • The coordinates are the
          child’s scores on
          Exposure to and Actual
          violence.
        • A statistical
          ASSOCIATION between
          Exposure to and Actual
          violence is evident from
          the elliptical shape of the
          cloud of points.

                                        5
              Regression

• Regression is a set of statistical
  techniques enabling the researcher to
  exploit an association among variables to
  PREDICT the values of one variable from
  those of others.



                                              6
           Some key terms
• The variable we are trying to predict or
  account for is the CRITERION, TARGET
  or DEPENDENT VARIABLE (DV).
• The predictors are the INDEPENDENT
  VARIABLES (IVs) or REGRESSORS.
• In our current example, the DV is Actual
  violence and the IV is Exposure to screen
  violence.
                                              7
  The REGRESSION LINE
is drawn through the points.




                               8
9
The regression line




                      10
Filling in the values




                        11
        Interpretation of slope
       or regression coefficient
• The slope is the average number of units
  change in the DV that result from a
  change of one unit on the IV.
• In our example, slope = .74 .
• So, an increase of one unit of Exposure
  produces, on average, an increase of .74
  in Actual violence.


                                             12
        The ‘best-fitting’ line

• The regression line of Actual violence
  upon Exposure is the uniquely ‘best-fitting’
  line according to what is known as the
  LEAST SQUARES criterion.




                                             13
                 Residuals
• John scored 9 on Exposure and 8 on
  Actual.
• John’s predicted score from regression ŷ
  is the point on the line above the value 9
  on the x-axis.
• The error in prediction is (y - ŷ ), a quantity
  known as the RESIDUAL score e.
• John’s residual score is shown.
                                                14
15
       Least squares criterion of
            goodness-of-fit




• The sum of the squares of the residuals (lower
  formula) is a minimum.
• There is a mathematical solution to the problem
  of finding values for the slope and the constant
  that meet the criterion.
                                                     16
        Ordinary least-squares
          (OLS) regression

• This approach to regression is known as
  ORDINARY LEAST SQUARES (OLS)
  regression.
• There are other kinds of regression (such
  as LOGISTIC REGRESSION, today’s
  topic) that do not work in this way.


                                              17
     Regression and correlation
• Regression and correlation are two sides of the same
  associative coin.
• The stronger the association, the narrower will be the
  elliptical scatterplot, the higher will be the value of the
  correlation coefficient and the smaller will be the
  residuals from regression.
• THE CORRELATION AND THE REGRESSION
  COEFFICIENT ALWAYS HAVE THE SAME SIGN.
• For fixed values of the variances of x and y, the greater
  the value of r, the steeper will be the slope of the
  regression line, i.e., the greater will be the value of b1.
• The slope of the regression line b1 and r are related
  according to …

                                                                18
Relation between the regression
 coefficient and the correlation
           coefficient




                                   19
The coefficient of determination (r2)
• The square of the Pearson correlation is
  known as the COEFFICIENT OF
  DETERMINATION.
• It is so-called because r2 is the proportion
  of the variance of y that is accounted for
  by regression upon x.



                                                 20
Coefficient of determination




                               21
  Prediction without regression
• Suppose you know nothing of the association
  between x and y.
• But you are told that the mean of the target
  variable y has a certain value My.
• You are asked to predict values of y for various
  values of x.
• It can be shown that your best strategy is to
  guess the value of My, irrespective of the value
  of x.
• This is termed INTERCEPT- ONLY prediction.

                                                 22
           A baseline model
• In multiple regression and several other related
  techniques, the first step is to formulate a
  baseline model, which takes no account of any
  association among the variables.
• The baseline model is the equivalent of
  guessing the mean every time.
• This is ‘Step 0’ in several SPSS regression and
  modelling routines.
• Step 0 provides a comparison or baseline for the
  evaluation of later models that include one or
  more of the IVs.

                                                 23
            Two or more IVs:
            multiple regression
• We could try to predict a person’s actual
  violence not only from exposure to screen
  violence, but also from additional variables, such
  as number of years of education and other
  characteristics of the parents.
• We should then have to determine the relative
  importance of the various IVs and whether we
  needed to include all of them in the regression
  model.
• These are problems in MULTIPLE
  REGRESSION.
                                                   24
Multiple regression




                      25
  Partial regression coefficients

• In multiple regression, a PARTIAL
  REGRESSION COEFFICIENT is the
  estimated average change in the DV
  resulting from an increase of one unit in
  one particular IV with ALL THE OTHER
  IVs HELD CONSTANT.


                                              26
             The multiple
       correlation coefficient R
• The MULTIPLE CORRELATION
  COEFFICIENT (R) is the correlation
  between the target variable y and the
  corresponding predictions of y from
  regression ŷ.

• R can never take a negative value.


                                          27
   Coefficient of determination in
        multiple regression


• In multiple regression, the COEFFICIENT
  OF DETERMINATION is the square of the
  multiple correlation coefficient.




                                        28
         The case of one IV
• The multiple correlation coefficient is
  defined even in simple regression, where
  there is only one IV.
• Here, remembering that R can never be
  negative, it takes the ABSOLUTE VALUE
  of the Pearson correlation between x and
  y, even when r has a negative value.
• So in SPSS, R is included in the output for
  simple regression.
                                            29
      The coefficient of multiple
          determination R2

• In multiple regression, the coefficient of
  determination, the proportion of variance
  of the target variable y that is accounted
  for by regression, is R2, the square of the
  multiple correlation coefficient.



                                                30
31
        What if the DV is a set of
              categories?
• Simple and multiple OLS regression assume
  that the DV and IVs consist of measures on an
  independent scale with units. The term
  CONTINUOUS VARIABLE is used for this sort
  of DV.
• But suppose we want to predict whether a
  person will suffer from a heart attack or contract
  a certain illness with known risk factors.
• Here, we are predicting not a VALUE, but
  CATEGORY MEMBERSHIP.

                                                   32
Regression with a categorical DV

 The two most commonly used
 techniques are:


    1.Logistic regression
    2.Discriminant analysis


                                   33
       Discriminant analysis
• If all (or most) IVs are continuous, you
  might consider using DISCRIMINANT
  ANALYSIS (DA).
• But the DA model makes assumptions
  about the distributions of the IVs (such as
  multivariate normality) which data sets
  often fail to satisfy.
• Moreover, DA doesn’t like qualitative IVs,
  such as sex or nationality.
                                            34
         Logistic regression

• Logistic regression makes fewer
  assumptions than does discriminant
  analysis.
• Logistic regression, moreover, is happy
  with qualitative IVs; in fact, logistic
  regression is happy even if ALL the IVs
  are qualitative.
                                            35
        A research question
• It is suspected that smoking and drinking
  are risk factors in the incidence of a pre-
  morbid blood condition, characterised by
  the presence of an antibody.
• Records of the incidence of the condition
  in 100 patients are available, together with
  estimates of the amount they smoke and
  drink.

                                             36
The data




           37
Let’s find out how many of the
 patients have the condition.




                                 38
39
40
Forty-four patients
have the condition




                      41
 The regression model assumes …
• Either you have the disease or you don’t.
• As smoking and alcohol increase, however, we
  assume that the probability of developing the
  condition increases CONTINUOUSLY as a
  function of the IVs.
• In logistic regression, we estimate the probability
  of the condition with the LOGISTIC
  REGRESSION FUNCTION
• If the estimated probability exceeds a cut-off
  (usually 0.5), the case is classified by the
  program as a Yes, rather than a No.

                                                    42
A logistic regression function




                                 43
Logistic regression function




                               44
45
                   The odds
• In an EXPERIMENT OF CHANCE (tossing a
  coin, rolling a die) the ODDS in favour of an
  event is the number of ways in which the event
  could occur, divided by the number of ways in
  which it could fail to occur.
• If a die is rolled, there is one way of getting a six
  and there are five ways of not getting a six.
• The odds in favour of a six are 1/6.


                                                      46
    Odds in favour of antibody
• Suppose we know that out of 100 people,
  44 have a certain antibody in their blood.
  We select a person at random from this
  group.
• There are 44 ways of selecting a person
  with the antibody; and 56 ways of
  selecting someone without it.
• The ODDS in favour of the person having
  the antibody are 44 to 56 or 44/56.
                                               47
         The log odds (logit)
• The odds measure suffers from
  ASYMMETRY OF RANGE.
• Unlikely events have odds between 0 and
  1; likely events can have huge odds.
• The LOG ODDS (LOGIT) is the natural
  logarithm (log to the base e) of the odds.
• Logit = ln(odds) = loge(odds).

                                               48
       When the logit is zero

• Suppose the odds were 50 to 50
  (50/50 =1).
• Since the log of 1 is zero (e0 = 1), a logit
  of zero means that the odds for are equal
  to the odds against.



                                                 49
          Range of the logit

• The logit has a symmetrical range: a
  positive sign means the odds are in
  favour; a negative sign means the odds
  are against.
• The logit has no upper or lower limit: it has
  an unlimited range of values.


                                              50
                Example
• The odds in favour of a case having the
  antibody are 44/56 = 11/14.
• Logit = ln(11/14) = –.24
• The event is less likely than not, hence the
  negative sign.
• If the odds in favour were 56/44, the logit
  would be ln(56/44) = +.24.
• Notice the symmetry of the scale of
  magnitude around the neutral point at 1.
                                             51
                   Probability
• A probability is a measure of likelihood ranging
  from 0 (an impossibility) to 1 (a certainty).
• The classical definition of probability, like that of
  the odds, also arises in the context of an
  experiment of chance.
• The probability p of an event is the number of
  ways it can happen divided by the TOTAL
  number of outcomes.
• The probability of a six when a die is rolled is
  1/6.

                                                      52
Relationship between the
probability and the odds
            • A probability and the
              odds are both
              measures of
              likelihood.
            • They are related
              according to the
              equation on the left.


                                      53
Logs and antilogs




                    54
The antilog function




                       55
Answers




          56
          Odds as antilogs

• A number such as the odds can be written
  as an ANTILOG, that is, the base e to the
  power of the natural log of the odds (the
  logit):




                                          57
The logistic regression
  function revisited




                          58
Logistic regression function




                               59
                The logit
• The logit is assumed to be a linear
  function Z of the independent variables.




                                             60
       Interpretation of a logistic
         regression coefficient

• The partial regression coefficient is the increase
  in the LOG ODDS or LOGIT arising from an
  increase of one unit in the independent variable.
• The log of a product is the SUM of the logs.


• So the antilog of the partial regression
  coefficient is the factor by which the original
  odds must be MULTIPLIED to give the new
  odds when the IV increases by a unit.
                                                    61
Change in the odds




                     62
                Example
• Suppose that for Smoking, b = 1.1. An
  increase of one smoking unit (eg 10
  cigarettes) increases the logit (the log
  odds) by 1.1.
• So the original odds are MULTIPLIED by




                                             63
               Summary
• In terms of the ODDS, an increase of one
  unit in the IV MULTIPLIES the original
  odds by the ANTILOG of b, that is, by eb,
  or exp(b).
• Exp(1.1) = 3.0
• So an increase of one smoking unit results
  in the odds being MULTIPLIED by 3, that
  is, the event is THREE times as likely to
  happen.
                                           64
              The problem
• In the logit equation, we must find values
  of the constant and partial regression
  coefficients such that correct assignment
  to categories is maximised.




                                               65
      No mathematical solution
• In logistic regression, there is no equivalent of the
  formulae for the intercept and coefficients in OLS
  regression.
• A ‘brute force’ computing algorithm is used whereby,
  starting at arbitrary values of the coefficients, the values
  are progressively adjusted to try to arrive at a set which
  maximises the likelihood of obtaining the observed
  frequencies.
• In a process known as ITERATION, estimates of the
  parameters are calculated again and again in the hope
  that they will ‘converge’ to stable values.
• IT DOESN’T ALWAYS HAPPEN!
• We must therefore check that this ‘convergence’ really
  has been achieved by examining the ITERATION
  HISTORY in the SPSS output.
                                                             66
         Potential difficulties

• The algorithm will not run successfully if
  the IVs are too highly correlated. This is
  the familiar MULTICOLLINEARITY
  PROBLEM sometimes encountered in
  OLS regression.



                                               67
                Centring
• As with OLS multiple regression, it is a
  good idea to CENTRE variables, by
  subtracting the mean from each score.
• Centring leaves the correlations among
  the variables unchanged.
• This move makes the algorithm more
  robust to substantial correlations among
  the variables.
                                             68
69
             Covariates

In SPSS logistic regression dialogs, IVs
that are continuous variables are known
as COVARIATES.




                                           70
 Always ask for the ITERATION
HISTORY, so that you can check
whether the algorithm was able to
   arrive at a stable estimate.



                                    71
             Dire warning!

• Should the iteration history show failure to
  converge, the results of the analysis can
  be ridiculous!
• The effects of failure to converge are not
  limited to the IV concerned: they can mess
  up the whole analysis!


                                             72
73
74
           Fitting a model

• The goodness-of-fit of a model is
  measured by a log likelihood chi-square
  statistic.




                                            75
   Step 0 in logistic regression
• We know that 44/100 people have the condition.
• Armed only with this fact, and with no knowledge
  of any associations there might be among the
  variables, we shall maximise our hit rate if we
  predict ABSENCE of the condition for ANY
  person selected at random.
• This, in logistic regression, is the equivalent of
  intercept-only (no-regression) prediction in OLS
  regression: you just guess My, whatever the
  value of x.

                                                  76
Here is the logistic regression
      output for Step 0




                                  77
78
   The Nagelkerke R2 statistic
• The Nagelkerke statistic is the counterpart
  of the coefficient of determination R2 in
  OLS multiple regression.

• It is a measure of the proportion of the
  total variation in incidence of the blood
  condition accounted for by regression.


                                              79
80
81
82
                                                      a
                                  Clas sification Table

                                                                 Predicted

                                                      Blood Condition        Percentage
         Obs erved                                    No         Y es          Correc t
Step 1   Blood Condition       No                        51            5            91.1
                               Y es                      10           34            77.3
         Overall Perc entage                                                        85.0
  a. The cut value is .500
                                                               Hit rate using the
A regression model is                                      regression model. This is
    now applied.                                          obviously much better than
                                                          the ‘no-regression’ hit rate
                                                                    of 56%.




                                                                                           83
         The Wald statistic
• The WALD STATISTIC tests a regression
  coefficient for significance.
• The null hypothesis is that, in the
  population, the coefficient is zero.
• The Wald statistic is B2/SE2 (not B/SE as
  Andy Field says on page 224 of his book)
  and is distributed approximately as chi-
  square.
                                              84
                                 Variables in the Equation

                        B           S.E.         Wald        df       Sig.     Ex p(B)
Step
 a
       Smoking          2.264         .513       19.490           1     .000      9.623
1      Alc ohol         -.078         .085         .846           1     .358       .925
       Cons tant       -1.394         .373       13.979           1     .000       .248
  a. Variable(s ) entered on step 1: Smoking, Alc ohol.


                                     This is the antilog of the coefficient of
                                     Smoking in the logit equation. Increasing
                                     Smoking by one unit MULTIPLIES the odds
                                     in favour of occurrence by about 10.




                                                                                          85
             The logit equation

Z    b0            b1 X 1               b2 X 2
   1.394      2.64( Smoking )      .078( Alcohol )




                                                    86
             Logistic function


    eZ       e1.394 2.64 Smoking .078 Alcohol
p              1.394  2.64 Smoking .078 Alcohol
   1 e Z
            1 e




                                                       87
88
               Conclusion
• The incidence of the blood condition is
  indeed predictable from regression and
  raises the hit rate from 54% to 85%.
• Smoking contributes significantly to the
  model.
• Alcohol does not contribute significantly to
  the model.


                                             89
             The next step

• This session has been merely an
  introduction to the technique of logistic
  regression.
• The next step is to do some further
  reading.



                                              90
            Getting started
• There’s an elementary section on logistic
  regression in
  –Kinnear, P., & Gray, C. (2008).
   SPSS16 made simple. Hove:
   Psychology Press. Chapter 14.
• This is mainly a practical, get-started
  guide; but there is an outline of the
  rationale of the technique as well.
                                              91
          Sage paperbacks
• Menard, S. (2002). Applied logistic
  regression analysis (2nd ed.). London:
  Sage.

• Jaccard, J. (2001). Interaction effects in
  logistic regression. London: Sage.



                                               92
• Tabachnik, B. G., & Fidell, L. S. (2007).
  Using multivariate statistics (5th ed.).
  Boston: Allyn & Bacon. Chapter 10.

• Field, A. (2005). Discovering statistics
  using SPSS for Windows: Advanced
  techniques for the beginner (2nd ed.).
  London: Sage. Chapter 6.
                                              93

								
To top
;