Analyses of Cateogical Dependent Variables

Document Sample
Analyses of Cateogical Dependent Variables Powered By Docstoc
					Analyses Involving Categorical Dependent Variables
When Dependent Variables are Categorical Chi-square analysis is frequently used. Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets? Dependent variable is Death: No (0) vs. Yes (1).

Crosstabs
[DataSet0]

So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets. Logistic Regression Lecture - 1 1/3/2010

Comments on Chi-square analyses What’s good? 1. The analysis is appropriate. It hasn’t been supplanted by something else. 2. The results are usually easy to communicate, especially to lay audiences. 3. A DV with a few more than 2 categories can be easily analyzed. 4. An IV with only a few more than 2 categories can be easily analyzed. What’s bad? 1. Incorporating more than one independent variable is awkward, requiring multiple tables. 2. Certain tests, such as tests of interactions, can’t be performed when you have more than one IV. 3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs which goes against recommendations to NOT dichotomize continuous variables because you lose power. Alternatives to the Chi-square test. We’ll focus on Dichotomous (two-valued) DVs. 1. Techniques based on linear regression a. Multiple Regression. Regress the dichotomous DV onto the continuous IVs. b. Discriminant Analysis (equivalent to MR when DV is dichotomous) Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous. 1. Assumption is that underlying relationship between Y and X is linear. But when Y has only two values, how can that be? 2. Y-hat when Y is continuous is a realizable value of Y. But when Y has only two values, what is a Yhat? 3. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption. 4. Residuals won’t be normally distributed. 5. Regression line will extend beyond 0 in the negative direction and beyond 1 in positive. 2. Logistic Regression 3. Probit analysis Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it. Logistic Regression Lecture - 2 1/3/2010

The Logistic Regression Equation Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1. When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur. We’ll use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1. The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So we’re conceptualizing Y-hat as the probability that Y is 1. The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression) (B0 + B1*X) 1 P(Y=1) = ---------------------(B0 + B1*X) 1 + e e

=

---------------(B0 + B1*X) e + 1

The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1. It is never negative and never larger than 1. The curve of the equation . . . B0: B0 is analogous to the linear regression “constant” , i.e., intercept parameter. B0 defines the "height" of the curve. B0 is an elevation parameter. Also called a difficulty parameter in some applications.
1.2

Prob3: B0 =2
1.0

Prob2: B0 =1

.8

.6

Prob1: B0 =0

P(Y)
.4 PROB1 .2

Value

PROB2 0.0 -3.00 -2.00 -1.00 .00 1.00 2.00 3.00 PROB3

X

Logistic Regression Lecture - 3

1/3/2010

B1: B1 is analogous to the slope of the linear regression line. B1 defines the “steepness” of the curve. It is sometimes called a discrimination parameter. The larger the value of B1, the “steeper” the curve, the more quickly it goes from 0 to 1.
1.2

1.0

Prob4: B1 =1

.8

Prob5: B1 =2
.6

P(Y= 1)

Prob6: B1 =3
.4 PROB4 .2

Value

PROB5 0.0 -3.00 -2.00 -1.00 .00 1.00 2.00 3.00 PROB6

X

Note that there is a MAJOR difference between the linear regression and logistic regression curves - - The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1. But the linear regression lines extend below 0 on the left and above 1 on the right. If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.

Logistic Regression Lecture - 4

1/3/2010

Example P(Y) = .09090909 Odds of Y = .09090909/.909090909 = .1:1 or 1:10 Y is 1/10th as likely to occur as to not occur. Odds of Y = .5/.5 = 1:1 Y is as likely to occur as to not occur. Odds of Y = .8/.2 = 4:1 Y is 4 times more likely to occur than to not occur. Odds of Y = .99/.01 = 99:1 Y is 99 times more likely to occur than to not occur.

P(Y) = .50

P(Y) = .8

P(Y) = .99

Logistic Regression Lecture - 5

1/3/2010

So logistic regression is logistic in probability but linear in log odds.

Logistic Regression Lecture - 6

1/3/2010

Crosstabs and Logistic Regression Applied to the same 2x2 situation
The FFROSH data. The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the 2nd semester. The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not. The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester. After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG – for EARLy REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.) So the analysis that follows examines the relationship of RETAINED to EARLIREG, retention to the 2nd semester to early registration. The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION. First, univariate analyses . . .
GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'. Fre var=retained earlireg.
retained

Frequency Valid .00 1.00 Total 552 4201 4753

Percent 11.6 88.4 100.0

Valid Percent 11.6 88.4 100.0

Cumulative Percent 11.6 100.0

earlireg

Frequency Valid .00 1.00 Total 2316 2437 4753

Percent 48.7 51.3 100.0

Valid Percent 48.7 51.3 100.0

Cumulative Percent 48.7 100.0

Logistic Regression Lecture - 7

1/3/2010

crosstabs retained by earlireg /cells=cou col /sta=chisq.

Crosstabs
Case Processing Summary Cases Valid N RETAINED * EARLIREG 4753 Percent 100.0% N 0 Missing Percent .0% N 4753 Total Percent 100.0%

RETAINED * EARLIREG Crosstabulation EARLIREG .00 RETAINED .00 Count % within EARLIREG 1.00 Count % within EARLIREG Total Count % within EARLIREG 367 15.8% 1949 84.2% 2316 100.0% 1.00 185 7.6% 2252 92.4% 2437 100.0% Total 552 11.6% 4201 88.4% 4753 100.0%

So, 92.4% of those who registered early sustained, compared to 84.2% of those who registered late.

Chi-Square Tests Asymp. Sig. (2-sided) .000 .000 .000 .000 78.815 4753 1 .000 .000 Exact Sig. (2-sided) Exact Sig. (1-sided)

Pearson Chi-Square Continuity Correction Likelihood Ratio Fisher' s Exact Test Linear-by-Linear Association N of Valid Cases a. Computed only for a 2x table 2
a

Value 78.832b 78.030 79.937

df 1 1 1

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 268.97.

Logistic Regression Lecture - 8

1/3/2010

The same analysis using Logistic Regression
logistic regression retained WITH earlireg.

Logistic Regression
Case Processing Summary Unweighted Cases Selected Cases
a

N 4753 0 4753 0 4753

Percent 100.0 .0 100.0 .0 100.0

Included in Analysis Missing Cases Total

Unselected Cases Total

a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original .00 Value 1.00 Internal Value 0 1

The display to the left is a valuable check to make sure that your “1” is the same as the Logistic Regression procedure’s “1”.

The Logistic Regression procedure fits the logistic regression model to the data. It estimates the parameters of the logistic regression equation. 1 That equation is P(Y) = ---------------------(B0 + B1X) 1 + e It performs the estimation in two stages. The first stage estimates only B0. So the model fit to the data in the first stage is simply 1 P(Y) = ------------------(B0) 1 + e SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimated

Logistic Regression Lecture - 9

1/3/2010

Block 0: Beginning Block
Classification Table
a,b

Predicted RETAINED Step 0 Observed RETAINED Overall Percentage a. Constant is included in the model. b. The cut value is .500 .00 .00 1.00 0 0 1.00 552 4201 Percentage Correct .0 100.0 88.4

Explanation of the above table: The program computes Y-hat for each case using the logistic regression formula with the estimate of B0. If Yhat is <= 0.5, that case is recorded as a predicted 0. If Y-hat is > 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s. The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030). Recall that B1 is not yet in the equation. This means that Y-hat is a constant, equal to .8839 for each case. (I got this by entering the prediction equation into a calculator.) Since Y-hat for each case is > 0.5, all predictions are 1, which is why the above table has only predicted 1’s. Sometimes this table is more useful than it was in this case.
Variables in the Equation B Step 0 Constant 2.030 S.E. .045 Wald 2009.624 df 1 Sig. .000 Exp(B) 7.611

The above box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis. The test statistic is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision. Exp(B) is the odds ratio: More later.
Variables not in the Equation Score 78.832 78.832 df 1 1 Sig. .000 .000

Step 0

Variables Overall Statistics

EARLIREG

The “Variables not in the Equation” gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be “significant” if it were added to the equation. In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .

Logistic Regression Lecture - 10

1/3/2010

Block 1: Method = Enter – Determining whether Y-hat changes with X
Omnibus Tests of Model Coefficients Chi-sq uare 79.937 79.937 79.937 df 1 1 1 Sig. .000 .000 .000

Step 1

Step Block Model

Note that the chi-square value is almost the same as the chi-square value from the CROSSTABS analysis.

Whew – three chi-square statistics. “Step”: Ignore for now. “Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output. “Model”: Ignore for now
Model Summary Cox & Snell R Square .017 Nagelkerke R Square .033

Step 1

-2 Log likelihood 3334.212

The value under “-2 Log likelihood” is a test of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.
Classification Table a Predicted RETAINED Step 1 Observed RETAINED Overall Percentage a. The cut value is .500 .00 .00 1.00 0 0 1.00 552 4201 Percentage Correct .0 100.0 88.4

The above table is the revised version of the table presented in Block 0. Note that since X is a dichotomous variable here, there are only two y-hat values. They are 1 P(Y) = --------------------- = .842 (see below) -(B0 + B1*0) 1 + e And 1 P(Y) = --------------------- = .924 (see below) -(B0 + B1*1) 1 + e As we’ll see below, in both cases, the y-hat was > .5, so predicted Y in the table was 1 for all cases. Logistic Regression Lecture - 11 1/3/2010

Variables in the Equation B Step 1
a

S.E. .830 1.670 .095 .057

EARLIREG Constant

Wald 75.719 861.036

df 1 1

Sig. .000 .000

Exp(B) 2.292 5.311

a. Variable(s) entered on step 1: EARLIREG.

The prediction equation is Y-hat = 1 / (1 + e-(.1.670 + .830*EARLIREG). Since EARLIREG has only two values, those students who registered early will have predicted RETAINED v alue of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of 1/(1+e(1.670+.830*0) = 1/(1+e-1.670)).= .842. Since both predicted values are above .5, this is why all the cases were predicted to be retained in the table on the previous page. Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0. Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is Odds when X=1 Odds ratio = --------------------- = Odds when X= 0 .924/(1-.924) 12.158 --------------- = ------------------- = 2.29. .842/(1-.842) 5.329

So a person who registered early had odds of being retained that were 2.29 times the odds of a person registering late being retained. Graphical representation of what we’ve just found. The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.
1.2

1.0

The two points are the predicted points for the two possible values of RETAINED.

.8

.6

.4

.2

Value YHAT

0.0

-.2 -6.00 -5.00 -4.00 -3.00 -2.00 -1.00 .00 1.00 2.00 3.00 4.00

X

Logistic Regression Lecture - 12

1/3/2010

Discussion 1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information. BUT as mentioned above . . . 2. CROSSTABS cannot be used to analyze relationships in which the X variable is continuous. 3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious. No tests of interactions are possible. The analysis involves inspection and comparison of multiple tables. 4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s. 5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above. 6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor. But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.

Logistic Regression Lecture - 13

1/3/2010

Logistic Regression with one Continuous Independent Variable
The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase. Both Amylase and Lipase levels are tests that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis. The objective here is to determine which alone is the better predictor of the diagnosis and to determine if both are needed. Because the distributions of both predictors were skewed, logarithms of the actual Amylase and Lipase values were used for this handout and for some of the following handouts. This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to only amylase. The name of the dependent variables is PANCGRP. It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise. Distributions of logamy and loglip – still somewhat positively skewed.

logamy

loglip

60

60

50

50

Frequency

Frequency

40

40

30

30

20

20

10 Mean = 2.0267 Std. Dev. = 0.50269 N = 306 1.00 1.50 2.00 2.50 3.00 3.50 4.00

10 Mean = 2.3851 Std. Dev. = 0.82634 N = 306 1.00 2.00 3.00 4.00 5.00

0

0 0.00

logamy

loglip

5.00

4.00

3.00

2.00

The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model.

loglip
1.00 0.00

1.00

1.50

2.00

2.50

3.00

3.50

4.00

logamy

Logistic Regression Lecture - 14

1/3/2010

1. Scatterplots with individual cases.
Relationship of Pancreatitis Diagnosis to log(Amylase)
1.2 1.0

.8

This graph is of individual cases. Y values are 0 or 1. X values are continuous.

.6

.4

.2

PANCGRP

0.0

-.2 1.0 1.5 2.0 2.5 3.0 3.5 4.0

LOGAMY

This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy. It is difficult to see the relationship that may very well be represented by the data. One can see, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis). The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted. But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive. Below are the same data, this time with the line of best fit generated by the logistic regression analysis through it. While neither line fits the observed points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.
1.2

1.0

.8

.6

.4

.2 Predi cted proba bilit 0.0 LOGAMY Panc reatit is Diag nos -.2 1.0 1.5 2.0 2.5 3.0 3.5 4.0 LOGAMY

Logistic Regression Lecture - 15

1/3/2010

2. Grouping cases to show a relationship when the DV is a dichotomy. The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values. When the DV is a dichotomy, it my be profitable to form groups of cases with similar IV values and plot the proportion of 1’s within each group vs. the IV value for that group. To illustrate this, groups were formed for every .2 increase in log amylase. That is, the values 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases between 1.5 and 1.7 were assigned to the 1.6 group. Then the proportion of 1’s within each group was computed. The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.
1.2

1.0

.8

.6

Note that the plot of proportion of Pancreatitis diagnoses within groups is not linear. The proportions increase in an ogival (S-shaped) fashion, with asymptotes at 0 and 1. This, of course, is a violation of the linear relationship which linear regression analysis assumes.
1.5 2.0 2.5 3.0 3.5 4.0

.4

.2

PROBPANC

0.0

-.2 1.0

LOGAMY GP

The plot of proportions above suggests that the S-shaped curve of the logistic regression model may better represent the increase in probability of Pancreatitis than the straight line curve of the linear regression model. The analyses that follow illustrate the application of both analyses to the data.

Logistic Regression Lecture - 16

1/3/2010

3. Linear Regression analysis of the logamy data
REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT pancdiag /METHOD=ENTER logamy /SCATTERPLOT=(*ZRESID ,*ZPRED ) /RESIDUALS HIST(ZRESID) NORM(ZRESID) .

Regression
Variables Entered/Removed b Variables Entered LOGAMY a Variables Removed . Model 1 Method Enter

Model Summary b Std. Error R Adjusted of the R Square R Square Estimate .755a .570 .568 .2569

Model 1

a. Predictors: (Constant), LOGAMY b. Dependent Variable: PANCGRP

ANOVA b Sum of Squares 22.230 16.770 39.000 Mean Square 22.230 6.602E-02

Model 1

Regression Residual Total

df 1 254 255

F 336.706

Sig. .000a

a. Predictors: (Constant), LOGAMY b. Dependent Variable: PANCGRP
Coefficients a Stan dardi zed Coeff icients Beta .755 t -15.125 18.350 Sig. .000 .000

The linear relationship of pancdiag to logamy is strong. But as we'll see, the logistic relationship is even stronger.

Unstandardize d Coefficients Model 1 B -1.043 .635 Std. Error .069 .035

(Constant) LOGAMY

a. Dependent Variable: PANCGRP

Thus, the predicted linear relationship of probability of Pancreatitis to log amylase is Predicted probability of Pancreatitis = -1.043 + 0.635 * log amylase.

Logistic Regression Lecture - 17

1/3/2010

The following are the usual linear regression diagnostics.
Casewise Diagnostics a Case Number 54 77 85 97 Std. Residual 3.016 3.343 3.419 3.218 a. Dependent Variable: PANCGRP

PANCGRP 1.00 1.00 1.00 1.00

Nothing particularly unusual here.

Residuals Statistics

a

Minimum Predicted Value Residual Std. Predicted Value Std. Residual -.1044 -.5998 -.989

Maximum 1.4256 .8786 4.193

Mean .1875 -1.3848E-16 .000

Std. Deviation .2953 .2564 1.000

N 256 256 256

Or here.

-2.334

3.419

.000

.998

256

a. Dependent Variable: PANCGRP

Dependent Variable: PANCGRP
60 50

40 30

Frequency
20 Std. Dev = 1.00 Mean = 0.00 N = 256.00 -2.25-1.75 -1.25-.75 -.25 .25 .75 1.25 1.75 2.25 2.75 3.25

The histogram of residuals is not particularly unusual.

10 0

Regression Standardized Residual

Normal P-P Plot
1.00

.75

.50

Expected Cum Prob
.25

Although there is a clear bend from the expected linear line, this is not particularly diagnostic..

0.00 0.00 .25 .50 .75 1.00

Observed Cum Prob

Logistic Regression Lecture - 18

1/3/2010

Scatterplot Dependent Variable: PANCGRP
4 3 2 1 0 -1 -2 -3 -2 -1 0 1 2 3 4 5

This is an indicator that there is something amiss. The plot of residuals vs. predicted values is supposed to form a classic 0 correlation scatterplot, with no unusual shape. This is clearly unusual.
Regression Standardized Predicted Value

Computation of y-hats for the groups. I had SPSS compute the Y-hat for each of the group mid-points discussed on page 3. I then plotted both the observed group proportion of 1’s that was shown on the previous page and the Y-hat for each group. Of course, the Y-hats are in a linear relationship with log amylase. Note that the solid points don’t really represent the relationship shown by the open symbols. Note also that the solid points extend above 1 and below 0. But the observed proportions are bound by 1 and 0.
compute mrgpyhat = -1.043 + .635*logamygp. execute. GRAPH /SCATTERPLOT(OVERLAY)=logamygp logamygp /MISSING=LISTWISE . WITH probpanc mrgpyhat (PAIR)

Graph
1.4 1.2

1.0

.8

Predicted proportion of Pancreatitis diagnosis within groups. Note that predictions extend below 0 and above 1.

.6

.4

.2 MRGPYHAT 0.0 -.2 1.0 1.5 2.0 2.5 3.0 3.5 4.0 LOGAMYGP PROBPANC LOGAMYGP

Observed proportion of Pacreatitis diagnoses within groups.

Logistic Regression Lecture - 19

1/3/2010

4. Logistic Regression Analysis of logamy data
logistic regression pancdiag with logamy.

Logistic Regression
Cas e Proces s ing Sum m ary Unw eighted Cases Selected Cas es
a

N Included in A naly sis Mis sing Cases Total 256 50 306 0 306

Unselected Cases Total

Percent 83.7 16.3 100.0 .0 100.0

a. If w eight is in ef f ect, s ee c las sif ication table f or the total number of cases .

Dependent Variable Encoding Original Value .00 No Pancreatitis 1.00 Panc reatitis Internal Value 0 1

SPSS’s Logistic regression procedure always performs the analysis in at least two steps, which it calls Blocks. Recall the Logistic prediction formula is 1 P(Y) = ---------------------(B0 + B1X) 1 + e In the first block, labeled Block 0, only B0 is entered into the equation. In this B0 only equation, it is assumed that the probability of a 1 is a constant, equal to the overall proportion of 1’s for the whole sample. Obviously this model will generally be incorrect, since typically, we’ll be working with data in which the probability of a 1 increases as the IV increases. But this model serves as a useful baseline against which to assess subsequent models, all of which do assume that probability of a 1 increase as the IV increases.

Logistic Regression Lecture - 20

1/3/2010

For each block the Logistic Regression Procedure automatically prints a 2x2 table of predicted and observed 1’s and 0’s. For all of these tables, a case is classified as a predicted 1 if it’s Y-hat (predicted probability) exceed 0.5. Otherwise it’s classified as a predicted 0. Since only the constant is estimated here, the predicted probability for every case is simply the proportion of 1’s in the sample, which is 48/256 = 0.1875. Since that’s less than 0.5, every case is predicted to be a 0 for this constant only model. A case is classified as a Predicted 0 if the y-hat for that case is less than or equal to .5 A case is classified as a Predicted 1 if the y-hat for that case is larger than .5

Block 0: Beginning Block

a,b Class ification Table

Predic ted Pancreatitis Diagnosis (DV ) No Pancreatitis Pancreatitis 208 0 48 0

Step 0

Obs erved Pancreatitis Diagnosis (DV) Overall Perc entage

No Pancreatitis Pancreatitis

Percentage Correc t 100.0 .0 81.3

a. Cons tant is included in the model. b. The cut value is .500

Variables in the Equation B -1.466 S.E. .160 Wald 83.852 df 1 Sig. .000 Exp(B) .231

Step 0

Constant

The test that is recommended is the Wald test. The p-value of .000 says that the value of B0 is significantly different from 0. The predicted probability of 1 here is 1 1 1
= 0.1875, the observed proportion of 1’s. P(1) = ------------------------- = --------------------------- = -------------

1+e

-(-1.466)

1 + 4.332

5.332

Variables not in the Equation Score 145.884 145.884 df 1 1 Sig. .000 .000

Step 0

Variables Overall Statis tics

LOGAMY

The “Variables not in the Equation” box says that if log amylase were added to the equation, it would be significant.

Logistic Regression Lecture - 21

1/3/2010

Block 1: Method = Enter In this block, log amylase is added to the equation.
Om nibus Tes ts of Mode l Coe fficients Chi-s quare 151.643 151.643 151.643 df 1 1 1 Sig. .000 .000 .000

Step 1

Step Bloc k Model

Step: The procedure can perform stepwise regression from a set of covariates. The Chi-square step tests the significance of the increase in fit of the current set of covariates vs. those in the previous set. Block: The significance of the increase in fit of the current model vs. the last Block. We’ll focus on this. Model: Tests the significance of the increase in fit of the current model vs. the “B0 only” model.
Mode l Sum m ary -2 Log likelihood 95.436 Cox & Snell R Square .447 Nagelkerke R Square .722

Step 1

In the following classification table, for each case, the predicted probability of 1 is evaluated and compared with 0.5. If that probability is > 0.5, the case is a predicted 1, otherwise it’s a predicted 0.
a Class ification Table

Predic ted Pancreatitis Diagnosis (DV) No Pancreatitis Pancreatitis 200 8 14 34

Step 1

Obs erved Pancreatitis Diagnosis (DV) Overall Perc entage

No Pancreatitis Pancreatitis

Percentage Correc t 96.2 70.8 91.4

Specificity Sensitivity (power)

a. The cut value is .500

Specificity: Proportion of Y=0 cases that test labels as 0. (Percentage of correct predictions of people who don’t have the disease.) Sensitivity: Proportion of Y=1 cases that test labels as 1. (Percentage of correct predictions of people who did have the disease.)
Variables in the Equation B 6.898 -16.020 S.E. 1.017 2.227 Wald 45.972 51.744 df 1 1 Sig. .000 .000 Ex p(B) 990.114 .000

Step a 1

LOGAMY Cons tant

a. Variable(s ) entered on step 1: LOGAMY.

Analogous to “Coefficients” box in Regression

This is the equation. 1 y-hat = ----------------------------------(-16.0203+6.8978*log amylase) 1+e Logistic Regression Lecture - 22 1/3/2010

5. Computing Predicted proportions for the groups defined on page 3.
To show that the relationship assumed by the logistic regression analysis is a better representation of the relationship than the linear, I computed probability of 1 for each of the group midpoints from page 3. The figure below is a plot of those probabilities and the observed proportion of 1’s vs. the group midpoints. Compare this figure with that on page 6 to see how much better the logistic regression relationship fits the data than does the linear relationship.
compute lrgpyhat = 1/(1+exp(-(-16.0203 + 6.8978*logamygp))).

GRAPH /SCATTERPLOT(OVERLAY)=logamygp logamygp /MISSING=LISTWISE .

WITH probpanc lrgpyhat (PAIR)

Graph
1.2

1.0

.8

Predicted proportions. Observed proportions.

.6

.4

.2 LRGPYHAT 0.0 LOGAMYGP PROBPANC -.2 1.0 1.5 2.0 2.5 3.0 3.5 4.0 LOGAMYGP

Logamy Compare this graph with the one immediately above. Note that the predicted proportions correspond much more closely to the observed proportions here.

Logistic Regression Lecture - 23

1/3/2010

6. Another way of comparing predicted vs. observed. I computed residuals for all cases. Recall that a residual is Y – Y-hat. For these data, Y’s were either 1 or 0. Y-hats are probabilities. First, I computed Y-hats for all cases, using both the linear equation and the logistic equation. .
compute mryhat = -1.043 + .635*logamy. compute lryhat = 1/(1+exp(-(-16.0203 + 6.8978*logamy))).

Now residuals are computed .
compute mrresid = pancdiag - mryhat. compute lrresid = pancdiag - lryhat.

frequencies variables = mrresid lrresid /histogram /format=notable.

Frequencies
Histogram
80

60

40

This is the distribution of residuals for the linear multiple regression. It's like the plot on page 3, except these are actual residuals, not Z's of residuals. Note that there are many large residuals - large negative and large positive.

Frequency

20 Std. Dev = .26 Mean = .00 0 N = 256.00

MRRESID
1.2

0 -.8 0 .0 -1
1.0 .8 .6 .4 .2

00 1.

Positive residual

0 -.6

0 -.4

0 -.2

0 -.0

0 .2

0 .4

PANCGRP

Negative residual

0 .6

0 .8

The residuals above are simply distances of the observed points from the best fitting line, in this case a straight line.

0.0

-.2 1.0 1.5 2.0 2.5 3.0 3.5 4.0

LOGAMY

Logistic Regression Lecture - 24

1/3/2010

Histogram
200

This is the distribution of residuals for the logistic regression.
100

Note that most of them are virtually 0.
Frequency
Std. Dev = .24 Mean = .00 0 N = 256.00

LRRESID

0 -.8 0 .0 -1

00 1.

0 -.6

0 -.4

0 -.2

0 -.0

0 .2

0 .4

0 .6

0 .8

1

1

1

1

The residuals above are simply distances of the observed points from the best fitting line, in this case a logistic line. The points which are circled are those with near 0 residuals.
Predi cted Value

0

0

0

LOGAMY PANCGRP

-0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

LOGAMY

What these two sets of figures show is that the vast majority of residuals from the logistic regression analysis were virtually 0, while for the linear regression, there were many residuals that were substantially different from 0. So the logistic regression analysis has modeled the Y’s better than the linear regression.

Logistic Regression Lecture - 25

1/3/2010

Logistic Regression - Logamy revisited: Focus on the Logistic Regression Output
logistic regression variables = pancgrp with logamy.

Logistic Regression
Cas e Proces s ing Sum m ary Unw eighted Cases Selected Cas es
a

N Included in A naly sis Mis sing Cases Total 256 50 306 0 306

Unselected Cases Total

Percent 83.7 16.3 100.0 .0 100.0

a. If w eight is in ef f ect, s ee c las sif ication table f or the total number of cases .

All cases have to have valid values of both the dependent variable and the independent variable to be included in the analysis. Some 50 cases had missing values of either one or the other, leaving only 256 valid cases for the analysis.

Dependent Variable Encoding Original Value .00 No Pancreatitis 1.00 Panc reatitis Internal Value 0 1

Be sure to make certain that your “0” and “1” are the same as logistic regression’s “0” and “1”.

Logistic Regression Lecture - 26

1/3/2010

Block 0: Beginning Block
a,b Class ification Table

Predic ted Pancreatitis Diagnosis (DV ) No Pancreatitis Pancreatitis 208 0 48 0

Step 0

Obs erved Pancreatitis Diagnosis (DV) Overall Perc entage

No Pancreatitis Pancreatitis

Percentage Correc t 100.0 .0 81.3

Specificity Sensitivity

a. Cons tant is included in the model. b. The cut value is .500

Specificity: The ability to identify cases that don’t have the disease. Sensitivity: The ability to identify cases that do have the disease. A classification table is produced for each model tested. In this case, the model contained only the constant, B0. (See "Variables in the Equation" on the next page.) Predicted Y 1 = -------1+e-B0

For these data, B0 = -1.4663, (see below) so P(Y) = 1 -------1 + e-(-1.4663) = 1 --------- = .1875 1 + 4.333

Each case for which P(Y) <= .5 is predicted to be 0. Each case for which P(Y) > .5 is predicted to be 1. When the constant is the only parameter estimated, all cases have the same P(Y), .1875 in this instance.

Variables in the Equation B -1.466 S.E. .160 Wald 83.852 df 1 Sig. .000 Exp(B) .231

Step 0

Constant

The Wald statistics is (B/SE)2. It tests the null hypothesis that the coefficient (B0 in this case) is 0 in the population. That null is rejected here.
Variables not in the Equation Score 145.884 145.884 df 1 1 Sig. .000 .000

Step 0

Variables Overall Statis tics

LOGAMY

Logistic Regression Lecture - 27

1/3/2010

Block 1: Method = Enter
Om nibus Tes ts of Mode l Coe fficients Chi-s quare 151.643 151.643 151.643 df 1 1 1 Sig. .000 .000 .000

Step 1

Step Bloc k Model

Each chi-square tests the significance of the increase in your ability to predict the dependent variable.

The Step Chi-Square tests the significance of improvement (or decrement) in fit over the immediately previous model. It is applicable when stepwise entry of independent variables within a block has been specified. It will be printed after each variable is entered or removed. Again, larger is better. The Block Chi-Square tests the significance of improvement (or decrement) in fit over the model specified in the previous block of independent variables, if there was one. It is only applicable when two or more blocks of independent variables have been specified. Again, larger is better. It's analogous to the F-change statistic in linear regression. The Model Chi-Square statistic tests the significance of the improvement in fit of the current model over a model containing just the constant, B0. For this chi-square, larger is better. It is analogous to the overall F statistic in linear regression output.
Mode l Sum m ary -2 Log likelihood 95.436 Cox & Snell R Square .447 Nagelkerke R Square .722

Step 1

-2 Log Likelihood is a goodness of fit measure (0 is best) computed using a particular set of assumptions. The R Square measures are analogous to R2 in regular regression. Each is computed using a different set of assumption, which accounts for the difference in their values.
a Class ification Table

Predic ted Pancreatitis Diagnosis (DV ) No Pancreatitis Pancreatitis 200 8 14 34

Step 1

Obs erved Pancreatitis Diagnosis (DV) Overall Perc entage

No Pancreatitis Pancreatitis

Percentage Correc t 96.2 70.8 91.4

Specificity Sensitivity

a. The cut value is .500

In this classification table, since every case potentially had a different value of logamy, a unique Y-hat was generated for each case. If Y-hat was <= .5, a prediction of 0 was recorded. If Y-hat was > .5, a prediction of 1 was recorded. Note the increase in % of correct classifications over the "constant only" model above.

Logistic Regression Lecture - 28

1/3/2010

The Wald statistic is Bi (-------- )2 SE Bi
Variables in the Equation B 6.898 -16.020 S.E. 1.017 2.227 Wald 45.972 51.744 df 1 1 Sig. .000 .000 Ex p(B) 990.114 .000

Step a 1

LOGAMY Cons tant

a. Variable(s ) entered on step 1: LOGAMY.

Exp(B) is the ratio of odds when the independent variable increases by 1. So. odds that Y=1, or P(Y=1) -------1-P(Y=1) When logamy increases by 1, the odds of Pancreatitus are 990.114 times greater.

Logistic Regression Lecture - 29

1/3/2010

Logistic Regression: Two Continuous predictors
LOGISTIC REGRESSION VAR=pancgrp /METHOD=ENTER logamy loglip /CLASSPLOT /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

[DataSet3] G:\MdbT\InClassDatasets\amylip.sav Logistic Regression
Case Processing Summary Unweighted Cases Selected Cases
a

N 256 50 306 0 306

Percent 83.7 16.3 100.0 .0 100.0

Included in Analysis Missing Cases Total

Unselected Cases Total

a. If weight is in effect, see classification table for the total number of cases.
Dependent Variable Encoding Original Value .00 No Pancreatitis 1.00 Pancreatitis Internal Value 0 1

Block 0: Beginning Block
Classification Table
a,b

Predicted Pancreatitis Diag nosis (DV) Step 0 Observed Pancreatitis Diag nosis (DV) Overall Percentage a. Constant is included in the model. b. The cut value is .500
Variables in the Equation B Step 0 Constan t -1.466 S.E. .160 Wald 83.852 df 1 Sig. .000 Exp(B) .231

No Pancreatitis Pancreatitis

No Pancreatitis 208 48

Pancreatitis 0 0

Percentage Correct 100.0 .0 81.3

Based on the equation with only the constant, B0.
Variables not in the Equation Score 145.884 161.169 165.256 df 1 1 2 Sig. .000 .000 .000

Step 0

Variables Overall Statistics

LOGAMY LOGLIP

Each p-value tells you whether or not the variable would be significant if entered BY ITSELF. That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.
Logistic Regression Lecture - 30 1/3/2010

Block 1: Method = Enter
Omnibus Tests of Model Coefficients Chi-sq uare 170.852 170.852 170.852 df 2 2 2 Sig. .000 .000 .000

Step 1

Step Block Model

Model Summary Cox & Snell R Square .487 Nagelk erke R Square .787

Step 1

-2 Log likelihood 76.228

Classification Table a Predicted Pancreatitis Diag nosis (DV) Step 1 Observed Pancreatitis Diag nosis (DV) Overall Percentage a. The cut value is .500 No Pancreatitis Pancreatitis No Pancreatitis 204 10 Pancreatitis 4 38 Percentage Correct 98.1 79.2 94.5

Specificity Sensitivity

Variables in the Equation B Step 1 a LOGAM Y LOGLIP Constant 2.659 2.998 -14.573 S.E. 1.418 .844 2.251 Wald 3.518 12.628 41.907 df 1 1 1 Sig. .061 .000 .000 Exp(B) 14.286 20.043 .000

a. Variable(s) entered on step 1: LOGAMY, LOGLIP.

Interpretation of the coefficients . . . B: Not easily interpretable. Expected increase in log odds for a one-unit increase in IV. SE: Standard error of the estimate of Bi. Wald: Test statistic. Sig: p-value associated with test statistic. Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP. Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV. Person one unit higher on IV will have Exp(B) greater odds of having Y=1.

Logistic Regression Lecture - 31

1/3/2010

Classification Plots- MIKE – This looks terrible. Use Explore to create histograms of y-hats by the two values of the dependent variable.
Step number: 1 Observed Groups and Predicted Probabilities 80 ┼ ┼ │N │ │N │ F │N │ R 60 ┼N ┼ E │N │ Q │N │ U │NN │ E 40 ┼NN ┼ N │NN │ C │NNN │ Y │NNN │ 20 ┼NNN ┼ │NNN P│ │NNN NN P│ │NNNNNNNNNNNP N P PP PP│ Predicted ─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────── Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Group: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP Predicted Probability is of Membership for Pancreatitis The Cut Value is .50 Symbols: N - No Pancreatitis P - Pancreatitis Each Symbol Represents 5 Cases.

The plot above is misleading because many cases are not represented in it. Only those cases which happened to be so close to other cases that a group of 10 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were enough to make 10 cases. Here’s the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable. For the most part, the patients who did not get Pancreatitis had small predicted probabilities while the patients who did get it had high predicted probabilities, as you would expect. There were, however, a few patients who did get it who had small values of Y-hat. Those patients are dragging down the sensitivity of the test. Note that these patients don’t show up on the CASEPLOT produced by the LOGISTIC REGRESSION procedure.

Logistic Regression Lecture - 32

1/3/2010

Here’s another equivalent representation of what the authors of the program were trying to show. The Ns on the left represent the distribution of predicted probabilities for those who didn’t have pancreatitis. The Ps on the right are the distribution of predicted probabilities for those who did have it. The programmers took both distributions and combined them into one, using different symbols to represent the two groups.

Histogram
for pancgrp= No Pancreatitis
200

150

Frequency

100

50

0 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000

Mean =0.0515201 Std. Dev. =0.12420309 N =208

Predicted probability

Histogram
for pancgrp= Pancreatitis
30

25

20

Frequency

15

10

5

0 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000

Mean =0.7767463 Std. Dev. =0.33120602 N =48

Predicted probability

Logistic Regression Lecture - 33

1/3/2010

Visualizing the equation with two predictors With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model. For two predictor models, a 3-D scatterplot is required. Here’s how the graph below was produced. Graphs -> Interactive -> Scatterplot. . .

                             

       

             





The graph shows the general ogival relationship of YHAT on the vertical to LOGLIP and LOGAMY. But the relationships really aren’t apparent until the graph is rotated. Don’t ask me to demonstrate rotation. SPSS now does not offer the ability to rotate the graph interactively. It used to offer such a capability, but it’s been removed.

                                                                                                                 

Logistic Regression Lecture - 34

1/3/2010

The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.

              



                                                                                                                                                     

  

     

Logistic Regression Lecture - 35

1/3/2010

Representing Relationships with a Table –
compute logamygp2 = rnd(logamy/.5)*.5.
logamygp2 Cumulative Percent 40.2 74.5 89.5 96.4 99.7 100.0

<- Rounds logamy to the nearest .5 .

Valid

1.50 2.00 2.50 3.00 3.50 4.00 Total

Frequency 123 105 46 21 10 1 306

Percent 40.2 34.3 15.0 6.9 3.3 .3 100.0

Valid Percent 40.2 34.3 15.0 6.9 3.3 .3 100.0

LOGAMY and LOGLIP groups were created by rounding values of LOGAMY and LOGLIP to the nearest .5.

compute loglipgp2 = rnd(loglip/.5)*.5.
loglipgp2 Cumulative Percent .3 2.3 17.0 57.8 73.9 83.7 90.2 96.7 99.3 100.0

Frequency Valid .50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 Total 1 6 45 125 49 30 20 20 8 2 306

Percent .3 2.0 14.7 40.8 16.0 9.8 6.5 6.5 2.6 .7 100.0

Valid Percent .3 2.0 14.7 40.8 16.0 9.8 6.5 6.5 2.6 .7 100.0

Here’s the loglipase grouping.

means pancgrp yhatamylip by logamygp2 by loglipgp2.

Here’s the top of a very long two way table of mean Y-hat values for each combination of logamylase group and loglipase group. Below, this table is “prettified”.

Logistic Regression Lecture - 36

1/3/2010

The above MEANS output, put into a 2 way table in Word The entry in each cell is the expected probability of contracting pancreatitus at the combination of logamylase and loglipase represented by the cell.

4 3.5 LOGAMY 3 2.5 2 1.5

.5 .

1 .

1.5

2

LOGLIP 2.5

3 .99 .97 .73 .47 .

3.5 1.00 .98 .92 .85

4 1.00 .99 .97

4.5 1.00 1.00 1.00

.00

.00

.03 .01 .00

.09 .04 .01

.30 .14 .05

This table shows the joint relationship of predicted Y to LOGAMY and LOGLIP. the table to the upper right.

Move from the lower left of

It also shows the partial relationships of each. Partial Relationship of YHAT to LOGLIP – Move across any row. So, for example, if your logamylase were 2.5, your chances of having pancreatitus would be only .03 if your loglipase were 1.5. But at the same 2.5 value of logamylase, your chances would be .97 if your loglipase value were 4.0. Partial Relationship of YHAT to LOGAMY – Move up any column. Empty cells show that there are certain combinations of LOGAMY and LOGLIP that are very unlikely.

Logistic Regression Lecture - 37

1/3/2010

Logistic Regression 3: One Categorical IV with 3 categories
The data here are the FFROSH data – freshmen from 1987-1992. The dependent variable is RETAINED – whether a student went directly to the 2nd semester. The independent variable is NRACE – the ethnic group recorded for the student. It has three values: 1: White; 2: African American 3: Oriental

Indicator coding is dummy coding. Here, Category 1 (White) is used as the reference category.

Logistic Regression Lecture - 38

1/3/2010

LOGISTIC REGRESSION retained /METHOD = ENTER nrace /CONTRAST (nrace)=Indicator(1) /CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

Logistic Regression
Case Processing Summary Unweighted Cases Selected Cases
a

N 4697 56 4753 0 4753

Included in Analysis Missing Cases Total

Percent 98.8 1.2 100.0 .0 100.0

Unselected Cases Total

a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value .00 1.00 Internal Value 0 1

Categorical Variables Codings Parameter coding nrace NUMBERIC WHITE/BLACK/ ORIENTAL RACE CODE 1.00 WHITE 2.00 BLACK 3.00 ORIENTAL Frequency 3987 626 84 (1) .000 1.000 .000 (2) .000 .000 1.000 (3)

Block 0: Beginning Block
Classification Table a,b Predicted retained Step 0 Observed retained Overall Percentage a. Constant is included in the model. b. The cut value is .500 .00 .00 1.00 0 0 1.00 545 4152 Percentage Correct .0 100.0 88.4

SPSS’s coding of the independent variable here is important. Note that Whites are the 0,0 group. The first group coding variable compares Blacks with Whites. The 2nd compares Orientals with Whites.

Variables in the Equation B Step 0 Constant 2.031 S.E. .046 Wald 1986.391 df 1 Sig. .000 Exp(B) 7.618

Variables not in the Equation Score 6.680 2.433 3.903 6.680 df 2 1 1 2 Sig. .035 .119 .048 .035

Step 0

Variables

nrace nrace(1) nrace(2)

Overall Statistics

SPSS first prints p-value information for the collection of group coding variables representing the categorical factor. Then it prints p-value information for each GCV separately. None of this information should be taken literally when categorical variables are being analyzed. Logistic Regression Lecture - 39 1/3/2010

Block 1: Method = Enter
Omnibus Tests of Model Coefficients Chi-sq uare 7.748 7.748 7.748 df 2 2 2 Sig. .021 .021 .021

Step 1

Step Block Model

Model Summary Cox & Snell R Square .002 Nagelkerke R Square .003

Step 1

-2 Log likelihood 3364.160a

a. Estimation terminated at iteration number 6 because parameter estimates changed by less than .001.
a

Classification Table

Predicted retained Step 1 Observed retained Overall Percentage a. The cut value is .500 .00 .00 1.00 0 0 1.00 545 4152 Percentage Correct .0 100.0 88.4

Variables in the Equation B Step 1
a

S.E. .237 1.007 1.989 .143 .515 .049

nrace nrace(1) nrace(2) Constant

Wald 6.368 2.741 3.829 1669.869

df 2 1 1 1

Sig. .041 .098 .050 .000

Exp(B) 1.268 2.737 7.306

a. Variable(s) entered on step 1: nrace.

So the bottom line is that 0) There are significant differences in likelihood of retention to the 2nd semester between the groups (p=.041). 1) Blacks are not significantly more likely to sustain than Whites, although the difference approaches significance. (p=.098). 2) Orientals are significantly more likely to sustain than Whites (p=.050).

Logistic Regression Lecture - 40

1/3/2010

Logistic Regression: Three Continuous predictors – FFROSH Data
The data used for this are data on freshmen from 1987-1992. The dependent variable is RETAINED – whether student went directly into the 2nd semester or not. Predictors (covariates in logistic regression) are HSGPA, ACT composite, and Overall attempted hours in the first semester, excluding the freshman seminar course. GET FILE='E:\MdbR\FFROSH\ffrosh.sav'. logistic regression retained with hsgpa actcomp oatthrs1.

Logistic Regression
Case Processing Summary Unweighted Cases Selected Cases
a

N 4852 0 4852 0 4852

Percent 100.0 .0 100.0 .0 100.0

Included in Analysis Missing Cases Total

Dependent Variable Encoding Original .00 Value 1.00 Internal Value 0 1

Unselected Cases Total

a. If weight is in effect, see classification table for the total number of cases.

Block 0: Beginning Block
Classification Table
a,b

Predicted RETAINED Step 0 Observed RETAINED Overall Percentage a. Constant is included in the model. b. The cut value is .500 .00 .00 1.00 0 0 1.00 620 4232 Percentage Correct .0 100.0 87.2

Specificity Sensitivity

Variables in the Equation B Step 0 Constan t 1.921 S.E. .043 Wald 1994.988 df 1 Sig. .000 Exp(B) 6.826

Variables not in the Equation Score 225.908 44.653 274.898 385.437 df 1 1 1 3 Sig. .000 .000 .000 .000

Step 0

Variables

HSGPA ACTCOMP OATTHRS1

Overall Statistics

Recall that the p-values are those that would be obtained if a variable were put BY ITSELF into the equation.

Logistic Regression Lecture - 41

1/3/2010

Block 1: Method = Enter
Omnibus Tests of Model Coefficients Chi-sq uare 381.011 381.011 381.011 df 3 3 3 Sig. .000 .000 .000

Step 1

Step Block Model

Model Summary Cox & Snell R Square .076 Nagelkerke R Square .141

Step 1

-2 Log likelihood 3327.365

Classification Table a Predicted RETAINED Step 1 Observed RETAINED Overall Percentage a. The cut value is .500 .00 .00 1.00 35 16 1.00 585 4216 Percentage Correct 5.6 99.6 87.6

Specificity Sensitivity

Variables in the Equation B Step 1
a

S.E. 1.077 -.022 .148 -2.225 .101 .014 .012 .308

HSGPA ACTCOMP OATTHRS1 Constant

Wald 112.767 2.637 146.487 52.362

df 1 1 1 1

Sig. .000 .104 .000 .000

Exp(B) 2.935 .978 1.160 .108

a. Variable(s) entered on step 1: HSGPA, ACTCOMP, OATTHRS1.

Note that while ACTCOMP would have been significant by itself without controlling for HSGPA and OATTHRS1, when controlling for those two variables, it’s not significiant. So, the bottom line is that 1) Among persons equal on ACTCOMP and OATTHRS1, those with larger HSGPAs were more likely to go directly into the 2nd semester. 2) Among persons equal on HSGPA and OATTHRS1, there was no significant relationship of likelihood of sustaining to ACTCOMP. Among persons equal on HSGPA and OATTHRS1 those with higher ACTCOMP were not significantly more likely to sustain than those with lower ACTCOMP. Note that there are other variables that could be controlled for and that this relationship might “become” significant when those variables are controlled. 3) Among persons equal on HSGPA and ACTCOMP, those who took more hours in the first semester were more likely to go directly to the 2nd semester. What does this mean????
Logistic Regression Lecture - 42 1/3/2010

The FFROSH Full Analysis
From the report to the faculty – Output from Macintosh Version 6. ---------------------- Variables in the Equation ----------------------Variable AGE NSEX B S.E. Wald df Sig R Exp(B)

-.0950 .0532 3.1935 1 .0739 -.0180 .9094 .2714 .0988 7.5486 1 .0060 .0388 1.3118 After adjusting for differences associated with the other variables, Males were more likely to enroll in the second semester . NRACE1 -.4738 .1578 9.0088 1 .0027 -.0436 .6227 After adjusting for differences associated with the other variables, Whites were less likely to enroll in the second semester. NRACE2 .1168 .1773 .4342 1 .5099 .0000 1.1239 HSGPA .8802 .1222 51.8438 1 .0000 .1162 2.4114 After adjusting for differences associated with the other variables, those with higher high school GPA's were more likely to enroll in the second semester. ACTCOMP -.0239 .0161 2.1929 1 .1387 -.0072 .9764 OATTHRS1 .1588 .0124 164.4041 1 .0000 .2098 1.1721 After adjusting for differences associated with the other variables, those with higher attempted hours were more likely to enroll in the second semester. EARLIREG .2917 .1011 8.3266 1 .0039 .0414 1.3387 After adjusting for differences associated with the other variables, those who registered six months or more before the first day of school were more likely to enroll in the second semester. NADMSTAT -.2431 .1226 3.9330 1 .0473 -.0229 .7842 POSTSEM -.1092 .0675 2.6206 1 .1055 -.0130 .8965 PREYEAR2 -.0461 .0853 .2924 1 .5887 .0000 .9549 PREYEAR3 .1918 .0915 4.3952 1 .0360 .0255 1.2114 After adjusting for differences associated with the other variables, those who enrolled in 1991 were more likely to enroll in the second semester than others enrolled before 1990.. POSYEAR2 -.0845 .0977 .7467 1 .3875 .0000 .9190 POSYEAR3 -.1397 .0998 1.9585 1 .1617 .0000 .8696 HAVEF101 .4828 .1543 9.7876 1 .0018 .0459 1.6206 After adjusting for differences associated with the other variables, those who took the freshman seminar were more likely to enroll in second semester than those who did not. Constant -.1075 1.1949 .0081 1 .9283
Variables in the Equation B Step 1
a

S.E. -.099 .257 -.944 -.337 .852 -.021 .159 .316 .253 -.115 -.048 .177 -.078 -.124 .967 -.032 .053 .099 .487 .504 .123 .016 .012 .102 .123 .068 .086 .092 .098 .101 .152 1.228

age nsex nrace nrace(1) nrace(2) hsgpa actcomp oatthrs1 earlireg admstat (1) postsem y1988 y1989 y1991 y1992 havef101 Constant

Wald 3.461 6.726 19.394 3.749 .446 48.204 1.676 163.499 9.640 4.222 2.880 .306 3.737 .633 1.511 40.364 .001

df 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Sig. .063 .010 .000 .053 .504 .000 .195 .000 .002 .040 .090 .580 .053 .426 .219 .000 .979

Exp(B) .905 1.294 .389 .714 2.344 .979 1.173 1.372 1.288 .891 .954 1.194 .925 .884 2.629 .968

This is from SPSS V15. There are slight differences in the numbers, not due to changes in the program but due to slight differences in the data. I believe some cases were dropped between when the V6 and V15 analyses were performed. NRACE was coded differently in the V15 analysis.

a. Variable(s) entered on step 1: age, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat, postsem, y1988, y 1989, y1991, y1992, hav ef101.

Logistic Regression Lecture - 43

1/3/2010

The full FFROSH Analysis in Version 15 of SPSS
logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat postsem y1988 y1989 y1991 y1992 havef101 /categorical nrace admstat.

Logistic Regression
[DataSet1] G:\MdbR\FFROSH\ffrosh.sav

Case Processing Summary Unweighted Cases Selected Cases
a

N 4781 71 4852 0 4852

Included in Analysis Missing Cases Total

Percent 98.5 1.5 100.0 .0 100.0

Unselected Cases Total

a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value .00 1.00 Internal Value 0 1

Categorical Variables Codings

a

Parameter coding nrace NUMBERIC WHITE/BLACK/ORIENTAL RACE CODE admstat NUMERIC ADMISSTION STATUS CODE 1.00 WHITE 2.00 BLACK 3.00 ORIENTAL AP CD Frequency 4060 636 85 3292 1489 (1) 1.000 .000 .000 1.000 .000 (2) .000 1.000 .000

a. This coding results in indicator coefficients.

Logistic Regression Lecture - 44

1/3/2010

Block 0: Beginning Block
Classification Table a,b Predicted retained Step 0 Observed retained Overall Percentage a. Constant is included in the model. b. The cut value is .500 .00 .00 1.00 0 0 1.00 610 4171 Percentage Correct .0 100.0 87.2

Variables in the Equation B Step 0 Constant 1.922 S.E. .043 Wald 1966.810 df 1 Sig. .000 Exp(B) 6.838

Variables not in the Equation Score 27.445 3.147 7.322 5.864 3.261 223.532 46.129 273.644 86.855 119.994 13.295 1.049 11.532 .486 .102 40.186 528.012 df 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 15 Sig. .000 .076 .026 .015 .071 .000 .000 .000 .000 .000 .000 .306 .001 .486 .750 .000 .000

Step 0

Variables

age nsex nrace nrace(1) nrace(2) hsgpa actcomp oatthrs1 earlireg admstat(1) postsem y1988 y1989 y1991 y1992 havef101

Overall Statistics

Logistic Regression Lecture - 45

1/3/2010

Block 1: Method = Enter
Omnibus Tests of Model Coefficients Chi-sq uare 494.704 494.704 494.704 df 15 15 15 Sig. .000 .000 .000

Step 1

Step Block Model

Model Summary Cox & Snell R Square .098 Nagelkerke R Square .184

Step 1

-2 Log likelihood 3155.842a

a. Estimation terminated at iteration number 6 because parameter estimates changed by less than .001.

Classification Table a Predicted retained Step 1 Observed retained Overall Percentage a. The cut value is .500 .00 .00 1.00 79 33 1.00 531 4138 Percentage Correct 13.0 99.2 88.2

Variables in the Equation B Step 1
a

S.E. -.099 .257 -.944 -.337 .852 -.021 .159 .316 .253 -.115 -.048 .177 -.078 -.124 .967 -.032 .053 .099 .487 .504 .123 .016 .012 .102 .123 .068 .086 .092 .098 .101 .152 1.228

age nsex nrace nrace(1) nrace(2) hsgpa actcomp oatthrs1 earlireg admstat(1) postsem y1988 y1989 y1991 y1992 havef101 Constant

Wald 3.461 6.726 19.394 3.749 .446 48.204 1.676 163.499 9.640 4.222 2.880 .306 3.737 .633 1.511 40.364 .001

df 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Sig. .063 .010 .000 .053 .504 .000 .195 .000 .002 .040 .090 .580 .053 .426 .219 .000 .979

Exp(B) .905 1.294 .389 .714 2.344 .979 1.173 1.372 1.288 .891 .954 1.194 .925 .884 2.629 .968

a. Variable(s) entered on step 1: ag e, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat, postsem, y1988, y 1989, y1991, y1992, hav ef101.

Logistic Regression Lecture - 46

1/3/2010