Correlation and Multiple
Robert K. Toutkoushian
Educational Leadership and Policy Studies
Objectives of Module
• Review statistical procedures such as
correlation and multiple regression
• Examine ways in which these procedures
can be applied to institutional research
• Practice using SPSS to implement these
• Discuss more involved procedures and
• Aim for a “middle ground” in terms of
difficulty (higher UG/lower G level)
• Focus more on intuition behind procedures
rather than proofs & derivations
• Assume familiarity with descriptive stats
and hypothesis testing
• STRONGLY encourage questions at any
Covariance and Correlation
Both measure the extent to which two
variables “move together.” They differ
only in units of measure:
• Positive covariance/correlation: Both
variables tend to move in the same
• Negative covariance/correlation: Both
variables tend to move in the opposite
• When looking for correlations, you may
have to first reorder one of the variables
• If two variables are related, then knowing
the value of one variable may help with
guesses as to the value of the other (e.g.,
retention and SAT/ACT scores)
• “Correlation” does not imply “causation”!
Calculating the Covariance
• Calculate the means for X and Y (denoted “x-
bar” and “y-bar”)
• Subtract the mean for X from each X value
and repeat for Y
• Multiply the differences together for each
observation, then sum and divide by degrees
of freedom (n-1)
Covariance = -132,000/(4-1) = -44,000
Properties of correlation coefficient:
• A “standardized” measure of covariance
that ranges between -1 and +1
• Positive: 0 < r +1
• Negative: -1 r < 0
• No correlation: r = 0
• Cov(x,y) and r will have the same sign
• Stronger relationship as r moves away from
• Calculate cov(x,y) as before
• Calculate st. dev‟s for X and Y
• Divide cov(x,y) by product of standard
Correlations can be useful in IR when one
variable of interest is unobservable, and a
correlated variable is observable:
• College performance (correlated with HS
• Faculty experience (correlated with age, years
• Teaching quality (correlated with student
• Weak correlations are less useful for
• Correlations vary across factors, so it is
difficult to compare across factors (e.g.,
stock prices and faculty salaries)
• May be multiple factors affecting a single
factor of interest
• Does not measure non-linear relationships
Class Example #1
Filename TUITION.SAV contains data on
average public tuition rates, state
appropriations, and median family income by
state in 1994. In SPSS:
• Calculate the means and standard deviations
for these three variables.
• Calculate the covariances and correlations
between state appropriations and (a) public
tuition rates, (b) median family income.
Linear Regression (“OLS”)
• Objective: find the best linear (“straight line”)
relationship between two or more variables.
• Ordinary Least Squares (OLS) is the technique
most often used to choose the best line.
• This linear relationship is based on the
covariance between two variables.
• Regression analysis requires the analyst to
specify the direction of causation.
Advantages of Linear
• Can predict/forecast one variable (Y)
based on values of another variable (X)
• Can perform hypothesis tests to determine
if X affects Y
• Can control for differences in Y due to X
• Very flexible with regard to functional form,
model specification, etc.
Example: Gender Equity in
Your President asks you to examine faculty salaries at your
institution and determine if there is a gender equity problem.
Descriptive stats show that on average men earn more than
• How can you control for salary differences due to justifiable
factors such as experience, productivity?
• How can you determine if the remaining pay difference is
large enough to conclude that this is a problem?
Ordinary Least Squares
• Slope = β in population, b in sample
• Error term (ε or e) encompasses effects of all omitted factors
• Parameters in the population model is unobservable
• Sample line is what you estimate with OLS
Assumptions in Linear
• The error term has a mean of zero and
• The errors are unrelated to each other
• The errors are unrelated to the
independent variable(s) in the model
• The error term is normally distributed
(needed for hypothesis testing)
Ordinary Least Squares
OLS specifies that the “best” line is the one that minimizes
the sum of squared errors ( minimize Σ ei2 )
Intercept (a) =
Notes on OLS:
The slope formula is the covariance between
X and Y divided by the variance for X
The slope and covariance will always have
the same sign
• b > 0 indicates a positive relationship
• b < 0 indicates a negative relationship
• b = 0 indicates no linear relationship
Example: An IR analyst is asked to help
forecast applications. She believes there
is a relationship between HS grads and
resident applications each year
Regression line: Ŷ = -358.28 + 0.29X
• Interpretation: For each additional HS grad,
predicted applications will rise by 0.29.
• The intercept may not have much meaning.
• Can predict applications given projections of
HS grads. If HS grads = 36,000, then
Ŷ = -358.28 + 0.29(36,000) = 10,082
Goodness of Fit
• Measures the strength of the relationship between X
• R-squared (or coefficient of determination):
proportion of total deviation in Y that is “explained”
• R-squared is bounded between 0 and 1 (R2 = 1 if
perfect fit, R2 = 0 if no fit)
• R-squared = square of correlation coefficient (with
only one X variable in the model)
More on R-squared...
• When there is no covariance, the slope of the
regression line is zero and R2 = 0.
• Adding variables to the regression model will
almost always raise R2, but this does not mean
that the resulting model is “better”
• Adjusted R2 attempts to correct for this, but no
longer has the same interpretation
• R2 varies depending on the dependent variable.
Do not use this to compare regression models
with different Y‟s.
Predicting Resident Applications
Note that HS grads account for 88.5%
of the total deviation in applications.
Class Example #2
Using TUITION.SAV, in SPSS:
• Calculate a regression line showing how
median income affects average tuition
• Calculate R2, TSS, RSS, ESS, and corr(x,y).
• /MISSING LISTWISE
• /STATISTICS COEFF OUTS R ANOVA
• /CRITERIA=PIN(.05) POUT(.10)
• /DEPENDENT tuition
• /METHOD=ENTER income .
Equation: Tuition = 313.119 + 0.0719*Income
Hypothesis Testing for β
• In most situations in the social sciences, it is
rarely known for sure if X affects Y
• A hypothesis test can be used to determine
if the data provide sufficient evidence of a
• For most variables, the sample slope „b‟ will
not exactly equal zero. How far from zero
must it be in order to safely conclude that β
Steps in Hypothesis
• Specify null (H0) and alternative (HA)
• Identify test statistic and find critical
value(s) based on degrees of freedom and
• Calculate test statistic and compare to
Common Hypotheses for β
• β = 0 (X has no effect on Y)
• β > 0 (X has a positive effect on Y)
• β < 0 (X has a negative effect on Y)
• β 0 (X has some effect on Y…+ or - )
Choose two hypotheses that are mutually
exclusive and exhaustive.
The null hypothesis (H0) should always
contain some form of equal sign.
Test Statistic for β
If ε ~ N(0, σ2), then b ~ N(β, Var(b))
Therefore the t-ratio =
Will follow a Student t-
distribution with n-k The t-ratio is defined as the
degrees of freedom (k = random variable minus its mean
# parameters to be
(when H0 is true), divided by its
Notes on Hypothesis
• The t-ratio simply counts the # standard
deviations the slope is from zero (“distance”)
• The greater the distance, the less likely you
would have found the value of b if β = 0.
• For significance tests, since β = 0, the t-ratio is
the slope divided by its standard deviation (or
Example: t-ratio of +2.40
This shows that there
is only a 1.3% chance
of finding a t-ratio of
2.40 or greater if in
fact β = 0. Therefore,
if you found a t-value
this high, it is
unlikely that β = 0.
R2 = 0.025
TSS = 1.9E+10
ESS = 1.9E+10
se = √(ESS/826) = 4766
Do undergraduate enrollments have a significant effect on average costs per
Null Hypothesis: β = 0, Alternative Hypothesis: β ≠ 0
For 826 df, 1% significance level, reject the null when the calculated t-ratio
exceeds 2.575 in absolute value.
P-value = Probability of drawing a more extreme sample value
given that the null hypothesis is true:
P-value = Pr(b < -0.175) = Pr(t < -4.577) = 0.000
Units of Measurement
The significance levels of any variable will
not be influenced by the units of measure
used for X or Y
• The coefficient represents the # units
change in Y due to a one-unit change in X
• When the units of measure change, both
the coefficients and standard errors
change proportionately (t-ratio remains the
The regression model can be used to derive
predictions of Y given values of X(s)
• Point estimates are found by substituting X into
the equation and solving for Y (“I predict that the
grad rate will be 70%”)
• Interval estimates are predictions that Y will fall
within a certain interval (“I am 95% certain that the
grad rate will be between 68% and 72%”)
• Interval estimates are more conservative, and
convey the uncertainty in predictions.
Two Types of Intervals
C.I. For expected value (“mean”) of Y
• For given X, what is the predicted average value
C.I. For a single value of Y
• For given X, what is the predicted single value of Y
(more uncertainty, so wider interval)
The two methods yield very similar intervals. Most
IR applications use C.I.‟s for single value.
Intervals can be obtained in SPSS using the “save”
Predict HS Grads in New
An IR analyst is charged with developing a model to
help predict changes in HS grads in the state
File AIR1.SAV in SPSS has two vars: # HS grads in
year t (HSGRAD), and # 2nd grade enrollments in
year t-10 (GRADE2).
• Find correlation between HSGRAD and GRADE2
• Estimate a regression model
• Form point and 95% CI estimates of high school
grads for the next ten years.
Under statistics > correlate > bivariate:
Note that r = +0.959, cov(x,y) = 701,160 (n=12)
Under statistics > regression > linear:
In 2006, the model
predicts there will
be 14,919 high
95% certain that
in 2006 there will
be between 14,185
and 15,652 high
In most IR applications, the dependent variable may
be influenced by multiple factors:
• Grad rate = f(avg. SAT, gender composition, avg.
HS rank, % students on campus,...)
• Faculty Salary = f(education, experience,
• Education Costs = f(enrollments, research
intensity, student/faculty ratio,...)
Assumptions in Multiple
• Error term has a mean of zero and
• Error terms are unrelated to each other
• Error term is unrelated to independent
• Error term is normally distributed
• Independent variables are not collinear
with each other (no “multicollinearity”)
Ordinary Least Squares
Least Squares Estimates
• The coefficients are referred to as “partial
effects” because they show the effect of
one variable on Y holding other vars
• The OLS formula takes into account any
relationships between the X variables. For
this reason, the coefficients usually
change when variables are dropped/added
Other Stats in Multiple
• Hypothesis tests for significance of coefficients
can be performed as before, except degrees of
freedom change (n-k-1).
• Goodness of fit measures are calculated as
before. R-squared now represents the %
deviation in Y explained by all X‟s together. Thus,
R2 usually rises as X‟s are added.
• Confidence intervals and point estimates can be
calculated as before.
Example: Average Public Tuition
An IR analyst is asked to help explain why
there are variations across states in
their tuition rates at public institutions.
She feels that factors such as state aid
given to students and state
appropriations help account for these
• Open the file TUITION.SAV in SPSS.
Question 1: How do state appropriations affect average tuition?
account for 13.5%
of differences in
Question 2: How do state appropriations and aid to students
affect average tuition?
These two variables
account for 40.4% of
differences in tuition.
A $1 increase in appropriations reduces tuition by 22.6 cents,
holding constant state aid per student.
Extensions of Regression
So far, we have only considered linear
models where X‟s and Y‟s were
continuous. We will now examine how to
• Categorical X‟s
• Interactions among X‟s
• Non-linear relationships between X and Y
There are many examples of independent variables
that are not numerical (ex: gender, race,
institution attended, attitudes/beliefs)
“Likert scale” variables (assign #‟s to categorical
responses) should not be used in regression
models in their present form due to problems in
interpreting changes in units.
• Slope = # units change in Y due to a one-unit
change in X (but Likert #‟s are artificial)
However, categorical X‟s can be used if they are
first recoded into “dummy variables”
• Dummy variable: has only two values (0,1)
• Need to specify an assignment rule. Can be used
for categorical, Likert, and continuous variables.
• The variable can now be used in regression
• It does not matter which group is assigned 1
• Coef represents the difference in intercepts for the
• Must omit one of the dummy variables for a
construct to avoid multicollinearity
Examples of Assignment
Let X = 1 if (0 otherwise):
• Teaches in Psychology Department
• Enrolled in public university
• Family income exceeds $100,000
• Student is “very satisfied” with the
quality of instruction
• Student graduated from campus
• Student dropped out of college
Note: Both equations
have the same slope
Question: Does living
on campus matter?
It is possible that the joint occurrence of two X‟s
has an effect on Y separate from each X‟s
• Academic performance of students with high
SAT scores and HS ranks
• State appropriations for higher ed in states
with low incomes and high tax rates
• The salary increase from promotions for men
and women may be different
In these examples, there is something
special about the joint occurrence of two
• To test these assertions, an “interaction
variable” can be created and added to the
• Interaction variables are created by
defining a third variable as the product of
the two variables in question.
The interaction variable is then added to the regression model and
treated as any other variable:
To find the effect of x1 on y, you need to differentiate the equation
with respect to x1:
Regression analysis can also be used in
situations where X has a non-linear
relationship with Y
• Linear: The change in Y due to a one-unit
change in X is constant.
• Non-linear: The change in Y due to a
one-unit change in X can vary with the
level of X.
Graphs of Non-linear
Exponential: Y = exp(X) Logarithmic:Y = ln(X)
Graphs of Quadratic
“Maximize” Y “Minimize” Y
Possible IR Examples
Exponential: Implies that as X increases, Y
increases at a faster rate.
• Y = salary, X = years of experience
Logarithmic: Implies that as X increases, Y
increases at a slower rate.
• Y = college GPA, X = hrs/week studying
• Y = retention rate, X = avg. student SAT
Possible IR Quadratic
“Maximize” Y: There is some value of X at
which Y is maximized.
• Y = Tuition revenue, X = tuition rate
• Y = Student gains, X = class size
“Minimize” Y: There is some value of X at
which Y is minimized.
• Y = costs/student, X = enrollments
Using Non-linear Functions
• Regression analysis requires a linear
relationship between X and Y.
• When there is a non-linear relationship, you can
transform one or more variables and then use
the transformed variables in the regression
• As long as there is a linear relationship between
the transformed variables, regression analysis is
• The coefficient estimate for β represents the
approximate percentage change in Y due to a one-
unit increase in x.
• The variable x always has the same directional
effect on Y (positive or negative)
• The change in Y due to a change in x increases
at an increasing rate
Natural Log Function
The natural log function is the inverse of the exponential
function: ln (exp (X)) = X
This can also be used for a subset of X’s.
“Double-Log” Function: Elasticities
If X is believed to have a quadratic effect on Y, then create
a new variable as the square of X and add this to the
The change in Y due to a one-unit change in X1 would
be found by differentiating the equation with respect to
Hill-shaped if β3 < 0, U-shaped if β3 > 0, linear if β3 = 0
More on Quadratic
• The value of X that maximizes or
minimizes Y can have important
implications. This is found by solving for
X in the first-derivative.
• Higher-order functions (ex., cubic) can
also be used in regression. They can
yield better representations of
relationships, but are harder to explain
SPSS Exercise: Faculty
An IR analyst is asked to investigate if female faculty are
paid less than comparable males. She draws a sample
of 432 faculty and creates these variables:
• Salary = monthly base salary (in dollars)
• Rank = 1 if Full, 2 if Associate, 3 if Assistant
• Gender = if if male, 0 otherwise
• Prevexp = days of experience before current job
• Npleave = days of non-professional leave
• Potenexp = days since highest degree
• Nine12 = 1 if nine-month appointment, 0 otherwise
• Cite85 = Citations in 1985 to all publications
Open the SPSS system file FACSAL.SAV:
• Estimate a regression model showing how
gender affects salary. How do these results
compare to a two-sample t-test?
• How do your findings change when potential
experience and citations are added?
• An economist argues that salaries rise
exponentially with potential experience,
citations, and gender. How can this be
Answer to first task...
Note: mean difference
is $916, which has a t-
value of 6.227 and is
Answer to second task...
Answer to third task...
• The VP for Finance argues that individuals
with high experience levels often get
smaller percentage salary increases than
others. How could this be addressed (use
same function as in previous example)?
• A female faculty member claims that
women face discrimination in part because
they are rewarded less for each citation
they receive. How could you test this?
Answer to fourth task...
Answer to fifth task...
For most IR problems, there are many
alternative models from which to choose.
How should the “best” model be selected?
• Begin with published studies that look at the
same (similar) Y‟s. What variables and
functional forms do they use?
• Is there a theory that can be used to guide:
human capital theory > salary models
median voter theory > state funding for HE
Tinto‟s model > student retention
More model selection
• Better to include too many factors than to omit
important variables (“omitted variable bias”)
• Can estimate several competing model
specifications and compare results. Be careful
not to simply select model with the most
• Keep in mind trade-off between simplicity and
accuracy. A simple model is worth its weight in
gold when explaining to decisionmakers!
Faculty Salary Example
• Return to FACSAL.SAV and create a
dummy variable for full professors
• Estimate a model explaining salary as a
function of gender, then gender and full
• Estimate a model explaining salary as a
function of gender, full professor, and
Problems in Regression
There are three main problems which may
arise in multiple regression:
We will briefly discuss what each means,
how they can be detected, and what can
be done about them when they occur.
This can occur in time-series data when the error in
one period is related to the error in the next.
• Violates the assumption E(εiεj) for i j
• Causes the computer to calculate incorrect
standard errors, thereby affecting t-ratios. Usually,
st.errs are too small, so t-ratios are too high
(making X appear significant when it isn‟t.)
• Possible IR Examples: Predicting applications, HS
grads, state funding for HE.
Calculates a “d-statistic” that reflects the
correlation among subsequent error terms:
If autocorrelation is detected, it can be corrected
through transforming the data to yield correct
standard errors (“generalized least squares”).
• Cochrane-Orcutt or Prais-Winston two commonly-
• Standard “autocorrelation” option in SPSS does
not do this. Use SPSS Trends or another
• Keep in mind that autocorrelation affects the
standard errors and not coefficients.
May occur in cross-section data when the variance
of the error term is related to one or more
independent variable (σi2 not constant).
• Affects standard errors, and hence t-ratios (but
not coefficient estimates)
Potential IR examples:
• Effects of enrollments on average costs
• Effect of tax revenues on state appropriations
• Effect of program size on expenditures
Graph of Heteroscedasticity
As X increases, the possible errors become
• Visual: Plot residuals against the variable
thought to be causing the problem.
• Park-Glesjer test: Estimate model and save
residuals. Regress the log of squared
residuals against the log of variable thought
to cause the problem.
• Other tests: White (1978), Goldfeld-Quandt.
• SPSS will not do these by default (must do
by hand or with other software).
• Weighted least squares: Weight
observations by the variable causing
heteroscedasticity. However, you must
know the form.
• For example, if σi2 = σ2X1i, then weighting
each observation by the square root of X1
will yield correct standard errors.
• An option that does not require knowing
the form of heteroscedasticity is by White.
Multicollinearity arises when there is an
extremely high correlation between two or
more independent variables in the model.
• The coefficients are biased; the stats program
does not know how to assign proper weights
• Standard errors increase, making t-ratios
Potential IR examples include: (1) effect of
current and previous experience on faculty
salaries, (2) effect of SAT score and high
school rank on academic performance, (3)
effect of family income and wealth on student
demand for higher education.
• A significant correlation between X‟s does
not necessarily lead to multicollinearity. Only
when the correlation is very high does this
There is no universally-accepted test for
• Variance inflation factors (VIF) estimate how much
the standard errors increase due to correlation with
other X‟s. No single “cutoff point” for VIFs.
Signs of multicollinearity include:
• Two similar variables have widely different effects
on Y (e.g., only one is signif.)
• The standard errors are large
To test, drop one of the variables from the model and
compare results. If the coef and st. err. change
considerably, this may be a problem.
There is also no uniformly-accepted
solution to this problem. However, you
can drop one of the problem variables
from the model.
• Multicollinearity may not be an important
issue if the collinearity occurs between
Return to the faculty salary data and create
a new variable:
• newpot = potenexp / 365 (“years of
exper”) and add this to the regression
• Then, make slight changes to first two
data points: change “27.02” to “13” and
change “19.01” to “27”.
• Estimate regression model again, using
gender, potenexp, newpot
Using gender and potenexp:
Using gender, potenexp and newpot:
Variable POTENEXP drops out of the equation because it
is perfectly correlated with NEWPOT.
Using gender, potenexp,
newpot (after changes)
Gender is significant Standard errors are about forty-
throughout all three models three times larger than before!
Thus far, we have considered instances
where Y was continuous and unbounded.
However, there are many situations where
this is violated:
• Individual student data are often
dichotomous (0,1) variables: 1 if graduate, 1
if return, 1 if apply/enroll.
• Some data are discrete counts: number of
journal articles or citations, number of times
a student changes his/her major
Problems with OLS when Y is
•Predictions can be > 1 or < 0
•Coefficients may be biased
•Heteroscedasticity is present (σ2 = P(1-P))
•Error term is not normally distributed (only
two possible values), so hypothesis tests
Of these problems, the last is the most
Maximum Likelihood Estimation
In this instance, there are advantages to
using a technique (MLE) in place of OLS.
• MLE: Find the coefficients that maximize
the likelihood of generating the
observations on Y in the sample.
• Recall that OLS chooses the coefficients
based on those that minimize the sum of
Logit and probit analysis
When Y = (0,1), the two most commonly-used
functional forms in MLE are the cumulative
logistic distribution (“logit analysis” or
“logistic regression”) and the cumulative
normal distribution (“probit analysis”).
• The two choices usually yield similar results
• Each avoids the four problems noted with
For logistic regression, the following functional form is used:
Ln P/(1-P) = a + b1X1 + b2X2 + …
where P = probability that Y=1
All you have to do, however, is create the dummy variable for
Y and tell SPSS to use logistic regression to estimate the
model. SPSS will create the log odds ratio for you.
• The coefficients from logistic regression are
hard to interpret and explain.
• Focus on the signs of the coefficients:
– If the sign is positive and significant, then as X
increases, the probability that Y=1 will also
– If the sign is negative and significant, then as X
increases, the probability that Y=1 will
– If the coefficient is not significant, then X has no
effect on the probability that Y=1.
Example: Faculty Rank
• Return to the faculty dataset, and estimate a
logistic regression model to explain whether a
faculty member is a Full professor: (under
“Regression / bivariate logistic”)
• X‟s include gender, potenexp, prevexp, and
• Need to create a dummy variable for Full
• SPSS Probit module is different than used
Wald Chi-square statistic: Note that these Chi-
(coefficient / standard error) 2 square values are the
square of the standard
Effect of X Standard Odds
on log odds errors Ratio
Results from rank analysis:
• Since the coefficient for GENDER is
positive and significant, it means that men
and more likely than women to hold the
rank of Full professor after controlling for
experience and citations.
• The positive and significant coef for
CITE85 means that a faculty member is
more likely to be a Full Prof as citations
Final Exam: SATDATA.SAV
File contains data on 1,999 NH high school seniors
in 1996 who have taken the SAT
• ASSOC = 1 if highest planned • SATCOMB = Combined SAT score
• SATCOMB2 = SAT squared
• MA = “ “ “ “
• ANYAP = 1 if taken any AP course
• GRADEAVG = high school GPA
• PHD = “ “ “
“ Doctorate • UNH = 1 if sent SAT score to UNH
• MALE = 1 if male • KSC = 1 if sent SAT score to KSC
• FIRSTGEN = 1 if 1st • PSC = 1 if sent SAT score to PSC
• INCOME = family income
• INCOME2 = income squared
• PUBHS = 1 if attend public
• How does family income, student ability,
and student intentions affect whether a
student submits SAT scores to UNH, KSC,
• Do SAT takers from poor families and/or
first generation families do worse on the
SAT than other students?