AICP EXAM REVIEW Quantitative Methods
Pennsylvania Planning Association March 2004 Kurt Paulsen Temple University Kurt.Paulsen@temple.edu http://astro.temple.edu/~paulsenx
Outline
Sampling and Surveys Statistics Demographics and population analysis Economic analysis techniques Impact Analysis/Decision tools
Cost-benefit analysis Fiscal impact analysis
Sampling and Surveys
Where have we come so far?
In our data analysis so far this semester, we have assumed the data come from “good sources”, that is they are scientifically based, readily available, without error, etc. In reality, we need to peer behind the methods used to collect the data before we can really analyze the data. (metadata) And, often, we have to think about how to collect data ourselves
Acknowledgements …
Sources for this lecture:
Babbie, E. 2004, The Practice of Social Research, 10th ed. Wadsworth/Thompson Learning. Chapters 7 (The Logic of Sampling) and ch. 9 (Survey Research). Lohr, S. 1999. Sampling: Design and Analysis. Duxbury Press. (more advanced mathematically)
…and disclaimers…
After this 3 hour lecture, you will know a lot about how surveys and samples are constructed, such that you can be an intelligent consumer of this type of research. However, you do not know enough to put together your own sample. You do not know what you do not know. Most likely, you would need further study or hire a professional polling organization.
…more disclaimers
Most university associated research, as well as some research funded by government and/or non-profit money requires “Institutional Review Board” and/or “Human Subjects Committee” approval of survey design and compliance with ethical research protocols before research begins.
Check out the requirements before you begin conducting a survey.
Sampling
As we have been thinking all along in statistics and probability theory, we start with an interest in some parameter of a population. However, collecting data on the entire population is neither feasible nor necessary. Instead, we collect a sample, and calculate “sample statistics”. Recall that a sample statistic is simply an estimated population parameter. Research focuses on collecting a sample of observations to make inferences about a general population.
Sampling
The process in research of selecting observations is called “sampling.”
The logic of probability theory can help us determine how to sample (random sampling) and how many to sample (calculate “margin of error”).
Sampling
Non-random sampling
There are special cases where we would substitute non-random sampling for random/probability sampling.
Purposive/judgemental sampling Snowball sampling Quota sampling
“Surveys” and sampling
A lot of what passes for “samples” or “surveys” these days is unscientific
Reader/viewer/listener response Internet polling/voting “man on the street” The “non-survey” survey…
Probability sampling
Probability sampling is simply selecting samples based on probability theory. This is the only way that our sample results can be used to generalize to a larger population. Probability sampling is the primary method of selecting large, representative samples for social research. Conducted properly, probability sampling is the best method available for reducing potential bias in data collection.
Probability sampling
Roughly (according to Babbie), in order for a sample to provide a useful description of the total population, a sample of individuals must contain essentially the same variation(s) that exist in the population.
Put another way, a sample should be representative.
“Representativeness”
No formal, scientific meaning.
Should include those characteristics reasonably considered to be relevant to the interest of the study.
Definitions
Sampling frame: who is chosen to be surveyed/sampled. The “list” of units/elements composing a population from which a sample is selected. If a sample is to be representative, the sampling frame should include all (or nearly all) of the population. One begins with a list of elements (units of the population). Usually individuals, but could be households, etc.
Probability Sampling
The purpose, then, is to select a set of elements from a population in such a way that a description of those elements portray the total population.
Probability theory then allows us to estimate the accuracy and representativeness of the sample.
Probability theory and sample size
Let‟s think back to our discussion of the Central Limit Theorem (a result which was important in our descriptive and inferential statistics). For sampling, if many independent, random samples could be selected from a population, the resulting sample statistics (technically: the “sampling distribution”) would be distributed around the “true” population parameter in some known way.
Probability theory and sample size
This knowledge of probability then allows us to estimate the “sampling error” – which is the degree of error which is expected….AS A FUNCTION OF SAMPLE SIZE.
That crazy central limit theorem again…as sample size increases, what happens to the sampling error?
Probability Sampling
EPSEM (equal probability of selection method):
A sample design in which each member of a population has the same chance of being selected into the sample.
If EPSEM holds, then we can call the sample a “random” sample…or more technically, a probability sample
Sampling Error and Sample Size
Now for the moment you all have been waiting for…because you love the probability theory which keeps popping up…. Sampling error: The error expected in probability sampling…composed of three factors:
Parameter Sample Size Standard error
Sampling error
Probability theory gives us a way to estimate how closely sample statistics will be dispersed around the “true” population parameter.
s
p q n
S=sampling error, p and q are population parameters for the binomial, n=number of cases
Sampling error
The above discussion assumes, of course, that we know the population parameters….
But like everything else so far in this class, we actually don‟t know any of the population parameters. That‟s why we are conducting research!! So, we have to estimate them from samples.
Confidence and significance
Recall from our discussion on confidence levels that we can define a 95 percent confidence interval as p=0.95, with alpha = 0.05. Alpha=1-p, p=1-alpha. We can define either the significance level or the confidence level. They are mathematically identical. Different people use different methods, but be careful in your language.
Back to…..CONFIDENCE INTERVALS!
We begin by specifying a confidence level…sometimes we have called the confidence level as the opposite of the significance level. If we want to be 95 percent confident that our true population parameter falls within some interval, we construct a 95 percent confidence interval, or chose a 5 percent significance level.
Significance levels
Let‟s define the error which we will “tolerate” in our sample in terms of probability:
P ( x e) 1
Or:
P ( x e) p
Where e is the “margin of error”, alpha is the significance level.
Margin of error
So in the “typical” 95 percent confidence level/5 percent significance level, we are looking at:
P ( x e ) 0 . 95
We read this as telling us that, for a margin of error “e” we are 95 percent confident that the true population value mu is represented by the sample statistic x-bar
Margin of error
When you read a poll or survey in a report and are told the “margin of error”, you only know “e”. If they do not also tell you the “confidence level” you can assume it is at the 95 percent confidence level. Consider this report from the Philadelphia Inquirer about their poll on likely votes for Street/Katz in the mayor‟s race…..a good example of actually explaining to readers what the poll is!
Philadelphia voters survey
“This poll was conducted by Mason-Dixon Polling & Research, Inc. of Washington, D.C. from October 21 through October 23, 2003. A total of 800 registered city voters were interviewed by telephone. All stated they were likely to vote in the November general election.The margin for error, according to standards customarily used by statisticians, is no more than plus or minus 3.5 percentage points. This means that there is a 95 percent probability that the "true" figure would fall within that range if all voters were surveyed.The margin for error is higher for any subgroup, such as a racial or party grouping.”
Philadelphia voters survey
Lets look at those numbers more carefully…hinting toward our calculations.
N=number surveyed, in this case is 800 95 percent confidence level Margin of error of +/- 3.5 points. How did they get these numbers?
Philadelphia voters survey
In this case, they use a well known result that, in large surveys, the margin of error is approximately equal to:
1 n
This square root of n on the bottom shows up a lot in statistics!!!!!
Philadelphia voters survey
Since n=800, square root of n≈28.28. 1 divided by 28.28 is ≈ 0.035
But what if we wanted to be 99 percent confident???????
More on the “margin of error”
The margin of error is, simply, the width of the confidence interval. For example, suppose I survey and find 50 percent indicate they will vote for a candidate. If the “margin of error is +/- 4 points, then the 95 percent confidence interval is [46,54]. That is, we are 95 percent confident that the true percentage of voters intending to vote for this candidate lies within a range from 46 to 54.
Back to confidence intervals
Do you remember the formula….Oh joy!
P ( x z
2
n
x z
2
n
) 1
Confidence intervals
Lets make the formula above into a 95 percent confidence interval. (Anything here look familiar)
P ( x z
2
n
x z
2
n
) 0 . 95
Confidence intervals
Lets substitute in what we already know
P ( x 1 . 96
n
x 1 . 96
n
) 0 . 95
Let’s rewrite this in terms of our “margin of error” idea (plus or minus)
P ( x 1 . 96
n
) 0 . 95
Confidence intervals and margin of error
Thus….by the “magic” of probability theory, the margin of error, e, is simply:
e z
2
n
e 1 . 96
n
For 95 percent confidence
Margin of error
Given the formula above, you can:
Figure our margin of error, given n Figure out required sample size, n, for a targeted margin of error
e z
2
n
z 2 n e
2
Margin of error
But what about sigma??? A neat result of the central limit theorem: in large samples, the population variance (sigma squared) ≈ p(1-p) (or p x q)
Now if we don‟t know p and q, we can place an upper bound on the variance….what is the highest it could be
Margin of error
What value of p maximizes the value p(1-p)? p=0.5!!....which leads to p(1-p)=0.25 So, putting it all together and substituting:
e 1 . 96
0 . 25 n
1 n
Margin of error
Thus, in the most simple cases, the margin of error in a sample, with 95 percent confidence, is 1/square root of number of respondents.
S A M P L E S IZ E M A R G IN O F E R R O R 10 0 .3 1 6 20 0 .2 2 4 30 0 .1 8 3 40 0 .1 5 8 50 0 .1 4 1 60 0 .1 2 9 70 0 .1 2 0 80 0 .1 1 2 90 0 .1 0 5 100 0 .1 0 0 150 0 .0 8 2 200 0 .0 7 1 250 0 .0 6 3 300 0 .0 5 8 400 0 .0 5 0 500 0 .0 4 5 600 0 .0 4 1 700 0 .0 3 8 800 0 .0 3 5 900 0 .0 3 3 1000 0 .0 3 2 1250 0 .0 2 8 1500 0 .0 2 6 1750 0 .0 2 4 2000 0 .0 2 2 3000 0 .0 1 8 4000 0 .0 1 6 5000 0 .0 1 4
D im inis hing R e turns to S a m p le S ize Inc re a s e s
0.350
0.300
0.250
M a rg in o f e rro r
0.200
0.150
0.100
0.050
0.000 0 500 1000 1500 2000 2500 S am p le S ize 3000 3500 4000 4500 5000
Sampling
Once you have figured out the sample size you need to get the precision you want, you need to go about constructing your sample frame and chose your sampling method. You need to figure out
Can I construct a list of all possible respondents Tradeoffs between cost, coverage, and precision
Simple Random Sampling
If you can make a complete list of your target population then you can use simple random sampling (SRS). SRS assigns a random number to each of the units in a population, and then uses random number generators to chose the appropriate number of respondents for analysis. Researchers until recently have avoided this procedure because it took a lot of time to look up random numbers in a book.
Systematic sampling
Every kth unit in a list is selected for inclusion in the sample.
To be effective, there needs to be no systemic order in the list. Empirically, virtually identical to simple random sampling.
Stratified Sampling
“The grouping of units composing the population into homogeneous groups (strata) before sampling.” May be used in conjunction with simple random sampling, systematic, or cluster sampling. Improves representativeness, in terms of the stratification variables
Stratified sampling
Choice of stratification variables
Depends on what is available Depends on what variables you think matter for responses, and which variables you are interested in In most socio-economic-demographic type research conducted by planners, obvious candidates are race, income, education, gender, location, etc.
Stratified sampling
Groups/strata need to have:
within-group homogeneity between-group heterogeneity
(Multistage) Cluster sampling
“A multistage sampling in which natural groups (clusters) are sampled initially, with the members of each selected groups being subsampled afterwards.” Used when impractical to compile an exhaustive list A multistage process of listing and sampling, listing and sampling, listing and sampling…
(Multistage) Cluster sampling
Highly efficient, but tradeoff is in accuracy Subject to errors at each of the stages. For a given sample size, if the number of clusters increases, then the number of elements selected from each cluster decreases. General guidelines: maximize # of clusters while decreasing elements. Balance these guidelines with administrative and cost constraints. General guideline: Population researchers aim at selecting 5 households per census block.
Probability Proportionate to Size Sampling
PPS (Probability proportionate to size sampling) refers to multistage cluster samples in which clusters are selected, not with equal probabilities but with probabilities proportionate to their size.
Disproportionate Sampling and Weighting
You may want to sample certain subpopulations disproportionately, especially if you believe that subgroup numbers will be so small as to make statistical analysis difficult. Many national level surveys “oversample” minority households Overall analysis thus involves assigning weights to disproportionate observations.
Sampling and surveys
Sampling can be used for research design for almost any research question. For example, you can collect samples in various environmental and ecological fields. The use of sampling most common to planning would be selecting respondents to administer a survey.
Surveys in planning
“Survey before Plan” -- Geddes
Survey techniques play important role in history of planning
The “Booth Survey” of poverty in east London W.E.B.DuBois in Philadelphia
Surveys in planning
When we wish to find out something about the people/place for whom we are planning, and we don‟t have the data already available, we need to collect original data. We conduct surveys to:
Understand important issues Identify public opinion Measure attitudes Justify, and cloak in the legitimacy of public participation what public officials have already decided ahead of time should occur
Survey research
In planning applications, much socioeconomic and demographic data is already widely available, both for analysis, and to serve as a baseline for survey research Surveys are most frequently used, then, to gauge public attitudes and public opinions around planning related issues. In the textbook world, you conduct a detailed survey before a comprehensive planning process.
Surveys - process
Begin with a definition of your research question – what do you want to know or find out? Operationalize your research questions with a questionnaire. Writing clear and coherent questions is crucial. Decide what type of quesitons to include. Pre-test your questionnaire
Surveys - process
Decide:
Desirable margin of error, sample size Form of sampling Method of delivery Acceptable response rate
Administer questionnaire Monitor Analyze
Types of questions
Open-ended. Questions for which the respondent is asked to provide his or her answers. Closed-ended. Questions in which the respondent is asked to select an answer from among a list provided by the researcher. A good survey should include both, although there are strengths and weaknesses to both.
Closed-ended questions
Easier to code for data analysis. Response categories must be:
Exhaustive Mutually exclusive
Open-ended questions
Allows respondents to give their own answers May uncover relationships, and attitudes of which the researcher was unaware before the study Difficult to code and analyze Frequently, researchers will code openended questions for analysis.
Writing questions
Define terms
e.g “We define working full time if you work more than 35 hours per week.”…
Make items clear and concise as possible Use as neutral language as possible Avoid double-barreled questions Avoid negatives Keep each question to one concept
More on survey design
Respondents must be competent to answer. Surveys should be designed in such a way as to elicit truthful and honest answers. Make respondents comfortable to express views which may be considered socially undesirable and/or about subjects of personal matters. Assure confidentiality Avoid biased items and terms
More on survey design
Question order matters. When different people read your question, they should all basically understand the same meaning from the same wording. When different people respond to your question, the same answers should mean relatively the same thing.
IMPORTANT
PRETEST YOUR QUESTIONNAIRE!!! PRETEST YOUR QUESTIONNAIRE!!! PRETEST YOUR QUESTIONNAIRE!!! PRETEST YOUR QUESTIONNAIRE!!! PRETEST YOUR QUESTIONNAIRE!!! PRETEST YOUR QUESTIONNAIRE!!!
Types of questionnaires
Self-administered. Respondents themselves actually complete the questionnaire. The most common form is the mail-out, mail-back survey. Interviewer-administered. An interviewer asks questions of the respondents, and records respondent‟s answers.
Face-to-face interview Telephone interview
Relatively new: computer administered. Each type is subject to their own biases and response rates.
Mail surveys
Send letter of introduction, survey and method to mail in to researcher. Provide mechanism for reminder card, and follow up to improve response rate. 3 mailings seems most efficient. 2-3 weeks intervals between mailings.
Response rate
Whenever you are reporting a survey, you should ALWAYS include your response rate. The response rate is the number of people who respond to a survey divided by the number selected to be in the sample. What is adequate? Babbie‟s rules of thumb
50%= adequate, 60%= good, 70%= very good
I think he‟s being a little optimistic. However, for surveys being conducted with outside funding sources, there may be standards for response rates. The federal government has some standards which require 80-85 percent response rate in surveys.
Face to face interviews
Higher response rates, but most costly. Some national level surveys use trained face to face interviewers.
Telephone interviews
Random digit dialing on computers reduces the biases of unlisted numbers. Increasing public frustration with fake surveys leads to increasing refusal rates. Safer and cheaper than face to face interviews.
Strengths and weaknesses of surveys
Strengths:
Surveys make large samples possible Surveys can measure public attitudes and perceptions Standardization allows statistical analysis Surveys are widely used in America
Weaknesses
Standardized questionnaires can be superficial and complex, cater to lowest common denominator. Ignores social context of life, can be artificial Subject to a number of potential biases.
While good survey design can reduce potential biases, you cannot completely eliminate all potential biases.
Potential biases in surveys
Selection bias:
Some portion of targeted population is not in sample
Non-response bias:
Does the distribution of characteristics of non-respondents equal that of respondents? When the measuring instrument or survey design systematically over or under estimates.
Measurement bias:
Potential biases in surveys
People:
Misrepresent facts, overstate income, etc. Don‟t remember Understand different words to mean different things Are confused Don‟t want to give a socially undesirable answer
Potential biases in samples
Hypothetical choice bias:
Respondents are asked for a response of what they would or would like to do, without having to actually chose.
Survey respondents frequently say that they would like more services and lower taxes….
Utilizing national surveys
There are a number of nationwide, well designed surveys which may be of interest and use in planning and research. The caution is: although they are representative of the nation as a whole, there may not be a sufficient sample size to generalize about your town/city/county/region.
National surveys of interest to planners
American Housing Survey.
http://www.census.gov/hhes/www/ahs.html
American Community Survey.
http://www.census.gov/acs/www/ http://www.bls.census.gov/cps/cpsmain.htm
http://www.bts.gov/nhts
Current Population Survey.
National household travel survey
Statistics
The foundation of statistics is probability. May be helpful to review some probability theory.
The Probability of an event is the the relative likelihood or relative frequency of an event, defined over a sample space of all possible outcomes. A “random variable” is a variable whose values occur at random, that is from random draws of a probability distribution. Random doesn‟t mean that values can‟t be known or predicted. Variables are things such as age, income, etc. Random simply means they come from some underlying probability distribution.
Statistics on the AICP exam
Most often, it is knowing which test or which procedure to use when. Occasionally, some of the less well known statistical tests are thrown in as wrong answers to distract you. Half of the battle is just learning the terms, and recognizing the right situation to use which statistic
Statistics/Probability help on the web
http://davidmlane.com/hyperstat/index.html http://www.tufts.edu/%7Egdallal/LHSP.HTM http://espse.ed.psu.edu/statistics/Investigating.htm http://trochim.human.cornell.edu/ http://bmj.bmjjournals.com/statsbk/ http://www2.sjsu.edu/faculty/gerstman/StatPrimer/ http://www.businessbookmall.com/free-stuff-statistics.htm http://www.stats.gla.ac.uk/steps/glossary/presenting_data.ht ml
Probability distributions
Generally, is a plot of the probabilities with the values on the horizontal (x) axis and the associated probabilities on the vertical (y) axis. Probability distributions can be:
Discrete: Only take on a countable number of values (examples: Binomial, Poisson) Continuous: probabilities are assigned a range of continuous values (examples: normal, Student‟s t, Chi-squared, F).
Binomial Distribution, p=0.5, n=20
Normal curve, mean=0, sigma=0.5
Normal curve, mean=0, sigma=1
As you increase the standard deviation, the distribution becomes more spread out.
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5
Probability and Statistics
Why waste all this time with probability? Because “statistics” are merely estimated parameters of probability distributions. Statistics, then, are ways of summarizing the information in a probability distribution or a random variable. Two main types of statistics
Descriptive Inferential
Descriptive Statistics
A descriptive statistic is any summary of the values of a set of data (variable) or a sample. That is, they are calculated as ways to “describe” or “summarize” your data. 4 main types
1. Percentiles and quartiles 2. Measures of Central Tendency 3. Measures of Dispersion or Variability 4. Measures of distribution shape.
Descriptive Statistics
1. Percentiles and quartiles.
The p-th percentile is the value of a given distribution such that p-percent of the distribution is less than or equal to that value.
Pr(X x) p
Most commonly used are the 25th percentile (1st quartile) and the 75th percentile (3rd quartile) The 50th percentile has more common name: the MEDIAN
Descriptive Statistics
2. Measures of Central Tendency (how “centered” is your distribution).
Most common is the MEAN, also called the “AVERAGE.” Sometimes written as x To derive the mean, add up all values and divide by number of values.
x
x1 x 2 x 3 ... x n n
or
x
x
i 1
n
i
n
Descriptive Statistics
MODE is simply the most frequently occurring value. (the tallest bar on a histogram).
Descriptive Statistics
3. Measures of Dispersion or Variability (how spread out your distribution is)
IQR (inter-quartile range): the range between the 25th percentile and the 75th perecentile Standard Deviation and Variance
Variance = Standard deviation squared Standard deviation = square root of variance Variance = σ2; standard deviation = σ
Variance
No, not that much abused land use phenomena from the ZHB, but a measure of how spread out the data is around the mean:
s
2
(x n 1
i 1
1
n
i
x)
2
This is the “sample variance”. The “population variance has n instead of n-1
Calculate the standard deviation by hand?
Suppose stuck on a desert island or in an AICP or a GRE exam. Although rare, a standard deviation question has come up. Remember, since its multiple choice, only approximate the standard deviation. Make 3 columns by hand. Column 1 is the numbers, column 2 is deviations and column 3 is “squareds.”
1.calculate the mean (add all together and divide by n) 2.subtract each number from the mean
N u m b e rs D e v ia tio n fro m M e a n 2 6 8 4 5 <==M ean
3. Square each deviation 4. Sum up the deviations 5. Take value from Step 4 and divide by (n-1).
S q u a re d -3 1 3 -1 9 1 9 1 20
Sum ==>
s 20 /( n 1) 20 / 3 6 . 67
2
Coefficient of Variation
Standard deviation divided by the mean (a mean-standardized standard deviation)
CV
Inferential Statistics
These are statistics which allow us to draw conclusions about the data or to test hypotheses. Generally in the framework:
Confidence intervals Hypothesis Testing
Confidence Intervals
Because our observations and samples are “random” variables, there is always the possibility that that value we calculate (e.g. the mean) is not actually true, but a fluke. We can use probability theory to establish how confident we are (confidence level) that the true value lies within a specified range (confidence interval). Most common is a 95 percent confidence interval. This means that if we were to take 100 samples (instead of 1) we would be confident that 95 out of those 100 times, our sample average would capture the “true” population mean.
Confidence Intervals
Key point: Confidence intervals are SAMPLE SIZE DEPENDENT. As the sample size increases, the interval in which we are ppercent confident (e.g. 95 percent confident) gets thinner and thinner. I can‟t imagine the AICP exam would ask you to calculate a confidence interval, so I won‟t bother with the formula.
Hypothesis Testing
We formulate and test hypotheses about the data.
1 sample hypothesis testing. Most common is to ask, does this variable equal some value. E.g. We hypothesize that the value of the mean I.Q. of practicing planners is 120 2 sample hypothesis testing. We want to compare two samples or two variables. (Can be that the two values are equal or that one is higher). E.g. We hypothesize that the mean I.Q. of practicing planners is not equal to the mean I.Q. of local township supervisors. Or, we hypothesize that the average income of town X is greater than the average income of town Y.
Hypothesis testing
4 elements in all hypothesis testing
1. Formulate “null hypothesis” H0 2. Forumulate “alternative hypothesis” Ha 3. Identify correct test statistic This is 90 percent of the difficulty 4. Identify rejection region. 5. Calculate test statistic 6. Compare to rejection region. Either:
Reject the Null and accept the Alternative Accept the Null or “fail to reject the null”
Choosing the right test statistic
If the null hypothesis is true, how is the test statistic distributed (from what probability distribution does the test statistic come) There are, most commonly, only 4 types of test statistics. The names of the test statistics come from the probability distribution of the test statistic. The confusion comes in that some of the test statistics have people‟s names associated with them, and different disciplines call them different names.
E.g. The Pearson Chi-Squared Test. Sometimes people simply call this “The Chi-squared test.” But there are lots of “Chisquared tests.” Any test statistic which is distributed as a “chisquared” probability function is a “chi-squared test”!
Test statistics
1. Z-test. When the test statistic is distributed as a STANDARD NORMAL.
You only use this when you already know the population standard deviation (that is, you don‟t estimate the standard deviation from the sample). Rarely used except in statistics classes as an illustration.
Test statistics
2. t-test. When the test statistic is distributed as a STUDENTS‟ t-distribution.
Used for 1 sample tests of hypothesis that estimated parameter takes on certain values. Mostly used to test regarding an estimated mean or a regression coefficient. Examples: Is the mean housing price in Upper Pothole township =
$167,000. Is the mean income in Lower Podunk township >$50,000 Is the estimated regression coefficient = 0. We generally test all regression coefficients with H0 that = 0. This is also called a “statistical significance” or more precisely a test that the estimated coefficient is statistically significantly different from zero.
T-test continued
Used in 2 sample means test. (sometimes called the “comparison of means test”). The question being asked is whether data drawn from two samples shows statistically significant differences. Examples:
Ho: Mean income Bucks = Mean income Chester Ho: Mean income „burbs > mean income city.
Interruption:Types of variables
The t-test and z-test are used on continuous variables (variables that can take on a continuous range of values.) What about variables which are CATEGORICAL. That is, they take on values according to certain categories. Examples: marital status, region, educational attainment, what type of car do you drive, responses to most survey questions.
Interruption:Types of variables
Categorical data can be of two types:
1. Ordinal: the categories can be placed in an ordered relationship. E.g. Educational attainment: no college, some college, college, graduate school. Likert-scales on surveys. "strongly agree", "somewhat agree" "neither agree nor disagree”, “somewhat disagree”, “strongly disagree.” 2. Nominal (Non-ordinal). No logical ordered relationship. E.g. What region you live in (Northeast, South, etc.) What type of car you drive (Ford, Hyundai, Toyota, GM)
Test statistics
3. Chi-squared tests. The most commonly used in hypothesis testing is the Pearson Chi-squared test, commonly called simply a chi-squared test. First, we summarize two categorical variables through a “cross-tabulation table,” also called a “pivot table” or a “contingency table”. You then test whether there is a relationship between these two variables. The null hypothesis is no relationship or “independence”
E.g. Is there a (statistical)relationship between region and educational attainment Is there a relationship between perceptions of environmental quality and car-type ownership
Test statistics
F-test. (not used often, unlikely to be on the AICP exam.) When the test statistic is distributed as an F-distribution.
When the test compares if the VARIANCES of two samples are equal. Used in ANOVA tests of proportion of variance.\ Used in regression as a joint test that all the coefficients are statistically significant.
Testing relationships
Frequently, we want to test the relationship between two variables. Most common is CORRELATION. That is, do the two variables MOVE IN THE SAME DIRECTION.
Positive correlation. As one variable increases, the other increases Negative correlation. As one variable increases, the other decreases.
Correlation
Correlation measures the strength of the relationship between two variables.
Technically: it measures the linear relationship between two variables.
The most common measure of correlation is the Pearson Correlation Coefficient, usually simply called “correlation” or given by the letter “r”. It ranges from –1 (negative correlation) to 0 (no correlation) to +1 (positive correlation.) The closer the absolute value of the correlation coefficient is to 1, the stronger the correlation. The correlation coefficient does not depend on the units in which the variables are measured
Correlation
Most statistical packages and Excel can calculate a correlation coefficient. Usually, a “p-value” is also reported. Like hypothesis testing, a low p-value (less than 0.05) indicates statistical significance. Thus the correlation coefficient can test a) is there a linear correlation and b) the strength of that correlation Another correlation coefficient sometimes seen is called “Spearman‟s” or “Spearman‟s Rank Correlation” or simply “Rank correlation.” It is a “nonparametric” test. (Does not assume any “parametric” structure)
Tests of association and measures of association
A large branch of statistics, common in psychology, sociology and some biostatistics is based on testing the association between different types of variables and measuring the strength of that association. These measures are particularly used in various forms of categoric data, depending on their structure. I include them here only based on the idea that they may appear on the exam as false leads
Tests of association
Tests of association simply measure whether the variables are statistically related. The Pearson chi-squared test from earlier is the most common TEST of association. It does not measure the STRENGTH of association. Other tests of association: Continuityadjusted Chi-squared; Likelihood-Ratio Chisquared
Measures of Association
Statistics which show the direction and/or magnitude of relationships between pairs of non-continuous (discrete) variables. I doubt the AICP exam will require you to know which ones are for ordinal and which ones for nominal data, so I just list them here for purposes of acquaintance. Recall that CORRELATION is the most common MEASURE of association
Measures of Association
Yule‟s Q Phi Gamma (also called Goodman-Kruskal Gamma) Cramer‟s V Tau-b (also called Kendal‟s tau-b) Tau-c (also called Stuart‟s tau-c) Somer‟s d
Other “obscure” tests/statistics
The following are also test statistics which are either highly specialized to particular applications/fields, or are “non-parametric) and thus would only show up as false answers. Durbin-Watson test, Wald test, Likelihood ratio test, Moran‟s I, Wilcoxson signed rank test, Bonferroni correction, mean squared error (MSE), mean absolute deviation (MAD), Mann-Whitney test, Fisher‟s exact, Kaplan-Meier, Breinich-Slusser, Breslow-Gehan
Regression
There are a virtually limitless number of different “types” of regression such that the term itself is quite vague. Often, though, when people say “regression” they simply mean “linear regression” Linear regression is, informally, a line fitted between two (or more) variables to estimate the linear relationship between the two variables. The line is “fit” according to best criteria such that it is a “best fit.” The most common criteria is to minimize the sum of squared errors.
2 5 0 ,0 0 0
2 0 0 ,0 0 0
y = 4 7 8 1 .9 + 6 1 .3 6 7 x
1 5 0 ,0 0 0
P R IC E
1 0 0 ,0 0 0 5 0 ,0 0 0 500 1 ,0 0 0 1 ,5 0 0 2 ,0 0 0 2 ,5 0 0 3 ,0 0 0 3 ,5 0 0 4 ,0 0 0
SQUARE FEET
Regression
In a regression analysis, we try to “explain” the variation in a DEPENDENT variable with a number of INDEPENDENT variables. These variables are known by different names: DEPENDENT = OUTCOME = LEFT HAND SIDE variable INDEPENDENT = PREDICTOR = RIGHT HAND SIDE variable
Regression
A regression equation simply looks like this:
y 0 1 x1 2 x 2 3 x 3 ... k x k
DEPENDENT VARIABLE
COEFFICIENTS
INDEPENDENT/PREDICTOR VARIABLES ERROR TERM
Regression
Overly general approach:
Is the DEPENDENT variable continuous?
YES=proceed with standard regression (OLS) NO=must utilize another form of regression.
If Dependent variable is a (0,1) variable use Probit or Logit If Dependent variable is count data, use Poisson regression
Perform standard linear regression (OLS) Utilize diagnostic tests to see if:
Regression assumptions are violated
If violated, perform various corrections
Regression
Regression Output:
1. Measures of regression‟s “significance” 2. Measures of regression‟s “goodness of fit.” 3. Estimated coefficients and associated “standard errors” 4. Predicted values and residuals
EXAMPLE REGRESSION OUTPUT FROM EXCEL DEPENDENT VARIABLE = PRICE
S U M M AR Y O U T P U T R e g re s s io n S ta tis tic s M u ltip le R 0 .8 6 8 2 R S q u a re 0 .7 5 3 8 Ad ju s te d R S q u a re 0 .7 4 5 0 S ta n d a rd E rro r 1 9 2 1 2 .5 3 8 5 O b s e rv a tio n s 117 AN O V A df R e g re s s io n R e s id u a l T o ta l 4 112 116 C o e ffic ie n ts 1 5 1 0 1 .4 1 5 2 .5 4 -3 9 8 .5 2 2 0 3 4 .8 9 1 4 5 0 6 .0 0 SS MS 126547844717 31636961179 4 1 3 4 1 6 2 3 1 4 6 3 6 9 1 2 1 6 3 5 .2 167889467863 S ta n d a rd E rro r 7 6 2 5 .9 4 4 .2 2 1 5 8 .4 8 1 4 0 1 .0 0 4 9 4 8 .0 4 t S ta t 1 .9 8 1 2 .4 6 -2 .5 1 1 .4 5 2 .9 3 F S ig n ific a n c e F 8 5 .7 0 8 7 6 9 5 7 0 .0 0
MEASURES OF GOODNESS OF FIT
Regression Significance
In te rc e p t S q u a re F e e t Ag e F e a tu re s C o rn e rL o t
P -v a lu e 0 .0 5 0 .0 0 0 .0 1 0 .1 5 0 .0 0
Low er 95% -8 .4 0 4 4 .1 9 -7 1 2 .5 4 -7 4 1 .0 2 4 7 0 2 .0 9
U pper 95% 3 0 2 1 1 .2 2 6 0 .9 0 -8 4 .5 1 4 8 1 0 .8 0 2 4 3 0 9 .9 1
Coefficients and Standard Errors
Output 1. Regression Significance. An F-test is performed of the null hypothesis that all the coefficients are zero (insignificant.) Higher F-test statistics imply greater degrees of overall regression significance. The significance of F, also called its p-value should be less than 0.05. A very low F or a significance greater than 0.05 implies a very poor regression.
AN O V A df R e g re s s io n R e s id u a l T o ta l 4 112 116 SS MS 126547844717 31636961179 4 1 3 4 1 6 2 3 1 4 6 3 6 9 1 2 1 6 3 5 .2 167889467863 F S ig n ific a n c e F 8 5 .7 0 8 7 6 9 5 7 0 .0 0
Output 2. “goodness of fit.” The most commonly accepted goodness of fit measure is R-squared, sometimes called the “coefficient of determination.” It measures the proportion of variation in the DEPENDENT variable explained by a linear-combination of the INDEPENDENT variables. As such, it measures how good your regression has “explained” the DEPENDENT variable. Adjusted R squared adjusts for the number of independent variables.
R e g re s s io n S ta tis tic s M u ltip le R 0 .8 6 8 2 R S q u a re 0 .7 5 3 8 A d ju s te d R S q u a re 0 .7 4 5 0 S ta n d a rd E rro r 1 9 2 1 2 .5 3 8 5 O b s e rv a tio n s 117
My regression has “explained” about 75 percent of the variation of house prices in my sample.
Regression output
Estimated coefficients and associated standard errors.
Interpretation of coefficients: How much change (on average, measured in the UNITS of the DEPENDENT variable) in the DEPENDENT variable would be associated with a 1-unit increase in THIS INDEPENDENT variable, holding all else constant. Standard errors are then used to generate t-tests of the null hypothesis that THIS coefficient is zero. If you reject the null hypothesis that this coefficient is zero, you conclude that this coefficient is “significant” or more precisely “statistically significantly”
Output 3. Determine which coefficients are statistically significant. If p-value < 0.05, then the coefficient is significant. Significant coefficients are highlighted.
In te rc e p t S q u a re F e e t Age F e a tu re s C o rn e rL o t
C o e ffic ie n ts 1 5 1 0 1 .4 1 5 2 .5 4 -3 9 8 .5 2 2 0 3 4 .8 9 1 4 5 0 6 .0 0
S ta n d a rd E rro r 7 6 2 5 .9 4 4 .2 2 1 5 8 .4 8 1 4 0 1 .0 0 4 9 4 8 .0 4
t S ta t 1 .9 8 1 2 .4 6 -2 .5 1 1 .4 5 2 .9 3
P -v a lu e 0 .0 5 0 .0 0 0 .0 1 0 .1 5 0 .0 0
Step 2. Interpret significant coefficients. Example: Holding all else constant, a 1 unit increase in square feet (in this case, a 1-square foot increase in house size) is associated with a $52.54 increase in house price. A corner lot commands a $14506 premium.
Define research, formulate hypotheses Gather data / collect sample
Descriptive Statistics
1. Percentiles,
Inferential Statistics Confidence Intervals
Measures of Association
quartiles,median 2. Central tendency (mean, mode) 3. Dispersion (variance, standard deviation, coefficient of variation)
Correlation
Hypothesis tests For categorical variables: Phi, Kramer’s V etc.
1 sample?
2 samples?
Test about mean: t-test Test about variance: χ2 test
Continuous?
Categorical?
Test about means: t-test Test about variances: F-test
Pearson χ2 test
REGRESSION
Demographic methods
Catching Up from Statistics
How many planning methods professors does it take to screw in a light bulb?
Answer: 3.26 +/- 1.43, with 95 percent confidence.
Demography
The scientific study of human populations, primarily with respect to their size, structure and development
For planners, we would want to include location
Demography and planning
PA MPC Article III, Sec. 301.2
“In preparing the comprehensive plan, the planning agency shall make careful…analyses of housing, demographic, and economic characteristics and trends.”
Demography
To demographers, you only do a few interesting things in your life (undergo a “demographic event”
Be born Give birth Move Die
The rest is details, I guess…
Demography
This gives rise to the 3 major components of demographic analysis:
Fertility Mortality Migration
Demography
Each of these components can be modeled as complexly or simply as needed.
Simple e.g.: Calculate “crude” rates Complex: disaggregated regression model of migration decisions by age, race, sex, education, and economic structure
Demography - components
What influences fertility?
Age Education Economic status Rates and timing of marriage/cohabitation Religion …and many others
Demography - components
What influences mortality?
Age Gender Income/Socio-economic status Genetics Behavior Access to health care….. ….and of course many more
Demography - components
What influences migration/immigration?
Age Economic structure/opportunities Public policies Quality of life, cost of living Climate ….and many more of course!
Demography - components
You probably noticed that age plays a significant role in all three components. Demography frequently employs lifecylcle/temporal models. People‟s probability of giving birth, dying, migrating, etc. exhibit clear age effects.
Demography
An important way demographers model “age” is to group all those with a similar age into a “cohort.”
Demography
Three basic types of demographic analysis used by planners
Descriptive – tools, data, and methods to describe the population of an area Trends analysis – look at how demographic data has changed over time Projections – estimates of future population and population structure
Descriptive Demography
You can access/acquire demographic data for a place from a number of sources:
PASDC
http://pasdc.hbg.psu.edu http://www.census.gov
Census
Private vendors
Descriptive Demography
After acquiring data for a place, you can calculate all of the descriptive statistics we learned earlier in the course, and use the statistics to tell a story about a place
Descriptive demography
Demographic-specific methods
Age-Sex pyramids Dependency ratio = children + seniors divided by total working age population
Or: non-working population divided by working population Some researchers specify under 15 + over 65 as “dependent”
Age-sex pyramids
Age-sex pyramids are graphic representations of the age and sex distribution of a population. The show the percentage of a population for each sex by age category/cohort. Typically they present data in 5 year bins, although this is dependent on your data.
Bin data – a note
A lot of demographic data we have is not individual data, but rather counts of the number of people who fall into a certain category.
Travel time (25 to 29 minutes) Age (birth to 4, 5 to 9 …)
Some of our descriptive statistics have biases when used on bin data
Bin data – descriptive statistics
Consider calculating the mean travel time for an area based on knowing the number of people who fall into each category of time intervals. How do we do it? To perform calculations, we have to make assumptions about the distribution of data within the bin.
Consider this travel time bin: 339 people in Richland Twp, Bucks Co., travel between 25 and 29 minutes to work
339 people?
25
27
29
OR?
67.8 people
25
26
27
28
29
Bin data
How do our two options differ in contributions to overall average and standard deviation calculations? Recall the formula for an average:
x
x1 x 2 x 3 ... x n n
Option 1 = 339*27 =9153
Option 2 = 67.8*25+67.8*26+…67.8*29 = 9153
Bin data
So, in this example, these two options are equivalent in their effects on the overall mean…. But what about the standard deviation?
Ok, but what if, instead the real data looked like this:
Non-symmetric distribution within a bin
300 people
39 people
25
27 25.8*300+28.1*39=8835.9
25.8
28.1
29
Bin data
What are we left with? We have to make assumptions about the distribution of data within bins, but we have no a priori way to determining which is “best”, and with the understanding that our output is dependent on our assumptions.
Bin data strategy
State your assumptions clearly Easiest solution is to assume all members of the bin share the mid-point value. If exact results are important enough, calculate at least 3 ways and present all 3 results (sensitivity analysis) You could chose all lower values and then all upper values for upper and lower “bounds” Don‟t believe standard deviation calculations from bin data!
Back to age-sex pyramids
Age-sex pyramids provide a visual representation of population structure. Consider the following three national level pyramids from the U.S., Bangladesh and Germany.
Age-Sex Pyramids
You can also produce pyramids for population forecasts, to show the underlying demographic structure.
Constructing an Age-Sex Pyramid
See Excel Spread Sheet…Population Example.xls on class server Step 1. Download age-sex data from the Census Bureau website or other data server. (factfinder.census.gov)
Find variable P12 (Census 2000): Sex by age (total population) If you wanted to, you could find sex by age for different racial/ethnic groups
Age-Sex Pyramids
Step 2. Unzip downloaded data, and transpose data into columns. Step 3. Clean data such that it is 5-year age bins with consistent labels. Step 4. Multiply Males by –1 (so on left side of pyramid) Paste in format appropriate to Chart Wizard Make stacked Bar Chart. (second option on bar chart chart wizard)
Special Excel Trick
Reformat Axis to eliminate negative numbers. Double click on bottom axis, chose “number” tab, chose custom, in box type:
0;0 Click Ok.
A g e -S e x P yra m id , L o w e r M a c u n g ie T o w n s h ip , L e h ig h C o u n ty, P A -- 2 0 0 0
M A LE S 85 plus 80 to 84 75 to 79 70 to 74 65 to 69 60 to 64 55 to 59 50 to 54 45 to 49 40 to 44 35 to 39 30 to 34 25 to 29 20 to 24 15 to 19 10 to 14 5 to 9 0 to 4 1000 800 600 400 200 0 200 400 600 800 1000 1200 F E M A LE S
Population Example
Calculate Male/Female ratio, overall and for each age category. Graph this to show general trends. Calculate overall dependency ratio. (For example = 0.532) This means there are 53 dependents for every working-aged person.
M A L E /F E M A L E R A T IO B Y A G E , L O W E R M A C U N G IE T O W N S H IP , L E H IG H C O U N T Y P A , 2 0 0 0
1 .2
1 .1
1
M ale/Fem ale R atio
0 .9
0 .8
0 .7
0 .6
0 .5
0 .4 0 to 4 5 to 9 1 0 to 14 1 5 to 19 2 0 to 24 2 5 to 29 3 0 to 34 3 5 to 39 4 0 to 44 4 5 to 49 5 0 to 54 5 5 to 59 6 0 to 64 6 5 to 69 7 0 to 74 7 5 to 79 8 0 to 84 85 p lu s
A g e C a te o g y
0 .4 1
0 .5
0 .6
0 .7
0 .8
0 .9
1 .1
0 to 4 9 to 5 10 to 14 17 19 ar s s ar 24 29 34 39 44 49 54 59 61 to 65 67 70 75 80 85 an to to to to to d 64 66 69 74 79 84 ov er to to ye ye to to to to to to to to to 15 18 20 21 22 25 30 35 40 45 50 55 60 62
M A L E /F E M A L E R A T IO , U N IT E D S T A T E S , 2 0 0 0 C E N S U S
Demography – trend analysis
Often, an analysis and visual presentations of the demographic trends in a community can tell a story and highlight significant planning issues. Recommended:
Tufte, The Visual Display of Quantitative Information The Planner‟s Use of Information
Trends – quantitative analysis
One of the more common simple quantitative analyses of past trends is to calculate “rates of change” (works not just in demography) Rate formula:
r
X t 1 X t Xt
Calculating Rates
The population of a census tract in Montgomery County grew from:
1990 population: 2371 2000 population: 3353
r
3353 2371 2371
0 . 4142
We read this as saying the population growth rate from 1990 to 2000 was 41.42 percent
Growth rates example
If the growth rate from 1990 to 2000 was 41.42 percent, what was the average annual growth rate?
If you said 41.42 / 10 = 4.142 percent per year, that would be….WRONG – but a common mistake Why? COMPOUNDING!!!
Rates of growth (and decline)
Additional information is presented in the Word document: “Growth Rates and Decline Rates.doc”
So….
This is the formula
Vn g V 0
1 t
1
So what is the average annual growth rate of population for this Montgomery Census Tract?
1 10
3353 g 2371
1 0 . 03526
We read this as a 3.526 percent average annual growth rate.
Growth rates
Calculated growth rates can be used for:
Estimating population between two censuses Projecting population based on constant growth rate assumption
Final word on growth rates
If r is the rate of population growth, then a population will double in ln(2)/r years!
Population Projections and Forecasts
Clarify some definitions, keeping consistency with Census Bureau:
Estimate: indirect measurement of population for the past, between Censuses, based on births, deaths and migration figures.
Released for July 1st of previous year
Projections: estimates of population for future years.
Population Projections
Two basic types
Top-down Applying constant rates Curve-fitting/extrapolation techniques Bottom-up
Cohort-component model Distributed housing unit method
There are also regression models, not covered in this class
Top down projections
Constant rate projections:
1. Calculate average annual growth rate, using most recent data. (e.g. 1990-2000). 2. Apply growth rate to future years.
Incidentally, this is also called a “geometric” curve
Top-down projections
Constant increment projections:
1. Calculate total number of people added per year. 2. Apply constant increment to future years.
Also called “linear” curve
Curve-fitting/extrapolation
Idea is relatively straightforward:
1.Plot the data
In this case, population from previous time periods
2. Fit a curve to the data
Made easier in Excel with “insert trendline”
3. Derive the equation of the fitted curve 4. Use the equation to calculate future values.
Curve-fitting/extrapolation
Step 1. Plot the data.
U S P o p u la tio n
300
250
P e rs o n s (m illio n s )
200
150
100
50
0 1790 1810 1830 1850 1870 1890 1910 1930 1950 1970 1990 2010
Year
Curve-fitting
Step 2. Fit a curve. Types of Curves:The most commonly used curves in population analysis are:
Linear, geometric, exponential, modified exponential, polynomial, logarithmic, logistic, and Gompertz. The number of possible curves is virtually limitless!
Curves
1. Linear
y a bx
a=intercept, b=slope
Curves
2. Geometric
y ab
x
We can do a logarithmic transformation to get:
ln y ln a x ln b
Curves
3. Polynomial
Called a 2nd degree polynomial because highest exponent is 2
2
y a bx cx
2 3
Can create an n-th degree polynomial:
y a bx cx dx ex ... x
4
n
Curves
4. Exponential
y ae
bx
5. Modified Exponential
y c ab
x
Curves
6. Gompertz
y ca
b
x
Curves
7. Logistic
y
1 c ab
x
Fitting curves to data
See examples on Population Projections.xls Worksheet: “fittingcurves” with population data from US.
Graph data Fit curve Get equation Extrapolate
Bottom-up projections
The most commonly used in planning is the “cohort-component” method Cohort=age group Component=the three components of demography (fertility, mortality, migration)
The Master Demographic Equation
Put all three components together:
POP t 1 POP t Births Deaths Inmigratio n Outmigrati on
Cohort Component Technique
Step 1. Get age-sex data Step 2. Acquire “vital records” data
Birth rates Death rates
Step 3. Calculate survival rates. Step 4. “Age the survivors” – move to next bin for next period Step 5. Calculate births
Cohort Component Technique
6. Allocate births to males/females 7. Project Population 8. Model migration as residual.
Cohort Component Technique
Two files for utilization, with different techniques of handling migration.
Cohort_Component_Example.xls NEW AND IMPROVED COHORT COMPONENT MODEL.xls
Basic techniques of regional economic analysis
Topics Covered
Using employment data to provide a snapshot “story” of your regional economy Economic Base Multipliers Location Quotients Shift-Share Analysis Input-Output
Employment Data
Most techniques of analyzing a regional economy look at employment data Main national sources
County Business Patterns (Census) BEA-REIS (Commerce) Bureau of Labor Statistics (CEW-ES202)
Data includes both number of jobs and average wages earned Most data is at county level. Employment data at municipal level is highly problematic.
Example from Lehigh County, Pennsylvania
First slide: NAICS data from County Business Patterns Second slide: Data extracted from Bureau of Economic Analysis Regional Economic Information System REIS utilizes older SIC codes
Lehigh County Employment, by NAICS Industry (2000)
Health care and social assistance Manufacturing Retail trade Admin, support, waste mgt, remediation services Accommodation & food services Management of companies & enterprises Wholesale trade Other services (except public administration) Finance & insurance Construction Educational services Professional, scientific & technical services Transportation & warehousing Information Auxiliaries (exc corporate, subsidiary & regional mgt) Utilities Real estate & rental & leasing Arts, entertainment & recreation 0 5 10 15 20 25 30 Thousands
Lehigh County Employment, by SIC Industry (2000)
Government and government enterprises
Services
Finance, insurance, and real estate
Retail trade
Wholesale trade
Transportation and public utilities
Manufacturing
Construction
0
10
20
30
40
50
60
70
80 Thousands
Comparing data sources
Note the large component of “service” sector employment in the 1-digit SIC aggregations. Which method better characterizes this regional economy, NAICS or SIC data?
Data Problems in analyzing economic change
Looking particularly at online sources Official shift for statistics from the SIC system to the NAICS system Problem is comparability of older data “Bridge files” are provided but they are complex and tedious BEA-REIS provides historical through 2000 data at only 1-digit SIC code
Data problems (cont)
New Jersey State Data center provides historical employment data using SIC codes in a difficult to use text forma Pennsylvania Dept. of Labor and Industry‟s website only provides data back to 1998 BLS Covered Employment Wage data only based on NAICS from 1997 to present County Business Patterns was based on SIC (through 1997) now is based on NAICS
Data problems (cont)
County Business Patterns has downloadable files going back to 1988 under 2-digit SIC categories Always make sure you know the source of your data and whether it is seasonally adjusted
Economic Base theory and regional economic growth
Economic “base” techniques divide regional industries into two groups
Basic or export sectors Non-basic or local sectors
Assumes that export or base industries drive regional economic growth Relatively simple to calculate, generates straightforward impact and prediction tools
Economic base theory
Rationale: exports from a region represent competitive or comparative advantages in technology or cost Export industries drive regional growth through multiplier effects, backward and forward “linkages” Emphasizes the “open” quality of small regional economies
Economic base theory
Many criticisms and problems
Non-spatial: as the size of the region studied grows relative to the national economy, the economic base declines Ignores factor mobility, migration Makes pretty strong assumptions about which industries are export oriented Misrepresents services, technology Demand-side driven model of economics
Economic base approaches
Many limitations, but widely used Relatively simple to use and understand Three main approaches we will study
Location Quotients Shift-Share Analysis Input/Output
Defining the export base
Direct (survey) approaches
Out-flow surveys (goods and services) In-flow surveys (income, payments received, receipts)
Expensive, time consuming Data quality, disclosure issues not often used
Defining the export base
Indirect methods
Assumption Approach Location Quotients Minimum Requirements
Assumption approach
Assigns industries to either “basic” or “non-basic” sectors based on assumptions of production
Typically assume agricultural, extractive (mining) and manufacturing industries are exporters Typically assume remaining industries are local, or supportive industries
Can be a reasonable assumption for specific industries and/or based on local knowledge
Minimum requirements approach
Not widely used anymore Chose reference or peer region as a “minimum requirement” region Assume a region meets all of local demand first, then exports Employment above the “minimum” needed to meet local demand represents export employment
Location Quotient approach
Commonly used, relatively easy to find data and calculate Can be based on various data sources
Employment data Income data, by source Consumption data
Most common usage is with employment data Location Quotients are used to tell us the amount of export employment in each industry
Location Quotient Approach
Caveats and Assumptions
Analysis is most informative at more disaggregated levels of industry data For example, 2-digit SIC or 3-digit NAICS codes All economic base techniques are dependent on the size of the region being studied Ignores differences in: regional income levels; technology; productivity; economies of size, scale and scope
Location Quotients
For industry i:
LQ i
Or:
%local employment %national employment
i
i i
Local Employment LQ
i
Total Local Employment National Employment Total National
i
Employment
Location Quotients
LQ = 1
Regional employment proportion in industry i is same as national proportion
Regional employment proportion in industry i is less than national proportion
LQ < 1 LQ > 1
Regional employment proportion in industry i is greater than national proportion
Location Quotients
Export employment:
X i, r (1
1 LQ
i
) * E i, r
X i, r [
E i, r E i, n
E i, r
E total, r E total, n
] * E i, n
X i, r [
E i, n E total, n
E total, r
] * E total, r
Using Location Quotients to Identify Basic Sectors – Example from Lehigh County PA
Utilize 3-digit NAICS codes, private employers, employment numbers, 2001 Download data from BLS, Covered Employment and Wages Calculate LQs for all industries 27 industries have LQ‟s > 1
10 highest LQ industries in Lehigh County, PA
312 Beverage and tobacco product manufacturing 325 Chemical manufacturing 314 Textile product mills 493 Warehousing and storage 221 Utilities
334 Computer and electronic product manufacturing 485 Transit and ground passenger transportation 339 Miscellaneous manufacturing 524 Insurance carriers and related activities 622 Hospitals
10 largest “export-employment” industries, Lehigh County, PA
Hospitals (4192) Chemical manufacturing (3711) Insurance carriers and related activities (3345) Computer and electronic product manufacturing (3238) Nursing and residential care facilities (2537)
Ambulatory health care services (1656) Utilities (1632) Warehousing and storage (1411) Miscellaneous manufacturing (1197) Beverage and tobacco product manufacturing (921)
Economic Base Multipliers
Base Multiplier Total Activity Basic Activity
The regional economic base multiplier measures the total impact on a region’s economy caused by a change of one unit of export activity
Base multipliers can be based on employment, income, or output
Economic Base Multipliers
Returning to our Lehigh County example, and using export employment calculated with Location Quotients
Total Activity (employment) = 156,709 Basic Activity (employment) = 29,040 Base Multiplier = Total/Basic = 5.39
For each 1 job created in a basic sector, 5.39 additional jobs created
Shift-share analysis
Technique for analyzing sources of change in the regional economy A descriptive tool, no real causal structure, not good for use in forecasting – contra Klosterman book, where described as a “projection” technique Controversial, yet simple -- often used, often misused
Shift-share analysis
Disaggregate regional employment change into 3 component parts:
1.National (Growth) Share (NS) – changes in the regional economy attributable to changes in the national economy 2. Industrial Mix Share (IM) – changes in the regional economy attributable to the mix of industries 3. Regional/Local Shift (RS) -- changes in regional employment due to local factors, or regional competitiveness
Shift-share analysis
National (growth) share: estimates the total employment in industry i in the region if industry i in the region grows at the same rate as the nation.
t1 n t0 n
NS
i, r
E E * E
t0 i, r
Shift-share analysis
Industry Mix Share: estimates change in employment in industry i based on the difference in growth rates between industry i nationally and the entire national economy
IM
i, r
E E *( E
t0 i, r
t1 i, n t0 i, n
E E
t1 n t0 n
)
Shift-share analysis
Regional “shift”: estimates change in employment in industry i in the region based on the difference in growth rates between industry i in the region and industry i nationally.
RS
i, r
E E *( E
t0 i, r
t1 i, r t0 i, r
E E
t1 i, n t0 i, n
)
Shift-share analysis
Putting it all together:
Or, measuring change:
E
t1 i, r
NS
i, r
IM
i, r
RS
RS
i, r
ΔE
t 0 ,1 i, r
ΔNS
i, r
IM
i, r
i, r
Shift-share analysis
Interpreting results
National share: simply tells you the extent to which the national economy grew or declines Industry mix: when summed over all industries, tells you if the mix of industries in your region are growing or declining (relatively) Regional shift: tells you if your region is strong or lagging in this industry relative to the nation
Input-Output Analysis
Acknowledgement and disclaimer: Some of the images and slides displayed in this presentation were shamelessly “borrowed” from my regional economics instructor, Dave Marcouiller at U. of Wisconsin. I saw no reason to reinvent what I thought were good slides and explanations of I/O
Today’s objectives
Focus on conceptual understanding of the Input-Output system
Understanding the system of accounts, I/O tables Basic understanding of the mathematics of I/O Interpretation of multipliers
Most I-O modeling is performed by experts and therefore we will focus on planners as “consumers” of the output of the analysis
Input-Output Analysis
Developed as an inter-industry accounting technique, a regional accounting system Relates “final demand” sectors to inputs from industrial sectors Demand driven: demand for final goods determined exogenously
From income and product accounts to inputoutput model
Recall from the study of national income and product accounts:
Gross National Income=Gross National Product
Or: Total factor payments in the economy =total spent on consumption, investment, government, and net exports
Same principles hold at a regional economy
Regional economic accounting
L N C I G (E - M)
Where L=labor, N=rents, C=consumption, I=investment, G=governemnt, E=Exports, M=Imports Rewrite, moving imports to left-side:
L N M C I G E
The left hand side is regional value added, while the right hand side represents final demand
Remember that total output = total outlays
In expanded form, this equation looks like:
X 1 z 1,1 z 1,2 z 1,3 ... z 1, n Y1 X 2 z 2,1 z 2,2 z 2,3 ... z 2, n Y 2 X 3 z 3,1 z 3,2 z 3,3 ... z 3, n Y 3 ... X n z n,1 z n,2 z n,3 ... z n, n Y n
A matrix is simply of way of organizing information, and is a way of representing a system of equations
a 11 a 21 A a 31 a m1
a 12 a 22 a 32 a m1
a 13 a 23 a 33 a m1
a m1
a 1n a 2n a 3n a mn
This matrix A has m rows and n columns, and is therefore an (m x n) matrix. It can be represented as:
A (m x n)
Matrices
A one row matrix is called a row-vector A one column matrix is called a columnvector Again, a matrix has m rows and n columns A matrix can represent a series of equations
For example, recall from a few slides ago our expanded equation:
X 1 z 1,1 z 1,2 z 1,3 ... z 1, n Y 1 X 2 z 2,1 z 2,2 z 2,3 ... z 2, n Y 2 X 3 z 3,1 z 3,2 z 3,3 ... z 3, n Y 3 ... X n z n,1 z n,2 z n,3 ... z n, n Y n
This can be written in matrix form as:
X (n x 1) Z (n x n) Y (n x 1)
Input-Output: The transactions table
The first step is to represent all our sectors‟ relationships in a transactions table. This will include “intermediate” or “interindustry” demand sectors, as well as “final” demand sectors
Transactions Table
Purchasing Producing Producing Sector1 Sector2
Sector1
Purchasing
Sector2
Final Demand
Gross Output
Z 11 Z 21 W1 X1
Z 12 Z 22 W2 X2
Y1 Y2 WY
Y( Y ) i
X1 X2
Y( Y ) i X( X ) i
Value Added Gross Outlays
Another way to represent this:
These three basic quadrants define an Input-Output Table
Production structure in I-O
Inter-industry flows from i to j are wholly dependent on the output of j Production takes place under “fixed proportion” or “Leontieff” production functions. Prices are “represented” in the production coefficients, also called “technical coefficients” z
a ij
ij
aij’s are called technical coefficients
xj
Input Output
In working with your I-O table, the technical coefficients are derived by dividing the individual elements in your Z matrix by the row total, or the gross output. The row-sum of all your technical coefficients is, by definition, equal to 1. The table of technical coefficients is called a “direct requirements table”
Let’s get back to the math of how this all works. Recall that previously we had this relationship:
X ZY
(these are matrices)
But now we have created the matrix of technical coefficients, aij,the matrix of which we will call A. We can now rewrite the equation above as:
X AX Y
Let’s rearrange this to get the X’s on the same side, but be careful Matrix algebra is not the same as regular algebra
X AX Y
In order to factor out the X, we need to use a matrix algebra trick
Define an “identity matrix” to be:
1 0 I 0 0
0 1 0 0
0 0 1 0
0 0 0 1
This functions like a 1 in Regular algebra
The diagonal elements are all 1, the off-diagonals are all 0
We can thus factor out the X’s from this equation:
X AX Y
To yield:
(I A)X Y
We can isolate X on the left hand side by pre-multiplying by the inverse
X (I A)
1
Y
The expression (I A) 1 is frequently called the “Leontieff Inverse” We can use this expression for the predictive or multiplier form Of input-output analysis
ΔX (I A) ΔY
1
Input-Output Model
an example in Microsoft Excel inputoutputexample.xls
Input-Output Multipliers
Where you “close” the model refers to which elements of the (I-A) table you include in the Leontieff inverse Closure only through industry sectors are “simple” multipliers or “Type 1” multipliers and capture
Direct effects Indirect effects
Closure through households and/or governments leads to “total” multipliers or “Type 2” multipliers and capture
Direct, indirect and induced effects
Input-Output Multipliers
Output multipliers: measures requirements needed from all sectors to deliver one additional dollar unit of sector I output to final demand Income multipliers: measures total change in income from a dollar unit change in sector I output resulting from a change in final demand Measures total changes in employment required from all sectors given a one unit change in employment in sector I needed to satisfy changes in final demand
Input-Output Multipliers
Interpretation:
Multipliers are region specific Since multipliers include “direct” effects, be careful not to double count the impact Multipliers do NOT represent the “turnover” of dollars in a local economy but rather the net result of interindustry purchases and increases in income from outside Time for the impact to occur?
Regional Input-Output Models
Also available from the Bureau of Economic Analysis (Dept. of Commerce) is the Regional Input-Output Modeling System (RIMS II) RIMS II Handbook contains Four Case Studies Using RIMS multipliers for economic impact analysis
Extending Input Output
Inter-regional Input Output Social Accounting Matrices
Other regional economic modeling approaches
Regional Econometric (forecasting) Computable General Equilibrium
Impact Analysis/Decision tools
Cost-Benefit Analysis Fiscal Impact Analysis (also called CostRevenue Analysis)
Cost-Benefit Analysis
Good resources for further study:
Circular A-94: Guidelines and Discount Rates for BenefitCost Analysis of Federal Programs, Executive Office of the President, Office of Management and Budget “A student‟s guide to cost-benefit analysis for natural resources”, http://ag.arizona.edu/classes/rnr485 Lecture notes from Prof. Allen Bellas, U. of Washington at: http://faculty.washington.edu/bellas/cba/
Cost-benefit Analysis
A method of project evaluation – an “aggregative” approach A decision “tool” not a decision “rule” Conceptually straight-forward Cloaks public actions in “rationality” Required by Executive Order 12291 and others, cf. OMB Circular A-94
Cost-Benefit Analysis
At its most simple level, CBA involves:
1. Calculate the costs of the project 2. Calculate the benefits of the project
IF BENEFITS>COSTS, undertake project IF COSTS>BENEFITS, don‟t undertake project OR: chose project with highest benefit/cost ratio
Of course, its never this simple….
Cost-benefit analysis
Complications of this “simple” world
Ethical complications: whose benefits? Whose costs? Interpersonal comparisons? Future generations? Practical complications: the “consumers” of a CBA have preconceived expectations of results Practical complications:
Aggregate
over different types of costs and
benefits Aggregate over time / choice of a discount rate Assign “prices” to non-market goods Risk/Uncertainty – Sensitivity Analysis
Some CBA relatives to consider
Cost-Effectiveness Analysis Risk-Cost Analysis, Risk-Risk Analysis Fiscal Impact Analysis/Cost-revenue analysis Internal Rate of Return (IRR) Willingness to Pay/Willingness to Accept Methods of valuing environmental amenities
Contingent Valuation Travel Cost Method Hedonic Price method
Steps in a CBA
1.
2. 3. 4. 5. 6. 7. 8. 9.
Define scope of project Identify alternatives Clarify issues of “standing” Explicitly specify assumptions List impacts of each alternative Assign values to impacts Discount future values to present values Account for uncertainty/risk Compare benefits to costs
The normative basis of CBA
Pareto-Optimality and Welfare Economics Kaldor-Hicks Criterion
“The principle of maximizing net present value of benefits is based on the premise that gainers could fully compensate the losers and still be better off” --OMB Circular A-94
Cost-Benefits: The Basics
Criteria depend on scope of problem:
If scope is accept or reject one project: Accept if NPV > 0 This is most Reject if NPV < 0 common If scope is choosing between 2+ projects Chose project with maximum “net benefits” or Chose project with highest Benefit/Cost ratio Given a project, chose its scale Set MB = MC (marginal benefits=marginal cost)
Cost-Benefits: The Basics
Definitions:
Present Value (PV) = the value today of a payment or a stream of payments in the future. Future Value (FV) = the value at a time in the future of an amount today Discounting: reducing a future value to present value by means of a discount rate Compounding: increasing a present value to a future value by means of an interest rate Internal Rate of Return (IRR): what discount rate sets the present value of benefits and costs equal. (NPB=NPC)
Cost-Benefits: The Basics
Comparison between two alternatives is “with/without” NOT “before/after” State assumptions clearly Perform sensitivity analysis, particularly concerning the choice of the discount rate What are “costs”? What are “benefits” Avoid “double-counting”
With/Without vs. Before/After:What is policy “impact”
c-b is with/without; c-a is before/after
c
b
a
“Value”
Policy Intervention at time t Time
t
Steps in Cost-Benefit Analysis
1.
2.
Define scope of project -define project exactly, identify likely impacts -engineering and financial cost estimates Define alternatives of the project
-may be “reject” project and do nothing
Steps in Cost-Benefit Analysis
3. Clarify issues of standing -whose benefits and costs are considered in the analysis -- must be consistent -often is given by the “client” 4. Explicitly specify assumptions
“Analyses should be explicit about the underlying assumptions used to arrive at estimates of future benefits and costs. The analysis should include a statement of the assumptions, the rationale behind them, and a review of their strengths and weaknesses. Key data and results, such as year by year estimates of benefits and costs, should be reported to promote independent analysis and review.” -OMB Circular A94
Steps in Cost-Benefit Analysis
5.
List impacts of each alternative Types of impacts: -Direct impacts -Indirect impacts Note:DO NOT USE MULTIPLIERS -Externalities and spillovers
Steps in Cost-Benefit Analysis
6.
Monetize (assign value to) all benefits and costs. -Costs should reflect opportunity costs -Costs should reflect market costs where possible -Costs should reflect willingness to pay (WTP) wherever possible -Variety of techniques for valuing non-market amenities
Steps in Cost-Benefit Analysis
7.
DISCOUNTING - Choosing the appropriate discount rate -The choice of the discount rate is quite controversial. Ideally, the discount rate should “The social discount rate” (SDR) - Mathematical mechanics of discounting in Excel
More on “discounting”
Discounting contains implicitly, adjustment for at least 3 factors:
Time preference Opportunity cost of capital Risk and uncertainty
Discounting
V t V 0 (1 )
V0 Vt (1 )
t
t
The formula to convert today’s money into tomorrow's
The formula to convert tomorrow's money into today’s
THE DISCOUNT RATE
Even more on discounting
Most government agencies have pre-set discount rules which are to be used, so you take what is given to you. Federal guidelines: start with 7 percent as discount rate. (OMB Circular, cf. appendix C) Why 7 percent? “Approximates the marginal pretax rate of return on an average investment in the private sector in recent years.”
Fiscal Impact Analysis
In PA, cf. http://cax.aers.psu.edu/residentialimpact/ (even though it is Penn State…) Fiscal impact analysis is used to estimate the costs and revenues of a proposed land development. Ideally it is used not in NIMBY type opposition to development, but in accurate capital improvements planning (CIP).
Fiscal Impact Analysis
The most commonly used form is the “per capita multiplier” method. Quite simply, it uses average per capita revenue and cost figures, and multiplies these by the estimated number of new residents/school children in a proposed development. The difference between AVERAGE and MARGINAL Ignores actual infrastructure capacity and capital expenditures.
Fiscal Impact Analysis
Steps in the per-capita multiplier method
1. Estimate number of expected new residents and new school children to reside in new development.
How? Use statewide averages, or locally derived averages. (From census data or local school enrollment data)
2. Estimate additional school spending as:
New school children * per pupil expenditures
Fiscal Impact Analysis
3. Estimate additional municipal (non-school) expenditures.
Generally use “current” not capital expenditures. Estimate current spending on roads, police/safety/fire, administration, etc. and divide by current population totals to get per capita expenditures Multiply number of new residents * per capita expenditures Net revenues minus net costs.
4. Calculate net fiscal impact:
5. Answer: Keep out as many school kids as possible.
Fiscal Impact Methods
Other types of fiscal impact methods
1. Cast Study method. Perform detailed case studies and/or solicit expert opinion from local and school district officials to estimate costs. 2. Service Standard method. Focuses on requirements of service levels to service proposed developments. (If done properly, would look at existing service levels and capacity.) 3. Comparable City method. Rarely used. Look at comparable developments in comparable cities.
Fiscal Impact Analysis
For non-residential developments, two methods to assess impacts:
1. Proportional valuation method. The denominator for assessing revenues and costs is not people (i.e. per capita) but the proportion of proposed development‟s property valuation relative to community. 2. Employment Anticipation method. Estimate number of employees to be serviced. Then use similar techniques to estimate costs and revenues per employee.
NOTE: Non-residential uses do not send kids to school.