Basic Quantitative Tools (Prof. Campbell) Data needed
last updated: Friday, March 17, 2008
(INFERENTIAL STATISTICS -- inferring from a sample to a population using the laws of probability)
How confident can one be that the sample mean (or proportion) represents the population as a whole?
confidence interval (mean) one interval variable
inverse: given a specific confidence interval, what is the needed sample size?
confidence interval (proportion) one nominal variable
Do differences found in a sample (a subset of the population) reflect differences in the
population as a whole? (commonly used to generalize from survey results)
chi-square two categorical variables
an interval variable divided
difference of means into two categories
a nominal variable divided
difference of proportions into two categories
an interval variable divided
ANOVA (Analysis of Variance) into three or more categories
What is the relationship between two variables?
correlation analysis (including an example of
ecological fallacy) two interval variables
How many total jobs are dependent on basic (export-based) jobs?
number of basic (export)
jobs, number of total jobs
Multiplier (export + locally serving)
What is the relative concentration of local employment by sector?
employment (total and by
sector) for both the locality
Location Quotients and the nation
How can we estimate interaction (e.g., trade, traffic) between two cities?
population of two cities,
Gravity Model distance, constant
How do we measure growth over time?
Growth Rates (3 types) population levels over time
How do we compare costs and benefits (e.g., of a project) over time?
quantified costs and benefits
Cost-benefit analysis for each year, discount rate
c672515f-24ea-49e6-9a63-dfa6309fbbec.xls Overview 9/2/2009 10:22 PM
calculate a confidence interval (with interval data)
that is, how confident are you that your sample estimate comes close to the populat
one interval variable
enter data
Data needed:
sample mean (X) in yellow cells
std dev of sample (s)
sample size (n)
_
X t.025
value of t-score for.025 (two-tail test) -- from t-table or let Excel calculate
Data Hhd Income
1 24,000 24,000
2 36,000 36,000
3 12,000 12,000 SO:
4 74,000 74,000 u= 42,000 +/-
5 46,000 46,000
6 27,000 27,000 lower end of confidence interval
7 23,000 23,000 upper end of confidence interval
8 69,000 69,000 range
9 107,000 107,000
10 53,000 53,000
11 29,000 29,000
12 34,000 34,000 Confidence Interval
13 43,000 43,000
14 28,000 28,000
15 24,000 24,000
16 43,000 43,000
MEAN 42,000
STDEV 24,105 1
n 16
t 2.131
set the confidence level (2-tail)
0.05 - 20,000 40,000 60,000
close to the population mean?
_ s
X t.025
n
12,845
29,155
54,845
25,690
80,000 100,000
calculate a confidence interval
that is, how confident are you that your sample estimate comes close to the populat
one interval variable
Here we will skip using the raw data and instead calculate with the summary data (mean, std dev., n)
Data needed:
sample mean (X)
std dev of sample (s)
sample size (n)
value of t-score for.025 (two-tail test) -- from t-table or let Excel calculate
MEAN 42,000 enter data
STDEV 5,000 in yellow cells
n 384
SO:
t 1.966 u= 42,000
set the confidence level (2-tail) lower end of confidence interval
0.05 upper end of confidence interval
range
Confidence Interval
1
- 20,000 40,000
e comes close to the population mean?
data (mean, std dev., n)
_ s
X t.025
n
+/- 502
f confidence interval 41,498
of confidence interval 42,502
1,003
Confidence Interval
40,000 60,000 80,000 100,000
calculate a minimum sample size need to achieve a specific confi
that is, how confident are you that your sample estimate comes close to the populat
one interval variable
Here we will skip using the raw data and instead calculate with the summary data (mean, std dev., n)
Data needed:
sample mean (X)
std dev of sample (s)
sample size (n)
value of t-score for.025 (two-tail test) -- from t-table or let Excel calculate
enter data
in yellow cells
MEAN 42,000
STDEV 5,000
c (confidence interval range) 500
SO:
t 1.960 u=
set the confidence level (2-tail) lower end of co
0.05 upper end of co
range
given values of stdev and c and confidence level, we calculate "n":
sample size needed 384
1
NOTES:
1. For the value of "t", we simply
assumed a large sample size (t --> Z),
e.g., for 95% confidence interval (2-
tailed), t = 1.96. - 20,000
2. We are also assuming a large
population size (M), so that N/M --> 0.
a specific confidence interval range
close to the population mean?
here is the formula to calculate a confidence
_
s
X t.025
n
…solving for n (sample size)
42,000 +/- 500
t.025 s
ower end of confidence interval 41,500 n
upper end of confidence interval 42,500
1,000
c
Confidence Interval
…leads to this equation (so, to estima
size, you need to know Stdev, the con
and the value of t.
t.025 s 2
n( )
c
20,000 40,000 60,000 80,000 100,000
calculate a confidence interval
s
t.025
n
mple size)
t.025 s
c
equation (so, to estimate sample
to know Stdev, the confidence interval,
t.025 s 2
( )
c
calculate a confidence interval using proportions (nominal data)
for large n
one nominal variable (proportions)
the population proportion is p
enter data
P 1.96
Data needed:
sample proportion (P) in yellow cells
sample size (n)
set the confidence level (2-tail)
0.05
P 50% SO:
n 100 p 0.500 +/-
lower end of confidence interval
t 1.984 upper end of confidence interval
range
Confidence Interval
1
0% 10% 20% 30% 40% 50% 60% 70%
(nominal data)
P(1 P)
P 1.96
n
in percent
0.099 9.9%
0.401 40.1%
0.599 59.9%
0.198 19.8%
70% 80% 90% 100%
Chi-Square
CHI-SQUARE TEST (EXCEL: FUNCTION) does the distribution of ou
from a random distribution
ACTUAL (OBSERVED)
city suburb rural
strong 2 1 1 4 enter data
medium 1 2 1 4 in yellow cells
weak 1 1 2 4
4 4 4 12
PREDICTED/EXPECTED (based on mutiplying row and column to
city suburb rural
strong 1.3333 1.3333 1.3333 4
medium 1.3333 1.3333 1.3333 4
weak 1.3333 1.3333 1.3333 4
4 4 4 12
Chi-square test (Calculated by Excel): "CHITEST"
### (probability of this sample outcome if no difference in population)
range: 0 to 1
Difference between predicted and actual
city suburb rural
strong -0.6667 0.3333 0.3333 0
medium 0.3333 -0.6667 0.3333 0
weak 0.3333 0.3333 -0.6667 0
0 0 0 0
Page 11
Chi-Square
distribution of outcomes (observed) significantly differ
andom distribution (expected)?
nter data 20 8 20
yellow cells
16 12 12
12 16 4
and column totals)
2
( fo fe )
2
fe
fo observed frequencies
fe
fe expected frequencies
Page 12
The t distribution is used for hypothesis testing with small samples (e.g., smaller than about 100 cases)
the t distribution is similar to the z distribution, but is "flatter" because of the smaller sample size.
When the sample size gets large (e.g., over 50-100), the t distribution approaches that of the Z distribution (a normal c
d.f. 5 10 50 1000
tails 2 2 Probabilities 2and t-scores for various degree
2
0
1.000
1.000 1.000 1.000 1.000
test)
0.1 0.950
0.924 0.922 0.921 0.920
0.2 0.849 0.845 0.842 0.842
0.900
0.3 0.776 0.770 0.765 0.764
0.4 0.706
0.850 0.698 0.691 0.689
probability of this outcome if no difference in population
0.5 0.638 0.628 0.619 0.617
0.6 0.575
0.800 0.562 0.551 0.549
0.7 0.515 0.500 0.487 0.484
0.8 0.750
0.460 0.442 0.427 0.424
0.9 0.409 0.389 0.372 0.368
0.700
1 0.363 0.341 0.322 0.318
1.1 0.321
0.650 0.297 0.277 0.272
1.2 0.284 0.258 0.236 0.230
1.3 0.250
0.600 0.223 0.200 0.194
1.4 0.220 0.192 0.168 0.162
1.5 0.550
0.194 0.165 0.140 0.134
1.6 0.170 0.141 0.116 0.110
0.500
1.7 0.150 0.120 0.095 0.089
1.8 0.132
0.450 0.102 0.078 0.072
1.9 0.116 0.087 0.063 0.058
2 0.102
0.400 0.073 0.051 0.046
2.1 0.090 0.062 0.041 0.036
2.2 0.350
0.079 0.052 0.032 0.028
2.3 0.070 0.044 0.026 0.022
0.300
2.4 0.062 0.037 0.020 0.017
2.5 0.054
0.250 0.031 0.016 0.013
2.6 0.048 0.026 0.012 0.009
2.7 0.043
0.200 0.022 0.009 0.007
2.8 level as the
0.038 the 0.050.019 is by convention used0.005 threshold of
0.007
2.9 0.150 0.016 0.006 0.004
0.034 statistical significance (though sometimes we use an even more
3 0.030 0.013 0.004 0.003
0.100 strict level, such as 0.01 or even 0.001
0.050
0.000
0 0.5 1 1.5
standardized sample differences (t
bout 100 cases)
the Z distribution (a normal curve)
r various degrees of freedom (two-tail
test)
Degrees of freedom
5
10
50
1000
the larger the sample size, the lower the
value of the critical t ...
… when the sample size gets large (e.g.,
over 50 - 100), then the critical t level (.05,
2 tail) approaches 1.96
2 2.5 3
differences (t-scores)
difference of means
Small Standard Deviation Larger Standard Deviation
Factor 50 45 Factor 80 75
Case Male Income Female Income Female Income
CaseMale Income
1 69,000 49,000 1 40,000 72,000
2 77,000 67,000 2 42,000 34,000
3 46,000 69,000 3 83,000 65,000
4 59,000 64,000 4 100,000 34,000
5 55,000 30,000 5 100,000 86,000
6 50,000 68,000 6 86,000 86,000
7 38,000 73,000 7 104,000 67,000
8 63,000 61,000 8 70,000 64,000
9 50,000 61,000 9 37,000 79,000
10 56,000 48,000 10 62,000 78,000
11 74,000 72,000 11 88,000 85,000
12 50,000 57,000 12 72,000 83,000
Mean 57,250 59,917 Mean 73,667 69,417
Std Dev. 11,702 12,428 Std Dev. 24,092 18,372
female female
mean
mean
Male
mean
1
Male
1
mean
- 20,000 40,000 60,000 80,000 100,000 - 20,000 40,000 60,000 80,000 100,000
t-Test: Two-Sample Assuming Equal Variances t-Test: Two-Sample Assuming Equal Variances
Male Income Female Income Male IncomeFemale Income
Mean 57250 59916.66667 Mean 73666.6667 69416.6667
Variance 136931818.2 154446969.7 Variance 580424242 337537879
Observations 12 12 Observations 12 12
Pooled Variance 145689393.9 Pooled Variance458981061
Hypothesized Mean Difference 0 Hypothesized Mean Difference 0
df 22 df 22
t Stat -0.541 t Stat 0.486
P(T2 (i.e., if t
t-score -1.86871
2), then it is "statistically significant" at the .05 level.
Prob-t 0.068649 That is, there is less than a 5% chance that one could get
this difference in the sample drawn from a population
where there is no difference between city and suburban
Page 20
Diff of Proportions
cent of Residents Who Own a Car
Suburban Residents
on: does the difference found in the sample reflect
ce among the entire population? [the research
difference is due merely to random sample
there is no difference in the population as a whole.
2 or t >
he .05 level.
at one could get
a population
and suburban
Page 21
Diff of Proportions (2)
Here, if given just the mean, n of cases
Mean 10.0% 20.0%
n of cases 150 120
degrees of freedom (n1 +n2-2) 268
t-score
Numerator:-10.0%
pu 0.14444 see Blalock, p. 234
0
sqrt(pu,qu) .35154
0.04305
denominator=
t-score -2.3226
Prob-t 0.0209
NOte that as the mean values deviate from 50%, we can
be more accurate:
e.g., compare 10% to 20%, vs. 40% to 50%
or 80% 90%
Page 22
ANOVA
AUTO MILES DRIVEN PER WEEK
Case Rural
City ResidentsSuburban ResidentsResidents City ResidentsSuburban Residents
1 20 50 40 0 17
rural 0 mean
2 0 80 50 23
3 50 90 60 12 24
4 100 350 70 18 50
5 70 240 80 18 60
6 35 120 90 20 65
7 12 90 100 24 70
8 150 80 100 35 80
9 120 70 20 35 80
10 0 60 30 42 85
11 18 90 40 50 90
mean
12 35 111 50 66 90
13 42 122 60 67 90
14 67 133 250 suburban
70 96
15 95 144 170 75 111
16 66 155 120 77 120
17 77 96 150 95 122
18 123 23 170 100 133
19 0 65 180 120 144
20 18 24 111 123 155
21 24 17 130 150 240
22 75 85 75 urban200 350
Mean 54.4 104.3 97.5 54.4mean 104.3
1
0 50 100
AUTO MILES DRIVEN PER W
Anova: Single Factor
SUMMARY
Groups Count Sum Average Variance
City Residents 22 1397 63.5 2672.64286
Suburban Residents 22 2295 104.318182 5471.65584
Rural Residents 22 2146 97.5454545 3391.11688
ANOVA
Source of Variation SS df MS F P-value F crit
2
Between Groups 1054.6364 2 10527.3182 2.73782547 0.07241657 3.14280868
Within Groups 242243.727 63 3845.13853
Total 263298.364 65
Why use ANOVA?
In situations where you are comparing the means from more than two groups.
since in a difference of means test, you compare x2-x1.
For more than two groups, you can't compare x3-x2-x1.
so you look at the variation (sum of squares) within vs. between groups.
Intuitively, sample groups with low internal variation, but high variation across groups, will
likely represent real differences in the population as a whole.
Page 23
ANOVA
While sample groups with high internal variation and low variation across groups have
a greater chance of representing populations with no real differences.
Anova: Single Factor
SUMMARY
Groups Count Sum Average Variance
City Residents 22 1197 54.4090909 1890.82468
Suburban Residents 22 2295 104.318182 5471.65584
Rural Residents 22 2146 97.5454545 3391.11688
ANOVA
Source of Variation SS df MS F P-value F crit
3
Between Groups 2248.5758 2 16124.2879 4.49829595 0.01492455 3.14280868
Within Groups 225825.545 63 3584.53247
Total 258074.121 65
Page 24
ANOVA
Rural Residents level
20 1 1.1 1.2
30 1 1.1 1.2
40 1 1.1 1.2
40 1 1.1 1.2
50 1 1.1 1.2
50 1 1.1 1.2
60 1 1.1 1.2
60 1 1.1 1.2
70 1 1.1 1.2
75 1 1.1 1.2
mean 80 1 1.1 1.2
90 1 1.1 1.2
100 1 1.1 1.2
100 1 1.1 1.2
111 1 1.1 1.2
120 1 1.1 1.2
130 1 1.1 1.2
150 1 1.1 1.2
170 1 1.1 1.2
170 1 1.1 1.2
180 1 1.1 1.2
250 1 1.1 1.2
97.5 1 1.1 1.2
150 200 250 300 350 400
AUTO MILES DRIVEN PER WEEK
SSbetwee nd . f .
F SSwithind. f .
SSbetween sum of squares between the groups
SSwithin sum of squares within the groups
d.f. = degrees of freedom
Page 25
SSbetween sum of squares between the groups
SSwithin sum of squares within the groups
ANOVA
d.f. = degrees of freedom
Page 26
Case x y 1
Case x
1 0 0 1 0.2
2 0.1 0.1 0.9
2 0.3
3 0.2 0.2 0.8
3 0.4
4 0.3 0.3 4 0.4
0.7
5 0.4 0.3 5 0.5
6 0.5 0.4 0.6 6 0.5
7 0.6 0.5 7 0.5
0.5
8 0.7 0.6 8 0.5
9 0.8 0.7 0.4 9 0.6
10 0.8 0.8 0.3
10 0.6
11 0.9 0.9 11 0.7
12 1 0.9 0.2
12 0.8
correlation+0.99 0.1 correlation
F ### 0 F
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p value 0.00 p value
correlation (range: -1 >> time preferences.
Present value (PV) = B(t) / (1+r)t
where B(t) is the benefit in year t, r is the discount rate.
Net Present Value (NPV) = ∑ (B(t) - C(t)) / (1+r)t
where B is benefits and C is costs.
why is money worth less in the future?
1 people are impatient (and mortal)
2 opportunity cost of investing the capital elsewhere.
The argument for discounting is referred to as the 'marginal productivity of capital'
AND THE TRICK IS TO INCLUDE ENVIRONMENTAL COSTS AND BENEFITS. [99]
if ∑ (B(t) - C(t)±E(t)) * (1+r)t > 0 , then the project is a net good project.
The Problems with Discounting for the Environment
a way to shift heavy costs to future generations.
note: it is hard to shift capital costs to future generations, since lenders want payba
1 actual damage may be far larger than the discounted value.
2 long-term benefits are also not strongly valued (even though today's action
3 will lead to greater exhaustion of exhaustible resources, esp. with a high d
However: "There is, in fact, no unique relationship between high discount rates an
How to select a discount rate: simply the rate of economic growth for a nation? t
Taking sustainability into account:
Page 50
Cost-benefit
EX: "require that any environmental damage be compensated by projects specifica
note how the r can really change the outcome, especially if costs and benefits patte
EXAMPLE
discoun discoun
Benefit Cost Net Benefit t rate t rate
t B(t) C(t) B(t) - C(t) r (1+r)^t
0 0 1,000,000 -1,000,000 0.02 1.00
1 100,000 100,000 0 0.02 1.02
2 110,000 100,000 10,000 0.02 1.04
3 120,000 100,000 20,000 0.02 1.06
4 130,000 100,000 30,000 0.02 1.08
5 140,000 100,000 40,000 0.02 1.10
6 150,000 100,000 50,000 0.02 1.13
7 160,000 100,000 60,000 0.02 1.15
8 170,000 100,000 70,000 0.02 1.17
9 180,000 100,000 80,000 0.02 1.20
10 190,000 100,000 90,000 0.02 1.22
11 200,000 100,000 100,000 0.02 1.24
12 210,000 100,000 110,000 0.02 1.27
13 220,000 100,000 120,000 0.02 1.29
14 230,000 100,000 130,000 0.02 1.32
15 240,000 100,000 140,000 0.02 1.35
16 250,000 100,000 150,000 0.02 1.37
17 260,000 100,000 160,000 0.02 1.40
18 270,000 100,000 170,000 0.02 1.43
19 280,000 100,000 180,000 0.02 1.46
20 290,000 100,000 190,000 0.02 1.49
Compare front-loading and backloading costs
and changing discount rates
1,500,000
Page 51
Cost-benefit
Benefit
1,000,000 Cost
Cumulative Net Present Value (NPV)
500,000
Net Benefit
0
1 2 3 4 5 6 7 8 9 10 11 12 13
-500,000
the year when the green line crosses over
axis (where y=0) is the year when the cumu
-1,000,000
impact shifts from a net cost to a net benef
-1,500,000
Year
Page 52
Cost-benefit
(Bt Ct ) n
preferences.
NPV t
t 0 (1 r)
Bt benefits in year t
Ct costs in year t
t year
NPV net present value (benefits adjusted for cost)
r discount rate (e.g., 6% per year or 0.06)
lsewhere.
marginal productivity of capital' argument, the use of the word 'marginal' indicating that it is
COSTS AND BENEFITS. [99]
is a net good project.
rations, since lenders want paybacks. e.g., 30 year loans. but it is easier to shift non-mone
discounted value.
alued (even though today's actions are required for those 50 years from now to enjoy them).
ble resources, esp. with a high discount rate.
p between high discount rates and environmental deterioration." [103]
conomic growth for a nation? the interest rate? [104]
Page 53
Cost-benefit
ompensated by projects specifically designed to improve the environment." [106]
ecially if costs and benefits patterns vary over time. (see graph).
Net Benefit
discounted for
present value Cumulative Net Present Value (NPV)
(B(t) - C(t)) / (1+r)t ∑ (B(t) - C(t)) / (1+r)t
-1,000,000 -1,000,000
0 -1,000,000
9,612 -990,388
18,846 -971,542
27,715 -943,827
36,229 -907,597
44,399 -863,199
52,234 -810,965
59,744 -751,221
66,940 -684,280
73,831 -610,449
80,426 -530,023
86,734 -443,288
92,764 -350,525
98,524 -252,001
104,022 -147,979
109,267 -38,712
114,266 75,554
119,027 194,581
123,558 318,139
127,865 446,003
Page 54
Cost-benefit
Net Present Value (NPV)
13 14 15 16 17 18 19 20 21
the year when the green line crosses over the x
axis (where y=0) is the year when the cumulative
impact shifts from a net cost to a net benefit.
Page 55
Cost-benefit
nal' indicating that it is the productivity of additional units of capital that is relevant. [99]
sier to shift non-monetary costs to the future, since the lenders are around to complain! the
m now to enjoy them). ie., they should not be discounted like capital.
Page 56
Cost-benefit
ent." [106]
Page 57
Cost-benefit
Page 58
Cost-benefit
is relevant. [99]
und to complain! they don't have a contractual agr
Page 59
gini
0.386
RANGE: 0 (PERFECT EQUALITY; 1 PERFECT INEQUALITY)
n 20
Income calculatedalculated
Person "i" c calculated
"i" X(i) x(i)
CULULATIVE X(i) x(i)*i
1 1,000 0.003 0.003 0.00
2 3,000 0.009 0.013 0.02
3 4,000 0.013 0.025 0.04
4 5,000 0.016 0.041 0.06
5 6,000 0.019 0.059 0.09
6 8,000 0.025 0.084 0.15
7 8,000 0.025 0.109 0.18
8 9,000 0.028 0.138 0.23
9 11,000 0.034 0.172 0.31
10 12,000 0.038 0.209 0.38
11 14,000 0.044 0.253 0.48
12 17,000 0.053 0.306 0.64
13 19,000 0.059 0.366 0.77
14 21,000 0.066 0.431 0.92
15 23,000 0.072 0.503 1.08
16 27,000 0.084 0.588 1.35
17 29,000 0.091 0.678 1.54 GINI COEFFICENT
18 32,000 0.100 0.778 1.80 CUMULATIVE X
19 33,000 0.103 0.881 1.96
20 38,000 0.119 1.000 2.38 1.000
SUM 320000 1 14.36 LINE OF EQUALITY
0.900
mean 0.05
Insert income amounts for each of 0.800
the 20 people here -- be sure to
arrange from LOW to HIGH 0.700
Do NOT enter data in any of the
0.600
other columns -- those are
calculated. 0.500
Try entering both a fairly equal
income distribution -- and then try a 0.400
broadly unequal one.
0.300
0.200
0.100
0.000
0 5 10
the LORENZ CURVE -- see how
the curve deviates from the line
of equality as the gini coefficient
source of formula and text: U.S. Census Bureau. The
Changing Shape of t he Nation’s Income Distribution, 1947-
1998, Curren tPopulationReport, By Arthur F. Jones Jr.and
Daniel H. Weinberg, (Issued June 2000)
http://www.census.gov/prod/2000pubs/p60-204.pdf
MEASURES OF
INEQUALITY/DISPARITY:
how to calculate a Gini
Coefficient
COEFFICENT GINI COEFFICENT GINI COEFFICENT
MULATIVE X CUMULATIVE X CUMULATIVE X
1.000 1.000
OF EQUALITY LINE OF EQUALITY LINE OF EQUALITY
0.900 0.900
0.800 0.800
0.700 0.700
0.600 0.600
0.500 0.500
0.400 0.400
0.300 0.300
0.200 0.200
0.100 0.100
0.000 0.000
11
13
15
17
19
11
1
3
5
7
1
3
9
5
7
9
15 20
CURVE -- see how
viates from the line
s the gini coefficient
I COEFFICENT
UMULATIVE X
NE OF EQUALITY
13
15
17
19