Crash Course in Elementary Statistical Methods

Document Sample

```					Crash Course in Elementary Statistical
Methods
Aim: understanding the relationship between two or more variables.

Examples:

(1) how technology, institutions, education or health is affecting
growth of output and income in the country;

(2) how income changes within the household affects the children’s
schooling attainment;

(3) how more funding to schools affects the childrens' performance;

(4) how better health affects the nutrition of poor people etc...

Elementary Statistical Methods

Regression analysis is a statistical technique
that allows the exploration of possible
interrelationships between variables.

Variables

Assume we have two variables: x and y,
and we want to study the relationship
between them.
x = annual income in the family
y = school enrollment of children within the family

1
Data

Data: may be collected on different levels,
i.e. family, village, district, country etc...

Data

Cross-sectional data: observations collected at
the same point in time but across different units
(families, villages, countries etc)

Data

Time-series / Panel data: observations collected
for the same unit but over different time periods:

2
Empirical Analysis
Example:

Estimating the effect of income on educational
attainment
Cross-sectional data on income and enrollment alone
may not be enough
There might be important (unobserved) differences that
might obscure the "pure" effect of income on enrollment.
OR, we might have excluded some variables in the
regression that might be correlated with family income
and which have its own effect on children's enrollment
→ biased effect.

Empirical Analysis

Biases caused by for example:
Families with higher income => more progressive
parents => bias the result upwards…
Control for parental attributes => effect of income
on education would be smaller

Summary statistics
The mean:
The sum of all observations of the relevant
variable divided by the total number of
observations (n) :

3
Summary statistics

The variance:
We want to know whether the different observations lie
more or less close to the mean (i.e. whether they are
clustered together) or far from it (i.e. whether they are
dispersed).
=> add up the all differences of the observations from
the mean.

Summary statistics

The standard deviation.
It makes the units comparable to those in
which the variable originally was measured:

Summary statistics
Correlation:
Our main goal is to understand whether two
(or more) variable move together, i.e.
whether they covary.

4
Summary statistics

Interpretation of covariance:
yi exceeds its mean, and xi exceeds its mean as
well, then the covariance will be positive.
Similarly, if xi tends to fall short of its mean when
when yi exceeds its mean, then covariance will be
negative.

Regression analysis

We are interested in finding out the form of the
relationship between variables x and y
not just in whether they are correlated.

We want to study the marginal impact of x on y:
by how much does an increase in x appear to affect y?
This is the general question in regression analysis.

Preliminary test: Scatter diagrams

Regression analysis

First: decide which is the "causal" variable
and which is the variable that is affected by
the movements of the "causal variable".

Convention:
let x stand for the causal variable =
independent variable
let y stand for the dependent variable.

5
Regression analysis

Second: Construct a diagram in which you
have the independent variable on the
horizontal axis and the dependent variable on
the vertical axis.

Look at the scatter plot and see study the
potential relationship.

Scatter plot

The basics of regression

Suppose we think that the relationship
between x and y can be described as a linear
relationship.

What does this mean?
When x increases also y increase or when x
decreases also y decreases....
x affects y with the same proportions.

6
The basics of regression

The basics of regression

A (linear) equation:

where α is the constant and β explains the effect of
x on y.
When x=0:
y=α, and y increases (or decreases) by the amount of β for
each additional unit increase (or decrease) in the value of
x.

The basics of regression

Given a set of observations, the regression
analysis is finding the straight line that is the
best fit to the data. The values of α and β are
then estimated from that best fit line.

"Best fit" = the actual data point should not be
very far away from the line.

7
The basics of regression

β> 0 => upward sloping curve
β< 0 => downward sloping
Slope = β
curve

“Best fit” line

The basics of regression

Running the regression equation in statistical
program, i.e. STATA,
the program gives you the optimally chosen values for β,
i.e. the coefficient of the slope of the curve that best fits the
data.

β is called the regression coefficient. It tells us about
the strength of the influence of x on y:
a high value if β implies that a small change in x can bring
about a large change in y;
a low value of β implies the opposite

8
Multivariate regressions:

A regression with more than ONE independent
variable, x:

where y=children's enrollment, x= family income, z=
parental education

Interpretation:
β now tells us the effect on y of a change in x when
the value of z is held constant.

Can the estimated coefficient β be
trusted?
Think of a large set of (x,y) observations that we
might have access to, but what we really have in our
hands is a subset or a sample of these
observations.

Our sample allows us to construct estimates of α
and β of the true relationship that we believe is "out
there".

Our estimates are random variables of observations
from the entire sample of observations.

Can the estimated coefficient β be
trusted?
The statistic will calculate how precise or how
significant our estimates are:
how confident can we be that our estimated
value of β is close to the true β?

9
Hypothesis testing

The underlying hypothesis which you want to
test is called the null hypothesis.

Null hypothesis: H₀=0,
Alternative hypothesis: HA≠0.

We want to test null hypothesis that family
income has NO effect on children's
enrollment.

Hypothesis testing
Example:

A regression on family income and children's school
enrollment gives us an estimate of β

=> We want to know whether β is significantly different
from 0.

We form a hypothesis that H₀=0, HA≠0.

Using the sample data, compute the test for whether we
can reject the H₀ or not. When you do this you get a test
statistic (t-value)

Hypothesis testing

To put it simple:
if the t-value >2, we reject the null hypothesis that
H₀=0.

=> This means that we can be confident that the
effect of i.e. family income on children's schooling is
NOT 0.

Then, look at the estimated β to see whether it is
positive or negative.

10
Hypothesis testing

“Estimate is significant at the 5% level”:

The null hypothesis is rejected under the
assumption that there is less than a 5% probability
that we rejected the null hypothesis when it was
indeed true.
We can be confident that we have not rejected a
true null hypothesis.

Randomized evaluations

“What would have happened to this person's
behavior if she had been subjected to an
alternative policy?”
would she work more if marginal taxes are lower
would she earn less if she had not gone to school
would she had higher test scores if she had
proper text books at the school

Randomized evaluations

Example:

YiT = the average test scores of children in a given
school i if the school has textbooks

YiC = the average test scores of children in the same
school i if the school has no textbooks

We are interested in the difference YiT- YiC , which is the
effect of having textbooks for school i.

11
Randomized evaluations

We will never know the effect of having
textbooks on a school in particular BUT we
may hope to learn the average effect that it
will have on schools:

E[YiT - YiC ].

Randomized evaluations

Difference =

E[YiT | School has] – E[YiC | School has
textbooks             no textbooks]

= E[YiT | T] – E[YiC | C].

Randomized evaluations

The problem is:
there may be systematic differences between schools with
textbooks and schools without textbooks.
I.e. schools with textbooks might have better teachers,
more money etc...

If we only run a regression of textbooks on test
scores, we will not get the “causal” effect because
we are not controlling for other variables such as
teachers, funding etc.
This is called that you have a bias in the estimate..

12
Randomized evaluations

How do we eliminate the bias in the
estimate?

One way to do this is to randomly decide
which schools gets text books and which
does not get textbooks

Randomized evaluation:

Evaluating policy programs, i.e. textbooks provision
to schools, worms’ medicines to children in primary
schools etc.

Think about medical experiments - some people are
given the drug and some are not...

Ideal set-up to evaluate the effect of a policy X on
outcome Y

Randomized evaluation

A sample of N individuals is selected from a
population. This sample is then randomly divided
into two groups:

(1) Treatment group

(2) Control group

The Treatment group is treated with policy X while
the Control group is not.

13
Randomized evaluation

The effect of policy X is measured by the difference
in empirical means of Y between the Treatment and
Control groups:
^              ^
D = E[Y | T ] − E[Y | C ]
^
(where   E   denotes the empirical mean)

Randomized evaluation

Key: Since Treatment has been randomly
assigned, the two groups are similar on other
characteristics and hence, your estimate is
going to be unbiased!!!

14

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 114 posted: 5/27/2010 language: English pages: 14