Crash Course in Elementary Statistical Methods

Document Sample
Crash Course in Elementary Statistical Methods Powered By Docstoc
					Crash Course in Elementary Statistical
 Aim: understanding the relationship between two or more variables.


 (1) how technology, institutions, education or health is affecting
 growth of output and income in the country;

 (2) how income changes within the household affects the children’s
 schooling attainment;

 (3) how more funding to schools affects the childrens' performance;

 (4) how better health affects the nutrition of poor people etc...

Elementary Statistical Methods

 Regression analysis is a statistical technique
 that allows the exploration of possible
 interrelationships between variables.


 Assume we have two variables: x and y,
 and we want to study the relationship
 between them.
    x = annual income in the family
    y = school enrollment of children within the family


 Data: may be collected on different levels,
 i.e. family, village, district, country etc...


   Cross-sectional data: observations collected at
   the same point in time but across different units
   (families, villages, countries etc)


   Time-series / Panel data: observations collected
   for the same unit but over different time periods:

Empirical Analysis

 Estimating the effect of income on educational
 Cross-sectional data on income and enrollment alone
 may not be enough
 There might be important (unobserved) differences that
 might obscure the "pure" effect of income on enrollment.
 OR, we might have excluded some variables in the
 regression that might be correlated with family income
 and which have its own effect on children's enrollment
 → biased effect.

Empirical Analysis

 Biases caused by for example:
   Families with higher income => more progressive
   parents => bias the result upwards…
   Control for parental attributes => effect of income
   on education would be smaller

Summary statistics
 The mean:
 The sum of all observations of the relevant
 variable divided by the total number of
 observations (n) :

Summary statistics

 The variance:
 We want to know whether the different observations lie
 more or less close to the mean (i.e. whether they are
 clustered together) or far from it (i.e. whether they are
 => add up the all differences of the observations from
 the mean.

Summary statistics

 The standard deviation.
 It makes the units comparable to those in
 which the variable originally was measured:

Summary statistics
 Our main goal is to understand whether two
 (or more) variable move together, i.e.
 whether they covary.

Summary statistics

 Interpretation of covariance:
   yi exceeds its mean, and xi exceeds its mean as
   well, then the covariance will be positive.
   Similarly, if xi tends to fall short of its mean when
   when yi exceeds its mean, then covariance will be

Regression analysis

 We are interested in finding out the form of the
 relationship between variables x and y
   not just in whether they are correlated.

 We want to study the marginal impact of x on y:
   by how much does an increase in x appear to affect y?
   This is the general question in regression analysis.

 Preliminary test: Scatter diagrams

Regression analysis

 First: decide which is the "causal" variable
 and which is the variable that is affected by
 the movements of the "causal variable".

 let x stand for the causal variable =
 independent variable
 let y stand for the dependent variable.

Regression analysis

 Second: Construct a diagram in which you
 have the independent variable on the
 horizontal axis and the dependent variable on
 the vertical axis.

 Look at the scatter plot and see study the
 potential relationship.

Scatter plot

The basics of regression

 Suppose we think that the relationship
 between x and y can be described as a linear

 What does this mean?
   When x increases also y increase or when x
   decreases also y decreases....
   x affects y with the same proportions.

The basics of regression

The basics of regression

 A (linear) equation:

 where α is the constant and β explains the effect of
 x on y.
 When x=0:
   y=α, and y increases (or decreases) by the amount of β for
   each additional unit increase (or decrease) in the value of

The basics of regression

 Given a set of observations, the regression
 analysis is finding the straight line that is the
 best fit to the data. The values of α and β are
 then estimated from that best fit line.

 "Best fit" = the actual data point should not be
 very far away from the line.

The basics of regression

                                       β> 0 => upward sloping curve
                                       β< 0 => downward sloping
                         Slope = β

“Best fit” line

The basics of regression

 Running the regression equation in statistical
 program, i.e. STATA,
   the program gives you the optimally chosen values for β,
   i.e. the coefficient of the slope of the curve that best fits the

 β is called the regression coefficient. It tells us about
 the strength of the influence of x on y:
   a high value if β implies that a small change in x can bring
   about a large change in y;
   a low value of β implies the opposite

Multivariate regressions:

 A regression with more than ONE independent
 variable, x:

 where y=children's enrollment, x= family income, z=
 parental education

 β now tells us the effect on y of a change in x when
 the value of z is held constant.

Can the estimated coefficient β be
 Think of a large set of (x,y) observations that we
 might have access to, but what we really have in our
 hands is a subset or a sample of these

 Our sample allows us to construct estimates of α
 and β of the true relationship that we believe is "out

 Our estimates are random variables of observations
 from the entire sample of observations.

Can the estimated coefficient β be
 The statistic will calculate how precise or how
 significant our estimates are:
 how confident can we be that our estimated
 value of β is close to the true β?

Hypothesis testing

 The underlying hypothesis which you want to
 test is called the null hypothesis.

 Null hypothesis: H₀=0,
 Alternative hypothesis: HA≠0.

 We want to test null hypothesis that family
 income has NO effect on children's

Hypothesis testing

 A regression on family income and children's school
 enrollment gives us an estimate of β

 => We want to know whether β is significantly different
 from 0.

 We form a hypothesis that H₀=0, HA≠0.

 Using the sample data, compute the test for whether we
 can reject the H₀ or not. When you do this you get a test
 statistic (t-value)

Hypothesis testing

 To put it simple:
 if the t-value >2, we reject the null hypothesis that

 => This means that we can be confident that the
 effect of i.e. family income on children's schooling is
 NOT 0.

 Then, look at the estimated β to see whether it is
 positive or negative.

Hypothesis testing

 “Estimate is significant at the 5% level”:

   The null hypothesis is rejected under the
   assumption that there is less than a 5% probability
   that we rejected the null hypothesis when it was
   indeed true.
   We can be confident that we have not rejected a
   true null hypothesis.

Randomized evaluations

 “What would have happened to this person's
 behavior if she had been subjected to an
 alternative policy?”
   would she work more if marginal taxes are lower
   would she earn less if she had not gone to school
   would she had higher test scores if she had
   proper text books at the school

Randomized evaluations


 YiT = the average test scores of children in a given
 school i if the school has textbooks

 YiC = the average test scores of children in the same
 school i if the school has no textbooks

 We are interested in the difference YiT- YiC , which is the
 effect of having textbooks for school i.

Randomized evaluations

  We will never know the effect of having
  textbooks on a school in particular BUT we
  may hope to learn the average effect that it
  will have on schools:

                    E[YiT - YiC ].

Randomized evaluations

Difference =

E[YiT | School has] – E[YiC | School has
        textbooks             no textbooks]

  = E[YiT | T] – E[YiC | C].

Randomized evaluations

  The problem is:
    there may be systematic differences between schools with
    textbooks and schools without textbooks.
    I.e. schools with textbooks might have better teachers,
    more money etc...

  If we only run a regression of textbooks on test
  scores, we will not get the “causal” effect because
  we are not controlling for other variables such as
  teachers, funding etc.
  This is called that you have a bias in the estimate..

Randomized evaluations

 How do we eliminate the bias in the

 One way to do this is to randomly decide
 which schools gets text books and which
 does not get textbooks

Randomized evaluation:

 Evaluating policy programs, i.e. textbooks provision
 to schools, worms’ medicines to children in primary
 schools etc.

 Think about medical experiments - some people are
 given the drug and some are not...

 Ideal set-up to evaluate the effect of a policy X on
 outcome Y

Randomized evaluation

 A sample of N individuals is selected from a
 population. This sample is then randomly divided
 into two groups:

 (1) Treatment group

 (2) Control group

 The Treatment group is treated with policy X while
 the Control group is not.

Randomized evaluation

 The effect of policy X is measured by the difference
 in empirical means of Y between the Treatment and
 Control groups:
                  ^              ^
          D = E[Y | T ] − E[Y | C ]
 (where   E   denotes the empirical mean)

Randomized evaluation

 Key: Since Treatment has been randomly
 assigned, the two groups are similar on other
 characteristics and hence, your estimate is
 going to be unbiased!!!