Bootstrapping

Document Sample
Bootstrapping Powered By Docstoc
					Lynn Lethbridge

   SHRUG
November, 2010
What is Bootstrapping?
 A method to estimate a statistic’s sampling distribution

 Bootstrap samples are drawn repeatedly with replacement
  from the original data

 From each new sample, the statistic is re-calculated and
  saved in a dataset (ie 200 bootstraps, 200 statistics)

 The standard error of the statistic is calculated as the
  standard deviation of the bootstrap statistics

 Bootstrapping not used for the point estimate
When to Use Bootstrapping
 Distribution has no clear analytical solution
    eg Gini coefficient, poverty intensity
 Test for sensitivity
 Complex survey design (not random)
    eg Statistics Canada surveys are a stratified, multistage
     design
        Households within clusters within strata are selected
        Observations will not be independent – variance calculated
         the usual way will be underestimated
Two Programs
 One is ‘traditional’ bootstrapping
    re-sampling from the original sample
 The second is bootstrapping using Statistics Canada
 survey data
   Statistics Canada does the re-sampling heavy lifting in
    most of its surveys
   Use the bootstrap weights provided to calculate the
    standard error
Program 1
 Project where we examined the effect of trade on
  ‘poverty intensity’ in Canada/US
 Used state/province level measures in regression
  analysis
 Used bootstrapping to measure robustness of results
  given a different mix of policies
   Our dataset consists of 61 unique observations of states
    and provinces. Re-sample to see if results are affected if
    we had a different make-up of regions
/** run the regression with original sample to get
point estimates */

proc reg data=orig.pov97
outest=work.estpoint(keep=intercept lmurate
aveuiben tradeimp tradeexp sambearn can);

model sst = lmurate   aveuiben tradeimp tradeexp
sambearn can;
weight invse;
title " 1997";
run;

proc transpose data=work.estpoint
out=work.estpoint2(drop=_label_ rename=(col1=coef));
run;
/* put sample size in a macro   */

proc means data=orig.pov97 noprint;
var year;
output out=work.out n=totnum;
run;

data _null_;
    set work.out;
    call symput ('totnum', totnum);
run;
/** make a temporary file of original dataset */
data work.pov97;
    set orig.pov97;
run;

/* initiate bootstrap dataset   */

data work.boot97fin;
    set _null_;
run;

options nonotes;

/* create macro for number of bootstraps   */

%let bt=1000;
%macro boot;

/** construct new sample of 61 observations -
randomly drawn with replacement */

data work.boot;
    do i=1 to &totnum;

          _p=ceil(ranuni(i+&x)*&totnum);
              do obsnum=_p to _p;
                set work.pov97 point=obsnum;
                if _error_ then abort;

               output;
               end;
       end;
               stop;
run;
/* estimate coefficients from bootstrap sample*/

proc reg data=work.boot noprint
outest=work.est(keep=intercept lmurate   aveuiben
tradeimp tradeexp sambearn can);

model sst = lmurate   aveuiben tradeimp tradeexp
sambearn can;
weight invse;
title " 1997";
run;

/** add coefficients to dataset    */

data work.boot97fin;
    set work.boot97fin work.est;
run;


%mend boot;
/** invoke the boot macro 1000 times */

%macro reps;

%do x=1 %to &bt;

    %boot;
%end;

%mend reps;

%reps;
options notes;

/** calculate the standard deviation of each
bootstrapped coefficient */

proc means data=work.boot97fin n mean std;
output out=work.std std=intercept lmurate aveuiben
tradeimp tradeexp sambearn can;
run;

proc transpose data= work.std
(drop=_type_ _freq_)out=work.std2(drop=_label_
rename=(col1=se));
 run;
/** merge point estimates together with standard errors and calculate
statistics */

data work.final;
    merge work.estpoint2            work.std2;

          t=coef/se;
          pvalue=(1-probnorm(abs(t)))*2;

     run;

 proc print data= work.final;
 run;
                       Parameter Estimates

               Parameter      Standard
Variable    DF Estimate          Error       t Value   Pr > |t|

Intercept   1     0.05648      0.02317        2.44     0.0181
lmurate     1     0.06210      0.01433        4.33     <.0001
aveuiben    1 -0.00009479   0.00003002       -3.16     0.0026
tradeimp    1    -0.07186      0.12541       -0.57     0.5690
tradeexp    1     0.02107      0.13190        0.16     0.8737
sambearn    1    -0.06155      0.04973       -1.24     0.2212
can         1    -0.03489      0.02739       -1.27     0.2081
                                     1997

                             The MEANS Procedure

  Variable     Label            N           Mean         Std Dev
  ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
 Intercept    Intercept    1000       0.0581707       0.0305142
 lmurate                   1000       0.0616976       0.0178248
 aveuiben                  1000    -0.000101532     0.000037820
 tradeimp                  1000      -0.0258204       0.1743886
 tradeexp                  1000      -0.0355008       0.1880651
 sambearn                  1000      -0.0635708       0.0673242
 can                       1000      -0.0228619       0.0402765
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Obs       _NAME_          coef        se        t          pvalue

      1       intercept    0.056482   0.03051    1.85102     0.06417
      2       lmurate      0.062098   0.01782    3.48378     0.00049
      3       aveuiben    -0.000095   0.00004   -2.50627     0.01220
      4       tradeimp    -0.071862   0.17439   -0.41208     0.68028
      5       tradeexp     0.021066   0.18807    0.11202     0.91081
      6       sambearn    -0.061547   0.06732   -0.91419     0.36062
      7       can         -0.034891   0.04028   -0.86628     0.38634
Program 2
 Project using the National Longitudinal Survey of
 Children and Youth (NLSCY)

 Examined the effect of having a child with disabilities
 on the health of mothers and fathers

 Ordered Probit utilizing Statistics Canada NLSCY
 bootstrap weights to estimate standard errors
Weighting
 Many survey datasets include sampling weights so
 results will represent the population

 The mechanics of using bootstrap weights are the
 same as for sampling weights

 Each individual in survey has a sample weight and all
 the bootstrap weights

 Re-estimate your model or statistic over and over using
 a different weight each time
Bootstrap Weight Derivation


Re-                      Bootstrap
sampling
           A Miracle     Weights
            Occurs
/** macros to indicate the dependent variable and
independent variables */

%let depvar=momhealth00;
%let indepvars=hhdis00 momage00 momlthigh00
momcertdip00 momunivdeg00
 momimm eqinc00    hhchlt500 kids01700 momvg94 momg94
momfp94 momsmokesdaily00;



 /** separate macro for the independent variables and
intercept */

 %let allrhs=intercept_2 intercept_3 intercept_4
intercept_5 &indepvars;
/*** get point estimates using sample weight   */

proc logistic data=nlscy.age615validboot descending
outest=work.point(keep=&allrhs);
model &depvar= &indepvars / link=normit maxiter=50 rsq;
 weight dwtcwd1l / norm;
 where validdis=1;
 title " mom 2000 ";
 run;

 /** transpose the date which contains the point
estimates */
proc transpose data=work.point
out=work.pointtrans(drop=_label_ rename=(col1=coef));
run;
/** put data into memory   */

data work.age615validboot;
     set nlscy.age615validboot;
run;

 /** create empty dataset for coefficients    */

 data work.probitboot;
    set _null_;
 run;


 %global bt;
 %let bt=1000;   /** 1000 bootstrap weights
                    provided;*/
%macro boot;

options nonotes;
%do i=1 %to &bt;

proc logistic data=work.age615validboot noprint
descending
outest=work.est(keep=&allrhs);
model &depvar =&indepvars / link=normit maxiter=50 rsq;
 weight bsw&i / norm;
 where validdis=1;
 title " mom 2000 ";
 run;

data work.probitboot;
   set work.probitboot work.est;
run;

%end;
options notes;
%mend boot;

%boot;
/** calculate the standard deviation */

 proc means data=work.probitboot n mean std ;
 output out=work.std std=&allrhs;
 run;

 proc transpose data=work.std(drop=_type_
_freq_) out=work.std2(drop=_label_
rename=(col1=se));
 run;
data work.final;
    merge work.pointtrans work.std2;

   /** Wald chi square */

       z=coef/se;

        chi=z*z;

        pvaluechi=1-probchi(chi,1);

run;

proc print;
title " married moms";
run;
        Analysis of Maximum Likelihood Estimates

                           Standard Wald
Parameter      DF Estimate   Error Chi-Square Pr > ChiSq

Intercept   5 1 -2.9050 0.1513 368.5150      <.0001
Intercept   4 1 -2.0956 0.1451 208.6086      <.0001
Intercept   3 1 -1.0202 0.1429    50.9855    <.0001
Intercept   2 1 0.2247 0.1424     2.4906      0.1145
hhdis00       1 0.3052 0.0427     51.1371    <.0001
momage00      1 0.00579 0.00314     3.4098    0.0648
momlthigh00   1 0.1499 0.0583     6.6078      0.0102
momcertdip00 1 -0.0731 0.0384     3.6231      0.0570
momunivdeg00 1 -0.1781 0.0433     16.9065     <.0001
momimm        1 0.3377 0.0419     64.9256    <.0001
eqinc00       1 -2.95E-6 6.018E-7 24.0756    <.0001
hhchlt500     1 -0.1872 0.0876     4.5628    0.0327
kids01700     1 -0.1262 0.0161    61.0665     <.0001
momvg94       1 0.6181 0.0350 312.6018        <.0001
momg94        1 1.1116 0.0458 589.8279        <.0001
momfp94       1 1.5644 0.0912 294.0294        <.0001
momsmokesdaily00 1 0.1706 0.0430     15.7629    <.0001
             The MEANS Procedure

Variable        N         Mean        Std Dev
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Intercept_5   1000     -2.9650753  0.3107804
Intercept_4   1000     -2.1470196  0.2770212
Intercept_3   1000     -1.0465351  0.2621726
Intercept_    1000      0.2091371  0.2622451
hhdis00       1000      0.2846419  0.0973226
momage00      1000      0.0057067  0.0055820
momlthigh00   1000      0.1293874  0.0932894
momcertdip00 1000     -0.0739417   0.0772243
momunivdeg00 1000     -0.1852935   0.0980241
momimm        1000    0.3191519    0.1181139
eqinc00       1000    -3.090889E-6 1.1721765E-6
hhchlt500     1000    -0.1760001   0.1143188
kids01700     1000    -0.1148346   0.0351904
momvg94       1000     0.6399775    0.0754143
momg94        1000     1.1403891    0.1000578
momfp94        1000   1.6089774    0.1664408
momsmokesdaily00 1000 0.1618192     0.0882162
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Obs _NAME_           coef    se        chi          pvaluechi

1    intercept_2   -2.90503 0.31078      87.376     0.00000
2    intercept_3   -2.09565 0.27702      57.228     0.00000
3    intercept_4   -1.02021 0.26217      15.143     0.00010
4    intercept_5    0.22473 0.26225       0.734      0.39147
5    hhdis00        0.30519 0.09732       9.834      0.00171
6    momage00       0.00579 0.00558      1.076      0.29961
7    momlthigh00    0.14987 0.09329      2.581      0.10815
8    momcertdip00 -0.07309 0.07722       0.896      0.34390
9    momunivdeg00 -0.17806 0.09802        3.300     0.06930
10    momimm        0.33771 0.11811      8.175       0.00425
11    eqinc00      -0.00000 0.00000      6.346       0.01176
12    hhchlt500    -0.18722 0.11432      2.682       0.10149
13    kids01700    -0.12618 0.03519     12.857      0.00034
14    momvg94       0.61807 0.07541     67.169      0.00000
15    momg94        1.11157 0.10006    123.417       0.00000
16    momfp94       1.56445 0.16644     88.349      0.00000
17    momsmokesdaily00 0.17064 0.08822      3.742      0.05307
Thank you
for your
attention!

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:40
posted:10/6/2012
language:English
pages:29