Weighting sample surveys with Bascula by zcc46658


									Weighting sample surveys with

          Harm Jan Boonstra
         Statistics Netherlands
• General overview
  – Calibration/weighting
  – Estimation and variance estimation
• Demonstration with example data from the
  Dutch Labour Force Survey (LFS)
• Other applications at Statistics Netherlands
• Part of Blaise (current version 4.7), a
  general system for computer-assisted survey
  processing developed at Statistics
• History: predecessor LINWEIGHT
  developed by Jelke Bethlehem in the 1980’s
             Main features
• Calibration: computation of weights using
  auxiliary information encoded in a
  weighting model
• Estimation of (sub)population totals, means,
  proportions and ratios
• Variance estimation: Taylor linearisation
  and balanced repeated replication (BRR) for
  several sampling designs
• Reduction of MSE
  – Reduction of (non-resonse) bias
  – Reduction of sampling variance
• Calibration to auxiliary totals for
  consistency with known population totals
• A single set of weights
  – Easy tabulation
  – Mutual consistency between estimated tables
      ‘Small sample’ problems
• Full consistency with register data or data from
  related surveys can usually not be achieved
  (overfitting). Not all information can be used at
  the same time.
• Weighting can be ineffective for (small) domain

  For sufficiently large samples weighting is an
  effective and convenient way to improve
  Weighting/calibration methods in
Based on the general regression (GREG)
• Poststratification, e.g. Region x AgeClass
• Ratio estimator, e.g. AgeClass x Income
• Linear weighting, e.g. Region + AgeClass x
Based on Iterative Proportional Fitting (IPF):
• Multiplicative weighting, e.g. Region +
     Further weighting options

• Bounding of weights for linear weighting,
  Huang and Fuller algorithm
• Consistent linear weighting, e.g. for equal
  weights within households, Lemaître and
              Estimation of totals
• Based on the calibration weights:                   Ycal   wi yi

• General regression estimator:                          wi  d i g i

             Yregr  YHT  B t ( X  X HT )   wi yi
              ˆ       ˆ    ˆ         ˆ
                                                 
       g i  1  ( x /  i )  d j x j x j /  j  ( X  X HT )
                   i        
                             js                 
• Also ratios of totals, means, proportions, subclasses
         Variance estimation
• Direct/Taylor method (HT and GREG only)
• Balanced Repeated Replication (BRR)
Sampling designs supported:
• Stratified two-stage element or cluster
  design with simple random sampling
  without replacement in both stages
• Stratified multistage cluster designs with
  replacement in the first stage and unequal
              Taylor variance
• Taylor linearisation:
    ˆ       ˆ    ˆ        ˆ
   Yregr  YHT  Bt ( X  X HT )
                yi  B t xi        ei 
      ˆ )  var 
   v(Yregr                     var  
                is i              
                                   is i 

• Modified variance estimator (default in Bascula):
                                   gi ei 
    v(Yregr )  var  wi ei   var 
                                            
                    is            is i 
                 BRR variance

      
 vBRR Yregr
                1 R ˆ ( )
               2  (Yregr,  Yregr ) 2
                R  1

• R balanced half samples (partially balanced
  if R < #strata)
• Fay factor 
• Grouped BRR (more than 2 PSUs per
  stratum allowed)
  – Artificial strata
  – Repeated grouping
• Sample data file: Ascii (fixed column or
  separated), Blaise, other OleDB compatible
• Blaise meta information; Blaise Textfile Wizard
  helps in making data model for Ascii files
• Tables of population totals
• Selection of weighting scheme and other
  parameters that influence the weighting
• Some additional input required for estimation and
  variance estimation: target tables and sampling
  design details
        Data integrity checks
• Consistency of set of population tables
• Sample counts per cell do not exceed
  population counts
• Enough sample observations for each cell in
  weighting model
• Inclusion weights/sampling fractions
  compatible with sampling design specified
• Set of final and correction weights (written to the
  sample file and to a separate weights file)
• Optionally: fitted values y       ˆ
                             ˆi  Bt xi
• Tables of estimates (including estimates of
  standard errors) in export file; format compatible
  with population data file
     Example: Dutch Labour Force
• Rotating panel design with five waves; CAPI in
  first wave, CATI in subsequent waves
• CATI data first calibrated on the most important
  target variable (employment in several categories)
  to initial CAPI panel to reduce panel attrition bias
• Weighted CATI data is combined with CAPI data
  and together calibrated to population totals of
  weighting scheme
  Region44 x Age4 x Sex2 + Age21 x Sex2 + Age5 x
  MarStat2 + Sex2 x Age5 x Ethnicity8 + CWI3
 Dacseis software evaluation report
            on Bascula:
‘Bascula is a part of Blaise (an integrated system for
  survey processing), and it might not be reasonable
  to purchase Blaise only for the use of Bascula.
  When having Blaise available, Bascula provides
  an advanced weighting tool (linear or
  multiplicative weighting) with abilities for proper
  variance estimation based on Taylor’s
  linearisation. When the basic order of the weight
  and estimate calculations of Bascula is understood,
  the operations can be carried out quite easily.’
• menu-based interactive version
• from Blaise’s script language Manipula
• from most modern programming languages,
  e.g. VB, VBA, Delphi, C++, C#
• from other software able to act as
  automation client, e.g. S-Plus
Bascula component (dll) can be used to
 automate weighting/estimation processes
For recurring weighting/estimation
 processes, batch processing, integration into
 production systems
Build custom tools utilizing Bascula’s
 Tools that use Bascula component
• Tool that integrates imputation/outlier
  detection and handling/weighting for the
  Production Statistics
• Tool for analysing results of experiments
• Tool for repeated weighting
• Simple simulation tools
  – Variance estimation (Dacseis)
  – GREG as input for small area estimators
            Repeated weighting
•    Practical sequential approach to make tables of
     estimates consistent between data sources
•    Two step procedure
    1. Start with GREG estimates
    2. Adjust these estimates such that they are consistent
       with register totals (not used in the weighting scheme
       of GREG) and possibly with previously estimated
       marginal tables from a combination of surveys.
Estimation 15
                                             Software tool

                                           Rectangular             Meta
          Micro                            datasets               database

       Dataset, weighting
       model, population

      Bascula                  Estimates


   Source: Systemdocumentation VRD, V.Snijders
       Use of Bascula at Statistics
• Labour Force Survey
• Repeated weighting for the Social Statistical
• Survey on Household Incomes
• Budget Survey
• Survey on Living Conditions
• Production Statistics
and more
  Survey on Household Incomes
• Calibration on both person totals and household
  totals, both obtained from municipal registrations
• Consistent linear weighting:
   Region29 x Age8 x Sex2 +
   Region29 x HouseholdType9 x OneHH

OneHH is auxiliary variable that sums to one over
 each household
        Production Statistics
• Continuous auxiliary variables available
  from Tax Office; categorical variables from
  Business Register
• Weighting scheme:
  Activity x SizeClass x Source x Tax +
  Activity x SizeClass x Source
• Variable Source indicates whether tax info
  can be matched to surveyed businesses
• Priorities for further development have not
  been very high in the last three years, but
  that may change
• Possible extensions: variance structure,
  Newton-Raphson for exponential method,
  two-phase regression estimator, synthetic
  estimation for subpopulations, small area

To top