Improving by chenshu


									United Nations Statistical Institute for Asia and the
                     Pacific (SIAP)
          Asian Development Bank (ADB)
        Country Training Workshops on MDGs and
Use of Administrative Data Systems for Statistical Purposes
  RETA6356: Improving Administrative Data Sources for the Monitoring of MDGIs

         Improving administrative
              data quality

     How to improve data quality ?
   Need assessment of both users and providers of data
   Work on methodology
   Being in the Statistical information system
   Capacity building / training
   Periodic review of questionnaires, forms, registers,
    instructions and quality measures
   Developing new methodologies:
      e.g: Out-of-school and completion indicators for
       education data

How to improve data quality internally ?
   Provide an exhaustive instruction manual that
    should include:
     Instructions for providers
     Guidelines on the coverage and definitions used in a
     Technical specifications

   Improve survey instrument

                  Data validation
   Developing/improving guidelines for data processing
   Data cleaning: identifying outliers
       Comparison with historical series
       Cross comparison : time and series
   Consistency
   Contact data provider in case of incoherence
   Check for other reliable sources

 Data validation: Methods for outlier
 Standard  deviation (mean)
 Quartile deviation (Box plot)

 Median absolute deviation (modified Z score)
      Standard deviation and Z score to avoid since outliers affect
       both the mean and standard deviation.
 Choice     depending on the type of data:
            Set of data clean or not
            Data normally distributed
            Dependant or independent
            Presence of multiple outliers

      Tukey box plot : EXAMPLE





                         32 1


   Exercise on 5 number summary…

       Modified z-score: using the Median
       Absolute Deviation (MAD)
 The median of absolute deviation is calculated and used in
  place of standard deviation in z-score calculations.
 The test heuristic states that an observation with a modified
  z-score greater than three and a half should be labeled as an
 Reliable test since the parameters used to calculate the
  modified z-score are minimally effected by the outliers.

                     0.6745( xi  xm )

              Data validation more...
Outlier methods proposal:
     Time series outlier detection: look at the data series
      themselves using the modified Z score using the
      Median absolute deviation
     Cross section outlier detection: Create clusters of
      countries and compare the data using the
      appropriate indicators with the modified Z score:
          Per capita: Using countries having the same population
          Expenditure; income: using GDP

Techniques used to improve data
   Imputations (deriving missing values)
   Small area estimation techniques( using
   Estimations and projections
       Time series
       Regression
   Other techniques
       Brass method
       Using ratios…

                        DATA EDITING

   Imputations
      Develop possible formulas going through questionnaires

      Develop automation software

      Identify methods to incorporate them in data

      Involve data analyst, questionnaire designers, sample
       designers for the development of formulas
   Imputations
      Hot deck imputation (most recent)
      Cold deck imputation (most distant)

    Small area estimations
   Use data from larger areas to develop models
        Use combined data sources: census/admin data of
         small areas with survey data for large areas
        eg: develop regression equations for poverty ratio (Y)
             Y = f(bX)
             Y is known only for large areas
             X known for small and large areas
        Estimate b values and estimate the model
        Use the model to estimate Y for small areas ( since b
         values and X are known)

Use of Time series and Regression
   To estimate values for missing/unknown
       Use time series models (Trend based or auto
        regressive models)
       Use dependent independent variable method
       Compare results with other information and validate

   Best estimate is useful than a blank..

Brass method

               Vital Rates(VR) Method
This method uses only birth and death data as symptomatic
variables. We define the following symbols for a small area and a
larger area for which the population estimate Pt is ascertained
from official sources

                              LOGIC OF VR
            Small area                               Larger area
pt = population estimate at year t
ˆ                                    ˆ
                                     Pt = population estimate at year t
                                     (known from official register)
b0 = births for the census year      Bt = births at year t (known from
d0 = deaths for the census year      official register)
bt = births for the current year     Dt = deaths at year t (known from
(known from official register)       official register)
dt = deaths for the current year
(known from official register)
           Small area                         Larger area
       b         d
 r1t  t & r2t  t    Crude birth,   ˆ1t  Bt & R2t  Dt
                                     R             ˆ
        pt       pt   Death rates          ˆ
                                           Pt             ˆ
updating factors                     updating factors
      r         r                         R           R
1  1t & 2  2t                    1  1t & 2  2t
      r10       r20                       R10         R20
   Note the assumption that updating factors are considered
               equal in small and larger areas
Estimates :                          Estimates :
       ˆ             ˆ                     ˆ         ˆ
r1t  1r10 & r2t   2 r20
ˆ             ˆ                       ˆ
                                     1 
                                          R1t    ˆ
                                              & 2 
      bt         dt                       R10        R20
 pt 
 ˆ        & pt 
      r1t        ˆ

Combine these two estimates to get             1  bt d t 
                                           pt    
                                                   ˆ ˆ
                                               2  r1t r2t    16
                                   Large Area

                     Small Area
Estimated Population in small         1  bt d t 
area is the average of the two    pt    
estimates:                                ˆ ˆ
                                      2  r1t r2t 
          What and why is Metadata?
   Metadata is data on data
   Metadata Includes documentations on
       Concepts
       Scope
       Classifications
       Basis of recording
       Data sources
       Statistical methodologies adopted
       Differences from the international Standards, guidelines, good
        practices annotated

     MDGs monitoring: at which level?
   Reporting and monitoring MDGs at the
    national level is a good start
   The Millennium Declaration is about
    improving the conditions of people in
    member states
   There is a need to monitor MDGs at the sub-
    national level
   But this is feasible only if data at lower levels
    are readily available
                                                  19 19
   Data at lower levels of disaggregation:
     Allow for targeted socioeconomic policy
      decision-making and programme formulation
     Allow planners and policy makers to be able to

         That some locales require more support for
          educational programmes
         Others require disproportionate investment in HIV

          treatment or malaria abatement

                                                         20 20
   Most administrative data (Health, Education,
    Access to Water, Sanitation…) and Census data
    can be disaggregated at lower levels
   For data from a HS, the survey needs to be large
    enough to yield reliable estimates at lower levels
   Increased cost of obtaining the information both
    in terms of human and financial resources
   For this reason: few HS provide data at the sub
    national level

                                                         21 21
   Opportunities include:
     Increased demand for data at lower levels
     Geographical Information Systems (GIS)

      technology (poverty mapping)
     Collaboration between Central Statistical Offices

      and sub national statistical institutions

                                                    22 22
   Sub national data disaggregation needs adequate
    sample sizes in HSs
   Disaggregation to sub national levels needs
    corresponding responsibilities
   Need to use MDG process to support
    strengthening of disaggregation opportunities

                                                      23 23

To top