Improving by chenshu

VIEWS: 6 PAGES: 23

									United Nations Statistical Institute for Asia and the
                     Pacific (SIAP)
          Asian Development Bank (ADB)
        Country Training Workshops on MDGs and
Use of Administrative Data Systems for Statistical Purposes
  RETA6356: Improving Administrative Data Sources for the Monitoring of MDGIs



         Improving administrative
              data quality


                                                                                1
     How to improve data quality ?
   Need assessment of both users and providers of data
   Work on methodology
   Being in the Statistical information system
   Capacity building / training
   Periodic review of questionnaires, forms, registers,
    instructions and quality measures
   Developing new methodologies:
      e.g: Out-of-school and completion indicators for
       education data


                                                           2
How to improve data quality internally ?
   Provide an exhaustive instruction manual that
    should include:
     Instructions for providers
     Guidelines on the coverage and definitions used in a
      questionnaire
     Technical specifications

   Improve survey instrument



                                                             3
                  Data validation
   Developing/improving guidelines for data processing
   Data cleaning: identifying outliers
       Comparison with historical series
       Cross comparison : time and series
   Consistency
   Contact data provider in case of incoherence
   Check for other reliable sources




                                                          4
 Data validation: Methods for outlier
              detection
 Standard  deviation (mean)
 Quartile deviation (Box plot)

 Median absolute deviation (modified Z score)
      Standard deviation and Z score to avoid since outliers affect
       both the mean and standard deviation.
 Choice     depending on the type of data:
            Set of data clean or not
            Data normally distributed
            Dependant or independent
            Presence of multiple outliers


                                                                       5
      Tukey box plot : EXAMPLE
                                  220

200




150


                                  133



100




 50
                            4
              Outliers
                         32 1



                         Titles




                                        6
   Exercise on 5 number summary…




                                    7
       Modified z-score: using the Median
       Absolute Deviation (MAD)
 The median of absolute deviation is calculated and used in
  place of standard deviation in z-score calculations.
 The test heuristic states that an observation with a modified
  z-score greater than three and a half should be labeled as an
  outlier.
 Reliable test since the parameters used to calculate the
  modified z-score are minimally effected by the outliers.

                     0.6745( xi  xm )
                  Z
                          MAD


                                                                  8
              Data validation more...
Outlier methods proposal:
     Time series outlier detection: look at the data series
      themselves using the modified Z score using the
      Median absolute deviation
     Cross section outlier detection: Create clusters of
      countries and compare the data using the
      appropriate indicators with the modified Z score:
          Per capita: Using countries having the same population
          Expenditure; income: using GDP




                                                                    9
Techniques used to improve data
   Imputations (deriving missing values)
   Small area estimation techniques( using
    models)
   Estimations and projections
       Time series
       Regression
   Other techniques
       Brass method
       Using ratios…

                                              10
                        DATA EDITING

   Imputations
      Develop possible formulas going through questionnaires

      Develop automation software

      Identify methods to incorporate them in data

      Involve data analyst, questionnaire designers, sample
       designers for the development of formulas
   Imputations
      Hot deck imputation (most recent)
      Cold deck imputation (most distant)




                                                          11
    Small area estimations
   Use data from larger areas to develop models
        Use combined data sources: census/admin data of
         small areas with survey data for large areas
        eg: develop regression equations for poverty ratio (Y)
             Y = f(bX)
             Y is known only for large areas
             X known for small and large areas
        Estimate b values and estimate the model
        Use the model to estimate Y for small areas ( since b
         values and X are known)

                                                                  12
Use of Time series and Regression
   To estimate values for missing/unknown
    domains:
       Use time series models (Trend based or auto
        regressive models)
       Use dependent independent variable method
       Compare results with other information and validate


   Best estimate is useful than a blank..

                                                              13
Brass method




               14
               Vital Rates(VR) Method
This method uses only birth and death data as symptomatic
variables. We define the following symbols for a small area and a
                                              ˆ
larger area for which the population estimate Pt is ascertained
from official sources

                              LOGIC OF VR
            Small area                               Larger area
pt = population estimate at year t
ˆ                                    ˆ
                                     Pt = population estimate at year t
                                     (known from official register)
b0 = births for the census year      Bt = births at year t (known from
d0 = deaths for the census year      official register)
bt = births for the current year     Dt = deaths at year t (known from
(known from official register)       official register)
dt = deaths for the current year
(known from official register)
                                                                          15
           Small area                         Larger area
       b         d
 r1t  t & r2t  t    Crude birth,   ˆ1t  Bt & R2t  Dt
                                     R             ˆ
        pt       pt   Death rates          ˆ
                                           Pt             ˆ
                                                         Pt
updating factors                     updating factors
      r         r                         R           R
1  1t & 2  2t                    1  1t & 2  2t
      r10       r20                       R10         R20
   Note the assumption that updating factors are considered
               equal in small and larger areas
Estimates :                          Estimates :
       ˆ             ˆ                     ˆ         ˆ
r1t  1r10 & r2t   2 r20
ˆ             ˆ                       ˆ
                                     1 
                                          R1t    ˆ
                                              & 2 
                                                     R2t
      bt         dt                       R10        R20
 pt 
 ˆ        & pt 
            ˆ
      ˆ
      r1t        ˆ
                 r2t

Combine these two estimates to get             1  bt d t 
                                           pt    
                                           ˆ
                                                   ˆ ˆ
                                               2  r1t r2t    16
                                   Large Area




                     Small Area
Estimated Population in small         1  bt d t 
area is the average of the two    pt    
                                  ˆ
estimates:                                ˆ ˆ
                                      2  r1t r2t 
                                                  17
          What and why is Metadata?
   Metadata is data on data
   Metadata Includes documentations on
       Concepts
       Scope
       Classifications
       Basis of recording
       Data sources
       Statistical methodologies adopted
       Differences from the international Standards, guidelines, good
        practices annotated



                                                                         18
     MDGs monitoring: at which level?
   Reporting and monitoring MDGs at the
    national level is a good start
   The Millennium Declaration is about
    improving the conditions of people in
    member states
   There is a need to monitor MDGs at the sub-
    national level
   But this is feasible only if data at lower levels
    are readily available
                                                  19 19
                     Advantages
   Data at lower levels of disaggregation:
     Allow for targeted socioeconomic policy
      decision-making and programme formulation
     Allow planners and policy makers to be able to

      identify:
         That some locales require more support for
          educational programmes
         Others require disproportionate investment in HIV

          treatment or malaria abatement


                                                         20 20
                    Challenges
   Most administrative data (Health, Education,
    Access to Water, Sanitation…) and Census data
    can be disaggregated at lower levels
   For data from a HS, the survey needs to be large
    enough to yield reliable estimates at lower levels
   Increased cost of obtaining the information both
    in terms of human and financial resources
   For this reason: few HS provide data at the sub
    national level


                                                         21 21
                 Opportunities
   Opportunities include:
     Increased demand for data at lower levels
     Geographical Information Systems (GIS)

      technology (poverty mapping)
     Collaboration between Central Statistical Offices

      and sub national statistical institutions




                                                    22 22
   Sub national data disaggregation needs adequate
    sample sizes in HSs
   Disaggregation to sub national levels needs
    corresponding responsibilities
   Need to use MDG process to support
    strengthening of disaggregation opportunities




                                                      23 23

								
To top