# Improving by chenshu

VIEWS: 6 PAGES: 23

• pg 1
```									United Nations Statistical Institute for Asia and the
Pacific (SIAP)
Country Training Workshops on MDGs and
Use of Administrative Data Systems for Statistical Purposes
RETA6356: Improving Administrative Data Sources for the Monitoring of MDGIs

data quality

1
How to improve data quality ?
   Need assessment of both users and providers of data
   Work on methodology
   Being in the Statistical information system
   Capacity building / training
   Periodic review of questionnaires, forms, registers,
instructions and quality measures
   Developing new methodologies:
 e.g: Out-of-school and completion indicators for
education data

2
How to improve data quality internally ?
   Provide an exhaustive instruction manual that
should include:
 Instructions for providers
 Guidelines on the coverage and definitions used in a
questionnaire
 Technical specifications

   Improve survey instrument

3
Data validation
   Developing/improving guidelines for data processing
   Data cleaning: identifying outliers
   Comparison with historical series
   Cross comparison : time and series
   Consistency
   Contact data provider in case of incoherence
   Check for other reliable sources

4
Data validation: Methods for outlier
detection
 Standard  deviation (mean)
 Quartile deviation (Box plot)

 Median absolute deviation (modified Z score)
   Standard deviation and Z score to avoid since outliers affect
both the mean and standard deviation.
 Choice     depending on the type of data:
   Set of data clean or not
   Data normally distributed
   Dependant or independent
   Presence of multiple outliers

5
Tukey box plot : EXAMPLE
220

200

150

133

100

50
4
Outliers
32 1

Titles

6
   Exercise on 5 number summary…

7
Modified z-score: using the Median
 The median of absolute deviation is calculated and used in
place of standard deviation in z-score calculations.
 The test heuristic states that an observation with a modified
z-score greater than three and a half should be labeled as an
outlier.
 Reliable test since the parameters used to calculate the
modified z-score are minimally effected by the outliers.

0.6745( xi  xm )
Z

8
Data validation more...
Outlier methods proposal:
   Time series outlier detection: look at the data series
themselves using the modified Z score using the
Median absolute deviation
   Cross section outlier detection: Create clusters of
countries and compare the data using the
appropriate indicators with the modified Z score:
   Per capita: Using countries having the same population
   Expenditure; income: using GDP

9
Techniques used to improve data
   Imputations (deriving missing values)
   Small area estimation techniques( using
models)
   Estimations and projections
   Time series
   Regression
   Other techniques
   Brass method
   Using ratios…

10
DATA EDITING

   Imputations
 Develop possible formulas going through questionnaires

 Develop automation software

 Identify methods to incorporate them in data

 Involve data analyst, questionnaire designers, sample
designers for the development of formulas
   Imputations
 Hot deck imputation (most recent)
 Cold deck imputation (most distant)

11
Small area estimations
   Use data from larger areas to develop models
   Use combined data sources: census/admin data of
small areas with survey data for large areas
   eg: develop regression equations for poverty ratio (Y)
   Y = f(bX)
   Y is known only for large areas
   X known for small and large areas
   Estimate b values and estimate the model
   Use the model to estimate Y for small areas ( since b
values and X are known)

12
Use of Time series and Regression
   To estimate values for missing/unknown
domains:
   Use time series models (Trend based or auto
regressive models)
   Use dependent independent variable method
   Compare results with other information and validate

   Best estimate is useful than a blank..

13
Brass method

14
Vital Rates(VR) Method
This method uses only birth and death data as symptomatic
variables. We define the following symbols for a small area and a
ˆ
larger area for which the population estimate Pt is ascertained
from official sources

LOGIC OF VR
Small area                               Larger area
pt = population estimate at year t
ˆ                                    ˆ
Pt = population estimate at year t
(known from official register)
b0 = births for the census year      Bt = births at year t (known from
d0 = deaths for the census year      official register)
bt = births for the current year     Dt = deaths at year t (known from
(known from official register)       official register)
dt = deaths for the current year
(known from official register)
15
Small area                         Larger area
b         d
r1t  t & r2t  t    Crude birth,   ˆ1t  Bt & R2t  Dt
R             ˆ
pt       pt   Death rates          ˆ
Pt             ˆ
Pt
updating factors                     updating factors
r         r                         R           R
1  1t & 2  2t                    1  1t & 2  2t
r10       r20                       R10         R20
Note the assumption that updating factors are considered
equal in small and larger areas
Estimates :                          Estimates :
ˆ             ˆ                     ˆ         ˆ
r1t  1r10 & r2t   2 r20
ˆ             ˆ                       ˆ
1 
R1t    ˆ
& 2 
R2t
bt         dt                       R10        R20
pt 
ˆ        & pt 
ˆ
ˆ
r1t        ˆ
r2t

Combine these two estimates to get             1  bt d t 
pt    
ˆ
ˆ ˆ
2  r1t r2t    16
Large Area

Small Area
Estimated Population in small         1  bt d t 
area is the average of the two    pt    
ˆ
estimates:                                ˆ ˆ
2  r1t r2t 
17
   Metadata is data on data
   Concepts
   Scope
   Classifications
   Basis of recording
   Data sources
   Differences from the international Standards, guidelines, good
practices annotated

18
MDGs monitoring: at which level?
   Reporting and monitoring MDGs at the
national level is a good start
   The Millennium Declaration is about
improving the conditions of people in
member states
   There is a need to monitor MDGs at the sub-
national level
   But this is feasible only if data at lower levels
19 19
   Data at lower levels of disaggregation:
 Allow for targeted socioeconomic policy
decision-making and programme formulation
 Allow planners and policy makers to be able to

identify:
 That some locales require more support for
educational programmes
 Others require disproportionate investment in HIV

treatment or malaria abatement

20 20
Challenges
   Most administrative data (Health, Education,
can be disaggregated at lower levels
   For data from a HS, the survey needs to be large
enough to yield reliable estimates at lower levels
   Increased cost of obtaining the information both
in terms of human and financial resources
   For this reason: few HS provide data at the sub
national level

21 21
Opportunities
   Opportunities include:
 Increased demand for data at lower levels
 Geographical Information Systems (GIS)

technology (poverty mapping)
 Collaboration between Central Statistical Offices

and sub national statistical institutions

22 22
   Sub national data disaggregation needs adequate
sample sizes in HSs
   Disaggregation to sub national levels needs
corresponding responsibilities
   Need to use MDG process to support
strengthening of disaggregation opportunities

23 23

```
To top