Ethiopia-2007 Census Data Capture_Processing

Document Sample
Ethiopia-2007 Census Data Capture_Processing Powered By Docstoc
					Ethiopian 2007 CENSUS DATA


                APRIL, 2008
 Background Information
 Population and Housing Census process is the
  largest data capturing exercise a country can undertake.
 It involves capturing of millions of forms
 The Central Statistics Agency (CSA) started using old
   techniques like Punched Card Reader as early 1960’s.
 Two Population and Housing Censuses have so
  far been conducted in Ethiopia.
 The first Population and Housing Census was
  carried out in 1984.
Background Information Cont’d . . .
  During the 1984 Census:
     Data capture was done on manual keyboard
     based entry using mainframe computer
     FORMSPEC data entry system was used

     It took more than 2 years to capture the data
      for about 42 million people.

  In the case of the 1994 Census:
     Data capture was again done on manual
      keyboard entry basis using PC’s
     CENTRY data entry system (IMPS) was used
Background Information Cont’d . . .
 It took about 18 months to capture the data for the
  population of about 53 million.
 About 180 data entry clerks were involved
 Around 90 Pc’s were used
 The entry work was done on 2-shift basis
Some Limitations of the Keyboard Manual
Entry Method

  Time consuming
    Does not allow the availability of timely data
    The data will be weaker in representing the
     current or existing situation
  Subject to additional non-sampling errors
    Human error due to manual keying
    Due to the volume of the data, a 100% verification,
     as in the case of sample surveys, is difficult.
Limitations Cont . . .

  Involves a great deal of human resource

       Large number of data entry operators
         and equipment required
The Need for Alternative Solutions
  The need to have timely census results and
   the limitations discussed above forced the
   Agency to look for other alternatives
        This is obviously very important with regards
         to large volume of data like census.

  Hence the need to use the Scanning Technology
The Scanning Technology

  The Scanning Technology in general implements
   two basic techniques
        Mark recognition, like the Optical Mark
        Reader (OMR)
        Character recognition, like the Optical
        Character Recognition (OCR), and the
        Intelligent Character Recognition (ICR)
Scanning Technology Cont . . .

 OMR is the recognition of shaded marks (blobs) on the
         The positioning of these blobs on a form
          determines the alphanumeric characters
          they represent
 The character recognition is the recognition of
  alphanumeric characters on forms and they are
  of 2 types:
         OCR which is the recognition of machine printed
          characters and . .
Scanning Technology Cont . . .

 ICR which refers to the capture of
     hand- printed characters from a form

 For scanning of the 2007 Census the Optical
  Mark Reader (OMR) technique has been selected

 The Scanning Technology we use:

  PhotoScribe Series PS900 Scanners

      (DRS Scanning Technology Product)
      Photo Scribe Series PS900
 High speed Imaging Mark Reader

 Windows XP professional

 CD R/WR drive

 Network connectivity

 A TFT monitor, Keyboard, mouse

 Speed: up to 8,500 forms / hour
The Scanning Process in General
It mainly involves:
    Scanning / Data Capture – including IMAGE capturing

    Validation and Key-correction of scanned data

    Exporting the scanned and key-corrected data
     into ASCII or Text format
            The format suitable for electronic processing
   Learning from Experiences of
      Other Countries
 Study tour made to two African countries
   Tanzania

      To learn from their successes
      Data capture of the 2002 Census of Tanzania
       was done in about 26 days
      General report tables were produced within
       3 months from the start of the scanning
 Experiences of Other Countries . . .
 Ghana
   To learn from their difficulties
   Data capture of the 2000 Census took about
    6 months - ( forms from 29,000 EAs)

   3 Scanners were used (Kodak, Fujitsu)
        The larger scanner was Kodak 500D
           Speed: About 500 forms/min

   Power failure was one of the major problems
       Loss of some data occurred as a result
       A large generator was installed to minimize
        the effect of the frequent power cut
Major Benefits of the Scanning Technology

   Significant decrease in time required to capture
    the data
   This helps to get timely data
      Users’ need satisfied (policy makers, planners,
       researchers, etc.)
   No need to worry to store millions of forms for
    long time in the future
      Scanning captures the whole content of a
       questionnaire in an electronic image format
Requirements for Effective Scanning
  Proper training
     Both on Hardware and Software
     This helps to “own” the technology
        Being able to use the technology after the
          departure of the trainers / technical advisors
  A reliable Network System
  A well organized space for forms and data flow
   is required
Data Processing Center
                             Retrieval                Warehouse
Registering EA’s
for Scanning

                                              1        Registering &
Waiting Room         4                                Organizing EA’s
                                   3                   Received from
                                                         the Field
Scanning                                          2
Room                         Receiving the

Key-Correction                                            Store

Requirements for Effective Scanning - - -
  Proper file management and care
     Checking Batch (EA) IDs and orientation of
     Ensuring the EA code on each box is the
      same as the one on the questionnaires
     Proper recording of the in-coming and out-
      going questionnaires
     Close attention in detecting errors in the
      scanning process is required
Requirements for Effective Scanning - - -
    Ensuring the proper paper throughput
     through the scanner

    Ensuring smooth running of the scanning machines
         Maintenance
         Cleaning (daily)
    An arrangement to minimize the effect of Power
     Interruption is required
 Major Activities Accomplished in the
     Course of the Census Taking

 Data from the Pilot Census was successfully scanned
  (OMR), key-corrected, exported to text format,
  tabulated and tested.
 One scanner (PS 900 Photo Scribe) was used to
  capture the pilot data
 Technical experts from the DRS company assisted in
  capturing, validating and exporting the pilot data
 Training in scanning technology was given :
       16 professionals were trained
 Major Activities Accomplished - - -
 Hardware and Software training conducted
    The training in general took about 7 working days
    SOSKITW for Windows :- a DRS software package
      for scanning was introduced
    Components of the SOSKITW Software :

          SOSGen : - used to generate scanning
           decodes for completed OMR forms (How
           marks on forms are interpreted and stored)

          SOSInp : - used to scan, validate and export
                      scanned data.
Major Activities Accomplished - - -
  Equipment purchased and installed
     10 additional PS900 iM2 DRS Scanners
     16 high capacity PC’s for key-correction

  Census data processing work plan prepared
     Recruitment of temporary staff
     Staff training (scanning technology, CSPro)
     Retrieval and organization of completed forms
     Scanning and validation
     Computer editing and tabulation
 (For each activity: duration and responsible body are indicated)
 Major Activities Accomplished - - -
 Census data processing teams organized
      Batch header database group
      Scanning and validation team
      Technical desk heads
      Shift supervisors
      Two senior programmers responsible for
       the overall scanning process
      Other sub-professional staff assigned
            4 batch header scanning technicians
            16 data validation workers
Major Activities Accomplished - - -
  The scanning room organized
  An air conditioner for the scanning room installed
  A high capacity automatic generator installed to
   ensure uninterrupted power supply
  Batch Header Database organized
     EA Control Forms completed in 2 parts during dispatch
            Same EA ID on both parts of the control form
            Same Enumerator Number on each part
            No. of Households in the EA filled-in
     The scannable part detached and scanned in office
Completed Census Forms
  Completed forms retrieved from the field
           (about 90,000 EA’s)
  Reception and organization of filled-in forms
        About 33 teams for registering and
         organizing forms were organized
        3 persons assigned per team
  Retrieval of each EA checked and registered
  Presence of all form types checked (each EA)
  Control forms are also used to check the
   completeness of EA’s
Completed Census Forms - - -
 Types of the 2007 Census Forms
       Short questionnaires
       Long questionnaires
       Household Listing Forms
       Summary Forms
       Community Level Forms

       EA Control Forms      (Batch Header Forms)
             EA ID’s and no. of households filled-in
             Unique Enumerator No. assigned
             Scanned   to create EA Database
Long Questionnaire
Batch Control Form   Summary Form
Actual Scanning Process - Census Forms
  Organized forms taken from store to the waiting room
  Batch Header information printed and associated with
   its respective EA box
  The existence of each EA verified
  Checked EAs sent to the scanning room
  Scanned forms are finally sent back to the stores
  Captured data are validated and key-corrected
  Key-correction involved checking and correcting:
        Missing marks
        Multi-marks
        Partial marks
Actual Scanning Process - - -
  Scanned and validated data is exported to TEXT format
     Format suitable for computer editing and tabulation

  Backup of the scanned / captured data is taken :
        on the Database Server
        externally, on high capacity tape cartridges
              HP Ultrium
              Data Cartridge
              400 GB
Actual Scanning Process - - -
  All Census forms have been scanned :
        The scanning of the 10 sedentary Regions
         was carried from mid Aug. 2007 to
         mid Dec 2008
        The scanning for Affar and Somali Regions
         took about one month including checking
             (mid Jan - mid Feb 2008)
  44 scanning operators were assigned
  11 scanners used
  2 shifts per day, 7 days per week
  Validation and key-correction of the scanned
    data is done
Census Forms Scanning Process
    Scanning      Key-Correction
Data Cleaning / Computer Editing
 Scanned, key-corrected and exported data

 Batch Edit Program based on Edit Specs provided by
  subject matter specialists developed and run on the data.

 The software to be used in editing the data is the Census
   and Survey Processing System (CSPro)

 And Batch Edit Application (.bch) is the component of
  CSPro used to clean the data through editing and
  imputation processes
Report Generation / Tabulation
  Raising factors attached to the edited long
   questionnaire data
  Tabulation programs (in CSPro) are prepared and
  Tables in accordance with the Tabulation Plan will be
  Final data will be organized in various formats
  Final data will be sent to the Central Databank for
   achieving and dissemination purposes.
Problems Encountered
I. Scanning :
    A batch might slip through un scanned during data capture

    A batch might also be scanned in parts only

    Misplacement of scanned forms in wrong boxes

    Limited storage space on the scanning machines
       Scanners become full– that makes scanning difficult
       Scanned images should constantly be moved to the
         storage server

    The location of scanned images on the storage server
      may sometimes not be found
Problems Encountered - - -
II. Key Correction:
     Problems in retrieving scanned images for key
      correction was encountered
     Key correction took longer time as it is done
     The key correction process, as stared earlier, was
      based on fixing:
            Missing marks
            Multi-marks
            Partial marks
Problems Encountered - - -
III. Processing the data :
     Large volume of data – takes long time (8 hrs)

     Frequent power failure highly affects the
      processing sessions

     The tabulation component of CSPro software
      sometimes fails unpredictably
        (It is a newly developed tabulation system)
In summary :
  Registration and organization of all completed Census
   Forms done
  The scanning and key correction of the Census
   questionnaires completed
  The scanning of the Household Listing forms is done

  Draft Census preliminary results have been produced

 Additional Comment:
  Quick manual review (editing and coding) of the
  filled-in forms might be needed prior to the scanning