Docstoc

Archiving and disseminating census microdataThe IPUMS

Document Sample
Archiving and disseminating census microdataThe IPUMS Powered By Docstoc
					   Roundtable on Archiving and Disseminating
official statistics with a focus on census microdata
          Example: IPUMS-International
                 http://www.ipums.org
                          ***
 Robert McCaa, Professor of Population History
         and Wendy L. Thomas, Archivist,
    University of Minnesota Population Center
                   rmccaa@umn.edu
 This .ppt, docs, & additional information at:
  www.hist.umn.edu/~rmccaa/ipums-africa
              Our common fate on a crowded planet:
        new forms of global cooperation are required.
           We must engage interdisciplinary research
                      combining theory and practice.
--Jeffrey D. Sachs, Common Wealth (Penguin 2008)
           A Census Microdata Revolution

1. Preserve all microdata and documentation 20 slides
   Product (tables and microdata)
   Process (of conducting census and producing census
      microdata)
2. Integrate microdata and metadata               8
3. Disseminate to researchers world-wide          3
Conclusion: strengths, challenges, 7 golden rules 4
           A Census Microdata Revolution

1. Preserve all census microdata and documentation
   product and process:
    1960s – present
    ~100 countries (80 have endorsed IPUMS MoU)
    ~400 censuses (219 are entrusted to IPUMS)
2. Integrate: both microdata and metadata
3. Disseminate to researchers world-wide— “extracts”
   of database: countries, censuses, sub-populations,
   sample size, variables
                 IPUMS-International Today
               dark green = already integrated:
   35 countries, 111 censuses, 263 million person records
green = to be integrated: 39 countries, 103 censuses, 150 mill.




                            Mollweide projection
       IPUMS dissemination calendar (see handout)
      samples for 35 countries available now, 74 soon
» Europe 10:4
    » Available (10): Austria, Belarus, France, Greece, Hungary, Netherlands, Portugal,
      Romania, Spain, UK
    » Soon (4): Germany, Czech Republic, Slovenia, Switzerland
» Americas (funding renewed July 1) 11:11
    » Available (11): Argentina, Brazil, Canada, Chile, Colombia, Costa Rica, Ecuador,
      Mexico, Panama, USA, Venezuela
    » Soon (11): Bolivia, Cuba, Dominican Republic, El Salvador, Guatemala,
      Honduras, Nicaragua, Paraguay, Peru, Puerto Rico, Uruguay
» Africa 6:11
    » Available (6): Egypt, Ghana, Kenya, Rwanda, South Africa, Uganda
    » Soon (11): Botswana, Ethiopia, Guinea (Conakry), Madagascar, Malawi, Mali,
       Mauritius, Sierra Leone, Sudan, Tanzania, Zambia
» Asia 8:13
    » Available (8): Cambodia, China, Iraq, Israel, Malaysia, Palestine, Philippines,
      Vietnam
    » Soon (13): Armenia, Bangladesh, Fiji, India, Indonesia, Jordan, Kyrgyz
      Republic, Mongolia, Nepal, Pakistan, Thailand, Turkmenistan
                         IPUMS timeline
» 1995: IPUMS-USA first release of integrated microdata
           IPUMS-USA continues: 1850-2000 + ACS samples
»   1999: IPUMS-International funded
»   2002 - 1st International release: 7 countries, including
    Colombia and Mexico
»   2006: 20 countries, 63 censuses
»   2008: 35 countries, 111 censuses
    » ~263 million person records
    » Two thousand users
» 2013: ~70 countries, ~200 censuses
    » 214 sets of microdata are already entrusted to MPC
    » Coming: Germany (8), Switzerland (4), Bangladesh (2), Cuba (1)...
      1. Preserve (Archive)
IPUMS Global workshop, ISI (Lisbon, Aug 2007)
Microdata: Archiving & Disseminating
• The producer’s perspective (official statisticians):
   – Archiving:
      • Comprehensive preservation of both data and documentation
        (metadata) with easily searchable indices
      • Continually updated with technological innovation—hardware,
        software (doc, pdf, txt, xls, jpg, etc.) and wet-ware
   – Disseminating: the web revolution
• The consumer’s perspective (researchers)
   – Access: locate and use on the web without obstacles
   – Disseminating: free access to anyone, anywhere, anytime
     (access postponed is access denied)
• What are your interests?
Microdata: Archiving & Disseminating
Our perspective:
• “Archiving Census Microdata and Documentation:
  Preserving Memory, Increasing Stakeholders” (UNSD
  NYC, 2001) – copy of paper at ~rmccaa/ipums-africa
  – Long term, 7 keys: readable, intelligible, identifiable,
    encapsulated, understandable, reconstructable, authentic
  – What to preserve: the product and the process
  – How to assess future value: stakeholders, future impact,
    anticipated use, informing the future
  – Challenges: archive, plan, trained staff, external repository
       Preservation, the problem:
1973 census tapes of Sudan were at risk!
        A Solution: Data recovery
(by a specialized data recovery company)
   Data recovery. Example: Bangladesh Bureau of
Statistics--1981 census, 276 tapes, recovery in Aug. ‘08)


   Microdata
                                     >3,000 tapes
  on this tape
                               recovered: 1971 Germany
were recovered!!
                                     1980 Mexico,
                                   Mali 76, Sudan 73
                                    and many more
                 Census Microdata: 1950s
             few countries archived microdata
(a country in green indicates microdata exist for the decade)
  see: www.hist.umn.edu/~rmccaa/IUMSI/country6.htm




                           Mollweide projection
          Census Microdata: 1960s
               The Americas:
in the vanguard for preservation of microdata




                   Mollweide projection
                       Census Microdata: 1970s
  the preservation of microdata was almost universal in the Americas
       and was becoming widespread in Europe, Africa and Asia




Mali, 1976:
census
microdata
recovered from
old Bernoulli
boxes

                               Mollweide projection
                      Census Microdata: 1980s
           The preservation of microdata became generalized




Ghana, 1984:
census
microdata
recovered
from floppy
discs!

                                Mollweide projection
     Census Microdata: 1990s
many countries preserved microdata
 (or are disposed to recover them)




              Mollweide projection
              Census Microdata: 2000s
            many countries have microdata
(or are disposed to make them available for research)




                       Mollweide projection
    Inventory of census microdata archived by region
         and decade (% of censuses conducted)

    Region/continent        Countries     2000s     1990s 1980s 1970s          1960s

    Latin America               21         100% 100%          89%     81%       72%

    North America               27           91%     72%      64%     24%         8%

    Africa                      58           15% 22%         25%     15%         2%

    Asia                        44            ?%     54%      31%     30%       13%

    Europe                      46            ?%     67%      55%     41%       13%
    Pacific
    (pob>.5m)                    7         100% 100% 100%             43%       29%
•Note: cases confirmed by the corresponding official statistical institute. Some
datasets remain to be certified. Some countries have not responded to the invitation to
inventory their stocks of data.
Source: http://www.hist.umn.edu/~rmccaa/IPUMS/country6.htm
      7 Essential Types of Metadata for Each Census
          See IPUMS Documentation (“Table 1”)
1.    Census Questionnaires (forms): dwellings,
      households, persons, mortality, migration, etc.
2.    Enumerator instructions
3.    Data Dictionaries (layouts)
4.    Codebooks
     a. Geographic codes
     b. Occupation / Industry / Education codes
5.    Data processing protocols
6.    Official Statistics
7.    Official Reports (Analytical, Technical, Methdological)
7 Essential Types of Metadata for Each Census
               Example: Ghana
   www.hist.umn.edu/~rmccaa/ipums-africa
7 Essential Types of Metadata for Each Census
         Example: Guinea (Conakry)
   www.hist.umn.edu/~rmccaa/ipums-africa
     2. Integration:
Microdata and Metadata
     IPUMS integration of metadata and
                microdata
» Comprehensive documentation, including
  » Data dictionaries and codebooks
  » Complete original source documentation in the official
    language:
     questionnaires, manuals, etc.
  » All translated to English (from the German--thanks again to
    Martin Podehl!!) and converted into metadatabase for each
    census
» Integration ≠ standardization
  » Composite codes (11, 12, 21, 22…) ≠ serial codes (1, 2, 3, …)
     (see next slide)
       IPUMS—Microdata integration method:
          composite codes (multiple digits)
             retains not only significant distinctions
            but also integrates comparable concepts
                                                           Chile            México
Code   Label                                          1992     2002    1990     2000
0      NIU                                             X        X       X        X
       ACTIVE (In Labor Force)
100     EMPLOYED, not specified                        ·           ·    ·            ·
110      At work                                       X           X    X            X
111       At work, and 'student'                       ·           ·    ·            X
112       At work, and 'housework'                     ·           ·    ·            X
113       At work, and 'seeking work'                  ·           ·    ·            X
114       At work, and 'retired'                       ·           ·    ·            X
115       At work, and 'no work'                       ·           ·    ·            X
116       At work, and 'other'                         ·           ·    ·            X
117       At work, family holding, not specified       ·           ·    ·            ·
118       At work, family holding, not agricultural    ·           ·    ·            ·
119       At work, family holding, agricultural        ·           ·    ·            ·
120     Have job, not at work last week                X           X    X            X
        IPUMS—Microdata integration method:
           composite codes (multiple digits)
              retains not only significant distinctions
             but also integrates comparable concepts
                                                            Chile            México
Code    Label                                          1992     2002    1990     2000
0       NIU                                             X        X       X        X
        ACTIVE (In Labor Force)
100    Goal of integration coding scheme:
          EMPLOYED, not specified           ·     ·                      ·            ·
110    Assist each researcher in making informed
           At work                          X     X                      X            X
111
112
            At work, and
       decisions on 'student'               ·     ·
                         comparability—not to attempt
            At work, and 'housework'        ·     ·
                                                                         ·
                                                                         ·
                                                                                      X
                                                                                      X
113    to make the one best decision for all ·
            At work, and 'seeking work'     ·                            ·            X
114
115
            At work, and
       researchers. 'retired'
            At work, and 'no work'
                                            ·
                                            ·
                                                  ·
                                                  ·
                                                                         ·
                                                                         ·
                                                                                      X
                                                                                      X
116        At work, and 'other'                         ·           ·    ·            X
117        At work, family holding, not specified       ·           ·    ·            ·
118        At work, family holding, not agricultural    ·           ·    ·            ·
119        At work, family holding, agricultural        ·           ·    ·            ·
120       Have job, not at work last week               X           X    X            X
                                             Translation Table for Employment Status

                     Metadata: Employment Status
          Harmonized Codes and Labels                                         Source Data Codes (selected samples)

IPUMSI                   IPUMSI                                 Col    Col     Fra    Fra    Ken    Mex    Mex     US     Viet   Viet
Code                     Label                                  1964   1993    1962   1975   1999   1970   2000   1960   1989    1999

0000 EMPSTAT
         N/A                                                     *,5    B       *      B      BB     0      BB     00      B     B,1
         ACTIVE (In Labor Force)
1000 Employment status
           EMPLOYED, not s pecified                              1                                                         1
1100           At work                                                  4       1      1      01     1      10     10
1101             At work, and 's tudent'                                                                    14
1102 Description At work, and 'hous ework'                                                                  15
1103             At work, and 's eeking work'                                                               13
1104 EMPSTAT indicates whether or not the respondent was part of the labor force --
                 At work, and 'retired'                                                                     16
1105             At work, and 'no work'                                                                     18
1106 working or seeking work -- over a specified period of time. Depending on the sample,
                 At work, public em ergency                                                                        11
1107             At work, fam ily holding, not s pecified
1108 EMPSTAT can also convey further information.
                 At work, fam ily holding, not agricultural                                   03
1109             At work, fam iliy holding, agricultural                                      04
1110             Working and s tudying (France)
1200           Have job, not at work las t week                         3                     02            20     12
1300 The first digit of EMPSTAT is fully comparable, and classifies the population into three
               Arm ed forces                                                                                       13
1301             Arm ed forces , at work                                                                           14
1302 groups: employed, unemployed, and inactive. The combination of employed and
                 Arm ed forces , not at work las t week                                                            15
1303             Military trainee (France)                                      8      6
2000 unemployed yields the total labor force. The second and third digits of EMPSTAT
           UNEMPLOYED, not s pecified                            2                     3      05     2      30     20
2001             Unem ployed (Vietnam )                                                                                    4      5
2002 preserve additional information available for some countries and census years but not
                 Worked les s than 6 m onths , perm anent job                                                              2
2003             Worked les s than 6 m onths , tem porary job                                                              6
2100 for others.
               Unem ployed, experience worker                           1                                          21
2101             Seeking work, worked les s than 3 m onths                      2
2102             Seeking work, worked 3 to 6 m onths                            3
2103             Seeking work, worked 6 to 12 m onths                           4
2104 Employment status is sometimes referred to in other sources as "activity status."
                 Seeking work, worked m ore than 1 year                         5
2105             Seeking work, experience uns pecified                          6
2200           Unem ployed, new worker                                  2       7                                  22
3000     INACTIVE (Not in Labor Force)                                                                             30
3100 Comparability -- General
           Hous ework                                            3      6                     10     3      50     31      6      2
3200       Unable to work/dis abled                              7      7                     09            70     32      7      4
3300 The age of persons to whom the question applies varies across the samples (see
           In s chool                                            4      5       9      5      07            40     33      5      3
3400       Retirees and living on rent                           8                                          60
3401 Universe).Living on rent paym ents
3402           Retirees /pens ioners                                    8              4      08
3500       Elderly                                               6
3600       No work available/dis couraged                                                     06
3700 The reference period for the employment status question varies. For most samples,
           Inactive, other reas ons                              9      0       0      0      11     4      80     34             6
9000
     employment status was reported with respect to the day of the census or…
         UNKNOWN/MISSING                                                9                     00     9      99                    9


Note: In the s ource data colum ns : a com m a indicates m ore than one code was coded to the res pective IPUMS-International
                                             Translation Table for Employment Status

                     Metadata: Employment Status, example: Mexico
          Harmonized Codes and Labels                                         Source Data Codes (selected samples)

IPUMSI
Code
         Integrate: retain all significant detail, harmonize everything
                         IPUMSI
                         Label
                                                                Col
                                                                1964
                                                                       Col
                                                                       1993
                                                                               Fra
                                                                               1962
                                                                                      Fra
                                                                                      1975
                                                                                             Ken
                                                                                             1999
                                                                                                    Mex
                                                                                                    1970
                                                                                                           Mex
                                                                                                           2000
                                                                                                                   US
                                                                                                                  1960
                                                                                                                          Viet
                                                                                                                         1989
                                                                                                                                 Viet
                                                                                                                                 1999

0000     Not standardize: force square pegs in round holes
         N/A
         ACTIVE (In Labor Force)
                                                                 *,5    B       *      B      BB     0      BB     00      B     B,1


1000       EMPLOYED, not s pecified                              1                                                         1
1100           At work                                                  4       1      1      01     1      10     10
1101             At work, and 's tudent'                                                                    14
1102             At work, and 'hous ework'                                                                  15
1103             At work, and 's eeking work'                                                               13
1104             At work, and 'retired'                                                                     16
1105             At work, and 'no work'                                                                     18
1106             At work, public em ergency                                                                        11
1107             At work, fam ily holding, not s pecified
Comparability -- Mexico
1108             At work, fam ily holding, not agricultural                                   03
1109             At work, fam iliy holding, agricultural                                      04
The universe and reference period are fully comparable across the Mexico samples.
1110             Working and s tudying (France)
1200           Have job, not at work las t week                         3                     02            20     12
1300           Arm ed forces                                                                                       13
1301             Arm ed forces , at work                                                                           14
The 1970 Census did not provide detail on the inactive population except for
1302             Arm ed forces , not at work las t week                                                            15
1303             Military trainee (France)                                      8      6
"houseworkers," while the later samples have numerous subcategories.
2000       UNEMPLOYED, not s pecified                            2                     3      05     2      30     20
2001             Unem ployed (Vietnam )                                                                                    4      5
2002             Worked les s than 6 m onths , perm anent job                                                              2
2003             Worked les s than 6 m onths , tem porary job                                                              6
In 1990, the employment status question refers to "Principal Activity" and therefore under-
2100           Unem ployed, experience worker                           1                                          21
2101             Seeking work, worked les s than 3 m onths                      2
reports secondary economic activity by students, housewives, family-workers, the semi-
2102             Seeking work, worked 3 to 6 m onths                            3
2103             Seeking work, worked 6 to 12 m onths                           4
retired, and others.
2104             Seeking work, worked m ore than 1 year                         5
2105             Seeking work, experience uns pecified                          6
2200           Unem ployed, new worker                                  2       7                                  22
3000     INACTIVE (Not in Labor Force)                                                                             30
The 2000 Census sought to overcome deficiencies in reporting work status for people whose
3100       Hous ework                                            3      6                     10     3      50     31      6      2
3200       Unable to work/dis abled                              7      7                     09            70     32      7      4
primary activity was not work (students, housewives, retirees, etc.), but who in fact were
3300       In s chool                                            4      5       9      5      07            40     33      5      3
3400       Retirees and living on rent                           8                                          60
working according to international definitions. A second question introduced for the first
3401           Living on rent paym ents
3402           Retirees /pens ioners                                    8              4      08
time in 2000 sought to capture this secondary economic activity. For strict comparability
3500       Elderly                                               6
3600       No work available/dis couraged                                                     06
with earlier Mexican censuses, this recovered activity (codes 1101-1106) should be
3700       Inactive, other reas ons                              9      0       0      0      11     4      80     34             6
9000     UNKNOWN/MISSING                                                9                     00     9      99                    9
considered "inactive."
Note: In the s ource data colum ns : a com m a indicates m ore than one code was coded to the res pective IPUMS-International
IPUMS integrated metadata: Instantly, compare text &/or
  image of enumeration forms and instructions for any
   combination of countries and censuses (example:
               educational attainment)
                    In addition…
» Microdata: new high precision samples not
  only for contemporary censuses but also for
  historical ones (before the 90s)
» Systematic metadata for all variables
   » Universes
   » Definitions
   » Comparability
   » Dynamic System—facilitates comparing the
    wording of questionnaires and instructions for any
    combination of countries and censuses
3. Dissemination
                    - Caution -
• IPUMS microdata are anonymized samples.
   – They are for advanced analysis and research.
   – Use of a statistical software is required.
   – Statistical software provides great power.
   – “With great power, comes great responsibility.”
• IPUMS samples are for analysis.
• IPUMS samples are not official statistics.
                   6 steps using
 https://international.ipums.org/international:


               2a. Study documentation
               2b. Design extract
                                              3. Receive email;
1. Logon
                                              logon with p/word
w/ password


                                                         (also SAS,
4. Download                                              STATA)
extract (SSL
encrypted)
                              5. UnZip data

                                                     6. Analyze
             Conclusion:
IPUMS Strengths and Challenges plus
7 golden rules for promoting microdata
               revolution
The IPUMS team (Feb. 2008)




                            Steven Ruggles, inventor of IPUMS,
                            Professor of History, and Director of
                            the Minnesota Population Center


(Not present: computer gurus, some researchers,
  and others who were too busy for a photo!)
               IPUMS-International strengths

1.   Uniform legal authorization with national statistical
     authorities
2.   Access restricted to academics with need who agree to abide
     by stringent confidentiality protections
3.   Sanctions against individual and institution—denial of access
     to all microdata for the entire institution
4.   Experienced integration teams
5.   Proven web-based distribution system
6.   High user satisfaction with microdata & metadata
7.   Sustainable funding: NSF, NIH
                  5 Challenges

1. Microdata to recover (30 countries), integrate
     (60 countries)
2.   2010 round of censuses (~100 countries)
3.   Tabulator (research tool—not official stats)
4.   GIS
5.   High security laboratory for sensitive,
     comprehensive microdata
                       7 golden rules for
                the global microdata revolution
1. Respect “restricted-access” conditions of use:
   »   protect confidentiality
   »   “share” data only with registered users
2. Study both source documentation and metadata:
   »   Original source: census forms, instructions to enumerators, etc.
   »   Integrated metadata: samples, variables, comparability discussions
3. Construct extracts judiciously:
   »   extract only needed countries, censuses, variables, sub-pops
   »   use sample size &/or “subsamp” features to keep samples small
4. Use weights:
   either households or individuals (geographical strata = power)
5. Analyze carefully:
   proper statistical techniques, keeping in mind data quality, sample error
6. Cite properly: IPUMS and National Statistical Agencies
7. Share publications: IPUMS and National Statistical Agencies
  Thank you!!

rmccaa@umn.edu

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:2/10/2013
language:Unknown
pages:40