Docstoc

PROJECT DOCUMENTATION - Census Bureau

Document Sample
PROJECT DOCUMENTATION - Census Bureau Powered By Docstoc
					                                    Data Product Production (DPP) Technical
                                    Design
                                    Data Access and Dissemination System (DADS)
                                    DPP

                                    Version No: 1.1
                                    Version Publishing Date: September 15, 2004



                                    Author: <name removed>
                                    Owner: <name removed>
                                    Client: Bureau of Census (BOC)



                                    Contract Number: 50-YABC-7-66012




The only authorized copy of this document is the on-line version maintained in the DADS repository. User must ensure that this or
any other copy of a controlled document is current and complete prior to use. Document owner must authorize all changes. Users
should discard obsolete copies.
                                     DOCUMENT ADMINISTRATION
Ensure that this document is current. Printed documents and locally copied files may become obsolete
due to changes to the master document.
The document original is located in the project’s documentation repository at the following file address:
          I:\Special Projects\DPP Documentation\08 WProd\as-is document and attachments

   i.     Revision History Log
The maintenance of this Revision History Log is mandatory. Document users discovering that the last
logged revision date is more than 120 days old should assume the content is dated and should alert the
document’s owner.

   Revision        Revision
   Number            Date                               Summary of Changes                 Team - Author
 1.0              08/03/04       Initial version for customer review                     PH
 1.1              09/15/04       Revised per comments from customer review meeting       PH
                                 08/17/2004.




Table 1: Revision History Log.

If this log has not been updated in more than 120 days, assume the content is dated.

  ii.     Identification
This document is identified as the Data Product Production (DPP) Technical Design. The production and
maintenance of this document is the responsibility of the DADS Business Architecture Team.

 iii.     Document References




 Date Last Printed: 3/29/13                                                                     Page 2 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                        TABLE OF CONTENTS
1.   INTRODUCTION ................................................................................................................................... 6
     1.1.  Scope ...................................................................................................................................... 6
     1.2.  Audience.................................................................................................................................. 6
     1.3.  Purpose ................................................................................................................................... 6
     1.4.  How to use this document ....................................................................................................... 6
2.   FUNCTIONAL AND NONFUNCTIONAL REQUIREMENTS ............................................................... 7
3.   SOFTWARE ARCHITECTURE OF THE DPP SYSTEM .................................................................... 13
     3.1. Software component chart .................................................................................................... 13
     3.2. Flowchart of the DPP System ............................................................................................... 14
     3.3. Description of major input and output files ............................................................................ 14
          3.3.1.      Detail data files (HDF, HEDF, SDF, SEDF) (inputs to the DPP system) .......................................... 14
          3.3.2.      Geography files (inputs) ................................................................................................................... 18
          3.3.3.      Independent tabulations (inputs) ...................................................................................................... 20
          3.3.4.      Summary Files (outputs) .................................................................................................................. 21
     3.4.          Driver files .............................................................................................................................. 26
          3.4.1.      How Driver Files are used in the DPP system ................................................................................. 26
          3.4.2.      How Driver Files are searched for in the DPP system ..................................................................... 26
          3.4.3.      Inventory of driver files ..................................................................................................................... 27
     3.5.          DPP COTS Software Components ....................................................................................... 35
          3.5.1.      Korn Shell ........................................................................................................................................ 36
          3.5.2.      SAS .................................................................................................................................................. 40
          3.5.3.      Space-Time Research ..................................................................................................................... 41
          3.5.4.      Perl................................................................................................................................................... 45
          3.5.5.      Python, and JPython ........................................................................................................................ 46
          3.5.6.      Java ................................................................................................................................................. 46
          3.5.7.      IBM VisualAge TeamConnection ..................................................................................................... 46
4.   ABOUT THE FUNCTIONAL CAPABILITIES OF THE DPP SYSTEM .............................................. 51
     4.1.  About Detail databases ......................................................................................................... 51
          4.1.1.      Preparing for a Detail Database Build .............................................................................................. 51
          4.1.2.      Building a Detail Database ............................................................................................................... 59
          4.1.3.      Post-Database-Build Validation........................................................................................................ 63
     4.2.          About the Divide and Conquer Approach to tabulation ......................................................... 64
          4.2.1.      Structuring the DPP system to use an integrated logical file system................................................ 64
          4.2.2.      Waves .............................................................................................................................................. 65
          4.2.3.      Other Uses of Modified Operational Materials ................................................................................. 66
          4.2.4.      Creating Waves................................................................................................................................ 66
          4.2.5.      Restarting Waves – Failed jobs and Reruns .................................................................................... 67
     4.3.          About the Division of code and labor .................................................................................... 67
     4.4.          About Geography .................................................................................................................. 70
          4.4.1.      Assembling the DGF ........................................................................................................................ 70
          4.4.2.      Processing Geography File .............................................................................................................. 71
          4.4.3.      Using Output From Geography Processing ..................................................................................... 72
     4.5.          About Geographic recoding ................................................................................................... 72
          4.5.1.      Producing a Full Geography Recode ............................................................................................... 72
          4.5.2.      Modifying a Geography Recode ....................................................................................................... 74
          4.5.3.      Tabulating with a geography recode ................................................................................................ 78

 Date Last Printed: 3/29/13                                                                                                                                Page 3 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
     4.6.          About Hand-off ...................................................................................................................... 80
          4.6.1.      AFF .................................................................................................................................................. 81
          4.6.2.      ACSD ............................................................................................................................................... 81
          4.6.3.      Review ............................................................................................................................................. 82
          4.6.4.      Internal ............................................................................................................................................. 83
     4.7.          About Iterations and iteration recodes .................................................................................. 84
          4.7.1.      Data Fields for Iterations .................................................................................................................. 84
          4.7.2.      Defining an Iteration ......................................................................................................................... 85
          4.7.3.      STR Specific Observations .............................................................................................................. 86
          4.7.4.      Iteration in an Operational Sense ..................................................................................................... 86
     4.8.          About Logging ....................................................................................................................... 87
          4.8.1.      SAS Naming Conventions ................................................................................................................ 87
          4.8.2.      Non-SAS Naming Conventions ........................................................................................................ 88
          4.8.3.      Structure of Logs File Directories ..................................................................................................... 88
          4.8.4.      Directory $DPPwork/$DPPenv/logs ................................................................................................. 88
          4.8.5.      Directory $DPPwork/$DPPenv/logs/<State Name>/<YYYYMMDD> ............................................... 88
          4.8.6.      Directory $DPPwork/$DPPenv/logs/<State Name> ......................................................................... 89
          4.8.7.      Log File $DPPwork/$DPPenv/log..................................................................................................... 89
     4.9.          About Median processing ...................................................................................................... 89
          4.9.1.      State-Level Median Processing........................................................................................................ 89
          4.9.2.      US-Level Median Processing ........................................................................................................... 90
     4.10.         About the Parameterization of the DPP System – metadata as driver files .......................... 90
     4.11.         About Quality Assurance ....................................................................................................... 91
     4.12.         About Status reporting ........................................................................................................... 96
     4.13.         About Tabulation ................................................................................................................... 99
          4.13.1.       What is Tabulation? ....................................................................................................................... 99
          4.13.2.       Detail Databases .......................................................................................................................... 100
          4.13.3.       Tabulating Medians ...................................................................................................................... 100
     4.14.         About Thresholding ............................................................................................................. 102
          4.14.1.       Approaches to Thresholding ........................................................................................................ 102
          4.14.2.       Iterations optimization .................................................................................................................. 103
          4.14.3.       Implementation - SIPHC .............................................................................................................. 103
     4.15.         About TXD’s and recodes ................................................................................................... 106
          4.15.1.       Understanding TXDs .................................................................................................................... 106
          4.15.2.       How TXDs are used in the DPP Production System .................................................................... 106
          4.15.3.       Parts of a TXD .............................................................................................................................. 106
          4.15.4.       Why Customize TXDs in DPP ...................................................................................................... 109
          4.15.5.       TXD Parts in the Build .................................................................................................................. 109
          4.15.6.       TXD Parts Built Dynamically ........................................................................................................ 110
          4.15.7.       How a dynamic TXD is built ......................................................................................................... 111
          4.15.8.       Important Points about TXDs ....................................................................................................... 111
     4.16.         About US processing and aggregation ............................................................................... 115
          4.16.1.       Geography ................................................................................................................................... 115
          4.16.2.       Creating a US-Level Detail Database .......................................................................................... 116
          4.16.3.       Tabulation .................................................................................................................................... 116
          4.16.4.       Aggregation .................................................................................................................................. 116
          4.16.5.       Medians ....................................................................................................................................... 116
5.   HARDWARE / FIRMWARE ARCHITECTURE OF THE DPP SYSTEM .......................................... 118
     5.1. Description of hardware components .................................................................................. 118
 Date Last Printed: 3/29/13                                                                                                                               Page 4 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
      5.2.         Backup/failover architecture ................................................................................................ 119
6.    SECURITY-RELATED ARCHITECTURE OF THE DPP SYSTEM .................................................. 121
      6.1.  Logical environments (dev, pa, uat, test, prod, sprod) ........................................................ 121
      6.2.  User groups and the use of generic, non-login-enabled accounts ...................................... 121
      6.3.  The assignment of read/write access to users .................................................................... 122
7.    USING THE DPP SYSTEM TO PRODUCE A NEW PRODUCT ..................................................... 125
      7.1.     The life cycle of a product in the DPP System .................................................................... 125
      7.2.     Using multiple instances of the DPP system ....................................................................... 126
      7.3.     Preparing the driver files ..................................................................................................... 126
      7.4.     Notes on sourcing new txd’s ............................................................................................... 127
      7.5.     Adding a new product to the Cookbook .............................................................................. 128
      7.6.     Interacting with the Team Connection environment ............................................................ 128
      7.7.     Planning how the production workload will be submitted .................................................... 128
      7.8.     Preparing the hardware/firmware environment for pa/uat/test/prod/sprod processing ....... 130
         7.8.1. Resource estimation (disk space, run time, etc.) .................................................................... 131
          7.8.2.      Striped disk areas .......................................................................................................................... 131
          7.8.3.      Mounting disk areas across multiple platforms............................................................................... 133
          7.8.4.      Linking to files generated in other products, versus recreating them ............................................. 134
      7.9.         Executing the DPP System for a product ............................................................................ 134
8.    MAINTAINING THE DPP SYSTEM .................................................................................................. 137
      8.1.  Testing/installing versions/fixes to operating system software and to COTS software ....... 137
      8.2.  User account maintenance .................................................................................................. 137
      8.3.  Adding and/or removing hardware ...................................................................................... 139
      8.4.  Supporting the long-term file retention policy ...................................................................... 140
      8.5.  Maintenance Reboots and Re-establishing Mount Points .................................................. 140
9.    ARTIFACTS AND KEY METRICS OF RECENT PRODUCTS ........................................................ 142
10.      APPENDIX ................................................................................................................................... 145
      10.1.  Glossary of Terms ............................................................................................................... 145
      10.2.  Appended Notes .................................................................................................................. 148
11.      INDEX .......................................................................................................................................... 149
      11.1.  Index of Tables .................................................................................................................... 149
      11.2.  Index of Figures ................................................................................................................... 151




 Date Last Printed: 3/29/13                                                                                                                        Page 5 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                 1. INTRODUCTION

1.1.      Scope
This document is intended as a comprehensive guide to the DPP subsystem of DADS, as it exists in the
summer of 2004. Specifically, it documents the software system which was delivered to the Bureau of the
Census by IBM on April 28, 2004, in the build identified as DPP2001_286.

1.2.      Audience
The audience for this document is the technical staff who will use and maintain, and possibly enhance,
the DPP system in the future. It describes the hardware and software architecture of the DPP system,
and provides instructions for producing new products from the Census 2000 data, and for maintaining the
DPP system itself.

1.3.      Purpose
The purpose of this documentation is to provide reference materials to support the maintenance and use
of the DPP system. It is written as a bridge between higher level architecture documents, and the code of
the DPP system. It is comparable to an automotive shop manual.
This documentation contains a description of the system and its components which will aid in
understanding how the DPP system should work, how to make it produce a new product, and where to
target modifications if functionality needs to be changed.
The expected life of the information in this document is the same as that of the DPP system itself. It is
expected that the DPP system that is documented herein will remain relatively unchanged for the rest of
its life, and will be obsolete by 2010.

1.4.      How to use this document
This document is primarily a reference document.
For information on the DPP system itself, consult the sections on software and hardware architecture.
To use the DPP system to produce a product, start with the information under “Using the DPP system to
produce a new product,” and refer to the sections on software and hardware architecture as needed.
On a regular basis, use the information in “Maintaining the DPP System” to keep the DPP system
functional for future use. Refer to the sections on software and hardware architecture as needed.




 Date Last Printed: 3/29/13                                                                   Page 6 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
            2. FUNCTIONAL AND NONFUNCTIONAL REQUIREMENTS
The DPP requirements are contained in these two historic documents:
         DADS 2000 Data Product Production Functional Requirements Content Version
          File creation date: 09/08/2000. Filename: DPP Functional Requirements 1.4.doc
         DADS 2000 - DPP Non-functional Requirements
          File creation date: 02/01/2000. Filename: DPP_NFR5.DOC
For the convenience of the reader, the pertinent Functional Use Cases have been extracted and are
summarized below:


Use Case: DPP01.0 Describe Product
Initiating Actor or Event: DPP Production Operator
Termination: Product has been described.
Description: Sets parameters for a specific product.
Notes:
A product is initialized by associating a product, such as “PL”, or “SF1” or “SD”, with parameters that contain information used to
customize the product being produced. The product customization information is used by automated processes, to prevent re-entry
and to prevent errors.
A partial list of customization information:
          Table Names.
          Number of cells in each table.
          Number of iterations (such as 52 states.)
          Use HEDF – Y/N.
          Match DPP output files to ACSD dissemination files.
The following items will be tested post September 2000:
          Implement thresholds – Y/N.
          Use SEDF – Y/N
Detailed Requirements:
Describe Product.

Use Case: DPP02.0 Acquire Data Files and Detail Metadata Files
Initiating Actor or Event: DPP Production Operator
Other Actors: Data Provider
Termination: The use case terminates successfully when the data are ready for data preparation and database build.
Description: The DPP Production Operator receives notice from the Data Provider that data files are available for acquisition. The
DPP Production Operator acquires, logs, verifies and installs the data for subsequent processing.
Notes: Data files are:
          Adjusted and unadjusted, HEDF detail data files.
          Detail File Metadata files.
          Adjusted and unadjusted, 100% files.
          DPP Geography files.
          Product Mapping files (Analyzer – Summary File, Internal Summary File, and prior products.)
The following items will be tested post September 2000:
          Adjusted and unadjusted SEDF detail data files.
          Adjusted and unadjusted, Sample Analyzer Tally files.

 Date Last Printed: 3/29/13                                                                                          Page 7 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Detailed Requirements:
          Receive notification of file availability.
          Acquires data.
          Create a means of verifying the source and integrity for each file, if not already received for data provider.
          Log acquired data.
          Verify source and integrity of data file.
          Validate data file contents.
          Preserve versions of previous files, as needed.
          Place data in appropriate location for subsequent processing.
          Update internal match mapping file, if necessary.
          Update Analyzer match mapping file, if necessary.
          Update the prior Product Match Mapping file, if necessary

Use Case: DPP04.0 Data file preparation and Database Build
Initiating Actor or Event: DPP Production Operator
Termination: An HEDF or SEDF database is ready for table composition and tabulation, geography records are ready for
Production Recodes, and Geography Stub files are ready for output processing. The SEDF database will be tested post September
2000.
Description: Acquired data files are prepared for the database build process. The prepared files are verified for accuracy. The
prepared files are then used as input to the Detail File database build process.
Detailed Requirements:
          Restructure and reformat metadata files and geography data for database creation.
          Create Production Geography Recode file for use in the tabulations, and GeoRecode files for use in creating Review
          Materials.
          Create Geography Header files and Geo ID equivalency file.
          Perform validation and internal consistency checks against data and geography files. This includes consistency with
          previous product deliveries for geography files; consistency between DPP Geography File and the block header records
          of the detail file; and between the DPP Geography File and the AFF Geography bucket.
          Check that, for all DPP GEO files , land and water areas values for higher summary areas equal the sum of values from
          blocks that compose the higher areas.
          Create database.
          Install database.
          Examine database for correctness by performing simple tabulations to match to EDITALS, and by eyeballing metadata in database.
          Provide database access to users.

Use Case: DPP05.0 Compose Tables
Initiating Actor or Event: DPP Table Composer
Termination: Table has been released for production.
Description: The DPP Table Composer creates the table components and composes tables according to the Product
Specifications. The DPP Table Composer coordinates a peer review and releases the table into production.
Notes: Table composition capabilities must enable the DPP Table Composer to create all the tables in the data product according to
the pertinent Product Specification and satisfy other operational needs such as support for conditional rounding and computation
population size code based geocomponent land and water areas. Where this not completely possible, the DPP system must allow
completion of the tabulation requirement as post-processing step that must occur prior to creation of Review Materials.
Detailed Requirements:
          Create table components (Recodes, UDFs.)
          Select Components.
          Layout table.
          Create derivations (Derived measures, calculations and accumulations.)


 Date Last Printed: 3/29/13                                                                                                Page 8 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          Save composed table.
          Save a composed table as text, change it with a text editor, and use it with no loss of tabulation functionality.
          Coordinate a peer review of composed table.
          Release table components into production
          Release table into production.

Use Case: DPP06.0 Create Production Tabulation
Initiating Actor or Event: DPP Production Operator
Termination: All Product tables submitted have been successfully run.
Notes: Product tables are a cross product of: the composed product table, the Detail File data base and associated geography file.
Detailed Requirements:
Merge composed tables with product geographies, run against the appropriate database, and save the data results. This generates
result tables for specified geographies. The Production Operator may run all of the tables in the product, or may only run some of
the tables. Results of previous tabulations will be retained for analysis until explicitly deleted.

Use Case: DPP07.0 Create Review Materials and Handoff Materials
Initiating Actor or Event: DPP Production Operator
Other Actors: POP & HHES Reviewer
Termination: Notification is given that Review materials are ready.
Detailed Requirements:
          Perform post processing, as required, for instance:
                    Reformat (E.g. “-85”  85.1.)
The following items will be tested post September 2000:
          Perform conditional rounding
          Filter product based on computed population thresholds
          Insert land and water areas for population size code based geocomponents in the Summary file geographic data segment
          Create Summary File:
                    For PL ONLY: Insert the count tabulated as PL1(1) into the POP100 field of the geographic portion of the PL Summary
                    File prior to preparation of Review Materials. The intention is to have POP100 contain the unadjusted count on the
                    Unadjusted Summary File and the adjusted count on the Adjusted Summary File. Overwrite HU100, if no value is present,
                    with zero.
                    For Unadjusted SF-1 ONLY: Use the ‘unadjusted’ POP100 and POP Size Code fields in place of the ‘adjusted’ POP100
                    and POP Size Code fields (both from DPP Geo File) in the geographic portion of the Summary File, performing the
                    substitution before preparing Review Materials.
          Perform internal match according to specifications.
          Perform Analyzer match according to specifications.
          Perform prior product match according to specifications.
          Create Full and First Occurrence SuperCROSS and SAS Review Materials, containing only tables which have not failed
              any of the Matches (Internal, Analyzer, and Prior Product) or have been otherwise specified by the operator for
              inclusion or exclusion. Indicate those lines which have been subject to the analyzer and/or prior product match.
          Make Review Materials available for review.
          Notify POP & HHES Reviewer that Review Materials are available.

Use Case: DPP11.0 Hand Off Product Files
Initiating Actor or Event: DPP Production Operator
Other Actors: Product Recipient.
Termination: Handoff material is available.
Description: Create Handoff list. Makes the set of files that constitutes the Handoffs available. The contents of the Product
Handoffs varies based upon the recipient of the Handoffs.
Notes: Includes: notifications/transmittals, files, documentation, metadata, and means of verifying the source and integrity of the
files.


 Date Last Printed: 3/29/13                                                                                              Page 9 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
           AFF receives:
                       Summary Files.
                       Product Documentation File (PDF.)
                       Accuracy of the Data File (PDF.)
                       Additional metadata files.
           ACSD receives:
                       Summary Files.
Detailed Requirements:
           Insure that materials are available.
           Notify product recipients.

Use Case: DPP12.0 Maintain Production Status
Initiating Actor or Event: DPP Production Status Maintainer. System.
Termination: The DPP Production Status Maintainer and/or System maintains current status of production processing.
Description: The DPP Production Status Maintainer uses the system to maintain the production status.
Associations: All other use cases “uses” this use case to record production status.
Notes:
Examples of how production work will be tracked in geography-u/a-product units (runs)are:
           "Indiana Unadjusted PL" or
           "Puerto Rico Adjusted HSF-1" or
           "US Adjusted HSF-2."
Processing steps may include:
           Compose Product Table.
           Acquire DPP Geography File.
           Verify DPP Geography File.
           Acquire HEDF (or SEDF.)
           Verify HEDF (or SEDF) counts.
           Build database.
           Produce Tables for product.*
           Create Summary File.
           Perform Summary File matches.
           Create Review Materials.
           Hand-off Review Materials to POP/HHES.
           Receive POP/HHES approval of entire geo-u/a-product unit Hand-off product files to ACSD for CD-ROM creation.
           Quality Check the product, as presented on CD-ROM.
* - “Produce Tables for product” will be further tracked on a table-by-table basis.
Detailed Requirements:
Checks production logs and records production processing. Processing steps may include:
           Started (with start date.)
           Not Started
           Completed (with completion date.)
If required, a “Completed” item can be changed to “Restarted.”
Processing steps may be repeated and should be tracked accordingly.



Use Case: DPP13.0 Report on Production and Milestones Status

 Date Last Printed: 3/29/13                                                                                    Page 10 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Initiating Actor or Event: DPP Production Status Reporter
Termination: The DPP Production Status Reporter has obtained information about the current status of production processing or
milestones reports.
Description: The DPP Production Status Reporter uses the system to report on production status, which may include status
reporting to upper level management.
Notes:
Use case notes for DPP12.0 Maintain Production Status, apply here as well.
Detailed Requirements:
Inquire upon production processing. Processing steps may include:
           Started (with start date.)
           Not Started
           Completed (with completion date.)
If required, a “Completed” item can be changed to “Restarted.”
Processing steps may be repeated and should be reported accordingly.
Produce reports based on templates provided by the DPP operations staff on which logged processing was started/complete/halted due to error for
defined milestones by product and state/national.

Use Case: DPP15.0 Investigate Inputs, Sourcing, Production Processing, and Handoff Trail
Initiating Actor or Event: DPP Production Researcher
Termination: DPP Production Researcher has completed investigation and reported results.
Description: A DPP Production Researcher uses the system to investigate the inputs, sourcing, production processing, and
handoff trail of data in Review Materials and presented in disseminated products. The investigation is performed in response to
questions from data users inside and outside the BOC (for up to 10 years after the product was produced.)
Detailed Requirements:
Access detail processing logs and system programs.
The DPP production researcher will use a DPP program to produce a history of logged operational events by product and state/national.

Use Case: DPP16.0 Perform Ad Hoc Query
Initiating Actor or Event: DPP (or other BOC) Ad Hoc Query User
Termination: The requested tabulation has been performed and results presented.
Description: A DPP (or other BOC) Ad Hoc Query User uses the DPP system to perform ad hoc tabulations of detail data.
Notes:
Logging is not performed for queries of this type.
Detailed Requirements:
Select database.
Compose table.
Execute tabulations.

DPP20.0 Create Population Size Code Reference Files
Initiating Actor or Event: DPP Production Operator
Termination: Reference files are created
Detailed Requirements:
DPP Production Operator will create reference files using PL results, and geography files according to a specification provided by
the Population Division.

DPP21.0 Submit Population Size Code Reference Files for Review
Initiating Actor or Event: DPP Production Operator
Termination: Reference files are available for review
Detailed Requirements:
POP and MHES will review reference files, which must be provided to them for this purpose.

 Date Last Printed: 3/29/13                                                                                                  Page 11 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                          3. SOFTWARE ARCHITECTURE OF THE DPP SYSTEM
This section of the document describes:
               the process and data flow of the DPP system itself;
               which parts of the DPP system implement the various DPP requirements;
               the major input and output files of the DPP system;
               the driver files which are used by the DPP system to produce various products; and
               how COTS software is used in the DPP system.
It answers the question, “What does the DPP system itself do?”

3.1.            Software component chart
The high-level DPP software component diagram is shown here for the convenience of the reader.



                                                                                         POP/
                          DSCMO                       GEO
                                                                                         HHES




                                                                                       Product                                                        Analyzer
                      Detail Files &                Geography                                                         Table
                                                                                     Specifications                                                 (Verification)
                       Metadata                       Files                                                         Composition
                                                                                      & Metadata                                                    Tabulations
                                                                                                                      (STR)




                                                                                       Compose                      Compose
                                                                   Greate
                                       Create STR                                          Cl                         Table
                                                                 Geographic
                                        Database                                       Definitions                  Definitions
                                                                  Recode
                                                                                        (Manual)                     (Manual)




                                                                 Geography                                           Template
                                                                                             Cl.txt                                                Match Reports
                                                                  Recode                                               TXD
                                       Microdata
                                        (STR)




    Remote-Mode                                                                                                                        Verification
                                                                 Tabulation
     Catalogue                         Local-Mode                               Tabulation                                           Internal Match
                                                                SuperCROSS                            Merge TXD
    (allows client to access            Catalogue                                 TXD                                             Prior-Product Match
                                                                   Server
           database)                                                                                                                 Analyzer Match




          Remote
                                                                                                        Summary
         Tabulation
                                                                                Create                   Files,                                                AFF
           (STR)                                                CSV                                                                Handoff
                                                                              Summary File             GeoHeader,                                             ACSD
                                                                                                       AFF Header



                                                                                                                                          DIAGRAM VERSION: 1.0, 7/2004

                                                                                                                    Data Access and Dissemination System (DADS)
                                                                                                                                       Data Flow Architecture
                                                                                                                                                      DPP DETAIL




Figure 1: DPP Software High-level Component Diagram




 Date Last Printed: 3/29/13                                                                                                                  Page 12 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
3.2.      Flowchart of the DPP System
A flowchart of the DPP system, organized by script and stage of script, is contained in a separate file in
the subdirectory ~/Supplemental Materials/additional content/DPPFlow.vsd. It shows process flow, as
well as input and output files and logical branching. It is a Visio chart, and is provided separately in order
to maintain full Visio functionality.

3.3.      Description of major input and output files
There are three major types of input files to the DPP system: Detail data files, DPP Geography Files, and
Independent Tabulation Files. Detail data files are the files which contain the characteristics of people
and housing units from the census operations. DPP Geography Files contain the information which the
DPP system uses to tabulate each housing unit and person into the appropriate summary levels.
Independent Tabulation Files (also called Analyzer Files) contain the results of independent tabulations
which are used for checking against the DPP results.
There is only one major type of output file from the DPP system. It is the Summary File. Summary File is
a logical designation; in fact, a Summary File is a collection of two or more physical files.
3.3.1. Detail data files (HDF, HEDF, SDF, SEDF) (inputs to the DPP system)
Detail files capture the information collected/derived from the Census short and long forms. The data
providers make them available on a flow basis by state. DPP uses the Get script to ftp these files to
/dpp2/ftp/dec. The files are hierarchical, and contain 5 record types and hundreds of fields. The
following diagram shows the hierarchy:
    Block 1 (record type 1)
       Housing Unit 1 in block 1 (record type 2)
             Person 1 in housing unit 1 (record type 3)
             Person 2 in housing unit 1
             …
             Person n in housing unit 1
       Housing Unit 2 in block 1
       Group Quarters 1 in block 1 (record type 4)
             Person 1 in group quarters (record type 5)
             Person 2 in group quarters
             …
             Person n in group quarters
    Block 2

More information on each type of detail data file are provided in the following sections.
3.3.1.1.       100% Detail Files
The 100% data contains information collected/derived from the short form (sent to 5 of 6 households in
the country), plus “short form” questions from the long form (sent to 1 of 6 households in the country).
There are two flavors of the 100% detail file – the Hundred Percent Detail File (HDF) and the Hundred
Percent Edited Detail File (HEDF). The HDF does not include any adjustments, while the HEDF contains
additional GQ and GQ person records used for adjustments to the census. One product was produced
from the HEDF, an Adjusted Redistricting product. All other 100% products were produced from the HDF.
The 100% detail data files range in size from just over 100MB (DC) to almost 7GB (CA). The record
counts for the Census 2000 HDFs are shown in the following table:
                                                                                                  GROUP
                                                     HOUSING            HOUSING UNIT   GROUP      QUARTERS
              STATE                 BLOCKS           UNITS              PERSONS        QUARTERS   PERSONS
          Alabama                   175220           1963711            4332380        2449       114720
          Alaska                    21874            260978             607583         923        19349


 Date Last Printed: 3/29/13                                                                       Page 13 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Arizona                   158294           2189189            5020782    2669    109850
         Arkansas                  141178           1173043            2599492    1650    73908
         California                533163           12214549           33051894   24679   819754
         Colorado                  141040           1808037            4198306    2612    102955
         Connecticut               53835            1385975            3297626    2153    107939
         Delaware                  17483            343072             759017     510     24583
         District of               5674             274845             536497     784     35562
         Columbia
         Florida                   362499           7302947            15593433   8657    388945
         Georgia                   214576           3281737            7952631    4010    233822
         Hawaii                    18990            460542             1175755    1416    35782
         Idaho                     88452            527824             1262457    860     31496
         Illinois                  366137           4885615            12097512   6125    321781
         Indiana                   201321           2532319            5902331    3901    178154
         Iowa                      168075           1232511            2822155    2631    104169
         Kansas                    173107           1131200            2606468    2037    81950
         Kentucky                  122141           1750927            3926965    2739    114804
         Louisiana                 139867           1847181            4333011    3369    135965
         Maine                     56893            651901             1240011    1474    34912
         Maryland                  79128            2145283            5162430    3852    134056
         Massachusetts             109997           2621989            6127881    5203    221216
         Michigan                  258925           4234279            9688555    7863    249889
         Minnesota                 200222           2065946            4783596    4195    135883
         Mississippi               136150           1161953            2749244    2056    95414
         Missouri                  241532           2442017            5433153    3834    162058
         Montana                   99018            412633             877433     911     24762
         Nebraska                  133692           722668             1660445    1535    50818
         Nevada                    60831            827457             1964582    827     33675
         New Hampshire             34728            547024             1200247    1015    35539
         New Jersey                141342           3310275            8219529    5559    194821
         New Mexico                137055           780579             1782739    1188    36307
         New York                  298506           7679307            18395996   12338   580461
         North                     232403           3523944            7795432    5527    253881
         Carolina
         North Dakota              84351            289677             618569     649     23631
         Ohio                      277807           4783051            11054019   6543    299121
         Oklahoma                  176064           1514400            3338279    2290    112375
         Oregon                    156232           1452709            3343908    2960    77491

Date Last Printed: 3/29/13                                                                Page 14 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          Pennsylvania              322424           5249750            11847753       12489      433301
          Rhode Island              21023            439837             1009503        752        38816
          South                     143919           1753670            3876975        2844       135037
          Carolina
          South Dakota              77951            323208             726426         826        28418
          Tennessee                 182203           2439443            5541337        3231       147946
          Texas                     675062           8157575            20290711       10965      561109
          Utah                      74704            768594             2192689        1220       40480
          Vermont                   24824            294382             588067         573        20760
          Virginia                  145399           2904192            6847117        3898       231398
          Washington                170871           2451075            5757739        4086       136382
          West Virginia             81788            844623             1765197        1179       43147
          Wisconsin                 200348           2321144            5207717        4098       155958
          Wyoming                   67264            223854             479699         519        14083
          Puerto Rico               56781            1418476            3761836        1613       46774
Table 2: Record counts for HDF files

The HDF and HEDF record layouts are identical and are explained in the following list:
         The block records are record type 1, and they contain 63 fields worth of geographic attributes.
          Examples are the key fields (state, county, tract, block) and other fields like land area, place
          code, and Consolidated Metropolitan Statistical Area code.
         The housing unit (HU) records are record type 2, and they contain 40 fields worth of housing unit
          attributes.
         The housing unit person records are record type 3, and they contain 77 fields worth of housing
          unit person attributes.
         The group quarters (GQ) records are record type 4, and they contain 7 fields worth of group
          quarters attributes.
         The group quarters person records are record type 5, and they contain 88 fields worth of group
          quarters person attributes.
3.3.1.2.       Sample Detail Files
The sample data contains information collected/derived from the long form (sent to 1 of 6 households in
the country). It’s much richer in content than the hundred percent data.
There are two flavors of the sample detail files – the Sample Detail File (SDF) and the Sample Edited
Detail File (SEDF). The SDF is a preliminary version of the sample file and was used to produce the
DSMD product. The SEDF is the final version of the sample detail file and was used to produce all the
publicly released Sample products (e.g., SF3, SF4, AIAN).
The sample detail data files range in size from about 40MB (DC) to 2.5GB (CA). The record counts for the
Census 2000 SEDFs are shown in the following table:
                                                                                                  GROUP
                                                     HOUSING            HOUSING UNIT   GROUP      QUARTERS
              STATE                 BLOCKS           UNITS              PERSONS        QUARTERS   PERSONS
          Alabama                   175220           304848             681047         1865       12201
          Alaska                    21874            56687              125170         606        2242


 Date Last Printed: 3/29/13                                                                       Page 15 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Arizona                   158294           293222             676386    1814    11158
         Arkansas                  141178           232338             512665    1211    6624
         California                533163           1619612            4352820   15672   81535
         Colorado                  141040           278009             632054    1890    12384
         Connecticut               53835            191427             456708    1666    13703
         Delaware                  17483            52031              108122    372     2160
         District of               5674             35054              69587     488     2677
         Columbia
         Florida                   362499           905954             1911836   6105    40100
         Georgia                   214576           449759             1088960   3080    24538
         Hawaii                    18990            73018              184636    925     4421
         Idaho                     88452            97806              226265    663     3769
         Illinois                  366137           751457             1857398   4895    43092
         Indiana                   201321           375080             881056    3000    22982
         Iowa                      168075           266044             606045    2070    13426
         Kansas                    173107           211991             482121    1580    9970
         Kentucky                  122141           293026             656364    2028    13905
         Louisiana                 139867           288832             676869    2382    14961
         Maine                     56893            157542             272815    1054    5049
         Maryland                  79128            288569             694751    2371    13019
         Massachusetts             109997           355013             824156    3849    25785
         Michigan                  258925           820423             1779024   5602    28225
         Minnesota                 200222           484259             1090591   3244    17864
         Mississippi               136150           193170             458756    1501    10830
         Missouri                  241532           457135             1003880   3065    19695
         Montana                   99018            98809              204277    646     3180
         Nebraska                  133692           167824             380848    1195    6654
         Nevada                    60831            99905              236189    607     3541
         New Hampshire             34728            102169             208824    775     4192
         New Jersey                141342           459790             1133490   3771    21248
         New Mexico                137055           120314             272632    805     3951
         New York                  298506           1156242            2714664   9019    68089
         North                     232403           562133             1214120   4311    30209
         Carolina
         North Dakota              84351            75941              159643    523     3320
         Ohio                      277807           758606             1758304   5167    34799
         Oklahoma                  176064           320872             700366    1804    13260
         Oregon                    156232           222812             510488    2138    8873

Date Last Printed: 3/29/13                                                               Page 16 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          Pennsylvania              322424           954861             2134198   8634        49573
          Rhode Island              21023            57857              132235    534         3029
          South                     143919           263758             580934    2147        15636
          Carolina
          South Dakota              77951            81059              179593    632         3943
          Tennessee                 182203           356624             805026    2476        17794
          Texas                     675062           1239057            3042214   7766        54453
          Utah                      74704            129104             363610    863         5027
          Vermont                   24824            83581              157656    509         2747
          Virginia                  145399           410411             955713    3088        24850
          Washington                170871           358613             838331    2709        15276
          West Virginia             81788            153927             320657    897         4771
          Wisconsin                 200348           536352             1177467   3025        19587
          Wyoming                   67264            42547              91617     363         1954
          Puerto Rico               56781            221186             582905    1202        4308
Table 3: Record counts for SEDFs

The SDF and SEDF record layouts are identical and are explained in the following list:
         For block records, please refer to the 100% description.
         The housing unit (HU) records are record type 2, and they contain 52 fields worth of housing unit
          attributes.
         The housing unit person (HUP) records are record type 3, and they contain 234 fields worth of
          housing unit person attributes.
         The group quarters (GQ) records are record type 4, and they contain 15 fields worth of group
          quarters attributes.
         The group quarters person (GQP) records are record type 5, and they also contain 234 fields
          worth of group quarters person attributes. The record layout is identical to record type 3.
There were also two supplemental sample detail files. These files are also hierarchical (but do not
contain block records) and contain many additional recodes that are necessary for tabulations. The first
supplemental file was merged with the SEDF in detail data preparation and used to produce sample
products like SF3 and SF4. The second supplemental file was also merged with the SEDF and was used
to produce the School Districts products.
3.3.2. Geography files (inputs)
DPP prepares a specification for each product’s geography files (format attached). We send this
specification to GEO, who prepares the files with the correct geographies, in product sort/nest order, with
the appropriate fields filled. GEO makes them available on a flow basis by state/US, and drops them off
in /dpp2/ftp/geo. The files are pipe (“|”) delimited, flat, non-hierarchical, and contain nearly one
hundred fields.
There are separate deliveries for blocks and for the state/US level geographies, as explained below.
Block Files
Block files were delivered just a few times and reused for later products. Sometimes blocks appear in the
final product (e.g., PL, uSF1), but most of the time they’re used to build the geography recodes and detail
databases and do not appear in the final product. The following table shows the basic product name and
the set of blocks that were used to create the product:

 Date Last Printed: 3/29/13                                                                   Page 17 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
            Basic Product              Source of blocks
          PL                           PL blocks
          SF1                          Each SF1 product (SF1, SF1A, SF1F,
                                       SF1UR) used its own block delivery
          SF2                          SF2 and SF2A used the SF1 delivery;
                                       SF2F used the SF1F delivery
          PF3                          SF1 delivery
          SF3                          SF1F delivery
          SF4                          SF1F delivery
          CD108                        CD108 delivery
          SD                           SF1F delivery
          AIAN                         SF1F delivery
          Table 4: Source of blocks for each basic product

Block files were delivered several times. For the PL and SF1 state products, in which block was a
summary level, the blocks were delivered within the summary level geography delivery, not separately.
The number of block records did not change, but the content of the fields on individual block records did
change, and the order of the blocks did change. Order of the blocks was important because the DPP
system matches block records to detail file records (to transfer geographic codes) before it builds a
database.
Block records represent the lowest common denominator geography (blocks), so the greatest number of
fields will be filled for blocks. For example, every block has a valid FIPS State code, FIPS County Code,
Census Tract, and Block Suffix. These fields are null for an MSA record.
State and National Level Files
Geographies for a state-level product are delivered by state, and geographies for a national-level product
are delivered in a single US file. One exception to this rule is the PF3 product, where a GEO delivery was
not required. For PF3, we used the SF1 geography file set (blocks and other geographies) to create a
new set of inputs.
These files contain a record for every geography in every summary level plus geocomponent in the
product. For example, the CD108 product for Indiana contains the following geographies:
            CD108
           Summary
             Level            Geocomponent             Meaning                          Count
          040                 00                       State                            1
          500                 00,01,43,52-             State-Congressional District     9 (per geo
                              59,64-                   (108th)                          component)
                              71,84,89-95
          510                 00                       State-Congressional District     106
                                                       (108th) – County
          511                 00                       State-Congressional District     1500
                                                       (108th) – County – Census
                                                       Tract
          521                 00                       State-Congressional District     1034
                                                       (108th) – County – County
                                                       Subdivision
          531                 00                       State-Congressional District     637
                                                       (108th) – Place/Remainder
          541                 00                       State-Congressional District     3
                                                       (108th) – Consolidated City
          542                 00                       State-Congressional District     17
                                                       (108th) – Consolidated City –
                                                       Place Within Consolidated City




 Date Last Printed: 3/29/13                                                                   Page 18 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
            CD108
           Summary
             Level            Geocomponent             Meaning                          Count
          550                 00                       State-Congressional District     1
                                                       (108th) – American Indian
                                                       Area/Alaska Native
                                                       Area/Hawaiian Home Land
          551                 00                       State-Congressional District     1
                                                       (108th) – American Indian
                                                       Area/Alaska Native Area
                                                       (Reservation or Statistical
                                                       Entity Only)
          552                 00                       State-Congressional District     0
                                                       (108th) – American Indian
                                                       Area/Alaska Native Area (Off-
                                                       Reservation Trust Land
                                                       Only)/Hawaiian Home Land
          553                 00                       State-Congressional District     1
                                                       (108th) – American Indian
                                                       Area/Alaska Native
                                                       Area/Hawaiian Home Land –
                                                       Tribal Subdivision/Remainder
          554                 00                       State-Congressional District     1
                                                       (108th) – American Indian
                                                       Area/Alaska Native Area
                                                       (Reservation or Statistical
                                                       Entity Only) – Tribal
                                                       Subdivision/Remainder
          555                 00                       State-Congressional District     0
                                                       (108th) – American Indian
                                                       Area/Alaska Native Area (Off-
                                                       Reservation Trust Land Only) –
                                                       Tribal Subdivision/Remainder
          560                 00                       State-Congressional District     0
                                                       (108th) – Alaska Native
                                                       Regional Corporation
          Table 5: CD108 summary levels for Indiana

3.3.3. Independent tabulations (inputs)
One of the quality checks which was incorporated into the DPP process was to compare the results of the
DPP system to the results of an independently programmed tabulation system. That independent
tabulation system was programmed by a different staff, using different computer hardware, a different
operating system, and different software.
Five sets of independent tabulations were performed by DSCMO, for each of the states, and delivered to
DPP. The DPP system matched each independently tabulated cell to one or more corresponding cells in
a DPP product, according to instructions contained in driver files. The match was performed for all
geographic summary levels for which independent tabulations were available. Tab stage 5100 performed
the match and prepared a report showing the results of the match. All differences were resolved.
The independent tabulation files were flat, ASCII files.
The first and second independent tabulations were tabulations of data in the HDF and the HEDF. They
were called the 100% Analyzer tabulations. Their format is described to the DPP system in the driver files
HDFAnalyzerTableInfo.txt and HEDFAnalyzerTableInfo.txt.
The third and fourth independent tabulations were tabulations of data in the SDF and the SEDF. They
were called the Sample Analyzer tabulations. Their format is described to the DPP system in the driver
files SDFAnalyzerTableInfo.txt and SEDFAnalyzerTableInfo.txt.
 Date Last Printed: 3/29/13                                                                 Page 19 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The fifth independent tabulation was a tabulation of data from the SEDF combined with data from the
SEDF School District Supplement. They were called the Independent tabulations. Their format is
described to the DPP system in the driver files IndCHTableInfo.txt, IndCOTableInfo.txt,
IndCPTableInfo.txt, IndHCTableInfo.txt, IndPCTableInfo.txt, and IndTTTableInfo.txt.
3.3.4. Summary Files (outputs)
The ultimate output of the DPP system is a product consisting of a set of files known as Summary Files.
The format of the Summary Files is described in this memo:
     Details of the Construction of the 2000 Decennial Summary Files Revision 3
     File creation date: 05/20/2004. Filename: DetailsR3-draft.wpd
The content of the memo is included here for the convenience of the reader:


General Principles in Constructing the 2000 Decennial Summary Files
The naming convention used for the summary files identified each file uniquely, and fields within each file also identified each file uniquely.
         This included a character in the FILEID to indicate whether the data tabulated were adjusted or unadjusted (u/a), and a character to
         indicate product revisions (if necessary) after initial public release. Some of the summary data products were ‘released’ several
         times, for different geographic sets, i.e., there were both state and national (US) releases of some products. All products released
         officially were tabulated from unadjusted data. One product, the Redistricting product, was also tabulated from adjusted data and
         ‘released’ because of a court order.
Each summary data product release consisted of a ‘set’ of files for that {state/US | u/a} version, containing the ~400-character geoheader and
        the product data, in matrix/cell order.
          FILEID and STUSAB, on both the geoheader file and on the data files, were filled according to the chart included later in this
                   document, and according to the {state/US | u/a} contents.
          One file in each ‘set’ contained the geoheaders for the appropriate summary levels, in product sequence; the other files in the set
                     were data files and contained up to 250 data cells each (enabling users to import the data into COTS packages).
          Each data file contained the following subset of fields from the geoheader, plus a 2-character file sequence number (CIFSN). This
                   made them uniquely identifiable, and link-able to the information in the geoheader file in that set:
                                FILEID, STUSAB, CHARITER, CIFSN, LOGRECNO
          For all products (including the ‘iterated’ products SF2, SF4, AIAN, and the School District products), the geoheader file occurred
                     once for each {state/US | u/a} release, with CHARITER zero-filled.
If a revision had been issued to a set of Summary Files, we would have indicated it in the filename.
Filenames were assigned to identify the contents of the file within an 8.3 (ffffffff.ext) name, like this:
          Public files:
                     for the PUBLIC GEOHEADER file:
                                                                 <POSTAL-STUSAB>geo<rev>.<3-Ch-FILEID>
                     for the data files:
                                                                 <POSTAL-STUSAB><CHARITER><CIFSN><rev>.<3-Ch-FILEID>
          Internal-to-Census files:
                     for the INTERNAL GEOHEADER:
                                                                 <POSTAL-STUSAB>igeo<rev>.<3-Ch-FILEID>
                     for the file for AFF relating LOGRECNO to GEOID:
                                                                 <POSTAL-STUSAB>aff<rev>.<3-Ch-FILEID>
          The codes used in the filenames were:
                     <POSTAL-STUSAB> represented the postal (alphabetic) equivalent of the STUSAB
                     ‘<rev>’ stood for an optional 1-character alphabetic revision indicator that was absent in the initial file release, and would
                                have been ‘a’ for the first revision, ‘b’ for the second revision, etc.)
                     ‘<3-Ch-FILEID>’ stood for a 3-character version of the FILEID. The actual codes used are listed later in this document in
                               the ‘Codes in the data files...’ section.
          In the filename (only), we dropped blanks from FILEID.
          Alphabetic characters in the filenames were lower-case, with the exception of iteration codes in the AIAN product.
CHARITER was assigned to each file this way:

 Date Last Printed: 3/29/13                                                                                              Page 20 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          On the geoheader file, in all products, CHARITER contained zeros.
          On the data files for non-iterated products like PL, SF1, and SF3, CHARITER contained zeros.
          On the data files for iterated products like SF2 and SF4, CHARITER contained the characteristic iteration number assigned by POP,
                   right-justified with leading zeros.
LOGRECNO was assigned to each record in a product release, as a sequential number, starting at 0000001 for the first geographic entity.
      LOGRECNO was assigned consistently to records for that same geographic entity in each of the data files in the product release, so
      that a user could link all the data for each geographic entity.
          If a particular data file in a release set contained no data for a particular geographic entity, that
          FILEID/STUSAB/CHARITER/CIFSN/LOGRECNO was not present in that data file.
Data file partitions were determined based on POP’s matrix definition and these rules:
          A matrix did not span two data files unless it was too big to fit in one.
          If a matrix would not fit in one data files, it was split at a logical cell.
          Data for P and H matrices were not in the same data file as data for PCT and/or HCT matrices.
The coverage of all the summary data products themselves included all geographic entities defined for the product, including those in which
         there is no population and/or no housing units. So there will be some geographies in each geoheader file for which there are no
         data in any of the data files.
Geoheader files were ASCII files containing fixed length fields.
          Some of the names of geographic entities in the geoheader (for PR only) contained diacritical marks, like these:
                 á      Á      é    É     ú     Ú     ü     í    Í              ó      Ó
Summary Data files were ASCII files with variable length fields, delimited by commas (‘,’).
          The data fields were numeric.
          Positive data was unsigned; negative data was preceded by a ‘-’ character.
          Some data contained explicit decimal places, which was described in the product documentation. Most data with explicit decimals
                  was actual data (like median age 43.6) but some had special documented meanings (for instance, median age value
                  ‘115.1' meant ‘over 115' in some matrices).
          No data contained implied decimal places.
          The largest number contained in any data field was 14 digits (i.e. ‘99999999999999').




      Date Last Printed: 3/29/13                                                                                         Page 21 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                          Codes in the Data Files and Filenames for various Product Releases

                                                                                   FILEIDs        STUSAB           CHARITER        CIFSN
                                                                               (3ch)    (6 chars) (2 chars)         (3 chars)     (2 chars)
 PL
       ~ Unadjusted PL Summary Files, for 52 states                            upl       ‘uPL’ ‘AL’ - ‘WY’           ‘000'          ‘01'-’02'
       ~ Adjusted PL Summary Files, for 52 states                              apl       ‘aPL’ ‘AL’ - ‘WY’           ‘000'          ‘01'-’02'
SF1
       ~ SF1 Summary Files (unadjusted), for 52 states                         uf1       ‘uSF1’ ‘AL’ - ‘WY’           ‘000'            ‘01'-’39'
       ~ SF1 Summary Files (unadjusted), US w/o urban/rural (“advance”)        u1        ‘uSF1A’ ‘US’                  ‘000'            ‘01'-’39'
       ~ SF1 Summary Files (unadjusted), US with urban/rural (“final”)         uf1       ‘uSF1F’ ‘US’                  ‘000'            ‘01'-’39'
SF1 Supplement      (urban / rural, P2 & H2)

       ~ SF1 Supplement Summary Files (unadjusted), for 52 states              ur1       ‘uSF1UR’ ‘AL’ - ‘WY’           ‘000'            ‘01'
SF2
      ~ SF2 Summary Files (unadjusted), for 52 states                          uf2       ‘uSF2’ ‘AL’ - ‘WY’           ‘000'-‘463'      ‘01'-’04'
      ~ SF2 Summary Files (unadjusted), US w/o urban/rural (“advance”)         u2        ‘uSF2A’ ‘US’                  ‘000'-’463'      ‘01'-’04'
      ~ SF2 Summary Files (unadjusted), US with urban/rural (“final”)          uf2       ‘uSF2F’ ‘US’                  ‘000'-’463'      ‘01'-’04'
SF2 Supplement  (PCT 37)

       ~ SF2 Supplement Summary Files (unadjusted), for 52 states              sf2       ‘uSF2’ ‘AL’ - ‘WY’           ‘000'-‘463'      ‘01'
SF3
       ~ SF3 Summary Files (unadjusted), for 52 states                         uf3       ‘uSF3’ ‘AL’ - ‘WY’           ‘000'            ‘01'-’76'
       ~ SF3 Summary Files (unadjusted), US (with urban/rural)                 uf3       ‘uSF3’ ‘US’                  ‘000'            ‘01'-’76'
SF4
       ~ SF4 Summary Files (unadjusted), for 52 states                         uf4       ‘uSF4’ ‘AL’ - ‘WY’           ‘000'-’585'      ‘01'-’38'
       ~ SF4 Summary Files (unadjusted), US (with urban/rural)                 uf4       ‘uSF4’ ‘US’                  ‘000'-’585'      ‘01'-’38'
108 th Congressional District

       ~ 108th CD 100-Percent Summary Files (unadjusted), for 52 states        h08       ‘u108_H’    ‘AL’ - ‘WY’        ‘000'         ‘01'-’39'
       ~ 108th CD Sample Summary Files (unadjusted), for 52 states             s08       ‘u108_S’    ‘AL’ - ‘WY’        ‘000'         ‘01'-’76'
AIAN

       ~ American Indian Alaska Native Summary Files (unadj), US (with urban/rural)
                                                                           ai1           ‘AIAN’    ‘US’               ‘001'-’74D' ‘01'-’38'
School District Tabulations

       ~ School District Summary Files, Total (unadjusted), US                 sd0       ‘SDTT’             ‘US’      ‘000'          ‘01'-‘80'
       ~ School District Summary Files, Children’s Own (unadjusted), US        sd1       ‘SDCO’             ‘US’      ‘001'-‘006'    ‘01'-‘80'
       ~ School District Summary Files, H-holds w Chldrn (unadj), US           sd2       ‘SDHC’            ‘US’      ‘001'-‘006'    ‘01'-‘80'
       ~ School District Summary Files, Parents with Children (unadj), US      sd3       ‘SDPC’             ‘US’      ‘001'-‘006'    ‘01'-‘80'
       ~ School District Summary Files, Children’s H-holds (unadj), US         sd4       ‘SDCH’             ‘US’      ‘000'-‘006'    ‘01'-‘80'
       ~ School District Summary Files, Children’s Parents (unadj), US         sd5       ‘SDCP’             ‘US’      ‘001'-‘006'    ‘01'-‘80'
       ~SDCO Supplemental Race/Ethnicity File (unadjusted), US                 ss1       ‘SDCOSS’             ‘US’      ‘011'-‘146'    ‘01'-‘02'
109 th Congressional District

      ~ 109th CD 100-Percent Summary Files (unadjusted), for 52 states         h09       ‘u109_H’ ‘AL’ - ‘WY’           ‘000'         ‘01'-’39'
      ~ 109th CD Sample Summary Files (unadjusted), for 52 states              s09       ‘u109_S’ ‘AL’ - ‘WY’           ‘000'         ‘01'-’76'
   th
110 Congressional District

       ~ 110th CD 100-Percent Summary Files (unadjusted), for 52 states        h10       ‘u110_H’ ‘AL’ - ‘WY’           ‘000'         ‘01'-’39'
       ~ 110th CD Sample Summary Files (unadjusted), for 52 states             s10       ‘u110_S’ ‘AL’ - ‘WY’           ‘000'         ‘01'-’76'

       Figure 2: Codes in the Data Files and Filenames for various Product Releases (attachment to Details of the Construction of the
       2000 Decennial Summary Files memorandum)



             Date Last Printed: 3/29/13                                                                                          Page 22 of 149

             Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                  Number of Geographic Entities in state-level Census 2000 products

                                           PL           SF1            SF1       SF2 & SF2            SF3    SF4         108th
                                                                Supplement                                           Congress
                                   (.upl, .apl)        (.uf1)                   Supplement        (.uf3)    (.uf4)
                                                                       (.ur1)                                            (.h08,
                                                                                  (.uf2, .sf2)                            .s08)

AL - Alabama                          203680        197569           383313             7211     28033      7616         2463

AK - Alaska                            28156          30484           54032             3595      9625      3527         1443

AZ - Arizona                          175716        176444           342565             5460     21741      5757         2260

AR - Arkansas                         169974        164530           317832             8227     29677      7859         2667

CA - California                       584824        631670          1209477           28619      117891     29964       11966

CO - Colorado                         170522        158818           307951             5551     21886      5986         2006

CT - Connecticut                       64452          64643          123427             3291     13568      3788         1326

DE - Delaware                          21727          20302           39189              987      3758      1180             335

DC - District of Columbia               6547           7423           13766              561      2056       563             219

FL - Florida                          388957        410183           797280           14963      61146      16072        6095

GA - Georgia                          249693        242366           470488             9040     35853      10256        3823

HI - Hawaii                            23726          24101           45035             1733      6468      1917             777

ID - Idaho                             99140          95277          186676             2327      8692      2674             790

IL - Illinois                         475862        430857           829274           21289      81130      22145        7357

IN - Indiana                          246366        232049           448680           10165      39150      11102        3545

IA - Iowa                             198001        198307           379834           11235      37397      10914        3676

KS - Kansas                           203524        196840           379938             7871     29120      7418         3170

KY - Kentucky                         133287        141621           272324             6187     24438      7101         2274

LA - Louisiana                        178245        165857           319238             8836     34231      9870         2554

ME - Maine                             62332          64623          124381             2668      9524      3000         1102

MD - Maryland                         101055          98763          187298             6283     24841      6886         2452

MA - Massachusetts                    132920        130064           249179             6019     24846      6873         2340

MI - Michigan                         308725        303623           582643           14307      55056      15947        5542

MN - Minnesota                        236944        238104           454177           13414      46181      14215        5413

MS - Mississippi                      161250        153467           299546             5508     23188      6198         1615

MO - Missouri                         287518        279300           539294           12513      47379      13118        4194

MT - Montana                          103223        106704           208465             2746      9325      2910             899

NE - Nebraska                         155967        152215           293404             6558     22475      6647         2505

NV - Nevada                            71843          67537          131077             2202      8130      2335             912

NH - New Hampshire                     37973          39620           76386             1655      6285      1972             667

NJ - New Jersey                       187308        167849           320097             8003     32144      9310         3619

NM - New Mexico                       152252        147187           288456             3381     12628      3522         1275

NY - New York                         405299        365908           692515           21046      80605      22816        8543


Date Last Printed: 3/29/13                                                                                  Page 23 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                     Number of Geographic Entities in state-level Census 2000 products

                                              PL              SF1           SF1            SF2 & SF2             SF3                    SF4       108th
                                                                     Supplement                                                               Congress
                                      (.upl, .apl)        (.uf1)                       Supplement               (.uf3)              (.uf4)
                                                                             (.ur1)                                                                  (.h08,
                                                                                           (.uf2, .sf2)                                               .s08)

 NC - North Carolina                    274235          271152             525408              12537            51063              13262             4375

 ND - North Dakota                          96249        99656             189478                2997           18113               2864             2518

 OH - Ohio                              311643          332716             637680              17951            69387              19531             6507

 OK - Oklahoma                          203262          202095             386442              10237            30493              10547             2439

 OR - Oregon                            164246          170380             333521                4443           18043               4915             1587

 PA - Pennsylvania                      395771          376355             719528              18897            66930              22299             8035

 RI - Rhode Island                          26032        24159              46617                  887           3907               1031                377

 SC - South Carolina                    169032          160171             312577                5087           21083               5674             1885

 SD - South Dakota                          92051        91761             174263                4175           16294               3996             2182

 TN - Tennessee                         212667          204050             396797                6737           27866               7562             2620

 TX - Texas                             773915          750624            1460789              22970            93640              24968             8560

 UT - Utah                                  91255        84551             163621                3296           12190               3511             1082

 VT - Vermont                               27280        28589              54714                1352            4682               1559                544

 VA - Virginia                          172515          171531             329129                8080           32972               8990             3141

 WA - Washington                        229491          195665             378325                7672           30771               8154             2687

 WV - West Virginia                     101076           94161             182584                3667           16390               4142             1131

 WI - Wisconsin                         223037          234503             451231              11456            43027              12523             4362

 WY - Wyoming                               73364        71266             140240                1362            5121               1409                454

 PR - Puerto Rico                           83356        73625             139300                6462           23820               8218             2130



    52 state total
                                       9747485         9541315           18389481             413716          1594259             446613          156440
Table 6: Number of Geographic Entities in state-level Census 2000 products (attachment to Details of the Construction of the 2000
Decennial Summary Files memorandum)



                                      Number of Geographic Entities in US-level Census 2000 products

         SF1              SF1                SF2               SF2               SF3                SF4                  AIAN            School
         Advance                             Advance                                                                                     Districts
         National         Final              National          Final             (.uf3)             (.uf4)               (.ai1)
                          National                             National                                                                  (.sd0, .sd1,
         (.u1)                               (.u2)                                                                                       .sd2, .sd3,
                          (.uf1)                               (.uf2)                                                                    .sd4, .sd5,
                                                                                                                                         .ss1)

  US             225995            497515            160658             264919            487093             250736               862             18328
Table 7: Number of Geographic Entities in US-level Census 2000 products (attachment to Details of the Construction of the 2000
Decennial Summary Files memorandum)




 Date Last Printed: 3/29/13                                                                                                         Page 24 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
3.4.      Driver files
3.4.1. How Driver Files are used in the DPP system
Driver files are used to parameterize the DPP system. They allow analysts to execute and customize the
DPP system without programming. In some DPP documentation, drivers are called “operational material”
and the act of editing them is called “operational programming.” The DPP Operations team creates a
complete set of operational material (including TXDs and recodes, in addition to driver files) to fully define
a DPP product.
The introduction of new products may require the creation of new drivers, or new columns may be added
to existing driver files. Driver files are pipe-delimited (|) text files. Over time, fields may become obsolete,
but for backward compatibility, these fields are not removed. Historically, when new fields are added,
older driver files are not updated. This can be confusing and may cause running a new DEV build with an
older set of drivers to fail. For this reason, the DPPinstallations.txt driver file is used to pair valid DEV
build/OPS build combinations together by a single number called the build ID.
From a programming perspective, developers should interact with the driver files through the include files
(discussed in the COTS section). This code provides a level of encapsulation from the formatting details
of the files.
Some driver files are used by all products, most notably Products.txt. Most driver files however are
unique to a product, for example, the coverage driver file. Our notation is to refer to these product-
specific driver files as <Product><DriverFileName>.txt, where <Product> must be one of the entries in
column 1 of Products.txt. For example, <Product>TableInfo.txt could stand for uPLTableInfo.txt or
AIANTableInfo.txt.
From the official OPS release directory $DPPopsbld/drivers, users can copy the driver files into their
work environment, and modify the drivers to customize the behavior of the DPP system in many useful
ways, as described elsewhere, such as waves. Here some simple examples:
                           To Do This                                             Edit this driver file
Run the system in table waves or run a subset of the tables in a             TableInfo.txt
product
Run the system in iterations waves or run a subset of the                    Iterations.txt
iterations in a product
Run a mini-US consisting of a few states and the US                          Coverage.txt
Table 8: Examples of Driver File Customization

3.4.2. How Driver Files are searched for in the DPP system
Since a single user environment may have multiple copies of the driver files, the DPP system locates
driver files using the following rule:
         For scripts executed with the –w option (wave), driver files are searched for first in the user-
          supplied wave directory; failing that, the $DPPdrivers directory is searched. If the requested
          driver file is found, the search ends. (This behavior is similar to the Korn shell CDPATH
          environment variable, which defines a colon-separated list of directories to search for relative
          directories when using the UNIX cd command.)
         For scripts executed without the –w option, driver files are searched for in the $DPPdrivers
          directory.
         In all cases, the failure to locate a driver file is a fatal error and the system will terminate with an
          error message.
These details are transparent to the user of the DPP system. Supporting a search path means the wave
directory only needs to contain files specific to that wave. Files common to all waves should reside in
$DPPdrivers.



 Date Last Printed: 3/29/13                                                                               Page 25 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
3.4.3. Inventory of driver files
The following table lists all of the driver files in the DPP system. In all cases, <Product> refers to the
official name of a DPP product, and corresponds to an entry in column 1 in Products.txt. Not all products
require all driver files.
                    Driver File                                                         Role
Products.txt                                             Product registry file
PriorProduct.txt                                         List of product/prior-product pairs and summary levels that
                                                         should be matched between products by Tab stage 5300.
                                                         Each match entry requires a corresponding <Product>-
                                                         <PriorProduct>_map.txt driver file
<Product>Coverage.txt                                    List of states in a product, which may include the United
                                                         State (US), Puerto Rico (PR), and the District of Columbia
                                                         (DC)
<Product>SequenceChart.txt                               List of valid summary level/geocomponent combinations in
                                                         this product; each pair is marked as “in state product only”
                                                         (SO), “in US product only” (NO), or appears in “both state
                                                         and US product” (NS)
<Product>GeoContent.txt                                  Basic information about the geographic content of the
                                                         product.
<Product>GeoIDInfo.txt                                   Rules to determine which blocks belong to which summary
                                                         levels; also contains the structure of the geographic
                                                         hierarchy as defined in the Summary Level Sequence Chart.
<Product>GeoLCDSumLev.txt                                n/a
<Product>GeoTractSumLev.txt                              n/a
<Product>TableInfo.txt                                   List of tables in product, number of cells, summary file
                                                         segment assignments, and processing instructions (e.g.,
                                                         median tables, split tables, etc.)
<Product><State>TableInfo.txt                            State-specific version of TableInfo.txt. Now obsolete; The
                                                         Tab script was updated to automatically look for state-
                                                         specific TXDs when creating production TXD
<Product>TableInfoForMatches.txt                         Same as TableInfo.txt, but presents image of official product
                                                         version of tables (with no combined tables). If not present,
                                                         system uses TableInfo.txt
<Product>TableInfoForHandoff.txt                         n/a
<Product>Handoff.txt                                     n/a
<Product>Iterations.txt                                  List of iterations in product
<Product>IterationDBLogic.txt                            n/a
<Product>IterationsForHandoff.txt                        List of iterations that Handoff script will deliver.
<Product>IterationsForSIPHC.txt                          n/a
<Product>ReportAddresses.txt                             List of email addresses to received output from Status script
<Product>_Rollup_HighLevel.txt                           Used by VerifyRollup program - list of summary
                                                         level/geocomponent pairs to verify at a high level (e.g., all
                                                         05000 summary levels - County – sum to the value of all
                                                         04000 summary levels - State). Not all summary levels roll-
                                                         up naturally to their parent (because of thresholding or part-
                                                         of relationships).
<Product>_Rollup_LowLevel.txt                            Used by VerifyRollup program - summary
                                                         level/geocomponent pairs to verify at a low level (e.g., each
                                                         all 05000 summary levels - County – sums to the value of
                                                         the specific parent to which they belong). Not all summary
                                                         levels roll-up naturally to their parent (because of
                                                         thresholding or part-of relationships).
Internal-<Product>_map.txt                               List of match formulas to execute within a product
<Product>-<PriorProduct>_map.txt                         List of formulas, optionally for specific iterations, executed
                                                         for list of summary levels from PriorProduct.txt, that should
                                                         match between two products.
 Date Last Printed: 3/29/13                                                                              Page 26 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
               Driver File                                                            Role
<DetailFile>AnalyzerTableInfo.txt                        n/a
<DetailFile>Analyzer-<Product>_map.txt                   List of match formulas to execute between analyzer and
                                                         current product summary files
DGFConsistency_CommonVars.txt                            List of geography fields in common between DPP
                                                         Geography File (DGF) and detail record; used in Tab stage
                                                         1000
Extracts.txt                                             Used only in uPF3 product to create an extract of a
                                                         summary file
SDTT-uSF3_geo_map.txt                                    School District Geography Equivalencies for Matching File.
                                                         Used by Prior-Product program for School District product
                                                         SDTT. Provides the ability to map specific SDTT GEO_Ids
                                                         to specific uSF3 GEO_Ids. The files contains a list of
                                                         geographic mapping between SDTT and uSF3 places,
                                                         county subdivions, and counties and school district
                                                         summary levels (950, 960, 970)
Table 9: Inventory of DPP Driver Files

3.4.3.1.       Products.txt
This file lists the products within an area and environment, along with the detail file that is the source of a
product’s data, and a variety of other parameters. The file is named Products.txt and is delimited by the
pipe (|) character.
Col          Field Name                                                           Description
1      Product Name              The product name defined by DPP operations or product sponsor.
2      Detail Files              Space-separated list of detail file names used by this product, e.g., “HDF SEDF”
3      File ID                   Suffix of product summary file, e.g., “usf4”
4      DB Build Indicator        Build databases for this product, and if so, how? Valid values:
                                      A indicates all, that is, state & national
                                      N indicates national only
                                      S indicates state only
                                 Anything else indicates to reuse databases from a prior product. Example “A uSF1F”
5      SIPHC Threshold           The old-style (SF2) SIPHC population threshold, e.g., “100:SIPHC”
6      Geographic                Instructions about what substitutions to perform in Geographic Header Record (by overwriting
       Substitutions             original values from Geography Division). A space-separated list of symbolic rules of the form <old
                                 value>:<new value>, e.g., “UASC:UASCU HU100:H1 POP100:P1”
7      OBSOLETE                  Subsumed by PriorProduct.txt; valid value is “OBSOLETE”
8      OBSOLETE                  Subsumed by PriorProduct.txt; valid value is “OBSOLETE”
9      OBSOLETE                  Subsumed by PriorProduct.txt; valid value is “OBSOLETE”
10     Summary Levels for        Space-separated list of summary levels to compare during internal match, e.g., “040 500 510 511
       Internal Match            521 531 541 542 550 551 552 553 554 555 560”
11     Base STR Port             Review Materials are no longer used; valid value is “OBSOLETE”
12     Summary level for         Review Materials are no longer used; valid value is “OBSOLETE”
       review material TXDs
13     Perform Analyzer          Boolean indicator - perform Analyzer match for this product? Values are “Y” or “N”
       Match
14     Linked Product Name       Linked product -- one that is computed at the same time as this product. Used only for uSF2.
                                 Value is “uSF2A”.
15     Supplemental Recode       Supplemental detail file indicator. Valid values are “SUPP” or “SUPP2”
       File Names
16     Review Material           Summary Levels for Review Materials (also for review material recode generation). OBSOLETE.
       Summary Levels
17     Associated Product        Used for geographic component 49 processing in SF3 and SF4. Must be a valid product in
                                 Products.txt. Example “uSF4gc49”
18     Iterated Product Flag     Boolean flag – is this an iterated product? Valid values are “Y” or “N”
19     File ID for Summary       School District only. Example “xd0”
       Files with
       Thresholding Applied
20     File ID for Summary       School District only. Example “xd1”
       Files with Special Tab
       Rounding Applied
21     Geographic Sets           Space-separated list of summary levels groups (each group contains a colon-separated list of
                                 summary levels) that define geographic sets of product; School District only. Example
                                 “010:250:040:050 950 960 970”
22     Threshold                 Specify as <Threshold Count>:<Threshold Source>. Example: “50:SDCOSS1_US_P2_011”

 Date Last Printed: 3/29/13                                                                                        Page 27 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Col          Field Name                                                      Description
23     Geo Header Handoff          Boolean indicator – Handoff geographic header? Valid values are “Y” or “N”
       Indicator
24     Split National/State        Boolean indicator - N indicates CSVs for the product are not separated into .csv, .n and .n+s files; Y
       contribution from CSV       indicates CSVs for the product are separated into .csv, .n and .n+s files. Valid values are “Y” or “N”
25     Handoff directory           Name of handoff directory. Default value: value in column 1. Example “108h”
26     Iterated database           Boolean indicator - Build iteration databases? Works in conjunction with columns 2 and 4. Valid
       Indicator                   values are “Y” or “N”
27     FileID                      FileID that goes into the geo, igeo, and aff files (and the data segments) for AIAN001 and AIAN
                                   both need AIAN as the FileID and the SF program was using Product instead
Table 10: Syntax of Products.txt

3.4.3.2.       Prior-Product File
The prior-products file lists the prior-products that are suitable for matching to the current product. The file
is named PriorProduct.txt and is delimited by the pipe (|) character.
Col          Field Name                                                        Description
1      MatchID                     Sequential number
2      Product                     Product.txt column 1 entry for product
3      Product Level               Are we matching current product’s state or US summary files? Values are “state” or “US”
4      Prior-Product               Products.txt column 1 entry for prior-product
5      Prior-Product Level         Are we matching prior-product’s state or US summary files? Values are “state” or “US”
6      Prior-Product area          OBSOLETE
7      Prior-Product               DPP area where prior-product is located, e.g., “prod”
       environment
8      Summary Levels to           List of summary levels to compare between products. The optional colon-notation <product
       match                       summary level>[:<prior-product summary level>] can be used to transform summary levels for
                                   comparison.
9      Use Geo-Mapping             Boolean indicator to perform geo mapping. Used only by School Districts “TT” product to map
                                   explicit GEO_Ids between School District and uSF3. Values are “Y” or “N”.
Table 11: Syntax of PriorProduct.txt

3.4.3.3.       Product Coverage File
There should be one product-coverage file for each product listed in the Products.txt file. The product-
coverage file lists sites (states) and type (adjusted/unadjusted) for a product defined within a given DPP
area and environment. The file is named <Product>Coverage.txt and is delimited by the pipe (|)
character.
Col           Field Name                                                         Description
1      State Postal Code               Two-letter abbreviation
2      State Name
3      Type                            Valid values - U or A (unadjusted or adjusted detail file)
4      Extended Type                   Valid values - unadjusted or adjusted
5      OBSOLETE column                 Obsolete columns are never removed
6      State FIPS Code                 Two-digit code, e.g., “01”
7      Batch size                      No longer used; the tabulation batch size is automatically set to the largest value possible.
8      Number of Recode Splits         A value of 1 means the geographic recode is not split. Used only for very large states like CA,
                                       TX or FL whose recode may exceed the 2 GB SuperSERVER limit.
9      Recode Reduction Flag           Boolean flag to dehydrate the geographic recode, Dehydration means to remove duplicate
                                       geographies from the geographic recode. Valid values are “DEHYD” or “NODEHYD”
Table 12: Syntax of <Product>Coverage.txt

3.4.3.4.       Product Summary Level Sequence Chart File
This driver file lists all the Summary Level and Geo-Component combinations in a Product. Some
Products tabulate values at only the State or National level. For these Products the following two codes
are used to classify their Summary Level/Geo-Component combinations:
     SO - State Only
     NO - National Only
Some Products tabulate both State and National numbers simultaneously. For these Products the
following code is used to classify their Summary Level/Geo-Component combinations:
   NS - National State
The file is named <Product>SequenceChart.txt and is delimited by the pipe (|) character.

 Date Last Printed: 3/29/13                                                                                            Page 28 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Col         Field Name                                                      Description
1      Summary                   Valid five character summary level/geocomponent taken from Product’s summary level sequence
       Level/Geocomponent        chart.
2      Which Products            Indicates where summary level/geocomponent combination appears - only in State product (SO),
                                 only in National or US product (NO), or appears in both the state and National products (NS).
Table 13: Syntax of <Product>SequenceChart.txt

3.4.3.5.       Product Geographic Content Files
The content file is a version of the matrix created by the Summary Level Working Group and incorporated
in the product specifications covering geographic field applicability by summary level. This is also called
the product “star chart”.
The file is named <Product>GeoContent.txt and is delimited by the pipe (|) character.
Col         Field Name                                                       Description
1
Table 14: Syntax of <Product>GeoContent.txt

3.4.3.6.       GeoID Composition File
The GeoID information file dictates the composition of the GeoID geographic identifier for a product for
different summary levels. The file is named <Product>GeoIDInfo.txt and is delimited by the pipe (|)
character. To create, review the product hierarchy, determine how many sub-hierarchies are present,
exactly what codes contribute to each member of each sub-hierarchy.
At one time (before uSF3), this hierarchy was built into the SXV4 detail database as a set of categories.
Because of the size and complexity of these hierarchies, that practice was discontinued with uSF3.
Col         Field Name                                                       Description
1      Line Number               Indicates the line number
2      Parent Line Number        Contains the parent line number for this entry in the sub-hierarchy. An asterisk (*) is used when
                                 there is no parent line number
3      Sub-Hierarchy             Contains the sub-hierarchy number
       Number
4      Summary Level             Contains the summary level
5      Geocomponent              Contains the geocomponent
6      GeoID Length              Contains the geoid length for this summary level
7      GeoID Fields              Contains the fields that make up the geoid for this summary level. An exception is when no fields
                                 contribute to the geoid (dummy should be used).
8      Last Item Hierarchy       Contains a Y if this is the last item in the hierarchy, and null otherwise
       Indicator
9      SAS code                  Applies to non-00 geocomponents only! It contains the phrase that defines this geocomponent,
                                 ready to be added to a SAS if condition
Table 15: Syntax of <Product>GeoIDInfo.txt

3.4.3.7.       <Product>GeoLCDSumLev.txt
The file is named <Product>GeoLCDSumLev.txt and has only record and one field.
Col        Field Name                                                      Description
1      Lowest Common             Three digit summary level code of the lowest common denominator (always block) summary level,
       Denominator               e.g., “100”
       Summary Level
Table 16: Syntax of <Product>GeoLCDSumLev.txt

3.4.3.8.       <Product>GeoTractSumLev.txt
The file is named <Product>GeoTractSumLev.txt and has only one record and one field.
Col        Field Name                                                      Description
1      Census Tract              The summary level of a census tract – “140”
       Summary Level
Table 17: Syntax of <Product>GeoTractSumLev.txt

3.4.3.9.       Product Table Information File
The product table information file describes the tables that comprise a product defined within an area and
environment. There should be one product table-information file for each product listed in the
 Date Last Printed: 3/29/13                                                                                          Page 29 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
area/environment's Products.txt file. The file is named <Product>TableInfo.txt and is delimited by the
pipe (|) character.
Col         Field Name                                                        Description
1      Table ID                  Name of table as defined in product specification.
       Number of Cells           Number of cells in the table, excluding the GeoID - i.e. only data cells are to be considered
3      Summary File              The summary file segment that table reside in. For tables that span segments, a syntax of the form
       (Partition) Number        7[164]:8[165] indicating a split between two segments
4      Conditional Rounding      1=CR, empty=don't round
       Indicator
5      Decimal Places            Number of decimal places or blank for 0
6      Reviewer Groups           The space-delimited list of reviewer group(s) that each product table belongs to:
                                            Race: Racial Statistics Branch
                                            Hispanic: Ethnic & Hispanic Statistics Branch
                                            Age-Sex-GQ: Special Populations Staff
                                            Family: Fertility & Family Statistics Branch
                                            Urban-Rural: Population Distribution Branch
                                            Housing: Physical and Social Characteristics Branch
7      Table Formula             The calculation required to produced the cells for this table
8      Sub-Tables                "subtables" to be tabulated in lieu of the designated table with results combined to produce the
                                 designated table. Subtables are listed in the form P13D:232 subtable:number of cells. Used for
                                 median tables, where an underlying distribution also needs to be tabulated.
9      TABCALC or Median         TABCALC tables are both tabulated and post-processed. MEDIAN tables require special
       Indicator                 processing at the US level.
10     Conditional Rounding      This column held the formula for conditional rounding in SF3. Before release of SF3, the business
       Formula                   rules changed, and the formula was no longer needed. Obsolete.
11     Tabulation Database       Which database should this table be tabulated against?
12     Special Tab Rounding      Is this table subject to special tab rounding? Currently only a feature in School Districts, this applies
                                 unconditional-type rounding at the summary file level.
Table 18: Syntax of <Product>TableInfo.txt

3.4.3.10. Product Table Information File for State
The file is named <Product><State>TableInfo.txt and is delimited by the pipe (|) character. It has the
same format as <Product>TableInfo.txt, but is only to be used when processing a state <State>.
3.4.3.11. Product Table Information File for Matches
The file is named <Product>TableInfoForMatches.txt and is delimited by the pipe (|) character. It has the
same format as <Product>TableInfo.txt and is used only by the post Tab stage 4000 match programs. It’s
necessary only if the product contains tables that had to be broken into smaller parts because they were
too big to tabulate in one operation. One example is table PCT42 in uSF3. Here are the
uSF3TableInfo.txt entries for PCT42:
   PCT42A|1|28||0|Income||PCT42A_D:39 PCT42A:1|MEDIAN|||
   PCT42B|7|28||0|Income||PCT42B_D:156 PCT42B:7|MEDIAN|||
   PCT42C|7|28||0|Income||PCT42C_D:156 PCT42C:7|MEDIAN|||
Here are the uSF3TableInfoForMatches.txt entries for the same combined table:
   PCT42|15|28||0|Income||PCT42_D:312 PCT42A:15|MEDIAN|||
The “for matches” view represents the final summary file version of the table.
3.4.3.12. Product Table Information File for Handoff
The file is named <Product>TableInfoForHandoff.txt and is delimited by the pipe (|) character.
Col         Field Name                                                        Description
1
Table 19: Syntax of <Product>TableInfoForHandoff.txt

3.4.3.13. Product Handoff File
The Product Handoff driver file lists the AFF Data Warehouse hand-off metadata files that are required for
a product as specified in the “American FactFinder 2000 Metadata ASCII Import File Specification.” DPP
no longer delivers product metadata to AFF, and this driver file is obsolete.
The file is named <Product>Handoff.txt and is delimited by the pipe (|) character.

 Date Last Printed: 3/29/13                                                                                             Page 30 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Col          Field Name                                                      Description
1      File Name                 Name of AFF metadata file
2      Optional/Mandatory        optional (O)/mandatory (M)
       Flag
3      External File             List of space delimited numbers indicating fields that may contain external file references
       Reference
Table 20: Syntax of <Product>Handoff.txt

3.4.3.14. Iterations File
The iterations file lists race-ancestry-ethnicity values for SF2, SF4 and AIAN computation, and the various
types of relevancy iterations for the School District products. The iteration number is three characters
long and should be left-padded with “0” to the full width. Iteration numbers do not have to be consecutive
or purely numeric. The file is named <Product>Iterations and is delimited by the pipe (|) character.
Col          Field Name                                                      Description
1      Iteration                 Three Character Iteration Code, e.g., “001”
2      Iteration Label           Iteration Name, e.g., “American Indian and Alaska Native alone”
3      Obsolete
4      Weight                    Relative integer weight or "cost" of this wave for tabulation purposes
5      SAS SIPHC Logic           SAS code logic used by SIPHCHandoff to determine if this iteration passes thresholding
6      n/a                       n/a
7      n/a                       n/a
Table 21: Syntax of <Product>Iterations.txt

3.4.3.15. <Product>IterationDBLogic.txt
The file is named <Product>IterationDBLogic.txt and is delimited by the pipe (|) character.
Col         Field Name                                                       Description
1
Table 22: Syntax of <Product>IterationDBLogic.txt

3.4.3.16. <Product>IterationsForHandoff.txt
The file is named <Product>IterationsForHandoff.txt and is delimited by the pipe (|) character.
Col         Field Name                                                       Description
1
Table 23: Syntax of <Product>IterationsForHandoff.txt

3.4.3.17. <Product>IterationsForSIPHC.txt
The file is named <Product>IterationsForSIPHC.txt and is delimited by the pipe (|) character.
Col         Field Name                                                       Description
1
Table 24: Syntax of <Product>IterationsForSIPHC.txt

3.4.3.18. Report Addresses Driver File
The Report Addresses driver file lists e-mail addresses to which the reports are sent when the Status and
Handoff scripts are run with the –m option.
The file is named <Product>ReportAddresses.txt and is delimited by the pipe (|) character.
Col        Field Name                                                         Description
1      Report Key                The file has two fields – report key and addresses – delimited by the pipe (|) character. The valid
                                 report key values are status for the Status report and acsd, aff, affpr, internal, and review to
                                 designate one of the Handoff script destination-parameter values. The report key is a "destination"
                                 in the sense of the Handoff script or "status" for the status script or something else for another
                                 script or program.
2      Email Addresses           Space or comma-delimited list of Email Addresses to receive report
Table 25: Syntax of <Product>ReportAddresses.txt

3.4.3.19. <Product>_Rollup_HighLevel.txt
This driver is used by the Verify Rollup program to manually verify that selected geography hierarchies
are additive (not all geography hierarchies are additive – that’s why all tabulations are always done at the

 Date Last Printed: 3/29/13                                                                                           Page 31 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
block level). Generally, Verify Rollup can be used only in non-threshold products. The program checks
only the rollup of the first cell in the product, assumed to be P1.
High level indicates the program will verify that all instances of the parent summary level/geocomponent
matches the sum of their children.
An example from AIAN_Rollup_HighLevel.txt:
   01000|04000
   38000|38500
The file is named <Product>_Rollup_HighLevel.txt and is delimited by the pipe (|) character.
Col                                                              Field Name
1      Parent Summary Level/Geocomponent
2      Child Summary Level/Geocomponent
Table 26: Syntax of <Product>_Rollup_HighLevel.txt

3.4.3.20. <Product>_Rollup_LowLevel.txt
This driver is used by the Verify Rollup program to manually verify that selected geography hierarchies
are additive (not all geography hierarchies are additive – that’s why all tabulations are always done at the
block level). Generally, Verify Rollup can be used only in non-threshold products. The program checks
only the rollup of the first cell in the product, assumed to be P1.
Low level indicates the program will verify every that every instance of the parent summary
level/geocomponent matches the sum of its children (e.g., for cell P1, the sum of the counties in Maryland
equal the Maryland state total).
An example from AIAN_Rollup_LowLevel.txt:
   01000|04000
The file is named <Product>_Rollup_LowLevel.txt and is delimited by the pipe (|) character.
Col                                                               Field Name
1      Parent Summary Level/Geocomponent
2      Child Summary Level/Geocomponent
Table 27: Syntax of <Product>_Rollup_LowLevel.txt

3.4.3.21. Product Internal Mapping File
The product internal mapping file describes mapping aggregations of Summary File cells to other
Summary File cells. There are two columns delimited by a comma: a match-to expression and a
matching expression. Product cells in the expressions are listed as tableID(cell) or tableID(cell1…cellN)
or tableID(cell1+cellN). Cells from more than one table can be included in the matching aggregate with an
expression of the form table1(cells)+table2(cells) or table1(cells)-table2(cells).
The file is named Internal-<Product>_map.txt and is delimited by the comma (,) character.
Col        Field Name                                                     Description
1      Product formula 1         Computation to perform on summary file cells of current product, e.g., P100(1)
2      Product formula 2         Computation to perform on summary file cells of current product, e.g.,
                                 P98(6+9+16+19+26+29+36+39)
Table 28: Syntax of Internal-<Product>_map.txt

3.4.3.22. Prior-Product Match Specification
The prior-product mapping file describes mapping aggregations of Summary File cells to Summary File
cells from a prior data product, for instance, from the uSF1 product files to the uPL product files. The file
is named <Product1>-<Product2>_map.txt and is delimited by the comma (,) character.
There are four columns delimited by commas: a match-to expression, a matching expression, a match-to
iteration number, a matching expression iteration number. Product cells in the expressions are listed as
tableID(cell) or tableID(cell1…cellN) or tableID(cell1+cellN). Cells from more than one table can be
included in the matching aggregate with an expression of the form table1(cells)+table2(cells) or
table1(cells)-table2(cells).


 Date Last Printed: 3/29/13                                                                                       Page 32 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Another supported variant is the ability to match Geo Header fields, or match a Geo Header field to a
table formula.
Col          Field Name                                                      Description
1      Product formula           Computation to perform on summary file cell’s of product, or a Geo Header field name.
2      Prior-Product formula     Computation to perform on summary file cell’s of prior-product, or a Geo Header field name.
3      Product Iteration         The iteration numbers should be left-padded with 0s to three characters. For example, use 018
                                 rather than 18. For products that don't have iterations, include the delimiter but leave the field
                                 empty.
                                 If Column 1 is a Geo Header field, the value of this column is “geo”
4      Prior-Product Iteration   The iteration numbers should be left-padded with 0s to three characters. For example, use 018
                                 rather than 18. For products that don't have iterations, include the delimiter but leave the field
                                 empty.
                                 If Column 2 is a Geo Header field, the value of this column is “geo”
Table 29: Syntax of <Product1>-<Product2>_map.txt

3.4.3.23. Detail file Analyzer Table Information File
A trimmed down version of <Product>TableInfo.txt. The file is named <DetailFile>AnalyzerTableInfo.txt
and is delimited by the pipe (|) character.
Col         Field Name                                                    Description
1      Table Name                Table Name
2      Number of Cells           Number of Cells in Table
3      Number of Implied         Number of Implied Decimal Places in Analyzer file
       Decimal Places
Table 30: Syntax of <DetailFile>AnalyzerTableInfo.txt

3.4.3.24. Analyzer - Product Mapping File
An analyzer is an independent tabulation on a detail file of a some subset of the products cells done by
the Population Division. The analyzer is deliver somewhat like a “prior-product” and is checked as part of
the post Tab stage 4000 match programs.
The file is named <DetailFile>Analyzer-<Product>_map.txt and is delimited by the pipe (|) character. The
format is identical to the <Product>-<Prior Product_map.txt driver file. An example from SEDFAnalyzer-
uSF4_map.txt:
      #SF4 to Sample Analyzer Match
      #PCT3. SEX BY AGE [209] ,
      #Total population = PCT3(1),
      #Total Population = iteration code 001 ,
      PCT3(1),P1(1),001,
      PCT3(1),P5(1...4),001,
      PCT3(1),P34(1...504),001,
      PCT3(1),P55(1...32),001,
      PCT3(1),P56(1...62),001,
      PCT3(1),P67(1...26),001,
      PCT3(1),P109(1...4),001,
      PCT3(1),P110(1...48),001,
      PCT3(1),P111(1...48),001,
      PCT3(1),P112(1...2),001,
      PCT3(1),P115(1...28),001,
      PCT3(1),P280(1+2),001,
      PCT3(1),P281(1+2),001,
      PCT3(1),P286(1...7),001,
      PCT3(1),P287(1...116),001,
      #White Alone = iteration code 002,
      PCT3(1),P24(1),002,
      PCT3(1),P27(1...5),002,
      PCT3(1),P34(1+64+127+190+253+316+379+442),002,
3.4.3.25. DGFConsistency_CommonVars.txt
The file is named DGFConsistency_CommonVars.txt and has only one field.

 Date Last Printed: 3/29/13                                                                                           Page 33 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Col          Field Name                                                        Description
1       SAS Variable in          E.g.,
        Common Between                     state
        Detail File and DPP                aianhh
        DGF File                           aianhhfp
                                           anrc
                                           arealand
                                           areawatr
Table 31: Syntax of DGFConsistency_CommonVars.txt

3.4.3.26. Summary File Extract Driver File
This obsolete driver file is documented in the attached “DPP Operations Guide.Operational Material
.6.4c.pdf”. The file is named Extracts.txt and is delimited by the pipe (|) character.
3.4.3.27. School District Geography Equivalencies for Matching Files
The files is used by the Prior-Product match program. It contains a list of SDTT GeoIDs and uSF3
GeoIDs that are physically the same place, but have different summary levels (e.g., many school districts
are counties, county subdivisions or places). The file is named SDTT-uSF3_geo_map.txt and is delimited
by the pipe (|) character.
Col           Field Name                                            Description
1       School District GeoID    E.g., 97000US0100005
2       USF3 GeoID               E.g., 16000US0100988
3       Name of School           E.g., “ALBERTVILLE CITY SCHOOL DISTRICT”
        District GeoID
Table 32: Syntax of SDTT-uSF3_geo_map.txt

3.5.      DPP COTS Software Components
The DPP system is a general-purpose data product production system. It produces Census data
products (namely, Summary Files) based on product specifications defined by POP/HHES and detailed
data files generated from the 2000 Decennial Census. It is a programmed system integrated with the
following commercial off-the-shelf (COTS) products:
 COTS Component                           Manufacturer                                      Role
UNIX Korn shell                 IBM                                          Standard part of IBM AIX
                                                                             operating system. Glue code
                                                                             that wraps DPP system
                                                                             functionality and provides
                                                                             operational control.
SAS                             SAS Institute                                Data processing tool
snbu                            Space-Time Research (STR)                    Command-line SXV4 database
                                                                             builder tool
scstools                        Space-Time Research                          Common-line tool to create
                                                                             SuperSERVER database
                                                                             catalogues
scs                             Space-Time Research                          Common-line tool to start a
                                                                             SuperSERVER server
                                                                             associated with a catalogue file
SuperCROSS                      Space-Time Research                          A Windows application used to
                                                                             compose tables and perform
                                                                             ad-hoc tabulations
ss2ps                           Space-Time Research                          The production system; a
                                                                             command-line batch tabulation
                                                                             engine
Java                            IBM Corporation                              Used to support the
                                                                             SuperSERVER suite
Python                          Open-source software                         Used to support the
                                                                             SuperSERVER suite; supplied
                                                                             as part of the STR product
                                                                             installation
 Date Last Printed: 3/29/13                                                                                     Page 34 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
 COTS Component                       Manufacturer                                         Role
Jpython                        Open-source software                          Used to support the
                                                                             SuperSERVER suite; supplied
                                                                             as part of the STR product
                                                                             installation
IBM VisualAge                  IBM Corporation                               DPP source-code control
TeamConnection                                                               system
Gzip                           Open-source software                          Used to compress summary
                                                                             files for handoff to AFF
pkzip25                        PKWare                                        Used to compress summary
                                                                             files for handoff to ACS
Table 33: DPP COTS Components

3.5.1. Korn Shell
The UNIX Korn shell is the glue that holds the DPP system together. The Korn shell scripts are located in
the following directories:
         $DPPscripts/scripts – contains the main DPP system scripts. They handle the details of
          logging, error handling, security, and invoking the SAS and STR software components on the
          user’s behalf.
         $DPPinclude/include – contains helper Korn shell functions used throughout the DPP
          system scripts
         $DPPutil/util – miscellaneous utility scripts; some are run only as part of build deployment;
          others are minor utility programs that create waves or perform integrity checks on match
          specification driver files.
3.5.1.1.       Environment Variables
The DPP system uses Korn shell environment variables extensively to parameterize and define the user’s
runtime context. Environmental variables insulate programs from physical environment details. The
DPPsetup script defines these variables. Since DPPsetup is sourced into the user’s environment, these
variables are globally available to all scripts. Depending on the situation, users may override these
variables by exporting new values (the most common case is to override $DPPdrivers to work with a
local copy of the DPP driver files). Developers also may override $DPPscripts to work with a local copy
of the shell scripts during development. Other DPP scripts may define additional variables that are valid
only for the duration of the script’s execution. The following are most important global variables:
    Variable                                                      Meaning
DPPhome                       Default location of the following key source-code control artifacts:
                                   DPPinstallations.txt (text configuration file)
                                   SSinstallations.txt (text configuration file)
                                   Directories (shell script)
                                   DPPsetup (shell script)
                                   DEV builds (e.g., DPP2001_277)
                                   OPS builds (e.g., DPP2000_OPS_219)
DPProot                       User root directory (e.g., /dpp2/prod)
DPPwork                       User work area (e.g., /dpp2/prod)
DPPenv                        User environment (uSF3)
DPPscripts                    Directory location of main shell scripts
DPPprog                       Directory location of SAS programs
DPPinclude                    Directory location of shell script include files
DPPutil                       Directory location of shell script utility files
DPPtmp                        Temporary directory in user’s environment
DPPdebug                      Enable to save intermediate files normally deleted at the end of a program,
                              particularly intermediate SAS datasets; false by default.
DPPdevbld                     The name of the selected DEV build
DPPopsbld                     The name of the selected OPS build
 Date Last Printed: 3/29/13                                                                                Page 35 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
    Variable                                                      Meaning
DPPdrivers                    The default location of the DPP driver files
DPPinstallation               Pipe-delimited configuration line from source-code control file
                              DPPinstallations.txt that indicates the default DEV and OPS build
                              combination for the user’s context
SSinstallation                Pipe-delimited configuration line from source-code control file
                              SSinstallations.txt that indicates the default SuperSERVER version to run
                              in user’s context (multiple versions of SuperSERVER may be installed on
                              the same machine)
GS_DriverPath                 A colon-separate list of directories to search for driver files
GS_Area                       User’s work area - dev, pa, uat, test, prod, or sprod
GS_Machine                    Current machine hostname (e.g., dpp2)
GS_Suser                      Current su login ID (e.g., dppdev)
GS_User                       User’s login ID (e.g., jbond007)
SAS_ROOT                      Location of SAS executable (e.g., /usr/lppp/sas820_32)
SSRel                         Default SuperSERVER package (default value of 999 indicates to use the
                              default package for this machine as defined in SSinstallations.txt)
Table 34: Important Global DPP Environment Variables

3.5.1.2.       Role of Include Files
The DPP system makes extensive use of Korn shell functions. Functions allow for a modular object-
oriented programming style, code reuse, and consistent, centralized error checking and logging. The
functions are located in the $DPPinclude/include directory. By convention, every DPP script
“sources in” the include functions and starts with the following two lines:
     #!/usr/bin/ksh
     . $DPPinclude/include/F_Common.sh
The F_Common.sh script is a list of all the modules in the include directory. As new modules are added
to the system, the F_Common.sh module should be updated to enable all DPP scripts to benefit from
new functionality.
3.5.1.3.       Stage Convention
By convention, most DPP scripts are divided into sections called “stages.” The most complex staged
scripts are ProcessGeo and Tab. Stages are numbered sequentially and each performs a single step in
the DPP process. User can optionally specify the start and end stage with the command-line parameters
–s and –e, respectively. Running a program without stage parameters will sequentially run all the stages.
Not all stages are applicable to all products.
 Stage                                                 Purpose
1000          Source-file verification, data preparation; same as omitting the -s switch
1025          Assemble data for US DB build
2000          Detail-file SuperCROSS DB build & installation
2050          Catalogue Detail-file SuperCROSS DB for ad-hoc Tabulation
3000          Catalogue Detail-file SuperCROSS DB for production Tabulation and Tabulate
Table 35: Example Stages from the Tab script

Stages play an important part in the DPP divide-and-conquer approach to tabulation. A massive product
like Summary File 4 is not run by a single invocation of the DPP system. Instead, the product is broken
into smaller pieces and the computing work is spread across multiple machines, each with multiple CPUs.
Recommendations for which scripts and which stages can be run in parallel are given in the DPP
Operational Cookbook on a per-product basis.
The following design principles define how stages are used throughout the DPP scripts:
         If the script determines a particular stage is not applicable to the current product (determined by
          the product’s driver files), the stage should log a message and be skipped. This is a “do no harm”
          philosophy.


 Date Last Printed: 3/29/13                                                                        Page 36 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         New products may require new stages to be defined. For this reason, stage numbers should be
          defined with gaps of 1000 between them to allow room for growth.
The general form of a DPP script with stages is the following, where the default start stage is the first
stage, and the default end stage is the final stage:
   <script> [-s <start stage>] [-e <end stage>]
Running a script with no stage parameters sequentially runs all the stages.
3.5.1.4.       Variable Naming Conventions
DPP Korn shell variables use a naming convention, <scope><type>_<name>, where <scope> and
<type> are defined in the following tables. A general comment about naming conventions – the Korn
shell lacks many of the features of a full-fledged programming language, like strong typing and variable
scoping rules. Nevertheless, the DPP naming conventions, along with the include mechanism, are
extremely effectively tools for building reusable and robust shell scripts.
<scope>                                              Meaning
G             Global – variable has been exported and is available to sub-shells and included
              functions.
L             Local – scope is limited to current function or shell script
Table 36: DPP Korn Shell Variable Scoping Convention

 <type>          Meaning                                  Example/Valid Values
S             String               “Iterations”
A             Array                A new-line separated list of values suitable for iteration
L             List                 A space-separated list of values suitable for iteration
F             File                 A relative or absolute UNIX file name
D             Directory            A relative or absolute UNIX directory name
B             Boolean              True if the string equals TRUE and false if the string equals
                                   FALSE
UC            Upper Case           Using typeset –u, the variable has been forced to be all
              String               uppercase
LC            Lower Case           Using typeset –l, the variable has been forced to be all
              String               lowercase
Table 37: DPP Korn Shell Variable Naming Convention

3.5.1.5.       Function Naming Conventions
DPP Korn shell functions use a naming convention, F<type>_<name>, where <type> is a Korn shell
data type, defined in Table 37. In shell, the only way to return a value from a function (other than setting
a global variable), is to execute the function and capture its standard output stream. For example, to
execute the function F_Which and capture its return value:
   LF_Products=$(F_Which Products.txt)
Functions with no return value use the convention F_<name>.
3.5.1.6.       Structure of Logs File Directories
Because of the extensive logging done by the DPP system, a series of subdirectories is created to
manage the large number of files created by the system. Too many files in a single directory make it
difficult and slow to find files; the AIX backup software also may experience time-out problems with
massive directories.
               Log Location                                                                  Purpose
$DPPwork/$DPPenv/logs                                                        Central location for all logs files in a
                                                                             user’s product processing environment
$DPPwork/$DPPenv/logs/<STATE Postal Code>                                    Central location for all logs for a
                                                                             specific state
$DPPwork/$DPPenv/logs/<STATE Postal                                          Central location for all SAS logs for a
Code>/<DATE>                                                                 specific state on a specific date
$DPPwork/$DPPenv/log                                                         A single file. Every script logs its start
 Date Last Printed: 3/29/13                                                                                       Page 37 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                          Log Location                                                       Purpose
                                                                             and end status in the log file. The DPP
                                                                             status-reporting infrastructure uses this
                                                                             information to reconstruct the progress
                                                                             of the current product.
Table 38: Summary of DPP Log Directories

3.5.1.7.       Log File Naming Conventions
Every DPP shell script automatically redirects its standard output and error stream to a log file stored in
the $DPPwork/$DPPenv/logs directory. These log files are invaluable when debugging a failed script.
The log file name is unique and provides an audit trail of how and when the script was run. Since the
number of log files can grow very large, the log directory is segmented by state subdirectories. For
example, the log files for state runs of Vermont are stored in the $DPPwork/$DPPenv/logs/VT
directory.
The log file-naming algorithm uses the current state’s Postal code, stage, product, process ID, and date
to form a unique file name. As an example, the name of a Summary File 3 Vermont ProcessGeo log file:
    $DPPwork/$DPPenv/logs/VT/ProcessGeo.VT.U.uSF3.20040503-122650_s45e45_58392
3.5.1.8.       Working with Temporary Files
Because DPP processing can be spread over multiple CPUs and machines sharing a common file
system, care is necessary to make sure two instances of the same programs don’t overwrite or corrupt
each other’s intermediate output. Use the shell function F_CreateTemporaryFileName to create file
names guaranteed to be unique across machines. All temporary files should be stored in the user’s
$DPPtmp location to avoid accidental collisions with other users on the same machine.
Below is an example of a temporary STR catalogue name that incorporates a random string, the user
name, the date, and the process ID:
    $DPPwork/$DPPenv/STR/catalogue/pg.VTdpp1.aaaDmOvaa.dppdev.20040430-
    182148.22178
3.5.1.9.       Korn Shell Script Design Principles
The following design principles are used throughout the DPP scripts:
         Scripts should never overwrite their input files. Inputs files and output files should have different
          names. It is however permissible to delete intermediate output files once it has been determined
          they are no longer needed.
         The standard input and output of all DPP scripts is automatically redirected to a log file in a
          standard location with a uniform naming convention.
         DPP scripts check for input files before performing a task (such as invoking a SAS program) and
          check for output files immediately after a task completes (for example, verifying that the tabulation
          output file was created as expected).
         Unless the –o (overwrite) flag is used, a DPP script will fail if the expected output file(s) already
          exist. This prevents inadvertently overwriting processing results.
         SAS programs invoked from a shell script create a log file separate from the main log file. SAS
          log files are generally voluminous and are stored in a separate subdirectory.
         The DPP system locates product driver files along a search path defined by the environmental
          variable $GS_DriverPath. This feature is useful when running the system in waves, where
          each wave is a separate runtime context that requires its own view of the driver files.
         Every DPP driver file has a corresponding include file with functions for parsing and iterating its
          contents. This pseudo object-oriented approach gives developers the freedom to modify the
          structure of the driver file without breaking existing code.
         Scripts should never parse a driver file directly. Always use the include file functions. If a new
          driver file is added to the system, create a new include file to encapsulate its behavior.
 Date Last Printed: 3/29/13                                                                                     Page 38 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         It is extremely important to check the return code of every UNIX command. The Korn shell
          technique of using set –e (which aborts a script whenever a UNIX command fails) was not
          used, so manually checking is required (set –e has some usability issues because historically
          not all UNIX commands agree on how to flag an error). Basic commands like mv, cp or head can
          fail under certain circumstances (for example, the head command fails when a file contains lines
          longer than 2048 characters). The DPP system defines shell functions like F_chmod and F_rm
          to encapsulate common UNIX commands to perform the necessary error handling. In several
          cases, Perl is used to perform these operations, since Perl does not have arbitrary line length
          limits.
         Be careful using the UNIX shell wildcard expansion character *. Many UNIX commands fail if
          there are too many command-line parameters. This is most likely to happen when * is used in a
          directory with a large number of files. The safest approach is to iterate over the files in the
          directory with a standard construct like the following:
    ls * | while read LS_Value; do print $LS_Value; done
     Always quote variables, since some variable values – like “North Carolina” - contain spaces. In
       other words, use “${LS_State}” instead of ${LS_State} since embedded space can cause
       unexpected problems.
3.5.2. SAS
Base SAS version 8.2 is used extensively in the DPP system for data and metadata processing. SAS is
run in 32-bit mode. SAS is the main ETL (extract-transform-load) tool in the DPP system. Some of the
key functions performed by base SAS:
         Process the detail data and metadata – HDF_PrepData.sas, SEDF_PrepData.sas,
          SEDF_SD_PrepData.sas
         Create the geographic recode – DGF_ProcessGeography_new.sas
         Create the summary file – SF_CreateSummaryFile.sas
         Run the internal, prior, and analyzer match procedure – SF_MatchInternal.sas,
          SF_MatchPrior.sas, SF_MatchAnalyzer.sas
         Process the geographic recode - SplitRecode.sas, SplitRecodeByGeoSets.sas,
          ReduceRecode.sas, ReduceRecode_Dehydrate.sas, Rehydrate.sas
Users of the DPP system never invoke SAS directly – it is always invoked as part of a shell script stage.
The $DPPscripts/dppsas script is a wrapper that invokes SAS in a uniform manner. It defines the
SAS log file, the SAS working directory, and the SAS configuration and autoexec file.
3.5.2.1.       SAS Log File Naming Conventions
In the DPP system, SAS is invoked by the $DPPscripts/dppsas script, which defines a standard log
file name for SAS. Because a given shell script may invoke SAS many times (by iterating over states, or
tables, or characteristic iterations), SAS logs are stored in date-stamped subdirectories. Here’s a typical
example of a ProcessGeo SAS log file:
    $DPPwork/$DPPenv/logs/VT/20040503/ProcessGeo.VT.U.uSF3.20040503-
    121124_s40e40_41644.ReduceRecode_58568.saslog
3.5.2.2.       SAS Design Principles
         For best performance, DPP SAS programs should use striped disk space for I/O intensive file
          operations. A standard part of the DPP operational setup is to prepare a large area of striped
          disk for this purpose. (See Section 7.8.2, Striped Disk Areas, for more information. The DPP
          Production Cookbook also provides guidance on how to setup a striped area.)
         Like shell scripts, SAS programs use a standard log file naming convention (see
          $DPPscripts/dppsas for details).


 Date Last Printed: 3/29/13                                                                    Page 39 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
        Like shell scripts, care must be taken when creating temporary SAS datasets in a multiple
         machine environment with a shared file system. The following code snippet is used in most DPP
         SAS programs to make sure temporary datasets, when written to a common stripped area, don’t
         conflict with other running instances of the same program:
   /*-------------------------------------------------
    * Create a temporary directory to allows this program
    * to run in waves for tables and/or iterations.
    *-------------------------------------------------*/
   X "mkdir &WorkLib./&Machine..&SYSJOBID.";
   LIBNAME Work "&WorkLib./&Machine..&SYSJOBID.";
    The DPPdebug environmental variable can be used to debug a SAS program by saving
      intermediate SAS datasets.
        Temporary SAS datasets are deleted at the end of SAS programs. If the SAS program fails, the
         temporary files are usually not cleaned up – users should monitor this situation to make sure
         failed program don’t consume large amounts of disk space. (See Section 7.9, Executing the DPP
         System for a product, for more information.)
3.5.3. Space-Time Research
The DPP system uses the SuperSTAR product suite from Space-Time Research (STR), based in
Melbourne, Australia. SuperSTAR is an industry standard tabulation system used by statistical agencies
around the world:
        US Census Bureau                                                  Statistics Switzerland
        Australian Bureau of Statistics (ABS)                             Statistics Poland
        Statistics New Zealand                                            General Registry Office (Scotland)
        Office of National Statistics (UK)                                Statistics South Africa
        Statistics Sweden                                                 Statistics Indonesia
        Statistics Finland

SuperSTAR compresses large quantities of microdata sets into a propriety format called a SXV4
database file (using the snbu program). The database is accessed by a very high performance query and
tabulation engine (called ss2ps), and is driven by an easy-to-use Windows GUI client (called
SuperCROSS). The following table lists the features of the SuperSTAR system.
Hierarchical databases            SuperCROSS can operate on hierarchical data structures such as
                                  Census databases where individuals make up households, which
                                  make up housing units. Tables can be created counting all or any
                                  levels of the database and calculations can be performed across
                                  levels, such as average number of people per housing unit.
High speed tabulation             SuperCROSS will cross tabulate in excess of 1 million records per
                                  second on a PC. Tabulation performance will vary depending on
                                  equipment used and type of table being created.
n-dimensional tables              Any number of fields can be placed in table columns, rows or wafers
                                  (sheets). The fields can be arranged in any order by dragging and
                                  dropping them around the table.
Sub-nested or                     Fields can either be sub -nested or concatenated in tables. Dragging
concatenated fields               and dropping will automatically sub -nest; dragging and dropping
                                  while holding the Ctrl key will cause concatenation. Concatenated
                                  fields can be sub- nested with other fields.
Join Tables                       Two or more tables from a different or the same database can be
                                  joined together by dragging and dropping columns or rows between
                                  the tables. A merge and sort is performed on one common access
                                  that ensures data integrity.
Access unit records               Underlying each cell in a table is the unit record data. By clicking on a

Date Last Printed: 3/29/13                                                                                   Page 40 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                   cell or a group of cells, a user, if given permission, can view the unit
                                   records making up the cell. This ability allows agencies to use
                                   SuperCROSS as a macro-editing tool. Tables can be created from
                                   surveys to identify any erroneous looking data. For example a table of
                                   Age by Income could be created to see if any children have income.
                                   These records can then be identified for correction.
Perform calculations               Common calculations can be performed on tables including
                                   percentages, averages and sub totals. The calculations can be linked
                                   to fields and saved for later use.
Create synthetic fields            Synthetic fields can be defined and created on-the-fly. Synthetic fields
                                   can be created based on other synthetic fields and they can be stored
                                   for later use with other tables. Synthetic fields include:
                                   Multi level - Fields created that count or sum across hierarchies in the
                                   database, for example: Number of elderly males living in a dwelling
                                   where individual details are stored in the personal record and dwelling
                                   details in the dwelling record.
                                   Single level - Fields that are created at one record level of the
                                   database based on multiple conditions, for example: Life expectancy
                                   with High, Medium and Low categories where each category is based
                                   on age, race and sex.
                                   Range - A synthetic field can be created that puts values into defined
                                   ranges from a field such as income where income is collected as
                                   actual values.
                                   Weight - Allows a value to be given for a field that has ranges
                                   defined. It turns a field such as income, where incomes are stored as
                                   ranges, into actual values that can be summed.
                                   Math - Allows calculations to be performed on summable fields.
                                   Time - Creates fields such as number of days since account last used
                                   based on date / time fields.
                                   Quantiles - Creates fields such as quantiles where the ranges are
                                   dynamically created based on population distribution.
Format tables                      SuperCROSS can produce publication -standard tables with sub-
                                   nested fields in rows and columns. All headings carry through to
                                   following pages and areas like the wafer, table heading and footnotes
                                   can all be edited.
Export data                        Data can be exported from SuperCROSS for use in many other
                                   packages including MS Excel, CSV, Tab and html. Data can be
                                   copied using the system clipboard.
Re-use table                       Table components such as recodes (reducing the number of
components                         categories in a field by selection and/or aggregation) and synthetic
                                   fields can be saved and used with other tables. These components
                                   can be saved into public directories where all users can access them.
                                   These saved components are then available for immediate use in
                                   future tables.
Batch production                   Tables can be batched to run in production mode where large
                                   numbers of tables are submitted to a server for tabulations. The
                                   Production Module also enables the same table to be run for many
                                   areas, time spans or subject populations.
Automatic zero                     When large potentially sparse tables are created the resulting data
suppression                        can be collapsed to only those rows or columns with nonzero values,
                                   using zero suppression.
Confidentiality routines           SuperCROSS will accept many confidentiality and random rounding
                                   algorithms which can be applied post tabulation or during tabulation
Table 39: Key Features of the SuperSTAR System

3.5.3.1.       STR Design Principles
The following design principles are used when dealing with STR:

 Date Last Printed: 3/29/13                                                                          Page 41 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Based on the product specification, subject matter experts use the SuperCROSS Windows GUI
          to compose tables. There’s one table per product matrix table definition. DPP called these the
          “template TXDs” because during runtime, the templates are merged with other recodes to
          produce the actual TXD used for production tabulation. Table definitions are saved as text-based
          textual table definition (TXD) files. TXDs are stored in the source-code control system in the
          $DPPopsbld/SXtables directory.
         Subject matter experts also compose the race, ancestry, ethnicity, or tribal recodes used in
          iterated products like Summary File 2, Summary File 4, American Indian and Alaska Native
          (AIAN), and the various School District special tabulations. These characteristics iteration (CI)
          recodes are stored in the source control system in the $DPPopsbld/SXrecodes directory.
         The geographic recode is how the system conveys to the ss2ps production system which
          geographies to tabulate. Generally, there’s one geographic recode per state, although many
          other cases are possible, as described elsewhere. The geographic recode is produced in Stage
          2 of ProcessGeo.
3.5.3.2.       32-bit Nature of ss2ps – 2 GB limit
The STR production system ss2ps is a 32-bit application. It can address a maximum of 2 GB of user
memory. There are three scenarios under which the DPP system can encounter the 2 GB limit and fail:
         Extremely large geographic recodes – this problem occurred in SF3 tabulating table P1 for
          California. To work around this problem, the DPP system splits the geographic recode into N
          smaller parts, tabulates each part, and then combines the results back into a full CSV file. This
          process is called recode splits. The relatively new features of recode reduction (removing
          geographies with no blocks) and recode dehydration (identifying duplicate geographies and
          tabulating only the minimal set) have made this problem less common.
         Extremely large SXV4 databases – for a given tabulation, STR loads into memory the referenced
          columns for every record in the database. This in-memory column-wise approach accounts for
          SuperSERVER’s amazing speed. For large databases however, loading the referenced columns
          for every record in the database may exceed the 2 GB limit. This is the primary reason DPP
          doesn’t build a single national-level database for products like Summary File 1 or Summary File 3
          – but instead builds 52 state-level databases. While it’s technically possible to build a national
          database with SNBU, it’s impossible to tabulate against the resulting database! AIAN was the
          first DPP product that was able to build a national database (by using a clever scheme to load
          only those households that were determined beforehand to contribute to the tabulation).
         Extremely complex tables (usually medians) – this problem occurred in SF3, tabulating table
          PCT42 for New York. Similar to recode splits, the table was broken into smaller pieces and the
          DPP infrastructure was modified to handle split tables. It was changed to use the additional
          driver files TableInfoForMatches.txt and TableInfoForHandoff.txt so that PCT42 could be listed as
          3 tables (PCT42A, PCT42B, and PCT42C) in TableInfo.txt.
3.5.3.3.       Programmatic Generation of Tabulation TXDs
The template TXDs under source control, in directory $DPPopsbld/SXtables, are merged with the
geographic recode and characteristic iteration recode (if the product is iterated) during the production
tabulation process, in Tab Stage 3000, as shown in the following figure:




 Date Last Printed: 3/29/13                                                                      Page 42 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                         Geographic
                                          Recode -
                                         SXrecodes                                                                Tabulation




                                                                      Characteristic
       Template TXD -
                                                                        Iteration -
         SXtables                           Merge                                                                  CSV file
                                                                       SXrecodes




                                            Fixup
                                           Process                                                            Delete Production
                                                                                                                    TXD




                                       Production TXD




                                     Figure 3: Flow Chart of Programmatic Generation of TXD

The fix-up process is used in cases where the business rules for merging the characteristic iteration are
complex (such as School Districts) and require special programming (e.g., by modifying the universe
based on the current summary level and iteration).
3.5.3.4.       Recode Caching
At IBM’s request, STR added recode caching to the ss2ps production system during the development of
Summary File 3. The feature allows the production TXD to contain a reference to an external geographic
recode file, instead of its contents. Since geographic recodes can be extremely large (hundreds of
megabytes), this was a significant disk space saving.
STR provided the following description of this capability:
    Connection based Recode Caching
    Introduction
    It is common that some recodes and user defined fields (udfs) are shared among many txd files. Currently these recodes/udfs
    need to be repeatedly defined in each txd file. This requires ss2ps to repeatedly parse these recodes/udfs and upload
    recode/udf information onto the server for each of the recodes. This is ineffective in particular with very large geo-recodes.
    Connection-based Recode Caching aims to address this problem by allowing users to define recodes and udfs outside a txd
    file. The txd file can then use recode/udf names to reference their definitions specified in the recode files. The recodes and
    udfs defined in the recode files will be cached in the server until the connection is lost and will be available for use any time
    within that connection.
    Usage
    The following optional parameter will be added:
    [-rl RecodeListFileName]
    where RecodeListFileName is a text file and tells ss2sp which recode files to be loaded for which databases.
    RecodeListFileName must be of the following format:
          DBID <FirstDatabaseID>
          <RecodeFileName1>
          <RecodeFileName2>
          …
          …
          DBID <SecondDatabaseID>
          <RecodeFileName3>
 Date Last Printed: 3/29/13                                                                                            Page 43 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         <RecodeFileName4>
         …
         …
         DBID …
         …
         …
    RecodeFileNames must be relative path to the folder where RecodeListFile is.
    Example:
         DBID “Financial Database”
         ..\My Recodes\GeoRecode.txt
         AgeRecode.txt
         DBID “Census Database’
         ..\My Recodes\GeoRecode.txt
         OccupationRecode.txt
    Recode File is a SuperCROSS textual file that defines recodes and user defined fields. These textual recode/udf files will have
    the definition of recodes and/or udfs (RCD….END RCD and UDF…END UDF) which is exactly same as in TXD file with the
    following header section on top:
         HEADER
                 ESCAPE_CHAR & ITEM_BY_CODE
         END HEADER
    Use Cases
    Processing Single Table:
         ss2ps –ct LOCAL_SERVER –dp scs.dat –un user –pw password –tn table –rl recodelist.txt

         ss2ps –ct REMOTE_SERVER –sn server –po 8000 –un user –pw password –tn table –rl
                 recodelist.txt
    Processing batch of tables:
         ss2ps –ct LOCAL_SERVER-dp scs.dat –un user –pw password –ip /usr/tables/input –op
                 /usr/tables/output –tl tables.txt –rl recodelist.txt

         ss2ps –ct REMOTE_SERVER –sn server –po 8000 –un user –pw password –ip /usr/tables/input –
                 op /usr/tables/output –tl tables.txt –rl recodelist.txt


    In the situation where a txd file has a udf or recode defined with the same name as recode files, the one from txd file is used.
    Current Limitations:
    Don’t use multithread option with connection-based recode caching. It is currently not supported .

3.5.3.5.       Role of Include Files
Two include modules, F_STR.sh and F_Servers.sh, encapsulate all of the DPP system interaction
with the STR software components (with the exception of SuperCROSS, which is an interactive Windows
GUI application used for table composition and ad-hoc tabulations).
3.5.4. Perl
Perl is used minimally in the DPP system. Several utility programs and include functions use Perl to
perform functions that are difficult in the shell, like generating random numbers. The TXD merge
program, $DPPscripts/scripts/MergeRecodeIntoTXD, is written in Perl. The version of Perl used
on DPP had several problems that made it unsuitable for general-purpose use:
        Perl v5.x on AIX can’t open files larger than 2 GB – some summary file segments are larger than
         2 GB.
        Perl integers can’t handle the full range of numbers found in the Summary Files. As a
         workaround, a BigInt package was used, which was cumbersome and slow.
3.5.5. Python, and JPython
The STR SNBU database builder and the ss2ps production system use Python and jpython. The ss2ps
program is written in C++ and contains an embedded python interpreter. STR uses python to compute
derived measure like the Census Pareto interpolation algorithm for medians.

Date Last Printed: 3/29/13                                                                                           Page 44 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
STR provides the python and jpython software on the product CD.
3.5.6. Java
Only the STR SNBU database builder uses Java. SNBU generates and compiles Java code to actually
build the SXV4 database. DPP uses the IBM JDK for AIX, versions 1.3.1. Since databases can be larger
than 2 GB, the Java JDK you’re using must be able to create files larger than 2 GB.
3.5.7. IBM VisualAge TeamConnection
TeamConnection is the source-code control system used by the DPP system. It consists of the following
components:
         IBM TeamConnection Client – runs on AIX and Windows, in both command-line and GUI mode
         IBM TeamConnection Server – runs on AIX on the dpp1 machine with a backing store of DB2
TeamConnection was discontinued as an IBM product in January 2002. At that time, IBM recommended
that TeamConnection customers migrate to the Rational ClearCase and ClearQuest products. Every
DADS group except DPP did migrate to Rational (now owned by IBM) as the standard source-code
control system.
The Rational AIX software in 2002-2003 was unstable and deemed too unreliable to use during that
critical period of SF3 and SF4 processing. As a result, the DPP team continues to use TeamConnection
and accepts the potential risk of using unsupported software. In practice, TeamConnection is a mature
and stable product, and we’ve had no problems of any kind in over three years. Historically, every
release of DPP code is also available on-line in the /releases directory, and is available on tape-backup.
Therefore, even if TeamConnection fails, every committed version of the DPP system can be retrieved
from tape backup or disk.
Briefly, TeamConnection supports the following concepts (see the document ~/Supplemental
Materials/COTS reference materials/TeamConnection Quick Start.doc for more details):
         Code is organized into projects called releases. DPP has two main releases: DPP2001 for the
          development team, and DPP2000_OPS for the operations team.
         Files are stored in releases.
         Files are checked in and out of the source-code control system using work areas.
         A work area is associated with a defect or feature.
         Work areas are grouped into a level.
         A committed level represents a reproducible code baseline, which consists of a group of work
          areas
         DPP users (pa, test, uat, and prod) use only committed levels of code. Developers (dev) may
          use uncommitted code levels for development purposes.
The DPP build procedures are documented below.
TeamConnection is used by the DPP system for the following tasks:
         Support the DEV code release DPP2001 for all developer-related files
         Support the OPS code release DPP2000_OPS for all operational materials
         Supports the DPP build deployment procedure, which extracts committed levels of code from the
          DEV and OPS releases, and stores them on the /releases file system (available on all DPP
          machines).
         Support the entire DPP team to record and track the status of features and defects against the
          system.




 Date Last Printed: 3/29/13                                                                   Page 45 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
3.5.7.1.      Steps to conduct a Development Build from Team Connection

# Team Connection Login : On dpp* as dpp in /home/dpp
set –o vi
export PATH=$PATH:/usr/teamc/bin
export LIBPATH=$LIBPATH:/usr/teamc/lib
export TC_FIRST_LOGIN=jbond007
export TC_BECOME=jbond007
export TC_FIRST_USER=jbond007
export TC_HOME=/usr/teamc
export TC_TOP=/dpp1
export TC_DBPATH=/home/census
export TC_FAMILY=census@dpp1@9011
export TC_USER=jbond007
cd /home/dpp
clear; teamc tclogin -logout ; teamc tclogin -login

# Team Connection Login : On dpp1 as dppdev in /dpp1/dev/2001
set –o vi
export PATH=$PATH:/usr/teamc/bin
export LIBPATH=$LIBPATH:/usr/teamc/lib
export TC_FIRST_LOGIN=jbond007
export TC_BECOME=jbond007
export TC_FIRST_USER=jbond007
export TC_HOME=/usr/teamc
export TC_TOP=/dpp1
export TC_DBPATH=/home/census
export TC_FAMILY=census@dpp1@9011
export TC_USER=jbond007
cd /dpp1/dev/2001
clear; teamc tclogin -logout ; teamc tclogin –login

# On dpp1 as dppdev
# Env Setup
. /usr/lpp/DPP/DPPsetup dev 2001 53
export root=/dpp1/dev/2001
export file=DPPinstallations.txt
export component="2001-BLD"
export partcomponent=2001
export release=DPP2001
export owner=jbond007
export refresh=
export feature=

# On dpp1 as dppdev
# Build Ids
# Use "let=a" only in case where DEV build doesn’t change but ops build does
#RELEASE DPP2001
export let=; export devlet=; export dev="286" ; export ops="222"
#export let=a; export devlet=; export dev="236" ; export ops="196"

# On dpp1 as dppdev

Date Last Printed: 3/29/13                                                  Page 46 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
# dpp1 create feature
export id="$(( $dev+100 ))$let" ; export dev="$dev$devlet" ;
print $dev $ops $id

cd $root
export remarks="DEV - Builds DEV $dev and OPS $ops (ID=$id)";
export abstract="$remarks"
print $remarks
teamc Feature -open        \
   -component $component   \
   -remarks "$remarks"     \
   -abstract "$abstract"   \
   -release "$release"     \
    -prefix f -verbose > /tmp/tc.$$
if [[ $? -eq 0 ]]; then
    cat /tmp/tc.$$
    export feature=$(sed -e '/^A new fea/d' /tmp/tc.$$ | awk '{print $6}' |
sed -e 's/.$//')
else
    cat /tmp/tc.$$
fi
teamc Feature -accept $feature -answer new_function –verbose


# change release back to DPP2001 if working on maintenance release
export release=DPP2001
teamc WorkArea -create -defect $feature -release $release -owner $owner -
verbose
teamc part -checkout $file -workarea $feature -release $release -relative
$root 2> /tmp/ee.$$
if [[ $? -ne 0 ]] ; then
   export refresh=$(grep '^ *work area ' /tmp/ee.$$ | awk '{print $3}' |
sed -e 's/\.//')
   rm /tmp/ee.$$
   teamc WorkArea -refresh $feature -release $release -source $refresh -
verbose
   teamc part -checkout $file -workarea $feature -release $release -relative
$root 2> /tmp/ee.$$
fi

# On dpp1 as dppdev
# dpp1 generate text for DPPinstallations.txt
print >> $file
for machine in dpp2 dpp1 ; do
print
"$machine|$id|/usr/lpp/DPP/DPP2001_$dev|/usr/lpp/DPP/DPP2001_$dev|/usr/lpp/DP
P/DPP2000_OPS_$ops/drivers|"
print
"$machine|999|/usr/lpp/DPP/DPP2001_$dev|/usr/lpp/DPP/DPP2001_$dev|/usr/lpp/DP
P/DPP2000_OPS_$ops/drivers|"
done >> $file

# On dpp1 as dppdev
# dpp1 manual steps
vi $file
Date Last Printed: 3/29/13                                                  Page 47 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
diff $file ~dpp
grep $id $file

# On dpp1 as dppdev
# dpp1 checkin DPPinstallations.txt
if [[ -n $refresh ]]; then
   teamc part -checkin $file -workarea $feature -release $release -relative
$root –force -verbose
else
   teamc part -checkin $file -workarea $feature -release $release -relative
$root -verbose
fi
print "export feature=$feature; export refresh=$refresh"

# On dpp* as dpp
# Paste above text from dpp1


# On dpp* as dpp
# Extract DPPinstallations.txt
#teamc tclogin -logout ; teamc tclogin -login
. /usr/lpp/DPP/DPPsetup dev 2001 53
export root=$PWD
export file=DPPinstallations.txt
export component="2001-BLD"
export release=DPP2001
export owner=jbond007
teamc part -extract $file -workarea $feature -release $release -relative
$root -verbose
clear; diff $file /usr/lpp/DPP

# On dpp1 as dppdev
# Complete fix record
teamc Fix -complete -workarea $feature -release $release -component
$partcomponent -verbose

# On dpp1 as dppdev
# TC Workarea commands

# reset release back if working on maintenance release
#export release=DPP2001_AIAN

#If doing a dev build continue, otherwise stop.
export type=integrate
export IntegrateWorkAreas=$(teamc Report -raw -view WorkareaView -where
"releaseName in ('$release') and state in ('$type')" | cut -f1 -d'|' | tr
'\n' ' ')
print $IntegrateWorkAreas

# TC Driver commands
export driver=DPP2001_$dev
print $driver
export driverType=prodDPP
teamc Driver -create $driver -release $release -type $driverType -verbose
Date Last Printed: 3/29/13                                                  Page 48 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
teamc DriverMember -create -driver $driver -release $release -workarea
$IntegrateWorkAreas -verbose

#teamc Report -raw -view DriverView -where "releaseName in ('$release')
order by addDate, commitDate"

# TC Driver commands: check results manually
teamc Driver -check $driver -release $release -verbose

# TC Driver commands (do only after Check is successful)
teamc Driver -commit $driver -release $release -verbose
teamc Driver -complete $driver -release $release -verbose

teamc Driver -long -view "$driver" -release "$release" | Mail -s "DPP $driver
Content" james.bond@census.gov

# List of WorkAreas in this build
x=
for i in $IntegrateWorkAreas ; do
   x="$x'$i',"
done
x=$(print $x | sed -e 's/.$//')

teamc Report -raw -view ChangeView -where "workAreaName in ($x) and
releaseName in ('$release') order by workAreaName ,releaseName ,pathName" # |
Mail -s "DPP $driver Changes" james.bond@census.gov

teamc Report -raw -view ChangeView -where "workAreaName in ($x) and
releaseName in ('$release') order by workAreaName ,releaseName ,pathName" >
/tmp/$$.txt
cat /tmp/$$.txt | grep -v '|link|' | cut -d'|' -f2,3,6,7,12 | Mail -s "DPP
$driver Changes" james.bond@census.gov

# Check if any build-related files are out-of-sync
cd /home/dpp
for file in DPPinstallations.txt SSinstallations.txt DPPsetup Directories ;
do
   if ! diff $file /usr/lpp/DPP > /dev/null ; then
      print "File $file is different"
   fi
done # | Mail -s "Diffs on $(hostname)" james.bond@census.gov




Date Last Printed: 3/29/13                                                  Page 49 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
   4. ABOUT THE FUNCTIONAL CAPABILITIES OF THE DPP SYSTEM
This section describes important concepts regarding the functional capabilities of the DPP system. Some
of these sections address the implementation of complex requirements, while others describe topics
whose complexity simply needs further description. It answers “Tell me about….”

4.1.      About Detail databases
Detail databases contain a record for every block, housing unit/GQ, person, and person response (for
certain fields) along with many supporting metadata tables. The process behind building a detail
database is explained in the following sections.
4.1.1. Preparing for a Detail Database Build
A great deal of preparation goes into each detail database build, as depicted in this graphic:




                                                                 Detail File
                                                                 Metadata
Figure 4: Overview of Detail Database Build process

These steps are explained below.
4.1.1.1.       Input Files
Detail Files
Detail files are discussed elsewhere in this document.
Metadata
Detail file metadata is delivered in the form of Excel worksheets and Word documents:
         Excel worksheets list the record types, one per sheet. Each sheet contains the following
          columns: record type, variable number, beginning column, ending column, variable name,
          description, lo value, high value, value description, explanatory note, alpha or numeric indicator,
          recode indicator, deleted variable indicator, and date of change. Here is an example of the
          metadata for the QSEX variable (last 5 columns omitted):

 Date Last Printed: 3/29/13                                                                      Page 50 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
           RT     NO          BEG       END       LEN        VAR       DESC   LO   HI      V DESC
          3       3011        40        40        1          QSEX      Sex    1            Male
          3       3011        40        40        1          QSEX      Sex    2            Female
Table 40: Example of QSEX metadata delivered via Excel worksheet; last 5 columns omitted

         Word documents are delivered for some variables when the list of valid values is considered too
          large for the spreadsheet. These Word documents are formatted for print, and therefore need to
          be manipulated in order to extract the list of valid values.
This metadata is used to prepare the textual database definition (TDD) files, which capture the structure
of the database and the classifications. The process is very time consuming, mostly due to the format of
the metadata. The TDD creation process takes anywhere from 2-3 person weeks depending on the type
of detail file and the number of supplemental detail files.
Geography
Geography is discussed elsewhere in this document.
4.1.1.2.        Timing
Detail files are produced by DSCMO and are released on a flow basis by state. Geography files are
produced by GEO and are also released on a flow basis by state. We may get more than one version of
each file since errors are sometimes detected by the provider or by DPP.
The input file delivery schedule along with the product release schedule causes us to start tabulating
immediately using state-level databases, and not waiting for to build a US-level database (where
possible).
4.1.1.3.        Pre-Data-Preparation Validation
There are three pre-data-preparation validation routines in the system:
CheckData Program
During the detail data retrieval process in the Get script, a SAS program checks the valid values for most
fields against the expected values (found in the TDD, which is explained later). This program captures
every error instead of just the first error (like the database builder, snbu). This error report allows us to
report all data issues to the provider at one time. Sometimes, the resolution is to correct the delivered
metadata but often detail files are redelivered due to the findings of this program.
VerifyCounts Program
This program is the first of two Tab script stage 1000 validation programs that are run prior to data
preparation. The program compares the HU100 and POP100 counts on the detail file block records to
the sum of the actual housing and person records for each block:
         Each HU and HU person record counts as one.
         The GQ records were not counted since GQs are not included in the HU100 count.
         The GQ person records were weighted per the rules for adjusted/unadjusted files. On the
          adjusted detail files, GQ person records with qgqtyp=’099’ were counted as –1 (otherwise as +1).
          On the unadjusted detail files, GQ person records with qgqtyp not equal to ‘099’ or ‘098’ were
          counted as +1 (otherwise 0).
A report file is produced, which lists the actual counts, expected counts, and the total number of errors:




 Date Last Printed: 3/29/13                                                                         Page 51 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
    Verification of SDF Counts
    STATE: DE
    TYPE:   U
    25JUL03

                                      HU100                      HWTs tallied   POP100       PWTs tallied
           Block ID                   from block                 within         from block   within
                                      record                     block          record       block
    100010401001000                   8                          12             14           21
    100010401001001                   14                         15             34           32
    100010401001002                   6                          7              17           20
    .
    <many rows skipped>
    .
    100050519002094     7                                        8              21           18
    100050519002095     14                                       7              17           13
    100050519002098     5                                        8              14           4

           SUM BLOCKS                 SUM HU100                  SUM HCNT       SUM POP100   SUM PCNT
           13302                      338091                     338091         770643       770643

        Sum of blocks with HU mismatched : xxxxx
        Sum of blocks with POP mismatched : yyyyy
    zzzzz errors/mismatches/differences found.

    Table 41: Example of VerifyCounts report file (numbers have been altered)

The numbers in the above report have been altered for each block, yet they still convey that the sum of
the housing and person weights only coincidentally match the HU100 and POP100 for the block due to
sampling. However, the grand total of the housing and person weights for the state do match the HU100
and POP100.
Note that the same report for a 100% detail file would show matches for all blocks.
DF-DGFConsistency Program
This program is the second of two Tab script stage 1000 validation programs that are run prior to data
preparation. The program compares the contents of the detail files block records to the DGF block
records.
A report file is produced, which lists input file record counts, mismatch counts, match counts, detailed
information for any mismatches, and total number of errors:




 Date Last Printed: 3/29/13                                                                       Page 52 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
    SEDF - DGF Consistency Check

    SEDF: /dpp2/prod/AIAN/datafiles/SEDF_DE_U
    DGF: /dpp2/prod/AIAN/datafiles/DGF_AIAN_DE

    Site: DE
    Date: 25 July 2003
    Time: 16:45
    Brief/Verbose Mode: B
    Compare POP & HU: N

             Consistency Checking Report:
    /dpp2/prod/AIAN/reports/SEDF_AIAN_DGF_Consistency_DE.rpt

    ***********************************************************************

        #    of    BLOCKS ON SEDF FILE:     xxxxx
        #    of    BLOCKS ON DGF (FOR THIS SITE): xxxxx
        #    of    BLOCKS ON DGF BUT NOT ON SEDF: 0
        #    of    BLOCKS ON SEDF BUT NOT ON DGF: 0
        #    of    MATCHING BLOCKS (I.E., THEY ARE ON BOTH FILES): xxxxx

    NUMBER OF NONMATCHES BY VARIABLE:

        state NONMATCHES:                              0
        aianhh NONMATCHES:                             0
        intptlat NONMATCHES:                           0
         .
         <many rows skipped>
         .
        intptlon NONMATCHES:                           xxxxx
         <many rows skipped>

    xxxxx          errors/mismatches/differences found.


    Table 42: Example of DF-DGFConsistency report file



4.1.1.4.          Data Preparation
Data preparation occurs in Tab stage 1000. The outputs are listed below:
    Type of
     Output                                       File Naming Convention
    Block           <Detail      File>_<State>_2000_snub_<unadjusted|adjusted>.1.dat
    Hu/GQ           <Detail      File>_<State>_2000_snub_<unadjusted|adjusted>.2.dat
    Person          <Detail      File>_<State>_2000_snub_<unadjusted|adjusted>.3.dat
    QRACE           <Detail      File>_<State>_2000_snub_<unadjusted|adjusted>_QRACEMULTI.dat
    QANCES          <Detail      File>_<State>_2000_snub_<unadjusted|adjusted>_QANCESMULTI.dat
    ANCES           <Detail      File>_<State>_2000_snub_<unadjusted|adjusted>_ANCESMULTI.dat
            Table 43: Outputs from data preparation

The main steps in the data preparation process are described below:
Adding Geographic Fields
We cannot completely trust the contents of geographic fields from the detail file block records, so the first
task in data preparation is to merge about 50 fields from the DPP geography files in with the detail file
block records. Sometimes we write both versions of a field to the output data file (e.g., sdelm from the
block record and gsdelm from the geography file). When we do retain both versions of a field, it’s usually
 Date Last Printed: 3/29/13                                                                     Page 53 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
because the analyzers are produced using detail file geography codes and we need the detail field
version of a field to match the analyzer version when doing ad-hoc queries. However, we usually only
write the field from the geography file to the output file.
We also add a unique block number to the block records. This field, called UNQBLKVS, is referenced in
the geographic recodes. It’s calculated using the following equation:
                                             <State> * 10000000 + counter
Supplemental Detail File
The data preparation step includes a process where we merge additional recodes in from another file.
We call these files “supplements,” and they contain the same record types as the detail files (except block
records). This was done for three database builds as explained below:
                                Built for This          Also Used By These
           Database               Product               Products             Supplement File(s) Used
          HDF                 uPL                       uSF1 (all), uSF2     N/A
                                                        (all), uSF3,
                                                        SDTT, uSF4,
                                                        u108_H
          HEDF                aPL                       N/A                  N/A
          SDF                 uPF3                      N/A                  SDF Supplement
          SEDF                uSF3                      uSF4, AIAN,          SEDF Supplement One
                                                        u108_S
          SEDF_SD             SDCO                      SDxx (except         SEDF Supplement One and Two
                                                        SDTT), SDCOSS,
                                                        SDCOSP
          Table 44: Mapping of databases and supplement files used

Merging Record Types
In data preparation, we perform two kinds of record type merges:
         Housing unit and group quarters records are merged to form the HU/GQ record type.
         Housing unit person and group quarters person records are merged to form the Person record
          type.
We merge these record types together for ease of use in SuperCROSS. However, there are many fields
that exist for only one of the two record types involved in the merge. After the merge, the records that
previously did not have a field are filled with spaces. This is a dangerous situation, since spaces may
already be a valid value for the field in question. In order to handle this situation, a new valid value is
needed to flag this situation. Often, the new valid value is “9,” and the meaning is either “Not Applicable
(GQ Record)” or “Not Applicable (HU Record).”
We can use the field HHT for an example of this situation. HHT is on the HU record type, but not on the
GQ record type. The valid values for HHT are:
          Value        Meaning
         0             NA (Vacant)
         1             Married couple family household
         2             Other family household: Male householder
         3             Other family household: Female householder
         4             Nonfamily household: Male householder: Living alone
         5             Nonfamily household: Male householder: Not living alone
         6             Nonfamily household: Female householder: Living alone
         7             Nonfamily household: Female householder: Not living alone
          Table 45: Valid values for HHT prior to merging record types

Since we’re merging the HU and GQ record types into the HU/GQ record type, we needed to create a
new valid value for the GQ records:



 Date Last Printed: 3/29/13                                                                      Page 54 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          Value        Meaning
         9             Not Applicable (GQ Record)
          Table 46: An additional valid value for HHT after merging record types

Working With Existing Fields
Many existing detail file fields need to be manipulated:
         Are blanks valid values? – Often, fields contain blanks as a valid value. The data preparation
          program replaces these blanks with another character (determined on a field-by-field basis, but
          many times ‘8’ or ‘B’ is the replacement character). Why? If we were to allow whitespace
          characters as valid values, we might miss an entire category of persons, housing units, or group
          quarters in the user interface.
         Does the field contain monetary values? – Monetary value fields are split into two fields during
          data preparation. One remains numeric and is used to sum, and the new field is a classification
          and is often used for universe restrictions. An example is “HINC-Household Total Income.” The
          new version of HINC is called “HINC_CODE-Household Total Income” with these valid values:
          0,"Not in universe (No income, NA, vacant)"
          1,"$1 or break even"
          2,"$2 to $9,999,998"
          3,"$9,999,999 or more"
          4,"Loss of $1 to $59,998"
          5,"Loss of $59,999 or more"
          9,"Not applicable (GQ record)"
Creating New Fields
We sometimes need to create new fields in order to make things easier in SuperCROSS. These new
fields are often summation options. An example for the HDF is the “Official Persons” summation option,
which is created based on business rules and is used to “count” persons. The HDF is a 100% detail file,
so each person record is self-representing and receives a weight of 1 for this summation option.
Multi Response Fields
Most tabulations place a person into a single category, therefore counting that person once. Multi
response fields allow the user to count each person more than once. We accomplish this by creating files
during data preparation that list each of the responses for each of these fields:
          Field         Meaning                             Maximum Responses      Detail File(s)
         QRACE          Race                                8                      100% and Sample
         QANCES         Ancestry                            2                      Sample
         ANCES          Corrected Ancestry                  2                      Sample
          Table 47: Multi-Response Fields

The SuperSTAR software suite understands the concept of multi-response fields and can use these in
tabulations.
4.1.1.5.       Post-Data-Preparation Validation
This section describes verification done manually. It is not part of the DPP system.
In the development area, the post-ETL data validation consists of a manual frequency distribution
verification process. Before the PrepData program manipulates any detail data, we perform frequency
distributions for most fields. We do the same towards the end of the PrepData program. We then
manually compare the counts and look for expected changes (e.g., blanks being converted to another
value) and unexpected changes (an error in the data preparation process). Example counts are shown
below for the RISSTAT field:




 Date Last Printed: 3/29/13                                                                     Page 55 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Value       Meaning                                        Before Data Prep            After Data Prep
         0           No response or not in                          101                         101
                     universe
         1           Occupied                                       188000                      188000
         2           Occupied – continuation                        11                          11
                     (forms attached)
         3           Vacant – Regular                               4321                        4321
         4           Vacant – Usual home                            1023                        1023
                     elsewhere
         5           Demolished/Burned out                          212                         212
         6           Cannot locate                                  45                          45
         7           Duplicate                                      23                          23
         8           Nonresidential                                 1044                        1044
         9           Other (open to elements,                       74                          74
                     condemned, under
                     construction, etc.)
         G           Not Applicable (GQ                             Not a valid                 52
                     Record)                                        value
          Table 48: Example of post-data prep validation for one field; numbers are synthetic

The value “G” is added during data preparation to represent GQ records, which do not have the RISSTAT
field (the merging of record types is explained earlier in this section under “Data Preparation”).
4.1.1.6.       Generic Textual Database Definition
A textual database definition (TDD) is Space-Time Research’s proprietary way of describing a database.
In DPP, we check a generic set of TDD files into Team Connection for each type of detail database we
want to build (e.g., SEDF, SEDF_SD, etc.). This generic TDD contains “tags” (e.g., <dbid>) that are
replaced when running the script that creates custom TDDs for each state. Generally, these tags are
replaced with information specific to that state (directories, database ids, etc.).
The naming convention of the generic set of TDD files includes the type of detail database. For example,
the DBCatalog file for the HDF is named HDFDBCatalog.csv. All of the generic TDD files are described
below:
Project File
The project file lists the database filename prefix and directory information for the build:
    prefix=<dbid>
    smbdir=.
    sxv4dir=<sxv4 directory>
    tdbdir=<specific directory>
    javadir=<specific directory>
DBCatalog
The DBCatalog file contains global database info like database id, input file record separator (0D 0A
translates to carriage return and newline), and the path to the input:
    <dbid>,0D 0A,0,1
    path=<directory>
DBColumns
The DBColumns file contains a record for every column from every table in the database. Here is an
example from the block record:
    Block,RT_DPP1-Record Type for DPP,RT_DPP1-Record Type for DPP,string,1,0,1
The line above tells us that there is a string field named “RT_DPP1-Record Type for DPP” in the block
record, it starts at column 0, and is 1 character in length.



 Date Last Printed: 3/29/13                                                                                       Page 56 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
DBDelim
The DBDelim file defines the column delimiters for the value set text files. An example for the QAGE
value set is:
    QAGEVS,2c,22
The above means that the column separator is a comma (2C is hex for comma) and that double quotes
(22 is hex for double quote) may surround a column of data.
DBFiles
The DBFiles file contains a record for every source file, which includes facts, classifications, and
mandatory TDD files. Here is a subset of the file:
    Flat:
    QAGEVS,<general directory>/QAGEVS.csv
This tells us the location of the QAGEVS source file.
DBForeignKeys
The DBForeignKeys file has an entry for every foreign key for the fact and classification tables. An
exmple entry for the “QAGE-Age” field is:
    Person,"QAGE-Age":QAGEVS,code
The above indicates that the QAGE-Age column from the Person table is a foreign key, and that it
references the primary key named “code” in the QAGEVS table.
DBPrimaryKeys
The DBPrimaryKeys file has an entry for every primary key for every kind of database table (fact,
classification, mandatory).
    QAGEVS,code
The above entry tells us that the field named “code” is the primary key for the QAGEVS table.
DBTables
The DBTables file has an entry for every database table. The example for QAGE is:
    QAGEVS,F

This entry tells us that the QAGEVS table is represented in a flat file.
BINS
The BINS file is required for our detail database builds, but is empty.
CLASSIFICATIONS
This file lists the value set table names along with the columns. The example for QAGE is:
    QAGEVS,code,name
DATABASE_LEVEL
Contents of this file is always a single character N.
FACTS
The FACTS file lists the database levels. The HDF version looks like:
    Block
    HU/GQ
    Person
    QRACE_MULTI
MEASURES
Measures wind up being summable fields in SuperCROSS.
 Date Last Printed: 3/29/13                                                                     Page 57 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
A processed 100% detail file contains 11 measures, e.g. Official Persons, NP-Number of Persons at this
Unit, and the land and water areas from both the block detail records and the DGF block records.
A processed sample detail file contains the same measures as above, but is also much richer in content.
Therefore, several dozen more summable fields are included as measures. Some examples are Annual
Cost of Electricity, Aggregate Household Wages and Salary Income, and Income Deficit for Families.
__SUPER_CHANNEL__
The __SUPER_CHANNEL__ file is required for database builds and always contains the following
entries:
    FACTS,FACTS
    CLASSIFICATIONS,CLASSIFICATIONS
    MEASURES,MEASURES
    BINS,BINS
    DATABASE_LABEL,DATABASE_LABEL
    TABLE_LABEL,TABLE_LABEL
    COLUMN_LABEL,COLUMN_LABEL
    COUNT_DEFAULT,COUNT_DEFAULT
4.1.2. Building a Detail Database
The detail database build process is usually quite easy, since most of the time and effort is spent in the
preparation stage. The build process mixes all of the ingredients described earlier and outputs one “sxv4”
file, which is the database.
4.1.2.1.       Customizing the Textual Database Definition
Databases are created for each state (and sometimes the US), and a set of TDD files must be created for
each. The PrepMetadata script, which is run early on in setting up an environment, does this
customization for us. This script does a search and replace on tags embedded in the generic TDD, adds
geography-related rows to some files, and writes the customized TDD files to a state-specific directory.
This section shows an example of the customization for each file:
Project File
    Generic Project File Entry                      Change?                 Customized Project File Entry
Prefix=<dbid>                                       Yes             prefix=HDFMDU
Smbdir=.                                            No              smbdir=.
sxv4dir=<sxv4 directory>                            Yes             sxv4dir=/dpp2/prod/uSF1F/STR/databases
Tdbdir=<specific directory>                         Yes             tdbdir=/dpp2/prod/uSF1F/TDD/HDF/MD/unad
                                                                    justed
javadir=<specific directory>                        Yes             javadir=/dpp2/prod/uSF1F/TDD/HDF/MD/una
                                                                    djusted
Table 49: Customized Project File from Detail Database TDD

DBCatalog
   Generic DBCatalog File Entry                     Change?               Customized DBCatalog File Entry
<dbid>,0D 0A,0,1                                    Yes             HDFMDU,0D 0A,0,1
path=<directory>                                    Yes             path=/dpp2/prod/uSF1F/TDD/HDF/MD/unadju
                                                                    sted
Table 50: Customized DBCatalog File from Detail Database TDD

DBColumns
The DBColumns file does not require customization for a particular state, but two geography-related
records are added during customization:




 Date Last Printed: 3/29/13                                                                     Page 58 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
  Generic DBColumns File Entry                      Change?               Customized DBColumns File Entry
Block,RT_DPP1-Record Type                           No              Block,RT_DPP1-Record Type for
for DPP,RT_DPP1-Record Type                                         DPP,RT_DPP1-Record Type for
for DPP,string,1,0,1                                                DPP,string,1,0,1
                                                    Yes             U,code,category key,long,60,0,0
                                                    Yes             U,name,category name,string,128,1,0
Table 51: Customized DBColumns File from Detail Database TDD

The “U” table is the value set table for the UNQBLKVS field on the Block table. This field is referenced in
the geography recode. In an earlier version of the DPP system, many more product-specific geographic
entries would have been added to the DBColumns file to support geography hierarchies.
DBDelim
The DBDelim file does not require customization for a particular state, but a geography-related record is
added during customization:
   Generic DBDelim File Entry                       Change?                 Customized DBDelim File Entry
QAGEVS,2c,22                                        No              QAGEVS,2c,22
                                                    Yes             U,2c,22
Table 52: Customized DBDelim File from Detail Database TDD

This “U” table is just like other tables - the column separator is a comma (2C is hex for comma) and
double quotes (22 is hex for double quote) may surround a column of data.
DBFiles
The DBFiles file requires customization for a particular state and we add one geography-related record:
    Generic DBFiles File Entry                      Change?                  Customized DBFiles File Entry
Flat:                                               No              Flat:

QAGEVS,<general                                     Yes             QAGEVS=/dpp2/prod/uSF1F/TDD/HDF/general
directory>/QAGEVS.csv                                               /QAGEVS.csv

                                                    Yes             U,/dpp2/prod/uSF1F/TDD/HDF/MD/U

Table 53: Customized DBFiles File from Detail Database TDD

DBForeignKeys
The DBForeignKeys file does not require customization the way we currently build detail databases.
DBPrimaryKeys
The DBPrimaryKeys file customization adds one geography-related record:
Generic DBPrimaryKeys File Entry                    Change?             Customized DBPrimaryKeys File Entry
QAGEVS,code                                         No              QAGEVS,code
                                                    Yes             U,code
Table 54: Customized DBPrimaryKeys File from Detail Database TDD

DBTables
The DBTables file customization adds one geography-related record:
   Generic DBTables File Entry                      Change?                Customized DBTables File Entry
QAGEVS,F                                            No              QAGEVS,F
                                                    Yes             U,F
Table 55: Customized DBTables File from Detail Database TDD

BINS
The BINS file does not require customization.
CLASSIFICATIONS
 Date Last Printed: 3/29/13                                                                           Page 59 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The CLASSIFICATIONS file customization adds one geography-related record:
   Generic DBTables File Entry                      Change?                Customized DBTables File Entry
QAGEVS,code,name                                    No              QAGEVS,code,name
                                                    Yes             U,code,name
Table 56: Customized CLASSIFICATIONS File from Detail Database TDD

DATABASE_LEVEL
The contents of this file is always a single character N and is never customized.
FACTS
The FACTS file does not require customization.
MEASURES
The MEASURES file does not require customization.
__SUPER_CHANNEL__
The __SUPER_CHANNEL__ file does not require customization.
4.1.2.2.       Running SNBU
The Space-Time Research database builder is named snbu. We start snbu by issuing the following
command:
    snbu –in:tdd –out:sxv4                    <project file>
The utility writes a java program, compiles it, then runs the program to build the database. These steps
are explained in brief below:
Writing the Java Program
The snbu utility loads the TDD files into a text driver and writes a java program to the javadir directory
(defined in the project file). This program contains a java method that creates and “channels” (i.e., loads)
every table into the database. The text driver finds the relationships between the tables, works out the
order that these tables must be channeled to the database driver, and ensures that referenced tables are
copied before referring tables.
Compiling and Running the Java Program
After the java program is written, the snbu utility compiles and runs the java program. This channels all of
mandatory, classification, and fact tables into the database. A snippet of the log from the AIAN US
database is included below:
    snbu version snbu-1_8
    Not reading .snbu file from /home/dppprod
    Project file: SEDFUSUAIAN.proj
    assigning prefix='SEDFUSUAIAN'
    assigning smbdir='.'
    assigning sxv4dir='/dpp2/prod/AIAN/STR/databases'
    assigning tdbdir='/dpp2/prod/AIAN/TDD/SEDF/US/unadjusted'
    assigning javadir='/dpp2/prod/AIAN/TDD/SEDF/US/unadjusted'
    facts: ['Block Record', 'HU/GQ Record', 'Person Record', 'QRACE_MULTI',
    'QANCES_MULTI', 'ANCES_MULTI']
    bins: {}
    /usr/java131a/bin/javac -J-mx128m -d
    /dpp2/prod/AIAN/TDD/SEDF/US/unadjusted -classpath
    /usr/lpp/STR/package53/jpyt
    hon/jpython.jar::/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/jar/junk.jar:
    /usr/lpp/STR/package53/SuperSTAR2/SNBU/V
    1.8/jar/sxv4.jar:/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/jar/sctextdri
    ver.jar /dpp2/prod/AIAN/TDD/SEDF/US/unad
    justed/ChannelSEDFUSUAIAN.java

 Date Last Printed: 3/29/13                                                                          Page 60 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
   /usr/java131a/bin/java -
   Djava.library.path=/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/lib -
   Dsxv4driver.home=/usr/
   lpp/STR/package53/SuperSTAR2/SNBU/V1.8/lib -classpath
   /dpp2/prod/AIAN/TDD/SEDF/US/unadjusted:/usr/lpp/STR/package5
   3/jpython/jpython.jar::/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/jar/jun
   k.jar:/usr/lpp/STR/package53/SuperSTAR2/
   SNBU/V1.8/jar/sxv4.jar:/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/jar/sct
   extdriver.jar ChannelSEDFUSUAIAN
   Cleaning up after last sxv4 use and creating directories...
   No cleaning needed - /dpp2/prod/AIAN/STR/databases/SEDFUSUAIAN.sxv4 does
   not exist
   ...clean
   Loading str.jdbc.sxv4.Driver from
   file:/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/jar/sxv4.jar
   ...loaded
   Connecting jdbc:sxv4:/dpp2/prod/AIAN/STR/databases/SEDFUSUAIAN
   ...connected
   Loading str.jdbc.sctextdriver.JDBCLayer.SCTextDriver from
   file:/usr/lpp/STR/package53/SuperSTAR2/SNBU/V1.8/jar/sct
   extdriver.jar
   ...loaded
   Connecting
   jdbc:sctextdriver:/dpp2/prod/AIAN/TDD/SEDF/US/unadjusted/SEDFUSUAIAN
   ...connected
   started "__SUPER_CHANNEL__"

   --   channeled 8 rows to "__SUPER_CHANNEL__"
   --      0hrs 0min 0sec
   started "FACTS"

   --   channeled 6 rows to "FACTS"
   --      0hrs 0min 0sec
   started "CLASSIFICATIONS"

   --   channeled 499 rows to "CLASSIFICATIONS"
   --      0hrs 0min 0sec
   started "MEASURES"

   --   channeled 46 rows to "MEASURES"
   --      0hrs 0min 0sec
   started "DATABASE_LABEL"

   --   channeled 1 rows to "DATABASE_LABEL"
   --      0hrs 0min 0sec
   Creating tables...
   ...created all tables
   Channeling tables...
   started "CNPLVS"

   --   channeled 10000 rows to "CNPLVS"
   --      0hrs 0min 0sec
   .
   <many lines skipped>
   .
   --   channeled 0 rows to "SOILDVS"
   --      0hrs 0min 0sec
   ...channeled tables
   Total time to channel TDB _only_:
Date Last Printed: 3/29/13                                                  Page 61 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
    ***        0hrs x min y sec                 ***

4.1.3. Post-Database-Build Validation
This section describes validation which was performed manually. It is not part of the DPP system.
After a database is built, development uses SuperCROSS to perform frequency distributions. We
manually check these frequency distributions against those done at the end of the PrepData program.
Any differences are a clear indication that an error exists in the build process. Example counts are shown
below for the RISSTAT field:
         Value       Meaning                                        Before DB Build    After DB Build
         0           No response or not in                          101                101
                     universe
         1           Occupied                                       188000             188000
         2           Occupied – continuation                        11                 11
                     (forms attached)
         3           Vacant – Regular                               4321               4321
         4           Vacant – Usual home                            1023               1023
                     elsewhere
         5           Demolished/Burned out                          212                212
         6           Cannot locate                                  45                 45
         7           Duplicate                                      23                 23
         8           Nonresidential                                 1044               1044
         9           Other (open to elements,                       74                 74
                     condemned, under
                     construction, etc.)
         G           Not Applicable (GQ                             52                 52
                     Record)
          Table 57: Example of post-database build validation; numbers are synthetic

4.1.3.1.       Cataloguing
One of the functions of the scstools utility is to catalogue databases so that the SuperCROSS client and
the production system may use them. We run the scstools utility by using the following command:
   scstools –db <catalog> -un <user> -pw <password> -file <catalog script>
The catalog script provides commands to delete and add folders and databases, and to assign
permissions. Here is an example for the AIAN US database:
    delete "SEDFUSUAIAN"
    delete SEDFUSU_folder
    add root SEDFUSU_folder "SEDFUSU Folder" F
    permission add all SEDFUSU_folder
    add SEDFUSU_folder "SEDFUSUAIAN" "SEDFUSUAIAN DB (CR: 01/01/04)" D
    /dpp2/prod/AIAN/STR/databases/SEDFUSUAIAN.sxv4 XTAB
    permission add all "SEDFUSUAIAN"
    print
    exit
Cataloguing takes place in Tab stage 2050 for ad-hoc queries (using SuperCROSS) and Tab stage 3000
for production tabulations. When snub or SuperCROSS Server opens the catalogue, every database in
the catalogue is verified.
4.1.3.2.       Creating TXDs
The creation and release of a development detail database bootstraps the TXD creation process in DPP
OPS. This development detail database has passed all of the data validation routines described earlier
and is expected to be identical to the production version of the detail database.




 Date Last Printed: 3/29/13                                                                             Page 62 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.1.3.3.       Production Tabulation
Production tabulations access the database catalogue via the production system (ss2ps). The ss2ps
utility reads the catalogue, loads basic information about all of the referenced databases, loads a
geography recode referenced during the startup phase, loads any recodes mentioned in the tabulation
request, tabulates one or more tables, and writes output.
4.1.3.4.       Ad-Hoc Tabulation
DPP team members can use SuperCROSS to access and tabulate databases that have been catalogued
in a SuperSERVER server.

4.2.      About the Divide and Conquer Approach to tabulation
All work on the DPP system is ultimately focused on creating accurate summary files as quickly as
possible. Due to the sheer size of the Decennial products, production processing for a single product can
run continuously for days, weeks, or in some cases, months on end. Therefore, virtually all products
require multiple executions of the DPP system. The DPP system supports a ”divide-and-conquer”
approach to creating products that gives the DPP system operator considerable control over how a
product is run.
For example, Summary File 4 contains 2,433,869,249,280 (2.4 trillion) data cells. With thresholding and
various performance enhancements, the actual number of tabulated cells was reduced to 78,744,099,280
(78 billion). Despite this 96% reduction in the number of cells to tabulate, SF4 ran continuously for
approximately three weeks – 24 hours a day, across three UNIX machines, each with 24 processors, 96
GB of memory, and large pool of shared disk space.
Although the general order in which commands are run is documented in the Cookbook, there’s
considerable flexibility in the system so the actual runtime sequence of commands is left to the operator.
The goal of production workload submission is to achieve maximum throughput by keeping the assigned
computer(s) fully engaged, but not over-committed, every minute of the day, until done.
Each stage of each script in the DPP system has a different performance profile. Some stages are I/O
intensive (for example, ProcessGeo stage 20); others are memory intensive (for example, ss2ps
tabulation in Tab stage 3000). A proven successful approach has been to keep a mix of these
performance-types active in each computer at the same time. Too many I/O intensive processes can
lead to severe performance degradation with long I/O wait-times. Likewise, too many ss2ps processes
can overwhelm virtual memory and cause swapping.
The DPP system leaves these runtime workload decisions to the operator. By ensuring that the
performance profile and waveability of each stage is documented, the operator can construct a runtime
strategy to optimize the creation of each product on the available hardware.
4.2.1. Structuring the DPP system to use an integrated logical file system
A user’s DPP environment lives under a directory structure called $DPPwork/$DPPenv. As an example,
a PROD user running uSF4 uses /dpp2/prod/uSF4 as the root of their work environment. To support
the distribution of work across multiple computers, the DPP system assumes this directory structure is
available across all machines involved in the creation of this product. There are three types of file
systems in the DPP system:
       Name                               Type of File System                              Purpose
Shared File System            Local file system, NFS exported for use        Used to share files hosted on a local
                              on other machines                              machine.
Private File System           Local file system, not NFS exported            Used to store temporary files that do
                                                                             not need be visible across machines.
Remote File System            NFS mounted file system                        Used to access files hosted on a
                                                                             remote computer
Table 58: Types of DPP File Systems

Here’s an example of how file systems were used in the tabulation of Summary File 4:


 Date Last Printed: 3/29/13                                                                          Page 63 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
           Machine                                      File System           Locate or remote file system?
dpp2.dads.census.gov                          /dpp2/prod/uSF4                Shared File System
dpp2.dads.census.gov                          /dpp2/prod/uSF4/SXoutput       Shared File System
dpp2.dads.census.gov                          /dpp2/prod/uSF4/datafiles      Shared File System
dpp2.dads.census.gov                          /dpp2/prod/uSF4/sf             Private File System
dpp2.dads.census.gov                          /dpp2/prod/uSF4/geo            Private File System

dpp3.dads.census.gov                          /dpp2/prod/uSF4                Remote File System
dpp3.dads.census.gov                          /dpp2/prod/uSF4/SXoutput       Remote File System
dpp3.dads.census.gov                          /dpp2/prod/uSF4/datafiles      Remote File System
dpp3.dads.census.gov                          /dpp2/prod/uSF4/sf             Private File System
dpp3.dads.census.gov                          /dpp2/prod/uSF4/geo            Private File System

dpp4.dads.census.gov                          /dpp2/prod/uSF4                Remote File System
dpp4.dads.census.gov                          /dpp2/prod/uSF4/SXoutput       Remote File System
dpp4.dads.census.gov                          /dpp2/prod/uSF4/datafiles      Remote File System
dpp4.dads.census.gov                          /dpp2/prod/uSF4/sf             Private File System
dpp4.dads.census.gov                          /dpp2/prod/uSF4/geo            Private File System
Table 59: Example of NFS disk layout for SF4

For performance reason, there may be multiple file systems under /dpp2/prod/uSF4 – for example, a
separate geo striped file system is usually created to support I/O intensive SAS programs. Although the
number and type of file systems vary by product, the general principle of exporting and mounting NFS file
systems across the production machines to create a unified consistent view of the product work
environment remains the same.
When running work across multiple machines, operators need to be careful to not accidentally run the
same job on the same (or different) machine. The result is usually chaos as scripts attempt to create the
same output files. There is no underlying job control architecture in the DPP system, so anomalous
situations like this are very hard to prevent.
The DPP system operator needs a clear strategy about how a product will be distributed across multiple
machines and processors for maximum efficiency. The most common approach is to divide the 52 states
into groups and isolate the running of a group to a single machine – e.g., run (CA, PA, VT) on dpp2, (NY,
MN, OH) on dpp3, etc. Although processing for a single state could be distributed across multiple
machines, there’s usually no need to do this. Late in the production cycle, as states finish processing and
extra machine CPU cycle become available, the operator could decide to run a large state like CA across
multiple machine to finish faster, but that tends to be the exception, not the rule.
4.2.2. Waves
A wave is a subset or other variation of a full product. Waves have many uses in the DPP system, but
their primary role is to support a divide-and-conquer approach to product creation that maximizes the
available processing power.
Waves break a task into separate, smaller parts that can be run in parallel. On a multiprocessor machine,
the total time to execute a “waveable” task is reduced. To the greatest extent possible, each stage of
each script is designed to maximize its capability to be parallelized. Waves can be spread across multiple
machines to further enhance performance and throughput.
Not all processing steps are waveable. Some activities have to be run sequentially (e.g., you can’t start
computing US medians until all the state are fully processed); likewise not all steps are suitable for
parallelism – for example, there’s no way to parallelize the creation of a SXV4 database – it’s a fixed set
of sequential tasks.
For steps that are waveable, there are usually two ways to use parallelism:
         Wave by Tables – break the TableInfo.txt driver file into parts, and assign each wave a different
          batch of tables
         Wave by Iteration – for iterated products, to leverage geo-recode caching, break the Iterations.txt
          driver file into parts, and assign each wave a different batch of iterations.
 Date Last Printed: 3/29/13                                                                        Page 64 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The DPP Production Cookbook should document, for every product:
         The waveability of each stage of each script
         Whether the step is waveable by tables, or iterations, both, or neither
         Provide a recommended approach for production (e.g., for SF4, many steps where waveable by
          table or iteration, but the production recommendation was to create 336 iterations waves since
          this makes better use of geo recode caching)
         The performance profile of each stage of each script (although, often the profile characteristics
          don’t become apparent until the system is actually run during DEV and PA testing)
4.2.3. Other Uses of Modified Operational Materials
In addition to waves for the purpose of increased performance or throughput, there are other cases where
a user may want to override the default driver files. The most common example occurs during
development and testing to create “mini” or “slimmed-down” versions for testing purposes (for example, to
quickly test the US functionality of the DPP system, you can edit the Coverage driver file to make a small
“three state” version of the United States). By convention, these situations with manually edited driver
files are also called waves, but the primary purpose is to run a modified version of the full system, and not
performance.
4.2.4. Creating Waves
In the DPP system, waves (subsets or variations) are implemented via operational programming –
namely, having each “wave” use different operational material (namely, driver files). As documented in
the Cookbook, there are two utility scripts to help create the necessary operational materials:
         $DPPutil/util/CreateWaves – a script that creates wave directories by breaking
          TableInfo.txt into smaller parts
         $DPPutil/util/CreateIterationWaves – a script that create wave directories by breaking
          Iterations.txt into smaller parts
Each wave runs in a separate runtime context, but shares a common file system with its peers. By
convention, DPP scripts that support waves use the –w command-line option to specify a wave directory.
The default wave directory is the $DPPdrivers directory. Using the –w option, a script first looks in the
wave directory; failing that it defaults to the $DPPdrivers directory. Here’s a simple example of running
the Tab script in three waves:
    Tab –w $DPPwork/$DPPenv/wave1 … &
    Tab –w $DPPwork/$DPPenv/wave2 … &
    Tab –w $DPPwork/$DPPenv/wave3 … &
Driver files are searched for first in the wave directory (wave1, wave2, or wave3) and then in the
$DPPdriver directory. This means the wave directory only needs to contain the files unique to that wave
(usually, TableInfo.txt or Iterations.txt); the driver files common to all waves (like Products.txt or
Coverage.txt) can reside in $DPPdrivers.
4.2.5. Restarting Waves – Failed jobs and Reruns
While breaking a task into waves can increase performance or throughput, it introduces additional
operational complexity. The system operator needs to monitor waves for failure and take appropriate
action. Wave can fail for many reasons:
         Hardware error – machine crash, disk failure, network failure, NFS timeout, etc.
         Running out of disk space, inodes, or /tmp space
         Software error – a bug in a DPP script or program
         COTS bug – STR, SAS, or Korn shell bugs
         Operational material error – a malformed driver file, a TXD with a syntax error, a TDD bug (e.g., a
          missing valid value), database build failure

 Date Last Printed: 3/29/13                                                                     Page 65 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Operator kills job
         Environment Error – a job can fail if the UNIX environment is not setup properly – e.g., missing
          NFS mount or mount privileges
         Running out of memory – AIX may kill your job in low memory situations
Each case requires corrective action, after which the wave can be restarted. Consider the following three
waves:
   Tab –w $DPPwork/$DPPenv/wave1 –s 3000 –e 4000 … &
   Tab –w $DPPwork/$DPPenv/wave2 –s 3000 –e 4000 … &
   Tab –w $DPPwork/$DPPenv/wave3 –s 3000 –e 4000 … &
Suppose wave 1 fails because of low disk space. After corrective action, you can restart this single wave:
   Tab –w $DPPwork/$DPPenv/wave1 –s 3000 –e 4000 … &
However, if the wave failed only in stage 4000 (Summary File creation), it may be possible to save
considerable time by restarting the wave only from the point of failure:
    Tab –w $DPPwork/$DPPenv/wave1 –s 4000 –e 4000 … &
This type of targeted restart strategy usually requires manual investigation in the log files to determine
where and why the failure occurred. Each stage should also document its restart point – the predecessor
stage that needs to be rerun if a given stage fails. Unless documented otherwise, each stage is its own
restart point.

4.3.      About the Division of code and labor
The DPP team is broken into four sub-teams:
         DEV – the development team. Responsibilities include design and implementation of the DPP
          system based on BOC requirements; writing DPP Production Cookbook; unit testing; working with
          PA and OPS to resolve defects and improve performance; ownership of TeamConnection
          administration.
         PA – the product assurance team. Responsibilities include deploying TeamConnection builds on
          AIX; testing the DPP system from DEV; testing system runability; performing basic content
          checking; creating test plans and test reports; verifying readiness of the DPP system to enter
          UAT testing.
         OPS – the operations team. Responsibilities include running UAT, TEST, SPROD, and PROD
          activities; creating official TXDs, characteristic recodes (CI), and driver files.
         SA – the AIX UNIX system administration team. Responsible for all UNIX system administration
          tasks, including account management, backups, and performance monitoring and tuning.
Artifacts Owned or Controlled by DEV
The DEV team owns the TeamConnection release DPP2001. Each build creates a new committed level
in the DPP2001 release. The level names are numbered sequentially – e.g., DPP2001_210,
DPP2001_211, DPP2001_212, etc. Each level contains the following items (note that obsolete files are
not included in this list):
                    Directory or File                                            Purpose                       Official
                                                                                                              Version?
DPPsetup                                                         Define user’s runtime environment                Y
Directories                                                      Run as part of initial runtime environment       Y
                                                                 setup
DPP Operations Guide.Production                                  Operational runtime instruction manual,          Y
Cookbook.doc                                                     with a separate section for each product
DPPinstallations.txt                                             Pipe-delimited configuration file that           Y
                                                                 enumerates the valid DEV and OPS build
                                                                 combinations on each machine
SSinstallations.txt                                              Pipe-delimited configuration file that           Y
                                                                 enumerates the valid SuperSERVER
 Date Last Printed: 3/29/13                                                                               Page 66 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                    Directory or File                                              Purpose                        Official
                                                                                                                 Version?
                                                                 versions installed on each machine
                                                                 (multiple versions of SuperSERVER may
                                                                 be installed on the same machine)
include/                                                         Supporting shell scripts                               Y
scripts/                                                         Shell scripts                                          Y
programs/                                                        SAS programs                                           Y
util/                                                            Miscellaneous shell scripts                            Y
datafiles/                                                       Contains pretabulated populated-based                  Y
                                                                 size codes lookup file
TDD/                                                             Official database definition files for                 Y
                                                                 SuperSERVER SNBU database builder –
                                                                 metadata for HDF, SEDF, School
                                                                 Districts, and various US-level median
                                                                 databases
drivers/                                                         Unofficial starter version of driver files to       N
                                                                 bootstrap development process
SXtables/                                                        Unofficial starter versions of TXDs to              N
                                                                 bootstrap development process
SXrecodes/                                                       Unofficial starter versions of characteristic       N
                                                                 iteration recodes to bootstrap
                                                                 development process
Table 60: Contents of DEV Release DPP2001

Artifacts Owned or Controlled by PA
The PA team does not own any TeamConnection release, but is responsible for deploying builds to the
AIX environment:
                         Build Artifact                                                     Description
/releases                                                               Initial location where code is extracted from
                                                                        TeamConnection
/usr/lpp/DPP/DPP2001_<DEV_BLD_NUMBER>                                   Deployed DEV build
/usr/lpp/DPP/DPP2000_OPS_<OPS_BLD_NUMBER>                               Deployed OPS build
/usr/lpp/DPP/DPPsetup                                                   Latest version of DPPsetup script
/usr/lpp/DPP/Directories                                                Latest version of Directories script
/usr/lpp/DPP/SSinstallations.txt                                        Latest version of SSinstallations.txt
/usr/lpp/DPP/DPPinstallations.txt                                       Latest version of DPPinstallations.txt
Table 61: DPP Build Directories

Artifacts Owned or Controlled by OPS
The OPS team owns the TeamConnection release DPP2000_OPS. Each build creates a new committed
level in the DPP2000_OPS release. The level names are number sequentially – e.g.,
DPP2000_OPS_220, DPP2000_OPS _221, DPP2000_OPS _222, etc. Each level contains the following
items:




 Date Last Printed: 3/29/13                                                                                  Page 67 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                          Directory or File                                             Purpose                  Official
                                                                                                                Version?
drivers/                                                                     Official version of driver files       Y
SXtables/<PRODUCT>/general/ <product>-                                       Official versions of state-based       Y
<TableID>[<state>].txd                                                       TXDs
SXtables/<PRODUCT>/median/ <product>-<TableID>.txd                           Official versions of US-based          Y
                                                                             TXDs
SXtables/<PRODUCT>/metadata/ <product>-                                      Official versions of table stub        Y
<TableID>.txt                                                                metadata used to build review
                                                                             materials (now obsolete) and
                                                                             build metadata handoff for AFF
                                                                             (now obsolete)
SXrecodes/<PRODUCT>/ci/                                                      Official versions of                   Y
                                                                             characteristic iteration recodes
Table 62: Contents of OPS Release DPP2000_OPS

In summary, the DPP sub-teams have the following roles and responsibilities:
         DEV team
               o    Design and implementation of the DPP system
               o    Store all version controlled objects in TeamConnection release DPP2001
               o    Create starter versions of the operational material – TXDs, driver files, SXtables, and
                    SXrecodes
               o    Document how to run the system in the Production Cookbook
               o    Correct defects identified by PA and OPS
               o    Manage all interactions with Space-Time Research, including reporting defects,
                    receiving, installing, and testing all software updates
               o    Manage technical supports issues with Census SAS Branch to resolve defects and
                    questions about use of base SAS system
               o    Work with the DPP system administration team as a product move through PA and OPS
                    to monitor system performance and take corrective action as necessary to improve the
                    performance of specific products
               o    Administrator TeamConnection – create releases, levels, and components as needed
               o    Work with performance engineer to help tune and monitor DPP system performance
         PA team
               o    Verify correctness of Production Cookbook
               o    Verify runability (from first to last stage) of every DPP product
               o    Perform basic content checking (the details of which are agreed upon with DEV and
                    OPS) on tabulation output
               o    Record defects as problems are identified
               o    Document work with test plans and test reports
               o    Work with system administration, DEV, and OPS to perform preliminary performance
                    profiling of each DPP product
               o    Deployed committed DEV and OPS code to AIX
               o    Work with performance engineer to help tune and monitor DPP system performance
         OPS team
               o    The DPP system operator for all UAT, TEST, SPROD, and PROD activities

 Date Last Printed: 3/29/13                                                                                 Page 68 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
               o    Create official versions of operational material (using starter versions from DEV) - TXDs,
                    driver files, SXtables, and SXrecodes
               o    Store all version controlled files in TeamConnection release DPP2000_OPS
               o    Manage all interactions with BOC and product sponsors
               o    Responsibility to run all production jobs and verify correctness of data before release to
                    AFF, ACSD and/or product sponsor
         System Administration team
               o    UNIX system administration, account management, NFS mounts, backups, system
                    performance, security
               o    Provide a performance engineer to coordinate all production planning activities, including
                    disk setup, disk tuning, performance monitor, operating system tuning, etc.

4.4.      About Geography
Geography input files and recodes are described elsewhere in this document, but this section picks up
where those leave off. It describes the processing and use of the geography files, and therefore covers
parts of the Get, ProcessGeo, and Tab scripts.
4.4.1. Assembling the DGF
We use the Get script to interact with the input geography files. This script is customized for each product
so that it can assemble together records from all of the different geography files, and also allows us to
add population size codes and perform verification.
4.4.1.1.       Obtaining the Correct Geography Records
The logic behind obtaining the correct geography records varies by product. Here is an example of what
happens for SF3 Maryland:
           Step
          within
           Get          Meaning                                       Why?
         1              Concatenate SF3 MD state                      State records are tabbed for the MD
                        records and SF1F MD                           SF3 product
                        block records into
                        output file                                   Block records are needed to make
                                                                      geography recode
         2              Obtain all SF3 records                        We want to tab the state-level SF3
                        from US file that start                       and the US-level SF3 at the same
                        with: US, MD, or pipe                         time. So let’s find all US records
                        (“|”)                                         that could be wholly or partially
                                                                      contained within MD
         3              From the set of records                       The state and US SF3 products
                        obtained in step 2,                           contain some overlapping
                        remove records that                           geographies, and we’re ensuring
                        contain any of the                            that we don’t tabulate the
                        following: 04000US,                           geography twice.
                        05000US, 06000US,
                        07000US, 16000US,
                        15500US, 17000US,
                        17200US, 23000US,
                        50000US
         4              Concatenate records from  We’ll then have a complete file for
                        step 3 to the file        SF3 MD tabulation, which includes
                        created in step 1         tabulation for the MD portion of US
                                                  geography
Table 63: Example of how we assemble the SF3 MD geography file

 Date Last Printed: 3/29/13                                                                       Page 69 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.4.1.2.       Inserting Population Size Codes
The previous section describes the assembly of the geography file, but often products need another
modification of the geographies before they are ready for use. We need to update population size codes
in the assembled geography file since some geographic components use them for proper geography
recode definition. They are also included in the “geography header” portion of the summary files.
4.4.1.3.       Verifying the Assembled File
The CheckContents program checks the contents of the geography file against the GeoContent driver file,
which is a matrix of required fields by summary level.
    CHECKING GEO FILE /dpp2/prod/uSF1UR/datafiles/DGF_uSF1UR_NJ OF uSF1UR OF
    New Jersey

                                              AS OF Apr 11, 2003;0:50:09

    Wrong size at line 162801 .Field name= PLACESC .Field size= 2 .Geo
    value=    .Geo size= 0
    NJ|391|00|39100US34560218130|1|2|22|34||||||#####|T1|3||||||||||||||||||56
    02|23|70|Y|||||Y|||||||||||||||||     1061
    14348|      30987006|Dover township|A|Y|||+39978850|-
    074147473|44|W|||||||||####||||18130||||||||||

    The number of records in geo file is : xxxxxx

    1     errors/mismatches/differences found.
The error indicates that Place Size Code (PLACESC) is required but not found for a particular summary
level 391 record.
4.4.2. Processing Geography File
The ProcessGeo script does many things with the geography file, some of which are discussed in great
detail elsewhere in this document (e.g., geography recodes). The remaining processes are highlighted in
this section.
4.4.2.1.       Master Geography SAS Dataset
ProcessGeo stage 10 creates what we can call the master geography SAS dataset. Many programs
throughout the DPP system use this dataset. The dataset is basically the geography file from section
4.4.1.1 in SAS form, but with several additional fields. The additional fields directly contribute to detail
data preparation, US aggregation, and summary file creation. These are explained further in section
4.4.3.
4.4.2.2.       Land and Water Area Verification
As discussed elsewhere in this document, ProcessGeo stage 10 outputs a geography recode. This stage
also outputs a hierarchical file that has two record types – non-block and block – and is used to ensure
the correctness of the geography recode. Each non-block record (e.g., a county) is the parent of 0 or
more blocks. For an explanation of how we equate non-blocks to blocks, refer to the Geography Recode
section.
The geography database build process is completely independent of the detail database build process
discussed elsewhere in this document. However, the build process requires the same basic materials -
database definition input files (TDD) and fact data (the hierarchical geography file).
After the database is built, we run a query (txd) against it to see if the sum of the land area, water area,
POP100, etc., from the blocks matches the same from the non-block record. Some mismatches are
expected, for example when we tabulate the MD portion of the US geography (01000US) the land area is
the same as MD.



 Date Last Printed: 3/29/13                                                                       Page 70 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.4.3. Using Output From Geography Processing
The Tab script uses output from ProcessGeo in many ways, some of which are discussed in great detail
elsewhere in this document (e.g., detail databases). The remaining processes are described in this
section.
4.4.3.1.       Detail Database Preparation
The block-level dataset from above is used in a detail database preparation match process. The detail
file block records need a unique number, and that number is assigned in the processing described earlier.
This unique number, which is eventually called UNQBLKVS in the detail database, is referenced in the
production geography recodes. For more information, refer to the Geography Recodes and Detail
Database sections.
4.4.3.2.       US Aggregation
A derivative of the main SAS dataset above is used in US aggregation. The main dataset is subset in
ProcessGeo stage 30 so that it contains only the US records. These records are in proper order for the
US product. This subset dataset is then used in US aggregation to ensure the aggregates results are
written out to csv format in the correct order. For more information, refer to the US Processing and
Aggregation section.
4.4.3.3.       Summary File Creation
Summary file creation uses the master geography SAS dataset to obtain dozens of geography fields that
must be written to the internal and external geography header files. For more information, refer to the
Summary File Creation and Summary File Output sections.

4.5.      About Geographic recoding
4.5.1. Producing a Full Geography Recode
Geography is the largest dimension of the tabulation puzzle, and geography recodes (or, geo recodes)
allow us to tabulate geography efficiently using Space-Time Research's production system (ss2ps). A full
geography recode means that all expected geographies for the product(s) we want to tabulate are
present in the recode, along with the block decomposition of each geography.
The geo recode that is produced is called a fastmap recode. The main difference between a fastmap
recode and a normal recode is the format, and therefore the time it takes to load the recode into memory.
Space-Time Research developed the fastmap recode since the performance of loading a normal geo
recode was poor. The fastmap recode is close to an array based recode definition, as you will see below.
4.5.1.1.       Inputs to Geography Recode Creation
There are several ingredients needed to make a geo recode:
DGF
The DGF contains all of the geography records of interest for the product(s) we're tabulating, plus the
blocks (which are usually not part of a data product). The file contains about 90 fields, and the mandatory
fields for each summary level are found in the GeoContent driver file.
The DGF is described in detail in the Geography section.
GeoIDInfo driver file
This file served two purposes, although one is now obsolete. The current single purpose of the file is to
list key fields and instructions that are needed to map geographic entities in summary level/geographic
components to their respective blocks.
For geo recode generation, the relevant columns of the driver file are summary level, geographic
component, length, key fields, and geographic component rules.
Each geographic entity from each non-block summary level/geographic component combination (e.g.,
050-01, or county-urban portion) is matched up with its blocks based on the Key Fields and Geographic

 Date Last Printed: 3/29/13                                                                   Page 71 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
component Rules columns. For 050-01, the Key Fields are state-county and the geographic component
rule is ur='U'.
For each 050-01 record, we pick out the contributing blocks by matching on state and county. We also
require that the "ur" field is 'U'.
4.5.1.2.       Geography Recode Format
The geography recode contains 4 sections:
    Line From Recode                                            Explanation
RECODE "recode name"                     Section 1: Recode name and recode source field
FROM "UNQBLK-Unique
Block ID for
SuperCROSS"
RESULT "geo_id 0"                        Section 2: The result section contains a line for each geography
RESULT "geo_id N-1"                      we want in the output; software settings can control whether
                                         these line are output (e.g., nulls can be omitted from output).
                                         The number of result lines can be quite low for some state-
                                         based products (a few hundred or thousand result lines) up to
                                         well over half a million lines for some larger state-based
                                         products. When referencing results, use index numbers 0 to n-
                                         1.
F   Block_number_in_db                   Section 3: The F section contains a detailed mapping of blocks
Result_line_number                       to result lines (geographies). Each time a block contributes to a
                                         result, it's listed here. This section is often quite large. For
                                         example, there are 533,163 lines (blocks) needed to define the
                                         state of California, which is just one result line.
END RECODE                               Section 4: End recode
Table 64: Geography Recode Format

4.5.1.3.       Example Geography Recode
The geo recodes are created in ProcessGeo stage 10. Assume we have a small state with only 2
counties, 1 place, and 3 blocks, and the data product for this small state requires five results (the state, 2
counties, and the place record for geographic components 91 and 00). The following is the full recode for
this small state:
             Line From Recode                                                          Explanation
RECODE "ProductX STATE Production Geo                                        Recode name and recode source
Recode" FROM "UNQBLK-Unique Block ID for                                     field
SuperCROSS"
RESULT "04000US99"                                                           First of five results - aka result "0"
RESULT "05000US99001"                                                        Second of five results - aka result "1"
RESULT "05000US99003"                                                        Third of five results - aka result "2"
RESULT "16091US9900005"                                                      Fourth of five results - aka result "3"
RESULT "16000US9900005"                                                      Fifth of five results - aka result "4"
F 0 0                                                                        These three lines tell us that blocks 0,
F 1 0                                                                        1, and 2 all contribute to result line 0
F 2 0                                                                        (result for 04000US99)
F 0 1                                                                        Tells us that block 0 contributes to
                                                                             result line 1 (result for 05000US99001)
F 1 2                                                                        These two lines tell us that blocks 1
F 2 2                                                                        and 2 contribute to result line 2 (result
                                                                             for 05000US99003)
F 0 4                                                                        Tells us that block 0 contributes to
                                                                             result line 4 (result for
                                                                             16000US9900005)
END RECODE                                                                   End recode


 Date Last Printed: 3/29/13                                                                                      Page 72 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                      Line From Recode                                     Explanation
                              Note that the result for 16091US9900005 has no F lines.
                                  This situation does occur, and is explained later.
Table 65: Example of a Geography Recode

4.5.1.4.       Ordering of results
One of the benefits of using a fastmap recode is that we can easily control the ordering of the recode
result line.
For products with state and national components, the geo recode preserves the order of the state product
geographies. The national-only product geographies are placed at the end of the recode, and the
national product geographies are put into product order after tabulation.
For state-only or national-only products, the geographies in the recode are in product order.
4.5.2. Modifying a Geography Recode
4.5.2.1.       Reducing a geography recode
Recode reduction occurs in ProcessGeo stage 40. Reducing a recode involves scanning the recode to
look for result lines that have no associated blocks (F lines), which means that no blocks contribute to the
result. If no blocks contribute to a result, we remove it from the recode and renumber the result line
references in the F lines.
Why would a result line have no associated blocks? One example is that a geographic component does
not apply to an area, an obvious example being the geographic component for Oklahoma Tribal Statistical
Areas (91). This geographic component only applies to a couple of summary levels in Oklahoma, so no
block outside of Oklahoma can possibly contribute to a geographic entity with geographic component 91.
Another example has to do with the way we process products with state and national components. In
order to be efficient, we combine the processing of the national and state areas through the tabulation
stage. By combining the processing, we're also adding result lines to the geo recodes. Some of the
national areas are screened out while assembling the DGF for the state, since we know based on the
state code that it's impossible for these national areas to be in the state of interest. Other national areas
are placed into the DGF, and therefore the geo recode, since we cannot determine if they are wholly
outside of the state of interest. A result of this is that we often have hundreds or thousands of result lines
with no associated blocks, so we remove these result lines from the recode.
The SF4 product has state and national components, and is a good indicator of the maximum amount of
reduction that can take place in a recode. Let's look at SF4 Maryland and Virginia:
             State                    Input RESULT lines                Output RESULT        Reduction
                                        (in full recode)               lines (in reduced
                                                                            recode)
MD (Maryland)                        32202                            10496                67.4%
VA (Virginia)                        34484                            12745                63.0%
Table 66: Example of Impact of Geographic Recode Reduction

The % reduction is a good indicator of the boost in performance - meaning that the reduced recode for
MD or VA would allow the tabulation stage to run a little over twice as fast compared to using the full
recode for tabulation
The example recode has one result - for geo_id 16091US9900005 - that has no associated F lines. Here
is what the recode looks like after removing this result:
            Full Recode                                    Change?                     Reduced Recode
RECODE "ProductX STATE                                     No                RECODE "ProductX STATE
Production Geo Recode" FROM                                                  Production Geo Recode" FROM
"UNQBLK-Unique Block ID for                                                  "UNQBLK-Unique Block ID for
SuperCROSS"                                                                  SuperCROSS"
RESULT "04000US99"                                         No                RESULT "04000US99"
RESULT "05000US99001"                                      No                RESULT "05000US99001"

 Date Last Printed: 3/29/13                                                                         Page 73 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
           Full Recode                                     Change?                     Reduced Recode
RESULT "05000US99003"                                      No                RESULT "05000US99003"
RESULT "16091US9900005"                                    Removed
RESULT "16000US9900005"                                    No                RESULT "16000US9900005"
F 0 0                                                      No                F 0 0
F 1 0                                                      No                F 1 0
F 2 0                                                      No                F 2 0
F 0 1                                                      No                F 0 1
F 1 2                                                      No                F 1 2
F 2 2                                                      No                F 2 2
F 0 4                                                      Numbering         F 0 3
END RECODE                                                 No                END RECODE
Table 67: Example of Impact of Recode Reduction on number of lines in Geography Recode

Note the highlighted F line in the reduced recode on the right. Since we dropped RESULT 3 (the fourth
result, as numbering starts at 0), the result line reference was changed.
4.5.2.2.       Dehydrating a geography recode
Recode dehydration occurs in ProcessGeo stage 45. Dehydrating a recode involves finding RESULT
lines that share the same block list and removing all but one of those results from the geo recode. That
single result is retained for tabulation, and it represents all of the other results that share the same block
list.
Why do we do this? Each geographic area can be known by more than one identifier (geoid). Since all
tabulations of these areas will result in identical numbers, we don't need to tabulate all of the areas.
Often, the duplicate identifiers are present as a result of tabulating the state and national portions of a
product in the same geo recode. An example tabulation may require totals for the US, region, division,
and state (let's say Maryland). Although these areas are not equivalent in reality, we usually process
geography and tabulate by state. Therefore, the Maryland portion of the US, the Maryland portion of the
South Region, the Maryland portion of the South Atlantic Division, and the state of Maryland all have the
same block list. In dehydration, we detect this and keep one of them (perhaps Maryland).
Let's once again look at SF4 MD and VA to get an idea of the see how many RESULT lines might be
dropped due to dehydration:
       State               Input RESULT lines (in               Output RESULT          Reduction       Cumulative
                              reduced recode)                       lines (in                          Reduction
                                                                  deyhdrated
                                                                    recode)
MD (Maryland)            10496                                 5035                  52.0%         84.4%
VA (Virginia)            12745                                 6168                  48.4%         82.1%
Table 68: Example of Impact of Geographic Recode Dehydration

The % savings is once again a good indicator of the boost in performance - meaning that the dehydrated
recode for MD or VA would allow the tabulation stage to run approximately twice as fast compared to
using the reduced recode for tabulation. The cumulative savings shows us that the dehydrated recode
might allow the tabulation stage to run as much as 6 times faster than using the full recode.
Dehydration is not appropriate for School Districts products, since who/what we're counting varies by
groups of summary levels and the retained geography could easily be in a different group than the
geographies it represents.
The example recode has two results - for geoids 05000US99001 and 16000US9900005 - that have an
identical set of F lines. Here is what the recode looks like after removing the result for 16000US9900005
from the reduced recode:




 Date Last Printed: 3/29/13                                                                             Page 74 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Reduced Recode                                    Change?                    Dehydrated Recode
RECODE "ProductX STATE                                     No                RECODE "ProductX STATE
Production Geo Recode" FROM                                                  Production Geo Recode" FROM
"UNQBLK-Unique Block ID for                                                  "UNQBLK-Unique Block ID for
SuperCROSS"                                                                  SuperCROSS"
RESULT "04000US99"                                         No                RESULT "04000US99"
RESULT "05000US99001"                                      No                RESULT "05000US99001"
RESULT "05000US99003"                                      No                RESULT "05000US99003"
RESULT "16000US9900005"                                    Removed
F 0 0                                                      No                F   0   0
F 1 0                                                      No                F   1   0
F 2 0                                                      No                F   2   0
F 0 1                                                      No                F   0   1
F 1 2                                                      No                F   1   2
F 2 2                                                      No                F   2   2
F 0 3                                                      Removed
END RECODE                                                 No                END RECODE
Table 69: Example of Impact of Recode Dehydration on number of lines in Geography Recode

Another output from the dehydration process is a space-delimited flat file that lists the original
geographies along with the geography that represents the original geography:
    04000US99             04000US99
    05000US99001          05000US99001
    05000US99003          05000US99003
    16000US9900005 05000US99001
Note that the geography 05000US99001 appears twice in the second column - it represents itself, and
also represents 16000US9900005. This file is used in a post-tabulation process called rehydration, where
the tabulation results are expanded out to include all geographies from the recode that was input to the
dehydration process.
4.5.2.3.       Splitting a geography recode
Recode splitting occurs in ProcessGeo stage 50. Sometimes the combination of base TXD definition plus
reduced or dehydrated geo recode causes the tabulation software to exceed the 32-bit process size limit
of 2GB. When this happens, one of the options to get around this issue is to split the geo recode into
pieces. If we need to split a geo recode, we can control the number of splits at the state level by
modifying a column in the Coverage driver file. An entry of "1" in the coverage file means recode splitting
will not take place, but an entry > 1 indicates the number of pieces the recode must be split into. The
details of splitting a recode - including the number of records in each split, etc., - are taken care of by a
SAS program.
Splitting a recode, which is rarely necessary, increases overall tabulation time slightly compared to using
one recode. Why? There are fixed costs associated with tabulation (e.g., production system startup),
and those fixed costs are repeated for each recode split.
The following table shows how our example dehydrated recode might be split into two pieces:




 Date Last Printed: 3/29/13                                                                         Page 75 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
    Original (Dehydrated) Recode                                                        Split Recode (2)
RECODE "ProductX STATE                                                       RECODE "ProductX STATE
Production Geo Recode" FROM                                                  Production Geo Recode" FROM
"UNQBLK-Unique Block ID for                                                  "UNQBLK-Unique Block ID for
SuperCROSS"                                                                  SuperCROSS"
RESULT "04000US99"                                                           RESULT "04000US99"
RESULT "05000US99001"                                                        RESULT "05000US99001"
RESULT "05000US99003"                                                        F 0 0
F 0 0                                                                        F 1 0
F 1 0                                                                        F 2 0
F 2 0                                                                        F 0 1
F 0 1                                                                        END RECODE
F 1 2
F 2 2
END RECODE
                                                                             RECODE "ProductX STATE
                                                                             Production Geo Recode" FROM
                                                                             "UNQBLK-Unique Block ID for
                                                                             SuperCROSS"
                                                                             RESULT "05000US99003"
                                                                             F 2 0
                                                                             END RECODE
Table 70: Example of Splitting a Geography Recode

Note how the first two result lines from the dehydrated recode appear in the first split, and the last result
line from the dehydrated recode appears in the second split. The split program also takes care of
renumbering the result line references in the "F" lines (refer to F 2 0 in the 2nd split).
4.5.2.4.       Geography recode sets
The creation of geo recode sets occurs in ProcessGeo stage 70. Recode sets were used in the School
District (SD) set of products, and are similar to recode splits. Recode sets are mandatory for the SD
products, but recode splits are needed for just a few states in the massive products. Also, since recode
sets are mandatory for SD, we cannot compare their performance to any other kind of recode.
Why are recode sets needed? The SD enrollment iterations define categories of relevant children.
Unlike the characteristic iterations in SF2 or SF4, the definition of relevancy varies by geography. This
geo-relevancy dependency requires the geographic recode to be broken into four parts, based on
summary level groupings defined in column 21 of Products.txt. These new geo recodes are called geo-
sets, and each table (e.g., P1) must be tabbed separately against each geo set (since the enrollment
iteration recode is different).
Geo        Summary Level(s)                                  Database field needed to define iteration
Set
1       010,250,040,050                    SD_CH1-Person is child in universe
2       950                                SDESD_RCH1-Person is ESD-relevant child in universe
3       960                                SDSSD_RCH1-Person is SSD-relevant child in universe
4       970                                SDUSD_RCH1-Person is USD-relevant child in universe
Table 71: Example of Geography Recode Sets for School District tabulations, Relevant Children

4.5.2.5.       Geography recodes per characteristic iteration
The creation of geo recodes for characteristic iterations occurs in ProcessGeo stage 60. For several
products, we produced custom geo recodes for each characteristic iteration (CI). These CI recodes differ
from recode sets since each recode in a recode set has different iteration logic, while CI recodes have the
same iteration logic.
Why produce these CI recodes? Rather than tabulate all geography for all characteristic iterations -
which would be very time consuming - we went through a process called SIPHC to determine (based on
unweighted pop counts) which geographies had to be tabulated for each CI.

 Date Last Printed: 3/29/13                                                                              Page 76 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Let's continue to look at SF4 MD and VA to get an idea of the see how many RESULT lines might be
dropped due to SIPHC:
        State              Iterations         Input RESULT                   Output       Reduction    Cumulative
                                                  lines (in              RESULT lines                   Reduction
                                                dehydrated               (in CI recode)               (compared to
                                                  recode)                                              full recode)
MD (Maryland)            001                 5035                       3984              20.9%
                         004                                            2076              58.8%
                         006                                            115               97.7%
                         585                                            82                98.4%
                         All                                            Average: 228      95.5%       99.3%

VA (Virginia)            001                 6168                       5264              14.7%
                         004                                            2616              57.6%
                         006                                            130               97.9%
                         585                                            88                98.6%
                         All                                            Average: 267      95.7%       99.2%
Table 72: Example of Impact of using Characteristic Iteration Geographic Recodes

The % savings across the entire set of CI geo recodes for both MD and VA is better than 95%. When
compared to using the full recode, our savings increase to over 99%.
However, there are even more savings not shown above. The SIPHC process also determines which CIs
don't need to be tabulated for a state. CIs don't need to be tabulated for a state when there are no people
in that CI category. For MD, we were able to skip 16 of the 336 SF4 iterations. For VA, we were able to
skip 14 of the 336 SF4 iterations. However, for the average state in SF4 we were able to skip well over
100 iterations.
The end result, even after including fixed costs associated with tabulation (e.g., production system
startup), is approximately a 99% reduction in run time compared to using the full recode and no SIPHC.
4.5.3. Tabulating with a geography recode
Tabulation uses one of the many forms of a geography recode described above. We've seen in the
previous section that this can be anywhere from a "full" recode to a recode that has been reduced,
dehydrated, split, and customized for CIs. The tabulation and post-tabulation steps are covered below.
4.5.3.1.        Using Geo Recodes in the Production System
One of the many forms of a geography recode is used in tabulation. We've seen in the previous section
that this can be anywhere from a "full" recode to a recode that has been reduced, dehydrated, split, and
customized for CIs.
Regardless of the recode being used, when it comes time to tabulate we take a base table description
(TXD), a geo recode, and possibly other information (like a CI recode), create a custom TXD, and submit
it for tabulation. The custom TXD actually contains a reference to the geo recode rather than the entire
recode. The geo recode is read once by the tabulation production system, held in memory, and used for
a number of tabulations. Once the tabulation job is finished, the recode is released from memory and the
production system shuts down.
         Customize base TXD with reference to the geography recode. Other changes made to the TXD
          at this point include the addition of a CI recode (if applicable) and the customization of the
          database ID.
         Submit a batch of tables to the production system. The batch can include one or more tables.
          The production system is initiated so that it does the following:
                 o    Read the geography recode into memory
                 o    Tabulate all of the tables in the batch, writing csv output after each
                 o    Release the recode from memory

 Date Last Printed: 3/29/13                                                                            Page 77 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                 o      Shut down
4.5.3.2.       Post-Tabulation Geo Recode Events
There are two geography recode related events that take place after tabulation.
Clean up after Split Recodes
If recode splits were used, Tab stage 3000 produces an output file per recode split. This stage also
cleans up after itself by making one csv file from the csv files generated by tabulating the recode splits.
Continuing with the split recode example from above, and assuming one data cell is produced by our
tabulation, we have two csv inputs and one output file:


 04000US99,99999
 05000US99001,44444


                                                                         04000US99,99999
                                                                         05000US99001,44444
                                                                         05000US99003,55555




 05000US99003,55555



Figure 5: Merging results of tabulation when recode splits are used

Rehydrate Tabulation Results
If recode dehydration was used, the csv file contains a record for each unique geographic area in the
product (in terms of the blocks that compose the geographic areas).
The rehydration process, which is Tab stage 3020, reverses the dehydration process. Continuing with
our dehydration example above, the rehydration process looks like this:

                                                    Rehydration Instructions:




                                               04000US99            04000US99
                                               05000US99001         05000US99001
                                               05000US99003         05000US99003
                                               16000US9900005       05000US99001




           Input CSV:                                                                             Rehydrated CSV:




 04000US99,99999                                                                              04000US99,99999
 05000US99001,44444                                                                           05000US99001,44444
 05000US99003,55555                                                                           05000US99003,55555
                                                                                              16000US9900005,44444
                                                           Rehydration
                                                            Process




Figure 6: Rehydration Tabulation results when a dehydrated recode is used

4.6.      About Hand-off
 Date Last Printed: 3/29/13                                                                                   Page 78 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Handoff is generally the final step in the production of a product. It is the process by which an official set
of files are conveyed to a specific destination. Handoff is achieved via a ksh script named Handoff.
Specifically, Handoff does one or more of the following based on command line parameters.
         Verifies that all of the required files are present.
         Renames the files according to the destination specification.
         Packages the files according to the destination specification.
         Links a set of files to the product summary files and sets access privileges
A DPP product is comprised of a set of summary files, a public geo header file, an internal geo header
file, and an AFF GeoID equivalency file. The number of geo files and summary files varies by product but
can reach into the millions. Like most ksh scripts, Handoff uses several driver files; However, the specific
set of files that Handoff may convey to the target destination is listed below. Every destination is delivered
a set of Summary Files, all other files are optional based on the destination:
         A set of Summary Files
         A set of public geo header files
         A set of internal geo header files
         A set of AFF GeoID equivalency files
         A First Occurrence DataSet (SAS format).
         A set of .don files (used for record count verification).
         A file set report.
Handoff is performed for a specific state (or the US). This is useful for several reasons.
         During the early processing of a product, a state or several states can be 'handed off' to other
          groups for an initial review of the product. These small handoffs aid in product wide error
          detection.
         Some products are quite large and can take months to produce. Source files are delivered on a
          flow basis, processing occurs on a flow basis, and it is only prudent to allow Handoff to occur on a
          flow basis. The groups receiving the files are better equipped to handle reasonable number or
          files.
         Handling several million files in a single processing stream is risky. Breaking the file set into
          manageable chunks permits a reasonable restart point in the case of error.
The software used to zip the files requires that the Handoff script be run in the UNIX foreground. That is,
if the Handoff script is invoked with the zip option (-z) it cannot be placed in the background.
There are 4 destinations for Handoff, AFF, ACSD, REVIEW, INTERNAL, each with their own
requirements and specifications. The destination "Internal" applies to older products.




 Date Last Printed: 3/29/13                                                                        Page 79 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.6.1. AFF
          Base directory                     /data1/ftp/dec/2000/<ProductLocation>/data

                                             <ProductLocation> is a product specific directory name and
                                             must exist and must have read/write permissions granted to the
                                             user executing Handoff.
          Zipped Files                       Optional

                                             AFF prefers unzipped files.
          GeoID Equivalency                  Yes
          Public Geo Header                  Yes
          Internal Geo Header                No
          .don files                         Yes
                                             AFF expects a .don file for each .dat file, Public Geo Header file,
                                             and GeoID Equivalency file. The .don file contains the row count
                                             of the corresponding .dat file.
          Checksum report                    No
          First occurrence SAS               No
          dataset
Figure 7: Hand-off specifics for destination: AFF

4.6.2. ACSD
          Base directory                     /dpp2/ftp/acsd/<SUGroup>/<ST>


                                             <ST> is the 2 character uppercase state abbreviation, for
                                             example MD.
          Zipped Files                       Optional

                                             ACSD prefers zipped files.
          GeoID Equivalency                  No
          Public Geo Header                  Yes
          Internal Geo Header                No
          .don files                         No
          Checksum report                    Yes

                                             ACSD requires a checksum file that contains the name, row
                                             count, and byte count for each file handled by the handoff
                                             process. It is created in the $DPPReport directory and copied to
                                             the target destination unzipped.
          First occurrence SAS               No
          dataset
Figure 8: Hand-off specifics for destination: ACSD




 Date Last Printed: 3/29/13                                                                             Page 80 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.6.3. Review
          Base directory                     /dpp2/ftp/<SUReview>/<Product>/<State>


                                             <SUReview> is a UNIX group set according to the user
                                             executing the Handoff script.
                                             <Product> is a product specific directory name based on the
                                             product name.
                                             <State> are state (or US) specific directory names. The names
                                             are fully spelled out, for example, "New Jersey".


                                             /dpp2/ftp/<SUReview>/<Product> is a product specific directory
                                             name and must exist and must have read/write permissions
                                             granted to the user executing Handoff.
                                             /dpp2/ftp/<SUReview>/<Product>/<State> are sub-directories
                                             created by Handoff.
          Zipped Files                       No
          GeoID Equivalency                  No
          Public Geo Header                  Yes
          Internal Geo Header                Yes
          .don files                         No
          Checksum report                    No
          First occurrence SAS               Yes for iteration 000 or 001 (A product can only have one or the
          dataset                            other).
Figure 9: Hand-off specifics for destination: Review




 Date Last Printed: 3/29/13                                                                           Page 81 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.6.4. Internal
          Base directory                        /dpp2/internal/<SUGroup>/<Product>/<ST>


                                                <SUGroup> is a UNIX group set according to the user executing
                                                the Handoff script.
                                                <Product> is a product specific directory name based on the
                                                product name.
                                                <State> are state (or US) specific directory names. The names
                                                are fully spelled out, for example, "New Jersey".


                                                /dpp2/internal/<SUGroup>/<Product> is a product specific
                                                directory name and must exist and must have read/write
                                                permissions granted to the user executing Handoff.
                                                /dpp2/internal/<SUGroup>/<Product>/<State> are sub-
                                                directories created by Handoff.
          Zipped Files                          No
          GeoID Equivalency                     Yes
          Public Geo Header                     No
          Internal Geo Header                   Yes
          .don files                            No
          Checksum report                       No
          First occurrence SAS                  Yes for iteration 000 or 001 (A product can only have one or the
          dataset                               other).
Figure 10: Hand-off specifics for destination: Internal

Specification Key
          Base directory                        UNIX directory where the files will be copied or linked to.
          Zipped Files                          Indicates if the destination can handle zipped files.
          GeoID Equivalency                     Does this destination want to receive the AFF Geo file?
          Public Geo Header                     Does this destination want to receive the Public Geo Header?
          Internal Geo Header                   Does this destination want to receive the Internal Geo Header?
          .don files                            Does this destination want to receive a set of .don files? If so,
                                                the files are listed.
          Checksum report                       Does this destination want to receive a checksum file? If so, the
                                                file contents is listed.
          First occurrence SAS                  Does this destination want to receive the First occurrence SAS
          dataset                               dataset?
Figure 11: Key for Hand-off specifics figures




 Date Last Printed: 3/29/13                                                                                   Page 82 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.7.      About Iterations and iteration recodes
Products are either based on the total population or are iterated over a set of race, ancestry, or relevancy
groups. In other words, the product's tableset is not only tabulated for the total population but is further
iterated over a set of race, ethnicity, and relevancy based groups. These definitions of race, ethnicity, and
relevancy are collectively called Iterations. For example, Iteration 000 and 001 refers to the total
population, Iteration 012 refers to 'Asian Alone', Iteration 008 refers to 'Alaska Native alone'.
From a dimensional perspective, Iteration is a product dimension and like geography applies to all tables.
For some products this dimension has only 1 value, 000, the total population, or it may over a thousand
values.
Iterations require special handling in the DPP system and a certain amount of code is dedicated to this
aspect. However, the concept of an Iteration also provides a point of leverage for handling large products
and/or utilizing the parallel processing capabilities of the DPP servers.
4.7.1. Data Fields for Iterations
Each Person record in the detail databases has several fields which describe their race and ancestry. The
table below lists the data source, the field, and the valid number of responses per person. Upstream
processing in Decennial guaranteed that each person had at least one response per field.
Data Source                     Field                           Possible number of   Examples
                                                                responses
HDF                             CENRACE                         1                    Black alone,
SEDF                                                                                 AIAN-Asian-SOR,
                                                                                     White-Black-AIAN-Asian
HDF                             QRACE                           1-8                  German, Irish, Italian
SEDF
SEDF                            ANCES                           1-2                  Syrian,
                                                                                     Czech,
                                                                                     Irish,
                                                                                     Jamaican
SEDF                            QANCES                          1-2
                                (This field was never
                                used)
Figure 12: Data Fields for Iterations

In relational terms, these fields are best modeled as follows.




 Date Last Printed: 3/29/13                                                                            Page 83 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                         Person
                                         Cenrace
                                         QRACE1
                                         QRACE2
                                         QRACE3
                                         QRACE4                              QRACE, ANCES, and QANCES
                                         QRACE5                              fields are denormalized in the
                                         QRACE6                              Person entity.
                                         QRACE7
                                         QRACE8
                                         ANCES1
                                         ANCES2
                                         QANCES1
                                         QANCES2




   ANCES_MULTI                            QANCES_MULTI                            QRACE_MULTI


Figure 13: Relational model of multi-response fields

4.7.2. Defining an Iteration
Iterations are defined by the POP division and are based on data fields delineated above. Some Iterations
are quite simple while others are verbose and/or complex in nature. In the DPP system each Iteration is
defined and stored in STR SuperCROSS format which is plain ASCII text and relatively intelligible. The
largest definition is over 5,000 lines in length. A few examples follow. These examples use the ER
diagram above and have been translated from STR SuperCROSS syntax to a more relational like syntax.
4.7.2.1.       Example of an iteration recode definition of low complexity
This example is very simple to understand and implement. It uses the CENRACE field to determine if a
person qualifies as 'Black Alone'. This field is recoded before it gets to DPP.
Iteration 004 (Black Alone)
Relational Syntax:
CENRACE = 2
4.7.2.2.       Example of an iteration recode definition of medium complexity
This example is a bit more complex and there is no CENRACE recode for it. Conceptually, if a person
replies 410 or 411 to any of the eight possible races, then that person is considered part of iteration 036.
Iteration 036 (Chinese, except Taiwanese, alone or in any combination)
Relational Syntax:
QRACE_MULTI in (410,411)
4.7.2.3.       Example of an iteration recode definition of high complexity
This example is relatively complex. Items to note:
         There is no exclusive CENRACE recode for it, however the CENRACE is one of the qualifiers.
         Value '000' for QRACEx means no response.
 Date Last Printed: 3/29/13                                                                               Page 84 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          There are 639 values that qualify as American Indian (e.g. A01,A05, etc)
          The 'alone' word in the iteration title is very important. In order to qualify for this iteration a person
           must have responded with 1 or more of the 639 values - and no other value. Consider the
           following example where all persons have CENRACE=3 and QRACE 111 is NOT one of the 639
           values:
Person     QRACE1       QRACE2        QRACE3         QRACE4        QRACE5    QRACE6   QRACE7    QRACE8       Qualifies?
A          A01          000           000            000           000       000      000       000          Yes
B          A01          A05           A06            A10           A11       A18      A19       A20          Yes
C          A01          A05           000            000           000       000      000       000          Yes
D          111          000           000            000           000       000      000       000          No
E          111          A01           000            000           000       000      000       000          No
F          A01          111           000            000           000       000      000       000          No

Figure 14: Using QRACE values in iteration definitions

Iteration 007 (American Indian alone)
Relational Syntax:
         CENRACE = 3
and QRACE1 in ('A01','A05', …636 other values…, 'M42')
and QRACE2 in ('000','A01','A05', …636 other values…, 'M42')
and QRACE3 in ('000','A01','A05', …636 other values…, 'M42')
and QRACE4 in ('000','A01','A05', …636 other values…, 'M42')
and QRACE5 in ('000','A01','A05', …636 other values…, 'M42')
and QRACE6 in ('000','A01','A05', …636 other values…, 'M42')
and QRACE7 in ('000','A01','A05', …636 other values…, 'M42')
and QRACE8 in ('000','A01','A05', …636 other values…, 'M42')
4.7.3. STR Specific Observations
It is worth noting that the multi-response nature of iterations (probably) causes some performance
problems for relational databases but is handled well by STR. However there are some limitations.
The low complexity example (above) is quite simple and easy to accomplish in STR, and for that matter,
in a relational database.
The medium complexity example (above) is quite simple and easy to accomplish in STR but would most
likely cause some performance problems in a relational database.
The high complexity example (above) is problematic all around. In order for STR to be able to handle this
example the QRACEx fields must be denormalized into the Person entity. In other words, the built in
multi-response function of STR cannot handle the 'alone' aspect and the code must be 'denormalized' and
know that there are 8 QRACE fields to code for. If this was attempted in a relational database then either
a similar approach would have to be taken or a relatively complex (and slow) SQL statement would have
to be formed.
4.7.4. Iteration in an Operational Sense
For operational purposes Iterations provide a natural break point in an environment where products can
take nearly a machine year to complete. Therefore, most of the scripts in the DPP system use the
<Product>Iterations.txt file to limit their scope of work. Coupled with the built in concept of waving,
multiple invocations of the same script can work on different Iterations simultaneously thereby leveraging
the parallel processing power of the DPP servers.


 Date Last Printed: 3/29/13                                                                            Page 85 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.8.      About Logging
The DPP system is initiated and controlled by a set of KSH scripts. These scripts, in turn, may call other
programs which include but are not limited to other KSH scripts, SAS programs, STR programs, and Perl
scripts. The KSH scripts have been developed to display detailed (and sometimes voluminous)
information about the script's progress. Capturing that specific information from the developed KSH
scripts and any other information returned by other programs constitutes the bulk of logging. A separately
maintained central file (named log) is written to by the KSH scripts and summarizes all script based
activity into 1 file. In summary, logging is achieved in 3 ways:
         All KSH scripts are developed in a manner such that a log file is created early in the script for
          each invocation of that script. All output (standard and error) is redirected to that logfile (and
          thereby preserved).
         All KSH scripts are developed in a manner such that major check points are logged into a single
          file (named log). A separate file is maintained for each product environment. Specifically the
          following events are written to the file named log:
              start of the script
              successful completion of a stage (if the script has a stage concept)
              end of the script (successful)
              error termination of a script,
         All output from called programs (SAS, STR, etc) is captured into a log file for each invocation of
          that program. All output (standard and error) is redirected to that logfile (and thereby preserved).
4.8.1. SAS Naming Conventions
As mentioned above, every SAS program writes to a unique log file. The log file-naming specification is
as follows:
     ${DPPwork}/${DPPenv}/log/${LS_StateName}/$(date
     '+%Y%m%d)/${Program}.${LS_Product}.${LS_State_UC}.${LS_Type_UC}.$(date
     '+%Y%m%d-%H%M%S')_$$.saslog
${DPPwork}                               Product environment root directory. For example '/dpp2/prod'.
${DPPenv}                                The product environment, usually the name of the product. For example, 'uSF3'.
${LS_StateName}                          The state name. For example, 'South Carolina'.
$(date '+%Y%m%d)                         The system date in the format specified. For example, '20031229' which
                                         is December 29, 2003.
${Program}                               The name of the calling script, for example, 'Tab'.
${LS_Product}                            Product being processed, for example, uSF3'.
${LS_State_UC}                           Two character postal code for the state in upper case. The United States
                                         is represented as 'US'. For example, 'MD'.
${LS_Type_UC}                            Type. 'U' for unadjusted. 'A' for adjusted.
$(date '+%Y%m%d-%H%M%S')The system date in the format specified. For example, '20031229-
                        034516' which is December 29, 2003 at 03:45:16 am.
$$                                       The system process identified of the invoked SAS program. Sometimes
                                         referred to as the PID. For example, '15435'.
A complete example for a uSF4 log file for a SAS call from the Tab script would be
/dpp2/prod/uSF4/logs/Vermont/200417/Tab.uSF3.VT.U.20000417-
110406_951413.saslog


 Date Last Printed: 3/29/13                                                                              Page 86 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.8.2. Non-SAS Naming Conventions
As mentioned above, every DPP shell script automatically redirects its standard output and error stream
to a unique log file. The log file-naming specification is as follows:
     ${DPPwork}/${DPPenv}/log/${LS_StateName}/${Program}.${LS_Product}.${LS_Sta
     te_UC}.${LS_Type_UC}.$(date '+%Y%m%d-
     %H%M%S')_s${StartingStage}e${EndingStage}_$$
${DPPwork}                               Product environment root directory. For example '/dpp2/prod'.
${DPPenv}                                The product environment, usually the name of the product. For example, 'uSF3'.
${LS_StateName}                          The state name. For example, 'South Carolina'.
${Program}                               Name of the script, for example, 'ProcessGeo'.
${LS_Product}                            Product being processed, for example, uSF3'.
${LS_State_UC}                           Two character postal code for the state in upper case. The United States
                                         is represented as 'US'. For example, 'MD'.
${LS_Type_UC}                            Type. 'U' for unadjusted. 'A' for adjusted.
$(date '+%Y%m%d-%H%M%S')The system date in the format specified. For example, '20031229-
                        034516' which is December 29, 2003 at 03:45:16 am.
${StartingStage}                         If the script has the concept of stages, then the starting stage. If none is
                                         specified when the script is called, stage '0' is used.
${EndingStage}                           If the script has the concept of stages, then the ending stage. If none is
                                         specified when the script is called, the last stage number is used.
$$                                       The system process identified of the invoked script. Sometimes referred
                                         to as the PID. For example, '15435'.
A complete example for uSF4 log file for ProcessGeo stages 40 to 45 for Vermont would be
/dpp2/prod/uSF4/logs/Vermont/ProcessGeo.VT.U.uSF3.20040503-
122650_s45e45_58392

4.8.3. Structure of Logs File Directories
Because of the sheer number of log files generated by some DPP products a series of subdirectories is
created that divides the log files among several directories. The reason for doing this is simple
practicality. Too many files in a directory makes it difficult and slow to locate individual files. It has also
been observed that the AIX backup software may experience time-out problems with massive directories.
Therefore, the current approach was devised through trail and error and works for the largest DPP
products. As a note, uSF4 produced approximately 135,000 log files.
4.8.4. Directory $DPPwork/$DPPenv/logs
Description:        Root directory for all logs files in the product environment. In general, files are not written
                    to this directory. Files are written to sub directories.
Writers:            None
4.8.5. Directory $DPPwork/$DPPenv/logs/<State Name>/<YYYYMMDD>
Description:        Central location for all logs for a specific state.
Writer:             Most KSH scripts, STR programs, Perl scripts, etc.
4.8.6. Directory $DPPwork/$DPPenv/logs/<State Name>
Description:        Central location for all SAS logs for a specific state on a specific date. There potentially too many
                    SAS log files to fit into the $DPPwork/$DPPenv/logs/<State Name> directory so the SAS log files
                    had to be further divided by date.

 Date Last Printed: 3/29/13                                                                               Page 87 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Writers:            SAS Programs.


4.8.7. Log File $DPPwork/$DPPenv/log
Description:      Each product environment has this single file. Every script logs its start and end status in the log
file. The DPP status-reporting infrastructure uses this information to reconstruct the progress of the current product.
Writers: Most KSH scripts.
Contents: The central log file contains the following data items in plan-text, variable-length, pipe (|) delimited
                fields:
                    Date-time in the form, e.g. Tue Mar 28 15:27:32 EST 2000
                    User (eight-character "James Bond" ID)
                    Platform
                    Process ID
                    Program/Event
                    Product
                    State (2-character Postal code)
                    Type [A | U]
                    Log point [ start | stage | error | end | rerun ]
                    Stage/Return code [ 0 | <stage number> | $ErrCode ]
                    Log file name (script log file)
                    Comment
Examples: Below are two sample log file entries that mark the start and end of the ProcessGeo script. The entries
are long so they will wrap in this document.
    20040114-161556|jbond007|dpp2.dads.census.gov|70416|ProcessGeo|AIAN|US|
    U|start|0|/dpp1/dev/IM_AIAN/logs/US/ProcessGeo.US.U.AIAN.20040114-
    161556_s40e40_70416|

    20040114-161556|jbond007|dpp2.dads.census.gov|70416|ProcessGeo|AIAN|US|
    U|start|40|/dpp1/dev/IM_AIAN/logs/US/ProcessGeo.US.U.AIAN.20040114-
    161556_s40e40_70416|


4.9.       About Median processing
Quartile processing includes medians, lower quartiles, and upper quartiles. For this discussion, they will
all be referred to as medians.
Medians require a special processing stream due to the inability to build and use a US-level detail
database. The state-level and US-level processing streams are explained below.
4.9.1. State-Level Median Processing
Each median product table (matrix) has two txds sourced in SuperCROSS. One contains instructions to
tabulate the numbers for the base distribution and the other contains the same instructions as the first
plus the instructions for the median calculation.
                                 Example
 Appearance of TXD                  Pop                                 Appearance of TXD for Median         Example
for Base Distribution             Count          Change?                        Calculation                  Median
Total                            200             Hidden
 Male                            95              Hidden
 Date Last Printed: 3/29/13                                                                              Page 88 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
  Age 0-19                        20             Hidden
  Age 20-39                       40             Hidden
  Age 40-59                       21             Hidden
  Age 60-79                       9              Hidden
  Age 80-99                       5              Hidden
  Age 100 or more                 0              Hidden
 Female                           105            Hidden
  Age 0-19                        17             Hidden
  Age 20-39                       48             Hidden
  Age 40-59                       20             Hidden
  Age 60-79                       15             Hidden
  Age 80-99                       2              Hidden
  Age 100 or more                 3              Hidden
                                                 Added           Total                                   34.3
                                                 Added            Male                                   33.5
                                                 Added            Female                                 34.8
Table 73: Example of State Base Distribution and Median TXDs; numbers are synthetic

4.9.2. US-Level Median Processing
This section highlights the major steps in US-Level Median Processing. A complete discussion of US
processing and aggregation can be found elsewhere in this document.
US-level median processing begins by running Tab stage 3200, which aggregates the state-level
numbers from the state-level distribution TXD’s. This data is fed into a database build process along with
the appropriate TDD and geography files, producing a distribution database. We tabulate TXDs against
this distribution database, but the TXDs are not the same as those mentioned in state-level processing
since the database has an entirely different structure. The TXDs we submit are somewhat close in
appearance to the “TXD for Median Calculation” in section 4.9.1.

4.10. About the Parameterization of the DPP System – metadata as driver files
The DPP system is a general-purpose data product production system. It produces “products” (namely,
Summary Files) based on product specifications defined by POP/HHES, geography data from Geography
Division, and detailed data files generated from the 2000 Decennial Census. Each product has specific
business requirements that generate reusable architectural capabilities in the DPP system. A key design
decision was to abstract and parameterize these architectural capabilities into text-based driver files.
Subsequent products with the same business requirements can then leverage these capabilities by
setting the proper driver files attributes. In earlier DPP documentation, this is called “operational
programming” and the driver files are referred to as “operational material.” An advantage of this approach
is the development time for a typical DPP product has decreased over the years. Very complex products
like Supplemental School Districts were developed rapidly by reusing code and driver file building blocks
from previous products.
The following table illustrates Decennial products that introduced new requirements. In each case, the
DPP system architecture was extended to include this new capability.
      Product                                          Characteristics/Innovations
Public Law (PL)                   The ability to deal with two sets of detail databases (adjusted and
                                   unadjusted) was built into the system.
Summary File 1                    The ability to combine several tables into one for tabulation, and
(SF1)                              break apart the results after tabulation (“split tables”).
                                  A product with 52 state summary files and a single United States
                                   summary file. An approach to aggregate the US results from the 52
                                   states was developed.
                                  An approach to tabulate US medians from the aggregated state
                                   distributions was developed.
                                  Introduced prior-product program to match specific cell and
                                   geography combinations between product summary files.
 Date Last Printed: 3/29/13                                                                        Page 89 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
        Product                              Characteristics/Innovations
                              
                         Updated system to insert Population-based size codes into the Geo
                         Header
                       Added support geographic components
                       Support for PCT table – filter out certain geographies from PCT tables
                         (everything below census tracts)
                       Develop the ability to do an advanced and final national variation of
                         the base SF1 product
Summary File 2         The first iterated DPP product. The iteration concept was introduced
(SF2)                    throughout the DPP system.
                       The SIPHC process – performing an initial tabulation to figure out
                         which subsequent tabulations were unnecessary (because they
                         would be suppressed by thresholding) – was a significant
                         performance optimization.
                       Support iterations in the internal and prior-product matching
                         programs.
Summary File 3         Support for an associated product (geographic component 49).
(SF3)                  The ability to split and cache a geographic recode for performance
                         was introduced.
                       Support for two database – provide ability to tab against sample and
                         100% database in the same product.
Summary File 1         Building on the geographic recode optimizations of SF3, duplicate
Supplemental             geographies were identified and removed from the recode (“recode
                         reduction” or “dehydration”, and its inverse, “rehydration”).
Summary File 4         The SIPHC process, developed for SF2, was generalized and the
(SF4)                    concept of mini-products to support SIPHC was introduced. The
                         generalized SIPHC process was leveraged in AIAN.
                       The ability to split a single table across summary file segments.
                       The ability to store logs and reports in state-specific subdirectories.
                       Hardening the system to deal with literally millions of intermediate and
                         output files.
American Indian and  AIAN was the first product to build a national database. This was
Native American          possible because the database included only those households that
(AIAN)                   contained relevant persons.
School Districts       A business requirement required tabulation rules that varied by
                         summary level. This was achieved by creating geographic recode
                         sets, which were similar to split recodes – but driven by business
                         rules, not performance. Each geographic recode set was submitted
                         to SuperCROSS as a separate tabulation.
                       Extended Prior-Product match program to allow matches between
                         specific GEO_Ids between products (called “geo matching”). Useful
                         because many, but not all, school district summary levels (950, 960,
                         970) are identical to existing county, county subdivison, or place
                         geographic areas.
Table 74: The development of new functionality for Decennial 2000 Products

4.11. About Quality Assurance
100% complete, error-free, accurate processing is a requirement for the DPP system. Therefore,
because of the time required to prepare many products, early detection of any error is very important.
Therefore, quality checking of many different types is incorporated into the DPP system at many different
points. This is a chronological description of the quality assurance measures in the DPP system.
         All scripts (and stages of scripts) in the DPP system perform the following checks, and stop as
          soon as an error is detected:
              checks on the parameters with which they have been invoked;

 Date Last Printed: 3/29/13                                                                    Page 90 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
             checks on the presence of necessary driver files;
             checks on the existence of required input files;
             checks that no files will be overwritten, unless “overwrite” has been specified as a parameter;
             checks that sufficient disk space exists for the planned processing;
             checks, at the end of each stage, that the intended output files have, in fact, been created.
        All scripts record their start and end (and error termination, if possible) in the central log file for
         the product. The DPPStatus script reads the central log file for a product and prepares three
         kinds of reports on processing, including clear identification of processing stages which have
         error terminated. The DPPStatus is run at will by the staff conducting these operations.
        Stage 20 of the Get script checks Detail data files (HDF, HEDF, SDF, SEDF) and DPP
         Geography files for valid values and structure. It prepares a report on its findings.
        The ProcessGEO script tabulates AreaLand and AreaWater from the block records for each
         summary level in the product, and compares the sums to the AreaLand and AreaWater provided
         on the summary level records by Geography Division. It prepares a report on its findings.
        Stage 1000 of the Tab script (source-file verification) compares the fields that are common
         between the block records of the Detail file, and the block records in the DGF file, and prepares a
         report showing differences in any of them (program DF-DGFConsistency.sas). It also sums the
         housing weights and person weights in each block, and compares them to the totals provided on
         the block records (programs ~VerifyCounts.sas). It prepares a report on its findings.
        Stage 2000 of the Tab script (database build) compares the contents of all fields in the Detail file,
         to the valid values in the TDD files (DPP metadata files) when snbu, the STR database builder
         runs. If an error is detected, it is reported, and processing stops.
        Stage 4200 of the Tab script (Summary File rollup verification) checks a set of Summary Files for
         conformance to certain known geographic relationships. It sums four cells according to two
         geographic hierarchies which are specified in driver files, and prepares a report containing any
         differences that are found. The four cells are the first cell in the product (which is either 100% or
         Sample population), AreaLand, AreaWater, and POP100.
             One of the two geographic hierarchies is called HighLevel, and this is an example of the
              HighLevel comparison: the four sums from all census tract summary level records, and the
              four sums from all county summary level records, should be equal. The DPP system
              prepares a report on its findings.
             The second of the two geographic hierarchies is called LowLevel, and this is an example of
              the LowLevel comparison: the four sums from all census tract summary level records in each
              county should equal the figures that are in the product for that county. The system prepares
              a report on its findings.
        Stage 5100 of the Tab script (Analyzer Match) compares cells in a set of Summary Files against
         comparable cells in an file which has been tabulated by an independent group, for all geographic
         entities which are common between the two. The geographies that are common, as well as the
         list of comparable cells, are contained in the driver files. A report is prepared by the DPP system
         on the comparison.
        Stage 5200 of the Tab script (Internal Match) compares cells (and sums of cells) in a set of
         Summary Files against other cells (and sums of cells) in the same set of Summary files, for every
         geographic entity, for every iteration. This is done to ensure consistency within the product itself.
         The list of comparisons to be performed is contained in the driver files. A report is prepared by
         the DPP system on the comparison.
        Stage 5300 of the Tab script (Prior Product Match) compares cells in a set of Summary Files
         against comparable cells in a previously-approved set of Summary Files, for all geographic
         entities which are common between the two, for every iteration. This is done to ensure
         consistency among products. The geographies that are common, as well as the list of

Date Last Printed: 3/29/13                                                                           Page 91 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         comparable cells, are contained in the driver files. A report is prepared by the DPP system on the
         comparison.
        Stage 5400 of the Tab script (Verify GeoHeader/DPPGeo Files) compares the contents of the
         geoheader file in a Summary File set to the contents of the original file delivered by Geography
         Division for the product to ensure that all codes that should have been transferred are equal. A
         report is prepared by the DPP system reporting the result of the comparison.
        The DPP system was used on several occasions to produce a “QC” product, when a product
         contained geographic levels which had not been present in previous product. Special driver files
         and TXD’s were developed and deployed to support the QC product. The basic point of the QC
         product was to tabulate Pop100, HU100, AreaLand, and AreaWater, using the same geographic
         recode which was used to tabulate the product itself, and compare those figures to those in the
         resulting set of Summary Files.
        The staff conducting operations reviewed the content of all reports mentioned above for every set
         of Summary Files produced, before submitting the product to the sponsor for review and
         approval. To the extent possible, they reviewed them as soon as they were produced.
        The staff conducting these operations also monitored the contents of the SXoutput subdirectory
         for the creation of error logs which occasionally were the only indication of ss2ps tabulation
         engine errors.
        In addition, the disk system was monitored by the staff conducting these operations proactively to
         avoid problems that would be caused by overloading resources, and / or by running out of disk
         space in file systems.
This document contains details on the Analyzer, Internal, and Prior Product matches that were performed
for various products:
         Census 2000 Data Products: Matching values in summary files to ensure consistency
         File creation date: 02/18/2004. Filename: ComparingC2KSummaryFiles.xls
The contents of this document have been inserted below for the convenience of the reader.




                                            This space intentionally left blank.




Date Last Printed: 3/29/13                                                                    Page 92 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
When a summary file was created, its data
                                                                                                                                                       Comparing values in Census 2000 summary files
cells were compared to other data cells in
                                                                                                                                                              - to ensure consistency - part 1
that product, and to data cells in previously
created products.                                                                                                                                                                                                                                                                                                                            2/18/2004

                                                                                                         - Summary File Products and other Tabulations used for Matching -
"Compared" means that every opportunity to
match a cell (or a sum of cells) in the new
product, to another cell (or sum of cells),
was executed, for every possible geographic
entity.




                                                                                                                                                                                                                                           Sample Analyzer Tabs**




                                                                                                                                                                                                                                                                                                                                                    Independent SD tabs**
                                                                                                          288 100% Analyzer Tabs**
The table shows which files were matched,




                                                                                                                                                       uSF1 - 52 states




                                                                                                                                                                                                                                                                    uSF3 - 52 states
                                                                                                                                                                                                                        Sample Editals**
and how many cell comparisons were




                                                                                                                                                                                                                                                                                                             SF4 - 52 States
                                                                                                                                     uPL - 52 states
                                                                                   n/a* 100% Editals**
                                                               34 Internal Match
performed for each geographic entity.




                                                                                                                                                                          uSF1A - US




                                                                                                                                                                                                                                                                                       uSF3 - US




                                                                                                                                                                                                                                                                                                                               SF4 - US
                                                                                                                                                                                                                                                                                                   uSF4SIS




                                                                                                                                                                                                                                                                                                                                          AIANSIS
                                                                                                                                                                                                      uSF2A
                                                                                                                                                                                       uSF1F




                                                                                                                                                                                                                uSF2F




                                                                                                                                                                                                                                                                                                                                                                            SDCO
                                                                                                                                                                                               uSF2
       Summary File Products created
          with the DPP system
                                            uPL




P    Redistricting (PL)
                                            dev




 I   Population Size Code Tabulations
                                                               1,783


                                                                                                          2,868
                                                                                                                                     352
                                            uSF1A uSF1




P    SF1 - 52 States
                                                               1,783


                                                                                                          2,868


                                                                                                                                                       8,113




P    SF1 - Advance US
                                                               1,783




                                                                                                                                                       8,179
                                                                                                                                                                          8,181
                                            uSF1UR uSF1F




P    SF1 - Final US
                                                               4




                                                                                                                                                       6


                                                                                                                                                                                       11




     SF1 Supplement (U/R,H2/P2) - 52
P    States


     Pop & Housing Tabs (SIPHC) for
 I   SF2 thresholding
                                                                                                                                                       6,404
                                                               298
                                            uSF2A uSF2




P    SF2 - 52 States
                                                                                                                                                                          6,406
                                                               298




P    SF2 - Advance US
                                                                                                                                                                                                      189,077
                                                               298




                                                                                                                                                                                       89
                                            test/uSF2X uSF2F




P    SF2 - Final US
                                                                                                                                                                                               829
                                                               4




     SF2 Supplement (PCT5) - 52
P    States
                                                                                                                                                                                                                        n/a*
                                                               407




                                                                                                                                                                                                                                           717
                                            uPF3




     Preliminary Sample Tabs for
 I   Demographic Sample Stratification

P = public product; I = internal product
* these matches were performed manually.                                                                 ** these tabulations were performed by DSCMO.
All other matches were performed by computer.                                                            All other tabulations were performed by DADSO.




Date Last Printed: 3/29/13                                                                                                                                                                                                                                                                                                     Page 93 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                                                                                                                                                  Comparing values in Census 2000 summary files
                                                                                                                                                                                    - to ensure consistency - continued - part 2
                                                                                                                                                                                                                                                                                                                                                                                    2/18/2004

                                                                                                                                        - Summary File Products and other Tabulations used for Matching -




                                                                                                                                                                                                                                                                            6,812 Sample Analyzer Tabs**




                                                                                                                                                                                                                                                                                                                                                                                           Independent SD tabs**
                                                                                                                                         100% Analyzer Tabs**


                                                                                                                                                                                  75 uSF1 - 52 states




                                                                                                                                                                                                                                                                                                           uSF3 - 52 states
                                                                                                                                                                                                                                                    n/a* Sample Editals**




                                                                                                                                                                                                                                                                                                                                                    SF4 - 52 States
                                                                                                                                                                uPL - 52 states
                                                                                                                       100% Editals**
                                                                                          8,664 8,664 Internal Match




                                                                                                                                                                                                        uSF1A - US




                                                                                                                                                                                                                                                                                                                              uSF3 - US




                                                                                                                                                                                                                                                                                                                                                                      SF4 - US
                                                                                                                                                                                                                                                                                                                                          uSF4SIS




                                                                                                                                                                                                                                                                                                                                                                                 AIANSIS
                                                                                                                                                                                                                                    uSF2A
                                                                                                                                                                                                                     uSF1F




                                                                                                                                                                                                                                            uSF2F




                                                                                                                                                                                                                                                                                                                                                                                                                   SDCO
                                                                                                                                                                                                                             uSF2
      Summary File Products created
         with the DPP system               uSF3


P    SF3 - 52 States




                                                                                                                                                                                                                                                                                                           16,593
                                                                                                                                                                                                        75
                                                                                                                                                                                                                     81
                                           uSF4SIS uSF4SIHP uSF4SIHH u108_S u108_H uSF3




P    SF3 - Final US
                                                                                          1,783




                                                                                                                                                                                  8,176
     108th Congressional District - SF1
P    tab - 52 States
                                                                                          8,672




                                                                                                                                                                                                                                                                                                           16,593
     108th Congressional District - SF3
P    tab - 52 States




                                                                                                                                                                                                                                                                                                           78
     100% Housing Tabs for uSF4
 I   thresholding
                                                                                                                                                                                                                             326


                                                                                                                                                                                                                                            327




     100% Pop Tabs for uSF4
 I   thresholding
                                                                                                                                                                                                                                                                                                           182




     Sample Pop Tabs for uSF4
 I   thresholding
                                                                                          4,017




                                                                                                                                                                                                                                                                            2,211
                                                                                                                                                                                                                                                                                                           18,988


                                                                                                                                                                                                                                                                                                                                          77
                                           uSF4




P    SF4 - 52 States
                                                                                                                                                                                                                                                                                                                                                    2,647,753
                                                                                          4,017




                                                                                                                                                                                                                                                                                                                              18,988
                                           uSF4




P    SF4 - US

P = public product; I = internal product
* these matches were performed manually.                                                                                                ** these tabulations were performed by DSCMO.
All other matches were performed by computer.                                                                                           All other tabulations were performed by DADSO.




Date Last Printed: 3/29/13                                                                                                                                                                                                                                                                                                                                            Page 94 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                                                                                                                      Comparing values in Census 2000 summary files
                                                                                                                                                        - to ensure consistency - continued - part 3
                                                                                                                                                                                                                                                                                                                                             2/18/2004

                                                                                                            - Summary File Products and other Tabulations used for Matching -




                                                                                                                                                                                                                                        Sample Analyzer Tabs**




                                                                                                                                                                                                                                                                                                                                                    Independent SD tabs**
                                                                                                             100% Analyzer Tabs**


                                                                                                                                                      uSF1 - 52 states




                                                                                                                                                                                                                                                                 uSF3 - 52 states
                                                                                                                                                                                                                     Sample Editals**




                                                                                                                                                                                                                                                                                                             SF4 - 52 States
                                                                                                                                    uPL - 52 states
                                                                                           100% Editals**
                                                                        2 Internal Match




                                                                                                                                                                         uSF1A - US




                                                                                                                                                                                                                                                                                    78 uSF3 - US




                                                                                                                                                                                                                                                                                                                               SF4 - US
                                                                                                                                                                                                                                                                                                   uSF4SIS




                                                                                                                                                                                                                                                                                                                                          AIANSIS
                                                                                                                                                                                                     uSF2A
                                                                                                                                                                                      uSF1F




                                                                                                                                                                                                             uSF2F




                                                                                                                                                                                                                                                                                                                                                                            SDCO
                                                                                                                                                                                              uSF2
       Summary File Products created
          with the DPP system



                                            AIANSIS AIANSIHP AIANSIHH
     100% Housing Tabs for AIAN
 I   thresholding




                                                                                                                                                                                                             152
     100% Pop Tabs for AIAN
 I   thresholding




                                                                                                                                                                                                                                                                                                   237
     Sample Pop Tabs for AIAN
 I   thresholding




                                                                                                                                                                                                                                                                                                                               622,593
                                                                        4,017




                                                                                                                                                                                                                                                                                                                                          77
                                            AIAN




P    AIAN - US
                                                                        8,618




                                                                                                                                                                                                                                                                                    16,393




                                                                                                                                                                                                                                                                                                                                                    40
                                            SDHC SDCO SDTT




P    School Districts: Total POP - US
                                                                        8,799




                                                                                                                                                                                                                                                                                                                                                    23
     School Districts: Children's Own
P    Characteristics - US
                                                                        8,799 8,799




                                                                                                                                                                                                                                                                                                                                                    22
     School Districts: Household
P    Characteristics - US




                                                                                                                                                                                                                                                                                                                                                    17
                                            SDCOSS1 SDCP SDCH SDPC




     School Districts: Parents'
P    Characteristics - US
                                                                        8,785




                                                                                                                                                                                                                                                                                                                                                    13
     School Districts: Characteristics of
P    Children's Households - US
                                                                        8,799




                                                                                                                                                                                                                                                                                                                                                    11
     School Districts: Characteristics of
P    Children's Parents - US
                                                                        1,386




                                                                                                                                                                                                                                                                                                                                                                            15,396



     School Districts: Children's Own
P    Characteristics, Iterated - US

P = public product; I = internal product
* these matches were performed manually.                                                                    ** these tabulations were performed by DSCMO.
All other matches were performed by computer.                                                               All other tabulations were performed by DADSO.




4.12. About Status reporting
The DPP system prepares three kinds of reports on the status of operations. They are prepared by the
DPPStatus script. The reports are snapshots of what has happened at the moment at which they are run.


Date Last Printed: 3/29/13                                                                                                                                                                                                                                                                                                     Page 95 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
All scripts record their start and end (and error termination, if possible) in the central log file for the
product. The DPPStatus script uses this information to reconstruct the progress of the current product
and displays it in the Status and ExtendedStatus reports.
The DPPStatus script also examines the content of the /SXoutput subdirectories and prepares a report on
the progress of tabulations in the CountCSVFiles report by counting the files which have been created.
The three reports provide information at three levels of detail. Brief descriptions and examples are given
below. The examples are based on real reports, but the contents have been edited to show typical
behavior and patterns.
The Status report is the highest level of the three reports. It summarizes activity for the steps from data
acquisition through sponsor approval. It highlights the status of interactions between DPP and the other
BOC groups with which it interacts.
                                      Census 2000 uSF3 Production Status Report
Report run: Aug 26, 2002 10:23:28
(Legend: C = Complete; S = Started; X = Errors detected)

                                                                                               Data
                                                                                      Review clear
                                                     Tabulatio
     State             Geo file Detail file                                          materials ed by
                                            Analyzer     n     Analyzer                              Handoff Handoff to
  FIPS/Postal         received received                                              released POP
                                            received performe matched                                to ACSD   AFF
     Name             & verified & verified                                          to POP &    &
                                                         d
                                                                                       HHES    HHE
                                                                                                 S

TOTAL                 52            52             52            52          0       0        0        45              45

01 AL Alabama C07/22/02 X07/22/02 C07/22/02 C07/24/02 C07/25/02 C07/26/02                              C07/29/02 C07/26/02

02 AK Alaska          C07/22/02 X07/22/02 C07/22/02 C07/22/02 C07/22/02 C07/22/02                      C07/28/02 C07/22/02

04 AZ Arizona         C07/23/02 X07/23/02 C07/23/02 C08/11/02 C08/11/02 C08/12/02                      C08/21/02 C08/12/02

          :                   :           :              :             :         :       :        :          :              :


Figure 15: Example of a Status report

The Extended Status report focuses on progress through each of the 25 steps in the Tab script. {This
                                   th
information is summarized in the 5 and 6th columns of the Status report, which was described above.}
Note that it reports on Tab stages 6000-6200, which are obsolete. The Extended Status report is
designed to be printed on legal size paper, landscape, and has been reformatted in the example below to
fit on this letter-size page.
Note that different stages of the Tab script were run for the state processing, than for the US.




 Date Last Printed: 3/29/13                                                                           Page 96 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                               Census 2000 uSF3 Production Tab Extended_Status Report
Report run: Aug 26, 2002 10:25:19
(Legend: C = Complete; S = Started; X = Errors detected)

                                                              3000 =
                      2000 =            2100 =                 Insert
        1000 =
                      Detail            Detail-                 geo
 State Source                                                                                                          3500 =
                        file   2100 =     file    2900 =     recodes 3050 =             3200 =  3300 =        3400 =            3600 =
 FIPS/    file                                                                3100 =                                     Cat
                      Super    Detail   Super      TXD          into  Split           Aggregate Prep          Median             Split
Postal/ verify                                                               Tabulate                                   SF1F
                      Cross    DB Cat   Cross     Setup        Super Results           National median         Build            Tables
Name     data                                                                                                          Median
                        DB                DB                   Cross
         prep
                       Build            Install               tables
                                                              (TXDs)

TOTAL       52         52          52     52        52         52        0          52           1       1        1      1        53

01 AL C072202 C072202 C081502 C081502 C081502 C081502                             C081502                                       C081502

02 AK C072302 C072402 C072402 C072502 C072602 C072702                             C080202                                       C080202

   :        :           :          :       :         :          :        :           :           :       :        :       :        :

00 US                                                                                         C082302 C082002 C082102 C082102 C080202

Figure 16: Example of Extended Status report (first 14 columns)




                                                                        6000 = 6100 6200 =
            4000 =                     5000 =                   5300 = Generate   =  Catalog
                     4100 =    4200 =          5100 = 5200 =
 3700 = Un Summary                    Summary                    Prior  Super Build RM DB
                     Verify    Verify         Analyzer Internal
conditional  File                       Data                    Product Cross Super
                   Composition Rollup          Match    Match
 Rounding Creation                     Verify                    Match  Review Cross
                                                                       Materials RM
                                                                                 DB

       53        53            0          52        53              0        53          53          0   0    0

C081502      C081502                    C081502 C081502                 C081502 C081502

C080302      C081502                    C081502 C081502                 C081502 C081502

       :         :             :           :             :          :        :           :

C082302      C082402                              C082602               C082602 C082602

Figure 17: Example of Extended Status report (last 11 columns)



The CountCSVFiles report contains relatively simple information that shows how many files of specific
types have been created. Various stages of the Tab script (3000 through 3700 ) create files with different
file extensions, and the column headings refer to those file extensions. This report is useful while a
state/US is being tabulated; the number of files created should continue growing to a certain number.
Note that CA (California) has many more files than most states. That is because the California
geography was ‘split’ into four pieces for tabulation.
Note that PR (Puerto Rico) has a few more files than most states. That is because special versions of
several tables were used, for Puerto Rico only.
Note that the number of files, and the types of files created, for US, was unique.




 Date Last Printed: 3/29/13                                                                                            Page 97 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Report run on 26AUG02 at 10:25 for Product uSF3
The report is based on information located at:
/dpp2/prod/uSF3/SXoutput/uSF3/state/unadjusted/

                                                                                      csv   csv
                csv        csv                     csv          csv             csv    cr    cr   error       latest
State           n+s         n           csv        _bk        formula            cr    n     ns    log        date

01     AL       725        725         949          86             0            0     0     0       0         N/A
02     AK       725        725         949          86             0            0     0     0       0         N/A
04     AZ       725        725         949          86             0            0     0     0       0         N/A
05     AR       725        725         949          86             0            0     0     0       0         N/A
06     CA       725        725        3849          86             0            0     0     0       0         N/A
08     CO       725        725         949          86             0            0     0     0       0         N/A
:       :         :          :            :          :             :            :      :     :       :          :
72 PR           748        748          963         86             0            0     0     0       0         N/A
   US             0          0          948         86             0            0     0     0       0         N/A
Figure 18: Example of CountCSVFile report

4.13. About Tabulation
4.13.1.                What is Tabulation?
Tabulation is a loose term. The DPP system is sometimes referred to as a tabulation system and at a high
level that is true. Then there is the term 'cross tabulation' which implies some type of dimensional
analysis. This section does not attempt to split hairs on this subject but rather to convey the following
important points about tabulation in DPP.
            Core tabulation is similar but separate from Post Tabulation.
            Core tabulation is similar to OLAP.
            Detail data is an input to core tabulation.
            Production TXDs are inputs to core tabulation but are not part of the DPP system builds.
            TXDs are integral to understanding core tabulation.




    Date Last Printed: 3/29/13                                                                            Page 98 of 149

    Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         TXD
       Templates
         (.txd)



                                                                Core Tabulation
        Iteration                                                   (STR)
                                 Form Production
       Definitions
                                     TXDs
          (.txt)


                                                                                   Tabulation
                                                                                    Output
                                                                                     (.csv)

          Geo
        Recodes
         (.txt)                                                  Post Tabulation
                                                                      (SAS)



                                                                                      Post
    Detail Databases                                                               Tabulation
         (.sxv4)                                                                    Output
                                                                                     (.csv)

Figure 19: The manipulation of TXDs and Recodes for tabulation

4.13.1.1. Core Tabulation
'Core tabulation' in the DPP system generally refers to what occurs during Tab Stage 3000 and 3200.
Most of the computational effort up to these stages goes into preparing the necessary inputs for core
tabulation. Strictly speaking Tab Stage 3000 and 3200 don't really perform core tabulation so much as
they coordinate core tabulation by STR.
Core tabulation is performed by software from STR. In reality this comes down to calling the ss2ps
program repeatedly until all tabulations are complete. The details on how tabulation is performed by STR
is unknown to the DPP system but generally speaking the ss2ps program reads in a TXD and forms a
cube based on the instructions contained in the TXD. When the cube is complete the results are written to
a flat ASCII text file.
4.13.1.2. Post Tabulation
Once core tabulation is complete some 'post tabulation' rules may be applied to the data. These rules
vary by product. For example, universal rounding, special tab rounding. The primary reason to implement
these rules in a separate step is traceability. By implementing a rule in post tabulation the before number
can be compared to the after number. The DPP system uses SAS to perform most of the post tabulation
steps.
4.13.2.              Detail Databases
Tabulation is based on detail data. That data is modeled in a Star like schema and stored in a proprietary
STR format (known as an .sxv4 databases). Additional details about detail databases can be found in that
section.
4.13.3.              Tabulating Medians
4.13.3.1. Medians in general
Medians are relatively difficult to compute on large record sets. For that reason or perhaps for other
statistical reasons, the Census Bureau implements median calculations based on applying a
mathematical function on a ranged distribution.
Requirements for Median Specification TXD’s
The median specification in a TXD has some specific requirements that are noteworthy.
 Date Last Printed: 3/29/13                                                                     Page 99 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Please refer to the code pieces in the next two figures for this section.
    RECODE "YRBLT" FROM "SYRBLT-About when was this building first built?"
    RESULT LOW 1930 MID 1939 HIGH 1940 SUBSTITUTION_STRING "1939" "1939 or
    earlier"
    RESULT LOW 1940 MID 1944.5 HIGH 1950 "1940 to 1949"
    RESULT LOW 1950 MID 1954.5 HIGH 1960 "1950 to 1959"
    RESULT LOW 1960 MID 1964.5 HIGH 1970 "1960 to 1969"
    RESULT LOW 1970 MID 1974.5 HIGH 1980 "1970 to 1979"
    RESULT LOW 1980 MID 1984.5 HIGH 1990 "1980 to 1989"
    RESULT LOW 1990 MID 1992 HIGH 1995 "1990 to 1994"
    RESULT LOW 1995 MID 1996.5 HIGH 1999 "1995 to 1998"
    RESULT LOW 1999 MID 1999.5 HIGH 2000 "1999 to 2000"
    MAP CODE "1" TO "1999 to 2000"
    MAP CODE "2" TO "1995 to 1998"
    MAP CODE "3" TO "1990 to 1994"
    MAP CODE "4" TO "1980 to 1989"
    MAP CODE "5" TO "1970 to 1979"
    MAP CODE "6" TO "1960 to 1969"
    MAP CODE "7" TO "1950 to 1959"
    MAP CODE "8" TO "1940 to 1949"
    MAP CODE "9" TO "1939 or earlier"
    END RECODE
Figure 20: Code from the RECODE section of a TXD, specifying distribution intervals for a median.

    pareto("1939 or earlier&0":"1999 to 2000&0";1000000;0.5)
Figure 21: A function call from the DERIVE section of a TXD specifying a median calculation

The “pareto“ code in the second figure is a function call from the DERIVE section of a TXD. It specifies
that a median calculation is to be performed on the distribution which was specified by the code in the first
figure.
The first parameter to the “pareto“ function lists the starting and ending distributions. The second
parameter lists the break point for linear or logarithmic interpolation. The third parameter indicates the
quartile to compute the median for. A value of .5 will result in a median, whereas a value of .25 will result
in a lower quartile.
Linear versus pareto interpolation
The pareto function in STR uses either linear interpolation OR Pareto (logarithmic ) interpolation
depending on the 2nd parameter. If the median falls within a distribution interval that has an interval
range (HIGH-LOW) less than the 2nd parameter, then linear interpolation is used. Otherwise, Pareto
(logarithmic) interpolation is used. In general, although each case is specified by the Bureau, income
distribution intervals >= 2,500 use Pareto (logarithmic) interpolation. All other distribution intervals use
                                                    nd
linear interpolation. In the above example, the 2 parameter (1,000,000) effectively forces the STR
pareto function to use linear interpolation for all distribution intervals.
The MID value
The MID value is ignored by the pareto function.
Order of distribution intervals in a TXD
The distribution intervals must be mapped in a certain order.
For year-based distributions, the results must be mapped from the highest value to lowest value, as
shown in the example above. For all other distributions, the results must be mapped from lowest value to
highest value.
4.13.3.2. US Level Medians
The AIX implementation of STR server product (ss2ps) handles state level median calculations within the
hardware and software limitations of the system. However, US level calculations exceed the limitations of
STR and a unique approach is required to solve this problem. The approach to solving this problem is to
 Date Last Printed: 3/29/13                                                                         Page 100 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
calculate the distributions at the state level during Tab Stage 3000, aggregate these distributions in Tab
Stage 3100, and then apply the appropriate median algorithm on that distribution to arrive at the US
medians in Tab Stage 3200. Although simple in concept this aspect does involve additional tabulations,
code, and file handling.

4.14. About Thresholding
Thresholding is the process by which a set of rules are applied to a product to suppress data to ensure
confidentiality. Not all products are thresholded. Some products by their very nature do not need to be
thresholded, for example, SF1.
Thresholding can significantly reduce a product's size. The most extreme example is SF4. The
unthresholded product size is 2.4 trillion cells whereas after thresholding was applied the product shrunk
to 79 billion cells. Thresholding suppressed 96% of the product!
In one sense thresholding adds a layer of complexity to each product. Additional tabulations may be
required and, at a minimum, additional processing steps are required. However, thresholding is
sometimes the silver bullet that makes a product computationally feasible. As an example consider SF4
which took almost 4 months to complete. Assuming linear computational effort, it would have taken 10
years to complete (4 months / .03) without (SIPHC) thresholding.
4.14.1.             Approaches to Thresholding
In the DPP system thresholding may be applied either before mainstream tabulation occurs or after
tabulation occurs. There are several names and acronyms to describe each the most popular are 'brute
force' and SIPHC.
4.14.1.1. Brute Force
The brute force method tabulates the entire product and then determines which geographies to suppress.
This approach is 'brute force' in that no attempt is made to optimize the tabulation. All numbers are
computed and then cells are discarded based on the thresholding rules. This approach is acceptable for
smaller products but it not practical for large products. For example, this was the approach to the School
Districts product.
4.14.1.2. SIPHC
SIPHC is short for "State Iterated Pop and Housing Counts", which was coined during the production of
SF2. The SIPHC approach used in SF2 is fundamentally the same as for recent products but
mechanically they are vastly different. Before mainstream tabulation occurs SIPHC determines which
cells need to be tabulated - and therefore which cells can be ignored. This approach minimizes the
number of tabulations for the product.
Each product has a geographic dimension which is generally, by far, the largest dimension. Iterated
products also have an iteration dimension which ranges in size from a few to just over a thousand. Most
iterated products are thresholded whereas most non-iterated products are not thresholded. Therefore,
thresholded products have a geographic dimension and generally an iteration dimension. SIPHC is quite
effective because the rules for thresholding cut along these two conforming dimensions. The DPP system
is specifically coded to loop (or skip) over iterations and then only process geographies that pass
thresholding (for that iteration).
Refer to the illustration below. In three dimensions the general concept is sketched out. Geography forms
                                                                              rd
the rows, Iterations form the columns, and the Product table cells form the 3 dimension. Thresholding
rules for Census 2000 are based on the cube face (commonly referred to as the geo/ci combination).
Thresholding scans the cube face and determines which combinations pass the thresholding rule
(illustrated by shaded squares). Thus, for each geo/ci combination that passes thresholding, all table cells
(illustrated by the three dimensional shaded areas) need to be tabulated. In a product such as SF4 only
3% of the geo/ci combinations were shaded.




 Date Last Printed: 3/29/13                                                                   Page 101 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                        l
                                                      el
                                                  C
                                                  e
                                                bl
                                                            Iteration
                                             Ta  I1 I2 I3 I4 I5 I6
 ( <---------- Very large ) Geography


                                            G1
                                            G2
                                            G3
                                            G4
                                            G5                                        For example

                                            G6                                        This unshaded square represents a geo/ci combination
                                            G7                                        that does NOT pass validation.

                                            G8                                        This shaded square represents a geo/ci combination that
                                            G9                                        passes validation.
                                                                                      Therefore, all table cells will be tabulated for this geo/ci
                                        G10                                           combination.
                                        G11
                                        G12
                                        G13
                                        G14
                                        G15
                                        G16

Figure 22: Thre-dimensional representation of thresholding

Some items to note.
                                            Iteration I1 is usually total population (designated in DPP as 000 or 001) and therefore will have
                                             the most geographies that pass thresholding. Why? Because, if any geography is going to pass
                                             thresholding it will be the iteration representing the total population. Therefore, starting on the
                                             leftmost edge of the cube face, if the geo/ci combination does not pass thresholding then none of
                                             the other iterations should pass thresholding. Geographies G6, G12, G13 are examples of this
                                             situation.
                                            Geography G1 is usually the United States (with geocomponent 00) and therefore a similar
                                             concept can be applied to the cube face along the top edge. If an iteration does not pass
                                             thresholding for geography G1 then none of the other geographies should pass thresholding.
                                             Iteration I4 is an example of this situation.
4.14.2.                                                 Iterations optimization
SIPHC also optimizes tabulation by avoiding iterations that do not meet the threshold at the US level.
Several iterations in SF4 did not meet the thresholding rules at the US level. Therefore, subsequent
processing steps (using the output from SIPHC) completely skipped processing for these iterations.
4.14.3.                                                 Implementation - SIPHC
SIPHC evolved over the production of several products and the process has been refined to a point
where it is quite modular and robust. There is always the possibility that new products will require
additional programming but if the rules are similar to prior products then implementing SIPHC in a product
is a matter of driver file configuration and additional operational steps.




 Date Last Printed: 3/29/13                                                                                                            Page 102 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
       Mini-                Mini-                  Mini-
     product 1            product 2     ...      product 2      SIPHC
                                                                Mini-products
     Summary             Summary                 Summary        Tabulated data needed to evaluate
      file like           file like               file like     thresholding rules
      output              output                  output




                           Apply                                SIPHC
                        Thresholding
                           Rules
                                                                Processing
                                                                Evaluates the thresholding rules
                                        SAS Dataset.
         Thresholded
                                         Cube face
          Iterations
                                         truth table




                        ProcessGeo                              Product
                         Stage 60
                                                                Processing

                       ...Other Steps




                       Summary File
                         Creation




                       ...Other Steps



Figure 23: Overview of SIPHC creation and usage

There are basically 3 parts to SIPHC processing.
    1. Create the SIPHC mini-products.
    2. Apply the thresholding rules (to the mini-product output).
    3. Incorporate the outputs from SIPHC into the product processing stream.
4.14.3.1. Create the SIPHC mini-products
The SIPHC mini-products use the DPP system to create a set of summary files that contain the necessary
information to make a thresholding decision. In terms of figure xx above, the same cube face (geo/ci
combination) is considered. Based on the thresholding rules a set of TXDs in the mini products computes
1 or more cells for each geo/ci combination.
4.14.3.2. Apply the thresholding rules (to the mini-product output)
This is the central part of thresholding. Thresholding is implemented using several driver files (SIPHC.txt,
<Product>Iterations.txt) , a shell script (SIPHCHandoff), and a SAS program (SIPHC_Handoff.sas). The
data input comes from the mini products (above). The final result of processing is 2 sets of files. Below is
a specification for uSF4. Note that the crucial elements in each file are boolean, that is, when the
thresholding rules are applied the decision is either true or false.
The handoff between the uSF4SI product and the uSF4 product will be handled by 2 file sets.
Each fileset will have 53 files for a total of 106 files.
 Date Last Printed: 3/29/13                                                                         Page 103 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Fileset 1
53 SAS datasets:
One dataset for each state; contains (only) state-only geos and state/national geos
One dataset for US, contains (only) national-only geos
For each Geography:
          The HU000 value (Housing Unit and Group Quarters count, different from HU100)
          The HPOP (unweighted count) value for Iterations 1 - 250 (HPOP_001 should be
          equivalent to POP100)
          The SCNT (unweighted count) value for Iterations 1 - 336
          The final threshold decision for iterations 1 - 336.


Geo    HU_000     HPOP_001       HPOP_...     HPOP_463       SCNT_001        SCNT_...    SCNT_585    Threshold_001       Threshold_585
                  Iteration 1    Iteration    Iteration      Iteration 1     Iteration   Iteration   Iteration 001       Iteration 336
                                 …            250                            …           336
                                                                                                     See Note 1          See note 1




Figure 24: SIPHC Fileset 1 - file contents

Note 1: Boolean Indicator (T/F)
Fileset 2
53 ASCII text files, each with exactly 336 rows
One file for each state.
One file for US.
For each iteration (1-336):
          The final decision whether tabulation is required.
          The final decision whether the CI should have a summary file set.
          The HPOP (unweighted count) value for Iterations 1 - 250
          The SCNT (unweighted count) value for Iterations 1 - 336
          The US HPOP (unweighted count) value for Iterations 1 - 250
          The US SCNT (unweighted count) value for Iterations 1 - 336


Iteration       Tabulate? Create Summary                      HPOP            SCNT         US          US
                          File Set?                                                        HPOP        SCNT
001             T                 T                           150             125          5000        4000
002             T                 F                           20              5            100         50
003             F                 F                           125             40           200         45

 Date Last Printed: 3/29/13                                                                                          Page 104 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
…
585             F                 F                                          40            45
Figure 25: SIPHC Fileset 2 – file contents

Note: A value of Tabulate?=F and CreateSummaryFileSet?=T is not valid. However a value of
Tabulate?=T and CreateSummaryFileSet?=F is valid. This would be an example of an iteration
that does not meet the thresholding requirements at the state level but meets them at the US
level. Therefore, the cells need to be tabulated at the state level so that they can be summed
into a US level product.
4.14.3.3. Incorporate outputs from SIPHC into the product processing stream
The main product uses the outputs from SIPHC primarily to implement thresholding but also for
performance enhancement. From ProcessGeo Stage 60 and onwards the shell scripts use fileset 2 to
skip over iterations that are not applicable to tabulation or the final product. ProcessGeo Stage 60 use
fileset 1 to create the minimum set of geographic recodes (for use in tabulation) and Summary file
creation uses fileset 1 to limit the geographic header that is released to the public.

4.15. About TXD’s and recodes
4.15.1.             Understanding TXDs
A TXD is a textual definition of a SuperCross query. TXD is short for TeXtual Definition. It is the STR
equivalent to a SQL statement. Once a tabulation is formed in the SuperCross client the user has the
ability to save the definition of that tabulation to a file either in binary format (.scs) or ASCII format (.txd).
TXD files can be viewed and edited using a standard ASCII file editor. The syntax is defined by STR and
reflects their philosophy and thought process for dimensional queries. If SQL is the language for relational
databases then TXD is the language for dimensional queries - at least from STR's viewpoint.
4.15.2.             How TXDs are used in the DPP Production System
A small 'sourcing' database is used to create the production TXDs. Vermont is a common choice. The
TXDs are sourced against this database for the total population (equivalent to Iteration 000 or 001). In
other words, the Iteration and Geography dimensions are not used at this point in the process. The target
database (e.g. MD, VT), Iteration dimension (001,012), and Geo dimension (04000US12) are added to
TXDs in a systematic fashion during the production of a product.
4.15.3.             Parts of a TXD
This section is not a comprehensive explanation of the contents of a TXD or the TXD language. Please
refer to the appropriate STR manuals for a comprehensive explanation on this subject. This section
attempts to convey how the DPP manipulates a TXD to form a production TXD.
Below is an example TXD framework with all of the details removed. Each section defines a different
aspect of the tabulation. A short explanation of each section follows.
     HEADER
     END HEADER
     RECODE
     END RECODE
     TABLE
     ROW
              DIM
              END DIM
              DERIVE
              END DERIVE
              AXIS_MAP
                     ORDER
                     END ORDER
              END AXIS_MAP

 Date Last Printed: 3/29/13                                                                         Page 105 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
    END ROW
    SUBJECT
    END SUBJECT
    END TABLE
Figure 26: TXD framework

4.15.3.1. Header
The Header section appears at the beginning of the TXD. It contains global settings.
Static aspects:          There are numerous settings such as ROW_WIDTH which do not have an impact on
                         batch tabulation.
Dynamic aspects: The keyword DBID is completely overwritten during TXD formation to point to the
                  correct database.
4.15.3.2. Recodes
Each TXD contains from 1 to N recode definitions. Each definition is defined by a RECODE and END
RECODE keyword pair. Technically speaking, even if a recode is defined it does not necessarily have to
be used by the TXD. In other words, even if a recode is defined in this section it is not used unless it
appears in the TABLE-DIM, TABLE-WAFER, or TABLE-SUBJECT sections. Some recode definitions are
only a few lines long whereas some are hundreds of lines long.
There are basically 3 types of recode definitions.
         Recodes that define which dimensional values are of interest and a mapping of those values to a
          new set of values for the purpose of defining an edge of the cube. This is the equivalent of the
          select (with a decode statement) clause of a SQL statement and to some degree the where
          clause. For example:


           Age
        Dimension

              1
              2               Excluded from tabulation
              3
              4
              5                            Age Recode
              6                          Child 5 and above
              7                              in 4 bands
              8                           Ages 6-10
              9
             10
             11
             12
                                          Ages 11-14
             13
             14
             15                           Ages 15-16
             16
             17
             18                           Ages 17-20
             19
             20
             21
             22
                               Excluded from tabulation
             …

Figure 27: Example of a SuperCROSS recode of Age

 Date Last Printed: 3/29/13                                                                 Page 106 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Recodes that define which dimensional values are of interest for the purpose of limiting the query
          to a universe. This is the equivalent of the where clause of a SQL statement.
         Recodes that define which summable columns are of interest and what kind of mathematical
          process to apply to them for the purpose of defining the cube metric. This is the equivalent of the
          select (with a group function) clause of a SQL statement.
Static aspects:          Recodes in the base TXD are retained.
Dynamic aspects:         The following recodes may be added by the TXD creation process:
                         Characteristic Iteration Recode
                         School District Recode
                         Geo Recode (reference to)
4.15.3.3. TABLE
The TABLE section is broken down into the DIM, DERIVE, AXIS, SUBJECT, and WAFER sub-sections.
TABLE-DIM:
This section indicates which recodes (from the RECODE sections) are used to define the dimensions of
the table.
Static aspects:          The production system does not alter existing DIM sections in the template TXD.
Dynamic aspects:         None
TABLE-AXIS:
This section serves two purposes. First, if the default order, as per the DIM sections, is not acceptable,
this section defines the order in which the cells should appear. Second, if some cells are hidden, they are
filtered in this section.
TABLE-DERIVE:
Derivations (e.g. medians) are defined here.
TABLE-SUBJECT:
This section indicates which recodes (from the RECODE sections) are used to define the universe for this
table and which recodes (from the RECODE sections) define the summation options.
Static aspects:          The production system does not alter existing SUBJECT references.
Dynamic aspects:         If an iterated product, the production system adds a line to this section that
                         references the Iteration's recode (from the RECODE section) thereby adding the
                         Iteration definition to the universe.
TABLE-WAFER:
This section is added by the production system. The Geographic dimension is added as the wafer. In
STR terms this is simply another axis although from a production system standpoint it is important to note
because tabulation output is by geography. That is, when the tabulation output .csv is created the
geography dimension is output as a row. For example, tabulated cells for the United States are put on a
single line, so are the tabulated cells for Maryland.
4.15.4.             Why Customize TXDs in DPP
Some products require millions of individual tabulations. For example, SF4 had approximately 6 million.
Creating 6 million TXDs by hand and putting them into a development build is impractical. Therefore, a
TXD is created for each product table and put into a development build. These TXDs become the
templates from which millions of production TXDs are formed.




 Date Last Printed: 3/29/13                                                                    Page 107 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.15.5.             TXD Parts in the Build
4.15.5.1. Key
<dev build>              is the root directory to the development build. For example,
                         /usr/lpp/DPP/DPP2001_279.
<ops build>              is the root directory to the development build. For example,
                         /usr/lpp/DPP/DPP2000_OPS_222.
<tablename>              is the tablename in the product, for example p13.
<product>                 is the name of the product in lower case. For example, usf3.
<Product>                 is the name of the product in mixed case. For example, uSF3.
<PRoption>                is an optional 'PR' that indicates that the base TXD is specifically for Puerto Rico.
<Iteration>               is the iteration code. For example '012'.
<GeoSet>                  is the geo set. For example '2'.
<ST>                      is a 2 character state (or US for United States) abbreviation. For example, 'MD'.
<host>                    is the host name. For example, 'dpp1'.
<random string>          is a random string assigned so as to ensure that file names are unique (within a
                         host). For example, 'aaanTkFya'.
4.15.5.2. (State) TXD templates for a product.
Each product has a set of State TXDs. Generally there is one TXD for each table in the product, however
there is some variation. For example, sometimes a table is so large that it needs to be tabulated in
smaller pieces so pct119 could be broken up into pct119a, pct119b, and pct119c.
Location in the build.
     <ops build>/SXtables/<Product>/general/<product>-<tablename><PRoption>.txd
4.15.5.3. (State) TXD distribution templates for the product.
For products that have US level geographies and medians, a set of State level TXDs tabulate each
State's contribution to the US distribution. Since these tabulations are performed at the state level a set of
TXDs is required to tabulation that information.
Location in the build.
     <ops build>/SXtables/<Product>/general/<product>-<tablename><PRoption>_d.txd
4.15.5.4. (US) Median TXD Templates
For products that have US level geographies and medians, a US level distribution database is created
(details for that process is documented in another part of this document). A set of TXDs is used to
tabulate the US level medians from that US level distribution database.
Location in the build.
     <ops build>/SXtables/<Product>/median/<product>-<tablename><PRoption>.txd
4.15.5.5. Iteration Recodes
For Iterated products the definition of each Iteration is stored in a .txt file. Each file contains the STR text
required to define a recode (as per above). These files are basically inserted into the template TXD to
form an Iteration specific TXD.
Location in the build.
     <ops build>/SXrecodes/<Product>/ci/ci-<Iteration>-<GeoSet>.txt



 Date Last Printed: 3/29/13                                                                         Page 108 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.15.5.6. GEO.TXD
This TXD is used by ProcessGeo to tabulate information from the Geo Database.
Location in the build.
     <ops build>/SXtables/geo/general/GEO.txd
4.15.6.             TXD Parts Built Dynamically
4.15.6.1. Geo Recode
The geo recode is built by ProcessGeo, saved as a .txt file, and is later used by ss2ps. Similar in concept
to the Iteration recodes (above), the Geo recode is inserted into the TXD to form a geo specific version of
the TXD. Whereas the Iteration recode is added to the universe, the geo recode is added as a dimension.
A Geo recode can be millions of lines long and merging and handling such a file is unwieldy. Therefore, a
reference to the Geo recode is inserted into the TXD. This STR specific concept is known as recode
caching.
The Geo recode is dynamically built and stored as:
     <product home>/SXrecodes/<PRODUCT>/<ST>/<recodefilename>
4.15.6.2. Tabulation TXDs
The base tabulation unit for the production system is a call to ss2ps. ss2ps processes a set of TXDs.
There are slight differences between the TXDs from Tab Stage 3000 and the TXDs from Tab Stage 3200.
For Tab Stage 3000, the production TXDs are created as:
     <product home>/SXtables/<PRODUCT>/<ST>/unadjusted/
     <TABLENAME>_<TableSplit>_<ST>_U_i<Iteration>_<geoset>_3000.txd
For Tab Stage 3200, the production TXDs are created as:
     <product home>/SXtables/<PRODUCT>/US/unadjusted/
     <TABLENAME>_<TableSplit>_<ST>_U_i<Iteration>_<geoset>_3000.txd
4.15.6.3. Recode Caching List
These small text files are only a few lines long and basically serve as the link between the geo recode
and the reference in the production TXD. This implementation is STR based. It would be simpler if this file
did not exist and the information was coded directly into the TXD. The only cached recode that DPP uses
is a geo recode. For each ss2ps call, the Tab script determines which geo recode is required and that
information is put into this small file. It is created and used by Tab Stage 3000 only.
<product home>/SXrecodes/<PRODUCT>/<ST>/<host>.<random string>.<geo set>
4.15.7.             How a dynamic TXD is built
Production TXDs are formed (and used in tabulation) in Tab stages 3000 and 3200 by applying a series
of edits to template TXDs to form the production TXDs. Additional files are created and used in the
tabulation process because all of the required information (namely the geo recode) is not contained in the
production TXD.
Generally speaking the following steps describe the process to build a production TXD. This entire
process is controlled by the Tab script. Note that the Tab script is executed for a specific State.
          For each iteration (if iterated), loop.
                    For each geo split (if the geography has been split), loop.
                               At this point the geo recode is known, so create the recode caching list file.
                               Identify all tabulation TXDs that need to be constructed. For each TXD get the
                               template TXD, loop.

 Date Last Printed: 3/29/13                                                                          Page 109 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                         Update the DBID.
                                         Insert the iteration .txt file (if iterated) as a DIM section.
                                         Insert a reference to the iteration recode in the SUBJECT section.
                                         Create a DIM section for the geo recode. Since recode caching will be
                                         used this is a reference to the recode caching file.
                                         Create a WAFER section and insert a reference to the geo recode.
                               Call ss2ps. Reference the list of TXDs to be tabulated, reference the recode
                               caching file.
                               Remove the production TXD files.
4.15.8.             Important Points about TXDs
4.15.8.1. TXDs are not saved
This behavior is a throwback to a time when production TXDs were too large in size to retain. Recode
caching has made it practical to retain the production TXDs although to date that change has not been
made. The primary reason to save TXDs would be for traceability.
4.15.8.2. Geo Recodes
Geo recodes are quite large and one of the goals of Tab Stage 3000 is to load each geo recode once and
perform all calculations based on that recode in one ss2ps processing stream. Given a large number of
tabulations or single CPU availability this is by far the most efficient way to handle the geo dimension.
4.15.8.3. Recode Caching
The size of the geo recode was identified as an operational hurdle and a performance bottleneck jointly
by IBM and STR. At IBM's request, STR implemented recode caching. Although recode caching could
apply to any recode it has, by far, the greatest implications for geo recode. The DPP system uses recode
caching only for the geo recode.
Background on Geo Recodes:
The geo recode is central to the following discussion and some background may be helpful.
         A recode is a textual mapping of 2 or more items.
         A geo recode is textual mapping of geographies to their constituent Census blocks.
         A geo recode file is a stored version of a recode. It is a plain ASCII file.
          There are 2 parts to the geo recode file. The first part defines the geographies that are to be
          tabulated (and are to appear in the tabulation output). The second part maps Census blocks to
          geographies. As a note, a block can be mapped to more than one geography.
For tabulation purposes Geo recodes are specific to a State/CI. For example, for SF4 there are potentially
336*53=17,808 such recodes.
Geo recodes can be large. For uSF3, the TX geo recode file has 53,239,513 lines and is 735 MB in size!
Geo recodes are constructed by ProcessGeo and either merged, referenced, or made persistent by the
DPP system (depending on the package version and operational parameters).
Loading, parsing, and applying a geo recode is no trivial matter and probably is the single largest factor
for tabulation performance.
Batch Size:
The STR batch size determines how many tables are to be processed in a call to ss2ps. Ss2ps is the
UNIX executable that the DPP system calls to perform an STR tabulation. For example, there are
approximately 800 tables in uSF3. A call to ss2ps with a batch size of 800 would process all 800 tables
serially.


 Date Last Printed: 3/29/13                                                                               Page 110 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The batch size option has been in STR packages for some time, however, package 16 and 52 had a bug
that forced us to use batch size equal to 1.
Note that each Iteration has its own customized Geo Recode file. Therefore multiple iterations cannot be
tabulated in a single call to ss2ps.
Large batch sizes are not always beneficial. Determining the 'optimal' batch size depends on several
factors.
         What is the performance goal? Best throughput of the entire product or completion by some
          interval, for example, by state.
         How many CPUs are available.
         How many tables does the product have.
         Are tables equally difficult or are there easy tables and difficult tables.
         How many 'States' in the product.
         Are States equally different or are there smaller States and larger States.
Given the number of permutations to the questions above it is possible for the optimal batch size to range
from 1 to 9999. The following examples are extremes but demonstrate the setting the batch size to 9999
is not always optimal.
Batch Size Example 1:
Throughput target=n/a, there is only 1 state in this example.
Available CPUs=24
Tables=10
Table Difficulty=All Equal
Target States=VT
State Difficulty=n/a, there is only 1 state.
In this case it would be best to set the batch size to 1 and to concurrently (DPP term is wave) tabulate all
10 tables (thereby utilizing as many CPUs as possible). The additional overhead of running 10 ss2ps calls
will erased by using 10 CPUs.
Batch Size Example 2:
Throughput target=The entire product
Available CPUs=24
Tables=800
Table Difficulty=All Equal
Target States=All 52 states
State Difficulty=medium
In this case it would be best to set the batch size to 9999, and concurrently submit 24 (or greater) States
in descending order of size.
Batch Size Example 3:
Throughput target=The entire product
Available CPUs=1
Tables=800
Table Difficulty=All Equal
Target States=All 52 states

 Date Last Printed: 3/29/13                                                                   Page 111 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
State Difficulty=medium
In this case it would be best to set the batch size to 9999, and submit States serially.
Caching Options:
There are 3 caching options in STR, No Recode Caching, Non-persistent Recode Caching, Persistent
recode Caching. The DPP system implements Non-persistent Recode Caching. Availability by version:
                      No Recode              Non-persistent Recode            Persistent Recode
                      Caching                Caching                          Caching
Package 16            Yes                    No                               No
Package 51            Yes                    Yes.                             No
                                             Batch size was limited to 1 so
                                             the benefit was limited.
Package 53            Yes                    Yes                              Yes


No Recode Caching:
The default, available in all versions.
In this scenario for each TXD the DPP production system:
                   Reads the TXD (a small file)
                   Reads the geo recode (for Texas, a 753 MB file)
                   Merges the TXD and geo recode, writes the result to form new TXD (a 753 MB file).
                   Submits the new TXD to ss2ps. Ss2ps reads the TXD (for Texas, a 753 MB file) , parses
                    the geo recode, and tabulates the result.
                   Deletes the new TXD (for Texas, a 753 MB file)
For uSF3 that process was repeated once for each table - approximately 800 times per state! Needless to
say there is a tremendous amount of CPU and I/O overhead to this process!
Recode Caching (non-persistent):
Recode caching was available in package 51 but was basically unusable since the batch size was
restricted to 1. Package 53 is capable of large batch sizes.
If the batch size =1, for each TXD the DPP production system:
                   Reads the TXD (a small file)
                   Inserts a small reference to the geo recode, writes the result to form a new TXD (a small
                    file).
                   Submits the new TXD to ss2ps.
                    SS2ps reads and parses the geo recode (for Texas, a 753 MB file).
                    Ss2ps tabulates the result.
                   Deletes the new TXD (a small file)
If the batch size =40, for each set of 40 TXDs the DPP production system:
                   Reads the 40 TXDs (small files)
                   Inserts a small reference to the geo recode, writes the result to form new TXDs (small
                    files).
                   Submits the new TXDs to ss2ps.
                    SS2ps reads and parses the geo recode (for Texas, a 753 MB file). Note that this is done


 Date Last Printed: 3/29/13                                                                       Page 112 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                    only once.
                    For each of the 40 TXDs, Ss2ps tabulates the result.
                   Deletes the new TXDs (small files)
Even with the batch size=1 this approach is superior to the 'No Recode Caching' scenario for the simple
reason that large TXD files do not to be manipulated. When the batch size is increased (in this example to
40) the gain is even more pronounced.
Recode Caching (persistent):
Persistent recode caching is available in package 53.
In this situation the geo recode (for Texas, a 753 MB file) is submitted to scstools (a STR component). It
is read once, parsed once, and stored in a binary format (a 175 MB file).
If the batch size =1, for each TXD the DPP production system:
                   Reads the TXD (a small file)
                   Inserts a small reference to the geo recode, writes the result to form a new TXD (a small
                    file).
                   Submits the new TXD to ss2ps.
                    SS2ps reads the parsed geo recode (for Texas, a 175 MB file).
                    Ss2ps tabulates the result.
                   Deletes the new TXD (a small file)
If the batch size =40, for each set of 40 TXDs the DPP production system:
                   Reads the 40 TXDs (small files)
                   Inserts a small reference to the geo recode, writes the result to form new TXDs (small
                    files).
                   Submits the new TXDs to ss2ps.
                    SS2ps reads the parsed geo recode (for Texas, a 175 MB file). Note that this is done only
                    once.
                    For each of the 40 TXDs, Ss2ps tabulates the result.
                   Deletes the new TXDs (small files)
With the batch size=1 this approach is superior to the 'Recode Caching (non-persistent)' scenario
because the geo recode only has to be parsed once and that parsed file is now approximately ¼ of the
non-parsed recode. When the batch size is increased (in this example to 40) the gain is less pronounced.

4.16. About US processing and aggregation
This section focuses on standard US-level product processing, which involves tabulating against state-
level databases. However, the exception to this rule - tabulating most AIAN iterations against a US-level
detail database – is also covered.
4.16.1.             Geography
The ProcessGeo script is used to create all outputs that are required to process a US level product. The
exception to the rule of tabulating against state-level databases is also covered. The outputs are
explained below.
4.16.1.1. US-Level Detail Database Preparation: Scan Detail Data
ProcessGeo stage 5 scans a state-level detail data file and flags the blocks of interest that are needed for
a US-level detail database build. This stage was run for the AIAN product, where we determined that it
would be practical to build a US-level detail database consisting of the appropriate records (blocks,
housing units/GQs [and their associated person records] if the HU/GQ contained any AIAN person).



 Date Last Printed: 3/29/13                                                                     Page 113 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The scan process is in the geography portion of the DPP system since it keeps track of the blocks that
contain the appropriate records. This output file is used in the next section (4.16.1.2).
4.16.1.2. US-Level Detail Database Preparation: Creating a US-Level DGF
Section 4.16.1.1 creates the input for this section, which is ProcessGeo stage 7. This stage assembles a
DGF together by looping over the state-level DGFs and keeping (a) all non-block records (these appear in
the final product) and (b) all block records flagged in section 4.16.1.1. This DGF is fed into the next
section (4.16.1.3).
4.16.1.3. US-Level Detail Database Preparation: Creating a Geography Recode
Section 4.16.1.2 creates the input for this section, which is ProcessGeo stage 10 (creating geography
recodes, etc). This process is covered in section Example Geography Recode using a state-level product
as an example, but the process is the same for the US. Please refer to Example Geography Recode for
more information.
4.16.1.4. Land and Water Area Verification
This section corresponds to ProcessGeo stage 20, which creates and verifies land and water areas for
state-level products. It’s necessary to mention this stage in this section of the document since the outputs
are used in section 4.16.1.5.
4.16.1.5. Creating US-Level Geography Outputs
For all US-level products besides AIAN (most iterations), ProcessGeo Stage 30 is the normal entry point
after Get has been run to obtain the DGF. This stage creates three main outputs that help us to create a
US-level product. These outputs are explained below:
Master Geography SAS Dataset
The master geography SAS dataset for the US lists all of the US-level product geography in product
order. It serves the same purpose as the state version of the dataset, which is discussed in the
Processing Geography File section.
Land and Water Area Aggregation
US processing requires that we aggregate the state-level land and water areas from ProcessGeo stage
20 (discussed in section 4.16.1.4). These land and water areas are used in summary file creation (just
like the state land and water areas are used).
Land and Water Area Verification
This step compares the aggregated land and water areas (from the previous paragraph) to the land and
water areas from the Master Geography SAS Dataset and produces a report.
4.16.2.             Creating a US-Level Detail Database
This section applies to the AIAN product only.
4.16.2.1. Data Preparation
Data preparation for detail databases is covered in Input Files, but that section focuses on state-level
detail database build and fails to explain what happens if a US-level database needs to be built.
The only difference in Tab stage 1000 for a US database build is that an additional code path is executed.
This code path includes only the records of interest from the source detail files, as determined by the
scan process that’s documented in section 4.16.1.1.
4.16.2.2. Assemble Data for Detail Database Build
Tab stage 1025 takes the state-level detail database inputs from section 4.16.2.1 and concatenates them
together so they can be used in a detail database build.
From this point on, the US detail database build process is identical to the process covered in Building a
Detail Database.


 Date Last Printed: 3/29/13                                                                   Page 114 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
4.16.3.             Tabulation
This section applies to all US-level products (although only AIAN iteration 001).
Tabulating against a state-level or US-level detail database makes no difference to the main tabulation
stage, which is Tab Stage 3000. For a brief explanation of tabulation, refer to About Tabulation.
4.16.4.             Aggregation
Aggregation is required when we tabulate a US-level product against state-level detail databases. The
aggregation occurs in Tab stage 3200, and requires that all tabulations for all states have been finished.
The stage iterates over the product tables, taking each state’s csv for the table and aggregating the
numbers. The process also orders the data per the master geography SAS dataset (refer to section
4.16.1.5) and writes a csv for each product table.
4.16.5.             Medians
US medians are calculated against a distribution database, which is different than a detail database. The
following Tab stages are run to prepare and tabulate against a distribution database:
Preparing TDD for Distribution Database
Tab stage 3300 prepares the TDD for the US distribution database, customizing the TDD under source
control with directories, etc., much like the customization that occurs in Customizing the Textual Database
Definition.
Preparing Geography for Distribution Database
Tab stage 3325 prepares geography for the US distribution database, creating a value set for the US
geography that can be referenced in tabulation.
Preparing Aggregated Data for Distribution Database
Tab stage 3350 prepares the aggregated distribution data for database build, and is a rather simple
reformat of the data.
Building the Distribution Database
Tab stage 3400 build the distribution database. The database build process is identical to the process
described in section Running SNBU
Cataloguing the Distribution Database
Tab stage 3450 catalogues the distribution database so that we can tabulate. The catalogue process is
described in section Cataloguing.
Tabulating Medians Against the Distribution Database
Tab stage 3500 tabulates the US median TXDs against the distribution database and writes out csv files.




 Date Last Printed: 3/29/13                                                                  Page 115 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
 5. HARDWARE / FIRMWARE ARCHITECTURE OF THE DPP SYSTEM

5.1.     Description of hardware components
Currently, the DPP hardware environment consists of:
        One RS/6000 M80 (DPP1) and one RS/6000 S80 (DPP2).
        DPP1 has 8 CPUs. DPP2 has 24 CPUs.
        The DPP2 S80 (7017 S80 AB10D) has 98GB total amount of physical memory.
         The DPP1 M80 (7026 M80 F95CD) has 16GB total amount of physical memory.
        Both systems are Fibre switch attached to an IBM Enterprise Storage Server (ESS).
        ESS3 has 5.7 TB of disk space.
        DPP1 has 1.07 TB of ssa disks, four VGs are raid 5 disk arrays. (Note: the external ssa disk
         drawers are physically located inside of the DPP2 cabinet.)
        3584 tape library (224Tb capacity) is attached to DPP1 for Tivoli backup/recovery purposes.
        The two RS/6000 machines are connected to each other with a private 100 Mb ethernet network
         interface with interface names of gigdpp1 and gigdpp2.
When the size and runtime estimates for uSF4 were calculated, it was determined that additional cpu
power was needed. For a brief time, two additional P680s were included in the configuration. They have
since been removed and become part of the American FactFinder hardware.
Below is a diagram which describes the DPP hardware and network architecture. All of the DPP
equipment is located in Bowie in MOD 2.




Date Last Printed: 3/29/13                                                                  Page 116 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                       Figure 28: DPP Hardware and Network Architecture



5.2.     Backup/failover architecture

Date Last Printed: 3/29/13                                                                Page 117 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Unlike American FactFinder, DPP does not have full redundancy capability. If the DPP2 production
machine were to become unuseable, an impromptu plan would be necessary based on the current
production urgency.
Both DPP servers participate in the TSM disaster recovery procedures. These procedures create
redundant tape copies of the entire server and are located in a fire proof vault offsite in Suitland.
Approximately once a month, scripts are run to update the inventory of offsite tapes, generate a list of
tapes ready to go offsite, generate a ist of empty tapes for return to the library, and update the TSM
database. As of this time, there is no firm hardware contingency in place where these tapes could be
restored if a catastrophe did occur.




 Date Last Printed: 3/29/13                                                                  Page 118 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         6. SECURITY-RELATED ARCHITECTURE OF THE DPP SYSTEM

  6.1.       Logical environments (dev, pa, uat, test, prod, sprod)
  To avoid confusion, accidental overwrites, and space usage problems, each environment of DPP is
  provided their own disk space and file structure. The only cross reference to files outside of an
  environments structure will be soft links to read-only files needed for inputs or prior-product comparisons.
  The structure is owned by the generic account associated with the environment. For example, dppuat
  generic account is the owner of /dpp2/uat/* disk space. That account has sole write capability while all
  others in the uat group have only read access. Below is an example of the uat area during the School
  Districts Special tabulation testing.
dpp2$ ls –la uat
drwxr-sr-x 10 dppuat                   uat              512 May 13 13:55 uat
dpp2$cd uat
dpp2$ ls –la
total 80
drwxr-sr-x 10 dppuat                   uat              512    May    13   13:55   .
drwxr-xr-x 11 root                     system           512    Apr    14   10:30   ..
drwxr-x--- 16 dppuat                   uat              512    May    12   11:05   SDCOSP1
drwxr-x--- 16 dppuat                   uat              512    May    12   13:46   SDCOSP2
drwxr-x--- 16 dppuat                   uat              512    May    12   14:07   SDCOSP3
drwxr-x--- 16 dppuat                   uat              512    May    13   13:28   SDCOSP4
drwxr-x--- 16 dppuat                   uat              512    May    13   13:40   SDCOSP5
drwxr-x--- 16 dppuat                   uat              512    May    13   13:55   SDCOSP6
drwxrwx---   2 dppuat                  uat              512    May    06   09:39   lost+found     (an AIX special file)
lrwxrwxrwx   1 dppuat                  uat               15    May    06   09:40   uSF1 -> /dpp2/prod/uSF1 (soft link)
lrwxrwxrwx   1 dppuat                  uat               16    May    06   09:40   uSF1A -> /dpp2/prod/uSF1A
  Figure 29: Example showing ownership of /dpp2/uat disk space

  The disk space allocated is typically on separate volume groups to facilitate changing, expanding, or
  relocating the storage.
  In addition to the file structure for each environment, a striped logical volume disk area is assigned to
  each environment if requested. These areas are considered temporary so they may/may not exist on
  disk at any given time.

  6.2.       User groups and the use of generic, non-login-enabled accounts
  DPP uses work groups to allow assignment of rights and permissions to sets of users rather than to
  individuals. Users can be added or removed from groups without having to change the properties of
  objects. “Reviewers” are a special category of user. The ability of “reviewers” is limited to read-access to
  products which have been prepared through the DPP system.
  The following are DPP-system groups on the AIX platform:
      dpp      – DPP staff needing read access to development materials
      pa       – DADS Product Assurance team members
      prod     – staff needing read access to DPP production materials
      sprod – staff needing read access to DPP special-product production materials
      test     – staff needing read access to DPP test run materials
      uat      – staff needing read access to DPP user acceptance testing materials
  Reviewer Groups:
             dreview – development data product reviewers
             review – production data product reviewers including POP and ACSD

   Date Last Printed: 3/29/13                                                                        Page 119 of 149

   Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
          pareview – product assurance data product reviewers
          sreview – special-product production data product reviewers
          treview – test-state production data product reviewers
          ureview – user-acceptance testing data product reviewers

6.3.      The assignment of read/write access to users
Typical JamesBondID user accounts do not have write access to any DPP work areas. Only the generic,
su-only accounts (dppprod, dppdev, etc.) have write access.
Be aware that a recent BOC standard was issued that will affect the use of DPP generic su-only
accounts. Presently, the generic accounts are used by multiple people at the same time. The new
standard states that only one user may be accessing a generic at any particular time.
(http://cww2.census.gov/it/ssd/itsupp/docs/27_0_0UserNamingConvention.pdf)
Below are tables which depict the access each account should be granted:
                                                                                 Non-Review Groups
                                     Default
          Username
                                     Group         staff   acsd    dec       dpp   geo   oracle   pa   prod   sprod     test   uat

Role-based (su-only) Accounts:

dadscm                              pa             x

dpp                                 dpp            x       x       x         x     x              x    x      x         x      x

dppdev                              dpp            x       x       x         x     x

dpppa                               pa             x       x       x               x              x

dppprod                             prod           x       x       x               x                   x

dppsprod                            prod           x       x       x               x                          X

dpptest                             prod           x       x       x               x                                    x

dppuat                              uat            x       x       x               x                                           x

oracle                              oracle         x                                     X

Named User Accounts (JamesBondIDs) <belonging to a functional group>

<DPP Operations>                    prod           x       x       x         x     x              x    x      X         x      x

<DPP development>                   dpp            x       x       x         x     x              x    x      X         x      x

<Geography Division>                geo            x                               x

<Product Assurance>                 pa             x       x       x         x     x              x    x      X         x      x

Figure 30: Group membership for DPP users (in non-review groups)



 Date Last Printed: 3/29/13                                                                                       Page 120 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                                             Review Groups
                                        Default
           Username
                                        Group
                                                     staff         acsd      dec    sprod    test         uat


Role-based (su-only) Accounts:

dadscm                                  pa

dpp                                     dpp          X             x         x      x        x            x

Dppdev                                  dpp          X

Dpppa                                   pa                         x

Dppprod                                 prod                                 x

Dppsprod                                prod                                        x

Dpptest                                 prod                                                 x

Dppuat                                  uat                                                               x

Oracle                                  oracle


Named User Accounts (JamesBondIDs) ) <belonging to a functional group>:

<DPP Operations>                        prod         x             x         x      x        x            x

<DPP development>                       dpp          x             x         x      x        x            x

<Geography Division>                    geo

<Product Assurance>                     pa           x             x         x      x        x            x

<special-product user>                  sreview                                     x

Figure 31: Group membership for DPP users (in review groups)




 Date Last Printed: 3/29/13                                                                         Page 121 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
        7. USING THE DPP SYSTEM TO PRODUCE A NEW PRODUCT

7.1.      The life cycle of a product in the DPP System
This is the process that has typically been followed during the customization of the DPP system for a new
product. These steps have always been followed iteratively:
At the earliest possible date, staff of BOC and IBM meet to determine the best way to produce the new
product; call it SF-X. At that time, there are usually no final written specifications for either the subject-
matter-content or the geographic-content of the desired product; there may, or may not, be drafts. SF-X
is described by BOC staff, verbally or by e-mails, by comparing it to previous products. For instance, “SF-
X will be like SF3, but it will have different geographies.”
The initial meeting(s) on SF-X results in technical questions, and in an initial plan for customizing the DPP
system to make SF-X. Typical questions are:
       What will this product be called? What prior product does it most closely resemble?
       Is the data source 100% data, or Sample data, or both? Will a new detail-file database be built, or
       will an existing one be used? Will there be new supplemental recodes?
       Is this a US product, or a state product, or both?
       Will a new delivery of geography from GEO be required? If so, will it be a delivery of summary levels
       only, or will blocks be delivered also?
       Is thresholding required? If so, how will it be implemented? Will an SIPHC activity be required?
       Will existing table definitions (txd’s) be used, or will new ones be developed?
       Is there a requirement for Special Tab Rounding?
       When SF-X summary files have been created, what will they be matched to? Is there a requirement
       for mapping summary levels to other summary levels in the matching?
       Will the existing DPP system be able to produce SF-X, or are there new requirements that will
       require changes to the DPP system?
       What testing is appropriate?
       What is the QC plan for SF-X production?
The BOC staff may answer some of the questions, or may refer them to the product’s sponsor. Other
questions cover schedule and budget; confirmation of new functionality (if any); as well as the details of
subject-matter- and geographic-content. When work is authorized to begin, these questions often remain
in flux. The initial plan usually firms up the kind of meta-information that is stored in the Products.txt
driver file. It also includes a rough plan for any software changes that might be required. The
development of the initial plan is critical to success, because it determines the resources that will be
required for SF-X. The initial plan is outlined in the Approach document.
During the development stage, a complete set of driver files is developed for the product by the
developers. It is used by them for unit testing and system testing. Any changes needed to the DPP
system itself are also made and tested. The Cookbook, which contains instructions for the execution of
commands to invoke the DPP system, is updated with the specifics for SF-X. The Approach document is
revised if necessary. During this phase, configuration management for both driver files and code are
managed through Team Connection releases (DPP2001_*). Multiple ‘builds’ usually occur as the product
wends its way through development and testing.
Concurrently, certain driver files in DPP2001_* are reviewed by the BOC Ops staff, and possibly
changed, and are then also placed under configuration management in Team Connection
(DPP2000_OPS_*).
Product Assurance testing often begins before development is complete. The Product Assurance group
uses the Cookbook section on SF-X, with advice from both developers and BOC staff, to devise a Test

 Date Last Printed: 3/29/13                                                                    Page 122 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Plan. The advice often focuses on subsetting input files and/or driver files to creating both a quick-
running, repeatable runnability test, and a more comprehensive content test.
During Product Assurance (PA) testing, the instructions in the Cookbook for SF-X are executed by the
Product Assurance staff using the contents of DPP2001_* and DPP2000_OPS_*. Output and processing
are reviewed, defects are entered into Team Connection, modifications are made, and testing is
repeated. PA testing is conducted in the disk environment /dpp2/pa.
User Acceptance testing (UAT) often begins before PA testing and development are complete. During
UAT testing, the instructions in the Cookbook for SF-X are executed by BOC Ops staff using the contents
of DPP2001_* and DPP2000_OPS_*. Output and processing are reviewed, defects are entered,
modifications are made, and testing is repeated. UAT testing is conducted in the disk environment
/dpp2/uat.
Test Production (TEST) is an optional phase, often used for larger products, in which SF-X can be
produced for several states for submission to the product sponsor for approval, prior to the start of
Production. During TEST, the instructions in the Cookbook for SF-X are executed by BOC Ops staff
using the contents of DPP2001_* and DPP2000_OPS_*. Output and processing are reviewed, and are
submitted to the sponsor of SF-X for review. If any defects are found, they are entered, modifications are
made, and Test is repeated. UAT testing is conducted in the disk environment /dpp2/test.
The output of Test may be forwarded for release to the public, via the Hand-off mechanism to AFF and
ACSD, but usually the Test states are re-processed in Production prior to public release.
During Production (PROD or SPROD), the instructions in the Cookbook for SF-X are executed by BOC
Ops staff using the contents of DPP2001_* and DPP2000_OPS_*. Output and processing are reviewed,
and are submitted to the sponsor of SF-X for review. Defects are not expected, but are handled through
the same process as other defects if they are discovered. The output of Production is submitted to the
sponsor of SF-X for review, and if approved, is forwarded for release to the public, via the Hand-off
mechanism to AFF and ACSD.

7.2.      Using multiple instances of the DPP system
Virtually all products require multiple executions of the DPP system. However, not all of the steps in the
DPP system need to be run in every execution. The DPP system is designed so that individual stages
may be chosen, from components of the DPP system.
The requirements of each product are different. Often, it is possible to meet the requirements of a
particular product by configuring the execution of the DPP system in a number of different arrangements.
In that case, the decision on which configuration to use should also be based on resource factors like run-
time, disk usage, etc.
                          th
For instance, the 108 Congressional District product produced two sets of summary files for each of 52
states, one set containing 100% data, and one set containing Sample data. To do this, the DPP system
was run 104 times, once for each kind of data, for each state.
         Review the approaches and flowcharts used for previous products (section on Artifacts of
          Previous Products) for other examples.
                                                                                            th
But certain modules of the DPP system were not run in those 104 executions for the 108 Congressional
District product . For instance, Tab Stages 1000 – 2050 were not run. Those stages create a
SuperCROSS database, and, because the inputs to that process hadn’t changed, databases that had
been created for previously were re-used.
         Refer to the section on software architecture for the functionality of each stage in the DPP
          system.

7.3.      Preparing the driver files
Usually, driver files are prepared by modifying the driver files used for previous products. The best
practice is to use the latest OPS build as the source for these files.
While the developers are responsible for initial development of these files, usually developers and BOC
staff work collaboratively to refine and perfect them.
 Date Last Printed: 3/29/13                                                                      Page 123 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The content and syntax of each driver file must be perfect. Even the absence of a carriage return at the
end of the last line of a driver file can cause the DPP system to malfunction. The DPP system performs
very little validity checking on the contents of the driver files.
The early stages of testing usually identify most of the defects in driver files. Defects discovered later in
the life cycle can be very expensive to correct. An attempt should be made to eradicate defects as soon
as possible by additional measures, such as manual review, or by instituting additional automated
procedures.
Certain driver files, most notably the Internal and Prior Product Match specifications, are subject to
change

7.4.      Notes on sourcing new txd’s
The txd’s contain tabulation instructions for each matrix in a product. Their content is crucial. Some of
the txd’s may be developed by contractor staff, but their final content is the responsibility of the BOC Ops
group. Note that the txd’s are modified by the DPP system in a controlled fashion (changing the database
name; merging the geography recode; merging an iteration restriction) before being submitted to ss2ps
for tabulation.
These are notes / guidelines on table sourcing which may prove helpful to new staff:
         Adhere to the file-naming conventions. The names of the txd’s must match those in TableInfo.txt.
         It is often better to modify an existing, proven txd from a prior product, rather than to source a txd
          from scratch. Modifications can be made either through the SuperCROSS gui, or with a text
          editor. If a text editor is used, you should load the resulting txd into the SuperCROSS gui to verify
          that it loads and runs, and to check that the tabulated results are correct.
         Whenever the SuperCROSS gui is used, it must be attached to a SuperCROSS database which
          contains the fields needed for the tabulation. The gui obtains the field names, and the description
          of valid values for each, from the SuperCROSS database. If a SuperCROSS database does not
          exist with the appropriate fields defined, priority must be give to creating one.
         A txd contains the name of the SuperCROSS database with which it was created. Before using a
          txd in the gui, edit the txd to change the state name if necessary.
         If sourcing a txd from scratch, start by defining the universe, and confirming that the tabulated
          result is correct by comparing it to a known figure. Then identify the lowest-level cells in the table.
          In most cases, the sum of the data for the lowest-level cells will equal the universe total (the first
          line of the table). Disregarding the total line and all subtotal lines, define the lowest-level cells
          using the codes in fields in the database, and using recodes created from them using the
          SuperCROSS gui. Retabulate frequently during sourcing to make sure that the results of
          sourcing make sense. Where possible, comparing these results to known figures to confirm
          accuracy.
         Once the lowest-level cells have been sourced, derive the totals and subtotals using the ‘axis
          derivation’ function. Do not source them independently. The theory behind the internal and inter-
          product matching assumes that totals and subtotals within a matrix are faithful sums.
         Don’t ever remove (to “look behind’) a median derivation unless you’ve saved the txd, or you will
          lose all the work that went into defining the derivation.
         Hide lines in a table as the very last step in constructing a txd, or you will become very confused.
          Hiding cells is a mechanism by which the user can suppress data cells which should not appear
          in the product, such as the tabulation of a distribution in a median table.
         There is a long formula which has been inserted in certain txd’s. The main part of the txd
          tabulates aggregates and substitutes jam values. The formula round all results except the jam
          values. Be careful working with these txd’s not to disturb the formula, and test the resulting txd to
          confirm that every tabulated number is handled correctly.



 Date Last Printed: 3/29/13                                                                       Page 124 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
         Some of the tabulations of Puerto Rico required that different codes be used than were used for
          tabulations of the fifty states and DC. Always check to see if a special version is needed for
          Puerto Rico.

7.5.      Adding a new product to the Cookbook
During unit testing, the developers should add a section to the Cookbook which describes the actions to
be taken to produce SF-X. During PA, UAT, and TEST testing, testers should follow the instructions in
the Cookbook, and lodge defects against it when necessary. Follow the pattern used for other products.
The Cookbook is under configuration management in the DPP2001_* release.

7.6.      Interacting with the Team Connection environment
The DPP system uses two distinct “releases” in Team Connection to control elements from two distinct
sources.
DPP2001 is the release used for controlling the Cookbook, programs, scripts, and TDD’s. It is also a
repository for unofficial versions of other items needed during unit and system testing, such as driver files,
iteration recodes, and txd’s.
DPP2000_OPS is the release used for controlling the official versions of driver files, iteration recodes,
and txd’s.
The basic process for changing code in either kind of release is to open a defect (or a feature), accept it,
open a work area, make the place the correction into the work area, close the work area, mark it ready for
build, and then create a build. See Section 3.5.7, IBM VisualAge TeamConnection, and the attached
document “TeamConnection Quick Start.doc,” for more on the use of Team Connection.

7.7.      Planning how the production workload will be submitted
All work on the DPP system is ultimately focused on creating accurate summary files as quickly as
possible. Due to the sheer size of decennial products, production processing for a single product has
been known to run all day, every day, on multiple computers, for months on end.
All work using the DPP system is ultimately triggered by a human being who types system commands on
a console. The general order in which the commands will be typed is apparent from the Cookbook, but
there is an additional level of sequencing which has been left to the operator.
For instance, if detail file databases for 52 states are to be built before tabulations are started: in what
order should the 52 database builds be submitted? How many of them can be submitted at a time?
Should the tabulations be initiated automatically at the conclusion of each successful database build, or
should they be initiated manually?
The goal of production workload submission is to achieve maximum throughput by keeping the assigned
computer(s) fully engaged, but not over-committed, every minute of every day, until done. Each of the
stages of each of the scripts in the DPP system has a different performance profile. One technique which
has been followed successfully to date has been to keep a mix of these performance-types active in each
computer at any one time (to the extent possible). The other technique which has been followed
successfully to date has been to identify processing errors as soon as possible, immediately stop that
stream of processing while research is done, and fill the computers with other work.
During initial planning, development, and testing, the system’s performance characteristics should be
considered. As much as possible, the testing workload should mimic that planned for production, in order
to identify performance issues. If throughput blockages develop, remedial action should be taken. In the
past, these types of remediation have proven useful: modifying scripts/programs; modifying workload
submission method; using striped disk for high i/o processes; adding hardware.
During testing, the techniques planned for “sunny day” production are usually implemented by
developing operational scripts which group the instructions in the Cookbook in the planned increments.
This is done to reduce the risk of error in the construction, sequencing, and entry of the commands. Also,
any wave driver subdirectories and files that will be used in production are created and used during
testing. A directory named $DPPwork/$DPPenv/OpsNotes is usually created to contain the operational


 Date Last Printed: 3/29/13                                                                    Page 125 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
scripts. Also, a running file named $DPPwork/$DPPenv/OpsNotes/Activity is usually created to contain
notes on which scripts were submitted, and when, and what other actions were taken.
Here are guidelines for use when planning how the production workload will be submitted, and what the
contents of the operational scripts and the wave driver subdirectories should be:
        Assess the available computer resources, the work to be performed, and the urgency of
         completing the product, and make a rough guess at how long it would take to run production if the
         instructions in the Cookbook were followed with no creativity. If that’s an acceptable time period,
         then form the scripts that way.
        Nail down which builds will be used. There must be a record in /home/dpp/DPPinstallations.txt
         that links the desired DPP2001 and DPP2000_OPS builds to an installation number and
         authorizes them to be used together on DPP1 or DPP2. The installation number will have to be
         entered correctly, manually, as a parameter to the DPPSetup command, every time the operator
         wants to access the test or production environment. Getting it right is critical.
        It’s a good idea to create as few operational scripts as possible, yet enough operational scripts
         that the workload can be submitted to the computers smoothly.
        It’s a good idea to have as few activities running as possible, on any one server at any one
         moment. This is, in general, for maintaining control. For instance, it is better to run tabulations
         for five states on each of three servers, and later submit tabulations for additional states as the
         original five complete, than to submit tabulations for all 52 states. This method allows the work
         load to be fed into the servers to maintain a steady state of productivity. It also allows early
         identification of systemic errors (for instance, if an incorrect version of a txd were tabulated, the
         match reports could detect it, and processing could be stopped. It also provides limited protection
         from system problems like running out of disk space, because the affected activities could be
         stopped and other activities submitted while the impact was investigated, keeping the servers
         fully engaged.
        Create a separate set of scripts for each ‘product’ environment which will be used in the creation
         of product SF-X. For instance, for the AIAN product, operations were conducted in five different
         ‘product’ environments. Each of the five ‘products’ had a different set of scripts.
        (for each ‘product’:) Decide whether the tabulations will be waved by table or not. Generally, the
         more tables there are in a product, and/or the more geographic entities there are in a product, the
         longer tabulation will take, and the more valuable waving will be.
        (for each ‘product’:) Generally, you should create one script to run all the one-time setup
         functions. These include creating the directory structure, creating wave driver subdirectories (if
         desired), creating links, etc.
        (for each ‘product’:) Examine the Cookbook instructions carefully to understand the
         predecessor/successor relationships, in particular the points where two or more activities must
         complete before proceeding.
            If there are iterations in the product, there is a requirement to complete Summary File Creation
             for iteration 001 (Tab stage 4000) before beginning to create the summary file for any other
             iteration. This is because the logical record numbers assigned for iteration 001 are used for
             all other iterations in a product.
            If the tabulations will be waved by table, there is a requirement to complete Unconditional
             Rounding (Tab stage 3700) for all tables (in that iteration) before beginning to create the
             summary file (for that iteration).
            If there is US processing, there may be a requirement to complete Splitting csv Output (Tab
             stage 3050) for all 52 states (for each iteration) before starting Aggregation (Tab stage 3200)
             for the US for (that iteration).
            If there is associated product (gc49) processing, there is a requirement to complete Rehydrate
             (Tab stage 3020) in both the main product and the associated product (for each iteration)
             before starting Incorporating Results from Associated Products (Tab stage 3025) (for that
             iteration) for the main product.
Date Last Printed: 3/29/13                                                                     Page 126 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
             Then look for other predecessor/successor relationships which you can leverage. Just as
              there are points where you must complete two or more activities before proceeding, there will
              be points where you would prefer to run subsequent tasks in parallel. For instance, once a
              summary file has been created, you can either run Verify Rollup, the Internal Match, the
              Analyzer Match, the Prior Product Match(es) in sequence or in parallel. If timing is important,
              run them in parallel.
         Note that it will be incumbent upon the operator to determine the success or failure of each state
          of each script. Plan to be able to control the work at that level.
When testing is complete, the operational scripts are ready to be migrated to the production environment.
Do not migrate wave drivers to the production environment. Recreate them in the production
environment so that they are based on the production driver files.
Here are some tips on planning the contents of the operational scripts and the wave driver subdirectories:
         Use the operational scripts from previous products as the base for new operational scripts. Many
          fairly fancy techniques were developed during production for SF4, AIAN, and CD108, which may
          be useful for a new product.
         Observe the behavior of the computer(s) while these scripts are running, and make changes to
          maximize throughput.
         Use the techniques listed under “Executing the DPP System for a Product” to identify
          opportunities for improvement while testing.

7.8. Preparing the hardware/firmware environment for pa/uat/test/prod/sprod
processing
The tasks that go into preparing the RS/6000 DPP servers for a job submission will vary depending on
the environment. The system administrator (SA) is responsible for the majority of tasks performed. The
SA may need to do cleanup or archiving from previous products in order to acquire enough available
disks depending on the space estimates. Disk groupings (Volume Groups or VGs ) are created, used,
reused, and reconfigured as needed for the next upcoming event. The SA gets direction from the various
environment leaders prior to deleting or archiving any files. During initial planning meetings and
especially during pa reviews, the SA must determine which disk areas are most heavily written to and
which areas are most likely to grow substantially. This will help determine how to subdivide the area into
separate file systems for the production environment.
The pa and uat environments are less restrictive and require less involvement from the system
administrator. The pa and uat operators are basically provided 2 large disk areas, also referred to as ‘file
system mount points’. One area is a standard logical volume with a journaled file system (jfs) covering
one or more disks and the second area is a striped logical volume with a file system covering multiple
disks. Striped disk areas are covered in more detail below. Uat and pa are given large areas that the
operators can manipulate at will, ie, creating or deleting subdirectories and links. Multiple runs using
different build versions could be contained within one pa or uat area. As mentioned earlier, pa and uat
runs may contain only a subset of states or iterations and thus require less disk space. Test, sprod, and
prod are progressively more restrictive and may contain multiple mount points or several file systems
combined to create the disk area needed. Here is an example showing the contrasting look of the various
areas by mount points.
Example of PA area for running uSF3. Disk space requested may have been approximately 420Gb total.
          mount point(s) approx. size                                   remarks
          /dpp2/pa             320Gb               - operator creates/modifies subdirectories and links as desired
                                                   - may contain subdirectories generated by various builds.
          /striped_pa          120Gb               - target filesystem for write intensive directories which have
                                                   been linked out of /dpp2/pa, ie, /dpp2/pa/geo would actually
                                                   be a link to /striped_pa/geo



 Date Last Printed: 3/29/13                                                                            Page 127 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Example of Prod area for running uSF3. Disk space requested may have been 1.1 terrabytes; an area
much too large to be contained within one single file system.
          mount points                              approx. size                         remarks
          /dpp2/prod/uSF3                          300Gb                - production scripts generate subdirectories
          /dpp2/prod/uSF3/SXoutput                 400Gb                - separate file system is used due to volatile
                                                                        growth of this area
          /prod_striped                            400 Gb               - target filesystem for write intensive directories
                                                                        which have been linked out of /dpp2/prod/uSF3,
                                                                        ie, dpp2/prod/geo
When using IBM Enterprise Storage Server disk arrays (ESS aka Shark), the real term should be ‘logical
unit numbers’ (luns) rather than disks. However, luns and disks are used interchangeably for the
purposes of this document.
There are two important parameters to consider when defining the filesystem. The uSF4 product
generated a tremendous number of small files. Careful calculations were required to determine the best
Number of Bits Per Inode (NBPI) to use. The NBPI had to be small enough to allow for millions of files
while at the same time large enough to allow a 400-800 Gb filesystem. The second parameter required is
the MIND option. It is inserted in the Mount Options field. This is a bug workaround provided by IBM for
large filesystems.
Once the jfs has been created and mounted from the command line, the proper user/group permissions
are applied for the particular environment. Eligible users include dppprod, dpppa, dppuat, dppsprod, and
dpptest. Individual jamesbond IDs should never own these functional areas. User is granted full read,
write, and execute (rwx) and groups are assigned read and execute only. Effectively,
chmod 750 <mount point>
The SA maintains several scripts to monitor the system during important jobs. The scripts monitor paging
space, filesystem disk usage, inode usage, input/output statistics, runqueus, etc. The scripts are modified
for each product and/or environment as necessary. Occasionally, the environment leader may request
that a certain parameter be watched and information written to a log for analysis. Most often, the SA
captures information related to system statistics that may be needed to answer questions later.
Finally, immediately prior to start of production (or any long running job), a machine reboot is performed.
This is necessary to clear systems caches and get optimum performance.
7.8.1. Resource estimation (disk space, run time, etc.)
Disk space, run times, and potentially, number of machines required for each product gets refined as the
test runs move from the development environment to product assurance through user acceptance testing.
For small products, such as School Districts, pa testing has incorporated the full coverage of all states.
The estimates are simply taken from their results. Logs contain date and time stamps for each stage of
the run and provide run time estimates. In preparation for larger, more intense products such as uSF3
and uSF4, a performance specialist has been enlisted to assist with extrapolating the numbers from a
subset of test runs. Assignments of small, medium, and large states are made for each product and
every effort is made to run a good sampling through product assurance testing.
7.8.2. Striped disk areas
When a product or a particular code version generates extreme disk write activity, it is recommended that
striped disk areas be defined. The need for striped areas is evident when input/output wait times exceed
roughly 30 percent. Most often, sas modules and geography processing (Get stage 10, Tab stages 4000,
5200, and 5300) require output directories to be striped disk areas.
In the /striped_pa filesystem example above, the 120 gigabyte area would probably consist of four thirty-
two gigabyte luns (4x32) totaling 128Gb (total PPs equals 1904@64Mb). The jfs filesystem is spread
over the disks in equal physical partitions (PPs). Therefore, unlike regular jfs filesystems, growing a
striped filesystem is restricted to available remaining space across all disks involved. Underestimated
striped areas are difficult to recover from once production has begun. Another advance calculation
required is the
 Date Last Printed: 3/29/13                                                                                 Page 128 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
* Number of LOGICAL PARTITIONS             this should be a multiple of the total number of luns involved. In
this example it should be a multiple of four for most efficient use of disk space. Since AIX must create the
jfslog using one PP when the first filesystem is created on any VG, Number of LOGICAL PARTITIONS
must not exceed the Total PPs of the VG minus one. This, in effect, forces you to subtract one PP from
each lun when doing the math. For example:
 4 disks x 32Gb = 128Gb
 PP size = 64Mb or 476 PP per disk. Effectively giving you 475 x 4 = 1900 PPs or 121.6Gb
To create the striped area, the SA first locates several same-sized disks and builds the VG or adds to an
existing one. For this example, to create the PA striped area of 128Gb over fourdisks, the SA would use
the System Management Interface Tool (SMIT) gui or SMITTY text-based menus. Two specific menu
items must be modified to achieve a striped logical volume – 1) PHYSICAL VOLUME names, and 2)
Stripe size. In all cases, the stripe size selected for DPP was 64. This selection is based on VG PP size
and numbers of disks. (Reference IBM Redbook XXXXX for further discussion on selecting appropriate
stripe sizes).
All of the luns which make up the intended striped logical volume should be selected by pressing F4 on
the PHYSICAL VOLUME names item.
Below is the smitty mklv menu with the three important items highlighted:
                             Add a Logical Volume
    Type or select values in entry fields.
    Press Enter AFTER making all desired changes.
                                                              [Entry Fields]
      Logical volume NAME                     [stripedpalv]
    * VOLUME GROUP name                       stripedpavg
    * Number of LOGICAL PARTITIONS            [1900]
    PHYSICAL VOLUME names                     [vpath51,vpath52,vpath97,vpath10]
                                       (press F4 to get list and select luns)
    Logical volume TYPE                           []
     POSITION on physical volume                  middle
    RANGE of physical volumes                     minimum
    MAXIMUM NUMBER of PHYSICAL VOLUMES            []
    to use for allocation
    Number of COPIES of each logical              1
    partition
    Mirror Write Consistency?                     yes
    Allocate each logical partition copy          yes
    on a SEPARATE physical volume?
    RELOCATE the logical volume during            yes
    reorganization?
     Logical volume LABEL                         []
     MAXIMUM NUMBER of LOGICAL PARTITIONS         [512]
     Enable BAD BLOCK relocation?                 yes
    SCHEDULING POLICY for reading/writing         parallel
    logical partition copies
    Enable WRITE VERIFY?                          no
    File containing ALLOCATION MAP                []
    Stripe Size?                                  [64] (press F4 to select 64k)

While it is acceptable to mix regular jfs fileystems with striped filesystems on the same VG, cleanup and
management of the disk space gets more complicated. It is best to keep them on separate VGs when
possible. In the PA example, you will note that 7.4Gb of disk space is unused. It would be allowable to
create a 7.4Gb filesystem using that leftover disk space, however, when cleanup and archive occur you
may be left with a 7.4Gb JFS that now won’t allow you to reuse the 121.6 space as you would wish. Will
the 7.4Gb filesystem be temporary/disposable versus long-term is one consideration when making the
decision to co-mingle filesystems on a volume group.

 Date Last Printed: 3/29/13                                                                    Page 129 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
7.8.3. Mounting disk areas across multiple platforms
The DPP hardware resources grew to include four RS/6000 servers and two ESS sharks (total of 17Tb) at
one point. While DPP1 always remains the development/test machine and DPP2 remains the primary
production machine, DPP3 and DPP4 were flexible and available for use where most needed. Due to the
dynamic availability of the ESS disks, the SA can move Volume Groups and/or reassign luns among the
servers quite easily.
There were typically three approaches for sharing disks and data between the servers:
1) export/import VGs (unplug and move)
2) NFS mounting across the network
3) combination of NFS mounting and local striped disks.
Export/import of a VG was used when the environment had begun running on a particular machine and
had to be relocated to another machine in midstream. There was some amount of effort and data which
could not be discarded and restarted. This occurred most often with PA testing. Exporting/importing a
VG is a standard AIX feature and no customization was done for DPP. However, in the ESS environment
host access/ownership of the luns became an additional concern.
DPP relies heavily on NFS mounted filesystems which share existing data between servers and between
environments. A gigabit router and private network between DPP machines was implemented. The most
prevalent scenario is where DPP2 makes completed product files available (NFS exports) as inputs for
prior product matches across all other environments ( dev, pa, test, uat, prod itself). When these
environments are running on different machines, NFS mounts from DPP2 must be created. Extreme
caution to avoid write access to the files must be taken. Because NFS exports are a security risk,
restrictive availability is important. Read-only access is provided to limited hostnames whenever possible.
During one production run, all of the production filesystems in use on DPP2 were NFS exported to a
second machine, DPP3, to shorten the runtime. Here is a table depicting this configuration:
          DPP2 (files/disks groups resident)                  ------ >           DPP3 (files/disks nfs mounted)
JFS exported to DPP3 explicitly                                          NFS mounted as
/dpp2/prod/uSF2A                                                         gigdpp2:/dpp2/prod/uSF2A
All reads/writes performed locally                                       All read/writes performed across the network
  (gigdppx is the private gigabit network interface between the four machines)
This configuration only worked with small products that did not have extensive write activity.
The third configuration became the optimum way of running Production jobs. One server, often DPP2,
was the primary machine holding the static, non-temporary, read-mostly directories. The directories were
NFS exported and mounted to other servers. The dynamic, write-intensive, and “mostly” temporary
directories were created locally on each machine involved. Links and strict naming conventions
controlled the desired output was written properly. Below is an example of each type of filesystem in this
mixed configuration:
DPP2           mounted as                                                DPP3                 mounted as
/dpp2/prod/uSF4      JFS local                                           dpp2:/dpp2/prod/uSF4 NFS

/prod_striped/geo              JFS local                                 /prod_striped/geo       JFS local

On both machines, the /geo subdirectory was always a link which looked like this:
/dpp2/prod/uSF4/geo                ---- >                   /prod_striped/geo
Thereby, on either server, the link would resolve locally, ie, when you issued the command
          >cd /dpp2/prod/uSF4/geo
you would be placed on the LOCAL /prod_striped/geo directory.


 Date Last Printed: 3/29/13                                                                                  Page 130 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
The reference to “mostly temporary” above means that some /geo files were generated but were needed
as inputs some time later during production. These files could not be deleted until the product was
absolutely completed.
7.8.4. Linking to files generated in other products, versus recreating them
Using links to access files which exist in other product areas became the obvious choice for two reasons.
Duplication or copying files into other environments allowed for errors, incorrect versions, and
unintentional overwrites. Links to files, particularly production area files, is the optimum choice in order to
save on disk space and avoid confusion. Access is read-only and ownership with write privileges is
restricted to the dppprod generic account.
An example of linking files is the SDCOSP product which required inputs from the previously completed
product – SDCO, specifically the SXrecodes directory. The links in SDCOSP and the target files in
SDCO for Alaska appear as follows:
Directory /dpp2/prod/SDCOSP1/SXrecodes/SDCOSP1/AK contains this link
AK_SDCOSP1_prod_recode.txt.full -> /dpp2/prod/SDCO/SXrecodes/SDCO/AK/AK_SDCO_prod_recode.txt.full

7.9.      Executing the DPP System for a product
This section is written for production, but it is applicable, to a lesser degree, to testing done in /pa, /uat,
and /test. The major difference between those environments and /prod (or /sprod) is the rigor with which
the user must adhere to configuration management. Work in all of these environments is done solely
through a named account with write privilege to the area. While using the named account, the operator
has the computer’s permission to modify all files in the area. So, while working in /prod, the operator
must be vigilant.
During testing, the operator may choose to modify files to test certain functions, or to speed testing.
During production operations, however, the user must make only the sorts of changes mentioned below,
in order to ensure the integrity of results.
This documents an approach that has worked in the past. However, it is not the only way that valid
results could be achieved.
Assumptions for this section: The main product is SF_X. The computer to be used is dpp2. Disk space
has been allocated. The inode recommendations have been implemented. Striped disk has been
prepared. The DPP builds have been deployed, and the installation number that links them is 789.
Operational scripts will be used. The operator knows the password to dppprod, which is the named
account for /prod.
Prepare Operational Scripts:
Prepare operational scripts to use in submitting the workload. It is not possible to make a general list the
scripts to be developed, or the tasks that can be run in sequence, as each product has its own sequence.
That information is in the Cookbook. Modify operational scripts used for previous products, or write them
from scratch. Use them to implement the Cookbook for uSF_X.
Prepare the environment for processing:
The first time (and every time) you work in /prod/uSF_X, do these three things:
     Log on to dpp2 as yourself.
     su dppprod
     set –o vi
In the main uSF_X area, do these one-time, preparatory non-Cookbook tasks:
    mkdir /dpp2/prod/uSF_X/OpsScripts
    mkdir /dpp2/prod/uSF_X/OpsNotes
Copy/modify/create the operational scripts into /dpp2/prod/uSF_X/OpsScripts, and start making
notes in /dpp2/prod/uSF_X/OpsNotes/Activity.
Submitting the workload:

 Date Last Printed: 3/29/13                                                                         Page 131 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Before running any of the operational scripts, always su to a named account, and run the DPPsetup
command using the correct software installation numbers, for example:
   su dppprod
   . ~dpp/DPPsetup         prod uSF_X        53 789
Once the environment variables have been set, run the operational scripts in the appropriate sequence.
Here are techniques for monitoring the progress of production.
        Keep a paper version of what has been submitted. Keep it up-to-date. It is fairly convenient to
         make notations on a printed version of the reports from DPPStatus.
        Run the DPPStatus program from time to time. It summarizes information in
         /dpp2/prod/uSF_X/logs, and counts files being generated during Tab stages 3000 - 3700.
         Compare each new set of reports to the paper record, and reconcile any differences.
        If any processing stream has experienced errors, stop the stream immediately if it has not
         stopped itself. Submit some other work to keep the computer busy, and begin investigating the
         source and impact of the errors to the stopped stream. After determining the necessary action,
         resubmit the processing stream. If significant work was completed successfully, create and use a
         modified operational script to re-start just before the error condition occurred. When in doubt
         about a stage of processing, err on the side of caution: run it again.
        Use the ps –eaf command to gauge the load on the computer(s), and watch to see the impact
         of that load on disk space, especially the subdirectories which contain temporary files.
        Use the df –k command to monitor the use of disk space. Monitor each mount point that is in
         use.
        Here are some guidelines which have worked well on a 24 processor computer for how much
         DPP work can be submitted:
            At most, run ProcessGEO for 8-10 states at a time. They do a lot of i/o.
            At most, run database builds (Tab stage 2100) for 8-10 states at a time. They also do a lot of
             i/o.
            At most, have 120 processes active at a time. Add together the figures in the first two
             columns from the command vmstat 5 5 to get this number. Any more than this could lead to
             problems.
            At the least, have 40 processes active at a time. Add together the figures in the first two
             columns from the command vmstat 5 5 to get this number. Any fewer than 40, and you will
             not be fully utilizing the computer.
            Plan work so that the computer will stay busy all weekend without manual intervention.
             Monday through Thursday, submit most of the shorter running jobs. On Friday afternoons, fill
             the computer with long-running jobs.
        Monitor the SAR report. The “% user” should be very high, the “% waiting for i/o” should be low
         or medium, the “% system” should be relatively low, and the “% idle” should be under 5. Note:
         processing done by DPP scripts (as opposed to processing done by DPP programs) is attributed
         to the “% system,” so a higher figure for the “% system” may not indicate a problem if processing
         is concentrated certain stages.
        Monitor the logs frequently. The DPP system will create a file with ‘err’ its name in the /log
         directory if an error has occurred.
        Monitor the files in /reports frequently. Each report needs to be examined for any indication of
         problems. Each product is different, so develop expectations for the content of each type of
         report during testing, in consultation with developers and subject matter experts. During
         production, because some product produce literally thousands of reports, you may want to
         develop scripts to parse each type of report, instead of looking at them manually.


Date Last Printed: 3/29/13                                                                     Page 132 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
        For major products, the Census Bureau typically selected several states as Test States. One or
         more of these (often Vermont) was usually processed repeatedly in UAT, with as many defect
         correction cycles as necessary. Then the Test States were run in TEST and submitted to the
         product sponsor for review. They were rerun in TEST with as many defect correction cycles as
         necessary, until the results were approved by the product sponsor. Then they were run in
         production. The summary files resulting from each of these runs was compared to the result of
         the previous run using the utility program SFDiff, to confirm that the only things that changed
         were those which were supposed to change.
        Whenever running production, it is a good idea to select a small subset of the product (for
         example, two small states), and push them through the system quickly, so that all logs and
         reports can be examined while other states are in the early stages of processing.




Date Last Printed: 3/29/13                                                                   Page 133 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                 8. MAINTAINING THE DPP SYSTEM

8.1. Testing/installing versions/fixes to operating system software and to COTS
software
Prior approval is needed from the BOC manager overseeing DPP systems when installing any new
software or updates. Information should be provided to all interested parties regarding the impact of the
update and an agreed-upon time is approved. Whenever possible, previous versions should not be
overwritten and a plan to revert back should be drafted in the event the change has a negative impact.
For new versions of STR, SAS, JDK, etc., separate filesystems are created so that older versions can be
removed eventually and disk space recovered more efficiently.
Updates, patches, new builds, and anything of an alpha or beta nature, are always applied to DPP1 first.
The update will be put on DPP2 after approximately 2 weeks or when sufficient testing has been done
depending on the size and nature of the update. Security related patches are applied as soon as
possible, usually one or two days after DPP1.
The directory that holds SA working files is /admin/working. It exists on both DPP1 and DPP2. However,
for very large OS updates it may be NFS mounted from DPP1 to DPP2 for space savings. SMIT (or
smitty text based) screens should be used for installing software updates. Typically, from the directory
where the software update resides, the command used is “smitty update_all.” Below is a sample
update_all screen:
                                Update Installed Software to Latest Level (Update All)
    Type or select values in entry fields.
    Press Enter AFTER making all desired changes.
                                                                     [Entry Fields]
    * INPUT device / directory for software                           .
    * SOFTWARE to update                                              _update_all
      PREVIEW only? (update operation will NOT occur)                 yes
      COMMIT software updates?                                        no
      SAVE replaced files?                                            yes
      AUTOMATICALLY install requisite software?                       yes
      EXTEND file systems if space needed?                            yes
      VERIFY install and check file sizes?                            no
      DETAILED output?                                                no
      Process multiple volumes?                                       yes
                             Note: The dot entry for *INPUT device means ‘current directory’.
Any software in an “applied” state must be “committed” before newer versions will install. The “Preview
Only” option should be run first; then any conflicts resolved before proceeding. Careful attention should
be paid to the resulting SMIT log which will indicate any problems. SMIT Command Status should
display “OK” results.
Based on the nature of the software update, developers, testers, and SAs should perform a sampling of
tasks to assure that the updates have not adversely impacted the DPP system.

8.2.      User account maintenance
Requests for creation, update, or closing user accounts must come from BOC management, especially
when additional privileges are gained by the action. Except the generic user accounts, all accounts
should have very limited access to production files, except in rare cases. There is a small number of user
accounts that exist for non-DADSO staff; these include POP/Housing data reviewers and geography
users who transfer/upload files to DPP.
When a user leaves the DADSO staff, the following actions should be taken:
- user’s jamesbond account is locked, password is changed, any type of login is denied.
 (NOTE: the account should NEVER be deleted per BOC Security Guidelines).

 Date Last Printed: 3/29/13                                                                  Page 134 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
- user is removed from all groups, except staff.
- cron and at (batch) queues are checked for user entries.
- any generic accounts the user had access to must have passwords changed.
- user’s home directories are not deleted until it has been determined that no remaining files are needed.
The SA will monitor the disk usage of the /home filesystem and notify users when it becomes too full.
Users taking the majority of disk space are notified directly that they should cleanup their directory.
For security reasons, BOC guidelines request that all users log out of their sessions at the end of each
day. Some exceptions are made for long running interactive jobs. Screens should be locked whenever a
user is away from their desk for any length of time.
Users /home directories and subdirectories are backed up on a daily basis. However, users are
discouraged from using their home directory to store program related files. Shared areas providing group
access are provided. This attempts to avoid multiple copies of various versions of files and large tarfiles.
In addition to the DPP server to server NFS connections, many users choose to access DPP filesystems
from their PC desktop. This is possible through the PC_NFS daemon which runs on the unix server and
the InterDrive client on the desktop. The user must have a valid username/password on the unix server
and the nfs exported filesystem must not be limited to a specific hostname or IP address. Steps to
access a unix directory from the PC are:
         Double click My Network Places -> Entire Network -> Entire Contents -> InterDrive -> NFS
          Servers I Have Configured.
         Select the DPP server (you may need to add them if none exist). A listing of all NFS exported
          filesystems will appear along with printer queues.
         Double clicking on a folder icon will invoke a username/password login screen.
Below is an example of NFS exports for DPP2. Notice that the products uSF1 and uSF1A are available
on the network, as well as UREVIEW for POP reviewers.
The WS_LOGS filesystems contains copies of apache logs and is accessed by PC software, WebTrends,
to generate AFF website usage reports.




 Date Last Printed: 3/29/13                                                                   Page 135 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Figure 32: Figure showing NFS exports from DPP2 to a desktop.

Since the BOC uses dynamic IP addressing for desktops and often does not provide reverse IP lookup,
the IP address of the user's PC must be entered into the unix server's /etc/hosts file. The indication that
the IP cannot be resolved and the /etc/hosts file needs updating is the pop-up error:
          "\\dpp2\<directory>\" is not accessible"
          "Network Access is Denied"
Again, NFS exports without limitations are a security concern and should be used cautiously.

8.3.      Adding and/or removing hardware
Any decision to add or remove hardware should include a lengthy review on how the change will affect
the various environments, ie, development, test, production, etc. Justification for adding hardware should
include how the additional resources will be divided among the groups. The plan may include detailed
steps to move users and data to a totally new environment or server. In the case where the 12Tb ESS
was removed, several months and detailed planning were required to shrink, archive, and move data
around.
All interested parties should be notified well in advance of any hardware changes. Beyond the DPP users
and managers, the BOC Bowie Computer Center must be included, particularly if the hardware change
requires floor space, power or network drops, or any changes to equipment inventory at that facility. In
the event cables are needed to connect the hardware, determine the proper cable and whether Bowie or
the vendor should be the supplier.



 Date Last Printed: 3/29/13                                                                   Page 136 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Assuming all approvals for the physical hardware changes have been acquired, next steps are to
ascertain the required microcode levels and device drivers needed to support the hardware
change/upgrade. The SA may need the assistance of the IBM hardware engineer for these software
installations depending upon the hardware involved. The hardware engineer should be consulted in
almost all discussions regarding hardware changes.
The IBM RS/6000 server and the AIX OS are very tolerant of hardware configuration changes. Most
often, the system may remain up and running when the activity occurs. However, a reboot is most likely
needed to implement the change. Of course, system backups (mksysb) should be taken for quick
recovery in the event of a failure. Immediately following hardware configuration changes, another system
backup should be taken to capture current configurations.
Once new hardware is installed and available, minimal testing should occur to make sure it performs as
expected and is reliable.

8.4.      Supporting the long-term file retention policy
All DADS servers are backed up nightly using Tivoli Storage Manager (TSM), “incremental forever” policy.
The current version plus two previous versions of every file are backed up in the TSM database which
resides on DPP1. In addition to disk storage pools, DPP1 has a model 3584 tape library attached. All
production area files are backed up when created and/or changed. A few specific directories (/tmp,
.netscape_cache, etc.) are excluded because the files are either temporary or backed up by a system
backup instead. Once a production file has been deleted from disk, the final version remains on the
backup media indefinitely. Dev, pa, and uat environments have backup policies that retain fewer versions
and the final version has a 365 day retention period.
To restore a file, invoke the tsm client and issue the following command:
tsm>restore “/<full pathname to file>”
See the Tivoli Storage Manager documentation for additional options available for restoring.
In addition to regular backups, completed production directories are archived to tape. Archiving does a
complete backup of the filesystem specified and the files are not subject to the policies which control
regular backups such as number of versions, retention times, etc. To allow for reuse of disk space,
completed product directories are shrunk by removing temporary or unnecessary interim files. If any
deleted files are needed on-line, they can be restored from backup tapes at the rate of approximately
20Gb per hour. Number of files and sizes will affect the restore.
In the event files are needed long after the unix software and hardware are obsolete, the most important
directories of the DPP environment were copied to DVD for long term storage. These files are ascii text
format to assure they would be supported/accessible by most future software packages. The directories
include:
/releases/*
/dpp2/ftp/dec                  (Due to size, California HDF had to be gzipped on an NTFS PC prior)
/dpp2/ftp/geo
/dpp2/prod/<product>/product/*                     (All uSF4 files were gzipped prior to copy)
/dpp2/prod/<product>/logs/*
The software used to produce the DVD copies is Veritas RecordNow DX and the DVDs are 4.7Gb in size.
The unix directories were NFS exported and mounted as a networked drive on the PC with DVD writer
attached. A binder containing information and the index of files is located in the DPP common area.

8.5.      Maintenance Reboots and Re-establishing Mount Points
IBM recommends the RS/6000 servers be rebooted on a regular basis. The interim is determined by the
way in which the server is utilized, ie, heavily loaded database servers would be rebooted more frequently
than a sparsely used general purpose type server. For the DPP environment, the servers should be
rebooted approximately every 30 to 45 days. This cleans out the temporary files, refreshes microcodes,
and flushes caches, etc.
 Date Last Printed: 3/29/13                                                                      Page 137 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Everyone affected by the reboot should be notified by email and/or during a regular meeting. Environment
leaders will agree to a date/time window. Users outside of DADSO, such as Bowie monitors, POP and
HHES reviewers should also be notified by a BOC manager. If the reboot includes installing a patches or
updates, then a plan for testing should be considered.
The reboots take about 45 minutes per server so expected downtime is approximately two hours. The
larger the server, number of devices, amount of memory, number of large Volume Groups, the nfs
mounts required, all affect the reboot duration. Thus, DPP2 always takes longer than DPP1. Due to the
tangled web of nfs mounts between machines, the reboots are done separately or staggered at a
minimum. Some precautionary steps prior to reboot should be taken to avoid problems:
         nfs mounts not being used should be unmounted and deleted
         a system snapshot should be generated and saved locally; commands would include
               o               lsvg, lsvg –l `lsvg`, df –kI, lsvpcfg, mount, cat /etc/exports, cat /etc/filesystems
         determine that the internal tape drive is empty using ‘tctl –f /dev/rmt0 rewind’
         verify the bootlist
         issue the warning command “wall               The system is coming down in 5 minutes – please log off”
While waiting for a server to reboot, from another unix server you should issue the following commands to
check progress:
       >while true
         do
               ping –c1 <servername>
               sleep 20
When the ping no longer fails with “100% packet loss”, the server is up.
After the server reboots successfully, several steps are necessary to validate the integrity of the system:
         reissue the commands listed above and check that all VGs and filesystems mounted
               o    pay particular attention to filesystems which are sub-mounted below other filesystems
                               Example - /dpp2/ftp/review
                                            /dpp2/ftp/review/uSF4
         issue the command ‘exportfs –va’ and review the nfs mounts between both servers making sure
                              there is content under the directory headings




 Date Last Printed: 3/29/13                                                                            Page 138 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                          9. ARTIFACTS AND KEY METRICS OF RECENT PRODUCTS
                   Many useful documents were produced during the development and processing of each product.
                   They include:
                                   the product specifications (for tables and geography) from Population Division;
                                   the specifications for DPP Geography files that were sent from DADS to Geography Division;
                                   the DPP Approach documents that were developed for each product;
                                   the flowcharts that were specific to each product;
                                   the processing progress reports.
                   There are also documents the describe COTS software in more detail.
                   These documents have been gathered into product subdirectories under /Supplemental Materials. Examining the
                   documents that supported a particular product can further understanding of the capabilities of the DPP system.


                   In addition, a chart summarizing key metrics for the 2000 Decennial products is included on the next page.




                                                                            This space intentionally left blank.




                                  Date Last Printed: 3/29/13                                                             Page 139 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                Date Last Printed: 3/29/13                  Page 140 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Last updated 07/29/2004                                                                       THRSHLD         THRSHLD          THRSHLD         THRSHLD (%)
              Tables      Table        CIs       State        National         Geo/CI           State         National          Geo/CI           Geo/CI          Unthresholded          Actual
   Product             Matrix Cells           Geographies    Geographies     Combinations    Geographies     Geographies      Combinations     Combinations       Product Size       Product Size
AIAN             323       7,880      1,085              -           1,089       1,181,565              -            862             14,590               1.23      9,310,732,200       114,969,200
CD108H           286       8,113         1        156,440                -        156,440         156,440             -             156,440           100.00        1,269,197,720     1,269,197,720
CD108S           813      16,520         1        156,440                -        156,440         156,440             -             156,440           100.00        2,584,388,800     2,584,388,800
SDCH             805      16,486         6               -          18,328        109,968               -          18,328           102,858            93.53        1,812,932,448     1,695,716,988
SDCO             805      16,486         6               -          18,328        109,968               -          18,328           102,858            93.53        1,812,932,448     1,695,716,988
SDCOSP1           10         285         6               -          18,328        109,968               -          16,401            98,406            89.49           31,340,880        28,045,710
SDCOSP2           10         285         6               -          18,328        109,968               -           3,922            23,532            21.40           31,340,880         6,706,620
SDCOSP3           10         285         6               -          18,328        109,968               -           1,365             8,190               7.45         31,340,880         2,334,150
SDCOSP4           10         285         6               -          18,328        109,968               -           1,635             9,810               8.92         31,340,880         2,795,850
SDCOSP5           10         285         6               -          18,328        109,968               -           2,457            14,742            13.41           31,340,880         4,201,470
SDCOSP6           10         285         6               -          18,328        109,968               -           4,952            29,712            27.02           31,340,880         8,467,920
SDCOSS1           10         285         6               -          18,328        109,968               -           3,228            19,368            17.61           31,340,880         5,519,880
SDCOSS10          10         285         6               -          18,328        109,968               -           1,365             8,190               7.45         31,340,880         2,334,150
SDCOSS11          10         285         6               -          18,328        109,968               -           1,635             9,810               8.92         31,340,880         2,795,850
SDCOSS12          10         285         6               -          18,328        109,968               -            154                 924              0.84         31,340,880          263,340
SDCOSS13          10         285         6               -          18,328        109,968               -            225              1,350               1.23         31,340,880          384,750
SDCOSS14          10         285         6               -          18,328        109,968               -           2,264            13,584            12.35           31,340,880         3,871,440
SDCOSS2           10         285         6               -          18,328        109,968               -            349              2,094               1.90         31,340,880          596,790
SDCOSS3           10         285         6               -          18,328        109,968               -            204              1,224               1.11         31,340,880          348,840
SDCOSS4           10         285         6               -          18,328        109,968               -                58              348              0.32         31,340,880            99,180
SDCOSS5           10         285         6               -          18,328        109,968               -                20              120              0.11         31,340,880            34,200
SDCOSS6           10         285         6               -          18,328        109,968               -           2,958            17,748            16.14           31,340,880         5,058,180
SDCOSS7           10         285         6               -          18,328        109,968               -           1,110             6,660               6.06         31,340,880         1,898,100
SDCOSS8           10         285         6               -          18,328        109,968               -          16,401            98,406            89.49           31,340,880        28,045,710
SDCOSS9           10         285         6               -          18,328        109,968               -           3,922            23,532            21.40           31,340,880         6,706,620
SDCP             805      16,486         6               -          18,328        109,968               -          18,328           102,858            93.53        1,812,932,448     1,695,716,988
SDHC             805      16,486         6               -          18,328        109,968               -          18,328           102,858            93.53        1,812,932,448     1,695,716,988
SDPC             805      16,486         6               -          18,328        109,968               -          18,328           102,858            93.53        1,812,932,448     1,695,716,988
SDTT             805      16,486         1               -          18,328         18,328               -          18,328            18,328           100.00          302,155,408       302,155,408
SF1              286       8,113         1      9,541,315                -       9,541,315       9,541,315            -           9,541,315           100.00       77,408,688,595    77,408,688,595
SF1A             286       8,113         1               -         225,995        225,995               -         225,995           225,995           100.00        1,833,497,435     1,833,497,435
SF1F             286       8,113         1               -         497,515        497,515               -         497,515           497,515           100.00        4,036,339,195     4,036,339,195
SF1UR              2          12         1     18,389,481                -      18,389,481      18,389,481            -          18,389,481           100.00          220,673,772       220,673,772
SF2               47         766       250        482,188                -     120,547,000        413,716             -           5,204,625               4.32     92,339,002,000     3,986,742,750
SF2A              47         766       250               -         185,659      46,414,750              -         160,658         2,245,993               4.84     35,553,698,500     1,720,430,638
SF2F              47         766       250               -         373,989      93,497,250              -         264,919         4,394,180               4.70     71,618,893,500     3,365,941,880
SF2X               1          49       250        482,188                -     120,547,000        413,716             -           5,204,625               4.32      5,906,803,000       255,026,625
SF3              813      16,520         1      1,594,262          487,093       2,081,355       1,594,262        487,093         2,081,355           100.00       27,677,328,630    27,677,328,630
SF4              323       7,880       336        545,191          374,055     308,866,656         446,613        250,736         9,992,906             3.24     2,433,869,249,280   78,744,099,280
uPL                4         288         1      9,747,485              -         9,747,485       9,747,485            -           9,747,485           100.00         2,807,275,680    2,807,275,680


Figure 33: Key Metrics of 2000 Decennial Products
                                      Date Last Printed: 3/29/13                                                                                          Page 141 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                   10.       APPENDIX

10.1. Glossary of Terms
               Term                                                         Definition
A.C.E.                                Accuracy and Coverage Evaluation, basis for adjustment of detail data to
                                      compensate for undercount and overcount.
AIX                                   Advanced Interactive Executive. The Operating system, a version of Unix
                                      produced by IBM, in use on all the DADS DPP machines.
Area                                  As used in this document, “area” has one of six defined values:
                                      dev for development
                                      pa for product-assurance testing
                                      uat for user-acceptance testing
                                      test for test-state production
                                      prod for production
                                      sprod for special-product production
                                      Other areas may be created as needed. The purpose is to segregate
                                      functionally distinct work. In practice, these areas are the root of separate
                                      directory-tree branches, for instance, /dpp1/dev and /dpp1/test, and are
                                      captured in the $DPProot, $DPPwork, and $DPPprog (Unix shell)
                                      environment variables actually used by DPP-system programs and scripts.
Comma-separated value                 Variable-length file with variable-length, comma-delimited fields.
(CSV) format
Detail file                           An edited decennial data file that contains geographic (Block), Group
                                      Quarters (GQ), Housing Unit (HU), and GQ & HU Person records.
DGF                                   File constructed by the DPP system from one or more DPP Geography
                                      Files, which often also contains Population Size Codes.
DPP Geography File                    File given to DPP from Geographic Division with data on all basic
                                      geographic areas, that is, blocks, and on higher-level geographic summary
                                      areas.
Driver file                           A file containing information about detail files, Analyzers, and products that
                                      is used by the DPP system in processing.
Environment                           The DPP system uses “environments” to allow products processed within
                                      an area (see above) to be grouped. The environment value is determined
                                      by the DPP Operations Staff and communicated to the Configuration
                                      Manager, who creates directories. Examples of environments one might
                                      create are DRSSF for Dress Rehearsal Sample Summary File, UAT for
                                      User Acceptance Testing, and PL for the Public Law product.
                                      In practice, environments are used with areas as the basis for the
                                      $DPProot, $DPPwork, and $DPPprog (Unix shell) environment variables
                                      actually used by DPP-system programs and scripts.
Flow basis                            Provision of data and/or processing and/or release of results piece-by-
                                      piece, for instance, state-by-state, rather than all at once.
Geographic component                  Geographic components represent that portion of their geographic entity
                                      that has some particular characteristic or population size. For instance, the
                                      data found on the 'urban' geographic component (geo component code of
                                      '01') for a state represents that portion of the state that is defined as being
                                      'urban'. In truth, the application of this definition for tabulation purposes
Date Last Printed: 3/29/13                                                                           Page 142 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
               Term                                                         Definition
                                      requires that each block, the lowest geographic unit, be examined to
                                      determine its definition and thus its inclusion in the geographic component.
                                      All those blocks that qualify then become the basis for tabulating the data
                                      for the geographic component.
GeoID                                 Geographic Identifier: A field composed of Summary Level, Geographic
                                      Component, State, County, Tract, Block, and other fields that uniquely
                                      identify a given geographic area.
HDF                                   The Hundred Percent Detail File, which is composed of individual records of
                                      information on persons and housing units for the 100 percent census data
                                      items from the census questionnaires. These files are used for tabulation
                                      purposes and are not released to the public.
HEDF                                  The Hundred Percent Edited Detail File, which is composed of individual
                                      records of information on persons and housing units for the 100 percent
                                      census data items from the census questionnaires plus adjustment records.
                                      These files are used for tabulation purposes and are not released to the
                                      public.
Hundred Percent Summary               A Census 2000 Dress Rehearsal product file known in Census 2000 as
File (HSF)                            SF1.
Metadata                              Data that describes other the structure and meaning of data files and fields.
P.L.                                  Public Law 94 - 171 – The public law requiring the Census Bureau to
                                      provide selected decennial census data tabulations to the states by April 1
                                      of the year following the census. These tabulations are used by the states
                                      to redefine the areas included in each congressional district and the areas
                                      in other districts used for state and local elections, a process called
                                      redistricting.
                                      Public Law 105-119 – The public law requiring the Census Bureau to make
                                      publicly available a second version of the data mandated by PL 94-171.
                                      This second set of data does not include the corrections for overcount and
                                      undercount measured in the Accuracy and Coverage Evaluation. All other
                                      aspects of this version of the data and the data with correction for A.C.E.
                                      will be identical.
                                      Public Law 103 – 430 – The public law that amends Title 13, U.S. Code, to
                                      allow designated local and tribal officials access to the address information
                                      in the master address file to verify its accuracy and completeness. This law
                                      also requires the U.S. Postal Service to provide its address information to
                                      the Census Bureau to improve the master address file.
Population Size Code                  Population Size Codes can be thought of as recodes of exact population
                                      counts into distinct ranges that are represented by two-digit codes. For
                                      instance, if a geographic entity has an exact population count of 243, its
                                      count falls in the range of 200 to 249, which equates to a size code of '05'.
Population Size Code                  A file containing Population Size Codes which were set for six geographic
Reference File                        entities based on the 2000 Census 100% population counts.
Product                               This term refers to a set of summary file tabulations, such as SF1 or SF3
Recode                                A version of a database field that maps (recodes) individual or sets of
                                      values to other values.
Redistricting Product                 This refers to the 100% tabulation product used for defining new voting
                                      districts. Also known as “PL,” because its creation is mandated by Public

Date Last Printed: 3/29/13                                                                          Page 143 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
               Term                                                         Definition
                                      Law 94 – 171.
Review                                Quality control of a product before release.
SEDF                                  Sample Edited Detail File, a data file containing 100 percent and sample
                                      characteristics for housing units and persons from the census long form,
                                      which goes to one in six households. The file is used for tabulation purpose
                                      only and is not released to the public.
SF1                                   A Census 2000 100% tabulation data product consisting of 286 tables.
SF2                                   A Census 2000 100% tabulation data product consisting of 47 tables
                                      iterated over 250 race/ethnicity/ancestry universes.
SF3                                   A Census 2000 Sample tabulation data product consisting of approximately
                                      700 tables.
SF4                                   A Census 2000 Sample tabulation data product consisting of approximately
                                      50 tables iterated over several hundred race/ethnicity/ancestry universes.
Summary File                          The Census 2000 hand-off format: a set of files containing geographic
                                      information and product data. The Summary File is segmented into a
                                      number of files to make it more manageable. One file segment contains
                                      geographic data. The remainder of the segments contain summary data.
                                      Data cells from summary tables are mapped to SF segments according to
                                      information in the table-information driver file.
Summary Format File (SFF) A hand-off format that describes product tables that was used by DPP and
                          AFF in the Census 2000 Dress Rehearsal. The 2000 format that replaces
                          SFFs, which are not generated by the Census 2000 DPP system, is
                          meta0420.txt.
Summary level                         A geographic area at which a data product is tabulated.
Summary Tape File (STF)               The 1990 decennial census hand-off format, which was the basis for the
                                      current Summary File format.
SuperCHANNEL (SC)                     SuperSTAR II database-builder software, which is used to build
                                      SuperCROSS 4 databases. The current version is 1.2.
SuperCROSS (SX)                       The tabulation software chosen for Census 2000 tabulations. The current
                                      Windows version is 3.7 update 3 (build 55). The tabulation engine for
                                      Census 2000 will run on the RS/6000, the current version of the
                                      SuperCROSS Server is 1.5.5.
SuperCROSS Production                 A SuperSTAR II software component that queues TXD tabulation job files to
System (Table Manager)                the SuperCROSS Server and writes tabular macro-data output files from
                                      tabulated data returned by the server. The current version is 1.0.13.
SuperMART Builder (SMB)               SuperSTAR I database-builder software, which is used to build
                                      SuperCROSS 3 databases. SuperMART Builder input files may also be
                                      used by SuperCHANNEL to build SuperCROSS 4 databases.
SuperSTAR                             STR product suite consisting of several products.
Table                                 A table, in DPP usage, is described in a product specification, composed in
                                      SuperCROSS, and tabulated with SuperCROSS.
Tabular macro-data format             A SuperCROSS tabulation-out format that contains a row/column/wafer
                                      label and a vector of data values for the given row/column/wafer.

Date Last Printed: 3/29/13                                                                         Page 144 of 149

Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                Term                                                         Definition
 TDD                                   SuperCROSS textual data definition (TDD) files, which are the direct input
                                       format for SuperCHANNEL.
 TXD                                   SuperCROSS textual-table format. Textual representation of the
                                       SuperCROSS table. A TXD does not contain data.
 UDF                                   SuperCROSS user-defined field. A redefined data base field or non-data
                                       base field created in SuperCROSS and used in table composition.
 Wave                                  A term concocted by the DPP team to designate driver files for a set of
                                       tables, geographic areas, or files that together form a subset of a product.
                                       Waves are created by copying the driver-file directory from $DPPdrivers
                                       and editing one or more driver files for a product. A wave can be defined
                                       for use in only one production step or for use in multiple steps by altering as
                                       many applicable wave driver files as are needed to cover the steps. Wave
                                       processing -- the use of wave driver files -- is kicked off with a wave option -
                                       - -w <wave-driver directory> -- passed to selected DPP system scripts.
Table 75: Glossary of terms.

10.2. Appended Notes
<This appendix is optional. Opportunities it may support include technical notes, awareness notes
regarding potential revisions or needed contact information such as third parties or special source
references. >




 Date Last Printed: 3/29/13                                                                           Page 145 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
                                                              11.         INDEX

11.1. Index of Tables
Table 1: Revision History Log. ...................................................................................................................... 2
Table 2: Record counts for HDF files ......................................................................................................... 16
Table 3: Record counts for SEDFs ............................................................................................................ 18
Table 4: Source of blocks for each basic product ....................................................................................... 19
Table 5: CD108 summary levels for Indiana ............................................................................................... 20
Table 6: Number of Geographic Entities in state-level Census 2000 products (attachment to Details of
    the Construction of the 2000 Decennial Summary Files memorandum) ............................................. 25
Table 7: Number of Geographic Entities in US-level Census 2000 products (attachment to Details of the
    Construction of the 2000 Decennial Summary Files memorandum) ................................................... 26
Table 8: Examples of Driver File Customization ......................................................................................... 26
Table 9: Inventory of DPP Driver Files ........................................................................................................ 28
Table 10: Syntax of Products.txt ................................................................................................................. 29
Table 11: Syntax of PriorProduct.txt ........................................................................................................... 29
Table 12: Syntax of <Product>Coverage.txt ............................................................................................... 29
Table 13: Syntax of <Product>SequenceChart.txt ..................................................................................... 30
Table 14: Syntax of <Product>GeoContent.txt ........................................................................................... 30
Table 15: Syntax of <Product>GeoIDInfo.txt .............................................................................................. 30
Table 16: Syntax of <Product>GeoLCDSumLev.txt ................................................................................... 30
Table 17: Syntax of <Product>GeoTractSumLev.txt .................................................................................. 31
Table 18: Syntax of <Product>TableInfo.txt ................................................................................................ 31
Table 19: Syntax of <Product>TableInfoForHandoff.txt ............................................................................. 32
Table 20: Syntax of <Product>Handoff.txt .................................................................................................. 32
Table 21: Syntax of <Product>Iterations.txt ................................................................................................ 32
Table 22: Syntax of <Product>IterationDBLogic.txt .................................................................................... 32
Table 23: Syntax of <Product>IterationsForHandoff.txt .............................................................................. 32
Table 24: Syntax of <Product>IterationsForSIPHC.txt ............................................................................... 32
Table 25: Syntax of <Product>ReportAddresses.txt ................................................................................... 33
Table 26: Syntax of <Product>_Rollup_HighLevel.txt ................................................................................ 33
Table 27: Syntax of <Product>_Rollup_LowLevel.txt ................................................................................. 33
Table 28: Syntax of Internal-<Product>_map.txt ........................................................................................ 33
Table 29: Syntax of <Product1>-<Product2>_map.txt ................................................................................ 34
Table 30: Syntax of <DetailFile>AnalyzerTableInfo.txt ............................................................................... 34
Table 31: Syntax of DGFConsistency_CommonVars.txt ............................................................................ 35
Table 32: Syntax of SDTT-uSF3_geo_map.txt ........................................................................................... 35
Table 33: DPP COTS Components ............................................................................................................ 36
 Date Last Printed: 3/29/13                                                                                                       Page 146 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Table 34: Important Global DPP Environment Variables ............................................................................ 37
Table 35: Example Stages from the Tab script ........................................................................................... 38
Table 36: DPP Korn Shell Variable Scoping Convention ........................................................................... 38
Table 37: DPP Korn Shell Variable Naming Convention ............................................................................ 38
Table 38: Summary of DPP Log Directories ............................................................................................... 39
Table 39: Key Features of the SuperSTAR System ................................................................................... 43
Table 40: Example of QSEX metadata delivered via Excel worksheet; last 5 columns omitted ................ 52
Table 41: Example of VerifyCounts report file (numbers have been altered) ............................................. 53
Table 42: Example of DF-DGFConsistency report file ................................................................................ 54
Table 43: Outputs from data preparation .................................................................................................... 54
Table 44: Mapping of databases and supplement files used...................................................................... 55
Table 45: Valid values for HHT prior to merging record types .................................................................... 55
Table 46: An additional valid value for HHT after merging record types .................................................... 56
Table 47: Multi-Response Fields ................................................................................................................ 56
Table 48: Example of post-data prep validation for one field; numbers are synthetic ................................ 57
Table 49: Customized Project File from Detail Database TDD................................................................... 59
Table 50: Customized DBCatalog File from Detail Database TDD ............................................................ 59
Table 51: Customized DBColumns File from Detail Database TDD .......................................................... 60
Table 52: Customized DBDelim File from Detail Database TDD................................................................ 60
Table 53: Customized DBFiles File from Detail Database TDD ................................................................. 60
Table 54: Customized DBPrimaryKeys File from Detail Database TDD .................................................... 60
Table 55: Customized DBTables File from Detail Database TDD .............................................................. 60
Table 56: Customized CLASSIFICATIONS File from Detail Database TDD .............................................. 61
Table 57: Example of post-database build validation; numbers are synthetic ............................................ 63
Table 58: Types of DPP File Systems ........................................................................................................ 65
Table 59: Example of NFS disk layout for SF4 ........................................................................................... 65
Table 60: Contents of DEV Release DPP2001 .......................................................................................... 68
Table 61: DPP Build Directories ................................................................................................................. 68
Table 62: Contents of OPS Release DPP2000_OPS ................................................................................. 69
Table 63: Example of how we assemble the SF3 MD geography file ........................................................ 71
Table 64: Geography Recode Format ........................................................................................................ 73
Table 65: Example of a Geography Recode ............................................................................................... 74
Table 66: Example of Impact of Geographic Recode Reduction ................................................................ 74
Table 67: Example of Impact of Recode Reduction on number of lines in Geography Recode ................ 75
Table 68: Example of Impact of Geographic Recode Dehydration............................................................. 75
Table 69: Example of Impact of Recode Dehydration on number of lines in Geography Recode ............. 76
Table 70: Example of Splitting a Geography Recode ................................................................................. 77

 Date Last Printed: 3/29/13                                                                                                     Page 147 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Table 71: Example of Geography Recode Sets for School District tabulations, Relevant Children ........... 77
Table 72: Example of Impact of using Characteristic Iteration Geographic Recodes ................................ 78
Table 73: Example of State Base Distribution and Median TXDs; numbers are synthetic ........................ 90
Table 74: The development of new functionality for Decennial 2000 Products .......................................... 91
Table 75: Glossary of terms. ..................................................................................................................... 148



11.2. Index of Figures
Figure 1: DPP Software High-level Component Diagram .......................................................................... 13
Figure 2: Codes in the Data Files and Filenames for various Product Releases (attachment to Details of
    the Construction of the 2000 Decennial Summary Files memorandum) ............................................. 24
Figure 3: Flow Chart of Programmatic Generation of TXD ........................................................................ 44
Figure 4: Overview of Detail Database Build process ............................................................................... 51
Figure 5: Merging results of tabulation when recode splits are used......................................................... 79
Figure 6: Rehydration Tabulation results when a dehydrated recode is used ........................................... 79
Figure 7: Hand-off specifics for destination: AFF ....................................................................................... 81
Figure 8: Hand-off specifics for destination: ACSD ................................................................................... 81
Figure 9: Hand-off specifics for destination: Review .................................................................................. 82
Figure 10: Hand-off specifics for destination: Internal ............................................................................... 83
Figure 11: Key for Hand-off specifics figures ............................................................................................. 83
Figure 12: Data Fields for Iterations ........................................................................................................... 84
Figure 13: Relational model of multi-response fields ................................................................................. 85
Figure 14: Using QRACE values in iteration definitions ............................................................................ 86
Figure 15: Example of a Status report ....................................................................................................... 97
Figure 16: Example of Extended Status report (first 14 columns) ............................................................. 98
Figure 17: Example of Extended Status report (last 11 columns) ............................................................. 98
Figure 18: Example of CountCSVFile report .............................................................................................. 99
Figure 19: The manipulation of TXDs and Recodes for tabulation .......................................................... 100
Figure 20: Code from the RECODE section of a TXD, specifying distribution intervals for a median..... 101
Figure 21: A function call from the DERIVE section of a TXD specifying a median calculation .............. 101
Figure 22: Thre-dimensional representation of thresholding ................................................................... 103
Figure 23: Overview of SIPHC creation and usage ................................................................................ 104
Figure 24: SIPHC Fileset 1 - file contents ................................................................................................ 105
Figure 25: SIPHC Fileset 2 – file contents ............................................................................................... 106
Figure 26: TXD framework ....................................................................................................................... 107
Figure 27: Example of a SuperCROSS recode of Age ........................................................................... 108
Figure 28: DPP Hardware and Network Architecture .............................................................................. 119
Figure 29: Example showing ownership of /dpp2/uat disk space ............................................................ 121

 Date Last Printed: 3/29/13                                                                                                      Page 148 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc
Figure 30: Group membership for DPP users (in non-review groups) .................................................... 123
Figure 31: Group membership for DPP users (in review groups) ............................................................ 124
Figure 32: Figure showing NFS exports from DPP2 to a desktop. .......................................................... 139
Figure 33: Key Metrics of 2000 Decennial Products ................................................................................ 144




 Date Last Printed: 3/29/13                                                                                         Page 149 of 149

 Location: C:\Docstoc\Working\pdf\2e7681c3-9c04-433d-8c09-653e94ec9430.doc

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:3/29/2013
language:Latin
pages:149