Simple Licencing Agreement

Document Sample
Simple Licencing Agreement Powered By Docstoc
					    ONS Classification Coding
         Tools Project

       Occupation Classification Workshop
          RSS, London, 21 June 2004

                Nigel Swier

               Overview of ONS Coding
                    Tools Project
     Aim: To select and „operationalise‟ a standard tool for assigning
      classification codes to verbatim text responses given in answer
      to a question
     Scope:
         For all classifications (except ICD10 for cause of death coding),
          including occupation (SOC) and industry (SIC)
         Both automatic and interactive coding functionality
         Development of selected tool into a component so that can be used
          within the new ONS technical architecture
     Context: Part of the ONS Statistical Infrastructure Development
      Project (itself part of the ONS Statistical Modernisation

             ONS formed in 1996

    Central Statistics

       Office of
      Population         Office for National
     Censuses and             Statistics
    Surveys (OPCS)


            Statistical Modernisation
               Programme (SMP)
     Inherited Infrastructure:                ONS vision:
    • Multiple databases              Single repository (Oracle)
    • Multiple development tools      Java (J2EE)
    • Proliferation of statistical    Standard statistical tools and
    tools and methods                methods (e.g. coding tool)
    • Poor metadata                   Corporate metadata system
    • Paper-based dissemination       Web-based dissemination
    • Risky statistical systems       Robust statistical systems

    • £75 million to deliver SMP (2003-2006)

                      Statistical Value Chain
    Data Collection        Operations on    Operations on          Dissemination
                             Unit Data      Aggregate Data
    • Survey design       • Editing         • Time series

    • Survey case         • Imputation      • Tabulation
    management                              • Disclosure Control
                          • Coding
                                            • Weighting
                                            • Estimation

                    Common ONS Statistical Tools

                       Corporate ONS Repository for Data (CORD)

                                ONS Metadata Repository

    Benefits of Statistical Modernisation

     Robust statistical systems
     Automated workflow:
         More rapid publishing of statistical outputs
         Improved efficiency
         Improved job satisfaction
     Data will be a corporate resource. Along with improved
      metadata it will allow ONS to leverage greater value from data
     Reduced licencing and IT support costs
     Reduced staff training costs and easier transferability of staff

                      Evaluation criteria

    • Functionality
        – Automatic and interactive coding
        – Able to handle simple and complex classifications
        – Dependent coding
    •   Performance (coding/agreement rates)
    •   Technical (fit with new ONS technical environment)
    •   Supplier support
    •   Impact on ONS outputs

        Evaluating and selecting the tool

    •   Started (in earnest) January 2003
    •   Establish detailed evaluation criteria
    •   Investigate tools and identify a shortlist (ACTR, PDC)
    •   Obtain software, preparation of knowledge bases for testing,
        Preparation of test data
    •   Testing (automatic coding performance)
    •   Analysis of results
    •   Evaluate supplier comments and tool functionality
    •   Compilation of scores
    •   Final Report (Completed December 2003) => recommendation
        to select ACTR

                ACTR - the selected tool
    •   Automated Coding by Text Recognition
    •   Developed by Statistics Canada
    •   Used by Lockheed Martin for the Census 2001 Processing System
    •   Automatic and interactive coding
    •   Consists of coding engine and maintenance tools; customer builds
        and tunes the coding index
    •   Generic: Can code a range of classifications
    •   Flexible: Allows different coding strategies, thresholds
    •   Has API and has been ported to UNIX/Windows
    •   Multiple coding databases
    •   Dependent coding using filters
    •   Powerful parsing capabilities

    • Manipulation of text using global rules
        – Normalise, or reduce variation in text
        – Tune coding application
    • Examples:
        – Replace/delete string
        – Replace/delete word, (synonym list)
        – Delete clause
    • Applied to both reference files (i.e. coding index) and input
    • Parsing data + coding index = Knowledge base

             ACTR matching algorithm

    • Matching always follows parsing.
    • Step 1: Find direct matches and assign codes
    • Step 2: Find indirect matches (using Hellerman algorithm)
       – match scores based on word frequencies across index
       – unmatched words ignored (although more unmatched words lowers
         the score)
       – no fuzzy matching (except through parsing rules)
    • Step 3: Assign codes based on user defined match parameters.

            Building knowledge base for
                     SOC 2000
    •   Based on SOC 2000 index
    •   Obtain test/tuning data (Census 1991 recoded descriptions)
    •   Development of parsing strategy
    •   Iterative development
    •   Index partitioned into 2 „contexts‟
        – Main index entries
        – Default index

                   ACTR shortcomings

    • Non-linguistic, ignores word order (e.g. “Clerk to the Council” is
      not equivalent to “Council Clerk”)
    • No “fuzzy matching” (although particular cases of missing
      spaces and misspellings can be handled through parsing)
    • Longer text strings difficult to code automatically
    • No classifications mapping facility

                          Next steps?

    • Short term: Building knowledge bases
    • Medium term: Implementing ACTR in individual business areas:
       – ASHE (Earnings) for coding occupation in April 2005
       – IDBR (Industry)
    • Medium/Long term: “Operationalising” ACTR in the new ONS
      environment, including CORD etc.

    The End


Shared By:
Description: Simple Licencing Agreement document sample