Docstoc

Simple Licencing Agreement

Document Sample
Simple Licencing Agreement Powered By Docstoc
					    ONS Classification Coding
         Tools Project

       Occupation Classification Workshop
          RSS, London, 21 June 2004



                Nigel Swier




-
               Overview of ONS Coding
                    Tools Project
     Aim: To select and „operationalise‟ a standard tool for assigning
      classification codes to verbatim text responses given in answer
      to a question
     Scope:
         For all classifications (except ICD10 for cause of death coding),
          including occupation (SOC) and industry (SIC)
         Both automatic and interactive coding functionality
         Development of selected tool into a component so that can be used
          within the new ONS technical architecture
     Context: Part of the ONS Statistical Infrastructure Development
      Project (itself part of the ONS Statistical Modernisation
      Programme).

-
             ONS formed in 1996

    Central Statistics
         Office


       Office of
      Population         Office for National
     Censuses and             Statistics
    Surveys (OPCS)

      Employment
      Department



-
            Statistical Modernisation
               Programme (SMP)
     Inherited Infrastructure:                ONS vision:
    • Multiple databases              Single repository (Oracle)
    • Multiple development tools      Java (J2EE)
    • Proliferation of statistical    Standard statistical tools and
    tools and methods                methods (e.g. coding tool)
    • Poor metadata                   Corporate metadata system
    • Paper-based dissemination       Web-based dissemination
    • Risky statistical systems       Robust statistical systems


    • £75 million to deliver SMP (2003-2006)

-
                      Statistical Value Chain
    Data Collection        Operations on    Operations on          Dissemination
                             Unit Data      Aggregate Data
    • Survey design       • Editing         • Time series

    • Survey case         • Imputation      • Tabulation
    management                              • Disclosure Control
                          • Coding
                                            • Weighting
                                            • Estimation

                    Common ONS Statistical Tools


                       Corporate ONS Repository for Data (CORD)

                                ONS Metadata Repository


-
    Benefits of Statistical Modernisation

     Robust statistical systems
     Automated workflow:
         More rapid publishing of statistical outputs
         Improved efficiency
         Improved job satisfaction
     Data will be a corporate resource. Along with improved
      metadata it will allow ONS to leverage greater value from data
      holdings
     Reduced licencing and IT support costs
     Reduced staff training costs and easier transferability of staff


-
                      Evaluation criteria

    • Functionality
        – Automatic and interactive coding
        – Able to handle simple and complex classifications
        – Dependent coding
    •   Performance (coding/agreement rates)
    •   Technical (fit with new ONS technical environment)
    •   Supplier support
    •   Impact on ONS outputs




-
        Evaluating and selecting the tool

    •   Started (in earnest) January 2003
    •   Establish detailed evaluation criteria
    •   Investigate tools and identify a shortlist (ACTR, PDC)
    •   Obtain software, preparation of knowledge bases for testing,
        Preparation of test data
    •   Testing (automatic coding performance)
    •   Analysis of results
    •   Evaluate supplier comments and tool functionality
    •   Compilation of scores
    •   Final Report (Completed December 2003) => recommendation
        to select ACTR


-
                ACTR - the selected tool
    •   Automated Coding by Text Recognition
    •   Developed by Statistics Canada
    •   Used by Lockheed Martin for the Census 2001 Processing System
    •   Automatic and interactive coding
    •   Consists of coding engine and maintenance tools; customer builds
        and tunes the coding index
    •   Generic: Can code a range of classifications
    •   Flexible: Allows different coding strategies, thresholds
    •   Has API and has been ported to UNIX/Windows
    •   Multiple coding databases
    •   Dependent coding using filters
    •   Powerful parsing capabilities

-
                                 Parsing
    • Manipulation of text using global rules
        – Normalise, or reduce variation in text
        – Tune coding application
    • Examples:
        – Replace/delete string
        – Replace/delete word, (synonym list)
        – Delete clause
    • Applied to both reference files (i.e. coding index) and input
      files.
    • Parsing data + coding index = Knowledge base




-
             ACTR matching algorithm

    • Matching always follows parsing.
    • Step 1: Find direct matches and assign codes
    • Step 2: Find indirect matches (using Hellerman algorithm)
       – match scores based on word frequencies across index
       – unmatched words ignored (although more unmatched words lowers
         the score)
       – no fuzzy matching (except through parsing rules)
    • Step 3: Assign codes based on user defined match parameters.




-
            Building knowledge base for
                     SOC 2000
    •   Based on SOC 2000 index
    •   Obtain test/tuning data (Census 1991 recoded descriptions)
    •   Development of parsing strategy
    •   Iterative development
    •   Index partitioned into 2 „contexts‟
        – Main index entries
        – Default index




-
-
-
-
-
-
                   ACTR shortcomings

    • Non-linguistic, ignores word order (e.g. “Clerk to the Council” is
      not equivalent to “Council Clerk”)
    • No “fuzzy matching” (although particular cases of missing
      spaces and misspellings can be handled through parsing)
    • Longer text strings difficult to code automatically
    • No classifications mapping facility




-
                          Next steps?

    • Short term: Building knowledge bases
    • Medium term: Implementing ACTR in individual business areas:
       – ASHE (Earnings) for coding occupation in April 2005
       – IDBR (Industry)
    • Medium/Long term: “Operationalising” ACTR in the new ONS
      environment, including CORD etc.




-
    The End




-

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:8/23/2011
language:English
pages:20
Description: Simple Licencing Agreement document sample