Docstoc

FAQs About Taxonomies _ Metadata.ppt

Document Sample
FAQs About Taxonomies _ Metadata.ppt Powered By Docstoc
					Taxonomy Strategies LLC




                 FAQs About Taxonomies &
                        Metadata

                         Joseph A. Busch & Ron Daniel, Jr.




May 16, 2005   Copyright 2005 Taxonomy Strategies LLC. All rights reserved.
  Agenda
  9:00          Who are we?
  9:10          What are taxonomies & metadata?
  9:30          What kinds of taxonomies are there, and what do I need?
  9:40          How do I get a good taxonomy?
  10:05         How do I associate the taxonomy with content?
  10:30         Break
  10:45         What do taxonomies and metadata have to do with search?
  11:15         How can I sell my management on a taxonomy project?
  11:45         Any more questions?
  12:00         Adjourn


Taxonomy Strategies LLC   The business of organized information           2
  Who is Joseph Busch?

 Over 25 years in the business of organized information
      Founder, Taxonomy Strategies
      Director, Solutions Architecture, Interwoven
      VP, Infoware, Metacode Technologies
      Program Manager, Getty Foundation
      Manager, Pricewaterhouse

 Metadata and taxonomies community leadership
      President, American Society for Information Science & Technology
      Director, Dublin Core Metadata Initiative
      Adviser, National Research Council Computer Science and
       Telecommunications Board
      Reviewer, National Science Foundation Division of Information and
       Intelligent Systems
      Founder, Networked Knowledge Organization Systems/Services



Taxonomy Strategies LLC   The business of organized information            3
  Who is Ron Daniel, Jr.?

   Over 15 years in the business of metadata & automatic
     classification
        Principal, Taxonomy Strategies
        Standards Architect, Interwoven
        Senior Information Scientist, Metacode Technologies
        Technical Staff Member, Los Alamos National Laboratory

   Metadata and taxonomies community leadership
        Chair, PRISM (Publishers Requirements for Industry Standard Metadata)
         working group
        Acting Chair, XML Linking working group
        Member, RDF working groups
        Co-editor, PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1 & 2
         reports.




Taxonomy Strategies LLC   The business of organized information                  4
  Who has Taxonomy Strategies worked with?
  Government                                                      Commercial
        Commodity Futures Trading Commission                         Allstate Insurance
        Defense Intelligence Agency                                  Blue Shield of California
        ERIC                                                         Debevoise & Plimpton
        Federal Aviation Administration                              Halliburton
        Federal Reserve Bank of Atlanta                              Hewlett Packard
        Forest Service                                               Motorola
        GSA Office of Citizen Services                               PeopleSoft
         (www.firstgov.gov)                                           Pricewaterhousecoopers
        Head Start                                                   Siderean Software
        Infocomm Development Authority of                            Sprint
         Singapore                                                    Time Inc.
        NASA (nasataxonomy.jpl.nasa.gov)
        Small Business Administration                            Commercial subcontracts
        Social Security Administration                               Agency.com – Top financial services
        USDA Economic Research Service                               Critical Mass – Fortune 50 retailers
        USDA e-Government Program                                    Deloitte Consulting – Big credit card
         (www.usda.gov)                                               Gistics/OTB – Direct selling giant
  International orgs & Non-profits
        CEN
        IDEAlliance
        IMF
        OCLC



Taxonomy Strategies LLC   The business of organized information                                                5
  What we do




                             Organize Stuff
Taxonomy Strategies LLC   The business of organized information   6
  Who are you? What do you want out of today?

   Government / NGO / SME / Global 2000?

   IT / Library & IM / Public Affairs / Product Management
     / Engineering / HR & Finance / Other?

   Webmaster / Technical / Researcher / Editorial /
     Supervisory / Executive?

   Competing session – Search & Content Management:
     Putting the Puzzle Pieces Together
        What brought you HERE instead of THERE?




Taxonomy Strategies LLC   The business of organized information   7
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:05 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information          8
  What is metadata? Different definitions

   Library & Information
     Science
        Author/Title/Subject
        Controlled Vocabularies for
         Subject Codes (e.g. Dewey)
        Authority Files for Author
         Names

   Database
        Tables/Columns/
         Datatypes/Relationships
        References for some values




Taxonomy Strategies LLC   The business of organized information   9
  What is metadata? Another view of Dublin Core



                                            Subject metadata –                     Use metadata –
               Difficult to Generate



                                                  What & Why:                      How can it be used:
                                            Better resource description =
                                               Subject, Description,
                                                    Coverage
                                                                                   Rights & Permissions

                                                   Better navigation &
                                             Asset metadata –
                                                           discovery metadata –
                                             Who, Where & When: Relational
                                            Title, Creator, Publisher,           Links between and to:
                                            Contributor, Date, Type,                    Relation
                                           Format, Identifier, Source,
                                                    Language


                                                                       Functionality



Taxonomy Strategies LLC                The business of organized information                              10
      Are there extensions to the Dublin Core?

Elements                Refinements                                             Encodings Types
1.    Identifier        Abstract                            Is referenced by    Box        Collection
2.    Title             Access rights                       Is replaced by      DCMIType   Dataset
3.    Creator           Alternative                         Is required by      DDC        Event
4.    Contributor       Audience                            Issued              IMT        Image
5.    Publisher         Available                           Is version of       ISO3166    Interactive
6.    Subject           Bibliographic citation              License             ISO639-2      Resource
7.    Description       Conforms to                         Mediator            LCC        Moving Image
8.    Coverage          Created                             Medium              LCSH       Physical Object
9.    Format            Date accepted                       Modified            MESH       Service
10.   Type              Date copyrighted                    Provenance          Period     Software
11.   Date              Date submitted                      References          Point      Sound
12.   Relation          Education level                     Replaces            RFC1766    Still Image
13.   Source            Extent                              Requires            RFC3066    Text
14.   Rights            Has format                          Rights holder       TGN
15.   Language          Has part                            Spatial             UDC
                        Has version                         Table of contents   URI
                        Is format of                        Temporal            W3CTDF
                        Is part of                          Valid


  Taxonomy Strategies LLC   The business of organized information                                      11
 What is metadata: A scheme for recipes
                             Data
    Element                  Type             Length                      Source                       Purpose
                                                               Asset Metadata
Unique ID                   Integer        Fixed                     System supplied    Basic accountability
Recipe Title                String         Variable                  Licensed Content   Text search & results display
Recipe summary              String         Variable                  Licensed Content   Content
                                                                     Main Ingredients   Key index to retrieve & aggregate
Main Ingredients            List           Variable                  vocabulary         recipes, & generate shopping list
                                                              Subject Metadata
Meal Types                  List           Variable                  Meal Types vocab
Cuisines                    List           Variable                  Cuisines           Browse or group recipes & filter search
Courses                     List           Variable                  Courses vocab      results
Cooking Method              Flag           Fixed                     Cooking vocab
                                                                Link Metadata
Recipe Image                Pointer        Variable                  Product Group      Merchandize products
                                                                Use Metadata

Rating                      String         Variable                  Licensed Content   Filter, rank, & evaluate recipes
Release Date                Date           Fixed                     Product Group      Publish & feature new recipes



  Taxonomy Strategies LLC    The business of organized information                                                          12
  What is a taxonomy? Systematics view
                       Pragmatic
   Biological taxonomy place an organism in one and only one
  But most of the time things belong to more than one category.
                              place.
                    Animalia
                                     Chordata
                                                     Mammalia
                                                                    Carnivora
                                                                                Canidae
                                                                                          Canis
                                                                                                  C. familiari
                  Kingdom           Phylum           Class          Order       Family    Genus   Species

             Linnaeus …


             Pets                                                 Mammals                            Farm
                                                                                                    Animals



                                                                   Dogs




Taxonomy Strategies LLC   The business of organized information                                                  13
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:05 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information          14
  Are there other organizational schemes?

           Type                                                   Remarks
   Synonym                      Connects a series of terms together
   Ring                         Treats them as equivalent for search purposes
   Authority File               Used to control variant names with a preferred term
                                Typically used for names of countries, individuals,
                                 organizations
   Classification An arrangement of knowledge
   Scheme         Does not follow taxonomy rules
                                Usually enumerated; ie, LC or Dewey
   Thesaurus                    Expresses semantic relationships of:
                                       Hierarchy (broader & narrower terms)
                                       Equivalence (synonyms)
                                       Associative (related terms)

   Ontology                     Resembles faceted taxonomy but uses richer semantic
                                 relationships among terms and attributes and strict
                                 specification rules
Taxonomy Strategies LLC   The business of organized information                        15
  Another point of view ….

                            Taxonomies                                                  Ontologies
                                                             (Vocabularies)

                Synonym                      Authority             Classification
                                                                                       Thesauri
                 Rings                         Files                 Schemes


                   Simple                                                               Complex


                   Equivalence                              Hierarchical            Associative
                                                            (Relationships)


                          Source: Amy Warner. Metadata and Taxonomies for a More Flexible Information
                          Architecture (http://www.lexonomy.com/presentations/metadataAndTaxonomies.ppt)

Taxonomy Strategies LLC   The business of organized information                                            16
Taxonomic metadata – e-Forms example
   Agency             Form Type               Industry               Jurisdiction   BRM Impact        Keyword        Audience
                                               Impact                                                  Topic


0001 Legislative     Application           00 Generic                               Citizen Srvcs   Agriculture &   All
1000 Judicial        Approval              11 Agriculture                            Social Srvs      food          General
1100 Executive       Claim                 21 Mining                                 Defense        Commerce         Citizen
  Office of Pres     Information       Metadata Elements
                                           22 Utilities                              Disasters      Communica-       Business
0003 Exec Depts        request             23 Construct              Federal         Econ Dev         tions          Govt
  1200 Agriculture   Information           31-33 Manuf               State +         Education      Education        Employee
  1300 Commerce        submission          42 Wholesale              Local +         Energy         Energy           Native
  9700 Defense       Instructions          44-45 Retail              Other +         Env Mgmt       Env pro          American
  9100 Education     Legal filing          48-49 Trans                               Law Enf        Foreign rels      Non-
  8900 Energy        Payment               51 Info                                   Judicial       Govt             resident
  7500 HHS           Procurement           52 Finance                                Correctional   Health &         Tourist
  7000 DHS           Renewal               54 Profession                             Health           safety        Special
  8600 HUD           Reservation           55 Mgmt                                   Security       Housing &        group
  1400 Interior      Service               56 Support                                Income Sec       comm dev
  1500 Justice         request             61 Education                              Intelligence   Labor
  1600 Labor         Test                  62 Health                                 Intl Affairs   Law
  1900 State         Other input            Care                                     Nat Resour     Named grps
  6900 Transport     Other                 71 Arts                                   Transport      National def
  2000 Treasury        transaction         72 Hospitality                            Workforce      Nat resources
  3600 Veterans                            81 Other                                  Science        Recreation
Ind Agencies                                Services                                Delivery        Sci & tech
Intl Orgs                                  92 Public                                Support         Social pgms
                                            Admin                                   Management      Transport




                                                                     Taxonomies
   Taxonomy Strategies LLC   The business of organized information                                                       17
  Why use faceted taxonomies?

   4 independent categories
       of 10 nodes each have
       the same discriminatory
       power as one hierarchy
       of 10,000 nodes (104)
          Easier to maintain
          Can be easier to
            navigate




Taxonomy Strategies LLC   The business of organized information   18
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
                Can I get a taxonomy off-the-shelf or create one with software?
                How do you know it is good?
                How do you build or modify to make it good?
  10:05 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information                    19
  How do I get a good Taxonomy? – Seven practical
  rules
  1) Incremental, extensible process that identifies and enables
        users, and engages stakeholders.

  2) Quick implementation that provides measurable results as
        quickly as possible.

  3) Not monolithic—has separately maintainable facets.

  4) Re-uses existing IP as much as possible.

  5) A means to an end, and not the end in itself .

  6) Not perfect, but it does the job it is supposed to do—such as
        improving search and navigation.

  7) Improved over time, and maintained.


Taxonomy Strategies LLC   The business of organized information      20
  Can I get a taxonomy off the shelf?

   Sure:
      www.taxonomywarehouse.com
      There are usually license fees, but they will be less than
       the effort to develop an equivalent taxonomy.
      The voice of experience says these will usually not be
       what you want.

   We recommend:
     Adopt a faceted approach.
     Reuse existing (esp. internal) vocabularies for as many
      of the facets as reasonable.
     Plan on doing full-custom “Content Type” and “Subject”
      taxonomies.


Taxonomy Strategies LLC   The business of organized information     21
  Sources for 8 common taxonomies
     Taxonomy                             Definition                          Potential Sources
   Organization            Organizational structure.              FIPS 95-2, U.S. Government Manual, Your
                                                                  organizational structure, etc.
   Content Type            Structured list of the various types   DC Types, AGLS Document Type, AAT
                           of content being managed or used.      Information Forms , Your records management
                                                                  policy, etc.
   Industry                Broad market categories such as        FIPS 66, SIC, NAICS, Your market segments,
                           lines of business, life events, or     etc.
                           industry codes.
   Location                Place of operations or                 FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics
                           constituencies.                        Div, US Postal Service, Your sales regions, etc.
   Function                Functions and processes                FEA Business Reference Model, Enterprise
                           performed to accomplish mission        Ontology, AAT Functions, Your business
                           and goals.                             functions, etc.
   Topic                   Business topics relevant to your       Federal Register Thesaurus, NAL Agricultural
                           mission & goals.                       Thesaurus, LCSH, Your research areas, etc.
   Audience                Subset of constituents to whom a       GEM, ERIC Thesaurus, IEEE LOM, Your
                           piece of content is directed or        psycho-graphics or personas, etc.
                           intended to be used.
   Products &              Names of products/programs &           ERP system, Your products and services, etc.
   Services                services.


Taxonomy Strategies LLC   The business of organized information                                                  22
  What about automatically created taxonomies?

   Documents can be
     ‘clustered’ based on
     similarities and differences.

   Problems:
        Typically only a single
         hierarchy
        No overall plan
        Results hard for people to
         navigate




                                                                  What does “North” mean on this map?



Taxonomy Strategies LLC   The business of organized information                                         23
  What should I expect from automatic taxonomy
  construction software?
   Software can scan large quantities of
     content and extract statistically significant
     words and phrases.

   Example: Archive of 10 publications was
     analyzed for topics significant to ‘copyright’.

   Software does a poor job of
      de-duplication
      turning those significant words and phrases
       into a larger structure
      discriminating between gold and garbage

   Software is good for
      getting an understanding of the key phrases
       in a large amount of content
      providing test cases for evaluating a
       taxonomy                                                   Source: Sample data courtesy of
                                                                      Randy Marcinko and nStein.
Taxonomy Strategies LLC   The business of organized information                                 24
  How can I test a Taxonomy? – Qualitative methods

             Method                               Process                      Validation
   Walk-throughs                         Show and explain          Approach
                                                                   Consistency to rules
                                                              Appropriateness to task
   Usability Testing                     Contextual analysis  Tasks are completed
                                         (card sorting,        successfully
                                         scenario testing,    Time to complete task is reduced
                                         etc.)
   User Satisfaction                     Survey               Reaction to new interface
                                                                   Reaction to search results
   Tagging samples                       Tag sample                Content ‘fit’
                                         content with              Fills out content inventory
                                         taxonomy
                                                                   Training materials for people &
                                                                    algorithms
                                                                   Basis for quantitative methods

Taxonomy Strategies LLC   The business of organized information                                       25
Quantitative Method – How evenly does it divide the
content?
   Background:                                                                            Measured and Expected Distribution of Top 10 Content Types
                                                                                                        in Library of Congress Database

     Documents do not distribute uniformly                                           350,000

      across categories                                                               300,000




                                                                  Number of Records
                                                                                      250,000
     Zipf (1/x) distribution is expected                                             200,000                                                                                                                                 Series2
                                                                                                                                                                                                                              Series1
      behavior                                                                        150,000
                                                                                      100,000

     80/20 rule in action (actually 70/20 rule)                                       50,000
                                                                                            0




                                                                                                                                                                                        e

                                                                                                                                                                                      hy
                                                                                                      y
                                                                                                    es




                                                                                                     ls


                                                                                                                                        s




                                                                                                                                                                ns
                                                                                                                                                   n




                                                                                                                                                                                       s
                                                                                                                                                                                     ur
                                                                                                   ph




                                                                                                                                      ap


                                                                                                                                                 io
                                                                                                  ca




                                                                                                                                                                                   tic
                                                                                                                                                                                 ap
                                                                                                 ss




                                                                                                                                                            itio


                                                                                                                                                                                 at
                                                                                                                                              ct
                                                                                                ra




                                                                                                                                    M




                                                                                                                                                                                is
                                                                                               di




                                                                                                                                                                               er
                                                                                              re




                                                                                                                                                                              gr
   Methodology:




                                                                                                                                            Fi


                                                                                                                                                          b




                                                                                                                                                                             at
                                                                                             og


                                                                                            rio
                                                                                           ng




                                                                                                                                                                     lit


                                                                                                                                                                            io
                                                                                                                                                       hi




                                                                                                                                                                          St
                                                                                          Bi


                                                                                         Pe




                                                                                                                                                     Ex




                                                                                                                                                                         bl
                                                                                                                                                                    le
                                                                                        Co




                                                                                                                                                                               Bi
                                                                                                                                                                 ni
                                                                                                                                                              ve
                                                                                                                                                            Ju
                                                                                                                                        Top 10 Content Types

     Part of alpha test of ‘content type’ for
      corporate intranet
     115 URLs selected at random from                                                      Measured and Expected Distribution of Content Types in an
                                                                                                                   Intranet
      search index were manually categorized.
      Inaccessible files and ‘junk’ were                                              25

                                                                                      20
      removed


                                                                    # Documents
                                                                                      15                                                                                                                                     Measured


   Results:
                                                                                                                                                                                                                             Expected
                                                                                      10

                                                                                       5

     Results were slightly more uniform than                                          0




                                                                                                             News & Events




                                                                                                                                                                                           Unclassified
                                                                                                                                 Manuals &




                                                                                                                                              Marketing &




                                                                                                                                                            Procedures &
                                                                                                                             Communications




                                                                                                                                                                           Presentations
                                                                                            People, Groups




                                                                                                                                                                                                          Proposals, Plans
       the Zipf distribution, which is better than




                                                                                                                                                            Regulations,
                                                                                                                                  Materials
                                                                                                                                   Learning

                                                                                                                              Operations &




                                                                                                                                                                                             Other &




                                                                                                                                                                                                            & Schedules
                                                                                                                                                                             Papers &
                                                                                                                                                              Policies,
                                                                                                                                                Sales
                                                                                              & Places




                                                                                                                                                                                                             Programs,
                                                                                                                                Internal
       expected
                                                                                                                                        Content Type




Taxonomy Strategies LLC   The business of organized information                                                                                                                                                                     26
Quantitative Method – How intuitive (repeatable) are the
categorizations?
 Methodology: Closed Card
    Sort
      For alpha test of a grocery site
      15 Testers put each of 100 best-                           “Cocoa Drinks – Powder” is best
       selling products into one of 10                            categorized in both “Beverages”
       pre-defined categories                                              and “Grocery”.
      Categories where fewer than 14
       of 15 testers put product into
       same category were flagged
 Results:
              % of                Cumulative %
             Testers               of Products
                                                                  In the trade, “Corn Tortillas” are
               15/15                      54%                                a Dairy item!
               14/15                      70%
              13/15                      77%
              12/15                      83%
               11/15                     85%
              <11/15                     100%
Taxonomy Strategies LLC   The business of organized information                                        27
Quantitative Method – How does taxonomy “shape”
match that of content?
                                                                             Term Group            %        %
 Background:                                                                                    Terms     Docs
    Hierarchical taxonomies allow
                                                                     Administrators                  7.8    15.8
       comparison of “fit” between content
                                                                     Community Groups                2.8     1.8
       and taxonomy areas                                            Counselors                      3.4     1.4
                                                                     Federal Funds Recipients        9.5    34.4
 Methodology:                                                       and Applicants
   25,380 resources tagged with                                     Librarians                      2.8     1.1
     taxonomy of 179 terms. (Avg. of 2                               News Media                      0.6     3.1
     terms per resource)                                             Other                           7.3     2.0
    Counts of terms and documents                                   Parents and Families            2.8     6.0
     summed within taxonomy hierarchy                                Policymakers                    4.5    11.5
                                                                     Researchers                     2.2     3.6
 Results:                                                           School Support Staff            2.2     0.2

    Roughly Zipf distributed (top 20                                Student Financial Aid           1.7     0.7
                                                                     Providers
     terms: 79%; top 30 terms: 87%)                                  Students                       27.4     7.0
    Mismatches between term% and                                    Teachers                       25.1    11.4
     document% flagged                                             Source: Courtesy Keith Stubbs, US. Dept. of Ed.
 Taxonomy Strategies LLC   The business of organized information                                             28
  How do large corporations typically extend the
  Dublin Core?

          120%
                                   100%
          100%
                                                                     86%
            80%
                                                                                    57%
            60%

            40%

            20%

              0%
                                Doc Types                     Products & Services   Roles

       Base: 20 corporate information managers



                       Source: CEN/ISSS Workshop on Dublin Core. Guidance information for the deployment of
                                                             Dublin Core metadata in Corporate Environments
                      (http://www.cenorm.be/cenorm/businessdomains/businessdomains/isss/cwa/cwa15247.asp)
Taxonomy Strategies LLC   The business of organized information                                       29
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:05        How do I associate the taxonomy with content?
                How are we going to populate metadata elements with complete and consistent
                   values?
                  What can we expect to get from automatic classifiers?
                  What kinds of tools do people use?
                  How do different automatic classification tools compare?
                  What else should I keep in mind?
  10:30        Break
  10:45        What do taxonomies and metadata have to do with search?
  11:15        How can I sell my management on a taxonomy project?
  11:45        Any more questions?
  12:00        Adjourn

Taxonomy Strategies LLC   The business of organized information                            30
  General remarks on tagging

   Province of authors (SMEs) or editors?

   Taxonomy often highly granular to meet task and re-use needs.

   Vocabulary dependent on originating department.

   The more tags there are (and the more values for each tag), the
     more hooks to the content.

   If there are too many, authors will resist and use “general” tags
     (if available)

   Automatic classification tools exist, and are valuable, but results
     are not as good as humans can do.
        “Semi-automated” is best.
        Degree of human involvement is a cost/benefit tradeoff.


Taxonomy Strategies LLC   The business of organized information           31
  What methods do large companies use to create &
  maintain metadata?
      80%                  71%
      70%
                                                          57%
      60%
      50%                                                             43%               43%
      40%
      30%
      20%
      10%
        0%
                          Forms                      Distributed   Centralized    Not Automated
                                                     Production    production

    Base: 20 corporate information managers


                      Source: CEN/ISSS Workshop on Dublin Core. Guidance information for the deployment of
                                                            Dublin Core metadata in Corporate Environments
                     (http://www.cenorm.be/cenorm/businessdomains/businessdomains/isss/cwa/cwa15247.asp)
Taxonomy Strategies LLC   The business of organized information                                      32
  How do tools compare? Analyst viewpoint



                                        high
                      Content Volumes
                                        low




                                                    low                                    high
                                                                          Accuracy Level



Taxonomy Strategies LLC           The business of organized information                           33
  What accuracy should we expect from an automatic
  classifier?
   Classification Performance is                                          Accuracy
     measured by “Inter-cataloger
     agreement”                                                                       Trained Librarians
        Trained librarians agree less than 80%
         of the time                                                  potential
                                                                  performance
        Errors are subtle differences in                                 gain
         judgment, or big goofs
                                                                                            Regexps

   Automatic classification struggles to
     match human performance
        Exception: Entity recognition can
         exceed human performance
                                                                                              Development Effort/
                                                                                              Licensing Expense
   Classifier performance limited by
     algorithms available, which is limited
                                                                       1) 80/20 tradeoff where 20% of effort
     by development effort
                                                                          gives 80% of performance.
   Very wide variance in one vendor’s                                 2) Smart implementation of inexpensive
     performance depending on who does
                                                                          tools will outperform naive
     the implementation, and how much
     time they have to do it                                              implementations of world-class tools.


Taxonomy Strategies LLC   The business of organized information                                                     34
  How do tools compare? Pragmatic viewpoint


                                      high
                    Content Volumes
                                      low




                                                       low                                  high
                                                                           Accuracy Level



Taxonomy Strategies LLC                The business of organized information                       35
  What kind of metadata creation and maintenance
  process is needed?
   Even ‘purely’ automatic                                                           Compose in
                                                                                       Template

     meta-tagging systems need
     a manual error correction                                     Automatically
                                                                  fill-in metadata
                                                                                     Submit to CMS
                                                                                                                        Problem?
                                                                                                                                            Har
                                                                                                                                             d
                                                                                                                    Y                N      Cop

     procedure.                                                                                                                              y



                                                                                     Approve/Edit    Review             Copy Edit

        Should add a QA sampling
                                                                                                                                            Web
                                                                                      metadata       content             content
                                                                                                                                            site



         mechanism
                                                                                                     Problem?
                                                                                                                N



   Tagging models:                                                                                  Y




        Author-generated
                                                                  Tagging Tool          Analyst       Editor            Copywriter       Sys Admin




        Central librarians
                                                                  Sample of ‘author-generated’ metadata
        Hybrid – central auto-tagging
                                                                                workflow.
         service, distributed manual
         review and correction



Taxonomy Strategies LLC   The business of organized information                                                                            36
  Tagging tool example: Interwoven MetaTagger




Manual form fill-in w/ check
boxes, pull-down lists, etc.
                                                                  Auto keyword &
                                                                  summarization




Taxonomy Strategies LLC   The business of organized information                    37
    Tagging tool example: Interwoven MetaTagger
                                                                    Rules & pattern
Auto-categorization                                                 matching




   Parse & lookup
   (recognize names)




  Taxonomy Strategies LLC   The business of organized information                     38
  Where do I put the metadata?

   Where can I store metadata?
     In the content – HTML Headers, File properties, etc.
     In a centralized repository – Search index, Metadata database, etc.

   Where should I store metadata? It depends.
     If you are moving files through a process, putting it in the file keeps it
      from getting dropped at system borders.
     If you are doing search across multiple documents, it has to be at
      least copied out of the files.
     If you make copies of files and modify them, consistent in-file
      metadata will be impossible.

   Real question is not where to STORE the metadata, it is how to
     MAINTAIN the metadata.
        Web CMS as an example


Taxonomy Strategies LLC   The business of organized information               39
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:05 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information          40
  Agenda
  9:00        Who are we?
  9:10        What are taxonomies & metadata?
  9:30        What kinds of taxonomies are there, and what do I need?
  9:40        How do I get a good taxonomy?
  10:05       How do I associate the taxonomy with content?
  10:30       Break
  10:45 What do taxonomies and metadata have to do with search?
                 Does adding a taxonomy mean replacing my search engine?
                 How are they used behind the scenes in a search implementation
                 How are they used in the Search UI to aid searching?
                 How can we make our current search engine better?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn


Taxonomy Strategies LLC   The business of organized information                    41
  How to fix search? … Add metadata to search on!
        “Adding metadata to unstructured content allows it to be managed
         like structured content. Applications that use structured content work
         better.”

        “Enriching content with structured metadata is critical for supporting
         search and personalized content delivery.”

        “Content that has been adequately tagged with metadata can be
         leveraged in usage tracking, personalization and improved
         searching.”

        “Better structure equals better access: Taxonomy serves as a
         framework for organizing the ever-growing and changing information
         within a company. The many dimensions of taxonomy can greatly
         facilitate Web site design, content management, and search
         engineering. If well done, taxonomy will allow for structured Web
         content, leading to improved information access.”

Taxonomy Strategies LLC   The business of organized information                   42
  How does Google do so well without metadata?

   They don’t, they just use particular types of metadata:
      Number of incoming links
      PageRank for each incoming link
      Text of incoming links




Taxonomy Strategies LLC   The business of organized information   43
  Dublin Core framework for corporate use

    Not just 15 elements
    A framework to enable cross-resource exploration and
         use

                                                                  Dublin Core is framework
                                                                  for “integration metadata”
                                                                  at BellSouth




        Source: Courtesy of Todd Stephens, BellSouth


Taxonomy Strategies LLC   The business of organized information                            44
 What about Search? Integration Metadata
                       Data                                Req. /
   Element             Type            Length              Repeat        Source                     Purpose
                                                             Asset Metadata
Unique ID             Integer Fixed
                         dc:identifier                          1    System supplied    Basic accountability
Recipe Title             dc:title Variable
                      String                                    1    Licensed Content   Text search & results display
Recipe summary           dc:description
                      String   Variable                         1    Licensed Content   Content
                         X                                           Main Ingredients   Key index to retrieve & aggregate
Main Ingredients      List           Variable                   ?    vocabulary         recipes, & generate shopping list
                                                           Subject Metadata
Meal Types               X
                      List           Variable                    *   Meal Types vocab
Cuisines                 X
                      List           Variable                    *   Cuisines           Browse or group recipes & filter
Courses                  X
                      List           Variable                    *   Courses vocab      search results
Cooking Method           X
                      Flag           Fixed                       *   Cooking vocab
                                                              Link Metadata
Recipe Image          Pointer Variable
                        dcterms:hasPart                         ?    Product Group      Merchandize products
                                                              Use Metadata
Rating                String         Variable                   1    Licensed Content   Filter, rank, & evaluate recipes
Release Date            dc:dateFixed
                      Date                                      1    Product Group      Publish & feature new recipes

                                         Legend: ? – 1 or more * dc:language=“en”
                            dc:type=“recipe”, dc:format=“text/html”, - 0 or more

  Taxonomy Strategies LLC    The business of organized information                                                      45
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:10 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:30 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information          46
  How do I sell Management on a Taxonomy Project?

   Don’t sell “metadata” or “taxonomy”, sell the vision of
     what you want to be able to do.

   Clearly understand what the problem is and what the
     opportunities are.

   Do the calculus (costs and benefits)

   Design the taxonomy (in terms of LOE) in relation to
     the value at hand.




Taxonomy Strategies LLC   The business of organized information   47
  Fundamentals of metadata ROI

   Tagging content using metadata and a taxonomy are
     costs, not benefits.

   There is no benefit without exposing the tagged
     content to users in some way that cuts costs or
     improves revenues.

   Putting metadata and a taxonomy into operation
     requires UI changes and/or backend system changes,
     as well as data changes.

   You need to determine those changes, and their costs,
     as part of the ROI.

Taxonomy Strategies LLC   The business of organized information   48
  What are the typical metadata ROI scenarios?

   Catalog site
     Increased sales.
     Increased productivity.

   Customer support
     Cutting costs.
     Increased sales.

   Compliance
     Avoiding penalties.

   Knowledge worker productivity
      Less time searching, more time working.

Taxonomy Strategies LLC   The business of organized information   49
  Metadata ROI: Catalog site




                                                                  Guided Navigation
                                                                   2-3 clicks to product
                                                                   No dead ends
                                                                  http://www.tesco.com/winestore

Taxonomy Strategies LLC   The business of organized information                                    50
  Metadata ROI: Catalog site

    Increased sales                                               Enterprise portal cost
          Product findability.                                      $6M
          Product cross-sells and up-
           sells.
          Customer loyalty.


    1-5% increase in sales
          $57.6B sales (’04)                                       $600M to $2B/year
          $2.1B net income (’04)                                   $21M to $105M/year

    1-5% increase in productivity                                  $155M to $776M/year
          $50K average cost per employee
          310,400 employees (’04)


                                                                      Source: Proforma based on Hoover’s data.
Taxonomy Strategies LLC   The business of organized information                                             51
  Metadata ROI: Customer support model
                                         Help on search
                                         page, not a click        Type and go to
                                         away.                    search for specific
                                                                  policies
        Policy categories
        for browsing


                                                                    Refine search
                                                                    offered with
                                                                    results




                                                                    Good search
                                                                    results for policy
                                                                    topics, e.g.,
                                                                    “pets”


Taxonomy Strategies LLC   The business of organized information                         52
  Metadata ROI: Customer support model

    Self service                                                 Manual processing
            Fewer customer calls.                                     100,000 documents
            Faster, more accurate CSR                                 2 pages per document
               responses through better                                $4 per page
               information access.
                                                                       $800K

    25-50% service efficiency
         increase
            300K customer service calls
             per month
            $6 cost per call                                         $5.4M to $10.8M/yr

     1-5% increased sales
            $18.6B sales (’04)                                    $186M to $930M/year
            ($761M) net income (’04)                              ($575M) to $169M/year

                                                                      Source: Proforma based on Hoover’s data.
Taxonomy Strategies LLC   The business of organized information                                             53
  Metadata ROI: Compliance

    Avoiding penalties for
       breaching regulations
          SOX: up to 5 years in jail
          SOX: up to $5M
    Following required
       procedures

    Loss of company
          $100B revenue (’00)                                            $100B


    Loss of partner companies
       Arthur Andersen


                                                                      Source: Proforma based on Hoover’s data.
Taxonomy Strategies LLC   The business of organized information                                             54
  Knowledge workers spend up to 2.5 hours
  each day looking for information …



                    Commun-
                                                                             Searching
                     icating




                                                                  Creating



           … But find what they are looking for only 40% of
           the time.
                                                                                 — Kit Sims Taylor

Taxonomy Strategies LLC   The business of organized information                                 55
  High cost of not finding information

        “The amount of time wasted in futile searching for vital
            information is enormous, leading to staggering costs …”

                                                                  — Sue Feldman, bnb nbnbn

  High cost of poor classification

           Poor classification costs a 10,000 user organization
            $10M each year—about $1,000 per employee.

                                                                  — Jakob Nielsen, useit.com

                  But “better search” itself is a weak ROI

Taxonomy Strategies LLC   The business of organized information                           56
  Knowledge workers spend more time re-creating
  existing content than creating new content




                          Commun-                                 Searching
                           icating


                                                                  Creating
                                               Recreating           new
                                                existing          content
                                                content             9%
                                                  26%


                                                                             — Kit Sims Taylor

Taxonomy Strategies LLC   The business of organized information                             57
  Metadata ROI: Productivity

    Decreased cost to market                                      Enterprise document
          Decreased development cost                              management system cost
          Increased R&D productivity                                $10M
          Reduced time for sales &
           marketing
    1-5% decrease in drug
       development cost
          $800M/drug
                                                                   $8M to $16M/drug
    5-10% increase in R&D
       productivity
          13% of revenue
          $39B in sales (’04)                                     $254M to $507M/year
    10-20% decrease in time
       for sales & marketing
                                                                   $254M to $507M/year
          13% of revenue
                                                                     Source: Proforma based on Hoover’s data.
Taxonomy Strategies LLC   The business of organized information                                            58
  Metadata ROI: Executive Mandate

    There is no ROI out of the box
    Just someone with a vision
                                                          …and the budget to make it happen.

    What’s really needed?
          Demos and proofs of value.
          So that a stronger cost benefit argument can be made for
           continuing the work




Taxonomy Strategies LLC   The business of organized information                           59
  Productivity, loyalty, and revenue have provided the
  ROI




Taxonomy Strategies LLC   The business of organized information   60
  Intranet has provided the best ROI




                                                  Intranet

                                              Web/online
                                          customer sales

                                                 Web dev
                                            infrastructure

                                             Web/online
                                          business sales

                                            Middleware to
                                         link Web to ERP

                                          Extranet/supply
                                                    chain
                                                        e-
                                          billing/payment
                                                  systems
                                            Wireless Web
                                                  access

                                           e-marketplace/
                                                   portal

                                                    None




Taxonomy Strategies LLC   The business of organized information   61
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:05 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information          62
  Agenda
  9:00         Who are we?
  9:10         What are taxonomies & metadata?
  9:30         What kinds of taxonomies are there, and what do I need?
  9:40         How do I get a good taxonomy?
  10:05 How do I associate the taxonomy with content?
  10:30 Break
  10:45 What do taxonomies and metadata have to do with search?
  11:15 How can I sell my management on a taxonomy project?
  11:45 Any more questions?
  12:00 Adjourn

Taxonomy Strategies LLC   The business of organized information          63
Taxonomy Strategies LLC




                                         Contact Info

                                                  Ron Daniel
                                              925-368-8371
                          rdaniel@taxonomystrategies.com


                                              Joseph Busch
                                              415-377-7912
                          jbusch@taxonomystrategies.com



May 16, 2005   Copyright 2005 Taxonomy Strategies LLC. All rights reserved.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:11/6/2012
language:Latin
pages:64
suchufp suchufp http://
About