Taxonomy Development Workshop

Document Sample
Taxonomy Development Workshop Powered By Docstoc
					Text Analytics Workshop
        Development
                Tom Reamy
        Chief Knowledge Architect
                KAPS Group
Knowledge Architecture Professional Services
        http://www.kapsgroup.com
Agenda
 Development - Foundation
 Case Study 1 – Internet News
 Case Study 2 – Tale of two taxonomies
 Case Study 3 – Software Evaluation and Beyond
   –   BBN Motivations
   –   Amgen 2 – clustering, auto-taxonomy
   –   GAO Taxonomy – from terms to rules
 Exercises



                                                  2
Text Analytics Platform
4 Basic Contexts
 Ideas – Content Structure
   –   Language and Mind of your organization
   –   Applications - exchange meaning, not data
 People – Company Structure
   –   Communities, Users
   –   Central team - establish standards, facilitate
 Activities – Business processes and procedures
 Technology
   –   CMS, Search, portals, taxonomy tools
   –   Applications – BI, CI, Text Mining


                                                        3
Text Analytics Development: Foundation

 Articulated Information Management Strategy (K Map)
   – Content and Structures and Metadata
   – Search, ECM, applications - and how used in Enterprise
   – Community information needs and Text Analytics Team
 POC establishes the preliminary foundation
   – Need to expand and deepen
   – Content – full range, basis for rules-training
   – Additional SME’s – content selection, refinement
 Taxonomy – starting point for categorization / suitable?
 Databases – starting point for entity catalogs

                                                              4
Knowledge Architecture Audit:
Knowledge Map
Project       Contextual    Information   App/Content User Survey    Strategy
Foundation    Interviews    Interviews    Catalog                    Document
Meetings,     High Level:   Info          Technology    All 4        Meetings,
work groups   Process       behaviors     and content   dimensions   work groups
Overview      Community     of Business
                            processes


General       Broad         Deep          Deep          Complete     New
Outline       Context       Details       Details       Picture      Foundation




                                                                                5
Taxonomy Development Process:
Progressive Refinement
Taxonomy      Information    Content      Refine       Map          Governance
Model         Interviews     Analysis                  Community    Plan


Buy/Find         Info        Bottom Up    Interviews   Refine       Develop,
work groups   behaviors,     Prototypes   Evaluate     Interviews   Refine
Overview      Card Sorts



General        Preliminary   Taxonomy     Taxonomy     Tax 2.0      Taxonomy
Outline       Taxonomy       1.0          1.0-1.9




                                                                               6
Text Analytics Development: Categorization Process

 Starter Taxonomy
   –   If no taxonomy, develop initial high level (see Chart)
 Analysis of taxonomy – suitable for categorization
   –   Structure – not too flat, not too large
   –   Orthogonal categories
 Content Selection
   –   Map of all anticipated content
   –   Selection of training sets – if possible
   –   Automated selection of training sets – taxonomy nodes as first
       categorization rules – apply and get content



                                                                   7
Text Analytics Development: Categorization Process

 First Round of Categorization Rules
 Term building – from content – basic set of terms that
    appear often / important to content
   Add terms to rule, apply to broader set of content
   Repeat for more terms – get recall-precision “scores”
   Repeat, refine, repeat, refine, repeat
   Get SME feedback – formal process – scoring
   Get SME feedback – human judgments
   Text against more, new content
   Repeat until “done” – 90%?


                                                            8
Text Analytics Development: Entity Extraction Process

 Facet Design – from KA Audit, K Map
 Find and Convert catalogs:
   –   Organization – internal resources
   –   People – corporate yellow pages, HR
   –   Include variants
   –   Scripts to convert catalogs – programming resource
 Build initial rules – follow categorization process
   –   Differences – scale, “score”
   –   Recall – find all entities
   –   Precision – correct assignment to entity class
   –   Issue – disambiguation – Ford company, person, car

                                                            9
Case Study - Background
 Inxight Smart Discovery
 Multiple Taxonomies
   –   Healthcare – first target
   –   Travel, Media, Education, Business, Consumer Goods,
 Content – 800+ Internet news sources
   –   5,000 stories a day
 Application – Newsletters
   –   Editors using categorized results
   –   Easier than full automation



                                                             10
Case Study - Approach
 Initial High Level Taxonomy
   – Auto generation – very strange – not usable
   – Editors High Level – sections of newsletters
   – Editors & Taxonomy Pro’s - Broad categories & refine
 Develop Categorization Rules
   – Multiple Test collections
   – Good stories, bad stories – close misses - terms
 Recall and Precision Cycles
   – Refine and test – taxonomists – many rounds
   – Review – editors – 2-3 rounds
 Repeat – about 4 weeks

                                                            11
12
13
14
15
16
17
18
Case Study - Issues
 Taxonomy Structure
   –   Aggregate nodes vs. independent nodes
   –   Children Nodes – subset – rare
 Depth of taxonomy and complexity of rules
   –   Trade-off need to update and usefulness of categories
 Multiple avenues - Facets – source – New York Times –
  can put into rules or make it a facet to filter results
 When to use filter or terms – experimental
 Recall more important than precision – editors role


                                                               19
Case Study – Lessons Learned
 Combination of SME and Taxonomy pros
 Combination of Features – Entity extraction, terms,
  Boolean, filters, facts
 Training sets and find similar are weakest
   –   Somewhat useful during development for terms
 No best answer – taxonomy structure, format of rules
   –   Need custom development
 Plan for ongoing refinement
 This stuff actually works!


                                                         20
Enterprise Environment – Case Studies
 A Tale of Two Taxonomies
  –   It was the best of times, it was the worst of times
 Basic Approach
  –   Initial meetings – project planning
  –   High level K map – content, people, technology
  –   Contextual and Information Interviews
  –   Content Analysis
  –   Draft Taxonomy – validation interviews, refine
  –   Integration and Governance Plans

                                                            21
Enterprise Environment – Case One – Taxonomy, 7 facets

 Taxonomy of Subjects / Disciplines:
   –   Science > Marine Science > Marine microbiology > Marine toxins
 Facets:
   –   Organization > Division > Group
   –   Clients > Federal > EPA
   –   Instruments > Environmental Testing > Ocean Analysis > Vehicle
   –   Facilities > Division > Location > Building X
   –   Methods > Social > Population Study
   –   Materials > Compounds > Chemicals
   –   Content Type – Knowledge Asset > Proposals



                                                                        22
Enterprise Environment – Case One – Taxonomy, 7 facets

 Project Owner – KM department – included RM, business
  process
 Involvement of library - critical
 Realistic budget, flexible project plan
 Successful interviews – build on context
   –   Overall information strategy – where taxonomy fits
 Good Draft taxonomy and extended refinement
   –   Software, process, team – train library staff
   –   Good selection and number of facets
 Final plans and hand off to client


                                                            23
Enterprise Environment – Case Two – Taxonomy, 4 facets

 Taxonomy of Subjects / Disciplines:
   –   Geology > Petrology
 Facets:
   – Organization > Division > Group
   – Process > Drill a Well > File Test Plan
   – Assets > Platforms > Platform A
   – Content Type > Communication > Presentations




                                                    24
Enterprise Environment – Case Two – Taxonomy, 4 facets

 Environment Issues
   – Value of taxonomy understood, but not the complexity
     and scope
   – Under budget, under staffed
   – Location – not KM – tied to RM and software
       • Solution looking for the right problem
   – Importance of an internal library staff
   – Difficulty of merging internal expertise and taxonomy




                                                             25
Enterprise Environment – Case Two – Taxonomy, 4 facets

 Project Issues
   –   Project mind set – not infrastructure
   –   Wrong kind of project management
        • Special needs of a taxonomy project
        • Importance of integration – with team, company
   –   Project plan more important than results
        • Rushing to meet deadlines doesn’t work with semantics as
          well as software




                                                                26
Enterprise Environment – Case Two – Taxonomy, 4 facets

 Research Issues
   –   Not enough research – and wrong people
   –   Interference of non-taxonomy – communication
   –   Misunderstanding of research – wanted tinker toy connections
        • Interview 1 implies conclusion A
 Design Issues
   –   Not enough facets
   –   Wrong set of facets – business not information
   –   Ill-defined facets – too complex internal structure



                                                                 27
Taxonomy Development
Conclusion: Risk Factors
 Political-Cultural-Semantic Environment
   –   Not simple resistance - more subtle
        • – re-interpretation of specific conclusions and sequence of
          conclusions / Relative importance of specific recommendations
 Understanding project scope
 Access to content and people
   –   Enthusiastic access
 Importance of a unified project team
   –   Working communication as well as weekly meetings



                                                                          28
Text Analytics Development
Case Study 3 – POC – Government Agency
 Demo of SAS – Teragram / Enterprise Content
  Categorization




                                                29
Conclusion
 Enterprise Context – strategic, self knowledge
 Importance of a good foundation
   –   Importance of Taxonomy Structure – mapped to use
   –   POC a head start on development
 Importance of Text Analytics Vision / Strategy
   –   Infrastructure resource, not a project
 Balance of expertise and local knowledge
 Importance of Usability for refinement cycles
 Difference of taxonomy and categorization
   –   Concepts vs. text in documents

                                                          30
             Questions?
                Tom Reamy
          tomr@kapsgroup.com
                KAPS Group
Knowledge Architecture Professional Services
        http://www.kapsgroup.com

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:61
posted:7/5/2012
language:
pages:31