Docstoc

Taxonomy Development Workshop The

Document Sample
Taxonomy Development Workshop The Powered By Docstoc
					The Future is:
Text Analytics

        Tom Reamy
Chief Knowledge Architect
        KAPS Group
http://www.kapsgroup.com
Agenda
 Introduction – Elements of Text Analytics
   – Related semantic elements
   – Categorization, Extraction, Summarization – ECC
   – Sentiment Analysis, Expertise Analysis
 Taxonomy / Text Analytics Software
   – Variety of Vendors / Features
   – Selecting Software – Two Phase, Proof of Concept
 Development – Taxonomy, Categorization, Faceted Metadata
 Text Analytics and Library Services
 Conclusions and Resources


                                                             2
KAPS Group: General
   Knowledge Architecture Professional Services
   Virtual Company: Network of consultants – 8-10
   Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc.
   Consulting, Strategy, Knowledge architecture audit
   Services:
     – Taxonomy/Text Analytics development, consulting, customization
     – Technology Consulting – Search, CMS, Portals, etc.
     –   Evaluation of Enterprise Search, Text Analytics
     – Metadata standards and implementation
     – Knowledge Management: Collaboration, Expertise, e-learning
     – Applied Theory – Faceted taxonomies, complexity theory, natural
       categories


                                                                         3
Introduction to Text Analytics
Taxonomic Resources
 Thesauri, Controlled Vocabulary, Product Catalogs
   – Resources to build on
   – Indexing not categorization

 Taxonomy
   –   Foundation for Categorization
   –   Browse – classification scheme
   –   Formal – Is-Child-Of, Is-Part-Of
   –   Large taxonomies - MeSH – indexing all topics
   –   Small is better – for categorization and faceted navigation



                                                                     4
Introduction to Text Analytics
Metadata
 Metadata standards – Dublin Core - Mostly syntactic not semantic
    – Description – static or dynamic / human or computer (summarization)
    – Semantic – keywords – very poor performance
    – Derived metadata – from link analysis, URLs – as good as existing
      content and structure
 Best Bets – high level categorization-search
    –   Human judgments – very labor intensive
 Facets – classes of metadata
    –   Standard - People, Organization, Document type-purpose
    –   Requires huge amounts of metadata
    –   Product catalogs – if you have them

                                                                       5
Introduction to Text Analytics
Text Analytics Features
 Noun Phrase Extraction
   – Catalogs with variants, rule based dynamic
   – Multiple types, custom classes – entities, concepts, events
   – Feeds facets
 Summarization
   –   Customizable rules, map to different content
 Fact Extraction
   – Relationships of entities – people-organizations-activities
   – Ontologies – triples, RDF, etc.
 Sentiment Analysis
   –   Rules –Products and their features and phrases

                                                                   6
Introduction to Text Analytics
Text Analytics Features
 Auto-categorization
     –  Training sets – Bayesian, Vector space
     – Terms – literal strings, stemming, dictionary of related terms
     – Rules – simple – position in text (Title, body, url)
     – Semantic Network – Predefined relationships, sets of rules
     – Boolean– Full search syntax – AND, OR, NOT
     – Advanced – DIST (#), SENTENCE, NOTIN, MINOC
   This is the most difficult to develop, fundamental
   Build on a Taxonomy
   Combine with Extraction
     – If any of list of entities and other words
     – Build dynamic rules with categorization capabilities - disambiguation



                                                                           7
8
9
10
11
12
13
14
15
16
Varieties of Taxonomy/ Text Analytics Software
 Taxonomy Management
   –   Synaptica, Mondeca
 Taxonomy Platform
   –   Clear Forest, Data Harmony, Concept Searching, Expert System
 Text Analytics Platform
   –   SAS-Teragram, IBM, SAP-Inxight, Smart Logic, GATE-Open Source
 Content Management
   –   Nstein, Interwoven, Documentum, etc.
 Embedded – Search
   –   FAST, Autonomy, Endeca, Exalead, etc.
 Specialty
   –   Sentiment Analysis – Lexalytics, Attensity
                                                                      17
Evaluating Text Analytics Software – Process

 Start with Self Knowledge
   –   Think Big, Start Small, Scale Fast
 Eliminate the unfit
   –   Filter One- Ask Experts - reputation, research – Gartner, etc.
        • Market strength of vendor, platforms, etc.
        • Feature scorecard – minimum, must have, filter to top 3
   –   Filter Two – Technology Filter – match to your overall scope
       and capabilities – Filter not a focus
   –   Filter Three – In-Depth Demo – 3-6 vendors
 Deep POC (2) – advanced, integration, semantics
 Focus on working relationship with vendor.

                                                                      18
Text Analytics Development: Categorization Process

 Starter Taxonomy
   –   If no taxonomy, develop initial high level
 Analysis of taxonomy – suitable for categorization
   –   Structure – not too flat, not too large
   –   Orthogonal categories
   –   Software analysis of Content - Clusters
 Content Selection
   –   Map of all anticipated content
   –   Selection of training sets – if possible
   –   Automated selection of training sets – taxonomy nodes as first
       categorization rules – apply and get content

                                                                   19
Text Analytics Development: Categorization Process

 First Round of Categorization Rules
 Term building – from content – basic set of terms that
    appear often / important to content
   Add terms to rule, apply to broader set of content
   Repeat for more terms – get recall-precision “scores”
   Test, Refine, Test, Refine - repeat
   Get SME feedback – formal process – scoring
   Get SME feedback – human judgments
   Test against more, new content
   Repeat until “done” – 90%?


                                                            20
Text Analytics Development: Entity Extraction Process

 Facet Design – from Initial Research
 Find and Convert catalogs:
   –   Organization – internal resources
   –   People – corporate yellow pages, HR
   –   Include variants
   –   Scripts to convert catalogs – programming resource
 Build initial rules – follow categorization process
   –   Differences – scale, “score”
   –   Recall – find all entities
   –   Precision – correct assignment to entity class
   –   Issue – disambiguation – Ford company, person, car

                                                            21
Text Analytics Development: Sentiment Analysis

 Similar to Categorization / Hybrid
 Start with Content
   – Training sets – negative, positive, neutral
   – Software extracts large set of general sentiment
   – Combine rules with statistical
   – Edit to smarter set
 Design Hierarchy of Entities and Features
   – Very Content Dependent
   – Software generates words and phrases by feature
 Edit words and phrases – test and refine cycles
 Apply – Social Media, surveys, Voice of Customer

                                                        22
Text Analytics and Library Services
 Taxonomy is Dead, Long Live Text Analytics
   –   Taxonomy – big, formal, hard to maintain, indexing or
       browsing
   –   Text Analytics – small taxonomy + rules, targeted and
       designed for use with facets, hard to develop-easy to maintain,
       categorization and facets
 What’s in a Name?
   –   Enterprise Content Categorization
   –   Sentiment Analysis
   –   Semantic Web – Combining unstructured content and data
 What’s in a Name?
   –   Categorization Taxonomy??? Catonomy???
                                                                   23
Text Analytics and Library Services

 Basic direction – same as before only better
 Text Analytics Development - Librarians are a Critical piece
    –   Categorization rules, facet design, ontology design-development,
 Content Management
    – Hybrid model of metadata – Software suggests
    – Humans – cognitively easier – agree or not, not generate
    – Librarian – develop Text Analytics, facilitate, monitor and refine
 Search – metadata driven
    – Faceted navigation – requires more metadata - especially in the
      enterprise
    – Search Based Applications
 More Cognitive Science / Linguistics – Less Library Science
    –   Smaller taxonomies in conjunction with Facets
                                                                           24
Conclusions

 Text Analytics can fulfill the promise of taxonomy and metadata
  (and Search and CM and Info Age)
 Librarians First Step – you’re here
 Hybrid – SME’s, Librarians, Software
 Software – match to content and applications
    –   For Enterprise – Platform software is best
 Future Directions
    – Starter Taxonomies, catalogs
    – Embedded Applications
    – Semantic Web + Unstructured Content
 Expertise Analysis – level of words, basic level categories
    – Taxonomy/Ontology Development
    – Search, CM, KM, Social Media
                                                                    25
Resources

 Text Analytics Takes Business Insight to New Depths
   –   Leslie Owens, Forrester – Oct, 2009
 Semantic Software Technologies
   –   Linda Moulton, Gilbane Group – July, 2010
 Text Analytics – Seth Grimes
   –   http://www.textanalyticsnews.com/text-mining-conference/index.shtml
 Sem Tech conference
   –   http://semtech2010.semanticuniverse.com/
 Enterprise Content Categorization – The Business Strategy for a
  Semantic Infrastructure – Tom Reamy
 Enterprise Content Categorization – How to Successfully Choose,
  Develop, and Implement a Semantic Strategy – Tom Reamy

                                                                             26
             Questions?
                Tom Reamy
          tomr@kapsgroup.com
                KAPS Group
Knowledge Architecture Professional Services
        http://www.kapsgroup.com

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/2/2011
language:English
pages:27