Docstoc

UIMA and Semantic Search 1Q2007

Document Sample
UIMA and Semantic Search 1Q2007 Powered By Docstoc
					                  U MA
                   I
UIMA | IBM Research

   UIMA and Semantic Search
     Introductory Overview
          www.ibm.com/research/uima




David A. Ferrucci
Senior Manager, Semantic Analysis and Integration
Chief Architect, UIMA
IBM T.J. Watson Research Center




                                        © 2006 IBM Corporation – All Rights Reserved
       UIMA | IBM Research

Semantic Match vs. Key Word Match

SEARCH: Going rate for leasing a billboard near Triborough Bridge

       Top hits from popular search engines miss the mark…



                                 Keywords may Match
                             BUT WRONG Content returned
                               And right content MISSED




2                                           David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
       UIMA | IBM Research

SEARCH: Going rate for leasing a billboard near Triborough Bridge




                                                       “Going”, “Bridge”, “Rate”,
                                                            “Leasing”, etc.




                                                          But All Misses




3                                     David Ferrucci         © 2006 IBM Corporation – All Rights Reserved –
       UIMA | IBM Research

SEARCH: Going rate for leasing a billboard near Triborough Bridge




                                                             Another Search Engine
                                                               Some Different Hits
                                                                       but
                                                                Still All Misses




                                                            Makes you wonder…

                                                             Could be out there?




4                                     David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
    UIMA | IBM Research

                                                Remarkably…With some
                                                  Location semantics

                                                   We can quickly find
                                               Hi-Res examples of area of
                                                        interest

                                                 But NOT the information
                                                       we need




5                         David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
           UIMA | IBM Research

Overlaying Semantics vs. Extracting Knowledge: Improve Search Recall
                                   Rate_For
                         Rate                   Billboard

 SEARCH: Going rate for leasing a billboard near Triborough Bridge
                                                                                      Bronx

                                                                      Located_In
    No Keywords in Common
        But a good “hit”
                                         Rate_For

                                Rate
                                                                      Billboard


    “…We were offered $250,000/year in 2001 for an outdoor sign in Hunts Point
    overlooking the Bruckner expressway. …”
                                                                                                 Bronx

                                                                              Located_In


Analysis can detect semantic types to improve search precision and recall

6                                                    David Ferrucci               © 2006 IBM Corporation – All Rights Reserved –
            UIMA | IBM Research

Overlaying Semantics vs. Extracting Knowledge: Improve Search Precision
                                   Rate_For
                         Rate                   Billboard

 SEARCH: Going rate for leasing a billboard near Triborough Bridge
                                                                                   Bronx

                                                                      Located_In
     Common Keywords
     Bad Semantic Match
                                              Song Title


                                                 Queens


    “…Simon and Garfunkel's "The 59th Street Bridge Song" was rated highly by
    the Billboard magazine in the 60's…”


             Magazine
Analysis can detect semantic types to improve search precision and recall
7                                                    David Ferrucci           © 2006 IBM Corporation – All Rights Reserved –
                          U MA
                           I
         UIMA | IBM Research




UIMA – Quick Overview
Architecture, Software Framework and Tooling


         Enabling Semantic Analysis
         The Foundation of Semantic Search




                                     © 2006 IBM Corporation – All Rights Reserved
            UIMA | IBM Research


    Analytics Bridge the Unstructured & Structured Worlds
                                            Text and Multi-Modal
                                                  Analytics




                                                  UIMA                                       Structured
     Unstructured                                                                           Information
     Information

                           Discover Relevant Semantics → Build into Structure                    Indices
                                Docs, Emails, Phone Calls, Reports
      Text, Chat,               Topics, Entities, Relationships
     Email, Audio,                                                                                 DBs
                                People, Places, Org, Times, Events
       Video
                                Customer Opinions, Products, Problems
                                Threats, Chemicals, Drugs, Drug Interactions....                  KBs

•    High-Value                                                              •    Explicit Semantics
•    Most Current                                                            •    Efficient Search
•    Fastest Growing                                                         •    Focused Content
•    ...BUT ...                                                              ...BUT...
•    Buried in Huge Volumes (Noise)                                          •    Slow Growing
•    Implicit Semantics                                                      •    Narrow Coverage
•    Inefficient Search                                                      •    Less Current/Relevant
9                                                     David Ferrucci      © 2006 IBM Corporation – All Rights Reserved –
            UIMA | IBM Research

Analytics: The kinds of things they do
 • Independently developed               • Different technologies & interfaces
 • From an increasing # of sources       • Highly specialized & fine grained

            Analysis Capabilities                    Capability Specializations
        Language, Speaker Identifiers         Modality
        Tokenizers                            Human Language
        Classifiers                           Domain of Interest
        Part of Speech Detectors              Source: Style and Format
        Document Structure Detectors          Input/Output Semantics
        Parsers, Translators                  Privacy/Security
        Named-Entity Detectors                Precision/Recall Tradeoffs
        Face Recognizers                      Performance/Precision Tradeoffs...
        Relationship Detectors

         The right analysis for the job will likely be a best-of-breed
         combination integrating capabilities across many dimensions.
10                                       David Ferrucci      © 2006 IBM Corporation – All Rights Reserved –
                UIMA | IBM Research
UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
types based on existing ones and update the Common Analysis Structure (CAS) for
upstream processing.                                                         UIMA CAS
                                                                             Representation now
     Common Analysis Structure (CAS)                                               Aligned
                                                                              with XMI standard
 Relationship                                    CeoOf


                                 Arg1:Person                           Arg2:Org
                                           Analysis Results
                                       (i.e., Artifact Metadata)
Named Entity           Person                                                       Organization



  Parser                 NP                     VP                                PP


                Fred        Center     is       the          CEO       of        Center           Micros

                                      Artifact (e.g., Document)

11                                                    David Ferrucci    © 2006 IBM Corporation – All Rights Reserved –
               UIMA | IBM Research




     • Analyzed by a collection of text analytics
     • Detected Semantic Entities and Relations Highlighted
     • Represented in UIMA Common Analysis Structure (CAS)




12                                                            David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
              UIMA | IBM Research


     UIMA: Unstructured Information Management Architecture
 Open Software Architecture and Emerging Standard
     – Platform independent standard for interoperable text and multi-modal analytics
     – Under Development: UIMA Standards Technical Committee Initiated under OASIS
     – http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima
 Software Framework Implementation
  – SDK Available on IBM Alphaworks
       • http://www.alphaworks.ibm.com/tech/uima
     – Tools, Utilities, Runtime, Extensive Documentation
       •   Creation, Integration, Discovery, Deployment of analytics
       •   Java, C++, Perl, Python (others possible)
       •   Supports co-located and service-oriented deployments (eg., SOAP)
       •   x-Language High-Performances APIs to common data structure (CAS)
     – Embeddable on Systems Middleware (e.g., ActiveMQ, WebSphere, DB2)
 Apache UIMA open-source project
  – http://incubator.apache.org/uima/




13                                                   David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
             UIMA | IBM Research

     Any UIMA-Compliant                         Any UIMA-Compliant                                Any UIMA-Compliant
     Readers, Segmenters              Video                          Transcription                 CAS Consumer(s)
                                                  Analysis Engine(s)
                                  Object Detector                       Engine
           Web Crawler                                                                       Index Tokens
                                         Entity & Relation       Deep                        & Annotations
              File System                  Detector(s)           Parser                       in IR Engine
                 Reader                              Arabic-English                          Index
                                                       Translator                    Entities & Relations
               Streaming                                                             in RDB or OWL KB
                                                     (Web Service)
            Speech Segmenter


                                                                                                      Relational
                                                   Analyze Content                                    Database
                               Connect, Read                                  Index or
                                                       Assign                                                         OWL
      Text, Chat,               & Segment                                     Process
                                                    Task-Relevant                                                   Knowledge-
        Email,                   Sources                                      Results
                                               CAS    Semantics    CAS
     Speech, Video                                                                                     Text IR        Base
                                                                                                       Engine
                               UIMA: Pluggable Framework, User-defined Workflows                        Index
                                                                                                                       Video
                             CAS: Common UIMA Data Representation & Interchange
                                                                                                                       Search
                            Aligned with OMG & W3C standards (i.e., XMI, SOAP, RDF)                                     Index




                                      Query Interface(s)
                                                         End-User
                                                                                      Query
                                                         Application
                                      Relevant Knowledge                             Services
                                                         Interfaces


14                                                                David Ferrucci           © 2006 IBM Corporation – All Rights Reserved –
                  UIMA | IBM Research

UIMA Component Architecture



                         Collection Processing Engine (CPE)

                                         Aggregate Analysis Engine
                                                                                                                       Ontologies
                                              Analysis Engine                             CAS Consumer
                                                                                         CAS Consumer
                                                 Annotator                                                               Indices
                          Collection
      Text, Chat,                                                                      CAS Consumer
     Email, Audio,        Reader
        Video                                            Analysis Engine                                                  DBs
                                        CAS        CAS                        CAS

                                                             Annotator                                                 Knowledge
                                          Flow                                                                           Bases
                                          Controller

                          Flow
                          Controller

     Key
  Framework
  Construction
Developer Codes


15                                                           David Ferrucci         © 2006 IBM Corporation – All Rights Reserved –
       UIMA | IBM Research

 CAS Multiplier: A generalization of the Collection Reader

                              Annotator
                        CAS                    CAS’




                              CAS Multiplier
                        CAS                    CASn        CAS2               CAS1




                              Collection
         Text, Chat,
        Email, Audio,         Reader
           Video

                        CAS                    CASn        CAS2        CAS1




16                                                    David Ferrucci      © 2006 IBM Corporation – All Rights Reserved –
     UIMA | IBM Research

 Example: Inline Segmenter

                             CAS Multiplier
                       CAS                     CASn            CAS2                  CAS1




     Video                   Key Frame                    Key Frame
     Analyzer                Extractor                    Classifier
                             (Segmenter)
              Video                                                         Key
             Segment            KeyFrame N                            Frame + Metadata
                                           KeyFrames…

                                                  KeyFrames 1




17                                                 David Ferrucci            © 2006 IBM Corporation – All Rights Reserved –
     UIMA | IBM Research   Eclipse Development Tools for Creating,
                             Describing, Composing and Testing
                             Component Analytics (in UIMA SDK)




18                             David Ferrucci    © 2006 IBM Corporation – All Rights Reserved –
            UIMA | IBM Research

UIMA Component Repository at CMU http://uima.lti.cs.cmu.edu/index.html
 http://sith.watson.ibm.com




19                                    David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
                          U MA
                           I
         UIMA | IBM Research




Semantic Search Overview



         Using UIMA to Advance Search and Discovery




                                     © 2006 IBM Corporation – All Rights Reserved
           UIMA | IBM Research

     The Semantic Search 4-Step Program

                                              1
                                                       UIMA           Detect the Semantic Content in the Corpus
                    Corpus                            Corpus-               Build the Semantic Signatures
                                                      Analysis


                                                  2
     Index the text and the Semantic Signatures       Semantic
                                                       Search                                   SIAPI: Efficiently retrieve
                                                        Index                                     documents matching
                                                                                                   Semantic Signature

                   Detect Semantic
                 Signatures in Query              3                                                             4
                                                       UIMA                      Semantic Search
                  User Query                          Query-                          Engine
                         Automatically generate       Analysis                   (e.g., OmniFind)
                        Semantic Search queries
                             For back-end




21                                                         David Ferrucci         © 2006 IBM Corporation – All Rights Reserved –
           UIMA | IBM Research


 UIMA Pipeline for Keyword Search

                                       SimpleToken
                                       AndSentence
                                        Annotator



                       Collection Processing Engine
                                                                         KW                         KW
                                                                       Search                     Search
                                                                       Indexer                     Index

     Collection            File           Analysis
      of Text             Reader         Aggregate
       Docs                                                            XcasCas                     Local
                                                                       Writer                  File System
                         Collection      Analysis
                          Reader         Engines
                                                             CAS Consumers



22                                                    David Ferrucci             © 2006 IBM Corporation – All Rights Reserved –
           UIMA | IBM Research

UIMA Pipe Line for Semantic Search

                        SimpleToken
                                             Named-Entity
                        AndSentence
                                              Annotator
                         Annotator



                       Collection Processing Engine
                                                                   Semantic                      Semantic
                                                                    Search                        Search
                                                                   Indexer                         Index

     Collection            File           Analysis
      of Text             Reader         Aggregate
       Docs                                                            XcasCas                     Local
                         Collection                                    Writer                  File System
                                         Analysis
                          Reader         Engines
                                                             CAS Consumers




23                                                    David Ferrucci             © 2006 IBM Corporation – All Rights Reserved –
        UIMA | IBM Research

We index and search over tokens AND the semantics annotations
                “first” is an ambiguous term so is “center”

          We are looking for these terms but with particular
            semantics detected by the UIMA analytics
            and indexed in the semantic search engine
                                                                            “first” as it appears
                                                                             in the name of an
                                                                                organization
The JuruXML Query Language Exploits the results of Analysis

KeyWord Query:                    “first”
Semantic Search Query:            <organization> first </organization> (1)
Semantic Search Query:            <ceo_of> <person> Center </person> </ceo_of>
              “Center” appearing in a name
                        of a person
               who is the ceo of something                              (1) XMLFragment Syntax



24                                           David Ferrucci   © 2006 IBM Corporation – All Rights Reserved –
           UIMA | IBM Research

Overlaying Semantics vs. Extracting Knowledge: Improve Search Recall
                                   Rate_For
                         Rate                   Billboard

 SEARCH: Going rate for leasing a billboard near Triborough Bridge
                                                                                      Bronx

                                                                      Located_In
     No Keywords in Common
         But a good “hit”
                                         Rate_For

                                Rate
                                                                      Billboard


 “…We were offered $250,000/year in 2001 for an outdoor sign in Hunts Point
 overlooking the Bruckner expressway. …”
                                                                                                 Bronx

                                                                              Located_In


Analysis can detect semantic types to improve search precision and recall

25                                                   David Ferrucci               © 2006 IBM Corporation – All Rights Reserved –
           UIMA | IBM Research

Overlaying Semantics vs. Extracting Knowledge: Improve Search Precision
                                   Rate_For
                         Rate                   Billboard

 SEARCH: Going rate for leasing a billboard near Triborough Bridge
                                                                                   Bronx

                                                                      Located_In
     Common Keywords
     Bad Semantic Match
                                              Song Title


                                                 Queens


 “…Simon and Garfunkel's "The 59th Street Bridge Song" was rated highly by
 the Billboard magazine in the 60's…”


            Magazine
Analysis can detect semantic types to improve search precision and recall
26                                                   David Ferrucci           © 2006 IBM Corporation – All Rights Reserved –
         UIMA | IBM Research


 Concluding Remarks
  Raising the Search Bar
   – Known-Item Search must evolve into Knowledge Gathering and Synthesis
   – Semantic Search can improve precision and recall
   – Graceful degradation: Worst-Case should be keyword search
  Semantics Analysis is Key
   – Perfect, consistent or massive manual semantic annotation NOT likely
   – Automated annotation is essential
     • Many annotators must emerge
     • Must be easy to discover, combine, aggregate and deploy
     • UIMA is an enabling Integration Platform
  Approximate Semantics
   – Universal semantic consensus won’t happen
   – But approximations can work to better search applications
   – Improve precision, recall and density across artifact boundaries




27                                            David Ferrucci     © 2006 IBM Corporation – All Rights Reserved –

				
DOCUMENT INFO
Shared By:
Tags:
Stats:
views:261
posted:8/16/2011
language:English
pages:27