Experiences with UIMA from a User�s Perspective by 5iC78p6

VIEWS: 0 PAGES: 41

									     Experiences with UIMA from a User’s
                Perspective



                         Dietmar Rösner,
                         Manuela Kunze,
                         Hany Mahgoub


University of Magdeburg C Knowledge Based Systems and Document Processing
                                                 Overview

                                     • Introduction

                                     • GATE

                                     • UIMA

                                     • Conclusion




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   2
                                              Introduction

    • November 2005; Version 1.2.3 of UIMA is available


      "IBM’s Unstructured Information Management Architecture
         (UIMA) is an architecture and software framework for
        creating, discovering, composing and deploying a broad
       range of multi-modal analysis capabilities and integrating
                     them with search technologies."




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   3
                                              Introduction




                                                      really?

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   4
                                              Introduction

     • similarity/comparison of GATE and UIMA
            – frameworks
            – results are documents + annotations
            – pipeline processing


     • steps:
            – task definition
            – one corpus




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   5
                              Evaluation Topics/Points
     • ease of getting acquainted with system?:
            – quality of docus: completeness, clarity, up-to-date, …?
            – tutorials, use cases, …?
     • processing and linguistic resources?
            – lexica, Gazetteer lists, tools
     • tools for resource maintenance and extension?
            – quality: selfexplanatory, robust, comfortable
     • speed of processing?
     • single docs vs. large corpora?
     • limitations, suggestions for improvement?
     • support for im-/export of a variety of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   6
                                Task of the Experiment
     • process a corpus of websites
            – to detect and extract information relevant for tourists
                • opening times of museum, prices of hotels,…
     • corpus:
            – 30 tourism web sites of Egypt
            – additional 20 web sites of Washington, New York, London
     • output:
            – Prolog facts for a reasoner
            – Questions:
               • Which museum is now open?
               • …



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   7
                             Excerpts from the Corpus
     • The Egyptian Museum is open the hours: 9am-5pm daily

     • The Military Museum is open the hours: Summer: 8am-
       5:30pm; winter: 8am-4:30pm

     • Palace Museum is open the hours: 8am-5:30pm
       (summer) 8am-4:30pm (winter)

     • 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri

     • …


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   8
                                                 Overview

                                     • Introduction


                                     • GATE


                                     • UIMA


                                     • Conclusion


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   9
              GATE: General Architecture for Text
                       Engineering
     • a suite of tools for language processing and information extraction

     • rule-based modular IE system (ANNIE)

     • language and domain-independent processing resources

     • open and extensible architecture

     • aims to provide uniform access to various linguistic and ontological
       resources

     • http://gate.ac.uk/



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective       10
                GATE: General Architecture for Text
                         Engineering
     • a software infrastructure for NLP researchers; based on
       three main elements:

            – an architecture
                  • describing the components composing a language processing
                    system


            – a framework
                  • could be used as a basis for building such systems


            – a graphical development environment
                  • a set of tools and
                  • components for language engineers


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective         11
                GATE: General Architecture for Text
                         Engineering

     • GATE distributed with IE system called ANNIE
            – relies on finite state algorithms and the Java Annotation Pattern
              Engine (JAPE) language

            – comprising a set of core Processing Resources (PRs):
                  •   Tokeniser
                  •   Gazetteers
                  •   POS tagger
                  •   Sentence Splitter
                  •   Semantic Tagger (JAPE transducer)
                  •   Orthomatcher (orthographic coreference)
                  •   …




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective           12
                                           GATE: ANNIE




                       [Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                               13
                                        Gate Application
     • several Processing Resources: Tokenizer, Hash
       Gazetteer (with new/extended Gazetteer lists), JAPE
       Transducer               ...
                                                            * The Military Museum*

                                                           Summer: 8am-5:30pm; Winter: 9pm-5pm …




            ANNIE English                    Gazetteer                  JAPE
              Tokenizer                        lists                  Transducer


                                       JAPE rules: to annotate
                                       • interval of times and restrictions
             names of museums, fragments of times and restrictions
                                       • museum


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                            14
                         Museum information in JAPE

     Rule: egyptmuseums
     (
        ({SpaceToken})
        ({Token.kind == word})
        ({SpaceToken})
        {Lookup.majorType ==org_base} // from gazetteer lists
        ({SpaceToken})?
        (({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))*
          timeinfo defined by JAPE rules transducer
        ({timeinfo}) // annotation by jape detects patterns like:
     )    • 9am-5pm, 6pm-9pm
          • 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm
     :museum
     --> • 5:00PM-7:00PM, 10:00am-5:00pm
          • ….
      :museum.sight = {rule ="egyptmuseums"}


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective      15
                       GATE: Presentation of Results
    Type and location of
      every extracted
       annotation on
        document




     Museums                                                              Annotations
    Information


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                 16
                                          GATE: Results
     • information annotated in the documents:
            –   names of museums, hotels
            –   names of tourist places in Egypt
            –   times, time intervals
            –   time restrictions
            –   prices, intervals of prices (hotel prices and museum prices)
            –   names of pharaohs, queens




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective        17
                                       GATE: Evaluation
  documentation?                                        - good

  processing and linguistic resources?
                                                        - illustrative examples (tutorial) but not
                                                          enough specialy about JAPE rules
  tools for resource maintenance and
    extension?
                                                        - can deal with it without know of Java
  speed of processing?                                    programming

  single docs vs. large corpora?
                                                        - but is advantage to have experinces
  limitations, suggestions for
                                                          with Java programming to use it in
    improvement?                                          JAPE rules

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                              18
                                       GATE: Evaluation
  documentation?                                        - many processing resources available
                                                          (ANNIE)
  processing and linguistic                                  -   tokenisers
   resources?
                                                             -   POS taggers
                                                             -   parsers
  tools for resource maintenance and
    extension?                                               -   gazetteers
                                                             -   sentence splitter
  speed of processing?                                       -   …

  single docs vs. large corpora?                        - additional PRs :
                                                             -   gazetteer collector
  limitations, suggestions for
    improvement?                                             -   PRs for Machine Learning
                                                             -   various exporters
  im-/export of document formats?                            -   annotation set transfer etc...


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                           19
                                       GATE: Evaluation
  documentation?                                        - editor for gazetteer list

  processing and linguistic resources?
                                                        - corpus manager
  tools for resource
    maintenance and extension? - text editor and debugger for JAPE
                                 rules
  speed of processing?

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective               20
                                       GATE: Evaluation
  documentation?                                        - there is no measurement of
                                                          processing time in the GATE tool
  processing and linguistic resources?

  tools for resource maintenance and
    extension?

  speed of processing?

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                      21
                                       GATE: Evaluation
  documentation?                                        - corpus pipeline vs document pipeline

  processing and linguistic resources?

  tools for resource maintenance and
    extension?

  speed of processing?

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                          22
                                       GATE: Evaluation
  documentation?                                        - no limitations:
                                                             - all is possible but it is not necessary to
  processing and linguistic resources?                         implement by yourself

  tools for resource maintenance and
    extension?                                          - for beginning:
                                                             - processing and linguistic resources
  speed of processing?                                         available within the distribution

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                     23
                                       GATE: Evaluation
  documentation?                                        - import:
                                                             - supports a variety of document
  processing and linguistic resources?                         formats: HTML, rtf, email, SGML and
                                                               plain text
  tools for resource maintenance and                                 - In all cases the format is analysed
    extension?                                                         and converted into a single unified
                                                                       model of annotation
  speed of processing?
                                                        - export:
  single docs vs. large corpora?
                                                             - documents, corpora and annotations in
                                                               databases of various sorts
  limitations, suggestions for
    improvement?                                             - required: Java application (CREOLE)


  im-/export of document
   formats?


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                      24
                                                 Overview

                                     • Introduction

                                     • GATE


                                     • UIMA

                                     • Conclusion




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   25
                     UIMA: Unstructured Information
                       Management Architecture
     • a software architecture for developing and deploying
       unstructured information management (UIM) applications

     • UIM application: a software system
            – analyse large volumes of unstructured information to
                  • discover,
                  • organize, and
                  • deliver relevant knowledge to the end user


     • software architecture which specifies
            – component interfaces, data representations, …

     • http://www.research.ibm.com/UIMA/


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   26
            UIMA: Unstructured Information Management
                           Architecture




        … takes a by a a collection contents, and produces an enriched
         … interfaces to analyzes Reader to populate a CAS from a to be
  … may be used CAS,Collection its of data items (e.g., documents) document.
        CAS. Analysis Engines can be recursively composed of otherdocuments to
         analyzed. CAS Initializer is an HTML parser that de-tags an Analysis Engines
  An example of a Collection Readers return CASes that contain the HTML
        (called an possibly along with additional Aggregates may from <P> tags
         analyze, Aggregate Analysis Engine). metadata.
  document and also inserts paragraph annotations (determinedalso contain CAS
         original HTML) into the CAS.
  in theConsumers.


CAS: Common Analysis Structure
     … consume the enriched      CAS that was produced by the sequence of Analysis
CPM: Collecting Processing Manager
       Engines before it, and produce an application-specific data structure, such as a
       search engine index or database.


[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                     27
           UIMA: Unstructured Information Management
                          Architecture

   • Analysis Engine (AE):
          – a component that analyzes artifacts (e.g. documents) and infers
            information about them

          – consists of two parts:
                • Java classes (typically packaged as one or more JAR files) and
                • AE descriptors (one or more XML files)
                      – the configuration settings for the Analysis Engine as well as
                      – a description of the AE’s input and output requirements.




     [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                          28
                                       UIMA Application
     • several annotators (like a pipeline)

                                                           ...
                                                             *Fraunces Tavern Museum*
                                                           54 Pearl St. - 1-212-425-1778
                     regular expressions                   Tuesday-Friday, 12pm?5pm; …



          restrictions

                                    interval of       Prolog facts:
                                                      museumopen('Fraunces Tavern Museum ',
                                       times                                             museum
                                                                 '2005-12-01T12:00:00', '2005-12-01T17:00:00').
             time
                                                      museumopen('Fraunces Tavern Museum ',
                                                                                       information
            pattern                                              '2005-12-02T12:00:00', '2005-12-02T17:00:00').
                                                           window covering two time intervals
                                                      museumopen('Fraunces Tavern Museum ',
                                                           and a restriction
                                                                 '2005-12-03T12:00:00', '2005-12-03T17:00:00').
           museum               regular expressions
                                                                          window covering a museum and
            pattern
                                                                          opening hours
                                regular expressions
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                           29
                                           UIMA: Results
     • information annotated in the documents:
            –   names of museums, hotels
            –   times, time intervals
            –   time restrictions
            –   prices, intervals of prices (hotel prices)
            –   keywords for museum category
            –   names of pharaohs (annotated with a correction of mispellings)
     • hotel and museum information are exported into Prolog facts and
       into a short textual summary
            – templates filled with the detected information
                  • hotels: Price information about Cosmopolitan Hotel : $157
                  • museums:
                      *** *Fraunces Tavern Museum* ***
                      Open from 12:00:00 to 17:00:00;
                      Restriction: Tuesday-Friday


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective          30
                                       UIMA: Evaluation
  documentation?                                        - good

  processing and linguistic resources?
                                                        - illustrative examples (tutorial)
  tools for resource maintenance and
    extension?                                          - completeness: sometimes it is very
                                                          shortly described
  speed of processing?                                       - prior knowledge about Java and
                                                               Eclipse is helpful
  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                         31
                                       UIMA: Evaluation
  documentation?                                        - annotators only from tutorial
                                                             -   sentence annotation
  processing and linguistic                                  -   word annotation
   resources?
                                                             -   date/time annotators
                                                             -   examples for using regular
  tools for resource maintenance and
    extension?                                                   expressions etc.
                                                        - external resources can be integrated:
  speed of processing?                                       - lexical resources as external resources
                                                               (text files)
  single docs vs. large corpora?                             - existing processing resources
                                                                     - implementation of an interface is
  limitations, suggestions for                                         necessary
    improvement?

  im-/export of document formats?


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                    32
                                       UIMA: Evaluation
  documentation?                                        - specific Eclipse component editors or
                                                        - simple text Editors
  processing and linguistic resources?

  tools for resource
    maintenance and extension?

  speed of processing?

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                           33
                                       UIMA: Evaluation
  documentation                                         - faster than GATE?
                                                        - in CPE detailed information about
  processing and linguistic resources
                                                          processing time for each module
  tools for resource maintenance and
    extension?

  speed of processing?

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                       34
                                       UIMA: Evaluation
  documentation                                         - Collection Reader
                                                             - document(s) from a directory
  processing and linguistic resources

  tools for resource maintenance and                    - adapt extensions into Preprocessing
    extension?                                            (CAS Initializer)
                                                             - e.g., extraction of text fragments from
  speed of processing?                                         a HTML document

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                  35
                                       UIMA: Evaluation
  documentation                                         • no limitations:
                                                             – all is possible, but implementation or
  processing and linguistic resources                          interfacing by user

  tools for resource maintenance and
    extension?                                          • wish:
                                                             – more processing and linguistic
  speed of processing?                                         resources within the distribution

  single docs vs. large corpora?

  limitations, suggestions for
    improvement?

  im-/export of document formats?



Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                 36
                                       UIMA: Evaluation
  documentation                                         - import: CAS Initializer

  processing and linguistic resources
                                                        - export: CAS Consumer
  tools for resource maintenance and                         - transform annotations in any other
    extension?                                                 format
                                                             - export of
  speed of processing?                                               - document + annotations
                                                                     - only annotations
  single docs vs. large corpora?

  limitations, suggestions for                          - required: Java application
    improvement?

  im-/export of document
   formats?


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                             37
                                                 Overview

                                     • Introduction

                                     • GATE

                                     • UIMA


                                     • Conclusion




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective   38
                                               Conclusion
     • intended use

            – GATE: academic/scientific application
                  • tools available
                  • comfortable GUI

            – UIMA: more commercial
                  • plain framework
                  • simplified definition of (complex) results structures
                  • simplified pre- and postprocessing of annotations

     • in sum: incommensurable




Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective     39
                                               Conclusion
     • both are extensible

     • no final judgement about: use GATE or UIMA
            – depends on
                  • your task
                        – task description
                        – expected results
                        – which processing resources are necessary
                  • your preferences for interface
                        – prefer the Eclispe environment (or other Java editors)
                        – prefer a comfortable GUI



     • or use both

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective            40
                                               Conclusion
     • found in the UIMA Forum:
          I see UIMA and GATE as complementary rather than competitive, and each
          can gain from the strengths of the other.

          GATE was originally developed as a research tool, and has features suited
          to rapid prototyping of text processing code, like JAPE (a language for
          defining finite-state transducers over annotations on a document).

          UIMA is more targetted at robust deployment of applications, with strong
          typing of feature structures and better support for distributed processing.

          We're currently working on writing a translation layer to allow UIMA analysis
          components to be used in GATE and vice-versa. It's not in a releasable
          state just yet, but we hope to release something in the near future. Keep
          your eye on http://gate.ac.uk/ for details.

                                                                          Ian Roberts (GATE developer)


Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective                                  41

								
To top