PowerPoint Presentation

Document Sample
PowerPoint Presentation Powered By Docstoc
					 Software Architecture for
Language Engineering (SALE)
       – where next?
  http://gate.ac.uk/   http://nlp.shef.ac.uk/

             Hamish Cunningham

      IBM TJ Watson,   1st August/2003
           Structure of the Talk
1. SALE and its context
   • Definitions
   • The Knowledge Economy and HLT
   • Software Lifecycle
2. GATE, a General Architecture for Text Engineering
   • History
   • Summary of Features and Principles
   • Component-base development
   • Unicode support
   • Measurement
   • CREOLE: some components
   • Users and Projects
3. Where Next (give up and go home)?
   • Future context
   • Desirables
   • Conclusion             2(39)
              SALE: definitions
• Computational Linguistics: science of language that
  uses computation as an investigative tool.
• Natural Language Processing: science of computation
  whose subject matter is data structures and algorithms
  for human language processing.
• Language Engineering: building systems whose cost
  and outputs are measurable and predictable.
• Software Architecture: macro-level organisational
  principles for families of systems. In this context is also
  used as infrastructure.
• SALE: software infrastructure, architecture and
  development tools for applied NLP and LE.

   The Knowledge Economy and
         Human Language
Gartner, December 2002:
• taxonomic and hierachical knowledge mapping and
  indexing will be prevalent in almost all information-rich
• through 2012 more than 95% of human-to-computer
  information input will involve textual language

A contradiction: formal knowledge in semantics-based
systems vs. ambiguous informal natural language

The challenge: to reconcile these two opposing tendencies
IE and Knowledge: Closing the Language Loop
              (M)NLG        MNLG: Multilingual Natural Language Generation
                            OIE: Ontology-aware Information Extraction
                            AIE: Adaptive IE
                            CLIE: Controlled Language IE

Human                               Formal Knowledge           Web;
Language                            (ontologies and            Semantic
                                    instance bases)            Grid;


Language             CLIE
             Software lifecycle in
            collaborative research
Project Proposal: We love each other. We can work so well together. We can hold
workshops on Santorini together. We will solve all the problems of AI that our
predecessors were too stupid to.

Analysis and Design: Stop work entirely, for a period of reflection and
recuperation following the stress of attending the kick-off meeting in Luxembourg.

Implementation: Each developer partner tries to convince the others that program
X that they just happen to have lying around on a dusty disk-drive meets the
project objectives exactly and should form the centrepiece of the demonstrator.

Integration and Testing: The lead partner gets desperate and decides to hard-
code the results for a small set of examples into the demonstrator, and have a fail-
safe crash facility for unknown input ("well, you know, it's still a prototype...").

Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard
problems, and how if we had another grant we could go on to transform information
processing the World over (or at least the European business travel industry).
    Where did GATE come from?
Early- mid-1990s (e.g. in TIPSTER):
• Increasing trend towards multi-site collaborative projects
• Role of engineering in scalable, reusable, and portable HLT
• Support for large data, in multiple media, languages, formats,
  and locations
• Lower cost of creation of language processing components
• Promote quantitative evaluation metrics via tools and a level
  playing field

GATE history:
• 1996 – 2002: GATE version 1, proof of concept
• March 2002: version 2, rewritten in Java, component based,
  LGPL, more users
• Fall 2003: new development cycle
                     GATE is...
• An architecture A macro-level organisational picture for LE
  software systems.
• A framework For programmers, GATE is an object-oriented
  class library that implements the architecture.
• A development environment For language engineers,
  computational linguists et al, a graphical development

GATE comes with...
• Some free components... ...and wrappers for other people's
• Tools for: evaluation; visualise/edit; persistence; IR; IE;
  dialogue; ontologies; etc.
• Free software (LGPL). Download at
          Architectural principles

• Non-prescriptive, theory neutral (strength and weakness)
• Re-use, interoperation, not reimplementation (e.g. diverse
  XML support, integration of Protégé, Jena, Weka...)
• (Almost) everything is a component, and component sets
  are user-extendable
• (Almost) all operations are available both from API and GUI

  Component-based development
CREOLE: a Collection of REusable Objects for
Language Engineering:
• Java Beans: an OO way of chunking software
• GATE components: modified Java Beans with
  XML configuration
• The minimal component = 10 lines of Java, 10
  lines of XML, 1 URL

Why bother?
• Allows the system to load arbitrary language
  processing components

             CREOLE lifecycle
• Bootstrap: stub Java class, Makefile, config
• Registration: URL / JAR / creole.xml
• Instantiation: class loading, parameterisation,
  bean object creation
    – load-time parameters, e.g. a document’s charset
    – run-time parameters, e.g. a parser’s lexicon
Three types of beans (not a new religion!):
• Language Resources, e.g. doc, corpus, lexicon
• Processing Resource, e.g. tagger, stat modeller
• Visual Resource, e.g. doc editor, syntax editor

        Language Resources (LRs)
• GATE LRs are documents, ontologies, corpora,
  lexicons, ……
• LRs can be associated with DataStores (Oracle,
  PostgreSQL, XML, Java Serialisation)
• Documents / corpora:
    –    Diverse document formats: text, html, XML, email,
         RTF, SGML
    –    Optional format-preserving markup analyse / save
• Standoff annotation model (start, end, type,
  features), derivative of TIPSTER, compatible
  with ATLAS and XCES

      Processing Resources (PRs)
• Algorithmic components knows as PRs – beans
  with execute methods.
• Controllers: execute a set of PRs
  –   SerialController: sequential run of arbitrary PR set
  –   SerialAnalyserController: analyser PRs over corpus
  –   Conditional controllers: execute depend on features
  –   Parallel controller?
• PRs + Controller = Applications
• Application parameterisation state can be saved
  and restored, and used for embedding / batching

         Visual Resources (VRs)

VRs (2): Coreference

VRs (3): Syntax

           Editing Multilingual Data
GATE Unicode Kit (GUK)
    Complements Java’s facilities

• Support for defining
  Input Methods (IMs)

• currently 30 IMs
  for 17 languages

• Pluggable in other
  applications (e.g.
      Processing Multilingual Data
All processing, visualisation and editing tools use GUK

            Performance Evaluation
• At document level – annotation diff

             Regression Test
At corpus level – corpus benchmark tool –
tracking system’s performance over time

             More CREOLE

1.   JAPE, FSTs over annotations
2.   ANNIE, A Nearly-New IE system
3.   DAML+OIL, Protégé, Ontology-Aware IE
4.   Information Retrieval, Lucene
5.   WordNet
6.   Machine Learning support

            FSTs over annotations
JAPE: a Java Annotation Patterns Engine
• Light, robust regular-expression-based processing
• Cascaded finite state transduction
• Low-overhead development of new components
• Simplifies multi-phase regex processing

 Rule: Company1
 Priority: 25
     ( {Token.orthography == upperInitial} )+
     {Lookup.kind == companyDesignator}
    :match.NamedEntity = { kind=company, rule=“Company1” }
     Info Extraction Components
The ANNIE system – a reusable and easily extendable set of

Populating Ontologies with IE

Protégé and Ontology Management

        Information Retrieval
Currently based on the Lucene IR engine

         WordNet support

     Machine Learning support
• Uses classification.
   [Attr1, Attr2, Attr3, … Attrn]  Class
• Classifies annotations.
   (Documents can be classified as well using a simple
• Annotations of a particular type are selected as
• Attributes refer to instance annotations.
• Attributes have a position relative to the instance
  annotation they refer to.

Attributes can be:
    – Boolean
       The [lack of] presence of an annotation of a
         particular type [partially] overlapping the referred
         instance annotation.
    – Nominal
       The value of a particular feature of the referred
         instance annotation. The complete set of
         acceptable values must be specified a-priori.
    – Numeric
       The numeric value (converted from String) of a
         particular feature of the referred instance
Machine Learning PR in GATE.
Has two functioning modes:
   – training
   – application
Uses an XML file for configuration:
  <?xml version="1.0" encoding="windows-1252"?>
       <DATASET> … </DATASET>


Soon: Torch? YASMET? TIMBL?

Attributes Position

                 Instances type: Token

          Standard Use Scenario
Training                              Application
• Prepare training annotations.       • [ Load the previously
• Run the ML PR in training             saved model. ]
  mode.                               • Run the ML PR in
• Export the dataset as .arff           application mode.
  and perform experiments             • [ Save the learnt model. ]
  using the WEKA interface in
  order to find the best
  attribute set / algorithm /
  algorithm options.
• Update the configuration file
• Run the ML PR again to
  collect the actual data.
• [ Save the learnt model. ]

        Using Other ML Libraries
The MLEngine Interface

• void addTrainingInstance(List attributes)
      Adds a new training instance to the dataset.
• Object classifyInstance(List attributes)
      Classifies a new instance.
• void init()
      This method will be called after an engine is created and
  has its dataset and options set.
• void setDatasetDefinition(DatasetDefintion definition)
      Sets the definition for the dataset used.
• void setOptions(org.jdom.Element options)
      Sets the options from an XML JDom element.
• void setOwnerPR(ProcessingResource pr)
      Registers the PR using the engine with the engine.
  A bit of a nuisance (GATE users)
Thousands of users at hundreds of               GATE team projects. Past:
sites. A representative sample:                 • Conceptual indexing: MUMIS:
                                                  automatic semantic indices for
• the American National Corpus project            sports video
• the Perseus Digital Library project,          • MUSE, cross-genre entitiy finder
  Tufts University, US                          • HSL, Health-and-safety IE
• Longman Pearson publishing, UK                • Old Bailey: collaboration with HRI
• Merck KgAa, Germany                             on 17th century court reports
                                                • Multiflora: plant taxonomy text
• Canon Europe, UK                                analysis for biodiversity research e-
• Knight Ridder, US                               science
• BBN (leading HLT research lab), US            Present:
• SMEs inc. Sirma AI Ltd., Bulgaria             • Advanced Knowledge
                                                  Technologies: €12m UK five site
• Imperial College, London, the University        collaborative project
  of Manchester, UMIST, the University of • EMILLE: S. Asian languages
  Karlsruhe, Vassar College, the                  corpus
  University of Southern California and a       • ACE / TIDES: Arabic, Chinese NE
  large number of other UK, US and EU • JHU summer w/s on semtagging
  Universities                                  Future:
• UK and EU projects inc. MyGrid, CLEF, • Five new projects (below)
  dotkom, AMITIES, Cub Reporter,
  EMILLE, Poesia...
                 Where Next (1)?
• Can Universities cope with the long term?
• User survey
• Future context:
   – SEKT: Knowledge Management
   – KnowledgeWeb: OntoWeb II
   – PrestoSpace: audiovisual preservation (FSTs for users?)
   – hTechSight: knowledge portal for petrochemicals
   – ETCSL: Electronic Text Corpus of Sumerian Language
   – DERI: Digital Enterprise Research Institute
   – PhDs: INK, PIE

                 Where Next (2)?
• Some desirables:
   – Corpus tools (ANNIC in progress)
   – Audiovisual documents
   – WS-based backend server, for ML, active learning etc.
   – Better dialogue support (cf. AMITIES, Galaxy)
   – Better MT support
   – PDF documents
   – JAPE debugger, editor, 101 language extensions (e.g.
     quantified ops, deletion ontology callouts)
   – Cleverer treatment of large documents in the GUI
   – PR reloading

GATE is:
• Addressing the need for scalable, reusable, and portable
  HLT solutions
• Supporting large data, in multiple media, languages, formats,
  and locations
• Lowering the cost of creation of new language processing
• Promoting quantitative evaluation metrics via tools and a
  level playing field
• Promoting experimental repeatability by developing and
  supporting free software


Shared By: