829 by xiangpeng

VIEWS: 9 PAGES: 20

									    GRISP: A Massive Multilingual
      Terminological Database
for Scientific and Technical Domains


   Patrice Lopez and Laurent Romary
               INRIA & HUB – IDSL
patrice_lopez@hotmail.com    laurent.romary@inria.fr
                                  Overview

• GRISP (Generic Research Insight in Scientific and technical Publications)
    – Multiple scientific and technical fields
    – Multilingual (en, fr, de)
    – Built from the compilation of open resources
        • Sound conceptual model
        • Mapping across a variety of domains
        • Use of structural constraints
        • Machine learning techniques for controlling the fusion process
    – Our sources: MeSH, UMLS, Specialist Lexicon, Gene Ontology, ChEBI,
      WordNet, WOLF, SUMO, IPC, Wikipedia


    – Result: several millions terms, concepts, semantic relations and definitions.
              Why are we doing all this?
• Terminology is the main vehicle by which technical and scientific units of
  knowledge are represented and conveyed (30-80%; Ahmad, 1996)
• Application to a large collection of multilingual and multi-domain patent
  documents
• Two underlying considerations:
    – Cost of manually maintained terminological resources
        • Cf. Biosis, IATE, TermScience
            – Khayari et al., 2006: Modeling the heterogeneity of resources
    – A lot of available resources online, based on heterogeneous
      organizational principles


• Underlying vision: Integrating knowledge engineering into current state of the
  art information retrieval and classification systems
     Merging terminological resources
• Related to the fusion of ontologies
   – Ontologies are usually relatively small in size
       • Semi-automatic methods: McGuinness et al., 2000
       • Fully automatic method
           – Madhavan et al., 2001: exploit structural and linguistic matching
           – Doan et al., 2001: Machine learning techniques (concepts and properties)
           – Gal et al., 2005: fuzzy logic methods

• Existing work on merging classification systems
           – Wang et al., 2008: Merging of subject headers in Digital Libraries

• Automatic merging techniques for heterogeneous
  terminologies has not been yet investigated
   – Much richer linguistic content
   – No formal organization of concepts
       • Do not model facts or assertions
                     A quick reminder
• Terminological resources
  – Approximation of lexical semantics in specialized fields
  – Based on a concept to term (onomasiological) model
  – Naturally multilingual (term grouping according to languages)
  – Existing standards
      • ISO 704: editorial principles for building up a terminological resource
      • ISO 16642: Abstract model for representing terminological databases
          – Romary, 2001
      • ISO 30042: A concrete XML syntax (TBX)
  – Note: terminology standards do not standardize terminologies!
           Target terminological model
• Multiple languages
• Multiple terms
   – Variants, abbreviation, inflexions
• Multiple descriptions
   – E.g. multiple definitions, complementing each other
   – Additional information: illustrations, formulae, etc.
• Basic conceptual relations
• Local metadata
   – Provides management information attached to the various terminological
     description levels (e.g. origin, validation level, register)
   – Allows the creation of views (e.g. all MeSH entries; cf. Khayari et al.,
     2006)
• And yes, ISO 16642 (TMF) can all this!
   – Main issue: identifying the relevant data category in the various source
     terminologies
                Merging terminologies,
                  merging models
        TMF model 1




     TMF model 2

                                Target model

  TMF model 2




TMF model 2
                           TMF in a nutshell

                            Terminological Data Collection (TDC)

            Terminological Entry                            Terminological Entry

  Language          Language                       Language            Language
   Section           Section                        Section             Section
  Term Section     Term Section                   Term Section         Term Section

  Term Section     Term Section                   Term Section         Term Section




                                                                 Terminological Entry
            Terminological Entry
                                                    Language             Language
  Language          Language                         Section              Section
   Section           Section
                                                   Term Section         Term Section
  Term Section     Term Section
                                                   Term Section         Term Section
  Term Section     Term Section



+ any kind of local metadata (origin, certainty, accessibility)
                Merging terminologies,
                  merging models
                   /definition/

        TMF model 1



     TMF model 2                        /definition/

                                  Target model

  TMF model 2




TMF model 2
                 Identifying domains

• Theoretical background
  – Non-ambiguity of a term within a domain
  – E.g. 129 domains in MESH
• GRISP
  – Set of 76 reference domains (see table 1)
      • Scientific and technical domains of Wordnet Domains (Magnini and
        Cavaglià, 2000)
      • Organised as a hierarchy
  – Manual mapping from resource specific domains to our reference
    set
                       Merging concepts
• Identification of common concepts across terminological
  sources – core principles
   – Baseline: same term + same domain = same concept
   – Difficulties: Conflicting domain mapping, high polysemy of term
     variants and incorrectly positioned concepts (e.g. Wikipedia)
       • Wrongly merged concepts
       • Lost in precision for concept description
   – Revised: same preferred term + same domain = same concept
   – Source conformance rule: separated concepts in a given source
     cannot be further merged (by transitivity)
       • Not applied to Wordnet, IPC and Wikipedia
   – Smoothing down the rules: using machine learning techniques
Concept merging as a machine learning
              process
                                                                      Concept pool


                                              Concept       Concept
                                                Concept
                 Concept         Concept
                   Concept

                                                                      Concept
                                  Concept


                                            Concept       Concept
                       Concept
                                              Concept

 Features                                                                       Merging decision




     SVM (Support Vector Machine) and MLP (Multi-Layer Perceptron)
                      binary classification models
                           Training process
• Training features
       • (f1-2) sources (e.g. S1=“MeSH”, S2=“Wikipedia”)
       • (f3) Number of common domains between the two concepts
       • (f4) Number of same source-specific categorizations
       • (f5) Boolean indicating if both preferred terms are identical
       • (f6) Boolean indicating if both preferred terms are identical after stemming
       • (f7) Ratio of identical terms given all terms
       • (f8) Similarity measure of the definition texts, after stemming and based on negative KL
         divergence
       • (f9) Number of domains of the merged concept
       • (f10) Number of words of the longest common terms

• Training data
   – Wikipedia – MeSH mapping
   – Pascal database (INIST)
                           Result overview
Merger         Concepts    Terms       Sem. Rel.    Overall content:
                                                    • 596,865 definitions
Aggregation    1,503,818   3,140,726      970,864
                                                    • 1,321,988 source specific
Merg. Rule 1   1,457,538   3,157,179    1,022,303     categorizations of concepts
Merg. Rule 2   1,476,508   3,114,711      971,218   • 20,000 acronyms
                                                    • 14,268 chemical formulas and
SVM            1,450,688   3,195,118    1,088,446   • 12,375 chemical structure
MLP            1,451,710   3,192,325    1,081,955     identifiers.



  • Observations:
      – Small number of actual merges (cf. product names, chemical and
        medical entities)
      – Merging relevant for frequently used concepts
                                   Evaluation
Merger              Wiki/MeSH           PASCAL

Merging Rule 1      cov. 0.6464         cov. 0.5358
                    acc. 0.9497         acc. 0.9371
Merging Rule 2      cov. 0.3607         cov. 0.2735
                    acc. 0.9949         acc. 0.9916
SVM                 cov. 0.8642         cov. 0.6203
                    acc. 0.9698         acc. 0.9522
MLP                 cov. 0.8607         cov. 0.6178
                    acc. 0.9748         acc. 0.9515

•   Random subset of 10% of the merging examples extracted from Wikipedia/MeSH
    mappings and from the PASCAL terminology
•   Merging Rule 2 produces almost perfect merging but with a very low coverage
•   Rule 1 extends the coverage at the price of a relatively high rate of merging error
•   Machine Learning approaches further extend the coverage while maintaining a high
    precision
GRISP browser: radial engine




                               16
17
                    Application: Patatras
• PATATRAS (PATent and Article Tracking, Retrieval and
  AnalysiS)
• Context: CLEF-IP competition
  – Prior art search task (EPO documents)
  – 1,9 million documents in English, French and German (more than 3 billion
    words)
  – Ranked first for all subtasks of the evaluation track among 14 participants
    (Roda et al., 2009)
• Conceptual indexing of the CLEF-IP corpus
  – Development of a term annotator based on GRISP
      • Term variant matching after POS + lemmatization
      • Concept disambiguation based on IPC classes of the documents
      • 1.1 million different terms identified
      • 176 million annotations
                    Results: Patatras

• Significant accuracy improvements for CLEF-IP
   – Combination of a word-based and concept-based ranked results with a
     regression model




                                                     Based on 10,000 queries




                                                                               19
                          Epilogue

• Online tool
  – Contact: patrice_lopez@hotmail.com
• Free resource
  – Based on the freely available subset of resources
• Constant evolution
  – Maintenance according to evolution of our sources
  – Addition of further sources

								
To top