Wordnets for information retrieval a hole in one! by znu21902

VIEWS: 7 PAGES: 64

									   Wordnets for information retrieval:
   a hole in one!


            Piek Vossen
            OmniPaper seminar, Leuven
            December, 3rd 2004


18-7-2010                 Irion Technologies (c)
Content
    WordNet, EuroWordNet, Global Wordnet
    Why should we use wordnets?
    How do we use wordnets?
    Why are wordnets not enough?
    Conceptual indexing
    Conceptual matching
    Conceptual dialogue
    Demos


18-7-2010           Irion Technologies (c)
Princeton WordNet

    Developed by George Miller and his team at
     Princeton University, as the implementation
     of a mental model of the lexicon
    Organized around the notion of a synset: a
     set of synonyms in a language that represent
     a single concept
    Semantic relations between concepts
    Currently covers about 100,000 concepts and
     120,000 English words
18-7-2010             Irion Technologies (c)
Wordnet Model
             Relations            Concepts                             Vocabulary of a language
                            rec: 12345                      1
                            - financial institute
                                                                               bank
                            rec: 54321                      2
                            - side of a river
                            rec: 9876
                                                                               fiddle
                            - small string instrument              1
                                                                               violin
            type-of         rec: 65438                             2
                            - musician playing violin
                                                                               fiddler
                                                                               violist
                            rec:42654
                  type-of   - musician
                            rec:35576                       1
                  part-of   - string of instrument
                                                                               string
                            rec:29551                       2
                            - underwear
                            rec:25876
                            - string instrument




18-7-2010                                 Irion Technologies (c)
EuroWordNet

    EU project (1996-1999) to develop wordnets
     for 8 European languages
    Each wordnet is linked to the English wordnet
     that functions as an Interlingua
    Cross-lingual wordnet database, where you
     can go from a synset in one language to a
     synset in any other language
    Coverage: 10,000 – 50,000 synsets and up to
     55,000 words in Dutch, German, French,
     Spanish, Italian, Estonian and Czech
18-7-2010             Irion Technologies (c)
 EuroWordNet Model
                                     Domains                      Ontology
             move                                                                        bewegen
                                     Traffic                    2OrderEntity
             go                                                                          gaan

               III                 Air       Road`          Location Dynamic                   III
     ride               drive                                                        rijden           berijden
                                                       I               I
               III                                                                             III
                                     II                                        II
                                                                                       Lexical Items Table
     Lexical Items Table
                                                                                       Lexical Items Table
     Lexical Items Table                                ILI-record
                                                          {drive}
               III                                                                             III
  cabalgar                           II                                        II
                        conducir                                                    guidare           cavalcare
  jinetear
                                                Inter-Lingual-Index
               III                                                                            III

                                         I = Language Independent link
            mover                        II = Link from Language Specific                 andare
            transitar                         to Inter lingual Index                      muoversi
                                         III = Language Dependent Link



18-7-2010                                            Irion Technologies (c)
    Example of EuroWordNet structure
                                                                             organism

                             Causes                    Patient                                          Part of
                                      to get well
                                                                                  being
    disease
                                                                                                            organ
                                                                                 person

                        treat


          scalpel                                       Agent
                                                                        doctor
                Instrument
                       operate
                                                                                 sick person, patient
                                           Patient
stomach disease                                                                                            stomach
                                                        Involves




    18-7-2010                                       Irion Technologies (c)
Global Wordnet Association
        EuroWordNet        BalkaNet
                          Romanian                       Danish          Arabic
           English                                                        Polish
                          Bulgarian                      Norway
           German                                                         Welsh
                          Turkish                        Swedish
           Spanish                                                        Chinese
                          Slovenian                      Portuguese
           French                                                         20 Indian
                                                          Korean
           Italian       Greek                                            Languages
                                                          Russian
                          Serbian                                         Brazilian
           Dutch                                         Basque           Portuguese
           Czech                                         Catalan         Hebrew
           Estonian                                      Thai            Latvian
                                                                           Persian
                                                                           Kurdish
http://www.globalwordnet.org                                               Avestan
                                                                           Baluchi
                                                                           Hungarian


18-7-2010                         Irion Technologies (c)
      Why to use wordnets?




18-7-2010       Irion Technologies (c)
Why are wordnets not used by Internet
Search Engines?
    Without wordnets recall is very low but this does not
     seem to be a problem:
           There is too much information on the Internet to handle
            anyway;
           There is redundancy of information, i.e. it is expressed in
            any conceivable way and any conceivable language;
           Whatever you type in, you allways get many results;
    Google approach:
           All content words should occur (boolean AND);
           Pidgeon ranking: pages to which many people link are on
            top, show what others know;


18-7-2010                         Irion Technologies (c)
Why wordnets should be used?
    Cross-lingual retrieval is not possible unless you
     map words across languages;
    Very specific questions still give no results if the
     query is formulated differently from the answer, e.g.
     Google:
           “evaluate web of concepts for OmniPaper search system”
            (3 results)
           “evaluate web of concepts for OmniPaper search engine”
            (0 results)
    Small-scale indexes have no redundancy, there will
     be no results for queries formulated differently;

18-7-2010                       Irion Technologies (c)
 Language technology: a hole in one!


                                         golf
                                       club(s)




thesaurus               Funnel


                                      golf
                                     clubs
             semantic
             network




 18-7-2010                       Irion Technologies (c)
      How to use wordnets?




18-7-2010       Irion Technologies (c)
   How to get the correct recall?
                                                                     Wordnet
   Morpho-                                                           Full Expansion
                   police cells
   logy                                          jail
                                                                                    cell
                                                                                  neuron
   No NLP                                      prison                             growth
                  cell [prison]

                  cell [phone]                              mobile
                                                            phone
Index             cell [tissue]
Disambiguation:
-Communication                                                           Index & query
-Legal             cell-division                                         Disambiguation:
                                               neuron
-Biology                cellular                                         - Biology




   18-7-2010                       Irion Technologies (c)
      Why are wordnets not
      enough?




18-7-2010       Irion Technologies (c)
Words out of context:

    Traditional search paradigm focuses on
     document/page retrieval and not on phrase
     retrieval:
           Dominant meanings will overrule other meanings:
               “Internet services on Java” gives no results for the island
                Java only for the software.
           Compositional differences are neglected:
               “toxic medication” versus “medication against
                toxication”,
               “animal party” versus “party animal”

18-7-2010                          Irion Technologies (c)
Where are we heading at?

    There is a growing need for more precision
     and more complex applications to find more
     fine-grained facts regardless of ‘form’
           Information retrieval (IR): documents
             Classification: topics
                  Informatie extractie (IE): facts
                      Multimodal human machine interfaces (speech, mobile, chat);
                      Question-answering systems (QA): simple human-machine interface
                      Dialogue systems: iterative human-machine
                      Intelligent machines (reason, decisions): intelligent human-machine
                       interface
           Summarization -> Multidoc summaries ->Language generation -
            >Machine translation


18-7-2010                                Irion Technologies (c)
Approach
    Multilingual wordnet database and morphy-syntatic
     processing are used to decompose text to concept
     elements:
      -> maximum recall;

    Word-sense-disambiguation at index and query side:
      -> reduce noise;

    Synonym selection:
      -> reduce more noise;

    Match query phrases with document phrases:
      -> match concept combinations in context

    Intelligent dialogues to create context at the user side:
      -> match intended meanings



18-7-2010                  Irion Technologies (c)
Cut out the noise from a multilingual
semantic network
    Concept selection
           Assign domain labels and selectional patterns to
            documents and phrases
           Select word meanings within domains and
            patterns
    Synonym selection
           Most frequent synonyms for selected concepts
           Co-occurrence relations



18-7-2010                    Irion Technologies (c)
  Wordnet: Domain information
Vocabularies of languages                         Concepts            Relations                   Domains

                                 1          rec: 12345
                                                                               Clothing   Culture Sport Finance
                                            - financial institute
                                 2          rec: 54321
                       bank                 - river side                                  Music    Ball  Winter
                                 1          rec: 9876                                             sports sports
                       violin               - small string instrument
                                 2          rec: 65438
                                            - musician playing a violin
                       violist              rec:42654
                                            - musician              type-of
                                 1          rec:35576                                      type-of
                                            - string of an instrument         part-of
                       string
                                 2          rec:29551
                                            - underwear
                                            rec:25876
                                            - string instrument




   18-7-2010                         Irion Technologies (c)
  Domain based concept selection
WordNet/Semnet                                                                                  IST-project
                                                            More                 + Domain
                                                                                                MEANING
  Domain                                                   Contexts

    Set of
                                                                      Train
    concepts                                                                      Text Classifier
                                    Sport    Train
                                                         TwentyOne                    Text
   Synsets
                                                          Classify                    grouped by Domains
                          Export   - words
   Glosses

   Examples                                                           Classify

                                       Un-seen Document
                                       - - Microworld: Sport
                                       - Phrase: financial scandal Juventus
                                       - - Nanoworld: Finance
              Concept
                                       - Phrase: Players boycott the match
              Selection
                                       - - Nanoworld: Sport



  18-7-2010                                  Irion Technologies (c)
When to apply what strategy?
                Poly   Word types                            Word tokens
                semy   in document(s)                        in document(s)

                 1
                 1                                             20%
                 1                                           Microworld
                 1                                             20%
                 1                                                        Nanoworld
                 1                                                          20%

                 2       80%                                   80%
                 2
                 2     Microworld                                         Factotum
                 4        20%                                               70%
    Nanoworld
      20%
                 5
                 6       20%
    Factotum
      10%        40
                       ball - goal – game - score
                       eat - food
                       be – person – have – begin – stop - part
18-7-2010                           Irion Technologies (c)
Conceptual Indexing
            Document
              Microworld = sport                               Assign domain label to document

                        Context


                  Phrase                                       Extract phrases

                    Nanoworld = finance                        Assign domain label to phrase in context

                                    Concept1..N                Normalize word
                       Word form1
                                    ConceptN                   Resolve compounds and derivations

                       Word form2                              Apply multiword lookup
                                    ConceptM
                       Word formN                              Select concepts within Nanoworld
                                                               & Microworld
                                                               Restrict synonym expansion
                        Context


18-7-2010                             Irion Technologies (c)
Conceptual query analysis

            Query
             Nanoworld = finance               Assign domain label to query

                           Concept1..N         Normalize word
              Word form1
                           ConceptN            Resolve compounds and derivations

              Word form2                       Apply multiword lookup
                           ConceptM
              Word formN                       Select concepts within Nanoworld




18-7-2010                          Irion Technologies (c)
   Conceptual matching
Document
                                                                             ?Context
 Microworld = sport
  Phrase                                                                            Query

  Nanoworld = finance                                                 Nanoworld = finance
                    Concept1..N                                  Concept1..N
   Word form1                                                                Word form1
                    ConceptN                                     ConceptN
   Word form2       ConceptM                                                  Word form2
   Word formN                                                     ConceptM
                                                                              Word formN


Phrase-score:
 • number matching concepts
 • matching nanoworlds
 • matching nanoworld-microwolds:         potatos, potatoes, Afganistan & afghanistan
 • fuzzy word match:                         café, cafe, Café, CaFé, CAFÉ, café-noir
 • flexion and derivation:        depart, departure, departures, departing, departings
 • multiwords and compounds:                mensenrechtenactivistenleider, human rights
 • original word, synonym or translation:      café, pub, bar, coffee shop, tea room
United States of America, US, USA, VS, Amerika, Pays-Bas, Holland, the Netherlands

   18-7-2010                        Irion Technologies (c)
How to create more context?

    Replace the front-end by an intelligent dialogue
     system;
    Users are invited to ask questions in Natural
     Language;
    The system uses the linguistic structure to infer
     valuable information about information states;
    The system evaluates the answers (results);
    Context history is built up and used to find more
     precise results or adjust results;


18-7-2010                 Irion Technologies (c)
  Conteptual Dialogue system
                                              Dialogue                           Arrangements
                                              Manager                                     A
                                                                       Active holidays
                                                                                          B
• What can I do for you?                                               Winter holidays    C
• I want to book a holiday.                                                               D
• Can you provide me with more details?                   Classifier
• Nice appartment with swimming pool.                      Engine      Appartments        E
•There are two arrangements that might
be what you are looking for. Have a look                               Fly & Drive        F
at F or G                                                                                 G
• No, I would like something near the sea!                             Summer holidays
• Perhaps H and I are a better option?                                                    H
• Do you also have flight drive arrangements?                          Camping
                                                                                          I
• Yes but not within your first selection.                Retrieval
• And without swimming pool?                               Engine
• Please have a look at E


  18-7-2010                            Irion Technologies (c)
Demos

    Cross-lingual retrieval where queries in 6
     languages can be matched with a conceptual
     index
    Dutch dialogue system, where the complete
     context is used to guide users to information
     step by step




18-7-2010             Irion Technologies (c)
Cross-lingual retrieval system

    Antonya: portal of environmental information
    More than 3000 URLS crawled (mostly in the
     Netherlands)
    Indexing languages: English, German,
     French, Dutch, Spanish, Italian
    Search languages: English, German, French,
     Dutch, Spanish, Italian
    http://www.antonya.net

18-7-2010             Irion Technologies (c)
Conteptual Dialogue system

    Service desk for the city of Nijmegen
    256 products on their website, which are
     services for citizines
    Dialogue system to analyse user queries and
     evaluate information states (33 different
     states)
    Classification system trained with the
     documents to find to retrieve answers
    http://kundera.irion.nl/burgerloket/

18-7-2010             Irion Technologies (c)
            Thank you for your attention!




18-7-2010              Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
Conteptual Dialogue system

    Service desk for the city of Nijmegen
    100 products on their website, which are
     services for citizines
    Dialogue system to analyse user queries and
     evaluate information states (33 different
     states)
    Classification system trained with the
     documents to find to retrieve answers
    http://kundera.irion.nl/burgerloket/

18-7-2010             Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
18-7-2010   Irion Technologies (c)
            Thank you for your attention!




18-7-2010              Irion Technologies (c)

								
To top