Query Expansion

Document Sample
Query Expansion Powered By Docstoc
					                   Query Expansion

                      Joo-Eun Feit

                     January 8, 2010

Wikipedia Mining                       1

                   A knowledge-based search engine powered by Wikipedia
                   A Wikipedia-Based Multilingual Retrieval Model

Wikipedia Mining                                                          2
                             Query Expansion: Why?

                   → low recall
                   → low precision
                   leads to term mismatch problem

Wikipedia Mining                                     3
         A Knowledege-Based Search Engine Powered
                       by Wikipedia
                   Milne, D.N. and Witten, I.H. and Nichols, D.M.

Wikipedia Mining                                                    4
      A Knowledege-Based Search Engine Powered by

                   Search interface Koru
                   Helps the user to enhance his search query interactively
                   Wikipedia based
                       manually craftet documents

Wikipedia Mining                                                              5
              Wikipedia as Knowledge Base: Wikisauri

                   Generic Thesaurus
                   Automatically derived
                   Wikipedia’s articles as concepts
                   Wikipedia’s hyperlinks describe the relation between the
                   Using disambiguation pages and redirects

Wikipedia Mining                                                              6
           Wikisauri: Measuring Semantic Relatedness

                   Quantifies the strength of relatedness bewteen two Wikipedia
                   Weighting and comparing hyperlinks
                   Using Vector space model

Wikipedia Mining                                                                 7

        Weighted value w for the link a→b:
                   w(a → b) = |a → b| × log(             t t
                                                      x=1 x→b )
                   t: total number of articles in Wikipedia
        The vector for each article x and y is given by:
                   x = (w (x → l1 ), w (x → l2 ), . . . , w (x → ln ))
                   y = (w (y → l1 ), w (y → l2 ), . . . , w (y → ln ))
                   n{l i |i = 1...n}: number of all links within article x and y
        Determine the strength of relatedness:
                   Summing up the weights of the links that are common in both

Wikipedia Mining                                                                   8
                            Wikisauri: Choosing Terms

                   Corpus specific thesauri
                   Choosing which terms should be included
                   → Parsing document collection
                   → Every noun/noun phrase serves as a potential term
                   → If there is a matching Wikipedia article the article will be
                   included in the list of concepts
                   Disambiguation of terms
                   → Using whole sentence as context
                   → Measuring the semantic relatedness of the nouns
                   → If a sentence does not allow a disambiguation, the
                   surrounding paragraph serves as context

Wikipedia Mining                                                                    9
                   Wikisauri: Construction

Wikipedia Mining                             10
                                Wikisauri: Relations

            1      Synonyms
                   → Redirect links
            2      Is-a-related-topic
                   → Weigthed hyperlinks

Wikipedia Mining                                       11
                            Weighting Topics and Relations

        Every topic in Wikisauri is weighted by:
                        tf i,j =    k nk,j
                        idf i = log   |{d:t i ∈d}|

                   Previous mentioned semantic relatedness measure

Wikipedia Mining                                                     12

                   Comparison of Koru against another version of Koru without
                   12 participants were observed when interacting with the two
                   Every participant had to perfom five tasks with Koru (Topic
                   browsing) and the lighter version of Koru (Keyword searching)
                   The participants had to judge whether the retrieved
                   documents were strongly relevant, weakly relevant or

Wikipedia Mining                                                                   13
                   Example Retrieval Task

Wikipedia Mining                            14

Wikipedia Mining             15
         Wikipedia-Based Multilingual Retrieval Model
                   Potthast, M. and Stein, B. and Anderka, M.

Wikipedia Mining                                                16
         Wikipedia-Based Multilingual Retrieval Model

                   Based on cross Language-Explicit Semantic Analysis (CL-ESA)
                   Measures the cross language similarity of concepts
                   Explicit : concept-driven, predetermined
                   Wikipedia as comparable corpus

Wikipedia Mining                                                                 17
               Mononlingual Explicit Semantic Analysis

                   D * → Document collection
                       Broad range
                       Reasonable length
                   ϕ → Cosine similarity measure
                       cos(x,y) =    ||A||||B||
                   d → n-dimensional concept vector
                   n → size of D *
                   v → vector space model representation of d ∈ D *

Wikipedia Mining                                                      18
                         ESA : Document Representation

                   An n-dimensional concept vector d for a document d reflects
                   its similarity to each document in D*
                   Collection-relative representation
                   d = (ϕ(v, v1 * ), . . . , ϕ(v, vn * ))

Wikipedia Mining                                                                19
                         ESA : Similarity of Documents

                   Cosine similarity of two concept vectors

                   ϕ(d, d’ )

Wikipedia Mining                                              20
             Cross Language-Explicit Semantic Analysis

                   L = {L1 , . . . , Lm } → set of languages
                   Di * → document collection of language i
                   C = {c 1 , . . . , c n } → set of concept descriptors
                   D* ={D1 * , . . . ,Dm * } → set of document collections
                        Concept-aligned comparabe corpus
                        The ith document of any document collection Di * ∈ D*
                        describes the same concept ci

Wikipedia Mining                                                                21
                              CL-ESA cont’d

        Each document written in Language L is represented as a ESA
        vector by using the document collection of the same language L.
        The similarity of two documents d and d’ written in different
        languages is computed by ϕ(d, d’ ).

Wikipedia Mining                                                          22
                       Evaluation : Two Test Collections

                   1000 translation-aligned documents from a parallel corpus
                   1000 concept-aligned documents
                   1000 not concept-aligned documents, no language link
                   → Each for english and german

Wikipedia Mining                                                               23
                                  Experiment 1 & 2

        Experiment 1: Cross-Lanuguage Ranking
        Given an aligned document d its cross-language concept-aligned or
        translation-aligned counterpart will be ranked as the most the
        similar document with 91% and ranks under the top ten with >
        Experiment 2: Bilingual Rank Correlation
        Two kind rankings:
            1      Cross-language ranking with respect to concept c1
            2      Mono-language ranking with respect to concept c1
        The two ranking are compared by using a rank correlation
        coefficient e.g. Spearman’s ρ and their correlation is at 0.76

Wikipedia Mining                                                            24
                                      Experiment 3

        Cross-language Similarity Distribution
                   Shows the distribution of pairwise similarities of
                   concept-alinged and translation-aligned documents
                   On average, high similarities for both kinds of aligned

Wikipedia Mining                                                             25

Wikipedia Mining             26
        Two kind of similarity measure:
                   Both use vector space models
                   Different wayes to compute angle between vectors and to
                   represent the content of wikipedia articles
            1      A knowledge-based search engine powered by Wikipedia
                       Vector: Based on weighted hyperlinks
                       Similarity Measure: Summs up the weights of common
            2      A Wikipedia-Based Multilingual Retrieval Model
                       Vector: Reflection of the similarity of a document to each
                       other document
                       Similarity Measure: Uses cosine similarity measure

Wikipedia Mining                                                                   27
                   Thank you!

Wikipedia Mining                28

               E. Gabrilovich and S. Markovitch.
               Computing semantic relatedness using wikipedia-based explicit semantic analysis.
               In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 6–12, 2007.

               D. Milne.
               Computing semantic relatedness using wikipedia link structure.
               In Proceedings of the New Zealand Computer Science Research Student Conference, NZ CSRSC, volume 7.
               Citeseer, 2007.

               D. Milne, I.H. Witten, and D.M. Nichols.
               Extracting corpus specific knowledge bases from Wikipedia.
               Technical report, Citeseer, 2007.

               D.N. Milne, I.H. Witten, and D.M. Nichols.
               A knowledge-based search engine powered by wikipedia.
               In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management,
               pages 445–454. ACM, 2007.

               M. Potthast, B. Stein, and M. Anderka.
               A wikipedia-based multilingual retrieval model.
               Lecture Notes in Computer Science, 4956:522–530, 2008.

               P. Sorg and P. Cimiano.
               Cross-lingual information retrieval with explicit semantic analysis.
               In Working Notes of the Annual CLEF Meeting, 2008.

Wikipedia Mining                                                                                                        29