Docstoc

Query Expansion

Document Sample
Query Expansion Powered By Docstoc
					                   Query Expansion

                      Joo-Eun Feit


                     January 8, 2010




Wikipedia Mining                       1
                                         Outline




                   Introduction
                   A knowledge-based search engine powered by Wikipedia
                   A Wikipedia-Based Multilingual Retrieval Model




Wikipedia Mining                                                          2
                             Query Expansion: Why?




                   Synonymy
                   → low recall
                   Polysemy
                   → low precision
                   leads to term mismatch problem




Wikipedia Mining                                     3
         A Knowledege-Based Search Engine Powered
                       by Wikipedia
                   Milne, D.N. and Witten, I.H. and Nichols, D.M.




Wikipedia Mining                                                    4
      A Knowledege-Based Search Engine Powered by
                      Wikipedia



                   Search interface Koru
                   Helps the user to enhance his search query interactively
                   Wikipedia based
                       domain-independent
                       manually craftet documents




Wikipedia Mining                                                              5
              Wikipedia as Knowledge Base: Wikisauri



                   Generic Thesaurus
                   Automatically derived
                   Wikipedia’s articles as concepts
                   Wikipedia’s hyperlinks describe the relation between the
                   concepts
                   Using disambiguation pages and redirects




Wikipedia Mining                                                              6
           Wikisauri: Measuring Semantic Relatedness




                   Quantifies the strength of relatedness bewteen two Wikipedia
                   articles
                   Weighting and comparing hyperlinks
                   Using Vector space model




Wikipedia Mining                                                                 7
                                           Definitions

        Weighted value w for the link a→b:
                   w(a → b) = |a → b| × log(             t t
                                                      x=1 x→b )
                   t: total number of articles in Wikipedia
        The vector for each article x and y is given by:
                   x = (w (x → l1 ), w (x → l2 ), . . . , w (x → ln ))
                   y = (w (y → l1 ), w (y → l2 ), . . . , w (y → ln ))
                   n{l i |i = 1...n}: number of all links within article x and y
        Determine the strength of relatedness:
                   Summing up the weights of the links that are common in both
                   article



Wikipedia Mining                                                                   8
                            Wikisauri: Choosing Terms

                   Corpus specific thesauri
                   Choosing which terms should be included
                   → Parsing document collection
                   → Every noun/noun phrase serves as a potential term
                   → If there is a matching Wikipedia article the article will be
                   included in the list of concepts
                   Disambiguation of terms
                   → Using whole sentence as context
                   → Measuring the semantic relatedness of the nouns
                   → If a sentence does not allow a disambiguation, the
                   surrounding paragraph serves as context



Wikipedia Mining                                                                    9
                   Wikisauri: Construction




Wikipedia Mining                             10
                                Wikisauri: Relations




            1      Synonyms
                   → Redirect links
            2      Is-a-related-topic
                   → Weigthed hyperlinks




Wikipedia Mining                                       11
                            Weighting Topics and Relations



        Every topic in Wikisauri is weighted by:
                   tf-idf
                                   ni,j
                        tf i,j =    k nk,j
                                          |D|
                        idf i = log   |{d:t i ∈d}|

                   Previous mentioned semantic relatedness measure




Wikipedia Mining                                                     12
                                        Evalution


                   Comparison of Koru against another version of Koru without
                   thesaurus
                   12 participants were observed when interacting with the two
                   systems
                   Every participant had to perfom five tasks with Koru (Topic
                   browsing) and the lighter version of Koru (Keyword searching)
                   The participants had to judge whether the retrieved
                   documents were strongly relevant, weakly relevant or
                   irrelevant




Wikipedia Mining                                                                   13
                   Example Retrieval Task




Wikipedia Mining                            14
                   Results




Wikipedia Mining             15
         Wikipedia-Based Multilingual Retrieval Model
                   Potthast, M. and Stein, B. and Anderka, M.




Wikipedia Mining                                                16
         Wikipedia-Based Multilingual Retrieval Model




                   Based on cross Language-Explicit Semantic Analysis (CL-ESA)
                   Measures the cross language similarity of concepts
                   Explicit : concept-driven, predetermined
                   Wikipedia as comparable corpus




Wikipedia Mining                                                                 17
               Mononlingual Explicit Semantic Analysis


                   D * → Document collection
                       Broad range
                       Reasonable length
                   ϕ → Cosine similarity measure
                                        A·B
                       cos(x,y) =    ||A||||B||
                   d → n-dimensional concept vector
                   n → size of D *
                   v → vector space model representation of d ∈ D *




Wikipedia Mining                                                      18
                         ESA : Document Representation




                   An n-dimensional concept vector d for a document d reflects
                   its similarity to each document in D*
                   Collection-relative representation
                   d = (ϕ(v, v1 * ), . . . , ϕ(v, vn * ))




Wikipedia Mining                                                                19
                         ESA : Similarity of Documents




                   Cosine similarity of two concept vectors

                   ϕ(d, d’ )




Wikipedia Mining                                              20
             Cross Language-Explicit Semantic Analysis



                   L = {L1 , . . . , Lm } → set of languages
                   Di * → document collection of language i
                   C = {c 1 , . . . , c n } → set of concept descriptors
                   D* ={D1 * , . . . ,Dm * } → set of document collections
                        Concept-aligned comparabe corpus
                        The ith document of any document collection Di * ∈ D*
                        describes the same concept ci




Wikipedia Mining                                                                21
                              CL-ESA cont’d




        Each document written in Language L is represented as a ESA
        vector by using the document collection of the same language L.
        The similarity of two documents d and d’ written in different
        languages is computed by ϕ(d, d’ ).




Wikipedia Mining                                                          22
                       Evaluation : Two Test Collections




                   1000 translation-aligned documents from a parallel corpus
                   (JRC-Aquis)
                   1000 concept-aligned documents
                   1000 not concept-aligned documents, no language link
                   → Each for english and german




Wikipedia Mining                                                               23
                                  Experiment 1 & 2

        Experiment 1: Cross-Lanuguage Ranking
        Given an aligned document d its cross-language concept-aligned or
        translation-aligned counterpart will be ranked as the most the
        similar document with 91% and ranks under the top ten with >
        99%
        Experiment 2: Bilingual Rank Correlation
        Two kind rankings:
            1      Cross-language ranking with respect to concept c1
            2      Mono-language ranking with respect to concept c1
        The two ranking are compared by using a rank correlation
        coefficient e.g. Spearman’s ρ and their correlation is at 0.76


Wikipedia Mining                                                            24
                                      Experiment 3




        Cross-language Similarity Distribution
                   Shows the distribution of pairwise similarities of
                   concept-alinged and translation-aligned documents
                   On average, high similarities for both kinds of aligned
                   documents




Wikipedia Mining                                                             25
                   Results




Wikipedia Mining             26
                                        Conclusion
        Two kind of similarity measure:
                   Both use vector space models
                   Different wayes to compute angle between vectors and to
                   represent the content of wikipedia articles
            1      A knowledge-based search engine powered by Wikipedia
                       Monolinugal
                       Vector: Based on weighted hyperlinks
                       Similarity Measure: Summs up the weights of common
                       hyperlinks
            2      A Wikipedia-Based Multilingual Retrieval Model
                       Multilingual
                       Vector: Reflection of the similarity of a document to each
                       other document
                       Similarity Measure: Uses cosine similarity measure


Wikipedia Mining                                                                   27
                   Thank you!




Wikipedia Mining                28
                                                   Bibliography


               E. Gabrilovich and S. Markovitch.
               Computing semantic relatedness using wikipedia-based explicit semantic analysis.
               In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 6–12, 2007.

               D. Milne.
               Computing semantic relatedness using wikipedia link structure.
               In Proceedings of the New Zealand Computer Science Research Student Conference, NZ CSRSC, volume 7.
               Citeseer, 2007.

               D. Milne, I.H. Witten, and D.M. Nichols.
               Extracting corpus specific knowledge bases from Wikipedia.
               Technical report, Citeseer, 2007.

               D.N. Milne, I.H. Witten, and D.M. Nichols.
               A knowledge-based search engine powered by wikipedia.
               In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management,
               pages 445–454. ACM, 2007.

               M. Potthast, B. Stein, and M. Anderka.
               A wikipedia-based multilingual retrieval model.
               Lecture Notes in Computer Science, 4956:522–530, 2008.

               P. Sorg and P. Cimiano.
               Cross-lingual information retrieval with explicit semantic analysis.
               In Working Notes of the Annual CLEF Meeting, 2008.


Wikipedia Mining                                                                                                        29

				
DOCUMENT INFO