Document Sample
London Powered By Docstoc
					Evaluating XML retrieval:
   The INEX initiative
          Mounia Lalmas
   Queen Mary University of London

 Information retrieval
 (Content-oriented) XML retrieval

 Evaluating information retrieval
 Evaluating XML retrieval: INEX
           Information retrieval

Example of a user information need:

  “Find all documents about sailing charter agencies that
  (1) offer sailing boats in the Greek islands, and (2) are
  registered with the RYA. The documents should contain
  boat specification, price per week, e-mail and other
  contact details.”

A formal representation of an information need constitutes
a query
         Information retrieval

IR is concerned with the representation,
storage, organisation, and access to
repositories of information, usually under the
form of documents.

Primary goal of an IR system
   “Retrieve all the documents which are relevant
   (useful) to a user query, while retrieving as few
   non-relevant documents as possible.”
  Conceptual model for IR

      Documents                                Query

   Indexing                                   Formulation

Document representation                 Query representation

                   Retrieval function

                    Retrieval results                     feedback
Structured Document Retrieval
 Traditional IR is about finding relevant documents to a user’s
  information need, e.g. entire book.

 SDR allows users to retrieve document components that are more
  focussed to their information needs, e.g a chapter of a book
  instead of an entire book.

 The structure of documents is exploited to identify which document
  components to retrieve.
                                       Structured Documents
                                                     Linear order of words, sentences,
                                                      paragraphs …
Chapters                                             Hierarchy or logical structure
                                                      of a book’s chapters, sections …
                                                     Links (hyperlink), cross-
                                                      references, citations …
                                                     Temporal and spatial relationships
        World Wide Web
                                                      in multimedia documents
                       This is only
                        only another
                        to look one
        le to show the need an la a
        out structure of and more
        a document and so ass to
        it doe not necessary text a

        structured document have
       retrieval on the web is an it
       important topic of today’s
       research it issues to make se
       last sentence..
                Structured Documents
                           World Wide Web
                                          This is only
                                           only another

                                                          Explicit structure formalised
                                           to look one
                           le to show the need an la a
                           out structure of and more
                           a document and so ass to
                           it doe not necessary text a

                                                          through document representation
                           structured document have
                          retrieval on the web is an it
                          important topic of today’s
                          research it issues to make se
                          last sentence..

                                                          standards (Mark-up Languages)
<b><font size=+2>SDR</font></b>
<img src="qmir.jpg" border=0>                              Layout
                                                               LaTeX (publishing), HTML (Web
     <paragraph>… </paragraph>                             Structure
     <paragraph>… </paragraph>
  </subsection>                                                SGML, XML (Web publishing,
</section>                                                     engineering), MPEG-7 (broadcasting)

                                                           Content/Semantic
<Book rdf:about=“book”>
  <rdf:author=“..”/>                                           RDF, DAML + OIL, OWL (semantic web)
XML: eXtensible Mark-up Language

   Meta-language (user-defined tags) currently being
    adopted as the document format language by W3C
   Used to describe content and structure (and not
   Grammar described in DTD ( used for validation)
       <title> Structured Document Retrieval </title>
       <author> <fnm> Smith </fnm> <snm> John </snm> </author>
          <title> Introduction into XML retrieval </title>
          <paragraph> …. </paragraph>
       </chapter> …                        <!ELEMENT lecture (title,
    </lecture>                             author+,chapter+)>
                                           <!ELEMENT author (fnm*,snm)>
                                           <!ELEMENT fnm #PCDATA>
XML: eXtensible Mark-up Language

  Use of XPath notation to refer to the XML
chapter/title: title is a direct sub-component of chapter
//title: any title
chapter//title: title is a direct or indirect sub-component of chapter
chapter/paragraph[2]: any direct second paragraph of any chapter
chapter/*: all direct sub-components of a chapter

              <title> Structured Document Retrieval </title>
              <author> <fnm> Smith </fnm> <snm> John </snm> </author>
                 <title> Introduction into SDR </title>
                 <paragraph> …. </paragraph>
              </chapter> …
      Querying XML documents
 Content-only (CO) queries

   'open standards for digital video in distance learning'

 Content-and-structure (CAS) queries

  //article [about(., 'formal methods verify correctness aviation
       [about(.,'case study application model checking theorem proving')]

 Structure-only (SA) queries

Conceptual model for XML retrieval

Structured documents                               Content + structure

            Documents                                Query

         Indexing       tf, idf, acc                Formulation

     Document representation                  Query representation

Inverted file +                                   Matching content +
                         Retrieval function           structure
structure index
                          Retrieval results                     feedback

                                       Presentation of related components
Content-oriented XML retrieval

  Return document components of
 varying granularity (e.g. a book, a
  chapter, a section, a paragraph, a
 table, a figure, etc), relevant to the
  user’s information need both with
 regards to content and structure.
Content-oriented XML retrieval

Retrieve the best components according
to content and structure criteria:
 INEX: most specific component that satisfies the query, while
  being exhaustive to the query

 Shakespeare study: best entry points, which are components
  from which many relevant components can be reached through

 ???
                                     Article                 ?XML,?retrieval

                      0.4                           0.2          ?authoring

             Title                  Section 1             Section 2
                       0.6                    0.4                     0.4
            0.9 XML                 0.5 XML                 0.2 XML
            0.4 retrieval                                    0.7 authoring

No fixed retrieval unit + nested document components +
different types of document components
      how to obtain document and collection statistics?
      which component is a good retrieval unit?
      which components contribute best to content of Article?
      how to estimate?
      how to aggregate?
                        Approaches …
   vector space model                                         bayesian network

                          collection statistics                     language model
 cognitive model
                         proximity search
                                                   tuning             belief model
 boolean model
                        relevance feedback
                    parameter estimation
logistic regression                                             probabilistic model
                                      component statistics
                                   term statistics
    extending DB model                               natural language processing

 The goal of an IR system
     retrieve as many relevant documents as possible and as few non-
     relevant documents as possible

 Comparative evaluation of technical performance of IR
  systems = effectiveness
     ability of the IR system to retrieve relevant documents and suppress
     non-relevant documents

 Effectiveness
     combination of recall and precision

   A document is relevant if it “has significant and
    demonstrable bearing on the matter at hand”.

   Common assumptions:
       Objectivity
       Topicality
       Binary nature
       Independence
            Recall / Precision
Retrieved    Retrieved and relevant     Relevant

  Document collection

                                      number of relevant documents retrieved
                        precision 
                                         number of documents retrieved

                                      number of relevant documents retrieved
                         recall   
                                         number of relevant documents
              Recall / Precision
        relevant documents for a given query
                 {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
rank   doc     precision                   recall        rank      doc            precision        recall
1      d123    1/1                         1/10          8         d129
2      d84                                               9         d187
3      d56     2/3                         2/10          10        d25            4/10             4/10
4      D6                                                11        d48
5      d8                                                12        d250
6      d9      3/6                         3/10          13        d113
7      d511                                              14        d3             5/14             5/10


                                 50                                                                                    s2
                                       0     10     20        30   40      50       60        70   80       90   100
                Test collection
 Document collection = document themselves
     depend on the task, e.g. evaluating web retrieval requires a
     collection of HTML documents.

 Queries / requests
     simulate real user information needs.

 Relevance judgements
     stating for a query the relevant documents.

 See TREC, CLEF, etc
 Evaluation of XML retrieval: INEX
 Evaluating the effectiveness of content-oriented XML retrieval

 Collaborative effort  participants contribute to the development of
  the collection
      relevance assessments

 Similar methodology as for TREC, but adapted to XML retrieval

 40+ participants worldwide

 Workshop in Schloss Dagstuhl in December (20+ institutions)
         INEX Test Collection
 Documents (~500MB), which consist of 12,107 articles in XML
  format from the IEEE Computer Society; 8 millions elements

 INEX 2002
    30 CO and 30 CAS queries
    inex_eval metric

 INEX 2003
    36 CO and 30 CAS queries
    CAS queries are defined according to enhanced subset of XPath
    inex_eval and inex_eval_ng metrics

 INEX 2004 is just starting
                 Relevance in XML

A element is relevant if it “has significant and
 demonstrable bearing on the matter at hand”

Common assumptions in IR               article
      Objectivity
      Topicality
                                    1   2   3 section
      Binary nature
      Independence
                               paragraph 1        2
                 Relevance in INEX

  article                      all sections relevant  article very relevant
                               all sections relevant  article better than sections
                               one section relevant  article less relevant
section                        one section relevant  section better than article

   Exhaustivity
            how exhaustively a document component discusses the query: 0, 1,
            2, 3
   Specificity
            how focused the component is on the query: 0, 1, 2, 3
   Relevance
            (3,3), (2,3), (1,1), (0,0), …
   Relevance assessment task
 Completeness
       Element  parent element, children element

 Consistency
       Parent of a relevant element must also be relevant, although to a
        different extent
       Exhaustivity increase going                          article
       Specificity decrease going 

 Use of an online interface
       Assessing a query takes a week!                  1    2   3   section
       Average 2 topics per participants

                                                   paragraph 1        2
 Only participants that complete the assessment task have access to the
    Recall / precision - based
      quantisation functions to obtain one relevance value

       expected search length
       penalise overlap
       consider size
 Others
    expected ratio of relevant
    cumulated gain-based metrics
    tolerance to irrelevance
             Lessons learnt
 Good definition of relevance

 Expressing CAS queries was not easy

 Relevance assessment process must be “improved”

 Further development on metrics needed

 User studies required
 XML retrieval is not just about the effective retrieval of
  XML documents, but also about how to evaluate

 INEX 2004 tracks
          Relevance feedback
          Interactive
          Heterogeneous collection
          Natural language query
Evaluating XML retrieval:
   The INEX initiative
             Mounia Lalmas
      Queen Mary University of London

Shared By: