Docstoc

Overview of Information Retrieval System Terrier

Document Sample
Overview of Information Retrieval System Terrier Powered By Docstoc
					Overview of a Information
Retrieval System: Terrier


          Ashish
                overview
• Structural view
  – Indexing
  – Retrieval


• Extend
• Setup
• Run
              IR Systems
• Terrier
  – Academic/ research
  – Open source
• Lucene-Nutch
  – Commercial/ research
  – Open source
                   Terrier
•   Being developed at University of Glasgow.
•   Open Source
•   OS independent : Java
•   Easy to learn
•   Easy to extend
    – modular
                Subfolders -1
• etc/
   – Configuration files
• bin/
   – Srcipts to compile and run the terrier
• lib/
   – Java library, jar files containing the terrier
     system.
                     Subfolders -2
• src/
  – The java source files, user written plugins
• doc/
  – Javadocs for terrier and for extended components
• var/
  – Index/
     • Index files
  – Results/
     • Results and evaluation
• share/
  – Shared resources such as stopwords, lexicon etc.
Indexing
                    Tokenization
•   Identifying words
    – Based on space
    – Handling spacial characters such as -,$,
      digits etc.
    – Sometimes space is not word separator.
      •   German, Chinese
    – agglutinative languages
      •   Marathi
           Term Pipelining
• Stemming/ finding root
  – ate -> eat
• Stopword removal
  – is, was, I, in etc.
• Abbreviations
  – Dr -> Doctor
• Normalisation
  – Color Vs colour
         Index – data structures
• Direct Index
   – stores the identifiers of terms that appear in each document and
     the corresponding frequencies.
• Document Index
   – stores information about each document for example the
     document length and identifier,
• Inverted Index
   – stores the posting lists, i.e. the identifiers of the documents and
     their corresponding term frequencies.
• Lexicon
   – stores the collection vocabulary and the corresponding
     document and term frequencies.
Extending the indexing process
• Tokenisation:
  – uk.ac.gla.terrier.indexing.*Document
• Term Pipelines:
  – uk.ac.gla.terrier.terms.*
        Retrieval
          query




Index
        Scoring and Ranking
• Score: S(di,qj)
• Documents are ranked (sorted) according
  to the score
• Presented to the user in decreasing order
  of S(di,qj)
  – Scoring model
     • e.g. TF-IDF
               Matching Process
• Input
  – Query and weighting model
• Output
  – Ranked resultset
      • Weighting model
           – Himestra-LM
• Uses
  – Term Score Modifiers
      • uk.ac.gla.terrier.matching.tsms
  – Document Score Modifiers
      • uk.ac.gla.terrier.matching.dsms
• extend
  – uk.ac.gla.terrier.matching.models
                    Input
• Corpus
  – Very large set of documents
• Topics
  – Queries representing user need
• Relevance Results
  – Set of judgments per query per document
                          Topic format
<doc>
<docno>Mumbai85B7FB3BB9.htm.txt</docno>

<text> याज्मऩाराांनी घेतरी याष्ट्रऩती, उऩयाष्ट्रऩतीांची बेट

         भांफई, ता. २१ - याज्मऩार एस. एभ. कृ ष्णा माांनी आज याष्ट्रऩती प्रततबा
         ऩाटीर आणण उऩयाष्ट्रऩती डॉ. हभीद अन्सायी माांची ददल्री मेथे बेट
   घेतरी.          याष्ट्रऩती, उपराष्ट्रपतिपदी तिवड झाल्याच्या पार्श‍ भूमीवर
                                                                      ्
                                                                      व
   राज्यपालाांिी भेट                                        े
                               घेऊन तमाांचे स्वागत करे. आज दऩायी याष्ट्रऩती
   बवन मेथे श्रीभती प्रततबा                ऩाटीर माांची बेट घेतल्मानांतय तमाांनी
   हरयमाना बवन मेथे जाऊन                   उऩयाष्ट्रऩतीांची बेट घेतरी.
</text>

</doc>
                       Document
<top>
<num>5
<title>बायतीम याष्ट्रऩती तनवडणूक २००७
<desc>बायताच्मा याष्ट्रऩती तनवडणूकीशी सांफांतधत भद्दे व घटना.
                                                    े
<narr>याष्ट्रऩतीांची तनवडणूक, उभेदवायाांववरूध्द करेरी / गतरच्छ
                          े
       याजकीम तचखरपक आणण आऩल्मा तनकटच्मा
   उभेदवायाचा           ऩयाबव करून प्रततबा ऩाटीर ह्ाांचे
   बायताच्मा सववप्रथभ भदहरा             याष्ट्रऩती (अध्मऺ) म्हणून
   तनवडू न मेणे ह्ा-ववषमीची भादहती सांफांतधत कागदऩत्रात
   असावमास हवी.
</top>
           Relevance Judement
           .
           .                         Relevence judgement: 0 or 1
           .
           13 Q0 1100019.cms.txt 0
           13 Q0 1102914.cms.txt 0
           13 Q0 1104294.cms.txt 0
           13 Q0 1104312.cms.txt 1
           13 Q0 1110418.cms.txt 0
           13 Q0 1123377.cms.txt 0
           13 Q0 1124813.cms.txt 1
           13 Q0 1126006.cms.txt 1
           .
           .
Query-id   .                                    Document id
           .
           Configuration files
• etc/terrier.properties
  – Utf-8 settings, stemmer, index name, etc
etc/trec.topic.list
  – set topics/queries
• etc/trec.models
  – Set matching/retrieval model
• etc\trec.qrels
  – Set Relevane Judgement file path
               Running terrier
• Already compiled
• To recompile
  – bin/compile.sh
• Setup corpus
  – bin/trec_setup.sh‍“<corpus‍folder‍path>“
• Index
  – bin/trec_terrier.sh -i
• Retrieval
  – bin/trec_terrier.sh -r
• Evaluate
  – bin/trec_terrier.sh -e‍“<result‍file>”
                Reference
• http://ir.dcs.gla.ac.uk/terrier/doc/
• http://ir.dcs.gla.ac.uk/wiki/Terrier