Docstoc

IR2009_midterm_sol

Document Sample
IR2009_midterm_sol Powered By Docstoc
					1. Many web sites provide users a push service of document distribution, in which documents are
   distributed to subscribers based on their predefined criteria. Please specify how such a service is
   implemented. (10 points)

   Ans: This service is document routing, which works as follows.


                                    Profile of
                                    Multiple
                                    Detection
                                     Needs




                                          Convert Detection                    Document
                                           Need to System                        to be
                                           Specific Query                       Routed




                                                                   Pre-processing:
                                          Build Index From            Stemmer
                                               Queries               Phrase List
                                                                      Stop List




                                               Routing                Compare
                                             Profile Index            Document




                                                                       List of
                                                                       Profiles
                                                                        to Put
                                                                      Documen
                                                                         t into
                                                                                          #

2. Suppose we have a collection that consists of the four documents given in the below table. What is
   the final ranking of the documents for the query click metal under the given information retrieval
   model? Please explain how your results are derived. Assume the and here are stop words.

   (a) Classic vector model (10 points)

   Ans: First, remove the stop words: the and here. We get the following index terms.

   Terms: [click, go, shears, boys, metal]
                  4                 4                   4                  4                  4
   idf click  log , idf go  log , idf shears  log , idf boys  log , idf metal  log
                  3                 1                   2                  1                  2
                           ur 4            4 1        4 1        4 1          4
                           d1  [  log ,  log ,  log ,  log , 0]
                                  7        3 7        1 7        2 7          1
                            r
                           uu             4
                           d 2  [1 log , 0, 0, 0, 0]
                                          3
                            u
                           ur                      4
                           d3  [0, 0, 0, 0,1 log ]
                                                   2
                            r
                           uu 1            4 1           4 1           4
                           d 4  [  log , 0,  log , 0,  log ]
                                   3       3 3           2 3           2
                           r            0.5 1/ 2       4                  0.5 1/ 2        4
                           q  [ 0.5               log , 0, 0, 0,  0.5              log ]
                                           1/ 2         3                    1/ 2           2
       Then, according to cosine similarity,
                                                                                   ur r
                                                                     ur r          di ·q
                                                                 Sim(di , q )  ur        r
                                                                               | di |  | q |

                                                                       ur r
                                                                   Sim(d1 , q )  0.1856
                                                                        r
                                                                       uu r
                                                                   Sim(d 2 , q )  0.3833
                                                                        u
                                                                       ur r
                                                                   Sim(d 3 , q )  0.9236
                                                                        r
                                                                       uu r
                                                                   Sim(d 4 , q )  0.7346

       Therefore, the final ranking list of vector space model is (d3 > d4 > d2 > d1). #

(b) Fuzzy set model (10 points)

Ans: First, remove stop words: the and here. According to Fuzzy set model

                                                            i , j  1   (1  ci ,l )
                                                                          kl d j

                                                            Sim(q, d j )  min( i , j )

                   click ,d  1  (1  cclick ,click )  (1  cclick , go )  (1  cclick ,shears )  (1  cclick ,boys )
                              1


                                                  2                      1                     2                    1 
                                   1  1                    1                  1                   1             
                                         3  3  2   3 1 1   3  2  2   3 1 1 
                                     25
                                  
                                     27
                  metal ,d1
                                   1  (1  cmetal ,click )  (1  cmetal , go )  (1  cmetal ,shears )  (1  cmetal ,boys )
                                            1            0           1          0 
                                   1  1          1         1        1      
                                         2  3 1   2  3  0   2  2 1   2 1  0 
                                    1
                                  
                                    2

                                                                                                       1
                                                  Sim(q, d1 )  min{click ,d1 , metal ,d1 } 
                                                                                                       2
Similarily,
                                                                                      1
                                                                    click ,d 2 
                                                                                      2
                                                                                      1
                                                                    mtal ,d 2      
                                                                                      4
                                                                                                         1
                                                   Sim(q, d 2 )  min{click ,d2 , metal ,d2 } 
                                                                                                         4
                                                                                     1
                                                                    click ,d 
                                                                              3
                                                                                     4
                                                                     metal ,d 3    0

                                                   Sim(q, d3 )  min{click , d3 , metal ,d3 }  0
                                                                        7
                                                       click ,d 4 
                                                                        8
                                                                        1
                                                        metal ,d 4   
                                                                        2

                                                                                          1
                                        Sim(q, d 4 )  min{click ,d4 , metal ,d4 } 
                                                                                          2

Therefore, the ranking list of fuzzy model is (d1 = d4 > d2 > d3). #


(c) Query likelihood language model (10 points)

Ans: First, remove the stop words: the and here. According to query likelihood language model,

                          Sim(d , q)  P(d | q)  P(d ) ((1   ) P(t | M c )   P(t | M d ))
                                                            tq


Here, we assume        0.5 ;   that is, the model is estimated from the documents and collection, mixed
with the lambda.

                                              7      4         2      0
                          P(q | d1 )  0.5   0.5    0.5   0.5    0.0427
                                             13      7        13      7
                                              7      2         2      0
                          P(q | d 2 )  0.5   0.5    0.5   0.5    0.0592
                                             13      2        13      2
                                              7      0         2      1
                          P(q | d3 )  0.5   0.5    0.5   0.5    0.1553
                                             13      1        13      1
                                              7      1         2      1
                          P(q | d 4 )  0.5   0.5    0.5   0.5    0.1062
                                             13      3        13      3

Therefore, the final ranking list of LM is (d3 > d4 > d2 > d1). If without the P (t | M c ) smoothing (i.e.,
  1 ), the ranking list is (d4 > d1 = d2 = d3). #

3. Okapi BM25 term weighting has been used quite widely and successful across a range of collections
   and search tasks. What is the theoretical basis of the Okapi weighting? How does it work? (10
   points)

   Ans: 1) The theoretical basis of BM25 is the Probability Ranking Principle (PRP). #

       2) Similar to the Binary Independence Model (BIM), BM25 is aimed at estimating the
   probability of P( R | d , q) . However, it can further focus on term frequency and document length,
   two    features    often      necessary    for          modern full-text search collections. Derived from
   O( R | d , q)  P( R  1| d , q) / P( R  0 | d , q) , the weighting formula of BM25 can be obtained as follows:

                            N                 (k1  1)tftd               (k  1)tftq
                RSVd   log ·                                          · 3
                       tq  dft  k1 ((1  b)  b  ( Ld / Lave ))  tftd k3  tftq

       where k1, k3, and b are tunable parameters. In general, we search for values of these
       parameters that maximize performance on a separate development test collection (either
       manually or with optimization methods such as grid search or something more advanced), and
then use the obtained parameters on the actual test collection. #
4. Explain why polysemy and synonymy are problems for vector space model and propose a possible
   solultion. (15 points)

Ans:   Refer to the slide of Lecture-3, p.86-p.88.
       One of the possible solutions is applying the latent semantic indexing (LSI) model.

5. Consider an information need for which there are four relevant documents in the collection. Contrast
   two systems run on this collection. Their top ten results are judged for relevance as follows. Here R
   and N denote relevant and non-relevant, respectively.
       System 1: R N R N N N N N R R
       System 2: N R N N R R R N N N
    (a) What is the mean average precision (MAP) of each system? (5 points)
    (b) What is the R-precision of each system? (5 points)
    (c) Sketch precision versus recall figure to compare the retrieval performance of these two
         systems. Assume 11-point average precision is adopted. (5 points)

Ans:   (a)           System-1: (1+2/3+3/9+4/10)/4 = 0.6
                     System-2: (1/2+2/5+3/6+4/7)/4 ≒ 0.49
       (b)           There are 4 relevant documents.
                     System-1: 2/4
                     System-2: 1/4
       (c)

                                      System-1      System-2
                    1.2
                     1
                    0.8
        precision




                    0.6
                    0.4
                    0.2
                     0
                          0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9    1
                                           recall level


6. Modeling the relationships among persons is one of important applications of social networking. We
   may postulate that two persons appearing in the same context may have some relationship. Given a
   specific person, please extend the concepts of association clusters, metric clusters and scalar
   clusters to construct social networks for this person. (15 points)

Ans:   Refers to the slide of Lecture-5, p.29-p.33.
       association cluster: co-occurrences of pairs of terms in documents.
       metric cluster: distance factor between two terms.
       scalar cluster: terms with similar neighborhoods have some synonymity relationship.
       In the case of constructing social network, terms represent persons and documents represent
       specific contexts such as conference events or courses in a university.

7. Describe the difference between vector space relevance feedback and probabilistic relevance
   feedback. (10 points)

Ans:   Refers to Lecture-5, p.13 and p.18
Vector space relevance feedback performs query expansion and term reweighting but the
reweighting scheme is not optimal.
Probabilistic relevance performs no query expansion. The term reweighting is optimal.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:4
posted:2/27/2011
language:English
pages:6