VIEWS: 4 PAGES: 6 POSTED ON: 2/27/2011 Public Domain
1. Many web sites provide users a push service of document distribution, in which documents are distributed to subscribers based on their predefined criteria. Please specify how such a service is implemented. (10 points) Ans: This service is document routing, which works as follows. Profile of Multiple Detection Needs Convert Detection Document Need to System to be Specific Query Routed Pre-processing: Build Index From Stemmer Queries Phrase List Stop List Routing Compare Profile Index Document List of Profiles to Put Documen t into # 2. Suppose we have a collection that consists of the four documents given in the below table. What is the final ranking of the documents for the query click metal under the given information retrieval model? Please explain how your results are derived. Assume the and here are stop words. (a) Classic vector model (10 points) Ans: First, remove the stop words: the and here. We get the following index terms. Terms: [click, go, shears, boys, metal] 4 4 4 4 4 idf click log , idf go log , idf shears log , idf boys log , idf metal log 3 1 2 1 2 ur 4 4 1 4 1 4 1 4 d1 [ log , log , log , log , 0] 7 3 7 1 7 2 7 1 r uu 4 d 2 [1 log , 0, 0, 0, 0] 3 u ur 4 d3 [0, 0, 0, 0,1 log ] 2 r uu 1 4 1 4 1 4 d 4 [ log , 0, log , 0, log ] 3 3 3 2 3 2 r 0.5 1/ 2 4 0.5 1/ 2 4 q [ 0.5 log , 0, 0, 0, 0.5 log ] 1/ 2 3 1/ 2 2 Then, according to cosine similarity, ur r ur r di ·q Sim(di , q ) ur r | di | | q | ur r Sim(d1 , q ) 0.1856 r uu r Sim(d 2 , q ) 0.3833 u ur r Sim(d 3 , q ) 0.9236 r uu r Sim(d 4 , q ) 0.7346 Therefore, the final ranking list of vector space model is (d3 > d4 > d2 > d1). # (b) Fuzzy set model (10 points) Ans: First, remove stop words: the and here. According to Fuzzy set model i , j 1 (1 ci ,l ) kl d j Sim(q, d j ) min( i , j ) click ,d 1 (1 cclick ,click ) (1 cclick , go ) (1 cclick ,shears ) (1 cclick ,boys ) 1 2 1 2 1 1 1 1 1 1 3 3 2 3 1 1 3 2 2 3 1 1 25 27 metal ,d1 1 (1 cmetal ,click ) (1 cmetal , go ) (1 cmetal ,shears ) (1 cmetal ,boys ) 1 0 1 0 1 1 1 1 1 2 3 1 2 3 0 2 2 1 2 1 0 1 2 1 Sim(q, d1 ) min{click ,d1 , metal ,d1 } 2 Similarily, 1 click ,d 2 2 1 mtal ,d 2 4 1 Sim(q, d 2 ) min{click ,d2 , metal ,d2 } 4 1 click ,d 3 4 metal ,d 3 0 Sim(q, d3 ) min{click , d3 , metal ,d3 } 0 7 click ,d 4 8 1 metal ,d 4 2 1 Sim(q, d 4 ) min{click ,d4 , metal ,d4 } 2 Therefore, the ranking list of fuzzy model is (d1 = d4 > d2 > d3). # (c) Query likelihood language model (10 points) Ans: First, remove the stop words: the and here. According to query likelihood language model, Sim(d , q) P(d | q) P(d ) ((1 ) P(t | M c ) P(t | M d )) tq Here, we assume 0.5 ; that is, the model is estimated from the documents and collection, mixed with the lambda. 7 4 2 0 P(q | d1 ) 0.5 0.5 0.5 0.5 0.0427 13 7 13 7 7 2 2 0 P(q | d 2 ) 0.5 0.5 0.5 0.5 0.0592 13 2 13 2 7 0 2 1 P(q | d3 ) 0.5 0.5 0.5 0.5 0.1553 13 1 13 1 7 1 2 1 P(q | d 4 ) 0.5 0.5 0.5 0.5 0.1062 13 3 13 3 Therefore, the final ranking list of LM is (d3 > d4 > d2 > d1). If without the P (t | M c ) smoothing (i.e., 1 ), the ranking list is (d4 > d1 = d2 = d3). # 3. Okapi BM25 term weighting has been used quite widely and successful across a range of collections and search tasks. What is the theoretical basis of the Okapi weighting? How does it work? (10 points) Ans: 1) The theoretical basis of BM25 is the Probability Ranking Principle (PRP). # 2) Similar to the Binary Independence Model (BIM), BM25 is aimed at estimating the probability of P( R | d , q) . However, it can further focus on term frequency and document length, two features often necessary for modern full-text search collections. Derived from O( R | d , q) P( R 1| d , q) / P( R 0 | d , q) , the weighting formula of BM25 can be obtained as follows: N (k1 1)tftd (k 1)tftq RSVd log · · 3 tq dft k1 ((1 b) b ( Ld / Lave )) tftd k3 tftq where k1, k3, and b are tunable parameters. In general, we search for values of these parameters that maximize performance on a separate development test collection (either manually or with optimization methods such as grid search or something more advanced), and then use the obtained parameters on the actual test collection. # 4. Explain why polysemy and synonymy are problems for vector space model and propose a possible solultion. (15 points) Ans: Refer to the slide of Lecture-3, p.86-p.88. One of the possible solutions is applying the latent semantic indexing (LSI) model. 5. Consider an information need for which there are four relevant documents in the collection. Contrast two systems run on this collection. Their top ten results are judged for relevance as follows. Here R and N denote relevant and non-relevant, respectively. System 1: R N R N N N N N R R System 2: N R N N R R R N N N (a) What is the mean average precision (MAP) of each system? (5 points) (b) What is the R-precision of each system? (5 points) (c) Sketch precision versus recall figure to compare the retrieval performance of these two systems. Assume 11-point average precision is adopted. (5 points) Ans: (a) System-1: (1+2/3+3/9+4/10)/4 = 0.6 System-2: (1/2+2/5+3/6+4/7)/4 ≒ 0.49 (b) There are 4 relevant documents. System-1: 2/4 System-2: 1/4 (c) System-1 System-2 1.2 1 0.8 precision 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall level 6. Modeling the relationships among persons is one of important applications of social networking. We may postulate that two persons appearing in the same context may have some relationship. Given a specific person, please extend the concepts of association clusters, metric clusters and scalar clusters to construct social networks for this person. (15 points) Ans: Refers to the slide of Lecture-5, p.29-p.33. association cluster: co-occurrences of pairs of terms in documents. metric cluster: distance factor between two terms. scalar cluster: terms with similar neighborhoods have some synonymity relationship. In the case of constructing social network, terms represent persons and documents represent specific contexts such as conference events or courses in a university. 7. Describe the difference between vector space relevance feedback and probabilistic relevance feedback. (10 points) Ans: Refers to Lecture-5, p.13 and p.18 Vector space relevance feedback performs query expansion and term reweighting but the reweighting scheme is not optimal. Probabilistic relevance performs no query expansion. The term reweighting is optimal.