evaluation by twittersubzero

VIEWS: 5 PAGES: 41

Information Retrievaland Web Search lecture

More Info
									     Performance Evaluation
of Information Retrieval Systems



                                   1
        Why System Evaluation?
• There are many retrieval models/ algorithms/
  systems, which one is the best?
• What is the best component for:
  – Ranking function (dot-product, cosine, …)
  – Term selection (stopword removal, stemming…)
  – Term weighting (TF, TF-IDF,…)

• How far down the ranked list will a user need
  to look to find some/all relevant documents?

                                                   2
   Difficulties in Evaluating IR Systems

• Effectiveness is related to the relevancy of retrieved
  items.
• Relevancy is not typically binary but continuous.
• Even if relevancy is binary, it can be a difficult
  judgment to make.
• Relevancy, from a human standpoint, is:
   –   Subjective: Depends upon a specific user’s judgment.
   –   Situational: Relates to user’s current needs.
   –   Cognitive: Depends on human perception and behavior.
   –   Dynamic: Changes over time.
                                                              3
         Human Labeled Corpora
            (Gold Standard)
• Start with a corpus of documents.
• Collect a set of queries for this corpus.
• Have one or more human experts
  exhaustively label the relevant documents
  for each query.
• Typically assumes binary relevance
  judgments.
• Requires considerable human effort for
  large document/query corpora.

                                              4
                    Precision and Recall




                                       relevant irrelevant
Entire document                                              retrieved &   Not retrieved &
collection      Relevant   Retrieved
               documents   documents                          irrelevant     irrelevant


                                                             retrieved &   not retrieved but
                                                               relevant         relevant

                                                             retrieved      not retrieved

         Number of relevant documents retrieved
recall 
          Total number of relevant documents

            Number of relevant documents retrieved
precision 
             Total number of retrieved documents
                                                                                               5
              Precision and Recall

• Precision
  – The ability to retrieve top-ranked documents
    that are mostly relevant.
  – Degree of the correctness of the results.
• Recall
  – The ability of the search to find all of the
    relevant items in the corpus.
  – Degree of the strength of retrieval model.


                                                   6
     Determining Recall is Difficult

• Total number of relevant items is
  sometimes not available:
  – Sample across the database and perform
    relevance judgment on these items.
  – Apply different retrieval algorithms to the same
    database for the same query. The aggregate of
    relevant items is taken as the total relevant set.



                                                         7
       Trade-off between Recall and Precision
    Returns relevant documents but
    misses many useful ones too                         The ideal
                             1
                       Precision




                             0                      1
                                   Recall                    Returns most relevant
               Number of relevant documents retrieved        documents but includes
 precision                                                  lots of junk
                Total number of retrieved documents

           Number of relevant documents retrieved
recall 
            Total number of relevant documents                                    8
    Computing Recall/Precision Points
• For a given query, produce the ranked list of
  retrievals.

• Adjusting a threshold on this ranked list produces
  different sets of retrieved documents, and therefore
  different recall/precision measures.

• Mark each document in the ranked list that is
  relevant according to the gold standard.

• Compute a recall/precision pair for each position in
  the ranked list that contains a relevant document.
                                                         9
 Computing Recall/Precision Points:
           An Example
n doc # relevant
                   Let total # of relevant docs = 6
1 588       x      Check each new recall point:
2 589       x
3 576
                   R=1/6=0.167; P=1/1=1
4 590       x
5 986
                   R=2/6=0.333; P=2/2=1
6 592       x
7 984              R=3/6=0.5;     P=3/4=0.75
8 988
9 578              R=4/6=0.667; P=4/6=0.667
10 985
11 103                                         Missing one
                                            relevant document.
12 591
                                               Never reach
13 772      x      R=5/6=0.833; p=5/13=0.38     100% recall
14 990
                                                           10
   Interpolating a Recall/Precision Curve

• Interpolate a precision value for each standard
  recall level:
  – rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
  – r0 = 0.0, r1 = 0.1, …, r10=1.0
• The interpolated precision at the j-th standard
  recall level is the maximum known precision at
  any recall level between the j-th and (j + 1)-th
  level:
               P(rj )  max P(r )
                          r j  r  r j 1
                                                             11
            Interpolating a Recall/Precision Curve:
                          An Example
Precision




            1.0

            0.8

            0.6

            0.4

            0.2


                   0.2     0.4    0.6   0.8    1.0
                                                     Recall

                                                              12
     Average Recall/Precision Curve

• Typically average performance over a large
  set of queries.
• Compute average precision at each standard
  recall level across all queries.
• Plot average precision/recall curves to
  evaluate overall system performance on a
  document/query corpus.


                                               13
   Example 1: Total no. of relevant documents: 4
                                   Actual
      Rank     Relevant     Precision       Recall
          1      Yes
          2      No
          3      Yes
          4      No
          5      Yes
          6      Yes
          7      No
          8      No
          9      No
          10     No

Interpolated
 Recall

Precision                                            14
   Example 1: Total no. of relevant documents: 4
                                                          Actual
      Rank               Relevant             Precision             Recall
          1                Yes                     1/1                   0.25
          2                No                      1/2                   0.25
          3                Yes                     2/3                   0.5
          4                No                      2/4                   0.5
          5                Yes                     3/5                   0.75
          6                Yes                     4/6                    1
          7                No                      4/7                    1
          8                No                      4/8                    1
          9                No                      4/9                    1
          10               No                      4/10                   1

Interpolated
 Recall        0   0.1   0.2     0.3   0.4   0.5     0.6    0.7    0.8        0.9   1

Precision                                                                               15
   Example 1: Total no. of relevant documents: 4
                                                            Actual
      Rank               Relevant               Precision             Recall
          1                  Yes                     1/1                   0.25
          2                  No                      1/2                   0.25
          3                  Yes                     2/3                   0.5
          4                  No                      2/4                   0.5
          5                  Yes                     3/5                   0.75
          6                  Yes                     4/6                    1
          7                  No                      4/7                    1
          8                  No                      4/8                    1
          9                  No                      4/9                    1
          10                 No                      4/10                   1

Interpolated
 Recall        0   0.1   0.2       0.3   0.4   0.5     0.6    0.7    0.8        0.9    1

Precision      1   1     1         2/3   2/3   2/3     3/5    3/5    4/6        4/6   4/616
   Interpolating a Recall/Precision Curve
   1.20

   1.00

   0.80

   0.60                                                                  Series2

   0.40

   0.20

   0.00
          0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00




 Recall      0    0.1   0.2    0.3    0.4    0.5    0.6   0.7      0.8    0.9      1


Precision    1     1     1     2/3    2/3    2/3    3/5   3/5      4/6    4/6      4/6
                                                                                         17
    Example 2: Total no. of relevant documents: 5
                                     Actual
            Rank   Relevant   Precision       Recall
             1       Yes
             2       No
             3       Yes
             4       No
             5       Yes
             6       No
             7       No
             8       No
             9       No
             10      No

Interpolated
 Recall
Precision
                                                       18
    Example 2: Total no. of relevant documents: 5
                                                                     Actual
            Rank               Relevant                 Precision                   Recall
             1                       Yes                     1/1                     0.2
             2                       No                      1/2                     0.2
             3                       Yes                     2/3                     0.4
             4                       No                      2/4                     0.4
             5                       Yes                     3/5                     0.6
             6                       No                      3/6                     0.6
             7                       No                      3/7                     0.6
             8                       No                      3/8                     0.6
             9                       No                      3/9                     0.6
             10                      No                      3/10                    0.6

Interpolated
 Recall     0      0.1   0.2   0.3         0.3   0.4   0.5     0.6     0.7    0.8      0.9   1
Precision
                                                                                                 19
    Example 2: Total no. of relevant documents: 5
                                                               Actual
            Rank               Relevant               Precision             Recall
             1                   Yes                     1/1                  0.2
             2                   No                      1/2                  0.2
             3                   Yes                     2/3                  0.4
             4                   No                      2/4                  0.4
             5                   Yes                     3/5                  0.6
             6                   No                      3/6                  0.6
             7                   No                      3/7                  0.6
             8                   No                      3/8                  0.6
             9                   No                      3/9                  0.6
             10                  No                     3/10                  0.6

Interpolated
 Recall       0    0.1   0.2     0.3      0.4   0.5      0.6      0.7   0.8     0.9     1
Precision     1    1     1       2/3      2/3   3/5      3/5      0     0           0   0
                                                                                            20
      Interpolating a Recall/Precision Curve

      1.20

      1.00

      0.80

      0.60
                                                                                                   Series1

      0.40

      0.20

      0.00
             0.00   0.10   0.20    0.30    0.40   0.50   0.60   0.70   0.80   0.90    1.00




 Recall      0      0.1      0.2          0.3     0.4       0.5        0.6      0.7          0.8    0.9      1

Precision    1       1        1           2/3     2/3       3/5        3/5       0           0       0       0
                                                                                                                 21
       Average Recall/Precision Curve, an Example


1.20

1.00

0.80                                                             Precision1
                                                                 Precision2
0.60
                                                                 Avg

0.40

0.20

0.00
        0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00



                                                                          22
      Compare Two or More Systems

• The curve closest to the upper right-hand
  corner of the graph indicates the best
  performance
                   1
                  0.8                   NoStem    Stem
      Precision




                  0.6
                  0.4
                  0.2
                   0
                        0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1
                                        Recall

                                                                  23
      Other Recall/Precision Measures

• Single-value measures
  –   F-measure
  –   E-measure
  –   Fallout rate
  –   ESL
  –   ASL




                                        24
                F-Measure

• One measure of performance that takes into
  account both recall and precision.
• Harmonic mean of recall and precision:

               2 PR    2
            F       1 1
               P  R RP



                                               25
 Computing Recall/Precision Points:
     F-Measure, An Example
n doc # relevant
                   Let total # of relevant docs = 6
1 588       x      Check each new recall point:
2 589       x
3 576
                   R=1/6=0.167; P=1/1=1, F= 0.28
4 590       x
5 986
                   R=2/6=0.333; P=2/2=1, F=0.5
6 592       x
7 984              R=3/6=0.5;     P=3/4=0.75, F=0.6
8 988
9 578              R=4/6=0.667; P=4/6=0.667, F= 0.667
10 985
11 103
12 591
13 772      x
14 990             R=5/6=0.833; p=5/13=0.38, F= 0.521
                                                        26
 E Measure (parameterized F Measure)
• A variant of F measure that allows weighting
  emphasis on precision over recall:
           (1   ) PR (1   )
                    2                2
        E             2 1
              PR
               2
                            
                          R P

• Value of  controls trade-off:
  –  = 1: Equally weight precision and recall (E=F).
  –  > 1: Weight precision more.
  –  < 1: Weight recall more.

                                                        27
                      Fallout Rate
• Problems with both precision and recall:
    – Number of irrelevant documents in the
      collection is not taken into account.
    – Recall is undefined when there is no
      relevant document in the collection.
    – Precision is undefined when no document is
      retrieved.
                 no. of nonrelevant items retrieved
 Fallout 
           total no. of nonrelevant items in the collection

                                                              28
                  Other Measures
• Expected Search Length:
  [Cooper 1968] average number of
  documents that must be examined
                                                N
  to retrieve a given number i of
  relevant documents
                                                i *e      i

                                        ESL    i 1
                                                       N
   – N: maximum number of
      relevant documents
                                                  ii 1

   – ei: expected search length for i




                                                               29
                    Five Types of ESL
• Type 1: A user may just want the answer to a very specific factual
  question or a single statistics. Only one relevant document is
  needed to satisfy the search request.
• Type 2: A user may actually want only a fixed number, for
  example, six of relevant documents to a query.
• Type 3: A user may wish to see all documents relevant to the
  topic.
• Type 4: A user may want to sample a subject area as in 2, but wish
  to specify the ideal size for the sample as some proportion, say
  one-tenth, of the relevant documents.
• Type 5: A user may wish to read all relevant documents in case
  there should be less than five, and exactly five in case there exist
  more than five.
                                                                    30
             Other Measures (cont.)
• Average Search Length: [Losee 1998] expected
  position of a relevant document in the ordered list of all
  documents
   – N: total number of documents
   – Q: probability that the ranking is optimal (perfect)
   – A: expected proportion of all documents examined in
     order to reach the average position of a relevant
     document in an optimal ranking

          ASL  N[QA  (1  Q)(1  A)]


                                                               31
                   Problems

• While they are single value measurements
  (F-measure, E-measure, ESL, ASL)
  – They are not easy to measure (compute)
  – Or the data required for the measure are
    typically not available (e.g. ASL)
  – They don’t work well in web search
    environment




                                               32
                RankPower

• RankPower is an effective measure for
  interactive information search systems such
  as the web.
• Take into consideration both the placement
  of the relevant documents and the number
  of relevant documents in a set of retrieved
  documents for a given query.


                                                33
               RankPower (cont.)
• Some definitions
  – For a given query, N documents are returned
  – Among the returned documents, RN are relevant
    documents, |RN| = CN < N
  – Each of the relevant document in RN is placed at location
    Li
  – Average rank of returned relevant documents Ravg(N)

                                CN

                                L     i

                  Ravg( N )    i 1
                                 CN                             34
             RankPower (cont.)
• RankPower definition

                                 CN

                                L
                      Ravg( N ) i 1
                                        i

     RankPower( N )           
                        CN       CN 2




                                            35
           Computing Rank Power :
                An Example
n doc # relevant
1 588       x
2 589       x
3 576
4 590       x                   1  2  4  6  13
5 986              Rank Power            2
6 592       x                           5
7 984
8 988                          1.04
9 578
10 985
11 103
12 591
13 772      x
14 990
                                                 36
           Computing Rank Power :
                An Example
n doc # relevant
1 588       x
2 589       x
3 576
4 590       x                   1 2  4  6  7
5 986              Rank Power          2
6 592       x                         5
7 772
8 988
            x
                               0 .8
9 578
10 985
11 103
12 591
13 984
14 990
                                               37
                        Examples
• Compare two systems each of which returns a list
  of 10 documents.
• System A has two relevant documents listed as 1st
  and 2nd, with a RankPower of 0.75.
• Let’s examine some scenario in which system B
  can match or surpass system A.
   – If system B returns 3 relevant documents, unless two of
     the three are listed 1st and 2nd, it is less favored than A
     since the two best cases (1+3+4)/32=0.89 and
     (2+3+4)/32=1 which are greater than that of A (0.75).
   – System B needs to have 6 relevant documents in its top-
     10 list to beat A if it doesn’t capture 1st and 2nd places.
                                                                   38
                   RankPower (cont.)
• Some properties
  – A function of two variables, individual ranks of relevant
    documents, and the number of relevant documents

  – For a fixed CN, the more documents listed earlier, the more
    favorite the value is (smaller values are favored).

  – If the size of returned documents N increases and the
    number of relevant documents in N also increases, the
    average rank increases (unbounded).

  – In the ideal case where every single returned document is
    relevant, the average rank is simply (N+1)/2

                                                                  39
                         Assignment
• Apply Performance Evaluation using (Recall/Precision, F-
  Measure and RankPower) on google.com, yahoo.com and
  ask.com using the following queries:
   – Query 1: “Relevant Documents”
       • Documents related to IR systems.
   – Query 2: “Program”
       • Documents related to programming languages and IT.
   – Query 3: “Database”
       • Documents related to DBMS and IT.
• For each (system , query) pair, plot the RP curve assuming
  that # of relevant documents is 50.
• For each system, plot the ARP curve.
• Plot a combined ARP curve for the three systems.

                                                               40
                Useful URL

• Bow: A Toolkit for Statistical Language
  Modeling, Text Retrieval, Classification
  and Clustering.

http://www-2.cs.cmu.edu/%7Emccallum/bow/




                                             41

								
To top