Preliminary

Document Sample
Preliminary Powered By Docstoc
					                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Lecture 8:
  Search Engine Evaluation
   Motivation

   Traditional Issues: Recall and Precision

   New Features in Search Engine Evaluation
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Motivation

   There are many search engines on the
    market, which one is best for your need?
   A search engine may use several models:
    e.g.,Boolean or vector, different indexing data
    structure,different user-interfaces,etc.
       which combination is the best one?
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


    Two major aspects:

   Efficiency: speed
   Effectiveness: how good is the result?(quality)
   Speed is rather technical and relatively easier to
    evaluate
   Effectiveness is much more difficult to judge.
   Our focus will be on effectiveness evaluation.
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Relevancy
   Effectiveness is related to relevancy of
    documents retrieved
   Relevancy, from a human judgment
    standpoint, is
       subjective - depends upon a specific user’s
        judgment
       situational - relates to user’s requirement
       cognitive - depends on human perception and
        behavior
       temporal - changes over time
               CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Threshold method
   Relevancy is not a binary value, but a
    continuous function.
   If the user considers the relevancy value of
    the document exceed a threshold (it may not
    exist, and if it exists, it is decided by the
    user), the document will be deemed as
    relevant, otherwise irrelevant.
                CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Recall and Precision
   Two important metrics for evaluation of
    relevance of documents returned by a
    IR system
               CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Parameters:
   Given a query I, an IR system will return a
    set of documents as the answer.
   R is the relevant set for the query
   A is the returned answer set
   |R| and |A| denote the cardinality of these
    sets.
   D denotes the the set of all docs.
                                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Document Space
  Entire document
  collection D


                                   Relevant           Retrieved
                                 documents R         documents A
           relevant irrelevant




                                  retrieved &   Not retrieved &
                                   irrelevant     irrelevant


                                  retrieved &   not retrieved but
                                    relevant         relevant

                                   retrieved     not retrieved
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Definition of Recall
Recall = |R  A| / |R| , between 0 and 1
         =    Number of relevant documents retrieved
               Total number of relevant documents

•If Recall =1 ,it means it retrieves all relevant documents.
•If Recall =0, it means all retrieved documents all irrelevant.
•What is a simple way to achieve Recall=1?
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Definition of Precision

Precision=|R  A| / |A| , between 0 and 1
             Number of relevant documents retrieved
        =
              total Number of documents retrieved

   If precision =1,it means all retrieved documents
    all relevant.
   If precision =0, it means all retrieved documents
    all irrelevant.
   How to achieve precision=1 ?
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Rolls of Recall and Precision

   Recall measures the ability of the search to find all
    of the relevant items in the database
   Precision
      evaluates the correlation of the query to the
       database
      an indirect measure of the completeness of
       indexing algorithm
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Availability of Data
   Among these numbers
      only two are always available for Internet IR

            total number of items retrieved: |A|
            number of relevant items retrieved: |R  A|
       total number of relevant items |R| is usually
        not available
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Evaluation of Precision and Recall

     Precision can be evaluated exactly by
      dividing |R  A| by |A|, since both
      numbers are available.
     Recall cannot be exactly evaluated in
      general since it is defined as the ratio of
      |R  A| and |R|, where the latter is
      usually not available
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Approximate Estimation of Recall

     Randomly pick up a set F of documents.
     Heuristic argument (and sampling
      technique in statistics): The proportion of
      R  A in R is the same as the proportion
      of R  A  F in R  F.
     Recall can be estimated by the ratio of R
       A  F over R  F.
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Estimation of Precision
     Even though Precision can be evaluated exactly,
      it may be costly since |R  A| can be huge for
      Internet IR. R is often determined subjectively
      by people:
         It is laborious to determine R  A.
         Claim can be costly to verify.
     Again we may randomly pick up a set F of
      documents and estimate precision by the ratio
      of R  A  F over A  F.
     A  F is relatively small and it is much easier to
      find and to verify R  A  F.
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Dual objectives for IR systems
   We want both precision and recall to be one.
   Unfortunately, precision and recall affect each
    other in the opposite direction!
   Given a system:
       Broadening a query will increase recall but lower
        precision
       Increasing the number of documents returned has the
        same effect
   Different queries may yield different values of
    recall.
       Using the average for a chosen set of queries.
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     An Example
   Consider a query for which the relevant set is
    R={d1,d2,d3,d4,d5} out of a set D of 10 docs.
   Let us assume that given IR system returned
    A={d3,d6,d1,d4}.
   Recall= 3/5=60%, and Precision =3/4=75%
   How do we visualize this relationship between
    Recall and Precision considering ranking?
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




   More examples
R={d1,d2,d3,d4,d5}
A={d3,d6,d1,d4}
 {d3} yields 100% Precision at 20% Recall

 {d3,d6} yields 50% Precision at 20% Recall

 {d3,d6,d1} yields 66% Precision at 40% Recall

 {d3,d6,d1,d4} yields 75% Precision at 60% Recall
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




            A figure for previous examples
            100

            80
Precision




            60

            40

            20

             0
                  0   20       40            60          80          100
                                    Recall
                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     The recall-precision curve
Usually, the relationship between Recall and Precision
  turns out to be shaped like this:
                  1
             recall




                  0
                            precision
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    An objective criteria
   One is better than another
       if at each recall level, one is more precise than
        another
       or at each precision level, one recalls more than
        another.
   Note: this may not always be the case.
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Interpolation method
Interpolated P X R curve
 Usual procedure: use of 11 standard points of recall
  : 0%, 10%, 20%, …, 100%
 The precision at any point is the highest precision

  value at any previous recall value. This can
  guarantee the curve is non-increasing.
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Recall and Precision
   That is to say, Let r j , j  {0,1,2,...,10}, be a
    reference to the j-th standard recall level
   (i.e.r5 is a reference to the recall level 50%). Then

          P(rj )  Maxrj r rj1 P(r )

   That is, we use the upper envelop of the function
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


   An example, we draw a chart next

                                           Recall           Precision
Recall     Precision                                   0          100
       0                                              10          100
      10        100                                   20           60
      20         50                                   30           60
      30         60                                   40           57
      40         57                                   50           50
      50         42
                                                      60           50
      60         46
                                                      70           50
      70         50
                                                      80           50
      80         50
                                                      90           47
      90         47
                                                     100           45
     100         45
                            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Recall and Precision

             100

              80
 Precision




              60

              40

              20

               0
                   0   10   20     30     40     50     60      70     80     90    100
                                               Recall
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Exercise
   Consider R={d1,d2,d3,d4,d5} and
   A1={d3,d5,d1,d4,d2,d6,d7,d8,d9,d0}
   A2={d6,d7,d8,d9,d0,d1,d2,d3,d4,d5}
   A3={d3,d5,d0}
   Draw the Interpolated P X R Charts for all three
    cases
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Difficulty in constructing PxR curve
    Using figures based on recall and precision
     requires knowledge a priori of R, i.e ,the answer
     set
    However, this is not possible in general.
    In addition to statistical sampling method,
     another way to overcome this difficult is to use
     some sample data for which R is known in
     advance.
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     TREC Model
   TREC (see http://trec.nist.gov/) maintains about
    6Gb of SGML tagged text, queries and respective
    answers for evaluation purposes.
   The answers to the queries are obtained manually
    in advance.
   IR systems are tested against them and
    evaluated accordingly.
                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Other metrics:
    Fallout is concerned with retrieved but non-
     relevant docs. F=|A-R| / |D-R|
    Single Value Measures:
     Average precision at seen relevant docs: Compute
      the precision every time a relevant doc is found
      and report the overall average.
     R-precision: The precision of the lowest ranked
      relevant doc.
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Another alternative measure:
   Given the j-th doc in the ranking, its recall rj and
    precision pj
   Van Rijsbergen proposed the following measure:
          Ej =1-(1+b2)/(b2/rj+1/pj)
   b is a parameter set by the user ,
   if b=1, Ej =1-2/(1/rj+1/pj)
   Docs with high precision and high recall have a low
    E value, whereas docs with low precision and low
    recall have a high E value
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




       Its interpretation
   If b>1, then the emphasis would be on precision
   If b<1, then the user would be more interested in
    recall
   The main aspect of the measure E is that it
    evaluate each ranked document, not the whole
    document set, thus “anomalies” can be seen.
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


User-oriented Measure of Performance

It is also important to take into account what
different users feel about the answer sets
User may consider the same answer set of different
usefulness, this is is especially true if they know (in
different degrees) the answers they “should” obtain.
In addition to R and A ,let us also consider the
following subsets of R:
                                CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



         K: set of answers which are known to the
          user and,
         U: set of answers which were not known by
          the user and were retrieved.

     Relevant Docs R                                                  Answer Set A




Relevant Docs Known to user K      Relevant Docs not Known by user and were retrieved
                                   U
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Recall and Precision
   C=|A  K| / |K| is the coverage of the answer set
       A high coverage ratio means that the system is finding
        most of what the user was expecting.
   N =|U| / (|K|+|U|) is the novelty of the answer
       A high novelty ratio means that the user is finding many
        new docs which were not known before and are relevant
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


        New Features in Search Engine
        Evaluation
   The web pages(docs) on the web are estimated
    more than two billion on April 2001
       (www.searchenginewatch.com).
   It is almost impossible to get all relevant web
    pages from the Internet.
   Web pages are dynamic. Some of them will be
    disappear tomorrow, or be updated.
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


    New Features in Search Engine
    Evaluation
 Even for same query, different user may desire
  different result.
 User tends to user short query words.

 Two main issues on search engine evaluation.

Search Engine Coverage
Search Engine Effectiveness
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Search Engine Coverage
Some published approaches to estimating coverage
are based on the number of hits for certain queries
as reported by the services themselves
       For example , the method used by searchengineshowdown
       http://www.searchengineshowdown.com/stats/fast300.shtml
To compare the sizes of the search engine databases, the
study uses 25 specific queries that meet the criteria listed as
follows.
The results of each query are verified when possible and
only the number of hits that can be displayed are counted
              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Query criteria:
1.   Only single words are used to avoid any
     variation in the processing of multiple
     term searches
2.   Terms were drawn from a variety of
     reference books that cover different
     fields.
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Selection of Query terms
3.   Any term used must find less than 1,000 results in the
     AltaVista Advanced Search, since numbers higher than
     that cannot be verified on AltaVista.
4.   Since Northern Light automatically searches both
     English plural and singular forms of words, query terms
     were chosen that cannot generally be made plural. This
     was checked by pluralizing the word and running a
     search on AltaVista or Fast. Only those terms where the
     plural form found zero results were used
                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




                                        Data from: Aug. 14, 2001
                                  Showdow
                  Search                   n   Claim
                  Engine           Estimate (millions)
                                  (millions)
                 Google                  730       1,000
                   Fast                  552          623
                 WISEnut                 510       1,400
                 Northern                369          322
                   Light
                  Hotbot                 364          500
                 AltaVista               346          500
               MSN Search                334          500

http://www.searchengineshowdown.com/stats/sizeest.shtml
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Estimation of coverage
1 For all the queries, we check all retrieved results
  by each search engine, and get the total count of
  valid web pages for each search engine.

2 Based on the Northern Light and Fast search
  engine size, we can estimate the search engine
  coverage.
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Another algorithm: (Krishna Bharat
and Andrei Broder)
                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Conditional Probability Method
Let Pr(A) represent the probability that an element
belongs to the set A and let Pr(A & B|A) represent
the conditional probability that an element belongs
to both sets given that it belongs to A. Then,
Pr(A & B|A) = Size(A & B)/Size(A) and similarly,
Pr(A & B|B) = Size(A & B)/Size(B), and
therefore
Size(A)/Size(B) = Pr(A & B|B) / Pr(A & B|A).
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Two major procedures
To implement this idea we need two procedures:

Sampling: A procedure for picking pages uniformly
  at random from the index of a particular engine.

Checking: A procedure for determining whether a
 particular page is indexed by a particular engine.
                      CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     The solution
Overlap estimate: The fraction of E1's database
  indexed by E2 is estimated by: Fraction of URLs
  sampled from E1 found in E2.
Size comparison: For search engines E1 and E2,
  the ratio Size(E1)/Size(E2) is estimated by:
  Fraction of URLs sampled from E2 found in E1·
   ----------------------------------------------------------
  Fraction of URLs sampled from E1 found in E2
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Search Engine Effectiveness

How to evaluate the quality of retrieved results by
different search engines?
Manual

Benefit: the accuracy with respect to user’s expectation
Drawback: subjective and time-consuming
Automatic
It is much better in adapting to the fast changing Web and
search engines, as well as the large amount of information
on the web.
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




 Automatic evaluation of effectiveness

In [Longzhuang Li, Yi Shang, and Wei Zhang],
Two sample query set were used:
(a) The TKDE set containing 1383 queries derived from the
index terms of papers published in the IEEE Transactions on
Knowledge and Data Engineering between January 1995 and
June 2000
(b) The TPDC set containing 1726 queries derived from the
index terms of papers published in the IEEE Transactions on
Parallel and Distributed Systems between January 1995 and
February 2000
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Search Engine Effectiveness
For each query, the top 20 hits from each search engine are
analysed.
To compute the relevance score, we followed each hit to

retrieve the corresponding Web document.
The scores are calculated base on four models

       (a) Vector Space Model
       (b) Okapi Similarity Measurement(Okapi)
       (c) Cover Density Ranking(CDR)
       (d) Three-level Scoring Method(TLS)
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Search Engine Effectiveness

Rank the search engine based on their average relevance
scores computed using the above four scoring methods,
respectively.

   Result:

Google is always the best.