Dating Uk Yahoo - PowerPoint - PowerPoint by eos11093


More Info
									Sogang University A. I. Lab.

        Effective site finding using link anchor

                         Nick Craswell and David Hawking
            CSIRO Mathematical and Information Sciences, Canberra, Australia
                                 Stephen Robertson
                               Microsoft Research, UK
                                CIGIR’01, to appear

                                   Sung Hae, Jun
                                 Artificial Intelligence Lab.
                                 Dept. of Computer Science
                                     Sogang University
                                         Seoul, Korea
       Introduction (1/2)

  • Link-based ranking is popular
               With search engines. Google, Fast
               With researchers. HITS, PageRank
  • To find the main entry point of a specific Web site
         In our experiments, ranking based on link anchor text is
  twice as effective as ranking based on document content.
               This paper : “named site finding”
  • It opens a rich new area for effectiveness improvement,
  where traditional methods fail.

Sogang University A. I. Lab.                                   Page 2
       Introduction (2/2)

  • Link methods
               its incoming and outgoing links
  • Content methods
               its text content
  • Past TREC experiments have found that link information
  does not enhance retrieval effectiveness.
  In particular, TREC-8 Small and Large Web Tracks found link methods to be no
  better than non-link methods.

Sogang University A. I. Lab.                                                     Page 3
       The site finding problem

  • The “topic” of site might be quite broad.
         Yahoo! ( covers a broad range
  of subject matter and provides a range of services.
  • A site finding task is one where the user wants to find a
  particular site, and their query is an attempt to specify which
  site that is.
  • A search system succeeds in the task if it returns the entry
  page of the required site : the “correct answer”.

Sogang University A. I. Lab.                                        Page 4
      Site finding examples
        Sometimes, the user types the name of a site in order to find its URL

Named site finding                           Not named site finding
Where can I find Hotmail?                    How does a modem work?
Where is the official Michael Schumacher     What should I consider when purchasing
home page?                                   a PC for under $2,000?
Where can I find the web site for Toshiba?   Who was Cleopatra?
Where is the fun site dating patterns        What is mp3 and where can i learn more
analyzer?                                    about it?
Where is the official Star Wars site?        Why do dogs have wet noses?
 The user knows which site they want, but   Where is the Taj Mahal?
not its location (URL)

 Sogang University A. I. Lab.                                                   Page 5
       Link-based ranking methods (1/2)

   A hypertext is a relationship between two documents or two parts of the
   same documents.

   On the Web, the source document would contain text such as:
                 <A HREF="">ACM Site</A>
   The target document :
   The link’s anchor text : ACM Site
   If the user selects the anchor, their browser will display the target
   A ranking method, given a query and a set of documents, generates a
   ranked list of documents. In the site-finding task, the entry page of the
   described site should appear as close as possible to the top of the list.

Sogang University A. I. Lab.                                                   Page 6
       Link-based ranking methods (2/2)

            Link source              (Link-based ranking methods)
               Targets                                              The ranked list

         Possible anchors

    Link methods can be divided into three classes, depending on which of these
    alternate assumptions they rely :

                 1. recommendation
                 2. topic locality
                 3. anchor description ( this paper )

Sogang University A. I. Lab.                                                          Page 7
       Three classes of link methods

• The recommendation assumption is that by linking to a target, a
page author is recommending it. Accordingly, a page with high in-degree is
highly recommended, and should be ranked more highly.

• The topic locality assumption is that pages connected by links are
more likely to be about the same topic than those which are not.

• The anchor description assumption is that the anchor text of a link
describes its target. Using the example link mentioned previously, the anchor
text “ACM site” is describing

Sogang University A. I. Lab.                                                 Page 8
       TREC experiments with links
  • The Text Retrieval Conference (TREC) has primarily
  concentrated on subject searches, performed over news and
  government documents. It has recently expanded to new
  search tasks, such as question answering, and new
  document sets, including several Web collections.
  • The Web collections are based on a 1997 Internet Archive
  (http: // crawl of over 50 million pages. The
  100 gigabyte VLC2 collection is an 18.5 million document
  subset. The WT2g and WT10g collections are 0.25 and 1.25
  million page subsets respectively.

Sogang University A. I. Lab.                                      Page 9
     Link-based ranking

Sogang University A. I. Lab.   Page 10

A. Problem: Named site finding
B. Solution: Anchor text propagation [wwww, Google]
C. Experiments: Link solution twice as effective

Sogang University A. I. Lab.                          Page 11
     Site finding samples (this paper)

Sogang University A. I. Lab.             Page 12
     Different from TREC ad hoc

     • TREC ad hoc:
                  Topical query Relevant documents
     • Named site finding:
                  Site name query    The site’s URL
     • Site finding useful for forgotten URLs (known item
     search, I’m feeling lucky) or visiting a new site
     (“suspected item search”)

Sogang University A. I. Lab.                                Page 13
     Evaluation methodology
 1. Choose corpus : Choose a fixed test corpus of hypertext documents.
 2. Identify query pairs : Identify a set of <query, site> pairs (for example
 <sigir 2001,>), numbering perhaps 100. Each
 represents a user typing a query, in order to find a particular site entry
 3. Run methods : For each method being evaluated, run the queries over
 the corpus.
 4. Examine results : For each query, examine pooled results to identify
 equivalent URLs (e.g. mirror sites)
 5. Measure effectiveness : Apply some effectiveness measure. In case of
 multiple equivalent correct answers, measure according to the top ranked

Sogang University A. I. Lab.                                                    Page 14
     Content method used here
      Okapi BM25 applied to e.g. 18.5 million documents in VLC2.
        Excite Home World’s best news, FREE! Chocolates & Wine Cards & Music
        Flowers Gifts Excite Search Twice the power of the competition. Search the
        entire Web Search NewsTracker Search Excite Web Reviews Search Usenet
        newsgroups Search Tips Advanced Search Submit For info on destinations
        around the globe, visit Reference People Finder Yellow Pages Email
        Lookup Travel Search Shareware Resources Free Start Page Bookmark
        Excite Excite Direct New to the Net? Free Search Engine Information Help
        Feedback Advertising Add URL About Excite Jobs at Excite Excite Web
        Reviews Our insights into the . . .
        . . . [106 more words]

          The anchor method does not use any of this text.

Sogang University A. I. Lab.                                                           Page 15
      Link method used here
     Okapi BM25 applied to e.g. 44.1 million VLC2 anchor documents.
   Each anchor document contains all the anchor texts of anchor doc
   a page’s incoming links. If 7332 pages link to with the anchor text “excite”, that   7 332       excite
   word is added 7332 times to the anchor document.             910         excite netsearch
                                                                227         excite search
                                                                200         excite!
                                                                168         e xcite
                                                                154         view
                                                                140         excite home
                                                                86          excite search engine
                                                                66          excite search:
                                                                49          exite
                                                                . . . [440 more lines]

Sogang University A. I. Lab.                                                                         Page 16
       VLC2: Random sites

                               VLC2 results for 100 site
                               entry pages,
                               chosen randomly through
                               page selection and
                               For 35 of the 100 queries, the
                               anchor method returned the
                               correct answer at rank one,
                               compared 15 times for the
                               content method.

                                   35/100 vs 15/100

Sogang University A. I. Lab.                               Page 17
      VLC2: Yahoo!-listed sites

                          VLC2 results for 100 Yahoo!-listed sites.
                                    62/100 vs 27/100

Sogang University A. I. Lab.                                          Page 18
     ANU: Directory-listed sites

              University results for 100 sites within the institution.
                                68/100 vs 21/100

Sogang University A. I. Lab.                                             Page 19

  • Anchor information is more useful than content on this site
  finding task.
  • The biggest difference between link and content methods
  is at rank one.
  • In future experiments it will be interesting to test other link
  and content methods, and combinations of methods.

Sogang University A. I. Lab.                                          Page 20

  • Anchors: Good evidence for finding named sites
  • The link anchor method was approximately twice as
  effective as the content method.
  Future work:
  • Can we improve on this e.g. using anchor+content?
  • Using the methodology of the present study as a basis
  there are a great many aspects of this important problem
  to be investigated in future work.

Sogang University A. I. Lab.                                 Page 21

To top