Docstoc

nunes10-web-ranking

Document Sample
nunes10-web-ranking Powered By Docstoc
					Web Document Ranking

          Sérgio Nunes

     DEI, Faculdade de Engenharia
        Universidade do Porto



    SSIIM, MIEIC, 2010/11
Overview of concepts and techniques
    for ranking web documents
The World Wide Web
The Web



 The World Wide Web is a distributed information system
 unprecedented in many ways — in size, in lack of central
 coordination, and in the diversity of users’ backgrounds.

 The first published vision of a large-scale distributed
 hypertext system can be traced back to Vannevar Bush’s
 seminal article “As We May Think” (1945).
Web Growth




 Web pages >> web hosts.

 Altavista reported an index of 30 million web pages in 1995.
 At least 11.5 billion indexable web pages in 2005 [Gulli et al.].

 How can we estimate the size of the web?
Authority Problem


 Several factors have led to the mass adoption of the web as a
 publishing medium — from anonymous individuals to
 professional organizations.

 The lack of a central authority or coordination, the simplicity
 of the underlying technology, and the easy access to free web
 publishing tools, means that anybody can publish anything.

 How can we assess the reliability of content found on the web?

 Which pages can we trust?
Web Directories


 A web directory is a hierarchical structure, organized by
 topics, containing selected web sites — e.g. dmoz.org.

 In the early days of the web, these directories were very
 popular — human editors selected the highest quality pages
 for each category.

 This approach quickly became unfeasible at web-scale.
 Additionally, these approaches implied a strong semantic
 agreement between the directory’s editors and the users.
Search Engines

 First generation search engines were based on classic keyword
 matching techniques developed for text search. The main
 challenge was dealing with the size of the web.

 While classic text search techniques provided sufficient results,
 the overall quality was questionable due to the nature of web
 content.

 Most notably, the web has no central editorial control, there is
 a complete lack of publishing standards, there is a high degree
 of content duplication and some content is published with
 malicious intents (i.e. spam).
Web’s Size

 Estimating the size of the web is not a trivial problem — e.g.
 the number of dynamic web pages is technically infinite.

 The deep web is estimated to be several orders of magnitude
 bigger than the surface web.

 The size of the surface web was considered to be 170 TB in
 2003. The deep web was several orders of magnitude bigger,
 with approximately 90,000 TB.

 “How Much Information? 2003”
 http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
SPAM

On the web, spam is an issue of major importance.
At its root, spam exists due to commercial motivations — e.g.
achieve better rankings in search engines. There is a wide
range of techniques for web spam, from simple to highly
sophisticated.

Keyword stuffing Repetition of high-value keywords in content.
Cloaking (mask) Show different content to search engines.
 Link spam Artificial links created using hidden links, link farms, etc.


Web search engines operate in an adversarial information
retrieval environment (research topic).
SPAM Example

 1. Scrape content from real web documents: blogs,
    Wikipedia, news sites, etc.

 2. Mix and generate synthetic content to avoid duplicate
    detection.

 3. Insert key words and phrases.

 4. Replace or insert links to sites being “promoted”.

 5. Publish content on the web using free publishing
    platforms (e.g. wordpress, blogspot, comments, etc).
The Web Graph
 The web is usually modeled as a directed graph, where each
 web page is a node and each link is a directed edge.



                       A               B



                                C




 The hyperlinks that point to a page are called in-links and
 those originating in the page are called out-links. The number
 of in-links to a page is called in-degree.
The Bowtie Model
                                                            TENDRILS




                              IN             SCC              OUT




                                                  TUBE
                                   DC




 A web surfer can pass from any page in IN to any page in SCC by
 following hyperlinks. Likewise, from any page in SCC to any page
 in OUT. SCC is a strongly connected core.
 “Graph structure in the Web” (2000) http://dx.doi.org/10.1016/S1389-1286(00)00083-9
Web Ranking
Web Document Ranking


 Web documents can be ranked in a static, absolute way or
 ranked in a given context.

 The static ranking of document is typically called
 query-independent — i.e. documents have a weight regardless
 of a query or a context. E.g.: best document in the world
 wide web.

 In query-dependent ranking, each document has a different
 weight depending on the query of context being analyzed.
 E.g.: best document for learning how to cook.
Signals

 Documents are scored (i.e. ranked) using various sources of
 information, usually called features or, more generically,
 signals. A multitude of signals can be identified:
   ▶ Length of document         ▶ Number of query terms
   ▶ Age of document            ▶ Time of query
   ▶ Number of incoming links   ▶ Query terms in document
   ▶ Number of outgoing links   ▶ Query terms in collection
   ▶ Document’s host domain     ▶ Query terms in document title
   ▶ Document’s language        ▶ Query’s language


 On the left are examples of query-independent signals, on the
 right are query-dependent examples.

 Google reportedly uses more than 200 signals in their ranking.
Types of Signals


 The signals available in a collection of web documents can be
 divided in two groups depending on their origins.

 The signals obtained directly from the document are named
 document-based signals. E.g.: term frequency, doc length, etc.

 Signals obtained from the Web are named web-based signals.
 E.g.: number of citations, anchor text, etc.

 Web search engines have access to other sources of signals:
 click data, external collections, etc.
Document-based Signals
Term Frequency

 The number of occurrences of a terms in a document is a
 signal typically used in text retrieval. However, the web is an
 adversarial information retrieval environment.

      Quasi architecto                       Quasi architecto                         Quasi architecto


      Sed ut perspiciatis unde omnis iste    Sed ut flowers unde omnis flowers          flowers ut flowers flowers omnis
      natus error sit flowers accusantium     natus error sit flowers accusantium       flowers flowers flowers sit flowers
      doloremque laudantium, totam rem       flowers laudantium, totam rem             flowers flowers flowers, totam
      aperiam, eaque ipsa quae ab illo       aperiam, eaque ipsa quae ab illo         flowers aperiam, flowers ipsa flowers
      flowers veritatis et quasi architecto   flowers veritatis et quasi flowers         ab flowers flowers flowers et quasi
      beatae vitae dicta sunt explicabo.     beatae vitae dicta sunt explicabo.       flowers flowers flowers dicta flowers.


      Nemo enim flowers voluptatem quia       Nemo enim flowers voluptatem quia         flowers enim flowers flowers quia
      voluptas sit aspernatur aut odit aut   voluptas sit aspernatur aut flowers aut   flowers flowers flowers aut flowers
      fugit, sed quia consequuntur magni     fugit, sed quia flowers magni dolores     aut flowers, flowers quia flowers
      dolores eos qui ratione voluptatem     eos qui ratione voluptatem sequi         flowers dolores flowers qui flowers
      sequi nesciunt.                        flowers.                                  flowers sequi flowers.



              TF("flowers") = 3                      TF("flowers") = 10                        TF("flowers") = ∞
Inverse Document Frequency
 Terms that appear in fewer documents of a collection have
 more discriminative power, thus are given an higher weight.

                              |Documents in collection|
             IDF (term) =
                            |Documents containing term|




 Measures the general importance of a term. Combined with
 term frequency, results in the classic tf.idf measure.
Term Position
 The position of a term within an HTML file has impact on its
 meaning and importance. Terms within the title or strong
 tags are highlighted differently.


         Quasi architecto                            Quasi flowers


         Sed ut perspiciatis unde omnis iste         Sed ut perspiciatis unde omnis iste
         natus flowers sit olucap accusantium         natus error sit olucap accusantium
         doloremque laudantium, totam rem            doloremque flowers, totam rem
         aperiam, eaque ipsa quae ab illo sumo       aperiam, eaque ipsa quae ab illo sumo
         veritatis et quasi flowers beatae vitae      veritatis et quasi architecto beatae vitae
         dicta sunt explicabo.                       dicta sunt explicabo.


         Nemo enim etupm voluptatem quia             Nemo enim etupm voluptatem quia
         flowers sit aspernatur aut odit aut fugit,   voluptas sit aspernatur aut odit aut
         sed quia consequuntur flowers dolores        fugit, sed quia consequuntur magni
         eos qui ratione voluptatem sequi            dolores eos qui ratione voluptatem
         nesciunt.                                   sequi nesciunt.
Term Position
 Regardless of the HTML structure, should terms in different
 positions have different weights?


         Quasi architecto                            Quasi architecto


         Sed ut flowers unde flowers iste natus        Sed ut perspiciatis unde omnis iste
         flowers sit olucap flowers doloremque         natus error sit olucap accusantium
         flowers, totam rem aperiam, eaque            doloremque laudantium, totam rem
         ipsa quae ab illo sumo veritatis et quasi   aperiam, eaque ipsa quae ab illo sumo
         architecto beatae vitae dicta sunt          veritatis et quasi architecto beatae vitae
         explicabo.                                  dicta sunt explicabo.


         Nemo enim etupm voluptatem quia             Nemo enim etupm voluptatem quia
         voluptas sit aspernatur aut odit aut        voluptas sit aspernatur aut odit aut
         fugit, sed quia consequuntur magni          fugit, flowers quia flowers magni
         dolores eos qui ratione voluptatem          dolores flowers qui ratione flowers
         sequi nesciunt.                             flowers nesciunt.
Web-based Signals
Host Structure
 Web documents in the same host are related to each other.

 A document in a high-value host like www.bbc.co.uk should
 be valued higher than www.besttopnews.com.

 The location of a document in a site structure is an important
 signal. Documents that are closer to the root of a site is
 typically more important.
Anchor Text
 A citation between web documents is defined by an HTML
 anchor tag that requires a content. The text used in anchor
 tags is one of the most valuable signals

 <a href="http://www.amazon.com">amazon</a>


                                   sucks
                         amazon

              books
                                            books




                        www.amazon.com
Link Analysis


 Link analysis has many aspects in common with the field of
 bibliometrics, more specifically citation analysis.

 Central assumption → a link is an endorsement.
 A hyperlink from page A to page B represents a vote in page
 B from the creator of page A.

 Simply using the in-degree of a page as a measure of its
 importance would be easy to manipulate (e.g. link spam).
PageRank


 Originated from Stanford and used by Google.

 The PageRank algorithm depends on the link structure of the
 web graph and assigns a score between 0 and 1 to each page.

 The PageRank weight is a query-independent score.


 “The PageRank Citation Ranking: Bringing Order to the Web”
 Larry Page, Sergey Brin, Rajeev Motwani and Terry Winograd (1998)
PageRank Random Surfer

                                  2
                                          2
                    0
                         1
                                      3
                                              1
                              1




 1. Consider a random surfer visiting web pages and
    following the out-links in a random fashion at each point.

 2. Eventually, the nodes with an higher in-degree will be
    visited more often.

 3. The idea behind PageRank is that pages that have more
    visits are more important.
PageRank Calculation
                                                         ∑           P R(p)
              P R(A) = (1 − d) + d ×
                                                                    |Out(p)|
                                                       p∈In(A)




                                                              0.2
                       0.6     0.2         0.2

                                                                      0.2
                             0.2
                  0.2
                                                       0.15
                                                               0.15
                                     0.2                                    0.35
                 0.2


                                                 d=1




 Computation is performed iteratively until
 a minimum threshold is achieved.
PageRank Example

                                 P R(B) P R(C) P R(E)
                      P R(A) =         +      +
                                    2      1      3

                                       P R(D)
                            P R(B) =
          B                               1
  A
                  D
                                       P R(E)
                            P R(C) =
                                          3

      C       E
                                  P R(A) P R(E)
                       P R(D) =         +
                                     1      3

                                       P R(B)
                            P R(E) =
                                          2
HITS

 The Hyperlinked Induced Topic Selection (HITS) was
 proposed by Jon Kleinberg in 1999.

 HITS is an algorithm that uses the link structure of the web
 to produce two query-dependent scores — an authority score
 and a hub score.

 An authority is a page with many citations from hubs.
 A hub is a page that cites a large number of authorities.

 Three major differences from PageRank:
 (1) it is computed at query time (!); (2) it produces two values
 for each page; (3) it is applied to subsets of the web.
HITS Calculation

 1. Select a collection of documents related to a query.

 2. Iteratively calculate authority and hub values for each
    document.

                                     ∑
                Authority(A) =             Hub(p)
                                 p∈In(A)

                            ∑
               Hub(A) =              Authority(p)
                          p∈Out(A)
Scoring

 With so many signals, how to obtain a single ranking score?

    Score(P ) = α × Signal 1 (P ) + β × S2 (P ) + γ × S3 (P ) . . .


  1. Manually tuning by experts based on real-data
     measurements.

  2. Use machine-learning methods to automatically build
     ranking formulas: learning to rank / machine-learned
     relevance.
Search Engines
Discovering Information


 There are two broad categories of services for facilitating the
 discovering of information on the web.

 Full-Text Search Engines
 Generically known as web search engines, these services crawl
 the web, index their contents and rank the documents.

 Web Directories
 Topic-oriented collections, maintained by human editors.
Search Engine Architecture

       WEB
                                               USER


                       CRAWLER



                                           SEARCH


                                     RANKING
                    INDEXER




             Disk     Disk    Disk
Crawler

 Includes the software that finds and fetches web pages.
 Multiple and distributed crawlers operate simultaneously.

 First generation search engines had a scheduled periodic crawl
 of the web. In current search engines, crawlers operate
 continuously — e.g. very popular and dynamic documents are
 crawled multiples times a day.

 There is an infinite number of pages on the Web, thus the
 crawler must decide which will be crawled and which won’t.

 A crawler must be robust and polite.
 A crawler should be distributed, scalable, efficient, fresh,
 quality-targeted and extensible.
robots.txt
      www.publico.pt/robots.txt        www.google.com/robots.txt

    User-agent: *                   User-agent: *
    Disallow: /ADS/                 Disallow: /search
    Disallow: /banners/             Disallow: /groups
    Disallow: /bartoon/             Disallow: /images
    Disallow: /bdt/                 Disallow: /catalogs
    Disallow: /bin/                 Disallow: /catalogues
    Disallow: /calvin_and_hobbes/   Disallow: /news
    Disallow: /cinecartaz/          Allow: /news/directory
    Disallow: /desportohtml/        Disallow: /nwshp
    Disallow: /emprego/             Disallow: /setnewsprefs?
    Disallow: /especial/            Disallow: /index.html?
    Disallow: /img/                 Disallow: /?
    Disallow: /includeKimus/        Disallow: /addurl/image?
    Disallow: /lazer/               Disallow: /pagead/
    Disallow: /mail/                Disallow: /relpage/
    Disallow: /static/              Disallow: /relcontent
    Disallow: /xsl/                 Disallow: /imgres
                                    Disallow: /imglanding
                                    Disallow: /keyword/
                                    Disallow: /u/
                                    Disallow: /univ/
                                    Disallow: /cobrand
                                    ...
Indexer
 Indices are data structures designed for fast reading.
 The index is the biggest component of a search engine.

 Web documents are parsed and separated into tokens. This is
 a very challenging task due to the diversity of the web: file
 formats, language ambiguity, word boundaries, etc.


                       a          —›   d1...
                       domingo    —›   d1,d17,d30
                       estranho   —›   d2
                       flores     —›   d1,d3,d5
                       porto      —›   d4,d18
                       ...




 Research challenges in: size optimization, parallelism,
 maintenance, lookup speed, etc.
Ranking and Presentation




        QUERY               MAGIC                10 DOCS
                            in x millisecs




 For a given query, documents are ordered combining hundreds
 of signals. Additionally, ads are selected ($) and snippets are
 produced for each document. All in a few milliseconds.
Business

     “1% of the web search market is worth over $1 billion”

 Search engine’s business model is based on advertisement.

 First business models were based on small per-view charges.
 Ads were indiscriminately published, resulting a low
 conversion rates.

 The use of targeted advertising (ads are related to searches)
 resulted in much higher conversion rates. Advertisers bid on
 query terms and pay-per-click.

 Search engines operate complex systems that try to maximize
 revenue by selecting which ads to display.
Summary


 The World Wide Web didn’t exist 20 years ago.

 The Web is scientifically young and combines research from
 many different fields, not just technology.

 There are many open problems and much more to be opened.

 Some currently hot topics: learning to rank, wisdom of the
 crowds, social media, real-time, contextual, hcir.
Thank You
http://www.fe.up.pt/∼ssn
References



  ▶   An Introduction to Information Retrieval (2009)
      Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
      http://www.informationretrieval.org

  ▶   Web Information Retrieval (2009)
      Nick Craswell and David Hawking

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:4/1/2012
language:
pages:44