Try the all-new QuickBooks Online for FREE.  No credit card required.

Web Search Engines - Search Engines _ Question Answering

Document Sample
Web Search Engines - Search Engines _ Question Answering Powered By Docstoc
					Search Engines & Question

             Giuseppe Attardi
             Università di Pisa
   Web Search
    – Search engines
    – Architecture
    – Crawling: parallel/distributed, focused
    – Link analysis (Google PageRank)
    – Scaling
Top Online Activities

        Email                           96%

  Web Search                          88%

 Product Info.                  72%

         Source: Jupiter Communications, 2000
Pew Study (US users July 2002)
 Total Internet users = 111 M
 Do a search on any given day = 33 M
 Have used Internet to search = 85%

Search on the Web
   Corpus:The publicly accessible Web: static + dynamic

   Goal: Retrieve high quality results relevant to the user’s need
     – (not docs!)
   Need
     – Informational – want to learn about something (~40%)
                                         Low hemoglobin
     – Navigational – want to go to that page (~25%)
                                         United Airlines
     – Transactional – want to do something (web-mediated) (~35%)
         • Access a service    Tampere weather
         • Downloads              Mars surface images
                                     Nikon CoolPix
        • Shop
     – Gray areas
        • Find a good hub            Car rental Finland
        • Exploratory search “see what’s there”

   Static pages (documents)
    – text, mp3, images, video, ...
   Dynamic pages = generated
    on request
    – data base access
    – “the invisible web”
    – proprietary content, etc.

                URL = Universal Resource Locator
Access method   Host name

                                  Page name
   Immense amount of content
     – 2-10B static pages, doubling every 8-12 months
     – Lexicon Size: 10s-100s of millions of words
   Authors galore (1 in 4 hosts run a web server)

   Languages/Encodings
    – Hundreds (thousands ?) of languages, W3C encodings: 55
      (Jul01) [W3C01]
    – Home pages (1997): English 82%, Next 15: 13% [Babe97]
    – Google (mid 2001): English: 53%, JGCFSKRIP: 30%
   Document & query topic
    Popular Query Topics (from 1 million Google queries, Apr 2000)
     Arts           14.6%          Arts: Music               6.1%
     Computers      13.8%          Regional: North America   5.3%
     Regional       10.3%          Adult: Image Galleries    4.4%
     Society        8.7%           Computers: Software       3.4%
     Adult          8%             Computers: Internet       3.2%
     Recreation     7.3%           Business: Industries      2.3%
     Business       7.2%           Regional: Europe          1.8%
     …              …              …                         …
Rate of change
[Cho00] 720K pages from 270 popular
 sites sampled daily from Feb 17 –
 Jun 14, 1999
   Mathematically, what
   does this seem to be?
Web idiosyncrasies
   Distributed authorship
    – Millions of people creating pages with their
      own style, grammar, vocabulary, opinions,
      facts, falsehoods …
    – Not all have the purest motives in providing
      high-quality information - commercial motives
      drive “spamming” - 100s of millions of pages.
    – The open web is largely a marketing tool.
       • IBM’s home page does not contain computer.
Other characteristics
   Significant duplication
    – Syntactic - 30%-40% (near) duplicates
          [Brod97, Shiv99b]
    – Semantic - ???
   High linkage
    – ~ 8 links/page in the average
   Complex graph topology
    – Not a small world; bow-tie structure [Brod00]
   More on these corpus characteristics later
    – how do we measure them?
Web search users
   Ill-defined queries             Specific behavior
    – Short                          – 85% look over one
         • AV 2001: 2.54 terms         result screen only
           avg, 80% < 3 words)         (mostly above the
    – Imprecise terms                  fold)
    – Sub-optimal syntax             – 78% of queries are
      (80% queries                     not modified (one
      without operator)                query/session)
    – Low effort                     – Follow links –
                                        “the scent of
   Wide variance in                   information” ...
    –   Needs
    –   Expectations
    –   Knowledge
    –   Bandwidth
Evolution of search engines
   First generation -- use only “on page”, text data 1995-1997 AV,
     – Word frequency, language                       Excite, Lycos, etc

   Second generation -- use off-page, web-specific data
     – Link (or connectivity) analysis              From 1998. Made
     – Click-through data (What results people clickpopular by Google
     – Anchor-text (How people refer to this page) but everyone now

   Third generation -- answer “the need behind the query”
     – Semantic analysis -- what is this about?
     – Focus on user need, rather than on query
     – Context determination                        Still experimental
     – Helping the user
     – Integration of search and text analysis
Third generation search engine:
answering “the need behind the query”
 Query language determination
 Different ranking
  –(if query Japanese do not return English)
 Hard   & soft matches
  –Personalities (triggered on names)
  –Cities (travel info, maps)
  –Medical info (triggered on names and/or
  –Stock quotes, news (triggered on stock
  –Company info, …
 Integration   of Search and Text Analysis
Answering “the need behind the query”
Context determination

   Context determination
     –   spatial (user location/target location)
     –   query stream (previous queries)
     –   personal (user profile)
     –   explicit (vertical search, family friendly)
     –   implicit (use AltaVista from AltaVista France)
   Context use
     – Result restriction
     – Ranking modulation
The spatial context - geo-search
   Two aspects
     – Geo-coding
        • encode geographic coordinates to make search effective
     – Geo-parsing
        • the process of identifying geographic context.
   Geo-coding
     – Geometrical hierarchy (squares)
     – Natural hierarchy (country, state, county, city, zip-codes, etc)
   Geo-parsing
     – Pages (infer from phone nos, zip, etc). About 10% feasible.
     – Queries (use dictionary of place names)
     – Users
        • From IP data
     – Mobile phones
        • In its infancy, many issues (display size, privacy, etc)
AV barry bonds
Lycos palo alto
Geo-search example - Northern Light (Now Divine Inc)
Helping the user

 UI
 spellchecking
 query refinement
 query suggestion
 context transfer …
Context sensitive spell check
 Search Engine Architecture

                 Page Repository                        Store


                Indexer   Structure   Ranking
  Crawlers                                                   Results

                                      Query     Queries
Crawl Control    Text                 Engine
 Crawler
 Crawler control
 Indexes – text, structure, utility
 Page repository
 Indexer
 Collection analysis module
 Query engine
 Ranking module

             “Hidden Treasures”
   The page repository is a scalable storage
    system for web pages
   Allows the Crawler to store pages
   Allows the Indexer and Collection
    Analysis to retrieve them
   Similar to other data storage systems –
    DB or file systems
   Does not have to provide some of the
    other systems’ features: transactions,
    logging, directory
Storage Issues
 Scalability and seamless load distribution
 Dual access modes
    – Random access (used by the query engine for
      cached pages)
    – Streaming access (used by the Indexer and
      Collection Analysis)
 Large bulk update – reclaim old space,
  avoid access/update conflicts
 Obsolete pages - remove pages no longer
  on the web
Designing a Distributed Web
 Repository designed to work over a
  cluster of interconnected nodes
 Page distribution across nodes
 Physical organization within a node
 Update strategy
Page Distribution
 How to choose a node to store a
 Uniform distribution – any page can
  be sent to any node
 Hash distribution policy – hash page
  ID space into node ID space
Organization Within a Node
    Several operations required
     – Add / remove a page
     – High speed streaming
     – Random page access
    Hashed organization
     – Treat each disk as a hash bucket
     – Assign according to a page’s ID
    Log organization
     – Treat the disk as one file, and add the page at the end
     – Support random access using a B-tree
    Hybrid
     – Hash map a page to an extent and use log structure
       within an extent.
Distribution Performance

                Log   Hashed Hashed
Streaming       ++    -      +
Random access   +-    ++     +-
Page addition   ++    -      +
         Update Strategies

 Updates are generated by the crawler
 Several characteristics
    – Time in which the crawl occurs and
      the repository receives information
    – Whether the crawl’s information
      replaces the entire database or
      modifies parts of it
Batch vs. Steady
   Batch mode
    – Periodically executed
    – Allocated a certain amount of time
   Steady mode
    – Run all the time
    – Always send results back to the
Partial vs. Complete Crawls
   A batch mode crawler can
    – Do a complete crawl every run, and replace
      entire collection
    – Recrawl only a specific subset, and apply
      updates to the existing collection – partial
   The repository can implement
    – In place update
       • Quickly refresh pages
    – Shadowing, update as another stage
       • Avoid refresh-access conflicts
Partial vs. Complete Crawls
 Shadowing resolves the conflicts
  between updates and read for the
 Batch mode suits well with
 Steady crawler suits with in place
The Indexer Module
Creates Two indexes :
 Text (content) index : Uses
  “Traditional” indexing methods like
  Inverted Indexing
 Structure(Links( index : Uses a
  directed graph of pages and links.
  Sometimes also creates an inverted
The Link Analysis Module
Uses the 2 basic indexes created by
the indexer module in order to
assemble “Utility Indexes”
e.g. : A site index.
Inverted Index
 A Set of inverted lists, one per each index
  term (word)
 Inverted list of a term: A sorted list of
  locations in which the term appeared.
 Posting: A pair (w,l) where w is word and l
  is one of its locations
 Lexicon: Holds all index’s terms with
  statistics about the term (not the posting)
   Index build must be:
    – Fast
    – Economic
  (unlike traditional index buildings)
 Incremental Indexing must be
 Storage: compression vs. speed
Index Partitioning
A distributed text indexing can be done by:
 Local inverted file (IFL)
    – Each nodes contain disjoint random pages
    – Query is broadcasted
    – Result is the joined query answers
   Global inverted file (IFG)
    – Each node is responsible only for a subset of
      terms in the collection
    – Query sent only to the apropriate node
Indexing, Conclusion
 Web pages indexing is complicated
  due to it’s scale (millions of pages,
  hundreds of gigabytes)
 Challenges: Incremental indexing
  and personalization
   Google (Nov 2002):
    – Number of pages: 3 billion
    – Refresh interval: 1 month (1200
    – Queries/day: 150 million = 1700 q/s
 Avg page size:10KB
 Avg query size: 40 B
 Avg result size: 5 KB
 Avg links/page: 8
Size of Dataset
   Total raw HTML data size:
    3G x 10 KB = 30 TB
 Inverted index ~= corpus = 30 TB
 Using compression 3:1
    20 TB data on disk
Single copy of index
   Index
    – (10 TB) / (100 GB per disk) = 100 disk
   Document
    – (10 TB) / (100 GB per disk) = 100 disk
Query Load
 1700 queries/sec
 Rule of thumb: 20 q/s per CPU
    – 85 clusters to answer queries
 Cluster: 100 machine
 Total = 85 x 100 = 8500
 Document server
    – Snippet search: 1000 snippet/s
    – (1700 * 10 / 1000) * 100 = 1700
 Redirector: 4000 req/sec
 Bandwidth: 1100 req/sec
 Server: 22 q/s each
 Cluster: 50 nodes = 1100 q/s =
          95 million q/day
Scaling the Index


                                    Hardware based
                                     load balancer

                                        Google Web Server
                                         Google Web Server
        Spell Checker               Google Web Server                Ad server

                        Index servers
                                                               Document servers
Pooled Shard Architecture
                                            Web Server
                                                                                                   1 Gb/s
                                                                                                   100 Mb/s
                                        Index load balancer

                                                                            Index Server Network

                         Index Server 1
                                               …          Index Server K

                                                                                               Pool Network

              Intermediate load                                            Intermediate load
                  balancer 1                                                   balancer N

         SIS                      SIS                               SIS        …           SIS

         S1                       S1                                SN                        SN
               Pool for shard 1                                            Pool for shard N
Replicated Index Architecture

                                         Web Server                                     1 Gb/s
                                                                                        100 Mb/s

                                     Index load balancer

                                                                         Index Server Network

              Index Server 1                                     Index Server M

        SIS      …             SIS                         SIS       …            SIS

        S1                     SN                          S1                     SN
               Full Index 1                                        Full Index M
Index Replication
 100 Mb/s bandwidth
 20 TB x 8 / 100
 20 TB require one full day
First generation ranking
   Extended Boolean model
    –   Matches: exact, prefix, phrase,…
    –   Operators: AND, OR, AND NOT, NEAR, …
    –   Fields: TITLE:, URL:, HOST:,…
    –   AND is somewhat easier to implement, maybe
        preferable as default for short queries
   Ranking
    – TF like factors: TF, explicit keywords, words in
      title, explicit emphasis (headers), etc
    – IDF factors: IDF, total word count in corpus,
      frequency in query log, frequency in language
Second generation search engine
   Ranking -- use off-page, web-specific
    – Link (or connectivity) analysis
    – Click-through data (What results people
      click on)
    – Anchor-text (How people refer to this
   Crawling
    – Algorithms to create the best possible
Connectivity analysis

 Idea:
      mine hyperlink information in
 the Web
 Assumptions:
  – Links often connect related pages
  – A link between pages is a
   recommendation “people vote with their
Citation Analysis
   Citation frequency
   Co-citation coupling frequency
    – Cocitations with a given author measures “impact”
    – Cocitation analysis [Mcca90]
   Bibliographic coupling frequency
    – Articles that co-cite the same articles are related
   Citation indexing
    – Who is a given author cited by? (Garfield [Garf72])
   Pinsker and Narin
Query-independent ordering
 First generation: using link counts as
  simple measures of popularity
 Two basic suggestions:
    – Undirected popularity:
      • Each page gets a score = the number of in-
        links plus the number of out-links (3+2=5)
    – Directed popularity:
      • Score of a page = number of its in-links (3)
Query processing
 First retrieve all pages meeting the
  text query (say venture capital)
 Order these by their link popularity
  (either variant on the previous page)
Spamming simple popularity
 Exercise: How do you spam each of
  the following heuristics so your page
  gets a high score?
 Each page gets a score = the number
  of in-links plus the number of out-
 Score of a page = number of its in-
PageRank scoring
   Imagine a browser doing a random
    walk on web pages:           1/3
    – Start at a random page            1/3

    – At each step, go out of the current page
      along one of the links on that page,
   “In the steady state” each page has a
    long-term visit rate - use this as the
    page’s score
Not quite enough
   The web is full of dead-ends
    – Random walk can get stuck in dead-
    – Makes no sense to talk about long-term
      visit rates.

 At each step, with probability 10%,
  jump to a random web page
 With remaining probability (90%), go
  out on a random link
    – If no out-link, stay put in this case
Result of teleporting
 Now cannot get stuck locally
 There is a long-term rate at which
  any page is visited (not obvious, will
  show this)
 How do we compute this visit rate?
 Tries to capture the notion of “Importance
  of a page”
 Uses Backlinks for ranking
 Avoids trivial spamming: Distributes
  pages’ “voting power” among the pages
  they are linking to
 “Important” page linking to a page will
  raise it’s rank more the “Not Important”
Simple PageRank
   Given by :
    B(i) : set of pages links to i
    N(j) : number of outgoing links from j
 Well defined if link graph is strongly
 Based on “Random Surfer Model” -
  Rank of page equals to the
  probability of being in this page
Computation Of PageRank (1)

      r  At r
      r  [ r (1), r ( 2),..., r ( m)]
             1 / N (i )       if i points to j
      a     
       i, j  0                 Otherwise
Computation of PageRank (2)
 Given a matrix A, an eigenvalue c
  and the corresponding eigenvector v
  is defined if Av = cv
 Hence r is eigenvector of At for
  eigenvalue “1”
 If G is strongly connected then r is
Computation of PageRank (3)
   Simple PageRank can be computed

         1. s  any random vector
          2. r  A  st

          3. if || r  s ||    goto 5
          4. s  r               goto 2
          5. r is the PageRank v ector
PageRank Example
Practical PageRank: Problem
   Web is not a strongly connected
    graph. It contains:
    – “Rank Sinks”: cluster of pages without
      outgoing links. Pages outside cluster
      will be ranked 0.
    – “Rank Leaks”: a page without outgoing
      links. All pages will be ranked 0.
Practical PageRank: Solution
 Remove all Page Leaks
 Add decay factor d to Simple

   Based on “Board Surfer Model”
HITS: Hypertext Induced Topic Search

   A query dependent technique
   Produces two scores:
      – Authority: A most likely to be relevant
        page to a given query
      – Hub: Points to many Authorities
     Contains two part:
      – Identifying the focused subgraph
      – Link analysis
HITS: Identifying The Focused Subgraph

 Subgraph creation from t-sized page set:
          1. R  t initial pages
          2. S  R
          3. for each page p  R
             (a) Include all the pages that p points to in S
             (b) Include (up to maximum d ) all pages that points to p in S
           4. S holds the focused graph

    (d reduces the influence of extremely
    popular pages like
HITS: Link Analysis
   Calculates Authorities & Hubs scores
    (ai & hi) for each page in S

           1. Initialize ai , bi (1  i  n) arbitraril y.
           2. Repeat until convergenc e
             (a) ai      h
                         jB ( i )
                                         j   (for all pages)

             (b) hi     a
                        jF ( i )
                                     j       (for all pages)
 HITS:Link Analysis Computation

 Eigenvectors      computation can be
  used by:
                a  Ah  a  AAtr a
                h  A a  h  A Ah
                     tr        tr

 a: Vector of Authorities’ scores
 h: Vector of Hubs’ scores
 A: Adjacency matrix in which ai,j = 1 if points
 to j
Markov chains
 A Markov chain consists of n states, plus
  an nn transition probability matrix P
 At each step, we are in exactly one of the
 For 1  i,j  n, the matrix entry Pij tells us
  the probability of j being the next state,
  given we are currently in state i
                                     is OK.
                i           j
Markov chains

 Clearly, for all i,  Pij  1.
                      j 1

 Markov chains are abstractions of
  random walks
 Exercise: represent the teleporting
  random walk from 3 slides ago as a
  Markov chain, for this case:
Ergodic Markov chains
   A Markov chain is ergodic if
    – you have a path from any state to any
    – you can be in any state at every time
      step, with non-zero probability
Ergodic Markov chains
   For any ergodic Markov chain, there
    is a unique long-term visit rate for
    each state
    – Steady-state distribution
 Over a long time-period, we visit
  each state in proportion to this rate
 It doesn’t matter where we start
 Probability vectors
  A probability (row) vector x = (x1, … xn)
   tells us where the walk is at any point
  E.g., (000…1…000( means we’re in state i
        1      i    n

More generally, the vector x = (x1, … xn) means
the walk is in state i with probability xi

                    i 1
                           i    1.
Change in probability vector
 If the probability vector is x = (x1, …
  xn) at this step, what is it at the next
 Recall that row i of the transition
  prob. Matrix P tells us where we go
  next from state i
 So from x, our next state is
  distributed as xP
Computing the visit rate
   The steady state looks like a vector
    of probabilities a = (a1, … an):
    – ai is the probability that we are in state i

             1/4    1         2    3/4

         For this example, a1=1/4 and a2=3/4
How do we compute this vector?
 Let a = (a1, … an) denote the row vector of
  steady-state probabilities
 If we our current position is described by
  a, then the next step is distributed as aP
 But a is the steady state, so a=aP
 Solving this matrix equation gives us a
    – So a is the (left) eigenvector for P
    – (Corresponds to the “principal” eigenvector of
      P with the largest eigenvalue)
One way of computing a
   Recall, regardless of where we start, we
    eventually reach the steady state a
   Start with any distribution (say x=(10…0))
   After one step, we’re at xP
   after two steps at xP2 , then xP3 and so on
   “Eventually” means for “large” k, xPk = a
   Algorithm: multiply x by increasing
    powers of P until the product looks stable
Lempel: Salsa
   By applying ergodic theorem, Lempel
    has proved that:
    – ai is proportional to number of incoming
Pagerank summary
   Preprocessing:
    – Given graph of links, build matrix P
    – From it compute a
    – The entry ai is a number between 0 and
      1: the PageRank of page i
   Query processing:
    – Retrieve pages meeting query
    – Rank them by their PageRank
    – Order is query-independent
The reality
   Pagerank is used in Google, but so
    are many other clever heuristics