Docstoc
EXCLUSIVE OFFER FOR DOCSTOC USERS
Try the all-new QuickBooks Online for FREE.  No credit card required.

Web Search Engines - Search Engines _ Question Answering

Document Sample
Web Search Engines - Search Engines _ Question Answering Powered By Docstoc
					Search Engines & Question
Answering


             Giuseppe Attardi
             Università di Pisa
Topics
   Web Search
    – Search engines
    – Architecture
    – Crawling: parallel/distributed, focused
    – Link analysis (Google PageRank)
    – Scaling
Top Online Activities


        Email                           96%


  Web Search                          88%


 Product Info.                  72%
    Search




         Source: Jupiter Communications, 2000
Pew Study (US users July 2002)
 Total Internet users = 111 M
 Do a search on any given day = 33 M
 Have used Internet to search = 85%

    //www.pewinternet.org/reports/toc.asp?Repor
    t=64
Search on the Web
   Corpus:The publicly accessible Web: static + dynamic

   Goal: Retrieve high quality results relevant to the user’s need
     – (not docs!)
   Need
     – Informational – want to learn about something (~40%)
                                         Low hemoglobin
     – Navigational – want to go to that page (~25%)
                                         United Airlines
     – Transactional – want to do something (web-mediated) (~35%)
         • Access a service    Tampere weather
         • Downloads              Mars surface images
                                     Nikon CoolPix
        • Shop
     – Gray areas
        • Find a good hub            Car rental Finland
        • Exploratory search “see what’s there”
Results

   Static pages (documents)
    – text, mp3, images, video, ...
   Dynamic pages = generated
    on request
    – data base access
    – “the invisible web”
    – proprietary content, etc.
Terminology

                URL = Universal Resource Locator



    http://www.cism.it/cism/hotels_2001.ht
      m
Access method   Host name



                                  Page name
Scale
   Immense amount of content
     – 2-10B static pages, doubling every 8-12 months
     – Lexicon Size: 10s-100s of millions of words
   Authors galore (1 in 4 hosts run a web server)




                                          http://www.netcraft.com/Survey
Diversity
   Languages/Encodings
    – Hundreds (thousands ?) of languages, W3C encodings: 55
      (Jul01) [W3C01]
    – Home pages (1997): English 82%, Next 15: 13% [Babe97]
    – Google (mid 2001): English: 53%, JGCFSKRIP: 30%
   Document & query topic
    Popular Query Topics (from 1 million Google queries, Apr 2000)
     Arts           14.6%          Arts: Music               6.1%
     Computers      13.8%          Regional: North America   5.3%
     Regional       10.3%          Adult: Image Galleries    4.4%
     Society        8.7%           Computers: Software       3.4%
     Adult          8%             Computers: Internet       3.2%
     Recreation     7.3%           Business: Industries      2.3%
     Business       7.2%           Regional: Europe          1.8%
     …              …              …                         …
Rate of change
[Cho00] 720K pages from 270 popular
 sites sampled daily from Feb 17 –
 Jun 14, 1999
   Mathematically, what
   does this seem to be?
Web idiosyncrasies
   Distributed authorship
    – Millions of people creating pages with their
      own style, grammar, vocabulary, opinions,
      facts, falsehoods …
    – Not all have the purest motives in providing
      high-quality information - commercial motives
      drive “spamming” - 100s of millions of pages.
    – The open web is largely a marketing tool.
       • IBM’s home page does not contain computer.
Other characteristics
   Significant duplication
    – Syntactic - 30%-40% (near) duplicates
          [Brod97, Shiv99b]
    – Semantic - ???
   High linkage
    – ~ 8 links/page in the average
   Complex graph topology
    – Not a small world; bow-tie structure [Brod00]
   More on these corpus characteristics later
    – how do we measure them?
Web search users
   Ill-defined queries             Specific behavior
    – Short                          – 85% look over one
         • AV 2001: 2.54 terms         result screen only
           avg, 80% < 3 words)         (mostly above the
    – Imprecise terms                  fold)
    – Sub-optimal syntax             – 78% of queries are
      (80% queries                     not modified (one
      without operator)                query/session)
    – Low effort                     – Follow links –
                                        “the scent of
   Wide variance in                   information” ...
    –   Needs
    –   Expectations
    –   Knowledge
    –   Bandwidth
Evolution of search engines
   First generation -- use only “on page”, text data 1995-1997 AV,
     – Word frequency, language                       Excite, Lycos, etc

   Second generation -- use off-page, web-specific data
     – Link (or connectivity) analysis              From 1998. Made
     – Click-through data (What results people clickpopular by Google
                                                     on)
     – Anchor-text (How people refer to this page) but everyone now

   Third generation -- answer “the need behind the query”
     – Semantic analysis -- what is this about?
     – Focus on user need, rather than on query
     – Context determination                        Still experimental
     – Helping the user
     – Integration of search and text analysis
Third generation search engine:
answering “the need behind the query”
 Query language determination
 Different ranking
  –(if query Japanese do not return English)
 Hard   & soft matches
  –Personalities (triggered on names)
  –Cities (travel info, maps)
  –Medical info (triggered on names and/or
   results)
  –Stock quotes, news (triggered on stock
   symbol)
  –Company info, …
 Integration   of Search and Text Analysis
Answering “the need behind the query”
Context determination

   Context determination
     –   spatial (user location/target location)
     –   query stream (previous queries)
     –   personal (user profile)
     –   explicit (vertical search, family friendly)
     –   implicit (use AltaVista from AltaVista France)
   Context use
     – Result restriction
     – Ranking modulation
The spatial context - geo-search
   Two aspects
     – Geo-coding
        • encode geographic coordinates to make search effective
     – Geo-parsing
        • the process of identifying geographic context.
   Geo-coding
     – Geometrical hierarchy (squares)
     – Natural hierarchy (country, state, county, city, zip-codes, etc)
   Geo-parsing
     – Pages (infer from phone nos, zip, etc). About 10% feasible.
     – Queries (use dictionary of place names)
     – Users
        • From IP data
     – Mobile phones
        • In its infancy, many issues (display size, privacy, etc)
AV barry bonds
Lycos palo alto
Geo-search example - Northern Light (Now Divine Inc)
Helping the user

 UI
 spellchecking
 query refinement
 query suggestion
 context transfer …
Context sensitive spell check
 Search Engine Architecture

                                                      Document
                 Page Repository                        Store


                           Link
                          Analysis

                                                      Snippet
                Indexer   Structure   Ranking
                                                     Extraction
  Crawlers                                                   Results



                                      Query     Queries
Crawl Control    Text                 Engine
Terms
 Crawler
 Crawler control
 Indexes – text, structure, utility
 Page repository
 Indexer
 Collection analysis module
 Query engine
 Ranking module
Repository


             “Hidden Treasures”
Storage
   The page repository is a scalable storage
    system for web pages
   Allows the Crawler to store pages
   Allows the Indexer and Collection
    Analysis to retrieve them
   Similar to other data storage systems –
    DB or file systems
   Does not have to provide some of the
    other systems’ features: transactions,
    logging, directory
Storage Issues
 Scalability and seamless load distribution
 Dual access modes
    – Random access (used by the query engine for
      cached pages)
    – Streaming access (used by the Indexer and
      Collection Analysis)
 Large bulk update – reclaim old space,
  avoid access/update conflicts
 Obsolete pages - remove pages no longer
  on the web
Designing a Distributed Web
Repository
 Repository designed to work over a
  cluster of interconnected nodes
 Page distribution across nodes
 Physical organization within a node
 Update strategy
Page Distribution
 How to choose a node to store a
  page
 Uniform distribution – any page can
  be sent to any node
 Hash distribution policy – hash page
  ID space into node ID space
Organization Within a Node
    Several operations required
     – Add / remove a page
     – High speed streaming
     – Random page access
    Hashed organization
     – Treat each disk as a hash bucket
     – Assign according to a page’s ID
    Log organization
     – Treat the disk as one file, and add the page at the end
     – Support random access using a B-tree
    Hybrid
     – Hash map a page to an extent and use log structure
       within an extent.
Distribution Performance

                Log   Hashed Hashed
                             Log
Streaming       ++    -      +
performance
Random access   +-    ++     +-
performance
Page addition   ++    -      +
         Update Strategies

 Updates are generated by the crawler
 Several characteristics
    – Time in which the crawl occurs and
      the repository receives information
    – Whether the crawl’s information
      replaces the entire database or
      modifies parts of it
Batch vs. Steady
   Batch mode
    – Periodically executed
    – Allocated a certain amount of time
   Steady mode
    – Run all the time
    – Always send results back to the
      repository
Partial vs. Complete Crawls
   A batch mode crawler can
    – Do a complete crawl every run, and replace
      entire collection
    – Recrawl only a specific subset, and apply
      updates to the existing collection – partial
      crawl
   The repository can implement
    – In place update
       • Quickly refresh pages
    – Shadowing, update as another stage
       • Avoid refresh-access conflicts
Partial vs. Complete Crawls
 Shadowing resolves the conflicts
  between updates and read for the
  queries
 Batch mode suits well with
  shadowing
 Steady crawler suits with in place
  updates
Indexing
The Indexer Module
Creates Two indexes :
 Text (content) index : Uses
  “Traditional” indexing methods like
  Inverted Indexing
 Structure(Links( index : Uses a
  directed graph of pages and links.
  Sometimes also creates an inverted
  graph
The Link Analysis Module
Uses the 2 basic indexes created by
the indexer module in order to
assemble “Utility Indexes”
e.g. : A site index.
Inverted Index
 A Set of inverted lists, one per each index
  term (word)
 Inverted list of a term: A sorted list of
  locations in which the term appeared.
 Posting: A pair (w,l) where w is word and l
  is one of its locations
 Lexicon: Holds all index’s terms with
  statistics about the term (not the posting)
Challenges
   Index build must be:
    – Fast
    – Economic
  (unlike traditional index buildings)
 Incremental Indexing must be
  supported
 Storage: compression vs. speed
Index Partitioning
A distributed text indexing can be done by:
 Local inverted file (IFL)
    – Each nodes contain disjoint random pages
    – Query is broadcasted
    – Result is the joined query answers
   Global inverted file (IFG)
    – Each node is responsible only for a subset of
      terms in the collection
    – Query sent only to the apropriate node
Indexing, Conclusion
 Web pages indexing is complicated
  due to it’s scale (millions of pages,
  hundreds of gigabytes)
 Challenges: Incremental indexing
  and personalization
Scaling
Scaling
   Google (Nov 2002):
    – Number of pages: 3 billion
    – Refresh interval: 1 month (1200
      pag/sec)
    – Queries/day: 150 million = 1700 q/s
 Avg page size:10KB
 Avg query size: 40 B
 Avg result size: 5 KB
 Avg links/page: 8
Size of Dataset
   Total raw HTML data size:
    3G x 10 KB = 30 TB
 Inverted index ~= corpus = 30 TB
 Using compression 3:1
    20 TB data on disk
Single copy of index
   Index
    – (10 TB) / (100 GB per disk) = 100 disk
   Document
    – (10 TB) / (100 GB per disk) = 100 disk
Query Load
 1700 queries/sec
 Rule of thumb: 20 q/s per CPU
    – 85 clusters to answer queries
 Cluster: 100 machine
 Total = 85 x 100 = 8500
 Document server
    – Snippet search: 1000 snippet/s
    – (1700 * 10 / 1000) * 100 = 1700
Limits
 Redirector: 4000 req/sec
 Bandwidth: 1100 req/sec
 Server: 22 q/s each
 Cluster: 50 nodes = 1100 q/s =
          95 million q/day
Scaling the Index

                                                     Queries


                                    Hardware based
                                     load balancer


                                        Google Web Server
                                         Google Web Server
        Spell Checker               Google Web Server                Ad server




                        Index servers
                                                               Document servers
Pooled Shard Architecture
                                            Web Server
                                                                                                   1 Gb/s
                                                                                                   100 Mb/s
                                        Index load balancer

                                                                            Index Server Network



                         Index Server 1
                                               …          Index Server K

                                                                                               Pool Network




              Intermediate load                                            Intermediate load
                  balancer 1                                                   balancer N

                                                   …
                   …
         SIS                      SIS                               SIS        …           SIS


         S1                       S1                                SN                        SN
               Pool for shard 1                                            Pool for shard N
Replicated Index Architecture

                                         Web Server                                     1 Gb/s
                                                                                        100 Mb/s

                                     Index load balancer

                                                                         Index Server Network




              Index Server 1                                     Index Server M


                                             …
        SIS      …             SIS                         SIS       …            SIS

        S1                     SN                          S1                     SN
               Full Index 1                                        Full Index M
Index Replication
 100 Mb/s bandwidth
 20 TB x 8 / 100
 20 TB require one full day
Ranking
First generation ranking
   Extended Boolean model
    –   Matches: exact, prefix, phrase,…
    –   Operators: AND, OR, AND NOT, NEAR, …
    –   Fields: TITLE:, URL:, HOST:,…
    –   AND is somewhat easier to implement, maybe
        preferable as default for short queries
   Ranking
    – TF like factors: TF, explicit keywords, words in
      title, explicit emphasis (headers), etc
    – IDF factors: IDF, total word count in corpus,
      frequency in query log, frequency in language
Second generation search engine
   Ranking -- use off-page, web-specific
    data
    – Link (or connectivity) analysis
    – Click-through data (What results people
      click on)
    – Anchor-text (How people refer to this
      page)
   Crawling
    – Algorithms to create the best possible
      corpus
Connectivity analysis

 Idea:
      mine hyperlink information in
 the Web
 Assumptions:
  – Links often connect related pages
  – A link between pages is a
   recommendation “people vote with their
   links”
Citation Analysis
   Citation frequency
   Co-citation coupling frequency
    – Cocitations with a given author measures “impact”
    – Cocitation analysis [Mcca90]
   Bibliographic coupling frequency
    – Articles that co-cite the same articles are related
   Citation indexing
    – Who is a given author cited by? (Garfield [Garf72])
   Pinsker and Narin
Query-independent ordering
 First generation: using link counts as
  simple measures of popularity
 Two basic suggestions:
    – Undirected popularity:
      • Each page gets a score = the number of in-
        links plus the number of out-links (3+2=5)
    – Directed popularity:
      • Score of a page = number of its in-links (3)
Query processing
 First retrieve all pages meeting the
  text query (say venture capital)
 Order these by their link popularity
  (either variant on the previous page)
Spamming simple popularity
 Exercise: How do you spam each of
  the following heuristics so your page
  gets a high score?
 Each page gets a score = the number
  of in-links plus the number of out-
  links
 Score of a page = number of its in-
  links
PageRank scoring
   Imagine a browser doing a random
    walk on web pages:           1/3
                                 1/3
    – Start at a random page            1/3

    – At each step, go out of the current page
      along one of the links on that page,
      equiprobably
   “In the steady state” each page has a
    long-term visit rate - use this as the
    page’s score
Not quite enough
   The web is full of dead-ends
    – Random walk can get stuck in dead-
      ends.
    – Makes no sense to talk about long-term
      visit rates.


                           ??
Teleporting
 At each step, with probability 10%,
  jump to a random web page
 With remaining probability (90%), go
  out on a random link
    – If no out-link, stay put in this case
Result of teleporting
 Now cannot get stuck locally
 There is a long-term rate at which
  any page is visited (not obvious, will
  show this)
 How do we compute this visit rate?
PageRank
 Tries to capture the notion of “Importance
  of a page”
 Uses Backlinks for ranking
 Avoids trivial spamming: Distributes
  pages’ “voting power” among the pages
  they are linking to
 “Important” page linking to a page will
  raise it’s rank more the “Not Important”
  one
Simple PageRank
   Given by :
    Where
    B(i) : set of pages links to i
    N(j) : number of outgoing links from j
 Well defined if link graph is strongly
  connected
 Based on “Random Surfer Model” -
  Rank of page equals to the
  probability of being in this page
Computation Of PageRank (1)



      r  At r
      r  [ r (1), r ( 2),..., r ( m)]
             1 / N (i )       if i points to j
      a     
       i, j  0                 Otherwise
Computation of PageRank (2)
 Given a matrix A, an eigenvalue c
  and the corresponding eigenvector v
  is defined if Av = cv
 Hence r is eigenvector of At for
  eigenvalue “1”
 If G is strongly connected then r is
  unique
Computation of PageRank (3)
   Simple PageRank can be computed
    by:

         1. s  any random vector
          2. r  A  st


          3. if || r  s ||    goto 5
          4. s  r               goto 2
          5. r is the PageRank v ector
PageRank Example
Practical PageRank: Problem
   Web is not a strongly connected
    graph. It contains:
    – “Rank Sinks”: cluster of pages without
      outgoing links. Pages outside cluster
      will be ranked 0.
    – “Rank Leaks”: a page without outgoing
      links. All pages will be ranked 0.
Practical PageRank: Solution
 Remove all Page Leaks
 Add decay factor d to Simple
  PageRank




   Based on “Board Surfer Model”
HITS: Hypertext Induced Topic Search

   A query dependent technique
   Produces two scores:
      – Authority: A most likely to be relevant
        page to a given query
      – Hub: Points to many Authorities
     Contains two part:
      – Identifying the focused subgraph
      – Link analysis
HITS: Identifying The Focused Subgraph

 Subgraph creation from t-sized page set:
          1. R  t initial pages
          2. S  R
          3. for each page p  R
             (a) Include all the pages that p points to in S
             (b) Include (up to maximum d ) all pages that points to p in S
           4. S holds the focused graph


    (d reduces the influence of extremely
    popular pages like yahoo.com)
HITS: Link Analysis
   Calculates Authorities & Hubs scores
    (ai & hi) for each page in S

           1. Initialize ai , bi (1  i  n) arbitraril y.
           2. Repeat until convergenc e
             (a) ai      h
                         jB ( i )
                                         j   (for all pages)

             (b) hi     a
                        jF ( i )
                                     j       (for all pages)
 HITS:Link Analysis Computation

 Eigenvectors      computation can be
  used by:
                a  Ah  a  AAtr a
                        
                h  A a  h  A Ah
                     tr        tr

Where
 a: Vector of Authorities’ scores
 h: Vector of Hubs’ scores
 A: Adjacency matrix in which ai,j = 1 if points
 to j
Markov chains
 A Markov chain consists of n states, plus
  an nn transition probability matrix P
 At each step, we are in exactly one of the
  states
 For 1  i,j  n, the matrix entry Pij tells us
  the probability of j being the next state,
  given we are currently in state i
                                      Pii>0
                                     is OK.
                i           j
                     Pij
Markov chains
                   n

 Clearly, for all i,  Pij  1.
                      j 1

 Markov chains are abstractions of
  random walks
 Exercise: represent the teleporting
  random walk from 3 slides ago as a
  Markov chain, for this case:
Ergodic Markov chains
   A Markov chain is ergodic if
    – you have a path from any state to any
      other
    – you can be in any state at every time
      step, with non-zero probability
                            Not
                          ergodic
                          (even/
                           odd).
Ergodic Markov chains
   For any ergodic Markov chain, there
    is a unique long-term visit rate for
    each state
    – Steady-state distribution
 Over a long time-period, we visit
  each state in proportion to this rate
 It doesn’t matter where we start
 Probability vectors
  A probability (row) vector x = (x1, … xn)
   tells us where the walk is at any point
  E.g., (000…1…000( means we’re in state i
        1      i    n



More generally, the vector x = (x1, … xn) means
the walk is in state i with probability xi
                     n

                   x
                    i 1
                           i    1.
Change in probability vector
 If the probability vector is x = (x1, …
  xn) at this step, what is it at the next
  step?
 Recall that row i of the transition
  prob. Matrix P tells us where we go
  next from state i
 So from x, our next state is
  distributed as xP
Computing the visit rate
   The steady state looks like a vector
    of probabilities a = (a1, … an):
    – ai is the probability that we are in state i


                        3/4
             1/4    1         2    3/4
                        1/4


         For this example, a1=1/4 and a2=3/4
How do we compute this vector?
 Let a = (a1, … an) denote the row vector of
  steady-state probabilities
 If we our current position is described by
  a, then the next step is distributed as aP
 But a is the steady state, so a=aP
 Solving this matrix equation gives us a
    – So a is the (left) eigenvector for P
    – (Corresponds to the “principal” eigenvector of
      P with the largest eigenvalue)
One way of computing a
   Recall, regardless of where we start, we
    eventually reach the steady state a
   Start with any distribution (say x=(10…0))
   After one step, we’re at xP
   after two steps at xP2 , then xP3 and so on
   “Eventually” means for “large” k, xPk = a
   Algorithm: multiply x by increasing
    powers of P until the product looks stable
Lempel: Salsa
   By applying ergodic theorem, Lempel
    has proved that:
    – ai is proportional to number of incoming
      links
Pagerank summary
   Preprocessing:
    – Given graph of links, build matrix P
    – From it compute a
    – The entry ai is a number between 0 and
      1: the PageRank of page i
   Query processing:
    – Retrieve pages meeting query
    – Rank them by their PageRank
    – Order is query-independent
The reality
   Pagerank is used in Google, but so
    are many other clever heuristics

				
DOCUMENT INFO