Document Sample
Crawling Powered By Docstoc

                Giuseppe Attardi
           Dipartimento di Informatica
               Università di Pisa
 Download a set of Web pages
 Set consists typically of all pages
  reachable following links from a root
 Crawling is performed periodically
 Goals:
    – Find new pages
    – Keep pages fresh
    – Select “good” pages
Web Dynamics
   Size
    – ~10 billion Public Indexable pages
    – 10kB / page  100 TB
    – Doubles every 18 months
   Dynamics
    – 33% change weekly
    – 8% new pages every week
    – 25% new links every week
Weekly change

                Fetterly, Manasse, Najork, Wiener 2003
Crawling Issues
   How to crawl?
    – Quality: “Best” pages first
    – Efficiency: Avoid duplication (or near
    – Etiquette: Robots.txt, Server load concerns

   How much to crawl? How much to
    – Coverage: How big is the Web? How much do
      we cover?
    – Relative Coverage: How much do competitors
Basic crawler operation
 Begin with known “seed” pages
 Add them to a queue
 Extract URL from queue
    – Fetch it
    – Parse it and extract URLs it points to
    – Add any non visited extracted URLs on
      the queue
   Repeat until queue is empty
Simple picture – complications
   Web crawling isn’t feasible with one machine
    – All of the above steps distributed
   Even non-malicious pages pose challenges
    – Latency/bandwidth to remote servers vary
    – Robots.txt stipulations
        • How “deep” should you crawl a site’s URL hierarchy?
    – Site mirrors and duplicate pages
   Malicious pages
    – Spam pages
    – Spider traps – incl dynamically generated
   Politeness – don’t hit a server too often
   Protocol for giving spiders (“robots”)
    limited access to a website, originally
    from 1994
   Website announces its request on
    what can(not) be crawled
    – For a URL, create a file
    – This file specifies access restrictions
Robots.txt example
   No robot should visit any URL
    starting with "/yoursite/temp/",
    except the robot called “IxeBot":

User-agent: *
Disallow: /yoursite/temp/

User-agent: IxeBot
Crawling Issues
 Crawl strategies
 Distributed crawling
 Refresh strategies
 Filtering duplicates
 Mirror detection
Crawl Strategies
 Where do we crawl next?

      URLs crawled
      and parsed

                 URLs in queue

Crawl Order

 Want best pages first
 Potential quality measures:
     • Final In-degree
     • Final PageRank
 Crawl heuristics:
    • Breadth First Search (BFS)
    • Partial Indegree
    • Partial PageRank
    • Random walk
Breadth-First Crawl
    Basic idea:
         start at a set of known URLs
         explore in “concentric circles” around these URLs

start pages
distance-one pages
distance-two pages

    used by broad web search engines
    balances load between servers
Web Wide Crawl (328M pages) [Najo01]

        BFS crawling brings in high quality
        pages early in the crawl
Stanford Web Base (179K) [Cho98]

Overlap with
best x% by

               x% crawled by O(u)
Queue of URLs to be fetched
 What constraints dictate which queued
  URL is fetched next?
 Politeness – don’t hit a server too often,
  even from different threads of your spider
 How far into a site you’ve crawled already
    – Most sites, stay at ≤ 5 levels of URL hierarchy
   Which URLs are most promising for
    building a high-quality corpus
    – This is a graph traversal problem:
    – Given a directed graph you’ve partially visited,
      where do you visit next?
Where do we crawl next?
 Keep all crawlers busy
 Keep crawlers from treading on each
  others’ toes
    – Avoid fetching duplicates repeatedly
 Respect politeness/robots.txt
 Avoid getting stuck in traps
 Detect/minimize spam
 Get the “best” pages
    – What’s best?
    – Best for answering search queries
Where do we crawl next?
   Complex scheduling optimization
    problem, subject to all the constraints
    – Plus operational constraints (e.g., keeping all
      machines load-balanced)
   Scientific study – limited to specific
    – Which ones?
    – What do we measure?
   What are the compromises in distributed
Page selection
 Importance metric
 Web crawler model
 Crawler method for choosing page to
Importance Metrics
 Given a page P, define how “good”
  that page is
 Several metric types:
    – Interest driven
    – Popularity driven
    – Location driven
    – Combined
Interest Driven
   Define a driving query Q
   Find textual similarity between P and Q

   Define a word vocabulary t1…tn
   Define a vector for P and Q:
    – Vp, Vq = <w1,…,wn>
       • wi = 0 if ti does not appear in the document
       • wi = IDF(ti) = 1 / number of pages containing ti
   Importance: IS(P) = Vp * Vq (cosine product)
   Finding IDF requires going over the entire web
   Estimate IDF by pages already visited, to calculate
Popularity Driven
   How popular a page is:
    – Backlink count
      • IB(P) = the number of pages containing a
        link to P
      • Estimate by pervious crawls: IB’(P)
    – More sophisticated metric, e.g.
      PageRank: IR(P)
Location Driven
   IL(P): A function of the URL to P
    – Words appearing on URL
    – Number of “/” on the URL
   Easily evaluated, requires no data
    from pervious crawls
Combined Metrics
 IC(P): a function of several other
 Allows using local metrics for first
  stage and estimated metrics for
  second stage
 IC(P) = a*IS(P) + b*IB(P) + c*IL(P)
Focused Crawling
Focused Crawling (Chakrabarti)
 Distributed federation of focused
 Supervised topic classifier
 Controls priority of unvisited frontier
 Trained on document samples from
  Web directory (Dmoz)
Basic Focused Crawler
DOM tree analysis
Learning Algorithm
 Naïve Bayes
 Page u is modeled as bag of words
    { <feature, freq> }
    feature = < term, distance >
   Start the baseline crawler from the URLs
    in one topic
   Fetch up to 20000-25000 pages
   For each pair of fetched pages (u,v), add
    item to the training set of the apprentice
   Train the apprentice
   Start the enhanced crawler from the same
    set of pages
   Fetch about the same number of pages
 Chakrabarty claims focused crawler
  superior to breadth-first
 Suel claims the contrary and that
  argument was based on experiments
  with poor performance crawlers
Distributed Crawling
 Centralized Parallel Crawler
 Distributed
 P2P
Parallel Crawlers
   A parallel crawler consists of multiple
    crawling processes communicating via
    local network (intra-site parallel crawler) or
    Internet (distributed crawler)

   Setting: we have a number of c-proc’s
    – c-proc = crawling process
   Goal: we wish to crawl the best pages with
    minimum overhead
     Crawler-process distribution

  at geographically
  distant locations.

                               on the same local
Distributed Crawler
                               Parallel Crawler
Distributed model
   Crawlers may be running in diverse
    geographic locations
    – Periodically update a master index
    – Incremental update so this is “cheap”
      • Compression, differential update etc.
    – Focus on communication overhead
      during the crawl
   Also results in dispersed WAN load
Issues and benefits

   – overlap: minimization of multiple downloaded
   – quality: depends on the crawl strategy
   – communication bandwidth: minimization

  – scalability: for large-scale web-crawls
  – costs: use of cheaper machines
  – network-load dispersion and reduction: by
    dividing the web into regions and crawling
    only the nearest pages

A parallel crawler consists of multiple crawling
    processes communicating via local network
    (intra-site parallel crawler) or Internet (distributed

Coordination among c-proc's:
   1.   Independent: no coordination, every process follows its
        extracted links
   2.   Dynamic assignment: a central coordinator dynamically
        divides the web into small partitions and assigns each
        partition to a process
   3.   Static assignment: Web is partitioned and assigned
        without central coordinator before the crawl starts
 1. Independent:
   ► no coordination, every process follows its
     extracted links
 2. Dynamic assignment:
   ► a central coordinator dynamically divides the
     web into small partitions and assigns each
     partition to a process
 3. Static assignment:
   ► Web is partitioned and assigned without
     central coordinator before the crawl starts
c-proc’s crawling the web
                               Which c-proc
                               gets this URL?

URLs crawled
               URLs in

                  Communication: by URLs
                  passed between c-procs.
Static assignment

Links from one partition to another (inter-
   partition links) can be handled either in:
1.   Firewall mode:              Partition 1   Partition 2
     a process does not follow
     any inter-partition link          a               f
2.   Cross-over mode:
     a process follows also        b       c      g
     inter-partition links and
     discovers also more pages
     in its partition                      d      h          i
3.   Exchange mode:                e
     processes exchange inter-
     partition URLs; mode
     needs communication
Classification of parallel crawlers
If exchange mode is used, communication can be
     limited by:
   –   Batch communication: every process collects some URLs
       and send them in a batch
   –   Replication: the k most popular URLs are replicated at each
       process and are not exchanged (previous crawl or on the fly)

Some ways to partition the Web:
   –   URL-hash based: many inter-partition links
   –   Site-hash based: reduces the inter partition links
   –   Hierarchical: .com domain, .net domain …
Static assignement: comparison

             Coverage   Overlap   Quality   Communication

  Firewall     Bad       Good      Bad          Good

              Good       Bad       Bad          Good

 Exchange     Good       Good     Good          Bad
UBI Crawler
  – Full distribution: identical agents / no central coordinator
  – Balanced locally computable assignment:
     • each URL is assigned to one agent
     • each agent can compute the URL assignement locally
     • distribution of URLs is balanced
  – Scalability:
     • number of crawled pages per second and per agent are
       independent of the number of agents
  – Fault tolerance:
     • URLs are not statically distributed
     • distributed reassignment protocol not reasonable
UBI Crawler: Assignment Function
A: set of agent identifiers
L: set of alive agents         L A
m: total number of hosts
: assigns host h to an alive agent in L:           L (h)  L
 Balance: each agent should be responsible for approximatly the
  same number of hosts:
                                  (a) 
   Contravariance: if the number of agents grows, the portion of the
    web crawled by each agent must shrink:

              L  L   L (a)  L ' (a)
                     '             1         1

            L  L'   L ' (h)  L   L ' (h)   L (h)
Consistent Hashing
 Each bucket is replicated k times and each replica is mapped
  randomly on the unit circle
 Hashing a key: compute a point on the unit circle and find the nearest

L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9}

                                                                    5       8
                5       8                                                           a
                                                            b                           0
            b                   0
                                    b                   7
                                            2       c
                                                3                                                   b

                                                                                                        L  L'   L (a)  L ' (a)
3                                           b                                                                        1         1
    a                                                                                           4
                                                        6                                       a
        6                               a

                                                                                                        L  L'   L '  L   L ' (h)   L (h)

                                                        L‘-1(a) = {4,5,6,8}                            Balancing:
L-1(a) = {1,4,5,6,8,9}
L-1(b) = {0,2,3,7}                                     L‘-1(b) = {0,2,7}                              Hash function and random
                                                        L‘-1(c) = {1,3,9}                              number generator
UBI Crawler: fault tolerance
Up to now: no metrics for estimating the fault tolerance of
  distributed crawlers

   – Each agent has its own view of the set of alive agents (views
     can be different) but two agents will never dispatch hosts to
     two different agents.
                                1           died
                         b              c


                         a              d

   – Agents can be added dynamically in a self-stabilizing
Evaluation metrics (1)
                        N I
1.   Overlap: Overlap 

     N: total number of fetched pages
     I: number of distinct fetched pages
        • minimize the overlap
1.   Coverage:    Coverage

     U: total number of Web pages
      • maximize the coverage
Evaluation metrics (2)
3.   Communication Overhead:           Overhead 

     M: number of exchanged messages
     P: number of downloaded pages
       • minimize the overhead
4.   Quality:              Quality 
                                            PageRank ( p )

      • maximize the quality                                  x
      • backlink count / oracle crawler
   40M URL graph – Stanford Webbase
    – Open Directory ( URLs as
   Should be considered a small Web
Firewall mode coverage
   The price of crawling in firewall mode
Crossover mode overlap
   Demanding coverage drives up
    Exchange mode communication
       Communication overhead sublinear

Cho’s conclusion
   <4 crawling processes run in parallel firewall
    mode provide good coverage
   firewall mode not appropriate when:
    – > 4 crawling processes
    – download only a small subset of the Web and quality of
      the downloaded pages is important
   exchange mode
    – consumes < 1% network bandwidth for URL exchanges
    – maximizes the quality of the downloaded pages
   By replicating 10,000 - 100,000 popular URLs,
    communication overhead reduced by 40%
Crawler Models
   A crawler
    – Tries to visit more important pages first
    – Only has estimates of importance metrics
    – Can only download a limited amount
   How well does a crawler perform?
    – Crawl and Stop
    – Crawl and Stop with Threshold
Crawl and Stop
 A crawler stops after visiting K pages
 A perfect crawler
    – Visits pages with ranks R1,…,Rk
    – These are called Top Pages
   A real crawler
    – Visits only M < K top pages

 Performance rate     Pcs 
 For a random crawler Pcs 
Crawl and Stop with Threshold
   A crawler stops after visiting K pages
   Top pages are pages with a metric higher than G
   A crawler visits V top pages
   Metric: percent of top pages visited   Pst 
   Perfect crawler      Pst 
                               N       K
   Random crawler       Pst        
                                 T     N
Ordering Metrics
 The crawlers queue is prioritized
  according to an ordering metric
 The ordering metric is based on an
  importance metric
    – Location metrics - directly
    – Popularity metrics - via estimates according to
      pervious crawls
    – Similarity metrics – via estimates according to
When to stop
 Up to a certain depth
 Up to a certain amount from each
 Up to a maximum pages
 Mercator (Altavista, Java)
 WebBase (Garcia-Molina, Cho)
 PolyBot (Suel, NY Polytechnic)
 Grub (
 UbiCrawler (Vigna, Uni. Milano)
Crawler Architecture
• application issues
  requests to manager
• manager does DNS
  and robot exclusion
• manager schedules
  URL on downloader
• downloader gets file
  and puts it on disk
• application is notified
  of new files
• application parses new
  files for hyperlinks
• application sends data
  to storage component
Scaling up

•   20 machines
•   1500 pages/s?
•   depends on
    crawl strategy
•   hash to nodes
•   based on site
    (b/c robot ex)
Scheduler Data Structures

• when to insert new URLs into internal structures?
IXE Crawler

                 <UrlInfo>             Citations                           synch. obj

CrawlInfo        Crawler              Parser                                   Host
                                                    Cache                     queues


             Feeder                                 Retriever
                                                     Retriever select()

                             Hosts     Robots
 Splits URLs by hosts
 Distributes to Retrievers
 Keeps hosts on wait queue for grace
 Modified version of cURL library
 Asynchronous DNS resolver
 Handles up to 1024 transfer at once
 Keeps connections alive, transfer
  multiple files on one connection
  (connection setup can be 25% of
  retrieval time)
Persistent Storage
 Implemented by metaprogramming
  as IXE object-relational tables
 Tuned to handle concurrent access,
  with low granularity locking
 Locking completely transparent to
 < 1% of overall runtime
 > 30% of execution time
 Uses a ThreadPool
Locking and Concurrency
   Portable layer (Unix, Windows):
    – Thread
    – ThreadGroup
    – ThreadPool
    – Lock
    – Condition
    – ConditionWait
 bool Get(T& res) {
   LockUp our(_lock);
   if (empty()) {
      if (closed)
     return false;
      else {
     ConditionWait(not_empty, &_lock);
     if (closed)
        return false;

      res = front();
      return true;
In Memory Communication
 Channel<T>
 Pipe<T, Q = queue<T> >
 MultiPipe<T, Q = queue<T> >
 Asynchronous Put(obj)
 Synchronous Get(obj)
 Peek()
 Close()
Performance Comparison
                      nutch           IXE
 Language         Java        C++
 Parallelism      threads: 300asynch IO:
                              1 thread, 500
 DNS resolution   synchronous asynchronous
 Page analysis stages            concurrent
 Download speed 19 pag/sec       120 pag/sec
 Bandwidth        1.3 Mb/s peak 24 Mb/s peak
Crawl Performance
   Single PC:
    – Peak 120 pages per second (24 Mb/s)
    – 5 million pages/day
   Full crawl of 2 billion pages:
    – 400 day/PC, 25 Mb/s or
    – 4 days, 100 PCs, 2.5 Gb/s bandwidth
Crawler Console
 Web pushlet (aka AJAX) graphical
 Real time bandwidth usage graph
 Scheduled Windows Service
URL compression
 10 billion URLs = 550 GB
 Must map URL to unique Ids
 Simple trick: use undocumented
  feature of zlib
    – Supply dictionary to decompress()
    – Create dictionary by compressing a
      number of URL’s with typical patterns
    – Save dictionary for separate use
    – Achieves >50% compression rate
 Page avg. 13KB, 8 links per page
 From 20 million pages >100 million
  links: several days to crawl
Related work
   Mercator (Heydon/Najork, DEC/Compaq)
    – used in AltaVista
    – centralized system (2-CPU Alpha with RAID
    – URL-seen test by fast disk access and caching
    – one thread per HTTP connection
    – completely in Java, with pluggable
   Atrax: distributed extension to Mercator
    – combines several Mercators
    – URL hashing and off-line URL check
Related Work (cont.)
   early Internet Archive crawler (circa 96)
    – uses hashing to partition URLs between
    – bloom filter for “URL seen” structure
   early Google crawler (1998)
   P2P crawlers ( and others)
   Cho/Garcia-Molina (WWW 2002)
    – study of overhead/quality tradeoff in parallel
    – difference: we scale services separately, and
      focus on single-node performance
    – in our experience, parallel overhead low
Open Issues
   Measuring and tuning peak performance
    – need simulation environment
    – eventually reduces to parsing and network
    – to be improved: space, fault-tolerance (Xactions?)
   Highly Distributed crawling
    – highly distributed (e.g., ? (maybe)
    – bybrid? (different services)
    – few high-performance sites? (several Universities)
   Recrawling and focused crawling strategies
    – what strategies?
    – how to express?
    – how to implement?
Refresh Strategies
Page Refresh
 Make sure pages are up-to-date
 Many possible strategies
    – Uniform refresh

                   i, j   fi  f j
    – Proportional to change frequency
                         i   j
                   i, j    
                           f   i   fj

   Need to define a metric
Freshness Metric
   Freshness
     F (ei , t )  1 if fresh, 0 otherwise
     F (S , t ) 
                       F (e , t )
                      i 1

   Age of pages
     A(ei , t )  time since modified
     A( S , t ) 
                       A(e , t )
                      i 1
Average Freshness
 Freshness changes over time
 Take the average freshness over a
  long period of time

        f ( S , t )  lim  f ( S , t )dt
                       t  t 0
Refresh Strategy
 Crawlers can refresh only a certain
  amount of pages in a period of time
 The page download resource can be
  allocated in many ways
 The proportional refresh policy
  allocated the resource proportionally
  to the pages’ change rate
   The collection contains 2 pages
    – E1 changes 9 times a day
    – E2 changes once a day
    – Simplified change model
       • The day is split into 9 equal intervals, and E1
         changes once on each interval
       • E2 changes once during the entire day
       • The only unknown is when the pages change within
         the intervals
 The crawler can download a page a day
 Goal is to maximize the freshness
Example (2)
Example (3)
   Which page do we refresh?
    – If we refresh E2 in midday
       • If E2 changes in first half of the day, and we refresh
         in midday, it remains fresh for the rest half of the day
           – 50% for 0.5 day freshness increase
           – 50% for no increase
           – Expectancy of 0.25 day freshness increase
    – If we refresh E1 in midday
       • If E1 changes in first half of the interval, and we
         refresh in midday (which is the middle of the
         interval), it remains fresh for the rest half of the
         interval = 1/18 of a day
           – 50% for 1/18 day freshness increase
           – 50% for no increase
           – Expectancy of 1/36 day freshness increase
Example (4)
 This gives a nice estimation
 But things are more complex in real
    – Not sure that a page will change within
      an interval
    – Have to worry about age
   Using a Poisson model shows a
    uniform policy always performs
    better than a proportional one
Comparing Policies

                  Freshness      Age
 Proportional          0.12 400 days
 Uniform               0.57 5.6 days
 Optimal               0.62 4.3 days

      Based on Statistics from experiment
      and revisit frequency of every month
Duplicate Detection
Duplicate/Near-Duplicate Detection

 Duplicate: Exact match with fingerprints
 Near-Duplicate: Approximate match
  – Overview
     • Compute syntactic similarity with an edit-
       distance measure
     • Use similarity threshold to detect near-
        – e.g., Similarity > 80% => Documents are “near
        – Not transitive though sometimes used transitively
Computing Near Similarity
   Features:
    – Segments of a document (natural or artificial
      breakpoints) [Brin95]
    – Shingles (word N-Grams) [Brin95, Brod98]
     “a rose is a rose is a rose” =>
   Similarity Measure
    – TF*IDF [Shiv95]
    – Set intersection [Brod98]
      |Intersection| / |Union|

          Jaccard measure
Shingles + Set Intersection
   Computing exact set intersection of
    shingles between all pairs of
    documents is unfeasible
    – Approximate using a cleverly chosen
      subset of shingles from each (a sketch)
Shingles + Set Intersection
   Estimate |intersection| / |union| based on
    a short sketch ([Brod97, Brod98])
     – Create a “sketch vector” (e.g. of size 200) for
       each document
     – Documents which share more than t (say 80%)
       corresponding vector elements are similar
     – For doc D, sketch[i] is computed as follows:
        • Let f map all shingles in the universe to 0..2m (e.g. f =
        • Let pi be a specific random permutation on 0..2m
        • Pick sketch[i] := MIN pi (f(s)) over all shingles s in D
Computing Sketch[i] for Doc1
    Document 1

                 264   Start with 64 bit shingles

                       Permute on the number line
                 264   with pi

                 264   Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]

      Document 1                       Document 2

                           264                       264

                           264                       264

                           264                       264
      A                                   B
                           264                       264

              Are these equal?
 Test for 200 random permutations:   p1, p2,… p200

           Document 1              Document 2

                             264                    264
                             264                    264
                             264                    264
           A                          B
                             264                    264

A = B iff the shingle with the MIN value in the union of Doc1 and
Doc2 is common to both (i.e. lies in the intersection)

This happens with probability:
  |intersection| / |union|
Mirror Detection
Mirror Detection

   Mirroring is systematic replication of Web
    pages across hosts
    – Single largest cause of duplication on the Web
   URL1 and URL2 are mirrors iff
       For all (or most) paths p such that when
        URL1/p exists
        URL2/p exists as well
       with identical (or near identical) content, and
        vice versa
Example of mirrors
 and
   Structural Classification of Proteins
   Bottom-up [Cho 2000]
    – Group near duplicates into clusters
    – Merge clusters of same cardinality and
      corresponding linkage
   Top-down [Bhar99, Bhar00c]
    – Select features
    – Compute list of pairs
    – Host pair validation by sampling
Mirror detection benefits
   Smart crawling
    – Fetch from the fastest or freshest server
    – Avoid duplication
   Better connectivity analysis
    – Combine inlinks
    – Avoid double counting outlinks
 Avoid redundancy in query results
 Proxy caching
   Mercator:
   K. M. Risvik and R. Michelsen, Search engines and web
    dynamics, Computer Networks, vol. 39, pp. 289--302, June
   WebBase:
   PolyBot:
   Grub:
   UbiCrawler:
Lots of practical issues
 URL normalization
 Embedded session ID
 Embedded URLs
 Cookies loops (e.g. inner page
  redirects to main page + cookie)
 Malformed HTML
 Avoid spam
 Access free registration sites
 Handling URL queries

Shared By: