Docstoc

Crawling

Document Sample
Crawling Powered By Docstoc
					Crawling


                Giuseppe Attardi
           Dipartimento di Informatica
               Università di Pisa
Crawling
 Download a set of Web pages
 Set consists typically of all pages
  reachable following links from a root
  set
 Crawling is performed periodically
 Goals:
    – Find new pages
    – Keep pages fresh
    – Select “good” pages
Web Dynamics
   Size
    – ~10 billion Public Indexable pages
    – 10kB / page  100 TB
    – Doubles every 18 months
   Dynamics
    – 33% change weekly
    – 8% new pages every week
    – 25% new links every week
Weekly change




                Fetterly, Manasse, Najork, Wiener 2003
Crawling Issues
   How to crawl?
    – Quality: “Best” pages first
    – Efficiency: Avoid duplication (or near
      duplication)
    – Etiquette: Robots.txt, Server load concerns


   How much to crawl? How much to
    index?
    – Coverage: How big is the Web? How much do
      we cover?
    – Relative Coverage: How much do competitors
      have?
Basic crawler operation
 Begin with known “seed” pages
 Add them to a queue
 Extract URL from queue
    – Fetch it
    – Parse it and extract URLs it points to
    – Add any non visited extracted URLs on
      the queue
   Repeat until queue is empty
Simple picture – complications
   Web crawling isn’t feasible with one machine
    – All of the above steps distributed
   Even non-malicious pages pose challenges
    – Latency/bandwidth to remote servers vary
    – Robots.txt stipulations
        • How “deep” should you crawl a site’s URL hierarchy?
    – Site mirrors and duplicate pages
   Malicious pages
    – Spam pages
    – Spider traps – incl dynamically generated
   Politeness – don’t hit a server too often
Robots.txt
   Protocol for giving spiders (“robots”)
    limited access to a website, originally
    from 1994
    – www.robotstxt.org/wc/norobots.html
   Website announces its request on
    what can(not) be crawled
    – For a URL, create a file
      URL/robots.txt
    – This file specifies access restrictions
Robots.txt example
   No robot should visit any URL
    starting with "/yoursite/temp/",
    except the robot called “IxeBot":

User-agent: *
Disallow: /yoursite/temp/

User-agent: IxeBot
Disallow:
Crawling Issues
 Crawl strategies
 Distributed crawling
 Refresh strategies
 Filtering duplicates
 Mirror detection
Crawl Strategies
 Where do we crawl next?



      URLs crawled
      and parsed


                 URLs in queue


Web
Crawl Order

 Want best pages first
 Potential quality measures:
     • Final In-degree
     • Final PageRank
 Crawl heuristics:
    • Breadth First Search (BFS)
    • Partial Indegree
    • Partial PageRank
    • Random walk
Breadth-First Crawl
    Basic idea:
         start at a set of known URLs
         explore in “concentric circles” around these URLs

start pages
distance-one pages
distance-two pages




    used by broad web search engines
    balances load between servers
Web Wide Crawl (328M pages) [Najo01]


        BFS crawling brings in high quality
        pages early in the crawl
Stanford Web Base (179K) [Cho98]


Overlap with
best x% by
indegree




               x% crawled by O(u)
Queue of URLs to be fetched
 What constraints dictate which queued
  URL is fetched next?
 Politeness – don’t hit a server too often,
  even from different threads of your spider
 How far into a site you’ve crawled already
    – Most sites, stay at ≤ 5 levels of URL hierarchy
   Which URLs are most promising for
    building a high-quality corpus
    – This is a graph traversal problem:
    – Given a directed graph you’ve partially visited,
      where do you visit next?
Where do we crawl next?
 Keep all crawlers busy
 Keep crawlers from treading on each
  others’ toes
    – Avoid fetching duplicates repeatedly
 Respect politeness/robots.txt
 Avoid getting stuck in traps
 Detect/minimize spam
 Get the “best” pages
    – What’s best?
    – Best for answering search queries
Where do we crawl next?
   Complex scheduling optimization
    problem, subject to all the constraints
    listed
    – Plus operational constraints (e.g., keeping all
      machines load-balanced)
   Scientific study – limited to specific
    aspects
    – Which ones?
    – What do we measure?
   What are the compromises in distributed
    crawling?
Page selection
 Importance metric
 Web crawler model
 Crawler method for choosing page to
  download
Importance Metrics
 Given a page P, define how “good”
  that page is
 Several metric types:
    – Interest driven
    – Popularity driven
    – Location driven
    – Combined
Interest Driven
   Define a driving query Q
   Find textual similarity between P and Q

   Define a word vocabulary t1…tn
   Define a vector for P and Q:
    – Vp, Vq = <w1,…,wn>
       • wi = 0 if ti does not appear in the document
       • wi = IDF(ti) = 1 / number of pages containing ti
   Importance: IS(P) = Vp * Vq (cosine product)
   Finding IDF requires going over the entire web
   Estimate IDF by pages already visited, to calculate
    IS’
Popularity Driven
   How popular a page is:
    – Backlink count
      • IB(P) = the number of pages containing a
        link to P
      • Estimate by pervious crawls: IB’(P)
    – More sophisticated metric, e.g.
      PageRank: IR(P)
Location Driven
   IL(P): A function of the URL to P
    – Words appearing on URL
    – Number of “/” on the URL
   Easily evaluated, requires no data
    from pervious crawls
Combined Metrics
 IC(P): a function of several other
  metrics
 Allows using local metrics for first
  stage and estimated metrics for
  second stage
 IC(P) = a*IS(P) + b*IB(P) + c*IL(P)
Focused Crawling
Focused Crawling (Chakrabarti)
 Distributed federation of focused
  crawlers
 Supervised topic classifier
 Controls priority of unvisited frontier
 Trained on document samples from
  Web directory (Dmoz)
Basic Focused Crawler
DOM tree analysis
Learning Algorithm
 Naïve Bayes
 Page u is modeled as bag of words
    { <feature, freq> }
    feature = < term, distance >
Comparisons
   Start the baseline crawler from the URLs
    in one topic
   Fetch up to 20000-25000 pages
   For each pair of fetched pages (u,v), add
    item to the training set of the apprentice
   Train the apprentice
   Start the enhanced crawler from the same
    set of pages
   Fetch about the same number of pages
Results
Controversy
 Chakrabarty claims focused crawler
  superior to breadth-first
 Suel claims the contrary and that
  argument was based on experiments
  with poor performance crawlers
Distributed Crawling
Approaches
 Centralized Parallel Crawler
 Distributed
 P2P
Parallel Crawlers
   A parallel crawler consists of multiple
    crawling processes communicating via
    local network (intra-site parallel crawler) or
    Internet (distributed crawler)
    – http://www2002.org/CDROM/refereed/108/index.html

   Setting: we have a number of c-proc’s
    – c-proc = crawling process
   Goal: we wish to crawl the best pages with
    minimum overhead
     Crawler-process distribution

  at geographically
  distant locations.

                               on the same local
                                    network
Distributed Crawler
                                   Central
                               Parallel Crawler
Distributed model
   Crawlers may be running in diverse
    geographic locations
    – Periodically update a master index
    – Incremental update so this is “cheap”
      • Compression, differential update etc.
    – Focus on communication overhead
      during the crawl
   Also results in dispersed WAN load
Issues and benefits

Issues:
   – overlap: minimization of multiple downloaded
     pages
   – quality: depends on the crawl strategy
   – communication bandwidth: minimization

Benefits:
  – scalability: for large-scale web-crawls
  – costs: use of cheaper machines
  – network-load dispersion and reduction: by
    dividing the web into regions and crawling
    only the nearest pages
Coordination

A parallel crawler consists of multiple crawling
    processes communicating via local network
    (intra-site parallel crawler) or Internet (distributed
    crawler)

Coordination among c-proc's:
   1.   Independent: no coordination, every process follows its
        extracted links
   2.   Dynamic assignment: a central coordinator dynamically
        divides the web into small partitions and assigns each
        partition to a process
   3.   Static assignment: Web is partitioned and assigned
        without central coordinator before the crawl starts
Coordination
 1. Independent:
   ► no coordination, every process follows its
     extracted links
 2. Dynamic assignment:
   ► a central coordinator dynamically divides the
     web into small partitions and assigns each
     partition to a process
 3. Static assignment:
   ► Web is partitioned and assigned without
     central coordinator before the crawl starts
c-proc’s crawling the web
                               Which c-proc
                               gets this URL?




URLs crawled
               URLs in
               queues

                  Communication: by URLs
                  passed between c-procs.
Static assignment

Links from one partition to another (inter-
   partition links) can be handled either in:
1.   Firewall mode:              Partition 1   Partition 2
     a process does not follow
     any inter-partition link          a               f
2.   Cross-over mode:
     a process follows also        b       c      g
     inter-partition links and
     discovers also more pages
     in its partition                      d      h          i
3.   Exchange mode:                e
     processes exchange inter-
     partition URLs; mode
     needs communication
Classification of parallel crawlers
If exchange mode is used, communication can be
     limited by:
   –   Batch communication: every process collects some URLs
       and send them in a batch
   –   Replication: the k most popular URLs are replicated at each
       process and are not exchanged (previous crawl or on the fly)


Some ways to partition the Web:
   –   URL-hash based: many inter-partition links
   –   Site-hash based: reduces the inter partition links
   –   Hierarchical: .com domain, .net domain …
Static assignement: comparison


             Coverage   Overlap   Quality   Communication


  Firewall     Bad       Good      Bad          Good

  Cross-
              Good       Bad       Bad          Good
   over

 Exchange     Good       Good     Good          Bad
UBI Crawler
Features:
  – Full distribution: identical agents / no central coordinator
  – Balanced locally computable assignment:
     • each URL is assigned to one agent
     • each agent can compute the URL assignement locally
     • distribution of URLs is balanced
  – Scalability:
     • number of crawled pages per second and per agent are
       independent of the number of agents
  – Fault tolerance:
     • URLs are not statically distributed
     • distributed reassignment protocol not reasonable
UBI Crawler: Assignment Function
A: set of agent identifiers
L: set of alive agents         L A
m: total number of hosts
: assigns host h to an alive agent in L:           L (h)  L
Requirements:
 Balance: each agent should be responsible for approximatly the
  same number of hosts:
                                          m
                         
                             1
                             L
                                  (a) 
                                          L
   Contravariance: if the number of agents grows, the portion of the
    web crawled by each agent must shrink:

              L  L   L (a)  L ' (a)
                     '             1         1



            L  L'   L ' (h)  L   L ' (h)   L (h)
Consistent Hashing
 Each bucket is replicated k times and each replica is mapped
  randomly on the unit circle
 Hashing a key: compute a point on the unit circle and find the nearest
  replica

L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9}

                                                                    5       8
                5       8                                                           a
                            a
                                                            b                           0
            b                   0
                                                                                            b
                                    b                   7
        7
                                                                                                    2
                                            2       c
                                                                                                        Contravariance:
                                                3                                                   b

                                                                                                        L  L'   L (a)  L ' (a)
3                                           b                                                                        1         1
                                                    a
    a                                                                                           4
                                        4
                                                        6                                       a
        6                               a

                                9
                                                                c
                                                                                c
                                                                                        9
                                                                                                        L  L'   L '  L   L ' (h)   L (h)
                                                                        1
                    1


                                                        L‘-1(a) = {4,5,6,8}                            Balancing:
L-1(a) = {1,4,5,6,8,9}
L-1(b) = {0,2,3,7}                                     L‘-1(b) = {0,2,7}                              Hash function and random
                                                        L‘-1(c) = {1,3,9}                              number generator
UBI Crawler: fault tolerance
Up to now: no metrics for estimating the fault tolerance of
  distributed crawlers

   – Each agent has its own view of the set of alive agents (views
     can be different) but two agents will never dispatch hosts to
     two different agents.
                                1           died
                         b              c
                                2

                                    3

                         a              d


   – Agents can be added dynamically in a self-stabilizing
     way
Evaluation metrics (1)
                        N I
1.   Overlap: Overlap 
                         N

     N: total number of fetched pages
     I: number of distinct fetched pages
        • minimize the overlap
                            U
1.   Coverage:    Coverage
                            I

     U: total number of Web pages
      • maximize the coverage
Evaluation metrics (2)
                                                  M
3.   Communication Overhead:           Overhead 
                                                  P

     M: number of exchanged messages
     (URLs)
     P: number of downloaded pages
       • minimize the overhead
                                       1
4.   Quality:              Quality 
                                       N
                                            PageRank ( p )
                                           i
                                                         i




      • maximize the quality                                  x
      • backlink count / oracle crawler
Experiments
   40M URL graph – Stanford Webbase
    – Open Directory (dmoz.org) URLs as
      seeds
   Should be considered a small Web
Firewall mode coverage
   The price of crawling in firewall mode
Crossover mode overlap
   Demanding coverage drives up
    overlap
    Exchange mode communication
       Communication overhead sublinear




Per
downloaded
URL
Cho’s conclusion
   <4 crawling processes run in parallel firewall
    mode provide good coverage
   firewall mode not appropriate when:
    – > 4 crawling processes
    – download only a small subset of the Web and quality of
      the downloaded pages is important
   exchange mode
    – consumes < 1% network bandwidth for URL exchanges
    – maximizes the quality of the downloaded pages
   By replicating 10,000 - 100,000 popular URLs,
    communication overhead reduced by 40%
Resources
   www.robotstxt.org/wc/norobots.html
   www2002.org/CDROM/refereed/108/inde
    x.html
   www2004.org/proceedings/docs/1p59
    5.pdf
Crawler Models
   A crawler
    – Tries to visit more important pages first
    – Only has estimates of importance metrics
    – Can only download a limited amount
   How well does a crawler perform?
    – Crawl and Stop
    – Crawl and Stop with Threshold
Crawl and Stop
 A crawler stops after visiting K pages
 A perfect crawler
    – Visits pages with ranks R1,…,Rk
    – These are called Top Pages
   A real crawler
    – Visits only M < K top pages

                             M
 Performance rate     Pcs 
                             K
                             K
 For a random crawler Pcs 
                             N
Crawl and Stop with Threshold
   A crawler stops after visiting K pages
   Top pages are pages with a metric higher than G
   A crawler visits V top pages
                                                 V
   Metric: percent of top pages visited   Pst 
                                                 T
                               K
   Perfect crawler      Pst 
                               T
                               T
                                  K
                               N       K
   Random crawler       Pst        
                                 T     N
Ordering Metrics
 The crawlers queue is prioritized
  according to an ordering metric
 The ordering metric is based on an
  importance metric
    – Location metrics - directly
    – Popularity metrics - via estimates according to
      pervious crawls
    – Similarity metrics – via estimates according to
      anchor
When to stop
 Up to a certain depth
 Up to a certain amount from each
  site
 Up to a maximum pages
Literature
 Mercator (Altavista, Java)
 WebBase (Garcia-Molina, Cho)
 PolyBot (Suel, NY Polytechnic)
 Grub (grub.org)
 UbiCrawler (Vigna, Uni. Milano)
Crawler Architecture
  PolyBot
• application issues
  requests to manager
• manager does DNS
  and robot exclusion
• manager schedules
  URL on downloader
• downloader gets file
  and puts it on disk
• application is notified
  of new files
• application parses new
  files for hyperlinks
• application sends data
  to storage component
Scaling up


•   20 machines
•   1500 pages/s?
•   depends on
    crawl strategy
•   hash to nodes
•   based on site
    (b/c robot ex)
Scheduler Data Structures




• when to insert new URLs into internal structures?
IXE Crawler

                                                                           thread
                   Table
                 <UrlInfo>             Citations                           synch. obj
                                                                           memory
                                                                           disk


CrawlInfo        Crawler              Parser                                   Host
                                                    Cache                     queues



                                                                select()
                                                   Retriever

             Feeder                                 Retriever
                               Scheduler
                                                     Retriever select()

            UrlEnumerator
                             Hosts     Robots
Scheduler
 Splits URLs by hosts
 Distributes to Retrievers
 Keeps hosts on wait queue for grace
  period
Retriever
 Modified version of cURL library
 Asynchronous DNS resolver
 Handles up to 1024 transfer at once
 Keeps connections alive, transfer
  multiple files on one connection
  (connection setup can be 25% of
  retrieval time)
Persistent Storage
 Implemented by metaprogramming
  as IXE object-relational tables
 Tuned to handle concurrent access,
  with low granularity locking
 Locking completely transparent to
  application
 < 1% of overall runtime
Parser
 > 30% of execution time
 Uses a ThreadPool
Locking and Concurrency
   Portable layer (Unix, Windows):
    – Thread
    – ThreadGroup
    – ThreadPool
    – Lock
    – Condition
    – ConditionWait
Pipe
 bool Get(T& res) {
   LockUp our(_lock);
   if (empty()) {
      if (closed)
     return false;
      else {
     ConditionWait(not_empty, &_lock);
     if (closed)
        return false;
      }
   }
   assert(!empty());

      res = front();
      pop_front();
      return true;
  }
In Memory Communication
 Channel<T>
 Pipe<T, Q = queue<T> >
 MultiPipe<T, Q = queue<T> >
 Asynchronous Put(obj)
 Synchronous Get(obj)
 Peek()
 Close()
Performance Comparison
                      nutch           IXE
 Language         Java        C++
 Parallelism      threads: 300asynch IO:
                              1 thread, 500
                              connections
 DNS resolution   synchronous asynchronous
 Page analysis stages            concurrent
 Download speed 19 pag/sec       120 pag/sec
 Bandwidth        1.3 Mb/s peak 24 Mb/s peak
Crawl Performance
   Single PC:
    – Peak 120 pages per second (24 Mb/s)
    – 5 million pages/day
   Full crawl of 2 billion pages:
    – 400 day/PC, 25 Mb/s or
    – 4 days, 100 PCs, 2.5 Gb/s bandwidth
Crawler Console
 Web pushlet (aka AJAX) graphical
  monitor
 Real time bandwidth usage graph
 Scheduled Windows Service
URL compression
 10 billion URLs = 550 GB
 Must map URL to unique Ids
 Simple trick: use undocumented
  feature of zlib
    – Supply dictionary to decompress()
    – Create dictionary by compressing a
      number of URL’s with typical patterns
    – Save dictionary for separate use
    – Achieves >50% compression rate
Statistics
 Page avg. 13KB, 8 links per page
 From 20 million pages >100 million
  links: several days to crawl
Related work
   Mercator (Heydon/Najork, DEC/Compaq)
    – used in AltaVista
    – centralized system (2-CPU Alpha with RAID
      disks)
    – URL-seen test by fast disk access and caching
    – one thread per HTTP connection
    – completely in Java, with pluggable
      components
   Atrax: distributed extension to Mercator
    – combines several Mercators
    – URL hashing and off-line URL check
Related Work (cont.)
   early Internet Archive crawler (circa 96)
    – uses hashing to partition URLs between
      crawlers
    – bloom filter for “URL seen” structure
   early Google crawler (1998)
   P2P crawlers (grub.org and others)
   Cho/Garcia-Molina (WWW 2002)
    – study of overhead/quality tradeoff in parallel
      crawlers
    – difference: we scale services separately, and
      focus on single-node performance
    – in our experience, parallel overhead low
Open Issues
   Measuring and tuning peak performance
    – need simulation environment
    – eventually reduces to parsing and network
    – to be improved: space, fault-tolerance (Xactions?)
   Highly Distributed crawling
    – highly distributed (e.g., grub.org) ? (maybe)
    – bybrid? (different services)
    – few high-performance sites? (several Universities)
   Recrawling and focused crawling strategies
    – what strategies?
    – how to express?
    – how to implement?
Refresh Strategies
Page Refresh
 Make sure pages are up-to-date
 Many possible strategies
    – Uniform refresh

                   i, j   fi  f j
    – Proportional to change frequency
                         i   j
                   i, j    
                           f   i   fj

   Need to define a metric
Freshness Metric
   Freshness
     F (ei , t )  1 if fresh, 0 otherwise
                       N
                  1
     F (S , t ) 
                  N
                       F (e , t )
                      i 1
                                 i



   Age of pages
     A(ei , t )  time since modified
                      N
                  1
     A( S , t ) 
                  N
                       A(e , t )
                      i 1
                             i
Average Freshness
 Freshness changes over time
 Take the average freshness over a
  long period of time

                            t
                            1
        f ( S , t )  lim  f ( S , t )dt
                       t  t 0
Refresh Strategy
 Crawlers can refresh only a certain
  amount of pages in a period of time
 The page download resource can be
  allocated in many ways
 The proportional refresh policy
  allocated the resource proportionally
  to the pages’ change rate
Example
   The collection contains 2 pages
    – E1 changes 9 times a day
    – E2 changes once a day
    – Simplified change model
       • The day is split into 9 equal intervals, and E1
         changes once on each interval
       • E2 changes once during the entire day
       • The only unknown is when the pages change within
         the intervals
 The crawler can download a page a day
 Goal is to maximize the freshness
Example (2)
Example (3)
   Which page do we refresh?
    – If we refresh E2 in midday
       • If E2 changes in first half of the day, and we refresh
         in midday, it remains fresh for the rest half of the day
           – 50% for 0.5 day freshness increase
           – 50% for no increase
           – Expectancy of 0.25 day freshness increase
    – If we refresh E1 in midday
       • If E1 changes in first half of the interval, and we
         refresh in midday (which is the middle of the
         interval), it remains fresh for the rest half of the
         interval = 1/18 of a day
           – 50% for 1/18 day freshness increase
           – 50% for no increase
           – Expectancy of 1/36 day freshness increase
Example (4)
 This gives a nice estimation
 But things are more complex in real
  life
    – Not sure that a page will change within
      an interval
    – Have to worry about age
   Using a Poisson model shows a
    uniform policy always performs
    better than a proportional one
Comparing Policies

                  Freshness      Age
 Proportional          0.12 400 days
 Uniform               0.57 5.6 days
 Optimal               0.62 4.3 days

      Based on Statistics from experiment
      and revisit frequency of every month
Duplicate Detection
Duplicate/Near-Duplicate Detection

 Duplicate: Exact match with fingerprints
 Near-Duplicate: Approximate match
  – Overview
     • Compute syntactic similarity with an edit-
       distance measure
     • Use similarity threshold to detect near-
       duplicates
        – e.g., Similarity > 80% => Documents are “near
          duplicates”
        – Not transitive though sometimes used transitively
Computing Near Similarity
   Features:
    – Segments of a document (natural or artificial
      breakpoints) [Brin95]
    – Shingles (word N-Grams) [Brin95, Brod98]
     “a rose is a rose is a rose” =>
       a_rose_is_a
          rose_is_a_rose
               is_a_rose_is
   Similarity Measure
    – TF*IDF [Shiv95]
    – Set intersection [Brod98]
      |Intersection| / |Union|

          Jaccard measure
Shingles + Set Intersection
   Computing exact set intersection of
    shingles between all pairs of
    documents is unfeasible
    – Approximate using a cleverly chosen
      subset of shingles from each (a sketch)
Shingles + Set Intersection
   Estimate |intersection| / |union| based on
    a short sketch ([Brod97, Brod98])
     – Create a “sketch vector” (e.g. of size 200) for
       each document
     – Documents which share more than t (say 80%)
       corresponding vector elements are similar
     – For doc D, sketch[i] is computed as follows:
        • Let f map all shingles in the universe to 0..2m (e.g. f =
          fingerprinting)
        • Let pi be a specific random permutation on 0..2m
        • Pick sketch[i] := MIN pi (f(s)) over all shingles s in D
Computing Sketch[i] for Doc1
    Document 1


                 264   Start with 64 bit shingles

                 264
                       Permute on the number line
                 264   with pi

                 264   Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]

      Document 1                       Document 2


                           264                       264

                           264                       264

                           264                       264
      A                                   B
                           264                       264



              Are these equal?
 Test for 200 random permutations:   p1, p2,… p200
However…

           Document 1              Document 2

                             264                    264
                             264                    264
                             264                    264
           A                          B
                             264                    264



A = B iff the shingle with the MIN value in the union of Doc1 and
Doc2 is common to both (i.e. lies in the intersection)

This happens with probability:
  |intersection| / |union|
Mirror Detection
Mirror Detection

   Mirroring is systematic replication of Web
    pages across hosts
    – Single largest cause of duplication on the Web
   URL1 and URL2 are mirrors iff
       For all (or most) paths p such that when
        URL1/p exists
        URL2/p exists as well
       with identical (or near identical) content, and
        vice versa
Example of mirrors
   http://www.elsevier.com/ and
    http://www.elsevier.nl/
   Structural Classification of Proteins
    – http://scop.mrc-lmb.cam.ac.uk/scop
    – http://scop.berkeley.edu/
    – http://scop.wehi.edu.au/scop
    – http://pdb.weizmann.ac.il/scop
    – http://scop.protres.ru/
Approaches
   Bottom-up [Cho 2000]
    – Group near duplicates into clusters
    – Merge clusters of same cardinality and
      corresponding linkage
   Top-down [Bhar99, Bhar00c]
    – Select features
    – Compute list of pairs
    – Host pair validation by sampling
Mirror detection benefits
   Smart crawling
    – Fetch from the fastest or freshest server
    – Avoid duplication
   Better connectivity analysis
    – Combine inlinks
    – Avoid double counting outlinks
 Avoid redundancy in query results
 Proxy caching
References
   Mercator:
    – http://research.compaq.com/SRC/mercator
   K. M. Risvik and R. Michelsen, Search engines and web
    dynamics, Computer Networks, vol. 39, pp. 289--302, June
    2002, http://citeseer.ist.psu.edu/risvik02search.html
   WebBase:
    – http://www-diglib.stanford.edu/~testbed/doc2/WebBase
   PolyBot:
    – http://cis.poly.edu/polybot
   Grub:
    – www.grub.org
   UbiCrawler:
    – http://ubi.imc.pi.cnr.it/projects/ubicrawler
Lots of practical issues
 URL normalization
 Embedded session ID
 Embedded URLs
 Cookies loops (e.g. inner page
  redirects to main page + cookie)
 Malformed HTML
 Avoid spam
 Access free registration sites
 Handling URL queries
Questions?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:2/8/2012
language:
pages:110