Docstoc

crawling (PowerPoint)

Document Sample
crawling (PowerPoint) Powered By Docstoc
					               Claudio Scordino
                   Ph.D. Student
                      May 2004




Crawling the Web:
 problems and techniques



  Computer Science Department - University of Pisa
                                  Outline

• Introduction

• Crawler architectures
   - Increasing the throughput

• What pages we do not want to fetch

   - Spider traps

   - Duplicates

   - Mirrors
                          Introduction

Job of a crawler (or spider): fetching the Web
pages to a computer where they will be analyzed



The algorithm is conceptually simple, but…

      …it‟s a complex and underestimate activity
                      Famous Crawlers
• Mercator (Compaq, Altavista)
   Java

   Modular (components loaded dynamically)

   Priority-based scheduling for URLs downloads

     - The algorithm is a pluggable component

   Different processing modules for different contents
   Checkpointing

     - Allows the crawler to recover its state after a failure

     - In a distributed crawler is performed by the Queen
                      Famous Crawlers

• GoogleBot (Stanford, Google)
    C/C++

• WebBase (Stanford)

• HiWE: Hidden Web Exposer (Stanford)

• Heritrix (Internet Archive)

    http://www.crawler.archive.org/
                      Famous Crawlers
• Sphinx
   Java
   Visual and interactive environment
   Relocatable: capable of executing on a remote host
   Site-specific
     -   Customizable crawling
     -   Classifiers: site-specific content analyzers
         1. Links to follow
         2. Parts to process
     - Not scalable
                             Crawler Architecture
              PARSER
               HREFs
Citations     extractor
            and normalizer




                URL
                Filter
              Duplicate
                 URL
              Eliminator


                               SCHEDULER
                                                          Interne
 Crawl                              Load
                                             DNS HTTP          t
Metadata                           Monitor

                                             RETRIEVERS

              seed URLs        Hosts
            URL FRONTIER
           Web masters annoyed

Web Server administrators could be annoyed by:


  1. Server overload
     - Solution: per-server queues


  2. Fetching of private pages
     - Solution: Robot Exclusion Protocol
     - File: /robots.txt
Crawler Architecture




               Per-server
                queues
     Robots
                        Mercator’s scheduler

       FRONT-END:
     prioritizes URLs
         with a value                     Queues
     between 1 and k                      containing
                                          URLs of only
                                          a single host




BACK-END: ensures
          politeness                      Specifies
 (no server overload)                     when a
                                          server may
                                          be contacted
                                          again
     Increasing the throughput

Parallelize the process to fetch many pages at
the same time (~thousands per second).



       Possible levels of parallelization:




DNS                HTTP                  Parsing
       Domain Name resolution




Problem: DNS requires time to resolve the server
  hostname
    Domain Name resolution
1. Asynchronous DNS resolver:
  • Concurrent handling of multiple outstanding
    requests

  • Not provided by most UNIX implementations of
    gethostbyname

  • GNU ADNS library
     • http://www.chiark.greenend.org.uk/~ian/adns/

  • Mercator reduced the thread‟s elapsed time from
    87% to 25%
    Domain Name resolution
2. Customized DNS component:
  • Caching server with persistent cache largely
    residing in memory

  • Prefetching

     • Hostnames extracted by HREFs and requests
       made to the caching server

     • Does not wait for resolution to be completed
Crawler Architecture


                Async
    DNS          DNS
   Cache       prefetch
                 DNS
               resolver
                 client




                          Per-server
                           queues

      Robots
                            Page retrieval

Problem: HTTP requires time to fetch a page


  1. Multithreading
     • Blocking system calls (synchronous I/O)
     • pthreads multithreading library
     • Used in Mercator, Sphinx, WebRace
     • Sphinx uses a monitor to determine the optimal
       number of threads at runtime
     • Mutual exclusion overhead
                          Page retrieval
2. Asynchronous sockets
  • not blocking the process/thread
  •   select monitors several sockets at the same time
  • Does not need mutual exclusion since it performs a
    serialized completion of threads (i.e. the code that
    completes processing the page is not interrupted by
    other completions).
  • Used in IXE (1024 connection at once)
                       Page retrieval
3. Persistent connection
  • Multiple documents requested on a single
    connection
  • Feature of HTTP 1.1
  • Reduce the number of HTTP connection setups
  • Used in IXE
IXE Crawler
                              IXE Parser
• Problem: parsing requires 30% of execution
  time


• Possible solution: distributed parsing
                                       IXE Parser

     Table
    <UrlInfo>              Citations




DocID1     URL1
DocID2     URL2




                  URL1
                  URL2                         URL1
                                                  URL2
    URL Table               Parser
     Manager                                             Cache
   (“Crawler”)    DocID1
                  DocID2             URL1
                                        URL2
                           A distributed parser
        MISS
                                          URL2                  Hash(URL2)
                 Table 1                                            →
 Table 1                                                     Hash (URL1)
                Manager                                          Manager1     URL1
<UrlInfo>                                                        →
                               ?                                                URL2
                                                              Manager2
                                                                        Parser 1
                                   URL1            DocID2
        HIT

                 Table 2
 Table 2
                Manager
<UrlInfo>                     DocID1      URL1
                                            URL2


                                 New
                                                                        Parser N
                                DocID



                            Sched ()
                              →
                            Parser1
                                                      URL1                   Cache
    Citations                      Scheduler                URL2
                  A distributed parser
• Does this solution scale?
   - High traffic on the main link

• Suppose that:
   - Average page size = 10KB
   - Average out-links per page = 10
   - URL size = 40 characters (40 bytes)
   - DocID size = 5 byte
• X = throughput (pages per second)
• N = number of parsers
                               A distributed parser
  • Bandwidth for web pages:
       - X*10*1024*8 = 81920*X bps

  • Bandwidth for messages (hit):
       - X/N * 10 * (40+5) * 8 * N = 3600*X bps

Pages per                       DocID                 Number
 parser                         Reply                of parsers
            Outlinks DocID
            per page Request            Byte → bit




  • Using 100Mbps : X = 1226 pages per second
  What we don’t want to fetch

1. Spider traps

2. Duplicates

   2.1 Different URLs for the same page

   2.2 Already visited URLs

   2.3 Same document on different sites

   2.4 Mirrors
      • At least 10% of the hosts are mirrored
                              Spider traps
• Spider trap: hyperlink graph constructed
  unintentionally or malevolently to keep a
  crawler trapped


  1. Infinitely “deep” Web sites
     • Problem: using CGI is possible to generate an
       infinite number of pages
     • Solution: check of the URL length
                                     Spider traps
2. Large number of dummy pages
  • Example:
    http://www.troutbums.com/Flyfactory/flyfactory/flyfactory/hatchlin
    e/hatchline/flyfactory/hatchline/flyfactory/hatchline/flyfactory/flyfa
    ctory/flyfactory/hatchline/flyfactory/hatchline/

  • Solution: disable crawling

     • a guard removes from consideration any
       URL from a site which dominates the
       collection
                      Avoid duplicates
• Problem almost nonexistent in classic IR



• Duplicate content
  • wastes resources (index space)

  • annoys users
                         Virtual Hosting
• Problem: Virtual Hosting
  • Allows to map different sites to a single IP address
  • Could be used to create duplicates
  • Feature of HTTP 1.1
 http://www.cocacola.com
                                     129.33.45.163

     http://www.coke.com


• Rely on canonical hostnames (CNAMEs)
  provided by DNS
             Already visited URLs
• Problem: how to recognize an already visited
  URL ?


  • The page is reachable by many paths


  • We need an efficient Duplicate URL Eliminator
              Already visited URLs
1. Bloom Filter
   • Probabilistic data structure for set membership
     testing
                                               BIT VECTOR
                                      0/1
                   hash function 1
                                      0/1
                  hash function 2
     URL


                   hash function n    0/1



   • Problem: false positivs
      • new URLs marked as already seen
              Already visited URLs
2. URL hashing
  • MD5
                                   128 bits


       URL          MD5           Digest


  • Using a 64-bit hash function, a billion URLs
    requires 8GB
     - Does not fit in memory
     - Using the disk limit the crawling rate to 75
       downloads per second
               Already visited URLs
3. two-level hash function
   • The crawler is luckily to explore URLs within the
     same site
   • Relative URLs create a spatiotemporal locality of
     access
   • Exploit this kind of locality using a cache


                          24 bits     40 bits

                    Hostname+Port Path
       Content based techniques
• Problem: how to recognize duplicates basing
  on the page contents?




1. Edit distance
   • Number of replacements required to transform one
     document to the other
   • Cost: l1*l2, where l1 and l2 are the lenghts of the
     documents: Impractical!
      Content based techniques

2. Hashing
  • A digest associated with each crawled page
  • Used in Mercator
  • Cost: one seek in the index for each new crawled
    page

  Problem: pages could have minor syntatic
  differences !
     • site mantainer‟s name, latest update
     • anchors modified
     • different formatting
        Content based techniques
3. Shingling
• Shingle (or q-gram): contiguous subsequence of
  tokens taken from document d
   • representable by a fixed length integer
   • w-shingle: shingle of width w


• S(d,w): w-shingling of document d
   •   unordered set of distinct w-shingles contained in
       document d
    Content based techniques
  Sentence:     a rose is a rose is a rose

    Tokens:    a rose is a rose is a rose

               a,rose,is,a
                  rose,is,a,rose
4-shingles:              is,a,rose,is
                             a,rose,is,a
                               rose,is,a,rose



   S(d,4): a,rose,is,a   rose,is,a,rose is,a,rose,is
    Content based techniques

• Each token = 32 bit
                               w-shingle=320 bit
• w = 10 (suitable value)


• S(d,10) = set of 320-bits numbers


• We can hash the w-shingles and keep 500
 bytes of digests for each document
      Content based techniques
• Resemblance of documents d1 and d2:




                            Jaccard coefficient


• Eliminate pages too similar (pages whose resem-
  blance value is close to 1)
                                   Mirrors
                       URL


 http://www.research.digital.com/SRC/
 access                hostname              path
method



• Precision = relevant retrieved docs / retrieved
  docs
                                       Mirrors
1. URL String based
  • Vector Space model: term vector matching to
    compute the likelyhood that a pair of hosts are
    mirrors
  • terms with df(t) < 100
                                         Mirrors
a) Hostname matching       27%
   • Terms: substrings of the hostname
   • Term weighting:




        len(t)= number of segments obtained by breaking
           the term at „.‟ characters


  • This weighting favours substrings composed by
    many segments very specific
                                        Mirrors
b) Full path matching      59%
   • Terms: entire paths
   • Term weighting:

                              mdf = max df(t)
                                   t∈collection

                                                  +19%
  Connectivity based filtering stage:
      • Idea: mirrors share many common paths
      • Testing for each common path if it has the
      same set of out-links on both hosts
      • Remove hostnames from local URLs
                                       Mirrors
c) Positional word bigram matching     72%
   • Terms creation:
      • Break the path into a list of words by treating
        „/‟ and „.‟ as breaks
      • Eliminate non-alphanumeric characters
      • Replace digits with „*‟ (effect similar to
        stemming)
      • Combine successive pairs of words in the list
      • Append the ordinal position of the first word
                                Mirrors
   conferences/d299/advanceprogram.html




 conferences
      d*
advanceprogram
     html


                    conferences_d*_0     Positional
                  d*_advanceprogram_1      Word
                 advanceprogram_html_2    Bigrams
                                          Mirrors
2. Host connectivity based 45%
  • Consider all documents on a host as a single large
    document
  • Graph:
     • host → node
     • document on host a pointing to a document on
       host B → directed edge from A to B
     • Idea: two hosts are likely to be mirrors if their
       nodes point to the same nodes
  • Term vector matching
     - Terms: set of nodes that a host‟s node points to
                                          References
S. Chakrabarti and M. Kaufmann, Mining the Web: Analysis of Hypertext
and Semi Structured Data, 2002. Pages 17-43,71-72.


S.Brin and L.Page, The anatomy of a large-scale hypertextual Web
search engine. Proceedings of the 7th World Wide Web Conference
(WWW7), 1998.


A.Heydon and M.Najork, Mercator: A scalable, extensible Web crawler,
World Wide Web Conference, 1999.


K.Bharat, A.Broder, J.Dean, M,R.Henzinger, A comparison of
Techniques to Find Mirrored Hosts on the WWW, Journal of the
American Society for Information Science, 2000.
                                            References
A.Heydon and M.Najork, High performance Web Crawling, Technical
Report, SRC Research Report, 173, Compaq Systems Research Center,
26 September 2001.


R.C.Miller and K.Bharat, SPHINX: a framework for creating personal,
site-specific web crawlers, Proceedings of the 7th World-Wide Web
Conference, 1998.


D. Zeinalipour-Yazti and M. Dikaiakos. Design and Implementation of a
Distributed Crawler and Filtering Processor, Proceedings of the 5th
Workshop on Next Generation Information Technologies and Systems
(NGITS 2002), June 2002.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:26
posted:4/5/2011
language:English
pages:48