crawling-policies-ht06

Document Sample
crawling-policies-ht06 Powered By Docstoc
					Evaluation of Crawling Policies
 for a Web-Repository Crawler

   Frank McCown & Michael L. Nelson
            Old Dominion University
             Norfolk, Virginia, USA




             Odense, Denmark
              August 23, 2006
                      HT'06           1
       Alternate Models
        of Preservation

• Lazy Preservation
   – Let Google, IA et al. preserve your website
• Just-In-Time Preservation
   – Find a “good enough” replacement web page
• Shared Infrastructure Preservation
   – Push your content to sites that might preserve it
• Web Server Enhanced Preservation
   – Use Apache modules to create archival-ready resources



                               HT'06                                                            2
                                              image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm
                   Outline
• Web page threats
• Web Infrastructure
• Warrick
  – architectural description
  – crawling policies
  – future work




                        HT'06   3
Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg                     HT'06   4
Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg
Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
 Crawling the Web and web
        repositories

                 Web crawling           Web-repository
                                          crawling
                                Repo1


World Wide Web                  Repo2                    Repo

                                 ...

                                Repon




                        HT'06                                   5
       How much of the Web is indexed?




     • GYM intersection less than 43%
                                                    HT'06                                         6
Figure from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW‟05)
   Traditional Web Crawler

Seed URLs         Init
                          Web



               Download
Visited URLs              Repo
               resource



  Frontier      Extract
                URLs

                 HT'06           7
    Web-Repository Crawler

Seed URLs        Init
                                   Web
                                  repos


               Download
Visited URLs               Repo
               resource



  Frontier      Extract
                URLs

                   HT'06                  8
HT'06   9
HT'06   10
Cached Image




  HT'06        11
               Cached PDF
     http://www.fda.gov/cder/about/whatwedo/testtube.pdf



canonical




MSN version          Yahoo version           Google version
Web Repository Characteristics
Type                             MIME type                       File ext   Google     Yahoo     MSN        IA

HTML text                        text/html                        html         C          C        C        C
Plain text                       text/plain                      txt, ans      M          M        M        C
Graphic Interchange Format       image/gif                         gif         M          M       ~R        C
Joint Photographic Experts                                         jpg
                                 image/jpeg                                    M          M       ~R        C
Group
Portable Network Graphic         image/png                         png         M          M       ~R        C
Adobe Portable Document                                            pdf
                                 application/pdf                               M          M        M        C
Format
JavaScript                       application/javascript             js         M                   M        C
Microsoft Excel                  application/vnd.ms-excel          xls         M         ~S        M        C
Microsoft PowerPoint             application/vnd.ms-               ppt
                                                                               M          M        M        C
                                 powerpoint
Microsoft Word                   application/msword                doc         M          M        M        C
PostScript                       application/postscript            ps          M         ~S                 C

   C              Canonical version is stored
   M              Modified version is stored (modified images are thumbnails, all others are html conversions)
   ~R             Indexed but not retrievable
   ~S             Indexed but not stored
                      Limitations
         Web crawling                    Web-repo crawling
•   Limit hit rate per host        •   Limit hit rate per repo
•   Websites periodically          •   Limited hits per day (API
    unavailable                        query quotas)
•   Portions of website off-       •   Repos periodically
    limits (robots.txt,                unavailable
    passwords)                     •   Flash and JavaScript
•   Deep web                           interfaces
•   Spam                           •   Can only recover what
•   Duplicate content                  repos have stored
•   Flash and JavaScript           •   Lossy format conversions
    interfaces                         (thumb nail images,
•   Crawler traps                      HTMLlized PDFs, etc.)

                               HT'06                          14
                   Warrick
• First developed in fall of 2005
• Available for download at
  http://www.cs.odu.edu/~fmccown/warrick/
• www2006.org – first lost website reconstructed
  (Nov 2005)
• DCkickball.org – first website someone else
  reconstructed without our help (late Jan 2006)
• www.iclnet.org – first website we reconstructed
  for someone else (mid Mar 2006)
• Internet Archive officially endorses Warrick (mid
  Mar 2006)

                        HT'06                     15
                     How Much Did We
                       Reconstruct?
            “Lost” web site               Reconstructed web site
                     A                                A

              B          C                      B’        C’   F

       D             E        F           G          E



Four categories of                  Missing link to D;          F can’t
recovered resources:                  points to old            be found
1) Identical: A, E                     resource G
2) Changed: B, C
3) Missing: D, F
4) Added: G                       HT'06                               16
Reconstruction Diagram

 added              changed
  20%                33%




                     missing
identical
                      17%
  50%


            HT'06              17
  Initial Experiment - April 2005
• Crawl and reconstruct 24 sites in 3 categories:
  1. small (1-150 resources)
  2. medium (150-499 resources)
  3. large (500+ resources)
• Calculate reconstruction vector for each site
• Results: mostly successful at recovering HTML
• Observation: many wasted queries, disconnected
  portions of websites are unrecoverable
• See:
   – McCown et al. Reconstructing websites for the lazy
     webmaster. Tech Report, 2005. http://arxiv.org/abs/cs.IR/0512069
   – Smith et al. Observed web robot behavior on decaying web
     subsites. D-Lib Magazine, 12(2), Feb 2006.


                                HT'06                               18
Missing Disconnected Resources




              HT'06              19
                   Lister Queries
• Problem with initial version of Warrick: wasted
  queries
   –   Internet Archive: Do you have X? No
   –   Google: Do you have X? No
   –   Yahoo: Do you have X? Yes
   –   MSN: Do you have X? No
• What if we first ask each web repository “What
  resources do you have?” We call these “lister
  queries”.
• How many repository requests will this save?
• How many more resources will this discover?
• What other problems will this help solve?
                                HT'06               20
          Lister Queries cont.
• Search engines
  – site:www.foo.org
  – Limited to first 1000 results or less
• Internet Archive
  – http://web.archive.org/web/*/http://www.foo.org/*
  – Not all URLs reported are actually accessible
• Results are given in groups of 100 or less


                           HT'06                        21
          URL Canonicalization
• How do we know if URL X is pointing to the
  same resource as URL Y?
• Web crawlers use several strategies that we
  may borrow:
  –   Convert to lowercase
  –   Remove www prefix
  –   Remove session IDs
  –   etc.
• All web repositories have different
  canonicalization policies which lister queries can
  uncover
                             HT'06                 22
Missing „www‟ Prefix




         HT'06         23
          Case Sensitivity
• Some web servers run on case-insensitive
  file systems (e.g., IIS on Windows)
• http://foo.org/bar.html is equivalent to
  http://foo.org/BAR.html
• MSN always ignores case, Google and
  Yahoo do not



                   HT'06                 24
HT'06   25
          Crawling Policies
1. Naïve Policy - Do not issue lister
   queries; only recover links that are found
   in recovered pages.
2. Knowledgeable Policy - Issue lister
   queries but only recover links that are
   found in recovered pages.
3. Exhaustive Policy - Issue lister queries
   and recover all resources found in all
   repositories. (Repository dump)
                     HT'06                  26
     Web-Repository Crawler using
           Lister Queries
                                               Lister
                                              queries
   Exhaustive   Stored URLs            Init
     Policy                                                     Web
                                                               repos


                                Download
Seed URLs       Visited URLs                            Repo
                                resource



                  Frontier        Extract
                                  URLs


                               HT'06                                   27
             Experiment
• Download all 24 websites (from first
  experiment)
• Perform 3 reconstructions for each site
  using the 3 crawling policies
• Compute reconstruction vectors for each
  reconstruction



                    HT'06                   28
     Reconstruction Statistics




Efficiency ratio = total recovered resources / total repository requests
  Efficiency
    Ratio
      All resources




      Efficiency ratio =
total recovered resources /
  total repository requests
  Efficiency
    Ratio
     Not including
   „added‟ resources




      Efficiency ratio =
total recovered resources /
  total repository requests
        Summary of Findings
• Naïve policy
  – Recovers nearly as many non-added resources as
    the knowledgeable and exhaustive policies
  – Issues highest number of repository requests
• Knowledgeable policy
  – Issues fewest number of requests per reconstruction
  – Has highest efficiency ratio for only non-added
    resources
• Exhaustive policy
  – Recovers significantly more added resources than the
    other two policies
  – Highest efficiency ratio

                         HT'06                            33
Website “Hijacking”
 Soft 404
 “Cache
poisoning”

Warrick should detect
soft 404s
          Other Web Obstacles
• Effective “do not preserve” tags:
  –   Flash
  –   AJAX
  –   http POST
  –   session ids, cookies, etc.
  –   cloaking
  –   URLs that change based on traversal patterns
       • Lutkenhouse, Nelson, Bollen, Distributed, Real-Time
         Computation of Community Preferences, Proceedings of
         ACM Hypertext 05
          – http://doi.acm.org/10.1145/1083356.1083374


                                HT'06                           36
                 Conclusion
• Web sites can be reconstructed by accessing
  the caches of the Web Infrastructure
  – Some URL canonicalization issues can be tackled
    using lister queries
  – multiple policies available depending on
    reconstruction requirements
• Much work to be done
  – capturing server-side information
  – moving from descriptive model to proscriptive &
    predictive model

                          HT'06                       37

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:31
posted:3/5/2010
language:English
pages:37