crawling-policies-ht06 by liaoxiuli1


									Evaluation of Crawling Policies
 for a Web-Repository Crawler

   Frank McCown & Michael L. Nelson
            Old Dominion University
             Norfolk, Virginia, USA

             Odense, Denmark
              August 23, 2006
                      HT'06           1
       Alternate Models
        of Preservation

• Lazy Preservation
   – Let Google, IA et al. preserve your website
• Just-In-Time Preservation
   – Find a “good enough” replacement web page
• Shared Infrastructure Preservation
   – Push your content to sites that might preserve it
• Web Server Enhanced Preservation
   – Use Apache modules to create archival-ready resources

                               HT'06                                                            2
                                              image from:
• Web page threats
• Web Infrastructure
• Warrick
  – architectural description
  – crawling policies
  – future work

                        HT'06   3
Black hat:                     HT'06   4
Virus image:
Hard drive:
 Crawling the Web and web

                 Web crawling           Web-repository

World Wide Web                  Repo2                    Repo



                        HT'06                                   5
       How much of the Web is indexed?

     • GYM intersection less than 43%
                                                    HT'06                                         6
Figure from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW‟05)
   Traditional Web Crawler

Seed URLs         Init

Visited URLs              Repo

  Frontier      Extract

                 HT'06           7
    Web-Repository Crawler

Seed URLs        Init

Visited URLs               Repo

  Frontier      Extract

                   HT'06                  8
HT'06   9
HT'06   10
Cached Image

  HT'06        11
               Cached PDF


MSN version          Yahoo version           Google version
Web Repository Characteristics
Type                             MIME type                       File ext   Google     Yahoo     MSN        IA

HTML text                        text/html                        html         C          C        C        C
Plain text                       text/plain                      txt, ans      M          M        M        C
Graphic Interchange Format       image/gif                         gif         M          M       ~R        C
Joint Photographic Experts                                         jpg
                                 image/jpeg                                    M          M       ~R        C
Portable Network Graphic         image/png                         png         M          M       ~R        C
Adobe Portable Document                                            pdf
                                 application/pdf                               M          M        M        C
JavaScript                       application/javascript             js         M                   M        C
Microsoft Excel                  application/          xls         M         ~S        M        C
Microsoft PowerPoint             application/               ppt
                                                                               M          M        M        C
Microsoft Word                   application/msword                doc         M          M        M        C
PostScript                       application/postscript            ps          M         ~S                 C

   C              Canonical version is stored
   M              Modified version is stored (modified images are thumbnails, all others are html conversions)
   ~R             Indexed but not retrievable
   ~S             Indexed but not stored
         Web crawling                    Web-repo crawling
•   Limit hit rate per host        •   Limit hit rate per repo
•   Websites periodically          •   Limited hits per day (API
    unavailable                        query quotas)
•   Portions of website off-       •   Repos periodically
    limits (robots.txt,                unavailable
    passwords)                     •   Flash and JavaScript
•   Deep web                           interfaces
•   Spam                           •   Can only recover what
•   Duplicate content                  repos have stored
•   Flash and JavaScript           •   Lossy format conversions
    interfaces                         (thumb nail images,
•   Crawler traps                      HTMLlized PDFs, etc.)

                               HT'06                          14
• First developed in fall of 2005
• Available for download at
• – first lost website reconstructed
  (Nov 2005)
• – first website someone else
  reconstructed without our help (late Jan 2006)
• – first website we reconstructed
  for someone else (mid Mar 2006)
• Internet Archive officially endorses Warrick (mid
  Mar 2006)

                        HT'06                     15
                     How Much Did We
            “Lost” web site               Reconstructed web site
                     A                                A

              B          C                      B’        C’   F

       D             E        F           G          E

Four categories of                  Missing link to D;          F can’t
recovered resources:                  points to old            be found
1) Identical: A, E                     resource G
2) Changed: B, C
3) Missing: D, F
4) Added: G                       HT'06                               16
Reconstruction Diagram

 added              changed
  20%                33%


            HT'06              17
  Initial Experiment - April 2005
• Crawl and reconstruct 24 sites in 3 categories:
  1. small (1-150 resources)
  2. medium (150-499 resources)
  3. large (500+ resources)
• Calculate reconstruction vector for each site
• Results: mostly successful at recovering HTML
• Observation: many wasted queries, disconnected
  portions of websites are unrecoverable
• See:
   – McCown et al. Reconstructing websites for the lazy
     webmaster. Tech Report, 2005.
   – Smith et al. Observed web robot behavior on decaying web
     subsites. D-Lib Magazine, 12(2), Feb 2006.

                                HT'06                               18
Missing Disconnected Resources

              HT'06              19
                   Lister Queries
• Problem with initial version of Warrick: wasted
   –   Internet Archive: Do you have X? No
   –   Google: Do you have X? No
   –   Yahoo: Do you have X? Yes
   –   MSN: Do you have X? No
• What if we first ask each web repository “What
  resources do you have?” We call these “lister
• How many repository requests will this save?
• How many more resources will this discover?
• What other problems will this help solve?
                                HT'06               20
          Lister Queries cont.
• Search engines
  – Limited to first 1000 results or less
• Internet Archive
  – Not all URLs reported are actually accessible
• Results are given in groups of 100 or less

                           HT'06                        21
          URL Canonicalization
• How do we know if URL X is pointing to the
  same resource as URL Y?
• Web crawlers use several strategies that we
  may borrow:
  –   Convert to lowercase
  –   Remove www prefix
  –   Remove session IDs
  –   etc.
• All web repositories have different
  canonicalization policies which lister queries can
                             HT'06                 22
Missing „www‟ Prefix

         HT'06         23
          Case Sensitivity
• Some web servers run on case-insensitive
  file systems (e.g., IIS on Windows)
• is equivalent to
• MSN always ignores case, Google and
  Yahoo do not

                   HT'06                 24
HT'06   25
          Crawling Policies
1. Naïve Policy - Do not issue lister
   queries; only recover links that are found
   in recovered pages.
2. Knowledgeable Policy - Issue lister
   queries but only recover links that are
   found in recovered pages.
3. Exhaustive Policy - Issue lister queries
   and recover all resources found in all
   repositories. (Repository dump)
                     HT'06                  26
     Web-Repository Crawler using
           Lister Queries
   Exhaustive   Stored URLs            Init
     Policy                                                     Web

Seed URLs       Visited URLs                            Repo

                  Frontier        Extract

                               HT'06                                   27
• Download all 24 websites (from first
• Perform 3 reconstructions for each site
  using the 3 crawling policies
• Compute reconstruction vectors for each

                    HT'06                   28
     Reconstruction Statistics

Efficiency ratio = total recovered resources / total repository requests
      All resources

      Efficiency ratio =
total recovered resources /
  total repository requests
     Not including
   „added‟ resources

      Efficiency ratio =
total recovered resources /
  total repository requests
        Summary of Findings
• Naïve policy
  – Recovers nearly as many non-added resources as
    the knowledgeable and exhaustive policies
  – Issues highest number of repository requests
• Knowledgeable policy
  – Issues fewest number of requests per reconstruction
  – Has highest efficiency ratio for only non-added
• Exhaustive policy
  – Recovers significantly more added resources than the
    other two policies
  – Highest efficiency ratio

                         HT'06                            33
Website “Hijacking”
 Soft 404

Warrick should detect
soft 404s
          Other Web Obstacles
• Effective “do not preserve” tags:
  –   Flash
  –   AJAX
  –   http POST
  –   session ids, cookies, etc.
  –   cloaking
  –   URLs that change based on traversal patterns
       • Lutkenhouse, Nelson, Bollen, Distributed, Real-Time
         Computation of Community Preferences, Proceedings of
         ACM Hypertext 05

                                HT'06                           36
• Web sites can be reconstructed by accessing
  the caches of the Web Infrastructure
  – Some URL canonicalization issues can be tackled
    using lister queries
  – multiple policies available depending on
    reconstruction requirements
• Much work to be done
  – capturing server-side information
  – moving from descriptive model to proscriptive &
    predictive model

                          HT'06                       37

To top