Docstoc

Evaluation of Crawling Policies for a Web-Repository Crawler

Document Sample
Evaluation of Crawling Policies for a Web-Repository Crawler Powered By Docstoc
					 Lazy Preservation: Reconstructing
Websites from the Web Infrastructure

            Frank McCown
        Advisor: Michael L. Nelson

           Old Dominion University
         Computer Science Department
            Norfolk, Virginia, USA


        Dissertation Defense
               October 19, 2007
              Outline
• Motivation
• Lazy preservation and the Web
  Infrastructure
• Web repositories
• Responses to 10 research questions
• Contributions and Future Work


                                       2
Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg                     3
Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg
Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
              Preservation: Fortress Model

       5 easy steps for preservation:
      1.        Get a lot of $
      2.        Buy a lot of disks, machines, tapes, etc.
      3.        Hire an army of staff
      4.        Load a small amount of data
      5.        “Look upon my archive ye Mighty, and
                despair!”




                                                                                                                                               4
Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt   Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
…I was doing a little “maintenance” on one
of my sites and accidentally deleted my
entire database of about 30 articles. After I
finished berating myself for being so stupid,
I realized that my hosting company would
have a backup, so I sent an email asking
them to restore the database. Their reply
stated that backups were “coming
soon”…OUCH!
                                                6
     Web
Infrastructure
         Lazy Preservation
• How much preservation can be had for
  free? (Little to no effort for web
  producer/publisher before website is lost)
• High-coverage preservation of works of
  unknown importance
• Built atop unreliable, distributed members
  which cannot be controlled
• Usually limited to crawlable web
                                               8
  Dissertation Objective

To demonstrate the feasibility of
using the WI as a preservation
service – lazy preservation – and
to evaluate how effectively this
previously unexplored service can
be utilized for reconstructing lost
websites.

                                      9
          Research Questions
                 (Dissertation p. 3)
1. What types of resources are typically stored in the WI
   search engine caches, and how up-to-date are the
   caches?
2. How successful is the WI at preserving short-lived web
   content?
3. How much overlap is there with what is found in search
   engine caches and the Internet Archive?
4. What interfaces are necessary for a member of the WI
   (a web repository) to be used in website reconstruction?
5. How does a web-repository crawler work, and how can
   it reconstruct a lost website from the WI?
                                                         10
     Research Questions cont.
6. What types of websites do people lose, and how
   successful have they been recovering them from the
   WI?
7. How completely can websites be reconstructed from the
   WI?
8. What website attributes contribute to the success of
   website reconstruction?
9. Which members of the WI are the most helpful for
   website reconstruction?
10.What methods can be used to recover the server-side
   components of websites from the WI?
                                                      11
WI Preliminaries:
Web Repositories




                    12
       How much of the Web is indexed?
                                                                Internet Archive?




                                                                                                     13
Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
14
15
Cached Image




               16
               Cached PDF
     http://www.fda.gov/cder/about/whatwedo/testtube.pdf



canonical




MSN version          Yahoo version           Google version
   Types of Web Repositories
• Depth of holdings
  – Flat – only maintain last version of resource
    crawled
  – Deep – maintain multiple versions, each with
    a timestamp
• Access to holdings
  – Dark – no outside access to resources
  – Light – minimal access restrictions

                                                    18
               Accessing the WI
• Screen-scraping the web user interface
  (WUI)
• Application programming interface (API)
• WUIs and APIs do not always produce the
  same responses; the APIs may be pulling
  from smaller indexes1


 1McCown   & Nelson, Agreeing to Disagree: Search Engines and
 their Public Interfaces, JCDL 2007                             19
     Research Questions 1-3:
      Characterizing the WI

• Experiment 1: Observe the WI finding and
  caching new web content that is decaying.

• Experiment 2: Examine the contents of the
  WI by randomly sampling URLs



                                          20
Timeline of Web Resource




                           21
         Web Caching Experiment
 • May – Sept 2005
 • Create 4 websites composed of HTML, PDFs,
   and images
     –   http://www.owenbrau.com/
     –   http://www.cs.odu.edu/~fmccown/lazy/
     –   http://www.cs.odu.edu/~jsmit/
     –   http://www.cs.odu.edu/~mln/lazp/
 • Remove pages each day
 • Query GMY every day using identifiers

McCown et al., Lazy Preservation: Reconstructing Websites by
                                                               22
Crawling the Crawlers, ACM WIDM 2006.
23
24
25
26
              Observations
• Internet Archive found nothing
• Google was the most useful web
  repository from a preservation perspective
  – Quick to find new content
  – Consistent access to cached content
  – Lost content reappeared in cache long after it
    was removed
• Images are slow to be cached, and
  duplicate images are not cached
                                                 27
         Experiment: Sample Search
               Engine Caches

 • Feb 2007
 • Submitted 5200 one-term queries to Ask,
   Google, MSN, and Yahoo
 • Randomly selected 1 result from first 100
 • Download resource and cached page
 • Check for overlap with Internet Archive

McCown and Nelson, Characterization of Search Engine Caches,   28
Archiving 2007.
Distribution of Top Level Domains




                                    29
Cached Resource Size Distributions


            976 KB              977 KB




            1 MB            215 KB




                                         30
          Cache Freshness and
               Staleness

                 Fresh                Stale             Fresh
time


       crawled and       changed on           crawled and
         cached          web server             cached




  Staleness = max(0, Last-modified HTTP header – cached date)


                                                                31
         Cache Staleness
• 46% of resource had Last-Modified header
• 71% also had cached date
• 16% were at least 1 day stale




                                         32
Overlap with Internet Archive




                                33
Overlap with Internet Archive




                       Ave of 46%
                       URLs from
                         search
                      engines were
                        archived
                                     34
Research Question 4 of 10:
  Repository Interfaces

 Minimum interface requirement:
What resource r do you have stored
          for the URI u?“

       r  getResource(u)


                                     35
    Deep Repositories


What resource r do you have stored
  for the URI u at datestamp d?“

     r  getResource(u, d)



                                     36
        Lister Queries


What resources R do you have stored
           from the site s?

        R  getAllUris(s)



                                      37
38
   Other Interface Commands
• Get list of dates D stored for URI u
    D  getResourceList(u)
• Get crawl date d for URI u
    d  getCrawlDate(u)




                                         39
Research Question 5 of 10:
 Web-Repository Crawling




                             40
Web-repository Crawler




                         41
•   Written in Perl
•   First version completed in Sept 2005
•   Made available to the public in Jan 2006
•   Run as a command line program
    warrick.pl --recursive --debug --output-file log.txt
     http://foo.edu/~joe/
• Or on-line using the Brass queuing system
     http://warrick.cs.odu.edu/
                                                       42
43
Research Question 6 of 10:
     Warrick usage




                             44
Ave 38.2%




            45
46
  Research Questions 7 and 8:
  Reconstruction Effectiveness
• Problem with usage data: Difficult to determine
  how successful reconstructions actually are
  – Brass tells Warrick to recover all resources, even if
    not part of “current” website
  – When were websites actually lost?
  – Were URLs spelled correctly? Spam?
  – Need actual website to compare against
    reconstruction, especially if wanting to determine
    which factors determine website’s recoverability

                                                            47
48
    Measuring the Difference
Apply Recovery Vector for each resource
              (rc, rm, ra)
        changed   missing   added

  Compute Difference Vector for website


                                          49
Reconstruction Diagram

 added
                changed
  20%
                 33%




identical        missing
                  17%
  50%


                           50
McCown and Nelson, Evaluation of Crawling Policies for a Web-
                                                                51
Repository Crawler, HYPERTEXT 2006
    Reconstruction Experiment
• 300 websites chosen randomly from Open
  Directory Project (dmoz.org)
• Crawled and reconstructed each website
  every week for 14 weeks
• Examined change rates, age, decay,
  growth, recoverability


McCown and Nelson, Factors Affecting Website Reconstruction
                                                              52
from the Web Infrastructure, JCDL 2007
Success of
website
recovery
each week

*On average, 61% of a
website was recovered on
any given week.
                           53
Recovery by TLD




                  54
    Which Factors Are Significant?

•   External backlinks    •   Query string params
•   Internal backlinks    •   Age
•   Google’s PageRank     •   Resource birth rate
•   Hops from root page   •   TLD
•   Path depth            •   Website size
•   MIME type             •   Size of resources


                                                    55
        Regression Analysis
• No surprises: all variables are significant,
  but overall model only explains about half
  of the observations
• Three most significant variables:
  PageRank, hops and age (R-squared =
  0.1496)



                                                 56
                 Observations
• Most of the sampled websites were relatively
  stable
   – One third of the websites never lost a single resource
   – Half of the websites never added any new resources
• The typical website can expect to get back 61%
  of its resources if it were lost today (77% textual,
  42% images and 32% other)
• How to improve recovery from WI? Improve
  PageRank, decrease number of hops to
  resources, create stable URLs

                                                          57
    Research Question 9 of 10:
   Web Repository Contributions

 Experimental
   results



Real usage
  data




                                  58
     Research Question 10 of 10:
Recovering the web server’s components

      Static files
   (html files, PDFs,
 images, style sheets,
    Javascript, etc.)                        Web
                                        Infrastructure
                          Recoverable
  config
                  Perl       Dynamic
                 script       page

 Database
              Not Recoverable

                                                         59
            Web Server
  Injecting Server Components into
           Crawlable Pages




Erasure codes
                HTML pages   Recover at least
                                m blocks
                                                60
  Server Encoding Experiment
• Create a digital library using Eprints
  software and populate with 100 research
  papers
• Monarch DL: http://blanche-03.cs.odu.edu/
• Encode Eprints server components (Perl
  scripts, MySQL database, config files) and
  inject into all HTML pages
• Reconstruct each week
                                          61
62
   Web resources
recovered each week




                      63
64
             Contributions
1. Novel solution to pervasive problem of
   website loss: lazy preservation, after-the-
   fact recovery for little to no work required
   for the content creator
2. WI is characterized: behavior to consume
   and retain new web content, types of
   resources it contains, overlap between
   flat and deep repositories

                                              65
         Contributions cont.
3. Model for resource availability is
   developed from initial creation to its
   potential unavailability
4. Developed new type of crawler: web-
   repository crawler. Architecture,
   interfaces for crawling web repositories,
   rules for canonicalizing URLs, three
   crawling policies are evaluated

                                               66
         Contributions cont.
5. Developed statistical model to measure
   reconstructed website, reconstruction
   diagram to summarize reconstruction
   success.
6. Discovered the three most significant
   variables that determine how successfully
   a web resource will be recovered from the
   WI: Google's PageRank, hops from the
   root page, resource age.
                                          67
         Contributions cont.
7. Proposed and experimentally validated a
   novel solution to recover a website's
   server components from WI
8. Created website reconstruction service
   which is currently being used by the
   public to reconstruct more than 100 lost
   websites a month


                                              68
             Future Work
• Improvements to Warrick: increase used
  repositories, discovery of URLs, soft 404s
• Determining or predicting loss- save
  websites if detecting they are about to or
  already have disappeared
• Investigate other sources of lazy
  preservation: browser caches
• More extensive overlap studies of WI
                                               69
             Related Publications
• Deep web                              • Search engine contents and
   – IEEE Internet Computing 2006         interfaces
                                           – ECDL 2005
                                           – WWW 2007
• Link rot
                                           – JCDL 2007
   – IWAW 2005

• Lazy Preservation / WI                • Obsolete web file formats
                                           – IWAW 2005
   – D-Lib Magazine 2006
   – WIDM 2006
   – Archiving 2007                     • Warrick
   – Dynamics of Search Engines: An        –   HYPERTEXT 2006
     Introduction (chapter)                –   Archiving 2007
   – Content Engineering (chapter)         –   JCDL 2007
   – International Journal on Digital      –   IWAW 2007
     Libraries 2007
                                           –   Communications of the ACM 2007
                                               (to appear)
                                                                           70
           Thank You

Can’t wait until I’m
old enough to run
     Warrick!




                       71
72
74
  Some Difference Vectors

   D = (changed, missing, added)
(0,0,0) – Perfect recovery
(1,0,0) – All resources are recovered
          but changed
(0,1,0) – All resources are lost
(0,0,1) – All recovered resources
          are at new URIs               75
How Much Change is a Bad Thing?
      Lost          Recovered




                                76
How Much Change is a Bad Thing?
      Lost          Recovered




                                77
Assigning Penalties
  Penalty Adjustment
    (Pc, Pm, Pa)
  Apply to each resource



   Or Difference vector


                           78
        Defining Success

           success = 1 – dm

Equivalent to percent of recovered resources


   0                                     1

     Less                     More
   successful               successful
                                               79
Recovery of Textual Resources




                            80
Birth and Decay




                  81
Recovery of HTML Resources




                             82
Recovery by Age




                  83
           Mild Correlations
• Hops and
  – website size (0.428)
  – path depth (0.388)
• Age and # of query params (-0.318)
• External links and
  – PageRank (0.339)
  – Website size (0.301)
  – Hops (0.320)
                                       84
Similarity vs. Staleness




                           86

				
DOCUMENT INFO