Evaluation of Crawling Policies for a Web-Repository Crawler
Document Sample


Lazy Preservation: Reconstructing
Websites from the Web Infrastructure
Frank McCown
Advisor: Michael L. Nelson
Old Dominion University
Computer Science Department
Norfolk, Virginia, USA
Dissertation Defense
October 19, 2007
Outline
• Motivation
• Lazy preservation and the Web
Infrastructure
• Web repositories
• Responses to 10 research questions
• Contributions and Future Work
2
Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg 3
Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg
Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
Preservation: Fortress Model
5 easy steps for preservation:
1. Get a lot of $
2. Buy a lot of disks, machines, tapes, etc.
3. Hire an army of staff
4. Load a small amount of data
5. “Look upon my archive ye Mighty, and
despair!”
4
Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
…I was doing a little “maintenance” on one
of my sites and accidentally deleted my
entire database of about 30 articles. After I
finished berating myself for being so stupid,
I realized that my hosting company would
have a backup, so I sent an email asking
them to restore the database. Their reply
stated that backups were “coming
soon”…OUCH!
6
Web
Infrastructure
Lazy Preservation
• How much preservation can be had for
free? (Little to no effort for web
producer/publisher before website is lost)
• High-coverage preservation of works of
unknown importance
• Built atop unreliable, distributed members
which cannot be controlled
• Usually limited to crawlable web
8
Dissertation Objective
To demonstrate the feasibility of
using the WI as a preservation
service – lazy preservation – and
to evaluate how effectively this
previously unexplored service can
be utilized for reconstructing lost
websites.
9
Research Questions
(Dissertation p. 3)
1. What types of resources are typically stored in the WI
search engine caches, and how up-to-date are the
caches?
2. How successful is the WI at preserving short-lived web
content?
3. How much overlap is there with what is found in search
engine caches and the Internet Archive?
4. What interfaces are necessary for a member of the WI
(a web repository) to be used in website reconstruction?
5. How does a web-repository crawler work, and how can
it reconstruct a lost website from the WI?
10
Research Questions cont.
6. What types of websites do people lose, and how
successful have they been recovering them from the
WI?
7. How completely can websites be reconstructed from the
WI?
8. What website attributes contribute to the success of
website reconstruction?
9. Which members of the WI are the most helpful for
website reconstruction?
10.What methods can be used to recover the server-side
components of websites from the WI?
11
WI Preliminaries:
Web Repositories
12
How much of the Web is indexed?
Internet Archive?
13
Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
14
15
Cached Image
16
Cached PDF
http://www.fda.gov/cder/about/whatwedo/testtube.pdf
canonical
MSN version Yahoo version Google version
Types of Web Repositories
• Depth of holdings
– Flat – only maintain last version of resource
crawled
– Deep – maintain multiple versions, each with
a timestamp
• Access to holdings
– Dark – no outside access to resources
– Light – minimal access restrictions
18
Accessing the WI
• Screen-scraping the web user interface
(WUI)
• Application programming interface (API)
• WUIs and APIs do not always produce the
same responses; the APIs may be pulling
from smaller indexes1
1McCown & Nelson, Agreeing to Disagree: Search Engines and
their Public Interfaces, JCDL 2007 19
Research Questions 1-3:
Characterizing the WI
• Experiment 1: Observe the WI finding and
caching new web content that is decaying.
• Experiment 2: Examine the contents of the
WI by randomly sampling URLs
20
Timeline of Web Resource
21
Web Caching Experiment
• May – Sept 2005
• Create 4 websites composed of HTML, PDFs,
and images
– http://www.owenbrau.com/
– http://www.cs.odu.edu/~fmccown/lazy/
– http://www.cs.odu.edu/~jsmit/
– http://www.cs.odu.edu/~mln/lazp/
• Remove pages each day
• Query GMY every day using identifiers
McCown et al., Lazy Preservation: Reconstructing Websites by
22
Crawling the Crawlers, ACM WIDM 2006.
23
24
25
26
Observations
• Internet Archive found nothing
• Google was the most useful web
repository from a preservation perspective
– Quick to find new content
– Consistent access to cached content
– Lost content reappeared in cache long after it
was removed
• Images are slow to be cached, and
duplicate images are not cached
27
Experiment: Sample Search
Engine Caches
• Feb 2007
• Submitted 5200 one-term queries to Ask,
Google, MSN, and Yahoo
• Randomly selected 1 result from first 100
• Download resource and cached page
• Check for overlap with Internet Archive
McCown and Nelson, Characterization of Search Engine Caches, 28
Archiving 2007.
Distribution of Top Level Domains
29
Cached Resource Size Distributions
976 KB 977 KB
1 MB 215 KB
30
Cache Freshness and
Staleness
Fresh Stale Fresh
time
crawled and changed on crawled and
cached web server cached
Staleness = max(0, Last-modified HTTP header – cached date)
31
Cache Staleness
• 46% of resource had Last-Modified header
• 71% also had cached date
• 16% were at least 1 day stale
32
Overlap with Internet Archive
33
Overlap with Internet Archive
Ave of 46%
URLs from
search
engines were
archived
34
Research Question 4 of 10:
Repository Interfaces
Minimum interface requirement:
What resource r do you have stored
for the URI u?“
r getResource(u)
35
Deep Repositories
What resource r do you have stored
for the URI u at datestamp d?“
r getResource(u, d)
36
Lister Queries
What resources R do you have stored
from the site s?
R getAllUris(s)
37
38
Other Interface Commands
• Get list of dates D stored for URI u
D getResourceList(u)
• Get crawl date d for URI u
d getCrawlDate(u)
39
Research Question 5 of 10:
Web-Repository Crawling
40
Web-repository Crawler
41
• Written in Perl
• First version completed in Sept 2005
• Made available to the public in Jan 2006
• Run as a command line program
warrick.pl --recursive --debug --output-file log.txt
http://foo.edu/~joe/
• Or on-line using the Brass queuing system
http://warrick.cs.odu.edu/
42
43
Research Question 6 of 10:
Warrick usage
44
Ave 38.2%
45
46
Research Questions 7 and 8:
Reconstruction Effectiveness
• Problem with usage data: Difficult to determine
how successful reconstructions actually are
– Brass tells Warrick to recover all resources, even if
not part of “current” website
– When were websites actually lost?
– Were URLs spelled correctly? Spam?
– Need actual website to compare against
reconstruction, especially if wanting to determine
which factors determine website’s recoverability
47
48
Measuring the Difference
Apply Recovery Vector for each resource
(rc, rm, ra)
changed missing added
Compute Difference Vector for website
49
Reconstruction Diagram
added
changed
20%
33%
identical missing
17%
50%
50
McCown and Nelson, Evaluation of Crawling Policies for a Web-
51
Repository Crawler, HYPERTEXT 2006
Reconstruction Experiment
• 300 websites chosen randomly from Open
Directory Project (dmoz.org)
• Crawled and reconstructed each website
every week for 14 weeks
• Examined change rates, age, decay,
growth, recoverability
McCown and Nelson, Factors Affecting Website Reconstruction
52
from the Web Infrastructure, JCDL 2007
Success of
website
recovery
each week
*On average, 61% of a
website was recovered on
any given week.
53
Recovery by TLD
54
Which Factors Are Significant?
• External backlinks • Query string params
• Internal backlinks • Age
• Google’s PageRank • Resource birth rate
• Hops from root page • TLD
• Path depth • Website size
• MIME type • Size of resources
55
Regression Analysis
• No surprises: all variables are significant,
but overall model only explains about half
of the observations
• Three most significant variables:
PageRank, hops and age (R-squared =
0.1496)
56
Observations
• Most of the sampled websites were relatively
stable
– One third of the websites never lost a single resource
– Half of the websites never added any new resources
• The typical website can expect to get back 61%
of its resources if it were lost today (77% textual,
42% images and 32% other)
• How to improve recovery from WI? Improve
PageRank, decrease number of hops to
resources, create stable URLs
57
Research Question 9 of 10:
Web Repository Contributions
Experimental
results
Real usage
data
58
Research Question 10 of 10:
Recovering the web server’s components
Static files
(html files, PDFs,
images, style sheets,
Javascript, etc.) Web
Infrastructure
Recoverable
config
Perl Dynamic
script page
Database
Not Recoverable
59
Web Server
Injecting Server Components into
Crawlable Pages
Erasure codes
HTML pages Recover at least
m blocks
60
Server Encoding Experiment
• Create a digital library using Eprints
software and populate with 100 research
papers
• Monarch DL: http://blanche-03.cs.odu.edu/
• Encode Eprints server components (Perl
scripts, MySQL database, config files) and
inject into all HTML pages
• Reconstruct each week
61
62
Web resources
recovered each week
63
64
Contributions
1. Novel solution to pervasive problem of
website loss: lazy preservation, after-the-
fact recovery for little to no work required
for the content creator
2. WI is characterized: behavior to consume
and retain new web content, types of
resources it contains, overlap between
flat and deep repositories
65
Contributions cont.
3. Model for resource availability is
developed from initial creation to its
potential unavailability
4. Developed new type of crawler: web-
repository crawler. Architecture,
interfaces for crawling web repositories,
rules for canonicalizing URLs, three
crawling policies are evaluated
66
Contributions cont.
5. Developed statistical model to measure
reconstructed website, reconstruction
diagram to summarize reconstruction
success.
6. Discovered the three most significant
variables that determine how successfully
a web resource will be recovered from the
WI: Google's PageRank, hops from the
root page, resource age.
67
Contributions cont.
7. Proposed and experimentally validated a
novel solution to recover a website's
server components from WI
8. Created website reconstruction service
which is currently being used by the
public to reconstruct more than 100 lost
websites a month
68
Future Work
• Improvements to Warrick: increase used
repositories, discovery of URLs, soft 404s
• Determining or predicting loss- save
websites if detecting they are about to or
already have disappeared
• Investigate other sources of lazy
preservation: browser caches
• More extensive overlap studies of WI
69
Related Publications
• Deep web • Search engine contents and
– IEEE Internet Computing 2006 interfaces
– ECDL 2005
– WWW 2007
• Link rot
– JCDL 2007
– IWAW 2005
• Lazy Preservation / WI • Obsolete web file formats
– IWAW 2005
– D-Lib Magazine 2006
– WIDM 2006
– Archiving 2007 • Warrick
– Dynamics of Search Engines: An – HYPERTEXT 2006
Introduction (chapter) – Archiving 2007
– Content Engineering (chapter) – JCDL 2007
– International Journal on Digital – IWAW 2007
Libraries 2007
– Communications of the ACM 2007
(to appear)
70
Thank You
Can’t wait until I’m
old enough to run
Warrick!
71
72
74
Some Difference Vectors
D = (changed, missing, added)
(0,0,0) – Perfect recovery
(1,0,0) – All resources are recovered
but changed
(0,1,0) – All resources are lost
(0,0,1) – All recovered resources
are at new URIs 75
How Much Change is a Bad Thing?
Lost Recovered
76
How Much Change is a Bad Thing?
Lost Recovered
77
Assigning Penalties
Penalty Adjustment
(Pc, Pm, Pa)
Apply to each resource
Or Difference vector
78
Defining Success
success = 1 – dm
Equivalent to percent of recovered resources
0 1
Less More
successful successful
79
Recovery of Textual Resources
80
Birth and Decay
81
Recovery of HTML Resources
82
Recovery by Age
83
Mild Correlations
• Hops and
– website size (0.428)
– path depth (0.388)
• Age and # of query params (-0.318)
• External links and
– PageRank (0.339)
– Website size (0.301)
– Hops (0.320)
84
Similarity vs. Staleness
86
Get documents about "