The Management of a
Website’s Historical Resources
College of Business
San Francisco State University
• An organization’s websites change
constantly to reflect the dynamic nature of
its environment causing changes in website
structure, contents and the supporting
Types of Change
• Website structure:
– Causing web pages’ URL to change
• Website content:
– Changes to web pages:
• Insertions, deletions, modifications
– Changes to content databases
What are a website’s historical
• Outdated URLs
• Outdated web pages:
– Web page snapshots
– Content database snapshots
– Deleted web pages
• Replaced technologies
The Objective of Managing
• The major objective of the management of
historical resources is to satisfy users’ needs
for historical information by enabling the
website to recreate or retrieve web page
– Web page snapshot is the state of a web page at
a specific point in time.
Factors Affecting The State Of A Web Page
• Content factors:
– Web page code
– The state of internal resources it references:
• Images, style sheet, components, script files,
– The state of external resources it references:
• External resources are files not managed by the web
site but can be referenced in creating the web site’s
• Environment factors:
– Web site host environment variables:
• System clock
– Web technologies implemented on the server-side
as well as on the client-side
Levels of Web Page Snapshot
• Level 1 snapshot: A web document snapshot is the
state of web document code at snaptime.
– Creating level 1 snapshot enables a web site to trace the
changes to the web document code over time.
• Level 2 snapshot: A level 2 snapshot is a level 1
snapshot with the additional requirement that all
the internal resources it references are at least
level 1 snapshots at the same snaptime.
– Referencing database snapshots
• Level 3 snapshot: A level 3 snapshot is a level 2
snapshot with the additional requirement that all
the external resources it references are at least
level 2 snapshots at the snaptime.
Enforcing Environment Factors Page
• (1) Plus 0: If both environment factors are
• (2) Plus 1: If the host variables are reset to
the snapshot time.
• (3) Plus 2: If web technologies are
compatible with the technologies at the
• (4) Plus 3: If both factors are enforced.
Possible Levels of Snapshot States
Schemes for Tracking Changes
• Scheme for tracking website structure
changes and web page code changes
– A logging and archiving scheme
• Scheme for tracking content database
Design of a Logging and Archiving Scheme
for Tracking Website Changes
• The log, named TemporalURLLog, has five
fields: URL, PublishDate, DocExpireDate,
URLExpireDate, and NewURL.
• Those archived documents are saved in the
Archive using URL + PublishDate as file
Impacts of Website Changes to
Historical Links and Archive
Time Website Changes Current Web Historical Links Snapshots in
Pages Generated Archive
T0 P1, P2, P3 None None
T1 P1 renamed to P4 P2, P3, P4, P5 P1+ T0
P5 is added
T2 P2 is deleted P3, P4, P5 P2+ T0, P3+ T0 P2+ T0,
P3 is modified P3+ T0
T3 P3, P4, P5 is modified P1, P3, P4, P5, P6 P3+ T2, P4+ T1 P3+ T2,
P1, P6 are added P5+ T1 P4+ T1,
T4 P3 is deleted P1, P3, P6, P7, P8 P3+ T3, P4+ T3 P3+ T3
P4 is renamed to P8 P5+T3
P5 is renamed to P7
A new page P3 is added
The contents of TemporalURLLog
URL PublishDate DocExpireDate URLExpireDate NewURL
P1 T0 Null T1 P4
P2 T0 T2 T2 Null
P3 T0 T2 Null Null
P4 T1 T3 Null Null
P4 T3 Null T4 P8
P5 T1 T3 Null Null
P3 T2 T3 Null Null
P3 T3 T4 T4 Null
P5 T3 Null T4 P7
P6 T3 Null Null Null
P1 T3 Null Null Null
P8 T4 Null Null Null
P7 T4 Null Null Null
P3 T4 Null Null Null
Examples of Using the Log
• Retrieve a snapshot of a current web page:
• Retrieve a deleted page:
• Retrieve the snapshot of a deleted web page:
– The snapshot of P3 at T2 is in the Archive: P3+ T2.
• Retrieve the current web page of an out-dated URL:
– An old URL P5 is now renamed to P7. If users submit a
request for P5, it can be traced to P7.
• Retrieve the web page previously associated with a
– A historical link P1 is now renamed to P8, and a current
link P1 points to a new web page. If the current web page
associated with P1 is not what the users need, it can be
redirected to P8.
• Determine if an invalid URL ever exists:
– A URL P12 has never existed in the web site.
Tracking Changes to Content
• A web page may use content databases:
– (1) as a source for querying.
– (2) as storage for contents of placeholders on a
Database Snapshot Management
• Defining snapshots:
CREATE SNAPSHOT snapshotname
AS OF snaptime
• Refreshing snapshots:
REFRESH SNAPSHOT snapshotname
AS OF new snaptime
Issues in Tracking Changes to
• The content data databases may exist in many
– XML, delimited text files, Etc.
– Not all content databases are supported by a snapshot
• The website may not have the authority in the
management of the content databases.
• A web page may retrieve data from many
• There is no single way in designing content
Tracking Content Database Changes
Using Log – An Example
– One content database supports many web pages.
– Each page contains many placeholders.
• Log design:
– PageID + PlaceHolderID + Content + Update
Flag + Time Stamp
• PageID is (URL + Page publish time)
Working with the TemporalLog
• Because a web page’s URL may change,
the content database log needs the support
of the TemporalURL log to track the
changes of URL.
Delivering Historical Resources to Users
• A website consists of:
– (1) a current website where current web pages
– (2) a historical website where historical
resources are stored and accessed.
• A typical web server serves requests for
current web pages only and is inadequate to
serve a request for historical information.
The Design of a Web Page Snapshot
• We developed a scheme to track changes to
website structure, web pages and files
referenced by web pages, and a second
scheme to track changes to content
databases so that the website is capable of
creating Level 2 snapshots.