Tracking a Web Site's Historical Links with Temporal URLs

Document Sample
Tracking a Web Site's Historical Links with Temporal URLs Powered By Docstoc
					   The Management of a
Website’s Historical Resources

              David Chao
           College of Business
      San Francisco State University
• An organization’s websites change
  constantly to reflect the dynamic nature of
  its environment causing changes in website
  structure, contents and the supporting
             Types of Change
• Website structure:
  – Causing web pages’ URL to change
• Website content:
  – Changes to web pages:
     • Insertions, deletions, modifications
  – Changes to content databases
• Technology
   What are a website’s historical
• Outdated URLs
• Outdated web pages:
  – Web page snapshots
  – Content database snapshots
  – Deleted web pages
• Replaced technologies
      The Objective of Managing
         Historical Resources
• The major objective of the management of
  historical resources is to satisfy users’ needs
  for historical information by enabling the
  website to recreate or retrieve web page
  – Web page snapshot is the state of a web page at
    a specific point in time.
 Factors Affecting The State Of A Web Page
• Content factors:
  – Web page code
  – The state of internal resources it references:
      • Images, style sheet, components, script files,
        databases, etc.
   – The state of external resources it references:
      • External resources are files not managed by the web
        site but can be referenced in creating the web site’s
• Environment factors:
  – Web site host environment variables:
      • System clock
   – Web technologies implemented on the server-side
     as well as on the client-side
          Levels of Web Page Snapshot
• Level 1 snapshot: A web document snapshot is the
  state of web document code at snaptime.
   – Creating level 1 snapshot enables a web site to trace the
     changes to the web document code over time.
• Level 2 snapshot: A level 2 snapshot is a level 1
  snapshot with the additional requirement that all
  the internal resources it references are at least
  level 1 snapshots at the same snaptime.
   – Referencing database snapshots
• Level 3 snapshot: A level 3 snapshot is a level 2
  snapshot with the additional requirement that all
  the external resources it references are at least
  level 2 snapshots at the snaptime.
Enforcing Environment Factors Page
• (1) Plus 0: If both environment factors are
  not enforced.
• (2) Plus 1: If the host variables are reset to
  the snapshot time.
• (3) Plus 2: If web technologies are
  compatible with the technologies at the
  snapshot time.
• (4) Plus 3: If both factors are enforced.
Possible Levels of Snapshot States
  Schemes for Tracking Changes

• Scheme for tracking website structure
  changes and web page code changes
  – A logging and archiving scheme
• Scheme for tracking content database
Design of a Logging and Archiving Scheme
      for Tracking Website Changes
• The log, named TemporalURLLog, has five
  fields: URL, PublishDate, DocExpireDate,
  URLExpireDate, and NewURL.
• Those archived documents are saved in the
  Archive using URL + PublishDate as file
     Impacts of Website Changes to
      Historical Links and Archive
Time   Website Changes          Current Web          Historical Links   Snapshots in
                                   Pages                 Generated         Archive
T0                              P1, P2, P3           None               None

T1     P1 renamed to P4         P2, P3, P4, P5       P1+ T0
       P5 is added

T2     P2 is deleted            P3, P4, P5           P2+ T0, P3+ T0     P2+ T0,
       P3 is modified                                                   P3+ T0
T3     P3, P4, P5 is modified   P1, P3, P4, P5, P6   P3+ T2, P4+ T1     P3+ T2,
       P1, P6 are added                              P5+ T1             P4+ T1,
                                                                        P5+ T1
T4     P3 is deleted            P1, P3, P6, P7, P8   P3+ T3, P4+ T3     P3+ T3
       P4 is renamed to P8                           P5+T3
       P5 is renamed to P7
       A new page P3 is added
The contents of TemporalURLLog
URL   PublishDate   DocExpireDate   URLExpireDate NewURL

P1    T0            Null            T1           P4
P2    T0            T2              T2           Null
P3    T0            T2              Null         Null
P4    T1            T3              Null         Null
P4    T3            Null            T4           P8
P5    T1            T3              Null         Null
P3    T2            T3              Null         Null
P3    T3            T4              T4           Null
P5    T3            Null            T4           P7
P6    T3            Null            Null         Null
P1    T3            Null            Null         Null
P8    T4            Null            Null         Null
P7    T4            Null            Null         Null
P3    T4            Null            Null         Null
       Examples of Using the Log
• Retrieve a snapshot of a current web page:
• Retrieve a deleted page:
• Retrieve the snapshot of a deleted web page:
   – The snapshot of P3 at T2 is in the Archive: P3+ T2.
• Retrieve the current web page of an out-dated URL:
   – An old URL P5 is now renamed to P7. If users submit a
     request for P5, it can be traced to P7.
• Retrieve the web page previously associated with a
  current link:
   – A historical link P1 is now renamed to P8, and a current
     link P1 points to a new web page. If the current web page
     associated with P1 is not what the users need, it can be
     redirected to P8.
• Determine if an invalid URL ever exists:
   – A URL P12 has never existed in the web site.
   Tracking Changes to Content
• A web page may use content databases:
  – (1) as a source for querying.
  – (2) as storage for contents of placeholders on a
    web page.
Database Snapshot Management
• Defining snapshots:
    CREATE SNAPSHOT snapshotname
    AS query
    AS OF snaptime
• Refreshing snapshots:
    REFRESH SNAPSHOT snapshotname
    AS OF new snaptime
     Issues in Tracking Changes to
           Content Databases
• The content data databases may exist in many
   – XML, delimited text files, Etc.
   – Not all content databases are supported by a snapshot
     management system.
• The website may not have the authority in the
  management of the content databases.
• A web page may retrieve data from many
• There is no single way in designing content
Tracking Content Database Changes
     Using Log – An Example
• Assuming:
  – One content database supports many web pages.
  – Each page contains many placeholders.
• Log design:
  – PageID + PlaceHolderID + Content + Update
    Flag + Time Stamp
     • PageID is (URL + Page publish time)
 Working with the TemporalLog
• Because a web page’s URL may change,
  the content database log needs the support
  of the TemporalURL log to track the
  changes of URL.
• Example:
 Delivering Historical Resources to Users

• A website consists of:
  – (1) a current website where current web pages
    are published.
  – (2) a historical website where historical
    resources are stored and accessed.
• A typical web server serves requests for
  current web pages only and is inadequate to
  serve a request for historical information.
The Design of a Web Page Snapshot
      Management System
• We developed a scheme to track changes to
  website structure, web pages and files
  referenced by web pages, and a second
  scheme to track changes to content
  databases so that the website is capable of
  creating Level 2 snapshots.

Shared By: