WAX: A candle in the darkness A digital to digital project by JAW3HH

VIEWS: 7 PAGES: 43

									    WAX: A candle in the
         darkness
       A digital to digital project
            Wendy Gogel, Andrea Goethals
Harvard University Library, Office for Information Systems
                       May 1, 2009
Today’s Journey
• The Darkness – The Web
  Introducing the challenge of web archiving
• The Candle – WAX
  HUL’s Web Archive Collection Service
• The Light – The Collections
  Demonstrating the results
The Darkness: The Web
The Challenges of Web Archiving

• A fleeting record – here today, gone
  tomorrow
   •   Government Documents
   •   Public Debate
   •   Culture
   •   Personal expression
   •   University Output
Harvard Magazine May/June 2009
Curator Activities
• Selection
• Acquisition
• Rights management
• Quality assurance
• Arrangement
• Storage
• Description and indexing for discovery
  (cataloguing, searching, browsing)
• Presentations and exhibitions
• Preservation
IP and Other Legal Risks
  • Copyright infringement
  • State tort liability
     • Civil damages, resulting from invasion
       of privacy, sensitive personal data,
       commercial content, defamatory
       content
  • Statutory content restrictions
  • Foreign Laws
Preservation Challenges
• We were not there at creation
   • Viruses more likely
   • Formats misidentify themselves
   • A lot of formats are invalid (especially HTML)
• It’s a moving target – what should we
  preserve?
   •   Evolving born digital formats
   •   Proliferation of formats
   •   Partial capture
   •   Complex behaviors and styles
• Complex delivery to maintain
   • Hyperlinked resources
   • Multiple renderers will continue to evolve
                    2006/07 Alternatives
              Selection   Crawling   Management     Storage   Preservation    Discovery      Notes
                                     (QA and                                  and Display
                                     Metadata)

Wayback       No          Yes        No             Yes       Partial -       No full text
(IA)                                                          Replicated      searching
                                                              storage – Not
                                                              Harvard
                                                              owned
Contract IA   Yes         Yes        No, handle     No,       No, Handle      No, Handle
                                     in-house       Handle    in-house        in-house
                                                    in-
                                                    house
Archive It!   Yes         Yes        Minimal, has   Yes       Partial -       Minimal, has   2008 costs:
(IA)                                 since                    Replicated      since          $16,000/yr
                                     improved                 storage         improved       $2,000/yr
                                                                                             Harvard copy
Customize     Yes         Yes        Yes            Yes       More than       Yes
IIPC Tools                                                    others
(WAX)*



   * Additional benefit of integration with HUL central services
The Candle: WAX
HUL’s Web Archiving Project
•   2.5 year pilot project funded by LDI
•   Key Goals
    1. Gain experience in domain
    2. Explore legal terrain
    3. Investigate sustainability of a Harvard
       web archiving service
       •   quantify technical, human, and $
           requirements
       •   aim for operational efficiencies
Project Players
1. Curators and Collection Managers
   •   Harvard University Archives
   •   Schlesinger Library on the History of
       Women in America
   •   Edwin O. Reischauer Institute of
       Japanese Studies
2. Legal Counsel – Office of General
   Counsel (OGC)
3. Technologists - OIS
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
Third Party Software
• International Internet Preservation
  Consortium (IIPC) tools
    www.netpreserve.org
    • Heritrix
    • HCC
    • NutchWAX
    • Wayback
•   JBoss
•   Oracle
•   Struts
•   Tomcat
•   Quartz job scheduler
The Web is vast
and
interconnected.

How do you
specify the part
you want to
capture?

Or “training a
web crawler”…
How to Train a Web Crawler
1. Tell it where to start
   •   “Seed URI”
2. Tell it what to collect and where to
   stop
   •   “Scope”
3. Tell it when and how often
   •   “Schedule”
Web Archiving Steps
1.   Create a harvest profile
     Identify website URI (“seed”), define scope and
         schedule
2.   Harvest web site
3.   QA harvest
4.   Send harvest to DRS
5.   Index harvest
     Becomes searchable and viewable by users


A lot of work per website –
which can automated?
  Web Archiving Steps
        Manual by curator → 1. Create a harvest profile


                Automated by 2. Harvest web site
scheduler and crawler software
                            →
        Manual by curator → 3. QA harvest


        Manual by curator → 4. Send harvest to DRS


              Automated by 5. Index harvest
        Indexing software →
Workflow Efficiencies
• Curator’s manual tasks:
   • Create a harvest profile
       • 3 scopes: Directory, host and host+1
       • Schedules
       • Global excluded URIs
   • QA harvests
       • Remove unwanted pieces
       • Detect missing pieces
       • Refinement of seed scope
   • Send harvests to DRS

How can the system help with these tasks?
Efficiencies: QA Harvests
•   Exclude URIs
    from future
    crawls
•   Delete URIs from
    harvest
•   Delete URIs from
    harvest and
    Exclude them
    from future
    crawls
Efficiencies: Send Harvests to DRS
The Ultimate Shortcut?
• Can pre-configure WAX to send
  harvests directly to the DRS
  • Skip QA step
  • Skip push to archive step
Web Harvest Objects:
Unit of Preservation in the DRS
• For each crawl starting from a seed URI:
   • One or more ARC files (*.arc.gz)
      • contain one or more “resources” - the
        individual HTML, JPEG, Javascript, etc.
        files that make up the harvested web
        pages
   • Crawl log
      • records all URI requests, regardless of
        result
   • Crawler configuration
   • Metadata
      • descriptive, administrative, technical
WAX Legal Mitigations: Crawls
• Polite crawling
   • Obey robots.txt
   • Leave WAX crawler information in logs
• Employ a respectful “request
  frequency” during crawls
   • Don’t overload web servers
• Capture surface web only
   • No attempt to crawl protected content
• Choice of offsite crawler for curators
   • Non-Harvard IP address
WAX Legal Mitigations: Use
• Don’t compete with or divert traffic
  from live site
   • Exclude robots from the WAX archive
   • Add transformative content
      • Framing
      • Presentation pages with original
        intellectual content
   • Embargo display for 3 months
   • Link to live site
The Collections
• 191 “seeds” identified by curators for
  harvesting
• Stored in DRS:
   • Over 8 million web archive resources
   • 365.17 gigabytes of storage ($913/year)
   • 291 mime types
application/x-download                   application/x-java-vm              Shockwave
message/rfc822                           text/Javascript                    audio/x-realaudio
image/x-portable-anymap                  text\css                           chemical/mdl-rdf
javascript/x-javascript                  application/x-Shockwave-Flash      content-type
application/bds                          png                                text/text
image/png?ver=074219b2138e87ecf980914    text/x-c++
                                                                            Text/HTML
471183dfc
                                         image/x-cmu-raster
application/xrds+xml                                                        audio/mid
                                         httpd/yahoo-send-as-is
"text/xml"                                                                  text/Calendar
                                         application/x-mpeg
image/x-bmp                                                                 application/x-wais-source
                                         Video/X-Flv
gif                                                                         application/x-perl
                                         text/x-python
application/x-rar-compressed                                                image/txt
                                         audio/x-scpls
Image/png                                                                   Applicationxm
                                         application/pgp-keys
mime/type                                                                   PNG
                                         text/calendar
image/null                                                                  x-png
                                         text/x-vcard
text/troff                                                                  unknown/unknown
                                         application/octet-string
application/vnd.sun.xml.impress
                                         application/x-troff-me             text/x-javascript
text/enriched                                                               application/octetstream
                                         video/x-m4v
application/icalendar                                                       Image
                                         application/pgp-signature
application-x/javascript                                                    application/x-sh
                                         image/x-portable-graymap
x-mapp-php4                                                                 audio/x-mpegurl
                                         image/#{favicon_formats[format]}
imag/x-icon                                                                 audio/unknown
                                         image/files/curryjpg               chemical/x-xyz
application/x-shockwave-flash2-preview
                                         test/xml                           application/perl
Swish
                                         text/x-invalid                     application/x.atom+xml
image/x-photoshop                                                           application/octet_stream
                                         video/x-flv
application/x-quicktimeplayer                                               video/mp4
                                         text/javascript+json
The Light: The Collections
The Partners
Megan Sniffin-Marinoff, University Archivist

      A-Sites: Archived Harvard Web Sites collected by the
      Harvard University Archives

Marilyn Dunn, Executive Director of the Schlesinger Library
     and Librarian of the Radcliffe Institute

     Blogs: Capturing Women's Voices collected by the Arthur and
     Elizabeth Schlesinger Library on the History of Women in
     America

Helen Hardacre, Reischauer Institute Professor of Japanese
    Religions and Society

      Web Archiving Project on Constitutional Revision collected
      by the Reischauer Institute of Japanese Studies with
      Sponsorship from the Harvard College Library Documentation
      Center on Contemporary Japan
  To Participate




http://hul.harvard.edu/ois/systems/wax
Questions?

“…we have rather chosen to fill our hives
  with honey and wax, thus furnishing
  mankind with the two noblest of things,
  which are sweetness and light.”

                  Jonathan Swift
     Image Credits
Title slide:
http://www.flickr.com/photos/lwr/59014972/in/set-1552655/
The darkness:
http://www.melegraph.com/images/outerspace.jpg
The candle:
http://www.sxc.hu/pic/m/a/as/asolario/472153_peach_votive_candle.jpg
The Web:
http://projecta-z.com/Internet_map_1024.jpg
The light
http://i252.photobucket.com/albums/hh2/habeba2007/candles-1-1.gif

								
To top