PANDORA Australia's Web Archive by niusheng11

VIEWS: 66 PAGES: 55

									   PANDORA
Australia’s Web Archive
   Library Science Talks
 SNL/CERN, September 2004
         Paul Koerbin
   Digital Archiving Branch
  National Library of Australia
     pkoerbin@nla.gov.au
            PANDORA
         Australia’s Web Archive



1. Background and approach to web
   archiving
2. The management system (PANDAS)
3. Workflows and procedures
4. Issues and future directions
             PANDORA
          Australia’s Web Archive




1. Background and approach to web
   archiving in Australia - PANDORA
                PANDORA
             Australia’s Web Archive
Beginnings
• Name originally an acronym for:
  ‘Preserving and Accessing Networked
  Documentary Resources of Australia’
• Now: ‘Australia’s Web Archive’
• Began in mid-1996 (selecting)
• Began archiving in late 1996-early 1997
              PANDORA
           Australia’s Web Archive
Approach
• Practical and pragmatic
• Began as: Proof-of-concept project
• Now: Routine National Library activity
• Achieving outcomes while continuing to
  develop and extend processes and systems
• Best use of available resources and
  infrastructure
                 PANDORA
             Australia’s Web Archive
Resources
• Existing technical services staff - librarians
• Digital Archiving Branch has the business
  responsibility
• Information technology staff from within
  the Library for development and support
• PANDORA partner institutions (10
  including the NLA)
                 PANDORA
              Australia’s Web Archive
Mandate and responsibilities
• National Library of Australia’s statutory
  responsibilities
• National Library Act, 1960
• Maintain and develop a national collection
  of ‘library material’
• Comprehensive collection relating to
  Australia and the Australian people
                   PANDORA
                Australia’s Web Archive
Mandate and responsibilities
• National Library has a leadership role for
  the Australian library community
Legal deposit
• Legal deposit in the federal jurisdiction in
  Australia does not cover electronic
  resources
                 PANDORA
              Australia’s Web Archive
Some key characteristics
• Selective approach to archiving online
  resources
• Scalable to available resources and do-able
• Negotiate permission to archive
• Apply manual quality assurance processes
  to harvested resources
• Provide access to the archived resources
                 PANDORA
             Australia’s Web Archive

Shortcomings of selective approach
• Can’t collect everything that future
  researchers may want
• Labour intensive tasks
• Does not retain the full complexity of the
  linking structure of the Internet
                          PANDORA
                    Australia’s Web Archive

Indicative statistics as at August 2004
•   6,500+ titles
•   13,000+ archived instances
•   21 million files*
•   680 gigabytes*
*These figures are for the display copy only. Two more preservation copies plus
   preservation metadata are maintained.
             PANDORA
          Australia’s Web Archive




2. The management system: PANDAS
                PANDORA
            Australia’s Web Archive

• PANDAS – PANDORA Digital Archiving
  System
• Integrated web based system
• Workflow management system
• Developed specifically to manage the web
  archiving processes at the National Library
  of Australia
• Used by PANDORA’s partners located
  throughout Australia
               PANDORA
            Australia’s Web Archive
• Developed in-house at the NLA
• Replaced multiple non-integrated systems
  used between 1996 and 2001
• Written in Java on Apple WebObjects
  application development platform
• First version released in June 2001
• Second version released August 2002
• Ongoing enhancement and development
  program
   PANDORA
Australia’s Web Archive
                 PANDORA
             Australia’s Web Archive

PANDAS system architecture consists of 4 layers

• 1) Presentation layer – client applications
  for visual presentation to the end user
• 2) Application layer – the core application
  functionality such as PANDAS and
  PANDORA
                 PANDORA
             Australia’s Web Archive

PANDAS system architecture consists of 4 layers

• 3) Business layer – application access to the
  data storage and communication
  infrastructure
• 4) Data layer – third party infrastructure
  products, e.g. Oracle database and
  WebDAV accessible files servers
                  PANDORA
               Australia’s Web Archive
Nomenclature
• PANDORA – the whole enterprise
• PANDAS – the whole management system
• PANDAS – the system component
  providing a web-based user application to
  manage workflows
• PANDORA – the system component that
  creates the public interface
   PANDORA
Australia’s Web Archive
                PANDORA
            Australia’s Web Archive
PANDAS is used to:
• Record administrative metadata about titles
  selected (or rejected or monitored) for
  archiving
• Schedule and initiate harvesting
• Manage quality assurance checking and
  problem fixing
                PANDORA
            Australia’s Web Archive
PANDAS is used to:
• Prepare items for public display through the
  PANDORA home page
• Manage access restrictions
• Generate management reports
                 PANDORA
             Australia’s Web Archive

PANDAS is a workflow system that:
• Connects with and utilises other software
  and protocols for specific functions
• Provides an interface to the harvesting
  software – currently this is HTTrack
  (http://www.httrack.com)
                 PANDORA
             Australia’s Web Archive

PANDAS is a workflow system that:
• Uses WebDAV protocol to provide content
  managers with remote access to the
  harvested files
• Uses Z39.50 protocol to access the National
  Bibliographic Database to extract metadata
  from the MARC record
                PANDORA
             Australia’s Web Archive

PANDORA public interface component
• Title and subject listings and title entry
  pages are generated ‘on-the-fly’ from
  PANDAS metadata
• Some static web pages (documents,
  information)
• Search engine
                      PANDORA
              Australia’s Web Archive
Persistent identifiers and URLs
• Running number generated by PANDAS
• Persistent URL applied to title entry page
   http://nla.gov.au/nla.arc-21220
• Logically extended to any resource in the
  Archive
   http://nla.gov.au/nla.arc-21220-20030822-
      www.ipjp.org/september2002/schweitzer-ed.html
• Citation generator on public interface
              PANDORA
           Australia’s Web Archive




3. Workflows and procedures
                 PANDORA
             Australia’s Web Archive
•   Identifying and selecting
•   Recording administrative metadata
•   Harvesting
•   Quality assurance processing
•   Archiving
•   Preparing for public display
•   Creating resource discovery metadata
•   Reporting
                    PANDORA
              Australia’s Web Archive
Identifying and selecting
• Selection guidelines – each partner has their
  own guidelines
• Just guidelines … not rules nor ideology
• Selection priorities in guidelines (NLA)
• Notification networks – indexing agencies,
  staff, publishers, public
• NLA selection guidelines available at:
  http://pandora.nla.gov.au/selectionguidelines.html
                  PANDORA
              Australia’s Web Archive

Selection – what sort of publications?
•   Titles – the entities to be archived
•   Defined during the selection process
•   Document-like publications, e.g. PDF
•   Whole web sites
•   Parts of web sites
                  PANDORA
              Australia’s Web Archive
Selection – what sort of publications?
• Focus on content – substantial, unique
• Special events or issues
• Format or potential technical problems are
  not, in principle, a selection consideration
• One-off archiving
• Scheduled archiving – whole entity, not an
  update
                   PANDORA
             Australia’s Web Archive
Recording administrative metadata
• Four types of records
  –   Title
  –   Publisher
  –   Indexer
  –   Collection
• Selection status
• Additional details associated with status
  (standing)
                  PANDORA
              Australia’s Web Archive
Administrative metadata
•   Publisher details
•   Archiving permission status
•   Access restrictions
•   Notes
•   Assigning ownership of titles
•   Transfer titles between agencies
                PANDORA
             Australia’s Web Archive
Harvesting
• Mostly harvesting from the Web
• Also able to upload from local drives
  (WebDAV protocol)
• Third party software – HTTrack
• PANDAS interface to set up harvesting
  rules
                PANDORA
             Australia’s Web Archive

Harvesting
• Define extent of selected resource to be
  archived
• Set gather filters and gather settings
• Set gather schedule
• Initiate harvesting
                 PANDORA
             Australia’s Web Archive

Scheduling harvesting
• Significant function of PANDAS
• Regular schedules, e.g. weekly, monthly,
  annual
• Specific dates
• Harvest now
• Combination of scheduling options
                  PANDORA
              Australia’s Web Archive
Harvesting - filters and settings
• Default settings
• Ignore robot.txt rules because permission to
  archive has been obtained from publisher
• Gather sub-directories
• Gather ‘near files’, e.g. linked images
• Limit on depth – sufficient for any web site
  but to prevent abuse of host server
                  PANDORA
              Australia’s Web Archive

Harvesting - filters and settings
•   Gather filters are critical
•   Selection based on specific content
•   Archiving permission for specific content
•   Efficient use of resources (bandwidth,
    storage)
                    PANDORA
             Australia’s Web Archive
Quality assurance
• Important process for PANDORA
• Owner of title notified when harvest is
  complete
• Visual, manual checking process
• Check for completeness and functionality
• Check that content is new (if previously
  archived)
• Check that there is no extraneous material
                    PANDORA
             Australia’s Web Archive
Quality assurance
• Harvested files in a working area – not
  ‘archived’ at this stage
• WebDAV (protocol) access to the working
  area
• Problem analysis and fixing
• Missing files, broken links
• Complex problems referred to IT support
  through PANDAS error reporting module
                 PANDORA
              Australia’s Web Archive
Quality assurance
• Problems due to limitations of harvesting
  software
• Excessive use of JavaScript
• Deep web resources
• Traps such as metafiles, absolute links
• Other methods of acquisition (CD, FTP)
• Business decision whether or not to accept
  the harvested instance
                 PANDORA
             Australia’s Web Archive
Archiving
• Harvested instance is accepted
• One-click process for PANDAS user
• Transfers instance from working area to
  Digital Object Storage System
• Creates preservation and display copies
• Perl scripts – e.g. re-write external links
                 PANDORA
             Australia’s Web Archive
Archiving – preservation master copies
• Preservation master – incl. harvest log files
• Display master – includes changes made to
  the harvested instance (manual and scripts)
• Metadata master – http header responses
• Gzip compressed TARball (Tape Archive
  format) on Digital Object Storage System
  (DOSS)
• Access (display) copy on web server
                  PANDORA
              Australia’s Web Archive
Preparing for public access – title entry pages
• Generated ‘on-the-fly’ from content of
  PANDAS database
• Partner branding
• Link to publisher’s site
• Links to dated archived instances
• Manual additions – notes, links to serial
  issues, copyright statement
                  PANDORA
              Australia’s Web Archive

Preparing for public access – listings and collections
•   Subject listings
•   Title listings
•   Partner views
•   Collections – events, sampling over specific
    time period
                  PANDORA
              Australia’s Web Archive
Public access – restrictions
• Period
• Date
• Authentication
• IP addresses/subnet mask (i.e. physical
  locations such as a single PC in the NLA
  main reading room)
• PANDAS manages automatically – can be
  manually enabled/disabled
                 PANDORA
              Australia’s Web Archive
Creating resource discovery metadata
• MARC record for each title
• National Library of Australia OPAC
• National Bibliographic Database
• Metadata derived from the catalogue record
  is embedded in the title entry pages
• Indexing/abstracting services’ citations
               PANDORA
            Australia’s Web Archive
Reporting
• Pre-defined reports from PANDAS UI
• Statistical and data reports
• SQL query on Oracle database (not through
  PANDAS interface)
• ProClarity for user defined data cube
  reporting and analysis
• LinkScan for broken publisher URL links
                PANDORA
            Australia’s Web Archive




4. Issues and future directions
                    PANDORA
                 Australia’s Web Archive
Current issues
• Commitment to selective, quality assessed,
  accessible web archiving
• Efficient identification – automated
  selection
• Legal deposit (when?)
• Blanket permission – government agencies
                    PANDORA
                 Australia’s Web Archive
Current issues
• Ongoing development and enhancement of
  PANDAS
• Improve robustness of system
• Re-engineer PANDAS software
• Need to achieve greater efficiencies and
  increase scale of web archiving activity
                    PANDORA
             Australia’s Web Archive
Future directions
• Automatically ingest and process larger
  volume of online publications and
  associated metadata – batches
• Comply with international standards and
  adopt standard tools – IIPC
• Incorporate other collection methods –
  domain harvesting, deep web, deposit
                    PANDORA
             Australia’s Web Archive

Future directions
• Automate collection of more preservation
  metadata and develop metadata
  management interface
• Improve access and discovery paths to the
  Archive’s resources as it continues to grow
                    PANDORA
               Australia’s Web Archive
More information
• PANDORA home page
   http://pandora.nla.gov.au/
• Key documents (background, technical, PIs)
   http://pandora.nla.gov.au/documents.html
• PANDAS manual
   http://pandora.nla.gov.au/manual/pandas
• Papers and presentations
   http://pandora.nla.gov.au/papers.html
   PANDORA
Australia’s Web Archive




     Questions?
   PANDORA
Australia’s Web Archive

   http://pandora.nla.gov.au

								
To top