Digital preservation
Document Sample


Digital preservation
Michael Day
UKOLN, University of Bath, UK
m.day@ukoln.ac.uk
University of Bristol, MSc in Library and Information
Management, Unit 6A: Advanced Information Systems
Bristol, 15th October 2003
http://www.ukoln.ac.uk/
Session overview
• The digital preservation problem
• Preservation strategies
• Preservation metadata
– The OAIS model
• Non-technical issues
– collection management, legal issues, costs, …
• Case study: the World Wide Web
• Selected projects and initiatives
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
The digital preservation problem
http://www.ukoln.ac.uk/
Definitions (1)
• Preservation:
– a management function
• “Its objective is to ensure that information
survives in usable form for as long as it is
wanted” - John Feather (1991)
– not primarily about:
• conservation or restoration
• backups or storage
• concepts of “permanence”
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Definitions (2)
• Digital preservation:
– digital information is different
– technical problems with ensuring continued
access
– but also a managerial problem
• “... the planning, resource allocation, and application
of preservation methods and technologies to ensure
that digital information of continuing value remains
accessible and usable” - Margaret Hedstrom (1998)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Definitions (3)
• Potential confusion with:
– “archiving”
• a term used in some computing contexts for
the creation of secure backup copies
– “archives”
• a well-understood term in archives and
recordkeeping professions
• but also used to refer to almost any
collection of data
– e.g., e-print archives, image archives, etc.
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Definitions (4)
• Potential confusion (continued):
– “digitisation”
• especially where the motive for digitisation is
the preservation of original items
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital information (1)
• An increasing flood of data ...
• The Web
– Billions of pages
– Internet Archive - >300 Terabytes (and growing @ 12
Tb. per month)
– The "deep-Web"
• Scientific data
– Wellcome Trust Sanger Institute - manages several
hundred Terabytes of data per year, growing
exponentially
– Particle physics and astronomy - e-Science projects
expected to generate Petabytes of data per year (e.g.,
CERN's Large Hadron Collider)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital information (2)
• Sizes:
Kilobyte: 1,000 bytes
Megabyte: 1,000,000 bytes
Gigabyte: 1 billion bytes
Terabyte: 1,000 Gigabytes
Petabyte: 1,000 Terabytes
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital information (2)
• Sizes:
Kilobyte: 1,000 bytes
Megabyte: 1,000,000 bytes
Gigabyte: 1 billion bytes
Terabyte: 1,000 Gigabytes
Petabyte: 1,000 Terabytes
Exabyte: 1,000 Petabytes
Zettabyte: 1,000 Exabytes
Yottabyte: 1,000 Zettabytes
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation (1)
• Media issues:
• currently magnetic or optical tape and disks
– e.g., CD-ROM, DVD (optical), DAT, DLT
(magnetic)
• unknown lifetimes
– but relatively short compared to paper or good
quality microform
– probably years rather than decades
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation (2)
• Media issues (continued):
• technical solutions
– longer lasting media:
» e.g. Norsam's High Density Rosetta system -
analogue storage on nickel plates
» COM (output to good-quality microform)
» Keeping paper copies!
– periodic copying of data bits on to new media
(refreshing)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation (3)
• Dependence on particular hardware and
software:
• the heart of the digital preservation problem
• relatively short obsolescence cycle for:
– hardware
» e.g., BBC Domesday Project (1986) used a
special type of videodisc player developed by
Philips
– software
» e.g., word-processing files
http://www.atsf.co.uk/dottext/domesday.html
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation strategies
http://www.ukoln.ac.uk/
Preservation strategies
– Main proposed types:
• technology preservation
• emulation
• migration
• encapsulation
• others ...
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Technology preservation
• The preservation of an information object
together with all of the hardware and
software needed to interpret it
– preserves the look and feel and behaviour of
whole system
– but will lead to museums of “ageing and
incompatible computer hardware” - Mary Feeney
(1999)
– storage space, maintenance, costs ...
– may have a short-term role in the rescue of digital
objects (digital archaeology)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Emulation (1)
• The preservation of original application
software and to run this on emulators that
mimic the behaviour of obsolete hardware
and operating systems
– preserves „look-and-feel‟
– may be useful where the digital object is complex
(e.g. multimedia) or cannot easily be migrated
– development of „virtual machines‟ that would have
to be migrated to work on different platforms (Jeff
Rothenberg)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Emulation (2)
– strategy has been tested in:
» Camileon project (JISC/NSF)
» NEDLIB experiments (European national
libraries)
– requires the maintenance of a huge (and
growing) amount of information about platforms
and operating systems
– preserves the defects embedded in original
software
– Hard to know whether user experience has been
accurately preserved
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Migration (1)
• Managed transformations:
– The periodic transfer of digital information from
one hardware and software configuration to
another, or from one generation of computer
technology to a subsequent one - CPA/RLG
report (1996)
– abandons attempts to keep old technology (or
substitutes) working
– a linear migration strategy is used by software
vendors for some data types (e.g. Microsoft Excel
files)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Migration (2)
– Migration can often be combined with some form
of standardisation (e.g., on ingest)
» ASCII
» bit-mapped-page images
» well-defined XML formats
– Migration on Request
» Camileon project proposal
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Encapsulation
• Encapsulating the digital object with
information on how it should be interpreted
– self-describing objects
– the principle underlying the OAIS reference
model
– can also support emulation or migration on
demand strategies
– examples:
» Universal Preservation Format (UPF)
» “Buckets” (NASA Langley Research Center)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Other strategies
• Digital archaeology
– data recovery
– time consuming process (expensive)
• “Persistent archives”
– San Diego Supercomputer Center
– research funded by NSF, DARPA, NARA
– comprehensive strategy based on an information
management architecture
– infrastructure independent representations of
digital objects (tagged in XML)
– tested on an e-mail collection (Reagan Moore, et
al., 2000)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Mixed strategies
– Preservation strategies are not in competition
• different strategies can work together
• but have implications for:
– the technical infrastructure required (and metadata)
– collection management priorities
» e.g., encouraging the consistent use of standards
(migration), the collection of software and
documentation (emulation)
– rights management
» e.g., holding the rights to re-engineer software
– costs
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Preservation metadata
http://www.ukoln.ac.uk/
Preservation metadata (1)
• All digital preservation strategies depend - to
some extent - on the creation, capture and
maintenance of metadata
– "Preserving the right metadata is key to
preserving digital objects" (ERPANET Briefing
Paper, 2003)
• Defined as:
– The various types data that will allow the re-
creation and interpretation of the structure and
content of digital data over time (Ludäsher,
Marciano & Moore, 2001)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Preservation metadata (2)
• Metadata fulfil various roles, e.g.:
– "… to find, manage, control, understand or
preserve … information over time" (Cunningham,
2000)
– Descriptive information; technical information
about formats and structure; information about
provenance and context; administrative
information, e.g. for rights management
– Current schemas either very complex or only
provide a basic framework (sometimes both!)
– Perception that different strategies and objects
will need different metadata
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Preservation metadata - standards
– Developed from many different perspectives:
• Digital libraries:
– METS, NISO Z39.87 (to support digitisation initiatives)
– OCLC/RLG Framework, Cedars, NEDLIB, NLA, NLNZ
– OAIS influence has been greatest in this area
• Records management and archival description:
– Pittsburgh BAC, RKMS, NAA, VERS, PRO, EAD, etc.
– Also standards not specifically developed for
preservation, but with some overlap:
• Multimedia
– MPEG-7, SMPTE, etc
• Rights management:
– <indecs>, MPEG-21, etc.
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
The OAIS model
– Reference Model for an Open Archival
Information System (OAIS)
– ISO 14721:2003
– Established a common framework of terms and
concepts
– Influential on the design of some schemas
» e.g., OCLC/RLG Metadata Framework
– Identified basic functions:
» Ingest, Data Management, Archival Storage,
Administration, Access, Preservation
Planning
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
OAIS functional model
P C
Preservation Planning
R O
O Descriptive
DIP N
info.
D Descriptive
queries S
info.
U Data
result sets
U
SIP
C Management Access M
E Ingest orders E
R SIP Archival R
AIP Storage AIP
SIP DIP
Administration
MANAGEMENT OAIS Functional Entities (Figure 4-1)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
OAIS information objects
• Information Object (basic concept)
– Data Object (bit-stream)
– Representation Information (permits “the full
interpretation of Data Object into meaningful
information”)
• Information Object Classes
– Content Information
– Preservation Description Information (PDI)
– Packaging Information
– Descriptive Information
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
OAIS information packages
• Information package:
– Container that encapsulates Content Information and
PDI
– Packages for submission (SIP), archival storage (AIP)
and dissemination (DIP)
» AIP = “... a concise way of referring to a set of
information that has, in principle, all of the qualities
needed for permanent, or indefinite, Long Term
Preservation of a designated Information Object”
– PDI = other information (metadata) “which will allow the
understanding of the Content Information over an
indefinite period of time”
» Reference, Provenance, Context, Fixity
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
The OAIS model (4)
Preservation Description Information:
Preservation
Description
Information
Reference Provenance Context Fixity
Information Information Information Information
OAIS Information Package Taxonomy (Figure 4-14)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Metadata schema categorisation
• Earliest schemas were largely conceptual in
nature:
– e.g. Pittsburgh BAC model, Cedars outline
specification, OCLC/RLG WG I
• Gradually moving towards a more practical
focus:
– e.g., VERS, NLNZ, METS, PREMIS WG
– Convergence on XML (DTDs and Schemas)
• But there is an urgent need for all this
practical experience to be shared
– e.g., published schemas, advice on
implementation, etc.
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Sustainability issues (1)
• Balance risks with costs:
– There is a perception that metadata creation and
maintenance will be expensive
– But costs associated with data recovery are not
trivial
– Need to balance the risks of data loss with the
cost of creating metadata
» Cost/benefit analysis
» Robust selection criteria
» Co-operation between repositories
» Re-use of existing metadata
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Sustainability issues (2)
• Avoid imposing unnecessary costs:
– Avoid large schemas (?)
– Need to identify the right metadata - 'core
metadata' (?)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Metadata creation issues
• Created by humans or captured
automatically?
– Some metadata already exists, e.g.:
» Embedded within objects
» In separate databases
» Generated by particular processes
– Need for this metadata to be captured at creation,
ingest, migration, and at other appropriate points
in object life-cycle
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Interoperability issues
• Benefits of interoperability
– Support for ingest process
– To support the management of multiple formats
and metadata schema within a digital
preservation system
» Current metadata specifications not entirely
clear on how this should be done
– To support the exchange of information packages
outside the repository, e.g. by converting to
standard 'exchange formats'
» Networks of 'trusted repositories'
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Format and metadata registries
• Format registries
– There is "… a pressing need to establish reliable,
sustained repositories of file format specifications,
documentation, and related software" (Lawrence,
et al., 2000)
– DSpace 'bitstream format registry'
– Digital Library Federation, et al. recently
proposed a Global digital format registry
• Metadata registries
– More research into these is required
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Non-technical issues
http://www.ukoln.ac.uk/
Collection management
– Selection, storage, access, "de-selection"
– Issues:
• Preservation issues need to be considered early in an
object's life-cycle (the traditional 'transfer to
repository' model will not work)
• An important role for creators (and funding bodies)
– Guidance, documentation
• Sharing of responsibilities
– A need for collaboration
• Digital storage costs are cheap, so should we keep
everything?
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Legal issues (1)
• Institutions need to obtain the legal rights to
preserve digital objects and make them
accessible:
– e.g., copying, the re-engineering of software
– identify and negotiate with rights holders?
» but difficult to identify all rights holders ...
– safeguard rights
– part of legal deposit?
– Monitoring legislation and case law
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Legal issues (2)
• Rights holders want increasing control over
content
– e.g., the extension of copyright periods, licensing
of access
– Digital Millennium Copyright Act (US)
– European Union Copyright Directive
• Consideration of “dark archives” -
repositories without access ...
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Costs
– Still very little known about costs:
• no widely used economic models
• no clear idea of who pays?
• Moore‟s Law (technology)
– digital storage densities increase while costs
decrease
– not necessarily applicable to Petabytes of data
from e-science projects
• identification of cost elements is best
approach
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Capturing and preserving the World
Wide Web
http://www.ukoln.ac.uk/
Web archiving (1)
• Four main approaches (to date):
– Crawler based (for surface Web)
• Internet Archive
• Swedish Royal Library (Kulturarw3)
• Iceland, Finland, Austria, etc.
– Selective approach
• National Library of Australia (PANDORA)
• British Library pilot
– Direct deposit by creators
– Combined approaches
• Bibliothèque nationale de France
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Web archiving (2)
– an important response to the transitory nature
of the Web
– existing projects more concerned with collection
strategies than access or preservation
– major focus on events, e.g. national elections
• Internet Archive Special Collections
• NARA (US National Archives and Records
Administration) snapshots of US federal agencies and
departments in 2001
• The National Archives (PRO) - capture of No. 10,
Downing Street site (2001); current work with Internet
Archive (UK Central Government Web Archive)
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Web archiving (3)
– limited consideration of access issues,
• except for:
– Internet Archive (Wayback Machine)
– PANDORA Archive (NLA)
– Nordic Web Archive project
– A look at the Wayback Machine ...
– http://www.archive.org/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Some projects and initiatives
http://www.ukoln.ac.uk/
The Cedars project
• CURL Exemplars in Digital Archives:
• Consortium of University Research Libraries (CURL)
• Funded by the JISC (1998-2002)
• Main partners: Universities of Cambridge, Leeds and
Oxford; support from UKOLN for the work on
metadata
• Final phase produced guides to collection
development, intellectual property rights issues,
metadata, etc.
http://www.leeds.ac.uk/cedars/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital Preservation Coalition
• formed in 2001
• aims to foster joint action in the UK and
internationally
– Dissemination (handbook, bulletin, …)
– getting digital preservation on the agenda of key
stakeholders
– members include BL, the e-Science core
programme, JISC, OCLC, the National Archives,
Resource, the BBC, etc.
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital Curation Centre
• An initiative of the JISC and the Research
Councils e-Science Core Programme
• $1 million (per annum, for 3 years)
• Key objectives (simplified), to develop:
– A research programme
– A centre and repository for tools and
documentation
– Pilot services, e.g. format registries
– Advisory services, identifying best practice, etc.
• Initial bids currently being evaluated
• Deadline for full proposals = November 2003
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
OCLC/RLG Working Groups
– Preservation Metadata - Implementation
Strategies:
• OAIS model based Metadata Framework (2002)
• PREMIS Working Group
http://www.oclc.org/research/pmwg/
– Digital Archive Attributes Working Group:
• “Trusted digital repositories: attributes and
responsibilities” (May 2002)
http://www.rlg.org/longterm/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
NDIIPP
– National Digital Information Infrastructure
Preservation Program
• Funded by the US Congress
• A national planning effort led by the Library of
Congress, in co-operation with representatives of
other federal, research, library, and business
organisations
• $100 million
• Master plan approved by Congress, December 2002
• NDIIPP Programme Announcement
– For projects between $500K - $3 million
– Proposals due 12 November 2003
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Summing up
http://www.ukoln.ac.uk/
Summing up:
• Digital preservation is a managerial as well
as a technical problem
• Technical agenda is being developed
– there is much work being undertaken into
developing sustainable preservation strategies
and metadata schemas
• Co-operation is essential
– some progress, e.g. the DPC, DCC, NDIIPP
• Many problems remain
– costs, legal issues, etc.
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Further information
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
More information
• Preserving Access to Digital
Information (PADI) gateway:
– http://www.nla.gov.au/padi/
• DPC/PADI “What‟s New” bulletin:
– http://www.dpconline.org/graphics/whatsnew/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Acknowledgements
UKOLN is funded by Resource: the Council for Museums,
Archives and Libraries, the Joint Information Systems
Committee (JISC) of the UK higher and further education
funding councils, as well as by project funding from the JISC
and the European Union. UKOLN also receives support from
the University of Bath, where it is based.
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Get documents about "