Digital preservation

Document Sample
Digital preservation Powered By Docstoc
					Digital preservation

Michael Day
UKOLN, University of Bath, UK
m.day@ukoln.ac.uk


University of Bristol, MSc in Library and Information
Management, Unit 6A: Advanced Information Systems
Bristol, 15th October 2003



http://www.ukoln.ac.uk/
Session overview
       • The digital preservation problem
       • Preservation strategies
       • Preservation metadata
           – The OAIS model
       • Non-technical issues
           – collection management, legal issues, costs, …
       • Case study: the World Wide Web
       • Selected projects and initiatives


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
 The digital preservation problem




http://www.ukoln.ac.uk/
Definitions (1)
• Preservation:
   – a management function
       • “Its objective is to ensure that information
         survives in usable form for as long as it is
         wanted” - John Feather (1991)
   – not primarily about:
       • conservation or restoration
       • backups or storage
       • concepts of “permanence”

Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Definitions (2)
• Digital preservation:
   – digital information is different
   – technical problems with ensuring continued
     access
   – but also a managerial problem
       • “... the planning, resource allocation, and application
         of preservation methods and technologies to ensure
         that digital information of continuing value remains
         accessible and usable” - Margaret Hedstrom (1998)




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Definitions (3)
• Potential confusion with:
   – “archiving”
       • a term used in some computing contexts for
         the creation of secure backup copies
   – “archives”
       • a well-understood term in archives and
         recordkeeping professions
       • but also used to refer to almost any
         collection of data
           – e.g., e-print archives, image archives, etc.


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Definitions (4)
• Potential confusion (continued):
   – “digitisation”
       • especially where the motive for digitisation is
         the preservation of original items




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital information (1)
• An increasing flood of data ...
       • The Web
           – Billions of pages
           – Internet Archive - >300 Terabytes (and growing @ 12
             Tb. per month)
           – The "deep-Web"
       • Scientific data
           – Wellcome Trust Sanger Institute - manages several
             hundred Terabytes of data per year, growing
             exponentially
           – Particle physics and astronomy - e-Science projects
             expected to generate Petabytes of data per year (e.g.,
             CERN's Large Hadron Collider)



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital information (2)
• Sizes:
       Kilobyte:       1,000 bytes
       Megabyte: 1,000,000 bytes
       Gigabyte:       1 billion bytes
       Terabyte:       1,000 Gigabytes
       Petabyte:       1,000 Terabytes




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital information (2)
• Sizes:
       Kilobyte:       1,000 bytes
       Megabyte: 1,000,000 bytes
       Gigabyte:       1 billion bytes
       Terabyte:       1,000 Gigabytes
       Petabyte:       1,000 Terabytes
       Exabyte:        1,000 Petabytes
       Zettabyte:      1,000 Exabytes
       Yottabyte:      1,000 Zettabytes

Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation (1)
• Media issues:
       • currently magnetic or optical tape and disks
           – e.g., CD-ROM, DVD (optical), DAT, DLT
             (magnetic)
       • unknown lifetimes
           – but relatively short compared to paper or good
             quality microform
           – probably years rather than decades




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation (2)
• Media issues (continued):
       • technical solutions
           – longer lasting media:
               » e.g. Norsam's High Density Rosetta system -
                  analogue storage on nickel plates
               » COM (output to good-quality microform)
               » Keeping paper copies!
           – periodic copying of data bits on to new media
             (refreshing)




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital preservation (3)
• Dependence on particular hardware and
  software:
       • the heart of the digital preservation problem
       • relatively short obsolescence cycle for:
           – hardware
               » e.g., BBC Domesday Project (1986) used a
                  special type of videodisc player developed by
                  Philips
           – software
               » e.g., word-processing files

   http://www.atsf.co.uk/dottext/domesday.html




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
  Digital preservation strategies




http://www.ukoln.ac.uk/
Preservation strategies
   – Main proposed types:
       •   technology preservation
       •   emulation
       •   migration
       •   encapsulation
       •   others ...




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Technology preservation
       • The preservation of an information object
         together with all of the hardware and
         software needed to interpret it
           – preserves the look and feel and behaviour of
             whole system
           – but will lead to museums of “ageing and
             incompatible computer hardware” - Mary Feeney
             (1999)
           – storage space, maintenance, costs ...
           – may have a short-term role in the rescue of digital
             objects (digital archaeology)



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Emulation (1)
       • The preservation of original application
         software and to run this on emulators that
         mimic the behaviour of obsolete hardware
         and operating systems
           – preserves „look-and-feel‟
           – may be useful where the digital object is complex
             (e.g. multimedia) or cannot easily be migrated
           – development of „virtual machines‟ that would have
             to be migrated to work on different platforms (Jeff
             Rothenberg)




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Emulation (2)
           – strategy has been tested in:
               » Camileon project (JISC/NSF)
               » NEDLIB experiments (European national
                 libraries)
           – requires the maintenance of a huge (and
             growing) amount of information about platforms
             and operating systems
           – preserves the defects embedded in original
             software
           – Hard to know whether user experience has been
             accurately preserved



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Migration (1)
       • Managed transformations:
           – The periodic transfer of digital information from
             one hardware and software configuration to
             another, or from one generation of computer
             technology to a subsequent one - CPA/RLG
             report (1996)
           – abandons attempts to keep old technology (or
             substitutes) working
           – a linear migration strategy is used by software
             vendors for some data types (e.g. Microsoft Excel
             files)



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Migration (2)
           – Migration can often be combined with some form
             of standardisation (e.g., on ingest)
               » ASCII
               » bit-mapped-page images
               » well-defined XML formats
           – Migration on Request
               » Camileon project proposal




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Encapsulation
       • Encapsulating the digital object with
         information on how it should be interpreted
           – self-describing objects
           – the principle underlying the OAIS reference
             model
           – can also support emulation or migration on
             demand strategies
           – examples:
               » Universal Preservation Format (UPF)
               » “Buckets” (NASA Langley Research Center)




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Other strategies
       • Digital archaeology
           – data recovery
           – time consuming process (expensive)
       • “Persistent archives”
           – San Diego Supercomputer Center
           – research funded by NSF, DARPA, NARA
           – comprehensive strategy based on an information
             management architecture
           – infrastructure independent representations of
             digital objects (tagged in XML)
           – tested on an e-mail collection (Reagan Moore, et
              al., 2000)



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Mixed strategies
   – Preservation strategies are not in competition
       • different strategies can work together
       • but have implications for:
           – the technical infrastructure required (and metadata)
           – collection management priorities
                » e.g., encouraging the consistent use of standards
                  (migration), the collection of software and
                  documentation (emulation)
           – rights management
                » e.g., holding the rights to re-engineer software
           – costs




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
        Preservation metadata




http://www.ukoln.ac.uk/
Preservation metadata (1)
       • All digital preservation strategies depend - to
         some extent - on the creation, capture and
         maintenance of metadata
           – "Preserving the right metadata is key to
             preserving digital objects" (ERPANET Briefing
             Paper, 2003)
       • Defined as:
           – The various types data that will allow the re-
             creation and interpretation of the structure and
             content of digital data over time (Ludäsher,
             Marciano & Moore, 2001)



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Preservation metadata (2)
       • Metadata fulfil various roles, e.g.:
           – "… to find, manage, control, understand or
             preserve … information over time" (Cunningham,
             2000)
           – Descriptive information; technical information
             about formats and structure; information about
             provenance and context; administrative
             information, e.g. for rights management
           – Current schemas either very complex or only
             provide a basic framework (sometimes both!)
           – Perception that different strategies and objects
             will need different metadata


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Preservation metadata - standards
   – Developed from many different perspectives:
       • Digital libraries:
           – METS, NISO Z39.87 (to support digitisation initiatives)
           – OCLC/RLG Framework, Cedars, NEDLIB, NLA, NLNZ
           – OAIS influence has been greatest in this area
       • Records management and archival description:
           – Pittsburgh BAC, RKMS, NAA, VERS, PRO, EAD, etc.
   – Also standards not specifically developed for
     preservation, but with some overlap:
       • Multimedia
           – MPEG-7, SMPTE, etc
       • Rights management:
           – <indecs>, MPEG-21, etc.

Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
The OAIS model
   – Reference Model for an Open Archival
     Information System (OAIS)
           – ISO 14721:2003
           – Established a common framework of terms and
             concepts
           – Influential on the design of some schemas
               » e.g., OCLC/RLG Metadata Framework
           – Identified basic functions:
               » Ingest, Data Management, Archival Storage,
                  Administration, Access, Preservation
                  Planning


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
    OAIS functional model

P                                                                                            C
                             Preservation Planning
R                                                                                            O
O                                                Descriptive
                                                                                 DIP         N
                                                    info.
D              Descriptive
                                                                                queries      S
                  info.
U                                     Data
                                                                               result sets
                                                                                             U
    SIP
C                                  Management                  Access                        M
E              Ingest                                                           orders       E
R   SIP                        Archival                                                      R
                   AIP         Storage             AIP

    SIP                                                                          DIP


                                Administration


                             MANAGEMENT                  OAIS Functional Entities (Figure 4-1)


    Unit 6A: Advanced Information Systems, 15 October 2003
    http://www.ukoln.ac.uk/
OAIS information objects
       • Information Object (basic concept)
           – Data Object (bit-stream)
           – Representation Information (permits “the full
             interpretation of Data Object into meaningful
             information”)
       • Information Object Classes
           –   Content Information
           –   Preservation Description Information (PDI)
           –   Packaging Information
           –   Descriptive Information



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
OAIS information packages
       • Information package:
           – Container that encapsulates Content Information and
             PDI
           – Packages for submission (SIP), archival storage (AIP)
             and dissemination (DIP)
               » AIP = “... a concise way of referring to a set of
                 information that has, in principle, all of the qualities
                 needed for permanent, or indefinite, Long Term
                 Preservation of a designated Information Object”
           – PDI = other information (metadata) “which will allow the
             understanding of the Content Information over an
             indefinite period of time”
               » Reference, Provenance, Context, Fixity



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
The OAIS model (4)
Preservation Description Information:

                             Preservation
                             Description
                             Information




     Reference      Provenance          Context            Fixity
    Information     Information       Information       Information



                            OAIS Information Package Taxonomy (Figure 4-14)


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Metadata schema categorisation
       • Earliest schemas were largely conceptual in
         nature:
           – e.g. Pittsburgh BAC model, Cedars outline
             specification, OCLC/RLG WG I
       • Gradually moving towards a more practical
         focus:
           – e.g., VERS, NLNZ, METS, PREMIS WG
           – Convergence on XML (DTDs and Schemas)
       • But there is an urgent need for all this
         practical experience to be shared
           – e.g., published schemas, advice on
             implementation, etc.


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Sustainability issues (1)
       • Balance risks with costs:
           – There is a perception that metadata creation and
             maintenance will be expensive
           – But costs associated with data recovery are not
             trivial
           – Need to balance the risks of data loss with the
             cost of creating metadata
                » Cost/benefit analysis
                » Robust selection criteria
                » Co-operation between repositories
                » Re-use of existing metadata


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Sustainability issues (2)
       • Avoid imposing unnecessary costs:
           – Avoid large schemas (?)
           – Need to identify the right metadata - 'core
             metadata' (?)




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Metadata creation issues
       • Created by humans or captured
         automatically?
           – Some metadata already exists, e.g.:
               » Embedded within objects
               » In separate databases
               » Generated by particular processes
           – Need for this metadata to be captured at creation,
             ingest, migration, and at other appropriate points
             in object life-cycle




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Interoperability issues
       • Benefits of interoperability
           – Support for ingest process
           – To support the management of multiple formats
             and metadata schema within a digital
             preservation system
               » Current metadata specifications not entirely
                 clear on how this should be done
           – To support the exchange of information packages
             outside the repository, e.g. by converting to
             standard 'exchange formats'
               » Networks of 'trusted repositories'



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Format and metadata registries
       • Format registries
           – There is "… a pressing need to establish reliable,
             sustained repositories of file format specifications,
             documentation, and related software" (Lawrence,
             et al., 2000)
           – DSpace 'bitstream format registry'
           – Digital Library Federation, et al. recently
             proposed a Global digital format registry
       • Metadata registries
           – More research into these is required




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
         Non-technical issues




http://www.ukoln.ac.uk/
Collection management
   – Selection, storage, access, "de-selection"
   – Issues:
       • Preservation issues need to be considered early in an
         object's life-cycle (the traditional 'transfer to
         repository' model will not work)
       • An important role for creators (and funding bodies)
           – Guidance, documentation
       • Sharing of responsibilities
           – A need for collaboration
       • Digital storage costs are cheap, so should we keep
         everything?


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Legal issues (1)
       • Institutions need to obtain the legal rights to
         preserve digital objects and make them
         accessible:
           – e.g., copying, the re-engineering of software
           – identify and negotiate with rights holders?
               » but difficult to identify all rights holders ...
           – safeguard rights
           – part of legal deposit?
           – Monitoring legislation and case law




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Legal issues (2)
       • Rights holders want increasing control over
         content
           – e.g., the extension of copyright periods, licensing
             of access
           – Digital Millennium Copyright Act (US)
           – European Union Copyright Directive
       • Consideration of “dark archives” -
         repositories without access ...




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Costs
   – Still very little known about costs:
       • no widely used economic models
       • no clear idea of who pays?
       • Moore‟s Law (technology)
           – digital storage densities increase while costs
             decrease
           – not necessarily applicable to Petabytes of data
             from e-science projects
       • identification of cost elements is best
         approach


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Capturing and preserving the World
            Wide Web




http://www.ukoln.ac.uk/
Web archiving (1)
• Four main approaches (to date):
   – Crawler based (for surface Web)
       • Internet Archive
       • Swedish Royal Library (Kulturarw3)
       • Iceland, Finland, Austria, etc.
   – Selective approach
       • National Library of Australia (PANDORA)
       • British Library pilot
   – Direct deposit by creators
   – Combined approaches
       • Bibliothèque nationale de France


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Web archiving (2)
   – an important response to the transitory nature
     of the Web
   – existing projects more concerned with collection
     strategies than access or preservation
   – major focus on events, e.g. national elections
       • Internet Archive Special Collections
       • NARA (US National Archives and Records
         Administration) snapshots of US federal agencies and
         departments in 2001
       • The National Archives (PRO) - capture of No. 10,
         Downing Street site (2001); current work with Internet
         Archive (UK Central Government Web Archive)


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Web archiving (3)
   – limited consideration of access issues,
       • except for:
           – Internet Archive (Wayback Machine)
           – PANDORA Archive (NLA)
           – Nordic Web Archive project

           – A look at the Wayback Machine ...
           – http://www.archive.org/




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
   Some projects and initiatives




http://www.ukoln.ac.uk/
The Cedars project
• CURL Exemplars in Digital Archives:
       • Consortium of University Research Libraries (CURL)
       • Funded by the JISC (1998-2002)
       • Main partners: Universities of Cambridge, Leeds and
         Oxford; support from UKOLN for the work on
         metadata
       • Final phase produced guides to collection
         development, intellectual property rights issues,
         metadata, etc.

   http://www.leeds.ac.uk/cedars/




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital Preservation Coalition
       • formed in 2001
       • aims to foster joint action in the UK and
         internationally
           – Dissemination (handbook, bulletin, …)
           – getting digital preservation on the agenda of key
             stakeholders
           – members include BL, the e-Science core
             programme, JISC, OCLC, the National Archives,
             Resource, the BBC, etc.




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Digital Curation Centre
       • An initiative of the JISC and the Research
         Councils e-Science Core Programme
       • $1 million (per annum, for 3 years)
       • Key objectives (simplified), to develop:
           – A research programme
           – A centre and repository for tools and
             documentation
           – Pilot services, e.g. format registries
           – Advisory services, identifying best practice, etc.
       • Initial bids currently being evaluated
       • Deadline for full proposals = November 2003


Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
OCLC/RLG Working Groups
   – Preservation Metadata - Implementation
     Strategies:
       • OAIS model based Metadata Framework (2002)
       • PREMIS Working Group
       http://www.oclc.org/research/pmwg/


   – Digital Archive Attributes Working Group:
       • “Trusted digital repositories: attributes and
         responsibilities” (May 2002)
       http://www.rlg.org/longterm/




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
NDIIPP
   – National Digital Information Infrastructure
     Preservation Program
       • Funded by the US Congress
       • A national planning effort led by the Library of
         Congress, in co-operation with representatives of
         other federal, research, library, and business
         organisations
       • $100 million
       • Master plan approved by Congress, December 2002
       • NDIIPP Programme Announcement
           – For projects between $500K - $3 million
           – Proposals due 12 November 2003



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
               Summing up




http://www.ukoln.ac.uk/
Summing up:
       • Digital preservation is a managerial as well
         as a technical problem
       • Technical agenda is being developed
           – there is much work being undertaken into
             developing sustainable preservation strategies
             and metadata schemas
       • Co-operation is essential
           – some progress, e.g. the DPC, DCC, NDIIPP
       • Many problems remain
           – costs, legal issues, etc.



Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
          Further information




http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
More information
• Preserving Access to Digital
  Information (PADI) gateway:
   – http://www.nla.gov.au/padi/


• DPC/PADI “What‟s New” bulletin:
   – http://www.dpconline.org/graphics/whatsnew/




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/
Acknowledgements
UKOLN is funded by Resource: the Council for Museums,
Archives and Libraries, the Joint Information Systems
Committee (JISC) of the UK higher and further education
funding councils, as well as by project funding from the JISC
and the European Union. UKOLN also receives support from
the University of Bath, where it is based.




Unit 6A: Advanced Information Systems, 15 October 2003
http://www.ukoln.ac.uk/

				
DOCUMENT INFO