Personal & SOHO Archiving
Digital objects require appropriate measures for digital preser ... addressing the challenges posed by digital preservation needs. for some time, private ...
Shared by: findpdf
Personal & SOHO Archiving Stephan Strodl, Florian Motlik, Kevin Stadler, Andreas Rauber Vienna University of Technology Vienna, Austria www.ifs.tuwien.ac.at/dp ABSTRACT Keywords Digital objects require appropriate measures for digital preser- Personal Archiving, Home Archiving, Home User, SOHO, vation to ensure that they can be accessed and used in the Digital Preservation, Long Term Access near and far future. While heritage institutions have been addressing the challenges posed by digital preservation needs 1. INTRODUCTION for some time, private users and SOHOs (Small Oﬃce/Home An increasing amount of electronic material is stored and Oﬃce) are less prepared to handle these challenges. Yet, organised on home PCs. Legal, ﬁnancial, and business con- both have increasing amounts of data that represent consid- tracts of private users are conducted electronically, such erable value, be it oﬃce documents or family photographs. as insurances, contracts, tax payments and bank activities. Backup, common practice of home users, avoids the phys- Other material is highly valuable for private users simply ical loss of data, but it does not prevent the loss of the due to its emotional value such as e.g. family photographs, ability to render and use the data in the long term. Re- e-mail exchanges, and blogs. SOHOs manage their ﬁnancial search and development in the area of digital preservation concerns, correspondence and business by using PCs and is driven by memory institutions and large businesses. The internet services. The stored data have high value for the available tools, services and models are developed to meet business in the long term. the demands of these professional settings. Nowadays, it is common practice for SOHO users to backup This paper analyses the requirements and challenges of their data on CDs, DVDs and external hard discs to guar- preservation solutions for private users and SOHOs. Based antee the future use and long term availability of their data. on the requirements and supported by available tools and A number of backup solutions are available on the market, services, we are designing and implementing a home archiv- ranging from simple open source applications to commercial ing system to provide digital preservation solutions speciﬁ- application suites. The backup of the data only provides cally for digital holdings in the small oﬃce and home envi- protection against technical failures of storage media and ronment. It hides the technical complexity of digital preser- the physical loss of data. vation challenges and provides simple and automated ser- Apart from technical failure, information can be lost due vices based on established best practice examples. The sys- to obsolete formats and lack of metadata making the infor- tem combines bitstream preservation and logical preserva- mation unusable. Private users are hardly aware of these tion strategies to avoid loss of data and the ability to access risks. None of the current backup systems for private users and use them. A ﬁrst software prototype, called Hoppla, is deals with the challenge of digital preservation. However, presented in this paper. most users live under the impression that copying their ﬁles to a DVD is suﬃcient for ensuring access and usage in the Categories and Subject Descriptors future. H.3 [Information Storage and Retrieval]: H.3.7 Digital Digital preservation has turned into an important activ- Libraries ity for heritage institutions and large businesses. A num- ber of projects worldwide develop models and services for long term preservation in professional settings. Due to the General Terms diﬀerent environments, knowledge and objectives, the re- Design, Documentation, Experimentation, Reliability, The- quirements for a preservation system for private users diﬀer ory signiﬁcantly from those in professional environments. For example, authenticity and audit play a minor role for pri- vate data, and access to the archived data has to be kept simple and practicable. Permission to make digital or hard copies of all or part of this work for To allow private users to manage and preserve their digital personal or classroom use is granted without fee provided that copies are holdings, the complexity of digital preservation has to be not made or distributed for proﬁt or commercial advantage and that copies reduced based on established best practice examples; simple bear this notice and the full citation on the ﬁrst page. To copy otherwise, to and automated preservation services are vital to ensure the republish, to post on servers or to redistribute to lists, requires prior speciﬁc long term access to these heterogeneous collections. These permission and/or a fee. JCDL’08, June 16–20, 2008, Pittsburgh, Pennsylvania, USA. services have to have small entry barriers for new users and Copyright 2008 ACM 978-1-59593-998-2/08/06 ...$5.00. need to be accessible for users possessing little knowledge in the domain of digital preservation. Therefore, service preservation of the data, logical preservation is not covered support needs to be kept as simple as possible. in the system. A similar application is under development Home Archiving is a new concept to assist private users for Linux operating systems, called TimeVault allowing au- and SOHOs in long term preservation of their data. It con- tomatic backups of data . siders the abovementioned issues and tackles the emerging Open source digital repositories, such as Fedora2 and challenges to ensure the accessibility and availability of pri- DSpace3 , are useful environments for professional archiving, vately owned digital objects in the future. but usability and required knowledge for conﬁguration and This paper describes a practical approach for digital preser- use do not meet the skills of home user . vation for SOHO users. It combines bitstream preservation The Reference Model for an Open Archival Information with best practice logical preservation strategies to avoid System (OAIS)  has been widely accepted as a key stan- loss of data and the ability to access and use the data. The dard reference model for archival system in the digital li- home archiving software Hoppla, introduced in this paper, brary community. The standard was taken into considera- builds on a service model similar to current Firewall and tion for the system architecture in Section 4. Antivirus solutions. It provides a user-friendly handling of Over the last years a lot of eﬀort was spent to deﬁne, im- services, an automated update service and hides the techni- prove, and evaluate preservation strategies. A good overview cal complexity of the software. of preservation of digital heritage and preservation strategies The remainder of this paper is organised as follows: Sec- is provided by the companion document to the UNESCO tion 2 provides pointers to related initiatives and gives an charter for the preservation of the digital heritage . overview of work previously done in this area. After that, Research on technical preservation issues is focused on two Section 3 presents the challenges and requirements for dig- dominant strategies, namely migration and emulation. The ital preservation of private and SOHO holdings. Following Council of Library and Information Resources (CLIR) pre- the description of a system architecture for home archiving sented diﬀerent kinds of risks for a migration project . in Section 4, we present an initial prototype in Section 5 in- Migration requires the repeated conversion of a digital object cluding an outlook on future developments. Finally we draw into more stable or current ﬁle formats, such as e.g. convert- conclusions in Section 6. ing a Microsoft WORD97 document into the current Oﬃce 2007 format (within format-family migration) or converting it, e.g. to Adobe PDF/A, a simple ASCII/UNICODE text 2. RELATED WORK ﬁle, a screenshot image, or others. Migration is a modiﬁca- Current research on digital preservation is driven by mem- tion of the data and always incurs the risk of losing essential ory institutions and focuses on professional environments characteristics of the object . Therefore, a veriﬁcation of to preserve scientiﬁc and cultural heritage. The increasing completeness and correctness of the migration activity is re- amount of digital objects with legally and personally im- quired for a preservation system. Characterisation services portance held by SOHO is facing the challenge of obsolete for digital objects that extract information and character- formats and hardware. Preservation solutions for private istics from digital objects support this veriﬁcation. Work users and SMEs can beneﬁt from experience and knowledge in the ﬁeld of characterisation is done, for example, by the in professional settings and research. Harvard University Library in the JHOVE project , the A series of studies about private users and how they han- Planets Project with the eXtensible Characterisation Lan- dle their digital holdings are performed. A study about tech- guages (XCL) , and the Global Grid Forum Data Format niques and tools for managing their electronic material is Description Language Working Group with DFDL . The presented in . Case studies about digital preservation number of tools as well as the ease of applying migration of personal information were performed in [20, 21] identify- makes it a very promising candidate for home archiving. ing current practices and challenges in digital preservation Emulation, the second important preservation strategy for private users. The identiﬁed practices and challenges of aims at providing programs that mimic a certain environ- home users form a basis for the requirements of home archiv- ment, e.g. the emulation of a certain processor type or emu- ing systems such as the one presented in Section 3. They lating the features of a certain operating system. A example were further considered for the archiving system design in is to run Microsoft WORD 1.0 on a Linux operating sys- Section 4. tem emulating Windows 3.1. Jeﬀ Rothenberg together with The MyLifeBits project aims at keeping a complete dig- CLIR  envision a framework of an ideal preservation sur- ital record of a person’s life [10, 11]. The project focuses rounding for emulation. Emulation requires suﬃcient knowl- on browsing, searching and managing personal digital infor- edge from the user about the computer environment and de- mation based on semantic analysis of the accumulated data. pendencies of components. Emulation of a certain software The preservation of the collected content plays a minor role to render data may require to preserve the operating sys- in this project as well as in several other similar initiatives tem, the application software, and the data. If one of these such as . information is lost, the information can not be accessed any The Paradigm project1 focuses on preservation of personal more. The emulator itself is a peace of software and has material. The ﬁnal report  presents a series of case stud- to be preserved over time. Emulation is a useful strategy ies and best practice recommendations for preserving per- to preserve software applications, for home archiving we are sonal digital material in archives curated by archivists. focusing on preserving the information of digital objects. In Apple’s Time Machine is a backup utility embedded in order to keep the home archiving system simple and easy the Mac OS X Leopard operating system . It automat- to apply, we are currently not considering emulation as a ically creates incremental backups on an external device of an Apple computer. The Time Machine provides bitstream 2 http://www.fedora.info 1 3 http://www.paradigm.ac.uk http://www.dspace.org preservation strategy for home archiving in this paper, al- materials of private users consist of a variety of formats though it is deﬁnitely not excluded from a system design of diﬀerent age. In order to ﬁnd appropriate preservation perspective. strategies these object formats have to be identiﬁed. For this purpose, a number of tools and services are developed, for example, JHOVE , developed by JSTOR and the 3. PRESERVATION HOME ARCHIVING Harvard University Library; or DROID  by the National The underlying principle of a home archiving system is Archives. ﬁnding a best eﬀort solution with respect to the available Bitstream preservation protects the digital information technology and skills of private users. We cannot assume against physical deterioration of the media and the obso- a highly sophisticated computer environment; neither can lescence of media readers. A common practical solution for we expect a profound knowledge in digital preservation or bitstream preservation is to maintain multiple copies of the archiving. A home archiving system backs up the private digital material on separate media and the periodic transfer holdings and automatically applies appropriate preservation of the data to new media. The backup of data on CDs, strategies to the objects. The system should provide the DVDs and external hard discs is a common practice for best available and most practical preservation solution. It SOHO users. Yet, there is little knowledge about appro- further should hide the technical complexity from the end priate archival media. Thus, storage media for use in home user. The installation, the execution and the maintenance archiving thus has to be readily available and commonly of the system have to be easy to handle. This requires a user known. friendly GUI design and the provision of automated services While bitstream preservation avoids the physical loss of handling migration of objects that are stored in formats that data, it does not prevent the loss of the ability to decode and are considered at risk. Experience and knowledge about dig- represent the stored information. Due to the rapid develop- ital preservation gained in professional environments and re- ment in software and formats, information can quickly turn search should be used to provide preservation solutions for into uninterpretable bitstreams. The loss of the required ap- private users. Even tools and services developed for institu- plications and the information to interpret the format can tional preservation can be adopted and used in home archiv- be avoided by periodical migration and storage of represen- ing systems, albeit limitations have to be kept in mind. tation information. Migration provides repeated conversion of objects; a ﬁle is converted to either a more current version 3.1 Requirements of its own ﬁle format, or to another, which is easier to handle Requirements and challenges for digital preservation of and access. In order to understand and interpret the pre- private holdings diﬀer from those in professional settings served data in the future, additional information is required. caused by diﬀerent environments, skills, and objectives. Cri- The concept of representation information is introduced and teria for institutional repositories is an active research ﬁeld discussed in the OAIS Reference Model . For a home in the digital library community. Examples are the Trust- archiving system a practical approach is required, therefore worthy Repositories Audit & Certiﬁcation: Criteria and the format speciﬁcations for all formats in a personal archive, Checklist (TRAC) and Catalogue of Criteria for Trusted if available, are stored together with the preserved data. Digital Repositories  from the certiﬁcation working group The combination of migration and stored format speciﬁca- of NESTOR4 . Requirements beside the archive and library tion is a practical approach to access and use the preserved environment are documented in  ,  and . This objects in the future. The migration should assure that the section analyses the challenges and requirements for home objects can be accessed in the future by using then current archiving system and presents potential and practical solu- software. In case no software is available or the loss incurred tions. by sequential migration steps exceeds tolerable limits, the in- The user studies done by Catherine C. Marshall [20, 21] formation of the objects can be accessed by using the format identiﬁed the estimation of the future value of digital mate- speciﬁcation. rial as one of the central challenges for personal archiving. The objects in a home archiving system should be self The appraisal of the content can only reasonably be done suﬃcient. That means they should have a minimum of de- by the user. Usage statistics of objects can support the se- pendencies on systems, other data or documentation. The lection, the statistics may include creation date, last access, minimisation of dependencies is a requirement for the se- number of accesses and last change. lection of appropriate preservation strategies. Best practice In order to select material, data acquisition has to be per- preservation strategies and the use of open standards can formed. Digital belongings of private users are highly dis- help reduce dependencies. Moreover, required documenta- tributed across a variety of media. Private users are using tion such as the format speciﬁcation has to be preserved with diﬀerent web services to exchange and publish their digital the data within the archiving system to prevent additional material. Private photos are sent by e-mail or published via external dependencies. web photo albums; other users publish private web pages Metadata is a key component for archival and library or write blogs. Oﬄine media are also in use, for example repositories. A number of initiatives and projects devel- videos from camcorders stored on CDs and DVDs or old oped standards and recommendations for long term meta- data are moved to external hard discs. Unlike in profes- data, such as Dublin Core  and Premis . Private users sional environments the data in question are not kept in a hardly ever make the eﬀort of assigning metadata to their single repository, they are distributed on both on- and oﬄine objects. The aim of a home archiving system is to preserve media. A potential home archiving solution should support the available metadata and to obtain additional informa- the acquisition of digital material from diﬀerent sources. tion about the user’s objects. Characterisation services are In addition to being stored on diﬀerent media, electronic needed to extract information about the object, its content 4 and its environment. http://www.langzeitarchivierung.de Privacy and authenticity of the objects are essential for Figure 1 shows the basic architecture of a home archiving professional repositories as well as for home archiving. The system, the architecture is inﬂuenced by the OAIS reference use of external services with private data or information model . It consists of six core components: acquisition, about the data put privacy at risk. Therefore, the user ingest, data management, preservation management, stor- should be able to decide which data or information about age management, and access. Two registries contain preser- the data are provided to external services. The objects have vation rules and services. Both registries are updated au- to be protected against unauthorised access and manipula- tomatically by an external update web service. The service tion. Due to the fact that a home archiving system predom- registry contains services and tools for object identiﬁcation, inantly stores the data on removable and portable storage characterisation, preservation, and preservation validation. media such as external hard discs or DVDs, physical protec- The registry also contains representation information about tion is the only eﬀective access control. Encryption of data formats, for example the format speciﬁcation. The preser- bears a couple of risks for the long term storage of digital vation rule registry speciﬁes preservation strategies for dif- content. The loss of the encryption algorithm or password ferent types of objects. Preservation rules describe the input can result in irrecoverable loss of all stored data. On the format, the output format and the tool including the speciﬁc other hand, due to the evolution of decryption algorithm parameter setting for a speciﬁc migration task, e.g. migra- and computing power current encryption can not provide tion of word objects to PDF/A objects by using Adobe Ac- security in the long turn. Therefore, a home archiving sys- robat 7.0. The metadata repository is used for operational tem does not support encryption of the data. A simple but purposes and explained in Section 4.4. The functions of the eﬀective protection against manipulation can be provided core components are described in more detail below. by using checksums. Yet, this is a less prominent issue for home archiving systems than for institutional repositories. 4.1 Acquisition The acquisition component is responsible for capturing 3.2 Differences between Home Archiving and the digital data from diﬀerent sources. In order to sup- Institutional Archiving port diﬀerent media the acquisition component provides an The diﬀerences of home archiving and institutional archiv- API for plugins. The use of plugins allows to support all ing are manifold. The design of potential preservation solu- kinds of storage media and current as well as future data tions have to consider these diﬀerences. Examples for major sources. The acquisition plugins capture the objects and diﬀerences among many others are: all relevant information about the objects, such as usage statistics or additional descriptions. Examples for acquisi- • The level of expertise in digital preservation of home tion plugins are disc acquisition, e-mail archiving clients, or users diﬀers from those in professional settings. web acquisition tools. Disc acquisition acquires objects from • Staﬀ in institutional repositories have a profound un- home directories and changeable media; e-mail clients from derstanding of challenges in digital preservation, for e-mail accounts by using POP or IMAP; other sources can example fragility of formats or dependencies of com- be supported by speciﬁc tools such as e.g. Internet crawler puter software. Heritrix  to harvest web content (for example private web pages, community pages or web pages of user interest). • Home users hold a much smaller amount of data re- The acquired data are submitted to the ingest component. sulting in diﬀerent performance requirements for tools and data storage. 4.2 Ingest Appraisal, i.e. the estimation of an object’s future value, • Institutional repositories have a professional hardware and the selection of the digital objects to be preserved is environment and infrastructure, for example tape robots, performed in the ingest component. The user selects the ob- storage servers or RAID systems. jects to preserve, additional information about the objects • Home users have minimum requirements in authentic- captured by the acquisition component can support the se- ity of data; anyhow the documentation of changes of lection. Further analyses of appraisal and selection can be objects is an important aspect for both communities. found in [3, 8]. After the selection, the objects are quarantined and checked • The requirements in automatisation of the archiving for viruses. The ingest component is responsible for the process are higher for home archiving software solu- identiﬁcation of an object’s format by using identiﬁcation tions. In institutional settings, critical decisions in services from the service registry. Examples of such services preservation endeavours can be made by skilled staﬀ. are JHove  or DROID . As none of the existing ser- Examples are error tolerance or the identiﬁcation of vices knows all formats a usual home archive will comprise requirements for a preservation solution. a number of objects in unknown formats. For objects in unknown formats only bitstream preservation can be per- • Preservation endeavours in institutional settings have formed. to meet the legal and institutional obligations, while The ingest component creates a collection proﬁle describ- these limitations do not apply for private users. ing the format types, their proportion, the number of ob- jects, and the size of the collection. 4. HOME ARCHIVING SYSTEM A home archiving software combines bitstream preserva- 4.3 Preservation Management tion and logical preservation to store private user data for Preservation management controls the logical preservation the long term. It further supports acquisition of material of the objects. In the home archiving setting this means that from diﬀerent sources and provides extraction of metadata. it is responsible for performing migration strategies on the Figure 1: Architecture of a Home Archiving System objects in its archive. To do this preservation tools and rules istry. Again, these will usually be installed locally in order are requested from an update service. Based on the collec- to ensure privacy, i.e. not having to send the migrated ob- tion proﬁle suitable preservation strategies are recommended jects to external services. If the veriﬁcation fails, the output by the web update service. objects are deleted. The failed migration is documented in For privacy reasons, the user can deﬁne the level of detail the metadata and feedback on the failed migration is poten- of the collection proﬁle that is provided to the web update tially provided to the tool provider to allow improvement service. The minimum level of detail is a list of formats in of the migration service. After migration, a report about the archive; in this case default rules for the formats are performed actions and failures is provided to the user. provided to the home archiving system. More detailed col- The output of the migration are original objects, one or lection proﬁles contain information about the proportion of more migrated output objects for each input object, logging formats, size of the collection, and detailed characteristics of of the migration services and and results of the veriﬁcation. the objects. Therewith, the web update service can provide The migration is documented in the metadata of the objects. more speciﬁc preservation rules for a given collection. It can provide one or more preservation rules for a single format, 4.4 Data Management for example the migration of Word objects to PDF/A and to Open Document Format. Data management enriches the objects in the archive with Due to the large number of preservation rules and ser- metadata to ease later reuse. Metadata are created from the vices, those are requested on demand from the web update additional information captured by the acquisition compo- service. New or updated preservation rules and services are nent, the documentation of migration processes, and meta- transferred to and installed in the home archiving system. data extracted from objects. The home archiving system presents the user a list of rec- Metadata are extracted from original objects as well as ommended migrations and allows the user to revise this list. migrated objects by using characterisation services. The The objects in the archive are migrated according to the structure of the extracted information strongly depends on rules. The preservation service deﬁned in the rule is exe- the service and on the format. Additional metadata about cuted on the home archiving system. The migration process the archiving process are added to the objects, such as time can produce one or more output objects from a single in- of capture, original location in the home system, performed put object, for example the migration from PDF to PNG preservation strategy and output of the migration process. produces a PNG object for each page in the PDF object. Representation information  are added to the metadata After the execution of the migration, the output objects are of the objects. A checksum is derived for each object and validated for correctness and completeness against the input added to its metadata. The home archiving system allows object by using validation services deﬁned in the service reg- the user to add additional metadata to objects or groups of objects. This metadata is packaged together with the data objects. The data management submits the original objects, evaluation of diﬀerent preservation strategies against well the migrated objects, and the metadata to the storage man- deﬁned requirements. Examples for requirements of preser- agement. vation strategies for home archiving are open format speciﬁ- The metadata repository in the home archiving system cation, portable preservation service and availability of free contains information about archiving activities and the ob- rendering applications. jects including all metadata. The repository is only used op- In order to be informed about changes and developments erationally, all information of the repository is stored with in technology, the web service update component needs tech- the objects on the storage media. The central metadata nology watch services. These monitor technology to identify repository supports and improves the archiving process, for technologies becoming obsolete and inform about emerging example to store a repository across multiple data media. technologies. A watch service for ﬁle formats is developed The preserved objects and their metadata can be recovered by the the National Library of Australian. The Automated from the storage media without the application. Obsolescence Notiﬁcation System 2 (AONS II)  enables to get informed when formats are obsolete or at risk. 4.5 Access The access module provides services that allow users to ac- 4.8 Privacy and Conﬁdence cess the data stored in the home archive. The access module A software system to preserve private data for the long further displays information about object dependencies and term has to conform with conﬁdential requirements. In the versioning history. In principle the user can directly access home archiving system, a collection proﬁle is provided to the storage medias, as all information are stored on the tar- an external web update service. The level of detail of a get mediums. However, additional access services improve collection proﬁle ranges from listing of used formats to char- the usability of the system. Moreover, direct access would acteristics of the objects, available storage space of the user eﬀectively undermine the system’s application logic, possi- and a user proﬁle. In order to ensure the privacy, the user bly leading to accidental manipulation of the object and the has to be able to select the data that are provided to an stored information through the user, thus spoiling authentic- external service. More detailed proﬁles allow more speciﬁc ity. The access module provides services to retrieve objects preservation rules for the user’s collection. from the archive. It accesses the objects through functions In order to protect the privacy of private data, the home provided by the storage management module. Search func- archiving system does not use web services with private data, tionality using metadata of the access module eases ﬁnding such as identiﬁcation, preservation action, or characterisa- old objects in the archiving system. tion web services. The services are installed locally and ex- ecuted on the home archiving system without transferring 4.6 Storage Management private data via the internet. Storage management is responsible for bitstream preser- vation. The data provided by the data management com- 5. HOPPLA SOFTWARE ponent are stored on various storage media. The storage A ﬁrst version of a prototype software is currently un- management supports multiple copies of the data, following dergoing evaluation. The software, called Hoppla (Home the concept of the LOCKSS project [9, 19]. Multiple copies and Personal Persistent, Long term Archiving), developed limit the risk of physical deterioration of storage media. in Java, allows the acquisition and selection of digital data, In order to store the data on various storage systems or performs migrations according to deﬁned preservation rules media, storage management implements a reduced version of and creates multiple backup copies of the output. a storage resource broker . The storage manager provides a storage interface to access diﬀerent storage systems, such 5.1 Implementation as ﬁle systems or online storage system, by using plugins. The current version of Hoppla supports the acquisition from ﬁle systems. An additional module is currently under 4.7 Web Service Update development to extract e-mails via the IMAP and POP3 The web service update provides the home archiving sys- protocol. Both messages and attachments are temporar- tem with preservation rules and services. The collection pro- ily stored locally. This is realised via a persistence layer ﬁle and a list of present rules and services from the home handling e-mails in their original format as well as links to archiving system are sent to the web update service. Ac- attachments on the local ﬁle system. The persistence layer cording to the information in the collection proﬁle, preser- stores the e-mail in XML format preparing them for ingest. vation rules are selected. Wherever necessary, formats in the Two kinds of rules are implemented in the Hoppla sys- archive are assigned with at least one preservation rule. The tem, namely backup and migration rules. A migration rule home archiving receives updated and new rules and services. deﬁnes a migration service for a speciﬁc object format. The A critical part of the system is the selection of the preser- rule includes the input format, the output format and the vation strategies. These rules as well as the selection of mi- tool to perform the migration including the parameter set- gration tools need to be handled by teams of experts. In this ting of the tool. The backup rule deﬁnes the number of aspect, the web service update functionality works similar to diﬀerent versions of an object that should be stored in the current antivirus software kits, where new rules for detecting archive. The rules are currently deﬁned in the client appli- viruses as well as software modules to eliminate them, are cation; the DROID service  is integrated into the system downloaded by an update service. Experience and practice to identify object formats and to use Pronom Identiﬁers for of professional settings provide a ﬁrst indication of appli- rule deﬁnition. cable preservation strategies. Detailed analysis of preserva- The storage management component in Hoppla supports tion strategies can be done with evaluation tools such as the versioning of objects. At execution of the archiving process, Planets Preservation Planning approach . It allows the it identiﬁes data in the original system that have changed since the last backup. Timestamps of the operating system are currently used to discover changes, but more sophisti- cated models such as those implemented by synchronisation software such as UNISON  are obviously possible. New versions of objects are added to archive. Old versions are kept in the archive. Within a backup rule, the user can specify, depending on object size, how many versions of an object should be kept based on object formats. It is used to meet the demand for keeping few backups of large objects if storage space is scarce. For each object format a backup rule can be deﬁned. In addition to the rules per format, a global default-rule can be used ﬁring for all objects which have not been aﬀected by other rules. When the maximum number of versions of an object is archived, diﬀerent ver- sioning strategies are implemented in the system such as to keep always last versions, keep the ﬁrst and the last version, or keep random interim versions of an object. The logical preservation of the objects is performed ac- cording to preservation rules. Newly added objects or ob- jects with a format with a new preservation rule are mi- grated. The migration is performed by executing the tool with the parameter setting deﬁned in the preservation rule on the home archiving system. If a migration fails the mi- grated objects are deleted and the failure is documented in Figure 2: Screenshot of File Browser in Hoppla the metadata of the original object. The outcome of the suc- cessful migration is the original object, one or more migrated objects and the logging of the tool. The Hoppla system sup- migration, and failures. In order to keep the home archive ports assigning one or more preservation rules for a single up to date including adding new objects, and performing format. Moreover, the system allows versioning of migrated new migrations, the archiving process has to be re-executed objects. The preservation rule deﬁnes how many migrated periodically. The user can deﬁne the time period and the versions of an object should be kept in the archive. system creates a reminder for re-execution. The selected ob- The storage management component supports storing the jects and the used storage media are stored in a XML ﬁle to results at one or more storage media. The folder struc- allow a simple re-execution. ture of the original ﬁle system is recreated on the target While the location of objects can change over time, the media as well as speciﬁc structures for other data sources current implementation handles relocated objects as new ob- such as e-mails or web data. This eases locating and using jects. Duplication detection by using checksums can solve the preserved objects for the user. Migrations and previous this issue. versions of objects are stored at the same location of the storage media as the original. A name extension is added to the migrated objects and previous versions providing unique 5.2 Case Study ﬁlenames. A ﬁrst case study was performed on parts of two home For each directory two XML ﬁles are created, one docu- directories from the developer team with a size of 6 and 4,4 menting the objects in the directory the other holding meta- GB. In a ﬁrst run, the home directories are backed up on an data of the objects. The XML ﬁle describing the content in- external hard disc. It took 13,2 and 9 minutes respectively cludes the name of the objects and their history. The history to create a complete backup of the data. documents previous versions and migrations of the object The migration was tested on an initial set of 150 word doc- stored in the same directory. The second XML ﬁle includes uments with a size of 34 MB, 10 postscript objects (25MB) all metadata describing the objects. The metadata contain and 58 jpg images (21 MB). Three migration rules were for example the format identiﬁer, the logging information of tested on the data set: migration tools, and checksums. • Conversion of DOC to PDF using antiword 0.37 All information and documentation generated by the Hop- pla system are stored in XML format. It allows the recre- ation of all information stored in the operational metadata • Conversion of PS to PDF using ps2pdf repository from storage media. In order to provide the user a sophisticated way to access the archive a ﬁle browser was de- • Conversion of JPG to TIFF using ImageMagick veloped, shown in Figure 2. In a tree structure the content of The migration results in 7 MB of PDF objects for the word the archive is displayed including the previous versions and documents, 1,7 MB for the postscript data and 40MB of migrations of objects. The ﬁle browser allows the user to ac- TIFF images. The process took about 2 minutes. A second cess metadata about the objects and to retrieve objects from migration test was performed on 636 JPEG images with a the archive. Search functionality for the ﬁle browser using size of 1,8 GB migrating to TIFF using ImageMagick. 1,1 the collected metadata is currently under development. GB of TIFF images were created in 38 minutes. The per- The Hoppla system provides reports about archiving pro- formed case study provided a ﬁrst evaluation of processing cesses including statistics of backed up objects, successful times and storage demand. 5.3 Outlook  Appraisal Task Force. Appraisal task force ﬁnal The ﬁrst version of Hoppla focused on acquisition, ba- report. Tech. rep., InterPARES 1 Project, 2001. sic migration, and storage supporting versioning. Current http://www.interpares.org/display_file.cfm? development eﬀort focuses on ingest and the preservation doc=ip1_aptf_report.pdf. accessed: 25.03.2008. management component. A central web update service will  Baru, C., Moore, R., Rajasekar, A., and Wan, provide rules and services for the Hoppla clients. A ﬁrst ver- M. The SDSC storage resource broker. In CASCON sion of the update service will consist of database managing ’98: Proceedings of the 1998 conference of the Centre rules and services and an interface for administration. The for Advanced Studies on Collaborative research (1998), functionality of the web update service will be further ex- IBM Press, p. 5. panded and we speciﬁcally perform research on supporting  Bashi, A. Timevault - gnome backup/snapshot the selection of preservation strategies for collection proﬁles. system. https://launchpad.net/timevault. accessed: Further research will be done on heuristics for the selection 25.03.2008. of electronic material. An ongoing process is the collection  Becker, C., Rauber, A., Heydegger, V., and the evaluation of diﬀerent services for migration and Schnasse, J., and Thaller, M. A generic xml characterisation. language for characterising objects to support digital preservation. In Proceedings of the 23rd Annual ACM 6. CONCLUSIONS Symposium on Applied Computing (New York, NY, In this paper we presented challenges and requirements for USA, 2008), ACM. a digital preservation solution for private users and SOHOs.  Beckerle, M., and Westhead, M. GGF DFDL They diﬀer signiﬁcantly from those in professional settings Primer. Tech. rep., Global Grid Forum Data Format caused by diﬀerent environments, skills, and objectives. The Description Language Working Group, 2004. available tools and services, developed for professional set-  Eastwood, T. Appraising digital records for tings, have to be adopted to meet the requirements of the long-term preservation. Data Science Journal 3 SOHO users. (2004), 202 – 208. We presented a home archiving system that allows pri-  Eckman, C., Reich, V., Robertson, T., and vate users to preserve their data in the long term. The sys- Rosenthal, D. S. Lots of copies keep stuﬀ safe tem combines bitstream preservation and logical preserva- (LOCKSS) government documents: Sger # 0245231. tion strategies. It supports the acquisition of digital material In dg.o ’04: Proceedings of the 2004 annual national from diﬀerent sources. The logical preservation is performed conference on Digital government research (2004), by using established best practice preservation strategies. Digital Government Research Center, pp. 1–2. The system supports multiple migration pathways for ob-  Gemmell, J., Bell, G., and Lueder, R. Mylifebits: ject formats. The home archiving system documents object a personal database for everything. Commun. ACM characteristics and performed actions in metadata. Multiple 49, 1 (2006), 88–95. backup versions on diﬀerent storage media avoids the phys-  Gemmell, J., Lueder, R., and Bell, G. The ical loss of the data caused by physical deterioration of the mylifebits lifetime store. In ETP ’03: Proceedings of media. the 2003 ACM SIGMM workshop on Experiential Hoppla has a strong focus on ease of use and heavily re- telepresence (New York, NY, USA, 2003), ACM, lies on the best eﬀort principle. This is realised by centrally pp. 80–83. stored preservation rules as well as tailored to accommodate  Gladney, H. M. Principles for digital preservation. the needs of private users and SOHOs. The ongoing develop- Communication of the ACM 49, 2 (February 2006), ment of the Hoppla software focuses on acquisition plugins 111–116. to capture diﬀerent sources of Internet material, such as e-  Harvard University Library. Jhove - jstor/harvard mail and web sites. Research on the web update service will object validation environment, 2007. focus on methods to support the selection of preservation http://hul.harvard.edu/jhove. accessed: strategies for collection proﬁles. 25.03.2008.  Initiative, D. C. M. Dublin Core Metadata Element Acknowledgements Set, 1.1 ed., Jannuary 2008. http: Part of this work was supported by the European Union in //dublincore.org/documents/2008/01/14/dces/. the 6th Framework Program, IST, through the PLANETS accessed: 25.03.2008. project, contract 033789.  Internet Archive. Heritrix. http://crawler.archive.org, 2004. accessed: 7. REFERENCES 25.03.2008.  Ahmed, M., Hoang, H. H., Karim, S., Khusro, S.,  ISO. Space data and information transfer systems – Lanzenberger, M., Latif, K., Michlmayr, E., Open archival information system – Reference model Mustofa, K., Nguyen, M. T., Rauber, A., (ISO 14721:2003), 2003. Schatten, A., Tho, M. N., and Tjoa, A. M.  Kaye, J. J., Vertesi, J., Avery, S., Dafoe, A., Semanticlife - a framework for managing information David, S., Onaga, L., Rosero, I., and Pinch, T. of a human lifetime. In Proceedings of the To have and to hold: exploring the personal archive. International Conference on Information Integration, In CHI ’06: Proceedings of the SIGCHI conference on Web-Applications and Services (Jakarta) (2004). Human Factors in computing systems (New York, NY,  Apple Website. Max os x leopard - time maschine. USA, 2006), ACM, pp. 275–284. http://www.apple.com/macosx/features/ timemachine.html. accessed: 25.03.2008.  Lawrence, G. W., Kehoe, W. R., Rieger, O. Y., 25.03.2008. H.Walters, W., and Kenney, A. R. Risk  Strodl, S., Becker, C., Neumayer, R., and management of digital information: A ﬁle format Rauber, A. How to choose a digital preservation investigation, June 2000. strategy: Evaluating a preservation planning  Maniatis, P., Roussopoulos, M., Giuli, T. J., procedure. In Proceedings of the 7th ACM IEEE Joint Rosenthal, D. S. H., and Baker, M. The lockss Conference on Digital Libraries (JCDL’07) (New peer-to-peer digital preservation system. ACM Trans. York, NY, USA, 2007), ACM, pp. 29–38. Comput. Syst. 23, 1 (2005), 2–50.  The Center for Research Libraries (CRL), and  Marshall, C. C. Rethinking personal digital Online Computer Library Center, Inc.(OCLC ). archiving, part 1. D-Lib Magazine 14, 3/4 Trustworthy Repositories Audit & Certiﬁcation: (March/April 2008). Criteria and Checklist (TRAC). Tech. Rep. 1.0, CRL  Marshall, C. C. Rethinking personal digital and OCLC, February 2007. archiving, part 2. D-Lib Magazine 14, 3/4  The National Archives. Droid - digital record (March/April 2008). object identiﬁcation, 2007. http://droid.  nestor Working Group -Trusted Repositories sourceforge.net/wiki/index.php/Introduction. Certification. Catalogue of Criteria for Trusted accessed: 25.03.2008. Digital Repositories. Tech. rep., nestor - Network of  The National Library of Australia. Automatic Expertise in long-term STORage, Frankfurt am Main, obsolescence notiﬁcation system (AONS). http: June 2006. Version 1. //pilot.apsr.edu.au/wiki/index.php/AONS_II.  Pierce, B. C., and Vouillon, J. What’s in Unison? accessed: 25.03.2008. A formal speciﬁcation and reference implementation of  Thomas, S. A practical approach to the preservation a ﬁle synchronizer. Tech. Rep. MS-CIS-03-36, Dept. of of personal digital archives. Report, Paradigm, March Computer and Information Science, University of 2007. http://www.paradigm.ac.uk/projectdocs/ Pennsylvania, 2004. jiscreports/ParadigmFinalReportv1.pdf. accessed:  Preservation Metadata: Implementation 25.03.2008. Strategies (PREMIS) Working Group. Data  UNESCO. Guidelines for the preservation of digital dictionary for preservation metadata. Tech. rep., heritage. UNESCO, Information Society Division, Online Computer Library Center, Inc. (OCLC) and March 2003. unesdoc.unesco.org/images/0013/ Research Libraries Group RLG, Dublin, Ohio, USA, 001300/130071e.pdf. accessed: 25.03.2008. May 2005.  Waugh, A., Wilkinson, R., Hills, B., and  Rothenberg, J. Avoiding Technological Quicksand: Dell’oro, J. Preserving digital information forever. Finding a Viable Technical Foundation for Digital In DL ’00: Proceedings of the ﬁfth ACM conference on Preservation. Council on Library & Information Digital libraries (New York, NY, USA, 2000), ACM, Resources, 1999. http://www.clir.org/pubs/ pp. 175–184. reports/rothen-berg/contents.html. accessed: