For digital preservation_ the organizational effort -- the by leader6

VIEWS: 2 PAGES: 46

									                                                  1


                                         It’s About Time:
        Research Challenges in Digital Archiving and Long-term Preservation


     Report on the NSF Workshop on Research Challenges in Digital Archiving:
      Towards a National Infrastructure for Long-Term Preservation of Digital
                                           Information


                  Workshop Report -- Draft 2.0 (Pre-Publication Draft)
                                         August 12, 2002




                                        Sponsored by the
               National Science Foundation Digital Government Program
    National Science Foundation Division of Information and Intelligent Systems
                                                 and
            Library of Congress National Digital Information Infrastructure
                                   and Preservation Program




               Comments on this draft are welcome and should be sent to
                       orgcomm@umich.edu or hedstrom@umich.edu
This workshop was funded by NSF Award Number 0214690. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the authors
 and do not necessarily reflect the views of the National Science Foundation or the
Library of Congress.




It’s About Time: Research Challenges in Digital Archiving                                1
Workshop Report – Pre-Publication Draft August 12, 2002
                                                     2




                                         Executive Summary


                                           It’s About Time:
        Research Challenges in Digital Archiving and Long-term Preservation


    Report on a Workshop on Research Challenges in Digital Archiving: Towards a
     National Infrastructure for Long-Term Preservation of Digital Information
                                       Pre-Publication Draft
                                           August 12, 2002

Margaret Hedstrom, Sharon Dawes, Carl Fleischhauer, James Gray, Clifford Lynch,
Victor McCrary, Reagan Moore, Kenneth Thibodeau, and Donald Waters (Organizing
Committee)


Executive Summary
In April 2002, a group of computer scientists, information scientists, archivists, digital
library experts, and government program managers met to examine the prospect of
advancing computer and information technology research through a research program
that addresses the unique challenges of long-term preservation of digital information.
Developing an infrastructure for preserving digital information for future exploitation
raises many interesting and difficult issues. The requirements for long-term preservation
test the limits of many current technologies and information management methodologies.
Digital archiving research is based on the premise that computer and information



  This workshop was funded by NSF Award Number 0214690. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the authors and do not necessarily reflect the views
of the National Science Foundation or the Library of Congress.

* A final report from the workshop will be available pending further external review in September 2002 at
www.si.umich.edu/digarch/ and in printed form. In the meantime comments or requests for copies are
welcome and should be sent to hedstrom@umich.edu .




It’s About Time: Research Challenges in Digital Archiving                                                 2
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  3


technology will continue to evolve at a rapid pace as long as many of the country’s best
minds concentrate on IT research and development, and as long as the IT sector continues
to serve as an engine for economic development and growth. Some of the information
created yesterday and today may move through many generations of information
technology before it is reused at some point in the future. Other resources may be in
continuous demand over many decades while new systems and technology evolve around
the data. Long-term digital archiving requires systems, institutions, and business models
that are robust enough to withstand technological failures, changes in institutional
missions, and interruptions in management and funding. This report summarizes the
discussions and recommendations of the Workshop on Research Challenges in Digital
Archiving and Long-term Preservation that was sponsored by the National Science
Foundation and the Library of Congress. Some of the key recommendations of the
workshop include:


       The National Science Foundation, the Library of Congress, and other government
        agencies should undertake a massive research effort to improve the state of
        knowledge and practice for long-term preservation of digital information.
       Important new research opportunities have emerged in Computer and Information
        Science to address issues of storage and processing capacities, interoperability
        among heterogeneous systems, automation of many intake and preservation
        management processes, and complex metadata and semantic representation
        requirements.
       Long-term preservation issues will not be resolved through better tools and
        technology alone. Research opportunities abound around questions of economic
        and business models for affordable and sustainable long-term preservation
        programs. Research is also needed on policies and incentives for long-term
        preservation and on the economic, social, and legal impediments to digital
        archiving.
       Research in almost every discipline depends on well-managed, reliable, and
        readily accessible digital resources. Future research capabilities will be seriously



It’s About Time: Research Challenges in Digital Archiving                                      3
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  4


        compromised without significant investments in research and the development of
        digital archives.
       A pressing and urgent need exists to develop better solutions for long-term digital
        preservation in government agencies, libraries, archives, museums, private
        corporations, and even among private citizens who rely increasingly on the
        Internet to transact business and to communicate with colleagues, friends and
        family members.

In the remainder of the report we describe new challenges and opportunities in digital
archiving, explain what is at stake if these challenges are not addressed, and set out a
research agenda with priority research areas and a discussion of research modalities and
necessary investments.


NEW CHALLENGES IN DIGITAL ARCHIVING
Digital collections are vast, heterogeneous, and growing at a rate that outpaces our
ability to manage and preserve them. One of the marvels of the IT revolution is the
continuous improvement in computer, memory, and storage performance and the
simultaneous drop in costs. Thanks to what has been called “silicon scaling” the
processing power of a 1980s vintage mainframe computer now fits on a small silicon chip
that can be embedded in any number of capture devices from complex remote sensors to
consumer digital cameras. Digital storage devices and media have benefited from similar
performance improvements and cost declines. More and more individuals can afford
laptop and desktop computers with multiple gigabytes of storage. Larger organizations
regularly add terabytes of storage capacity. One might suspect that archiving digital
information would become easier and cheaper as a consequence of these improvements.
But from a long-term preservation perspective, there is dark side to the rapid growth in
digital information. The technologies, strategies, methodologies, and resources to needed
to manage digital information for the long-term have not kept pace with innovations in
the creation and capture of digital information.




It’s About Time: Research Challenges in Digital Archiving                                  4
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  5


A few examples illustrate this problem. Internet search engines crawl the web, copy web
pages, and then index them automatically so that users have a reasonable chance of
finding information relevant to them on the web. Large search engines companies, such
as Google, index more than 2 billion web pages and store copies in a cache as a back-up
in case the requested page is not available. But search engine companies are in the
business of providing tools for searching and navigating. They are not in the business of
long-term archiving of the web or even a significant portion of it, nor should they be
expected to take on this responsibility. Who will? The Internet Archive was founded in
1996 to preserve content distributed on the web. In six years it has developed the largest
collection of web pages in the world – about 10 billion web pages, including 200 million
pages on the 2000 Election and 500 million pages related to the terrorist attacks of
September 11, 2001. Although the Internet Archive has a policy to migrate its collections
to new media at least once every ten years, it has not yet undertaken one complete
migration. As a small organization without a predictable, steady flow of resources, it is
also seeking stable institutional partners, including the Library of Congress and the
Smithsonian Institution, to collaborate in its long-term preservation endeavors.


Much more digital content is available and worth preserving; researchers
increasingly depend on digital resources and assume that they will be preserved.
During the last decade, many scientific, academic and cultural organizations as well as
government agencies and private enterprises have assembled valuable collections of
digital information, either in the normal course of business or as special projects. Under
the American Memory Program, the Library of Congress (LoC) led an effort to digitize
more than 100 historical collections from materials in its own holdings and in libraries,
archives, and museums across the country. The seven million items in the American
Memory Collections are used daily by teachers, students, scholars, genealogists, and
private citizens. The Digital Library Initiatives, sponsored by NSF, DARPA, the
National Library of Medicine, LoC, NASA, and the National Endowment for the
Humanities, fostered research and development for hundreds of digital libraries. Many
digital library projects started as testbeds and prototypes, but they have evolved into
critical research resources for almost every discipline. These resources need to be


It’s About Time: Research Challenges in Digital Archiving                                    5
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  6


maintained into the foreseeable future to support on-going research and teaching and to
protect several hundred millions of dollar invested to digitize, organize, and provide
access. Scholarly journals, preprints and even raw research data have moved on-line and
become the preferred means for keeping up with new research in many fields. Such
resources are emerging as the vital venue for scholarly communications. Society’s ability
to preserve a continuous record of research and scholarship will require an infrastructure
for archiving digital scholarly communications that is as affordable and as robust as the
complex networks and relationships among libraries, and between libraries and content
creators, that have served reasonably well to preserve the published output for the last
400 years.


More and more valuable content is “born–digital” and can only be managed, preserved,
and used in digital form. In the last decade, researchers have mapped significant portions
of the human genome. Advances in biomedical research depend on building and
preserving complex genomic databases. Research in biodiversity and ecosystems, global
climate change, meteorology, and space science – to name only a few fields – is built on
the ability to combine vast quantities of digital information with complex models and
analytical tools. Indeed, the increasing use of complexity theory and integrated models in
scientific research has generated the demand for massive data sets and complex analytical
tools. Recently, NASA investigators had to use a combination of data from current
satellites and from satellite instruments launched in the early 1980s in order to discover
important and unexpected anomalies in tropical radiation that were not expected by
current models of atmospheric variability. In the future, even longer time series of Earth
observations will be required to establish the true variability of this system - and of
unexpected changes and cause-and-effect relationships that could not be exposed reliably
without this long-term record. Digital preservation is important because it allows new
data to be derived from unexpected uses of previous data. In ecology, court records have
been useful in establishing long-term changes in ecosystem types. In atmospheric
chemistry, old stellar spectra have been used to establish changes in the chemical
composition of the atmosphere. Without better systems and methodologies for long-term



It’s About Time: Research Challenges in Digital Archiving                                    6
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  7


preservation, integration of older and more recent data is costly and cumbersome and
many valuable resources remain at risk.


Government, commerce, and personal communications rely on digital information
and communications. Critical needs for digital archiving strategies extend into almost
all aspects of modern society. Whether carrying out business-to-business transactions,
using the Internet to purchase goods and services on-line, communicating via e-mail, or
using stand-alone computer systems, electronic transactions generate enormous quantities
of information, some of which is worth saving for the long-term. The aircraft industry
depends on software systems to design, manufacture, and maintain complex commercial
aircraft. For safety’s sake, design specifications, records of manufacturing processes,
parts inventories, maintenance records, and performance data, much of which is in digital
form, must be kept as long as a particular model of aircraft is in service -- a period that
can exceed fifty years. The FDA requires pharmaceutical companies to file new drug
applications electronically along with documentation of research protocols, tests, and
clinical trials. These digital records have to be kept at least as long as a drug is available.
Medical records that may be needed for an entire lifetime are becoming electronic.
Citizen’s rights, such as eligibility for Social Security benefits, are documented in
databases that accumulate data through each individual’s working life. E-government and
E-commerce could flounder if better methods are not found to identify and preserve those
digital records that have long-term uses for keeping the business running and for
maintaining accountability. The entertainment industry is shifting rapidly to digital
masters of recorded sound, movies, and television programming. Within a few years
digital television and digital movies will be the preferred delivery method. Even private
citizens are seeking ways to manage and preserve their e-mail, on-line accounts, and
digital photographs.


What is unique about digital archiving research? It’s about time.
Digital preservation shares many requirements with well-designed information systems,
such as security, authentication, robust models for representation, and sophisticated
information retrieval mechanisms. Nevertheless, unique long-term preservation


It’s About Time: Research Challenges in Digital Archiving                                         7
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  8


requirements raise many interesting research questions that demand innovative solutions.
One unique aspect of preservation is its concern with the long term, where long term may
simply mean long enough to be concerned about the obsolescence of technology or it
may mean decades or centuries. When long-term preservation spans several decades,
generations, or centuries, the threat of interrupted management of digital objects
becomes critical. Unlike many physical objects that can withstand some period of
neglect without resulting in total loss, digital objects require constant maintenance and
elaborate “life-support” systems to remain viable. Redundancy, replication, and security
against intentional attacks on archival systems and against technological failures are
critical requirements for long-term preservation, as are issues of forward migration. The
challenges of maintaining digital archives over long periods of time are as much social
and institutional as technological. Even the most ideal technological solutions will
require management and support from institutions that go through changes in direction,
purpose, management, and funding.


The funding and business models for digital archives differ considerably from common
business models that are based on relationships between investments, operating costs, and
the utility of goods and services. Repositories may be expected to preserve digital
resources even though their utility may not become apparent until well into the future. In
this respect, the economic models for digital archives resemble the economics of public
goods, where the primary beneficiaries of current investments may be future generations.
Future users of digital archives will have different needs, expectations, technologies and
analytical tools from those of the communities that created the digital content initially.
This raises challenging research questions in the areas of semantics and description and
in knowledge management technologies that will enable future reuse of digital archives.
Another factor that distinguishes digital preservation research from many other types of
research is the difficulty of knowing whether or not we have solved the problems
successfully, because the ultimate test of success will be the new knowledge and
discoveries that result at some future date. This problem requires some very challenging
thinking about success measures and evaluation criteria, and it will demand an extended
research effort over the next decade.


It’s About Time: Research Challenges in Digital Archiving                                    8
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  9




It’s About Time: Research Challenges in Digital Archiving   9
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  10




A Digital Archiving and Long-term Preservation Research Agenda

Digital archiving challenges are ubiquitous and multi-faceted. As a consequence, a
significant, multi-disciplinary research effort is needed to produce new knowledge in
computer and information science, economics, and policy. Solving this complex problem
will require many different approaches. We do not anticipate that a single solution will
emerge or would be appropriate for the wide variety of collections, technologies, and
organizational arrangements governing digital archiving requirements. At the same time,
we believe that concerted research efforts will produce basic principles, new
technologies, and new curatorial methods that will enable long-term preservation of vast
resources at a fraction of the cost of today’s immature and customized strategies.
Opportunities for research partnerships abound between academic researchers,
researchers in industry, and the many government agencies, cultural institutions, and
private companies that are seeking solutions to long-term preservation problems. These
research opportunities fall into four closely related categories: attributes of digital
repositories, attributes of archived collections, tools and technologies, and economic and
policy models.


Attributes of Digital Repositories
Even with a common conceptual framework, it seems unlikely that a single approach will
satisfy all the digital preservation needs of various organizations and individuals. The
development of infrastructures for digital archiving is strongly driven by the need to
support multiple communities. Each community has unique requirements that will
influence the design of the digital archive. Computer, information science and
engineering research is needed on a spectrum of archival repository designs. Variations
in archival repository models raise many different research issues.


       Data model driven architecture. This model is used to preserve specific types
        of data for future reuse. Associated research issues include capacity and
        scalability of multiple petabyte repositories and methods for automated
        acquisition, quality control, and description.

It’s About Time: Research Challenges in Digital Archiving                                  10
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  11


       Controlled access repositories. Research questions derive from stringent
        requirements for auditability, authentication, and access controls.
       Archives of temporally changing data. These archives preserve data that is
        continuously changing, either through regular additions that are streamed into the
        archive or through updates and changes. Research is needed on definitions,
        methodologies, and tools for time-based capture and representation, for taking
        useful snapshots of dynamic databases, for versioning, and on the identification of
        knowledge models to represent temporal or procedural relationships.
       Archives of evolving data. Preservation and management of many types of
        digital information requires transformation of the original data to new formats or
        canonical forms. Research is needed to better define and characterize
        transformation processes so that they can be automated, and so that
        transformations made on the original data can be documented.
       Archives of derived data products. Archives are not limited to the original
        materials. In the scientific community, processing may be done on archived
        collections to create derived data products to address scientific questions.
        Research issues include the ability to characterize the derived data products with
        descriptive metadata. This descriptive metadata can include the type of
        processing algorithm that was applied, the mathematical expression of the related
        operation, and the associated software implementation.
       Repurposing of archives. Many archives will need to enable new access
        mechanisms so that their collections can be used for different purposes from those
        originally envisioned. Re-purposing of archived material may require the ability
        to stream the entire collection through processing steps. This requirement
        illustrates the need to think of archives as repositories of information and
        knowledge that may need to be updated at periodic intervals. Archives in the
        future may be dependent upon the ability to support generation of new semantic
        indexing through the processing of every digital entity.


Although this spectrum may not capture all potential types of archival repositories, it
illustrates the need for research that more closely examines the relationships between the

It’s About Time: Research Challenges in Digital Archiving                                  11
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  12


purpose of the archive, the types of data and information that it acquires, and the needs of
its producer and user communities.


Attributes of Archived Collections
A great deal of information is “saved” in digital form on file servers, on personal hard
drives, and in large repositories of tapes and optical disks. Nevertheless, archived
collections have additional attributes that enhance their quality, utility, trustworthiness,
and longevity. Archival collections don’t just happen when someone clicks on the “save”
icon —dumps of saved documents offer preciously little for future researchers because
they lack critical contextual and content-oriented metadata. Rather, archival collections
are created through curatorial processes that include selection, organization, description,
and quality control, and they require individuals or organizational entities that will take
on formal responsibility for long-term stewardship. Just as the development of
infrastructure for digital archiving is strongly driven by the need to support multiple
communities, it is also strongly driven by the requirements to preserve many diverse
types of complex objects and collections -- from text, to images, to recorded sound, to
computer models and simulations, to digital video, plus all combinations of these object
types. Research is needed in several key areas to better define the attributes of archival
collections and curatorial processes, including:


       Selection and preservation of complex digital objects. Methods exist today to
        preserve simple, static digital objects, but managing and preserving complex
        multi-media objects and dynamic objects that change on a regular basis present
        significant challenges. An increasing percentage of born-digital content falls into
        this category.
       Aggregation of items and objects into collections. With the need to capture
        materials from the web before they are updated or deleted, research is needed to
        determine the appropriate extent and depth of web-based collections, to bring
        coherence to widely distributed collections, and to further develop effective and
        economical collection-level metadata schema that describe attributes common to



It’s About Time: Research Challenges in Digital Archiving                                      12
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  13


        all items in a collection and provide for inheritance of metadata from the
        collection to the item level.
       Decision models for selection. Long-term preservation does not imply that
        everything is worth saving. Most libraries, archives, and museums have well
        established collecting policies for physical items, but selection decisions in the
        digital realm are becoming more complex. An increasing amount of the content
        that libraries deliver to users is held in publishers’ repositories and is not owned
        physically by the library, raising concerns over who should assume responsibility
        for long-term preservation (publishers or libraries) and when (if ever) the
        obligations to acquire and preserve published material should shift from the
        content providers to a library or an archive. Collecting policies that were designed
        for physical materials do not encompass new types of digital objects and
        collections (such as web sites and multi-media productions). Formal models of
        selection decisions are needed so that tools can be developed to assist curators
        with selection responsibilities and to automate some selection decisions, but not to
        eliminate the considerable human judgment that goes into collection development.
       Resolution of naming hierarchies. Multiple naming conventions are used to
        describe digital entities, ranging from the components of the data model, to local
        file names for the digital entity, to global file names used to assemble distributed
        collections, to attribute names used to build collection catalogs, to relationship
        names used to describe properties of the collection. Preservation requires the
        ability to manipulate each name space at some arbitrary point in the future. A
        major research question is whether the generalization of name spaces as
        ontologies that characterize either semantic relationships, structural relationships,
        or logical relationships, will lead to a simpler way to preserve the information and
        knowledge content of archives.


Tools and Technology
Human labor is the most expensive component of digital archiving systems. Therefore,
research and development of better archiving tools and technologies will not only make



It’s About Time: Research Challenges in Digital Archiving                                      13
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  14


digital archives more robust and reliable, it will also drive down the costs of this
endeavor. Some of the priority areas of research and technology development include:


       Acquisition and ingest. Archives can use automated web crawlers and
        harvesters (the “pull” method) or formal submissions (the “push” method) or
        some combination of these to acquire digital content. Both models would benefit
        from research that allows finer tuning of ingestion tools and that are better
        integrated with selection criteria and subsequent preservation management
        requirements. Given the vast quantities of data likely to flow into digital archives,
        tools are needed for automated indexing, metadata extraction, validation, and
        quality control. Tools are also needed to transform disparate types of objects into
        the formats, standard forms, and data models that a repository can manage over
        the long-term.
       Naming and authorization. Managing the identity of preserved digital objects
        over time is a challenge for digital archives because the identifiers assigned to
        digital objects can be changed easily and the technologies for naming and tracking
        digital objects evolve over time. Research is needed to develop methods for
        unique and persistent naming of archived digital objects, tools for certification
        and authentication of preserved digital objects, methods for version control, and
        interoperability among naming mechanisms used by different content providers.
        The emergence of data grids that create global name spaces is an example of a
        technology for persistent naming. This technology needs to be extended to
        support persistent naming of the information and knowledge content of the
        collections.
       Decision models and metrics. In addition to decision models to support
        selection, research is needed to develop models and tools that will support
        decisions regarding preservation formats and standards, choice of preservation
        strategies (normalization, migration, emulation), and on the costs and benefits of
        various levels of description and metadata. Key research areas include metrics for
        measuring the quality and fidelity of preserved digital objects and for
        documenting the consequences of archival processes on them. Metrics need to

It’s About Time: Research Challenges in Digital Archiving                                   14
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  15


        include the maximal sustainable archive size (as a function of the access rate), the
        archival bandwidth (amount of material that can be moved forward into the future
        as a function of the type of storage technology), and the re-purposing rate (the
        amount of time needed to process the entire collection to derive new collection
        attributes).
       Standards and interoperability. Standards for data formats, data models,
        metadata, and many other aspects of digital information are useful for long-term
        preservation, but standards change over time and archived digital entities will
        have to be migrated to new standards in the future. Longevity of digital
        information will be enhanced through research on standard and long-term
        methods for representing text, sound, image, video, and other object components
        and for characterizing their semantic, temporal, spatial, and procedural
        relationships. Archived digital entities will have to be migrated to new standards
        in the future. A migration can be viewed as lossless if the new standard provides a
        superset of the features of the old standard. A goal for standard encoding formats
        is the creation of lossless feature conversions when migrating between standards.
        Research is also needed to support interoperability among different competing
        standards and for developing models that help predict which standards are likely
        to achieve wide-scale adoption over extended periods of time.


Policy and Economic Models
Even the most effective tools and technology will be useless without a policy and
economic environment that is conducive to long-term preservation. The area of policy
and economic models is ripe for research. Some of the key research areas include:


       Incentives for long-term preservation of digital information. Research is
        needed on a variety of incentives that would encourage organizations to develop
        digital archiving capabilities, build repositories, provide archiving services, and
        create content in ways that facilitate its long-term preservation. A variety of
        mechanisms warrant investigation including direct public subsidies, tax incentives
        for placing content in the pubic domain prior to the expiration of copyrights,


It’s About Time: Research Challenges in Digital Archiving                                     15
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  16


        philanthropic donations, and market mechanisms that provide for cost recovery or
        revenue streams to support the repository.
       Incentives for deposit of digital content into archives. Conversely, content
        creators need incentives to deposit content in repositories for long-term
        preservation. Research in this area is closely tied to the concept of trust.
        Depositors must have a very high level of trust in a repository based on secure
        technology, a track record of performance, and consistent application of rules and
        agreements.
       Metrics. There is a critical need for research that will produce metrics and
        methods to measure almost every aspect of digital archiving from the
        performance of storage media over the long-term, to the effectiveness and costs of
        different preservation strategies, to the market value of archiving services and
        market analysis of user demand. Evaluation of digital archiving is impossible
        without concrete measures of the costs, benefits, and value of digital objects.
       Intellectual capital. Archives need to become the repositories of intellectual
        capital that are viewed as the driving resource for economic growth. This
        emphasizes the view of archives as information and knowledge repositories. The
        goal of the archive is to make the information and knowledge content as readily
        accessible as possible, and to make it easy to re-purpose the collection for a new
        use. Digital archiving research is needed to achieve this goal in ways that are
        sustainable, manageable, and cost-effective.


Research Modalities and Scale
Most digital archiving research to date can be characterized as a combination of small
stand-alone projects, projects to resolve immediate operational problems, and projects
that were tacked onto larger research initiatives. A concerted, focused effort is needed
now that engages a sufficient number of researchers, involves government agencies and
other partners with substantial digital archiving needs, and mobilizes an appropriate level
of investment to address the problem effectively. We anticipate that a minimum
investment of $5 to $8 million per year is needed for a focused research program for the
next ten years. The ten year time frame is essential, not only because of the complexity

It’s About Time: Research Challenges in Digital Archiving                                    16
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  17


of the problem, but also because of the considerable time required to implement,
evaluate, and test the results of research. A ten-year program would also provide a
foundation for evaluating digital preservation strategies over two or three generations of
computer and information technologies. We recommend that the National Science
Foundation and the Library of Congress launch this research initiative; encourage
sponsorship from other government agencies, private foundations, content providers, and
industry; and participate in active partnerships with researchers from many disciplines.


One exciting aspect of research on digital archiving and long-term preservation is that the
research is amenable to many different research methodologies and innovative
approaches. Possible research methodologies cover a whole spectrum from small, single
investigator projects to testbeds involving many researchers and multiple participating
institutions. Another attractive feature is that, although oriented to the long term, digital
archiving research may have immediate societal benefits by preserving important digital
resources that might otherwise be lost, producing more cost-effective and sustainable
models that address current archiving needs, and by creating business opportunities for
new technologies and services. Therefore, we recommend support for a wide variety of
research modalities, ranging from small, single-investigator projects to the creation of
two or three large testbeds involving multiple institutions, large teams of researchers, and
experimentation with existing digital collections with obvious long-term value. Many
opportunities exist for partnerships between researchers and organizations of all sorts that
hold significant digital collections and face pressing digital archiving needs. There may
be benefit to creating one or more centers for digital archiving and long-term preservation
research to serve as focal points for this effort and to address issues of technology and
knowledge transfer, education and training, and capacity building.


Conclusion
It’s about time to launch a new research initiative that will advance research in computer
and information science, information economics, policy, and social and organizational
behavior while addressing critical needs in government, the private sector, universities,
and cultural institutions to find reliable, sustainable, and cost effective means to preserve

It’s About Time: Research Challenges in Digital Archiving                                   17
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  18


valuable digital information resources that are critical to near-term and long-term
discoveries of new knowledge. A concerted research effort undoubtedly will advance
our knowledge in many disciplines while also contributing to the foundation and
infrastructure for the discovery and generation of new knowledge in the future. The
remainder of this report presents a more thorough discussion of needs and opportunities
for digital archiving and preservation research.




It’s About Time: Research Challenges in Digital Archiving                                 18
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  19


Table of Contents
Preface
Workshop Report
        What is at Stake?
        Digital Preservation Challenges
        Research Areas
                 Architectures for Repositories
                 Attributes of Archived Collections
                 Policy and Economic Models
                 Tools and Technologies
        What is Unique About Long-term Preservation Research
        Priorities for Research
        Research Scenarios
Organizing Committee
List of Participants




It’s About Time: Research Challenges in Digital Archiving      19
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  20


        Preface
This is the pre-publication draft of the Workshop Report. A final report from the
workshop will be available pending further external review in September 2002 at
www.si.umich.edu/digarch/ and in printed form. In the meantime comments or requests
for copies are welcome and should be sent to hedstrom@umich.edu.




It’s About Time: Research Challenges in Digital Archiving                         20
Workshop Report – Pre-Publication Draft August 12, 2002
                                    It’s About Time:
       Research Challenges in Digital Archiving and Long-term Preservation


           Report on the Workshop on Research Challenges in Digital Archiving:
         Towards a National Infrastructure for Long-Term Preservation of Digital
                                           Information


                Workshop Report – Draft 2.0 – Pre-publication Draft
                                     August 12, 2002


"For digital preservation, the organizational effort -- the process of building deep
infrastructure -- necessarily involves multiple, interrelated factors, many of which are
either unknown or poorly defined." Task Force on Archiving Digital Information, 1996


In 1996, the Task Force on Archiving Digital Information articulated the critical
challenges for long-term preservation of digital information and identified the key
components of a deep infrastructure necessary to ensure continuing access to valuable
digital resources. Computer and information technology has continued to evolve during
the intervening years, providing more powerful tools for creating, organizing, and
distributing digital resources while simultaneously raising new and complex challenges
for long-term preservation of digital information. The need to address the question of the
longevity of digital information has become more urgent as repositories of digital
information are surpassing physical archives in both scale and significance. The ability to
preserve digital information is a serious challenge for government agencies, scientific
data repositories, libraries, archives, museums and other cultural heritage organizations,
and any organization that needs continuing access to digital information resources.


A group of fifty experts from government agencies, academia, and industry gathered on
April 12 and 13, 2002 at a workshop sponsored by the National Science Foundation and
the Library of Congress to discuss the research challenges in digital archiving and long-
term preservation. Individuals with expertise in computer science, mass storage systems,
                                                  22


archival science, digital libraries, and information management discussed with
government managers the obstacles to preserving digital information for future reuse.
The group developed a research agenda that articulates the priority areas for research to
address technical, economic, and policy issues that impede continuing access to and reuse
of digital information. This report describes the issues discussed at the workshop and
presents proposals for urgently needed research.


What is at Stake?
Concern over society's ability to preserve digital information for near-term, intermediate
and long-term future reuse is not new. A few archivists, librarians, and scientific
researchers began to question whether it was possible to preserve information encoded on
fragile storage media and dependent on particular computing platforms shortly after
computer technology was first used to generate and store data in the 1950s. During the
past decade several conferences and workshops have addressed the need for research that
would produce practical and cost-effective solutions to the problems of archiving digital
information, and several small initiatives have generated some useful results.
Nevertheless, many new research challenges continue to surface, several questions
remain unanswered, and new questions emerge as the technology evolves.


Several recent developments have created a climate that is conducive to large-scale
systematic research on digital archiving. There is increasing demand on organizations of
all types to provide ongoing access to digital resources. Over the course of the last
decade government agencies, research libraries, corporations, and private individuals
have accumulated vast quantities of digital information, much of which remains valuable
into the foreseeable future. Federal scientific agencies have invested billions of dollars to
develop and deploy satellites, remote sensing devices, and other instruments, which send
terabytes of data to ground-based receiving stations for process and analysis. In
biomedical research, teams of investigators in universities and in the burgeoning bio-
medical industry rely on access to massive databanks, such as the human genome
databases, to identify the basis for myriad diseases and to accelerate the discovery of
effective remedies or cures. Many of these scientific databases are created and


It’s About Time: Research Challenges in Digital Archiving                                   22
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  23


accumulated though government investments and some are irreplaceable at any price. In
most research communities, including social sciences and the humanities, research results
are disseminated through electronic journals and other digital means, sometimes with and
sometimes without a print version. Digital resources no longer are limited to specialized
databases that are used by scholars. E-government and E-commerce are built on
electronic transactions stored in databases that have to be maintained at least long enough
to satisfy auditing, taxation, compliance monitoring and other accountability
requirements. Medical records that track individual medical histories are going on-line.
Cultural heritage resources and vital evidence of human creativity in the digital age is
produced, shared, and enjoyed in digital form primarily through the World Wide Web.
Even private citizens are accumulating digital resources of value to them, from e-mail
with friends and family to digital photographs.


A variety of stakeholders have expressed concern about the longevity of digital
information, due in part to the ubiquity of digital communications in business, commerce,
education, research, and personal communications. Some of this concern comes from
libraries, archives, and museums, such as the Library of Congress and the National
Archives, which embrace preservation as a core part of their mission. In 2001, LoC was
asked by Congress to lead the effort to develop a National Digital Information
Infrastructure and Preservation Program with the possibility of $175 million available in
Congressional and private funding to support this effort. The National Archives and
Records Administration (NARA) has established a partnership with the San Diego
Supercomputer Center to develop the technical architecture for persistent archives. These
initiative grew in part from a growing recognition that a rapidly increasing portion of the
nation's published record, historical documentation, and creative expression exists only in
digital form. Solutions to the digital archiving problem are needed to enable archives,
libraries and museums to fulfill their historical roles as stewards and custodians of
intellectual and cultural heritage.


Many government agencies, beyond the Library of Congress, NARA, and the National
Libraries of Agriculture and Medicine, with historical preservation functions, also face


It’s About Time: Research Challenges in Digital Archiving                                  23
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  24


digital archiving challenges. NASA and NOAA manage rich archives of space,
atmospheric, oceanographic, and weather data. Statistical agencies, such as the Census
Bureau, the Bureau of Labor Statistics, and the National Center for Health Statistics, are
building longitudinal data sets that researchers exploit to understand trends in
employment, wages, pubic health, housing, and education or to analyze the effects of
government policies and public investments in improving social welfare. The
intelligence agencies need rapid access to data stored in many different formats and
crossing many jurisdictional boundaries. As the tragic events of September 11 showed,
the safety of citizens, at home and abroad, depends on the ability of intelligence and law
enforcement agencies to retrieve, evaluate, and analyze data from diverse sources while
also respecting basic civil rights and personal privacy. State and local government
agencies maintain repositories of digital data such as voting registration, deeds,
easements, and other property records, and records of land use in order to ensure citizens'
rights, protect private and public property, and manage public assets.


The recent draft report from the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure
envisions revolutionary changes in science and engineering research, based largely on the
ability of researchers to combine raw data from many sources and to utilize powerful
tools for analysis, visualization, and simulation. To realize this vision, scientists will
need dozens of highly curated digital repositories capable of storing streams of outputs
from satellites, land- and ocean-based instruments; complex models of diverse
phenomena from subatomic-level interactions to the changes in the global climate
system; and the entire published record of science. The new generation of repositories
will need to be maintained indefinitely into the future to protect very large investments in
the initial collection and processing of data and to enable continuous data mining, re-
purposing, and reuse. Private firms also collect and use digital data as part of their core
business mission, whether for the development of new drugs by pharmaceutical
companies; in exploration for resources by oil, gas, water, and mining companies; or for
market and sales analysis. Digital video, sound, animation, and accompanying text are
major assets to the entertainment industry as well as important cultural assets to the



It’s About Time: Research Challenges in Digital Archiving                                     24
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  25


public. In fact, few sectors of society are untouched in some way by the need to enhance
the longevity of digital resources and reduce the risk of catastrophic loss.


The ultimate purpose of a persistent archive is the preservation of intellectual capital,
whether archived in the form of digital entities, or expressed as information and
knowledge. A persistent archive uses collections to organize, present, and characterize
the information and knowledge content of the related digital entities. This organization is
intellectual capital that represents the attempts of the archivist to provide structure and
order to the original data. The persistent archive organization is a representation of the
intellectual capital created by the archivists. This view must be extended to include the
intellectual capital of the persons who created the original digital entities. A persistent
archive should represent the information and knowledge content of the digital entities that
are being preserved and the knowledge that was used to create the digital entities.
Persistent archives are repositories of intellectual content that can be mined and reused in
the future to improve society.


Digital Preservation Challenges


The problem of long-term preservation entails conceptual, technical, social,
organizational and economic aspects. Technology-induced problems, such as fragile
storage media, hardware and software obsolescence, and the difficulty of retrieving and
understanding data from legacy systems are well documented and have received the lion's
share of attention. Despite vastly reduced storage costs and significant improvements in
the capacity, durability, and average life of storage media and storage systems, many
technical challenges remain. The rapid obsolescence of storage media, input and output
devices, programming languages, software applications, and standards present unique
technical challenges for long-term preservation. Many digital information resources
remain valuable for periods of time that far exceed the useful life of the hardware, media,
and software used to store, encode, and display them. As one workshop participant
observed, with digital archiving, unlike real-time applications, the data are preserved
while the systems evolve. Over time, the systems flow around the data. Physical


It’s About Time: Research Challenges in Digital Archiving                                     25
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  26


preservation can be managed by copying a bit stream from one medium to another, but
this does not address the more challenging problem of "logical" preservation which is
necessary for computers and humans to interpret, use, and understand digital information.
Logical preservation is an increasingly difficult problem because people and
organizations of all types are creating complex digital objects comprised of many
different types of information (text, sound, numerical data, images, etc.). Formats,
standards, and software for managing different types of digital information evolve at
different rates, adding to the complexity of designing and selecting effective technical
strategies for long-term preservation.


The workshop participants refined the technical challenges but also recognized that
digital preservation is not exclusively a technological problem. Opportunities abound for
research on economic and business models for sustainable long-term repositories, on
policies for distributing the costs and responsibility for digital preservation, and for
managing intellectual property rights and other access restrictions, on decision models for
selecting resources for preservation and for choosing appropriate technical strategies, and
on user needs and requirements. Given the wide variety of digital materials that need to
be preserved and the different stakeholders in preservation processes and outcomes, it is
also apparent that no single approach will address all digital preservation challenges.
Rather, research is needed on a spectrum of solutions ranging from private citizens
wishing to preserve their digital photograph collections to managers of very large
scientific databases.


One goal of the Workshop was to go beyond defining the problem of long-term
preservation and to develop clear definitions of research challenges. Research and
development projects during the past five years or so provide a foundation for moving
from problem definition to problem solving. In addition to the initiatives at LoC and
NARA, discussed above, several research and development projects have enhanced our
understanding of the technological, organizational, and economic aspects of digital
preservation and pointed the way toward promising lines of research. The Open Archival
Information Systems (OAIS) Reference Model, a draft ISO standard developed by the


It’s About Time: Research Challenges in Digital Archiving                                  26
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  27


space data community with leadership from NASA and the European Space Agency, is
gaining rapid acceptance as a framework for the basic technical architecture for digital
repositories. Several projects funded by the National Science Foundation under its
Digital Library Initiative (DLI) are addressing research issues that are germane to long-
term preservation issues. The Mellon Foundation is sponsoring several research
partnerships that involve content creators and libraries to investigate models for
preserving electronic journals. Several private companies have joined as partners in these
projects or supported their own research on this problem. Despite modest investments in
research and notable progress toward addressing certain aspects of the long-term
preservation problem, formidable challenges remains. The remainder of this report
presents a summary of the workshop discussions and recommendations for future
research projects.


Research Areas
Workshop participants discussed research challenges in four areas:
        Architectures for Archival Repositories
        Attributes of Archived Collections
        Economic and Policy Models
        Tools and Technologies to Facilitate Long-term Preservation


Each of these discussions yielded important research questions. Research challenges in
each area are summarized below, followed by a description of some cross-cutting issues.


Architectures for Archival Repositories
There is a strong consensus that the OAIS Reference Model is a significant advance in
defining the organization, functions, and information models for archival repositories.
The OAIS functional model defines the core functions of a repository as administration,
ingest (more commonly known as accession), archival storage, data management, access,
and preservation planning. Its information model defines various types of information
packages, which if implemented together with the functional model, provide a means for
separating long-term storage of bits or data streams from the management of data and


It’s About Time: Research Challenges in Digital Archiving                                   27
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  28


collections. This model has been used to build prototype persistent archives that can
preserve data independent of any particular hardware and software configuration.
Nevertheless, a great deal of research is still needed to determine whether persistent
archives are scalable to meet and satisfy the preservation requirements for very large
databases (100+ terabytes); whether such a repository architecture can handle tens of
thousands of smaller heterogeneous collections typically found in libraries, archives, and
museums; and what alternative architectures might promise.


Even with a common conceptual framework, it seems unlikely that a single approach will
satisfy all the digital preservation needs of various organizations and individuals. The
development of infrastructure for digital archiving is strongly driven by the need to
support multiple communities. Each community has unique requirements that will
influence the design of the digital archive. In order to build a coherent picture of the
digital archive infrastructure layers and the associated research agendas, a
characterization of the spectrum of digital archive responsibilities and associated digital
archive models is needed. The working group on the Architecture for Archival
Repositories developed a characterization of digital library models that ranged from basic
archival capabilities, through more advanced and sophisticated systems. The
requirements for a particular community or federal agency are expected to fall
somewhere within this spectrum. By identifying the associated research agenda for each
level of the spectrum, it is possible to quantify the research agenda that may be the most
appropriate for a particular agency. Note that this characterization is technology driven,
and does not address the issues of governance, policy and economic models, archived
collection attributes, or archival tools.


The spectrum of digital archiving models crosses the following implementations:


       Data model driven archives for the preservation of data into the future. Each type
        of data has an associated data model that governs the set of attributes, the types of
        rendering applications, and the types of knowledge relationships that can be
        described. An archive that supports one particular data model is the simplest to


It’s About Time: Research Challenges in Digital Archiving                                     28
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  29


        implement, requiring support for the chosen data model. Such an archive
        typically provides access to the public, with no access restrictions.


        The attributes are typically organized as a catalog that is stored in an information
        repository (database). The knowledge relationships inherent in the collection are
        typically implicitly defined through the rendering application that is used to
        display and manipulate the digital entities. The data model defines the structural
        relationships inherent in the digital entities. Note that the collection of digital
        entities will have collective properties, such as naming conventions that should be
        stored as knowledge about the collection.


        Even a data model driven archive has associated research issues. They include
        managing technology evolution for storage systems and information repositories
        by defining the set of operations needed to access data in storage through a
        storage abstraction. Similarly, an information repository abstraction is defined to
        manipulate a catalog in an information repository. The abstractions are migrated
        to new technology, making it possible for an archive that implements the
        abstractions to manage data and information across evolving software
        infrastructure. A major challenge is proving that the correct abstractions have
        been chosen for storage and information repositories, such that the archive can
        persist while the underlying repositories evolve.


       Controlled access repositories. While every archive needs to provide mechanisms
        to manage and prove the authenticity of the archived data, controlled access
        repositories must also manage access control over time periods that exceed the
        lifetime of individuals and even of organizations. The description of the
        membership of individuals in groups is an example of knowledge management, in
        this case structural relationships on individuals. Controlled access repositories
        must either manage this knowledge, or identify services that will manage the
        knowledge independently of the archive through authentication and authorization
        servers.


It’s About Time: Research Challenges in Digital Archiving                                     29
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  30




        Research issues include the persistent management of authentication over long
        periods of time. Note that public key encryption technology relies upon a key
        length. What was once considered an acceptable length for an encryption key
        (128 bytes) can now be broken by a personal computer. The technology that
        guarantees the ability to authenticate a person today (encrypted password or
        public key certificate) will not be sufficient in the future. Similarly, the
        management of authorization over lifetimes longer than that of individuals
        requires levels of indirection. An approach is to authorize based upon
        membership in groups, and require external servers to manage the group
        membership. This shifts the problem out of the archive, but does not handle the
        situation in which an organization is subsumed within another organization.


       Archives of temporally changing data. Communities that archive electronic
        records require the ability to archive all stages of a workflow process. The
        temporal or procedural relationships between the multiple stages are knowledge
        that the archive will have to maintain. An example is an electronic records
        management environment. The description of the processing steps is as important
        as the final product. A related problem is the management of a data collection
        that can change over time, either through extension by addition of new entries, or
        through revision in which new versions are created for member data. In the
        scientific community, an example is the removal of artifacts from the collection to
        eliminate erroneous conclusions. In this case, the challenge is the annotation of
        the collection to represent its changed state, such that results from prior use of the
        collection can be corrected. Both of these examples illustrate the need for the
        archive to manage procedural knowledge.


        Research issues include the identification of the appropriate knowledge model to
        represent temporal or procedural relationships. There are also implicit
        requirements such as the need to generate the required technical and descriptive
        metadata at a rate faster than the archive update rate. How does an archive


It’s About Time: Research Challenges in Digital Archiving                                   30
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  31


        automate metadata generation such that the archive can match the rate at which
        data is sent?


       Archival processing preservation. Archives may convert the data that is received
        into a standard archival form, to simplify the ability to present or render the
        digital object in the future. This requires processing the data. To guarantee
        authenticity, archives that process data must be able to document all operations
        applied to the data as received. This can be viewed as an inventory management
        problem for the multiple versions of the data that are created as the processing is
        done. The procedural knowledge inherent in the processing steps must be
        preserved, along with the versions of the data. Thus version management may
        become a required capability of a persistent archive.


        Research issues include the ability to migrate archival processes forward in time.
        This implies the ability to characterize processes, and recreate the processes that
        were applied to build prior archivable forms. It should be possible to archive
        either the characterization of the archival process, or the result of applying the
        process. The latter is similar to the idea of a self-instantiating archive, in which
        archived material is migrated dynamically as requested from the original storage
        format to the current rendering format. The processing steps used to annotate the
        information content and assemble a collection can also be characterized, making it
        possible to build a collection from the original data at an arbitrary point in the
        future.


       Archives of derived data products. Archives are not limited to the original
        materials. In the scientific community, processing may be done on archived
        collections to create derived data products to address scientific questions. The
        challenge becomes the ability to characterize processes that are created by groups
        outside of the archival community, which may be based on technologies not
        employed in the persistent archive. Related issues include the ability to track
        provenance information for each derived data product, including a


It’s About Time: Research Challenges in Digital Archiving                                      31
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  32


        characterization of the processing that was applied. An example of this type of
        archive is the remote sensor data collected by NASA. Processing is applied to the
        original sensor data to create multiple levels of derived products, including
        application of calibration mechanisms, conversion from sensor data to physical
        quantities, and transformation to alternate coordinate systems.


        Research issues include the ability to characterize the derived data products by
        descriptive metadata beyond the provenance information. This descriptive
        metadata can include the type of processing algorithm that was applied, the
        mathematical expression of the related operation, and the associated software
        implementation. The management of descriptive metadata that changes
        depending upon the type of data depends upon the ability to support extensible
        metadata. Each derived data product that is created in the future may require
        additional descriptive metadata within the archive. Archives of derived data
        products require the ability to automate the management of information
        repositories.


       Re-purposed archives. The ability to discover relevant material within an archive
        is dependent upon the descriptive information that is generated during accession.
        The resulting information is queried to determine whether the desired data are
        available. If a new collection is generated, how can existing holdings be re-
        purposed, or organized into a new collection without having to replicate either the
        descriptive metadata or the associated digital entity? If the required descriptive
        metadata must be derived from each digital entity, how can the entire archive be
        reprocessed automatically to generate the metadata? Re-purposing of archived
        material may require the ability to stream the entire collection through processing
        steps. This requirement illustrates the need to think of archives as repositories of
        information and knowledge that may need to be updated at periodic intervals.
        Archives in the future may be dependent upon the ability to support generations
        of new semantic indexing through the processing of every digital entity.



It’s About Time: Research Challenges in Digital Archiving                                    32
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  33


        Research issues include the ability to migrate descriptive metadata to a new
        semantic context. This requires the ability to manipulate concept spaces, map
        from an existing concept space to a new concept space. Re-purposing in the most
        general sense is the concept that the correct semantic space to use for discovery is
        the user’s semantic space. Thus every access to an archive is a form of re-
        purposing, the mapping of a collection as organized in the original collection
        concept space, to the user’s preferred concept space. This is similar to content
        based addressing, with concept spaces serving as a level of indirection between
        the features inherent in the archived material, and the query issued by the user.


Although this spectrum may not capture all potential types of archival repositories, it
illustrates the need for research that more closely examines the relationships between the
purpose of the archive, the types of data and information that it acquires, and the needs of
its user community or communities.


Repository architectures span a spectrum from highly distributed layered models to self-
contained repositories that provide end-to-end services. In the layered model, different
organizations might specialize in particular preservation functions. A library or an
archive, for example, might contract with a service provider for physical storage and data
management but take responsibility for knowledge management and end user services.
A key research issue for this model is developing the methods and protocols that enable
the different layers to interoperate with each other. On the other hand, a single entity
might maintain a self-contained and relatively autonomous repository that supports all
archival functions from basic physical storage and maintenance of bits to end user
services. Such a model might be appropriate for preservation of highly specialized
collections that serve one particular community. A research challenge, however, is
developing methods to federate heterogeneous repositories or to provide some means of
interoperability among them.


Attributes of Archived Collections



It’s About Time: Research Challenges in Digital Archiving                                   33
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  34


A second area of discussion concerned defining the attributes of archived collections.
We know that a great deal of information is “saved” in digital form on files servers, on
personal hard drives, and in large repositories of tapes and optical disks. Workshop
participants assumed, however, that archived collections have some additional attributes
that enhance their quality, utility, trustworthiness, and longevity. There was a strong
consensus that archival collections don’t just happen when someone clicks on the “save”
icon. Rather, archival collections are created through curatorial processes that include
selection, organization, description, and quality control, and they require individuals or
organizational entities that assume a formal role of stewardship. Just as the development
of infrastructure for digital archiving is strongly driven by the need to support multiple
communities, it is also strongly driven by the requirements to preserve many diverse
types of complex objects and collections.


Research topics span a range of conceptual, policy and technological issues. Specific
research issues may revolve around the particular characteristics of different types of
collections, such as electronic journals, multi-media entertainment objects, scientific
databases, or models and simulations. Common concerns also cut across many different
types of objects, although they may be addressed through means that are specific to a
particular type of object or collection. Some of the common concerns are:


     Selection and preservation of complex digital objects. Although not ideal,                 Formatted: Bullets and Numbering

        methods do exist today to preserve simple, static digital objects in widely used
        formats. The next challenge is to develop means for preserving complex objects
        and dynamic objects that change on a regular basis. Here we need criteria for
        determining which aspects of digital objects are worth preserving. Complex
        objects contain some core content, but most also have additional features such as
        formatting, visual aspects and graphics, monochrome and color images, and
        sound and video. Interactive documents provide users with the opportunity to set
        preferences or make choices that determine how an object appears or behaves.
        Dynamic objects and streaming data are updated continually or on an arbitrary
        basis.


It’s About Time: Research Challenges in Digital Archiving                                    34
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  35




Even simple static objects present research issues. They include developing tools to
handle the processes of reformatting, migration, and normalization with minimal human
intervention and methods to evaluate the effectiveness of different formats, data models,
and metadata schema from a preservation perspective. For complex objects, research is
needed to identify which aspects of the objects are worth preserving and how different
features affect the authenticity, utility, and aesthetics of preserved objects for particular
user communities. Significant research is also needed to develop methods, tools, and
technologies that can manage complex objects over time. Dynamic objects raise research
questions about how to “fix” a view of an object that is constantly changing and how
many different versions are needed to preserve a meaningful sense of the object’s
evolution over time.


     Aggregation of items and objects into collections. Physical collections are made              Formatted: Bullets and Numbering

        up of discrete items, but this is not the case with digital collections. A digital
        collection can be created simply by making hyperlinks among widely distributed
        objects. This presents many challenges for archiving because different
        organizational entities control the content, format, quality, and availability of
        linked materials. With the emerging need to capture materials from the web for
        aggregation into collections, definitions are needed to establish boundaries around
        collections and to determine how many levels of linked items need to be
        preserved.


Research questions associated with defining collections include methods to establish
boundaries around collections that gives them coherence and an identity. Methods are
needed to develop distributed collections, where several organizations participate in the
curatorial processes and share preservation responsibilities. Another ripe area for
research is the development of collection-level metadata schema that describe attributes
common to all items in a collection and provide for inheritance of metadata from the
collection to the item level. Research is needed on ways to formally express the



It’s About Time: Research Challenges in Digital Archiving                                       35
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  36


attributes of collections so that higher level services can operate with them. Users also
need to know the nature and quality of collections that are made available to them.


     Decision models for selection. Long-term preservation does not imply that                 Formatted: Bullets and Numbering

        everything is worth saving. Most libraries, archives, and museums have well
        established collecting policies for physical items based on the mission of the
        institution and knowledge of its user community. In the digital realm, selection
        decisions are becoming more complex for several reasons. As discussed above,
        organizations that are building digital collections must be concerned not only with
        which objects or what content to preserve, but also with which features and
        functionality are needed. In libraries, selection decisions are changing because an
        increasing amount of the content that libraries deliver to users is held in
        publishers’ repositories and is not owned physically by the library. This reduces
        the redundancy among collections, but also raises concerns over who should
        assume responsibility for long-term preservation (publishers or libraries) and
        when (if ever) the obligations to acquire and preserve published material should
        shift from the content providers to a library or an archive. Furthermore, collecting
        policies that were designed for physical materials do not encompass new types of
        digital objects and collections (such as web sites and multi-media objects).
        Finally, many organizations that did not have to concern themselves with long-
        term preservation in the past are now confronted with the need to develop some
        means to preserve their digital materials.


In the area of selection for preservation, decision models are needed to assist curators and
collection developers with decisions about what to preserve. Such decisions are not
based on an arbitrary notion of what is valuable. Rather, there is a need to investigate the
requirements of the user community that the repository expects to serve. Formal models
of selection decisions are also needed so that tools can be developed to assist curators
with selection responsibilities and to automate some selection decision. There is also a
need to assess the costs and benefits of selection versus saving everything from a
particular publisher, domain, or organization.


It’s About Time: Research Challenges in Digital Archiving                                   36
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  37




       Resolution of naming hierarchies. Multiple naming conventions are used to
        describe digital entities, ranging from the components of the data model, to local
        file names for the digital entity, to global file names used to assemble distributed
        collections, to attribute names used to build collection catalogs, to relationship
        names used to describe properties of the collection. Preservation requires the
        ability to manipulate each name space at some arbitrary point in the future. Each
        name space could be managed by a different set of software infrastructure, and
        described by a different metadata standard. If these multiple name spaces can be
        structured for support by similar ontology management systems, the task of
        preservation can be greatly simplified.
A major research question is whether the generalization of name spaces as ontologies that
characterize either semantic relationships, structural relationships, or logical
relationships, will lead to a simpler way to preserve the information and knowledge
content of archives. Such a generalization is needed to make the management of
knowledge tractable.


Policy and Economic Models
A third set of discussion topics focused on the question of policy and economic models
for long-term preservation. Much digital preservation research has focused on technical
problems and technological solutions without careful analysis of the social,
organizational, and economic mechanisms that have to be in place to make preservation
possible and sustainable. The discussions of policy and economic models were informed
by applying the concept of public goods to digital preservation. Public goods, such as
national defense or public parks, share certain properties. Specifically, there are few
market incentives to create public goods because they tend to be costly and it is difficult
to exclude anyone from the benefits of a public good. An additional property of digital
and physical archives is that some of the information they preserve may not be used
immediately, but it is assumed to have value and utility at some point in the future. Other
information may be immediately useful, but access is constrained by copyright and other



It’s About Time: Research Challenges in Digital Archiving                                    37
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  38


restrictions. Creating an economy for long-term preservation entails providing incentives
for organizations to invest in digital archives.


The area of policy and economic models is ripe for research. Some of the key research
areas include:


     Incentives for long-term preservation of digital information. The incentives for          Formatted: Bullets and Numbering

        developing digital archives are likely to vary across different types of
        organizations and between public and private sector organizations.
        Consequently, research is needed on a variety of incentives that would encourage
        organizations to develop digital archiving capabilities, build repositories, provide
        archiving services, and to create content in ways that facilitate its long-term
        preservation. A variety of mechanisms warrant investigation including direct
        public subsidies, tax incentives for placing content in the pubic domain prior to
        the expiration of copyrights, philanthropic donations, and market mechanisms that
        provide for cost recovery or revenue streams to support the repository.
     Incentives for deposit of digital content in archives. Conversely, content creators
        need incentives to deposit content in repositories for long-term preservation.
        Research in this area is closely tied to the concept of trust. Depositors must have a
        very high level of trust in a repository, based on secure technology, a track record
        of performance, and consistent application of rules and agreements.
     Metrics. One of the major obstacles to long-term preservation of digital objects is
        the absence of metrics to quantify almost all aspects of digital archiving. There is
        a critical need for research on methods to measure the performance of various
        storage media over the long-term, the effectiveness and costs of different
        preservation strategies, the market value of archiving services, and market
        analysis of user demand. Evaluation of digital archiving is impossible without
        concrete measures of costs, benefits, and value of digital objects.
       Intellectual capital. Archives need to become the repositories of intellectual
        capital that are viewed as the driving resource for economic growth. This
        emphasizes the view of archives as information and knowledge repositories. The

It’s About Time: Research Challenges in Digital Archiving                                   38
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  39


        goal of the archive is to make the information and knowledge content as readily
        accessible as possible, and to make it easy to re-purpose the collection for a new
        use. The worth of an archive is the number of possible uses its inherent
        information and knowledge content has in the future. Examples where the
        preservation of intellectual capital is already important include universities and
        research laboratories. Every group or organization needs to think of persistent
        archives as the mechanism to preserve their information and knowledge for use by
        future generations.




Tools and Technology
A fourth area of discussion concerned the tools and technologies that are needed to
support long-term preservation. Although many useful tools exist, preservation is
different because of its long-term time scale. One cross-cutting theme is that digital
repositories will have to manage the evolution of tools, technology, standards, and
metadata schema over time. The discussion was also shaped by the recognition that
human costs are the most expensive element of archiving, and that human costs are likely
to increase while processing and storage costs decline. There are many opportunities for
research on tools and technologies for long-term preservation.


     Acquisition and ingest. Many powerful tools exist to locate, harvest and copy              Formatted: Bullets and Numbering

        digital information for preservation in a repository. Internet search engines and
        web archiving projects to acquire material available on the World Wide Web use
        automated robots and web crawlers extensively. With enhancements, this basic
        technology could be tuned to support acquisition and future preservation of vast
        digital resources. There are also many valuable resources that are not readily
        available through web robots because of access restrictions, security and privacy
        concerns, or because of the structure of the underlying resources. Preservation of
        such resources often is handled through formal agreements where the content
        creator transfers responsibility for preservation to a repository at some point in
        time with a formal agreement.


It’s About Time: Research Challenges in Digital Archiving                                    39
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  40




Research issues in the area of acquisition and ingest include both the harvesting or “pull”
model and the formal submission or “push” model. Both models would benefit from
research that would automate the processing of items and collections for preservation at
the point that they are acquired or ingested into a repository. Tools are needed for
automated indexing, metadata extraction, validation, and quality control. For the pull
model, research includes developing ways to incorporate selection criteria or preferences
with the harvesting tools. Tools are also needed to transform disparate types of objects
into the formats, standards, and data models that a repository can manage over the long-
term.


     Naming and Authorization. Managing the identity of preserved digital objects              Formatted: Bullets and Numbering

        over time is a challenge for digital archives because the identifiers assigned to
        digital objects can be changed easily and the technologies for naming and tracking
        digital objects evolve over time.


Research issues in the area of naming and authorization include development of methods
for unique and persistent naming of archived digital objects, tools for certification and
authentication of preserved digital objects, methods for version control, and
interoperability among naming mechanisms used by different content providers. The
emergence of data grids that create global name spaces is an example of a technology for
persistent naming.     This technology needs to be extended to support persistent naming
of the information and knowledge content of the collections.


     Decision models. Decision models are needed to support many aspects of digital            Formatted: Bullets and Numbering

        preservation in addition to the selection and ingest decisions discussed above.
        Research is needed on cost/benefit models to support decisions regarding
        preservation formats and standards, choice of preservation strategies
        (normalization, migration, emulation), and the costs and benefits of various levels
        of description and metadata.



It’s About Time: Research Challenges in Digital Archiving                                   40
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  41


Key research areas include metrics for measuring the quality and fidelity of preserved
digital objects and for documenting the consequences of archival processes on them. The
metrics need to include the maximal sustainable archive size (as a function of the access
rate), the archival bandwidth (amount of material that can be moved forward into the
future as a function of the type of storage technology), and the re-purposing rate (the
amount of time needed to process the entire collection to derive new collection
attributes).


     Standards and interoperability. Standards for data formats, data models,                 Formatted: Bullets and Numbering

        metadata, and many other aspects of digital information are useful for long-term
        preservation. However, standards change over time and digital repositories are
        likely to contain digital objects that conform to many different standards.


Opportunities for research on standards include development of standard and long-term
ways of representing text, sound, image, video, and other object components; temporal,
spatial, and other relationships; and data and data relationships. Research is also needed
to support interoperability among different standards and for predicting which standards
are likely to achieve wide scale adoption over extended periods of time. Archived digital
entities will have to be migrated to new standards in the future. The migration can be
viewed as lossless if the new standard provides a superset of the features of the old
standard. Unfortunately, this is not always true. The conversion from VRML version 1
to VRML version 2 was made difficult by the elimination of some features in version 2.
A goal for standard encoding formats is the creation of lossless feature conversions when
migrating between standards. The new standard needs to provide a way to encode the
operations and features of the prior standard.


[The following sections need elaboration]


What is Unique about Long-term Preservation Research?
The key point is that the development of a digital preservation infrastructure will benefit
from related research on storage media and systems, data management, knowledge


It’s About Time: Research Challenges in Digital Archiving                                  41
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  42


management, semantic interoperability, and information retrieval. Research on long-term
preservation needs to take into account the unique aspects of digital archiving. These
include:
     Long-term perspective                                                                  Formatted: Bullets and Numbering

     Threat of interrupted management
     Funding and utility models
     Intergenerational beneficiaries
     Changing user communities, requirements, and expectations
     Authenticity/Integrity when records are not maintained in the original state
     Need to manage systems and technology that evolve around the data
     Measures of success are long in the future


Priorities for Research
     Reference architecture for research that supports:                                     Formatted: Bullets and Numbering
        focused research at each layer of the architecture
        research on how to assemble the stack
     Metrics (costs, value, policy options, outcomes)                                       Formatted: Bullets and Numbering

     Preservation of dynamic objects
     Decision models
     Scalability Up and Down
        Spectrum of tools and methods that scale up to very large databases and scale
           down to personal archiving
     Tool development                                                                       Formatted: Bullets and Numbering
        automated ingest
        metadata capture and management
        push vs. pull models
     Predictive models of use and value                                                     Formatted: Bullets and Numbering
     User requirements
     Content creator requirements
     Incentives for building persistent archives
     Incentives for depositors

Research Scenarios
A wide range of research projects and methodologies are needed, including:
     Theory-building                                                                        Formatted: Bullets and Numbering
        Individual investigator or small group
        Concepts of value, aesthetics, experience, and behavior of digital objects
     Exploratory                                                                            Formatted: Bullets and Numbering
        Small teams
        Alternative architectures and preservation methods
     Simulations                                                                            Formatted: Bullets and Numbering


It’s About Time: Research Challenges in Digital Archiving                                42
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  43


        Small teams
        Simulate different policy and economic models, compare results
     Experimental                                                                          Formatted: Bullets and Numbering
        Small to large teams
        Apply different methods to the same content, compare results
     Observational                                                                         Formatted: Bullets and Numbering
        Individual investigator or small group
        User behavior, perception, incentives
     Testbeds                                                                              Formatted: Bullets and Numbering
        Large teams
        Use existing content, develop metrics, and measure results
     Focused projects. A significant portion of the challenges of persistent archives
       may be addressed by data grid technology. Data grids provide the interoperability
       mechanisms needed to manage data across heterogeneous storage and information
       repositories. Applications of data grid technology are being made for the Library
       of Congress (data grid for data replication), the National Archives (prototype
       persistent archive), and the National Science Foundation (discipline specific data
       grids in astronomy, earth system science, education curricula). These focused
       projects can be evaluated for their success in supporting archival processes across
       heterogeneous software platforms.




It’s About Time: Research Challenges in Digital Archiving                               43
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  44




                                      Organizing Committee




       Margaret Hedstrom, University of Michigan, Chair and Principal Investigator
       Sharon Dawes, Center for Technology in Government, University at Albany
                             Carl Fleischhauer, Library of Congress
                                James Gray, Microsoft Research
                    Clifford Lynch, Coalition for Networked Information
              Victor McCrary, National Institute of Standards and Technology
                       Reagan Moore, San Diego Supercomputer Center
            Kenneth Thibodeau, National Archives and Records Administration
                      Donald J. Waters, Andrew W. Mellon Foundation




It’s About Time: Research Challenges in Digital Archiving                            44
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  45


                                            Participants
Martha Anderson, Library of Congress
Bruce R. Barkstrom, NASA
Mick Bass, Hewlett-Packard Company
Neil Beagrie, Joint Information Systems Committee, UK
Lawrence Brandt, National Science Foundation
Peter Buneman, University of Edinburgh and University of Pennsylvania
Laura Campbell, Library of Congress
Arturo Crespo, Stanford University
Robin Dale, Research Libraries Group
Jon Eisenberg, National Academies, Computer Science and Telecommunications Board
Dale Flecker, Harvard University
Carl Fleischhauer, Library of Congress
Evelyn Frangakis, National Agricultural Library
Amy Friedlander, Council on Library and Information Resources
Anne Gilliland-Swetland, University of California, Los Angeles
Jim Gray, Microsoft
Daniel Greenstein, Digital Library Federation
Valerie Gregg, National Science Foundation
Stephen M. Griffin, National Science Foundation
Myron P. Gutmann, University of Michigan, Ann Arbor
Rich Harada, High Density Storage Association and Creative Businesses, Inc.
Margaret Hedstrom, University of Michigan, Ann Arbor
Robert Horton, Minnesota State Historical Society
Bernie Hurley, University of California, Berkeley
Carl Lagoze, Cornell University
Brian Lavoie, OCLC
Cal Lee, University of Michigan, Ann Arbor
Raymond Lorie, IBM Almaden
Clifford Lynch, Coalition for Networked Information
Petros Maniatis, Stanford University
Victor McCrary, National Institute of Standards and Technology
Alexa T. McCray, National Library of Medicine
Nancy McGovern, Cornell University
Kurt Molholm, Defense Technical Information Center
Reagan Moore, San Diego Supercomputer Center
Douglas Oard, University of Maryland
Christopher Olsen, Central Intelligence Agency
Arcot K. Rajasekar, San Diego Supercomputer Center
David Rosenthal, Sun Microsystems
Jeff Rothenberg, RAND
Charles Rothwell, National Center for Health Statistics
Ed Sequeira, National Library of Medicine
Abby Smith, Council on Library and Information Resources
MacKenzie Smith, Massachusetts Institute of Technology
Thornton Staples, University of Virginia

It’s About Time: Research Challenges in Digital Archiving                          45
Workshop Report – Pre-Publication Draft August 12, 2002
                                                  46


Sue Stendebach, National Science Foundation
Kenneth Thibodeau, National Archives and Records Administration
Herbert Van de Sompel, Los Alamos National Laboratory
Howard D. Wactlar, Carnegie Mellon University
Donald J. Waters, Andrew W. Mellon Foundation
Ed H. Zwaneveld, National Film Board of Canada




It’s About Time: Research Challenges in Digital Archiving         46
Workshop Report – Pre-Publication Draft August 12, 2002

								
To top