Document Sample
nara Powered By Docstoc
					     CSTB Proprietary--DRAFT--Do not copy, distribute, or cite                                           1

 1                               Summary and Recommendations

 2            As in other sectors of society, much of the business--and thus record-keeping--of the
 3   federal government depends on digital information. Documents are created, transmitted, and
 4   stored electronically. E-mail has become an important--and often primary--communications
 5   medium. And many records exist only in electronic form, inside database and other computer
 6   systems.
 7            Recognizing the growing importance of electronic records to its mission of preserving
 8   "essential evidence."1 NARA launched the Electronic Records Archives (ERA) initiative in the
 9   late 1990s. The ERA is envisioned by NARA to “authentically preserve and provide access to
10   any kind of electronic record.”2 NARA's current systems for electronic records archiving are
11   limited in capability and ad hoc in nature. Seeking to significantly expand their capabilities,
12   NARA sponsored work at the San Diego Supercomputer Center that resulted in the development
13   of a system that was used to conduct a series of archiving demonstrations. Building on this
14   experience, NARA plans to commence the initial procurement for a production-quality ERA in
15   2003. As of this writing, NARA has started a process of defining desired
16   capabilities/requirements for the system.
17            As part of its preparations for an initial ERA procurement, NARA asked the National
18   Academies' Computer Science and Telecommunications Board to provide independent technical
19   advice on the design of an electronic records archive, including an assessment of how work
20   sponsored by NARA at the San Diego Supercomputer Center (SDSC) helps inform the ERA
21   design, and what key issues should be considered in ERA's design and operation.
22            CSTB's Committee on Digital Archiving and the National Archives and Records
23   Administration has been tasked with preparing two reports. This first report is intended to
24   provide quick, preliminary feedback to NARA on lessons it should take from the SDSC work and
25   to identify key ERA design issues that should be addressed as the ERA procurement process
26   proceeds in 2003. The committee's second report, anticipated in late 2003, will provide longer-
27   term strategic recommendations to NARA on how to meet its electronic records archiving
28   challenges.

29                                               FINDINGS
30   Finding 1. As NARA clearly recognizes, it is critical to develop new electronic records
31   archiving capabilities as quickly as possible in order to fulfill its mandate to preserve
32   federal records.
34            With the rapid increase in federal records originating in digital form--and with many
35   records being digital-only--it is clear that a solution must be found for preserving these records in
36   order for NARA to fulfill its mandate. The volume and diversity of digital records that will be
37   eligible for transfer from the custody of federal agencies to NARA is projected to be very large.
38   Indeed, it is reasonable to anticipate that in the not-too-distant future, the number of digital
39   records is likely to exceed the number of records originating in paper form. NARA’s current
40   systems for electronic records, designed primarily to support archiving of relational databases and
41   other highly structured records, cannot meet these demands. If NARA fails to design and
42   implement an electronic records archiving program that is capable of handling the projected
                 NARA Strategic Plan
                 ERA Vision Statement
     CSTB Proprietary--DRAFT--Do not copy, distribute, or cite                                          2

43   volume and diversity of electronic records, important records will be lost because there are no
44   system or processes in place to preserve them. Program delays would likely put some records at
45   greater risk of loss and would limit access to already existing electronic records.
46            Under the paper-based model, NARA has historically received documents long after they
47   were created, and electronic records scheduling has for the most part proceeded in similar
48   fashion. Thus when the ERA system becomes operational, NARA will face a large pipeline of
49   electronic records that were created over the past few decades, many of which will pose
50   challenging preservation problems owing to their age. Going forward, there may be ways of
51   alleviating the record age problem by restructuring the relationship between NARA and the
52   agencies generating records in order to obtain copies of the records closer to the time of creation.
53   (Discussion of this issue is deferred to the committee’s second report.)
55   Finding 2. Demonstrations conducted at SDSC for NARA have provided a useful
56   opportunity for NARA to see demonstrations of relevant technology tools but the work has
57   not helped to significantly inform the appropriate design of ERA, nor reduced the
58   engineering risk of the program, nor has it developed NARA’s operational capabilities to
59   run the ERA.
61            These proof of concept demonstration projects have provided NARA with the
62   opportunity to interact with the IT community and to explore potential options for a production
63   digital archiving system. Although the SDSC projects have demonstrated potential options for
64   parts of a production digital archiving system, NARA should not interpret these projects as
65   solutions to digital archiving issues or components of a production system.
66            The demonstrations were quite limited and therefore are not easily transferable to more
67   complex problems. Furthermore, some of the demonstrations were constrained by re-use of
68   SDSC’s existing scientific data management software and systems rather than addressing ERA
69   requirements. Finally, some aspects of the SDSC demonstrations focused on attempting to
70   extract meaning from records (the “knowledge layer”). The technology here is immature--far
71   from ready for inclusion in a production system. Moreover, it does not address more fundamental
72   problems that NARA faces, such as engineering a system to preserve the original bits.
73            The SDSC demonstrations also shed little light on NARA’s operational capabilities to run
74   an ERA. It does not appear likely that NARA (or indeed many other organizations) could run the
75   sort of systems that SDSC used to support scientific data management, which are used by
76   scientific researchers and depend on a highly proficient IT support staff. The SDSC work should
77   therefore, be understood as much more meaningful as demonstration of technology than as a
78   prototype of an operational system for NARA.
80   Finding 3. ERA can and should be built, but important work remains to be done before
81   NARA embarks on a procurement.
83            Although no one has yet designed, built, or managed a production digital archives system
84   on the scale that NARA envisions, it is technically feasible to do so. There is no available off-
85   the-shelf overall solution but there are demonstrated solutions to a number of system components
86   NARA will need. The projected scale and complexity of ERA means that the task of
87   designing/engineering the system is a formidable challenge. The work NARA has done to date is
88   not sufficiently robust to ensure the procurement of a workable production system.

89                                        RECOMMENDATIONS
90           The recommendations that follow highlight critical actions for NARA to undertake if it is
91   to reduce the engineering risk of the ERA program and increase the likelihood that the program
      CSTB Proprietary--DRAFT--Do not copy, distribute, or cite                                          3

 92   will be successfully executed.
 94   Recommendation 1. NARA should not issue an RFP for an ERA system until in-house IT
 95   capabilities to prepare and evaluate responses to the RFP are in place.
 97            Lack of technical expertise at NARA is a major obstacle to successful development and
 98   procurement of the ERA. Based on briefings and other interactions with NARA staff, the
 99   committee concludes that while there is recognition of the importance of the ERA program, few
100   NARA staff members appreciate the complexity of building and managing a production digital
101   archiving system. Also, NARA staff do not appear to have sufficient technical expertise to define
102   and manage an overall architecture, develop an appropriate RFP, evaluate technical responses,
103   negotiate with vendors, or manage the implementation of the system. A contracted system, to be
104   successful, requires at a minimum an in-house contract monitoring staff (e.g., the contracting
105   officer’s technical representative) that has technical skills at least as good as the contractor's
106   people. [[Hedstrom: comparable to the IT expertise in other projects of a similar size and cost;
107   need examples/data]] This need should not be confused with the broader need to develop IT
108   expertise (discussed below). Just the addition of a few people with properly-focused systems
109   design expertise would make a critical difference in the successfulness of the program.
110            In addition to needing a quick ramp-up in IT expertise necessary to oversee the early
111   phases of procurement, NARA faces a longer-term need for a more pervasive culture change--IT
112   competence related to preservation is will need to be core competence, on par with its other
113   institutional values. NARA appears to recognize the issue in its appointment of a change manger
114   associated with the ERA program, but the difficulty in achieving this shift cannot be
115   underestimated. This need can be addressed in parallel with the system design and early stages of
116   implementation.
118   Recommendation 2. NARA should do more to define the scope and scale of the ERA before
119   issuing a RFP for the ERA system.
121         NARA needs more information before it is ready to prepare or evaluate responses to an
122   RFP. Additional information required includes:
124      Estimates of the scale of the initial build and scaling goals. Although it is very difficult to
125       estimate the ultimate scale of federal electronic records that may be subject to preservation by
126       NARA, an effort should be made to define the goals for the initial build. Exact estimates are
127       not required but order of magnitude estimates for such parameters as the number of objects to
128       be stored and the total storage required are essential. Similarly, NARA should establish
129       approximate goals for how quickly the archive systems are expected to grow.
130      Estimated costs. Building on these scale estimates, an estimate should be made of how much
131       computer and storage equipment will be required, the cost of personnel to ingest records and
132       operate the system, and the cost of building the system software. Separate procurement and
133       operational estimates are needed. Although additional funds may be made available to
134       design, build, and operate the ERA, NARA will nonetheless have finite resources and an
135       essentially unbounded mission in preserving all federal records of permanent historical value.
136       Thus in planning for and operating the ERA, NARA will need to make difficult decisions as
137       to priorities.
138      The most important data formats. It is infeasible to support all possible formats equally well.
139       In any event, early system builds should concentrate on a small number of formats.
140       Consequently, NARA should identify a subset of formats that will address the majority of
141       records that federal agencies are creating and establish priorities for ingest, storage, and
142       access. This likely means that for some lower priority formats, ERA support will, at least
      CSTB Proprietary--DRAFT--Do not copy, distribute, or cite                                         4

143        initially, be limited to capturing, storing, and providing access to the original bits and any
144        essential metadata.
145             [[Wilensky: However, the more general message might be that (i) one size doesn't fit all,
146   and that (ii) any ERA will have to accommodate different kinds of records differently, and hence,
147   will be some kind of ``tiered'' system, i.e., one that provides varying QOS depending on record
148   attributes, and that (iii) NARA will have to give careful thought to exactly what the criteria are
149   the determine the QOS for a given record.
150             In this context, we are expressing our believe that data format will be one important
151   criteria. But we might be overly emphasizing the importance of this one attribute. E.g., if NARA
152   rated a certain document to be of critical importance, or some such, then they might be willing to
153   invest the resources to archive regardless of data type.]]
155   Recommendation 3. NARA should define the design principles for the initial build and
156   future evolution of the ERAs and apply an engineering approach to its development.
158           Although archivists may like to deal in absolutes (e.g., "every important record will be
159   preserved forever"), engineering practice recognizes that there are objectives that are subject to
160   constraints. Engineering an ERA will require specifying the objectives and constraints of such a
161   system. This will require some considerations not normally found in writing about archival
162   processes.
163           This report describes some of the important issues that need to be addressed and provides
164   some advice on how to think about them. In some cases, the committee has provided an
165   assessment of preferred choices for the ERA systems but this preliminary analysis is no substitute
166   for additional analysis by NARA. Key design/engineering issues include:
168            Identifying the most important functions of the system and focusing initial design on
169   these. This implies an initial focus on storage and data management functions, together with
170   ingest and access for a few formats. The storage and data management requirements have much
171   in common with many other applications, so there are opportunities to learn from other systems
172   and from vendor offerings.
173            Designing for common cases. Given inherent resource limitations, it is not reasonable to
174   expect a system to preserve and provide equally good access to all records. So it will be
175   necessary to make some choices about what quality of service to provide for different types of
176   records . For example, it is not feasible to support all possible data formats with the same quality
177   of service. The ERA design should focus on a small number of formats that cover a large fraction
178   of the records.
179            Support the activities of future users by providing fundamental services. Archivists and
180   researchers of the future will have more powerful computers and tools, which means that NARA
181   does not have to anticipate in detail every high-level service that they will need. Instead, NARA
182   should concentrate on providing the fundamental low-level services that will be most helpful to
183   those future users with their more powerful tools. Such services include: (1) always archiving
184   the original bits even if derived forms are created for easier access; (2) always capturing as much
185   ephemeral (non-derivable) metadata as possible; (3) archiving critical external references, either
186   explicit or implicit (such as documentation about the originating systems or data standards used);
187   and (4) archiving as much information as possible about the software and workflow processes
188   used to ingest the original records (a desirable goal would be that ingest process workflow be log-
189   based, or otherwise designed so as to facilitate recreating in the future the archived record from
190   original sources).
191            Selecting the appropriate storage medium. The technology is evolving to make data
192   persistent through means that do not involve tape backup. Such techniques generally involve
193   geographically separated spinning disk replicas. Today, tape is understood to have significant
      CSTB Proprietary--DRAFT--Do not copy, distribute, or cite                                           5

194   drawbacks in such areas as transfer and access speeds and complexity. Because of dramatic
195   improvements in the performance/cost of disk storage, tape also no longer enjoys a significant
196   cost advantage. While large-scale disk-based archive systems are not yet in widespread use, and
197   their archival properties not entirely understood, there is good reason to believe that such
198   techniques will soon (and may already) be superior to more widely used tape-based techniques.
199   We encourage NARA to strongly consider such designs for their initial ERA prototype. At a
200   minimum, NARA should architect the system so that it could easily add support for spinning
201   disks in the future.
202            Evaluation of requirements for trustworthiness. Careful analysis of the acceptable rate of
203   data loss and threat models are required to understand how much to invest in trustworthiness
204   measures. Measures to consider include: robustness (e.g., redundancy via geographically-
205   distributed replicas), integrity checks (e.g., to verify receipt of records or protect against
206   tampering), and access controls (e.g., to protect classified or otherwise non-public records).
207            Deciding where to invest in access capabilities. For paper records, the primary access
208   scheme for an archive has been the finding aid. For born-digital materials, finding aids will
209   continue to be valuable tools, but full-text indexing is a low-cost method with a high payoff.
211   Recommendation 4. The ERA should be designed as a modular system that can be built,
212   evolved, modified, and maintained incrementally.
214            One of the most important techniques of developing a large system is modularization.
215   Properly designed modules can be replaced with better versions, with minimal disruption to other
216   modules. Clean interfaces among modules will reduce the risk of incompatibilities and will
217   improve the chances that the system can evolve steadily over time, withstanding both increases in
218   scale and changes in technology. For the ERA, a high-level modularization (consistent with the
219   thinking expressed in the commonly cited Open Archival Information System architecture) is to
220   separate the file store, ingest, and access. However, designing the modularity of a system
221   requires much more than simply identifying the different pieces. An overarching architecture is
222   required to define the framework into which the modules fit. Issues such as exactly where the
223   boundaries between these functions lie, what the interfaces are, and so forth require careful
224   analysis. Ideally, the high-level technical vision and responsibility for a modular ERA
225   architecture would belong to staff at NARA, not at an external contractor.
226            Iterative or incremental design--in which multiple cycles of specify, design, implement,
227   and test are executed--is a widely accepted way of developing software systems, especially
228   systems, such as ERA, that are not simple duplicates of existing systems. Iterative design allows
229   users (archivists, federal agencies, and those accessing records from the outside) to see a partially
230   working system, or partially working system components, early in the development process. It
231   provides rapid feedback about what works, what doesn't, what needs to be refined or rejected, and
232   what is missing. Incremental design is especially appropriate where such a system has not been
233   built before and where the requirements are relatively poorly understood, as is the case with the
234   ERA program.
235            Although NARA recognizes that there will be multiple builds of the ERA, it appears to
236   be an implicit assumption that the ERA program will at some point result in the creation of an
237   "ultimate system design" that will have a very long life. It is more reasonable to assume that
238   there will be many iterations and the design will continue to evolve indefinitely.
239            It is a very daunting task to contemplate identifying all of the requirements up front. An
240   iterative design approach means starting with something relatively simple that works soon, which
241   alleviates the pressures to try to anticipate everything up front and permits the system to be
242   improved and scaled up as experience is gained. It is important to important try to get a good
243   first design but the full design--modularity, interfaces, processes, workflows, and so forth--will all
244   change as the system evolves.
      CSTB Proprietary--DRAFT--Do not copy, distribute, or cite                                          6

245            Importantly, there will be occasions where the iterative process will result in considerable
246   changes to the system. It may, for example, be necessary to copy all the data to new store. And
247   because the correct modularity is only speculation at this point, it should not be a surprise if the
248   entire system needs to be replaced a few times during its evolution.
249            Modularity is the key to allowing the ERA to live a long time. A modular approach will
250   permit separate modules to evolve independently: new modules can be introduced, or replace old
251   modules, without disrupting the overall system function. However, designing a system to be
252   modular and evolvable is quite different from the `one-off' procurement that is more common for
253   most government agencies. In particular, it requires strong in-house IT skills to control the high-
254   level architectural design, and to manage RFP and acquisition processes.
255            [[Wilensky: A modular, evolvable, and incrementally improvable design is surely a good
256   thing. However, our recommendation seems to suggest that NARA initially design a system that
257   has these properties, which seems unlikely to be the case first time. So, perhaps we mean to
258   ALSO suggest that they ``rapid prototype'' a (possibly throw-away) system, in order to determine
259   how to subsequently produce one with has these fine properties.]]
260   Recommendation 5. In designing the ERA, NARA should seek out commonalities with
261   other systems and programs wherever possible rather than focus on ERA's unique
262   attributes.
264             Documents and briefings provided by NARA sometimes state or imply that the
265   challenges in the ERA are unique. It is true that the program’s goals are ambitious and that no
266   one has built an electronic records archiving system on the scale envisioned. Although its design
267   and implementation will be challenging, many parts of the problem appear to have precedents,
268   and many of the requirements have much in common with those of other systems. For example,
269   although NARA (rightly) places emphasis on integrity, confidentiality, and authenticity, other
270   sectors such as the financial industry have similarly stringent requirements with respect to these
271   attributes. Techniques for handling metadata and for access to records are similar to those of
272   digital libraries. Long-term robust storage of digital files has a great many applications today.
273   Uniqueness for the ERA may appear in some areas but it is the exception.
274             These commonalities mean that NARA would benefit from increased coordination with
275   other federal entities (such as the Library of Congress) and private-sector institutions that have
276   common interests in digital preservation. Enhanced coordination should extend at least to
277   increased information-sharing as the various institutions move forward with preservation
278   programs; this should not, however, be read as a recommendation for coordinated or joint
279   procurements. [[Dollar: NARA could respond to the Committee recommendation by simply
280   quoting from the ERP]]
281             A corollary relates to the use of commercial off-the-shelf (COTS) products, a decision
282   that NARA has made for the ERA. The committee believes that this is a correct and quite
283   important decision that is essential to the success of the project. But obtaining the benefits of a
284   COTS strategy will require compromises and adaptations--and a willingness to relax certain
285   requirements.

Shared By: