Workflow in Web-Based Scholarly Publishing: Recommendations to the American Anthropological Association for the AnthroSource Digital Portal John A. Benner School of Information The University of Texas at Austin INF 392K: Problems in the Permanent Retention of Electronic Records Dr. Patricia Galloway May 5, 2004 Introduction As it enters its second century, the American Anthropological Association (AAA) is planning to implement a new means of promoting its field. With the assistance of the University of California Press (UCP) and Atypon Systems, AAA is developing AnthroSource, a digital portal to a variety of anthropological information resources. The announcement on AAA’s Web site (http://www.aaanet.org/anthrosource/index.htm) has promised that AnthroSource will provide AAA members and other subscribers with access to the following: all AAA journals, newsletters, bulletins, and monographs; a linked, searchable database containing all past, present, and future AAA periodicals; centralized access to a wealth of other key anthropological resources, including text, sound, and video; and interactive services to foster communities of interest and practice throughout the discipline. This report will explore the issues that the portal’s creators must address in implementing a publishing workflow system to produce content suitably packaged to meet the requirements of long-term storage in a digital repository. Overview Although the AnthroSource portal will eventually provide access to a wide variety of anthropological resources, most of its content will be in the form of journal articles. This journal- based content will include about 100 years of retrospective material from AAA’s peer-reviewed publications in print, all to be converted to a digital format, and “born-digital” files of current and future issues. AAA has hired UCP to perform all editorial, production, and distribution of print and electronic publications. Atypon Systems, a company that specializes in providing technology to the information industry, will perform the role of UCP’s technology partner by hosting the portal and managing the publishing process via its Literatum software package. 2 Literatum will provide features for licensing, content collection, contracts, e-commerce, production and delivery of content, searching, personalization, customer relationship management, reporting (such as delivery, holdings, and usage reports), and a manuscript workflow system (“Literatum for the UC Press,” n.d.). Beebe and Meyers (2000) have divided the process of scholarly publishing into the following six major functions: content development, publisher enhancements, manufacturing, distribution, marketing, and archiving. This report will focus on three of these functions— content development, publisher enhancements, and archiving—because they are the most relevant functions to the process of creating digital documents for a portal and preserving them indefinitely. For each of these three functions, and for retrospective conversion, this report will examine the method of approach that UCP/Atypon have proposed. Where other models are available, this report will explore methods that parties engaged in similar projects have used or proposed. The goal is to recommend a model to UCP/Atypon for each major function. Content Development Content development for academic journals generally begins when the author completes and submits a manuscript, usually in electronic form (either on disk or through the Internet). According to Beebe and Meyers (2000), many publishers have electronic-submission guidelines for authors; these guidelines commonly include a manuscript template (usually in Microsoft Word). Hodge (2000) has identified the stage of content creation as an opportunity to make decisions that will reduce labor in the long term, stating that “the preservation and archiving process is made more efficient when attention is paid to issues of consistency, format, standardization, and metadata description in the very beginning of the information life cycle.” 3 For example, the Oak Ridge National Laboratory has placed limits on the software, format, and layout that can be used in the creation of digital documents. Also, Hodge has stated that many project managers consider a best practice for metadata creation to entail having the author provide some metadata along with the manuscript file. Recent trends toward incorporation of XML and RDF (Resource Description Framework) capability in word-processing and database software will facilitate the creation of metadata at the time of creation of the object. The publisher could then create additional metadata for the object at a later stage, which Hodge has described as identification and cataloging (Hodge, 2000, p. 3). Elsevier Science uses the point of login as an occasion to collect certain metadata about the author and the submitted item (Yale, 2002). The UCP/Atypon proposal has not specified limits to acceptable formats for electronic submission, opting instead to emphasize flexibility toward existing work processes as a key advantage for all parties involved. When manuscript reaches the publisher, office personnel enter identifying metadata about the file into the publishing system (“UCP Assumptions,” n.d.). UCP/Atypon could streamline early content development by providing authors with a template; although software conversion programs may make the limitation of acceptable file formats seem unnecessary, the use of a template facilitates the capture of metadata. Requiring authors to supply metadata for certain essential fields would reduce the amount of time that editorial staff or interns would spend on this task. Publisher Enhancements Beebe and Meyers (2000) have divided the function of publisher enhancements into peer review, editing (including substantive editing and copyediting), and coding. They have listed several advantages of using manuscript-tracking software to manage the peer-review process, 4 including increased efficiency (mainly because the process yields a supply of accepted digital documents that are ready for editing) and the ability to obtain statistics on reviewer performance and on a journal’s rate of manuscript acceptance. Similarly, they have indicated that performance of editing and copyediting on the electronic manuscript should be more efficient than pen-and- ink editing, providing an audit trail of all corrections while also yielding a clean revised manuscript ready for page composition. AAA has requested a workflow system that includes electronic peer review and manuscript tracking. UCP/Atypon express confidence in their ability to meet AAA’s needs (“UCP Assumptions,” n.d.). Provided that the system is able to execute the tasks described above, including the compilation of peer-review statistics, the system is likely to perform adequately in this part of the process. However, the system should compile not only statistics but also the peer reviews themselves; that is, the assembly of all peer review comments into one working document will streamline the revision process for the editor and the author. According to Beebe and Meyers (2000), coding is “[t]he act of tagging elements in a document to describe their structure, content, or desired appearance.” This function has acquired new significance through the demand for document portability across a variety of media. Beebe and Meyers (2000) have emphasized the financial advantage of using a flexible markup language for the coding process, saying that “SGML and XML add…value by assuring that the content can be reused in different media.” As of early 2002, Elsevier Science produced an SGML markup of each document in its workflow. However, the publisher reported that it was developing a new document type definition (DTD) that would be XML-based and MathML-enabled (Yale, 2002). The DiVA project has used a workflow that enables XML markup of both the document structure and the 5 metadata (Müller, Klosa, Andersson, & Hansson, 2003). Atypon has reported that its DTD, conforming to a standard created by the National Library of Medicine (NLM), is XML-based (“UCP Assumptions”). The Mellon Project Steering Committee at Harvard, in a 2003 postscript to its report from the previous year, has expressed confidence that this DTD, although newly released, “should help facilitate the creation and operation of sustainable archives for e-journals.” This report endorses the use of the NLM-standard DTD for document markup. Archiving Just as Beebe and Meyers (2000) have emphasized the flexibility of markup languages to enable documents to be displayed on different media, they have pointed out how such coding— especially in SGML or XML—facilitates the long-term preservation of digital documents: Standard coding in a platform-independent medium assures that the publication can be used in the future. Files can then be transferred from one medium to another as technology advances or be included in a larger database as part of a publisher’s knowledge management activities. SGML or XML are the preferred languages for material to be archived because they are platform-independent; consequently, the material can be transferred to any media in the future. In archiving, however, much more is at issue than the continued usability of the file. Preservation must deal with maintaining both the content and the “look and feel” of the original object. Also, archival principles emphasize that the original object should not be altered, if at all possible, and that any necessary change to the object must be thoroughly documented. One of the main questions to resolve, then, is which form or forms of the digital object an archive should keep. Because storage on hard drives is increasingly inexpensive, all files from the publishing 6 workflow should be kept, at least for the term of their usefulness. These files should be managed according to a records-retention schedule. Ockerbloom (2002) has provided a good discussion of “source” and “presentation” forms (or files). Source forms are often SGML files or, increasingly, XML files, structured according to a publisher-specified DTD. The benefit of source files is that their structure and markup “often provides information that cannot practically be extracted from presentation files, such as the structure of equations and formulas, bibliographic records, and tabular data.” Provided that the source files have used a well-documented DTD to display this information, “programs can easily analyze journal articles to support value-added services.” Presentation forms are important, according to Ockerbloom (2002), because they “show what readers actually saw when they read the journals,” whereas source forms typically do not. Often in the format of PDF or HTML accompanied by image files, these forms “faithfully reproduce the appearance of journal articles and their text.” Another advantage of presentation forms is that they are “relatively inexpensive to ingest” into an archival repository. Ockerbloom (2002) has listed some of the shortcomings of presentation files: they usually do not include structural markup; they lack “functionality that would be provided on the original publisher’s Web site.” Therefore, “they might not support…automatic bibliographic linking, cross-article text searching, and data analysis.” Also, although PDF is perhaps the most common format for presentation files, Hodge (2000) has pointed out that it “may not be accepted as a legal depository format because of its proprietary nature.” Perhaps the best solution to the source-versus-presentation dilemma is, storage capacity permitting, to preserve the object in both forms (in addition to the original format, if different). Ockerbloom (2002) has stated that, ideally, this method of preservation is exactly what an 7 archive should execute. He has also considered “auxiliary” forms for cases in which the object is more complicated than an article comprised of text and images: It may be useful…to support some basic supplementary data types, such as character and numeric data stored in relational tables. For other formats, if the supplementary material can be packaged as a bit-stream, archives should at least be able to preserve this bit- stream. They should also include some metadata concerning the format of the bit-stream so that new programs can interpret that format, if researchers are sufficiently motivated. It must be noted, in the discussion of the rather complicated problems that archiving presents, that Atypon has not adequately addressed these issues in the proposal documentation it has released. Atypon has proposed to store each journal page as a TIFF image and preserve this presentation file as the “archival quality master” (“UCP Assumptions,” n.d.). While TIFF may serve as an acceptable format for preserving the converted backfiles of journals, it is not adequate for archiving born-digital files. This report urges a thorough examination of the issues involved in archiving. At a minimum, Atypon should preserve each of the following: the digital object in its original format; all metadata, including a record of any changes made to the original object; one source file; and one presentation file. For preservation purposes, the metadata and the source file should be bundled together. In addition to taking these steps for preservation, Atypon should use open-source file formats whenever possible, in order to increase the likelihood of long-term access to format specifications. Retrospective Conversion Atypon has stated that it has considerable experience in retrospective content conversion, including work with optical character recognition, creation of PDF files, extracting metadata from PDF, and creating links for PDF (“UCP Assumptions,” n.d.). This experience should 8 qualify Atypon to convert the tremendous backfiles of AAA print journals to digital format. However, Atypon must consider the important issues of metadata and open-source formats that have been discussed previously. Other Considerations The parties involved in the creation of AnthroSource may want to consider the development of an article-based workflow model, which could eventually replace the issue-based model. Elsevier Science’s relationship with Science Direct, an internal customer, has revealed two “serious” problems with issue-based electronic workflow—first, that content produced in this workflow is not quickly distributed to customers, and second, that this type of workflow results in high and low periods of work rather than a steady stream of work. Science Direct is developing an alternative, article-based electronic workflow that “will streamline interactions between authors, producers, and suppliers.” This model will include Web-based submission of articles and an electronic peer-review system, which “will interface with a more automated login and tracking system” (Yale, 2002). Summary of Workflow Recommendations UCP/Atypon could streamline early content development by providing authors with templates; although software conversion programs may make the limitation of acceptable file formats seem unnecessary, the use of templates facilitates the capture of metadata. Requiring authors to supply metadata for certain essential fields would reduce the amount of time that editorial staff or interns would spend on this task. UCP/Atypon should save all files from the publishing workflow. These files will be managed on a records-retention schedule according to best practices. Files that comprise key stages in the workflow will be stored and maintained on the archival server(s). 9 In order to fulfill the purposes for which it is intended, the metadata must, of course, be evaluated at appropriate points in the workflow. The publishing workflow system and/or one or more editors should perform this evaluation to ensure the accuracy and relevance of all metadata. Conclusion UCP/Atypon appear to have the expertise and experience necessary to manage most aspects of the publishing workflow for AnthroSource. However, it is imperative that they (and AAA) take special care in examining the problems of archiving; by doing so, they will benefit from the knowledge that other groups have gained from experience with similar projects. References American Anthropological Association. (2004). AnthroSource. Retrieved March 23, 2004 from http://www.aaanet.org/anthrosource/index.htm. Beebe, L. and Meyers, B. (2000, reprint). Digital workflow: managing the process electronically. Journal of Electronic Publishing, 5 (4). Retrieved March 23, 2004 from http://www.press.umich.edu/jep/05-04/sheridan.html. Harvard University Library Mellon Project Steering Committee. (2002). Report on the planning year grant for the design of an e-journal archive. Digital Library Federation. Retrieved March 23, 2004 from http://www.diglib.org/preserve/harvardfinal.html. Hodge, G. (2000). Best practices for digital archiving. Journal of Electronic Publishing, 5 (4). Retrieved March 23, 2004 from http://www.press.umich.edu/jep/05-04/hodge.html. “Literatum for the UC Press – List of Features.” (2003?) Retrieved January 31, 2004 from http://www.lib.utexas.edu (document on e-reserves for INF 392K). Müller, E., Klosa, U., Andersson, S. & Hansson, P. (2003). The DiVA project – development of an electronic publishing system. D-Lib Magazine, 9 (11). Retrieved March 23, 2004 from http://www.dlib.org/dlib/november03/muller/11muller.html. Ockerbloom, J. (2002). Report on a Mellon-funded planning project for archiving scholarly journals. Digital Library Federation. Retrieved March 23, 2004 from http://www.diglib.org/preserve/upennfinal.html. “UCP Assumptions for [Atypon].” (2003?) Retrieved January 31, 2004 from http://www.lib.utexas.edu (document on e-reserves for INF 392K). Yale University Library and Elsevier Science. (2002). YEA: The Yale Electronic Archive – One year of progress – Report on the Digital Preservation Planning Project. Digital Library Federation. Retrieved March 23, 2004 from http://www.diglib.org/preserve/yalefinal.html.
Pages to are hidden for
"publishing workflow recommendations"Please download to view full document