publishing workflow recommendations by WhiONJZ


									                   Workflow in Web-Based Scholarly Publishing:
             Recommendations to the American Anthropological Association
                        for the AnthroSource Digital Portal

John A. Benner
School of Information
The University of Texas at Austin
INF 392K: Problems in the Permanent Retention of Electronic Records
Dr. Patricia Galloway
May 5, 2004

       As it enters its second century, the American Anthropological Association (AAA) is

planning to implement a new means of promoting its field. With the assistance of the University

of California Press (UCP) and Atypon Systems, AAA is developing AnthroSource, a digital

portal to a variety of anthropological information resources. The announcement on AAA’s Web

site ( has promised that AnthroSource will

provide AAA members and other subscribers with access to the following:

       all AAA journals, newsletters, bulletins, and monographs; a linked, searchable database

       containing all past, present, and future AAA periodicals; centralized access to a wealth of

       other key anthropological resources, including text, sound, and video; and interactive

       services to foster communities of interest and practice throughout the discipline.

This report will explore the issues that the portal’s creators must address in implementing a

publishing workflow system to produce content suitably packaged to meet the requirements of

long-term storage in a digital repository.


       Although the AnthroSource portal will eventually provide access to a wide variety of

anthropological resources, most of its content will be in the form of journal articles. This journal-

based content will include about 100 years of retrospective material from AAA’s peer-reviewed

publications in print, all to be converted to a digital format, and “born-digital” files of current

and future issues. AAA has hired UCP to perform all editorial, production, and distribution of

print and electronic publications. Atypon Systems, a company that specializes in providing

technology to the information industry, will perform the role of UCP’s technology partner by

hosting the portal and managing the publishing process via its Literatum software package.

Literatum will provide features for licensing, content collection, contracts, e-commerce,

production and delivery of content, searching, personalization, customer relationship

management, reporting (such as delivery, holdings, and usage reports), and a manuscript

workflow system (“Literatum for the UC Press,” n.d.).

         Beebe and Meyers (2000) have divided the process of scholarly publishing into the

following six major functions: content development, publisher enhancements, manufacturing,

distribution, marketing, and archiving. This report will focus on three of these functions—

content development, publisher enhancements, and archiving—because they are the most

relevant functions to the process of creating digital documents for a portal and preserving them

indefinitely. For each of these three functions, and for retrospective conversion, this report will

examine the method of approach that UCP/Atypon have proposed. Where other models are

available, this report will explore methods that parties engaged in similar projects have used or

proposed. The goal is to recommend a model to UCP/Atypon for each major function.

Content Development

         Content development for academic journals generally begins when the author completes

and submits a manuscript, usually in electronic form (either on disk or through the Internet).

According to Beebe and Meyers (2000), many publishers have electronic-submission guidelines

for authors; these guidelines commonly include a manuscript template (usually in Microsoft


         Hodge (2000) has identified the stage of content creation as an opportunity to make

decisions that will reduce labor in the long term, stating that “the preservation and archiving

process is made more efficient when attention is paid to issues of consistency, format,

standardization, and metadata description in the very beginning of the information life cycle.”

For example, the Oak Ridge National Laboratory has placed limits on the software, format, and

layout that can be used in the creation of digital documents. Also, Hodge has stated that many

project managers consider a best practice for metadata creation to entail having the author

provide some metadata along with the manuscript file. Recent trends toward incorporation of

XML and RDF (Resource Description Framework) capability in word-processing and database

software will facilitate the creation of metadata at the time of creation of the object. The

publisher could then create additional metadata for the object at a later stage, which Hodge has

described as identification and cataloging (Hodge, 2000, p. 3). Elsevier Science uses the point of

login as an occasion to collect certain metadata about the author and the submitted item (Yale,


         The UCP/Atypon proposal has not specified limits to acceptable formats for electronic

submission, opting instead to emphasize flexibility toward existing work processes as a key

advantage for all parties involved. When manuscript reaches the publisher, office personnel enter

identifying metadata about the file into the publishing system (“UCP Assumptions,” n.d.).

UCP/Atypon could streamline early content development by providing authors with a template;

although software conversion programs may make the limitation of acceptable file formats seem

unnecessary, the use of a template facilitates the capture of metadata. Requiring authors to

supply metadata for certain essential fields would reduce the amount of time that editorial staff

or interns would spend on this task.

Publisher Enhancements

         Beebe and Meyers (2000) have divided the function of publisher enhancements into peer

review, editing (including substantive editing and copyediting), and coding. They have listed

several advantages of using manuscript-tracking software to manage the peer-review process,

including increased efficiency (mainly because the process yields a supply of accepted digital

documents that are ready for editing) and the ability to obtain statistics on reviewer performance

and on a journal’s rate of manuscript acceptance. Similarly, they have indicated that performance

of editing and copyediting on the electronic manuscript should be more efficient than pen-and-

ink editing, providing an audit trail of all corrections while also yielding a clean revised

manuscript ready for page composition.

       AAA has requested a workflow system that includes electronic peer review and

manuscript tracking. UCP/Atypon express confidence in their ability to meet AAA’s needs

(“UCP Assumptions,” n.d.). Provided that the system is able to execute the tasks described

above, including the compilation of peer-review statistics, the system is likely to perform

adequately in this part of the process. However, the system should compile not only statistics but

also the peer reviews themselves; that is, the assembly of all peer review comments into one

working document will streamline the revision process for the editor and the author.

       According to Beebe and Meyers (2000), coding is “[t]he act of tagging elements in a

document to describe their structure, content, or desired appearance.” This function has acquired

new significance through the demand for document portability across a variety of media. Beebe

and Meyers (2000) have emphasized the financial advantage of using a flexible markup language

for the coding process, saying that “SGML and XML add…value by assuring that the content

can be reused in different media.”

       As of early 2002, Elsevier Science produced an SGML markup of each document in its

workflow. However, the publisher reported that it was developing a new document type

definition (DTD) that would be XML-based and MathML-enabled (Yale, 2002). The DiVA

project has used a workflow that enables XML markup of both the document structure and the

metadata (Müller, Klosa, Andersson, & Hansson, 2003). Atypon has reported that its DTD,

conforming to a standard created by the National Library of Medicine (NLM), is XML-based

(“UCP Assumptions”). The Mellon Project Steering Committee at Harvard, in a 2003 postscript

to its report from the previous year, has expressed confidence that this DTD, although newly

released, “should help facilitate the creation and operation of sustainable archives for e-journals.”

This report endorses the use of the NLM-standard DTD for document markup.


       Just as Beebe and Meyers (2000) have emphasized the flexibility of markup languages to

enable documents to be displayed on different media, they have pointed out how such coding—

especially in SGML or XML—facilitates the long-term preservation of digital documents:

       Standard coding in a platform-independent medium assures that the publication can be

       used in the future. Files can then be transferred from one medium to another as

       technology advances or be included in a larger database as part of a publisher’s

       knowledge management activities. SGML or XML are the preferred languages for

       material to be archived because they are platform-independent; consequently, the

       material can be transferred to any media in the future.

       In archiving, however, much more is at issue than the continued usability of the file.

Preservation must deal with maintaining both the content and the “look and feel” of the original

object. Also, archival principles emphasize that the original object should not be altered, if at all

possible, and that any necessary change to the object must be thoroughly documented. One of the

main questions to resolve, then, is which form or forms of the digital object an archive should

keep. Because storage on hard drives is increasingly inexpensive, all files from the publishing

workflow should be kept, at least for the term of their usefulness. These files should be managed

according to a records-retention schedule.

       Ockerbloom (2002) has provided a good discussion of “source” and “presentation” forms

(or files). Source forms are often SGML files or, increasingly, XML files, structured according to

a publisher-specified DTD. The benefit of source files is that their structure and markup “often

provides information that cannot practically be extracted from presentation files, such as the

structure of equations and formulas, bibliographic records, and tabular data.” Provided that the

source files have used a well-documented DTD to display this information, “programs can easily

analyze journal articles to support value-added services.”

       Presentation forms are important, according to Ockerbloom (2002), because they “show

what readers actually saw when they read the journals,” whereas source forms typically do not.

Often in the format of PDF or HTML accompanied by image files, these forms “faithfully

reproduce the appearance of journal articles and their text.” Another advantage of presentation

forms is that they are “relatively inexpensive to ingest” into an archival repository.

       Ockerbloom (2002) has listed some of the shortcomings of presentation files: they

usually do not include structural markup; they lack “functionality that would be provided on the

original publisher’s Web site.” Therefore, “they might not support…automatic bibliographic

linking, cross-article text searching, and data analysis.” Also, although PDF is perhaps the most

common format for presentation files, Hodge (2000) has pointed out that it “may not be accepted

as a legal depository format because of its proprietary nature.”

       Perhaps the best solution to the source-versus-presentation dilemma is, storage capacity

permitting, to preserve the object in both forms (in addition to the original format, if different).

Ockerbloom (2002) has stated that, ideally, this method of preservation is exactly what an

archive should execute. He has also considered “auxiliary” forms for cases in which the object is

more complicated than an article comprised of text and images:

       It may be useful…to support some basic supplementary data types, such as character and

       numeric data stored in relational tables. For other formats, if the supplementary material

       can be packaged as a bit-stream, archives should at least be able to preserve this bit-

       stream. They should also include some metadata concerning the format of the bit-stream

       so that new programs can interpret that format, if researchers are sufficiently motivated.

       It must be noted, in the discussion of the rather complicated problems that archiving

presents, that Atypon has not adequately addressed these issues in the proposal documentation it

has released. Atypon has proposed to store each journal page as a TIFF image and preserve this

presentation file as the “archival quality master” (“UCP Assumptions,” n.d.). While TIFF may

serve as an acceptable format for preserving the converted backfiles of journals, it is not

adequate for archiving born-digital files. This report urges a thorough examination of the issues

involved in archiving. At a minimum, Atypon should preserve each of the following: the digital

object in its original format; all metadata, including a record of any changes made to the original

object; one source file; and one presentation file. For preservation purposes, the metadata and the

source file should be bundled together. In addition to taking these steps for preservation, Atypon

should use open-source file formats whenever possible, in order to increase the likelihood of

long-term access to format specifications.

Retrospective Conversion

       Atypon has stated that it has considerable experience in retrospective content conversion,

including work with optical character recognition, creation of PDF files, extracting metadata

from PDF, and creating links for PDF (“UCP Assumptions,” n.d.). This experience should

qualify Atypon to convert the tremendous backfiles of AAA print journals to digital format.

However, Atypon must consider the important issues of metadata and open-source formats that

have been discussed previously.

Other Considerations

       The parties involved in the creation of AnthroSource may want to consider the

development of an article-based workflow model, which could eventually replace the issue-based

model. Elsevier Science’s relationship with Science Direct, an internal customer, has revealed

two “serious” problems with issue-based electronic workflow—first, that content produced in

this workflow is not quickly distributed to customers, and second, that this type of workflow

results in high and low periods of work rather than a steady stream of work. Science Direct is

developing an alternative, article-based electronic workflow that “will streamline interactions

between authors, producers, and suppliers.” This model will include Web-based submission of

articles and an electronic peer-review system, which “will interface with a more automated login

and tracking system” (Yale, 2002).

Summary of Workflow Recommendations

      UCP/Atypon could streamline early content development by providing authors with

       templates; although software conversion programs may make the limitation of acceptable

       file formats seem unnecessary, the use of templates facilitates the capture of metadata.

       Requiring authors to supply metadata for certain essential fields would reduce the amount

       of time that editorial staff or interns would spend on this task.

      UCP/Atypon should save all files from the publishing workflow. These files will be

       managed on a records-retention schedule according to best practices. Files that comprise

       key stages in the workflow will be stored and maintained on the archival server(s).

      In order to fulfill the purposes for which it is intended, the metadata must, of course, be

       evaluated at appropriate points in the workflow. The publishing workflow system and/or

       one or more editors should perform this evaluation to ensure the accuracy and relevance

       of all metadata.


       UCP/Atypon appear to have the expertise and experience necessary to manage most

aspects of the publishing workflow for AnthroSource. However, it is imperative that they (and

AAA) take special care in examining the problems of archiving; by doing so, they will benefit

from the knowledge that other groups have gained from experience with similar projects.

American Anthropological Association. (2004). AnthroSource. Retrieved March 23, 2004 from

Beebe, L. and Meyers, B. (2000, reprint). Digital workflow: managing the process electronically.
   Journal of Electronic Publishing, 5 (4). Retrieved March 23, 2004 from

Harvard University Library Mellon Project Steering Committee. (2002). Report on the planning
   year grant for the design of an e-journal archive. Digital Library Federation. Retrieved
   March 23, 2004 from

Hodge, G. (2000). Best practices for digital archiving. Journal of Electronic Publishing, 5 (4).
   Retrieved March 23, 2004 from

“Literatum for the UC Press – List of Features.” (2003?) Retrieved January 31, 2004 from (document on e-reserves for INF 392K).

Müller, E., Klosa, U., Andersson, S. & Hansson, P. (2003). The DiVA project – development of
  an electronic publishing system. D-Lib Magazine, 9 (11). Retrieved March 23, 2004 from

Ockerbloom, J. (2002). Report on a Mellon-funded planning project for archiving scholarly
   journals. Digital Library Federation. Retrieved March 23, 2004 from

“UCP Assumptions for [Atypon].” (2003?) Retrieved January 31, 2004 from (document on e-reserves for INF 392K).

Yale University Library and Elsevier Science. (2002). YEA: The Yale Electronic Archive – One
   year of progress – Report on the Digital Preservation Planning Project. Digital Library
   Federation. Retrieved March 23, 2004 from

To top