Archiving in everyday practice


									Archiving and the work flow of
          field work

                  Nicholas Thieberger

Pacific and Regional Archive for Digital Sources in Endangered Cultures
Nicholas Thieberger, Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)
Department of Linguistics & Applied Linguistics, The University of Melbourne Vic 3010, Australia
LSA Archiving tutorial, January 2005
Language archiving is an integral part of language documentation. The documents linguists are producing are meant to endure
and to be available for the people we record and their communities, as well as for fellow researchers, well into the future, and, we
hope, for ever. Archiving is no longer something we do at the end of our fieldwork, it is apparent now that it can be integrated into
everyday language documentation work and that it is a crucial aspect of documentary linguistics. We have learned to separate
form and content in the representation of linguistic data and recent technological advances have pointed to the importance of
planning data management and workflow for ethnographic recording which in turn has facilitated an expansion in documentary
linguistics and archiving. Recordings should always be of high quality, but it is in the context of small and endangered cultures
and languages that the quality of recording takes on new significance (quality here refers both to the content and the form of the
recording). If we are the only recorders of the last remaining speakers or performers then, right from the moment of recording, we
must be concerned with making good documents which will be placed into a suitable archive for storage and discovery. Thus we
can distinguish archival practice, which will be the main focus of this paper, from archival storage in a repository.
I discuss a workflow that builds in development of archival data and show that making the initial recordings and their digital
representation citable by means of a persistent identifier allows further work to be located with reference to that primary data.
Typically this further work involves annotation of the data and the construction of dictionaries and in all such derived material the
content is plain text structured to allow it to endure into the future. Further description of the data with standard metadata terms
allows its discovery in the longterm. All of this facilitates repatriation of the data to the communities from which it originates, as
they are able to locate the data once it has been archived.

Archives have an image of being repositories of old stuff. Usually old stuff that comes from old people. And in our case it is old
stuff from old people on old languages. I asked a colleague if he was considering depositing with our archive and he said "Did I
look as if I was going to die any minute when you last saw me?" For him, as for many people it seems, archiving is something
done at the end of one's career when there is time to go back and fill in gaps and make the whole data more presentable. This view
of archiving has it that boxes of stuff can be delivered to the archive sometime after the linguist has finished with them and will
then be held in perpetuity. The recent focus of linguistic archives, informed by the discussion of language documentation, is that
the stuff deposited must be of sufficient quality and sufficiently well-described that it can be useful into the future.
           Language Archives
An integral part of language documentation

  activity Ark - Hive (David Nathan, ELAR)

Need to develop archival methods for linguistic
Producing archival material is something we, as ordinary working linguists (OWLs), should do all the time and, further, the possibilities
provided by new technologies allow us to incorporate archival issues into our everyday practice to the benefit both of our analysis, and
of the use of our recordings and intellectual outputs. Current archives are training and providing advice in response to the need for such a
service in our community, that is the community of documentary linguists. These archives are primarily trusted longterm repositories
that take well-structured data and provide the infrastructure for securely holding and locating it over time. An archive is also the point of
reference for a network of practitioners who want advice on how to proceed. It is the archive’s role to agree on standards that seem most
appropriate and to assist in their adoption by the broader community. My observation of our own and other such archives suggests we
are all acting as a locus for documentary activity, and as proponents for new methods - what has been called an ‘ark-hive’ by David
Nathan of ELAR. As none of us has the resources to edit items in our collections, we rely on the depositors to produce material that is
well-formed from an archival point of view. Such data has an explicit structure (encoded, for example by tags (as in a Shoebox lexical
file for example) or by stand-off markup (as in time-aligned transcripts). It is also provided in a non-proprietary form that can be read on
any platform.

The fact that the best current working tools for transcription and time-alignment are coming out of this same effort, for example IMDI
and Elan from the MPI in Nijmegen, or Transcriber from LIMSI (with strong support from OLAC via the Linguistic Data Consortium)
Perspective of PARADISEC
This paper is written from the perspective of Paradisec, a young digital archive based in virtual space between Sydney, Melbourne and
Canberra in Australia. Paradisec was established 18 months ago by a group of linguists and musicologists concerned at the lack of a
repository for material recorded outside of Australia by Australian researchers. For those working with Indigenous Australian languages
there is a national archive called the Australian Institute for Aboriginal and Torres Strait Islander Studies (AIATSIS) which has been
operating since the 1960s. National Australian cultural institutions like the National Library and the National Film and Sound Archive do
not have a mandate to keep field recordings from outside of Australia. In particular we were concerned especially about audiotapes
recorded since the 1950s that were not being stored in any suitable repository and were physically deteriorating. Thus the initial focus
was on the preservation of existing, so-called ‘legacy’ material and we have so far digitised some 660 hours or 1.1 terabytes of data.
However, once we started processing these tapes, it was clear that there was a huge demand from current researchers wanting to work
with their data in a digital form and wanting high-quality archival representation of their media before they conducted most of their
        The coercive archive
Attached to a funding body

Obligatory deposit of recorded material
Enforces data formats by contractual
At this point it is useful to distinguish two kinds of existing archives, which I will characterise as coercive and non-coercive.
The coercive archive is part of a funding agency and so has some means and abilty to enforce standards on depositors, as is
the case with the ELDP/ELAR or DOBES. When grants are provided for language documentation, the form of the
recordings and their associated descriptive and analytical apparatus can be prescribed by these funding bodies, who have
also been providing training in the use of these methods. Further, the funding body can contractually bind the recipient to
lodge this material with the archive. These archives typically house newly recorded material, often recorded in a digital
form and so not requiring conversion before being archvied.

PARADISEC is currently not in a position to fund researchers, and so the appeal to depositors has to be pitched differently.
We encourage practitioners (who we take to mainly include linguists, musicologists, and indigenous language workers) to
deposit media material by ensuring that they will have a high quality digital version of their data in the short term. If an
archival form of the file is created first and is then used as the basis for the subsequent effort of transcription and time-
aligning, the resulting work has a citable source that should persist into the future. We have been encouraging postgraduate
students to lodge their tapes with PARADISEC as soon as they return from fieldwork. We digitise or capture their data and
provide both an archival (usually at 96khz/24bit BWF) and a representational (linear Mp3) copy with its persistent identifier
data with persistent identification. Their intellectual effort of annotating this primary data can then build on a firm
foundation for both their own immediate goal (typically a dissertation) and the longterm needs of having richly annotated
primary data safely archived.

I learned the hard way that using a non-archival digital file as the basis for transcription and analysis results in a mismatch in
timecodes when an archival file is later produced. I had digitised my analog field cassettes myself in 1998 by connecting a
tape player to a computer and producing fairly poor digital copies. I then annotated these using the program called
SoundIndex from LACITO as part of my documentation of the Oceanic language South Efate. A few years later the tapes
were digitised at a higher resolution and the timecodes in my earlier versions did not align in a simple way with those of the
new files. I transcribed around eighteen hours of audio altogether and I was under some time constraint as the work was to
result in a dissertation in the form of a documentary grammar with a time-aligned media corpus. The non-archival forms of
transcript may be corrected one day, but the clear lesson is that creation of an archival form of digital data is best done
before the analysis begins.
At PARADISEC we also spend considerable time with many old tapes, preparing them for data transfer by cleaning and,
in some cases, baking them under vaccuum. We assign persistent identification and create an enduring citable form of the
data as part of the archival accession and we run training workshops of half a day to several days’ duration on the use of
software tools and on data management. We use these as a means of advocating a workflow for language documentation
that builds archiving into the normal everyday work of the OWL rather than being an onerous addition, or a task left until
the weight of the cumulative research effort becomes unbearable at the end of a researcher’s working life.
The paucity of material related to many Australian indigenous languages is a great motivator for the current generation of
researchers to ensure that the records that we leave behind will be of more use than those we have had to work with.

Both coercive and non-coercive archives rely on the relationships they have established with their communities, both
depositors and users. In general, the benefits of depositing are clear, in particular as we are digitising analog tapes and
holding copies at no cost for members of our consortium. The ability to be ‘trusted’, as a repository should be, arises from
technical and in content, of recordings and associated derived material (transcripts, glosses, dictionaries etc). The
rationale is that if we want high quality recordings and well-structured archival data, then we have to provide training in
its creation. We run workshops in using Shoebox, still the only tool that creates structured lexical files, and, as wonderful
tools like Transcriber and Elan are produced by our colleagues in Europe we introduce them to a community of users in
our region at occasional workshops, both in our universities and in community-based language centres.
Resource Network for Linguistic Diversity, To assist in training, we have cooperated in the establishment of a network for
providing support to language workers, called the Resource Network for Linguistic Diversity, and we use the mailing list
associated with this network to discuss emerging methods and tools, as well as providing a FAQ page and an archive of the
list discussion (kindly provided by LinguistList).

Deposit of data in an archive presupposes that the depositor has sought and received permission from their interlocutors.
Ideally a written consent form, provided by the researcher and signed by the speaker, would specify the uses to which the
recordings could be put. Each item in the archive is accompanied by a deposit form filled out by the depositor or their
executor that outlines conditions on use of the material.

The ability to enforce standards on depositors extends to the description of the data, or the metadata that allows the data to
be discovered. Again, a coercive funding agency can insist on highly detailed metadata descriptions, as we see with the
finegrained IMDI metadata set. For legacy data, that PARADISEC mainly deals with, the quality of metadata can be quite
variable, often no more than a few lines on a tape box, together with contextual information about the collection from
well as of the process it undergoes from accession. All of this metadata can be output in various forms, one of which is the
OLAC metadata set. We would like to take this opportunity to thank OLAC for the work it has put into developing a
metadata system and for the support we have received in establishing a static metadata repository that we update
periodically from our catalog. Exporting to OLAC metadata has increased the visibility and so the discoverability of the
material in our collection, and its ease of use has meant that we were able to move our metadata system from nothing to
an Open Archives Initiative conformant metadata repository in a few months.
       Filenaming conventions

We also provide users with a spreadsheet with our metadata headings and are hoping to implement a web entry form
for users in the next year. We encourage users to develop a persistent naming convention using fairly standard ASCII
characters and to avoid unnecessarily long names. If we can then take the user’s names for their own files and
incorporate them into our persistent identification it makes it much easier to keep track of the relationships between
the notes and the media files. Our persistent filenames follow the directory structure of the mass storage system on
which the files will reside, and are composed of a collection identifier, followed by an item identifier and then a
specific local identifier (like ‘A’ or ‘B’ for the side of a tape). These are then followed by a three-letter extension
indicating the filetype.

Working with legacy material means that we see what small additional steps a researcher could have taken to make
their recordings more useful. Obviously collections vary greatly in the accompanying documentation. In some cases
there is no specific information about the tapes we have located in a box or filing cabinet, and, while there may be
accompanying fieldnotes, we do not have the time or the personnel to work through fieldnotes and to establish their
relevant material and to reintegrate it with fieldnotes.
           Page image of Stephen Wurm’s fieldnotes on Aiwo, Solomon Islands, SAW2-018-00005

<hasTranscript>/<isTranscriptOf> proposed
addition to OLAC/Dublin Core <relation> element
Recently we have been taking representational images of transcripts found in
association with tapes recorded thirty years ago. These are typescript or handwritten
manuscripts that belong together with tape recordings. We give the images the same
name as the tape, differing in the extension (.wav for the audio and .jpg for the image).
Furthermore, the metadata notes the relation between these two types of information
(using the element <relation>, for which we propose the additional refinement of
<isTranscriptOf>/ <isTranscribedBy>.

          Everyday archiving
Creation of optimal archival forms of data
  through normal linguistic practice, e.g.
• Well-structured data (e.g. backslash codes,
• Annotation with time-alignment
• Citation of archival data, which implies
      - persistent identification and location
      - interactive use of archival data
         Everyday archiving
Creation of optimal archival forms of data
  through normal linguistic practice, e.g.
• Consent/deposit forms clarify intellectual
  property issues
                        Example Workflow

           analogue digitised/                                           archived with
tape                                archival digital file
             digital captured                                            PARADISEC

           descriptive metadata added

  concordance of texts, navigation tool                      transcribed and
Media corpus instantiates links to media                        Transcriber
           (e.g. Audiamus)                                        or Elan)

       output to e.g. Shoebox for

                                                                       archived with
          Texts, dictionary etc                                        PARADISEC
          Contrast and compare
              Previously   Current
Data          Analog       Digital
                           interlocutors (because
in material                deposit in an archive is
clarified                  envisaged as part of the

Filenames     Arbitrary    Persistent Identifiers
           Contrast and compare
               Previously       Current
Data structure No explicit     Explicit structure is used
               structure       as the basis for derived
               (implicitly     forms (e.g. as in lexical
               and styles)
Archival       After use of the Incremental accession,
accession of   material by the ideally before use of the
primary data   researcher.      material by the researcher.
               (Typically post
               retirement or
               after death of
               the researcher.)
         Contrast and compare
               Previously        Current
Annotation     Little done,     More comprehensive
               usually by hand. annotation, using time-
of primary                      alignment and
Archival       Typically post    Work in progress
               retirement or     archivable and overwritten
accession of   after death of    by subsequent versions
annotations    the researcher.   (safe backup)
           Contrast and compare
                 Previously         Current
Persistent       Maybe in           Assigned by archive and
identification   fieldworker’s      persistent identifier
                 notes, hampered    resolved to an item in
                 by lack of         the archive.
citation         discoverability.
forms of
Metadata         Library/MARC       DC / OLAC (support for
standard         (large existing    small, collector-based
                 infrastructure)    archives)
            Contrast and compare
                 Previously         Current
Metadata         Library catalogs Open Archives Initiative,
discovery        (not always      subject specialised
Persistent       Maybe in           Assigned by archive and
identification   fieldworker’s      persistent identifier
                 notes,             resolved to an item in the
to support
                 hampered by        archive.
citation         lack of
forms of         discoverability.
         Contrast and compare
              Previously       Current
Persistence   Analog tape in   Digital simulacra/copies
              one location     (LOCKSS)
              treated in       instantiated where
between       catalog          possible (e.g.
items                          tape/transcript)
         Contrast and compare
               Previously         Current
Repatriation   Copies of tapes    Digital copies of
               provided from a    tape/transcript in linked
of copies      single location.   form. Available for
• Secure longterm storage of well-described
  linguistic records is crucial to language
• Training is essential for the integration of an
  archival sensibility into a linguist’s fieldwork
• It is up to the fieldworker to produce archival
  material from their fieldwork.

