VIEWS: 9 PAGES: 24 POSTED ON: 10/19/2010
Archiving and the work flow of field work QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures Nicholas Thieberger, Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) Department of Linguistics & Applied Linguistics, The University of Melbourne Vic 3010, Australia LSA Archiving tutorial, January 2005 Language archiving is an integral part of language documentation. The documents linguists are producing are meant to endure and to be available for the people we record and their communities, as well as for fellow researchers, well into the future, and, we hope, for ever. Archiving is no longer something we do at the end of our fieldwork, it is apparent now that it can be integrated into everyday language documentation work and that it is a crucial aspect of documentary linguistics. We have learned to separate form and content in the representation of linguistic data and recent technological advances have pointed to the importance of planning data management and workflow for ethnographic recording which in turn has facilitated an expansion in documentary linguistics and archiving. Recordings should always be of high quality, but it is in the context of small and endangered cultures and languages that the quality of recording takes on new significance (quality here refers both to the content and the form of the recording). If we are the only recorders of the last remaining speakers or performers then, right from the moment of recording, we must be concerned with making good documents which will be placed into a suitable archive for storage and discovery. Thus we can distinguish archival practice, which will be the main focus of this paper, from archival storage in a repository. QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. I discuss a workflow that builds in development of archival data and show that making the initial recordings and their digital representation citable by means of a persistent identifier allows further work to be located with reference to that primary data. Typically this further work involves annotation of the data and the construction of dictionaries and in all such derived material the content is plain text structured to allow it to endure into the future. Further description of the data with standard metadata terms allows its discovery in the longterm. All of this facilitates repatriation of the data to the communities from which it originates, as they are able to locate the data once it has been archived. Archives have an image of being repositories of old stuff. Usually old stuff that comes from old people. And in our case it is old stuff from old people on old languages. I asked a colleague if he was considering depositing with our archive and he said "Did I look as if I was going to die any minute when you last saw me?" For him, as for many people it seems, archiving is something done at the end of one's career when there is time to go back and fill in gaps and make the whole data more presentable. This view of archiving has it that boxes of stuff can be delivered to the archive sometime after the linguist has finished with them and will then be held in perpetuity. The recent focus of linguistic archives, informed by the discussion of language documentation, is that the stuff deposited must be of sufficient quality and sufficiently well-described that it can be useful into the future. Language Archives An integral part of language documentation The locus for supporting documentary linguisticQuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. activity Ark - Hive (David Nathan, ELAR) Need to develop archival methods for linguistic fieldworkers Producing archival material is something we, as ordinary working linguists (OWLs), should do all the time and, further, the possibilities provided by new technologies allow us to incorporate archival issues into our everyday practice to the benefit both of our analysis, and of the use of our recordings and intellectual outputs. Current archives are training and providing advice in response to the need for such a service in our community, that is the community of documentary linguists. These archives are primarily trusted longterm repositories that take well-structured data and provide the infrastructure for securely holding and locating it over time. An archive is also the point of reference for a network of practitioners who want advice on how to proceed. It is the archive’s role to agree on standards that seem most appropriate and to assist in their adoption by the broader community. My observation of our own and other such archives suggests we are all acting as a locus for documentary activity, and as proponents for new methods - what has been called an ‘ark-hive’ by David Nathan of ELAR. As none of us has the resources to edit items in our collections, we rely on the depositors to produce material that is well-formed from an archival point of view. Such data has an explicit structure (encoded, for example by tags (as in a Shoebox lexical file for example) or by stand-off markup (as in time-aligned transcripts). It is also provided in a non-proprietary form that can be read on any platform. The fact that the best current working tools for transcription and time-alignment are coming out of this same effort, for example IMDI and Elan from the MPI in Nijmegen, or Transcriber from LIMSI (with strong support from OLAC via the Linguistic Data Consortium) linguistic a indicates that archives are central to the promotion of new technologies as a means for ensuring that normal QuickTime™ and fieldwork will result in the best possible archival form. TIFF (LZW) decomp resso r are need ed to see this picture. Perspective of PARADISEC This paper is written from the perspective of Paradisec, a young digital archive based in virtual space between Sydney, Melbourne and Canberra in Australia. Paradisec was established 18 months ago by a group of linguists and musicologists concerned at the lack of a repository for material recorded outside of Australia by Australian researchers. For those working with Indigenous Australian languages there is a national archive called the Australian Institute for Aboriginal and Torres Strait Islander Studies (AIATSIS) which has been operating since the 1960s. National Australian cultural institutions like the National Library and the National Film and Sound Archive do not have a mandate to keep field recordings from outside of Australia. In particular we were concerned especially about audiotapes recorded since the 1950s that were not being stored in any suitable repository and were physically deteriorating. Thus the initial focus was on the preservation of existing, so-called ‘legacy’ material and we have so far digitised some 660 hours or 1.1 terabytes of data. However, once we started processing these tapes, it was clear that there was a huge demand from current researchers wanting to work with their data in a digital form and wanting high-quality archival representation of their media before they conducted most of their analysis. The coercive archive Attached to a funding body Obligatory deposit of recorded material QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. Enforces data formats by contractual requirement At this point it is useful to distinguish two kinds of existing archives, which I will characterise as coercive and non-coercive. The coercive archive is part of a funding agency and so has some means and abilty to enforce standards on depositors, as is the case with the ELDP/ELAR or DOBES. When grants are provided for language documentation, the form of the recordings and their associated descriptive and analytical apparatus can be prescribed by these funding bodies, who have also been providing training in the use of these methods. Further, the funding body can contractually bind the recipient to lodge this material with the archive. These archives typically house newly recorded material, often recorded in a digital form and so not requiring conversion before being archvied. PARADISEC is currently not in a position to fund researchers, and so the appeal to depositors has to be pitched differently. We encourage practitioners (who we take to mainly include linguists, musicologists, and indigenous language workers) to deposit media material by ensuring that they will have a high quality digital version of their data in the short term. If an archival form of the file is created first and is then used as the basis for the subsequent effort of transcription and time- aligning, the resulting work has a citable source that should persist into the future. We have been encouraging postgraduate students to lodge their tapes with PARADISEC as soon as they return from fieldwork. We digitise or capture their data and provide both an archival (usually at 96khz/24bit BWF) and a representational (linear Mp3) copy with its persistent identifier QuickTime™ and a a citable decomp picture. need ed to form of in our collection. This gives them a digital file to work with, but more importantly it gives them areTIFF (LZW) see this resso r archival data with persistent identification. Their intellectual effort of annotating this primary data can then build on a firm foundation for both their own immediate goal (typically a dissertation) and the longterm needs of having richly annotated primary data safely archived. I learned the hard way that using a non-archival digital file as the basis for transcription and analysis results in a mismatch in timecodes when an archival file is later produced. I had digitised my analog field cassettes myself in 1998 by connecting a tape player to a computer and producing fairly poor digital copies. I then annotated these using the program called SoundIndex from LACITO as part of my documentation of the Oceanic language South Efate. A few years later the tapes were digitised at a higher resolution and the timecodes in my earlier versions did not align in a simple way with those of the new files. I transcribed around eighteen hours of audio altogether and I was under some time constraint as the work was to result in a dissertation in the form of a documentary grammar with a time-aligned media corpus. The non-archival forms of transcript may be corrected one day, but the clear lesson is that creation of an archival form of digital data is best done before the analysis begins. At PARADISEC we also spend considerable time with many old tapes, preparing them for data transfer by cleaning and, in some cases, baking them under vaccuum. We assign persistent identification and create an enduring citable form of the data as part of the archival accession and we run training workshops of half a day to several days’ duration on the use of software tools and on data management. We use these as a means of advocating a workflow for language documentation that builds archiving into the normal everyday work of the OWL rather than being an onerous addition, or a task left until the weight of the cumulative research effort becomes unbearable at the end of a researcher’s working life. The paucity of material related to many Australian indigenous languages is a great motivator for the current generation of researchers to ensure that the records that we leave behind will be of more use than those we have had to work with. Both coercive and non-coercive archives rely on the relationships they have established with their communities, both depositors and users. In general, the benefits of depositing are clear, in particular as we are digitising analog tapes and holding copies at no cost for members of our consortium. The ability to be ‘trusted’, as a repository should be, arises from QuickTime™ and a a number of factors, but a key for us has been the ability to provide advice and training to ensure the quality, both TIFF (LZW) decomp resso r are need ed to see this picture. technical and in content, of recordings and associated derived material (transcripts, glosses, dictionaries etc). The rationale is that if we want high quality recordings and well-structured archival data, then we have to provide training in its creation. We run workshops in using Shoebox, still the only tool that creates structured lexical files, and, as wonderful tools like Transcriber and Elan are produced by our colleagues in Europe we introduce them to a community of users in our region at occasional workshops, both in our universities and in community-based language centres. QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. http://www.linguistics.unimelb.edu.au/rnld.html Resource Network for Linguistic Diversity, To assist in training, we have cooperated in the establishment of a network for providing support to language workers, called the Resource Network for Linguistic Diversity, and we use the mailing list associated with this network to discuss emerging methods and tools, as well as providing a FAQ page and an archive of the list discussion (kindly provided by LinguistList). Deposit of data in an archive presupposes that the depositor has sought and received permission from their interlocutors. Ideally a written consent form, provided by the researcher and signed by the speaker, would specify the uses to which the recordings could be put. Each item in the archive is accompanied by a deposit form filled out by the depositor or their executor that outlines conditions on use of the material. The ability to enforce standards on depositors extends to the description of the data, or the metadata that allows the data to be discovered. Again, a coercive funding agency can insist on highly detailed metadata descriptions, as we see with the finegrained IMDI metadata set. For legacy data, that PARADISEC mainly deals with, the quality of metadata can be quite variable, often no more than a few lines on a tape box, together with contextual information about the collection from QuickTime™ and a which the item will be identified. At PARADISEC we use a cataloging system that provides a description of the ritem as TIFF (LZW) decomp resso are need ed to see this picture. well as of the process it undergoes from accession. All of this metadata can be output in various forms, one of which is the OLAC metadata set. We would like to take this opportunity to thank OLAC for the work it has put into developing a metadata system and for the support we have received in establishing a static metadata repository that we update periodically from our catalog. Exporting to OLAC metadata has increased the visibility and so the discoverability of the material in our collection, and its ease of use has meant that we were able to move our metadata system from nothing to an Open Archives Initiative conformant metadata repository in a few months. Filenaming conventions Collection-item-file.extension AB1-002-A.wav QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. AB1/AB1-002/AB1-002-A.wav AB1/AB1-002/AB1-002-A.mp3 AB1/AB1-002/AB1-002-A.xml AB1/AB1-002/AB1-002-FN1.tif AB1/AB1-002/AB1-002-FN1.jpg We also provide users with a spreadsheet with our metadata headings and are hoping to implement a web entry form for users in the next year. We encourage users to develop a persistent naming convention using fairly standard ASCII characters and to avoid unnecessarily long names. If we can then take the user’s names for their own files and incorporate them into our persistent identification it makes it much easier to keep track of the relationships between the notes and the media files. Our persistent filenames follow the directory structure of the mass storage system on which the files will reside, and are composed of a collection identifier, followed by an item identifier and then a specific local identifier (like ‘A’ or ‘B’ for the side of a tape). These are then followed by a three-letter extension indicating the filetype. Working with legacy material means that we see what small additional steps a researcher could have taken to make their recordings more useful. Obviously collections vary greatly in the accompanying documentation. In some cases there is no specific information about the tapes we have located in a box or filing cabinet, and, while there may be accompanying fieldnotes, we do not have the time or the personnel to work through fieldnotes and to establish their QuickTime™ and a to picture. relationships to field recordings. Simple descriptive metadata allows us and potential researcherssee thislocate the TIFF (LZW) decomp resso r are need ed to relevant material and to reintegrate it with fieldnotes. QuickTime™ and a TIFF (LZW) decompressor are neede d to see this picture. QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. Page image of Stephen Wurm’s fieldnotes on Aiwo, Solomon Islands, SAW2-018-00005 <hasTranscript>/<isTranscriptOf> proposed addition to OLAC/Dublin Core <relation> element Recently we have been taking representational images of transcripts found in association with tapes recorded thirty years ago. These are typescript or handwritten manuscripts that belong together with tape recordings. We give the images the same name as the tape, differing in the extension (.wav for the audio and .jpg for the image). Furthermore, the metadata notes the relation between these two types of information (using the element <relation>, for which we propose the additional refinement of <isTranscriptOf>/ <isTranscribedBy>. QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. Everyday archiving Creation of optimal archival forms of data through normal linguistic practice, e.g. • Well-structured data (e.g. backslash codes, XML) QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. • Annotation with time-alignment • Citation of archival data, which implies - persistent identification and location - interactive use of archival data Everyday archiving Creation of optimal archival forms of data through normal linguistic practice, e.g. • Tracking relationships within the data QuickTime™ and a - speaker - tape - transcript - text TIFF (LZW) decomp resso r are need ed to see this picture. • Consent/deposit forms clarify intellectual property issues Example Workflow analogue digitised/ archived with tape archival digital file digital captured PARADISEC descriptive metadata added concordance of texts, navigation tool transcribed and QuickTime™ and a linked (using e.g. TIFF (LZW) decomp resso r are need ed to see this picture. Media corpus instantiates links to media Transcriber (e.g. Audiamus) or Elan) output to e.g. Shoebox for interlinearising archived with Texts, dictionary etc PARADISEC Contrast and compare Previously Current Data Analog Digital QuickTime™ and a Copyright TIFF (LZW) decomp resso r Rarely Consent forms signed by are need ed to see this picture. interlocutors (because in material deposit in an archive is clarified envisaged as part of the process) Filenames Arbitrary Persistent Identifiers Contrast and compare Previously Current Data structure No explicit Explicit structure is used structure as the basis for derived (implicitly forms (e.g. as in lexical QuickTime™ and a marked by fonts files in Shoebox) TIFF (LZW) decomp resso r are need ed to see this picture. and styles) Archival After use of the Incremental accession, accession of material by the ideally before use of the primary data researcher. material by the researcher. (Typically post retirement or after death of the researcher.) Contrast and compare Previously Current Annotation Little done, More comprehensive usually by hand. annotation, using time- of primary alignment and media interlinearising. QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. Archival Typically post Work in progress retirement or archivable and overwritten accession of after death of by subsequent versions annotations the researcher. (safe backup) Contrast and compare Previously Current Persistent Maybe in Assigned by archive and identification fieldworker’s persistent identifier notes, hampered resolved to an item in QuickTime™ and a to support TIFF (LZW) decomp resso r are need ed to see this picture. by lack of the archive. citation discoverability. forms of data. Metadata Library/MARC DC / OLAC (support for standard (large existing small, collector-based infrastructure) archives) Contrast and compare Previously Current Metadata Library catalogs Open Archives Initiative, discovery (not always subject specialised interpoerable) searches QuickTime™ and a TIFF (LZW) decomp resso r are need ed to see this picture. Persistent Maybe in Assigned by archive and identification fieldworker’s persistent identifier notes, resolved to an item in the to support hampered by archive. citation lack of forms of discoverability. data. Contrast and compare Previously Current Persistence Analog tape in Digital simulacra/copies one location (LOCKSS) of data QuickTime™ and a TIFF (LZW) decomp resso r Relation Ignored or Treated in metadata and are need ed to see this picture. treated in instantiated where between catalog possible (e.g. items tape/transcript) Contrast and compare Previously Current Repatriation Copies of tapes Digital copies of provided from a tape/transcript in linked of copies single location. form. Available for QuickTime™ and a download from the web. TIFF (LZW) decomp resso r are need ed to see this picture. Conclusion • Secure longterm storage of well-described linguistic records is crucial to language documentation. QuickTime™ and a • Archives do not have the resources to prepare TIFF (LZW) decomp resso r are need ed to see this picture. data. • Training is essential for the integration of an archival sensibility into a linguist’s fieldwork methods. • It is up to the fieldworker to produce archival material from their fieldwork.
"Archiving in everyday practice"