Docstoc

205

Document Sample
205 Powered By Docstoc
					






    Exploring and Enriching a LR Archive
    via the Web

                       Marc Kemps-Snijders,
                       Alex Klassmann, Claus Zinn,
                       Peter Berck, Albert Russel,
                       Peter Wittenburg
                       MPI for Psycholinguistics
                       DOBES Endangered Languages Project



    What is a digital archive?


    Two essential dimensions
       • Long-term Preservation of all resources and relations
       • Accessibility and Interpretability
Why preserve?
• face the loss of our cultural memory on electronic media
       UNESCO: 80% of the recordings about languages and cultures
                are highly endangered

There are no guarantees for preservation but we can increase chances of survival
      • store everything in a well-organized repository
           (browsable/searchable)
      • take care of redundancy, migration and curation on various dimensions
      • establish organizations that take responsibility


          Digital Archives are living Entities!
Live Archives Concept: allow enrichments (standoff), relations etc
     
     
     
         What is in MPI’s archive?

• Endangered Language Documentation resources
    –    Representative record of a language in its cultural context
    –    Crucial is the active involvement of the community
    –    May help in maintaining and revitalizing languages
    –    Therefore: trend towards complementing linguistic information with
         ontological one in collaborative spaces

• Child language, bilingualism, gesture, sign language, corpus
  spoken Dutch, sound corpora, second learner corpora, etc.

Mostly annotated audio/video recordings
   30 Terabyte, 53.000 AV resources, 24.000 annotation files,
   60 Mio annotations, lexicons, sketch grammars, etc.


All from a large number of depositors
 
 
 
     DOBES Languages




40 language teams from the DOBES program documenting about
 60 languages and working independently
      
      
      
          Language Archiving Technology
                                                                          LAT
                                                          preparation


                        ELAN/LEXUS/SYNPATHY
   Shoebox/CHAT           Annotation + Lexicon
    Transcriber
       XML                        IMDI
                        Data Organization, Metadata        integration

                                 LAMUS
                      Data Uploading and Management   Archive Grid
                            Access Management          Federation
LAT to support
 operations during     Data Archiving and Copying
 resource life-time
                               IMDI / GIS                   utilization
support standards     Metadata Browsing & Searching

where possible           ANNEX/LEXUS/IMEX/
                                TROVA
                                                      ODIT/ISOcat
                         Complex Access via Web
                                                        Ontology
                                                      management
                           ADDIT/VICOS/MEL             framework
                            Enrichments/Views
        
        
        
             LAT Dimensions: Management & Upload
                                                                 LAT


                                   • take care of consistency
                                   • check uploaded formats
                                   • convert where possible
                                   • create presentation formats
                 resources         • create indexes
repository       metadata          • allow access rights definition
system
                                   • add unique & persistent IDs
                                   • take care of distribution

                                   • basis is a robust repository
                                     system with reliable mechanisms



                metadata editing



    LAT Dimensions: Complex Access
                                                     LAT




                                     • access to annotated
                                       media or multimedia
                                       lexica
                                     • callable via any other
                                       web application
     
     
     
         LAT Dimensions: Customized views
                                                                             LAT




• fostering the creation of special web-sites by REST interfaces and templates
• fostering GIS presentations by special converters



    Who are our users?

     Stakeholder       Interest
     archivist         easy management, easy discovery, consistency,
                       statistics, versioning, ..
     researchers       easy visualization, easy discovery, virtual
                       collections, extensions, permissions, ..
     communities       semantic exploration, extensions, permissions, ..
     journalists       appetizers, easy inspection, ..
     students          curiosity, navigation, inspection, ..



      Still in a download first paradigm – not cyberinfrastructure usage
          (result of an ESF/NSF workshop)
  
  
  
      ‘Download first…’ problems and disadvantages

• Tool and format updates are propagated to users at a slow rate
    ’legacy’ formats offered to archives pose an increasing
     burden on archives or tool builders (conversion/migration)
    New techniques slowly spread through the community

• Orchestration between tools becomes much more difficult if not
  impossible
• Users need to install tools locally




        Can we provide more incentives on the tools side?
      
      
      
          How to extend LAT?
                                                                            LAT
• Paper dictionaries’ limited usefulness in language maintenance &
   language revival (Manning et al., 2000)
• “Linear” lexicons not at all interesting except for linguists
• Speech community may prefer explicit semantic acces and links, possibly
   of a wide variety of types (i.e. beyond formal systems)
• Semantic view not limited to lexicons, but should include all fragments

Therefore, introduction of conceptual spaces, where concepts are
 related to others anchored in language illustrated with multimedia

Extension of LAT with ADDIT and VICOS
 towards cyberspace paradigm
        ADDIT:
          relations between arbitrary fragments
          VICOS (Visualizing Conceptual Spaces):
          relations within and across lexicons
          and easy visualization

make VICOS a collaborative tool
      
      
      
          ADDIT: Commentary & Relations




• allow authorized people to make arbitrary comments on and relations between
  object fragments
• visualize them in tools and via VICOS
  
  
  
      VICOS: Lexical relations & navigation

• Allow users to create relations within and across lexicons
  across: cognate sets etc
• Visualize and allow easy navigation in conceptual spaces
• Empower community members to actively describe their L&C
  and to learn from such resources
   – Decide which words offer key access to cultural concepts
   – Technology needed to link words (and the associations they
     evoke) to other words and to all sorts of relevant fragments
• Conceptual Spaces = informal ontology of fuzzily-defined
  concepts and relationships
       • But where “concepts” are anchored in corresponding
         formal lexicon entries
    
    
    
        Team and Acknowledgements
                                                                      LAT



                                 LAT Team
                                 • System Managers
                                 • Archive Managers & Digitization
                                 • Software Developers




Acknowledgements
The work was funded by the VolkswagenFoundation, the European
Commission, the Dutch Science Organization, the Dutch Institute for
Lexicology, the Max Planck Society and the Max Planck Institute for
Psycholinguistics