Docstoc

Daan_Broeder_EPIC_20110412

Document Sample
Daan_Broeder_EPIC_20110412 Powered By Docstoc
					          History of PIDs at the
         MPI for Psycholinguistics

                        Daan Broeder
The Language Archive – Max Planck Institute for Psycholinguistics
                 Nijmegen, The Netherlands
          MPI/TLA Archive

• MPI for Psycholinguistics research corpora: Child language
  acquisition, bilingualism, gesture, sign language, Corpus Spoken
  Dutch, second learner corpora, etc.
• Archive for the DoBeS project
• Hosting (and inviting) corpora from other projects in need
  (UNESCO study: 80% of all material is endangered)
   – DBD, NGT, Leiden Univ. language documentation corpora
   – Eibl Eibersfeldt human ethology collection
• Mirror important corpora as Talkbank and Childes
• Maintain a metadata catalog for IMDI and OLAC described
  resources from other institutes
   – BAS, C-ORAL-ROM (Univ. Florence), …
   – LR from Lund Univ, INL, other archive partners

  100 TB data: 180K metadata records, >500K AV resources,
  annotation files, lexicons, sketch grammars, etc.
        DAM-LR infrastructure

Distributed Access Management for Language
Resources - EU project 2005-2007
Establish a federation of LR archives
• Four LR archives: INL, Lund
  University, MPI-PL, SOAS Purpose: conclusions: HS for
                             DAM-LR investigate use
• Single metadata domain •• Location independent technology exist for
                               Reliable and scalable resource identifiers.
                             • Possible sharing of primary like by all
                                 infrastructure aspects LHSsingle
   – IMDI metadata set                                     1839
                                   • AAI -> SAML based IDF (Shibboleth)
                             • Resource replication.    R

• Single domain of AAI                               R        sec.
                             • Potential authentication & group id
                                       for use forRuser and identification
                                                             10050

   – DAM-LR federation             • PIDsprimary MPI archive System
                                                 -> Handle
                                record administration.
                                                                          primary
                             •• Potential use for authorization
                                    R R
                                  However10050   installing and maintaining is
• Single system for PIDs (URIDs) not within reach of every R R
                                                                          10032
                                        R
                                record administration.
                                                   sec.            organization
                                                                              sec.
   – Handle System                                10032
                             • Potential use for versioning. INL archive
                                     Lund archive                            1839
          TLA Archive Management Components

                    Browsing/Search/Visualization
                          ANNEX
                          LEXUS
 WWW browser
                          IMDI Browser          ARCHIVE
                                                                      AMS
                                   Web
                                   apps.           metadata

                                            annotations     media files
                                   HTTP
                                  server
Local tools:                                                              typechecking!
• ELAN                  resource download                     LAMUS
• ARBIL
• Toolbox
                                                   resource upload


LOCAL DATA     All resources directly accessible by HTTP if authorized
    Handles at MPI/TLA


                                    HS
                                  synch    HS DB




             ARCHV               crawler
              DB                             HS
                                           Resolver




                             M
M   LAMUS            M
     & LAT                       R
                         R
R
PID visibility
      CLARIN

  Common Language Resources and Technology Infrastructure
  CLARIN is an (ESFRI) EU RI project with 4.2 ME funding for
  a 3 year preparatory phase (2008-2010) with further
  funding from national CLARIN projects, NL: 9 ME, DE: 10
  ME, DK, CZ, A, FR, …
• Goals not much different from DAM-LR, but more
  emphasis on organizational, legal, sustainability issues
• Builds on DAM-LR experience but much bigger
   – 25 candidate CLARIN centers EU wide, >> users
• Users are not only linguists but wider SSH researchers
• MPI PL responsible for technical infra in EU and is a
  CLARIN center in CLARIN NL
• For PIDs, CLARIN recommends EPIC
        Desiderata from CLARIN


• (Political) Independence: European GHR mirror
  & proxy + no single point of failure
• Wide(r) acceptance of PID scheme (W3C?)
• Support for object part addressing, from ISO
  TC37/SC4 CITER work.
• Support for (secure) management of resource
  copies (replication)



2008 CNRI Handle System Workshop
      CMDI Virtual Collection Registry


CLARIN Project builds the CMDI infrastructure to come to a
common domain for Language Resources and services
• Much scientific works depends on seemingly accidental
  distributed collections of (parts of) resources that has no
  independent embodiment.
• These Virtual Collections (VC) only exist as metadata
  records.
• VC needs to be citable with one single PID
• Prototype VC registry is available
      ISO standardization efforts


Work done within ISO TC37/SC4
Try to give requirements for PID frameworks used for
identifiers with LR and LT
• ISO 24619 CitER started in 2007
• ISO 24619 PISA 2010 went FDIS in 2010
• States requirements for associating multiple URIs,
   checksums, and object part identifiers

Major result the specification of part identifiers to refer
to parts of identifiers. Subsequent discussions with CNRI
resulted in HS’s template handles.
             PIDs and resource parts

                                           •   Wasteful to issue a pid for each part (think of
   1839/A                 y                    100k entries in a lexicon). So use part
   1839/x                                      identifiers.
                          A                •   Resolver can make an adequate translation
   1839/y             x       z
   1839/z                                      “A#z” -> “objectA?part=z” This requires
                                               enough flexibility from the resolver to
1839/A: + 1839/A#x, 1839/A#y, 1839/A#z         accommodate the object server.
                                           •   The syntax of “Z” should be standard for the
                                               specific data type. Loan from existing
                                               fragment identifier syntax standards.

1839/A                                                                                y
                          http://oserver/objectA                                      A
                                                                                 x         z
             pid                                                 object
           resolver                                              server

                          http://oserver/objectA?part=z
                                                                                       z
1839/A#z
             Data replication in DAM-LR

                                               primary   Problem handle record access
                             R                  1839
                                   move
                                                         granularity
                                    R                    • What if MPI moves the
                                                            resource copy?
                      copy       MPI archive             • MPI should have wrt access
                                                            to the Lund Handle record
                                                         • This would enable changing
                  R                                         the Lund URL record too!
primary
10050         Lund archive

                                                           LHS
   10050/R -> http://lund/lund_url                        Access         MPI
          -> http://mpi/mpi_url                           monitor       Manager


                                           Use resource monitor?
Data replication needs

                             • since 2004 a LTP
                               strategy in Max
                               Planck Society
                             • yet no systematic
                               European solution !!
                             • yet no safe and rule-
                               based replication !!
                             • using EPIC services
                      PID
                    system     and iRODs

                             • in addition 13 regional
                               archives worldwide to
                               help human heritage
                               to survive
                               (10 requests)
         Data Replication in REPLIX project

Replix: small project between CLARIN (MPI-PL) and DEISA
(RZG) investigating using iRods as a data replication layer
for the LAT/LAMUS archiving software

                                    HS
LAMUS                             Resolver                 LAMUS
 & LAT         MPI                                          & LAT
                        HS DB                    RZG
                         abcd
           R     R                           R     R
                                                   •   File copy Export access
                                                       rights, import access
                                                       rights
                                                   •   Synchronize DBs at
                          iRods                        receiver side
                                                   •   Copy new locations back
                                                       to originator, originator
                                                       adapts handle DB
      The future of PIDs at MPI/TLA


Plan use EPIC PID service as soon as a suitable API is
available and requirements are met
• If a good alternative exists, (smaller) archives should
   not run their own LHS
• Want to keep using our own prefix (of course)
• Should not be lock-in, need to be able to leave
• Associate extra information with the PID:
   – multiple URIs, MD5 checksums, metadata record pointer, access
     rights
• Allow secure separate administration of the with the
  PID associated different URIs
Thank You for Your Attention
  Resource duplicates

                                           primary
                         R     move         1839
                                                      1839/Rcpy -> http://mpi/mpi_url
                                R

                  copy       MPI archive


                                                     indirect handles*        MPI
                  R
                                                     • TYPE = URL            Manager
primary                                                  – IE-Plugin: ok.
10050         Lund archive
                                                         – HS proxy: not-ok
                                                     • TYPE = HS_ALIAS (problem*)
                                                         – IE-Plugin: ok.
   10050/R -> http://lund/lund_url
                                                         – HS-Proxy ok
          -> hdl:1839/Rcpy                           • Status of 1839/Rcpy handle?
                                                         – Use in documents?

  2008 CNRI Handle System Workshop
                   Persistent Identifiers
                                PID <-> URI      PID Framework (PURL,HS, ARK, ...)
       PID         resolver     mapping        • Give every resource a unique persistent
                                maintenance       identifier: PID
             URI                               • Every PID associated with one (or more)
                                                  URLs
    client
                              repository
                                               This comes at a cost:
                    URI                        • Added layer of infrastructure
                                               • Must be managed
                                               • Must run with high availability
    Standard we use URIs (URLs) for            • Must be very sure that this can be
    referencing resources. However: the            handled by our archives also in the long
    resource is moved - host name change           term.
    or file system changes
                                               • But can be used for extra services
•   Problem for embedded references
    inside the repository
•   But especially outside the repository
•   Can be seen as an organizational
    problem                                    CLARIN recommends EPIC (SARA,
•   But difficult to solve, hence the use of   GWDG, CSC) Underlying technology
    PID frameworks                             uses HS technology used also by
                                               LC, DOI, …
       TLA Archive Data Organization

                                                                        Language

• Archiving                 ARCHIVE
                                                                       Expedition

  formats only                                                         Age Group

• Metadata in                   C

                                                    }
                                                                          Genre
  XML files                                               IMDI           SessionX
                        C               C
• Relations                                             metadata    MediaFile AnnotationFil
                   S    S       S   S       S                                      e
  represented
  by URI links &
  PIDs in XML
  files
                   M
                    M
                        M
                            M
                                    T
                                        T
                                            T
                                                T
                                                    }   resources

• DBs only as
  helpers
          TLA Archive Access

                    Browsing/Search/Visualization
                          ANNEX
                          LEXUS
 WWW browser
                          IMDI Browser          ARCHIVE
                                                                      AMS
                                   Web
                                   apps.           metadata

                                            annotations     media files
                                   HTTP
                                  server
Local tools:                                                              typechecking!
• ELAN                  resource download                     LAMUS
• ARBIL
• Toolbox
                                                   resource upload


LOCAL DATA     All resources directly accessible by HTTP if authorized

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:3
posted:5/3/2011
language:English
pages:21