Document Sample
Michael_L_Nelson-icsep Powered By Docstoc
					New Digital Library Possibilities Using the
  Open Archives InitiativeProtocol for
   Metadata Harvesting (OAI-PMH)

                         Michael L. Nelson
                        Old Dominion University
                         Norfolk Virginia, USA

International Conference on Scientific Electronic Publishing
                  in Developing Countries
                      Valparaiso, Chile
                      October 2, 2002

             Several Slides Also from Van de Sompel & Warner
        Random Thoughts
1. Thanks to the Organizing Committee for
   inviting me
2. Me deseo habla prestado la atencion a mis
   clases del Espanol de la escuela
3. Publishers & Editors: if you want
   increased coverage, exposure and
   readership, you must “do” OAI…

• OAI-PMH history and technical highlights
    – a full technical review is out of the scope of
      this presentation
•   Example data provider user
•   Example service provider uses
•   Implicatations for authors and editors
•   Looking to the future
         Open Archives Initiative

The protocol is openly          Archive defined as a
documented, and metadata                                   OAI is happening
                                “collection of stuff” --   at break-neck speed...
is “exposed” to at least some   not the archivist’s
peer group (note: rights        definition of “archive”.
management can still apply!)    “Repository” used in
                                most OAI documents.
       The Rise and Fall of
      Distributed Searching
• wholesale distributed searching, popular at
  the time, is attractive in theory but
  troublesome in practice
  – Davis & Lagoze, JASIS 51(3), pp. 273-80
  – Powell & French, Proc 5th ACM DL, pp. 264-265
• distributed searching of N nodes still
  viable, but only for small values of N
      • NCSTRL: N > 100; bad
      • NTRS/NIX: N<=20; ok (but could be better)
        The Rise and Fall of
       Distributed Searching
• Other problems of distributed searching          (from STARTS)

   – source-metadata problem
       • how do you know which nodes to search?
   – query-language problem
       • syntax varies and drifts over time between the various nodes
   – rank-merging problem
       • how do you meaningfully merge multiple result sets?
• Temptations:
   – centralize all functions
       • “everything will be done at X”
   – standardize on a single product
       • “everyone will use system Y”
               Santa Fe Convention [02/2000]

• goal: optimize discovery of e-prints

• input:
   • the UPS prototype

   • RePEc /SODA “data provider / service
   provider model”
   • Dienst protocol
   • deliberations at Santa Fe meeting [10/99]
   Data and Service Providers
• Data Providers
   – publishing into an archive                                  Even if these
   – providing methods for metadata “harvesting”                 are done by
       • provide non-technical context for sharing information   the same DL,
         also                                                    these are
• Service Providers                                              distinct roles
   – harvest metadata from providers
   – implement user interface to data
• Self-describing archives
   – Much of the learning about the constituent UPS
     archives occurred out of band…
                          Metadata Harvesting
     • Move away from distributed searching
     • Extract metadata from various sources
     • Build services on local copies of metadata
          – data remains at remote repositories
                                                                                  all searching, browsing,
                                        user                                      etc. performed on
individual nodes can                      search for “cfd                         the metadata here
still support direct user
                                                                  local copy of

              metadata                                                  metadata
              harvested     metadata       metadata                     harvested
              offline       harvested      harvested                    offline
                            offline        offline
                                                                                             each node
                                                            ...                              independently
               OAI-PMH v.1.0 [01/2001]
• low-barrier interoperability specification
• metadata harvesting model: data provider / service
• focus on document-like objects
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• experimental: 12-18 months
               Santa Fe       OAI-PMH          OAI-PMH
              convention       v.1.0/1.1         v.2.0

 nature      experimental    experimental        stable

 verbs          Dienst        OAI-PMH          OAI-PMH


responses       XML              XML              XML

transport       HTTP             HTTP            HTTP

                               unqualified     unqualified
metadata        OAMS
                              Dublin Core      Dublin Core
 about         eprints                         resources
                              like objects
              metadata        metadata         metadata
              harvesting      harvesting       harvesting
               OAI-PMH 2.0
• Good news: OAI-PMH is still

       Six Verbs + Dublin Core

• Incremental improvements
  –   single XML schema
  –   ambiguities removed
  –   more expressive options
  –   cleaner separation of roles & responsibilities
• Bad news: not backwards compatible with 1.1
                        Dublin Core
• Dublin Core Metadata Initiative
  – from 1994-1995, recognizing the need for simple, interoperable
    metadata for resource discovery
  –   good overview of metadata & DC:

  – 15 elements (qualifiers possible)

       Title         Creator     Subject    Description   Publisher

       Contributor   Date        Type       Format        Identifier

       Source        Language    Relation   Coverage      Righ ts
 Request is encoded in http
                              OAI Mechanics

Response is encoded in XML

XML Schemas for the
responses are defined
in the OAI-PMH document
             Overview of OAI-PMH Verbs
                         Verb                      Function
              Identify              description of archive
about the
              ListMetadataFormats   metadata formats supported by
              ListSets              sets defined by archive

              ListIdentifiers       OAI unique ids contained in archive
verbs         ListRecords           listing of N records

              GetRecord             listing of a single record

               most verbs take arguments: dates, sets, ids, metadata formats
               and resumption token (for flow control)
               protocol vs periphery

• clear   distinction between protocol and
  • fixed protocol document
  • extensible implementation guidelines:
     • e.g. sample metadata formats, description
     containers, about containers
     • allows for OAI guidelines and community
              OAI-PMH vs HTTP

• clear separation of OAI-PMH and HTTP
  • OAI-PMH error handling
    • all OK at HTTP level? => 200 OK
    • something wrong at OAI-PMH level? =>
    OAI-PMH error (e.g. badVerb)
  • http codes 302, 503, etc. still available to
  implementers, but no longer represent OAI-PMH
                 resource – item - record

 set-membership is
 item-level property

item = identifier        all available metadata
                              about David

           Dublin Core        MARC       SPECTRUM
           metadata           metadata   metadata       records

        record = identifier + metadata format + datestamp
            other general changes

• better definitions of harvester,
repository, item, unique identifier, record,
set, selective harvesting

• oai_dc schema builds on DCMI XML
Schema for unqualified Dublin Core

• usage of must, must not etc. as in RFC2119
• wording on response compression
             other general changes

• all protocol responses can be validated with
a single XML Schema
  • easier for data providers

  • no redundancy in type definitions

  • SOAP-ready

  • clean for error handling
                response no errors
<?xml version="1.0" encoding="UTF-8"?>
<request verb=“GetRecord”… …></request>
   </header>                             note no http encoding
   <metadata>                            of the OAI-PMH request
               response with error
<?xml version="1.0" encoding="UTF-8"?>
<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>

                                   with errors, only the correct
                                   attributes are echoed in
                                                        scenario: harvesting
                                                        2770 records in 3 separate
                                                        1000 record “chunks”


            Records 1-1000, resumptionToken=AXad31                      RDBMS

             ListRecords, resumptionToken=AXad31

            Records 1001-2000, resumptionToken=pQ22-x

             ListRecords, resumptionToken=pQ22-x

                          Records 2001-2770
• idempotency of resumptionToken: return same incomplete
list when rT is reissued

   • while no changes occur in the repo: strict

   • while changes occur in the repo: all items with unchanged

   •new, optional attributes for the resumptionToken:



               harvesting granularity
• harvesting granularity
  • mandatory support of YYYY-MM-DD

  • optional support of YYYY-MM-DDThh:mm:ssZ
    • other granularities considered, but ultimately rejected

  • granularity of from and until must be the
 • Identify more expressive
    <repositoryName>Library of Congress 1</repositoryName>
• header contains set membership of item
  </record>               eliminates the need for the “double
                           harvest” 1.x required to get all records
                           and all set information
• ListIdentifiers returns headers
 <?xml version="1.0" encoding="UTF-8"?>
 <request verb=“…” …></request>
• introduction of provenance container to
facilitate tracing of harvesting history
        … … …
• introduction of friends container to
facilitate discovery of repositories

     NASA <friends> example (1)
• A light weight, DP-centric method
  to communicate the existence of

 <friends ..namespace stuff..>
  NASA <friends> example (2)




• introduction of branding container for
DPs to suggest rendering & association hints
<branding xmlns=""

• revision of oai-identifier
  <oai-identifier xmlns="

                                               domain based
                                               repository names
did not make it into OAI-PMH v.2.0

 •   SOAP implementation
 •   Result set filtering
 •   Multiple / “best” metadata
 •   GetRecord -> GetRecords
 •   Machine readable rights management
 •   XML format for “mini-archives”
So What Does OAI-PMH Mean
   for Your Digital Library?

 • Resources on DL projects are typically
   spent in 2 areas:
   – creating & maintaining the collection
      • data provider
   – developing access services for the collection
     (searching, browsing, etc.)
      • service provider
 • OAI-PMH allows for specialization based
   on resources / interest
NACA Report 1345

as seen through its native DL
NACA Report 1345

as seen through MAGiC
NACA Report 1345

as seen through its Scirus
NACA Report 1345

as seen through my.OAI
(FS Consulting)
     Scientific Communication
• With only some exceptions, which interface is used
  for discovery is not as important as the fact that
  discovery occurred in the first place…
   – “control” of the discovered objects is not “lost” by data
      • however, higher level mirroring services can be built on top of
        OAI (cf. NACA & ARC mirroring between NASA LaRC and
• The real power of OAI-PMH derives as much from
  what it does not do as what it actually does
   What Does OAI-PMH Mean
         for Authors?
• On the surface, absolutely nothing!
   – the ideal OAI deployment should be absolutely invisible to
     normal DL operations
   – uninterested users should not even notice or care
• Indirectly, they should enjoy the benefits of the
  critical mass of current and developing DL tools &
   – personal, institutional data providers
   – proliferation of targetted, value-added service providers
    What Does OAI-PMH Mean
          For Editors?
• Absolutely everything…
• The decoupling of SPs and DPs will have significant and
  profound implications on scientific and technical information
   – OAI-PMH is actually just one component in a larger engineering
     effort for scholarly communication (e.g. OpenURL)
• Service and resource integration will be the focus of journals,
  professional societies, universities, etc.
   – OAI-PMH will be a basic, core technology for scientific publishing
     as http & XML
             Field of Dreams
• It should be easy to be a data provider, even if it
  makes more work for the service provider.
   – if enough data providers exist, the service providers will
     come (DPs >> SPs)
• Open-source / freely available tools
   – “drop-in” data providers:
      • industrial strength:
      • personal size:
   – tools to make your existing DL a data provider:
      • also: OAI-implementers mailing list / mail archive!
   – service providers:
      • Arc:
           OAI Observation:
            Front-End Only
• No input/registry mechanism
  – OAI harvesting protocol is always a front-end for
    something else
     • filesystem, Dienst, RDBMS, LDAP, etc.
  – convenient for pre-existing DLs, but does not address
    “new” DLs
     • e.g., “we want to do OAI”
• Bounds the scope of OAI
  – responsibilities and domain of OAI are still be discussed
  – tension between functionality and simplicity
    OAI Observation: No T&C
• Possible to use multiple OAI servers in a
  DMZ-like configuration…
           OAI requests                                         OAI requests
           from arbitrary hosts                                 from trusted hosts

     Public OAI                                      Private OAI
       Server                                          Server

                          Source database

                                            could even use a separate copy of the database…
    OAI Observation: No T&C
• Possible to use OAI harvesting protocol in
  closed, restricted systems

           OAI 1                                                 OAI 2

           OAI 4                                                 OAI 3

                   all OAI requests originate from these 4 DLs

– Q: “Which format should I use?”
   • A: any/all of them…
– lowest common denominator: unqualified Dublin
– Again, little known about actual behavior
   • will DC be actually be useful? or too lossy?
   • will communities create/adopt specific formats?
   • will native (presumably richer) formats be harvested?

                                             we very much want
                                             this to happen...
             “The Return of MARC” ?!
The Future: Community Building
• Ultimately, protocols and metadata formats are
  not what makes a difference
• Rather, the critical mass afforded by a common
  set of utilities (cf. http, Dublin Core, XML)
• The best current example: The Open Language
  Archives Community
• OAI-PMH provides the basis for communication
  between strangers, but allows even richer
  communication between friends
Backup Slides
Detailed Review of the
 OAI-PMH 2.0 Verbs
    1.1                  2.0

• Arguments       • Arguments
  – none            – none
• Errors          • Errors
                    – badArgument
  – none
    1.1                      2.0

• Arguments             • Arguments
  – identifier            – identifier
    (OPTIONAL)              (OPTIONAL)

• Errors                • Errors
                          – badArgument
  – id does not exist
                          – noMetadataFormats
                          – idDoesNotExist
    1.1                     2.0

• Arguments            • Arguments
  – resumptionToken      – resumptionToken
    (EXCLUSIVE)            (EXCLUSIVE)

• Errors               • Errors
                         – badArgument
  – no set hierarchy
                         – badResumptionToken
                         – noSetHierarchy
      1.1                          2.0

• Arguments              •   Arguments
                             – from (OPTIONAL)
  –   from (OPTIONAL)        – until (OPTIONAL)
  –   until (OPTIONAL)       – set (OPTIONAL)
                             – resumptionToken
  –   set (OPTIONAL)           (EXCLUSIVE)
                             – metadataPrefix (REQUIRED)
  –   resumptionToken    •   Errors
      (EXCLUSIVE)            –   badArgument

• Errors                     –
  – no records match         –
        1.1                                 2.0
•   Arguments                     •   Arguments
    – from (OPTIONAL)                 – from (OPTIONAL)
    – until (OPTIONAL)                – until (OPTIONAL)
    – set (OPTIONAL)                  – set (OPTIONAL)
    – resumptionToken                 – resumptionToken
      (EXCLUSIVE)                       (EXCLUSIVE)
    – metadataPrefix (REQUIRED)       – metadataPrefix (REQUIRED)
•   Errors                        •   Errors
    – no records match                –   noRecordsMatch
    – metadata format cannot be       –   cannotDisseminateFormat
      disseminated                    –   badResumptionToken
                                      –   noSetHierarchy
                                      –   badArgument
      1.1                            2.0
• Arguments                   • Arguments
   – identifier (REQUIRED)       – identifier (REQUIRED)
   – metadataPrefix              – metadataPrefix (REQUIRED)
     (REQUIRED)               • Errors
• Errors                         – badArgument
   – id does not exist           – cannotDisseminateFormat
   – metadata format cannot      – idDoesNotExist
     be disseminated
                      Argument Summary
                  metadataPrefix   from        until          set         resumptionToken   identifier

Identify                                                                                     

ListMetadata                                                                              optional

ListSets                                                                  exclusive           

ListIdentifiers                     optional       optional    optional       exclusive           

ListRecords                         optional       optional    optional       exclusive           

GetRecord                                                                     
                                                                                                 
                              Error Summary
Identify            BA

ListMetadata        BA                                                                          NMF   IDDNE

ListSets            BA             BRT                                          NSH

ListIdentifiers     BA             BRT            CDF           NRM             NSH

ListRecords         BA             BRT            CDF           NRM             NSH

GetRecord           BA                            CDF                                                 IDDNE

               Generate badVerb on any input not matching the 6 defined verbs
                    this is an inversion of the table in section 3.6 of the OAI-PMH specification

Shared By: