OAI Past_ Present and Future by pengxiuhui


									OAI Overview

    Michael L. Nelson
    Old Dominion University
     Norfolk Virginia, USA

Bioinformatics Seminar
   ODU CS 791/891
      Feb 3 2003
          The Rise and Fall of
          Distributed Searching
• wholesale distributed searching, popular at
  the time, is attractive in theory but
  troublesome in practice
  – Davis & Lagoze, JASIS 51(3), pp. 273-80
  – Powell & French, Proc 5 th ACM DL, pp. 264-265
• distributed searching of N nodes still viable,
  but only for small values of N
      • NCSTRL: N > 100; bad
      • NTRS/NIX: N<=20; ok (but could be better)
           The Rise and Fall of
           Distributed Searching
• Other problems of distributed searching            (from STARTS)
   – source-metadata problem
       • how do you know which nodes to search?
   – query-language problem
       • syntax varies and drifts over time between the various nodes
   – rank-merging problem
       • how do you meaningfully merge multiple result sets?
• Temptations:
   – centralize all functions
       • “everything will be done at X”
   – standardize on a single product
       • “everyone will use system Y”
          Universal Preprint Service
• A cross-archive DL that that provides services on a collection of
  metadata harvested from multiple archives
   – based on NCSTRL+; a modified version of Dienst
       • support for “clustering”
       • support for “buckets”
• Demonstrated at Santa Fe NM, October 21-22, 1999
   – http://ups.cs.odu.edu/
   – D-Lib Magazine, 6(2) 2000 (2 articles)
       • http://www.dlib.org/dlib/february00/02contents.html
   – UPS was soon renamed the Open Archives Initiative (OAI)
          Data and Service Providers
• Data Providers
   – publishing into an archive                                  Even if these
                                                                 are done by
   – providing methods for metadata “harvesting”
                                                                 the same DL,
       • provide non-technical context for sharing information
                                                                 these are
                                                                 distinct roles
• Service Providers
   – harvest metadata from providers
   – implement user interface to data
• Self-describing archives
   – Much of the learning about the constituent UPS
     archives occurred out of band…
                          Metadata Harvesting
     • Move away from distributed searching
     • Extract metadata from various sources
     • Build services on local copies of metadata
          – data remains at remote repositories
                                                                                   all searching, browsing,
                                        user                                       etc. performed on
individual nodes can                       search for “cfd                         the metadata here
still support direct user
                                                                   local copy of

              harvested     metadata        metadata                     harvested
              offline       harvested       harvested                    offline
                            offline         offline
                                                                                              each node
                                                             ...                              independently
                       Result… OAI
• http://www.openarchives.org/
• The OAI was the result of the demonstration and discussion
  during the Santa Fe meeting
• Initial focus was on federating collections of scholarly e-print
• …however, interest grew and the scope and application of OAI
  expanded to become a generic bulk metadata transport protocol
• Note:
   – OAI is only about metadata -- not full text!
   – OAI is neutral with respect to the nature of the metadata or the resources
     the metadata describes
       • read: commercial publishers have an interest in OAI too...
               Santa Fe       OAI-PMH          OAI-PMH
              convention       v.1.0/1.1         v.2.0

 nature      experimental    experimental        stable

 verbs          Dienst        OAI-PMH          OAI-PMH


responses       XML              XML              XML

transport       HTTP             HTTP            HTTP

                              unqualified      unqualified
metadata        OAMS
                              Dublin Core      Dublin Core
 about         eprints                         resources
                              like objects
               metadata        metadata        metadata
              harvesting      harvesting       harvesting
                            Dublin Core
• Dublin Core Metadata Initiative
   – http://www.dublincore.org/
   – from 1994-1995, recognizing the need for simple, interoperable metadata
     for resource discovery
   –   good overview of metadata & DC: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html

   – 15 elements          (qualifiers possible)
                 Overview of OAI Verbs
                        Verb                              Function
             Identify                   description of archive
  metadata   ListMetadataFormats        metadata formats supported by archive

             ListSets                   sets defined by archive

             ListIdentifiers            OAI unique ids contained in archive

harvesting   ListRecords                listing of N records

             GetRecord                  listing of a single record

                 most verbs take arguments: dates, sets, ids, metadata formats
                 and resumption token (for flow control)
                       Argument Summary
                  metadataPrefix   from        until          set         resumptionToken   identifier

Identify                                                                                    

ListMetadata                                                                              optional

ListSets                                                                     exclusive
                                                                                             

ListIdentifiers                     optional       optional    optional      exclusive
                                                                                                

ListRecords                         optional       optional    optional      exclusive
                                                                                                

                                                                                            
                              Error Summary
Identify            BA

ListMetadata        BA                                                                         NMF   IDDNE

ListSets            BA             BRT                                          NSH

ListIdentifiers     BA             BRT            CDF           NRM             NSH

ListRecords         BA             BRT            CDF           NRM             NSH

GetRecord           BA                            CDF                                                IDDNE

               Generate badVerb on any input not matching the 6 defined verbs
                    this is an inversion of the table in section 3.6 of the OAI-PMH specification
                    Flow Control
• ListSets, ListIdentifiers, ListRecords are all
  allowed to return partial responses, via a
  combination of:
   – resumptionToken – an opaque, archive-defined data
     string that when passed back to the archive allows the
     response to begin where it left off
      • each archive defines their own resumptionToken syntax; it
        may have visible semantics or not
   – 503 http status code – “retry after”
      • up to the harvester to understand this code and respect it, and
        up to the archive to enforce it
                                                      scenario: harvesting
  resumptionToken                                     277 records in 3 separate
                                                      100 record “chunks”


harvester   Records 1-100, resumptionToken=AXad31                  RDBMS

             ListRecords, resumptionToken=AXad31

            Records 101-200, resumptionToken=pQ22-x

             ListRecords, resumptionToken=pQ22-x

                          Records 201-277
               OAI Links & Demos
• Data providers
   – not really meant for end-user interaction, but Suleman’s
     “Repository Explorer” is an excellent tool
      • http://purl.org/net/oai_explorer
      • ~100 registered data providers
          – http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl
          – many being used for internal purposes; not registered
• Service providers
   – Arc, the first known SP harvesting from OAI data providers
      • http://arc.cs.odu.edu/
      • ~20 registered service providers
          – http://www.openarchives.org/service_provider/oai_sp.htm
          – several more known to be in testing or creation
                     Field of Dreams
• It should be easy to be a data provider, even if it makes more work for
  the service provider.
    – if enough data providers exist, the service providers will come (DPs >>
• Open-source / freely available tools
    – “drop-in” data providers:
        • industrial strength: http://www.eprints.org/
        • personal size: http://kepler.cs.odu.edu/
    – tools to make your existing DL a data provider:
        • http://www.openarchives.org/tools/tools.htm
        • also: OAI-implementers mailing list / mail archive!
    – service providers:
        • only bits and pieces currently publicly available...
OAI Observation: Front-End Only
• No input/registry mechanism
   – OAI harvesting protocol is always a front-end for something else
       • filesystem, Dienst, RDBMS, LDAP, etc.
   – convenient for pre-existing DLs, but does not address “new” DLs
       • e.g., “we want to do OAI”
• Bounds the scope of OAI
   – responsibilities and domain of OAI are still be discussed
   – tension between functionality and simplicity
     OAI Observation: No T&C
• No terms & conditions provisions in protocol
   – assumes all metadata has uniform access rights
      • how to restrict metadata to certain hosts?
   – introducing T&C would increase the scope of
     application, but at the expense of simplicity
      • how expensive do we want to make a “just-a-front-end
        protocol” ?
      • maybe T&C is a good application for sets?
    OAI Observation: No T&C
• Possible to use multiple OAI servers in a
  DMZ-like configuration…
           OAI requests                                         OAI requests
           from arbitrary hosts                                 from trusted hosts

     Public OAI                                      Private OAI
       Server                                          Server

                          Source database

                                            could even use a separate copy of the database…
    OAI Observation: No T&C
• Possible to use OAI harvesting protocol in
  closed, restricted systems

           OAI 1                                                 OAI 2

           OAI 4                                                 OAI 3

                   all OAI requests originate from these 4 DLs
  OAI Observation: Monolithic
• An OAI server has no protocol-defined
  concept of “other” OAI servers
  – backups, mirrors, etc. have to be resolved
    outside of the scope of OAI
     • scope vs. complexity again
  – fully connected graph of DLs harvesting from
    each other is unnecessary
     • cf. web crawlers vs. “gathers” in U of Colorado’s
       Harvest System
        – 3rd party harvesting interfaces raise more T&C and data
          coherency issues
                 302 Load Balancing
   • Interactive users on main DL machine should not be
     impacted by metadata harvesting
       – don’t take deliveries through the front door
       – not part of the protocol; defined outside the protocol

                                                                                                      if load > 0.05
                                                                                                               redirect request
harvester                       HTTP Status Code 302


             <?xml version=“1.0” encoding=“UTF-8”?>                                          OAI
             <ListIdentifiers>                                                               Server
 OAI Observation: Data Coherency
• In the interest of OAI implementer simplicity,
  several issues are left for the service provider
  to interpret
  – what is an update vs. addition?
     • in the NACA OAI interface, they are reported as the
       same and its up to the harvesting system to figure it out
  – deletions?
     • it is currently optional for OAI systems to mark records
       as deleted or not…
        – still left to the harvester to interpret
 OAI Observation: Harvest Model
• Frequency of harvests
   – all-at-once harvests?
       • initial harvest
       • resolving data coherency
   – frequent incremental harvests?
       • far more efficient for both service and data providers
• Webcrawling vs. digital library models
   – webcrawlers: little to no a priori information about target
   – DLs: frequent harvesting of a small number of known targets
• Realization: we know very little about how harvesting
   – are we optimizing for all-at-once, when incremental will be more
Other Uses For the OAI-PMH
• Assumptions:
  – Traditional DLs / SPs will continue on their
    present path of increasing sophistication
     • citation indexing, search results viz, personalization,
       recommendations, subject-based filtering, etc.
  – growth rates remain the same (5x DPs as SPs)
• Premise: OAI-PMH is applicable to any
 scenario that needs to update / synchronize
 distributed state
  – Future opportunities are possible by creatively
    interpreting the OAI-PMH data model
                   OAI-PMH Data Model

 set-membership is
 item-level property

item = identifier        all available metadata
                              about David

           Dublin Core        MARC       SPECTRUM
           metadata           metadata   metadata       records

        record = identifier + metadata format + datestamp
                     Typical Values
•   repository
     – collection of publications
•   resource
     – scholarly publication
•   item
     – all metadata (DC + MARC)
•   record
     – a single metadata format
•   datestamp
     – last update / addition of a record
•   metadata format
     – bibliographic metadata format
•   set
     – originating institution or subject categories
• Stretching the idea of a repository a bit:
   – contextually sensitive repositories
      • “personalization for harvesters”
      • communication between strangers, or communication
        between friends?
   – OAI-PMH for individual complex objects?
      • OAI-PMH without MySQL?!
          – Fedora, Multi-valent documents, buckets
          – tar, jar, zip, etc. files

• What if resource were:
  – computer system status
    • uptime, who, w, df, ps, etc.
  – or generalized “system” status
    • e.g., sports league standings
  – people
    • personnel databases
    • authority files for authors

• What if item were:
  – software
     • union of versions + formats
  – all forms of metadata
     • administrative + structural
     • citations, annotations, reviews, etc.
  – data
     • e.g., newsfeeds and other XML expressible content
           – metadataPrefixes or sets could be defined to be different

• What if record were:
  – specific software instantiations / updates
  – access / retrieval logs for DLs (or computer systems)
  – push / pull model inversion
     • put a harvester on the client behind a firewall, the client
       contacts a DP and receives “instructions” on how to submit
       the desired document (e.g., send email to a specified

• semantics of datestamp are strongly influenced by
  the choice of resource / item / record /
  metadataPrefix, but it could be used to:
   – signify change of set membership (e.g., workflow: item
     moves from “submitted” to “approved”)
   – change datestamp to reflect access to the DP
      • e.g., in conjunction with metadataPrefixes of “accessed” or

• what if metadataPrefix were:
   – instructions for extracting / archiving / scraping the
      • verb=ListRecords&metadataPrefix=extract_TIFFs
   – code fragments to run locally
      • (harvested from a trusted source!)
   – XSLT for other metadataPrefixes
      • branding container is at the repository-level, this could be
        record- or item-level
• sets are already used for tunneling OAI-PMH
  extensions (see Suleman & Fox, D-Lib 7(12))
• other uses:
   – in aggregators, automatically create 1 set per baseURL
   – have “hidden” sets (or metadataPrefix) that have
     administrative or community-specific values (or triggers)
      • set=accessed>1000&from=2001-01-01
      • set=harvestMeWithTheseARGS&until=2002-05-
        Interesting Services

• DP9
   – gateway to expose repository contents in HTML suitable
     for web crawlers
• Celestial
   – OAI “cache”, also 1.1 -> 2.0 converter
• Static (mini-) repositories
   – XML files, based on OLAC work
• OpenURL metadata format registries
   – record = metadata format

To top